By Ian Stewart
The language that people use to communicate online is in constant flux. People may have once written “haha” to indicate laughter but over time have adopted “lol” instead. Entire dictionaries and websites such as UrbanDictionary.com are dedicated to tracking the ebb and flow of the latest slang (i.e. nonstandard) words that propagate across online communities. These changes often seem arbitrary but may be predictable based on the context of how the words are used.
Such language change can be predicted from social processes such as membership turnover, where new members join an online community and bring in new words while old members leave and are forgotten. In general, it is assumed that a nonstandard word manages to “stick” based on how many people adopt the word. However, it is less often considered how a new word’s function can help predict its growth. If a word like “lol” can occur in a wide range of linguistic contexts, will that give it an edge to overtake competing words such as “haha”? Furthermore, can the word’s linguistic properties predict growth better than its social properties? This is a critical question that can help linguists understand language change, and it can also help social scientists understand the limits of social theory.
Below, we visualize frequencies of laughter words on Reddit over a 3-year period. We see that the laughter acronym “lmao” grew in frequency while the typical form for laughter “haha” declined, even though these words had a similar initial frequency.
How well can these patterns of growth and decline be predicted in advance, using data about the words’ usage?
Our work studies the growth and decline of non-standard words in Reddit, and we compare the relative importance of social and linguistic “dissemination” in explaining the growth and decline. Here, social dissemination refers to the relative number of Reddit users, communities and threads in which a word appears, normalized by the expected count of users, communities, and threads. Similarly, linguistic dissemination refers to the relative number of contexts in which a word occurs. For instance, the adjective “trashy” occurs in an unusually high number of contexts as compared to the abbreviation “lmao,” which usually occurs at the beginning or end of a sentence. Below, the word “kinda” occurs in a broad range of linguistic contexts as well as social contexts.
We leverage several statistical tests to compare the relative importance of social and linguistic dissemination, which include logistic regression, causal inference, and survival analysis. Overwhelmingly, linguistic dissemination is a better predictor of word frequency change than social dissemination.
Linguistic dissemination can differentiate a future growing word from a future declining word even when controlling for relative frequency. Furthermore, linguistic dissemination can accurately forecast the point of “death” for declining words (when “hehe” starts to fall out of favor, while social dissemination cannot forecast this as accurately. Although the models used are relatively simple from a machine learning perspective, they reveal new insight into patterns in language change.
The prediction results below reveal the success of linguistic dissemination as a predictor of word growth versus decline, using several months of data to train several logistic regression models. The linguistic dissemination model (f+L = frequency plus linguistic dissemination) outperforms the baseline system (f = frequency alone) and the social dissemination model (f+S = frequency plus social dissemination).
Future work should look into more fine-grained definitions of linguistic dissemination: does it make a difference if we count syntactic contexts rather than word contexts? It’s also worth considering other kinds of language change, such as the adoption of phrases (“like a boss”) rather than just words. If you want to make words like “fetch” happen, it may take more linguistic creativity than social acceptance.
Stewart and his advisor, Jacob Eisenstein, will be presenting their paper at the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) which takes place October 31-November 4 in Brussels, Belgium.
The full paper is available on arXiv here.