Complexity in Computational Sociolinguistics: Exploring the Interplay between Geography, Culture and the Social Fabric
Louf, Thomas (Supervisors: Ramasco, José J.; Sánchez, David)
PhD Thesis (2023)
Language has a crucial role in and is greatly influenced by widely different spheres of society, from simple interpersonal communication to the economy or culture. This is what makes sociolinguistics, the study of the interactions of language and society, a complex but decidedly worthwhile endeavour. As a wealth of linguistic data can be retrieved from online social media, the development of new theoretical models aimed at uncovering mechanisms underlying sociolinguistic phenomena can be better guided and tested than ever before. In this thesis, we harness this great potential, and take an interdisciplinary approach to sociolinguistics that is inspired by methods of complex systems and data science.
First, we study languages as coherent units that compete with others for speakers, in order to try to identify the drivers of language extinction and how coexistence of multiple languages in an interconnected society might come to be. Crucially, we take into account the spatial embedding of languages, and first observe it using Twitter data. We find that two languages can coexist with completely separated communities but also with communities mixed in space, featuring a large population of bilinguals. We capture this diversity of coexistence states by introducing a model that considers a potential cultural attachment for one language that may counteract a globally lower prestige, as well as the relative ease to learn a language knowing the other. Both simulations’ and analytic results are used to support our claims.
We then focus on variation within a language to point out a potential dependence of standard language use with socio-economic status. Focusing on England, we find that there is a slight tendency for English Twitter users to make more grammatical mistakes the lower their income is. This tendency is however very different from one metropolitan area to another, and actually, it seems to be weaker the more socio-economic classes mix together. We propose a model that accounts for potentially different mixing patterns and preferences for a language variety. It reproduces this effect we observed in a simple setting that enables us to analyse it mathematically, but also in more realistic agent-based simulations. We thus find that increased social mixing is crucial to tackle potential social and economic segregation reflected in this linguistic variation.
Lastly, we leverage the interrelationship between language and culture in a case study of the United States to define its major cultural regions. From geotagged tweets written in English, we find the usage hotspots of words found in them to then compute the principal dimensions of lexical variation. With these, we are able to infer coherent cultural regions and the topics that define them. This quantitative, automatic analysis thus provides robust answers to the debate around cultural geography, which has been historically marked by differing definitions of relevant cultural factors.
The strength of the results we obtained across quite diverse areas of sociolinguistics is a mirror of the strength of the approach we took throughout our work, that relies on computational tools, large datasets and simple mathematical models. It calls for further developments of this kind, which are most probably only in their infancy.