Search all news

Crowdsourcing Dialect Characterization through Twitter

Aug. 12, 2014

Language is the most characteristic trait of human communication but takes on many heterogeneous forms. Dialects, in particular, are linguistic varieties which dier phonologically,gramatically or lexically in geographically separated regions. However, despite its fundamental importance and many recent developments, the way language varies spatially is still poorly understood.

Traditional methodological approaches in the study of regional dialects are based on interviews and questionnaires administered by a researcher to a small number (typically, a few hundred) of selected speakers known as informants. Based on the answers provided, linguistic atlases are generated that are naturally limited in scope and subject to the particular choice of locations and informants and perhaps not completely free of unwanted in uences from the dialectologist. Another approach is the use of mass media corpora which provide a wealth of information on language usage but suer from the tendency of media and newspapers to use standard norms (the "BBC English" for example) that limits their usefulness for the study of informal local variations.

On the other hand, the recent rise of online social tools has resulted in an unprecedented avalanche of content that is naturally and organically generated by millions or tens of millions of geographically distributed individuals that are likely to speak in vernacular and do not feel constrained to use standard linguistic norms. This, combined with the widespread usage of GPS
enabled smartphones to access social media tools provides a unique opportunity to observe how languages are used in everyday life and across vast regions of space.

In this work, we use a large dataset of geolocated Tweets to study local language variations across the world. Similar datasets have recently been used to map public opinion and social behavior and to analyze planetary language diversity.

Preliminary results demonstrating the feasibility of this approach have thus far been limited to considering only few words or just a few geographical areas. Here, we move beyond the mere proof of concept and provide a detailed global picture of spatial variants for a specic language. For deniteness, we choose Spanish as it is not only one of the most spoken in the world
but it has the added advantage of being spatially distributed across several continents. Several other languages such as Mandarin or English have more native speakers or higher supraregional status but their use is hindered by the limited local availability of Twitter (Mandarin) or a high abundance of homographs that percludes a detailed lexicographic analysis (English).

Using a large dataset of user generated content in vernacular Spanish, we analyse the diatopic structure of modern day Spanish language at the lexical level. By applying standard machine learning techniques, we find, for the first time, two large Spanish varieties which are related to, respectively, international and local speeches. We can also identify regional dialects and their approximate isoglosses. Our results are relevant to empirically understand how languages are used in real life across vastly dierent geographical regions. We believe that our work has considerable latitude for further applications in the computational study of linguistics, a eld full of rewarding opportunities. One can envisage much deeper analyses pointing the way towards new developments in sociolinguistic studies (bilingualism, creole varieties). Our work is based on a synchronous approach to language. However, the possibilities presented by the combination of large scale online social networks with easily aordable GPS enabled devices are so remarkable that might permit us to observe, for the rst time, how diatopic dierences arise and develop in time

TVE 1

Search all news

Crowdsourcing Dialect Characterization through Twitter

Photo gallery

Press and media