Big Data, memes, information diffusion in online social networks and opinion dynamics

Luque, Alvaro (Advisor: Ramasco, JJ)
Master Thesis (2023)

Online Social Networks are a key source of information when it comes to human interactions, due to their extended use in contemporary society. In this work a weighted, directed network was built using Twitter replies data from three different countries with different population sizes during an eight-year-long observation window. Once the network is built, community detection methods are applied to find densely connected clusters of users. Said communities are then studied from a thematical point of view, using hashtags as memes through which members within a community share ideas and common interests. A statistical study is conducted on the variety and repetition rate of hashtags inside communities, as well as quantifying similarities between pairs of groups. The objective is to test if online communication through hashtags in Twitter follows two trends; first, if the growth on the number of unique hashtags as a function of community size follows a well-known law for written texts called Heap’s Law, in which the number of unique words grows as a sublinear function with respect to text length; second, if the behavior in hashtag use of such groups has a boundary around the value of SD = 150 members, which has been believed to be the limit of stable social relationships a human being is able to maintain, following the ideas of Robin Dunbar: below this threshold, communities should behave more similarly to close acquaintances in real life, exhibit- ing a wide range of topics that are repeated less, in contrast to big groups which should represent communities that aggregate users that follow a certain topic, thus exhibiting a higher repetition rate. From the second idea also follows that there should be more nonzero values of similarity for pairs of small groups, since covering a larger amount of hashtags with less repetition should lead to some overlap in their covered topics, differently from pairs of big communities, which should show a large amount of zero similarity values due to their peaked hashtag distribution around certain topics.

In the first place, the vast amount of data that was gathered carried a high computational cost of obtaining the desired metrics and forced to sample a small amount of communities for each of the countries. Moreover, data from the most populated country had to be left out as a result of their dimensionality. As for the hypotheses, the growth of unique hashtags as a funcion of group size was confirmed, although said curve doesn’t resemble Heap’s Law, with such growth being significantly low. Then, the separation that follows from Dunbar’s results reveals that indeed the repetition rate for more populated groups of users grows with respect to that of small groups. Finally, the similarity measure that was implemented for this work doesn’t yield very illuminating results to test our hypothesis, mainly as a consequence of the community sampling that was conducted and the little data processing that was done over hashtags. In a future work, the implementation of this work must be improved to tackle problems such as lemmatization, intruder hashtags that don’t belong naturally to their respective networks or computational efficiency by using Big Data tools that relieve the cost of handling several millions of data entries. Moreover, we propose a network of communities whose link weights are proportional to their similarity, to which we can apply a propagation model for hashtags that emulates the transmission of specific topics in such a network.

Additional files


Aquesta web utilitza cookies per a la recollida de dades amb un propòsit estadístic. Si continues navegant, vol dir que acceptes la instal·lació de la cookie.


Més informació D'accord