Abstract:
With the increase in online social media interactions, the true identity of user profiles becomes increasingly doubtful. Fake profiles are used to engineer perceptions of opinions and also to create online relationships under false pretence. Natural language text -- how the user structures a sentence and uses words -- provides useful information to discover expected patterns, given the assumed social profile of the user. We expect, for example, different word use and sentence structures from teenagers than from adults. Sociolinguistics is the study of language in the context of social factors such as age, culture and common interest. Natural language processing (NLP) provides quantitative methods to discover sociolinguistic patterns in text data. Current NLP methods make use of a multinomial naïve Bayes classifier to classify unseen documents into predefined sociolinguistic classes. One property of language that is not captured in binomial or multinomial models, is that of burstiness. Burstiness defines the phenomenon that if a person uses a word, they are more likely to use that word again. Thus, the independence assumption between respective counts of the same word is relaxed. The Poisson distribution family captures this phenomenon and in the field of biostatistics, it is often referred to as contagious distributions (because the counts between contagious diseases is not independent). In this research, we relax this count independence assumption of the naïve Bayes classifier by replacing the baseline multinomial likelihood function with a Poisson likelihood function. In the second stage of the NLP pipeline, we use the top words identified in each class to explore the conditional dependencies between these words. For this purpose, an unsupervised Bayesian network is trained on a Bag-of-Words vectorisation of the top words. The output of the second stage is an exploration of the sociolinguistic patterns among different groups of people. The proposed methodology is applied to two data sets. In both cases, the contagious naïve Bayes classifier achieved the best results and we were able to extract word dependency structures from the Bayesian network learning. The methods developed in this research has the potential to aid security institutions, forensic investigations, and market researchers in identifying valuable sociolinguistic features associated with social groups of interest.