A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text

Show simple item record

dc.contributor.advisor De Waal, Alta
dc.contributor.postgraduate Derks, Iena Petronella
dc.date.accessioned 2020-02-12T06:50:23Z
dc.date.available 2020-02-12T06:50:23Z
dc.date.created 2020-04-15
dc.date.issued 2020
dc.description Mini Dissertation (MCom (Statistics))--University of Pretoria, 2020. en_ZA
dc.description.abstract With the increase in online social media interactions, the true identity of user profiles becomes increasingly doubtful. Fake profiles are used to engineer perceptions of opinions and also to create online relationships under false pretence. Natural language text -- how the user structures a sentence and uses words -- provides useful information to discover expected patterns, given the assumed social profile of the user. We expect, for example, different word use and sentence structures from teenagers than from adults. Sociolinguistics is the study of language in the context of social factors such as age, culture and common interest. Natural language processing (NLP) provides quantitative methods to discover sociolinguistic patterns in text data. Current NLP methods make use of a multinomial naïve Bayes classifier to classify unseen documents into predefined sociolinguistic classes. One property of language that is not captured in binomial or multinomial models, is that of burstiness. Burstiness defines the phenomenon that if a person uses a word, they are more likely to use that word again. Thus, the independence assumption between respective counts of the same word is relaxed. The Poisson distribution family captures this phenomenon and in the field of biostatistics, it is often referred to as contagious distributions (because the counts between contagious diseases is not independent). In this research, we relax this count independence assumption of the naïve Bayes classifier by replacing the baseline multinomial likelihood function with a Poisson likelihood function. In the second stage of the NLP pipeline, we use the top words identified in each class to explore the conditional dependencies between these words. For this purpose, an unsupervised Bayesian network is trained on a Bag-of-Words vectorisation of the top words. The output of the second stage is an exploration of the sociolinguistic patterns among different groups of people. The proposed methodology is applied to two data sets. In both cases, the contagious naïve Bayes classifier achieved the best results and we were able to extract word dependency structures from the Bayesian network learning. The methods developed in this research has the potential to aid security institutions, forensic investigations, and market researchers in identifying valuable sociolinguistic features associated with social groups of interest. en_ZA
dc.description.availability Unrestricted en_ZA
dc.description.degree MCom (Statistics) en_ZA
dc.description.department Statistics en_ZA
dc.description.sponsorship Center for Artificial Intelligence (CAIR) en_ZA
dc.identifier.citation Derks, IP 2020, A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text, MCom Mini-dissertation, University of Pretoria, Pretoria en_ZA
dc.identifier.other A2020 en_ZA
dc.identifier.uri http://hdl.handle.net/2263/73230
dc.language.iso en en_ZA
dc.publisher University of Pretoria
dc.rights © 2019 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subject UCTD en_ZA
dc.subject Statistics en_ZA
dc.title A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text en_ZA
dc.type Mini Dissertation en_ZA


Files in this item

This item appears in the following Collection(s)

Show simple item record