Topic Modelling for Short Text

Show simple item record

dc.contributor.advisor De Waal, Annari en
dc.contributor.coadvisor Millard, Sollie M. en
dc.contributor.postgraduate Mazarura, Jocelyn Rangarirai en
dc.date.accessioned 2015-11-25T09:47:19Z
dc.date.available 2015-11-25T09:47:19Z
dc.date.created 2015/09/01 en
dc.date.issued 2015 en
dc.description Dissertation (MSc)--University of Pretoria, 2015. en
dc.description.abstract Over the past few years, our increased ability to store large amounts of data, coupled with the increasing accessibility of the internet, has created massive stores of digital information. Consequently, it has become increasingly challenging to find and extract relevant information, thus creating a need for tools that can effectively extract and summarize the information. One such tool, is topic modelling, which is a method of extracting hidden themes or topics in a large collection of documents. Information is stored in many forms, but of particular interest is the information stored as short text, which typically arises as posts on websites like Facebook and Twitter where people freely share their ideas, interests and opinions. With such a wealth in data and so many diverse users, such stores of short text could potentially provide useful information about public opinion and current trends, for instance. Unlike long text, like news and journal articles, one of the commonly known challenges of applying topic models on short text is the fact that it contains few words, which means that it may not contain sufficiently many meaningful words. The Latent Dirichlet Allocation (LDA) model is one of the most popular topic models and it makes the generative assumption that a document belongs to many topics. Conversely, the Multinomial Mixture (MM) model, another topic model, assumes a document can belong to at most one topic, which we believe is an intuitively sensible assumption for short text. Based on this key difference, we posit that the MM model should perform better than the LDA. To validate this hypothesis we compare the performance of the LDA and MM on two long text and two short text corpora, using coherence as our main performance measure. Our experiments reveal that the LDA model performs slightly better than the MM model on long text, whereas the MM performs better than the LDA model on short text. en
dc.description.availability Unrestricted en
dc.description.degree MSc en
dc.description.department Statistics en
dc.description.librarian tm2015 en
dc.identifier.citation Mazarura, JR 2015, Topic Modelling for Short Text, MSc Dissertation, University of Pretoria, Pretoria, viewed yymmdd <http://hdl.handle.net/2263/50694> en
dc.identifier.other S2015 en
dc.identifier.uri http://hdl.handle.net/2263/50694
dc.language.iso en en
dc.publisher University of Pretoria en_ZA
dc.rights © 2015 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. en
dc.subject UCTD en
dc.title Topic Modelling for Short Text en
dc.type Dissertation en


Files in this item

This item appears in the following Collection(s)

Show simple item record