Topic Modelling for Short Text

dc.contributor.advisorDe Waal, Annarien
dc.contributor.coadvisorMillard, Sollie M.en
dc.contributor.emailu10220420@tuks.co.zaen
dc.contributor.postgraduateMazarura, Jocelyn Rangariraien
dc.date.accessioned2015-11-25T09:47:19Z
dc.date.available2015-11-25T09:47:19Z
dc.date.created2015/09/01en
dc.date.issued2015en
dc.descriptionDissertation (MSc)--University of Pretoria, 2015.en
dc.description.abstractOver the past few years, our increased ability to store large amounts of data, coupled with the increasing accessibility of the internet, has created massive stores of digital information. Consequently, it has become increasingly challenging to find and extract relevant information, thus creating a need for tools that can effectively extract and summarize the information. One such tool, is topic modelling, which is a method of extracting hidden themes or topics in a large collection of documents. Information is stored in many forms, but of particular interest is the information stored as short text, which typically arises as posts on websites like Facebook and Twitter where people freely share their ideas, interests and opinions. With such a wealth in data and so many diverse users, such stores of short text could potentially provide useful information about public opinion and current trends, for instance. Unlike long text, like news and journal articles, one of the commonly known challenges of applying topic models on short text is the fact that it contains few words, which means that it may not contain sufficiently many meaningful words. The Latent Dirichlet Allocation (LDA) model is one of the most popular topic models and it makes the generative assumption that a document belongs to many topics. Conversely, the Multinomial Mixture (MM) model, another topic model, assumes a document can belong to at most one topic, which we believe is an intuitively sensible assumption for short text. Based on this key difference, we posit that the MM model should perform better than the LDA. To validate this hypothesis we compare the performance of the LDA and MM on two long text and two short text corpora, using coherence as our main performance measure. Our experiments reveal that the LDA model performs slightly better than the MM model on long text, whereas the MM performs better than the LDA model on short text.en
dc.description.availabilityUnrestricteden
dc.description.degreeMScen
dc.description.departmentStatisticsen
dc.description.librariantm2015en
dc.identifier.citationMazarura, JR 2015, Topic Modelling for Short Text, MSc Dissertation, University of Pretoria, Pretoria, viewed yymmdd <http://hdl.handle.net/2263/50694> en
dc.identifier.otherS2015en
dc.identifier.urihttp://hdl.handle.net/2263/50694
dc.language.isoenen
dc.publisherUniversity of Pretoriaen_ZA
dc.rights© 2015 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.en
dc.subjectUCTDen
dc.titleTopic Modelling for Short Texten
dc.typeDissertationen

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Mazarura_Topic_2015.pdf
Size:
2.44 MB
Format:
Adobe Portable Document Format
Description:
Dissertation