Topic Modelling for Short Text

Loading...
Thumbnail Image

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

University of Pretoria

Abstract

Over the past few years, our increased ability to store large amounts of data, coupled with the increasing accessibility of the internet, has created massive stores of digital information. Consequently, it has become increasingly challenging to find and extract relevant information, thus creating a need for tools that can effectively extract and summarize the information. One such tool, is topic modelling, which is a method of extracting hidden themes or topics in a large collection of documents. Information is stored in many forms, but of particular interest is the information stored as short text, which typically arises as posts on websites like Facebook and Twitter where people freely share their ideas, interests and opinions. With such a wealth in data and so many diverse users, such stores of short text could potentially provide useful information about public opinion and current trends, for instance. Unlike long text, like news and journal articles, one of the commonly known challenges of applying topic models on short text is the fact that it contains few words, which means that it may not contain sufficiently many meaningful words. The Latent Dirichlet Allocation (LDA) model is one of the most popular topic models and it makes the generative assumption that a document belongs to many topics. Conversely, the Multinomial Mixture (MM) model, another topic model, assumes a document can belong to at most one topic, which we believe is an intuitively sensible assumption for short text. Based on this key difference, we posit that the MM model should perform better than the LDA. To validate this hypothesis we compare the performance of the LDA and MM on two long text and two short text corpora, using coherence as our main performance measure. Our experiments reveal that the LDA model performs slightly better than the MM model on long text, whereas the MM performs better than the LDA model on short text.

Description

Dissertation (MSc)--University of Pretoria, 2015.

Keywords

UCTD

Sustainable Development Goals

Citation

Mazarura, JR 2015, Topic Modelling for Short Text, MSc Dissertation, University of Pretoria, Pretoria, viewed yymmdd <http://hdl.handle.net/2263/50694>