Abstract:
Over the past few years, our increased ability to store large amounts of data, coupled with the
increasing accessibility of the internet, has created massive stores of digital information. Consequently,
it has become increasingly challenging to find and extract relevant information, thus
creating a need for tools that can effectively extract and summarize the information. One such
tool, is topic modelling, which is a method of extracting hidden themes or topics in a large collection
of documents.
Information is stored in many forms, but of particular interest is the information stored as short
text, which typically arises as posts on websites like Facebook and Twitter where people freely
share their ideas, interests and opinions. With such a wealth in data and so many diverse users,
such stores of short text could potentially provide useful information about public opinion and
current trends, for instance. Unlike long text, like news and journal articles, one of the commonly
known challenges of applying topic models on short text is the fact that it contains few words,
which means that it may not contain sufficiently many meaningful words.
The Latent Dirichlet Allocation (LDA) model is one of the most popular topic models and
it makes the generative assumption that a document belongs to many topics. Conversely, the
Multinomial Mixture (MM) model, another topic model, assumes a document can belong to at
most one topic, which we believe is an intuitively sensible assumption for short text. Based on this
key difference, we posit that the MM model should perform better than the LDA.
To validate this hypothesis we compare the performance of the LDA and MM on two long text
and two short text corpora, using coherence as our main performance measure. Our experiments
reveal that the LDA model performs slightly better than the MM model on long text, whereas the
MM performs better than the LDA model on short text.