Abstract:
Cluster analysis is a popular unsupervised statistical method used to group observations
into clusters. Identifying latent segments and groupings in the data aids in the understanding
of natural phenomena. The data driven society we live in today has made high
dimensional data quite ubiquitous and hence noise variables are unavoidable. Modelbased
clustering methods have had to adjust in order to identify these non-informative
variables since they unduly increase a model’s complexity. This mini dissertation reviews
the effectiveness of different penalized likelihood approaches and how they aid in identifying
and removing uninformative variables. An EM algorithm is used to fit a penalized
Gaussian mixture model to the data. The penalized log likelihood is maximized and if
a variable’s parameter estimates are reduced to the same value across all clusters, it is
removed from the model and deemed uninformative. It was found that by penalizing the
mean, uninformative variables were successfully identified and removed.