Penalized feature selection in model-based clustering
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Pretoria
Abstract
Cluster analysis is a popular unsupervised statistical method used to group observations
into clusters. Identifying latent segments and groupings in the data aids in the understanding
of natural phenomena. The data driven society we live in today has made high
dimensional data quite ubiquitous and hence noise variables are unavoidable. Modelbased
clustering methods have had to adjust in order to identify these non-informative
variables since they unduly increase a model’s complexity. This mini dissertation reviews
the effectiveness of different penalized likelihood approaches and how they aid in identifying
and removing uninformative variables. An EM algorithm is used to fit a penalized
Gaussian mixture model to the data. The penalized log likelihood is maximized and if
a variable’s parameter estimates are reduced to the same value across all clusters, it is
removed from the model and deemed uninformative. It was found that by penalizing the
mean, uninformative variables were successfully identified and removed.
Description
Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2022.
Keywords
UCTD, Variable selection, Clustering, Expectation Maximisation, Penalized log-likelihood, Penalized feature selection
Sustainable Development Goals
Citation
*
