Penalized feature selection in model-based clustering

doi:10.25403/UPresearchdata.23219531

Penalized feature selection in model-based clustering

Files

Potgieter_Penalized_2022.pdf (3.66 MB)

Date

2022

Authors

Potgieter, Luandrie

Publisher

University of Pretoria

Abstract

Cluster analysis is a popular unsupervised statistical method used to group observations into clusters. Identifying latent segments and groupings in the data aids in the understanding of natural phenomena. The data driven society we live in today has made high dimensional data quite ubiquitous and hence noise variables are unavoidable. Modelbased clustering methods have had to adjust in order to identify these non-informative variables since they unduly increase a model’s complexity. This mini dissertation reviews the effectiveness of different penalized likelihood approaches and how they aid in identifying and removing uninformative variables. An EM algorithm is used to fit a penalized Gaussian mixture model to the data. The penalized log likelihood is maximized and if a variable’s parameter estimates are reduced to the same value across all clusters, it is removed from the model and deemed uninformative. It was found that by penalizing the mean, uninformative variables were successfully identified and removed.

Description

Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2022.

Keywords

UCTD, Variable selection, Clustering, Expectation Maximisation, Penalized log-likelihood, Penalized feature selection

Citation

*

URI

http://hdl.handle.net/2263/91035

Collections

Theses and Dissertations (University of Pretoria)
Theses and Dissertations (Statistics)

Full item page

Penalized feature selection in model-based clustering

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Sustainable Development Goals

Citation

URI

Collections