South African isiZulu and siSwati news corpus creation, annotation and categorisation

Madodonga, Andani

UPSpace Home
→
University of Pretoria: Research Output
→
Theses and Dissertations (University of Pretoria)
→
View Item

dc.contributor.advisor	Marivate, Vukosi
dc.contributor.coadvisor	Adendorff, M.
dc.contributor.postgraduate	Madodonga, Andani
dc.date.accessioned	2023-10-09T08:01:33Z
dc.date.available	2023-10-09T08:01:33Z
dc.date.created	2023-04
dc.date.issued	2022
dc.description	Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2022.	en_US
dc.description.abstract	South Africa has eleven official languages and amongst the eleven languages only 9 languages are local low-resourced languages. As a result, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this project, the focus was to create annotated datasets for the isiZulu and siSwati local languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these local South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Count vectorizer, TFIDF vectorizer and word2vec. The results of this study showed that XGBoost, Logistic regression and LSTM, trained from word2vec performed better than the other combinations.	en_US
dc.description.availability	Unrestricted	en_US
dc.description.degree	MIT (Big Data Science)	en_US
dc.description.department	Computer Science	en_US
dc.identifier.citation	*	en_US
dc.identifier.other	A2023	en_US
dc.identifier.uri	http://hdl.handle.net/2263/92767
dc.language.iso	en	en_US
dc.publisher	University of Pretoria
dc.rights	© 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subject	UCTD	en_US
dc.subject	South African Local Languages	en_US
dc.subject	Low Resources Languages	en_US
dc.subject	Data Augmentation	en_US
dc.subject	Topic Classification	en_US
dc.subject	Logistic regression	en_US
dc.title	South African isiZulu and siSwati news corpus creation, annotation and categorisation	en_US
dc.type	Mini Dissertation	en_US