dc.contributor.advisor |
Marivate, Vukosi |
|
dc.contributor.coadvisor |
Adendorff, M. |
|
dc.contributor.postgraduate |
Madodonga, Andani |
|
dc.date.accessioned |
2023-10-09T08:01:33Z |
|
dc.date.available |
2023-10-09T08:01:33Z |
|
dc.date.created |
2023-04 |
|
dc.date.issued |
2022 |
|
dc.description |
Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2022. |
en_US |
dc.description.abstract |
South Africa has eleven official languages and amongst the eleven languages only 9
languages are local low-resourced languages. As a result, it is essential to build the
resources for these languages so that they can benefit from advances in the field of natural
language processing. In this project, the focus was to create annotated datasets for the
isiZulu and siSwati local languages based on news topic classification tasks and present
the findings from these baseline classification models. Due to the shortage of data for
these local South African languages, the datasets that were created were augmented and
oversampled to increase data size and overcome class classification imbalance. In total,
four different classification models were used namely Logistic regression, Naive bayes,
XGBoost and LSTM. These models were trained on three different word embeddings
namely Count vectorizer, TFIDF vectorizer and word2vec. The results of this study
showed that XGBoost, Logistic regression and LSTM, trained from word2vec performed
better than the other combinations. |
en_US |
dc.description.availability |
Unrestricted |
en_US |
dc.description.degree |
MIT (Big Data Science) |
en_US |
dc.description.department |
Computer Science |
en_US |
dc.identifier.citation |
* |
en_US |
dc.identifier.other |
A2023 |
en_US |
dc.identifier.uri |
http://hdl.handle.net/2263/92767 |
|
dc.language.iso |
en |
en_US |
dc.publisher |
University of Pretoria |
|
dc.rights |
© 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. |
|
dc.subject |
UCTD |
en_US |
dc.subject |
South African Local Languages |
en_US |
dc.subject |
Low Resources Languages |
en_US |
dc.subject |
Data Augmentation |
en_US |
dc.subject |
Topic Classification |
en_US |
dc.subject |
Logistic regression |
en_US |
dc.title |
South African isiZulu and siSwati news corpus creation, annotation and categorisation |
en_US |
dc.type |
Mini Dissertation |
en_US |