dc.contributor.advisor |
Marivate, Vukosi |
|
dc.contributor.coadvisor |
Mazarura, Jocelyn |
|
dc.contributor.postgraduate |
Nemakhavhani, Ndamulelo |
|
dc.date.accessioned |
2024-09-13T12:01:23Z |
|
dc.date.available |
2024-09-13T12:01:23Z |
|
dc.date.created |
2024-04 |
|
dc.date.issued |
2023-06 |
|
dc.description |
Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2023. |
en_US |
dc.description.abstract |
The information age has been a critical driver in the impressive advancement of Natural
Language Processing (NLP) applications in recent years. The benefits of these applications have
been prominent in populations with relatively better access to technology and information. On
the contrary, low-resourced regions such as South Africa have seen a lag in NLP advancement
due to limited high-quality datasets required to build reliable NLP models. To address this
challenge, recent studies on NLP research have emphasised advancing language-agnostic models
to enable Cross-Lingual Language Understanding (XLU) through cross-lingual transfer learning.
Several empirical results have shown that XLU models work well when applied to languages with
sufficient morphological or lexical similarity. In this study, we sought to exploit this capability
to improve Tshivenda NLP representation using Sepedi and other related Bantu languages with
relatively more data resources.
Current state-of-the-art cross-lingual language models such as XLM-RoBERTa are trained
on hundreds of languages, with most being high-resourced languages from European origins.
Although the cross-lingual performance of these models is impressive for popular African lan-
guages such as Swahili, there is still plenty of room left for improvement. As the size of such
models continues to soar, questions have been raised on whether competitive performance can
still be achieved using downsized training data to minimise the environmental impact yielded
by ever-increasing computational requirements. Fortunately, practical results from AfriBERTa,
a multilingual language model trained on a 1GB corpus from eleven African languages, showed
that this could be a tenable approach to address the lack of representation for low-resourced
languages in a sustainable way.
Inspired by these recent triumphs in studies including XLM-RoBERTa and AfriBERTa, we
present Zabantu-XLM-R, a novel fleet of small-scale, cross-lingual, pre-trained language models
aimed at enhancing NLP coverage of Tshivenda. Although the study solely focused on Tshivenda,
the presented methods can be easily adapted to other least-popular languages in South Africa,
such as Xhitsonga and IsiNdebele. The language models have been trained on different sets of
South African Bantu languages, with each set chosen heuristically based on the similarity to
Tshivenda. We used a novel news headline dataset annotated following the International Press
Telecommunications Council(IPTC) standards to conduct an extrinsic evaluation of the language
models on a short text classification task.
Our custom language models showed an impressive average weighted F1-score of 60% in few-
shot settings with as little as 50 examples per class from the target language. We also found that
open-source languages like AfriBERTa and AFroXLMR exhibited similar performance, although
they had a minimal representation of Tshivenda and Sepedi in their pre-training corpora. These
findings validated our hypothesis that we can leverage the relatedness among Bantu languages
to develop state-of-the-art NLP models for Tshivenda. To our knowledge, no similar work has
been carried out solely focusing on few-shot performance on Tshivenda. |
en_US |
dc.description.availability |
Unrestricted |
en_US |
dc.description.degree |
MIT (Big Data Science) |
en_US |
dc.description.department |
Computer Science |
en_US |
dc.description.faculty |
Faculty of Engineering, Built Environment and Information Technology |
en_US |
dc.identifier.citation |
* |
en_US |
dc.identifier.other |
A2024 |
en_US |
dc.identifier.uri |
http://hdl.handle.net/2263/98198 |
|
dc.language.iso |
en |
en_US |
dc.publisher |
University of Pretoria |
|
dc.rights |
© 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. |
|
dc.subject |
UCTD |
en_US |
dc.subject |
Natural Language Processing (NLP) |
en_US |
dc.subject |
Tshivenda NLP coverage |
en_US |
dc.subject |
Cross-Lingual Learning Techniques |
en_US |
dc.subject |
Low-resource NLP |
en_US |
dc.subject |
XLM-Roberta |
en_US |
dc.title |
Exploring cross-lingual learning techniques for advancing Tshivenda NLP coverage |
en_US |
dc.type |
Mini Dissertation |
en_US |