Text-based language identification for the South African languages

Show simple item record

dc.contributor.advisor Barnard, E. en
dc.contributor.postgraduate Botha, Gerrit Reinier en
dc.date.accessioned 2013-09-07T12:10:12Z
dc.date.available 2008-09-09 en
dc.date.available 2013-09-07T12:10:12Z
dc.date.created 2008-04-09 en
dc.date.issued 2008-09-09 en
dc.date.submitted 2008-09-04 en
dc.description Dissertation (MEng)--University of Pretoria, 2008. en
dc.description.abstract We investigate the factors that determine the performance of text-based language identification, with a particular focus on the 11 official languages of South Africa. Our study uses n-gram statistics as features for classification. In particular, we compare support vector machines, Naïve Bayesian and difference-in-frequency classifiers on different amounts of input text and various values of n, for different amounts of training data. For a fixed value of n the support vector machines generally outperforms the other classifiers, but the simpler classifiers are able to handle larger values of n. The additional computational complexity of training the support vector machine classifier may not be justified in light of importance of using a large value of n, except possibly for small sizes of the input window when limited training data is available. We find that it is more difficult to discriminate languages within language families then those across families. The accuracy on small input strings is low due to this reason, but for input strings of 100 characters or more there is only a slight confusion within families and accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when the languages in different families are grouped together, this corresponds to a usable 95.1% accuracy. The relationship between the amount of training data and the accuracy achieved is found to depend on the window size – for the largest window (300 characters) about 400 000 characters are sufficient to achieve close-to-optimal accuracy, whereas improvements in accuracy are found even beyond 1.6 million characters of training data. Finally, we show that the confusions between the different languages in our set can be used to derive informative graphical representations of the relationships between the languages. en
dc.description.availability unrestricted en
dc.description.department Electrical, Electronic and Computer Engineering en
dc.identifier.citation a 2008 en
dc.identifier.other E1086/gm en
dc.identifier.upetdurl http://upetd.up.ac.za/thesis/available/etd-09042008-133715/ en
dc.identifier.uri http://hdl.handle.net/2263/27725
dc.language.iso en
dc.publisher University of Pretoria en_ZA
dc.rights © University of Pretoria 2008 E1086/ en
dc.subject Naïve bayesian classification en
dc.subject Support vector machine en
dc.subject N-gram statistics en
dc.subject Text-based language identification en
dc.subject Difference-in-frequency classification en
dc.subject UCTD en_US
dc.title Text-based language identification for the South African languages en
dc.type Dissertation en


Files in this item

This item appears in the following Collection(s)

Show simple item record