On the development of a tagset for Northern Sotho with special reference to the issue of standardisation

Taljard, Elsabe (Elizabeth); Faab, Gertrud; Heid, Ulrich; Prinsloo, Danie J. (Daniel Jacobus), 1953-

On the development of a tagset for Northern Sotho with special reference to the issue of standardisation

Taljard, Elsabe (Elizabeth); Faab, Gertrud; Heid, Ulrich; Prinsloo, Danie J. (Daniel Jacobus), 1953-

Date: 2008

Abstract:

Working with corpora in the South African Bantu languages has up till now been limited to the utilisation of raw corpora. Such corpora, however, have limited functionality. Thus the next logical step in any NLP application is the development of software for automatic tagging of electronic texts. The development of a tagset is one of the first steps in corpus annotation. The authors of this article argue that the design of a tagset cannot be isolated from the purpose of the tagset, or from the place of the tagset and its design within the bigger picture of the architecture of corpus annotation. Usage-related aspects therefore feature prominently in the design of the tagset for Northern Sotho. It is explained why this proposed tagset is biased towards human readability, rather than machine readability; this choice of a stochastic tagger is motivated, and the relationship between tokenising, tagging, morphological analysis and parsing is discussed. In order to account at least to some extent for the morphological complexity of Northern Sotho at the tagging level, a multilevel annotation is opted for: the first level comprising obligatory information and the second optional and recommended information. Finally, aspects of standardisation are considered against the background of reuse, of sharing of resources, and of possible adaptation for use by other disjunctively written South African Bantu languages. It is not the aim of this article to evaluate the results of any tagging procedure using the proposed tagset. It only describes the design and motivates the choices made with regard to the tagset design. However, an evaluation is in process and results will be published in the near future (cf. Faasz et al., s.a.)

Tot dusver was die gebruik van korpora in die Suid-Afrikaanse Bantoetale beperk tot die ontginning van rou korpora. Die gebruiksmoontlikhede van hierdie tipe korpora is egter beperk. Die volgende logiese stap in enige toepassing van natuurlike taalprosessering is dus die ontwikkeling van sagteware vir outomatiese teksannotering. Die ontwikkeling van ’n stel annoteringsmerkers is een van die eerste stappe in korpus-annotering. Die outeurs van hierdie artikel meen dat die ontwerp van ’n annoteringstel direk verband hou met die doel van so ’n stel, en die posisie daarvan binne die groter raamwerk van die argitektuur van korpusannotasie. Gebruiksaspekte staan daarom sentraal in die ontwerp van ’n annoteringstel vir Noord-Sotho. Daar word verduidelik waarom hierdie stel eerder vir menslike leesbaarheid as vir masjienleesbaarheid voorsiening maak; die keuse van ’n stokastiese annoteerder word gemotiveer, en die verhouding tussen tokenisering, annotasie, en morfologiese en sintaktiese analise word bespreek. Ten einde op annoteringsvlak gedeeltelik voorsiening te maak vir die morfologiese kompleksiteit van Noord-Sotho, is ’n veelvlakkige annotasie verkies waar die eerste annotasievlak verpligte inligting bevat, en die tweede vlak opsionele en aanbevole inligting. Ten slotte word aspekte rondom standaardisering beskou teen die agtergrond van herbruikbaarheid, die doel van hulpbronne en moontlike aanpassing vir gebruik deur ander dis-junktief-geskrewe Suid-Afrikaanse Bantoetale. Dit is nie die doel van hierdie artikel om enige annoteringsproses waarin hierdie stel annoteringsmerkers gebruik word, te evalueer nie. Dit beskryf slegs die ontwerp en motiveer die keuses wat tydens die ontwerp van die annoteringsmerkstel gemaak is. ’n Evalueringsproses word tans onderneem en die resultate sal in Faaß et al., (s.a.) gepubliseer word.

Show full item record