Optical character recognition and text cleaning in the indigenous South African languages

Show simple item record

dc.contributor.author Prinsloo, Danie J. (Daniel Jacobus), 1953-
dc.contributor.author Taljard, Elsabe (Elizabeth)
dc.contributor.author Goosen, Michelle
dc.date.accessioned 2023-10-18T12:25:36Z
dc.date.available 2023-10-18T12:25:36Z
dc.date.issued 2022
dc.description.abstract This article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora with specific attention to cleaning of text-based material, since this is particularly relevant for the indigenous South African languages. For the purposes of this study, we use the term “web-sourced material” to refer to digital data sourced from the internet, whereas “text-based material” refers to hard copy textual material. We identify the different types of errors found in such texts, looking specifically at typical scanning errors in these languages, followed by an evaluation of three commercially available Optical Character Recognition (OCR) tools. We argue that the cleanness of texts is a matter of granularity, depending on the envisaged application of the corpus comprised by the texts. Text corpora which are to be utilized for e.g. lexicographic purposes can tolerate a higher level of ‘noise’ than those used for the compilation of e.g. spelling and grammar checkers. We conclude with some suggestions for text cleaning for the indigenous languages of South Africa. en_US
dc.description.department African Languages en_US
dc.description.librarian am2023 en_US
dc.description.sponsorship The South African Centre for Digital Language Resources (SADiLaR) and the National Research Foundation of South Africa. en_US
dc.description.uri http://spil.journals.ac.za en_US
dc.identifier.citation Prinsloo, D.J., Taljard, E., Goosen, M. 2022, 'Optical character recognition and text cleaning in the indigenous South African languages', Stellenbosch Papers in Linguistics Plus, vol. 64, pp. 165-187. DOI : 10.5842/64-1-867. en_US
dc.identifier.issn 1027-3417 (print)
dc.identifier.issn 2223-9936 (online)
dc.identifier.other 10.5842/64-1-867
dc.identifier.uri http://hdl.handle.net/2263/92986
dc.language.iso en en_US
dc.publisher Stellenbosch University, Library and Information Service en_US
dc.rights © 2021 The authors. This work is licensed under a Creative Commons Attribution 3.0 License. en_US
dc.subject Text cleaning en_US
dc.subject Scanning errors en_US
dc.subject Granularity of cleanness en_US
dc.subject Optical character recognition (OCR) en_US
dc.subject African languages en_US
dc.subject Corpus cleaning en_US
dc.subject Indigenous languages en_US
dc.subject South Africa (SA) en_US
dc.subject.other Humanities articles SDG-09
dc.subject.other SDG-09: Industry, innovation and infrastructure
dc.title Optical character recognition and text cleaning in the indigenous South African languages en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record