Abstract:
Although the complexity of prosody is widely recognised, there is a lack of widely-accepted descriptive standards for prosodic phenomena. This situation has become particularly noticeable with the development of increasingly capable text-to-speech (TTS) systems. Such systems require detailed prosodic models to sound natural. For the languages of Southern Africa, the deficiencies in our modelling capabilities are acute. Little work of a quantitative nature has been published for the languages of the Nguni family (such as isiZulu and isiXhosa), and there are significant contradictions and imprecisions in the literature on this topic. We have therefore embarked on a programme aimed at understanding the relationship between linguistic and physical variables of a prosodic nature in this family of languages. We then use the information/knowledge gathered to build intonation models for isiZulu and isiXhosa as representatives of the Nguni languages. Firstly, we need to extract physical measurements from the voice recordings of the Nguni family of languages. A number of pitch tracking algorithms have been developed; however, to our knowledge, these algorithms have not been evaluated formally on a Nguni language. In order to decide on an appropriate algorithm for further analysis, evaluations have been performed on two stateof- the-art algorithms namely the Praat pitch tracker and Yin (developed by Alain de Cheveingn´e). Praat’s pitch tracker algorithm performs somewhat better than Yin in terms of gross and fine errors and we use this algorithm for the rest of our analysis.<./p> For South African languages the task of building an intonation model is complicated by the lack of intonation resources available. We describe the methodology used for developing a generalpurpose intonation corpus and the various methods implemented to extract relevant features such as fundamental frequency, intensity and duration from the spoken utterances of these languages. In order to understand how the ‘expected’ intonation relates to the actual measured characteristics extracted, we developed two different statistical approaches to build intonation models for isiZulu and isiXhosa. The first is based on straightforward statistical techniques and the second uses a classifier. Both intonation models built produce fairly good accuracy for our isiZulu and isiXhosa sets of data. The neural network classifier used produces slightly better results for both sets of data than the statistical method. The classification model is also more robust and can easily learn from the training data. We show that it is possible to build fairly good intonation models for these languages using different approaches, and that intensity and fundamental frequency are comparable in predictive value for the ascribed tone.