Automated phoneme mapping for cross-language speech recognition

Show simple item record

dc.contributor.advisor Botha, Elizabeth C. en
dc.contributor.postgraduate Sooful, Jayren Jugpal en
dc.date.accessioned 2013-09-07T19:22:24Z
dc.date.available 2005-01-11 en
dc.date.available 2013-09-07T19:22:24Z
dc.date.created 2004-06-10 en
dc.date.issued 2006-01-11 en
dc.date.submitted 2005-01-11 en
dc.description Dissertation (MEng (Computer Engineering))--University of Pretoria, 2006. en
dc.description.abstract This dissertation explores a unique automated approach to map one phoneme set to another, based on the acoustic distances between the individual phonemes. Although the focus of this investigation is on cross-language applications, this automated approach can be extended to same-language but different-database applications as well. The main goal of this investigation is to be able to use the data of a source language, to train the initial acoustic models of a target language for which very little speech data may be available. To do this, an automatic technique for mapping the phonemes of the two data sets must be found. Using this technique, it would be possible to accelerate the development of a speech recognition system for a new language. The current research in the cross-language speech recognition field has focused on manual methods to map phonemes. This investigation has considered an English-to-Afrikaans phoneme mapping, as well as an Afrikaans-to-English phoneme mapping. This has been previously applied to these language instances, but utilising manual phoneme mapping methods. To determine the best phoneme mapping, different acoustic distance measures are compared. The distance measures that are considered are the Kullback-Leibler measure, the Bhattacharyya distance metric, the Mahalanobis measure, the Euclidean measure, the L2 metric and the Jeffreys-Matusita distance. The distance measures are tested by comparing the cross-database recognition results obtained on phoneme models created from the TIMIT speech corpus and a locally-compiled South African SUN Speech database. By selecting the most appropriate distance measure, an automated procedure to map phonemes from the source language to the target language can be done. The best distance measure for the mapping gives recognition rates comparable to a manual mapping process undertaken by a phonetic expert. This study also investigates the effect of the number of Gaussian mixture components on the mapping and on the speech recognition system’s performance. The results indicate that the recogniser’s performance increases up to a limit as the number of mixtures increase. In addition, this study has explored the effect of excluding the Mel Frequency delta and acceleration cepstral coefficients. It is found that the inclusion of these temporal features help improve the mapping and the recognition system’s phoneme recognition rate. Experiments are also carried out to determine the impact of the number of HMM recogniser states. It is found that single-state HMMs deliver the optimum cross-language phoneme recognition results. After having done the mapping, speaker adaptation strategies are applied on the recognisers to improve their target-language performance. The models of a fully trained speech recogniser in a source language are adapted to target-language models using Maximum Likelihood Linear Regression (MLLR) followed by Maximum A Posteriori (MAP) techniques. Embedded Baum-Welch re-estimation is used to further adapt the models to the target language. These techniques result in a considerable improvement in the phoneme recognition rate. Although a combination of MLLR and MAP techniques have been used previously in speech adaptation studies, the combination of MLLR, MAP and EBWR in cross-language speech recognition is a unique contribution of this study. Finally, a data pooling technique is applied to build a new recogniser using the automatically mapped phonemes from the target language as well as the source language phonemes. This new recogniser demonstrates moderate bilingual phoneme recognition capabilities. The bilingual recogniser is then further adapted to the target language using MAP and embedded Baum-Welch re-estimation techniques. This combination of adaptation techniques together with the data pooling strategy is uniquely applied in the field of cross-language recognition. The results obtained using this technique outperform all other techniques tested in terms of phoneme recognition rates, although it requires a considerably more time consuming training process. It displays only slightly poorer phoneme recognition than the recognisers trained and tested on the same language database. en
dc.description.availability unrestricted en
dc.description.department Electrical, Electronic and Computer Engineering en
dc.identifier.citation Sooful, J 2004, Automated phoneme mapping for cross-language speech recognition, MEng dissertation, University of Pretoria, Pretoria, viewed yymmdd < http://hdl.handle.net/2263/30584 > en
dc.identifier.upetdurl http://upetd.up.ac.za/thesis/available/etd-01112005-131128/ en
dc.identifier.uri http://hdl.handle.net/2263/30584
dc.language.iso en
dc.publisher University of Pretoria en_ZA
dc.rights © 2004, University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. en
dc.subject Map en
dc.subject Mllr en
dc.subject Data pooling en
dc.subject Embedded baum-welch re-estimation en
dc.subject Transformation-based adaptation en
dc.subject Cross-language en
dc.subject Phoneme mapping en
dc.subject Acoustic distance measures en
dc.subject UCTD en_US
dc.title Automated phoneme mapping for cross-language speech recognition en
dc.type Dissertation en


Files in this item

This item appears in the following Collection(s)

Show simple item record