End-to-end automated speech recognition using a character based small scale transformer architecture.

Show simple item record

dc.contributor.advisor De Villiers, Pieter
dc.contributor.coadvisor De Freitas, Allan
dc.contributor.postgraduate Loubser, Alexander
dc.date.accessioned 2024-02-14T12:45:07Z
dc.date.available 2024-02-14T12:45:07Z
dc.date.created 2024-04-29
dc.date.issued 2024-02-12
dc.description Dissertation (MEng(Electronic Engineering))--University of Pretoria, 2024. en_US
dc.description.abstract This study explores the feasibility of constructing a small-scale speech recognition system capable of competing with larger, modern automated speech recognition (ASR) systems in both performance and word error rate (WER). Our central hypothesis posits that a compact transformer-based ASR model can yield comparable results, specifically in terms of WER, to traditional ASR models while challenging contemporary ASR systems that boast significantly larger computational sizes. The aim is to extend ASR capabilities to under-resourced languages with limited corpora, catering to scenarios where practitioners face constraints in both data availability and computational resources. The model, comprising a compact convolutional neural network (CNN) and transformer architecture with 2.214 million parameters, challenges the conventional wisdom that large-scale transformer-based ASR systems are essential for achieving high accuracy. In comparison, contemporary ASR systems often deploy over 300 million parameters. Trained on a modest dataset of approximately 3000 hours—significantly less than the 50,000 hours used in larger systems—the proposed model leverages the Common Voice and LibriSpeech datasets. Evaluation on the LibriSpeech test-clean and test-other datasets produced character error rates (CERs) of 6.40% and 16.73% and WERs of 16.03% and 35.51% respectively. Comparisons with existing architectures showcase the efficiency of our model. A gated recurrent unit (GRU) architecture, albeit achieving lower error rates, incurred a computational cost 24 times larger than our proposed model. Large-scale transformer architectures, while achieving marginally lower WERs (2-4% on LibriSpeech test-clean), require 200 times more parameters and 53,000 additional hours of training data. Modern large language models are used to improve the WERs, but require large computational resources. To further enhance performance, a small 4-gram language model was integrated into our end-to-end ASR model, resulting in improved WERs. The overarching goal of this work is to provide a practical solution for practitioners dealing with limited datasets and computational resources, particularly in the context of under-resourced languages. en_US
dc.description.availability Unrestricted en_US
dc.description.degree Masters of Engineering (Electronic Engineering) en_US
dc.description.department Electrical, Electronic and Computer Engineering en_US
dc.description.faculty Faculty of Engineering, Built Environment and Information Technology en_US
dc.description.sdg SDG-09: Industry, innovation and infrastructure en_US
dc.description.sponsorship MultiChoice Chair of Machine Learning en_US
dc.identifier.citation * en_US
dc.identifier.doi https://doi.org/10.25403/UPresearchdata.25217993 en_US
dc.identifier.other April 2024 (A2024) en_US
dc.identifier.uri http://hdl.handle.net/2263/94605
dc.language.iso en en_US
dc.publisher University of Pretoria
dc.rights © 2023 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subject UCTD en_US
dc.subject Speech recognition en_US
dc.subject transformer en_US
dc.subject end-to-end en_US
dc.subject character based en_US
dc.subject connectionist temporal classification en_US
dc.title End-to-end automated speech recognition using a character based small scale transformer architecture. en_US
dc.type Dissertation en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record