End-to-end automated speech recognition using a character based small scale transformer architecture

Loubser, Alexander; De Villiers, Johan Pieter; De Freitas, Allan

End-to-end automated speech recognition using a character based small scale transformer architecture

dc.contributor.author	Loubser, Alexander
dc.contributor.author	De Villiers, Johan Pieter
dc.contributor.author	De Freitas, Allan
dc.contributor.email	a.loubser@tuks.co.za	en_US
dc.date.accessioned	2024-05-21T07:46:44Z
dc.date.available	2024-05-21T07:46:44Z
dc.date.issued	2024-10
dc.description	DATA AVAILABILITY: Data will be made available on request.	en_US
dc.description.abstract	This study explores the feasibility of constructing a small-scale speech recognition system capable of competing with larger, modern automated speech recognition (ASR) systems in both performance and word error rate (WER). Our central hypothesis posits that a compact transformer-based ASR model can yield comparable results, specifically in terms of WER, to traditional ASR models while challenging contemporary ASR systems that boast significantly larger computational sizes. The aim is to extend ASR capabilities to under-resourced languages with limited corpora, catering to scenarios where practitioners face constraints in both data availability and computational resources. The model, comprising a compact convolutional neural network (CNN) and transformer architecture with 2.214 million parameters, challenges the conventional wisdom that large-scale transformer-based ASR systems are essential for achieving high accuracy. In comparison, contemporary ASR systems often deploy over 300 million parameters. Trained on a modest dataset of approximately 3000 h – significantly less than the 50,000 h used in larger systems – the proposed model leverages the Common Voice and LibriSpeech datasets. Evaluation on the LibriSpeech test-clean and test-other datasets produced character error rates (CERs) of 6.40% and 16.73% and WERs of 16.03% and 35.51% respectively. Comparisons with existing architectures showcase the efficiency of our model. A gated recurrent unit (GRU) architecture, albeit achieving lower error rates, incurred a computational cost 24 times larger than our proposed model. Large-scale transformer architectures, while achieving marginally lower WERs (2%–4% on LibriSpeech test-clean), require 200 times more parameters and 53,000 additional hours of training data. Modern large language models are used to improve the WERs, but require large computational resources. To further enhance performance, a small 4-g language model was integrated into our end-to-end ASR model, resulting in improved WERs. The overarching goal of this work is to provide a practical solution for practitioners dealing with limited datasets and computational resources, particularly in the context of under-resourced languages.	en_US
dc.description.department	Electrical, Electronic and Computer Engineering	en_US
dc.description.librarian	hj2024	en_US
dc.description.sdg	SDG-09: Industry, innovation and infrastructure	en_US
dc.description.sponsorship	The MultiChoice Chair of Machine Learning.	en_US
dc.description.uri	https://www.elsevier.com/locate/eswa	en_US
dc.identifier.citation	Loubser, A., De Villiers, P. & De Freitas, A. 2024, 'End-to-end automated speech recognition using a character based small scale transformer architecture', Expert Systems with Applications, vol. 252, part A, art. 124119, pp. 1-11, doi : 10.1016/j.eswa.2024.124119.	en_US
dc.identifier.issn	0957-4174 (print)
dc.identifier.issn	1873-6793 (online)
dc.identifier.other	10.1016/j.eswa.2024.124119
dc.identifier.uri	http://hdl.handle.net/2263/96106
dc.language.iso	en	en_US
dc.publisher	Elsevier	en_US
dc.rights	© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license.	en_US
dc.subject	Automated speech recognition (ASR)	en_US
dc.subject	Speech recognition	en_US
dc.subject	Transformer	en_US
dc.subject	End-to-end	en_US
dc.subject	Character based	en_US
dc.subject	Connectionist temporal classification	en_US
dc.subject	Convolutional neural network (CNN)	en_US
dc.subject	Word error rate (WER)	en_US
dc.subject	Character error rate (CER)	en_US
dc.subject	SDG-09: Industry, innovation and infrastructure	en_US
dc.title	End-to-end automated speech recognition using a character based small scale transformer architecture	en_US
dc.type	Article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Loubser_EndToEnd_2024_2024.pdf
Size:: 645.93 KB
Format:: Adobe Portable Document Format
Description:: Article

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Research Articles (Electrical, Electronic and Computer Engineering)
Research Articles (University of Pretoria)

Simple item page