End-to-End Speech Recognition Models for a Low-Resourced Indonesian Language,
Published in 2020 8th International Conference on Information and Communication Technology (ICoICT), 2021
Abstract - Recent automatic speech recognition (ASR) is commonly developed using deep learning (DL), instead of the Hidden Markov Model (HMM). Many researchers show that DL is much better than HMM in noisy environments. However, DL needs a huge speech corpus but does not require any dictionary as well as the concept of either phonemes or syllables. Many DL-based tools are developed and claimed as a language-independent ASR, such as Mozma DeepSpeech (MDS) and Kaituoxu SpeechTransformer (KST). Both MDS and KST are classified as End-to-End ASR (E2EASR), but MDS uses a Recurrent Neural Network (RNN) while KST exploits a Transformer Network. In this paper, two Indonesian ASR (INASR) are developed using both MDS and KST to see their performances to handle a low-resourced language. Evaluation using a small speech corpus of Bahasa Indonesia containing 40 k utterances shows that KST is slightly better than MDS, where it gives a word error rate (WER) of 22.00% while MDS produces a WER of 23.10%.
Recommended citation: S. Suyanto, A. Arifianto, A. Sirwan and A. P. Rizaendra, “End-to-End Speech Recognition Models for a Low-Resourced Indonesian Language,” 2020 8th International Conference on Information and Communication Technology (ICoICT), 2020, pp. 1-6, doi: 10.1109/ICoICT49345.2020.9166346.