Improving out of vocabulary words recognition accuracy for an end-to-end Russian speech recognition system

Andrei Yu. Andrusenko, Romanenko  Alexei  N.

2022 , VOLUME 22, NUMBER 6 ( november-december )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2022-22-6-1143-1149

Improving out of vocabulary words recognition accuracy for an end-to-end Russian speech recognition system

A. Y. Andrusenko, A. N. Romanenko

Read the full article

Article in English

For citation:

Andrusenko A.Yu., Romanenko A.N. Improving out of vocabulary words recognition accuracy for an end-to-end Russian speech recognition system. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2022, vol. 22, no. 6, pp. 1143–1149. doi: 10.17586/2226-1494-2022-22-6-1143-1149

Abstract

Automatic Speech Recognition (ASR) systems are experiencing an active introduction into our daily lives, simplifying the way we interact with electronic devices. The advent of end-to-end approaches has only accelerated this process. However, the constant evolution and a high degree of inflection of the Russian language lead to the problem of recognizing new words outside the vocabulary (Out Of Vocabulary, OOV) because they did not take part in the training process of the ASR system. In such a case, the ASR model tends to predict the most similar word from the training data which leads to a recognition error. This is especially true for ASR models that use decoding based on a Weighted Finite State Transducer (WFST), since they are obviously limited by the list of vocabulary words that may appear as a result of recognition. In this paper, this problem is investigated on the basis of an open data set of the Russian language (common voice) and an integrated ASR system using the WFST decoder. A method for retraining an integral ASR system based on the discriminative loss function MMI (maximum mutual information) and a method for decoding the integral model using a TG graph are proposed. Discriminative learning allows smoothing the probability distribution of acoustic class prediction, thus adding more variability in the recognition results. Decoding using the TG graph, in turn, is not limited to recognizing only vocabulary words and allows the use of a language model trained on a large amount of external text data. An eight-hour subset from the common voice base is used as a test set. The total number of OOV words in this test sample is 18.1 %. The results show that the use of the proposed methods allows to reduce the word recognition error (Word Error Rate, WER) by 3 % in absolute value relative to the standard method of decoding integral models (beam search), while maintaining the ability to recognize OOV words at a comparable level. The use of the proposed methods should improve the overall quality of recognition of ASR systems and make such systems more resistant to the recognition of new words that were not involved in the learning process.

Keywords: automatic speech recognition, end-to-end ASR, discriminative training, OOV words, weighted finite state transducer

References

Hinton G., Deng L., Yu D., Dahl G.E., Mohamed A., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Sainath T.N., Kingsbury B. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012, vol. 29, no. 6, pp. 82–97. https://doi.org/10.1109/MSP.2012.2205597
Graves A., Fernandez S., Gomez F., Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proc. of the 23^rd International Conference on Machine Learning (ICML), 2006, pp. 369–376. https://doi.org/10.1145/1143844.1143891
Synnaeve G., Xu Q., Kahn J., Likhomanenko T., Grave E., Pratap V., Sriram A., Liptchinsky V., Collobert R. End-to-end ASR: From supervised to semi-supervised learning with modern architectures. arXiv, 2019, ArXiv:1911.08460. https://doi.org/10.48550/arXiv.1911.08460
Li J., Lavrukhin V., Ginsburg B., Leary R., Kuchaiev O., Cohen J.M., Nguyen H., Gadde R.T. Jasper: An end-to-end convolutional neural acoustic model. Proc. of the 20^th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language (INTERSPEECH), 2019, pp. 71–75. https://doi.org/10.21437/Interspeech.2019-1819
Khokhlov Y., Tomashenko N., Medennikov I., Romanenko A. Fast and accurate OOV decoder on high-level features. Proc. of the 18^th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2017, pp. 2884–2888. https://doi.org/10.21437/Interspeech.2017-1367
Alumaë A., Tilk O., Ullah A. Advanced rich transcription system for Estonian speech. Frontiers in Artificial Intelligence and Applications, 2018, vol. 307, pp. 1–8. https://doi.org/10.3233/978-1-61499-912-6-1
Braun R., Madikeri S., Motlicek P. A comparison of methods for OOV-word recognition on a new public dataset. Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021, pp. 5979–5983. https://doi.org/10.1109/ICASSP39728.2021.9415124
Laptev A., Andrusenko A., Podluzhny I., Mitrofanov A., Medennikov I., Matveev Y. Dynamic acoustic unit augmentation with BPE-dropout for low-resource end-to-end speech recognition. Sensors (Basel), 2021, vol. 21, no. 9, pp. 3063. https://doi.org/10.3390/s21093063
Andrusenko A., Laptev A., Medennikov I. Exploration of end-to-end ASR for OpenSTT - Russian open speech-to-text dataset. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2020, vol. 12335, pp. 35–45. https://doi.org/10.1007/978-3-030-60276-5_4
An K., Xiang H., Ou Z. CAT: A CTC-CRF based ASR toolkit bridging the hybrid and the end-to-end approaches towards data efficiency and low latency. Proc. of the 21^st Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 566–570. https://doi.org/10.21437/Interspeech.2020-2732
Hadian H., Sameti H., Povey D., Khudanpur S. End-to-end speech recognition using lattice-free MMI. Proc. of the 19^th Annual Conference of the International Speech Communication, (INTERSPEECH), 2018, pp. 12–16. https://doi.org/10.21437/Interspeech.2018-1423
Laptev A., Majumdar S., Ginsburg B. CTC variations through new WFST topologies. Proc. of the 23^rd Annual Conference of the International Speech Communication Association (INTERSPEECH), 2022, pp. 1041–1045 https://doi.org/10.21437/Interspeech.2022-10854.
Zeyer A., Schlüter R., Ney H. Why does CTC result in peaky behavior? arXiv, 2021, arXiv:2105.14849. https://doi.org/10.48550/arXiv.2105.14849.
Ardila R., Branson M., Davis K., Henretty M., Kohler M., Meyer J., Morais R., Saunders L., Tyers F.M., Weber G. Common voice: A massively-multilingual speech corpus. Proc. of the 12^th International Conference on Language Resources and Evaluation (LREC), 2020, pp. 4218–4222.
Park D., Chan W., Zhang Y., Chiu C., Zoph B., Cubuk E.D., Le Q.V. SpecAugment: A simple data augmentation method for automatic speech recognition. Proc. of the 20^th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language (INTERSPEECH), 2019, pp. 2613–2617 https://doi.org/10.21437/interspeech.2019-2680
Gulati A., Qin J., Chiu C., Parmar N., Zhang Y., Yu J., Han W., Wang S., Zhang Z., Wu Y., Pang R. Conformer: Convolution-augmented transformer for speech recognition. Proc. of the 21^st Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 5036–5040. https://doi.org/10.21437/Interspeech.2020-3015
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Proc. of the 31^st Annual Conference on Neural Information Processing Systems (NIPS), 2017, pp. 5998–6008.
Povey D., Ghoshal A., Boulianne G., Burget L., Glembek O., Goel N., Hannemann M., Motlicek P., Qian Y., Schwarz P., Silovsky J., Stemmer G., Vesely K. The Kaldi speech recognition toolkit. Proc. of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, 2011.
Watanabe S., Hori T., Karita S., Hayashi T., Nishitoba J., Unno Y., Soplin N.E.Y., Heymann J., Wiesner M., Chen N., Renduchintala A., Ochiaiet T. ESPnet: End-to-end speech processing toolkit. Proc. of the 19^th Annual Conference of the International Speech Communication (INTERSPEECH), 2018, pp. 2207–2211. https://doi.org/10.21437/Interspeech.2018-1456

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License