DOI: 10.17586/2226-1494-2016-16-6-1063-1072


GAUSSIAN MIXTURE MODELS FOR ADAPTATION OF DEEP NEURAL NETWORK ACOUSTIC MODELS IN AUTOMATIC SPEECH RECOGNITION SYSTEMS

N. A. Tomashenko, Y. Y. Khohlov, A. Larcher, Y. Estève, Y. N. Matveev


Read the full article 
Article in Russian

For citation: Tomashenko N.A., Khokhlov Yu.Yu., Larcher A., Estève Ya., Matveev Yu. N. Gaussian mixture models for adaptation of deep neural network acoustic models in automatic speech recognition systems. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2016, vol. 16, no. 6, pp. 1063–1072. doi: 10.17586/2226-1494-2016-16-6-1063-1072

Abstract

Subject of Research. We study speaker adaptation of deep neural network (DNN) acoustic models in automatic speech recognition systems. The aim of speaker adaptation techniques is to improve the accuracy of the speech recognition system for a particular speaker. Method. A novel method for training and adaptation of deep neural network acoustic models has been developed. It is based on using an auxiliary GMM (Gaussian Mixture Models) model and GMMD (GMM-derived) features. The principle advantage of the proposed GMMD features is the possibility of performing the adaptation of a DNN through the adaptation of the auxiliary GMM. In the proposed approach any methods for the adaptation of the auxiliary GMM can be used, hence, it provides a universal method for transferring adaptation algorithms developed for GMMs to DNN adaptation.Main Results. The effectiveness of the proposed approach was shown by means of one of the most common adaptation algorithms for GMM models – MAP (Maximum A Posteriori) adaptation. Different ways of integration of the proposed approach into state-of-the-art DNN architecture have been proposed and explored. Analysis of choosing the type of the auxiliary GMM model is given. Experimental results on the TED-LIUM corpus demonstrate that, in an unsupervised adaptation mode, the proposed adaptation technique can provide, approximately, a 11–18% relative word error reduction (WER) on different adaptation sets, compared to the speaker-independent DNN system built on conventional features, and a 3–6% relative WER reduction compared to the SAT-DNN trained on fMLLR adapted features.


Keywords: automatic speech recognition (ASR), acoustic models, speaker adaptation, deep neural networks (DNN), GMM-derived features, GMMD, maximum a posteriori (MAP), fMLLR, GMM, acoustic model adaptation, fusion

Acknowledgements. The work is partially financially supported by the Government of the Russian Federation (grant 074-U01).

References

1. Hinton G., Deng L., Yu D., Dahl G., Mohamed A.-R., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Sainath T., Kingsbury B. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups // IEEE Signal Processing Magazine. 2012. V. 29. N 6. P. 82–97. doi: 10.1109/MSP.2012.2205597
2. Gales M.J. Maximum likelihood linear transformations for HMM-based speech recognition // Computer Speech and Language. 1998. V. 12. N 2. P. 75–98.
3. Gauvain J.-L., Lee C.-H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains // IEEE Transactions on Speech and Audio Processing. 1994. V. 2. P. 291–298. doi: 10.1109/89.279278
4. Gemello R., Mana F., Scanzio S. et al. Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training // Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing ICASSP. Toulouse, France, 2006. doi: 10.1109/ICASSP.2006.1660239
5. Seide F., Li G., Chen X., Yu D. Feature engineering in context-dependent deep neural networks for conversational speech transcription // Proc. IEEE workshop on Automatic Speech Recognition and Understanding, ASRU. Waikoloa, USA, 2011. P. 24–29. doi: 10.1109/ASRU.2011.6163899
6. Yao K., Yu D., Seide F. et al. Adaptation of context-dependent deep neural networks for automatic speech recognition // Proc. IEEE Spoken Language Technology Workshop. Miami, 2012. P. 366–369.
7. Liao H. Speaker adaptation of context dependent deep neural networks // Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada, 2013. P. 7947–7951.
8. Yu D., Yao K., Su H. et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition // Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada, 2013. P. 7893–7897.
9. Li S., Lu X., Akita Y., Kawahara T. Ensemble speaker modeling using speaker adaptive training deep neural network for speaker adaptation // Proc. INTERSPEECH 2015. Dresden, Germany, 2015. P. 2892–2896.
10. Huang Z., Li J., Siniscalchi S.M. et al. Rapid adaptation for deep neural networks through multi-task learning // Proc. INTERSPEECH 2015. Dresden, Germany, 2015. P. 3625–3629.
11. Swietojanski P., Bell P., Renals S. Structured output layer with auxiliary targets for context-dependent acoustic modelling // Proc. INTERSPEECH 2015. Dresden, Germany, 2015. P. 3605–3609.
12. Price R., Iso K.-I., Shinoda K. Speaker adaptation of deep neural networks using a hierarchy of output layers // Proc. IEEE Workshop on Spoken Language Technology. South Lake Tahoe, USA, 2014. P. 153–158.
13. Karanasou P., Wang Y., Gales M.J., Woodland P.C. Adaptation of deep neural network acoustic models using factorised i-vectors // Proc. INTERSPEECH. Sungapore, 2014. P. 2180–2184.
14. Gupta V., Kenny P., Ouellet P., Stafylakis T. I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription // Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Florence, 2014. P. 6334–6338.
15. Senior A., Lopez-Moreno I. Improving DNN speaker independence with i-vector inputs // Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy, 2014. P. 225–229. doi: 10.1109/ICASSP.2014.6853591
16. Saon G., Soltau H., Nahamoo D., Picheny M. Speaker adaptation of neural network acoustic models using i-vectors // Proc. IEEE workshop on Automatic Speech Recognition and Understanding (ASRU). Olomouc, Czech Republic, 2013. P. 55–59. doi: 10.1109/ASRU.2013.6707705
17. Xue S., Abdel-Hamid O., Jiang H., Dai L., Liu Q. Fast adaptation of deep neural network based on discriminant codes for speech recognition // IEEE Transactions on Audio, Speech, and Language Processing. 2014. V. 22. N 12. P. 1713–1725.
18. Rath S.P., Povey D., Vesely K., Cernocky J. Improved feature processing for deep neural networks // Proc. INTERSPEECH. Lyon, France, 2013. P. 109–113.
19. Kanagawa H., Tachioka Y., Watanabe S., Ishii J. Feature-space structural MAPLR with regression tree-based multiple transformation matrices for DNN // Proc. IEEE Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Hong Kong, 2015. P. 86–92.
20. Lei X., Lin H., Heigold G. Deep neural networks with auxiliary Gaussian mixture models for real-time speech recognition // Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada, 2013. P. 7634–7638.
21. Liu S., Sim K.C. On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition // Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Florence, 2014. P. 195–199.
22. Murali Karthick B., Kolhar P., Umesh S. Speaker adaptation of convolutional neural network using speaker specific subspace vectors of SGMM // Proc. INTERSPEECH 2015. Dresden, Germany, 2015. P. 1096–1100.
23. Tomashenko N.A., Khokhlov Y.Y. Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing // Proc. INTERSPEECH. Sungapore, 2014. P. 2997–3001.
24. Tomashenko N., Khokhlov Y. GMM-derived features for effective unsupervised adaptation of deep neural network acoustic models // Proc. INTERSPEECH. Dresden, Germany, 2015. P. 2882–2886.
25. Tomashenko N., Khokhlov Y., Larcher A., Esteve Y. Exploring GMM-derived features for unsupervised adaptation of deep neural network acoustic models // Lecture Notes in Computer Science. 2016. V. 9811. P. 304–311. doi: 10.1007/978-3-319-43958-7_36
26. Tomashenko N., Khokhlov Y., Esteve Y. On the use of Gaussian mixture model framework to improve speaker adaptation of deep neural network acoustic models // Proc. INTERSPEECH. San Francisco, USA, 2016. P. 3788–3792. doi: 10.21437/Interspeech.2016-1230
27. Tomashenko N., Khokhlov Y., Larcher A., Esteve Y. Exploration de paramètres acoustiques dérivés de GMMs pour l’adaptation non supervisée de modèles acoustiques à base de réseaux de neurones profonds // Proc. 31ґeme Journґees d’Etudes sur la Parole (JEP), 2016. P. 337–345.
28. Tomashenko N., Khokhlov Y., Esteve Y. A new perspective on combining GMM and DNN frameworks for speaker adaptation // Lecture Notes in Computer Science. 2016. V. 9918. P. 120–132. doi: 10.1007/978-3-319-45925-7_10
29. Tomashenko N., Vythelingum K., Rousseau A., Esteve Y. LIUM ASR systems for the 2016 multi-genre broadcast arabic challenge // Proc. IEEE Workshop on Spoken Language Technology. San Diego, USA, 2016.
30. Rousseau A., Deleglise P., Esteve Y. Enhancing the TED-LIUM Corpus with selected data for language modeling and more TED talks // Proc. 9th Int. Conf. on Language Resources and Evaluation. Reykjavik, Iceland, 2014. P. 3936–3939.
31. Povey D., Ghoshal A., Boulianne G. et al. The Kaldi speech recognition // Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Waikoloa, USA, 2011. P. 1–4.
32. Матвеев Ю.Н. Исследование информативности признаков речи для систем автоматической идентификации дикторов // Изв. вузов. Приборостроение. 2013. Т. 56. № 2. С. 47–51.
33. Povey D., Kanevsky D., Kingsbury B., Ramabhadran B., Saon, G., Visweswariah K. Boosted MMI for model and feature-space discriminative training // Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing. Las Vegas, USA, 2008. P. 4057–4060. doi: 10.1109/ICASSP.2008.4518545
34. Tomashenko N.A., Khokhlov Y.Y. Fast algorithm for automatic alignment of speech and imperfect text data // Lecture Notes in Computer Science. 2013. V. 8113 LNAI. P. 146–153. doi: 10.1007/978-3-319-01931-4_20
35. Khokhlov Y., Tomashenko N. Speech recognition performance evaluation for LVCSR system // Proc. 14th Int. Conf. on Speech and Computer (SPECOM 2011). Kazan', Russia, 2011. P. 129–135.
36. Evermann G., Woodland P.C. Posterior probability decoding, confidence estimation and system combination // Proc. NIST Speech Transcription Workshop. 2000. V. 27. P. 78.
37. Maaten L.V.D., Hinton G. Visualizing data using t-SNE // Journal of Machine Learning Research. 2008. V. 9. P. 2579–2605.
 

Copyright 2001-2017 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.

Яндекс.Метрика