AUTOMATIC SPEECH RECOGNITION – THE MAIN STAGES OVER LAST 50 YEARS
Read the full article
For citation: Tampel I.B. Automatic speech recognition – the main stages over last 50 years. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2015, vol. 15, no. 6, pp. 957–968.
The main stages of automatic speech recognition systems over last 50 years are regarded. The attempt is made to evaluate different methods in the context of approaching to functioning of biological systems. The method implementation based on dynamic programming algorithm and done in 1968 is considered as a benchmark. Shortcomings of the method, which make it possible to use it only for command recognition, are considered. The next method considered is based on a formalism of Markov chains. Based on the notion of coarticulation the necessity of applying context dependent triphones and biphones instead of context independent phonemes is shown. The problems of insufficiency of speech databases for triphone training which lead to state tying methods are explained. The importance of model adaptation and feature normalization methods providing better invariance to speakers, communication channels and additive noise are shown. Deep Neural Networks and Recurrent Networks are considered as the most up-to-date methods. The similarity of deep (multilayer) neural networks and biological systems is noted. In conclusion, the problems and drawbacks of the modern systems of automatic speech recognition are described and prognosis of their development is given.
Acknowledgements. This work is partially financially supported by the Government of the Russian Federation (grant № 074-U01).
1. Levin K., Ponomareva I., Bulusheva A., Chernykh G., Medennikov I., Merkin N., Prudnikov A., Tomashenko N. Automated closed captioning for Russian live broadcasting. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Singapore, 2014, pp. 1438–1442.
2. Terry K. Instant patient records and all you have to do is talk. Medical Economics, 1999, vol. 76, no. 19, pp. 101–102, 107–108, 111–112.
3. Zafar A., Overhage J.M., McDonald C.J. Continuous speech recognition for clinicians. Journal of the American Medical Informatics Association, 1999, vol. 6, no. 3, pp. 195–204.
4. Goedart J. Speech recognition technology gives voice to clinical data. Health Data Management, 2002, vol. 10, no. 12, pp. 30–32, 34, 36.
5. Zick R.G., Olsen J. Voice recognition software versus a traditional transcription service for physician charting in the ED. American Journal of Emergency Medicine, 2001, vol. 19, no. 4, pp. 295–298.
6. Apple - iOS 8 - Siri. Available at: http://www.apple.com/ru/ios/siri (accessed 10.10.2015).
7. Voco: Windows application for translation speech to text. Available at: http://www.speechpro.ru/product/transcription/voco (accessed 10.10.2015).
8. Chistovich L.A., Ventsov A.V., Granstrem M.P. et. al. Rukovodstvo po Fiziologii. Fiziologiya Rechi. Vospriyatie Rechi Chelovekom [Guidance on Physiology. Physiology of Speech. The Perception of Human Speech]. Leningrad, Nauka Publ., 1976, 388 p.
9. Huang X., Acero A., Hon H.-W. Spoken Language Processing. Prentice Hall, 2001, 1008 p.
10. The HTK book. Cambridge University Engineering Department. Available at: http://speech.ee.ntu.edu.tw/homework/DSP_HW2-1/htkbook.pdf (accessed 22.10.2015).
11. Tou J.T., Gonzalez R.C. Pattern Recognition Principles. 2nd ed. Addison-Wesley, 1977, 377 p.
12. Hermansky H. Should recognizers have ears? Speech Communication, 1998, vol. 25, no. 1¬–3, pp. 3–27.
13. Vintsyuk T.K. Raspoznavanie slov ustnoi rechi metodami dinamicheskogo programmirovaniya [Oral speech recognition using dynamic programming]. Kibernetika, 1968, no. 1, pp. 81–88.
14. Velichko V.M., Zagoruiko N.G. Avtomaticheskoe raspoznavanie ogranichennogo nabora ustnykh komand [Automatic recognition of a limited set of verbal commands]. Vychislitel'nye Sistemy, 1969, no. 36, pp. 101–110.
15. Sakoe H., Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1978, vol. 64, pp. 43–49. doi: 10.1109/TASSP.1978.1163055
16. Kullback S. Letter to the Editor: The Kullback-Leibler distance. The American Statistician, 1987, vol. 41, no. 4, pp. 340–341.
17. Mansour D., Juang B.H. A family of distortion measures based upon projection operation for robust speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 1989, vol. 37, no. 11, pp. 1659–1671. doi: 10.1109/29.46548
18. Itakura F., Saito S. Analysis synthesis telephony based on the maximum likelihood method. Proc. 6th Int. Congress on Acoustics. Los Alamitos, 1968, pp. 17–20.
19. Flanagan J.L. Speech Analysis, Synthesis and Perception. Springer, 1965. doi: 10.1007/978-3-662-00849-2
20. Baker J.K. The dragon system – an overview. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1975, vol. ASSP 23, no. 1, pp. 24–29.
21. Jelinek F. Continuous speech recognition by statistical methods. Proc. of IEEE, 1976, vol. 64, no. 4, pp. 532–556. doi: 10.1109/PROC.1976.10159
22. Rabiner L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989, vol. 77, no. 2, pp. 257–286. doi: 10.1109/5.18626
23. Ramesh P., Wilpon J.G. Modeling state durations in hidden Markov models for automatic speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, ICASSP-92. San Francisco, USA, 1992, vol. 1, pp. 381–384.
24. Bonafonte A., Ros X., Marifio J.B. An efficient algorithm to find the best state sequence in HSMM. Proc. 3rd European Conf. on Speech, Communication and Technology, EUROSPEECH’93. Berlin, Germany, 1993, pp. 1547–1550.
25. Burshtein D. Robust parametric modeling of durations in hidden Markov models. IEEE Transactions on Speech and Audio Processing, 1996, vol. 4, no. 3, pp. 240–242. doi: 10.1109/89.496221
26. Pylkkönen J. Phone Duration Modeling Techniques in Continuous Speech Recognition. Master’s Thesis. Helsinki University of Technology, 2004. Available at: http://users.ics.aalto.fi/jpylkkon/mt.pdf (accessed 18.10.2015).
27. Introduction to Automatic Speech Recognition. MIT, 2003. Available at: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/lecture-notes/lecture1.pdf (accessed 23.10.2015).
28. Sakti S., Markov K., Nakamura S. Incorporation of pentaphone-context dependency based on hybrid HMM/BN acoustic modeling framework. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP. Toulouse, France, 1996, vol. 1, pp. I1177–I1180.
29. Shafran I., Ostendorf M. Use of higher level linguistic structure in acoustic modeling for speech recognition. Proc. IEEE Int. Conf. on Acoustic Signal and Speech Processing. Istanbul, Turkey, 2000, vol. 2, pp. 1021–1024.
30. Odell J.J. The Use of Context in Large Vocabulary Speech Recognition. 1995. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.7786 (accessed 18.10.2015).
31. Digalakis V., Murveit H. Genones: optimizing the degree of mixture tying in a large vocabulary hidden Markov model-based speech recognizer. Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP. Adelaide, South Australia, 1994, vol. 1, pp. 537–540.
32. Molau S., Kanthak S., Ney H. Efficient vocal tract normalization in automatic speech recognition. Konf. Elektron. Sprachsignalverarbeitung. Cottbus, 2000, pp. 209–216.
33. Hain T., Woodland P.C., Niesler T.R., Whittacker E.W.D. 1998 HTK system for transcription of conversational telephone speech. Proc. Int. Conf. on Acoustics, Speech and Signal Processing, 1999, vol. 1, pp. 57–60.
34. Gauvain J.-L., Lee C.-H. Maximum a posteriori estimation of multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 1994, vol. 2, no. 2, pp. 291–298. doi: 10.1109/89.279278
35. Leggetter C.J., Woodland P.C. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 1995, vol. 9, no. 2, pp. 171–185. doi: 10.1006/csla.1995.0010
36. Gales M.J.F., Woodland P.C. Mean and variance adaptation within the MLLR framework. Computer Speech and Language, 1996, vol. 10, no. 4, pp. 249–264. doi: 10.1006/csla.1996.0013
37. Digalakis V.V., Rtischev D., Neumeyer L. Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Transactions on Speech and Audio Processing, 1995, vol. 3, no. 5, pp. 357–366. doi: 10.1109/89.466659
38. Nguen P. Fast Speaker Adaptation. 1998. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.8771&rep=rep1&type=pdf (accessed 18.10.2015).
39. Kuhn R., Junqua J.-C., Nguen P., Niedzielski N. Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing, 2000, vol. 8, no. 6, pp. 695–706. doi: 10.1109/89.876308
40. Kalini O., Seltzer M.L., Droppo J., Acero A. Noise adaptive training for robust automatic speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 2010, vol. 18, no. 8, pp. 1889–1901. doi: 10.1109/TASL.2010.2040522
41. Bourlard H., Wellekens C.J. Links between Markov models and multilayer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1990, vol. 12, no. 12, pp. 1167–1178. doi: 10.1109/34.62605
42. Bourlard H., Hermansky H., Morgan N. Towards increasing speech recognition error rates. Speech Communication, 1996, vol. 18, no. 3, pp. 205–231. doi: 10.1016/0167-6393(96)00003-9
43. Hornik K., Stinchcombe M., White H. Multilayer feedforward networks are universal approximators. Neural Networks, 1989, vol. 2, no. 5, pp. 359–366. doi: 10.1016/0893-6080(89)90020-8
44. Hinton G., Deng L., Yu D., Dahl G., Mohamed A.-R., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Sainath T., Kingsbury B. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 2012, vol. 29, no. 6, pp. 82–97. doi: 10.1109/MSP.2012.2205597
45. Dong Yu, Li Deng. Automatic Speech Recognition. A Deep Learning Approach. London, Springer, 2015, 321 p. doi: 10.1007/978-1-4471-5779-3
46. Hermansky H., Ellis D., Sharma S. Tandem connectionist feature extraction for conventional HMM systems. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP. Istanbul, Turkey, 2000, vol. 3, pp. 1635–1638.
47. Robinson A.J. An application of recurrent nets to phone probability estimation. IEEE Transactions on Neural Networks, 1994, vol. 5, no. 2, pp. 298–305. doi: 10.1109/72.279192
48. Robinson T., Hochberg M., Renals S. The use of recurrent neural networks in continuous speech recognition. In Automatic Speech and Speaker Recognition. Advanced Topics. Eds. C.H. Lee, F.K. Soong, K. Paliwal. Kluwer Academic Publishers, 1996, 518 p. doi: 10.1007/978-1-4613-1367-0
49. Schwarz P. Phoneme Recognition Based on Long Temporal Context. Ph.D. Thesis. Brno University of Technology, 2008. Available at: http://www.fit.vutbr.cz/~schwarzp/publi/thesis.pdf (accessed 18.10.2015).
50. Triefenbach F., Demuynck K., Martens J.-P. Large vocabulary continuous speech recognition with reservoir-based acoustic models. IEEE Signal Processing Letters, 2014, vol. 21, no. 3, pp. 311–315. doi: 10.1109/LSP.2014.2302080