doi: 10.17586/2226-1494-2018-18-2-346-349


D. V. Ivanko, D. V. Fedotov, A. A. Karpov

Read the full article  ';
Article in Russian

For citation: Ivanko D.V., Fedotov D.V., Karpov A. A. Accuracy increase for automatic visual Russian speech recognition: viseme classes optimization. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2018, vol. 18, no. 2, pp. 346–349 (in Russian). doi: 10.17586/2226-1494-2018-18-2-346-349


Nowadays there are a lot of continuous studies on the correct viseme classes to be used for the most effective automatic lip-reading. The paper proposes a structured approach for the development of speaker-dependent classes of visemes. This method gives the possibility to create a set of phoneme-viseme correspondence maps, where each class has a different number of visemes from two to forty-eight with a constant number of phonemes. Viseme classes are based on their mapping from phonemes, which are converted into viseme groups during speech recognition process. With the usage of the obtained correspondence maps together with the database of audio-visual Russian speech HAVRUS the paper demonstrates the dependence of the visual speech recognition accuracy on the number of used viseme classes. The application of high-speed video data made it possible to expand the optimal set of viseme classes to twenty that resulted in recognition accuracy improvement by 1.34% compared to the standard set of fourteen classes.

Keywords: visual speech recognition, visemes, automatic lip-reading

Acknowledgements. The research was supported by the Ministry of Education and Science of the Russian Federation, contract No. 8.9957.2017/DAAD, as well as in the framework of the Russian state research No. 0073-2018-0002.

  1. Bear H., Harvey R., Theobald B., Lan Y. Which phoneme-to-viseme maps best improve visual-only computer lip-readingюLecture Notes in Computer Science, 2014, vol. 8888,
     pp. 230–239.
  2. Hazen T., Saenko K., La C., Glass J. A segment-based audio-visual speech recognizer: data collection, development, and initial experiments. Proc. 6th Int. Conf. on Multimodal Interfaces. New York, 2004, pp. 235–242.
  3. Verkhodanova V., Ronzhin A., Kipyatkova I., Ivanko D., Karpov A., Zelezny M. HAVRUS corpus: high-speed recordings of audio-visual Russian speech. Lecture Notes in Computer Science, 2016, vol. 9811, pp. 338–345.
  4. Ivanko D., Karpov A., Ryumin D., Kipyatkova I., Saveliev A., Budkov V., Ivanko Dm., Milos Z. Using a high-speed video camera for robust audio-visual speech recognition in acoustically noisy conditions. Lecture Notes in Computer Science, 2017, vol. 10458, pp. 757–767.
  5. Karpov A. An automatic multimodal speech recognition system with audio and video information. Automation and Remote Control, 2014, vol. 75, no. 12, pp. 2190–2200. doi: 10.1134/S000511791412008X
  6. Websdale D., Milner B. Analysing the importance of different visual feature coefficients. Proc. Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing. Vienna, 2015, pp. 137–142.
  7. Savchenko A., Khokhlova Y. About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems. Optical Memory and Neural Networks, 2014, vol. 23, no. 1, pp. 34–42. doi: 10.3103/S1060992X14010068
  8. Zheng G.L., Zhu M., Feng L. Review of lip-reading recognition. Proc. 7th International Symposium on Computational Intelligence and Design. Hangzhou, China, 2014, pp. 293–298. doi: 10.1109/ISCID.2014.110
  9. Karpov A., Kipyatkova I., Zelezny M. A framework for recording audio-visual speech corpora with a microphone and a high-speed camera. Lecture Notes in Computer Science, 2014, vol. 8773, pp. 50–57.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Copyright 2001-2024 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.