DOI: 10.17586/2226-1494-2018-18-2-243-254


A. L. Oleinik

Read the full article 
Article in Russian

For citation: Oleinik A.L. Audio-visual speech processing and analysis based on subspace projections. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2018, vol. 18, no. 2, pp. 243–254 (in Russian). doi: 10.17586/2226-1494-2018-18-2-243-254


Subject of Research.The paper deals with the problems of the mutual reconstruction (transformation) of acoustic and visual components (modalities) of speech. Audio recording of voice represents the acoustic component whereas the parallel video recording of the speaker’s face comprises the visual component. Because of the different physical nature of these modalities, their mutual analysis is accompanied by numerous difficulties. Reconstruction methods can be used to overcome these difficulties. Method. The proposed approach is based on Principal Component Analysis (PCA), Multiple Linear Regression (MLR), Partial Least Squares regression (PLS regression) and K-means clustering algorithm. Moreover, attention is paid to data preprocessing. Mel-frequency cepstral coefficients (MFCCs) are used as acoustic features, and twenty key points, which represent the mouth contour, comprise visual features. Main Results. The experiments on the reconstruction of the mouth contour from the MFCCs are presented. The experiments were carried out on VidTIMIT dataset of audio-visual phrase recordings in English. Four variants of the proposed approach were tested and evaluated. They are based on PCA and PLS regression with clustering and without it. Quantitative (objective) and qualitative (subjective) assessment confirmed the efficiency of the proposed approach. The implementation based on PLS regression with preliminary clustering led to the best results. Practical Relevance. The proposed approach can be used to develop various bimodal biometric systems, voice-driven virtual “avatars”, mobile access control systems and other useful human-computer interaction solutions. Moreover, it is shown that, given the proper implementation, PCA and PLS reduce significantly the computational complexity of the reconstruction operation. In addition, the clustering step can be omitted to increase additionally the processing speed at the cost of slightly lower reconstruction quality.

Keywords: bimodal speech systems, reconstruction, principal component analysis, clustering, partial least squares, regression

Acknowledgements. This work was partially supported by ITMO University start-up funding.

  1. Ivanko D.V., Karpov A.A. An analysis of perspectives for using high-speed cameras in processing dynamic video information. SPIIRAS Proceedings, 2016, no. 1, pp. 98–113. doi: 10.15622/SP.44.7 (In Russian)
  2. McGurk H., MacDonald J. Hearing lips and seeing voices. Nature, 1976, vol. 264, no. 5588, pp. 746–748.
  3. Atrey P.K., Hossain M.A., El Saddik A., Kankanhalli M.S. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 2010, vol. 16, no. 6, pp. 345–379. doi: 10.1007/s00530-010-0182-0
  4. Nefian A.V., Liang L., Pi X. et al. A coupled HMM for audio-visual speech recognition. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2002, vol. 2, pp. 2013–2016. doi: 10.1109/ICASSP.2002.5745027
  5. Karpov A. An automatic multimodal speech recognition system with audio and video information. Automation and Remote Control,2014,vol. 75,no. 12,pp. 2190–2200. doi: 10.1134/S000511791412008X
  6. Pachoud S., Gong S., Cavallaro A. Space-time audio-visual speech recognition with multiple multi-class probabilistic support vector machines. Proc. Auditory-Visual Speech Processing AVSP. Norwich, UK, 2009, pp. 155–160.
  7. Hammami I., Mercies G., Hamouda A. The Kohonen map for credal fusion of heterogeneous data. Proc. IEEE International Geoscience and Remote Sensing Symposium, IGARSS. Milan, Italy, 2015, pp. 2947–2950. doi: 10.1109/IGARSS.2015.7326433
  8. Hochreiter S., Schmidhuber J. Long short-term memory. Neural Computation, 1997, vol. 9, no. 8, pp. 1735–1780. doi: 10.1162/neco.1997.9.8.1735
  9. Jaeger H. The «echo state» approach to analysing and training recurrent neural networks - with an erratum note. GMD Technical Report 148, German National Research Center for Information Technology,2001, 13 p.
  10. LeCun Y. et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, vol. 86, no. 11, pp. 2278–2324. doi: 10.1109/5.726791
  11. Hou J.-C., Wang S.S., Lai Y.H., Tsao Y., Chang H.W., Wan H.M. Audio-visual speech enhancement based on multimodal deep convolutional neural network. ArXiv Prepr, ArXiv170310893, 2017.
  12. Noda K., Yamaguchi Y., Nakadai K., Okuno H.G., Ogata T. Audio-visual speech recognition using deep learning. Applied Intelligence, 2015, vol. 42, no. 4, pp. 722–737. doi: 10.1007/s10489-014-0629-7
  13. Ren J., Hu Y., Tai Y.W. et al. Look, listen and learn - a multimodal LSTM for speaker identification. Proc. 30th AAAI Conference on Artificial Intelligence. Phoenix, USA, 2016, pp. 3581–3587.
  14. Kukharev G.A., Kamenskaya E.I., Matveev Y.N., Shchegoleva N.L. Methods for Face Image Processing and Recognition in Biometric Applications. Ed. M.V. Khitrov. St. Petersburg, Politekhnika Publ., 2013, 388 p. (In Russian)
  15. Meng H., Huang D., Wang H., Yang H., Al-Shuraifi M., Wang Y. Depression recognition based on dynamic facial and vocal expression features using partial least square regression. Proc. 3rd ACM International Workshop on Audio/Visual Emotion Challenge, AVEC 2013. Barselona, Spain, 2013, pp. 21–29. doi: 10.1145/2512530.2512532
  16. Liu M., Wang R., Huang Z., Shan S., Chen X. Partial least squares regression on grassmannian manifold for emotion recognition. Proc. 15th ACM on Int. Conf. on Multimodal Interaction. Sydney, Australia, 2013, pp. 525–530. doi: 10.1145/2522848.2531738
  17. Bakry A., Elgammal A. MKPLS: Manifold kernel partial least squares for lipreading and speaker identification. Proc. 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013. Portland, USA, 2013, pp. 684–691. doi: 10.1109/CVPR.2013.94
  18. Sargin M.E., Yemez Y., Erzin E., Tekalp A.M. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 2007, vol. 9, no. 7, pp. 1396–1403. doi: 10.1109/TMM.2007.906583
  19. Sigg C., Fischer B., Ommer B., Roth V., Buhmann J. Nonnegative CCA for audiovisual source separation. Proc. 17th IEEE Int. Workshop on Machine Learning for Signal Processing. Thessaloniki, Greece, 2007, pp. 253–258. doi: 10.1109/MLSP.2007.4414315
  20. Lee J.-S., Ebrahimi T. Two-level bimodal association for audio-visual speech recognition. Lecture Notes in Computer Science, 2009, vol. 5807, pp. 133–144.doi: 10.1007/978-3-642-04697-1_13
  21. De Bie T., Cristianini N., Rosipal R. Eigenproblems in pattern recognition. In: Handbook of Geometric Computing. Ed. E.B. Corrochano. Berlin, Springer, 2005, pp. 129–167. doi: 10.1007/3-540-28247-5_5
  22. Esbensen K.H. Multivariate Date Analysis – In Practice.
    5th ed. Oslo, Norway, CAMO Process AS, 2002, 598 p.
  23. Prasad N.V., Umesh S. Improved cepstral mean and variance normalization using Bayesian framework. Proc. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013, pp. 156–161. doi: 10.1109/ASRU.2013.6707722
  24. OpenCV Library. URL: (accessed: 20.01.2018).
  25. Kazemi V., Sullivan J. One millisecond face alignment with an ensemble of regression trees. Proc. IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA, 2014, pp. 1867–1874. doi: 10.1109/CVPR.2014.241
  26. dlib C++ Library. URL: (accessed: 20.01.2018).
  27. Oleinik A.L. Application of Partial Least Squares regression for audio-visual speech processing and modeling.Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2015, vol. 15, no. 5, pp. 886–892. (In Russian) doi: 10.17586/2226-1494-2015-15-5-886-892
  28. SoX - Sound eXchange. HomePage. URL: (accessed: 09.09.2017).
  29. Wojcicki K. Mel Frequency Cepstral Coefficient Feature Extraction. Available at: (accessed: 20.01.2018).
  30. The VidTIMIT Audio-Video Database. URL: (accessed: 20.01.2018).
  31. Sanderson C., Lovell B.C. Multi-region probabilistic histograms for robust and scalable identity inference. Lecture Notes in Computer Science, 2009, vol. 5558, pp. 199–208. doi: 10.1007/978-3-642-01793-3_21
  32. Benton A., Khayrallah H., Gujral B., Reisinger D.A., Zhang S., Arora R. Deep generalized canonical correlation analysis. ArXiv Prepr, ArXiv1702.02519, 2017, 14 p.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Copyright 2001-2019 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.