AUDIO-VISUAL SPEECH PROCESSING AND ANALYSIS BASED ON SUBSPACE PROJECTIONS

Oleinik Andrei L

2018 , VOLUME 18, NUMBER 2 ( March-April )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2018-18-2-243-254

AUDIO-VISUAL SPEECH PROCESSING AND ANALYSIS BASED ON SUBSPACE PROJECTIONS

A. L. Oleinik

Read the full article

Article in Russian

For citation: Oleinik A.L. Audio-visual speech processing and analysis based on subspace projections. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2018, vol. 18, no. 2, pp. 243–254 (in Russian). doi: 10.17586/2226-1494-2018-18-2-243-254

Abstract

Subject of Research.The paper deals with the problems of the mutual reconstruction (transformation) of acoustic and visual components (modalities) of speech. Audio recording of voice represents the acoustic component whereas the parallel video recording of the speaker’s face comprises the visual component. Because of the different physical nature of these modalities, their mutual analysis is accompanied by numerous difficulties. Reconstruction methods can be used to overcome these difficulties. Method. The proposed approach is based on Principal Component Analysis (PCA), Multiple Linear Regression (MLR), Partial Least Squares regression (PLS regression) and K-means clustering algorithm. Moreover, attention is paid to data preprocessing. Mel-frequency cepstral coefficients (MFCCs) are used as acoustic features, and twenty key points, which represent the mouth contour, comprise visual features. Main Results. The experiments on the reconstruction of the mouth contour from the MFCCs are presented. The experiments were carried out on VidTIMIT dataset of audio-visual phrase recordings in English. Four variants of the proposed approach were tested and evaluated. They are based on PCA and PLS regression with clustering and without it. Quantitative (objective) and qualitative (subjective) assessment confirmed the efficiency of the proposed approach. The implementation based on PLS regression with preliminary clustering led to the best results. Practical Relevance. The proposed approach can be used to develop various bimodal biometric systems, voice-driven virtual “avatars”, mobile access control systems and other useful human-computer interaction solutions. Moreover, it is shown that, given the proper implementation, PCA and PLS reduce significantly the computational complexity of the reconstruction operation. In addition, the clustering step can be omitted to increase additionally the processing speed at the cost of slightly lower reconstruction quality.

Keywords: bimodal speech systems, reconstruction, principal component analysis, clustering, partial least squares, regression

Acknowledgements. This work was partially supported by ITMO University start-up funding.

References

Ivanko D.V., Karpov A.A. An analysis of perspectives for using high-speed cameras in processing dynamic video information. SPIIRAS Proceedings, 2016, no. 1, pp. 98–113. doi: 10.15622/SP.44.7 (In Russian)
McGurk H., MacDonald J. Hearing lips and seeing voices. Nature, 1976, vol. 264, no. 5588, pp. 746–748.
Atrey P.K., Hossain M.A., El Saddik A., Kankanhalli M.S. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 2010, vol. 16, no. 6, pp. 345–379. doi: 10.1007/s00530-010-0182-0
Nefian A.V., Liang L., Pi X. et al. A coupled HMM for audio-visual speech recognition. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2002, vol. 2, pp. 2013–2016. doi: 10.1109/ICASSP.2002.574502 7
Karpov A. An automatic multimodal speech recognition system with audio and video information. Automation and Remote Control,2014,vol. 75,no. 12,pp. 2190–2200. doi: 10.1134/S000511791412008X
Pachoud S., Gong S., Cavallaro A. Space-time audio-visual speech recognition with multiple multi-class probabilistic support vector machines. Proc. Auditory-Visual Speech Processing AVSP. Norwich, UK, 2009, pp. 155–160.
Hammami I., Mercies G., Hamouda A. The Kohonen map for credal fusion of heterogeneous data. Proc. IEEE International Geoscience and Remote Sensing Symposium, IGARSS. Milan, Italy, 2015, pp. 2947–2950. doi: 10.1109/IGARSS.2015.7326433
Hochreiter S., Schmidhuber J. Long short-term memory. Neural Computation, 1997, vol. 9, no. 8, pp. 1735–1780. doi: 10.1162/neco.1997.9.8.1735
Jaeger H. The «echo state» approach to analysing and training recurrent neural networks - with an erratum note. GMD Technical Report 148, German National Research Center for Information Technology,2001, 13 p.
LeCun Y. et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, vol. 86, no. 11, pp. 2278–2324. doi: 10.1109/5.726791
Hou J.-C., Wang S.S., Lai Y.H., Tsao Y., Chang H.W., Wan H.M. Audio-visual speech enhancement based on multimodal deep convolutional neural network. ArXiv Prepr, ArXiv170310893, 2017.
Noda K., Yamaguchi Y., Nakadai K., Okuno H.G., Ogata T. Audio-visual speech recognition using deep learning. Applied Intelligence, 2015, vol. 42, no. 4, pp. 722–737. doi: 10.1007/s10489-014-0629-7
Ren J., Hu Y., Tai Y.W. et al. Look, listen and learn - a multimodal LSTM for speaker identification. Proc. 30^th AAAI Conference on Artificial Intelligence. Phoenix, USA, 2016, pp. 3581–3587.
Kukharev G.A., Kamenskaya E.I., Matveev Y.N., Shchegoleva N.L. Methods for Face Image Processing and Recognition in Biometric Applications. Ed. M.V. Khitrov. St. Petersburg, Politekhnika Publ., 2013, 388 p. (In Russian)
Meng H., Huang D., Wang H., Yang H., Al-Shuraifi M., Wang Y. Depression recognition based on dynamic facial and vocal expression features using partial least square regression. Proc. 3^rd ACM International Workshop on Audio/Visual Emotion Challenge, AVEC 2013. Barselona, Spain, 2013, pp. 21–29. doi: 10.1145/2512530.2512532
Liu M., Wang R., Huang Z., Shan S., Chen X. Partial least squares regression on grassmannian manifold for emotion recognition. Proc. 15^th ACM on Int. Conf. on Multimodal Interaction. Sydney, Australia, 2013, pp. 525–530. doi: 10.1145/2522848.2531738
Bakry A., Elgammal A. MKPLS: Manifold kernel partial least squares for lipreading and speaker identification. Proc. 26^th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013. Portland, USA, 2013, pp. 684–691. doi: 10.1109/CVPR.2013.94
Sargin M.E., Yemez Y., Erzin E., Tekalp A.M. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 2007, vol. 9, no. 7, pp. 1396–1403. doi: 10.1109/TMM.2007.906583
Sigg C., Fischer B., Ommer B., Roth V., Buhmann J. Nonnegative CCA for audiovisual source separation. Proc. 17^th IEEE Int. Workshop on Machine Learning for Signal Processing. Thessaloniki, Greece, 2007, pp. 253–258. doi: 10.1109/MLSP.2007.4414315
Lee J.-S., Ebrahimi T. Two-level bimodal association for audio-visual speech recognition. Lecture Notes in Computer Science, 2009, vol. 5807, pp. 133–144.doi: 10.1007/978-3-642-04697-1_13
De Bie T., Cristianini N., Rosipal R. Eigenproblems in pattern recognition. In: Handbook of Geometric Computing. Ed. E.B. Corrochano. Berlin, Springer, 2005, pp. 129–167. doi: 10.1007/3-540-28247-5_5
Esbensen K.H. Multivariate Date Analysis – In Practice.
5^th ed. Oslo, Norway, CAMO Process AS, 2002, 598 p.
Prasad N.V., Umesh S. Improved cepstral mean and variance normalization using Bayesian framework. Proc. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013, pp. 156–161. doi: 10.1109/ASRU.2013.6707722
OpenCV Library. URL: http://opencv.org (accessed: 20.01.2018).
Kazemi V., Sullivan J. One millisecond face alignment with an ensemble of regression trees. Proc. IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA, 2014, pp. 1867–1874. doi: 10.1109/CVPR.2014.241
dlib C++ Library. URL: http://dlib.net (accessed: 20.01.2018).
Oleinik A.L. Application of Partial Least Squares regression for audio-visual speech processing and modeling.Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2015, vol. 15, no. 5, pp. 886–892. (In Russian) doi: 10.17586/2226-1494-2015-15-5-886-892
SoX - Sound eXchange. HomePage. URL: http://sox.sourceforge.net (accessed: 09.09.2017).
Wojcicki K. Mel Frequency Cepstral Coefficient Feature Extraction. Available at: www.mathworks.com/matlabcentral/fileexchange/32849-htk-mfcc-matlab (accessed: 20.01.2018).
The VidTIMIT Audio-Video Database. URL: http://conradsanderson.id.au/vidtimit/ (accessed: 20.01.2018).
Sanderson C., Lovell B.C. Multi-region probabilistic histograms for robust and scalable identity inference. Lecture Notes in Computer Science, 2009, vol. 5558, pp. 199–208. doi: 10.1007/978-3-642-01793-3_21
Benton A., Khayrallah H., Gujral B., Reisinger D.A., Zhang S., Arora R. Deep generalized canonical correlation analysis. ArXiv Prepr, ArXiv1702.02519, 2017, 14 p.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License