Nikiforov
Vladimir O.
D.Sc., Prof.
doi: 10.17586/2226-1494-2018-18-2-243-254
AUDIO-VISUAL SPEECH PROCESSING AND ANALYSIS BASED ON SUBSPACE PROJECTIONS
Read the full article ';
For citation: Oleinik A.L. Audio-visual speech processing and analysis based on subspace projections. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2018, vol. 18, no. 2, pp. 243–254 (in Russian). doi: 10.17586/2226-1494-2018-18-2-243-254
Abstract
Subject of Research.The paper deals with the problems of the mutual reconstruction (transformation) of acoustic and visual components (modalities) of speech. Audio recording of voice represents the acoustic component whereas the parallel video recording of the speaker’s face comprises the visual component. Because of the different physical nature of these modalities, their mutual analysis is accompanied by numerous difficulties. Reconstruction methods can be used to overcome these difficulties. Method. The proposed approach is based on Principal Component Analysis (PCA), Multiple Linear Regression (MLR), Partial Least Squares regression (PLS regression) and K-means clustering algorithm. Moreover, attention is paid to data preprocessing. Mel-frequency cepstral coefficients (MFCCs) are used as acoustic features, and twenty key points, which represent the mouth contour, comprise visual features. Main Results. The experiments on the reconstruction of the mouth contour from the MFCCs are presented. The experiments were carried out on VidTIMIT dataset of audio-visual phrase recordings in English. Four variants of the proposed approach were tested and evaluated. They are based on PCA and PLS regression with clustering and without it. Quantitative (objective) and qualitative (subjective) assessment confirmed the efficiency of the proposed approach. The implementation based on PLS regression with preliminary clustering led to the best results. Practical Relevance. The proposed approach can be used to develop various bimodal biometric systems, voice-driven virtual “avatars”, mobile access control systems and other useful human-computer interaction solutions. Moreover, it is shown that, given the proper implementation, PCA and PLS reduce significantly the computational complexity of the reconstruction operation. In addition, the clustering step can be omitted to increase additionally the processing speed at the cost of slightly lower reconstruction quality.
Acknowledgements. This work was partially supported by ITMO University start-up funding.
References
-
Ivanko D.V., Karpov A.A. An analysis of perspectives for using high-speed cameras in processing dynamic video information. SPIIRAS Proceedings, 2016, no. 1, pp. 98–113. doi: 10.15622/SP.44.7 (In Russian)
-
McGurk H., MacDonald J. Hearing lips and seeing voices. Nature, 1976, vol. 264, no. 5588, pp. 746–748.
-
Atrey P.K., Hossain M.A., El Saddik A., Kankanhalli M.S. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 2010, vol. 16, no. 6, pp. 345–379. doi: 10.1007/s00530-010-0182-0
-
Nefian A.V., Liang L., Pi X. et al. A coupled HMM for audio-visual speech recognition. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2002, vol. 2, pp. 2013–2016. doi: 10.1109/ICASSP.2002.5745027
-
Karpov A. An automatic multimodal speech recognition system with audio and video information. Automation and Remote Control,2014,vol. 75,no. 12,pp. 2190–2200. doi: 10.1134/S000511791412008X
-
Pachoud S., Gong S., Cavallaro A. Space-time audio-visual speech recognition with multiple multi-class probabilistic support vector machines. Proc. Auditory-Visual Speech Processing AVSP. Norwich, UK, 2009, pp. 155–160.
-
Hammami I., Mercies G., Hamouda A. The Kohonen map for credal fusion of heterogeneous data. Proc. IEEE International Geoscience and Remote Sensing Symposium, IGARSS. Milan, Italy, 2015, pp. 2947–2950. doi: 10.1109/IGARSS.2015.7326433
-
Hochreiter S., Schmidhuber J. Long short-term memory. Neural Computation, 1997, vol. 9, no. 8, pp. 1735–1780. doi: 10.1162/neco.1997.9.8.1735
-
Jaeger H. The «echo state» approach to analysing and training recurrent neural networks - with an erratum note. GMD Technical Report 148, German National Research Center for Information Technology,2001, 13 p.
-
LeCun Y. et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, vol. 86, no. 11, pp. 2278–2324. doi: 10.1109/5.726791
-
Hou J.-C., Wang S.S., Lai Y.H., Tsao Y., Chang H.W., Wan H.M. Audio-visual speech enhancement based on multimodal deep convolutional neural network. ArXiv Prepr, ArXiv170310893, 2017.
-
Noda K., Yamaguchi Y., Nakadai K., Okuno H.G., Ogata T. Audio-visual speech recognition using deep learning. Applied Intelligence, 2015, vol. 42, no. 4, pp. 722–737. doi: 10.1007/s10489-014-0629-7
-
Ren J., Hu Y., Tai Y.W. et al. Look, listen and learn - a multimodal LSTM for speaker identification. Proc. 30th AAAI Conference on Artificial Intelligence. Phoenix, USA, 2016, pp. 3581–3587.
-
Kukharev G.A., Kamenskaya E.I., Matveev Y.N., Shchegoleva N.L. Methods for Face Image Processing and Recognition in Biometric Applications. Ed. M.V. Khitrov. St. Petersburg, Politekhnika Publ., 2013, 388 p. (In Russian)
-
Meng H., Huang D., Wang H., Yang H., Al-Shuraifi M., Wang Y. Depression recognition based on dynamic facial and vocal expression features using partial least square regression. Proc. 3rd ACM International Workshop on Audio/Visual Emotion Challenge, AVEC 2013. Barselona, Spain, 2013, pp. 21–29. doi: 10.1145/2512530.2512532
-
Liu M., Wang R., Huang Z., Shan S., Chen X. Partial least squares regression on grassmannian manifold for emotion recognition. Proc. 15th ACM on Int. Conf. on Multimodal Interaction. Sydney, Australia, 2013, pp. 525–530. doi: 10.1145/2522848.2531738
-
Bakry A., Elgammal A. MKPLS: Manifold kernel partial least squares for lipreading and speaker identification. Proc. 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013. Portland, USA, 2013, pp. 684–691. doi: 10.1109/CVPR.2013.94
-
Sargin M.E., Yemez Y., Erzin E., Tekalp A.M. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 2007, vol. 9, no. 7, pp. 1396–1403. doi: 10.1109/TMM.2007.906583
-
Sigg C., Fischer B., Ommer B., Roth V., Buhmann J. Nonnegative CCA for audiovisual source separation. Proc. 17th IEEE Int. Workshop on Machine Learning for Signal Processing. Thessaloniki, Greece, 2007, pp. 253–258. doi: 10.1109/MLSP.2007.4414315
-
Lee J.-S., Ebrahimi T. Two-level bimodal association for audio-visual speech recognition. Lecture Notes in Computer Science, 2009, vol. 5807, pp. 133–144.doi: 10.1007/978-3-642-04697-1_13
-
De Bie T., Cristianini N., Rosipal R. Eigenproblems in pattern recognition. In: Handbook of Geometric Computing. Ed. E.B. Corrochano. Berlin, Springer, 2005, pp. 129–167. doi: 10.1007/3-540-28247-5_5
-
Esbensen K.H. Multivariate Date Analysis – In Practice.
5th ed. Oslo, Norway, CAMO Process AS, 2002, 598 p. -
Prasad N.V., Umesh S. Improved cepstral mean and variance normalization using Bayesian framework. Proc. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013, pp. 156–161. doi: 10.1109/ASRU.2013.6707722
-
OpenCV Library. URL: http://opencv.org (accessed: 20.01.2018).
-
Kazemi V., Sullivan J. One millisecond face alignment with an ensemble of regression trees. Proc. IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA, 2014, pp. 1867–1874. doi: 10.1109/CVPR.2014.241
-
dlib C++ Library. URL: http://dlib.net (accessed: 20.01.2018).
-
Oleinik A.L. Application of Partial Least Squares regression for audio-visual speech processing and modeling.Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2015, vol. 15, no. 5, pp. 886–892. (In Russian) doi: 10.17586/2226-1494-2015-15-5-886-892
-
SoX - Sound eXchange. HomePage. URL: http://sox.sourceforge.net (accessed: 09.09.2017).
-
Wojcicki K. Mel Frequency Cepstral Coefficient Feature Extraction. Available at: www.mathworks.com/matlabcentral/fileexchange/32849-htk-mfcc-matlab (accessed: 20.01.2018).
-
The VidTIMIT Audio-Video Database. URL: http://conradsanderson.id.au/vidtimit/ (accessed: 20.01.2018).
-
Sanderson C., Lovell B.C. Multi-region probabilistic histograms for robust and scalable identity inference. Lecture Notes in Computer Science, 2009, vol. 5558, pp. 199–208. doi: 10.1007/978-3-642-01793-3_21
-
Benton A., Khayrallah H., Gujral B., Reisinger D.A., Zhang S., Arora R. Deep generalized canonical correlation analysis. ArXiv Prepr, ArXiv1702.02519, 2017, 14 p.