DOI: 10.17586/2226-1494-2016-16-4-581-592


A. A. Karpov, H. Kaya, A. A. Salah

Read the full article 
Article in Russian

For citation: Karpov A.A., Kaya H., Salah A.A. State-of-the-art tasks and achievements of paralinguistic speech analysis systems. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2016, vol. 16, no. 4, pp. 581–592. doi: 10.17586/2226-1494-2016-16-4-581-592


We present analytical survey of state-of-the-art actual tasks in the area of computational paralinguistics, as well as the recent achievements of automatic systems for paralinguistic analysis of conversational speech. Paralinguistics studies non-verbal aspects of human communication and speech such as: natural emotions, accents, psycho-physiological states, pronunciation features, speaker’s voice parameters, etc. We describe architecture of a baseline computer system for acoustical paralinguistic analysis, its main components and useful speech processing methods. We present some information on an International contest called Computational Paralinguistics Challenge (ComParE), which is held each year since 2009 in the framework of the International conference INTERSPEECH organized by the International Speech Communication Association. We present sub-challenges (tasks) that were proposed at the ComParE Challenges in 2009-2016, and analyze winning computer systems for each sub-challenge and obtained results. The last completed ComParE-2015 Challenge was organized in September 2015 in Germany and proposed 3 sub-challenges: 1) Degree of Nativeness (DN) sub-challenge, determination of nativeness degree of speakers based on acoustics; 2) Parkinson's Condition (PC) sub-challenge, recognition of a degree of Parkinson’s condition based on speech analysis; 3) Eating Condition (EC) sub-challenge, determination of the eating condition state during speaking or a dialogue, and classification of consumed food type (one of seven classes of food) by the speaker. In the last sub-challenge (EC), the winner was a joint Turkish-Russian team consisting of the authors of the given paper. We have developed the most efficient computer-based system for detection and classification of the corresponding (EC) acoustical paralinguistic events. The paper deals with the architecture of this system, its main modules and methods, as well as the description of used training and evaluation audio data and the best obtained results on machine classification of these acoustic paralinguistic events.

Keywords: computational paralinguistics, speech technology, acoustical analysis, emotion recognition, machine learning, speaker states, acoustical paralinguistic events

Acknowledgements. This research is financially supported by the Russian Foundation for Basic Research (project No. 16-37-60100) and by the Council for Grants of the President of Russia (project No. MD-3035.2015.8)

 1.         Basov O.O., Karpov A.A., Saitov I.A. Metodologicheskie Osnovy Sinteza Polimodal'nykh Infokommunikatsionnykh Sistem Gosudarstvennogo Upravleniya [Methodological Bases of Synthesis of Multimodal Infocommunication Governance Systems]. Orel, Russian Academy of SSF, 2015, 271 p.
2.         Schuller B. Voice and speech analysis in search of states and traits. In: Computer Analysis of Human Behavior. Eds. A.A. Salah, T. Gevers. Springer,2011,pp. 227–253.doi: 10.1007/978-0-85729-994-9_9
3.         Schuller B., Rigoll G., Lang M. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine - Belief network architecture. Proc. IEEE Int. Conf. on Acoustic, Speech and Signal Processing, ICASSP-2004. Montreal, Canada, 2004, pp. 577–580.
4.         Schuller B., Vlasenko B., Eyben F., Wollmer M., Stuhlsatz A., Wendemuth A., Rigoll G. Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Transactions on Affective Computing, 2010, vol. 1, no. 2, pp. 119–131. doi: 10.1109/T-AFFC.2010.8
5.         El Ayadi M., Kamel M.S., Karray F. Survey on speech emotion recognition: features, classification schemes, and databases. PatternRecognition,2011, vol. 44, no. 3,pp. 572–587.doi: 10.1016/j.patcog.2010.09.020
6.         Dhall A., Goecke R., Lucey S., Gedeon T. Collecting large, richly annotated facial-expression databases from movies. IEEEMultiMedia,2012, vol. 19,no.3,pp. 34–41.doi: 10.1109/MMUL.2012.26
7.         Makarova V., Petrushin V. RUSLANA: a database of Russian emotional utterances. Proc. ICSLP-2002. Denver,USA,2002, pp. 2041–2044.
8.         Burkhardt F., Paeschke A., Rolfes M., Sendlmeier W., Weiss B. A database of German emotional speech. Proc. 9th European Conf. on Speech Communication and Technology. Lisbon, Portugal, 2005, pp. 1517–1520.
9.         Kaya H., Salah A.A., Gurgen S.F., Ekenel H. Protocol and baseline for experiments on Bogazici University Turkish emotional speech corpus. Proc. 22nd Signal Processing and Communications Applications Conf. Trabzon, Turkey, 2014, pp. 1698–1701. doi: 10.1109/SIU.2014.6830575
10.      Schuller B., Steidl S., Batliner A., Vinciarelli A., Scherer K., Ringeval F., Chetouani M., Weninger F., Eyben F., Marchi E., Mortillaro M., Salamin H., Polychroniou A., Valente F., Kim S. The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. Proc. INTERSPEECH-2013. Lyon, France, 2013, pp. 148–152.
11.      Eyben F., Weninger F., Groß F., Schuller B. Recent developments in OpenSMILE, the Munich open-source multimedia feature extractor. Proc. 21st ACM Int. Conf. on Multimedia. Barcelona, Spain, 2013, pp. 835–838. doi: 10.1145/2502081.2502224
12.      Bozkurt E., Erzin E., Erdem C.E., Erdem A.T. Formant position based weighted spectral features for emotion recognition. SpeechCommunication,2011, vol. 53,no.9–10,pp. 1186–1197.doi: 10.1016/j.specom.2011.04.003
13.      Alpaydin E. Introduction to Machine Learning. 2nd ed. MIT Press, 2010, 581 p.
14.      Kaya H., Salah A.A. Combining modality-specific extreme learning machines for emotion recognition in the wild. Proc. 16th Int. Conf. on Multimodal Interaction ICMI-2014. Istanbul, Turkey, 2014, pp. 487–493. doi: 10.1145/2663204.2666273
15.      Schuller B., Villar R.J., Rigoll G., Lang M.K. Meta-classifiers in acoustic and linguistic feature fusion-based affect recognition. Proc. IEEE Int. Conf. ICASSP-2005. Philadelphia, USA, 2005, pp. 325–328. doi: 10.1109/ICASSP.2005.1415116
16.      Schuller B., Steidl S., Batliner A. The INTERSEECH 2009 emotion challenge.Proc. INTERSEECH-2009. Brighton, UK, 2009, pp. 312–315.
17.      Lee C.-C., Mower E., Busso C., Lee S., Narayanan S. Emotion recognition using a hierarchical binary decision tree approach. Proc. INTERSPEECH-2009. Brighton, UK, 2009, pp. 320–323.
18.      Dumouchel P., Dehak N., Attabi Y., Dehak R., Boufaden N. Cepstral and long-term features for emotion recognition. Proc. INTERSEECH-2009. Brighton, UK, 2009, pp. 344–347.
19.      Schuller B., Steidl S., Batliner A., Burkhardt F., Devillers L., Mueller C., Narayanan S. The INTERSEECH 2010 paralinguistic challenge. Proc. INTERSPEECH-2010.Makuhari, Japan, 2010, pp. 2794–2797.
20.      Kockmann M., Burget L., Cernocky J. Brno University of Technology system for INTERSPEECH 2010 paralinguistic challenge. Proc. INTERSEECH-2010. Makuhari, Japan, 2010, pp. 2822–2825.
21.      Meinedo H., Trancoso I. Age and gender classification using fusion of acoustic and prosodic features. Proc. INTERSEECH-2010. Makuhari, Japan, 2010, pp. 2818–2821.
22.      Jeon J.H., Xia R., Liu Y. Level of interest sensing in spoken dialog using multi-level fusion of acoustic and lexical evidence. Proc. INTERSEECH-2010. Makuhari, Japan, 2010, pp. 2802–2805
23.      Schuller B., Steidl S., Batliner A., Schiel F., Krajewski J. The INTERSPEECH 2011 speaker state challenge. Proc. INTERSEECH-2011. Florence, Italy, 2011, pp. 3201–3204.
24.      Bone D., Black M.P., Li M., Metallinou A., Lee S., Narayanan S.S. Intoxicated speech detection by fusion of speaker normalized Hierarchical features and GMM supervectors. Proc. INTERSEECH-2011. Florence, Italy, 2011, pp. 3217–3220.
25.      Huang D.Y., Ge S.S., Zhang Z. Speaker state classification based on fusion of asymmetric SIMPLS and support vector machines. Proc. INTERSPEECH-2011. Florence, Italy, 2011, pp. 3301–3304.
26.      Schuller B., Steidl S., Batliner A., Nöth E., Vinciarelli A., Burkhardt F., van Son R., Weninger F., Eyben F., Bocklet T., Mohammadi G., Weiss B. The INTERSPEECH 2012 speaker trait challenge. Proc. INTERSPEECH-2012. Portland, USA, 2012, pp. 254–257.
27.      Ivanov A., Chen X. Modulation spectrum analysis for speaker personality trait recognition. Proc. INTERSPEECH-2012. Portland, USA, 2012, pp. 278–281.
28.      Montacie C., Caraty M.-J. Pitch and intonation contribution to speakers’ traits classification. Proc. INTERSPEECH-2012. Portland, USA, 2012, pp. 526–529.
29.      Kim J., Kumar N., Tsiartas A., Li M., Narayanan S. Intelligibility classification of pathological speech using fusion of multiple subsystems. Proc. INTERSPEECH-2012. Portland, USA, 2012, pp. 534–537.
30.      Anumanchipalli G.K., Meinedo H., Bugalho M., Trancoso I., Oliveira L.C., Black A.W. Text-dependent pathological voice detection. Proc. INTERSPEECH-2012. Portland, USA, 2012, pp. 530–533.
31.      Brueckner R., Schuller B. Likability classification - a not so deep neural network approach. Proc. INTERSPEECH-2012. Portland, USA, 2012, pp. 290–293.
32.      Buisman H., Postma E. The log-Gabor method: speech classification using spectrogram image analysis. Proc. INTERSPEECH-2012. Portland, USA, 2012, pp. 518–521.
33.      Lu D., Sha F. Predicting likability of speakers with Gaussian processes. Proc. INTERSPEECH-2012. Portland, USA, 2012, pp. 286–289.
34.      Huang D.-Y., Zhu Y., Wu D., Yu R. Detecting intelligibility by linear dimensionality reduction and normalized voice quality hierarchical features. Proc. INTERSPEECH-2012. Portland, USA, 2012, pp. 546–549.
35.      Zhang Z., Coutinho E., Deng J., Schuller B. Cooperative learning and its application to emotion recognition from speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, vol. 23, no. 1, pp. 115–126.
36.      Asgari M., Bayestehtashk A., Shafran I. Robust and accurate features for detecting and diagnosing autism spectrum disorders. Proc. INTERSEECH-2013. Lyon, France, 2013, pp. 191–194.
37.      Rasanen O., Pohjalainen J. Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. Proc. INTERSPEECH-2013.Lyon, France,2013, pp. 210–214.
38.      Gosztolya G., Busa-Fekete R., Toth L. Detecting autism, emotions and social signals using Adaboost. Proc. INTERSPEECH-2013.Lyon, France,2013, pp. 220–224.
39.      Gupta R., Audhkhasi K., Lee S., Narayanan S. Paralinguistic event detection from speech using probabilistic time-series smoothing and masking. Proc. INTERSPEECH-2013. Lyon, France, 2013, pp. 173–177.
40.      Kaya H., Ozkaptan T., Salah A.A., Gürgen F. Random discriminative projection based feature selection with application to conflict recognition. IEEE Signal Processing Letters, 2015, vol. 22, no. 6, pp. 671–675. doi: 10.1109/LSP.2014.2365393
41.      Martinez D., Ribas D., Lleida E., Ortega A., Miguel A. Suprasegmental information modelling for autism disorder spectrum and specific language impairment classification. Proc. INTERSPEECH-2013. Lyon, France, 2013, pp. 195–199.
42.      Lee H.-Y., Hu T.-Y., Jing H., Chang Y.-F., Tsao Y., Kao Y.-C., Pao T.-L. Ensemble of machine learning and acoustic segment model techniques for speech emotion and autism spectrum disorders recognition. Proc. INTERSEECH-2013. Lyon, France, 2013, pp. 215–219.
43.      Grezes F., Richards J., Rosenberg A. Let me finish: automatic conflict detection using speaker overlap. Proc. INTERSPEECH-2013. Lyon, France, 2013, pp. 200–204.
44.      Sethu V., Epps J., Ambikairajah E., Li H. GMM based speaker variability compensated system for interspeech 2013 compare emotion challenge. Proc. INTERSEECH-2013. Lyon, France, 2013, pp. 205–209.
45.      Janicki A. Non-linguistic vocalisation recognition based on hybrid GMM-SVM approach. Proc. INTERSPEECH-2013. Lyon, France, 2013, pp. 153–157.
46.      Schuller B., Steidl S., Batliner A., Epps J., Eyben F., Ringeval F., Marchi E., Zhang Y. The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load. Proc. INTERSPEECH-2014. Singapore, 2014, pp. 427–431.
47.      Kaya H., Ozkaptan T., Salah A.A., Gurgen S.F. Canonical correlation analysis and local Fisher discriminant analysis based multi-view acoustic feature reduction for physical load prediction. Proc. INTERSPEECH-2014. Singapore, 2014, pp. 442–446.
48.      Van Segbroeck M., Travadi R., Vaz C., Kim J., Black M.P., Potamianos A., Narayanan S. Classification of cognitive load from speech using an i-vector framework. Proc. INTERSPEECH-2014. Singapore, 2014, pp. 751–755.
49.      Kaya H., Eyben F., Salah A.A., Schuller B.W. CCA based feature selection with application to continuous depression recognition from acoustic speech features. Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP-2014. Florence, Italy, 2014, pp. 3729–3733.
50.      Kua J., Sethu V., Le P., Ambikairajah E. The UNSW submission to INTERSPEECH 2014 compare cognitive load challenge. Proc. INTERSPEECH-2014. Singapore, 2014, pp. 746–750.
51.      Gosztolya G., Grosz T., Busa-Fekete R., Toth L. Detecting the intensity of cognitive and physical load using AdaBoost and deep rectifier neural networks. Proc. INTERSPEECH-2014. Singapore, 2014, pp. 452–456.
52.      Schuller B., Steidl S., Batliner A., Hantke S., Honig F., Orozco-Arroyave J.R., Noth E., Zhang Y., Weninger F. The INTERSEECH 2015 computational paralinguistics challenge: nativeness, Parkinson’s & eating condition. Proc. INTERSPEECH-2015. Dresden, Germany, 2015, pp. 478–482.
53.      Black M., Bone D., Skordilis Z., Gupta R., Xia W., Papadopoulos P., Chakravarthula S., Xiao B., Segbroeck M., Kim J., Georgiou P., Narayanan S. Automated evaluation of non-native English pronunciation quality: combining knowledge- and data-driven features at multiple time scales. Proc. INTERSPEECH-2015. Dresden, Germany, 2015, pp. 493–497.
54.      Grosz T., Busa-Fekete R., Gosztolya G., Toth L. Assessing the degree of nativeness and Parkinson's condition using Gaussian processes and deep rectifier neural networks. Proc. INTERSEECH-2015. Dresden, Germany, 2015, pp. 919–923.
55.      Kaya H., Karpov A., Salah A. Fisher vectors with cascaded normalization for paralinguistic analysis. Proc. INTERSPEECH-2015. Dresden, Germany, 2015, pp. 909–913.
56.      Ribeiro E., Ferreira J., Olcoz J., Abad A., Moniz H., Batista F., Trancoso I. Combining multiple approaches to predict the degree of nativeness. Proc. INTERSPEECH-2015. Dresden, Germany, 2015, pp. 488–492.
57.      Kim J., Nasir M., Gupta R., Segbroeck M., Bone D., Black M., Skordilis Z., Yang Z., Georgiou P., Narayanan S. Automatic estimation of parkinson's disease severity from diverse speech tasks. Proc. INTERSEECH-2015.Dresden, Germany,2015,pp. 914–918.
58.      Milde B., Biemann C. Using representation learning and out-of-domain data for a paralinguistic speech task. Proc. INTERSPEECH-2015.Dresden, Germany,2015,pp. 904–908.
59.      Hahm S., WangJ. Parkinson's condition estimation using speech acoustic and inversely mapped articulatory data. Proc. INTERSPEECH-2015.Dresden, Germany,2015,pp.513–517.
60.      Hantke S., Weninger F., Kurle R., Ringeval F., Batliner A., El-Desoky Mousa A., Schuller B. I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance. PLoS ONE,2016,vol. 11(5).doi:10.1371/journal.pone.0154486
61.      Kaya H., Karpov A., Salah A.A. Robust acoustic emotion recognition based on cascaded normalization and extreme learning machines. Lecture Notes in Computer Science, 2016, vol. 9719. doi:10.1007/978-3-319-40663-3_14
62.      Lyakso E., Frolova O., Dmitrieva E., Grigorev A., Kaya H., Salah A.A., Karpov A. EmoChildRu: emotional child Russian speech corpus. Lecture Notes in Computer Science, 2015,vol. 9319,pp. 144–152.doi: 10.1007/978-3-319-23132-7_18
63.      Schuller B., Steidl S., Batliner A., Hirschberg J., Burgoon J.K., Baird A., Elkins A., Zhang Y., Coutinho E., Evanini K. The INTERSPEECH 2016 computational paralinguistics challenge: deception, sincerity & native language. Proc.INTERSPEECH-2016. San Francisco, USA, 2016.
64.      Kaya H., Karpov A. Fusing acoustic feature representations for computational paralinguistics tasks.Proc.INTERSPEECH-2016.SanFrancisco, USA,2016.
Copyright 2001-2017 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.