P. . Chistikov, A. O. Talanov, D. S. Zakharov, A. I. Solomennik

Read the full article 
Article in Russian


We propose an approach to synthesizing high-quality speech in view of a small initial speech database. A robust method for solving this problem is vital for voice restoration (recovery of the lost fragments of recordings based on available speech material of a well-known person, e.g. an actor). The proposed TTS (text-to-speech) system is a hybrid one that combines the advantages of both HMM- and Unit Selection-based TTS systems. The paper deals with the approach based on statistical models of intonation parameters, which makes it possible to preserve the speaker's pronunciation in synthesized speech. We describe the preparation of the database and the solution to the problem of shortage of original speech material for model training. Special algorithms of speech element concatenation and modification are effective to correct parameters according to the requirements, provide overall tonal smoothness and reduce spectral distortion at the boundaries of concatenated elements. Listening tests showed the efficiency of the proposed methods and proved the possibility of highquality speech synthesis even with a small speech database (right up to one hour of speech).

Keywords: speech synthesis, voice restoration, hidden Markov models, Unit Selection, speech modification

 1.     Breuer S., Bergmann S., Dragon R., Möller S. Set-up of a unit-selection synthesis with a prominent voice. Proc. 5th International conference on Language Resources and Evaluation. Genoa, 2006, pp. 293–296.
2.     Matoušek J., Tihelka D., Šmídl L. On the impact of annotation errors on unit-selection speech synthesis. Lecture Notes in Computer Science, 2012, vol. 7499, pp. 456–463. doi: 10.1007/978-3-642-32790-2_55
3.     Yamagishi J., Zen H., Toda T., Tokuda K. Speaker-independent HMM-based speech synthesis system – HTS-2007 system for the blizzard challenge 2007. Proc. Blizzard Challenge-2007. Bonn, Germany, 2007, pp. 1–6.
4.     Hunt A.J., Black A.W. Unit selection in a concatenative speech synthesis using a large speech database. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 96. Atlanta, USA, 1996, vol. 1, pp. 373–376.
5.     Phung T.-N., Mai C.L., Akagi M. A concatenative speech synthesis for monosyllabic languages with limited data. Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2012. Hollywood, US, 2012, pp. 1–10.
6.     Meng F., Wu Z., Meng H., Jia J., Cai L. Hierarchical english emphatic speech synthesis based on HMM with limited training data. Proc. 13th Annual Conference of the International Speech Communication Association 2012, InterSpeech 2012. Portland, US, 2012, vol. 1, pp. 466–469.
7.     Tsuzuki R., Zen H., Tokuda K., Kitamura T., Bulut M., Narayanan S. Constructing emotional speech synthesizers with limited speech database. Proc. INTERSPEECH 2004-ICSLP. Jeju Island, Korea, 2004, pp. 1185–1188.
8.     Phung T. N., Luong M. C., Akagi M. A hybrid TTS between unit selection and HMM-based TTS under limited data conditions. Proc. 8th ISCA Speech Synthesis Workshop. Barcelona, Spain, 2013, pp. 279–284.
9.     Chistikov P.G., Korolkov E.A., Talanov A.O. Combining HMM and unit selection technologies to increase naturalness of synthesized speech. Komp'yuternaya Lingvistika i Intellektual'nye Tekhnologii, 2013, no. 12-2, pp. 2–10.
10.  Chistikov P.G., Korolkov E.A. Talanov A.O., Solomennik A.I. Gibridnaya tekhnologiya sinteza rechi na osnove skrytykh markovskikh modelei i algoritma Unit Selection [A hybrid technology for TTS system based on hidden markov models and unit selection algorithm]. Izv. vuzov. Priborostroenie, 2013, vol. 56, no. 2, pp. 33–38.
11.  Solomennik A.I., Talanov A.O., Solomennik M.V., Khomitsevich O.G., Chistikov P.G. Otsenki kachestva sintezirovannoi rechi: problemy i resheniya [Assessment of synthesized speech quality: problems and solutions]. Izv. vuzov. Priborostroenie, 2013, vol. 56, no. 2, pp. 38–42.
12.  Chistikov P.G., Khomitsevich O.G., Rybin S.V. Statisticheskie metody avtomaticheskogo opredeleniya mest i dlitel'nosti pauz v sistemakh sinteza rechi [Statistical methods for automatic prosodic break detection in a text-to-speech system]. Izv. vuzov. Priborostroenie, 2014, vol. 57, no. 2, pp. 28–32.
13.  Chistikov P.G., Korolkov E.A. Data-driven speech parameter generation for Russian text-to-speech system. Komp'yuternaya Lingvistika i Intellektual'nye Tekhnologii, 2012, no. 11, pp. 103–111.
14.  Chistikov P., Khomitsevich O. Improving prosodic break detection in a Russian TTS system. Proc. 15th International Conference on Speech and Computer, SPECOM 2013. Pilsen, Czech Republic, 2013, vol. 8113, pp. 181–188. doi: 10.1007/978-3-319-01931-4_24
15.  Zen H., Tokuda K., Masuko T., Kobayashi T., Kitamura T. A hidden semi-Markov model-based speech synthesis. IEICE Transactions on Information and Systems, 2007, vol. E90-D, pp. 825–834. doi: 10.1093/ietisy/e90-d.5.825
16.  Yamagishi J., Kobayashi T. Adaptive training for hidden semi-Markov model. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP'05. Philadelphia, US, 2005, vol. 1, art. no. 141526, pp. I365–I368. doi: 10.1109/ICASSP.2005.1415126
17.  Taylor P. Text-to-Speech Synthesis. Cambridge University Press, 2009, 626 p.
18.  GOST R 50840-95.Peredacha rechi po traktam svyazi. Metody otsenki kachestva, razborchivosti i uznavaemosti [State Standard 50840-95. Speech transmission over varies communication channels. Techniques for measurements of speech quality, intelligibility and voice identification]. Moscow, Izdatel'stvo standartov Publ., 1996, 234 p.
Copyright 2001-2017 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.