ACOUSTIC MODELING FOR KAZAKH SPEECH SYNTHESIS

Arman K. Kaliyev, Sergey V. Rybin

doi:10.17586/2226-1494-2019-19-5-951-954

2019 , VOLUME 19, NUMBER 5 ( september-october )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2019-19-5-951-954

ACOUSTIC MODELING FOR KAZAKH SPEECH SYNTHESIS

A. K. Kaliyev, S. V. Rybin

Read the full article

Article in Russian

For citation:

Kaliyev A.K., Rybin S.V. Acoustic modeling for Kazakh speech synthesis. Scientiﬁc and Technical Journal of Information Technologies, Mechanics and Optics, 2019, vol. 19, no. 5, pp. 951–954 (in Russian). doi: 10.17586/2226-1494-2019-19-5-951-954

Abstract

We present a new framework of generative adversarial network for training of acoustic model for speech synthesis. The proposed generative adversarial network consists of a generator and a pair of agent discriminators, where the generator predicts the acoustic features from the linguistic representation. Training and testing were carried out on the Kazakh speech corpus, which consisted of 5.6 hours of speech recording. According to the experiment results the 3.46 mean opinion score was obtained which shows an acceptable quality of speech synthesis. This approach of the acoustic model development can be applied in speech synthesis systems of the other languages.

Keywords: acoustic model, speech synthesis, Kazakh language, generative adversarial network (GAN), speech corpus

Acknowledgements. This work was ﬁnancially supported by the initial funding from ITMO University within the framework of research practice No. 618278 “Emotional speech synthesis based on generative adversarial networks”.

References

1. Ze H., Senior A., Schuster M. Statistical parametric speech synthesis using deep neural networks. Proc. IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, 2013, pp. 7962–7966.

doi: 10.1109/ICASSP.2013.6639215
2. Saito Y., Takamichi S., Saruwatari H. Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, vol. 26, no. 1, pp. 84–96.

doi: 10.1109/ TASLP.2017.2761547
3. Khomitsevich O., Mendelev V., Tomashenko N., Rybin S., Medennikov I., Kudubayeva S. A bilingual Kazakh-Russian system for automatic speech recognition and synthesis. Lecture Notes in Computer Science, 2015, vol. 9319, pp. 25–33. doi: 10.1007/978-3-319-23132-7_3
4. Kaliyev A., Rybin S.V., Matveev Y. The pausing method based on brown clustering and word embedding. Lecture Notes in Computer Science, 2017, vol. 10458, pp. 741–747. doi: 10.1007/978-3-319-66429-3_74
5. Kaliyev A., Rybin S.V., Matveev Yu.N., Kaziyeva N., Burambayeva N. Modeling pause for the synthesis of Kazakh speech. Proc. 4th International Conference on Engineering and MIS, ICEMIS, 2018, pp. 1–4.

doi: 10.1145/3234698.3234699
6. Kaliyev A., Rybin S.V., Matveev Y.N. Phoneme duration prediction for Kazakh language. Lecture Notes in Computer Science, 2018, vol. 11096, pp. 274–280. doi: 10.1007/978-3-319-99579-3_29
7. Morise M., Yokomori F., Ozawa K. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 2016, vol. E99-D, no. 7, pp. 1877–1884.

doi: 10.1587/transinf.2015EDP7457
8. Karpov A.A., Verkhodanova V.O. Speech technologies for under- resourced languages of the world. Voprosy jazykoznanija, 2015, no. 2, pp. 117–135. (in Russian)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License