<div>
	The speech synthesis detection algorithm based on cepstral coefficients and convolutional neural network</div>

Roman A. Murtazin, Kouznetsov Alexander   Yu., Evgeny A. Fedorov, Ilnur M. Garipov, Anna V. Kholodenina, Yulia B. Baldanova, Vorobeva Alisa A.

2021 , VOLUME 21, NUMBER 4 ( July - August )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2021-21-4-545-552

The speech synthesis detection algorithm based on cepstral coefficients and convolutional neural network

R. A. Murtazin, A. Y. Kouznetsov, E. A. Fedorov, I. M. Garipov, A. V. Kholodenina, Y. B. Baldanova, A. A. Vorobeva

Read the full article

Article in русский

For citation:

Murtazin R.A., Kuznetsov A.Yu., Fedorov E.A., Garipov I.M., Kholodenina A.V., Baldanova Yu.B., Vorobeva A.A. The speech synthesis detection algorithm based on cepstral coefficients and convolutional neural network. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2021, vol. 21, no. 4, pp. 545–552 (in Russian). doi: 10.17586/2226-1494-2021-21-4-545-552

Abstract

The existing approaches to detecting synthesized speech, based on the current issues of synthesizing voice sequences, are considered. The stages of the algorithm for detecting spoofing attacks on voice biometric systems are described, and its final workflow is presented. The research focuses mainly on detecting synthesized speech, as it is the most dangerous type of attacks. The authors designed a software application for an experimental study, present its structure and propose the detection synthesized speech algorithm. This algorithm uses mel-frequency and constant Q cepstral coefficients to extract speech features. A Gaussian mixture model is used to construct a user model. Convolutional neural network was chosen as a classifier to determine the voice’s authenticity. Two basic methods for combating spoofing attacks, proposed by the authors of the ASVspoof2019 competition, were selected for making comparisons. One of these methods involved using linear frequency cepstral coefficients as speech features, while the other method used constant Q. Both solutions used Gaussian mixture models for classification. To evaluate the effectiveness of the proposed solution and compare it with other methods, a voice database was created. The selected EER and minDCF metrics were applied. The experimental results demonstrated the advantages of the proposed algorithm in comparison with the other algorithms. An advantage of the proposed solution is that it uses extracted speech features that perform efficiently when it comes to user identification. This makes it possible to use the algorithm to optimize a voice biometric system that has embedded protection against spoofing attacks that is built on speech synthesis. In addition, it is possible to use the proposed method for voice identification with minimal modifications required. Voice biometric identification systems have excellent opportunities in the banking sector. Such systems allow banks to simplify and accelerate the process of financial transactions and provide their users with advanced banking functions remotely. The implementation of voice biometric systems is difficult by their vulnerability to spoofing attacks, particularly to those conducted by means of speech synthesis. The proposed solution can be integrated into voice biometric systems to improve their security.

Keywords: biometric, automatic speaker verification in banking, synthetic speech, spoofing detection, cepstral analysis, convolutional neural network

Acknowledgements. The paper was prepared at ITMO University within the framework of the scientific project No. 50449 “Development of cyberspace protection algorithms for solving applied problems of ensuring cybersecurity of banking organizations”.

References

Martynova A.B., Pashkovskii M.Iu. Electronic and mobile banking. Scientific and Technological Works of Students: The Proceedings of the 45^th Scientific and Technological Student Conference. Komsomolsk-on-Amur, KnASTU, 2015, pp. 333–335. (in Russian)
Shilov N.M. Applications of voice recognition. Innovation. Science. Education, 2021, no. 27, pp. 1292–1297. (in Russian)
Maslova E.V. Biometrics Banking Market. Modern Problems and Prospect of Banking Development in Russia: Proceedings of the 3^th all-Russian scientific and practical conference with international participation. Tambov, Tambov State University, 2018, pp. 109–118. (in Russian)
Vasiliev R.A., Nikolaev D.B. Analyzing the possible use of voice identification in the systems of access to information. Research Result. Information Technologies, 2016, vol. 1, no. 1, pp. 48–57. (in Russian). https://doi.org/10.18413/2518-1092-2016-1-1-48-57
Kuznetsov A.Yu., Murtazin R.A., Garipov I.M., Fedorov E.A., Kholodenina A.V., Vorobeva A.A. Methods of countering speech synthesis attacks on voice biometric systems in banking. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2021, vol. 21, no. 1, pp. 109–117. https://doi.org/10.17586/2226-1494-2021-21-1-109-117
Kuznetsov D.A., Kuznetsov A.V., Tezin A.V., Basov O.O. The comparative analysis of the speech synthesizers for the notification subsystem of smart hall. Research Result. Information Technologies, 2018, vol. 3, no. 3, pp. 9–14. (in Russian). https://doi.org/10.18413/2518-1092-2018-3-3-0-2
Todisco M., Delgado H., Evans N. A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients. Odyssey 2016: Speaker and Language Recognition Workshop, 2016, pp. 283–290. https://doi.org/10.21437/Odyssey.2016-41
Paul D., Sahidullah M., Saha G. Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2047–2051. https://doi.org/10.1109/ICASSP.2017.7952516
Bilmes J.A. A Gentle Tutorial of the EM Algorithm and Its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical report ICSI-TR-97-021. Berkeley, University of Berkeley, 1998, 13 p.
Chernetsova E.A., Shishkin A.D. Algorithm for personal identification based on voice for information access authorization. International Research Journal, 2019, no. 2(80), pp. 59–64. (in Russian). https://doi.org/10.23670/IRJ.2019.80.2.010
Chow D., Abdulla W.H. Robust speaker identification based on perceptual log area ratio and Gaussian mixture models. Proc. 8^th International Conference on Spoken Language Processing, (ICSLP 2004), 2004, pp. 1761–1764.
Sholokhov A., Sahidullah M., Kinnunen T. Semi-supervised speech activity detection with an application to automatic speaker verification. Computer Speech & Language, 2018, vol. 47, pp. 132–156. https://doi.org/10.1016/j.csl.2017.07.005

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License