Menu
Publications
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
Editor-in-Chief
Nikiforov
Vladimir O.
D.Sc., Prof.
Partners
doi: 10.17586/2226-1494-2024-24-5-758-769
Low-complexity multi task learning for joint acoustic scenes classification and sound events detection
Read the full article ';
Article in Russian
For citation:
Abstract
For citation:
Surkov M.K. Low-complexity multi task learning for joint acoustic scenes classification and sound events detection. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 5, pp. 758–769 (in Russian). doi: 10.17586/2226-1494-2024-24-5-758-769
Abstract
The task of automatic metainformation recognition from audio sources is to detect and extract data of various natures (speech, noises, acoustic scenes, acoustic events, anomalies) from a given audio input signal. This area is well developed and known to the scientific community and has various approaches with high quality. But, the vast majority of such methods are based on large neural networks with a huge number of weights to be trained. Subsequently, it is impractical to use them in environments with severely limited computing resources. The smart device industry is currently growing rapidly: smartphones, smart watches, voice assistants, TV, smart home. Such products have limitations in both processor and memory. At that moment, the State-of-the-Art way to cope with these conditions is to use so-called low-complexity models. Moreover, in recent years, the interest of the scientific community in the above-mentioned problem has been growing (DCASE Workshop). One of the most crucial subtasks in the global meta information recognition problem is the task of Automatic Scene Classification and the task of Sound Event Detection. The most important scientific questions are the development of both the optimal low-complexity neural network architecture and learning algorithms to obtain a low-resource, high-quality system for classifying acoustic scenes and detecting sound events. In this paper the datasets from DCASE Challenge “Low-Complexity Acoustic Scene Classification” and “Sound Event Detection with Weak Labels and Synthetic Soundscapes” were used. A multitask neural network architecture was proposed consisting of a common encoder and two independent decoders for each of the two tasks. The classical algorithms of multitask learning SoftMTL and HardMTL were considered, and their modifications were developed: CrossMTL, which is based on the idea of reusing data from one task when training the decoder to solve the second task, and FreezeMTL, in which the trained weights of the common encoder are frozen after training on the first task and used to optimize the second decoder. As a result of the experiments, it was shown that the use of the CrossMTL modification can significantly increase the accuracy of the classification of acoustic scenes and event detection in compare with classical approaches SoftMTL and HardMTL. The FreezeMTL algorithm made it possible to obtain a model that provides 42.44 % accuracy in scene classification and 45.86 % accuracy in event detection, which is comparable to the results of the baseline solutions of 2023. In this paper, a low-complexity neural network consisting of 633.5 K trainable parameters was proposed, requiring 43.2 M MACs to process one second audio. This approach uses 7.8 % fewer trainable parameters and 40 % fewer MACs compared to the naive application of two independent models. The developed model can be used in smart devices due to a small number of trainable parameters, as well as a small number of MACs required for its application.
Keywords: acoustic scene classification, sound event detection, compact models, multitask neural networks, multitask learning,
meta-information recognition, smart devices, neural networks
References
References
- Kriman S., Beliaev S., Ginsburg B., Huang J., Kuchaiev O., Lavrukhin V., Leary R., Li J., Zhang Y. Quartznet: Deep automatic speech recognition with 1D time-channel separable convolutions. Proc. of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6124–6128. https://doi.org/10.1109/icassp40776.2020.9053889
- Lakhotia K., Kharitonov E., Hsu W.-N., Adi Y., Polyak A., Bolte B., Nguyen T.-A., Copet J., Baevski A., Mohamed A., Dupoux E. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 2021, vol. 9, pp. 1336–1354.
- Gulati A., Qin J., Chiu C.-C., Parmar N., Zhang Y., Yu J., Han W., Wang S., Zhang Z., Wu Y., Pang R. Conformer: Convolution-augmented transformer for speech recognition. Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, pp. 5036–5040. https://doi.org/10.21437/interspeech.2020-3015
- Hsu W.N., Tsai B., Bolte Y.-H.H., Salakhutdinov R., Mohamed A. HuBERT: How much can a bad teacher benefit ASR pre-training?. Proc. of the ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6533–6537. https://doi.org/10.1109/icassp39728.2021.9414460
- Radford A., Kim J.W., Xu T., Brockman G., McLeavey C., Sutskever I. Robust speech recognition via large-scale weak supervision. Proceedings of Machine Learning Research, 2023, vol. 202, pp. 28492–28518.
- Gong Y., Khurana S., Karlinsky L., Glass J. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023, pp. 2798–2802. https://doi.org/10.21437/interspeech.2023-2193
- Panayotov V., Chen G., Povey D., Khudanpur S. Librispeech: an asr corpus based on public domain audio books. Proc. of the IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210. https://doi.org/10.1109/icassp.2015.7178964
- Moritz N., Wichern G., Hori T., Le Roux L. All-in-One transformer: Unifying speech recognition, audio tagging, and event detection. Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, pp. 3112–3116. https://doi.org/10.21437/interspeech.2020-2757
- Karita S., Soplin N.E.Y., Watanabe S., Delcroix M., Ogawa A., Nakatani T. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019, pp. 1408–1412. https://doi.org/10.21437/interspeech.2019-1938
- Chen S., Wu Y., Wang C., Liu S., Tompkins D., Chen Z., Che W., Yu X., Wei F. Beats: Audio pre-training with acoustic tokenizers. Proceedings of Machine Learning Research, 2023, vol. 202, pp. 4672–4712.
- Gemmeke J.F., Ellis D.P.W., Freedman D., Jansen A., Lawrence W., Moore R.C., Plakal M., Ritter M. Audio set: An ontology and human-labeled dataset for audio events. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780. https://doi.org/10.1109/icassp.2017.7952261
- Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., Uszkoreit J., Houlsby N. et al. An image is worth 16x16 words: Transformers for image recognition at scale. Proc. of the ICLR 2021 - 9th International Conference on Learning Representations, 2021.
- Drossos K., Lipping S., Virtanen T. Clotho: An audio captioning dataset. Proc. of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 736–740. https://doi.org/10.1109/icassp40776.2020.9052990
- Poria S., Hazarika D., Majumder N., Naik G., Cambria E., Mihalcea R. MELD: A multimodal multi-party dataset for emotion recognition in conversations. Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 527–536. https://doi.org/10.18653/v1/p19-1050
- Lipping S., Sudarsanam P., Drossos K., Virtanen T. Clotho-AQA: A crowdsourced dataset for audio question answering. Proc. of the 30th European Signal Processing Conference (EUSIPCO), 2022, pp. 1140–1144. https://doi.org/10.23919/eusipco55093.2022.9909680
- Engel J., Resnick C., Roberts A., Dieleman S., Norouzi M., Eck D., Simonyan K. Neural audio synthesis of musical notes with wavenet autoencoders. International Conference on Machine Learning. PMLR, 2017, vol. 70, pp. 1068–1077.
- Chu Y., Xu J., Zhou X., Yang Q., Zhang S., Yan Z., Zhou C., Zhou J. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv, 2023, arXiv:2311.07919. https://doi.org/10.48550/arXiv.2311.07919
- Bai J., Bai S., Chu Y., Cui Z. Qwen technical report. arXiv, 2023, arXiv:2309.16609. https://doi.org/10.48550/arXiv.2309.16609
- Schmid F., Morocutti T., Masoudian S., Koutini K., Widmer G. CP-JKU submission to dcase23: Efficient acoustic scene classification with cp-mobile: Technical Report. Detection and Classification of Acoustic Scenes and Events (DCASE), 2023. 5 p.
- Salamon J., MacConnell D., Cartwright M., Li P., Bello J.P. Scaper: A library for soundscape synthesis and augmentation. Proc. of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 344–348. https://doi.org/10.1109/waspaa.2017.8170052
- Tarvainen A., Valpola H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems, 2017, vol. 30, pp. 1196–1205.
- Zhang Z., Luo P., Loy C.C., Tang X. Facial landmark detection by deep multi-task learning. Lecture Notes in Computer Science, 2014, vol. 8694, pp. 94–108. https://doi.org/10.1007/978-3-319-10599-4_7
- Dai J., He K., Sun J. Instance-aware semantic segmentation via multi-task network cascades. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3150–3158. https://doi.org/10.1109/cvpr.2016.343
- Zhao X., Li H., Shen X., Liang X., Wu Y. A modulation module for multi-task learning with applications in image retrieval. Lecture Notes in Computer Science, 2018, vol. 11205, pp. 415–432. https://doi.org/10.1007/978-3-030-01246-5_25
- Liu S., Johns E., Davison A.J. End-to-end multi-task learning with attention. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1871–1880. https://doi.org/10.1109/cvpr.2019.00197
- Ma J., Zhao Z., Yi X., Chen J., Hong L., Chi E.H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. Proc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1930–1939. https://doi.org/10.1145/3219819.3220007
- Misra I., Shrivastava A., Gupta A., Hebert M. Cross-stitch networks for multi-task learning. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3994–4003. https://doi.org/10.1109/cvpr.2016.433
- Ruder S., Bingel J., Augenstein I., Søgaard A. Latent multi-task architecture learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, no. 01, pp. 4822–4829. https://doi.org/10.1609/aaai.v33i01.33014822
- Krause D.A., Mesaros A. Binaural signal representations for joint sound event detection and acoustic scene classification. Proc. of the 30th European Signal Processing Conference (EUSIPCO), 2022, pp. 399–403. https://doi.org/10.23919/eusipco55093.2022.9909581
- Khandelwal T., Das R.K. A multi-task learning framework for sound event detection using high-level acoustic characteristics of sounds. Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023, pp. 1214–1218. https://doi.org/10.21437/interspeech.2023-909
- French R.M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 1999, vol. 3, no. 4, pp. 128–135. https://doi.org/10.1016/s1364-6613(99)01294-2
- McCloskey M., Cohen N.J. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 1989, vol. 24, pp. 109–165. https://doi.org/10.1016/s0079-7421(08)60536-8
- Kirkpatrick J., Pascanu R., Rabinowitz N., Veness J., Desjardins G., Rusu A.A., Milan K., Quan J., Ramalho T., Grabska-Barwinska A., Hassabis D., Clopath C., Kumaran D., Hadsell R. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 2017, vol. 114, no. 13, pp. 3521–3526. https://doi.org/10.1073/pnas.1611835114
- Kim J.W., Lee G.W., Kim H.K., Seo Y.S., Song I.H. Semi-supervised learning-based sound event detection using frequency-channel-wise selective kernel for DCASE challenge 2022 Task 4: Technical Report. Detection and Classification of Acoustic Scenes and Events (DCASE), 2022, 4 p.