Menu
Publications
2026
2025
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
Editor-in-Chief
Nikiforov
Vladimir O.
D.Sc., Prof.
Partners
doi: 10.17586/2226-1494-2026-26-2-306-314
Hierarchical multi-task learning for low-complexity models based on task synergy analysis
Read the full article
Article in Russian
For citation:
Abstract
For citation:
Surkov M.K. Hierarchical multi-task learning for low-complexity models based on task synergy analysis. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2026, vol. 26, no. 2, pp. 306–314 (in Russian). doi: 10.17586/2226-1494-2026-26-2-306-314
Abstract
The widespread adoption of wearable devices and smart home systems indicates a significant growth in potential use cases for such solutions. The abundance of devices and the need for convenient interaction with them drive the active development of approaches implementing various aspects of this interaction. Currently, speech is one of the most convenient human-machine interfaces. Advances in audio and speech signal processing and analysis technologies enable the successful solution of complex tasks, such as automatic speech recognition, speaker identification and verification, and the detection of emotions, gender, and age of the speaker. The applicability of such technologies typically requires significant computational resources, often unavailable to wearable devices and smart home systems. Addressing isolated audio/speech analysis tasks significantly limits human-machine interaction scenarios. Attempts to combine various technologies on a single device lead to increased demands on computational resources. Currently, greatest interest lies in technologies for multi-task audio/speech signal analysis with reduced computational requirements, allowing their application in wearable devices and smart home systems. This paper proposes a method for the automatic construction of hierarchical multi-task models for audio/speech signal analysis. This method determines task compatibility while maintaining overall accuracy for all tasks and significantly reducing the number of trainable parameters in the multi-task model. In the first stage, isolated recognition models are trained for each target task, and the metrics of these models are determined. The second stage involves determining the pairwise compatibility of audio/speech analysis tasks by iterating over the number of shared layers in a deep neural network. In the final stage, the final hierarchical architecture implementing the multi-task recognition model is automatically formed. It is demonstrated that, compared to baseline approaches, the developed method allows for the creation of a compact hierarchical model. Compared to a set of independent single-task models, the proposed architecture shows a 56 % reduction in the number of trainable parameters with an accuracy drop of no more than 1.9 %, whereas a classical (“flat”) multi-task architecture exhibits an accuracy reduction of 2.7 %. Applying existing multi-task model optimization approaches, LT4REC and the Lottery Ticket Hypothesis, leads to accuracy reductions of 9 % and 6.5 %, respectively. The results of this work have practical significance for the smart device industry (smartphones, wearable gadgets, smart speakers). The proposed algorithm enables the creation of efficient audio analysis systems capable of performing multiple functions simultaneously with minimal requirements for computational resources and memory when deployed on resource-constrained devices.
Keywords: hierarchical multi-task learning, on-device audio analysis, resource-efficient neural networks, task synergy, low-complexity models, voice activity detection, speech command recognition, speaker biometrics
References
References
1. Hebbar R., Somandepalli K., Narayanan S. Robust speech activity detection in movie audio: Data resources and experimental evaluation.
2. Sharma M., Joshi S., Chatterjee T., Hamid R. A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows. Neurocomputing, 2022, vol. 494, pp. 116–131. https://doi.org/10.1016/j.neucom.2022.04.084
3. de Andrade D.C., Leo S., Da Silva Viana M.L., Bernkopf C. A neural attention model for speech command recognition. arXiv, 2018. arXiv:1808.08929. https://doi.org/10.48550/arXiv.1808.08929
4. Sánchez-Hevia H.A., Gil-Pita R., Utrilla-Manso M., Rosa-Zurera M.Age group classification and gender recognition from speech with temporal convolutional neural networks. Multimedia Tools and Applications, 2022, vol. 81, no. 3,pp. 3535–3552. https://doi.org/10.1007/s11042-021-11614-4
5. Koutini K., Schlüter J., Eghbal-zadeh H., Widmer G. Efficient training of audio transformers with Patchout. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2022, pp. 2753–2757. https://doi.org/10.21437/interspeech.2022-227
6. Chen S., Wu Y., Wang C., Liu S., Tompkins D., Chen Z., et al. Beats: audio pre-training with acoustic tokenizers. Proc. of the 40th International Conference on Machine Learning, PMLR, 2023, vol. 202, pp. 5178–5193.
7. Yamashita R., Nishio M., Do R.K.G., Togashi K. Convolutional neural networks: an overview and application in radiology. Insights into Imaging, 2018, vol. 9, no. 4, pp. 611–629. https://doi.org/10.1007/s13244-018-0639-9
8. Sharma M., Joshi S., Chatterjee T., Hamid R. A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows. Neurocomputing, 2022, vol. 494, pp. 116–131. https://doi.org/10.1016/j.neucom.2022.04.084
9. Hoo Z.H., Candlish J., Teare D. What is an ROC curve? Emergency Medicine Journal, 2017, vol. 34, no. 6, pp. 357–359. https://doi.org/10.1136/emermed-2017-206735
10. Ardila R., Branson M., Davis K., Kohler M., Meyer J., Henretty M., et al. Common voice: A massively-multilingual speech corpus. Proc. of the 12th Language Resources and Evaluation Conference, 2020, pp. 4218–4222.
11. Ayache M., Kanaan H., Kassir K., Kassir Y. Speech command recognition using deep learning. Proc. of the 6th International Conference on Advances in Biomedical Engineering (ICABME), 2021, pp. 24–29. https://doi.org/10.1109/ICABME53305.2021.9604862
12. Warden P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv, 2018. arXiv:1804.03209. https://doi.org/10.48550/arXiv.1804.03209
13. Zhang Y., Yang Q. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2022, vol. 34, no. 12, pp. 5586–5609. https://doi.org/10.1109/TKDE.2021.3070203
14. Moritz N., Wichern G., Hori T., Le Roux J. All-in-one transformer: Unifying speech recognition, audio tagging, and event detection. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2020, pp. 3112–3116.
15. Chu Y., Xu J., Zhou X., Yang Q., Zhang S., Yan Z., et al. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv, 2023. arXiv:2311.07919. https://doi.org/10.48550/arXiv.2311.07919
16. Standley T., Zamir A., Chen D., Guibas L., Malik J., Savarese S. Which tasks should be learned together in multi-task learning? Proc. of the 37th International Conference on Machine Learning, PMLR, 2020, vol. 119, pp. 9120–9132.
17. Zamir A.R., Sax A., Shen W., Guibas L., Malik J., Savarese S. Taskonomy: disentangling task transfer learning. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3712–3722. https://doi.org/10.1109/CVPR.2018.00391
18. Surkov M.K. Towards efficient universal audio analysis: a low-complexity model via synergistic multi-task learning. Proc. of the 38th Conference of FRUCT Association, 2025, vol. 38, no. 2, pp. 420–427.
19. Chen T., Zhang Z., Liu S., Chang S., Wang Z. Long live the lottery: The existence of winning tickets in lifelong learning. Proc. of the International Conference on Learning Representations, 2021, pp. 1–19.
20. Frankle J., Carbin M.J. The lottery ticket hypothesis: finding sparse, trainable neural networks. Proc. of the 7th International Conference on Learning Representations, 2019.
21. Malach E., Yehudai G., Shalev-shwartz S., Shamir O. Proving the lottery ticket hypothesis: Pruning is all you need. Proc. of the 37th International Conference on Machine Learning, PMLR, 2020, vol. 119, pp. 6682–6691.
22. Xiao X., Chen H., Liu Y., Yao X., Liu P., Fan C., et al. LT4REC: a lottery ticket hypothesis based multi-task practice for video recommendation system. arXiv, 2020. arXiv:2008.09872. https://doi.org/10.48550/arXiv.2008.09872
23. Fifty C., Amid E., Zhao Z., Yu T., Anil R., Finn C. Efficiently identifying task groupings for multi-task learning. Proc.of the 35th International Conference on Neural Information Processing Systems, 2021, pp. 27503–27516.
24. Schmid F., Primus P., Heittola T., Mesaros A., Martín-Morató I., Koutini K., et al. Data-efficient low-complexity acoustic scene classification in the dcase 2024 challenge. arXiv, 2024. arXiv:2405.10018. https://doi.org/10.48550/arXiv.2405.10018

