Hierarchical multi-task learning for low-complexity models based on task synergy analysis

Surkov Maxim K.

2026 , VOLUME 26, NUMBER 2 ( march-april )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2026-26-2-306-314

Hierarchical multi-task learning for low-complexity models based on task synergy analysis

M. K. Surkov

Read the full article

Article in Russian

For citation:

Surkov M.K. Hierarchical multi-task learning for low-complexity models based on task synergy analysis. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2026, vol. 26, no. 2, pp. 306–314 (in Russian). doi: 10.17586/2226-1494-2026-26-2-306-314

Abstract

The widespread adoption of wearable devices and smart home systems indicates a significant growth in potential use cases for such solutions. The abundance of devices and the need for convenient interaction with them drive the active development of approaches implementing various aspects of this interaction. Currently, speech is one of the most convenient human-machine interfaces. Advances in audio and speech signal processing and analysis technologies enable the successful solution of complex tasks, such as automatic speech recognition, speaker identification and verification, and the detection of emotions, gender, and age of the speaker. The applicability of such technologies typically requires significant computational resources, often unavailable to wearable devices and smart home systems. Addressing isolated audio/speech analysis tasks significantly limits human-machine interaction scenarios. Attempts to combine various technologies on a single device lead to increased demands on computational resources. Currently, greatest interest lies in technologies for multi-task audio/speech signal analysis with reduced computational requirements, allowing their application in wearable devices and smart home systems. This paper proposes a method for the automatic construction of hierarchical multi-task models for audio/speech signal analysis. This method determines task compatibility while maintaining overall accuracy for all tasks and significantly reducing the number of trainable parameters in the multi-task model. In the first stage, isolated recognition models are trained for each target task, and the metrics of these models are determined. The second stage involves determining the pairwise compatibility of audio/speech analysis tasks by iterating over the number of shared layers in a deep neural network. In the final stage, the final hierarchical architecture implementing the multi-task recognition model is automatically formed. It is demonstrated that, compared to baseline approaches, the developed method allows for the creation of a compact hierarchical model. Compared to a set of independent single-task models, the proposed architecture shows a 56 % reduction in the number of trainable parameters with an accuracy drop of no more than 1.9 %, whereas a classical (“flat”) multi-task architecture exhibits an accuracy reduction of 2.7 %. Applying existing multi-task model optimization approaches, LT4REC and the Lottery Ticket Hypothesis, leads to accuracy reductions of 9 % and 6.5 %, respectively. The results of this work have practical significance for the smart device industry (smartphones, wearable gadgets, smart speakers). The proposed algorithm enables the creation of efficient audio analysis systems capable of performing multiple functions simultaneously with minimal requirements for computational resources and memory when deployed on resource-constrained devices.

Keywords: hierarchical multi-task learning, on-device audio analysis, resource-efficient neural networks, task synergy, low-complexity models, voice activity detection, speech command recognition, speaker biometrics

References

1. Hebbar R., Somandepalli K., Narayanan S. Robust speech activity detection in movie audio: Data resources and experimental evaluation.

2. Sharma M., Joshi S., Chatterjee T., Hamid R. A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows. Neurocomputing, 2022, vol. 494, pp. 116–131. https://doi.org/10.1016/j.neucom.2022.04.084

3. de Andrade D.C., Leo S., Da Silva Viana M.L., Bernkopf C. A neural attention model for speech command recognition. arXiv, 2018. arXiv:1808.08929. https://doi.org/10.48550/arXiv.1808.08929

4. Sánchez-Hevia H.A., Gil-Pita R., Utrilla-Manso M., Rosa-Zurera M.Age group classification and gender recognition from speech with temporal convolutional neural networks. Multimedia Tools and Applications, 2022, vol. 81, no. 3,pp. 3535–3552. https://doi.org/10.1007/s11042-021-11614-4

5. Koutini K., Schlüter J., Eghbal-zadeh H., Widmer G. Efficient training of audio transformers with Patchout. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2022, pp. 2753–2757. https://doi.org/10.21437/interspeech.2022-227

6. Chen S., Wu Y., Wang C., Liu S., Tompkins D., Chen Z., et al. Beats: audio pre-training with acoustic tokenizers. Proc. of the 40^th International Conference on Machine Learning, PMLR, 2023, vol. 202, pp. 5178–5193.

7. Yamashita R., Nishio M., Do R.K.G., Togashi K. Convolutional neural networks: an overview and application in radiology. Insights into Imaging, 2018, vol. 9, no. 4, pp. 611–629. https://doi.org/10.1007/s13244-018-0639-9

8. Sharma M., Joshi S., Chatterjee T., Hamid R. A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows. Neurocomputing, 2022, vol. 494, pp. 116–131. https://doi.org/10.1016/j.neucom.2022.04.084

9. Hoo Z.H., Candlish J., Teare D. What is an ROC curve? Emergency Medicine Journal, 2017, vol. 34, no. 6, pp. 357–359. https://doi.org/10.1136/emermed-2017-206735

10. Ardila R., Branson M., Davis K., Kohler M., Meyer J., Henretty M., et al. Common voice: A massively-multilingual speech corpus. Proc. of the 12^th Language Resources and Evaluation Conference, 2020, pp. 4218–4222.

11. Ayache M., Kanaan H., Kassir K., Kassir Y. Speech command recognition using deep learning. Proc. of the 6^th International Conference on Advances in Biomedical Engineering (ICABME), 2021, pp. 24–29. https://doi.org/10.1109/ICABME53305.2021.9604862

12. Warden P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv, 2018. arXiv:1804.03209. https://doi.org/10.48550/arXiv.1804.03209

13. Zhang Y., Yang Q. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2022, vol. 34, no. 12, pp. 5586–5609. https://doi.org/10.1109/TKDE.2021.3070203

14. Moritz N., Wichern G., Hori T., Le Roux J. All-in-one transformer: Unifying speech recognition, audio tagging, and event detection. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2020, pp. 3112–3116.

15. Chu Y., Xu J., Zhou X., Yang Q., Zhang S., Yan Z., et al. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv, 2023. arXiv:2311.07919. https://doi.org/10.48550/arXiv.2311.07919

16. Standley T., Zamir A., Chen D., Guibas L., Malik J., Savarese S. Which tasks should be learned together in multi-task learning? Proc. of the 37^th International Conference on Machine Learning, PMLR, 2020, vol. 119, pp. 9120–9132.

17. Zamir A.R., Sax A., Shen W., Guibas L., Malik J., Savarese S. Taskonomy: disentangling task transfer learning. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3712–3722. https://doi.org/10.1109/CVPR.2018.00391

18. Surkov M.K. Towards efficient universal audio analysis: a low-complexity model via synergistic multi-task learning. Proc. of the 38^th Conference of FRUCT Association, 2025, vol. 38, no. 2, pp. 420–427.

19. Chen T., Zhang Z., Liu S., Chang S., Wang Z. Long live the lottery: The existence of winning tickets in lifelong learning. Proc. of the International Conference on Learning Representations, 2021, pp. 1–19.

20. Frankle J., Carbin M.J. The lottery ticket hypothesis: finding sparse, trainable neural networks. Proc. of the 7^th International Conference on Learning Representations, 2019.

21. Malach E., Yehudai G., Shalev-shwartz S., Shamir O. Proving the lottery ticket hypothesis: Pruning is all you need. Proc. of the 37^th International Conference on Machine Learning, PMLR, 2020, vol. 119, pp. 6682–6691.

22. Xiao X., Chen H., Liu Y., Yao X., Liu P., Fan C., et al. LT4REC: a lottery ticket hypothesis based multi-task practice for video recommendation system. arXiv, 2020. arXiv:2008.09872. https://doi.org/10.48550/arXiv.2008.09872

23. Fifty C., Amid E., Zhao Z., Yu T., Anil R., Finn C. Efficiently identifying task groupings for multi-task learning. Proc.of the 35^th International Conference on Neural Information Processing Systems, 2021, pp. 27503–27516.

24. Schmid F., Primus P., Heittola T., Mesaros A., Martín-Morató I., Koutini K., et al. Data-efficient low-complexity acoustic scene classification in the dcase 2024 challenge. arXiv, 2024. arXiv:2405.10018. https://doi.org/10.48550/arXiv.2405.10018

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License