Optimizing knowledge distillation models for language models

Tatiana M. Tatarnikova, Nikita S. Mokretsov

2025 , VOLUME 25, NUMBER 4 ( july-august )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2025-25-4-737-743

Optimizing knowledge distillation models for language models

M. T. Tatarnikova, N. S. Mokretsov

Read the full article

Article in Russian

For citation:

Tatarnikova T.M., Mokretsov N.S. Optimizing knowledge distillation models for language models. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2025, vol. 25, no. 4, pp. 737–743 (in Russian). doi: 10.17586/2226-1494-2025-25-4-737-743

Abstract

The problem of optimizing large neural networks is discussed using the example of language models. The size of large language models is an obstacle to their practical application in conditions of limited amounts of computing resources and memory. One of the areas of compression of large neural network models being developed is knowledge distillation, the transfer of knowledge from a large teacher model to a smaller student model without significant loss of result accuracy. Currently known methods of distilling knowledge have certain disadvantages: inaccurate knowledge transfer, long learning process, accumulation of errors in long sequences. The methods that contribute to improving the quality of knowledge distillation in relation to language models are proposed: selective teacher intervention in the student’s learning process and low-level adaptation. The first approach is based on the transfer of teacher tokens when teaching a student to neural network layers, for which an exponentially decreasing threshold of measuring the discrepancy between the probability distributions of the teacher and the student is reached. The second approach suggests reducing the number of parameters in a neural network by replacing fully connected layers with low-rank ones, which reduces the risk of overfitting and speeds up the learning process. The limitations of each method when working with long sequences are shown. It is proposed to combine methods to obtain an improved model of classical distillation of knowledge for long sequences. The use of a combined approach to distilling knowledge on long sequences made it possible to significantly compress the resulting model with a slight loss of quality as well as significantly reduce GPU memory consumption and response output time. Complementary approaches to optimizing the knowledge transfer process and model compression showed better results than selective teacher intervention in the student learning process and low- rank adaptation separately. Thus, the quality of answers of the improved classical knowledge distillation model on long sequences showed 97 % of the quality of full fine-tuning and 98 % of the quality of the low-rank adaptation method in terms of ROGUE-L and Perplexity, given that the number of trainable parameters is reduced by 99 % compared to full fine-tuning and by 49 % compared to low-rank adaptation. In addition, GPU memory usage is reduced by 75 % and 30 %, respectively, and inference time by 30 %. The proposed combination of knowledge distillation methods can find application in problems with limited computational resources.

Keywords: large language models, long sequences, neural networks, knowledge distillation, teacher model, student model, selective intervention in the learning process, low-rank adaptation

References

Dudikhin V.V., Kondrashov P.E. Methodology of using large language models to solve tasks of State and municipal government for intelligent abstracting and automatic generation of text content. E-journal Public Administration, 2024, no. 105, pp. 169–179. (in Russian). https://doi.org/10.55959/MSU2070-1381-105-2024-169-179
Kuznetsov A.V. Digital history and artificial intelligence: perspectives and risks of pretrained language models. New Information Technologies in Education and Science, 2022, no. 5, pp. 53–57. (in Russian). https://doi.org/10.17853/2587-6910-2022-05-53-57
Mokretsov N.S., Tatarnikova T.M. Algorithm for optimizing neural network models for natural language text processing. Proc. of the Applied Artificial Intelligence: Prospects and Risks, 2024. pp. 280–282. (in Russian)
Houlsby N., Giurgiu A., Jastrzebski S., Morrone B., Laroussilhe Q., Gesmundo A., Attariyan M., Gelly S. Parameter-efficient transfer learning for NLP. Proc. of the 36^th International Conference on Machine Learning, 2019. vol. 97, pp. 2790–2799.
Liao B., Meng Y., Monz C. Parameter-efficient fine-tuning without introducing new latency. Proc. of the 61^st Annual Meeting of the Association for Computational Linguistics, 2023, vol. 1, pp. 4242–4260. https://doi.org/10.18653/v1/2023.acl-long.233
Lv K., Yang Y., Liu T., Guo Q., Qiu X. Full parameter fine-tuning for large language models with limited resources. Proc. of the 62^nd Annual Meeting of the Association for Computational Linguistics, 2024, vol. 1, pp. 8187–8198. https://doi.org/10.18653/v1/2024.acl-long.445
Khurana A., Subramonyam H., Chilana P.K. Why and when LLM-based assistants can go wrong: investigating the effectiveness of prompt-based interactions for software help-seeking. Proc. of the 29^th International Conference on Intelligent User Interfaces, 2024, pp. 288–303. https://doi.org/10.1145/3640543.3645200
Mokretsov N.S., Tatarnikova T.M. Optimizing the learning process with limited computational resources.Proc. of the International Conference on Soft Computing and Measurement, 2024, vol. 1, pp. 205–208. (in Russian)
Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C., Mishkin P., Zhang C., Agarwal S., Slama K., Ray A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022, vol. 35, pp. 27730–27744.
Borgeaud S., Mensch A., Hoffmann J., Cai T., Rutherford E., Millican K., et al. Improving language models by retrieving from trillions of tokens. Proc. of the 39^th InternationalConference on Machine Learning, 2022, pp. 2206–2240.
Belyakova A.Y., Belyakov Y.D. Overview of text summarization methods. Ingineering Journal of Don, 2020, no. 10 (70),pp. 142–159 (in Russian)
Shvyrov V.V., Kapustin D.A., Kushchenko A.V., Sentyay R.N. Large language models fine-tuning with the LoRA technique to solve problems of static analysis of program code.Vestnik of the Lugansk Vladimir Dahl National University, 2023, no. 12 (78), pp. 210–215. (in Russian)
Liu Z., Lin W., Shi Y., Zhao J. A robustly optimized BERT pre-training approach with post-training. Lecture Notes in Computer Science, 2021, vol. 12869, pp. 471–484. https://doi.org/10.1007/978-3-030-84186-7_31
Jiao X., Yin Y., Shang L., Jiang X., Chen X., Li L., Wang F., Liu Q. TinyBERT: distilling BERT for natural language understanding. Findings of the Association for Computational Linguistics: EMNLP, 2020, pp. 4163–4174. https://doi.org/10.18653/v1/2020.findings-emnlp.372

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License