
Nikiforov
Vladimir O.
D.Sc., Prof.
doi: 10.17586/2226-1494-2025-25-4-737-743
Optimizing knowledge distillation models for language models
Read the full article

For citation:
Abstract
The problem of optimizing large neural networks is discussed using the example of language models. The size of large language models is an obstacle to their practical application in conditions of limited amounts of computing resources and memory. One of the areas of compression of large neural network models being developed is knowledge distillation, the transfer of knowledge from a large teacher model to a smaller student model without significant loss of result accuracy. Currently known methods of distilling knowledge have certain disadvantages: inaccurate knowledge transfer, long learning process, accumulation of errors in long sequences. The methods that contribute to improving the quality of knowledge distillation in relation to language models are proposed: selective teacher intervention in the student’s learning process and low-level adaptation. The first approach is based on the transfer of teacher tokens when teaching a student to neural network layers, for which an exponentially decreasing threshold of measuring the discrepancy between the probability distributions of the teacher and the student is reached. The second approach suggests reducing the number of parameters in a neural network by replacing fully connected layers with low-rank ones, which reduces the risk of overfitting and speeds up the learning process. The limitations of each method when working with long sequences are shown. It is proposed to combine methods to obtain an improved model of classical distillation of knowledge for long sequences. The use of a combined approach to distilling knowledge on long sequences made it possible to significantly compress the resulting model with a slight loss of quality as well as significantly reduce GPU memory consumption and response output time. Complementary approaches to optimizing the knowledge transfer process and model compression showed better results than selective teacher intervention in the student learning process and low- rank adaptation separately. Thus, the quality of answers of the improved classical knowledge distillation model on long sequences showed 97 % of the quality of full fine-tuning and 98 % of the quality of the low-rank adaptation method in terms of ROGUE-L and Perplexity, given that the number of trainable parameters is reduced by 99 % compared to full fine-tuning and by 49 % compared to low-rank adaptation. In addition, GPU memory usage is reduced by 75 % and 30 %, respectively, and inference time by 30 %. The proposed combination of knowledge distillation methods can find application in problems with limited computational resources.
References
- Dudikhin V.V., Kondrashov P.E. Methodology of using large language models to solve tasks of State and municipal government for intelligent abstracting and automatic generation of text content. E-journal Public Administration, 2024, no. 105, pp. 169–179. (in Russian). https://doi.org/10.55959/MSU2070-1381-105-2024-169-179
- Kuznetsov A.V. Digital history and artificial intelligence: perspectives and risks of pretrained language models. New Information Technologies in Education and Science, 2022, no. 5, pp. 53–57. (in Russian). https://doi.org/10.17853/2587-6910-2022-05-53-57
- Mokretsov N.S., Tatarnikova T.M. Algorithm for optimizing neural network models for natural language text processing. Proc. of the Applied Artificial Intelligence: Prospects and Risks, 2024. pp. 280–282. (in Russian)
- Houlsby N., Giurgiu A., Jastrzebski S., Morrone B., Laroussilhe Q., Gesmundo A., Attariyan M., Gelly S. Parameter-efficient transfer learning for NLP. Proc. of the 36th International Conference on Machine Learning, 2019. vol. 97, pp. 2790–2799.
- Liao B., Meng Y., Monz C. Parameter-efficient fine-tuning without introducing new latency. Proc. of the 61st Annual Meeting of the Association for Computational Linguistics, 2023, vol. 1, pp. 4242–4260. https://doi.org/10.18653/v1/2023.acl-long.233
- Lv K., Yang Y., Liu T., Guo Q., Qiu X. Full parameter fine-tuning for large language models with limited resources. Proc. of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024, vol. 1, pp. 8187–8198. https://doi.org/10.18653/v1/2024.acl-long.445
- Khurana A., Subramonyam H., Chilana P.K. Why and when LLM-based assistants can go wrong: investigating the effectiveness of prompt-based interactions for software help-seeking. Proc. of the 29th International Conference on Intelligent User Interfaces, 2024, pp. 288–303. https://doi.org/10.1145/3640543.3645200
- Mokretsov N.S., Tatarnikova T.M. Optimizing the learning process with limited computational resources.Proc. of the International Conference on Soft Computing and Measurement, 2024, vol. 1, pp. 205–208. (in Russian)
- Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C., Mishkin P., Zhang C., Agarwal S., Slama K., Ray A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022, vol. 35, pp. 27730–27744.
- Borgeaud S., Mensch A., Hoffmann J., Cai T., Rutherford E., Millican K., et al. Improving language models by retrieving from trillions of tokens. Proc. of the 39th InternationalConference on Machine Learning, 2022, pp. 2206–2240.
- Belyakova A.Y., Belyakov Y.D. Overview of text summarization methods. Ingineering Journal of Don, 2020, no. 10 (70),pp. 142–159 (in Russian)
- Shvyrov V.V., Kapustin D.A., Kushchenko A.V., Sentyay R.N. Large language models fine-tuning with the LoRA technique to solve problems of static analysis of program code.Vestnik of the Lugansk Vladimir Dahl National University, 2023, no. 12 (78), pp. 210–215. (in Russian)
- Liu Z., Lin W., Shi Y., Zhao J. A robustly optimized BERT pre-training approach with post-training. Lecture Notes in Computer Science, 2021, vol. 12869, pp. 471–484. https://doi.org/10.1007/978-3-030-84186-7_31
- Jiao X., Yin Y., Shang L., Jiang X., Chen X., Li L., Wang F., Liu Q. TinyBERT: distilling BERT for natural language understanding. Findings of the Association for Computational Linguistics: EMNLP, 2020, pp. 4163–4174. https://doi.org/10.18653/v1/2020.findings-emnlp.372