Menu
Publications
2025
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.
Partners
doi: 10.17586/2226-1494-2024-24-6-1024-1034
Improving question answering in programming domain with pretrained language model finetuning using structured diverse online forum data
Read the full article

Article in English
For citation:
Abstract
For citation:
Gorbatovski A.V., Razin A.D., Aliev A.A., Kovalchuk S.V. Improving question answering in programming domain with pretrained language model finetuning using structured diverse online forum data. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 6, pp. 1024–1034. doi: 10.17586/2226-1494-2024-24-6-1024-1034
Abstract
Today, Community Question Answering (CQA) forums such as Stack Overflow are becoming an irreplaceable tool for software developers, providing fast and efficient solution search and prompt community response. Although modern Pretrained Language Models (PLMs), also trained including on data from such forums, have the potential to automate answering of domain-specific questions, they often show significant limitations in complex domains such as programming due to the heterogeneity of the domain and variety in contexts of the questions being asked. In our study, we propose an approach to solving this problem based on structuring data in a complex domain. The first stage includes decomposing available forum data with the selection of thematic subsets. Next, for individual topics, models are finetuned using Reinforcement Learning with Human Feedback (RLHF) using the voting available in the forum data. Finally, to manage the ensemble of finetuned models, question classification is used with subsequent selection of the appropriate model. Experimental studies were conducted on a subset of Python-related questions from Stack Overflow, using the Llama 7B model as the base language model. Experimental studies were conducted on a subset of Python-related questions from Stack Overflow forum using the Llama 7B model as a base PLM. The results of the studies showed that by classifying questions we can improve the model performance up to +22.5 % on the Rouge metric. Moreover, the inclusion of RLHF resulted in an additional improvement of up to +11.2 %. To validate these results, we performed human evaluation of the generated responses, which confirmed the effectiveness of our approach. This study shows that by structuring community data and processing implicit feedback, we can significantly improve PLM performance in CQA tasks in complex domains characterized by high heterogeneity, such as software development.
Keywords: question answering, natural language processing, natural language generation, pretrained language models, large
language models, finetuning, software development
Acknowledgements. The research was supported by the Russian Science Foundation, agreement No. 24-11-00272, https://rscf.ru/ project/24-11-00272/.
References
Acknowledgements. The research was supported by the Russian Science Foundation, agreement No. 24-11-00272, https://rscf.ru/ project/24-11-00272/.
References
- Koubaa A., Boulila W., Ghouti L., Alzahem A., Latif S. Exploring ChatGPT capabilities and limitations: A survey. IEEE Access, 2023, vol. 11, pp. 118698–118721. https://doi.org/10.1109/access.2023.3326474
- Dave T., Athaluri S.A., Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence, 2023, vol. 6. https://doi.org/10.3389/frai.2023.1169595
- Hadi M.U., Tashi Q.A., Qureshi R., Shah A., Muneer A., Irfan M., Zafar A., Shaikh M.B., Akhtar N., Hassan S.Z., Shoman M., Wu J., Mirjalili S., Shah M.Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints, 2023.
- Bird C., Ford D., Zimmermann T., Forsgren N., Kalliamvakou E., Lowdermilk T., Gazit I. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue, 2022, vol. 20, no. 6, pp. 35–57. https://doi.org/10.1145/3582083
- Ernst N.A., Bavota G. AI-Driven Development Is Here: Should You Worry? IEEE Software, 2022, vol. 39, no. 2, pp. 106–110. https://doi.org/10.1109/ms.2021.3133805
- Liang J.T., Yang C., Myers B.A. A large-scale survey on the usability of AI programming assistants: successes and challenges. Proc. of the 46th IEEE/ACM International Conference on Software Engineering (ICSE '24), 2024, pp. 605–617. https://doi.org/10.1145/3597503.3608128
- Roy P.K., Saumya S., Singh J.P., Banerjee S., Gutub A. Analysis of community question-answering issues via machine learning and deep learning: State-of-the-art review. CAAI Transactions on Intelligence Technology, 2023, vol. 8, no. 1, pp. 95–117. https://doi.org/10.1049/cit2.12081
- Blanco G., Pérez-López R., Fdez-Riverola F., Lourenço A.M.G. Understanding the social evolution of the Java community in Stack Overflow: A 10-year study of developer interactions. Future Generation Computer Systems, 2020, vol. 105, pp. 446–454. https://doi.org/10.1016/j.future.2019.12.021
- Beyer S., Macho C., Di Penta M., Pinzger M. What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories. Empirical Software Engineering, 2020, vol. 25, no. 3, pp. 2258–2301. https://doi.org/10.1007/s10664-019-09758-x
- Yuan S., Qin H., Gu X., Shen B. Clean and learn: Improving robustness to spurious solutions in API question answering. International Journal of Software Engineering and Knowledge Engineering, 2022, vol. 32, no. 7, pp. 1101–1123. https://doi.org/10.1142/s0218194022500449
- Kovalchuk S., Lomshakov V., Aliev A. Human perceiving behavior modeling in evaluation of code generation models. Proc. of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), 2022, pp. 287–294. https://doi.org/10.18653/v1/2022.gem-1.24
- Nakano R., Hilton J., Balaji S., Wu J., Ouyang L., Kim C., Hesse C., Jain S., Kosaraju V., Saunders W., Jiang X., Cobbe K., Eloundou T., Krueger G., Button K., Knight M., Chess B., Schulman J. WebGPT: Browser-assisted question-answering with human feedback. arXiv, 2022, arXiv:2112.09332. 2021. https://doi.org/10.48550/arXiv.2112.09332
- Casper S., Davies X., Shi C. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv, 2023, arXiv:2307.15217. https://doi.org/10.48550/arXiv.2307.15217
- Gorbatovski A., Kovalchuk S. Bayesian networks for named entity prediction in programming community question answering. Lecture Notes in Computer Science, 2023, vol. 14074, pp. 282–289. https://doi.org/10.1007/978-3-031-36021-3_28
- Rvanova L., Kovalchuk S. Automatic structuring of topics for natural language generation in community question answering in programming domain. Lecture Notes in Computer Science, 2023, vol. 14074, pp. 322–329. https://doi.org/10.1007/978-3-031-36021-3_33
- Hu E.J., Shen Y., Wallis P., Allen-Zhu Z., Li Y., Wang S., Wang L., Chen W. LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations (ICLR), 2022.
- Stiennon N., Ouyang L., Wu J., Ziegler D.M., Lowe R., Voss C., Radford A., Amodei D., Christiano P. Learning to summarize from human feedback. Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 3008–3021.
- Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C.L., Mishkin P., Zhang C., Agarwal S., Slama K., Ray A., Schulman J., Hilton J., Kelton F., Miller L., Simens M., Askell A., Welinder P., Christiano P., Leike J., Lowe R. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022, vol. 35, pp. 27730–27744.
- Schulman J., Wolski F., Dhariwal P., Radford A., Klimov O. Proximal policy optimization algorithms. arXiv, 2017, arXiv:1707.06347. https://doi.org/10.48550/arXiv.1707.06347
- Zhou C., Liu P., Xu P., Iyer S., Sun J., Mao Y., Ma X., Efrat A., Yu P., Yu L., Zhang S., Ghosh G., Lewis M., Zettlemoyer L., Levy O. LIMA: Less Is More for Alignment. arXiv, 2023, arXiv:2305.11206. https://doi.org/10.48550/arXiv.2305.11206
- Achiam J., Adler S., Agarwal S. et. al. GPT-4 Technical Report. arXiv, 2023, arXiv:2303.08774. https://doi.org/10.48550/arXiv.2303.08774
- Hunter D.R. MM algorithms for generalized Bradley-Terry models. Annals of Statistics, 2003, vol. 32, no. 1, pp. 384–406. https://doi.org/10.1214/aos/1079120141
- Sorokin N., Abulkhanov D., Piontkovskaya I., Malykh V. Ask me anything in your native language. Proc. of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 395–406. https://doi.org/10.18653/v1/2022.naacl-main.30
- Lee H., Phatale S., Mansoor H., Mesnard T., Ferret J., Lu K., Bishop C., Hall E., Carbune V., Rastogi A., Prakash S. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. arXiv, 2023, arXiv:2309.00267. https://doi.org/10.48550/arXiv.2309.00267
- Askell A., Bai Y., Chen A. et al. A general language assistant as a laboratory for alignment. arXiv, 2021, arXiv:2112.00861. https://doi.org/10.48550/arXiv.2112.00861
- Dong X., Shen J. Triplet loss in siamese network for object tracking. Lecture Notes in Computer Science, 2018, vol. 11217, pp. 472–488. https://doi.org/10.1007/978-3-030-01261-8_28
- Wang L., Yang N., Huang X., Jiao B., Yang L., Jiang D., Majumder R., Wei F. Text embeddings by weakly-supervised contrastive pre-training. arXiv, 2022, arXiv:2212.03533. https://doi.org/10.48550/arXiv.2212.03533
- Wang Y., Wang W., Joty S., Hoi S.C.H. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. Proc. of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685
- Song K., Tan X., Qin T., Lu J., Liu T.-Y.et al. MPNet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 16857–16867.
- Touvron H., Lavril T., Izacard G., Martinet X., Lachaux M.-A., Lacroix T., Rozière B., Goyal N., Hambro E., Azhar F., Rodriguez A., Joulin A., Grave E., Lample G. LLaMA: Open and efficient foundation language models. arXiv, 2023, arXiv:2302.13971. https://doi.org/10.48550/arXiv.2302.13971
- Lin C.-Y. ROUGE: A Package for automatic evaluation of summaries. Text Summarization Branches Out, 2004, pp. 74–81.
- Post M. A Call for Clarity in Reporting BLEU Scores. Proc. of the Third Conference on Machine Translation: Research Papers, 2018, pp. 186–191. https://doi.org/10.18653/v1/w18-6319
- Zhang T., Kishore V., Wu F., Weinberger K.Q., Artzi Y. BERTScore: Evaluating Text Generation with BERT. International Conference on Learning Representations (ICLR), 2020.
- Popović M. chrF++: words helping character n-grams. Proc. of the Second Conference on Machine Translation, 2017, pp. 612–618. https://doi.org/10.18653/v1/w17-4770
- Artstein R., Poesio M. Inter-Coder Agreement for Computational Linguistics. Computational Linguistics, 2008, vol. 34, no. 4, pp. 555–596. https://doi.org/10.1162/coli.07-034-r2