doi: 10.17586/2226-1494-2021-21-5-709-719

Automatic construction of the dialog tree based on unmarked text corpora in Russian

E. A. Feldina, O. V. Makhnytkina

Read the full article  ';
Article in Russian

For citation:
Feldina E.A., Makhnytkina O.V. Automatic construction of the dialog tree based on unmarked text corpora in Russian. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2021, vol. 21, no. 5, pp. 709–719 (in Russian). doi: 10.17586/2226-1494-2021-21-5-709-719

 In this paper, we propose a method for automatically determining the structure of the tree and the key topics of nodes in the process of building a dialog tree based on unmarked text corpora. Building a dialog tree is one of the time-consuming tasks when creating an automatic dialog system and in most cases is performed on the basis of manual markup, which takes a lot of time and resources. The method of hierarchical clustering of dialogs takes into account the semantic proximity of messages, allows one to allocate a different number of nodes at each level of the hierarchy and limit the dialog tree in width and depth. The algorithm for constructing annotations of nodes of the dialog tree takes into account the hierarchy of topics by building thematic chains. The method is based on the complex use of natural language processing methods (tokenization, lemmatization, part-of-speech tagging, word embeddings, etc.), analysis of the main components to reduce the dimension and methods of cluster analysis. Experiments on constructing the structure of the dialog tree and annotating nodes have shown the great possibilities of the proposed method for constructing an automatic dialog tree. The recognition accuracy on the example of the reference dialog tree containing 13 nodes at the first level, 381 nodes at the second level and 299 nodes at the third level was 0.8, 0.7 and 0.5, respectively. Automatic construction of dialog trees can be in demand when developing automatic dialog systems and for improving the quality of generating answers to user questions.

Keywords: dialog tree, dialog system, machine learning, cluster analysis, natural language processing

1. Yin J., Wang J. A text clustering algorithm using an online clustering scheme for initialization. Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1995–2004.
2. Svadas T., Jha J. Document cluster mining on text documents. International Journal of Computer Science and Mobile Computing, 2015, vol. 4, no. 6, pp. 778–782.
3. Kim H., Kim H.K., Cho S. Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling. Expert Systems with Applications, 2020, vol. 150, pp. 113288.
4. Abasi A., Khader A., Al-Betar M., Naim S., Alyasseri Z.A., Makhadmeh S. A novel hybrid multi-verse optimizer with K-means for text documents clustering. Neural Computing and Applications, 2020, vol. 32, no. 23, pp. 17703–17729.
5. Mohammed S.M., Jacksi K., Zeebaree S.R.M. Glove word embedding and DBSCAN algorithms for semantic document clustering. Proc. 3rd International Conference on Advanced Science and Engineering (ICOASE), 2020, pp. 211–216.
6. Cretulescu R., Morariu D., Breazu M., Volovici D. DBSCAN algorithm for document clustering. International Journal of Advanced Statistics and IT&C for Economics and Life Sciences, 2019, vol. 9, no. 1, pp. 58–66.
7. Kotouza M.T., Psomopoulos F., Mitkas P. A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures. Journal of Cloud Computing, 2020, vol. 9, no. 1, pp. 1–17.
8. Popat S.K., Deshmukh P.B., Metre V.A. Hierarchical document clustering based on cosine similarity measure. Proc. 1st International Conference on Intelligent Systems and Information Management (ICISIM), 2017, pp. 153–159.
9. Nagarajan R., Nair S., Puviarasan N., Aruna P. Document clustering using agglomerative hierarchical clustering approach (AHDC) and proposed TSG keyword extraction method. IJRET: International Journal of Research in Engineering and Technology, 2016, vol. 5, no. 11, pp. 118–124.
10. Rekabdar B., Mousas C., Gupta B. Generative adversarial network with policy gradient for text summarization. Proc. 13th IEEE International Conference on Semantic Computing (ICSC), 2019, pp. 204–207.
11. Zhang Y., Li D., Wang Y., Fang Y., Xiao W. Abstract text summarization with a convolutional Seq2seq model. Applied Sciences, 2019, vol. 9, no. 8, pp. 1665.
12. Jindal S.G., Kaur A. Automatic keyword and sentence-based text summarization for software bug reports. IEEE Access, 2020, vol. 8, pp. 65352–65370.
13. Varalakshmi K.P.N., Kallimani J.S. Survey on extractive text summarization methods with multi-document datasets. Proc. 7th International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2018, pp. 2113–2119.
14. Thomas N. An e-business chatbot using AIML and LSA. Proc. 5th International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2016, pp. 2740–2742.
15. Touimi Y.B., Hadioui A., Faddouli N.E., Bennani S. Intelligent Chatbot-LDA recommender system. International Journal of Emerging Technologies in Learning, 2020, vol. 15, no. 20, pp. 4–20.
16. Yusupov I.F., Trofimova M.V., Burtsev M.S. Unsupervised graph extraction for improvement of multi-domain task-oriented dialogue modelling. Proceedings of Moscow Institute of Physics and Technology, 2020, vol. 12, no. 3(47), pp. 75–86. (in Russian)
17. Feldina E., Makhnytkina O. Clustering approach to topic modeling in users dialogue. Advances in Intelligent Systems and Computing, 2021, vol. 1251 AISC, pp. 611–617.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Copyright 2001-2023 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.