Automatic construction of the dialog tree based on unmarked text corpora in Russian

Evgeniya A. Feldina , Makhnytkina Olesia V.

2021 , VOLUME 21, NUMBER 5 ( september-october )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2021-21-5-709-719

Automatic construction of the dialog tree based on unmarked text corpora in Russian

E. A. Feldina, O. V. Makhnytkina

Read the full article

Article in Russian

For citation:

Feldina E.A., Makhnytkina O.V. Automatic construction of the dialog tree based on unmarked text corpora in Russian. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2021, vol. 21, no. 5, pp. 709–719 (in Russian). doi: 10.17586/2226-1494-2021-21-5-709-719

Abstract

In this paper, we propose a method for automatically determining the structure of the tree and the key topics of nodes in the process of building a dialog tree based on unmarked text corpora. Building a dialog tree is one of the time-consuming tasks when creating an automatic dialog system and in most cases is performed on the basis of manual markup, which takes a lot of time and resources. The method of hierarchical clustering of dialogs takes into account the semantic proximity of messages, allows one to allocate a different number of nodes at each level of the hierarchy and limit the dialog tree in width and depth. The algorithm for constructing annotations of nodes of the dialog tree takes into account the hierarchy of topics by building thematic chains. The method is based on the complex use of natural language processing methods (tokenization, lemmatization, part-of-speech tagging, word embeddings, etc.), analysis of the main components to reduce the dimension and methods of cluster analysis. Experiments on constructing the structure of the dialog tree and annotating nodes have shown the great possibilities of the proposed method for constructing an automatic dialog tree. The recognition accuracy on the example of the reference dialog tree containing 13 nodes at the first level, 381 nodes at the second level and 299 nodes at the third level was 0.8, 0.7 and 0.5, respectively. Automatic construction of dialog trees can be in demand when developing automatic dialog systems and for improving the quality of generating answers to user questions.

Keywords: dialog tree, dialog system, machine learning, cluster analysis, natural language processing

References

1. Yin J., Wang J. A text clustering algorithm using an online clustering scheme for initialization. Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1995–2004. https://doi.org/10.1145/2939672.2939841

2. Svadas T., Jha J. Document cluster mining on text documents. International Journal of Computer Science and Mobile Computing, 2015, vol. 4, no. 6, pp. 778–782.

3. Kim H., Kim H.K., Cho S. Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling. Expert Systems with Applications, 2020, vol. 150, pp. 113288. https://doi.org/10.1016/j.eswa.2020.113288

4. Abasi A., Khader A., Al-Betar M., Naim S., Alyasseri Z.A., Makhadmeh S. A novel hybrid multi-verse optimizer with K-means for text documents clustering. Neural Computing and Applications, 2020, vol. 32, no. 23, pp. 17703–17729. https://doi.org/10.1007/s00521-020-04945-0

5. Mohammed S.M., Jacksi K., Zeebaree S.R.M. Glove word embedding and DBSCAN algorithms for semantic document clustering. Proc. 3rd International Conference on Advanced Science and Engineering (ICOASE), 2020, pp. 211–216. https://doi.org/10.1109/ICOASE51841.2020.9436540

6. Cretulescu R., Morariu D., Breazu M., Volovici D. DBSCAN algorithm for document clustering. International Journal of Advanced Statistics and IT&C for Economics and Life Sciences, 2019, vol. 9, no. 1, pp. 58–66. https://doi.org/10.2478/ijasitels-2019-0007

7. Kotouza M.T., Psomopoulos F., Mitkas P. A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures. Journal of Cloud Computing, 2020, vol. 9, no. 1, pp. 1–17. https://doi.org/10.1186/s13677-019-0150-y

8. Popat S.K., Deshmukh P.B., Metre V.A. Hierarchical document clustering based on cosine similarity measure. Proc. 1st International Conference on Intelligent Systems and Information Management (ICISIM), 2017, pp. 153–159. https://doi.org/10.1109/ICISIM.2017.8122166

9. Nagarajan R., Nair S., Puviarasan N., Aruna P. Document clustering using agglomerative hierarchical clustering approach (AHDC) and proposed TSG keyword extraction method. IJRET: International Journal of Research in Engineering and Technology, 2016, vol. 5, no. 11, pp. 118–124. https://doi.org/10.15623/ijret.2016.0511023

10. Rekabdar B., Mousas C., Gupta B. Generative adversarial network with policy gradient for text summarization. Proc. 13th IEEE International Conference on Semantic Computing (ICSC), 2019, pp. 204–207. https://doi.org/10.1109/ICOSC.2019.8665583

11. Zhang Y., Li D., Wang Y., Fang Y., Xiao W. Abstract text summarization with a convolutional Seq2seq model. Applied Sciences, 2019, vol. 9, no. 8, pp. 1665. https://doi.org/10.3390/app9081665

12. Jindal S.G., Kaur A. Automatic keyword and sentence-based text summarization for software bug reports. IEEE Access, 2020, vol. 8, pp. 65352–65370. https://doi.org/10.1109/ACCESS.2020.2985222

13. Varalakshmi K.P.N., Kallimani J.S. Survey on extractive text summarization methods with multi-document datasets. Proc. 7th International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2018, pp. 2113–2119. https://doi.org/10.1109/ICACCI.2018.8554768

14. Thomas N. An e-business chatbot using AIML and LSA. Proc. 5th International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2016, pp. 2740–2742. https://doi.org/10.1109/ICACCI.2016.7732476

15. Touimi Y.B., Hadioui A., Faddouli N.E., Bennani S. Intelligent Chatbot-LDA recommender system. International Journal of Emerging Technologies in Learning, 2020, vol. 15, no. 20, pp. 4–20. https://doi.org/10.3991/ijet.v15i20.15657

16. Yusupov I.F., Trofimova M.V., Burtsev M.S. Unsupervised graph extraction for improvement of multi-domain task-oriented dialogue modelling. Proceedings of Moscow Institute of Physics and Technology, 2020, vol. 12, no. 3(47), pp. 75–86. (in Russian)

17. Feldina E., Makhnytkina O. Clustering approach to topic modeling in users dialogue. Advances in Intelligent Systems and Computing, 2021, vol. 1251 AISC, pp. 611–617. https://doi.org/10.1007/978-3-030-55187-2_44

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License