Menu
Publications
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
Editor-in-Chief
Nikiforov
Vladimir O.
D.Sc., Prof.
Partners
doi: 10.17586/2226-1494-2019-19-6-1058-1063
TEXT CLUSTERING POWERED BY SEMANTICO-SYNTACTIC FEATURES
Read the full article ';
Article in Russian
For citation:
Abstract
For citation:
Lapshin S.V., Lebedev I.S., Spivak A.I. Text clustering powered by semantico-syntactic features. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2019, vol. 19, no. 6, pp. 1058–1063 (in Russian). doi: 10.17586/2226-1494-2019-19-6-1058-1063
Abstract
Subject of Research. The performed study is devoted to improvement of the text clustering quality indicators. The main attention is paid to the feature extraction that describes the mathematical model of the texts. The k-means method is used for clustering of the resulting vector representation of the texts. Method. An analytical approach was proposed based on the use of semanticosyntactic features of the clustered texts. Feature extraction was performed using the Stanford CoreNLP Toolkit. Some links between the words of the texts in “Enhanced ++ Dependencies” representation were encoded together with the words connecting them. The values of semantico-syntactic features were calculated based on the frequencies of encoded links in the texts. Main Results. An experiment has shown that by comparison of the quality indicators of a prototype developed on the basis of the proposed method and a clustering system based on statistical features, the proposed method application provides for decrease in the number of clustering errors by more than 15 %. Practical Relevance. Pre-training is not required to obtain semanticosyntactic features of the texts. Therefore, the proposed approach can be used to improve clustering quality indicators in the absence of large text corpuses, which are necessary for pre-training of statistical language models based on word embeddings.
Keywords: text clustering, semantico-syntactic features, word context, k-means
Acknowledgements. This work has been performed according to the program of fundamental research of the Russian Academy of Sciences in priority areas determined by the Presidium of the Russian Academy of Sciences No. 2 “Mechanisms for ensuring fault tolerance of modern high-performance and highly reliable computing”.
References
Acknowledgements. This work has been performed according to the program of fundamental research of the Russian Academy of Sciences in priority areas determined by the Presidium of the Russian Academy of Sciences No. 2 “Mechanisms for ensuring fault tolerance of modern high-performance and highly reliable computing”.
References
- Xu J., Xu B., Wang P., Zheng S., Tian G., Zhao J., Xu B. Self-taught convolutional neural networks for short text clustering. Neural Networks. 2017, vol. 88, pp. 22–31. doi: 10.1016/j.neunet.2016.12.008
- Parhomenko P.A., Grigorev A.A., Astrakhantsev N.A. A survey and an experimental comparison of methods for text clustering: application to scientific articles. Proceedings of ISP RAS, 2017, vol. 29, no. 2, pp. 161–200. (in Russian). doi: 10.15514/ISPRAS-2017-29(2)-6
- Whissell J.S., Clarke C.L.A. Improving document clustering using Okapi BM25 feature weighting. Information Retrieval. 2011, vol. 14, no. 5, pp. 466–487. doi: 10.1007/s10791-011-9163-y
- Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, vol. 41, no. 6, pp. 391–407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
- Hofmann T. Probabilistic latent semantic indexing. Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), 1999, pp. 50–57. doi: 10.1145/312624.312649
- Blei D.M., Ng A.Y., Jordan M.I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, vol. 3, no. 4-5, pp. 993–1022.
- Mikolov T., Chen K., Corrado G., Dean J. Efficient estimation of word representations in vector space. Proc. 1st International Conference on Learning Representations (ICLR 2013), 2013.
- Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. 2018.
- Staab S., Hotho A. Ontology-based text document clustering. Proc. International Intelligent Information Systems/ Intelligent Information Processing and Web Mining Conference, (IIS: IIPWM’03), 2003, pp. 451–452.
- Choudhary B., Bhattacharyya P. Text clustering using semantics. Available at: http://vima01220.ethz.ch/CDstore/www2002/poster/79.pdf (accessed: 23.10.2019)
- Liang S., Yilmaz E., Kanoulas E. Collaboratively tracking interests for user clustering in streams of short texts. IEEE Transactions on Knowledge and Data Engineering, 2019, vol. 31, no. 2, pp. 257–272. doi: 10.1109/TKDE.2018.2832211
- Popova S., Danilova V. Document representation for clustering of scientific abstracts. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2014, vol. 19, no. 1(89), pp. 99–107. (in Russian)
- Schuster S., Manning C.D. Enhanced english universal dependencies: an improved representation for natural language understanding tasks. Proc. 10th International Conference on Language Resources and Evaluation (LREC 2016), 2016, pp. 2371–2378.
- Manning C., Surdeanu M., Bauer J., Finkel J., Bethard S.J., McClosky D. The Stanford CoreNLP natural language processing toolkit. Proc. 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60. doi: 10.3115/v1/P14-5010