S. V. Popova, V. V. Danilova

Read the full article 


The key issue of the present paper is clustering of narrow-domain short texts, such as scientific abstracts. The work is based on the observations made when improving the performance of key phrase extraction algorithm. An extended stop-words list was used that was built automatically for the purposes of key phrase extraction and gave the possibility for a considerable quality enhancement of the phrases extracted from scientific publications. A description of the stop- words list creation procedure is given. The main objective is to investigate the possibilities to increase the performance and/or speed of clustering by the above-mentioned list of stop-words as well as information about lexeme parts of speech. In the latter case a vocabulary is applied for the document representation, which contains not all the words that occurred in the collection, but only nouns and adjectives or their sequences encountered in the documents. Two base clustering algorithms are applied: k-means and hierarchical clustering (average agglomerative method). The results show that the use of an extended stop-words list and adjective-noun document representation makes it possible to improve the performance and speed of k-means clustering. In a similar case for average agglomerative method a decline in performance quality may be observed. It is shown that the use of adjective-noun sequences for document representation lowers the clustering quality for both algorithms and can be justified only when a considerable reduction of feature space dimensionality is necessary.

Keywords: document clustering, document representation, key phrases application, use of nouns and adjectives, extended list of stop-words creation, results retrieval representation

1. Bernardini A., Carpineto C., D’Amico M. Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering // Proc. of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology. IEEE Computer Society, 2009. V. 1. P. 206–213.
2. Zhang D., Dong Y. Semantic, Hierarchical, Online Clustering of Web Search Results // Proc. of the 6th AsiaPacific Web Conference (APWeb 2004). Lecture Notes in Computer Science. 2004. V. 3007. P. 69–78.
3. Zeng H.-J., He Q.-C., Chen Z., Ma W.-Y., Ma J. Learning to cluster web search results // Proc. of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '04). NY: ACM Press, 2004. P. 210–217.
4. Gutwin C., Paynter G., Witten I., Nevill-Manning C., Frank E. Improving browsing in digital libraries with keyphrase indexes // J. Decision Support Systems. 1999. V. 27. N 1-2. P. 81–104.
5. Popova S., Khodyrev I., Egorov A., Logvin S., Gulyaev S., Karpova M., Muromtsev D. Sci-Search: Academic Search and Analysis System Based on Keyphrases // Proc. of the 4th Conference on Knowledge Engineering and Semantic Web (KESW 2013). Communications in Computer and Information Science series. 2013. V. 394. P. 281–288.
6. Alexandrov M., Gelbukh A., Rosso P. An Approach to Clustering Abstracts // Proc. of the 10th International Conference NLDB-05. Lecture Notes in Computer Science. 2005. V. 3513. P. 8–13.
7. Cagnina L., Errecalde M., Ingaramo D., Rosso P. A discrete particle swarm optimizer for clustering short text corpora // Proc. of the 3rd International Conference on Bioinspired Optimization Methods and their Applications (BIOMA08). Ljubljana, Slovenia, 2008. P. 93–103.
8. Errecalde M., Ingaramo D., Rosso P. ITSA: An Effective Iterative Method for Short-Text Clustering Tasks // Proc. 23rd International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA/AIE 2010). Lecture Notes in Artificial Intelligence. 2010. V. 6096. P. 550–559.
9. Ramírez-de-la-Rosa G., Montes-y-Gómez M., Solorio T., Villaseñor-Pineda L. A document is known by the company it keeps: neighborhood consensus for short text categorization // Lang Resources and Evaluation. 2012. V. 47. P. 127–149.
10. Romero F.P., Julián-Iranzo P., Soto A., Ferreira-Satler M., Gallardo-Casero J. Classifying unlabeled short texts using a fuzzy declarative approach // Lang Resources and Evaluation. 2013. V. 47. P. 151–178.
11. Pinto D. Analysis of narrow-domain short texts clustering // Research report for «Diploma de Estudios Avanzados (DEA)». Department of Information Systems and Computation. UPV. 2007 [Электронный ре- сурс]. Режим доступа:, свободный. Яз. англ. (дата обращения 23.12.2013).
12. Pinto D., Rosso P., Jiménez H. A Self-Enriching Methodology for Clustering Narrow Domain Short Texts // Computer Journal. 2011. V. 54. N 7. P. 1148–1165.
13. Pinto D., Jimenez-Salazar H., Rosso P. Clustering abstracts of scientific texts using the transition point technique // Proc. of the 7th International Conference CICLing 2006. Lecture Notes in Computer Science. 2006. V. 3878. P. 536–546.
14. Errecalde M., Ingaramo D., Rosso P. A new AntTree-based algorithm for clustering short-text corpora // J. Computer Sci. Technol. V. 10. N 1. P. 1–7.
15. Stein B., Meyer zu Eissen S., Potthast M. Syntax versus Semantics: Analysis of Enriched Vector Space Models // Third International Workshop on Text-Based Information Retrieval (TIR 06)/ Eds B. Stein, O. Kao. Trento, Italy: University of Trento, 2006. P. 47–52.
16. Meyer zu Eissen S., Stein B., Potthast M. The Suffix Tree Document Model Revisited // Proc. of the 5th International Conference on Knowledge Management (I-KNOW 05). Graz, Austria, 2005. P. 596–603.
17. You W., Fontaine D., Barhes J.-P. An automatic keyphrase extraction system for scientific documents // Knowledge and Information Systems. 2013. V. 34. N 3. P. 691–724.
18. Hulth A. Improved automatic keyword extraction given more linguistic knowledge // Proc. of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP'03). Stroudsburg, 2003. P. 216–223.
19. Mihalcea R., Tarau P. TextRank: Bringing order into texts // Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP '04). Stroudsburg, 2004. P. 404–411.
20. Wan X., Xiao J. Exploiting Neighborhood Knowledge for Single Document Summari-zation and Keyphrase Extraction // ACM Transactions on Information Systems. 2010. V. 28. N 2. Article 8.
21. Zesch T., Gurevych I. Approximate Matching for Evaluating Keyphrase Extraction // Proc. of the International Conference on Recent Advances in Natural Language Processing (RANLP 2009). 2009. P. 484–489.
22. Popova S., Khodyrev I. Ranking in keyphrase extraction problem: is it useful to use statistics of words occurrences? // RuSSUR 2013. Режим доступа:, свободный. Яз. англ. (дата обращения 27.12.2013).
23. Попова С.В., Ходырев И.А. Извлечение и ранжирование ключевых фраз в задаче аннотирования // Научно-технический вестник информационных технологий, механики и оптики. 2013. № 1 (83). С. 81– 85.
24. Popova S., Kovriguina L., Muromtsev D., Khodyrev I. Stop-words in Keyphrase Extraction Problem // Proc. of 14th Conference of Open Innovations Association FRUCT. Helsinki, Finland, 2013. P. 113–121.
25. Manning C., Raghavan P., Schütze H. Introduction to Information Retrieval. Cambridge University Press, 2009.544 p.
26. Meyer zu Eissen S., Stein B. Analysis of Clustering Algorithms for Web-based Search // Proc. of the 4th International Conference on Practical Aspects of Knowledge Management (PAKM 2002). Lecture Notes in Artificial Intelligence. 2002. V. 2569. P. 168–178
27. Stein B., Meyer zu Eissen S., Wißbrock F. On Cluster Validity and the Information Need of Users // Proc. of the 3rd IASTED International Conference on Artificial Intelligence and Applications (AIA 03). Benalmádena, Spain, 2003. P. 216–221.
28. Tsatsaronis G., Varlamis I., Norvag K. SemanticRank: Ranking Keywords and Sentences Using Semantic Graphs // Proc. of the 23rd International Conference on Computational Linguistics (Coling'10). Beijing, China, 2010. P. 1074–1082.
29. Hasan K.S., Ng V. Conundrums in Unsupervised Keyphrase Extraction: Making Sense of the State-of-the-Art // Proc. of the 23rd International Conference on Computational Linguistics: Posters (Coling'10). Beijing, China, 2010. P. 365–373.
30. Ingaramo D., Errecalde M., Cagnina L., Rosso P. Particle Swarm Optimization for lustering short-text corpora // Computational Intelligence and Bioengineering/ Eds F. Masulli, A. Micheli, A.Sperduti. IOS Press, 2009. P. 3–19.
31. Azzag H., Monmarche N., Slimane M., Venturini G. AntTree: A new model for clustering with artificial ants // Proc. of the 2003 Congress on Evolutionary Computation (CEC '03). IEEE Press, 2003.V. 4. P. 2642–2647.
32. Stein B., Meyer zu Eißen S. Document Categorization with MAJORCLUST // Proc. of the 12th Workshop on Information Technology and Systems ((WITS 02) / Eds A. Basu, S. Dutta. Barcelona, Spain: Technical University of Barcelona, 2002. P. 91–96.
Copyright 2001-2017 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.