EVALUATION OF SEMANTIC SIMILARITY FOR SENTENCES IN NATURAL LANGUAGE BY MATHEMATICAL STATISTICS METHODS

Pismak Alexey   E., Kharitonova Anastassia E., Tsopa Evgeny A, Klimenkov Sergey V.

2016 , VOLUME 16, NUMBER 2 ( March–April )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2016-16-2-324-330

EVALUATION OF SEMANTIC SIMILARITY FOR SENTENCES IN NATURAL LANGUAGE BY MATHEMATICAL STATISTICS METHODS

A. E. Pismak, A. E. Kharitonova, E. A. Tsopa, S. V. Klimenkov

Read the full article

Article in Russian

For citation: Pismak A.E., Kharitonova A.E., Tsopa E.A., Klimenkov S.V. Evaluation of semantic similarity for sentences in natural language by mathematical statistics methods. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2016, vol. 16, no. 2, pp. 324–330. doi:10.17586/2226-1494-2016-16-2-324-330

Abstract

Subject of Research. The paper is focused on Wiktionary articles structural organization in the aspect of its usage as the base for semantic network. Wiktionary community references, article templates and articles markup features are analyzed. The problem of numerical estimation for semantic similarity of structural elements in Wiktionary articles is considered. Analysis of existing software for semantic similarity estimation of such elements is carried out; algorithms of their functioning are studied; their advantages and disadvantages are shown. Methods. Mathematical statistics methods were used to analyze Wiktionary articles markup features. The method of semantic similarity computing based on statistics data for compared structural elements was proposed.Main Results. We have concluded that there is no possibility for direct use of Wiktionary articles as the source for semantic network. We have proposed to find hidden similarity between article elements, and for that purpose we have developed the algorithm for calculation of confidence coefficients proving that each pair of sentences is semantically near. The research of quantitative and qualitative characteristics for the developed algorithm has shown its major performance advantage over the other existing solutions in the presence of insignificantly higher error rate. Practical Relevance. The resulting algorithm may be useful in developing tools for automatic Wiktionary articles parsing. The developed method could be used in computing of semantic similarity for short text fragments in natural language in case of algorithm performance requirements are higher than its accuracy specifications.

Keywords: semantic similarity, mathematical statistics, sets, tokens, Wiktionary, semantic analysis, text

References

1. Bessmertny I. Knowledge visualization based on semantic networks. Programming and Computer Software, 2010, vol. 6, no. 4, pp. 197–204. doi: 10.1134/S036176881004002X
2. Nie J.Y., Brisebois M. An inferential approach to information retrieval and its implementation using a manual thesaurus. Artificial Intelligence Review, 1996, vol. 10, no. 5–6, pp. 409–439.
3. Nugumanova A., Bessmertny I. Applying the latent smantic analysis to the issue of automatic extraction of collocations from the domain texts. Communications in Computer and Information Science, 2013, vol. 394, pp. 92–101. doi: 10.1007/978-3-642-41360-5_8
4. Wiktionary. Available at: http://wiktionary.org/ (accessed 27.07.2015).
5. Pak A. Parsim Russkii Yazyk. Available at: http://habrahabr.ru/post/148124/ (accessed 05.08.2015).
6. Wikipedia: Manual of Style. Available at: https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style (accessed 12.08.2015).
7. Innovations. Proprietary Technologies. Available at: http://www.ispras.ru/technologies/texterra_text_mining_toolkit (accessed 01.09.2015).
8. Semanticus. Available at: http://semanticus.ru/ (accessed 08.09.2015).
9. S-Space. Available at: https://github.com/fozziethebeat/S-Space/wiki/ (accessed 07.09.2015).
10. SemanticVectors. Available at: https://github.com/semanticvectors/semanticvectors/wiki (accessed 07.09.2015).
11. Varlamov M.I., Korshunov A.V. Computing semantic similarity of concepts using shortest paths in Wikipedia link graph. Machine Learning and Data Analysis, 2014, vol. 1, no. 8, pp. 1107–1125.
12. Hall J., Nilsson J., Nivre J. MaltParser. Available at: http://www.maltparser.org/ (accessed 07.09.2015).
13. Shaliminov I. Method for proximity determining based on syntax. Available at: https://github.com/ishalyminov/syntactic_classification/wiki (accessed 03.08.2015).
14. Velikhov P.E. Measures semantic similarity of Wikipedia articles and their application to text processing. Informatsionnye Tekhnologii i Vychislitel'nye Sistemy, 2009, no. 1, pp. 23–37. (in Russian)
15. Zheludkov A.V., Makarov D.V., Fadeev P.V. Features of fuzzy search algorithms. Engineering Bulletin, 2014, no. 12, pp. 501–511. (in Russian)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License