DOI: 10.17586/2226-1494-2017-17-3-490-497


A. G. Andreeva, T. A. Markina

Read the full article 
Article in Russian

For citation: Andreeva A.G., Markina T.A. Tree similarity estimation by calculation of pq-gram distance. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2017, vol. 17, no. 3, pp. 490–497 (in Russian). doi: 10.17586/2226-1494-2017-17-3-490-497


The paper presents an algorithm for similarity estimation of hierarchical data based on the pq-gram distance calculation. The dependence of the algorithm sensitivity on the selected parameters p andq is analyzed. We  show how much the result of the algorithm will change at comparing of two trees that have difference in one random node when one of the nodes of the source tree is deleted, renamed, or an extra node is added. It is demonstrated that such analysis enables to select the parameters p and q in relation with the solving problem. The problem of a tree preliminary evaluation is substantiated - an approximate analysis of the initial level of node differences in the selected pq-grams of the compared trees. The basic terms and definitions relating to the  tree-based data structuring algorithms are described. Examples of the algorithm practical application and the details of its implementation on a real problem are shown

Keywords: pq-gram, trees, tree similarity, pq-gram distance, pq-gram pattern, pq-gram profile, XML

1.     Tai K.-C. The tree-to-tree correction problem. Journal of ACM, 1979, vol. 26, no. 3, pp. 422–433. doi: 10.1145/322139.322143
2.     Zhang K., Shasha D. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 1989, vol. 18, no. 6, pp. 1245–1262.
3.     Demaine E.D., Mozes S., Rossman B., Weimann O. An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algorithms, 2009, vol. 6, no. 1, art. 2. doi: 10.1145/1644015.1644017
4.     Pawlik M., Augsten N. RTED: a robust algorithm for the tree edit distance. Proc. 38th VLDB Endowment. Istanbul, Turkey, 2012, vol. 5, no. 4, pp. 334–345.
5.     Augsten N., Boehlen M., Gamper J. Approximate matching of hierarchical data using pq-grams. Proc. 31st Int. Conf. on Very Large Data Bases. Trondheim, Norway, 2005, vol. 1, pp. 301–312.
6.     Kubenskii A.A. Structures and algorithms of data processing. Object-oriented approach and implementation on C++. St. Petersburg, BHV Publ., 2004, 464 p. (In Russian)
7.     Gusfield D. Algorithms on String, Trees, and Sequences. Computer Science and Computational Biology. CambridgeUniversityPress, 1997, 556 p.
8.     Knuth D.E. The Art of Computer Programming. Vol. 2. Seminumerical Algorithms. Addison-Wesley, 1998.
9.     Cormen T.H., Leiserson C.E., Rivest R.L., Stein C. Introduction to Algorithms. 3rd ed. MIT Press, 2009, 1312 p.
10.  Bogomolov A.M., Salii V.N. Algebraic Foundations of the Discrete Systems Theory. Moscow, Fizmatlit Publ., 1997, 368 p. (In Russian)
11.  Kremer N.Sh. Probability Theory and Mathematical Statistics. 2nd ed. Moscow, Yuniti-Dana, 2003, 573 p. (In Russian)
12.  Srivastava N., Mishra V., Bhattacharya A. Analyzing the sensitivity of pq-gram distance with p and q. Proc. 10th Int. Conf. on Very Large Data Bases. Singapore, 2010.
13.  Wagner R.A., Fischer M.J. The string-to-string correction problem. Journal of ACM, 1974, vol. 21, no. 1, pp. 168–173. doi: 10.1145/321796.321811
14.  Zakharov V.N., Khoroshilov A.A. Automatic assessment of similarity of the texts’ thematic content on the base of their formalized semantic descriptions comparison. Proc. RCDL-2012. Pereslavl'-Zalesskii, Russia, 2012.
15.  Hunter D., Rafter J., Fawcett J., van der Vlist E. et. al. Beginning XML. 4th ed. Wiley, 2007, 1080 p.
Flach P. Machine Learning: The Art and Science of Algorithms That Make Sense of Data. Cambridge University Press, 2012, 409 p.
Copyright 2001-2017 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.