For citation: Bessmertny I.A., Nugumanova A.B., Mansurova M.Ye., Baiburin Ye.M. Method of rare term contrastive extraction from natural language texts. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2017, vol. 17, no. 1, pp. 81–91. doi: 10.17586/2226-1494-2017-17-1-81-91
The paper considers a problem of automatic domain term extraction from documents corpus by means of a contrast collection. Existing contrastive methods successfully extract often used terms but mishandle rare terms. This could yield poorness of the resulting thesaurus. Assessment of point-wise mutual information is one of the known statistical methods of term extraction and it finds rare terms successfully. Although, it extracts many false terms at that. The proposed approach consists of point-wise mutual information application for rare terms extraction and filtering of candidates by criterion of joint occurrence with the other candidates. We build “documents-by-terms” matrix that is subjected to singular value decomposition to eliminate noise and reveal strong interconnections. Then we pass on to the resulting matrix “terms-by-terms” that reproduces strength of interconnections between words. This approach was approved on a documents collection from “Geology” domain with the use of contrast documents from such topics as “Politics”, “Culture”, “Economics” and “Accidents” on some Internet resources. The experimental results demonstrate operability of this method for rare terms extraction
Keywords: contrastive term extraction, termhood, mutual information, semantic connections, rare term extraction
Acknowledgements. The paper contains data for study partially financially supported by the Grant 5033/ГФ4 of the Ministry of Education and Science of the Republic of Kazakhstan "The development of intelligent high-performance information and analysis search engine for semistructured data processing"
1.Weeber M., Vos R., Baayen R.H. Extracting the lowest-frequency words: pitfalls and possibilities. Computational Linguistics, 2000, vol. 26, no. 3, pp. 301–317. doi: 10.1162/089120100561719
2.Astrakhantsev N.A., Fedorenko D.G., Turdakov D.Y. Methods for automatic term recognition in domain-specific text collections: a survey. Programming and Computer Software, 2015, vol. 41, no. 6, pp. 336–349. doi: 10.1134/s036176881506002x
3.Heylen K., De Hertog D. Automatic term extraction. In Handbook of Terminology. Amsterdam, 2014, vol. 1.
4.Yang Y., Pedersen J.O. A comparative study on feature selection in text categorization. Proc. 14th Int. Conf. on Machine Learning ICML, 1997, vol. 97, pp. 412–420.
5.Braslavskii P.I., Sokolov E.A. Comparison of four methods for automatic two-word term extraction. Computational Linguistics and Intellectual Technologies. Proc. Int. Conf. Dialog 2006. Moscow, 2006, pp. 88–94. (In Russian)
6.Kim S.N., Cavedon L. Classifying domain-specific terms using a dictionary. Proc. Australasian Language Technology Association Workshop 2011, 2011, p. 57.
7.Conrado M.S., Pardo T.A.S., Rezende S.O. A machine learning approach to automatic term extraction using a rich feature set. Proc. NAACL HLT Student Research Workshop. Atlanta, USA, 2013, pp. 16–23.
8.Ahmad K., Gillam L., Tostevin L. University of surrey participation in TREC8: weirdness indexing for logical document extrapolation and retrieval (WILDER). Proc. 8th Text Retrieval Conference TREC. Gaithersburg, USA,1999, p. 717.
9.Gillam L., Tariq M., Ahmad K. Terminology and the construction of ontology. Terminology, 2005, vol. 11, no. 1, pp. 55–81.
10.Penas A., Verdejo F., Gonzalo J. Corpus-based terminology extraction applied to information access. Proceedings of Corpus Linguistics, 2001, vol. 2001, pp. 458–465.
11.Kim S.N., Baldwin T., Kan M.-Y. An unsupervised approach to domain-specific term extraction. Proc. Australasian Language Technology Association Workshop, 2009, pp. 94–98.
12.Basili R. A contrastive approach to term extraction. Proc. 4th Terminological and Artificial Intelligence Conference TIA2001. Nancy, France, 2001.
13.Wong W., Liu W., Bennamoun M. Determining termhood for learning domain ontologies using domain prevalence and tendency. Proc. 6th Australasian Conference on Data Mining and Analytics. Gold Coast, Australia, 2007, vol. 70, pp. 47–54.
14.Sclano F., Velardi P. Termextractor: a web application to learn the shared terminology of emergent web communities. In Enterprise Interoperability II. Springer, 2007, pp. 287–290. doi: 10.1007/978-1-84628-858-6_32
15.Lopes L., Fernandes P., Vieira R. Estimating term domain relevance through term frequency, disjoint corpora frequency-tf-dcf. Knowledge-Based Systems, 2016, vol. 97, pp. 237–249. doi: 10.1016/j.knosys.2015.12.015
16.Wong W., Liu W., Bennamoun M. Determining termhood for learning domain ontologies in a probabilistic framework. Proc. 6th Australasian Conference on Data Mining and Analytics. Gold Coast, Australia, 2007, vol. 70, pp. 55–63.
17.Prelov V. Mutual information of several random variables and its estimation via variation. Problems of Information Transmission, 2009, vol. 45, no. 4, pp. 295–308. doi: 10.1134/s0032946009040012
18.Hasan K.S., Ng V. Automatic keyphrase extraction: a survey of the state of the art. Proc. 52nd Annual Meeting of the Association for Computational Linguistics, 2014, vol. 1, pp. 1262–1273. doi: 10.3115/v1/p14-1119
19.Matsuo Y., Ishizuka M. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 2004, vol. 13, no. 1, pp. 157–169. doi: 10.1142/s0218213004001466
20.Slonim N., Tishby N. The power of word clusters for text classification. Proc. 23rd European Colloquium on Information Retrieval Research, 2001, vol. 1.
21.Eckart C., Young G. The approximation of one matrix by another of lower rank. Psychometrika, 1936, vol. 1, no. 3, pp. 211–218. doi: 10.1007/bf02288367
22.General Geology. Ed. A.K. Sokolovskyi. Moscow, KDU Publ., 2006, vol. 1, 448 p. (In Russian)