Homograph recognition algorithm based on Euclidean metric

Izrailova Elisa S. , Astemirov Arslanbek V. , Badaeva Ayshat S. , Sultanov Zelimhan A., Umarkhadzhiev Salaudin M., Khekhaev Mokhmad-Salekh L. , Yasaeva Madina L.

2024 , VOLUME 24, NUMBER 1 ( january-february )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2024-24-1-41-50

Homograph recognition algorithm based on Euclidean metric

E. S. Izrailova, A. V. Astemirov, A. S. Badaeva, Z. A. Sultanov, S. M. Umarkhadzhiev, M. L. Khekhaev, M. L. Yasaeva

Read the full article

Article in Russian

For citation:

Izrailova E.S., Astemirov A.V., Badaeva A.S., Sultanov Z.A., Umarkhadzhiev S.M., Khekhaev M.-S.L., Yasaeva M.L. Homograph recognition algorithm based on Euclidean metric. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 1, pp. 41–50 (in Russian). doi: 10.17586/2226-1494-2024-24-1-41-50

Abstract

The problem of resolving the uncertainties associated with homonymy for the Chechen language has become especially relevant after the creation of speech synthesis systems. The main disadvantage of speech synthesizers in the Chechen language are errors in reading homograph words that differ in the length / brevity of vowels — the longitude of such sounds is not displayed in any way when writing. The reproduction of diphthongs, which are indicated on the letter in the same way as monophthongs close to them in sound, causes problems. To improve the quality of synthesized speech in the Chechen language, an automatic homograph recognition program is needed. To solve this problem, the article considers the task of eliminating the ambiguity of the meaning of the words WSD (Word Sense Disambiguation). Algorithmic (supervised) methods based on a pre-marked database have been selected for the Chechen language. These methods are the most common solutions for eliminating the ambiguity of the meaning of words. The implementation of such methods is possible in the presence of large marked-up corpora that are inaccessible to most languages of the world including Chechen. The Chechen language belongs to low-resource languages for which the optimal approach from the point of view of saving labor and time resources is a semi-controlled hybrid method of homograph recognition based on the use of algorithmic and statistical methods. The algorithm created by the authors for recognizing homographs by six adjacent words in a sentence is presented. The method is implemented as a program. Preliminary preparation of the initial data for the operation of the algorithm includes marking of proposals by the values of homographs performed “manually”. The results of the program were evaluated using generally recognized accuracy metrics and amounted to F1 — 39 %, Accuracy — 45 %. A comparative analysis of the data obtained with the results of other methods and models showed that the accuracy of the algorithm presented in this article is closest to the results of the accuracy of algorithms based on the Lesk method. Using Lesk method for English, the results of F1 accuracy were obtained — 41.1 % (simple Lesk) and 51.1 % (extended Lesk). Methods using neural network algorithms provide higher WSD accuracy rates for most languages; however, their implementation requires large data bodies, which is not always available for low-resource languages, including Chechen.

Keywords: graphic homonymy, homographs, WSD, speech synthesis, Chechen language, low resource languages, text corpus

References

Izrailova E. Creating a system for synthesizing the Chechen speech. Izvestia: Herzen University Journal of Humanities & Sciences, 2020, no. 198, pp. 171–177. (in Russian). https://doi.org/10.33910/1992-6464-2020-198-171-177
Izrailova E.S., Badaeva A.S. Analysis of the speech signal quality of the chechen speech synthesis system. Automatic Documentation and Mathematical Linguistics, 2021, vol. 55, no. 2, pp. 74–78. https://doi.org/10.3103/S0005105521020059
Lesk M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. Proc. of the 5th Annual International Conference on Systems Documentation, 1986, pp. 24–26. https://doi.org/10.1145/318723.318728
Banerjee S., Pedersen T. An adapted lesk algorithm for word sense disambiguation using WordNet. Lecture Notes in Computer Science, 2002, vol. 2276, pp. 136–145. https://doi.org/10.1007/3-540-45715-1_11
Lastra-Diaz J.J., Goikoetxea J., Taieb M.A.H., Garcia-Serrano A., Aouicha M.B., Agirre E. A reproducible survey on word embeddings and ontology-based methods for word similarity: linear combinations outperform the state of the art. Engineering Applications of Artificial Intelligence, 2019, vol. 85, pp. 645–665. https://doi.org/10.1016/j.engappai.2019.07.010
Kumar S., Jat S., Saxena K., Talukdar P. Zero-shot word sense disambiguation using sense definition embeddings. Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 5670–5681. https://doi.org/10.18653/v1/p19-1568
Scozzafava F., Maru M., Brignone F., Torrisi G., Navigli R. Personalized PageRank with syntagmatic information for multilingual Word Sense Disambiguation. Proc. of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020, pp. 37–46. https://doi.org/10.18653/v1/2020.acl-demos.6
Escudero G., Marquez L., Rigau G., Salgado J.G. On the portability and tuning of supervised word sense disambiguation systems. Research report, 2000.
Manning C.D., Clark K., Hewitt J., Khandelwal U., Levy O. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 2020, vol. 117, no. 48, pp. 30046–30054. https://doi.org/10.1073/pnas.1907367117
Lin D. Automatic retrieval and clustering of similar words. Proc. of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics. V. 2, 1998, pp. 768–774. https://doi.org/10.3115/980691.980696
Hadiwinoto C., Ng H.T., Gan W.C. Improved Word Sense Disambiguation using pre-trained contextualized word representations. Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 5297–5306. https://doi.org/10.18653/v1/D19-1533
Vial L., Lecouteux B., Schwab D. Sense vocabulary compression through the semantic knowledge of WordNet for neural Word Sense Disambiguation. Proc. of the 10th Global Wordnet Conference, 2019, pp. 108–117.
Scarlini B., Pasini T., Navigli R. SensEmBERT: Context-enhanced sense embeddings for multilingual Word Sense Disambiguation. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, no. 5, pp. 8758–8765. https://doi.org/10.1609/aaai.v34i05.6402
Scarlini B., Pasini T., Navigli R. With more contexts comes better performance: Contextualized sense embeddings for all-round Word Sense Disambiguation. Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 3528–3539. https://doi.org/10.18653/v1/2020.emnlp-main.285
Zhang C.X., Liu R., Gao X.Y., Yu B. Graph convolutional network for word sense disambiguation. Discrete Dynamics in Nature and Society, 2021, vol. 2021, pp. 2822126. https://doi.org/10.1155/2021/2822126
Conia S., Navigli R. Framing Word Sense Disambiguation as a multi-label problem for model-agnostic knowledge integration. Proc. of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 3269–3275. https://doi.org/10.18653/v1/2021.eacl-main.286
Amrami A., Goldberg Y. Towards better substitution-based word sense induction. arXiv, 2019, arXiv:1905.12598. https://doi.org/10.48550/arXiv.1905.12598
Arefyev N., Sheludko B., Panchenko A. Combining lexical substitutes in neural word sense induction. Proc. of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 2019, pp. 62–70. https://doi.org/10.26615/978-954-452-056-4_008
Vasilescu F., Langlais P., Lapalme G. Evaluating variants of the lesk approach for disambiguating words. Proc. of the Fourth International Conference on Language Resources and Evaluation (LREC’04), 2004.
Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186.
El-Razzaz M., Fakhr M.W., Maghraby F.A. Arabic Gloss WSD Using BERT. Applied Sciences, 2021, vol. 11, no. 6, pp. 2567. https://doi.org/10.3390/app11062567
Kilgarriff A., Rosenzweig J. Framework and results for English SENSEVAL. Computers the Humanities, 2000, vol. 34, no. 1, pp. 15–48. https://doi.org/10.1023/A:1002693207386
Gataullin R.R., Gilmullin R.A., Khakimov B.E. Morphological disambiguation in the national corpus of tatar language using Purepos and LSTM models. VI International Conference on Computer Processing of Turkic Languages “TurkLang 2018” (Proceedings of the Conference), Tashkent, Navoiy Universiteti Publ., 2018, pp. 133–138. (in Russian)
Haveliwala T.H. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Transactions on Knowledge and Data Engineering, 2003, vol. 15, no. 4, pp. 784–796. https://doi.org/10.1109/tkde.2003.1208999
Peters M.E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., Zettlemoyer L. Deep contextualized word representations. Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 2227–2237. https://doi.org/10.18653/v1/N18-1202
Khomitsevich O.G., Rybin S.V., Anichkin I.M. Application of linguistic analysis for text normalization and homonymy resolution in Russian text-to-speech system. Journal of Instrument Engineering, 2013, vol. 56, no. 2, pp. 42–46. (in Russian)
WordNet: An Electronic Lexical Database. Ed. by Ch. Fellbaum. Cambridge, MA, MIT Press, 1998, 423 p.
Yasaeva M.L. Creation of databases of Chechen texts for processing homograph recognition algorithms by computer systems. All-Russian Scientific and Practical Conference “Current Problems of Native Language and Literature Research”, Grozny, 2022, pp. 65–69. (in Russian)
Karpov A.A., Verkhodanova V.O. Speech technologies for under-resourced languages of the world. Voprosy Jazykoznanija, 2015, no. 2, pp. 117–135. (in Russian)
Israilov E.S., Astemirov A.V. Statistical context analysis program for removing graphic homonymy in texts in the chechen language. Proceedings of the International Scientific Conference "Current Issues in the Development of Modern Science" theme-based to the 30th anniversary of the Academy of Sciences of the Chechen Republic, Makhachkala, Chechen Academy of Sciences, 2023, pp. 478–485. (in Russian)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License