Menu
Publications
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
Editor-in-Chief
Nikiforov
Vladimir O.
D.Sc., Prof.
Partners
doi: 10.17586/2226-1494-2024-24-1-41-50
Homograph recognition algorithm based on Euclidean metric
Read the full article ';
Article in Russian
For citation:
Abstract
For citation:
Izrailova E.S., Astemirov A.V., Badaeva A.S., Sultanov Z.A., Umarkhadzhiev S.M., Khekhaev M.-S.L., Yasaeva M.L. Homograph recognition algorithm based on Euclidean metric. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 1, pp. 41–50 (in Russian). doi: 10.17586/2226-1494-2024-24-1-41-50
Abstract
The problem of resolving the uncertainties associated with homonymy for the Chechen language has become especially relevant after the creation of speech synthesis systems. The main disadvantage of speech synthesizers in the Chechen language are errors in reading homograph words that differ in the length / brevity of vowels — the longitude of such sounds is not displayed in any way when writing. The reproduction of diphthongs, which are indicated on the letter in the same way as monophthongs close to them in sound, causes problems. To improve the quality of synthesized speech in the Chechen language, an automatic homograph recognition program is needed. To solve this problem, the article considers the task of eliminating the ambiguity of the meaning of the words WSD (Word Sense Disambiguation). Algorithmic (supervised) methods based on a pre-marked database have been selected for the Chechen language. These methods are the most common solutions for eliminating the ambiguity of the meaning of words. The implementation of such methods is possible in the presence of large marked-up corpora that are inaccessible to most languages of the world including Chechen. The Chechen language belongs to low-resource languages for which the optimal approach from the point of view of saving labor and time resources is a semi-controlled hybrid method of homograph recognition based on the use of algorithmic and statistical methods. The algorithm created by the authors for recognizing homographs by six adjacent words in a sentence is presented. The method is implemented as a program. Preliminary preparation of the initial data for the operation of the algorithm includes marking of proposals by the values of homographs performed “manually”. The results of the program were evaluated using generally recognized accuracy metrics and amounted to F1 — 39 %, Accuracy — 45 %. A comparative analysis of the data obtained with the results of other methods and models showed that the accuracy of the algorithm presented in this article is closest to the results of the accuracy of algorithms based on the Lesk method. Using Lesk method for English, the results of F1 accuracy were obtained — 41.1 % (simple Lesk) and 51.1 % (extended Lesk). Methods using neural network algorithms provide higher WSD accuracy rates for most languages; however, their implementation requires large data bodies, which is not always available for low-resource languages, including Chechen.
Keywords: graphic homonymy, homographs, WSD, speech synthesis, Chechen language, low resource languages, text corpus
References
References
- Izrailova E. Creating a system for synthesizing the Chechen speech. Izvestia: Herzen University Journal of Humanities & Sciences, 2020, no. 198, pp. 171–177. (in Russian). https://doi.org/10.33910/1992-6464-2020-198-171-177
- Izrailova E.S., Badaeva A.S. Analysis of the speech signal quality of the chechen speech synthesis system. Automatic Documentation and Mathematical Linguistics, 2021, vol. 55, no. 2, pp. 74–78. https://doi.org/10.3103/S0005105521020059
- Lesk M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. Proc. of the 5th Annual International Conference on Systems Documentation, 1986, pp. 24–26. https://doi.org/10.1145/318723.318728
- Banerjee S., Pedersen T. An adapted lesk algorithm for word sense disambiguation using WordNet. Lecture Notes in Computer Science, 2002, vol. 2276, pp. 136–145. https://doi.org/10.1007/3-540-45715-1_11
- Lastra-Diaz J.J., Goikoetxea J., Taieb M.A.H., Garcia-Serrano A., Aouicha M.B., Agirre E. A reproducible survey on word embeddings and ontology-based methods for word similarity: linear combinations outperform the state of the art. Engineering Applications of Artificial Intelligence, 2019, vol. 85, pp. 645–665. https://doi.org/10.1016/j.engappai.2019.07.010
- Kumar S., Jat S., Saxena K., Talukdar P. Zero-shot word sense disambiguation using sense definition embeddings. Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 5670–5681. https://doi.org/10.18653/v1/p19-1568
- Scozzafava F., Maru M., Brignone F., Torrisi G., Navigli R. Personalized PageRank with syntagmatic information for multilingual Word Sense Disambiguation. Proc. of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020, pp. 37–46. https://doi.org/10.18653/v1/2020.acl-demos.6
- Escudero G., Marquez L., Rigau G., Salgado J.G. On the portability and tuning of supervised word sense disambiguation systems. Research report, 2000.
- Manning C.D., Clark K., Hewitt J., Khandelwal U., Levy O. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 2020, vol. 117, no. 48, pp. 30046–30054. https://doi.org/10.1073/pnas.1907367117
- Lin D. Automatic retrieval and clustering of similar words. Proc. of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics. V. 2, 1998, pp. 768–774. https://doi.org/10.3115/980691.980696
- Hadiwinoto C., Ng H.T., Gan W.C. Improved Word Sense Disambiguation using pre-trained contextualized word representations. Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 5297–5306. https://doi.org/10.18653/v1/D19-1533
- Vial L., Lecouteux B., Schwab D. Sense vocabulary compression through the semantic knowledge of WordNet for neural Word Sense Disambiguation. Proc. of the 10th Global Wordnet Conference, 2019, pp. 108–117.
- Scarlini B., Pasini T., Navigli R. SensEmBERT: Context-enhanced sense embeddings for multilingual Word Sense Disambiguation. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, no. 5, pp. 8758–8765. https://doi.org/10.1609/aaai.v34i05.6402
- Scarlini B., Pasini T., Navigli R. With more contexts comes better performance: Contextualized sense embeddings for all-round Word Sense Disambiguation. Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 3528–3539. https://doi.org/10.18653/v1/2020.emnlp-main.285
- Zhang C.X., Liu R., Gao X.Y., Yu B. Graph convolutional network for word sense disambiguation. Discrete Dynamics in Nature and Society, 2021, vol. 2021, pp. 2822126. https://doi.org/10.1155/2021/2822126
- Conia S., Navigli R. Framing Word Sense Disambiguation as a multi-label problem for model-agnostic knowledge integration. Proc. of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 3269–3275. https://doi.org/10.18653/v1/2021.eacl-main.286
- Amrami A., Goldberg Y. Towards better substitution-based word sense induction. arXiv, 2019, arXiv:1905.12598. https://doi.org/10.48550/arXiv.1905.12598
- Arefyev N., Sheludko B., Panchenko A. Combining lexical substitutes in neural word sense induction. Proc. of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 2019, pp. 62–70. https://doi.org/10.26615/978-954-452-056-4_008
- Vasilescu F., Langlais P., Lapalme G. Evaluating variants of the lesk approach for disambiguating words. Proc. of the Fourth International Conference on Language Resources and Evaluation (LREC’04), 2004.
- Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186.
- El-Razzaz M., Fakhr M.W., Maghraby F.A. Arabic Gloss WSD Using BERT. Applied Sciences, 2021, vol. 11, no. 6, pp. 2567. https://doi.org/10.3390/app11062567
- Kilgarriff A., Rosenzweig J. Framework and results for English SENSEVAL. Computers the Humanities, 2000, vol. 34, no. 1, pp. 15–48. https://doi.org/10.1023/A:1002693207386
- Gataullin R.R., Gilmullin R.A., Khakimov B.E. Morphological disambiguation in the national corpus of tatar language using Purepos and LSTM models. VI International Conference on Computer Processing of Turkic Languages “TurkLang 2018” (Proceedings of the Conference), Tashkent, Navoiy Universiteti Publ., 2018, pp. 133–138. (in Russian)
- Haveliwala T.H. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Transactions on Knowledge and Data Engineering, 2003, vol. 15, no. 4, pp. 784–796. https://doi.org/10.1109/tkde.2003.1208999
- Peters M.E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., Zettlemoyer L. Deep contextualized word representations. Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 2227–2237. https://doi.org/10.18653/v1/N18-1202
- Khomitsevich O.G., Rybin S.V., Anichkin I.M. Application of linguistic analysis for text normalization and homonymy resolution in Russian text-to-speech system. Journal of Instrument Engineering, 2013, vol. 56, no. 2, pp. 42–46. (in Russian)
- WordNet: An Electronic Lexical Database. Ed. by Ch. Fellbaum. Cambridge, MA, MIT Press, 1998, 423 p.
- Yasaeva M.L. Creation of databases of Chechen texts for processing homograph recognition algorithms by computer systems. All-Russian Scientific and Practical Conference “Current Problems of Native Language and Literature Research”, Grozny, 2022, pp. 65–69. (in Russian)
- Karpov A.A., Verkhodanova V.O. Speech technologies for under-resourced languages of the world. Voprosy Jazykoznanija, 2015, no. 2, pp. 117–135. (in Russian)
- Israilov E.S., Astemirov A.V. Statistical context analysis program for removing graphic homonymy in texts in the chechen language. Proceedings of the International Scientific Conference "Current Issues in the Development of Modern Science" theme-based to the 30th anniversary of the Academy of Sciences of the Chechen Republic, Makhachkala, Chechen Academy of Sciences, 2023, pp. 478–485. (in Russian)