PROBABILITY DISTRIBUTION OVER THE SET OF CLASSES IN ARABIC DIALECT CLASSIFICATION TASK

Durandin Oleg V, Hilal Nadezhda R, Strebkov	 Dmitrii Yu, Zolotykh Nikolay Yu

2017 , VOLUME 17, NUMBER 1 ( January–February )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2017-17-1-110-116

PROBABILITY DISTRIBUTION OVER THE SET OF CLASSES IN ARABIC DIALECT CLASSIFICATION TASK

O. V. Durandin, N. R. Hilal, D. Y. Strebkov, N. Y. Zolotykh

Read the full article

Article in Russian

For citation: Durandin O.V., Hilal N.R., Strebkov D.Y., Zolotykh N.Y. Probability distribution over the set of classes in Arabic dialect clas-sification task. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2017, vol. 17, no. 1, pp. 110–116. doi: 10.17586/2226-1494-2017-17-1-110-116

Abstract

Subject of Research.We propose an approach for solving machine learning classification problem that uses the information about the probability distribution on the training data class label set. The algorithm is illustrated on a complex natural language processing task - classification of Arabic dialects. Method. Each object in the training set is associated with a probability distribution over the class label set instead of a particular class label. The proposed approach solves the classification problem taking into account the probability distribution over the class label set to improve the quality of the built classifier. Main Results. The suggested approach is illustrated on the automatic Arabic dialects classification example. Mined from the Twitter social network, the analyzed data contain word-marks and belong to the following six Arabic dialects: Saudi, Levantine, Algerian, Egyptian, Iraq, Jordan, and to the modern standard Arabic (MSA). The paper results demonstrate an increase of the quality of the built classifier achieved by taking into account probability distributions over the set of classes. Experiments carried out show that even relatively naive accounting of the probability distributions improves the precision of the classifier from 44% to 67%. Practical Relevance. Our approach and corresponding algorithm could be effectively used in situations when a manual annotation process performed by experts is connected with significant financial and time resources, but it is possible to create a system of heuristic rules. The implementation of the proposed algorithm enables to decrease significantly the data preparation expenses without substantial losses in the precision of the classification.

Keywords: classification task, multiclass classification, automatic annotation, Arabic dialects, classification of dialects, label noise, class noise

References

1. Kearns M.J., Vazirani U.V. An Introduction to Computational Learning Theory. MIT Press, 1994, 221 p.

2. Flach P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press, 2012, 409 p.

3. Bezdek J.C., Keller K., Krisnapuram R., Pal N. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Springer, 1999, 776 p.

4. Denoeux T., Zouhal L.M. Handling possibilistic labels in pattern classification using evidential reasoning. Fuzzy Sets and Systems, 2001, vol. 122, no. 3, pp. 409–424. doi: 10.1016/s0165-0114(00)00086-5

5. Denoeux T. Maximum likelihood estimation from uncertain data in the belief function framework. IEEE Transactions on Knowledge and Data Engineering, 2013, vol. 25, no. 1, pp. 119–130. doi: 10.1109/tkde.2011.201

6. Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 7^th ed. Springer, 2013, 745 p.

7. Durandin O., Hilal N., Strebkov D. Automatic Arabic dialect identification. Computational Linguistics and Intellectual Technologies: Proc. Int. Conf. “Dialogue 2016”. Moscow, 2016.

8. Habash N.Y. Introduction to Arabic Natural Language Processing. Toronto, Morgan & Claypool, 2010, 186 p.

9. Heintz I. Arabic language modeling with stem-derived morphemes for automatic speech recognition. Ph.D. thesis.Ohio State University, 2010, 202 p.

10. Almeman K., Lee M. Toward developing a multi-dialect morphological analyser for Arabic. Proc. 4^th Int. Conf. on Arabic Language Processing. Rabat, Morocco, 2012, pp. 19–25.

11. Cavnar W.B., Trenkle J.M. N-gram-based text categorization. Proc. 3^rd Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 161–175.

12. Miao Y., Keselj V., Milios E. Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering. Proc. 14^th ACM Int. Conf. on Information and Knowledge Management, 2005, pp. 357–358.

13. Brieman L. Random forests. Machine Learning, 2001, vol. 45, no. 5, pp. 5–32.

14. Zhang M.L., Zhou Z.H. A review on multi-label learning algorithms. IEEE Transations on Knowledge and Data Engineering, 2014, vol. 26, no. 8, pp. 1819–1837. doi: 10.1109/tkde.2013.39

Segal M.R. Machine Learning Benchmarks and Random Forests Regression. Technical Report. Univ. California, San Francisco, 2004.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License