Article in Russian
For citation: Durandin O.V., Hilal N.R., Strebkov D.Y., Zolotykh N.Y. Probability distribution over the set of classes in Arabic dialect clas-sification task.
Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2017, vol. 17, no. 1, pp. 110–116. doi: 10.17586/2226-1494-2017-17-1-110-116
Abstract
Subject of Research.We propose an approach for solving machine learning classification problem that uses the information about the probability distribution on the training data class label set. The algorithm is illustrated on a complex natural language processing task - classification of Arabic dialects. Method. Each object in the training set is associated with a probability distribution over the class label set instead of a particular class label. The proposed approach solves the classification problem taking into account the probability distribution over the class label set to improve the quality of the built classifier. Main Results. The suggested approach is illustrated on the automatic Arabic dialects classification example. Mined from the Twitter social network, the analyzed data contain word-marks and belong to the following six Arabic dialects: Saudi, Levantine, Algerian, Egyptian, Iraq, Jordan, and to the modern standard Arabic (MSA). The paper results demonstrate an increase of the quality of the built classifier achieved by taking into account probability distributions over the set of classes. Experiments carried out show that even relatively naive accounting of the probability distributions improves the precision of the classifier from 44% to 67%. Practical Relevance. Our approach and corresponding algorithm could be effectively used in situations when a manual annotation process performed by experts is connected with significant financial and time resources, but it is possible to create a system of heuristic rules. The implementation of the proposed algorithm enables to decrease significantly the data preparation expenses without substantial losses in the precision of the classification.
Keywords: classification task, multiclass classification, automatic annotation, Arabic dialects, classification of dialects, label noise, class noise
References
1. Kearns M.J., Vazirani U.V. An Introduction to Computational Learning Theory. MIT Press, 1994, 221 p.
2. Flach P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press, 2012, 409 p.
3. Bezdek J.C., Keller K., Krisnapuram R., Pal N. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Springer, 1999, 776 p.
4. Denoeux T., Zouhal L.M. Handling possibilistic labels in pattern classification using evidential reasoning.
Fuzzy Sets and Systems, 2001, vol. 122, no. 3, pp. 409–424. doi:
10.1016/s0165-0114(00)00086-5
5. Denoeux T. Maximum likelihood estimation from uncertain data in the belief function framework.
IEEE Transactions on Knowledge and Data Engineering, 2013, vol. 25, no. 1, pp. 119–130. doi:
10.1109/tkde.2011.201
6. Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 7th ed. Springer, 2013, 745 p.
7. Durandin O., Hilal N., Strebkov D. Automatic Arabic dialect identification. Computational Linguistics and Intellectual Technologies: Proc. Int. Conf. “Dialogue 2016”. Moscow, 2016.
8. Habash N.Y. Introduction to Arabic Natural Language Processing. Toronto, Morgan & Claypool, 2010, 186 p.
9. Heintz I. Arabic language modeling with stem-derived morphemes for automatic speech recognition. Ph.D. thesis.Ohio State University, 2010, 202 p.
10. Almeman K., Lee M. Toward developing a multi-dialect morphological analyser for Arabic. Proc. 4th Int. Conf. on Arabic Language Processing. Rabat, Morocco, 2012, pp. 19–25.
11. Cavnar W.B., Trenkle J.M. N-gram-based text categorization. Proc. 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 161–175.
12. Miao Y., Keselj V., Milios E. Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering. Proc. 14th ACM Int. Conf. on Information and Knowledge Management, 2005, pp. 357–358.
13. Brieman L. Random forests. Machine Learning, 2001, vol. 45, no. 5, pp. 5–32.
14. Zhang M.L., Zhou Z.H. A review on multi-label learning algorithms.
IEEE Transations on Knowledge and Data Engineering, 2014, vol. 26, no. 8, pp. 1819–1837. doi:
10.1109/tkde.2013.39
Segal M.R. Machine Learning Benchmarks and Random Forests Regression. Technical Report. Univ. California, San Francisco, 2004.