GENERATING DATASETS FOR THE BINARY CLASSIFICATION TASK BASED ON THEIR CHARACTERISTIC DESCRIPTIONS

Zabashta Alexey S, Andrey A. Filchenkov

doi:10.17586/2226-1494-2017-17-3-498-505

2017 , VOLUME 17, NUMBER 3 ( May-June )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2017-17-3-498-505

GENERATING DATASETS FOR THE BINARY CLASSIFICATION TASK BASED ON THEIR CHARACTERISTIC DESCRIPTIONS

A. S. Zabashta, A. A. Filchenkov

Read the full article

Article in Russian

For citation: Zabashta A.S., Filchenkov A.A. Generating datasets for the binary classification task based on their characteristic descriptions. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2017, vol. 17, no. 3, pp. 498–505 (in Russian). doi: 10.17586/2226-1494-2017-17-3-498-505

Abstract

Subject of Study. We present a method for generating instances of the binary classification task by (according to, based on) their characteristic descriptions in the form of a meta-feature vector. We propose a naïve method for the same problem solution to be used as a referral one. We study the characteristic space of the binary classification task instances, as well as the methods for this space traversal. Method. The proposed method is based on genetic algorithm, where the distance in the characteristic space from the description vector of the generated instance for the binary classification task to the specified one is used as the minimized objective function. We developed the crossover and mutation operators for the genetic algorithm. These operators are based on such transformations as addition or removal of features and objects from datasets. Main Results. In order to validate the proposed method, we chose several non-trivial two-dimensional meta-feature spaces that were generated from statistical, information-theoretical and structural characteristics of classification task instance. We used the baseline method to evaluate the relative error of the proposed method. Both methods used the same number of classification tasks instances. The proposed method outperformed the naïve method and reduced average error by 30 times. Practical Relevance. The proposed method for generating instances for classification task based on their characteristic description allows obtaining unknown instances that are required to evaluate the performance of classifiers in certain areas of the meta-features space for design of automatic algorithm selection systems

Keywords: machine learning, meta-learning, classification problem, evolutionary computation, genetic algorithm

Acknowledgements. This work was financially supported by the Government of the Russian Federation, Grant 074-U01, and the Russian Foundation for Basic Research, Grant 16-37-60115 mol_a_dk.

References

1. Cook S.A., Mitchell D.G. Finding hard instances of the satisfiability problem. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 1997, vol. 35, pp. 1–17. doi: 10.1090/dimacs/035/01

2. Horie S., Watanabe O. Hard instance generation for SAT. Lecture Notes in Computer Science, 1997, vol. 1350, pp.
22–31. doi: 10.1007/3-540-63890-3_4

3. Selman B., Mitchell D.G., Levesque H.J. Generating hard satisfiability problems. Artificial Intelligence, 1996, vol. 81, no. 1-2, pp. 17–29.

4. Xu K., Boussemart F., Hemery F., Lecoutre C. Random constraint satisfaction: easy generation of hard (satisfiable) instances. Artificial Intelligence, 2007, vol. 171, no. 8, pp. 514–534. doi: 10.1016/j.artint.2007.04.001

5. van Hemert J.I. Evolving combinatorial problem instances that are difficult to solve. Evolutionary Computation, 2006, vol. 14, no. 4, pp. 433–462. doi: 10.1162/evco.2006.14.4.433

6. Ajtai M. Generating hard instances of the short basis problem. Lecture Notes in Computer Science, 1999, vol. 1644, pp. 1–9.

7. Buzdalov M.V. Tests generation for olympiad programming tasks using genetic algorithms. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2011, no. 2, pp. 72–77. (In Russian)

8. Buzdalov M.V. Test generation for programming challenge tasks in graph theory by evolution strategies. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2011, no. 6, pp. 123–127. (In Russian)

9. Smith-Miles K., Tan T.T. Measuring algorithm footprints in instance space. IEEE Congress on Evolutionary Computation. Brisbane, Australia, 2012, pp. 3446–3453. doi: 10.1109/CEC.2012.6252992

10. Asahiro Y., Iwama K., Miyano E. Random generation of test instances with controlled attributes. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 1996, vol. 26, pp. 377–393. doi: 10.1090/dimacs/026/18

11. Smith-Miles K., Wreford B., Lopes L., Insani N. Predicting metaheuristic performance on graph coloring problems using data mining. Studies in Computational Intelligence, 2013, vol. 434, pp. 417–432. doi: 10.1007/978-3-642-30671-6-16

12. Smith-Miles K., Baatar D., Wreford B., Lewis R. Towards objective measures of algorithm performance across instance space. Computers and Operations Research, 2014, vol. 45, pp. 12–24. doi: 10.1016/j.cor.2013.11.015

13. Giraud-Carrier C. Metalearning – a tutorial. Tutorial at 7^th Int. Conf. on Machine Learning and Applications, ICMLA. San Diego, California, 2008, pp. 11–13.

14. Brazdil P., Giraud-Carrier C., Soares C., Vilalta R. Metalearning. Applications to Data Mining. Springer, 2009, 176 p. doi: 10.1007/978-3-540-73263-1

15. Wolpert D.H., Macready W.G. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1997, vol. 1, no. 1, pp. 67–82. doi: 10.1109/4235.585893

16. Wolpert D.H. The supervised learning no-free-lunch theorems. Soft Computing and Industry, 2002, pp. 25–42. doi: 10.1007/978-1-4471-0123-9_3

17. Vorontsov K.V. Matematicheskie metody obucheniya po pretsedentam (teoriya obucheniya mashin). Available at: http://docplayer.ru/2064-K-v-voroncov-http-www-ccas-ru-voron-voron-ccas-ru.html (accessed 24.03.2017).

18. Nikolenko S.I., Tulup'ev A.L. Self-Learning Systems. Moscow, MTsNMO Publ., 2009, 287 p. (In Russian)

19. Jin Y., Branke J. Evolutionary optimization in uncertain environments – a survey. IEEE Transactions on Evolutionary Computation, 2005, vol. 9, no. 3, pp. 303–317. doi: 10.1109/TEVC.2005.846356

20. Gladkov L.A., Kureichik V.V., Kureichik V.M. Genetic Algorithms.Moscow, Fizmatlit Publ., 2006, 319 p. (In Russian)

21. Kirkpatrick S., Gelatt C.D., Vecchi M.P. Optimization by simulated annealing. Science, 1983, vol. 220, no. 4598, pp. 671–680.

22. Skobtsov Yu.A.Fundamentals of Evolutionary Computation. Donetsk, DonNTU Publ., 2008, 326 p. (in Russian)

23. Vanschoren J., van Rijn J.N., Bischl B., Torgo L. OpenML: networked science in machine learning. ACMSIGKDDExplorationsNewsletter, 2014, vol. 15, pp. 49–60. doi: 10.1145/2641190.2641198

24. Filchenkov A., Pendryak A. Datasets meta-feature description for recommending feature selection algorithm. Proc. Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT. St. Petersburg, Russia, 2015, pp. 11–18. doi: 10.1109/AINL-ISMW-FRUCT.2015.7382962

25. Indyk P., Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality. Proc. 13^th Annual ACM Symposium on Theory of Computing. Dallas, USA, 1998, pp. 604–613.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License