doi: 10.17586/2226-1494-2023-23-5-967-979


Clustering in big data analytics: a systematic review and comparative analysis (review article) 

H. Shili


Read the full article  ';
Article in English

For citation:
Shili H. Clustering in big data analytics: a systematic review and comparative analysis (review article). Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2023, vol. 23, no. 5, pp. 967–979. doi: 10.17586/2226-1494-2023-23-5-967-979


Abstract
In the modern world, the widespread use of information and communication technology has led to the accumulation of vast and diverse quantities of data, commonly known as Big Data. This necessitates the need for novel concepts and analytical techniques to help individuals extract meaningful insights from rapidly increasing volumes of digital data. Clustering is a fundamental approach used in data mining to retrieve valuable information. Although a wide range of clustering methods have been described and implemented in various fields, the sheer variety complicates the task of keeping up with the latest advancements in the field. This research aims to provide a comprehensive evaluation of the clustering algorithms developed for Big Data highlighting their various features. The study also conducts empirical evaluations on six large datasets, using several validity metrics and computing time to assess the performance of the clustering methods under consideration.

Keywords: big data, clustering, data mining, empirical evaluations, performance metrics

References
  1. Hinneburg A., Keim D.A. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. Proc. of the 25th International Conference on Very Large Data Bases, 1999, pp. 506–517.
  2. Hinneburg A., Keim D.A. An efficient approach to clustering in large multimedia databases with noise. Proc. of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1998, pp. 58–65.
  3. Guha S., Rastogi R., Shim K. ROCK: a robust clustering algorithm for categorical attributes. Proc. of the 15th International Conference on Data Engineering, 1999, pp. 512–521. https://doi.org/10.1109/icde.1999.754967
  4. Gennari J.H., Langley P., Fisher D. Models of incremental concept formation. Artificial Intelligence, 1989, vol. 40, no. 1-3, pp. 11–61. https://doi.org/10.1016/0004-3702(89)90046-5
  5. Kaufman L., Rousseeuw P.J. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990, 342 p.
  6. Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. An extensive comparative study of cluster validity indices. Pattern Recognition, 2013, vol. 46, no. 1, pp. 243–256. https://doi.org/10.1016/j.patcog.2012.07.021
  7. Xu D., Tian Y. A comprehensive survey of clustering algorithms. Annals of Data Science, 2015, vol. 2, no. 2, pp. 165–193. https://doi.org/10.1007/s40745-015-0040-1
  8. Sinaga K.P., Yang M. Unsupervised k-means clustering algorithm. IEEE Access, 2020, vol. 8, pp. 80716–80727. https://doi.org/10.1109/ACCESS.2020.2988796
  9. Shili H., Romdhane L.B. IF-CLARANS: intuitionistic fuzzy algorithm for big data clustering. Communications in Computer and Information Science, 2018, vol. 854, pp. 39–50. https://doi.org/10.1007/978-3-319-91476-3_4
  10. Karypis G., Han E.H., Kumar V. Chameleon: hierarchical clustering using dynamic modeling. Computer, 1999, vol. 32, no. 8, pp. 68–75. https://doi.org/10.1109/2.781637
  11. Davies D.L., Bouldin D.W. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979, vol. PAMI-1, no. 2, pp. 224–227. https://doi.org/10.1109/TPAMI.1979.4766909
  12. Ankerst M., Breunig M., Kriegel H., Sander J. OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD Record, 1999, vol. 28, no. 2, pp. 49–60. https://doi.org/10.1145/304181.304187
  13. Cai Z., Wang J., He K. Adaptive density-based spatial clustering for massive data analysis. IEEE Access, 2020, vol. 8, pp. 23346–23358. https://doi.org/10.1109/ACCESS.2020.2969440
  14. Wang W., Yang J., Muntz R. STING: a statistical information grid approach to spatial data mining. Proc. of the 23th International Conference on Very Large Data Bases, 1997, pp. 186–195.
  15. Vanschoren J., van Rijn J.N., Bischl B., Torgo L. OpenML: Networked science in machine learning. ACM SIGKDD Explorations Newsletter, 2013, vol. 15, no. 2, pp. 49–60. https://doi.org/10.1145/2641190.2641198
  16. Goil S., Nagesh H., Choudhary A. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, 1999.
  17. Agrawal R., Gehrke J., Gunopulos D., Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record, 1998, vol. 27, no. 2, pp. 94–105. https://doi.org/10.1145/276305.276314
  18. Dempster P., Laird N.M., Rubin D.B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 1977, vol. 39, no. 1, pp. 1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
  19. Calinski T., Harabasz J. A dendrite method for cluster analysis. Communications in Statistics - Simulation and Computation, 1974, vol. 3, no. 1, pp. 1–27. https://doi.org/10.1080/03610917408548446
  20. Dunn J. Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 1974, vol. 4, no. 1, pp. 95–104. https://doi.org/10.1080/01969727408546059
  21. Canbay Y., Sağıroğlu S. Big data anonymization with spark. Proc. of the 2017 International Conference on Computer Science and Engineering (UBMK), 2017, pp. 833–838. https://doi.org/10.1109/UBMK.2017.8093543
  22. Lorbeer B., Kosareva A., Deva B., Softić D., Ruppel P., Küpper A. Variations on the Clustering Algorithm BIRCH. Big Data Research, 2018, vol. 11, pp. 44–53. https://doi.org/10.1016/j.bdr.2017.09.002
  23. Tsai C., Huang S. An effective and efficient grid-based data clustering algorithm using intuitive neighbor relationship for data mining. Proc. of the 2015 International Conference on Machine Learning and Cybernetics (ICMLC), 2015, pp. 478–483. https://doi.org/10.1109/ICMLC.2015.7340603
  24. Kailing K., Kriegel H., Kröger P. Density-connected subspace clustering for high-dimensional data. Proc. of the 2014 SIAM International Conference on Data Mining, 2004, pp. 246–257. https://doi.org/10.1137/1.9781611972740.23
  25. Kohonen T. The self-organizing map. Proceedings of the IEEE, 1990, vol. 78, no. 9, pp. 1464–1480. https://doi.org/10.1109/5.58325
  26. Bandyopadhyay S., Saha S. A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Transactions on Knowledge and Data Engineering, 2008, vol. 20, no. 11, pp. 1441–1457. https://doi.org/10.1109/tkde.2008.79
  27. Guha S., Rastogi R., Shim K. CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Record, 1998, vol. 27, no. 2, pp. 73–84. https://doi.org/10.1145/276305.276312
  28. Mahmud M.S., Huang J.Z., Salloum S., Emara T.Z., Sadatdiynov K. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining and Analytics, 2020, vol. 3, no. 2, pp. 85–101. https://doi.org/10.26599/BDMA.2019.9020015
  29. Djouzi K., Beghdad-Bey K. A review of clustering algorithms for big data. Proc. of the International Conference on Networking and Advanced Systems (ICNAS), 2019, pp. 1–6. https://doi.org/10.1109/ICNAS.2019.8807822
  30. Fahad A., Alshatri N., Tari Z., Alamri A., Khalil I., Zomaya A.Y., Foufou S., Bouras A. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing, 2014, vol. 2, no. 3, pp. 267–279. https://doi.org/10.1109/TETC.2014.2330519
  31. D’Urso P., De Giovanni L., Massari R. Smoothed self-organizing map for robust clustering. Information Sciences, 2020, vol. 512, pp. 381–401. https://doi.org/10.1016/j.ins.2019.06.038
  32. Asuncion A., Newman D.J. UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA, 2007.
  33. Zhang T., Ramakrishnan R., Livny M. BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1997, vol. 1, no. 2, pp. 141–182. https://doi.org/10.1023/A:1009783824328
  34. MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley Symposium Mathematical Statist. Probability. V. 1, 1967, pp. 281–297.
  35. Ng R.T., Han J. Efficient and effective clustering methods for spatial data mining. VLDB '94: Proc.of the 20th International Conference on Very Large Data Bases, 1994, pp. 144–144.
  36. Fisher D.H. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 1987, vol. 2, no. 2, pp. 139–172. https://doi.org/10.1007/bf00114265
  37. Ester M., Kriegel H.P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial data bases with noise. KDD'96: Proc. of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231.


Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Copyright 2001-2024 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.

Яндекс.Метрика