Clustering in big data analytics: a systematic review and comparative analysis (review article)

Shili Hechmi

2023 , VOLUME 23, NUMBER 5 ( september-october )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2023-23-5-967-979

Clustering in big data analytics: a systematic review and comparative analysis (review article)

H. Shili

Read the full article

Article in English

For citation:

Shili H. Clustering in big data analytics: a systematic review and comparative analysis (review article). Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2023, vol. 23, no. 5, pp. 967–979. doi: 10.17586/2226-1494-2023-23-5-967-979

Abstract

In the modern world, the widespread use of information and communication technology has led to the accumulation of vast and diverse quantities of data, commonly known as Big Data. This necessitates the need for novel concepts and analytical techniques to help individuals extract meaningful insights from rapidly increasing volumes of digital data. Clustering is a fundamental approach used in data mining to retrieve valuable information. Although a wide range of clustering methods have been described and implemented in various fields, the sheer variety complicates the task of keeping up with the latest advancements in the field. This research aims to provide a comprehensive evaluation of the clustering algorithms developed for Big Data highlighting their various features. The study also conducts empirical evaluations on six large datasets, using several validity metrics and computing time to assess the performance of the clustering methods under consideration.

Keywords: big data, clustering, data mining, empirical evaluations, performance metrics

References

Hinneburg A., Keim D.A. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. Proc. of the 25^th International Conference on Very Large Data Bases, 1999, pp. 506–517.
Hinneburg A., Keim D.A. An efficient approach to clustering in large multimedia databases with noise. Proc. of the 4^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1998, pp. 58–65.
Guha S., Rastogi R., Shim K. ROCK: a robust clustering algorithm for categorical attributes. Proc. of the 15^th International Conference on Data Engineering, 1999, pp. 512–521. https://doi.org/10.1109/icde.1999.754967
Gennari J.H., Langley P., Fisher D. Models of incremental concept formation. Artificial Intelligence, 1989, vol. 40, no. 1-3, pp. 11–61. https://doi.org/10.1016/0004-3702(89)90046-5
Kaufman L., Rousseeuw P.J. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990, 342 p.
Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. An extensive comparative study of cluster validity indices. Pattern Recognition, 2013, vol. 46, no. 1, pp. 243–256. https://doi.org/10.1016/j.patcog.2012.07.021
Xu D., Tian Y. A comprehensive survey of clustering algorithms. Annals of Data Science, 2015, vol. 2, no. 2, pp. 165–193. https://doi.org/10.1007/s40745-015-0040-1
Sinaga K.P., Yang M. Unsupervised k-means clustering algorithm. IEEE Access, 2020, vol. 8, pp. 80716–80727. https://doi.org/10.1109/ACCESS.2020.2988796
Shili H., Romdhane L.B. IF-CLARANS: intuitionistic fuzzy algorithm for big data clustering. Communications in Computer and Information Science, 2018, vol. 854, pp. 39–50. https://doi.org/10.1007/978-3-319-91476-3_4
Karypis G., Han E.H., Kumar V. Chameleon: hierarchical clustering using dynamic modeling. Computer, 1999, vol. 32, no. 8, pp. 68–75. https://doi.org/10.1109/2.781637
Davies D.L., Bouldin D.W. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979, vol. PAMI-1, no. 2, pp. 224–227. https://doi.org/10.1109/TPAMI.1979.4766909
Ankerst M., Breunig M., Kriegel H., Sander J. OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD Record, 1999, vol. 28, no. 2, pp. 49–60. https://doi.org/10.1145/304181.304187
Cai Z., Wang J., He K. Adaptive density-based spatial clustering for massive data analysis. IEEE Access, 2020, vol. 8, pp. 23346–23358. https://doi.org/10.1109/ACCESS.2020.2969440
Wang W., Yang J., Muntz R. STING: a statistical information grid approach to spatial data mining. Proc. of the 23^th International Conference on Very Large Data Bases, 1997, pp. 186–195.
Vanschoren J., van Rijn J.N., Bischl B., Torgo L. OpenML: Networked science in machine learning. ACM SIGKDD Explorations Newsletter, 2013, vol. 15, no. 2, pp. 49–60. https://doi.org/10.1145/2641190.2641198
Goil S., Nagesh H., Choudhary A. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, 1999.
Agrawal R., Gehrke J., Gunopulos D., Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record, 1998, vol. 27, no. 2, pp. 94–105. https://doi.org/10.1145/276305.276314
Dempster P., Laird N.M., Rubin D.B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 1977, vol. 39, no. 1, pp. 1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
Calinski T., Harabasz J. A dendrite method for cluster analysis. Communications in Statistics - Simulation and Computation, 1974, vol. 3, no. 1, pp. 1–27. https://doi.org/10.1080/03610917408548446
Dunn J. Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 1974, vol. 4, no. 1, pp. 95–104. https://doi.org/10.1080/01969727408546059
Canbay Y., Sağıroğlu S. Big data anonymization with spark. Proc. of the 2017 International Conference on Computer Science and Engineering (UBMK), 2017, pp. 833–838. https://doi.org/10.1109/UBMK.2017.8093543
Lorbeer B., Kosareva A., Deva B., Softić D., Ruppel P., Küpper A. Variations on the Clustering Algorithm BIRCH. Big Data Research, 2018, vol. 11, pp. 44–53. https://doi.org/10.1016/j.bdr.2017.09.002
Tsai C., Huang S. An effective and efficient grid-based data clustering algorithm using intuitive neighbor relationship for data mining. Proc. of the 2015 International Conference on Machine Learning and Cybernetics (ICMLC), 2015, pp. 478–483. https://doi.org/10.1109/ICMLC.2015.7340603
Kailing K., Kriegel H., Kröger P. Density-connected subspace clustering for high-dimensional data. Proc. of the 2014 SIAM International Conference on Data Mining, 2004, pp. 246–257. https://doi.org/10.1137/1.9781611972740.23
Kohonen T. The self-organizing map. Proceedings of the IEEE, 1990, vol. 78, no. 9, pp. 1464–1480. https://doi.org/10.1109/5.58325
Bandyopadhyay S., Saha S. A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Transactions on Knowledge and Data Engineering, 2008, vol. 20, no. 11, pp. 1441–1457. https://doi.org/10.1109/tkde.2008.79
Guha S., Rastogi R., Shim K. CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Record, 1998, vol. 27, no. 2, pp. 73–84. https://doi.org/10.1145/276305.276312
Mahmud M.S., Huang J.Z., Salloum S., Emara T.Z., Sadatdiynov K. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining and Analytics, 2020, vol. 3, no. 2, pp. 85–101. https://doi.org/10.26599/BDMA.2019.9020015
Djouzi K., Beghdad-Bey K. A review of clustering algorithms for big data. Proc. of the International Conference on Networking and Advanced Systems (ICNAS), 2019, pp. 1–6. https://doi.org/10.1109/ICNAS.2019.8807822
Fahad A., Alshatri N., Tari Z., Alamri A., Khalil I., Zomaya A.Y., Foufou S., Bouras A. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing, 2014, vol. 2, no. 3, pp. 267–279. https://doi.org/10.1109/TETC.2014.2330519
D’Urso P., De Giovanni L., Massari R. Smoothed self-organizing map for robust clustering. Information Sciences, 2020, vol. 512, pp. 381–401. https://doi.org/10.1016/j.ins.2019.06.038
Asuncion A., Newman D.J. UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA, 2007.
Zhang T., Ramakrishnan R., Livny M. BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1997, vol. 1, no. 2, pp. 141–182. https://doi.org/10.1023/A:1009783824328
MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley Symposium Mathematical Statist. Probability. V. 1, 1967, pp. 281–297.
Ng R.T., Han J. Efficient and effective clustering methods for spatial data mining. VLDB '94: Proc.of the 20^th International Conference on Very Large Data Bases, 1994, pp. 144–144.
Fisher D.H. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 1987, vol. 2, no. 2, pp. 139–172. https://doi.org/10.1007/bf00114265
Ester M., Kriegel H.P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial data bases with noise. KDD'96: Proc. of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License