doi: 10.17586/2226-1494-2020-20-5-755-760


PROCESS CHARACTERISTICS ESTIMATION IN WEB APPLICATIONS USING K-MEANS CLUSTERING

V. V. Evstratov, M. S. Ananyevskiy


Read the full article  ';
Article in Russian

For citation:
Evstratov V.V., Ananyevskiy M.S. Process characteristics estimation in web applications using K-means clustering. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2020, vol. 20, no. 5, pp. 755–760 (in Russian). doi: 10.17586/2226-1494-2020-20-5-755-760


Abstract
Subject of Research. The paper presents the study of estimation problem of process characteristics for the particular case of user’s activity prediction in computer online games. Various machine learning methods are considered, and the advantages of clustering-based approaches are identified. The variety of metrics for the estimation of clustering quality is studied. Method. A clustering-based approach to estimation of process characteristics was developed on the base of a hypothesis proposed during the preliminary analysis of user’s activity data. Data on activity of users with the known predicted values was collected. Each user was represented as a pair of vectors: the first vector corresponded to his first days of activity, and the second one corresponded to the days with predicted performance. The vectors representing user’s activity in the first days were used as training data for the K-means algorithm. A developed entropy-like loss function  was used to find a value of K suitable for the problem under consideration. The clusters were matched with vectors of predicted process characteristics averaged over all users in the cluster. These matches were used as the prediction  of new users’ characteristics. Main Results. An approach to the determination of the suitable number of clusters is proposed, taking into account the specifics of the considered data. Numerical experiment is carried out, demonstrating the applicability of the developed method. Practical Relevance. The proposed approach application allows for the simultaneous prediction of multiple characteristics of online-game users, and, therefore, for solution of various planning and analytics problems during online-game development. For example, the method developed in the present work was used to analyze the development payback of new game elements, and to predict server load in order to increase available computational resources beforehand. The advantages of the developed method include no need for expert tagging of the training set and relatively low computational cost due to the low computational complexity of the proposed loss function used to estimate the hyperparameter K.

Keywords: clustering, K-means, K-means algorithm, clustering quality assessment, entropy, machine learning, algorithms, web

Acknowledgements. This study has been supported by the Russian Foundation for Basic Research, grant no. 19-08-00865 А.

References
1. Zhang Z., Lai Z., Xu Y., Shao L., Wu J., Xie G.-S. Discriminative elastic-net regularized linear regression. IEEE Transactions on Image Processing, 2017, vol. 26, no. 3, pp. 1466–1481. doi: 10.1109/ TIP.2017.2651396
2. Olive D.J. Linear Regression. Springer, 2017, IX, 494 p. doi: 10.1007/978-3-319-55252-1
3. Xu J., Xu C., Zou B., Tang Y.Y., Peng J., You X. New incremental learning algorithm with support vector machines. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019, vol. 49, no. 11, pp. 2230–2241. doi: 10.1109/TSMC.2018.2791511
4. Press S.J., Wilson S. Choosing between logistic regression and discriminant analysis. Journal of the American Statistical Association, 1978, vol. 73, no. 364, pp. 699–705. doi: 10.1080/01621459.1978.10480080
5. Friedman J., Hastie T., Tibshirani R. Additive logistic regression: A statistical view of boosting. Annals of Statistics, 2000, vol. 28, no. 2, pp. 337–407. doi: 10.1214/aos/1016218223
6. Subramaniyaswamy V., Logesh R. Adaptive KNN based recommender system through mining of user preferences. Wireless Personal Communications, 2017, vol. 97, no. 2, pp. 2229–2247. doi: 10.1007/s11277-017-4605-5
7. Cheung D.W., Kao B., Lee J. Discovering user access patterns on the World Wide Web. Knowledge-Based Systems, 1998, vol. 10, no. 7, pp. 463–470. doi: 10.1016/S0950-7051(98)00037-9
8. Liu D.-S., Fan S.-J. A modified decision tree algorithm based on genetic algorithm for mobile user classification problem. The Scientific World Journal, 2014, pp. 468324.
9. Santra A., Jayasudha S. Classification of web log data to identify interested users using Naïve Bayesian classification. International Journal of Computer Science Issues (IJCSI), 2012, vol. 9, no. 1, pp. 381.
10. Park S., Suresh N.C., Jeong B.-K. Sequence-based clustering for web usage mining: A new experimental framework and ann-enhanced k-means algorithm. Data & Knowledge Engineering, 2008, vol. 65, no. 3, pp. 512–543. doi: 10.1016/j.datak.2008.01.002
11. Medina-Ortiz D., Contreras S., Quiroz C., Asenjo J.A., OliveraNappa Á. DMAKit: A user-friendly web platform for bringing stateof-the-art data analysis techniques to non-specific users. Information Systems, 2020, vol. 93, pp. 101557. doi: 10.1016/j.is.2020.101557
12. Meroño-Peñuela A. Refining Statistical Data on the Web. CreateSpace Independent Publishing Platform, 2016, 252 p. 
13. Nithya P., Sumathi P. Novel pre-processing technique for web log mining by removing global noise and web robots. Proc. of the National Conference on Computing and Communication Systems (NCCCS 2012), 2012, pp. 41–45. doi: 10.1109/NCCCS.2012.6412976
4. Kanungo T., Mount D.M., Netanyahu N.S., Piatko C.D., Silverman R., Wu A.Y. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, vol. 24, no. 7, pp. 881–892. doi: 10.1109/ TPAMI.2002.1017616
15. Yang S.-L., Li Y.-S., Hu X.-X., Pan R.-Y. Optimization study on k value of k-means algorithm. Xitong Gongcheng Lilun yu Shijian/ System Engineering Theory and Practice, 2006, vol. 26, no. 2, pp. 97– 101. (in Chinese)
16. Syakur M., Khotimah B., Rochman E.M.S., Satoto B.D. Integration k-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conference Series: Materials Science and Engineering, 2018, vol. 336, no. 1, pp. 012017. doi: 10.1088/1757-899X/336/1/012017
17. Thinsungnoen T., Kaoungku N., Durongdumronchai P., Kerdprasop K., Kerdprasop N. The clustering validity with silhouette and sum of squared errors. Proc. 3rd International Conference on Industrial Application Engineering (ICIAE 2015), 2015, pp. 44–51. doi: 10.12792/iciae2015.012
18. Menardi G. Density-based Silhouette diagnostics for clustering methods. Statistics and Computing, 2011, vol. 21, no. 3, pp. 295–308. doi: 10.1007/s11222-010-9169-0


Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Copyright 2001-2025 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.

Яндекс.Метрика