Natural language based malicious domain detection using machine learning and deep learning

Abdul Samad Saleem Raja, Ganesan Pradeepa, Somasundaram Mahalakshmi, Manickam Sam Jayakumar

2023 , VOLUME 23, NUMBER 2 ( March-April )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2023-23-2-304-312

Natural language based malicious domain detection using machine learning and deep learning

A. Saleem Raja, G. Pradeepa, S. Mahalakshmi, M. Sam Jayakumar

Read the full article

Article in English

For citation:

Saleem Raja A.S., Pradeepa G., Mahalakshmi S., Jayakumar M.S. Natural language based malicious domain detection using machine learning and deep learning. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2023, vol. 23, no. 2, pp. 304–312. doi: 10.17586/2226-1494-2023-23-2-304-312

Abstract

Cyberattacks are still challenging since they are increasing day by day. Cybercriminals employ a variety of strategies to manipulate and exploit their targets vulnerabilities. Malicious URLs are one such strategy which is used to target large groups on various social media platforms. To draw internet users, these web addresses are disguised as being safe. Deliberate or inadvertent use of such URLs exposes the user or the organization in the cyberspace and opens the way for further attacks. Systems that use rules-based or machine learning algorithms to find malicious URLs usually rely on feature engineering. This requires domain expertise and experience. Sometimes, even after extracting features from a dataset, it may not completely leverage the potential of the dataset. The proposed method employs Natural Language Processing (NLP) approaches to vectorize the words in the URLs and applies machine learning and deep learning models for classification. Vectorization technique in NLP reduces the effort of feature engineering and maximizing the use of the dataset. For the experiment, two separate datasets are used. To vectorize the URL text, three different vectorization methods are used. To evaluate the performance of the proposed method, two different datasets (D1 and D2) that are regularly utilized in the research domain were used. The results demonstrate that the superior accuracy of 92.4 % with the D1 dataset is achieved by the Decision Tree (DT) with count vectorizer and the Random Forest (RF) with Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer. With the D2 dataset, DT with TF-IDF vectorizer obtains a greater accuracy of 99.5 %. The Artificial Neural Network (ANN) model achieves 89.6 % accuracy with the D1 dataset and 99.2 % accuracy with the D2 dataset.

Keywords: malicious domain, phishing URL, NLP, machine learning, deep learning, ANN, CNN

References

Da H., Xu K., Pei J. Malicious URL detection by dynamically mining patterns without pre-defined elements. World Wide Web, 2014, vol. 17, no. 6, pp. 1375–1394. https://doi.org/10.1007/s11280-013-0250-4
Saleem Raja A., Pradeepa G., Arulkumar N. Mudhr. Malicious URL detection using heuristic rules based approach. AIP Conference Proceedings, 2022, vol. 2393, no. 1, pp. 020176. https://doi.org/10.1063/5.0074077
Sahoo D., Liu C., Hoi S.C.H. Malicious URL detection using machine learning: A survey. ArXiv, 2017, arXiv:1701.07179.
Brownlee J. Deep Learning with Python: Develop Deep Learning Models on Theano and TensorFlow Using Keras. Machine Learning Mastery, 2016, 256 p.
Pradeepa G., Devi R. Lightweight approach for malicious domain detection using machine learning. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2022, vol. 22, no. 2, pp. 262–268. https://doi.org/10.17586/2226-1494-2022-22-2-262-268
Saleem R.A., Vinodini R., Kavitha A. Lexical features based malicious URL detection using machine learning techniques. Materials Today: Proceedings, 2021, vol. 47, part 1, pp. 163–166. https://doi.org/10.1016/j.matpr.2021.04.041
Bengfort B., Bilbro R., Ojeda T. Applied Text Analysis with Python Enabling Language-Aware Data Products with Machine Learning. O’Reilly Media, Inc, 2018, 332 p.
Vishva E.S., Aju D. Phisher fighter: Website phishing detection system based on URL and term frequency-inverse document frequency values. Journal of Cyber Security and Mobility, 2022, vol. 11, no. 1, pp. 83–104. https://doi.org/10.13052/jcsm2245-1439.1114
Li S., Gong B. Word embedding and text classification based on deep learning methods. MATEC Web Conference, 2021, vol. 336, pp. 06022. https://doi.org/10.1051/matecconf/202133606022
Géron A. Hands-On Machine Learning with Scikit-Learn and TensorFlow. O’Reilly Media, 2017, 574 p.
Zhang M. Applications of deep learning in news text classification. Scientific Programming for Smart Internet of Things, 2021, vol. 2021, pp. 6095354. https://doi.org/10.1155/2021/6095354
Lakshmanarao A., Raja Babu M., Bala Krishna M.M. Malicious URL detection using NLP, machine learning and FLASK. Proc. of the International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), 2021, pp. 1–4. https://doi.org/10.1109/ICSES52305.2021.9633889
Liu B., Zeng X., Dong P. Malicious URL detection system based on LSTM and attention mechanism. Journal of Physics: Conference Series, 2021, vol. 2037, no. 1, pp. 012016. https://doi.org/10.1088/1742-6596/2037/1/012016
Routhu S.R., Amey U., Alwyn R.P. Application of word embedding and machine learning in detecting phishing websites. Telecommunication Systems, 2022, vol. 79, no. 1, pp. 33–45. https://doi.org/10.1007/s11235-021-00850-6
Zhang X., Zeng Y., Jin X.-B., Yan Z.-W., Geng G.-G. Boosting the phishing detection performance by semantic analysis. Proc. of the International Conference on Big Data, 2017, pp. 1063–1070. https://doi.org/10.1109/BigData.2017.8258030
Malak A., Samitha M. Phishing attacks detection using machine learning and deep learning models. Proc. of the 7^th International Conference on Data Science and Machine Learning Applications (CDMA), 2022, pp. 175–180. https://doi.org/10.1109/CDMA54072.2022.00034
Aung E.S., Yamana H. Phishing URL detection using information-rich domain and path features. Proc. of the DEIM, 2021.
Gopinath P., Sangeetha S., Balaji R., Sanjay, Shubham G., Bindhumadhava B.S. Malicious domain detection using machine learning on domain name features, host-based features and web-based features. Procedia Computer Science, 2020, vol. 171, pp. 654–661. https://doi.org/10.1016/j.procs.2020.04.071

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License