DOI: 10.17586/2226-1494-2018-18-3-447-456


A. A. Vorobeva, V. A. Pozvolenko, A. S. Korobitsyna, A. A. Sharafiev

Read the full article  ';
Article in Russian

For citation: Vorobeva A.A., Pozvolenko V.A., Korobitsyna A.S., Sharafiev A.A. Cross-domain web author identification. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2018, vol. 18, no. 3, pp. 447–456 (in Russian). doi: 10.17586/2226-1494-2018-18-3-447-456

 The paper is devoted to the cross-domain web author attribution (identification), where user's messages are obtained from several sources (web-sites). We focused on the problem of one web-site user identification by his messages from another web-site. We found that there is a stylistic difference between the texts of messages created by one user on different web-sites. The possibility of a single feature space forming for texts received from various sources was determined providing sufficient accuracy of linguistic identification. Two subtasks were studied: 1) mixed sources – training and test datasets include messages from mixed sources (web-sites); 2) separated sources –  the text messages sources of the training and test datasets do not intersect; training dataset includes texts from one source, test dataset includes texts from another.The experiment results showed that identification accuracy in mixed sources task is 0.82. The accuracy in separated sources task is 0.74. It is concluded that there is a stylistic difference between texts created by one user, but on the various web-sites. But at the same time, it is possible to form a single feature space for text messages received from various web-sites, ensuring sufficient identification accuracy.

Keywords: web users' identification, forensic linguistics, linguistic identification, cross-domain identification, author attribution



  1. Chen C., Wu K., Srinivasan V., Zhang X. Battling the internet water army: detection of hidden paid posters. Proc. IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining, ASONAM. Niagara Falls, Canada, 2013, pp. 116–120. doi: 10.1145/2492517.2492637
  2. Lebedev I.S., Borisov Y.B. Formalization models of natural-language messages in information security monitoring systems of open computer networks. Information and Control Systems, 2011, no. 2, pp. 37–43.(in Russian)
  3. Kataeva V.A., Pantyuhin I.S., Yurin I.V. Estimation method of the cohesion degree for the users’ profiles of social network based on open data. Open Education, 2017, vol. 21, no. 6, pp. 14–22. (in Russian) doi: 10.21686/1818-4243-2017-6-14-22
  4. Vorobeva A.A. Dynamic feature selection for web user identification on linguistic and stylistic features of online texts. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2017, vol. 17, no. 1, pp. 117–128. (in Russian) doi: 10.17586/2226-1494-2017-17-1-117-128
  5. Sidorova M.Yu. Internet Linguistics: Russian Language. Interpersonal Communication. Moscow, Publ., 2006, 193 p. (in Russian)
  6. Schwartz M.B.An Examination of Cross-Domain Authorship Attribution Techniques. CUNY Academic Works, 2016,32 p.
  7. Overdorf R., Greenstadt R. Blogs, twitter feeds, and reddit comments: cross-domain authorship attribution. Proceedings on Privacy Enhancing Technologies, 2016, no. 3,pp. 155–171.
  8. VorobevaA.A. Analizvozmozhnostiprimeneniyarazlichnykhlingvisticheskikhkharakteristikdlyaidentifikatsiiavtoraanonimnykhkorotkikhsoobshcheniivglobal'noisetiInternet. Informatsia i Kosmos, 2014, no. 1, pp. 42–46. (in Russian)
  9. ZhengR., Li J., Chen H., Huang Z. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 2006, vol. 57, no. 3, pp. 378–393. doi: 10.1002/asi.20316
  10. Vorobeva A.A. Technique of web-user identification based on stylistic and linguistic features of online texts.Informatsia i Kosmos, 2017, no. 1, pp. 127–130. (in Russian)
  11. Stamatatos E.A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 2009, vol. 60, no. 3, pp. 538–556. doi: 10.1002/asi.21001
  12. Nugumanova A.B., Bessmertnyi I.A., Petsina P., Baiburin E.M. Semantic relations in text classification based on Bag-of-words model.Software & Systems, 2016, no. 2, pp. 89–99. (in Russian)
  13. Houvardas J., Stamatatos E. N-gram feature selection for authorship identification. Lecture Notes in Computer Science, 2006, vol. 4183, pp. 77–86.
  14. Gomez-Adorno H. et al. Document embeddings learned on various types of n-grams for cross-topic authorship attribution. Computing, 2018, pp. 1–16. doi: 10.1007/s00607-018-0587-8
  15. Maitra P., Ghosh S., Das D. Authorship verification: an approach based on random forest. Proc. 6th Conference and Labs of the Evaluation Forum, CLEF 2015. Toulouse, France, 2015.
  16. Pacheco M.L., Fernandes K., Porco A. Random forest with increased generalization: a universal background approach for authorship verification. Proc. Conference and Labs of the Evaluation Forum, CLEF 2015. Toulouse, France, 2015.
  17. Vorobeva A.A. Influence of features discretization on accuracy of random forest classifier for web user identification. Proc. 20th Conf. on Open Innovations Association, FRUCT. St. Petersburg, Russia, 2017, pp. 498–504. doi: 10.23919/FRUCT.2017.8071354
  18. Brownlee J. Classification Accuracy is Not Enough: More Performance Measures You Can Use. 2014. Available at: 20.03.2018).
  19. Fomenko V.P., Fomenko T.G. Avtorskii invariant russkikh literaturnykh tekstov. In Fomenko A.T. Novaya Khronologiya Gretsii. Moscow, MSU Publ., 1995, vol. 2. (in Russian)

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Copyright 2001-2020 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.