Nikiforov
Vladimir O.
D.Sc., Prof.
doi: 10.17586/2226-1494-2016-16-3-482-496
AUTOMATIC SUMMARIZATION OF WEB FORUMS AS SOURCES OF PROFESSIONALLY SIGNIFICANT INFORMATION
Read the full article ';
For citation: Buraya K.I., Vinogradov P.D., Grozin V.A., Gusarova N.F., Dobrenko N.V., Trofimov V.A. Automatic summarization of web forums as sources of professionally significant information. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2016, vol. 16, no. 3, pp. 482–496. doi: 10.17586/2226-1494-2016-16-3-482-496
Abstract
Subject of Research.The competitive advantage of a modern specialist is the widest possible coverage of informationsources useful from the point of view of obtaining and acquisition of relevant professionally significant information. Among these sources professional web forums occupy a significant place. The paperconsiders the problem of automaticforum text summarization, i.e. identification ofthose fragments that contain professionally relevant information. Method.The research is based on statistical analysis of texts of forums by means of machine learning. Six web forums were selected for research considering aspects of technologies of various subject domains as their subject-matter. The marking of forums was carried out by an expert way. Using various methods of machine learning the models were designed reflecting functional communication between the estimated characteristics of PSI extraction quality and signs of posts. The cumulative NDCG metrics and its dispersion were used for an assessment of quality of models.Main Results. We have shown that an important role in an assessment of PSI extraction efficiency is played by requestcontext. The contexts of requestshave been selected,characteristic of PSI extraction, reflecting various interpretations of information needs of users, designated by terms relevance and informational content. The scales for their estimates have been designed corresponding to worldwide approaches. We have experimentally confirmed that results of the summarization of forums carried out by experts manually significantly depend on requestcontext. We have shown that in the general assessment of PSI extraction efficiency relevance is rather well described by a linear combination of features, and the informational content assessment already requires their nonlinear combination. At the same time at a relevance assessment the leading role is played by the features connected with keywords, and at an informational content assessment characteristics of the post text in general come to the fore, and also the features connected with structure of a thread as the text and the social graph. We have shown that efficiency of extraction of informative posts poorly depends on a way of keywords assignment while such dependence is essential to extraction of relevant posts. The way of keywords extraction, the most effective for real appendices has been revealed. We have shown that at extraction of relevant posts linear methods are better in efficiency in comparison with nonlinear, and the LDA model is intermediate; at the same time at extraction of informative posts linear and nonlinear methods are identical by efficiency, and the LDA model considerably concedes to both of them. We have proposed substantial model explaining the received results. Practical Relevance. The obtained results can provide background for creation of new and adequate application of the existing algorithms of web forums summarization that will allow reducing significantly user’s time and resource expenditure by receiving and studying the last minute professionally significant information.
References
1. Vasiliev V.N., Lisitsyna L.S. Planning and estimation of expected competences learning outcomes for FSES HPE. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2013, no. 2 (84), pp. 142–148. (In Russian)
2. Vasiliev V.N., Lisitsyna L.S., Shehonin A.A. Conceptual model for the extraction of learning outcomes from the excessive education content. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2010, no. 4, pp. 104–108. (In Russian)
3. Lisitsyna L.S. Methodology of Designing Modular Competence-Oriented Education Programs. St. Petersburg, SPbSU ITMO, 2009, 50 p. (In Russian)
4. Druzhinin V.N. Psychology. 2nd ed. St. Petersburg, Piter Publ., 2009, 656 p.
5. Kontseptsiya i metodika razrabotki kontrol'no-otsenochnykh sredstv. Available at: http://www.firo.ru/wp-content/uploads/2012/12/Concetion.doc (accessed 29.04.2016).
6. Stolyarenko A.M. Psychology and Pedagogics. 3rd ed. Moscow, Yuniti-Dana, 2010, 544 p. (In Russian)
7. Gusarova N.F., Kovalenko M.N., Mayatin A.V., Petrov V.A., Shilov I.V. Using a hierarchically organized text online forum as a means to support the scientific and technical design. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2005, no. 20, pp. 243–247. (In Russian)
8. Grozin V.A., Dobrenko N.V., Gusarova N. F., Tao N. The application of machine learning methods for analysis of text forums for creating learning objects. Proc. Int. Conf. on Computational Linguistics and Intellectual Technologies. Moscow, 2015, vol. 1, no. 14, pp. 202–213.
9. Grozin V.A., Gusarova N.F., Dobrenko N.V. Feature selection for language-independent text forum summarization. Proc. 6th Int. Conf. on Knowledge Engineering and Semantic Web, KESW - 2015. Moscow, 2015, vol. 518, pp. 63–71. doi: 10.1007/978-3-319-24543-0_5
10. Buraya K.I., Grozin V.A., Gusarova N.F, Dobrenko N.V. Machine learning methods for extracting of professionally significant information from web forums. Distantsionnoe i Virtual'noe Obrazovanie, 2015, no. 12, pp. 46–63.
11. Almahy I., Salim N. Web discussion summarization: study review. Proc. 1st Int. Conf. on Advanced Data and Information Engineering, DaEng-2013. Kuala Lumpur, Malaysia, 2013, pp. 649–656. doi: 10.1007/978-981-4585-18-7_73
12. Vorontsov K.V. Machine Learning (Lectures). Available at: http://www.machinelearning.ru/wiki/index.php?title= Машинное обучение (курс лекций, К.В.Воронцов) (accessed 04.2016).
13. Bishop C.М. Pattern Recognition and Machine Learning. Springer, 2006, 738 p.
14. Manning C.D., Raghavan P., Schutze H. Introduction to Informational Retrieval. Cambridge University Press, 2008, 504p.
15. Beliga S., Mesrovic A., Martinic-Ipsic S. An overview of graph-based keyword extraction methods and approaches. Journal of Information and Organizational Sciences, 2015, vol. 39, no. 1, pp. 1–20.
16. Zhao H., Zeng Q. Micro-blog keyword extraction method based on graph model and semantic space. Journal оf Multimedia, 2013, vol. 8, no. 5, pp. 611–617. doi: 10.4304/jmm.8.5.611-617
17. Sondhi P., Gupta M., Zhai C.X., Hockenmaier J. Shallow information extraction from medical forum data. Proc. 23rd Int. Conf. on Computational Linguistics, COLING '10. Beijing, China, 2010, pp. 1158–1166.
18. Elbedweihy K.M., Wrigley S.N., Clough P., Ciravegna F. An overview of semantic search evaluation initiatives. Journal of Web Semantics, 2015, vol. 30, pp. 82–105. doi: 10.1016/j.websem.2014.10.001
19. Saracevic T. Evaluation of evaluation in information retrieval. SIGIR Forum, 1995, pp. 137–146.
20. Kelly D. Methods for evaluating interactive information retrieval systems with users. Foundations and Trends Information Retrieval, 2009, vol. 3, no. 1–2, pp. 1–1224. doi: 10.1561/1500000012
21. Nenkova A., McKeown K. A survey of text summarization techniques. Mining Text Data, 2012, pp. 43–76. doi: 10.1007/978-1-4614-3223-4_3
22. Harman D. Information Retrieval Evaluation. Morgan & Claypool Publishers, 2011.
23. Biyani P., Bhati S., Caragea C., Mitra P. Using non-lexical features for identifying factual and opinionative threads in online forums. Knowledge-Based Systems, 2014, vol. 69, no. 1, pp. 170–178. doi: 10.1016/j.knosys.2014.04.048
24. Smine B., Faiz R., Desclés J-P. Relevant learning objects extraction based on semantic annotation. Interanational Journal of Metadata, Semantics and Ontologies, 2013, vol. 8, no. 1, pp. 13–27. doi: 10.1504/IJMSO.2013.054187
25. Nettleton D.F. Data mining of social networks represented as graphs. Computer Science Review, 2013, vol. 7, no. 1, pp. 1–34. doi: 10.1016/j.cosrev.2012.12.001
26. Romero C., Lopez M.-I., Luna J.-M., Ventura S. Predicting students’ final performance from participation in on-line discussion forums. Computers and Education, 2013, vol. 68, pp. 458–472. doi: 10.1016/j.compedu.2013.06.009
27. Wang B.-X., Liu B.-Q., Sun C.-J., Wang X.-L., Sun L. Thread segmentation based answer detection in Chinese online forums. Acta Automatica Sinica, 2013, vol. 39, no. 1, pp. 11–20. doi: 10.3724/SP.J.1004.2013.00011
28. Mihalcea R., Banea C., Wiebe J. Learning multilingual subjective language via cross-lingual projections. Proc. 45th Annual Meeting of the Association for Computational Linguistics. Prague, Czech Republic, 2007, pp. 976–983.
29. Järvelin K., Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 2002, vol. 20, no. 4, pp. 422–446. doi: 10.1145/582415.582418
30. Shai S.-S., Shai B.-D. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014, 409 p.
31. Herbrich R., Graepel T., Obermayer K. Large-margin thresholded ensembles for ordinal regression: theory and practice. In: Advances in Large Margin Classifiers. MIT Press, 2000, pp. 115–132.
32. Croft W.B. Combining approaches to information retrieval. In: Advances in Information Retrieval. Ed. W.B. Croft. Springer, 2000, pp. 1–36.