doi: 10.17586/2226-1494-2023-23-3-585-594


Joint recognition of text and layout in historical Russian documents

S. Mohammed, N. Teslya


Read the full article  ';
Article in English

For citation:
Mohammed S., Teslya N. Joint recognition of text and layout in historical Russian documents. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2023, vol. 23, no. 3, pp. 585–594. doi: 10.17586/2226-1494-2023-23-3-585-594


Abstract
In this paper, we evaluated the Document Attention Network (DAN), the first end-to-end segmentation-free architecture on Historical Russian Documents. The DAN model jointly recognizes both text and layout from whole documents, it takes whole documents from any size as an input and output the text as well as logical layout tokens. For comparison purposes, we conduct our experiments on Digital Peter dataset as it has been recognized at line-level. Dataset consists of documents of Peter the Great manuscripts; ground truths are represented according to a sophisticated XML schema which enables an accurate detailed definition of layout and text regions. We achieved good results at page-level: 18.71 % for Character Error Rate (CER), 39.7 % for Word Error Rate (WER), 14.11 % For Layout Ordering Error Rate (LOER), and 66.67 % for mean Average Precision (mAP).

Keywords: document understanding, handwritten text recognition, layout analysis, fully connected networks, transformers

Acknowledgements. The study was carried out at the expense of state funding, topic project No. FFZF-2022-0005.

References
  1. Sánchez J., Romero V., Toselli A.H., Vidal E. ICFHR2016 competition on handwritten text recognition on the READ dataset. Proc. of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016, pp. 630–635. https://doi.org/10.1109/icfhr.2016.0120
  2. Coquenet D., Chatelain C., Paquet T. DAN: a segmentation-free document attention network for handwritten document recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, in press. https://doi.org/10.1109/tpami.2023.3235826
  3. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017, pp. 5998–6008.
  4. Pletschacher S., Antonacopoulos A. The PAGE (Page Analysis and Ground-truth Elements) format framework. Proc. of the 20th International Conference on Pattern Recognition, 2010, pp. 257–260. https://doi.org/10.1109/icpr.2010.72
  5. Clausner C., Pletschacher S., Antonacopoulos A. Aletheia - An advanced document layout and text ground-truthing system for production environments. Proc. of the International Conference on Document Analysis and Recognition, 2011, pp. 48–52. https://doi.org/10.1109/ICDAR.2011.19
  6. Potanin M., Dimitrov D., Shonenkov A., Bataev V., Karachev D., Novopoltsev M., Chertok A. Digital Peter: New dataset, competition and handwriting recognition methods. Proc. of the HIP'21: The 6th International Workshop on Historical Document Imaging and Processing, 2021, pp. 43–48. https://doi.org/10.1145/3476887.3476892
  7. Shonenkov A., Karachev D., Novopoltsev M., Potanin M., Dimitrov D. StackMix and blot augmentations for handwritten text recognition. arXiv, 2021, arXiv:2108.11667. https://doi.org/10.48550/arXiv.2108.11667
  8. Teslya N., Mohammed S. Deep learning for handwriting text recognition: Existing approaches and challenges. Proc. of the 31st Conference of Open Innovations Association (FRUCT), 2022, pp. 339–346. https://doi.org/10.23919/FRUCT54823.2022.9770912
  9. Bluche T., Messina R. Gated convolutional recurrent neural networks for multilingual handwriting recognition. Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). V. 1, 2017, pp. 646–651. https://doi.org/10.1109/ICDAR.2017.111
  10. De Sousa Neto A.F., Bezerra B.L.D., Toselli A.H., Lima E.B. HTR-Flor: A deep learning system for offline handwritten text recognition. Proc. of the 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), 2020, pp. 54–61. https://doi.org/10.1109/SIBGRAPI51738.2020.00016
  11. Shi B., Bai X., Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, vol. 39, no. 11, pp. 2298–2304. https://doi.org/10.1109/TPAMI.2016.2646371
  12. Bluche T., Louradour J., Messina R. Scan, attend and read: End-to-end handwritten paragraph recognition with MDLSTM attention. Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). V. 1, 2017, pp. 1050–1055. https://doi.org/10.1109/ICDAR.2017.174
  13. Puigcerver J. Are multidimensional recurrent layers really necessary for handwritten text recognition? Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). V. 1, 2017, pp. 67–72. https://doi.org/10.1109/icdar.2017.20
  14. Graves A., Fernández S., Gomez F., Schmidhuber J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. ICML '06: Proc. of the 23rd International Conference on Machine Learning, 2006, pp. 369–376. https://doi.org/10.1145/1143844.1143891
  15. Li M., Lv T., Chen J., Cui L., Lu Y., Florencio D., Zhang C., Li Z., Wei F. TrOCR: Transformer-based optical character recognition with pre-trained models. arXiv, 2021, arXiv:2109.10282. https://doi.org/10.48550/arXiv.2109.10282
  16. Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., Uszkoreit J., Houlsby N. An image is worth than 16X16 words: transformers for image recognition. ICLR 2021. Available at: https://openreview.net/pdf?id=YicbFdNTTy (accessed: 23.12.2022).
  17. Touvron H., Cord M., Douze M., Massa F., Sablayrolles A., Jégou H. Training data-efficient image transformers & distillation through attention. arXiv, 2020, arXiv:2012.12877. https://doi.org/10.48550/arXiv.2012.12877
  18. Bao H., Dong L., Wei F. BEiT: BERT Pre-training of image transformers. arXiv, 2021, arXiv:2106.08254. https://doi.org/10.48550/arXiv.2106.08254
  19. Devlin J., Chang M.W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). V. 1, 2019, pp. 4171–4186. https://doi.org/https://aclanthology.org/N19-1423
  20. Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. RoBERTa: A robustly optimized bert pretraining approach. arXiv, 2019, arXiv:1907.11692. https://doi.org/10.48550/arXiv.1907.11692
  21. Dong L., Yang N., Wang W., Wei F., Liu X., Wang Y., Gao J., Zhou M., Hon H.-W. Unified language model pre-training for natural language understanding and generation. Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.
  22. Singh S.S., Karayev S. Full page handwriting recognition via image to sequence extraction. Lecture Notes in Computer Science, 2021, vol. 12823, pp. 55–69. https://doi.org/10.1007/978-3-030-86334-0_4
  23. Rouhou A.C., Dhiaf M., Kessentini Y., Ben Salem S. Transformer-based approach for joint handwriting and named entity recognition in historical document. Pattern Recognition Letters, 2022, vol. 155, pp. 128–134. https://doi.org/10.1016/j.patrec.2021.11.010
  24. Schreiber S., Agne S., Wolf I., Dengel A., Ahmed S. DeepDeSRT: Deep learning for detection and structure recognition of tables in document images. Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). V. 1, 2017, pp. 1162–1167. https://doi.org/10.1109/icdar.2017.192
  25. Ares Oliveira S., Seguin B., Kaplan F. DhSegment: A generic deep-learning approach for document segmentation. Proc. of the 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 7–12. https://doi.org/10.1109/icfhr-2018.2018.00011
  26. Yang X., Yumer E., Asente P., Kraley M., Kifer D., Giles C.L. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. Proc. of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4342–4351. https://doi.org/10.1109/cvpr.2017.462
  27. Xu Y., Li M., Cui L., Huang S., Wei F., Zhou M. LayoutLM: Pre-training of text and layout for document image understanding. Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1192–1200. https://doi.org/10.1145/3394486.3403172
  28. Xu Y., Xu Y., Lv T., Cui L., Wei F., Wang G., Lu Y., Florencio D., Zhang C., Che W., Zhang M., Zhou L. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. V. 1, 2021, pp. 2579–2591. https://doi.org/10.18653/v1/2021.acl-long.201
  29. Biswas S., Banerjee A., Lladós J., Pal U. DocSegTr: An Instance-level end-to-end document image segmentation transformer. arXiv, 2022, arXiv:2201.11438. https://doi.org/10.48550/arXiv.2201.11438
  30. Li J., Xu Y., Lv T., Cui L., Zhang C., Wei F. DiT: Self-supervised pre-training for document image transformer. MM '22: Proc. of the 30th ACM International Conference on Multimedia, 2022, pp. 3530–3539. https://doi.org/10.1145/3503161.3547911
  31. Coquenet D., Chatelain C., Paquet T. End-to-end handwritten paragraph text recognition using a vertical attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, vol. 45, no. 1, pp. 508–524. https://doi.org/10.1109/tpami.2022.3144899
  32. Everingham M., Gool Van L., Williams C.K.I., Winn J. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 2010, vol. 8, no. 2, pp. 303–338. https://doi.org/10.1007/s11263-009-0275-4
  33. Sánchez J.A., Romero V., Toselli A.H., Vidal E. ICFHR2016 competition on handwritten text recognition on the READ dataset. Proc. of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016, pp. 630–635. https://doi.org/10.1109/icfhr.2016.0120
  34. Grosicki E., Abed H.E. ICDAR 2011 - French handwriting recognition competition. Proc. of the International Conference on Document Analysis and Recognition (ICDAR), 2011, pp. 1459–1463. https://doi.org/10.1109/icdar.2011.290


Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Copyright 2001-2024 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.

Яндекс.Метрика