Меню
Публикации
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
Главный редактор
НИКИФОРОВ
Владимир Олегович
д.т.н., профессор
Партнеры
doi: 10.17586/2226-1494-2024-24-5-669-686
УДК 004.5, 004.93
Автоматический сурдоперевод: обзор нейросетевых методов распознавания и синтеза звучащей и жестовой речи
Читать статью полностью
Язык статьи - русский
Ссылка для цитирования:
Аннотация
Ссылка для цитирования:
Иванько Д.В., Рюмин Д.А. Автоматический сурдоперевод: обзор нейросетевых методов распознавания и синтеза звучащей и жестовой речи // Научно-технический вестник информационных технологий, механики и оптики. 2024. Т. 24, № 5. С. 669–686. doi: 10.17586/2226-1494-2024-24-5-669-686
Аннотация
Введение. Представлен обзор современных методов и технологий автоматического машинного сурдоперевода, включающих распознавание и синтез как звучащей, так и жестовой речи. Рассмотренные методы предназначены для обеспечения эффективной коммуникации между глухими, слабослышащими и слышащими людьми. Предложенные решения могут найти применение в современных интерфейсах человеко-машинного взаимодействия. Методы. Рассмотрены ключевые аспекты новых технологий, включая методы распознавания и синтеза жестовой речи и аудиовизуальной речи, существующие наборы данных для обучения нейросетевых моделей, а также современные системы автоматического машинного сурдоперевода. Представлены актуальные нейросетевые подходы, включающие использование методов глубокого обучения, таких как сверточные и рекуррентные нейросети, а также трансформеры. Приведен анализ существующих наборов данных для обучения систем распознавания и синтеза речи, проблем и ограничений существующих систем машинного сурдоперевода. Основные результаты. Выявлены основные недостатки и конкретные проблемы текущих технологий автоматического машинного сурдоперевода. Определены перспективные пути их решения. Особое внимание уделено возможности применения автоматических систем машинного сурдоперевода в реальных условиях. Обсуждение. Показана необходимость дальнейших исследований в области сбора и разметки данных. Доказана целесообразность разработки новых методов и нейросетевых моделей, а также создания инновационных технологий для обработки аудио- и видеоданных с целью улучшения качества и эффективности существующих систем автоматического машинного сурдоперевода.
Ключевые слова: автоматическое распознавание речи, синтез речи, распознавание жестов, синтез жестов, автоматический сурдоперевод, машинное обучение
Благодарности. Раздел «Предмет исследования» выполнен при поддержке бюджетной темы (№ FFZF-2022-0005), остальные исследования выполнены при финансовой поддержке Российского научного фонда (проект № 23-71-01056).
Список литературы
Благодарности. Раздел «Предмет исследования» выполнен при поддержке бюджетной темы (№ FFZF-2022-0005), остальные исследования выполнены при финансовой поддержке Российского научного фонда (проект № 23-71-01056).
Список литературы
- Mehrish A., Majumder N., Bharadwaj R., Mihalcea R., Poria S. A review of deep learning techniques for speech processing // Information Fusion. 2023. V. 99. P. 101869. https://doi.org/10.1016/j.inffus.2023.101869
- Ryumin D., Ivanko D., Ryumina E. Audio-visual speech and gesture recognition by sensors of mobile devices // Sensors. 2023. V. 23. N 4. P. 2284. https://doi.org/10.3390/s23042284
- Axyonov A., Ryumin D., Ivanko D., Kashevnik A., Karpov A. Audio-visual speech recognition in-the-wild: multi-angle vehicle cabin corpus and attention-based method // Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2024. P. 8195–8199. https://doi.org/10.1109/ICASSP48485.2024.10448048
- Ma P., Haliassos A., Fernandez-Lopez A., Chen H., Petridis S., Pantic M. Auto-AVSR: audio-visual speech recognition with automatic labels // Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023. P. 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096889
- Wang X., Mi J., Li B., Zhao Y., Meng J. CATNet: Cross-modal fusion for audio-visual speech recognition // Pattern Recognition Letters. 2024. V. 178. P. 216–222. https://doi.org/10.1016/j.patrec.2024.01.002
- Ryumin D., Axyonov A., Ryumina E., Ivanko D., Kashevnik A., Karpov A. Audio-visual speech recognition based on regulated transformer and spatio-temporal fusion strategy for driver assistive systems // Expert Systems with Applications. 2024. V. 252. Part. A. P. 124159. https://doi.org/10.1016/j.eswa.2024.124159
- Ryumin D., Karpov A. Towards automatic recognition of sign language gestures using kinect 2.0 // Lecture Notes in Computer Science. 2017. V. 10278. P. 89–101. https://doi.org/10.1007/978-3-319-58703-5_7
- Keskin C., Kıraç F., Kara Y.E., Akarun L. Hand pose estimation and hand shape classification using multi-layered randomized decision forests // Lecture Notes in Computer Science. 2012. V. 7577. P. 852–863. https://doi.org/10.1007/978-3-642-33783-3_61
- Keskin C., Kıraç F., Kara Y.E., Akarun L. Real time hand pose estimation using depth sensors // Consumer Depth Cameras for Computer Vision: Research Topics and Applications. 2013. P. 119–137. https://doi.org/10.1007/978-1-4471-4640-7_7
- Taylor J., Tankovich V., Tang D., Keskin C., Kim D., Davidson P., Kowdle A., Izadi S. Articulated distance fields for ultra-fast tracking of hands interacting // ACM Transactions on Graphics. 2017. V. 36. N 6. P. 1–12. https://doi.org/10.1145/3130800.3130853
- Kındıroğlu A.A., Özdemir O., Akarun L. Temporal accumulative features for sign language recognition // Proc. of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). 2019. P. 1288–1297. https://doi.org/10.1109/ICCVW.2019.00164
- Orbay A., Akarun L. Neural sign language translation by learning tokenization // Proc. of the 15th International Conference on Automatic Face and Gesture Recognition (FG). 2020. P. 222–228. https://doi.org/10.1109/FG47880.2020.00002
- Camgoz N.C., Hadfield S., Koller O., Ney H., Bowden R. Neural sign language translation // Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2018. P. 7784–7793. https://doi.org/10.1109/CVPR.2018.00812
- Koller O., Camgoz N.C., Ney H., Bowden R. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2020. V. 42. N 9. P. 2306–2320. https://doi.org/10.1109/TPAMI.2019.2911077
- Camgoz N.C., Koller O., Hadfield S., Bowden R. Multi-channel transformers for multi-articulatory sign language translation // Lecture Notes in Computer Science. 2020. V. 12538. P. 301–319. https://doi.org/10.1007/978-3-030-66823-5_18
- Narayana P., Beveridge J.R., Draper B.A. Gesture recognition: focus on the hands // Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2018. P. 5235–5244. https://doi.org/10.1109/CVPR.2018.00549
- Zhu G., Zhang L., Shen P., Song J. Multimodal gesture recognition using 3-D convolution and convolutional LSTM // IEEE Access. 2017. V. 5. P. 4517–4524. https://doi.org/10.1109/ACCESS.2017.2684186
- Abavisani M., Joze H.R.V., Patel V.M. Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training // Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019. P. 1165–1174. https://doi.org/10.1109/CVPR.2019.00126
- Elboushaki A., Hannane R., Afdel K., Koutti L. MultiD-CNN: A multi-dimensional feature learning approach based on deep convolutional networks for gesture recognition in RGB-D image sequences // Expert Systems with Applications. 2020. V. 139. P. 112829. https://doi.org/10.1016/j.eswa.2019.112829
- Amangeldy N., Kudubayeva S., Kassymova A., Karipzhanova A., Razakhova B., Kuralov S. Sign language recognition method based on palm definition model and multiple classification // Sensors. 2022. V. 22. N 17. P. 6621. https://doi.org/10.3390/s22176621
- Damaneh M.M., Mohanna F., Jafari P. Static hand gesture recognition in sign language based on convolutional neural network with feature extraction method using ORB descriptor and Gabor filter // Expert Systems with Applications. 2023. V. 211. P. 118559. https://doi.org/10.1016/j.eswa.2022.118559
- Núñez-Marcos A., Perez-de-Viñaspre O., Labaka G. A survey on sign language machine translation // Expert Systems with Applications. 2023. V. 213. Part. B. P. 118993. https://doi.org/10.1016/j.eswa.2022.118993
- Bohacek M., Hrúz M. Learning from what is already out there: few-shot sign language recognition with online dictionaries // Proc. of the 17th International Conference on Automatic Face and Gesture Recognition (FG). 2023. P. 1–6. https://doi.org/10.1109/FG57933.2023.10042544
- Wei S.E., Ramakrishna V., Kanade T., Sheikh Y. Convolutional pose machines // Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. P. 4724–4732. https://doi.org/10.1109/CVPR.2016.511
- Naert L., Larboulette C., Gibet S. A survey on the animation of signing avatars: from sign representation to utterance synthesis // Computers and Graphics. 2020. V. 92. P. 76–98. https://doi.org/10.1016/j.cag.2020.09.003
- Mujahid A., Awan M.J., Yasin A., Mohammed M.A., Damaševičius R., Maskeliūnas R., Abdulkareem K.H. Real-time hand gesture recognition based on deep learning YOLOv3 model // Applied Sciences. 2021. V. 11. N 9. P. 4164. https://doi.org/10.3390/app11094164
- Wang Y., Yu B., Wang L., Zu C., Lalush D.S., Lin W., Wu X., Zhou J., Shen D., Zhou L. 3D conditional generative adversarial networks for high-quality PET image estimation at low dose // NeuroImage. 2018. V. 174. P. 550–562. https://doi.org/10.1016/j.neuroimage.2018.03.045
- Vahdat A., Kautz J. NVAE: A deep hierarchical variational autoencoder // Proc. of the Neural Information Processing Systems (NeurIPS). 2020. P. 19667–19679.
- Ma C., Guo Y., Yang J., An W. Learning multi-view representation with LSTM for 3-D shape recognition and retrieval // IEEE Transactions on Multimedia. 2019. V. 21. N 5. P. 1169–1182. https://doi.org/10.1109/TMM.2018.2875512
- Vasileiadis M., Bouganis C.-S., Tzovaras D. Multi-person 3D pose estimation from 3D cloud data using 3D convolutional neural networks // Computer Vision and Image Understanding. 2019. V. 185. P. 12–23. https://doi.org/10.1016/j.cviu.2019.04.011
- Lin J., Yuan Y., Shao T., Zhou K. Towards high-fidelity 3D face reconstruction from in-the-wild images using graph convolutional networks // Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. P. 5890–5900. https://doi.org/10.1109/cvpr42600.2020.00593
- Liu R., Shen J., Wang H., Chen C., Cheung S.-C., Asari V. Attention mechanism exploits temporal contexts: real-time 3D human pose reconstruction // Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. P. 5063–5072. https://doi.org/10.1109/cvpr42600.2020.00511
- Zhang Z., Sun L., Yang Z., Chen L., Yang Y. Global-correlated 3D-decoupling transformer for clothed avatar reconstruction // Proc. of the Neural Information Processing Systems (NeurIPS). 2023. P. 7818–7830.
- Dupont S., Luettin J. Audio-visual speech modeling for continuous speech recognition // IEEE Transactions on Multimedia. 2000. V. 2. N 3. P. 141–151. https://doi.org/10.1109/6046.865479
- Ivanko D., Ryumin D., Axyonov A., Kashevnik A. Speaker-dependent visual command recognition in vehicle cabin: methodology and evaluation // Lecture Notes in Computer Science. 2021. V. 12997. P. 291–302. https://doi.org/10.1007/978-3-030-87802-3_27
- Ivanko D., Ryumin D., Kipyatkova I., Axyonov A., Karpov A. Lip-reading using pixel-based and geometry-based features for multimodal human-robot interfaces // Smart Innovation, Systems and Technologies. 2020. V. 154. P. 477–486. https://doi.org/10.1007/978-981-13-9267-2_39
- АксёновА.А., РюминаЕ.В., РюминД.А., ИванькоД.В., КарповА.А. Нейросетевойметод визуального распознавания голосовых команд водителя с использованием механизма внимания // Научно-технический вестник информационных технологий, механики и оптики. 2023. Т. 23. № 4. С. 767–775. https://doi.org/10.17586/2226-1494-2023-23-4-767–775
- Petridis S., Pantic M. Deep complementary bottleneck features for visual speech recognition // Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2016. P. 2304–2308. https://doi.org/10.1109/ICASSP.2016.7472088
- Takashima Y., Aihara R., Takiguchi T., Ariki Y., Mitani N., Omori K., Nakazono K. Audio-visual speech recognition using bimodal-trained bottleneck features for a person with severe hearing loss // Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2016. P. 277–281. https://doi.org/10.21437/Interspeech.2016-721
- Ninomiya H., Kitaoka N., Tamura S., Iribe Y., Takeda K. Integration of deep bottleneck features for audio-visual speech recognition // Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2015. P. 563–567. https://doi.org/10.21437/Interspeech.2015-204
- Ivanko D., Ryumin D., Karpov A. A review of recent advances on deep learning methods for audio-visual speech recognition // Mathematics. 2023. V. 11. N 12. P. 2665. https://doi.org/10.3390/math11122665
- Ma P., Petridis S., Pantic M. End-to-end audio-visual speech recognition with conformers // Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. P. 7613–7617. https://doi.org/10.1109/ICASSP39728.2021.9414567
- Hong J., Kim M., Choi J., Ro Y.M. Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring // Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. P. 18783–18794. https://doi.org/10.1109/CVPR52729.2023.01801
- Li G., Deng J., Geng M., Jin Z., Wang T., Hu S., Cui M., Meng H., Liu X. Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2023. V. 31. P. 2707–2723. https://doi.org/10.1109/TASLP.2023.3294705
- Burchi M., Timofte R. Audio-visual efficient conformer for robust speech recognition // Proc. of theIEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2023. P. 2257–2266. https://doi.org/10.1109/WACV56688.2023.00229
- Chang O., Liao H., Serdyuk D., Shahy A., Siohan O. Conformer is all you need for visual speech recognition // Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2024. P. 10136–10140. https://doi.org/10.1109/icassp48485.2024.10446532
- Wand M., Koutník J., Schmidhuber J. Lipreading with long short-term memory // Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2016. P. 6115–6119. https://doi.org/10.1109/ICASSP.2016.7472852
- Assael Y.M., Shillingford B., Whiteson S., De Freitas N. LipNet: end-to-end sentence-level lipreading // arXiv. 2016. arXiv:1611.01599. https://doi.org/10.48550/arXiv.1611.01599
- Shi B., Hsu W.N., Mohamed A. Robust self-supervised audio-visual speech recognition // Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2022. P. 2118–2122. https://doi.org/10.21437/interspeech.2022-99
- Ivanko D., Ryumin D., Kashevnik A.M., Axyonov A., Kitenko A., Lashkov I., Karpov A. DAVIS: driver’s audio-visual speech recognition // Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2022. P. 1141–1142.
- Zhou P., Yang W., Chen W., Wang Y., Jia J. Modality attention for end-to-end audio-visual speech recognition // Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP).2019.P. 6565–6569. https://doi.org/10.1109/ICASSP.2019.8683733
- Makino T., Liao H., Assael Y., Shillingford B., Garcia B., Braga O., Siohan O. Recurrent neural network transducer for audio-visual speech recognition // Proc. of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 2019. P. 905–912. https://doi.org/10.1109/ASRU46091.2019.9004036
- Li J., Li C., Wu Y., Qian Y. Unified cross-modal attention: robust audio-visual speech recognition and beyond // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2024. V. 32. P. 1941–1953. https://doi.org/10.1109/TASLP.2024.3375641
- Tan X., Qin T., Soong F., Liu T.Y. A survey on neural speech synthesis // arXiv. 2021. arXiv:2106.15561. https://doi.org/10.48550/arXiv.2106.15561
- de Barcelos Silva A., Gomes M.M., da Costa C.A., da Rosa Righi R., Barbosa J.L.V., Pessin G., de Doncker G., Federizzi G. Intelligent personal assistants: a systematic literature review // Expert Systems with Applications.2020.V. 147.P. 113193. https://doi.org/10.1016/j.eswa.2020.113193
- Oord A., Li Y., Babuschkin I., Simonyan K., Vinyals O., Kavukcuoglu K., Driessche G., Lockhart E., Cobo L., Stimberg F., Casagrande N., Grewe D., Noury S., Dieleman S., Elsen E., Kalchbrenner N., Zen H., Graves A., King H., Walters T., Belov D., Hassabis D. Parallel wavenet: fast high-fidelity speech synthesis // Proc. of the 35th International Conference on Machine Learning (ICML). 2018. P. 3918–3926.
- Wang Y., Skerry-Ryan R.J., Stanton D., Wu Y., Weiss R.J., Jaitly N., Yang Z., Xiao Y., Chen Z., Bengio S., Le Q., Agiomyrgiannakis Y., Clark R., Saurous R.A. Tacotron: towards end-to-end speech synthesis // Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2017. P. 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452
- Arık S.Ö., Chrzanowski M., Coates A., Diamos G., Gibiansky A., Kang Y., Li X., Miller J., Ng A., Raiman J., Sengupta S., Shoeybi M. Deep voice: real-time neural text-to-speech // Proc. of the 34th International Conference on Machine Learning (ICML). 2017. P. 195–204.
- Li N., Liu S., Liu Y., Zhao S., Liu M. Neural speech synthesis with transformer network // Proc. of the AAAI Conference on Artificial Intelligence. 2019. V. 33. N 1. P. 6706–6713. https://doi.org/10.1609/AAAI.V33I01.33016706
- Ren Y., Ruan Y., Tan X., Qin T., Zhao S., Zhao Z., Liu T.Y. Fastspeech: fast, robust and controllable text to speech // Proc. of the Neural Information Processing Systems (NeurIPS).2019. P. 1–10.
- Prenger R., Valle R., Catanzaro B. Waveglow: a flow-based generative network for speech synthesis // Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).2019. P. 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143
- Kumar K., Kumar R., De Boissiere T., Gestin L., Teoh W.Z., Sotelo J., de Brébisson A., Bengio Y., Courville A.C. Melgan: generative adversarial networks for conditional waveform synthesis // Proc. of the Neural Information Processing Systems (NeurIPS). 2019. P. 320–335.
- Yamamoto R., Song E., Kim J.M. Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram // Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. P. 6199–6203. https://doi.org/10.1109/ICASSP40776.2020.9053795
- Valin J.M., Skoglund J. LPCNet: Improving neural speech synthesis through linear prediction // Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019. P. 5891–5895. https://doi.org/10.1109/icassp.2019.8682804
- Asimopoulos D.C., Nitsiou M., Lazaridis L., Fragulis G.F. Generative adversarial networks: a systematic review and applications // SHS Web of Conferences. 2022. V. 139. P. 03012. https://doi.org/10.1051/shsconf/202213903012
- Kong J., Kim J., Bae J. Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis // Proc. of the Neural Information Processing Systems (NeurIPS). 2020. P. 17022–17033.
- Fang W., Chung Y.A., Glass J. Towards transfer learning for end-to-end speech synthesis from deep pre-trained language models // arXiv. 2019. arXiv:1906.07307. https://doi.org/10.48550/arXiv.1906.07307
- Valle R., Shih K., Prenger R., Catanzaro B. Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis // Proc. of the 9th International Conference on Learning Representations (ICLR). 2021. P. 1–6.
- Chen N., Zhang Y., Zen H., Weiss R.J., Norouzi M., Chan W. Wavegrad: estimating gradients for waveform generation // Proc. of the 9th International Conference on Learning Representations (ICLR). 2021. P. 1–8.
- Ping W., Peng K., Chen J. Clarinet: Parallel wave generation in end-to-end text-to-speech // Proc. of the 7th International Conference on Learning Representations (ICLR). 2019. P. 1–7.
- Camgöz N.C., Koller O., Hadfield S., Bowden R. Sign language transformers: joint end-to-end sign language recognition and translation // Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. P. 10020–10030. https://doi.org/10.1109/CVPR42600.2020.01004
- Bragg D., Koller O., Caselli N., Thies W. Exploring collection of sign language datasets: privacy, participation, and model performance // Proc. of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility. 2020. P. 1–14. https://doi.org/10.1145/3373625.3417024
- Caselli N.K., Sehyr Z.S., Cohen-Goldberg A.M., Emmorey K. ASL-LEX: a lexical database of american sign language // Behavior Research Methods. 2017. V. 49. N 2. P. 784–801. https://doi.org/10.3758/s13428-016-0742-0
- Forster J., Schmidt C., Koller O., Bellgardt M., Ney H. Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-Weather // Proc. of the 9th International Conference on Language Resources and Evaluation (LREC). 2014. P. 1911–1916.
- Azad R., Asadi-Aghbolaghi M., Kasaei S., Escalera S. Dynamic 3D hand gesture recognition by learning weighted depth motion maps // IEEE Transactions on Circuits and Systems for Video Technology. 2019. V. 29. N 6. P. 1729–1740. https://doi.org/10.1109/TCSVT.2018.2855416
- Chen Y., Wei F., Sun X., Wu Z., Lin S. A simple multi-modality transfer learning baseline for sign language translation // Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. P. 5110–5120. https://doi.org/10.1109/CVPR52688.2022.00506
- Escalera S., Baró X., Gonzàlez J., Bautista M.A., Madadi M., Reyes M., Ponce-López V., Escalante H.J., Shotton J., Guyon I. ChaLearn looking at people challenge 2014: dataset and results // Lecture Notes in Computer Science.2015.V. 8925. P. 459–473. https://doi.org/10.1007/978-3-319-16178-5_32
- Kagirov I., Ivanko D., Ryumin D., Axyonov A., Karpov A. TheRuSLan: database of russian sign language // Proc. of the 12th International Conference on Language Resources and Evaluation (LREC). 2020. P. 6079–6085.
- Sincan O.M., Keles H.Y. AUTSL: a large scale multi-modal turkish sign language dataset and baseline methods // IEEE Access. 2020. V. 8. P. 181340–181355. https://doi.org/10.1109/ACCESS.2020.3028072
- Kapitanov A., Kvanchiani K., Nagaev A., Kraynov R., Makhliarchuk A. HaGRID – HAnd gesture recognition image dataset // Proc. of theIEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2024. P. 4560–4569. https://doi.org/10.1109/WACV57701.2024.00451
- Petridis S., Wang Y., Ma P., Li Z., Pantic M. End-to-end visual speech recognition for small-scale datasets // Pattern Recognition Letters. 2020. V. 131. P. 421–427. https://doi.org/10.1016/j.patrec.2020.01.022
- Cooke M., Barker J., Cunningham S., Shao X. An audio-visual corpus for speech perception and automatic speech recognition // The Journal of the Acoustical Society of America.2006. V. 120. N 5. P. 2421–2424. https://doi.org/10.1121/1.2229005
- Chung J.S., Zisserman A. Lip reading in the wild // Lecture Notes in Computer Science. 2017. V. 10112. P. 87–103. https://doi.org/10.1007/978-3-319-54184-6_6
- Sequeira A.F., Monteiro J.C., Rebelo A., Oliveira H.P. MobBIO: a multimodal database captured with a portable handheld device // Proc. of the9th International Conference on Computer Vision Theory and Applications (VISAPP). 2014. P. 133–139. https://doi.org/10.5220/0004679601330139
- Parekh D., Gupta A., Chhatpar S., Yash A., Kulkarni M. Lip reading using convolutional auto encoders as feature extractor // Proc. of theIEEE 5th International Conference for Convergence in Technology (I2CT). 2019. P. 1–6. https://doi.org/10.1109/I2CT45611.2019.9033664
- Leeson L., Sheikh H. SIGNALL: a european partnership approach to deaf studies via new technologies // Proc. of theINTED. 2009. P. 1270–1279.
- Loizides F., Basson S., Kanevsky D., Prilepova O., Savla S., Zaraysky S. Breaking boundaries with live transcribe: expanding use cases beyond standard captioning scenarios // Proc. of the22nd International ACM SIGACCESS Conference on Computers and Accessibility.2020.P. 1–6. https://doi.org/10.1145/3373625.3417300
- Sinha A., Choi C., Ramani K. DeepHand: robust hand pose estimation by completing a matrix imputed with deep features // Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2016.P. 4150–4158. https://doi.org/10.1109/CVPR.2016.450
- Ee L.W.S., Ramachandiran C.R., Logeswaran R. Real-time sign language learning system // Journal of Physics: Conference Series.2020.V. 1712. P. 12011. https://doi.org/10.1088/1742-6596/1712/1/012011
- Junczys-Dowmunt M. Microsoft translator at wmt 2019: towards large-scale document-level neural machine translation // Proc. of the Conference on Machine Translation. 2019. P. 225–233. https://doi.org/10.18653/v1/W19-5321
- Hong F., You S., Wei M., Zhang Y., Guo Z. MGRA: motion gesture recognition via accelerometer // Sensors. 2016. V. 16. N 4. P. 530. https://doi.org/10.3390/s16040530