Menu
Publications
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
Editor-in-Chief
Nikiforov
Vladimir O.
D.Sc., Prof.
Partners
doi: 10.17586/2226-1494-2024-24-2-241-248
ViSL One-shot: generating Vietnamese sign language data set
Read the full article ';
Article in English
For citation:
Abstract
For citation:
Dang Khanh, Bessmertny I.A. ViSL One-shot: generating Vietnamese sign language data set. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 2, pp. 241–248. doi: 10.17586/2226-1494-2024-24-2-241-248
Abstract
The development of methods for automatic recognition of objects in a video stream, in particular, recognition of sign language, requires large amounts of video data for training. An established method of data enrichment for machine learning is distortion and noise. The difference between linguistic gestures and other gestures is that small changes in posture can radically change the meaning of a gesture. This imposes specific requirements for data variability. The novelty of the method lies in the fact that instead of distorting frames using affine image transformations, vectorization of the sign language speaker’s pose is used, followed by noise in the form of random deviations of skeletal elements. To implement controlled gesture variability using the MediaPipe library, we convert to a vector format where each vector corresponds to a skeletal element. After this, the image of the figure is restored from the vector representation. The advantage of this method is the possibility of controlled distortion of gestures, corresponding to real deviations in the postures of the sign language speaker. The developed method for enriching video data was tested on a set of 60 words of Indian Sign Language (common to all languages and dialects common in India), represented by 782 video fragments. For each word, the most representative gesture was selected and 100 variations were generated. The remaining, less representative gestures were used as test data. The resulting word-level classification and recognition model using the GRU-LSTM neural network has an accuracy above 95 %. The method tested in this way was transferred to a corpus of 4364 videos in Vietnamese Sign Language for all three regions of Northern, Central and Southern Vietnam. Generated 436,400 data samples, of which 100 data samples represent the meaning of words that can be used to develop and improve Vietnamese sign language recognition methods by generating many variations of gestures with varying degrees of deviation from the standards. The disadvantage of the proposed method is that the accuracy depends on the error of the MediaPipe library. The created video dataset can also be used for automatic sign language translation.
Keywords: Vietnamese sign language, Indian Sign Language, sign language recognition, MediaPipe, coordinate transformation, vector space, random noise, GRU-LSTM, one-shots, data augmentation
References
References
- Li D., Yu X., Xu C., Petersson L., Li H. Transferring Cross-domain Knowledge for Video Sign Language Recognition. Proc. of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6204–6213. https://doi.org/10.1109/cvpr42600.2020.00624
- Li D., Opazo C.R., Yu X., Li H. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. Proc. of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 1448–1458. https://doi.org/10.1109/WACV45572.2020.9093512
- Camgoz N.C., Hadfield S., Koller O., Ney H., Bowden R. Neural sign language translation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7784–7793. https://doi.org/10.1109/CVPR.2018.00812
- Sridhar A., Ganesan R.G., Kumar P., Khapra M. INCLUDE: A large scale dataset for indian sign language recognition. Proc. of the 28th ACM International Conference on Multimedia, 2020, pp. 1366–1375. https://doi.org/10.1145/3394171.3413528
- Ying X. An overview of overfitting and its solutions. Journal of Physics: Conference Series, 2019, vol. 1168, no. 2, pp. 022022. https://doi.org/10.1088/1742-6596/1168/2/022022
- Creswell A., White T., Dumoulin V., Arulkumaran K., Sengupta B., Bharath A. Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 2018, vol. 35, no. 1, pp. 53–65. https://doi.org/10.1109/MSP.2017.2765202
- Gupta K., Singh S., Shrivastava A. PatchVAE: Learning local latent codes for recognition. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4745–4754. https://doi.org/10.1109/CVPR42600.2020.00480
- Karras T., Aila T., Laine S., Lehtinen J. Progressive growing of GANs for improved quality, stability, and variation. Proc. of the ICLR 2018 Conference Blind Submission, 2018.
- Ma L., Jia X., Sun Q., Schiele B., Tuytelaars T., Van Gool L. Pose guided person image generation. Proc. of the 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017.
- Sushko V., Gall J., Khoreva A. One-shot GAN: Learning to generate samples from single images and videos. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021, pp. 2596–2600. https://doi.org/10.1109/CVPRW53098.2021.00293
- Li J., Jing M., Lu K., Ding Z., Zhu L., Huang Z. Leveraging the invariant side of generative zero-shot learning. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7394–7403. https://doi.org/10.1109/CVPR.2019.00758
- Madrid G.K.R., Villanueva R.G.R., Caya M.V.C. Recognition of dynamic Filipino Sign language using MediaPipe and long short-term memory. Proc. of the 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), 2022. https://doi.org/10.1109/ICCCNT54827.2022.9984599
- Adhikary S., Talukdar A.K., Sarma K.K. A vision-based system for recognition of words used in Indian Sign Language using MediaPipe. Proc. of the 2021 Sixth International Conference on Image Information Processing (ICIIP), 2021, pp. 390–394. https://doi.org/10.1109/ICIIP53038.2021.9702551
- Zhang S., Chen W., Chen C., Liu Y. Human deep squat detection method based on MediaPipe combined with Yolov5 network. Proc. of the 2022 41st Chinese Control Conference (CCC), 2022, pp. 6404–6409. https://doi.org/10.23919/CCC55666.2022.9902631
- Quiñonez Y., Lizarraga C., Aguayo R. Machine learning solutions with MediaPipe. Proc. of the 11th International Conference on Software Process Improvement (CIMPS), 2022, pp. 212–215. https://doi.org/10.1109/CIMPS57786.2022.10035706
- Ma J., Ma L., Ruan W., Chen H., Feng J. A Wushu posture recognition system based on MediaPipe. Proc. of the 2nd International Conference on Information Technology and Contemporary Sports (TCS), 2022, pp. 10–13. https://doi.org/10.1109/TCS56119.2022.9918744
- Cho K., Merriënboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734. https://doi.org/10.3115/v1/D14-1179
- Dey R., Salem F.M. Gate-variants of Gated Recurrent Unit (GRU) neural networks. Proc. of the IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), 2017, pp. 1597–1600. https://doi.org/10.1109/MWSCAS.2017.8053243
- Kothadiya D., Bhatt C., Sapariya K., Patel K., Gil-González A.-B., Corchado J.M. Deepsign: Sign language detection and recognition using deep learning. Electronics, 2022, vol. 11, no. 11, pp. 1780. https://doi.org/10.3390/electronics11111780
- Verma U., Tyagi P., Kaur M. Single input single head CNN-GRU-LSTM architecture for recognition of human activities. Indonesian Journal of Electrical Engineering and Informatics (IJEEI), 2022, vol. 10, no. 2, pp. 410–420. https://doi.org/10.52549/ijeei.v10i2.3475