CN112487939A

CN112487939A - Pure vision light weight sign language recognition system based on deep learning

Info

Publication number: CN112487939A
Application number: CN202011349613.8A
Authority: CN
Inventors: 吴宗正; 李凌; 刘云云; 辜嘉
Original assignee: Shenzhen Relitaihe Life Technology Co ltd
Current assignee: Shenzhen Relitaihe Life Technology Co ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-12

Abstract

The invention belongs to the field of sign language recognition, and particularly discloses a pure visual light weight sign language recognition system based on deep learning. The invention can output sentences through operation only by inputting image data without additional information, has simple and efficient overall network structure and short training period, and is suitable for being deployed at a mobile terminal. The program can be operated by only one terminal device, so that the use convenience is greatly improved, and the large-area popularization is facilitated.

Description

Pure vision light weight sign language recognition system based on deep learning

Technical Field

The invention relates to the technical field of sign language recognition, in particular to a pure visual light-weight sign language recognition system based on deep learning.

Background

Sign language is an important communication mode between the deaf-mute and the hearing-aid person, and in order to promote the communication convenience between the deaf-mute and the hearing-aid person, it is particularly important to design a sign language recognition system capable of running in real time at a mobile terminal. However, the sign language has rich semantics, and the action amplitude has locality and detail compared with other human behaviors, and is influenced by illumination, background, motion speed and the like, so that the traditional pattern recognition and machine learning method is difficult to realize ideal precision and robustness. In addition, the sign language recognition algorithm in the laboratory environment with a large amount of computation is difficult to deploy in the mobile terminal and realize efficient operation because the hardware conditions of the mobile terminal include a CPU, a GPU, a memory, and the like.

In recent years, image-based deep learning methods have been increasingly successful in continuous phrase sign language recognition tasks. Continuous sentence sign language recognition requires establishing more reliable long-term timing dependencies. Generally, a bidirectional long-time and short-time memory network model is adopted to better model the context semantic information of the long-time sequence of the sign language. Compared with the complexity of the BLSTM network model, the continuous sign language recognition based on the 1-dimensional convolutional network model and the 3-dimensional convolutional network model avoids the complex modeling of the BLSTM network, and saves complex calculated amount on the basis of time sequence modeling. The conventional sign language sentence time sequence segmentation method is complex in process and high in misjudgment rate, and in recent years, scholars gradually bypass time sequence segmentation, introduce a time sequence alignment algorithm CTC in a voice recognition field into the sign language recognition field and achieve good effects.

The existing implementation scheme generally utilizes a bracelet or a glove with a sensor to collect information such as motion and position of a hand, transmits the information to a cloud, extracts hand word information from the information by the cloud through pattern recognition, machine learning or deep learning methods, and finally generates sentences. For the prior art scheme, due to the fact that additional hardware assistance is needed, the method is high in use cost, poor in use convenience, difficult to popularize in a large area, low in identification precision and poor in robustness. The recognition algorithm is complex and is difficult to be deployed on a mobile phone to run in real time.

Aiming at the problems, the invention provides a pure vision light weight sign language recognition system based on deep learning.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a pure visual light sign language recognition system based on deep learning.

In order to achieve the purpose, the invention adopts the following technical scheme:

a pure visual light sign language recognition system based on deep learning comprises data acquisition, gesture feature extraction, time sequence feature extraction and sentence generation, wherein the data acquisition is used for acquiring a sign language video to be recognized and preprocessing images, the gesture feature extraction is used for acquiring gesture feature vectors from frames in the sign language video, the time sequence feature extraction is used for extracting sign language word information from a gesture feature vector sequence, and the sentence generation is used for combining all sign language word information into a text sentence according to context;

the identification system further comprises the following use steps:

s1, the application program opens a mobile phone camera to shoot and acquire the sign language video, or directly acquires the sign language video from a folder, and after clicking a start recognition button for a moment, the sign language recognition result is displayed on a screen;

s2, after a sign language video is obtained, firstly, an image sequence obtained by four-time down-sampling is used as source input of a sign language recognition model, an image sequence obtained by eight-time down-sampling is used as source input of a human body detection model, human body coordinates are predicted, then, a source input image is cut by taking a human body as a center and is zoomed to 224 pixels with high width and 224 pixels with high width, and finally, normalization is carried out, and data preparation is finished;

s3, in the gesture feature extraction part, firstly, the first feature extraction layer adopts a 2D convolution layer and a maximum pooling layer for zooming the image, which is beneficial to reducing the calculation amount, and the specific parameters are as follows: the convolution kernel size is 7x7, the step size is 2, the all-zero padding is 3, the channel is 64, the second feature extraction layer adopts two basic residual blocks, and the specific parameters are as follows: convolution kernel size 3x3, step 1, all-zero padding 1, channel 64, the third feature extraction layer uses two basic residual blocks, the specific parameters are: convolution kernel size 3x3, step 1, all-zero padding 1, channel 128, and the fourth feature extraction layer adopts two basic residual blocks, and the specific parameters are: the convolution kernel size is 3x3, the step size is 1, the all-zero padding is 1, the channel is 256, the fifth feature extraction layer adopts two basic residual blocks, and the specific parameters are as follows: convolution kernel size 3x3, step 1, all zero padding 1, channel 512. Finally, a global average pooling layer is followed, and the gesture feature extraction part finally outputs a series of feature vectors with the length of 512;

s4, in the time sequence feature extraction part, firstly accessing a 1D convolutional layer, then accessing a maximum pooling layer, and finally accessing a 1D convolutional layer;

s5, in the sentence generating part, a BLSTM layer is adopted, the obtained sign language word information is used as input, sign language sentence information is output according to the context environment, the sign language sentence information is mapped to a prediction space through a full connection layer, and finally a prediction result can be obtained through CTC beam search decoding.

In another aspect, the present invention provides a mobile phone including the above-mentioned pure visual light sign language recognition system based on deep learning.

In another aspect, the invention provides a tablet computer, which includes the above-mentioned pure visual light sign language recognition system based on deep learning.

In another aspect, the present invention provides a PC computer including the above-mentioned pure visual light sign language recognition system based on deep learning.

In still another aspect, the present invention provides a server including the above-mentioned deep learning-based pure visual lightweight sign language recognition system.

Compared with the prior art, the invention has the beneficial effects that:

the invention ingeniously uses two 1DCNN layers as a short-distance time sequence extractor, thereby outputting sign language word information, and the invention has small operand and high operation speed.

The invention uses a BLSTM layer as a long-distance time sequence extractor, which can capture forward information and backward information, so that the output sentence information is more accurate and smooth. A full connection layer is accessed behind the BLSTM layer, a prediction result is directly output, and simplicity and high efficiency are achieved.

The invention can output sentences through operation only by inputting image data without additional information, has simple and efficient overall network structure and short training period, and is suitable for being deployed at a mobile terminal. The program can be operated by only one terminal device, so that the use convenience is greatly improved, and the large-area popularization is facilitated.

Drawings

Fig. 1 is a framework diagram of an application program in which a sign language recognition system is deployed at a mobile phone end in one embodiment.

FIG. 2 is a flow diagram of a sign language recognition system in one embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments.

Examples

Referring to fig. 1-2, the pure visual light weight sign language recognition system based on deep learning provided by the invention comprises data acquisition, gesture feature extraction, time sequence feature extraction and sentence generation, wherein the data acquisition is the acquisition of a sign language video to be recognized and the image preprocessing, the gesture feature extraction is the acquisition of gesture feature vectors from each frame in the sign language video, the time sequence feature extraction is the extraction of sign language word information from a gesture feature vector sequence, and the sentence generation is the combination of all sign language word information into a text sentence according to the context;

the identification system further comprises the following use steps:

s4, in the time sequence feature extraction part, firstly accessing a 1D convolution layer, then accessing a maximum pooling layer, and finally accessing a 1D convolution layer, wherein due to the effects of two layers of convolution and one layer of pooling, the 1DCNN is helpful for extracting short-distance time sequence features, so that sign language word information is output after passing through the 1DCNN layer;

The invention ingeniously uses two 1DCNN layers as a short-distance time sequence extractor, thereby outputting sign language word information, and the invention has small operand and high operation speed. The invention uses a BLSTM layer as a long-distance time sequence extractor, which can capture forward information and backward information, so that the output sentence information is more accurate and smooth. A full connection layer is accessed behind the BLSTM layer, a prediction result is directly output, and simplicity and high efficiency are achieved. The invention adopts the network structure of 2DCNN +1DCNN + BLSTM + CTC, and sentences can be output through operation only by inputting image data without additional information. The overall network structure is simple and efficient, the training period is short, and the method is suitable for being deployed at a mobile terminal. The program can be operated by only one terminal device with a camera, such as a mobile phone, so that the use convenience is greatly improved, and the method is favorable for large-area popularization.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A pure visual light weight sign language recognition system based on deep learning is characterized by comprising data acquisition, gesture feature extraction, time sequence feature extraction and sentence generation, wherein the data acquisition is used for acquiring a sign language video to be recognized and preprocessing images, the gesture feature extraction is used for acquiring gesture feature vectors from frames in the sign language video, the time sequence feature extraction is used for extracting sign language word information from a gesture feature vector sequence, and the sentence generation is used for combining all sign language word information into a text sentence according to context;

the identification system further comprises the following use steps:

2. A handset comprising the deep learning based pure visual lightweight sign language recognition system of claim 1.

3. A tablet computer comprising the deep learning based pure visual lightweight sign language recognition system of claim 1.

4. A PC computer comprising the deep learning based pure visual lightweight sign language recognition system of claim 1.

5. A server comprising the deep learning based pure visual lightweight sign language recognition system of claim 1.