CN113642422A

CN113642422A - Continuous Chinese sign language recognition method

Info

Publication number: CN113642422A
Application number: CN202110848023.8A
Authority: CN
Inventors: 马乐; 吴晓越; 黄仰来; 李志伟; 徐东甫
Original assignee: Northeast Dianli University
Current assignee: Northeast Electric Power University
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-11-12

Abstract

The invention discloses a method for identifying continuous Chinese sign language, which comprises the following steps: acquiring sign language presenter video data information; performing ROI (region of interest) processing on the video; constructing a self-encoder, inputting the processed video into a self-encoder to obtain a feature vector of each frame of the video; inputting the processed video into a key frame identification module to identify key frames; generating an attention curve of each word based on the time sequence through the obtained key frame information; fusing the obtained attention curve and the characteristic vector generated by the self-encoder and inputting the fused attention curve and the characteristic vector into a long-term and short-term memory network to obtain a regression result of a video segment corresponding to each word in the video; and when all the video segments are recognized, combining recognized word results to complete the recognition of the semantics of the continuous sign language video. The method effectively realizes the segmentation and the word-by-word training of the continuous video, can identify each word in the video, avoids the respective training of sentences containing the same word, and effectively identifies the continuous sign language with different word combination modes.

Description

Continuous Chinese sign language recognition method

Technical Field

The invention belongs to the technical field of information processing, and particularly relates to a Chinese continuous sign language identification method.

Background

Sign language is a main communication mode of hearing-impaired people and plays a great role in communication of daily life, however, as most normal people in China do not go through the learning of the sign language in China, the hearing-impaired people have a lot of difficulties in expressing the personal demands. The continuous sign language is the most common form for hearing-impaired people to express semantics, and has high research value in the real society due to the characteristics of coherent spatial motion, intuitive and easily understood semantics and the like.

The existing continuous sign language identification method adopts the following three steps: (1) training the recognition model by continuously expanding training data samples of various semantics and extracting features in the video; (2) carrying out feature extraction on a sign language video demonstrated by a sign language demonstrator to be identified; (3) and inputting the acquired characteristics of the sign language video of the demonstrator into the model for classification, and outputting a classification result to obtain an identification result. However, the vocabulary combination method of the continuous sign language is various, and the above method can only identify the semantics of the fixed vocabulary combination in the training sample, and is difficult to adapt to the diversity of the continuous sign language.

Therefore, how to more effectively recognize continuous sign language with different vocabulary combination modes is a problem to be solved urgently at present.

Disclosure of Invention

Through the description of the content, the invention provides a continuous sign language recognition method, which takes the video key frames of the data to be recognized as clues, and can accurately recognize the semantics of each word in the continuous sign language by assisting the attention mechanism among the key frames, thereby meeting the requirement of recognizing the diversity of different vocabulary combination modes for the continuous sign language.

The application provides a continuous Chinese sign language recognition method, the recognition process is shown as the attached figure 1, and the method comprises the following steps:

acquiring sign language presenter video data information;

performing ROI (region of interest) processing on the video;

inputting the processed video into a self-encoder to obtain a feature vector of each frame of the video;

inputting the processed video into a key frame identification module to identify key frames;

generating an attention curve of each word based on the time sequence by obtaining the key frame information;

fusing the obtained attention curve and the characteristic vector generated by the self-encoder and inputting the fused attention curve and the characteristic vector into a long-term and short-term memory network to obtain a regression result of the video segment corresponding to each word in the video;

and performing approximate matching on the regression result and the word vector to obtain and output a final semantic result.

The video data information of the sign language demonstrator is acquired by an RGB color camera, and the ROI processing amplifies the limb action information of the performer in the video to obtain more obvious action characteristics.

The processed video adopts optical flow to obtain a pause part in continuous action as a key frame.

The self-encoder is designed based on a Convolutional Neural Network (CNN) to obtain the characteristic vector of the action information in each frame of image.

And arranging the feature vectors of each frame according to the sequence of the video frames to obtain a group of vectors capable of highly describing the continuous motion features of the video.

And designing an attention curve by using a Gaussian function, and finally generating the attention curve of the video segment between each key frame by acquiring the key frame and other related information by using the Gaussian function.

Based on the obtained attention curve and the feature vector of each frame of the video, the feature of a corresponding word in the video is amplified, and the features of other words in the same video are reduced, so that the individual recognition of each word in the video is realized.

The recognition of each word is realized by performing regression on the feature vector of the video, and the result obtained by network regression is subjected to similar matching processing with the word vector to output a semantic vocabulary corresponding to the word vector.

Preferably, the method adopts a long-short term memory network to identify the action characteristics of a group of characteristic vectors corresponding to a word in time and space continuity, thereby avoiding the preprocessing step of unifying the lengths of video data in the traditional identification method.

Preferably, the method realizes the recognition capability of different word combination modes by amplifying, segmenting and recognizing the characteristics of the continuous video.

According to the technical scheme provided by the invention, the method for combining the video key frame screening and the attention mechanism can effectively realize the identification of the continuous sign language. Moreover, the method can complete the effect of training the isolated sign language video only by inputting the continuous sign language video in the training process of the recognition model, thereby effectively reducing the working intensity of manually marking the video and realizing the efficient continuous sign language recognition effect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a recognition process of a continuous Chinese sign language recognition method according to an embodiment of the present invention;

FIG. 2 is a graph of the effect of attention curve generation;

FIG. 3 is a flow chart of a training process of a continuous Chinese sign language recognition method according to an embodiment of the present invention;

Detailed Description

The invention is further described in detail with reference to the accompanying drawings and the detailed description. It is to be understood that the described embodiments are only a few, and not all, embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of the present invention.

The flow chart of the training process of the continuous Chinese sign language identification method is shown in fig. 3, the method provided by the invention adopts an RGB color camera to acquire data information of performers, and the acquired video data information is classified and stored according to meanings.

Before training, ROI processing is needed to be carried out on the video, the ROI processing part is based on the recognition of human faces, according to the characteristic that most sign language demonstration in China is carried out on the upper half part of a body, fixed window interception is carried out on the upper half part of a trunk and an arm swing area by the position of the face of a demonstrator in the video and the size of the face occupying the whole video area, and the characteristic information of the demonstrator obtained by the method can be further amplified.

In addition, irrelevant information such as background images, surrounding environments and the like still exists in the video after feature amplification processing. Therefore, the performer is separated from the environment by adopting foreground segmentation, and the body motion characteristic information of the hand performer is further amplified.

Because the body appearance of the sign language performer and the distance from the camera can make the size of each video processed by the ROI different, in order to improve the accuracy of subsequent feature extraction and recognition, the processed images are unified into 224 × 224.

In the training process, the self-encoder adopts independent training, and the video data processed by ROI is subjected to encoding and decoding training, so that the video frame passing through the encoder obtains a feature vector Tf capable of highly describing the action features of the video frame, and the feature vector representation capability is judged by observing whether a decoder can restore the capability of the video frame according to the feature vector.

The self-encoder adopts Convolutional Neural Networks (CNN) to design the encoder and the decoder part in the self-encoder.

The gesture language has obvious pause between gesture language movements of each word in the demonstration process, and the variation range of the hand movement is smaller than that in the movement process, so that the optical flow method is adopted to calculate the optical flow of each frame image adjacent to the time sequence and count the optical flow vector histogram in the image area.

In the optical flow vector histogram statistical process, because each optical flow has tiny change, a threshold value is set for the optical flow vector value of each pixel point so as to avoid the influence of the tiny change on a statistical result.

After the optical flow vector histogram statistics is completed, a key frame section of transition between two words in the sign language video is judged through setting of a threshold value, and an intermediate frame of the key frame section is taken as a key frame for storage.

Mu and sigma are respectively the index of an intermediate frame of a certain segmented word corresponding to the length of a section of video in the video and the length of the intermediate frame from the boundary of the section of video, wherein mu is obtained by the index of the intermediate frame of the video frame between two key frames, and sigma is the length from the video frame corresponding to mu to the key frame closest to the mu.

And inputting the obtained mu and sigma values and the video frame length i into an attention mechanism module group by group to generate an attention curve W with the length i, wherein each mu and sigma corresponds to one attention curve W.

The attention curve is obtained by using a gaussian function, which is defined as follows:

in the formula, a is an amplitude parameter, i is a number of a current frame of the video, μ and σ are respectively an intermediate frame index of a certain segmented word corresponding to the length of a section of video in the video and the length of an intermediate frame from a boundary of the section of video, and according to fig. 2, the relationship between μ, σ and i and a generated attention curve and the video can be visually seen.

The obtained attention curves are respectively fused with the video feature vector groups, and it can be seen from different gaussian curve distributions in fig. 2 that each attention curve has an effect of enhancing feature vectors on video segments corresponding to μ and σ values, so that the corresponding words are associated with the corresponding video segments.

Meanwhile, the length of the attention curve is highly consistent with the length i of the video frame, so that the phenomena of loss, incompleteness and the like of the characteristic data cannot occur.

The attention curve and the video feature vector group are fused and input into the recognition module for training, the recognition module is designed by adopting a long-term and short-term memory network, the output of the network and the size of the word vector are both set to be 1 x 60 so as to facilitate the composition and calculation of a loss function, and the training process of the neural network is completed by the calculated result through reverse propagation.

The long-term and short-term memory network is a time recurrent neural network, is suitable for processing and predicting important events with relatively long intervals and delays in a time sequence, is used as a feature recognition problem of a continuous time sequence, and sign language semantics of the continuous time sequence have obvious relation between contexts in continuous action, so that the network not only stores previously learned contents in a training process, but also can generate relevant relation between subsequently learned contents and the previously learned contents, reduces the influence of different video sequence lengths on the network learning process, and effectively learns the continuous sign language action features.

The sentence is segmented by adopting a jieba module in Python, words forming the sentence are independent one by one, the segmented words are trained by adopting a word2vec function in a Gensim module, and the trained model can generate a word vector corresponding to the input words.

In the training process, because the output of the network is the same as the dimensionality of the word vector, a loss function is constructed by using MSELoss (mean square loss function), and the MSELoss calculation formula is as follows:

loss(x_i,y_i)＝(x_i-y_i)²

where the dimensions of x, y are the same and can be vectors or matrices and i is a subscript.

The parameters in the long-term and short-term memory network are adjusted through the result obtained by the calculation of the loss function through back propagation, so that the predicted value generated by the network can approach to the corresponding word vector as much as possible, even if the value of the loss function is continuously close to 0. And when the loss functions generated by all the training samples through the network are less than 0.001, the network can be considered to generate correct predicted values, and at the moment, the training is stopped and the network model is stored.

In the identification process, as in the training process, the RGB camera is adopted to acquire the action of an performer and perform ROI preprocessing, the processed video acquires key information and segments the video, relevant information is input into an attention mechanism to generate a corresponding attention curve and fused with a video feature vector sequence to be input into a network, an identification module network in the identification process is a model already completed in the training process, so that a corresponding prediction result can be directly generated for the input information, the prediction result is subjected to close matching, namely a standard word vector most similar to the prediction result is searched, the identification effect of corresponding words in the hand language video is realized, and after all video segments are identified, the identified word results are combined, and the identification of the semantics of the continuous hand language video can be completed.

The method effectively realizes the segmentation and the word-by-word training of the continuous video, can identify each word in the video, avoids the respective training of sentences containing the same word, and effectively identifies the continuous sign language with different word combination modes.

The above description of the disclosed embodiments is readily understood by those skilled in the art. Various modifications to these embodiments will be readily apparent to those skilled in the art. Therefore, any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A continuous sign language recognition method is provided, and the method comprises the following steps:

acquiring sign language presenter video data information;

performing ROI (region of interest) processing on the video;

generating an attention curve of each word based on the time sequence through the obtained key frame information;

fusing the obtained attention curve and the characteristic vector generated by the self-encoder and inputting the fused attention curve and the characteristic vector into a long-term and short-term memory network to obtain a regression result of a video segment corresponding to each word in the video;

2. The method of claim 1, wherein the ROI processing is performed on motion feature information of the video image, and the steps include:

step 1: carrying out human body contour image region interception processing based on face recognition on the obtained sign language video;

step 2: and separating the performer from the environment by adopting foreground segmentation, and further amplifying the body motion characteristic information of the hand performer.

3. The method of claim 1, wherein the obtaining of the feature vector is implemented by self-encoding, and the implementation steps are as follows:

step 1: disassembling the training samples frame by frame and inputting the training samples into an autoencoder for training;

step 2: extracting image features from an encoder in the self-encoder and converting the image features into feature vectors, restoring the image to the feature vectors by the encoder, and training the self-encoder by comparing the generated image with an original image to ensure that the generated image restores the input image as much as possible;

and step 3: after the training is finished, the self-encoder model is stored, and at the moment, the image input to the self-encoder can obtain the feature vector which can uniquely represent the image.

4. The method as claimed in claim 1, wherein the identification of the key frames and the related information is obtained by an optical flow method, the histogram of optical flow vectors in the image area is counted, the key frame segment of the transition between two words in the sign language video is determined by setting a threshold, the intermediate frame of the key frame segment is taken as the key frame for storage, and the related information such as the intermediate frame index of a word after being divided corresponding to the length of a video segment in the video and the length of the intermediate frame from the video boundary can be obtained according to the positions of the two key frames.

5. The method of claim 1, wherein the neural network uses the feature vector and the attention curve as input, uses the predicted value of the network output and the word vector as loss function, and completes training by back propagation, and the method comprises the following steps:

step 1: designing an attention curve by adopting a Gaussian function, and taking the relevant data information obtained in the claim 4 and the video length as input to obtain an attention curve;

step 2: and respectively fusing the attention curve with the feature vector, inputting the fused attention curve into a neural network in a digital unit of a word for training, calculating a mean square loss function of an output predicted value obtained through the network and a standard word vector, and adjusting network parameters through back propagation of a calculation result to realize the training process of a network model on a video segment corresponding to each word of a continuous video.