CN113642422B

CN113642422B - Continuous Chinese sign language recognition method

Info

Publication number: CN113642422B
Application number: CN202110848023.8A
Authority: CN
Inventors: 马乐; 吴晓越; 黄仰来; 李志伟; 徐东甫
Original assignee: Northeast Dianli University
Current assignee: Northeast Electric Power University
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2024-05-24
Anticipated expiration: 2041-07-27
Also published as: CN113642422A

Abstract

The invention discloses a continuous Chinese sign language identification method, which comprises the following steps: acquiring video data information of a sign language demonstrator; ROI (RegionofInterest) processing the video; the self-encoder is constructed to input the processed video to the self-encoder to obtain the feature vector of each frame of the video; the processed video is input into a key frame identification module to identify key frames; generating a time-series-based attention curve of each word through the obtained key frame information; fusing the obtained attention curve with the feature vector generated by the self-encoder and inputting the fused attention curve into a long-term and short-term memory network to obtain a regression result of video segments corresponding to each word in the video; and after all the video segments are identified, combining the identified word results, and then completing the identification of the semantics of the continuous sign language video. The method effectively realizes segmentation and word-by-word training of the continuous video, can identify each word in the video, avoids the separate training of sentences containing the same word, and effectively identifies continuous sign language of different vocabulary combination modes.

Description

Continuous Chinese sign language recognition method

Technical Field

The invention belongs to the technical field of information processing, and particularly relates to a Chinese continuous sign language identification method.

Background

Sign language is used as a main communication mode of hearing impaired people and plays a great role in daily life communication, however, as China and most normal people do not learn Chinese sign language yet, the hearing impaired people have a plurality of difficulties in expressing own appeal. The continuous sign language is used as the most common form of the hearing impaired people expressing the semantics, and has high research value in the real society due to the characteristics of coherent spatial actions, visual and easy understanding of the semantics and the like.

The existing continuous sign language identification method comprises the following three steps: (1) Training the recognition model by continuously expanding training data samples of various semantemes and extracting features in the video; (2) Extracting features of a sign language video to be demonstrated by a sign language presenter to be identified; (3) And inputting the acquired characteristics of the sign language video of the presenter into a model for classification, and outputting a classification result to obtain a recognition result. However, the vocabulary combination modes of continuous sign language are various, and the method can only recognize the semantics of the fixed vocabulary combination in the training sample, so that the method is difficult to adapt to the diversity of the continuous sign language.

Therefore, how to more effectively identify continuous sign language of different vocabulary combination modes is a problem to be solved in the present day.

Disclosure of Invention

Through the description, the invention provides a continuous sign language recognition method, which takes the key frames of the data video to be recognized as clues, and can accurately recognize the semantics of each word in the continuous sign language by assisting the attention mechanism among the key frames, thereby meeting the requirement on the diversity of different vocabulary combination modes of continuous sign language recognition.

The application provides a continuous Chinese sign language identification method, the identification flow is shown in figure 1, the method comprises:

acquiring video data information of a sign language demonstrator;

ROI (Region of Interest) processing the video;

inputting the processed video into a self-encoder to obtain the feature vector of each frame of the video;

the processed video is input into a key frame identification module to identify key frames;

generating a time-series-based attention curve of each word by obtaining key frame information;

Fusing the obtained attention curve with the feature vector generated by the self-encoder and inputting the fused attention curve into a long-term and short-term memory network to obtain a regression result of video segments corresponding to each word in the video;

and performing close matching on the regression result and the word vector to obtain a final semantic result and outputting the final semantic result.

The video data information of the sign language demonstrator is acquired by adopting an RGB color camera, and the limb motion information of the performer in the video is amplified by the ROI processing to obtain more obvious motion characteristics.

The processed video adopts the optical flow to obtain a pause part in continuous action as a key frame.

The self-encoder is designed based on a convolutional neural network (CNN, convolutional Neural Networks) to obtain a feature vector of motion information in each frame of image.

And arranging the feature vectors of each frame according to the video frame sequence to obtain a group of vectors capable of highly describing the continuous motion features of the video.

And designing an attention curve by using a Gaussian function, and finally generating the attention curve of the video segment between each key frame by acquiring related information such as the key frames.

Based on the obtained attention curve and the feature vector of each frame of the video, the feature of a corresponding word in the video is amplified, and the features of other words in the same video are reduced, so that the individual recognition of each word in the video is realized.

The recognition of each word is realized by carrying out regression on the feature vector of the video, and the result obtained by the network regression is subjected to close matching processing with the word vector to output the semantic vocabulary corresponding to the word vector.

Preferably, the method adopts a long-term and short-term memory network to identify the action characteristics of a group of characteristic vectors corresponding to a word in time and space continuity, and avoids the preprocessing step that the traditional identification method needs to unify the length of video data.

Preferably, the method realizes the recognition capability of different word combination modes by amplifying, dividing and recognizing the characteristics of the continuous video.

According to the technical scheme provided by the invention, the continuous sign language can be effectively identified by using the method combining the video key frame screening and the attention mechanism. In addition, the method can complete the effect of training the isolated sign language video by inputting the continuous sign language video in the process of training the recognition model, effectively lightens the working intensity of manually marking the video and simultaneously realizes the efficient continuous sign language recognition effect.

Drawings

For a clearer description of the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a continuous Chinese sign language recognition method recognition process according to an embodiment of the present invention;

FIG. 2 is an attention curve generating effect graph;

FIG. 3 is a flowchart of a training process of a continuous Chinese sign language recognition method according to an embodiment of the present invention;

Detailed Description

The invention will be described in further detail with reference to the drawings and the detailed description. It is apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on embodiments of the present invention, are within the scope of the present invention.

A training process flow chart of the continuous Chinese sign language identification method is shown in fig. 3, and the method provided by the invention adopts an RGB color camera to acquire data information of a performer, and the acquired video data information is classified and stored according to the meaning.

The method is characterized in that the ROI processing part is used for carrying out fixed window interception on the upper half part of the trunk and the arm waving area by carrying out fixed window interception on the position of the face of a presenter in the video and the size of the face occupying the whole video area according to the characteristics that the demonstration of most of sign language in China is carried out on the upper half part of the trunk based on the identification of the face of a human body before training.

In addition, the video after the characteristic amplification processing still contains irrelevant information such as background images, surrounding environments and the like. Therefore, the foreground segmentation is adopted to separate the performer from the environment, and the limb action characteristic information of the gesture performer is further amplified.

The sizes of all videos processed by the ROI are different due to the physical appearance of the sign language performer and the distance from the camera, and in order to improve the accuracy of the subsequent feature extraction and recognition, the processed images are unified to be 224 x 224.

In the training process, the self-encoder adopts independent training, and the video frames passing through the encoder can obtain the feature vector Tf capable of highly describing the motion features of the video frames through the training of encoding and decoding the video data processed by the ROI, and the feature vector characterization capability is judged by observing whether the decoder can restore the video frames according to the feature vector.

The self-encoder uses convolutional neural networks (CNN, convolutional Neural Networks) to design the encoder and decoder sections in the self-encoder.

The sign language has obvious pause between sign language actions of each word in the demonstration process, and the change range of the hand action is smaller than that of the hand action in the action process, so that the invention adopts an optical flow method to calculate the optical flow of each frame of images adjacent to the time sequence and count the optical flow vector histogram in the image area.

In the process of optical flow vector histogram statistics, since each optical flow has a tiny change, a threshold value is set for the optical flow vector value of each pixel point so as to avoid the influence of the tiny change on the statistics result.

After the statistics of the optical flow vector histogram is completed, judging a key frame segment of transition between two words in the sign language video through the setting of a threshold value, and taking an intermediate frame of the key frame segment as a key frame to store.

Μ and σ are respectively the index of the middle frame of a certain segmented word corresponding to the length of a video segment in the video and the length of the middle frame from the boundary of the video segment, wherein μ is obtained from the index of the middle frame of the video frame between two key frames, and σ is from the length of the video frame corresponding to μ to the nearest key frame.

The obtained mu and sigma values and the video frame length i are input into an attention mechanism module group by group to generate an attention curve W with the length i, and each mu and sigma corresponds to one attention curve W.

The attention curve is obtained by using a gaussian function, and the gaussian function is defined as follows:

Where a is an amplitude parameter, i is the number of the current frame of the video, μ and σ are the intermediate frame index of a segment of video length and the length of the intermediate frame from the boundary of the segment of video corresponding to a segmented word, and according to fig. 2, the relationship between μ, σ and i and the generated attention curve and the video can be intuitively seen.

The obtained attention curves are respectively fused with the video feature vector groups, and each attention curve has the function of enhancing the feature vector on the video segment corresponding to the mu value and the sigma value according to different Gaussian curve distribution in the attached figure 2, so that the corresponding word is associated with the corresponding video segment.

Meanwhile, the length of the attention curve is highly consistent with the video frame length i, so that phenomena such as loss and incompleteness of feature data are avoided.

The attention curves are respectively fused with the video feature vector groups and input into the recognition module for training, the recognition module adopts a long-short-period memory network for design, the output of the network and the size of the word vector are both set to be 1 x 60 so as to facilitate the formation and calculation of a loss function, and the calculated result completes the training process of the neural network through counter propagation.

The long-short-term memory network is a time recursive neural network, is suitable for processing and predicting important events with relatively long intervals and delays in a time sequence, is used for continuous sign language recognition as a feature recognition problem of the continuous time sequence, and has obvious relation between contexts in continuous actions, so that the network can store previously learned contents and can generate related relation between subsequently learned contents and the previously learned contents in the training process, influence of different lengths of video sequences on the network learning process is reduced, and continuous sign language action features are effectively learned.

The sentence is segmented by adopting the jieba modules in the Python, so that words forming the sentence are separated one by one, the segmented words are trained by adopting the word2vec functions in the Gensim modules, and word vectors corresponding to the input words can be generated by the trained models.

Since the output of the network is the same as the dimensionality of the word vector during training, MSELoss (mean square loss function) is used to construct the loss function, MSELoss is calculated as follows:

loss(x_i,y_i)＝(x_i-y_i)²

Where the dimensions of x, y are the same and may be vectors or matrices and i is a subscript.

The result obtained by the loss function calculation adjusts parameters in the long-term and short-term memory network through back propagation, so that a predicted value generated by the network can approximate to a corresponding word vector as much as possible, even if the value of the loss function is continuously close to 0. When the loss function generated by all training samples through the network is smaller than 0.001, the network can be considered to generate a correct predicted value, and training is stopped and the network model is stored.

In the recognition process, the RGB camera is adopted to acquire the action of a performer and conduct ROI preprocessing as in the training process, the processed video is subjected to key information acquisition and video segmentation, related information is input into an attention mechanism to generate a corresponding attention curve and is fused with a video feature vector sequence to be input into a network, the recognition module network in the recognition process is a model completed in the training process, therefore, the input information can be directly generated into a corresponding prediction result, the prediction result is closely matched, namely, the standard word vector most similar to the prediction result is searched, the recognition effect of the corresponding word in the sign language video is achieved, and after all video segments are recognized, the recognized word results are combined, so that the semantic recognition of the continuous sign language video can be completed.

The method effectively realizes segmentation and word-by-word training of continuous videos, can identify each word in the videos, avoids the separate training of sentences containing the same words, and effectively identifies continuous sign language of different vocabulary combination modes.

The foregoing description of the disclosed embodiments is readily understood by those skilled in the art. Various modifications to these embodiments will be readily apparent to those skilled in the art. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should therefore be included within the scope of the present invention.

Claims

1.A continuous sign language recognition method, the proposed method comprises:

acquiring video data information of a sign language demonstrator;

ROI (Region of Interest) processing the video;

Inputting the processed video to a key frame identification module, carrying out key frame and information identification by adopting an optical flow method, including counting an optical flow vector histogram in an image area, judging a key frame section of transition between two words in a sign language video through threshold setting, taking an intermediate frame of the key frame section as a key frame to be stored, and simultaneously obtaining an intermediate frame index of a certain segmented word corresponding to a video length in the video and length related information of the intermediate frame from the video boundary according to the positions of the two key frames;

generating a time-series-based attention curve of each word through the obtained key frame information;

adopting a Gaussian function to design an attention curve based on a time sequence, taking an intermediate frame index as a mean value of the Gaussian function, taking the length of a video boundary as a variance of the Gaussian function, fusing the curve with a feature vector generated by a self-encoder, and inputting the feature vector into a long-term and short-term memory network to obtain a regression result of a video segment corresponding to each word in a video;

2. The method according to claim 1, wherein the ROI processing is performed on motion feature information of the video image, comprising the steps of:

Step 1: carrying out facial recognition-based human body contour image region interception processing on the obtained sign language video;

Step 2: and separating the performer from the environment by adopting foreground segmentation, and further amplifying the limb action characteristic information of the gesture performer.

3. The method according to claim 1, wherein the feature vector is obtained by self-encoding, which comprises the steps of:

step 1: disassembling training samples frame by frame and inputting the training samples into a self-encoder for training;

step 2: the method comprises the steps that an encoder in the self-encoder extracts image features and converts the image features into feature vectors, the encoder restores the feature vectors to form images, the self-encoder is trained by comparing the generated images with original images, and the generated images restore the input images as much as possible;

step3: after training, the self-encoder model is saved, and at this time, the images input to the self-encoder can all obtain the feature vector capable of uniquely representing the image.

4. The method of claim 1, wherein the identification of the key frames and key frame information is performed by using an optical flow method, statistics is performed on an optical flow vector histogram in an image area, a key frame segment of transition between two words in the sign language video is determined by setting a threshold value, an intermediate frame of the key frame segment is taken as a key frame to be stored, and an intermediate frame index of a segmented word corresponding to a length of a video in the video and length information of the intermediate frame from a boundary of the video are obtained according to positions of the two key frames.

5. The method of claim 1, wherein the neural network takes as input a fusion of the feature vector and the attention curve, and wherein the predicted value and the word vector output by the neural network are used as a loss function, and wherein training is accomplished by back propagation, comprising the steps of:

Step 1: the attention curve is designed by adopting a Gaussian function, and the obtained intermediate frame index and the length information of the boundary and the video length are used as inputs to obtain an attention curve;

Step 2: and respectively merging the attention curves with the feature vectors, inputting the merged attention curves into a neural network for training in a word digital unit, calculating a mean square loss function between an output predicted value obtained through the network and a standard word vector, and adjusting network parameters by back propagation of calculation results to realize the training process of a network model on video segments corresponding to each word of continuous video.