CN110175551B

CN110175551B - Sign language recognition method

Info

Publication number: CN110175551B
Application number: CN201910426216.7A
Authority: CN
Inventors: 张淑军; 张群; 李辉; 王传旭
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao University of Science and Technology
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2023-01-10
Anticipated expiration: 2039-05-21
Also published as: CN110175551A

Abstract

The invention discloses a sign language identification method, which comprises the following steps: carrying out frequency domain transformation on a video sequence corresponding to the hand language video to obtain phase information of an image; sending the phase information and the video sequence into a C3D convolution neural network for convolution once and fusing to form characteristic information; and sending the characteristic information into a deep convolution neural network for secondary convolution and pooling, executing a self-adaptive learning pooling algorithm in the pooling process, screening out target characteristic vectors, and sending the target characteristic vectors into a full-connection layer to output classification results. The method combines the frequency domain transformation into the deep learning algorithm, extracts the phase information in the sign language video by utilizing the frequency domain transformation, assists the RGB space information, and sends the RGB space information into the deep learning network to generate the features of the sign language, so that the obtained features are more essential and accurate. By adding the self-adaptive learning pooling algorithm into the pooling layer of the 3D convolutional neural network model, more abstract and higher video characteristics in the hand language video can be mined, and more accurate classification results can be obtained.

Description

Sign language recognition method

Technical Field

The invention belongs to the technical field of video recognition, and particularly relates to a method for sign language semantic recognition.

Background

In the era of rapid development of computer technology, human-computer interaction technology has received wide attention and obtained certain research results, and the technology mainly includes human expression recognition, action recognition, sign language recognition and the like. Sign language is a main communication mode between the deaf-mutes and the hearing-aid persons, but for the hearing-aid persons, the hearing-aid persons do not really accept sign language training, and the real ideas of the deaf-mutes cannot be fundamentally understood except for the basic common knowledge of some simple gesture expressions, so that the communication between the deaf-mutes and the hearing-aid persons is difficult. Meanwhile, sign language recognition can also be applied to education and teaching of disabled people in an assisting mode so as to guarantee normal life and learning of the disabled people.

The traditional sign language identification method requires that a deaf-mute wears a data glove with a plurality of sensors, collects the limb behavior track of the deaf-mute according to the data glove, and generates understandable semantics according to track information. At present, most of behavior recognition methods designed based on the most original 3D convolutional neural network model have low accuracy rate for sign language recognition under a small data set, large calculation amount, easy generation of overfitting phenomenon and low universality.

Chinese patent application No. CN107506712A discloses a human behavior recognition method based on 3D deep convolutional network, which improves standard 3D convolutional network C3D, introduces multi-stage pooling, and can perform feature extraction on video segments with any resolution and duration, thereby obtaining final classification results. However, the C3D convolutional network used in this method has a shallow structure, and has low recognition accuracy for a wide range of data sets, and it is difficult to extract optimal feature information.

The Chinese patent application with the application number of CN107679491A discloses a 3D convolution neural network sign language recognition method fusing multi-modal data, which performs feature extraction on a gesture infrared image and a contour image from a space dimension and a time dimension, and fuses two network outputs based on different data formats to perform final sign language classification. However, the whole network input needs to additionally extract an infrared image and a contour image by using a motion sensing device, the input data is more complicated to process, and the identification effect on some detailed behaviors with large fluctuation amplitude is poor.

Chinese patent application No. CN104281853A discloses a behavior recognition method based on 3D convolutional neural network, combining optical flow information as multi-channel data input to the network to perform feature extraction, and finally performing final behavior classification through the full connection layer, and dividing the whole stage into an offline training stage and an online recognition stage. The method can realize on-line identification, but has high requirements on data sets, needs optical flow information, has complex calculation and low identification efficiency.

Disclosure of Invention

The invention aims to provide a sign language identification method, aiming at solving the problems of poor characteristic information extraction and low identification accuracy rate of the existing sign language identification method.

In order to solve the technical problems, the invention adopts the following technical scheme:

a sign language recognition method, comprising the processes of:

forming a video sequence X according to the sign language video;

performing image processing based on frequency domain transformation on the video sequence X, and extracting phase information;

respectively sending the phase information and the video sequence X into a C3D convolution neural network for primary convolution, and performing weighted fusion on the features obtained after the convolution to form fused feature information;

and sending the fused feature information into a 3D ResNet deep convolution neural network for secondary convolution and pooling, executing a self-adaptive learning pooling algorithm in the pooling process, screening out target feature vectors, sending the target feature vectors into a full-connection layer of the 3D ResNet deep convolution neural network, and outputting a classification result.

Compared with the prior art, the invention has the advantages and positive effects that: the sign language identification method combines the frequency domain transformation into the deep learning algorithm, extracts the phase information in the sign language video by using the frequency domain transformation, and sends the phase information into the deep learning algorithm to generate the characteristic information, so that the obtained characteristic information is more essential and accurate. In addition, the 3D convolutional neural network model is improved, and the self-adaptive learning pooling algorithm is added into the pooling layer of the network model, so that more abstract and higher-level video characteristics in the hand language video can be mined, more accurate classification results are obtained, and the accuracy of hand language identification is obviously improved.

Other features and advantages of the present invention will become more apparent from the detailed description of the embodiments of the present invention when taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solution in the embodiments of the present invention, a brief description will be given below of the drawings required to be used in the embodiments. It is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a flow chart of an embodiment of a sign language identification method proposed by the present invention;

FIG. 2 is a block diagram of one embodiment of a 3D ResNets deep convolutional neural network;

FIG. 3 is a diagram of an example of the dimension reduction of a feature matrix using an adaptive learning pooling algorithm.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

The sign language identification method of the embodiment mainly comprises two stages:

(1) Feature encoding stage based on frequency domain transformation

Combining frequency domain transformation with deep learning, and extracting phase information in the sign language video through frequency domain transformation; and then, respectively sending the phase information and sign language video data into a C3D convolution neural network for convolution once, and performing weighted fusion on the features obtained after convolution to form fused feature information.

(2) Feature decoding phase based on improved 3D ResNets deep convolutional neural network

Sending the fused characteristic information formed in the first stage into an improved deep convolutional neural network (3D ResNets), and performing secondary convolution on the time sequence information of different time sequence positions by using convolution kernels of different scales; then, the adaptive learning pooling algorithm provided by the embodiment is used for reducing the dimension of the feature matrix obtained by the secondary convolution, so that more abstract and higher-level target feature vectors are screened out and sent to a full-link layer to obtain more accurate classification results.

The specific process of the sign language identification method of the present embodiment is described in detail below with reference to fig. 1.

S1, forming a video sequence X according to a sign language video;

in this process, the following steps can be specifically designed:

s101, frame cutting is carried out on a hand language video;

the original sign language video RGB data is sliced into N image frames, N preferably equal to or greater than 34 frames. According to the characteristics of the Chinese sign language data set, because the sign language video corresponding to each semantic meaning is short and small, it is more suitable to cut each sign language video into 34 frames aiming at the Chinese sign language data set.

S102, preprocessing an image frame;

considering that in each sign language video, the first and last frames are usually still frames or background frames, in order to reduce the amount of calculation in the subsequent steps, a data preprocessing process is preferably performed after frame slicing to preliminarily screen out useful image frames, or so-called key frames. As a preferred embodiment, of the N image frames generated after frame cutting, the front f frame and the rear f frame are removed as redundant frames, and only the middle image frame is reserved as a key frame. Preferably, f is 5 or less.

For the Chinese sign language data set, the first 5 frames and the last 5 frames of the 34 cut image frames can be removed, and the middle 24 frames are reserved as key frames.

S103, equally dividing the key frame into n segments according to a time sequence;

as a preferred embodiment, n =3 is preferred, i.e. the preprocessed key frame is divided into three segments equally in time sequence.

S104, randomly selecting continuous m image frames from each segment to form a video sequence X;

in this embodiment, it is preferable to randomly select 8 consecutive image frames from each segment to form a video sequence X = (X =) ₁ ,x ₂ ,…,x _n ) (ii) a Wherein x is _i Representing m image frames in the ith slice, i =1,2, \ 8230;, n.

If 34 image frames generated after the slicing are not preprocessed, that is, redundant frames are not removed, then 11 consecutive image frames can be randomly selected from each segment to form the video sequence X.

Of course, in the case that the number of image frames generated after frame cutting is greater than 34 frames, or the number of key frames generated after removing redundant frames is greater than 24 frames, or the number of segments of key frames divided equally according to time sequence is less than 3 segments, more than 8 consecutive image frames can be randomly selected from each segment to form the video sequence X.

S2, performing image processing based on frequency domain transformation on the video sequence X, and extracting image phase information;

in various algorithms of frequency domain transformation, compared with Fourier transformation, gabor transformation has the characteristics of better locality, direction selectivity, band-pass property and the like, and has better anti-interference capability; meanwhile, for sign language recognition tasks, when the spatial position of a video frame changes, the amplitude of the Gabor features changes relatively slightly, and the phase changes correspondingly at a certain rate along with the change of the position, so that relative to the amplitude, gabor phase information can represent abstract features of behaviors per se, and the method has more important significance.

In summary, in this embodiment, in combination with the characteristics of the sign language video, the Gabor transform in the frequency domain transform is preferably adopted to extract the phase information of the video sequence X, so that not only can all information of the signal be provided as a whole, but also information of the intensity of signal change in any local time can be provided, and optimization of the behavior characteristics of the sign language is achieved. Since there are many methods for calculating Gabor Phase information, in principle, the combination of these methods and a deep learning network falls within the scope of the present invention, but in order to reduce the data dimension and the amount of computation, it is preferable in this embodiment to use a Local Gabor Phase Difference Pattern (LGPDP) proposed in the literature [ Guo Y, xu Z, local Gabor Phase Difference Pattern for Face Recognition, the 19th International Conference on Pattern Recognition, ieee,2008 1-4] to extract Phase information after Gabor transformation of an image frame. Of course, other LGPDP-based improvement algorithms are equally applicable.

S3, respectively sending the video sequence X and the extracted phase information into a C3D convolution neural network for primary convolution;

in this embodiment, the video sequence X and the extracted phase information are preferably first sent to a conventional C3D convolutional neural network model, and subjected to a convolution process once to generate feature information after the convolution once.

S4, carrying out weighted fusion on the feature information obtained after the primary convolution to form fused feature information;

in this embodiment, a traditional weighting fusion algorithm may be adopted to perform weighting fusion on the feature information after convolution processing by the C3D convolutional neural network, so as to form a fused feature matrix.

S5, sending the fused feature information into a 3D ResNet deep convolution neural network for secondary convolution and pooling so as to screen out a target feature vector;

in order to obtain more accurate video characteristics, the embodiment improves a 3D ResNets deep convolution neural network, introduces a self-adaptive learning pooling algorithm based on a weighted cross covariance matrix, and performs dimension reduction on a feature matrix obtained by convolution to screen out more abstract and high-level target feature vectors.

As a preferred embodiment, this embodiment preferably adopts a 19-layer 3D ResNets deep convolutional neural network, which includes: 1 data input layer, 8 3D convolution layers with convolution kernels of different scales, 8 pooling layers and two full-connection layers. As shown in fig. 2, it is preferable to design the 8 3D convolutional layers and the 8 pooling layers to be interleaved, wherein,

C1-C8 are 8 3D convolutional layers, the convolution kernel of each 3D convolutional layer is 3 multiplied by 3, and the number of the convolution kernels is increased from 64 to 512 in sequence, so that more types of high-level features can be generated from low-level feature combinations; after the convolutional layer, performing characteristic fusion of the convolutional layer on the two paths of information;

the S1-S8 are 8 pooling layers, each pooling layer uses an adaptive learning pooling algorithm to perform dimensionality reduction, wherein the second pooling layer S2, the sixth pooling layer S6, the seventh pooling layer S7 and the eighth pooling layer S8 use 2 x 2 windows to simultaneously perform downsampling on the time dimension and the space dimension, and the other pooling layers S1, S3, S4 and S5 use 1 x 2 windows to perform downsampling only on the space dimension.

In the 3D convolutional layer of this embodiment, it is preferable to perform secondary convolution on the time sequence information of different time sequence positions by using convolution kernels of different scales, and then perform feature aggregation on the convolution features of each time sequence position in the time dimension, so as to reduce the amount of computation of the network structure. As a preferred embodiment, a 1 x 1 convolution kernel may be used to first perform a dimensionality reduction operation on the feature matrix fed through the data input layer to help reduce the model parameters and normalize the dimensions of the different features. Then, the time sequence information of different time sequence positions is respectively convoluted by convolution kernels of different scales, for example, convolution kernels of 3 × 3 and 5 × 5 are respectively selected to convolute the high and low characteristics of the video level, then the convolution information of each time sequence position is subjected to weighted fusion to form an aggregated characteristic matrix, and the aggregated characteristic matrix is sent to a pooling layer for adaptive characteristic pooling.

In this embodiment, a pooling algorithm executed by each pooling layer is improved, and a self-adaptive learning pooling algorithm is proposed, as shown in fig. 3, first, a corresponding cross covariance matrix is calculated for the aggregated feature matrix, and then, a dimensionality reduction operation is performed on the obtained cross covariance matrix to obtain a feature vector up to the current time; then, the importance of the frame is obtained, the obtained characteristic vector of each frame after pooling is calculated, different weights are sequentially given according to the importance, and the characteristic vector occupying the largest weight is selected as a target characteristic vector.

The following specific process of the adaptive learning pooling algorithm proposed in this embodiment is described as follows:

s501, obtaining a feature matrix F after convolution and fusion according to the 3D convolution layer _n Find F _n Cross covariance matrix Q of _n ；

S502, adopting a conventional pooling algorithm to pair the cross covariance matrix Q _n Performing pooling dimensionality reduction to form a dimensionality-reduced eigenvector;

s503, representing the feature vector subjected to the dimensionality reduction at the time of the t frame as

Calculating the characteristic vector after the t +1 frame moment dimensionality reduction by adopting the following formula

Importance of beta _t+1 Namely:

wherein f is _p Is a prediction function in a perceptron algorithm; phi (x) _t+1 ) A feature vector representing a reduced dimension from a 1 st frame to a t +1 st frame in the video sequence X;

s504, calculating the weight omega of the feature vector at the moment of the t +1 frame, wherein the weight omega should satisfy the following calculation formula:

s505, repeating the steps S503-S504, and calculating the weight of the feature vector at each frame time;

s506, ranking the weight of the feature vector of each frame time calculated in the step S505 in a descending order, wherein the higher the weight is, the more useful information is contained in the frame;

and S507, selecting the characteristic vector with the maximum weight as a target characteristic vector and sending the target characteristic vector to the full connection layer.

In this embodiment, the data fed into each 3D convolutional layer is a feature matrix, and after performing convolutional pooling, a target feature vector is obtained through each pooling layer. And respectively sending the target characteristic vectors obtained by each pooling layer into a full-connection layer to obtain a more accurate classification result. In order to prevent problems such as gradient explosion or diffusion under the deep network, it is preferable to add a BN layer after each 3D convolutional layer and perform a dropout operation at a fully connected layer of each layer.

S6, sending the screened target feature vectors into a full-connection layer to obtain a final classification result;

the 3D ResNets deep convolutional neural network of the present embodiment preferably designs two fully connected layers, as shown in FIG. 2. Wherein, the first and the second end of the pipe are connected with each other,

FC1 is the first fully-connected layer, preferably containing 512 neurons, connected to 512 neurons in the FC1 layer by the eigenvectors output by the eighth pooling layer S8, where they are converted into 512-dimensional eigenvectors; using a Dropout layer between the eighth pooling layer S8 and the first fully-connected layer FC1, discarding a part of the neural network units with a probability of 0.5, and freezing a part of the connection of the eighth pooling layer S8 and the first fully-connected layer FC1 with a probability of 0.1 using a migration learning algorithm;

FC2 is a second fully-connected layer, and is also a dense output layer, including the same number of neurons as the classification result, for example, the number of neurons is 6; and each neuron in the second full connection layer FC2 is fully connected with 512 neurons in the first full connection layer FC1, and finally, the classification is carried out through the Softmax regression of the classifier, and the classification result of the belonged sign language class is output.

As a preferred embodiment, in the 3D ResNets deep convolutional neural network, the 3D convolutional layer and the first fully-connected layer FC1 preferably use ELU as an activation function to improve the performance of the deep network. The second full connectivity layer FC2 preferably uses Softmax as an activation function, the optimization function preferably uses SGD function, and the loss function preferably uses the sum of the multi-class cross-entropy function and the error of the adaptive learning pooling algorithm, i.e. the loss function can be specifically expressed as:

L(X,Y)＝l _cro (x,y)+μl _B (τ)；

wherein L (X, Y) is a loss function; l _cro (x, y) is a multi-class cross entropy function; l _B (τ) is the error of the adaptive learning pooling algorithm; mu is a hyperparameter. Since the error of the loss function, the multi-class cross-entropy function and the pooling algorithm are prior art, the meaning of the relevant parameters in each function in the above formula is well known to those skilled in the art, and the detailed description of the embodiment is omitted.

Therefore, the classification result output by the full-connection layer of the 3D ResNet deep convolution neural network is the sign language meaning which is identified.

The sign language identification method of the embodiment can be divided into two stages of training and testing. The training phase is trained by using the above steps S1 to S6, before which, the whole network structure is first weight-initialized, preferably by using a public reference behavior recognition data set Kinetics to weight-initialize the 3D ResNets deep convolution neural network, so that the weight initialization is enough to adapt to the task of native sign language recognition. Then, in the training process, a transfer learning strategy is adopted to carry out transfer learning on the whole network structure, the convolution layer is frozen, and the full-connection layer of the last layer is continuously trained, so that the final classification result is more accurate. Further, the initial learning rate was set to 0.001, and the learning rate was gradually decreased at a rate of one-tenth after each iterative process as time passed until the learning rate was stopped changing 2000 times before the iteration was completed. The accuracy gradually becomes stable around 2000 times before the iteration times of the whole network are completed. And setting the momentum to be 0.9, loading the network model for the last time after iterating for thirty-thousand times, and entering a testing stage.

In the testing stage, a Chinese sign language data set can be selected as a data source, and all testing processes are carried out on the data set.

The sign language identification method combines frequency domain transformation into a deep learning algorithm, utilizes Gabor phase information with good identification performance to assist RGB space information of a sign language video, and utilizes the extracted phase information to be combined with a deep learning process, so that more essential and accurate sign language behavior characteristics can be obtained; mining more abstract and advanced video characteristics in an original video by using a modified 19-layer deep convolutional neural network; the convolution kernels with different scales are adopted to capture the video level characteristics of different time sequence positions, so that the calculated amount can be reduced, the original information in the video can be fully utilized, and the sign language recognition under the complex background can be better adapted; finally, the dimensionality reduction is carried out on the feature matrix obtained by convolution by using a pooling algorithm of self-adaptive learning, a more accurate classification result is obtained, and the accuracy rate of sign language recognition is improved.

Of course, the above embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A sign language identification method, comprising:

cutting frames of the sign language video;

equally dividing an image frame corresponding to the sign language video into n segments according to a time sequence;

randomly selecting m continuous image frames from each segment to form a video sequence X = (X) ₁ ,x ₂ ,···,x _n ) (ii) a Wherein x is _i Representing m image frames in the ith slice;

sending the fused feature information into a 3D ResNets deep convolution neural network for secondary convolution and pooling;

from a feature matrix F generated after the quadratic convolution _n Obtaining F _n Cross covariance matrix Q of _n ；

For cross covariance matrix Q _n Performing pooling dimensionality reduction to form a dimensionality reduced eigenvector;

representing the characteristic vector after dimensionality reduction at t frame time as

Calculating the characteristic vector after t +1 frame moment dimensionality reduction

Importance of beta _t+1 ：

Wherein f is _p Is a prediction function in a perceptron algorithm; phi (x) _t+1 ) Representing the feature vector after the dimensionality reduction till t +1 frame under the video sequence X;

calculating a weight ω of the feature vector at the time of the t +1 frame, wherein the weight ω satisfies the following calculation formula:

calculating the weight of the feature vector of each frame time, and selecting the feature vector with the maximum weight as a target feature vector;

and sending the target characteristic vector to a full connection layer of the 3D ResNet deep convolution neural network, and outputting a classification result.

2. The sign language identification method according to claim 1, wherein in the process of forming the video sequence X, specifically comprising:

each sign language video is cut into N frames, N is more than or equal to 34, the front f frame and the rear f frame are taken as redundant frames to be removed, the middle key frame is reserved, and f is less than or equal to 5;

equally dividing the middle key frame into three segments according to the time sequence;

randomly selecting at least 8 continuous image frames from each segment to form the video sequence X.

3. The sign language recognition method of claim 1, wherein in extracting the phase information based on the frequency domain transform, a Gabor transform is used to extract the phase information of the image frame.

4. The sign language identification method according to any one of claims 1 to 3, wherein in the 3D ResNet deep convolutional neural network, a 3D convolutional layer of the 3D ResNet deep convolutional neural network performs secondary convolution on time sequence information of different time sequence positions by using convolutional kernels of different scales, then performs feature aggregation on the time dimension on the convolutional features of each time sequence position to form a feature matrix after the secondary convolution, and sends the feature matrix into a pooling layer, and then performs dimension reduction by using an adaptive learning pooling algorithm to screen out target feature vectors.

5. The sign language recognition method of claim 4, wherein the 3D ResNets deep convolutional neural network comprises 8 3D convolutional layers and 8 pooling layers, the 8 3D convolutional layers and 8 pooling layers being interleaved; wherein, the first and the second end of the pipe are connected with each other,

the convolution kernels of each 3D convolution layer are 3 multiplied by 3, the number of the convolution kernels is sequentially increased from 64 to 512, and after the convolution layers, the two paths of information are subjected to feature fusion of the convolution layers;

and each pooling layer is subjected to dimensionality reduction by using the self-adaptive learning pooling algorithm, wherein the second pooling layer, the sixth pooling layer, the seventh pooling layer and the eighth pooling layer are subjected to down-sampling on the time dimension and the space dimension by using 2 x 2 windows, and other pooling layers are subjected to down-sampling on the space dimension only by using 1 x 2 windows.

6. The sign language recognition method of claim 5, wherein a BN layer is added after each 3D convolutional layer.

7. The sign language identification method of claim 5, wherein said 3D ResNets deep convolutional neural network further comprises a data input layer and two fully connected layers, wherein,

the first full-connection layer comprises 512 neurons, the feature vector output by the eighth pooling layer is converted into a 512-dimensional feature vector in the layer, a Dropout layer is used between the eighth pooling layer and the first full-connection layer, partial neural network units are discarded according to the probability of 0.5, and partial connection of the eighth pooling layer and the first full-connection layer is frozen by using a transfer learning algorithm according to the probability of 0.1;

the second full connection layer is a dense output layer and comprises neurons with the same number as the classification result, each neuron in the second full connection layer is fully connected with 512 neurons in the first full connection layer, and finally the classification result of the sign language class is output after the classification is carried out through a classifier.

8. The sign language identification method according to claim 7, wherein the 3D convolutional layer and the first fully-connected layer use ELU as an activation function, the second fully-connected layer uses Softmax as an activation function, the optimization function uses SGD function, and the loss function is the sum of errors of a multi-class cross-entropy function and an adaptive learning pooling algorithm.