CN112883922B

CN112883922B - Sign language identification method based on CNN-BiGRU neural network fusion

Info

Publication number: CN112883922B
Application number: CN202110304616.8A
Authority: CN
Inventors: 李桢旻; 祝东疆; 苏彦博; 贺子珊; 鲁杰; 彭靖宇; 杜高明; 王晓蕾
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2022-08-30
Anticipated expiration: 2041-03-23
Also published as: CN112883922A

Abstract

The invention discloses a sign language identification method based on CNN-BiGRU neural network fusion, which comprises the following steps: 1, collecting sign language data and adding a label to make a sign language data set; 2, carrying out data preprocessing on the sign language data set; dividing the enhanced characteristic data into a training data set, a verification data set and a test data set; 4, establishing a CNN-BiGRU deep neural network model fused by one-dimensional CNN and BiGRU; and 5, collecting sign language data in real time, preprocessing the sign language data, and inputting the preprocessed sign language data into a final model to obtain a sign language classification result. The invention can fully utilize the space-time information of the sign language feature sequence and improve the identification precision of the whole model, thereby effectively and accurately realizing the identification and classification of the sign language.

Description

Sign language identification method based on CNN-BiGRU neural network fusion

Technical Field

The invention relates to the field of sign language identification, in particular to a sign language identification method based on CNN-BiGRU neural network fusion.

Background

Sign language recognition highlights the semantic reconstruction characteristic of dynamic space information of sign language interaction under the background that the information transmission mode of current intelligent man-machine interaction is increasingly diversified, and directly hits the requirement pain point: china has over twenty million deaf-mute populations, has a large cardinal number and a low literacy rate, and sign language is used as a communication mode in daily life, has wide application and large translation requirement. However, the sign language translation industry is slow in development, social training strength is weak, infrastructure is deficient and the like, so that high-level sign language translation talent supply is scarce. Secondly, the online sign language translation operation cost is high, the popularization difficulty is high, and therefore the method has important significance for automatic and accurate sign language identification.

The traditional independent sign language state acquisition modes comprise a skin electromyography sensor, a wearable data glove, a common camera and the like. The myoelectric sensing equipment collects weak nerve current generated by body muscle movement, and then corresponding sign language actions are identified and fed back through a mode identification algorithm; the recognition algorithm based on the common camera mainly obtains sign language lower-dimensional data by performing background segmentation on a target and constructing a local description factor and is used for classification; increasing the number of cameras can help to further extract hand space trajectory information. However, the complexity of the background information of the shot image increases the difficulty in recognizing gesture language postures and hand positions, and it is difficult to extract sufficient effective depth information from a single image, which affects the gesture language recognition accuracy. The myoelectric sensor and the wearable data glove have wearing requirements, are inconvenient to use, need to consider the sanitary protection problem of putting into public places under epidemic situations, and have considerable limitations in practical use popularization.

At present, with the rise of artificial intelligence, deep learning gradually deepens into various fields, the aspect of sign language recognition technology gradually turns to the deep learning field, and good results are obtained, but the recognition technology for sign language is still few and is not mature. The traditional sign language recognition deep learning method mainly comprises a Convolutional Neural Network (CNN) and a long-short term memory network (LSTM), and the CNN-based sign language recognition system is limited to local features and cannot deeply learn pooled features; although the LSTM network considers the past characteristic sequences, the LSTM network ignores the future time sequence information, and has a complex network structure and high training difficulty.

Disclosure of Invention

Aiming at the problem of sign language identification, in order to overcome the defects of the prior art, the invention provides a sign language identification method based on CNN-BiGRU neural network fusion, so that the space-time information of a sign language feature sequence can be fully utilized, the identification precision of an identification model is improved, and the identification and classification of sign languages can be effectively and accurately realized.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a sign language identification method based on CNN-BiGRU neural network fusion, which is characterized by comprising the following steps:

step 1: capturing by thumb position coordinates F with a depth camera device ₁ Index finger position coordinate F ₂ Middle finger position coordinate F ₃ Position coordinates F of ring finger ₄ Position coordinates of little finger F ₅ Palm center position P _C Palm stable position P _S Palm Pitch angle Pitch, palm Yaw angle Yaw, palm Roll angle Roll, hand grip radius r, palm width P _W Various sign language data composed of palm center velocity v; each sign language data is provided with a corresponding category label; forming a sign language data set by a plurality of sign language data and category labels thereof;

step 2: carrying out data preprocessing on the sign language data set;

step 2.1, calculating a relative radius S of a hand-held ball, a fingertip palm center distance D, a palm center relative speed V, an absolute three-dimensional palm position standard deviation P and an inter-finger distance L according to the sign language data set to be used as a feature data set;

step 2.2, standardizing the characteristic data set by adopting a zero-mean value normalization method to obtain a preprocessed characteristic data set;

step 2.3, performing data enhancement on the preprocessed characteristic data set to obtain an enhanced characteristic data set;

and step 3: dividing the enhanced feature data set into a training data set, a verification data set and a test data set;

and 4, step 4: establishing a CNN-BiGRU deep neural network model fused by one-dimensional CNN and BiGRU; the CNN-BiGRU deep neural network model comprises: a SpatialDropout layer, a one-dimensional CNN network, a BiGRU network and a full connection layer; the one-dimensional CNN network consists of a one-dimensional convolutional layer and a one-dimensional maximum pooling layer; the BiGRU network structure is formed by combining a forward propagation GRU unit and a backward propagation GRU unit;

step 4.1, setting a hyper-parameter, and initializing the parameter of the CNN-BiGRU deep neural network model so as to obtain a current network model;

step 4.2, inputting the training data set into a spatialDropout layer in the current network model, and obtaining primary sign language features through convolution and maximum pooling operation of a one-dimensional CNN network; after the initial sign language features pass through a BiGRU network, obtaining time sequence information of the sign language features and outputting each sign language category probability through a full connection layer;

4.3, reversely propagating the sign language category probability to the current network model by using an optimization algorithm, thereby updating parameters of each layer of network model and obtaining an updated CNN-BiGRU deep neural network model;

step 4.4, verifying the accuracy of the updated CNN-BiGRU deep neural network model by using a verification data set so as to judge whether the updated CNN-BiGRU deep neural network model is converged, if so, taking the updated CNN-BiGRU deep neural network model as an optimal sign language classification model under the current super-parameter setting, otherwise, taking the updated CNN-BiGRU deep neural network model as a current network model, and returning to the step 4.2;

and 4.5, according to the process from the step 4.1 to the step 4.4, obtaining the optimal sign language classification models under different hyper-parameters, and comparing the accuracy of the optimal sign language classification models on the test data set, so that the optimal sign language classification model with the highest accuracy is selected as the model finally used for recognizing the sign language.

The sign language identification method based on the CNN-BiGRU neural network fusion is also characterized in that the step 2.1 comprises the following steps:

step 2.1.1, obtaining the relative radius S of the handheld ball by using the formula (1):

step 2.1.2, obtaining the fingertip and palm center distance D by using the formula (2):

in the formula (2), F _ix Coordinate value, F, representing the x direction in the ith finger position coordinate _iy Coordinate value, F, representing the y direction in the i-th finger position coordinate _iz Coordinate value, P, representing the z direction in the ith finger position coordinate _Cx Coordinate value, P, representing the x-direction in coordinates of the palm center position _Cy Coordinate value, P, representing the y-direction in coordinates of the palm center position _Cz A coordinate value in the z direction in the palm center position coordinate is represented, and i is 1, 2, 3, 4 and 5 respectively represent a thumb, an index finger, a middle finger, a ring finger and a little finger;

step 2.1.3, obtaining the palm center relative speed V in the x direction by using the formula (3), the formula (4) and the formula (5) respectively _x Y-direction palm center relative velocity V _y And the palm center relative velocity V in the z direction _z ：

In the formulae (3) to (5), v _k Representing the palm velocity in the k direction;

step 2.1.4, obtaining the standard deviation P of the absolute three-dimensional palm position by using the formula (6):

P＝P _C -P _S (6)

step 2.1.5, obtaining the fingertip distance L by using the formula (7):

in the formula (7), F _jx Coordinate value, F, representing the x direction in the jth finger position coordinate _jy Coordinate value, F, representing the y direction in the j-th finger position coordinate _jz And j represents coordinate values in the z direction in the j-th finger position coordinate, wherein j represents a thumb, an index finger, a middle finger, a ring finger and a little finger respectively, and i is not equal to j.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, the characteristics of the palm and the fingers are accurately obtained by utilizing the Leap Motion of the depth camera, and a one-dimensional CNN network and a BiGRU network are fused, so that sign language recognition with high accuracy is achieved;

2. according to the invention, the space-time characteristics of the palm and the fingers are accurately obtained by utilizing the Leap Motion of the depth camera under the operating situation of no contact and no mark assistance, compared with the traditional glove-type identification system, no special equipment is needed to obtain the characteristics, so that the equipment cost is reduced; compared with a common two-dimensional camera, the Leap Motion of the depth camera can capture more spatial features of the sign language, so that the recognition accuracy is improved;

3. the CNN-BiGRU deep neural network model utilizes a one-dimensional CNN network to perform primary feature extraction on the feature quantity through convolution operation, and extracts the spatial features of the hand speech, thereby greatly reducing the number of network parameters under the condition of ensuring the completeness of the feature vector at each moment, avoiding the mutual influence among the parameters due to the sparsity of convolution and improving the effectiveness of feature extraction;

4. the CNN-BiGRU deep neural network model adopts a BiGRU network, and performs forward and backward extraction on the time sequence information of the sign language features extracted by the one-dimensional CNN network in the time dimension simultaneously, so that more complete time sequence feature information is obtained, and the accuracy of sign language identification is improved;

drawings

FIG. 1 is a flow chart of a sign language recognition method based on fusion according to the present invention;

FIG. 2a is a diagram illustrating finger position, palm position and palm velocity in the sign language data feature of the present invention;

FIG. 2b is a schematic diagram of a palm angle in the sign language data feature of the present invention;

FIG. 2c is a schematic diagram of the hand-held sphere radius and the palm width in the sign language data feature of the present invention;

fig. 3 is a schematic structural diagram of a CNN-BiGRU combined network model implemented by the present invention.

Detailed Description

In this embodiment, a sign language identification method based on a CNN-BiGRU neural network fusion algorithm, as shown in fig. 1, includes the following steps:

step 1: obtaining thumb position coordinates F using a depth camera device Leap Motion ₁ Index finger position coordinate F ₂ Middle finger position coordinates F ₃ Position coordinates of ring finger F ₄ Position coordinates of little finger F ₅ Palm center position P _C Palm stable position P _S Palm center velocity v, palm Pitch angle Pitch, palm Yaw Angle Yaw, palm Roll Angle Roll, hand grip radius r, palm Width P _W The formed various sign language data; each sign language data is provided with a corresponding category label; forming a sign language data set by a plurality of sign language data and category labels thereof;

in specific implementation, the operation of acquiring the sign language data set specifically includes: 10 collection subjects, 15 sign language actions, each recording 20 sets of sign language data sets per sign language action. When recording, the palm faces to the Leap Motion device, appointed sign language actions are respectively completed in the visual field range of about 25mm to 600mm above the palm, and original characteristic data describing the palm position and the finger state are obtained.

As shown in FIG. 2a, the thumb position coordinates F in the sign language data set are obtained ₁ Index finger position coordinate F ₂ Middle finger position coordinate F ₃ Position coordinates F of ring finger ₄ Position coordinates of little finger F ₅ Palm center position P _C Palm velocity v;

as shown in fig. 2b, acquiring a palm Pitch angle Pitch, a palm Yaw angle Yaw and a palm Roll angle Roll in the sign language data set;

as shown in FIG. 2c, the hand-held sphere radius r and the palm width P in the sign language data set are obtained _W ；

The 10 collection subjects are 10 volunteers selected from the recruited subjects in consideration of individual differentiation factors such as gender, age, handedness and the like, have a certain sign language basis, and participate in data set recording after short-time learning.

Step 2: carrying out data preprocessing on the sign language data set;

step 2.1, calculating according to the sign language data set to obtain a relative radius S of a hand-held ball, a fingertip palm center distance D, a palm center relative speed V, an absolute three-dimensional palm position standard deviation P and an inter-finger distance L, and taking the relative radius S, the fingertip palm center distance D, the palm center relative speed V, the absolute three-dimensional palm position standard deviation P and the inter-finger distance L as characteristic data;

In the formulae (3) to (5), v _k Represents the palm velocity in the k direction;

P＝P _C -P _S (6)

step 2.1.5, obtaining the fingertip distance L by using the formula (7):

in the formula (7), F _jx Coordinate value, F, representing the x direction in the jth finger position coordinate _jy Coordinate value, F, representing the y direction in the j-th finger position coordinate _jz The coordinate value of z direction in j-th finger position coordinate is represented, j is 1, 2, 3, 4 and 5 respectively represent thumb, index finger, middle finger, ring finger and little finger, and i is not equal to j, and represents that the distance between a finger tip and the finger tip is not calculated。

Step 2.2, carrying out standardization processing on the characteristic data by adopting a zero-mean normalization method, eliminating the influence of dimension and value range difference among indexes, and obtaining a preprocessed characteristic data set;

the zero-mean normalization algorithm is embodied as follows:

in the formula (8), x represents characteristic data,

represents the mean of the characteristic data, sigma represents the standard deviation of the characteristic data, x ^* The resulting normalized preprocessed feature data is represented.

And 2.3, performing data enhancement on the preprocessed characteristic data set. And transforming the time sequence of the data samples in the time domain of the time sequence by adopting a central averaging method based on weighted form dynamic time warping. The method comprises the following steps: randomly selecting an initial time sequence from the data set, giving a weight of 0.5 to the initial time sequence, and using the initial time sequence as an initial time sequence of a central averaging technology; respectively calculating the dynamic time warping distances of the initial time sequence and other samples, and finding 5 time sequences with the shortest distance; randomly selecting two time sequences from the 5 nearest neighbors, and respectively giving a weight of 0.15; the remaining sequences are equally assigned the remaining 0.2 weights. Obtaining an enhanced feature data set;

and step 3: dividing the enhanced feature data set into a training data set, a verification data set and a test data set, wherein the proportion is 6: 2: 2;

and 4, step 4: as shown in fig. 3, a CNN-BiGRU deep neural network model with one-dimensional CNN and BiGRU fused is established; the CNN-BiGRU deep neural network model comprises the following steps: a SpatialDropout layer, a one-dimensional CNN network, a BiGRU network and a full connection layer; the spatialDropout layer randomly zeros the partial region of the sign language image, so that the generalization capability of the model is improved; the one-dimensional CNN network consists of a one-dimensional convolutional layer and a maximum pooling layer, wherein the one-dimensional convolutional layer obtains global information by comprehensively learning the local characteristics of the sign language sequence through convolution operation; and the maximum pooling layer performs down-sampling on the input feature map to eliminate partial redundant information. The BiGRU network structure is formed by combining a forward propagation GRU unit and a backward propagation GRU unit; the GRU unit which is propagated forwards processes input sequence data along a time positive sequence, the GRU unit which is propagated backwards processes the input sequence data along a time negative sequence, a unidirectional network structure is changed into a bidirectional network structure, context information can be fully utilized, feature information ignored by the unidirectional GRU can be captured, redundant information can be further eliminated, and finally space features and time feature information containing an original sign language sequence are obtained. The full connection layer is used for reintegrating the input data and mapping to the sample label space.

Step 4.1, setting a hyper-parameter, and initializing the parameter of the CNN-BiGRU deep neural network model so as to obtain a current network model; the hyper-parameter setting comprises the following steps: the activation function of the one-dimensional convolutional layer is Relu, the size of the filter is 5, the number of the filters is 64, overfitting is avoided by using L2 regularization, and the regularization coefficient is 0.001; the pooling layer had a pooling window size of 4. In the BiGRU layer, the dropout ratio parameter of an input unit is 0.1, the dropout ratio parameter of a circulation unit is 0.1, overfitting is avoided by using L2 regularization, and the regularization coefficient is 0.001. The full connection layer adopts a Softmax activation function.

Step 4.2, inputting the training data set into the current network model, randomly setting the partial region of the sign language image to zero by using a SpatialDropout layer, improving the generalization capability of the model, extracting the primary feature of the sign language through the convolution operation of the one-dimensional CNN network, and eliminating redundant information through the maximum pooling operation of the one-dimensional CNN network; obtaining initial sign language features, wherein the sign language primary features comprise spatial information such as positions and the like; after the initial sign language features pass through a BiGRU network, obtaining time sequence information of the sign language features as high-level features, and finally outputting each sign language category probability through a full connection layer;

4.3, reversely propagating the sign language class probability to the current network model by using an optimization algorithm, thereby updating parameters of each layer of network model and obtaining an updated CNN-BiGRU deep neural network model; adam is selected as an optimizer for the optimization algorithm model, Cross Engine is used as a loss function, accuracy is selected as the accuracy evaluation index, epochs is 15, and batch _ size is 16.

Step 4.4, verifying the accuracy of the updated CNN-BiGRU deep neural network model by using the verification data set so as to judge whether the updated CNN-BiGRU deep neural network model is converged, if so, taking the updated CNN-BiGRU deep neural network model as an optimal sign language classification model under the current super-parameter setting, otherwise, taking the updated CNN-BiGRU deep neural network model as a current network model, and returning to the step 4.2;

and 4.5, according to the process from the step 4.1 to the step 4.4, performing cross validation by using a plurality of hyper-parameters to obtain optimal sign language classification models under different hyper-parameters, and comparing the accuracy of the optimal sign language classification models on the test data set, so that the optimal sign language classification model with the highest accuracy is selected as the model finally used for recognizing the sign language.

And 5: sign language data are collected in real time, preprocessed and input into a final model to obtain sign language classification results.

Claims

1. A sign language identification method based on CNN-BiGRU neural network fusion is characterized by comprising the following steps:

step 1: obtaining by thumb position coordinates F with a depth camera device ₁ Index finger position coordinate F ₂ Middle finger position coordinate F ₃ Position coordinates F of ring finger ₄ Position coordinates of little finger F ₅ Palm center position P _C Palm stable position P _S Palm Pitch angle Pitch, palm Yaw angle Yaw, palm Roll angle Roll, hand grip radius r, palm width P _W Various sign language data composed of palm center velocity v; each sign language data is provided with a corresponding category label; forming a sign language data set by a plurality of sign language data and category labels thereof;

and 2, step: carrying out data preprocessing on the sign language data set;

obtaining the relative radius size S of the handheld ball by using the formula (1):

obtaining the palm center relative velocity V in the x direction by using the formula (2), the formula (3) and the formula (4) respectively _x Y-direction palm center relative velocity V _y And the palm center relative velocity V in the z direction _z ：

In the formula (2) to the formula (4), v _k Represents the palm velocity in the k direction;

step 2.1.4, obtaining a standard deviation P of an absolute three-dimensional palm position by using a formula (5):

P＝P _C -P _S (5)

and 4.5, obtaining the optimal sign language classification models under different hyper-parameters according to the process from the step 4.1 to the step 4.4, and comparing the accuracy of the optimal sign language classification models on the test data set, so that the optimal sign language classification model with the highest accuracy is selected as the model finally used for recognizing the sign language.

2. The method for sign language recognition based on CNN-BiGRU neural network fusion as claimed in claim 1, wherein the step 2.1 comprises:

step 2.1.1, obtaining the fingertip and palm center distance D by using the formula (6):

in the formula (6), F _ix Coordinate value, F, representing the x direction in the ith finger position coordinate _iy Coordinate value representing the y direction in the ith finger position coordinate, F _iz Coordinate value, P, representing the z direction in the ith finger position coordinate _Cx Coordinate value, P, representing the x-direction in coordinates of the palm center position _Cy Coordinate value, P, representing the y-direction in coordinates of the palm center position _Cz A coordinate value in the z direction in the palm center position coordinate is represented, and i is 1, 2, 3, 4 and 5 respectively represent a thumb, an index finger, a middle finger, a ring finger and a little finger;

step 2.1.2, obtaining the fingertip distance L by using the formula (7):

in the formula (7), F _jx Coordinate value in x direction in j-th finger position coordinate, F _jy Coordinate value, F, representing the y direction in the j-th finger position coordinate _jz And j represents coordinate values in the z direction in the j-th finger position coordinate, wherein j represents a thumb, an index finger, a middle finger, a ring finger and a little finger respectively, and i is not equal to j.