CN114943990A

CN114943990A - Continuous sign language recognition method and device based on ResNet34 network-attention mechanism

Info

Publication number: CN114943990A
Application number: CN202210709795.8A
Authority: CN
Inventors: 沈丛; 杨甜; 东天宇; 幸高松; 陆星元; 袁甜甜; 陈胜勇
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-08-26

Abstract

The invention provides a continuous sign language recognition method and a device based on a ResNet34 network-attention mechanism, which relate to the technical field of artificial intelligence recognition and comprise the following steps: s1: acquiring a first video data training set, and acquiring a second video data training set by adopting a KFE clustering algorithm, S2: constructing a ResNet34 network topology, integrating a PSA channel attention mechanism and an RCC space attention mechanism into a PR attention mechanism, and integrating the PR attention mechanism with the ResNet34 network topology to extract characteristic information of a second video data set; s3: constructing a BilSTM network topology to encode the characteristic information of the second video data set, and adopting an LSTM-CTC end-to-end network structure topology to decode the encoded second video data set; s4: and constructing an objective function to adjust the network topology parameters, the BilSTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters after the ResNet34 network topology and the PR attention mechanism are integrated. The method can solve the technical problem of overfitting of the neural network structure caused by video redundancy in the prior art.

Description

Continuous sign language recognition method and device based on ResNet34 network-attention mechanism

Technical Field

The invention relates to the technical field of artificial intelligence recognition, in particular to a continuous sign language recognition method and device based on a ResNet34 network-attention mechanism.

Background

Sign language as a specific communication language for deaf-dumb people and hearing-impaired people integrates related knowledge in the fields of natural language processing and computer vision, and sign language recognition as a subtask is also concerned by researchers. Generally speaking, sign language recognition is divided into isolated word recognition and continuous sign language recognition tasks. Although the task of recognizing isolated sign language words has achieved excellent results, the task of recognizing continuous sign language words is gradually emphasized because the potential semantic relationship in sign language and the long-term time sequence dependency relationship in a sign language sentence are ignored.

In recent years, various sign language recognition methods have been devised to improve the accuracy of continuous sign language recognition. Initial sign language recognition studies often relied on data gloves and other sensor devices to collect motion changes and time information of gestures in real time and either model the timing information of sign language through traditional hidden markov models or extract hand information through conditional random fields.

Later with the intense learning (DL) of fire, more and more researchers have performed the recognition of continuous sign language by using neural networks. The rapid development of the neural network opens a new research door for the tasks of sign language recognition, sign language translation and sign language generation. At present, researchers identify sign language by using a convolution neural network, a circulation convolution neural network, a graph convolution neural network and a model such as skeleton information, and the like, and also have the researchers use a continuous time classification method to better converge the network on the sign language video and the identified text result. However, the current sign language recognition work is rarely focused on multi-modal input, and for a neural network, a complete sign language sequence is input, and the redundancy of videos easily causes over-fitting of the network, so that a novel continuous sign language recognition method is necessary to be designed.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for continuous sign language recognition based on the ResNet34 network-attention mechanism, so as to alleviate the technical problem of transient fitting of a neural network structure formed by video redundancy, and improve the generalization capability of the continuous sign language recognition method.

The invention relates to a continuous sign language identification method based on a ResNet34 network-attention mechanism, which comprises the following steps:

s1: acquiring a first video data training set, wherein the first video data set comprises RGB (red, green and blue) videos and depth videos, and extracting key frames of the first video data training set by adopting a KFE (Kalman Filter) clustering algorithm to acquire a second video data training set, wherein the second video data training set is provided with a label;

s2: constructing a ResNet34 network topology, fusing a PSA channel attention mechanism and an RCC space attention mechanism into a PR attention mechanism, and integrating the PR attention mechanism with the ResNet34 network topology to extract characteristic information of the second video data set;

s3: constructing a BiLSTM network topology to encode the characteristic information of the second video data set, and adopting an LSTM-CTC end-to-end network structure topology to decode the encoded second video data set;

s4: and constructing an objective function to adjust the network topology parameters, the BilSTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters after the ResNet34 network topology and the PR attention mechanism are integrated.

Preferably, the method further comprises:

and acquiring a first video data test set, and testing the network topology after the ResNet34 network topology and the PR attention mechanism are integrated, the BilTM network topology and the LSTM-CTC end-to-end network structure topology.

Preferably, the step of extracting the key frames of the first video data training set by using a KFE clustering algorithm to obtain a second video data training set includes:

acquiring an initial threshold, a frame set of the first video data training set and cluster centroids of all clusters;

acquiring frames of the first video data training set based on the frame set of the first video data training set, and acquiring the closest distance from the frames of the first video data training set to a cluster centroid based on the cluster centroids of all the clusters;

determining whether a closest distance of a frame of the first training set of video data to a cluster centroid is less than an initial threshold;

if so, classifying the frames of the first video data training set into a cluster centroid class with the closest distance, removing the frames of the first video data training set from the frame set of the first video data training set, and executing the steps of acquiring an initial threshold, the frame set of the first video data training set and cluster centroids of all clusters;

if not, defining the frames of the first video data training set to be in a new category, removing the frames of the first video data training set from the frame set of the first video data training set, and executing the steps of obtaining an initial threshold value, the frame set of the first video data training set and cluster centroids of all clusters.

Preferably, the ResNet34 network topology includes an initial layer, a first residual layer, a second residual layer, a third residual layer, a fourth residual layer, and a global average pooling layer;

the number of convolution kernels of the first residual error layers is 64, and the number of the first residual error layers is 3;

the number of convolution kernels of the second residual error layers is 128, and the number of the second residual error layers is 4;

the number of convolution kernels of the third residual error layers is 256, and the number of the third residual error layers is 6;

the number of convolution kernels of the fourth residual error layers is 512, and the number of the fourth residual error layers is 3;

integrating a PR attention mechanism with the ResNet34 network topology includes:

introducing the PR attention mechanism between the fourth residual layer and a global average pooling layer.

Preferably, in the step of combining the PSA channel attention mechanism and the RCC spatial attention mechanism into the PR attention mechanism, the PSA channel attention mechanism is:

[X ₀ ，X ₁ ，…，X _S-1 ]＝Split(X)；

F _i ＝Conv(K _i ×K _i ，G _i )(X _i )；

F＝Cat([F ₀ ，F ₁ ，…，F _S-1 ])；

X∈R ^C×W×H -a first feature map obtained by computing the first four residual layers of the second video training set through the ResNet34 network;

c, W and H-channel, width and height of the first profile;

split-first feature map X ∈ R in channel dimension ^C×W×H Equally dividing the mixture into S parts;

X _i ∈R ^C/S×W×H -the first feature map is divided equally into feature maps with channels C/S;

K _i -different convolution kernel parameters;

G _i -parameters of the packet convolution;

F _i ∈R ^C/S×W×H -multi-scale features after multi-scale feature extraction;

cat-splicing multi-scale features under different receptive fields on a channel dimension;

F∈R ^C×W×H -a feature vector after multi-scale feature stitching;

and extracting the weight of the feature vector after the multi-scale feature splicing by adopting the following formula:

g _i ＝AvgPool(F _i )；

Z _i ＝σ(W ₁ δ(W ₀ (g _i )))；

Z＝Cat([Z ₀ ，Z ₁ ，…，Z _S-1 ])；

AvgPool (·) -represents the global average pooling;

σ (-) is a sigmoid activation function;

δ (·) is the ReLU activation function;

g _i ∈R ^C/S×1×1 -a feature vector for global average pooling of multi-scale features;

W ₀ and W ₁ Respectively, the dimensions are [ C/S/r, C/S]，[C/S，C/S/r]Wherein r represents a reduction rate;

Z _i dimension [ C/S, 1%]The different partial attention weights of (a);

a cross-dimensional channel attention feature weight map with a Z-dimension of [ C, 1, 1 ];

normalizing the obtained attention weight by adopting the following formula, and performing tensor product operation on the weight and the feature vector subjected to multi-scale feature extraction:

att＝Softmax(Z)；

Y＝att⊙F；

att-normalized channel attention weight.

The RCC attention mechanism is characterized in that a Criss-Cross module is connected in series twice to obtain rich context information, wherein the Criss-Cross channel attention mechanism is as follows:

Q＝W _Q Y；

K＝W _K Y；

V＝W _V Y；

W _Q and W _K Are all of dimensions [ C', C]A weight matrix of (a);

W _V is of dimension [ C, C]A weight matrix of (a);

and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information for the second video data set using the following formula:

performing Affinity operation to obtain the relationship between each pixel point and the same row and column pixel points in the characteristic diagram with the size of [ W, H ]:

D＝Affinity(Q，K)；

Affinity-Q, K are all feature maps with the dimension of [ C', W, H ];

Ω _u -for each position u there is a feature vector in the spatial dimension of Q

Wherein

Is omega _u The (i) th element of (2),

d _i，u e.g. D-characteristic Q _u And Ω _i，u ，i＝[1，…，H+W-1]The degree of correlation of (c).

Applying a softmax layer on the channel dimension based on the relation D between each pixel point and the same row and column pixel points in the feature graph with the size of [ W, H ] so as to calculate a feature graph A:

A＝softmax(D)；

the Aggregation operation is performed on the feature map a to collect context information Y':

aggregation-there is a feature vector V for each position u in the spatial dimension of V _u ∈R ^C Set ofΦ _u ∈R ^(H+W-1)×C Extracting a characteristic vector set which is positioned in the same row and the same column with the position u from the V matrix;

y' -the context information which is captured to be long connected in the vertical direction and the horizontal direction;

repeatedly concatenating the context information captured in the vertical direction and the horizontal direction long connection, as shown in the following formula:

Y″＝CrissCross(Y′)；

Y′＝CrissCross(Y)；

y' -said is the feature vector that has acquired global pixel information.

Preferably, the step of decoding the encoded second video data set by using the LSTM-CTC end-to-end network structure topology and the label of the second video data training set includes:

calculating a CTC loss function of the LSTM-CTC end-to-end network, which specifically comprises the following steps:

defining a many-to-one mapping function β (-) to its target sequence y:

wherein

In the formula, pi _n -pi tags at n instants;

-probability of occurrence of n instants;

the CTC loss function is:

preferably, the objective function is constructed using the following formula:

l-a constructed objective function to adjust network topology parameters after the ResNet34 network topology is integrated with the PR attention mechanism, the BilTM encoder network topology parameters, and the LSTM-CTC decoder network topology;

s-given a dimension of the second video data set;

||ω|| ² -a regularization term that avoids overfitting;

hyper-parameters of the lambda-regularization term.

Preferably, the step of acquiring the first video data test set and testing the network topology after the constructed ResNet34 network topology and the PR attention mechanism are integrated, the BiLSTM network topology and the LSTM-CTC end-to-end network structure topology includes:

the get WER value represents the accuracy of recognition:

s, I (Ins) and D (Del) are minimum replace, insert, delete operations, respectively;

n represents the total number of words of the tag.

the acquisition Accuracy represents the Accuracy of the identification:

n represents the total number of words of the tag.

In another aspect, a continuous sign language recognition device based on the ResNet34 network-attention mechanism comprises:

a video acquisition module: the method comprises the steps of obtaining a first video data training set, wherein the first video data set comprises RGB videos and depth videos, and key frames of the first video data training set are extracted by adopting a KFE clustering algorithm to obtain a second video data training set, and the second video data training set is provided with labels;

a feature extraction module: for constructing a ResNet34 network topology and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information of the second video data set;

and a decoding module: the system is used for constructing a BilSTM network topology to encode the characteristic information of the second video data set, and an LSTM-CTC end-to-end network structure topology is adopted to decode the encoded second video data set;

a parameter adjusting module: and the system is used for constructing an objective function so as to adjust the network topology parameters after the ResNet34 network topology is integrated with the PR attention mechanism, the BilTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters.

The embodiment of the invention has the following beneficial effects: the invention provides a continuous sign language recognition method and a device based on a ResNet34 network-attention mechanism, which comprises the following steps: s1: acquiring a first video data training set, wherein the first video data set comprises RGB (red, green and blue) videos and depth videos, and extracting key frames of the first video data training set by adopting a KFE (Kalman Filter edge) clustering algorithm to acquire a second video data training set, and the second video data training set is provided with labels; s2: constructing a ResNet34 network topology and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information of the second video data set; s3: constructing a BiLSTM network topology to encode the characteristic information of the second video data set, and adopting an LSTM-CTC end-to-end network structure topology to decode the encoded second video data set; s4: and constructing an objective function to adjust the network topology parameters, the BilSTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters after the ResNet34 network topology and the PR attention mechanism are integrated. The method and the device provided by the invention can relieve the technical problem of overfitting of the neural network structure caused by video redundancy in the prior art, and improve the generalization capability of the continuous sign language identification method.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a continuous sign language recognition method based on the ResNet34 network-attention mechanism according to an embodiment of the present invention;

FIG. 2 is a diagram of a ResNet 34-based network-attention mechanism network architecture according to an embodiment of the present invention;

fig. 3 is a network flow diagram of a continuous sign language recognition method ResNet34 residual module based on a ResNet34 network-attention mechanism according to an embodiment of the present invention;

FIG. 4 is a network structure diagram of a ResNet34 network-attention mechanism-based continuous sign language recognition method ResNet34 residual module network according to an embodiment of the present invention;

FIG. 5 is a network diagram of a LSTM basic unit module of a continuous sign language recognition method based on a ResNet34 network-attention mechanism according to an embodiment of the present invention;

fig. 6 is a network structure diagram of a continuous sign language recognition method BiLSTM encoder based on the ResNet34 network-attention mechanism according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, sign language recognition work is rarely focused on multi-mode input, for a neural network, a complete sign language sequence is input, and overfitting of the network is easily caused by redundancy of a video.

For the convenience of understanding the present embodiment, the method and apparatus for continuous sign language recognition based on the ResNet34 network-attention mechanism disclosed in the present embodiment will be described in detail first.

The first embodiment is as follows:

with reference to fig. 1, an embodiment of the present invention provides a continuous sign language recognition method based on a ResNet34 network-attention mechanism, including:

s1: acquiring a first video data training set, wherein the first video data set comprises RGB (red, green and blue) videos and a depth video, and extracting key frames of the first video data training set by adopting a KFE (Kalman Filter) clustering algorithm to acquire a second video data training set, and the second video data training set is provided with a label;

s2: constructing a ResNet34 network topology and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information of the second video data set;

in connection with fig. 6, S3: constructing a BiLSTM network topology to encode the characteristic information of the second video data set, and adopting an LSTM-CTC end-to-end network structure topology to decode the encoded second video data set; it should be noted that LSTM, as a special RNN network, can learn long-distance time dependence. As shown in FIG. 5, at time t, there are three inputs to the LSTM: input value x of network at current time t _t Last time LSTM output value h _t-1 And cell state C at the previous time _t-1 (ii) a The output of the LSTM is two: current time LSTM output value h _t Cell state at the present time C _t . The operation mechanism is operated by an input gate, a forgetting gate and an output gate inside.

With reference to fig. 5, the LSTM has the following characteristics:

1): the forgetting gate determines how many unit states at the last moment need to be reserved to the current moment, and f can be obtained through the forgetting gate _t ；

f _t ＝σ(W _f ·[h _t-1 ，x _t ]+b _f )；

2): the input gate determines how much input data of the network needs to be stored at the current time to the unit state, and the input gate can obtain i _t The temporary state at the current moment can be obtained through the unit state

Then the unit state C of the last LSTM cell structure is applied _t-1 Output f of forgetting gate _t Output of input gate i _t And a temporary state

The unit state C of the current moment can be obtained _t 。

i _t ＝σ(W _i ·[h _t-1 ，x _t ]+b _i )

3): the output gate controls how much current unit state needs to be output to the current output value, and o can be obtained through the output gate _t Combined with the cell state C at the current time _t And o _t The final output h can be obtained _t ；

o _t ＝σ(W _o ·[h _t-1 ，x _t ]+b _o )；

h _t ＝o _t *tan(C _t )

BilSTM is a combination of forward LSTM and backward LSTM to obtain context information for a long period of time. Specifically, the inverted input sequence is calculated in an LSTM manner, and finally the result of the forward LSTM and the result of the backward LSTM are stacked;

further, the PR attention mechanism and the ResNet34 network topology are integrated to extract the high-dimensional sign language video feature vectors, and the extracted high-dimensional sign language video feature vectors are sent to a BiLSTM-LSTM encoder decoder. Specifically, we set sign language picture X to { X ═ X ₁ ，x ₂ ，...，x _i ，...，x _T -as input to a network after integration of a PR attention mechanism with the ResNet34 network topology to extract feature information of the second video data set;

in conjunction with fig. 6, the network sets the output E to { E ═ E ₁ ，e ₂ ，...，e _i ，...，e _T In which x is _i ∈R ^C×H×W ，e _i ∈R ^C′ And T represents the frame number of the sign language video. Stacking the output of the feature extraction network to obtain

Can be used as the input of a BilSTM coder;

Preferably, the method further comprises:

With reference to fig. 2 to 4, preferably, the ResNet34 network topology includes an initial layer, a first residual layer, a second residual layer, a third residual layer, a fourth residual layer, and a global average pooling layer;

further, as a backbone network of CNN, ResNet has been widely applied to various scenes in computer vision for a long time;

ResNet (deep Residual network) is named as a deep Residual network, and the most important characteristic of the deep Residual network is that a plurality of Residual units with the same parameters are connected to form a BasicBlock, and a plurality of BasicBlock are combined to form a ResNet network together with a preprocessing layer and a final full-connection classification layer, wherein the Residual unit is shown in FIG. 3. Considering the significant impact of neural network depth, the use of ResNet is chosen herein to address this problem. Specifically, the ResNet34 shown in FIG. 4 is used as the backbone network of the present invention, and the attention mechanism of the space and the channel is fused, so that the image feature information of the sign language video can be well extracted, and the problem of gradient disappearance of the deep network is avoided.

In a further aspect, the step of integrating a PR attention mechanism with the ResNet34 network topology comprises:

It should be noted that, in the embodiment provided by the present invention, because the feature map mixes redundant information, the important region of the spatial hand motion is highlighted herein, and different weights are used to represent different importance. Since the function of each channel represents a unique detector, the purpose of the channel attention is to look at which features make sense. Furthermore, the spatial attention module introduced herein focuses on which spatial features are more meaningful.

[X ₀ ，X ₁ ，…，X _S-1 ]＝Split(X)；

F _i ＝Conv(K _i ×K _i ，G _i )(X _i )；

F＝Cat([F ₀ ，F ₁ ，…，F _S-1 ])；

X∈R ^C×W×H -a first feature map obtained by computing the first four residual layers of the second video training set through a ResNet34 network;

c, W and H-channel, width and height of the first profile;

split-first feature map X ∈ R in channel dimension ^C×W×H Equally dividing the obtained product into S parts;

K _i -different convolution kernel parameters;

G _i -parameters of the packet convolution;

F∈R ^C×W×H -a feature vector after multi-scale feature stitching;

g _i ＝AvgPool(F _i )；

Z _i ＝σ(W ₁ δ(W ₀ (g _i )))；

Z＝Cat([Z ₀ ，Z ₁ ，…，Z _S-1 ])；

AvgPool (·) -represents the global average pooling;

σ (-) is a sigmoid activation function;

δ (·) is the ReLU activation function;

Z _i dimension [ C/S, 1%]Different partial attention weights of (2);

att＝Softmax(Z)；

Y＝att⊙F；

att-normalized channel attention weight;

it should be noted that, in the embodiment provided by the present invention, the RCC attention mechanism is to connect the Criss-Cross modules in series twice to obtain rich context information, where the Criss-Cross channel attention mechanism is:

Q＝W _Q Y；

K＝W _K Y；

V＝W _V Y；

W _Q and W _K Are all of dimensions [ C', C]A weight matrix of (a);

W _V is of dimension [ C, C]A weight matrix of (a);

D＝Affinity(Q，K)；

Affinity-Q, K are all feature maps with the dimension of [ C', W, H ];

Wherein

Is omega _u The (i) th element of (a),

Further, there is a feature vector for each position u in the spatial dimension of Q

Meanwhile, a feature vector set in the same row and column as the position u can be extracted from the K matrix

Wherein

Is omega _u The ith element of (1), d _i，u E.g. D is a feature Q _u And Ω _i，u ，i＝[1，…，H+W-1]The degree of correlation of (c);

applying a softmax layer on a channel dimension based on the relation D between each pixel point and the pixels in the same row and column in the feature graph with the size of [ W, H ] so as to calculate a feature graph A:

A＝softmax(D)；

aggregation-there is a feature vector V for each position u in the spatial dimension of V _u ∈R ^C Of set phi _u ∈R ^(H+W-1)×C Extracting a characteristic vector set which is positioned in the same row and the same column with the position u from the V matrix;

y' -the context information which is captured to be long connected in the vertical direction and the horizontal direction; '

However, at this time, the feature information at the pixel level is also slightly sparse, and therefore, context information with long connections in the vertical direction and the horizontal direction captured needs to be repeatedly concatenated as shown in the following formula:

Y″＝CrissCross(Y′)；

Y′＝CrissCross(Y)；

y' -said is the feature vector that has acquired global pixel information.

defining a many-to-one mapping function β (-) to its target sequence y:

wherein

In the formula, pi _n -pi tags at n instants;

-probability of occurrence of n instants;

the CTC loss function is:

further, CTC is an objective function that integrates all possibilities of alignment between input and target sequences. In the data set used in the present invention, tags (-) have been added to sign language annotations to accurately simulate the transition between two adjacent sign language words;

the middle tag path of the input sequence is denoted as pi ═ pi (pi) ₁ ，π ₂ ，…，π _t ，…，π _T ) In which pi _t E { V { - } { [ C ]; v is a sign language word vocabulary, and asterisks are wildcards;

for a given input X, the probability calculation for path π is as follows

Wherein pi _n Is the pi-tag at time n,

is the probability of occurrence at time n.

Because various subdivisions of sign language annotation tags tend to result in different alignments between the same input sequence and target sequence, CTC defines a many-to-one mapping function β (·) for its target sequence y, i.e., y ═ β (pi). The probability of y can be defined as the sum probability of all alignments that match it, and is calculated as follows:

thus, the above equation can be replaced with a CTC loss function, defined as follows:

preferably, the objective function is constructed using the following formula:

l-the constructed objective function to adjust the network topology parameters after the ResNet34 network topology is integrated with the PR attention mechanism, the BilTM encoder network topology parameters and the LSTM-CTC decoder network structure topology;

s-dimensions of a given second video data set;

||ω|| ² -a regularization term that avoids overfitting;

hyper-parameters of the lambda-regularization term.

the get WER value represents the accuracy of recognition:

n represents the total number of words of the label.

It should be noted that Word Error Rate (WER) is often used as an evaluation index of an experiment in speech recognition and machine translation tasks. In the invention, the network quality is evaluated by adopting different experimental indexes, namely word error rate WER and Accuracy Accuracy. The model effect can be judged by observing the evaluation index, but the network learning according to the index is not involved in the invention, so the CRB-Net designed by the invention still adopts a gradient descent method to train the network;

WER measures sequence conversion through a combination of replace, delete and insert operations. For sign language recognition tasks, a low WER value can be often expressed as high precision of recognition, and the calculation formula of the WER is as follows

the acquisition Accuracy represents the Accuracy of the identification:

n represents the total number of words of the tag;

example two:

the invention provides a continuous sign language recognition device based on a ResNet34 network-attention mechanism, which comprises:

a decoding module: the system is used for constructing a BilSTM network topology to encode the characteristic information of the second video data set, and an LSTM-CTC end-to-end network structure topology is adopted to decode the encoded second video data set;

Example three:

in order to prove the generalization of the test set of the network model provided by the invention under the strong supervised learning, the method of the invention evaluates two large sign language data sets, including a CSL data set from the university of Chinese science and technology and a TJUT sign language recognition and translation data set (TJUT-SLRT). The experimental result shows that the evaluation of the WER indexes of different data sets by the algorithm provided by the invention has higher precision, thereby proving that the invention has extremely high generalization.

The invention has the following advantages:

1) the algorithm evaluates two large-scale sign language data sets respectively, and performs experimental tests on test sets divided by the TJUT-SLRT data sets, and the results are shown in the following table, so that the word error rate WER of the method provided by the application is lower than that of a continuous sign language identification method based on a CBAM attention mechanism, and the generalization degree of the method is good;

the method and the sign language identification method based on the CBAM attention mechanism have WER training results on a Tjut test data set:

	ResNet34+CBAM+BiLSTM	ResNet34+PRR+BiLSTM(Ours)
			TJUT	11.45％	11.26％

the WER training result of the method on the CSL data set is as follows:

DataSet	DEV	TEST
			CSL	2.01％	1.76％

the method has the following accuracy rate under different settings on the CSL data set:

Method	Accuray
		CNN+LSTM	0.873
ResNet18+LSTM	0.905
		ResNet34+LSTM	0.926
CNN+BiLSTM	0.896
		ResNet18+BiLSTM	0.928
ResNet34+BiLSTM	0.943
		PRR-ResNet34+BiLSTM	0.982

2) the RGB video and the depth video data are used, so that the image feature representation can be better learned by a network;

3) by adopting a KFE clustering algorithm based on a K-means algorithm, the redundancy of the video is reduced, the technical problem of transition fitting of a neural network structure is further alleviated, and the generalization capability of the continuous sign language recognition method is improved;

4) a PR attention module is introduced into the existing Resnet34 network topology, so that the network is more focused on hand features, the PR attention module captures spatial information of different scales on channel dimensions to enrich feature space, and remote dependence is established from global spatial information; the feature map with dense and rich context information is generated by adopting second-order attention on the spatial dimension, and because the feature map is close to dense self-attention calculation by adopting two continuous sparse self-attention calculations, the feature map not only reduces the memory consumption and time complexity, but also reduces the interference of redundant information in global pixels while ensuring the advanced performance;

5) the method comprises the steps that end-to-end continuous sign language recognition is achieved through a network model of an encoder-decoder, and mapping between a source sequence and a target sequence with unequal lengths can be achieved while semantic information of image features in a time dimension is captured; in a longitudinal view, the hidden layer contains more information of previous nodes, and particularly in a sign language sequence with more video frames, the model has better memorability, so that the problem of information loss in a long sequence is effectively avoided. (ii) a

Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for continuous sign language recognition based on the ResNet34 network-attention mechanism, comprising:

s3: constructing a BilSTM network topology to encode the characteristic information of the second video data set, and decoding the encoded second video data set by adopting an LSTM-CTC end-to-end network structure topology and the label of the second video data training set;

2. The method of claim 1, further comprising:

3. The method according to claim 1, wherein the step of extracting key frames of the first training set of video data by using a KFE clustering algorithm to obtain a second training set of video data comprises:

acquiring frames of the first video data training set based on the frame set of the first video data training set, and acquiring the closest distance from the frames of the first video data training set to the cluster centroid based on the cluster centroids of all the clusters;

4. The method of claim 1, wherein the ResNet34 network topology comprises an initial layer, a first residual layer, a second residual layer, a third residual layer, a fourth residual layer, and a global average pooling layer;

the convolution kernel number of the first residual error layers is 64, and the number of the first residual error layers is 3;

5. The method of claim 4, wherein in the step of combining a PSA channel attention mechanism and an RCC spatial attention mechanism into a PR attention mechanism, wherein the PSA channel attention mechanism is:

[X ₀ ，X ₁ ，…，X _S-1 ]＝Split(X)；

F _i ＝Conv(K _i ×K _i ，G _i )(X _i )；

F＝Cat([F ₀ ，F ₁ ，…，F _S-1 ])；

c, W and H-channel, width and height of the first profile;

K _i -different convolution kernel parameters;

G _i -parameters of the packet convolution;

F∈R ^C×W×H -a feature vector after multi-scale feature stitching;

g _i ＝AvgPool(F _i )；

Z _i ＝σ(W ₁ δ(W ₀ (g _i )))；

Z＝Cat([Z ₀ ，Z ₁ ，…，Z _S-1 ])；

AvgPool (·) -represents the global average pooling;

σ (-) is a sigmoid activation function;

δ (·) is the ReLU activation function;

Z _i dimension [ C/S, 1%]Different partial attention weights of (2);

att＝Softmax(Z)；

Y＝att⊙F；

att-normalized channel attention weight.

Q＝W _Q Y；

K＝W _K Y；

V＝W _V Y；

W _Q and W _K Are all of dimensions [ C', C]A weight matrix of (a);

W _V is of dimension [ C, C]A weight matrix of (a);

D＝Affinity(Q，K)；

Affinity-Q, K are all feature maps with the dimension of [ C', W, H ];

Ω _u -for each position u there is a feature vector Q in the spatial dimension of Q _u ∈R ^C ′；

Wherein Ω is _i，u ∈R ^C Is omega _u The (i) th element of (a),

A＝softmax(D)；

aggregation-there is a feature vector V for each position u in the spatial dimension of V _u ∈R ^C Of set phi _u ∈R ^(H ^+W-1)×C Extracting a characteristic vector set which is positioned in the same row and the same column with the position u from the V matrix;

Y″＝CrissCross(Y′)；

Y′＝CrissCross(Y)；

y' -said is the feature vector that has acquired global pixel information.

6. The method of claim 1, wherein the step of decoding the encoded second video data set using an LSTM-CTC peer-to-peer network architecture topology and the tags of the second video data training set comprises:

defining a many-to-one mapping function β (-) to its target sequence y:

wherein

In the formula, pi _n -pi tags at n instants;

-probability of occurrence of n instants;

the CTC loss function is:

7. the method of claim 6, wherein the objective function is constructed using the following formula:

s-given a dimension of the second video data set;

||ω|| ² -a regularization term that avoids overfitting;

hyper-parameters of the lambda-regularization term.

8. The method of claim 2, wherein the step of obtaining a first test set of video data and testing the constructed network topology after the ResNet34 network topology is integrated with the PR attention mechanism, the BiLSTM network topology, and the LSTM-CTC peer-to-peer network topology comprises:

the get WER value represents the accuracy of recognition:

n represents the total number of words of the tag.

9. The method of claim 2, wherein the step of obtaining a first test set of video data and testing the constructed network topology after the ResNet34 network topology is integrated with the PR attention mechanism, the BiLSTM network topology, and the LSTM-CTC peer-to-peer network topology comprises:

the acquisition Accuracy represents the Accuracy of the identification:

n represents the total number of words of the tag.

10. A continuous sign language recognition device based on the ResNet34 network-attention mechanism, comprising: