CN114943990A - Continuous sign language recognition method and device based on ResNet34 network-attention mechanism - Google Patents

Continuous sign language recognition method and device based on ResNet34 network-attention mechanism Download PDF

Info

Publication number
CN114943990A
CN114943990A CN202210709795.8A CN202210709795A CN114943990A CN 114943990 A CN114943990 A CN 114943990A CN 202210709795 A CN202210709795 A CN 202210709795A CN 114943990 A CN114943990 A CN 114943990A
Authority
CN
China
Prior art keywords
video data
network topology
attention mechanism
training set
resnet34
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210709795.8A
Other languages
Chinese (zh)
Inventor
沈丛
杨甜
东天宇
幸高松
陆星元
袁甜甜
陈胜勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Technology
Original Assignee
Tianjin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Technology filed Critical Tianjin University of Technology
Priority to CN202210709795.8A priority Critical patent/CN114943990A/en
Publication of CN114943990A publication Critical patent/CN114943990A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/113Recognition of static hand signs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a continuous sign language recognition method and a device based on a ResNet34 network-attention mechanism, which relate to the technical field of artificial intelligence recognition and comprise the following steps: s1: acquiring a first video data training set, and acquiring a second video data training set by adopting a KFE clustering algorithm, S2: constructing a ResNet34 network topology, integrating a PSA channel attention mechanism and an RCC space attention mechanism into a PR attention mechanism, and integrating the PR attention mechanism with the ResNet34 network topology to extract characteristic information of a second video data set; s3: constructing a BilSTM network topology to encode the characteristic information of the second video data set, and adopting an LSTM-CTC end-to-end network structure topology to decode the encoded second video data set; s4: and constructing an objective function to adjust the network topology parameters, the BilSTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters after the ResNet34 network topology and the PR attention mechanism are integrated. The method can solve the technical problem of overfitting of the neural network structure caused by video redundancy in the prior art.

Description

Continuous sign language recognition method and device based on ResNet34 network-attention mechanism
Technical Field
The invention relates to the technical field of artificial intelligence recognition, in particular to a continuous sign language recognition method and device based on a ResNet34 network-attention mechanism.
Background
Sign language as a specific communication language for deaf-dumb people and hearing-impaired people integrates related knowledge in the fields of natural language processing and computer vision, and sign language recognition as a subtask is also concerned by researchers. Generally speaking, sign language recognition is divided into isolated word recognition and continuous sign language recognition tasks. Although the task of recognizing isolated sign language words has achieved excellent results, the task of recognizing continuous sign language words is gradually emphasized because the potential semantic relationship in sign language and the long-term time sequence dependency relationship in a sign language sentence are ignored.
In recent years, various sign language recognition methods have been devised to improve the accuracy of continuous sign language recognition. Initial sign language recognition studies often relied on data gloves and other sensor devices to collect motion changes and time information of gestures in real time and either model the timing information of sign language through traditional hidden markov models or extract hand information through conditional random fields.
Later with the intense learning (DL) of fire, more and more researchers have performed the recognition of continuous sign language by using neural networks. The rapid development of the neural network opens a new research door for the tasks of sign language recognition, sign language translation and sign language generation. At present, researchers identify sign language by using a convolution neural network, a circulation convolution neural network, a graph convolution neural network and a model such as skeleton information, and the like, and also have the researchers use a continuous time classification method to better converge the network on the sign language video and the identified text result. However, the current sign language recognition work is rarely focused on multi-modal input, and for a neural network, a complete sign language sequence is input, and the redundancy of videos easily causes over-fitting of the network, so that a novel continuous sign language recognition method is necessary to be designed.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for continuous sign language recognition based on the ResNet34 network-attention mechanism, so as to alleviate the technical problem of transient fitting of a neural network structure formed by video redundancy, and improve the generalization capability of the continuous sign language recognition method.
The invention relates to a continuous sign language identification method based on a ResNet34 network-attention mechanism, which comprises the following steps:
s1: acquiring a first video data training set, wherein the first video data set comprises RGB (red, green and blue) videos and depth videos, and extracting key frames of the first video data training set by adopting a KFE (Kalman Filter) clustering algorithm to acquire a second video data training set, wherein the second video data training set is provided with a label;
s2: constructing a ResNet34 network topology, fusing a PSA channel attention mechanism and an RCC space attention mechanism into a PR attention mechanism, and integrating the PR attention mechanism with the ResNet34 network topology to extract characteristic information of the second video data set;
s3: constructing a BiLSTM network topology to encode the characteristic information of the second video data set, and adopting an LSTM-CTC end-to-end network structure topology to decode the encoded second video data set;
s4: and constructing an objective function to adjust the network topology parameters, the BilSTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters after the ResNet34 network topology and the PR attention mechanism are integrated.
Preferably, the method further comprises:
and acquiring a first video data test set, and testing the network topology after the ResNet34 network topology and the PR attention mechanism are integrated, the BilTM network topology and the LSTM-CTC end-to-end network structure topology.
Preferably, the step of extracting the key frames of the first video data training set by using a KFE clustering algorithm to obtain a second video data training set includes:
acquiring an initial threshold, a frame set of the first video data training set and cluster centroids of all clusters;
acquiring frames of the first video data training set based on the frame set of the first video data training set, and acquiring the closest distance from the frames of the first video data training set to a cluster centroid based on the cluster centroids of all the clusters;
determining whether a closest distance of a frame of the first training set of video data to a cluster centroid is less than an initial threshold;
if so, classifying the frames of the first video data training set into a cluster centroid class with the closest distance, removing the frames of the first video data training set from the frame set of the first video data training set, and executing the steps of acquiring an initial threshold, the frame set of the first video data training set and cluster centroids of all clusters;
if not, defining the frames of the first video data training set to be in a new category, removing the frames of the first video data training set from the frame set of the first video data training set, and executing the steps of obtaining an initial threshold value, the frame set of the first video data training set and cluster centroids of all clusters.
Preferably, the ResNet34 network topology includes an initial layer, a first residual layer, a second residual layer, a third residual layer, a fourth residual layer, and a global average pooling layer;
the number of convolution kernels of the first residual error layers is 64, and the number of the first residual error layers is 3;
the number of convolution kernels of the second residual error layers is 128, and the number of the second residual error layers is 4;
the number of convolution kernels of the third residual error layers is 256, and the number of the third residual error layers is 6;
the number of convolution kernels of the fourth residual error layers is 512, and the number of the fourth residual error layers is 3;
integrating a PR attention mechanism with the ResNet34 network topology includes:
introducing the PR attention mechanism between the fourth residual layer and a global average pooling layer.
Preferably, in the step of combining the PSA channel attention mechanism and the RCC spatial attention mechanism into the PR attention mechanism, the PSA channel attention mechanism is:
[X 0 ,X 1 ,…,X S-1 ]=Split(X);
F i =Conv(K i ×K i ,G i )(X i );
F=Cat([F 0 ,F 1 ,…,F S-1 ]);
X∈R C×W×H -a first feature map obtained by computing the first four residual layers of the second video training set through the ResNet34 network;
c, W and H-channel, width and height of the first profile;
split-first feature map X ∈ R in channel dimension C×W×H Equally dividing the mixture into S parts;
X i ∈R C/S×W×H -the first feature map is divided equally into feature maps with channels C/S;
K i -different convolution kernel parameters;
G i -parameters of the packet convolution;
F i ∈R C/S×W×H -multi-scale features after multi-scale feature extraction;
cat-splicing multi-scale features under different receptive fields on a channel dimension;
F∈R C×W×H -a feature vector after multi-scale feature stitching;
and extracting the weight of the feature vector after the multi-scale feature splicing by adopting the following formula:
g i =AvgPool(F i );
Z i =σ(W 1 δ(W 0 (g i )));
Z=Cat([Z 0 ,Z 1 ,…,Z S-1 ]);
AvgPool (·) -represents the global average pooling;
σ (-) is a sigmoid activation function;
δ (·) is the ReLU activation function;
g i ∈R C/S×1×1 -a feature vector for global average pooling of multi-scale features;
W 0 and W 1 Respectively, the dimensions are [ C/S/r, C/S],[C/S,C/S/r]Wherein r represents a reduction rate;
Z i dimension [ C/S, 1%]The different partial attention weights of (a);
a cross-dimensional channel attention feature weight map with a Z-dimension of [ C, 1, 1 ];
normalizing the obtained attention weight by adopting the following formula, and performing tensor product operation on the weight and the feature vector subjected to multi-scale feature extraction:
att=Softmax(Z);
Y=att⊙F;
att-normalized channel attention weight.
The RCC attention mechanism is characterized in that a Criss-Cross module is connected in series twice to obtain rich context information, wherein the Criss-Cross channel attention mechanism is as follows:
Q=W Q Y;
K=W K Y;
V=W V Y;
W Q and W K Are all of dimensions [ C', C]A weight matrix of (a);
W V is of dimension [ C, C]A weight matrix of (a);
and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information for the second video data set using the following formula:
performing Affinity operation to obtain the relationship between each pixel point and the same row and column pixel points in the characteristic diagram with the size of [ W, H ]:
D=Affinity(Q,K);
Figure BSA0000275798670000051
Affinity-Q, K are all feature maps with the dimension of [ C', W, H ];
Ω u -for each position u there is a feature vector in the spatial dimension of Q
Figure BSA0000275798670000052
Wherein
Figure BSA0000275798670000053
Is omega u The (i) th element of (2),
d i,u e.g. D-characteristic Q u And Ω i,u ,i=[1,…,H+W-1]The degree of correlation of (c).
Applying a softmax layer on the channel dimension based on the relation D between each pixel point and the same row and column pixel points in the feature graph with the size of [ W, H ] so as to calculate a feature graph A:
A=softmax(D);
the Aggregation operation is performed on the feature map a to collect context information Y':
Figure BSA0000275798670000061
aggregation-there is a feature vector V for each position u in the spatial dimension of V u ∈R C Set ofΦ u ∈R (H+W-1)×C Extracting a characteristic vector set which is positioned in the same row and the same column with the position u from the V matrix;
y' -the context information which is captured to be long connected in the vertical direction and the horizontal direction;
repeatedly concatenating the context information captured in the vertical direction and the horizontal direction long connection, as shown in the following formula:
Y″=CrissCross(Y′);
Y′=CrissCross(Y);
y' -said is the feature vector that has acquired global pixel information.
Preferably, the step of decoding the encoded second video data set by using the LSTM-CTC end-to-end network structure topology and the label of the second video data training set includes:
calculating a CTC loss function of the LSTM-CTC end-to-end network, which specifically comprises the following steps:
defining a many-to-one mapping function β (-) to its target sequence y:
Figure BSA0000275798670000064
wherein
Figure BSA0000275798670000062
In the formula, pi n -pi tags at n instants;
Figure BSA0000275798670000063
-probability of occurrence of n instants;
the CTC loss function is:
Figure BSA0000275798670000065
preferably, the objective function is constructed using the following formula:
Figure BSA0000275798670000071
l-a constructed objective function to adjust network topology parameters after the ResNet34 network topology is integrated with the PR attention mechanism, the BilTM encoder network topology parameters, and the LSTM-CTC decoder network topology;
s-given a dimension of the second video data set;
||ω|| 2 -a regularization term that avoids overfitting;
hyper-parameters of the lambda-regularization term.
Preferably, the step of acquiring the first video data test set and testing the network topology after the constructed ResNet34 network topology and the PR attention mechanism are integrated, the BiLSTM network topology and the LSTM-CTC end-to-end network structure topology includes:
the get WER value represents the accuracy of recognition:
Figure BSA0000275798670000072
s, I (Ins) and D (Del) are minimum replace, insert, delete operations, respectively;
n represents the total number of words of the tag.
Preferably, the step of acquiring the first video data test set and testing the network topology after the constructed ResNet34 network topology and the PR attention mechanism are integrated, the BiLSTM network topology and the LSTM-CTC end-to-end network structure topology includes:
the acquisition Accuracy represents the Accuracy of the identification:
Figure BSA0000275798670000073
s, I (Ins) and D (Del) are minimum replace, insert, delete operations, respectively;
n represents the total number of words of the tag.
In another aspect, a continuous sign language recognition device based on the ResNet34 network-attention mechanism comprises:
a video acquisition module: the method comprises the steps of obtaining a first video data training set, wherein the first video data set comprises RGB videos and depth videos, and key frames of the first video data training set are extracted by adopting a KFE clustering algorithm to obtain a second video data training set, and the second video data training set is provided with labels;
a feature extraction module: for constructing a ResNet34 network topology and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information of the second video data set;
and a decoding module: the system is used for constructing a BilSTM network topology to encode the characteristic information of the second video data set, and an LSTM-CTC end-to-end network structure topology is adopted to decode the encoded second video data set;
a parameter adjusting module: and the system is used for constructing an objective function so as to adjust the network topology parameters after the ResNet34 network topology is integrated with the PR attention mechanism, the BilTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters.
The embodiment of the invention has the following beneficial effects: the invention provides a continuous sign language recognition method and a device based on a ResNet34 network-attention mechanism, which comprises the following steps: s1: acquiring a first video data training set, wherein the first video data set comprises RGB (red, green and blue) videos and depth videos, and extracting key frames of the first video data training set by adopting a KFE (Kalman Filter edge) clustering algorithm to acquire a second video data training set, and the second video data training set is provided with labels; s2: constructing a ResNet34 network topology and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information of the second video data set; s3: constructing a BiLSTM network topology to encode the characteristic information of the second video data set, and adopting an LSTM-CTC end-to-end network structure topology to decode the encoded second video data set; s4: and constructing an objective function to adjust the network topology parameters, the BilSTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters after the ResNet34 network topology and the PR attention mechanism are integrated. The method and the device provided by the invention can relieve the technical problem of overfitting of the neural network structure caused by video redundancy in the prior art, and improve the generalization capability of the continuous sign language identification method.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a continuous sign language recognition method based on the ResNet34 network-attention mechanism according to an embodiment of the present invention;
FIG. 2 is a diagram of a ResNet 34-based network-attention mechanism network architecture according to an embodiment of the present invention;
fig. 3 is a network flow diagram of a continuous sign language recognition method ResNet34 residual module based on a ResNet34 network-attention mechanism according to an embodiment of the present invention;
FIG. 4 is a network structure diagram of a ResNet34 network-attention mechanism-based continuous sign language recognition method ResNet34 residual module network according to an embodiment of the present invention;
FIG. 5 is a network diagram of a LSTM basic unit module of a continuous sign language recognition method based on a ResNet34 network-attention mechanism according to an embodiment of the present invention;
fig. 6 is a network structure diagram of a continuous sign language recognition method BiLSTM encoder based on the ResNet34 network-attention mechanism according to an embodiment of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
At present, sign language recognition work is rarely focused on multi-mode input, for a neural network, a complete sign language sequence is input, and overfitting of the network is easily caused by redundancy of a video.
For the convenience of understanding the present embodiment, the method and apparatus for continuous sign language recognition based on the ResNet34 network-attention mechanism disclosed in the present embodiment will be described in detail first.
The first embodiment is as follows:
with reference to fig. 1, an embodiment of the present invention provides a continuous sign language recognition method based on a ResNet34 network-attention mechanism, including:
s1: acquiring a first video data training set, wherein the first video data set comprises RGB (red, green and blue) videos and a depth video, and extracting key frames of the first video data training set by adopting a KFE (Kalman Filter) clustering algorithm to acquire a second video data training set, and the second video data training set is provided with a label;
s2: constructing a ResNet34 network topology and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information of the second video data set;
in connection with fig. 6, S3: constructing a BiLSTM network topology to encode the characteristic information of the second video data set, and adopting an LSTM-CTC end-to-end network structure topology to decode the encoded second video data set; it should be noted that LSTM, as a special RNN network, can learn long-distance time dependence. As shown in FIG. 5, at time t, there are three inputs to the LSTM: input value x of network at current time t t Last time LSTM output value h t-1 And cell state C at the previous time t-1 (ii) a The output of the LSTM is two: current time LSTM output value h t Cell state at the present time C t . The operation mechanism is operated by an input gate, a forgetting gate and an output gate inside.
With reference to fig. 5, the LSTM has the following characteristics:
1): the forgetting gate determines how many unit states at the last moment need to be reserved to the current moment, and f can be obtained through the forgetting gate t
f t =σ(W f ·[h t-1 ,x t ]+b f );
2): the input gate determines how much input data of the network needs to be stored at the current time to the unit state, and the input gate can obtain i t The temporary state at the current moment can be obtained through the unit state
Figure BSA0000275798670000111
Then the unit state C of the last LSTM cell structure is applied t-1 Output f of forgetting gate t Output of input gate i t And a temporary state
Figure BSA0000275798670000112
The unit state C of the current moment can be obtained t
i t =σ(W i ·[h t-1 ,x t ]+b i )
Figure BSA0000275798670000113
Figure BSA0000275798670000114
3): the output gate controls how much current unit state needs to be output to the current output value, and o can be obtained through the output gate t Combined with the cell state C at the current time t And o t The final output h can be obtained t
o t =σ(W o ·[h t-1 ,x t ]+b o );
h t =o t *tan(C t )
BilSTM is a combination of forward LSTM and backward LSTM to obtain context information for a long period of time. Specifically, the inverted input sequence is calculated in an LSTM manner, and finally the result of the forward LSTM and the result of the backward LSTM are stacked;
further, the PR attention mechanism and the ResNet34 network topology are integrated to extract the high-dimensional sign language video feature vectors, and the extracted high-dimensional sign language video feature vectors are sent to a BiLSTM-LSTM encoder decoder. Specifically, we set sign language picture X to { X ═ X 1 ,x 2 ,...,x i ,...,x T -as input to a network after integration of a PR attention mechanism with the ResNet34 network topology to extract feature information of the second video data set;
in conjunction with fig. 6, the network sets the output E to { E ═ E 1 ,e 2 ,...,e i ,...,e T In which x is i ∈R C×H×W ,e i ∈R C′ And T represents the frame number of the sign language video. Stacking the output of the feature extraction network to obtain
Figure BSA0000275798670000121
Can be used as the input of a BilSTM coder;
s4: and constructing an objective function to adjust the network topology parameters, the BilSTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters after the ResNet34 network topology and the PR attention mechanism are integrated.
Preferably, the method further comprises:
and acquiring a first video data test set, and testing the network topology after the ResNet34 network topology and the PR attention mechanism are integrated, the BilTM network topology and the LSTM-CTC end-to-end network structure topology.
Preferably, the step of extracting the key frames of the first video data training set by using a KFE clustering algorithm to obtain a second video data training set includes:
acquiring an initial threshold, a frame set of the first video data training set and cluster centroids of all clusters;
acquiring frames of the first video data training set based on the frame set of the first video data training set, and acquiring the closest distance from the frames of the first video data training set to a cluster centroid based on the cluster centroids of all the clusters;
determining whether a closest distance of a frame of the first training set of video data to a cluster centroid is less than an initial threshold;
if so, classifying the frames of the first video data training set into a cluster centroid class with the closest distance, removing the frames of the first video data training set from the frame set of the first video data training set, and executing the steps of acquiring an initial threshold, the frame set of the first video data training set and cluster centroids of all clusters;
if not, defining the frames of the first video data training set to be in a new category, removing the frames of the first video data training set from the frame set of the first video data training set, and executing the steps of obtaining an initial threshold value, the frame set of the first video data training set and cluster centroids of all clusters.
With reference to fig. 2 to 4, preferably, the ResNet34 network topology includes an initial layer, a first residual layer, a second residual layer, a third residual layer, a fourth residual layer, and a global average pooling layer;
the number of convolution kernels of the first residual error layers is 64, and the number of the first residual error layers is 3;
the number of convolution kernels of the second residual error layers is 128, and the number of the second residual error layers is 4;
the number of convolution kernels of the third residual error layers is 256, and the number of the third residual error layers is 6;
the number of convolution kernels of the fourth residual error layers is 512, and the number of the fourth residual error layers is 3;
further, as a backbone network of CNN, ResNet has been widely applied to various scenes in computer vision for a long time;
ResNet (deep Residual network) is named as a deep Residual network, and the most important characteristic of the deep Residual network is that a plurality of Residual units with the same parameters are connected to form a BasicBlock, and a plurality of BasicBlock are combined to form a ResNet network together with a preprocessing layer and a final full-connection classification layer, wherein the Residual unit is shown in FIG. 3. Considering the significant impact of neural network depth, the use of ResNet is chosen herein to address this problem. Specifically, the ResNet34 shown in FIG. 4 is used as the backbone network of the present invention, and the attention mechanism of the space and the channel is fused, so that the image feature information of the sign language video can be well extracted, and the problem of gradient disappearance of the deep network is avoided.
In a further aspect, the step of integrating a PR attention mechanism with the ResNet34 network topology comprises:
introducing the PR attention mechanism between the fourth residual layer and a global average pooling layer.
It should be noted that, in the embodiment provided by the present invention, because the feature map mixes redundant information, the important region of the spatial hand motion is highlighted herein, and different weights are used to represent different importance. Since the function of each channel represents a unique detector, the purpose of the channel attention is to look at which features make sense. Furthermore, the spatial attention module introduced herein focuses on which spatial features are more meaningful.
Preferably, in the step of combining the PSA channel attention mechanism and the RCC spatial attention mechanism into the PR attention mechanism, the PSA channel attention mechanism is:
[X 0 ,X 1 ,…,X S-1 ]=Split(X);
F i =Conv(K i ×K i ,G i )(X i );
F=Cat([F 0 ,F 1 ,…,F S-1 ]);
X∈R C×W×H -a first feature map obtained by computing the first four residual layers of the second video training set through a ResNet34 network;
c, W and H-channel, width and height of the first profile;
split-first feature map X ∈ R in channel dimension C×W×H Equally dividing the obtained product into S parts;
X i ∈R C/S×W×H -the first feature map is divided equally into feature maps with channels C/S;
K i -different convolution kernel parameters;
G i -parameters of the packet convolution;
F i ∈R C/S×W×H -multi-scale features after multi-scale feature extraction;
cat-splicing multi-scale features under different receptive fields on a channel dimension;
F∈R C×W×H -a feature vector after multi-scale feature stitching;
and extracting the weight of the feature vector after the multi-scale feature splicing by adopting the following formula:
g i =AvgPool(F i );
Z i =σ(W 1 δ(W 0 (g i )));
Z=Cat([Z 0 ,Z 1 ,…,Z S-1 ]);
AvgPool (·) -represents the global average pooling;
σ (-) is a sigmoid activation function;
δ (·) is the ReLU activation function;
g i ∈R C/S×1×1 -a feature vector for global average pooling of multi-scale features;
W 0 and W 1 Respectively, the dimensions are [ C/S/r, C/S],[C/S,C/S/r]Wherein r represents a reduction rate;
Z i dimension [ C/S, 1%]Different partial attention weights of (2);
a cross-dimensional channel attention feature weight map with a Z-dimension of [ C, 1, 1 ];
normalizing the obtained attention weight by adopting the following formula, and performing tensor product operation on the weight and the feature vector subjected to multi-scale feature extraction:
att=Softmax(Z);
Y=att⊙F;
att-normalized channel attention weight;
it should be noted that, in the embodiment provided by the present invention, the RCC attention mechanism is to connect the Criss-Cross modules in series twice to obtain rich context information, where the Criss-Cross channel attention mechanism is:
Q=W Q Y;
K=W K Y;
V=W V Y;
W Q and W K Are all of dimensions [ C', C]A weight matrix of (a);
W V is of dimension [ C, C]A weight matrix of (a);
and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information for the second video data set using the following formula:
performing Affinity operation to obtain the relationship between each pixel point and the same row and column pixel points in the characteristic diagram with the size of [ W, H ]:
D=Affinity(Q,K);
Figure BSA0000275798670000151
Affinity-Q, K are all feature maps with the dimension of [ C', W, H ];
Ω u -for each position u there is a feature vector in the spatial dimension of Q
Figure BSA0000275798670000152
Wherein
Figure BSA0000275798670000162
Is omega u The (i) th element of (a),
d i,u e.g. D-characteristic Q u And Ω i,u ,i=[1,…,H+W-1]The degree of correlation of (c).
Further, there is a feature vector for each position u in the spatial dimension of Q
Figure BSA0000275798670000163
Figure BSA0000275798670000164
Meanwhile, a feature vector set in the same row and column as the position u can be extracted from the K matrix
Figure BSA0000275798670000165
Wherein
Figure BSA0000275798670000166
Is omega u The ith element of (1), d i,u E.g. D is a feature Q u And Ω i,u ,i=[1,…,H+W-1]The degree of correlation of (c);
applying a softmax layer on a channel dimension based on the relation D between each pixel point and the pixels in the same row and column in the feature graph with the size of [ W, H ] so as to calculate a feature graph A:
A=softmax(D);
the Aggregation operation is performed on the feature map a to collect context information Y':
Figure BSA0000275798670000161
aggregation-there is a feature vector V for each position u in the spatial dimension of V u ∈R C Of set phi u ∈R (H+W-1)×C Extracting a characteristic vector set which is positioned in the same row and the same column with the position u from the V matrix;
y' -the context information which is captured to be long connected in the vertical direction and the horizontal direction; '
However, at this time, the feature information at the pixel level is also slightly sparse, and therefore, context information with long connections in the vertical direction and the horizontal direction captured needs to be repeatedly concatenated as shown in the following formula:
Y″=CrissCross(Y′);
Y′=CrissCross(Y);
y' -said is the feature vector that has acquired global pixel information.
Preferably, the step of decoding the encoded second video data set by using the LSTM-CTC end-to-end network structure topology and the label of the second video data training set includes:
calculating a CTC loss function of the LSTM-CTC end-to-end network, which specifically comprises the following steps:
defining a many-to-one mapping function β (-) to its target sequence y:
Figure BSA0000275798670000167
wherein
Figure BSA0000275798670000171
In the formula, pi n -pi tags at n instants;
Figure BSA0000275798670000172
-probability of occurrence of n instants;
the CTC loss function is:
Figure BSA0000275798670000176
further, CTC is an objective function that integrates all possibilities of alignment between input and target sequences. In the data set used in the present invention, tags (-) have been added to sign language annotations to accurately simulate the transition between two adjacent sign language words;
the middle tag path of the input sequence is denoted as pi ═ pi (pi) 1 ,π 2 ,…,π t ,…,π T ) In which pi t E { V { - } { [ C ]; v is a sign language word vocabulary, and asterisks are wildcards;
for a given input X, the probability calculation for path π is as follows
Figure BSA0000275798670000173
Wherein pi n Is the pi-tag at time n,
Figure BSA0000275798670000174
is the probability of occurrence at time n.
Because various subdivisions of sign language annotation tags tend to result in different alignments between the same input sequence and target sequence, CTC defines a many-to-one mapping function β (·) for its target sequence y, i.e., y ═ β (pi). The probability of y can be defined as the sum probability of all alignments that match it, and is calculated as follows:
Figure BSA0000275798670000177
thus, the above equation can be replaced with a CTC loss function, defined as follows:
Figure BSA0000275798670000178
preferably, the objective function is constructed using the following formula:
Figure BSA0000275798670000175
l-the constructed objective function to adjust the network topology parameters after the ResNet34 network topology is integrated with the PR attention mechanism, the BilTM encoder network topology parameters and the LSTM-CTC decoder network structure topology;
s-dimensions of a given second video data set;
||ω|| 2 -a regularization term that avoids overfitting;
hyper-parameters of the lambda-regularization term.
Preferably, the step of acquiring the first video data test set and testing the network topology after the constructed ResNet34 network topology and the PR attention mechanism are integrated, the BiLSTM network topology and the LSTM-CTC end-to-end network structure topology includes:
the get WER value represents the accuracy of recognition:
Figure BSA0000275798670000181
s, I (Ins) and D (Del) are minimum replace, insert, delete operations, respectively;
n represents the total number of words of the label.
It should be noted that Word Error Rate (WER) is often used as an evaluation index of an experiment in speech recognition and machine translation tasks. In the invention, the network quality is evaluated by adopting different experimental indexes, namely word error rate WER and Accuracy Accuracy. The model effect can be judged by observing the evaluation index, but the network learning according to the index is not involved in the invention, so the CRB-Net designed by the invention still adopts a gradient descent method to train the network;
WER measures sequence conversion through a combination of replace, delete and insert operations. For sign language recognition tasks, a low WER value can be often expressed as high precision of recognition, and the calculation formula of the WER is as follows
Figure BSA0000275798670000182
Preferably, the step of acquiring the first video data test set and testing the network topology after the constructed ResNet34 network topology and the PR attention mechanism are integrated, the BiLSTM network topology and the LSTM-CTC end-to-end network structure topology includes:
the acquisition Accuracy represents the Accuracy of the identification:
Figure BSA0000275798670000183
s, I (Ins) and D (Del) are minimum replace, insert, delete operations, respectively;
n represents the total number of words of the tag;
example two:
the invention provides a continuous sign language recognition device based on a ResNet34 network-attention mechanism, which comprises:
a video acquisition module: the method comprises the steps of obtaining a first video data training set, wherein the first video data set comprises RGB videos and depth videos, and key frames of the first video data training set are extracted by adopting a KFE clustering algorithm to obtain a second video data training set, and the second video data training set is provided with labels;
a feature extraction module: for constructing a ResNet34 network topology and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information of the second video data set;
a decoding module: the system is used for constructing a BilSTM network topology to encode the characteristic information of the second video data set, and an LSTM-CTC end-to-end network structure topology is adopted to decode the encoded second video data set;
a parameter adjusting module: and the system is used for constructing an objective function so as to adjust the network topology parameters after the ResNet34 network topology is integrated with the PR attention mechanism, the BilTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters.
Example three:
in order to prove the generalization of the test set of the network model provided by the invention under the strong supervised learning, the method of the invention evaluates two large sign language data sets, including a CSL data set from the university of Chinese science and technology and a TJUT sign language recognition and translation data set (TJUT-SLRT). The experimental result shows that the evaluation of the WER indexes of different data sets by the algorithm provided by the invention has higher precision, thereby proving that the invention has extremely high generalization.
The invention has the following advantages:
1) the algorithm evaluates two large-scale sign language data sets respectively, and performs experimental tests on test sets divided by the TJUT-SLRT data sets, and the results are shown in the following table, so that the word error rate WER of the method provided by the application is lower than that of a continuous sign language identification method based on a CBAM attention mechanism, and the generalization degree of the method is good;
the method and the sign language identification method based on the CBAM attention mechanism have WER training results on a Tjut test data set:
ResNet34+CBAM+BiLSTM ResNet34+PRR+BiLSTM(Ours)
TJUT 11.45% 11.26%
the WER training result of the method on the CSL data set is as follows:
DataSet DEV TEST
CSL 2.01% 1.76%
the method has the following accuracy rate under different settings on the CSL data set:
Method Accuray
CNN+LSTM 0.873
ResNet18+LSTM 0.905
ResNet34+LSTM 0.926
CNN+BiLSTM 0.896
ResNet18+BiLSTM 0.928
ResNet34+BiLSTM 0.943
PRR-ResNet34+BiLSTM 0.982
2) the RGB video and the depth video data are used, so that the image feature representation can be better learned by a network;
3) by adopting a KFE clustering algorithm based on a K-means algorithm, the redundancy of the video is reduced, the technical problem of transition fitting of a neural network structure is further alleviated, and the generalization capability of the continuous sign language recognition method is improved;
4) a PR attention module is introduced into the existing Resnet34 network topology, so that the network is more focused on hand features, the PR attention module captures spatial information of different scales on channel dimensions to enrich feature space, and remote dependence is established from global spatial information; the feature map with dense and rich context information is generated by adopting second-order attention on the spatial dimension, and because the feature map is close to dense self-attention calculation by adopting two continuous sparse self-attention calculations, the feature map not only reduces the memory consumption and time complexity, but also reduces the interference of redundant information in global pixels while ensuring the advanced performance;
5) the method comprises the steps that end-to-end continuous sign language recognition is achieved through a network model of an encoder-decoder, and mapping between a source sequence and a target sequence with unequal lengths can be achieved while semantic information of image features in a time dimension is captured; in a longitudinal view, the hidden layer contains more information of previous nodes, and particularly in a sign language sequence with more video frames, the model has better memorability, so that the problem of information loss in a long sequence is effectively avoided. (ii) a
Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.
The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for continuous sign language recognition based on the ResNet34 network-attention mechanism, comprising:
s1: acquiring a first video data training set, wherein the first video data set comprises RGB (red, green and blue) videos and depth videos, and extracting key frames of the first video data training set by adopting a KFE (Kalman Filter) clustering algorithm to acquire a second video data training set, wherein the second video data training set is provided with a label;
s2: constructing a ResNet34 network topology, fusing a PSA channel attention mechanism and an RCC space attention mechanism into a PR attention mechanism, and integrating the PR attention mechanism with the ResNet34 network topology to extract characteristic information of the second video data set;
s3: constructing a BilSTM network topology to encode the characteristic information of the second video data set, and decoding the encoded second video data set by adopting an LSTM-CTC end-to-end network structure topology and the label of the second video data training set;
s4: and constructing an objective function to adjust the network topology parameters, the BilSTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters after the ResNet34 network topology and the PR attention mechanism are integrated.
2. The method of claim 1, further comprising:
and acquiring a first video data test set, and testing the network topology after the ResNet34 network topology and the PR attention mechanism are integrated, the BilTM network topology and the LSTM-CTC end-to-end network structure topology.
3. The method according to claim 1, wherein the step of extracting key frames of the first training set of video data by using a KFE clustering algorithm to obtain a second training set of video data comprises:
acquiring an initial threshold, a frame set of the first video data training set and cluster centroids of all clusters;
acquiring frames of the first video data training set based on the frame set of the first video data training set, and acquiring the closest distance from the frames of the first video data training set to the cluster centroid based on the cluster centroids of all the clusters;
determining whether a closest distance of a frame of the first training set of video data to a cluster centroid is less than an initial threshold;
if so, classifying the frames of the first video data training set into a cluster centroid class with the closest distance, removing the frames of the first video data training set from the frame set of the first video data training set, and executing the steps of acquiring an initial threshold, the frame set of the first video data training set and cluster centroids of all clusters;
if not, defining the frames of the first video data training set to be in a new category, removing the frames of the first video data training set from the frame set of the first video data training set, and executing the steps of obtaining an initial threshold value, the frame set of the first video data training set and cluster centroids of all clusters.
4. The method of claim 1, wherein the ResNet34 network topology comprises an initial layer, a first residual layer, a second residual layer, a third residual layer, a fourth residual layer, and a global average pooling layer;
the convolution kernel number of the first residual error layers is 64, and the number of the first residual error layers is 3;
the number of convolution kernels of the second residual error layers is 128, and the number of the second residual error layers is 4;
the number of convolution kernels of the third residual error layers is 256, and the number of the third residual error layers is 6;
the number of convolution kernels of the fourth residual error layers is 512, and the number of the fourth residual error layers is 3;
integrating a PR attention mechanism with the ResNet34 network topology includes:
introducing the PR attention mechanism between the fourth residual layer and a global average pooling layer.
5. The method of claim 4, wherein in the step of combining a PSA channel attention mechanism and an RCC spatial attention mechanism into a PR attention mechanism, wherein the PSA channel attention mechanism is:
[X 0 ,X 1 ,…,X S-1 ]=Split(X);
F i =Conv(K i ×K i ,G i )(X i );
F=Cat([F 0 ,F 1 ,…,F S-1 ]);
X∈R C×W×H -a first feature map obtained by computing the first four residual layers of the second video training set through the ResNet34 network;
c, W and H-channel, width and height of the first profile;
split-first feature map X ∈ R in channel dimension C×W×H Equally dividing the mixture into S parts;
X i ∈R C/S×W×H -the first feature map is divided equally into feature maps with channels C/S;
K i -different convolution kernel parameters;
G i -parameters of the packet convolution;
F i ∈R C/S×W×H -multi-scale features after multi-scale feature extraction;
cat-splicing multi-scale features under different receptive fields on a channel dimension;
F∈R C×W×H -a feature vector after multi-scale feature stitching;
and extracting the weight of the feature vector after the multi-scale feature splicing by adopting the following formula:
g i =AvgPool(F i );
Z i =σ(W 1 δ(W 0 (g i )));
Z=Cat([Z 0 ,Z 1 ,…,Z S-1 ]);
AvgPool (·) -represents the global average pooling;
σ (-) is a sigmoid activation function;
δ (·) is the ReLU activation function;
g i ∈R C/S×1×1 -a feature vector for global average pooling of multi-scale features;
W 0 and W 1 Respectively, the dimensions are [ C/S/r, C/S],[C/S,C/S/r]Wherein r represents a reduction rate;
Z i dimension [ C/S, 1%]Different partial attention weights of (2);
a cross-dimensional channel attention feature weight map with a Z-dimension of [ C, 1, 1 ];
normalizing the obtained attention weight by adopting the following formula, and performing tensor product operation on the weight and the feature vector subjected to multi-scale feature extraction:
att=Softmax(Z);
Y=att⊙F;
att-normalized channel attention weight.
The RCC attention mechanism is characterized in that a Criss-Cross module is connected in series twice to obtain rich context information, wherein the Criss-Cross channel attention mechanism is as follows:
Q=W Q Y;
K=W K Y;
V=W V Y;
W Q and W K Are all of dimensions [ C', C]A weight matrix of (a);
W V is of dimension [ C, C]A weight matrix of (a);
and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information for the second video data set using the following formula:
performing Affinity operation to obtain the relationship between each pixel point and the same row and column pixel points in the characteristic diagram with the size of [ W, H ]:
D=Affinity(Q,K);
Figure FSA0000275798660000041
Affinity-Q, K are all feature maps with the dimension of [ C', W, H ];
Ω u -for each position u there is a feature vector Q in the spatial dimension of Q u ∈R C ′;
Wherein Ω is i,u ∈R C Is omega u The (i) th element of (a),
d i,u e.g. D-characteristic Q u And Ω i,u ,i=[1,…,H+W-1]The degree of correlation of (c).
Applying a softmax layer on a channel dimension based on the relation D between each pixel point and the pixels in the same row and column in the feature graph with the size of [ W, H ] so as to calculate a feature graph A:
A=softmax(D);
the Aggregation operation is performed on the feature map a to collect context information Y':
Figure FSA0000275798660000051
aggregation-there is a feature vector V for each position u in the spatial dimension of V u ∈R C Of set phi u ∈R (H +W-1)×C Extracting a characteristic vector set which is positioned in the same row and the same column with the position u from the V matrix;
y' -the context information which is captured to be long connected in the vertical direction and the horizontal direction;
repeatedly concatenating the context information captured in the vertical direction and the horizontal direction long connection, as shown in the following formula:
Y″=CrissCross(Y′);
Y′=CrissCross(Y);
y' -said is the feature vector that has acquired global pixel information.
6. The method of claim 1, wherein the step of decoding the encoded second video data set using an LSTM-CTC peer-to-peer network architecture topology and the tags of the second video data training set comprises:
calculating a CTC loss function of the LSTM-CTC end-to-end network, which specifically comprises the following steps:
defining a many-to-one mapping function β (-) to its target sequence y:
Figure FSA0000275798660000052
wherein
Figure FSA0000275798660000053
In the formula, pi n -pi tags at n instants;
Figure FSA0000275798660000061
-probability of occurrence of n instants;
the CTC loss function is:
Figure FSA0000275798660000062
7. the method of claim 6, wherein the objective function is constructed using the following formula:
Figure FSA0000275798660000063
l-a constructed objective function to adjust network topology parameters after the ResNet34 network topology is integrated with the PR attention mechanism, the BilTM encoder network topology parameters, and the LSTM-CTC decoder network topology;
s-given a dimension of the second video data set;
||ω|| 2 -a regularization term that avoids overfitting;
hyper-parameters of the lambda-regularization term.
8. The method of claim 2, wherein the step of obtaining a first test set of video data and testing the constructed network topology after the ResNet34 network topology is integrated with the PR attention mechanism, the BiLSTM network topology, and the LSTM-CTC peer-to-peer network topology comprises:
the get WER value represents the accuracy of recognition:
Figure FSA0000275798660000064
s, I (Ins) and D (Del) are minimum replace, insert, delete operations, respectively;
n represents the total number of words of the tag.
9. The method of claim 2, wherein the step of obtaining a first test set of video data and testing the constructed network topology after the ResNet34 network topology is integrated with the PR attention mechanism, the BiLSTM network topology, and the LSTM-CTC peer-to-peer network topology comprises:
the acquisition Accuracy represents the Accuracy of the identification:
Figure FSA0000275798660000071
s, I (Ins) and D (Del) are minimum replace, insert, delete operations, respectively;
n represents the total number of words of the tag.
10. A continuous sign language recognition device based on the ResNet34 network-attention mechanism, comprising:
a video acquisition module: the method comprises the steps of obtaining a first video data training set, wherein the first video data set comprises RGB videos and depth videos, and key frames of the first video data training set are extracted by adopting a KFE clustering algorithm to obtain a second video data training set, and the second video data training set is provided with labels;
a feature extraction module: for constructing a ResNet34 network topology and integrating a PR attention mechanism with the ResNet34 network topology to extract feature information of the second video data set;
a decoding module: the system is used for constructing a BilSTM network topology to encode the characteristic information of the second video data set, and an LSTM-CTC end-to-end network structure topology is adopted to decode the encoded second video data set;
a parameter adjusting module: and the system is used for constructing an objective function so as to adjust the network topology parameters after the ResNet34 network topology is integrated with the PR attention mechanism, the BilTM network topology parameters and the LSTM-CTC end-to-end network structure topology parameters.
CN202210709795.8A 2022-06-23 2022-06-23 Continuous sign language recognition method and device based on ResNet34 network-attention mechanism Pending CN114943990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210709795.8A CN114943990A (en) 2022-06-23 2022-06-23 Continuous sign language recognition method and device based on ResNet34 network-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210709795.8A CN114943990A (en) 2022-06-23 2022-06-23 Continuous sign language recognition method and device based on ResNet34 network-attention mechanism

Publications (1)

Publication Number Publication Date
CN114943990A true CN114943990A (en) 2022-08-26

Family

ID=82910292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210709795.8A Pending CN114943990A (en) 2022-06-23 2022-06-23 Continuous sign language recognition method and device based on ResNet34 network-attention mechanism

Country Status (1)

Country Link
CN (1) CN114943990A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424275A (en) * 2022-08-30 2022-12-02 青岛励图高科信息技术有限公司 Fishing boat brand identification method and system based on deep learning technology
CN117725528A (en) * 2024-01-30 2024-03-19 中原工学院 Depth feature fusion-based personnel action recognition method in industrial scene

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424275A (en) * 2022-08-30 2022-12-02 青岛励图高科信息技术有限公司 Fishing boat brand identification method and system based on deep learning technology
CN115424275B (en) * 2022-08-30 2024-02-02 青岛励图高科信息技术有限公司 Fishing boat license plate identification method and system based on deep learning technology
CN117725528A (en) * 2024-01-30 2024-03-19 中原工学院 Depth feature fusion-based personnel action recognition method in industrial scene

Similar Documents

Publication Publication Date Title
CN109284506B (en) User comment emotion analysis system and method based on attention convolution neural network
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN109670576B (en) Multi-scale visual attention image description method
CN114943990A (en) Continuous sign language recognition method and device based on ResNet34 network-attention mechanism
CN113449801B (en) Image character behavior description generation method based on multi-level image context coding and decoding
CN109508686B (en) Human behavior recognition method based on hierarchical feature subspace learning
CN111475622A (en) Text classification method, device, terminal and storage medium
Dandıl et al. Real-time facial emotion classification using deep learning
Lopes et al. An AutoML-based approach to multimodal image sentiment analysis
Jasani et al. Skeleton based zero shot action recognition in joint pose-language semantic space
CN114663915A (en) Image human-object interaction positioning method and system based on Transformer model
CN116110089A (en) Facial expression recognition method based on depth self-adaptive metric learning
Yang et al. Event camera data pre-training
Das et al. Determining attention mechanism for visual sentiment analysis of an image using svm classifier in deep learning based architecture
Ahammad et al. Recognizing Bengali sign language gestures for digits in real time using convolutional neural network
CN110111365B (en) Training method and device based on deep learning and target tracking method and device
CN113240033B (en) Visual relation detection method and device based on scene graph high-order semantic structure
Xing et al. Understanding spatio-temporal relations in human-object interaction using pyramid graph convolutional network
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN110298331A (en) A kind of testimony of a witness comparison method
CN116680407A (en) Knowledge graph construction method and device
Li et al. Multiple instance discriminative dictionary learning for action recognition
Xiao et al. Multi-modal sign language recognition with enhanced spatiotemporal representation
Katti et al. Character and word level gesture recognition of Indian Sign language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination