CN113537024B

CN113537024B - Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism

Info

Publication number: CN113537024B
Application number: CN202110773432.6A
Authority: CN
Inventors: 袁甜甜; 周乐员; 张剑华; 陈胜勇
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2022-06-21
Anticipated expiration: 2041-07-08
Also published as: CN113537024A

Abstract

A weak supervision neural network sign language recognition method of a multilayer time sequence attention fusion mechanism adopts a coder-decoder neural network fused by the multilayer attention fusion mechanism and combines a language model transformer to perform recognition from continuous sign language videos to continuous sign languages and generate translated sentences; posture information is extracted through a pre-trained migration learning convolution module on a large-scale image data set Imagenet, a bidirectional gated circulation network and a multilayer residual stacked gated circulation network are used for time sequence feature coding, bottom layer semantics and high-dimensional features of sign language grammar are fused through a multilayer time sequence attention fusion mechanism, a sign language recognition sentence expressing sign language person action expression meaning is obtained through a greedy decoding and forced teaching training method, and the difficulty of recognition and translation of continuous sign language videos is solved through a language model. The invention can promote the communication between the hearing-aid and the speaker.

Description

Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism

Technical Field

The invention relates to the technical fields of computer vision, artificial intelligence, data mining, natural language processing, deep learning and the like. In particular to a weak supervision neural network sign language identification method of a multilayer time sequence attention fusion mechanism.

Background

Computer vision is a technology for enabling a computer or a machine to simulate human eyes to perceive, and is widely applied to the fields of graphic images, three-dimensional reconstruction, target tracking, face detection and video understanding. Natural language processing is a technology that enables a computer or a machine to recognize thinking like a human being, and is widely applied to tasks such as machine translation, reading and understanding, language generation, and multi-turn conversation. Before the advent of deep learning techniques, the traditional computer vision technology and natural language processing field relied heavily on manually extracted features and methods for manually defining grammatical rules. As the amount of data increases and the cost of GPU computation decreases, deep learning techniques typified by deep neural networks are gradually emerging. Computer vision techniques and natural language techniques based on deep learning are beginning to grow in popularity. Because deep learning has strong representation learning capability, the neural network can learn and understand certain knowledge only through the joint training of end-to-end data and labels without manually extracting and formulating complicated characteristics and rules by the original person. Therefore, by combining the perception of computer vision based on the deep learning technology and the cognitive technology of natural language processing, the weakly supervised neural network algorithm of the multi-layer time sequence attention fusion mechanism for continuous sign language video recognition and translation is designed, so that the computer can effectively understand the content expressed by the speaker in the sign language video.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to design a weak supervision neural network algorithm of a multi-layer time sequence attention fusion mechanism for continuous sign language video recognition and translation, which is used for solving the problem of difficulty in recognition and translation of continuous sign language videos, so that a computer can learn to understand the meaning expressed by a speaker, and communication between a hearing person and the speaker can be promoted.

The technical scheme of the invention is as follows:

a weakly supervised neural network sign language recognition method of a multi-layer time sequence attention fusion mechanism comprises the following steps:

1) a sign language video V containing (f)₁,...f_u) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)₁,...g_n) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)₁,....l_u)；

Carrying out uniform and random frame sampling on each sign language video data by using an opencv library so as to ensure that the frame number of each sign language video is consistent, carrying out word segmentation on corpus tag sentences of the sign language video and automatically marking each sign language video by using a python programming language;

2) the method comprises the following steps of transmitting sign language videos of a batch of specified sizes into an encoder part of a neural network according to frames, firstly carrying out feature extraction on each sign language video frame through a pre-trained convolutional neural network module, and then obtaining effective attitude information through two layers of residual full-connected layers, wherein the effective attitude information is used as space embedding of the network:

S_u＝SpatialEmbedding(f_u) (1)

wherein f is_uRepresenting a sign language video frame, S_uThe space embedded vector is extracted through convolution network pyrrole characteristics;

the space embedding vector of the sign language video contains rich characteristic information, the space embedding vector is input into a next module bidirectional gating circulation network, the gating circulation network can carry out effective characteristic modeling on the sign language video frame sequence data of the time dimension, and sign language action context information B is obtained in a forward and backward bidirectional modeling mode_uA 1 to B_uObtaining higher-dimensional abstract information E through three-layer residual stacking one-way gating circulation network_u；

Through the operation, the neural network encoder part performs space-time coding on the sign language video to obtain a hidden vector h_uAnd h is_uPassed to the decoder part of the neural network, the decoder network incorporating h_uVector sum C obtained by fusion of multi-layer sequential attention mechanism_mixVector, obtaining sign language recognized words at each time step of the multilayer residual gating circulation network, and finally combining the words into a complete sign language sentence;

the fusion vector of the multilayer time sequence attention mechanism comprises the following steps: firstly, a fraction is calculated, and a hidden vector h of a previous step of each time step of a decoder is calculated_n-1As query terms, with query term h_n-1Respectively and E_UAnd B_uTo carry outThe operation yields two fractional vectors score1 and score2 as follows:

score1(h_n-1,E_u)＝E_uWh_n-1 ^T (2)

score2(h_n-1,B_u)＝B_uDh_n-1 ^T (3)

two score using the above scoring function, where W and D are trainable neural network weight parameters, are then used to obtain the temporal attention weights r and p of the sign language video for aligning the sign language video frames and words, which is calculated as follows:

wherein k represents the kth time step in the time sequence dimension of the encoder network, n represents the nth time step in the time sequence dimension of the decoder network, and then the obtained time sequence attention weights r and p of the sign language video are respectively added with E_uAnd B_uAnd calculating to obtain two sign language attention background vectors C_tAnd C_bThe operation is as follows:

then C is mixed_tAnd C_bThe two background vectors are fused to obtain C_mixThe operation is as follows:

this attention background vector is called sign language sequence context fusion vector C_mix；

3) In the decoding stage, first from the input<BOS>At the beginning of the symbol,<BOS>the symbol is used as the starting symbol of each network training and is input into the first time step of the decoder network, and C is simultaneously input_mixSplicing with sign language embedded vocabulary, inputting into a decoder of the current time step, obtaining output after nonlinear operation of a gate control cycle network with a four-layer stacked residual error structure of the decoder, generating words with the maximum probability of the current time step through a layer of full-connection layer, and circularly decoding until meeting<End>And (5) finishing the symbol, and finishing the generation of a complete sign language recognition sentence.

Further, the language model generates a natural language text conforming to the spoken language expression, and performs language learning using a Transformer as a language model to obtain a result of continuous sign language translation.

In a space embedding module of the encoder network, a convolutional neural network pre-trained on Imagenet is used to freeze all parameters on the convolutional neural network; using the resnet152 to pre-train the convolutional network, and using the output of the penultimate layer or the output of the last layer, adding two 2600-dimensional trainable residual fully-connected layers after the pre-trained convolutional neural network, and performing residual connection with the output of the following bidirectional gating cycle unit module; the hidden units of the bidirectional gated loop units of the encoder are arranged in 1300 dimensions, because past and future information is spliced, the output is 2600 dimensions, and the hidden unit dimension of each subsequent layer of the gated loop network is also 2600, so that residual connection can be performed; at the decoder stage, the word embedding dimension for sign language words is set to 256 and the network hidden unit dimension is set to 800 for each gated loop in the decoder.

The default Adam optimizer and cross entropy loss function of the pytorech are adopted in the training process, each batch is set to be 10, the setting of the learning rate is divided into two stages, the first stage is carried out by using 0.00004, after the 8 th epoch, the learning rate is adjusted to be 0.000004, the training is continuously carried out for 6 epochs, and the convergence of the neural network parameters is completed.

The technical conception of the invention is as follows: the method comprises the steps of performing feature modeling on a sign language video by deeply learning strong representation learning capacity and a large-scale sign language video data set and utilizing the strong feature extraction capacity of a convolutional neural network and the long sequence modeling capacity of a gated cyclic network, and combining a multilayer time sequence attention system fusion technology and the strong translation capacity of a transform language model, so as to obtain continuous sign language translation sentences which accord with the natural language spoken language sequence from the continuous sign language video. The proposed algorithm aims at learning the sequence-to-sequence mapping. I.e. a sign language video V, containing (f)₁,...f_u) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)₁,...g_n) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)₁,....l_u)。

The invention has the following beneficial effects: in the method, the task, whether continuous sign language recognition or translation, is a weak supervision task from video to text, and only comprises sentence-level labeling, each word is not required to be individually labeled by time division of video, and the network can recognize and understand the meaning of a sign language person from a sign language video by utilizing end-to-end deep neural network training.

Drawings

FIG. 1 is a block diagram of a weakly supervised neural network sign language identification method of a multi-layer temporal attention fusion mechanism.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, a weakly supervised neural network sign language recognition method of a multi-layer time-series attention fusion mechanism includes the following steps:

1) in thatAnd learning the mapping relation of the sequences to the sequences. I.e. a sign language video V, containing (f)₁,...f_u) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)₁,...g_n) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)₁,....l_u)；

S_u＝SpatialEmbedding(f_u) (1)

Through the operation, the neural network encoder part performs space-time coding on the sign language video to obtain a hidden vector h_uAnd h is_uPassed to the decoder part of the neural network, the decoder network incorporating h_uVector sum C obtained by fusion of multi-layer sequential attention mechanism_mixVector in multipleObtaining words recognized by the sign language at each time step of the layer residual gating circulation network, and finally combining the words into a complete sign language sentence;

the fusion vector of the multilayer time sequence attention mechanism comprises the following steps: firstly, a fraction is calculated, and a hidden vector h of a previous step of each time step of a decoder is calculated_n-1As query terms, with query term h_n-1Respectively and E_UAnd B_uThe two fractional vectors score1 and score2 are computed as follows:

score1(h_n-1,E_u)＝E_uWh_n-1 ^T (2)

score2(h_n-1,B_u)＝B_uDh_n-1 ^T (3)

two score using the above scoring function, where W and D are trainable neural network weight parameters, are then used to obtain the temporal attention weights r and p of the sign language video for aligning the sign language video frames and words, and the mathematical operations are as follows:

wherein k represents the kth time step in the encoder network timing dimension, n represents the nth time step in the decoder network timing dimension, and then the obtained temporal attention weights r and p of the sign language video are respectively added with E_uAnd B_uAnd calculating to obtain two sign language attention background vectors C_tAnd C_bThe operation is as follows:

3) In the decoding stage, first from the input<BOS>At the start of the symbol, the symbol is,<BOS>the symbol is used as the starting symbol of each network training and is input into the first time step of the decoder network, and C is simultaneously input_mixSplicing with sign language embedded vocabulary, inputting into a decoder of the current time step, obtaining output after nonlinear operation of a gate control cycle network with a four-layer stacked residual error structure of the decoder, generating words with the maximum probability of the current time step through a layer of full-connection layer, and circularly decoding until meeting<End>And (5) finishing the symbol, and finishing the generation of a complete sign language recognition sentence.

Further, the language model aims to generate natural language text that conforms to the spoken language expression. Since sentences generated by continuous sign language recognition may not conform to spoken descriptions, language learning is performed by using a Transformer as a language model, so as to further obtain results of continuous sign language translation, a very small model parameter is used for training a Transformer network, so that sign language recognition sentences and translated natural languages are mapped in a one-to-one manner, and originally static position coding is changed into dynamic trainable position coding in a Transformer structure, so that the position relation between words in a sequence can be learned more easily.

The language model is built by using a pytorch deep learning framework, and the configuration network parameters are set as follows: in a spatial embedding module of an encoder network, a convolutional neural network pre-trained on Imagenet is used, and all parameters on the convolutional neural network are frozen for the convenience of training; using the resnet152 to pre-train the convolutional network, and using the output of the penultimate layer or the output of the last layer, adding two 2600-dimensional trainable residual fully-connected layers after the pre-trained convolutional neural network, and performing residual connection with the output of the following bidirectional gating cycle unit module; the hidden unit of the bidirectional gating cycle unit of the encoder is set to be 1300-dimensional, because past and future information is spliced, the output is 2600-dimensional, the hidden unit dimension of the gating cycle network of each layer is also set to be 2600, so that residual connection can be carried out, the word embedding dimension of a sign language word is set to be 256 at the decoder stage, and the hidden unit dimension of each gating cycle network in the decoder is set to be 800.

During training, a default Adam optimizer and cross entropy loss function of the pytorech are used, and each batch is set to 10. The learning rate is set in two stages, the first stage is carried out by using 0.00004, after the 8 th epoch, the learning rate is adjusted to 0.000004, and training is continued for 6 epochs, so that convergence of the neural network parameters can be completed.

Therefore, the weakly supervised neural network method of the multi-layer time sequence attention fusion mechanism for continuous sign language video recognition and translation provided by the invention can enable the network to recognize and understand the meaning of the sign language person from the sign language video.

Claims

1. A weakly supervised neural network sign language recognition method of a multi-layer time sequence attention fusion mechanism is characterized by comprising the following steps:

S_u＝SpatialEmbedding(f_u) (1)

wherein f is_uRepresenting a sign language video frame, S_uThe space embedding vector is obtained after the convolution network characteristic extraction;

the fusion vector of the multilayer time sequence attention mechanism comprises the following steps: firstly, a fraction is calculated, and a hidden vector h of a previous step of each time step of a decoder is calculated_n-1As query terms, with query term h_n-1Respectively and E_UAnd B_uThe two score vectors score1 and score2 are computed as follows:

score1(h_n-1,E_u)＝E_uWh_n-1 ^T (2)

score2(h_n-1,B_u)＝B_uDh_n-1 ^T (3)

two score using the above scoring function, where W and D are trainable neural network weighting parameters, are then used to obtain sign language video temporal attention weights r and p for aligning sign language video frames and words, which operate as follows:

wherein k represents the kth time step in the time sequence dimension of the encoder network, n represents the nth time step in the time sequence dimension of the decoder network, and then the obtained time sequence attention weights r and p of the sign language video are respectively added with E_uAnd B_uPerforming operation to obtain two sign language attention background vectors C_tAnd C_bThe operation is as follows:

this attention background vector is called sign language sequence context fusion vectorC_mix；

3) In the decoding stage, first from the input<BOS>At the start of the symbol, the symbol is,<BOS>the symbol is used as the starting symbol of each network training and is input into the first time step of the decoder network, and C is simultaneously input_mixSplicing with sign language embedded vocabulary, inputting into a decoder of the current time step, obtaining output after nonlinear operation of a gate control cycle network with a four-layer stacked residual error structure of the decoder, generating words with the maximum probability of the current time step through a layer of full-connection layer, and circularly decoding until meeting<End>And finishing the symbol, and finishing the generation of a complete sign language recognition sentence.

2. The method as claimed in claim 1, wherein the language model generates natural language text conforming to the spoken language expression, and the language learning is performed by using a Transformer as the language model, so as to further obtain the result of continuous sign language translation, and the originally static position code is changed into a dynamically trainable position code in the Transformer structure.

3. The sign language recognition method of the weakly supervised neural network of the multi-layer time sequence attention fusion mechanism as claimed in claim 1 or 2, characterized in that a language model is built by using a pytorch deep learning framework, and configuration network parameters are set as follows, in a space embedding module of an encoder network, a convolutional neural network pre-trained on Imagenet is used to freeze all parameters on the convolutional neural network; using the resnet152 to pre-train the convolutional network, and using the output of the penultimate layer or the output of the last layer, adding two 2600-dimensional trainable residual fully-connected layers after the pre-trained convolutional neural network, and performing residual connection with the output of the following bidirectional gating cycle unit module; the hidden unit of the bidirectional gating cycle unit of the encoder is set to be 1300-dimensional, because past and future information is spliced, the output is 2600-dimensional, the hidden unit of each layer of the gating cycle network is also set to be 2600-dimensional, and therefore residual connection can be carried out; at the decoder stage, the word embedding dimension for sign language words is set to 256 and the network hidden unit dimension is set to 800 for each gated loop in the decoder.

4. The weakly supervised neural network sign language identification method of the multi-layer time series attention fusion mechanism as recited in claim 3, wherein a pitoch default Adam optimizer and a cross entropy loss function are adopted in the training process, each batch is set to be 10, the setting of the learning rate is divided into two stages, the first stage is carried out by using 0.00004, after the 8 th epoch, the learning rate is adjusted to be 0.000004, the training is continued for 6 epochs, and the convergence of neural network parameters is completed.