CN113033452B

CN113033452B - Lip language identification method fusing channel attention and selective feature fusion mechanism

Info

Publication number: CN113033452B
Application number: CN202110366767.6A
Authority: CN
Inventors: 薛峰; 杨添; 王文博; 洪自坤
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2022-09-16
Anticipated expiration: 2041-04-06
Also published as: CN113033452A

Abstract

The invention discloses a lip language identification method fusing a channel attention and selective feature fusion mechanism, which comprises the following steps: 1. downloading a data set GRID for training a model, and preprocessing the data set; 2. building a lip language recognition network, and selecting a proper objective function to optimize model parameters; 3. evaluating the effect of the model by adopting corresponding evaluation indexes; 4. and carrying out lip language recognition on the video by using the trained model. The invention uses the stacked 3D convolution neural network, the selective characteristic fusion network and the bidirectional GRU network to encode the input video frame, wherein a channel attention mechanism is added between each layer of 3D convolution layer, and finally a CTC decoder is adopted to generate an output text, so that the characteristics of the speaker lip region can be better learned, and the more accurate lip reading effect is realized.

Description

Lip language recognition method fusing channel attention and selective feature fusion mechanism

Technical Field

The invention belongs to the technical field of computer machine learning and artificial intelligence, and mainly relates to a lip language identification method of a deep neural network.

Background

Lip language plays a crucial role in human communication and speech understanding, however, human lip reading ability is poor as research shows. Good lip speech recognition technology can be a complement to audio-based speech recognition, can be used for improving hearing aids, improving the acquisition of speech information in silent, safe and noisy environments, and the like, has great practicability, and thus becomes an increasingly interesting field. Most of the lip reading work was based on manually designed feature learning before the occurrence of deep learning, and this type of method is computationally intensive and less accurate. Deep learning methods, which are continuously developed in recent years, are used to extract static features of the lip region of a speaker or to construct an end-to-end architecture. The 3D convolutional neural network can effectively learn the motion information of the lip part; the cyclic neural network can better process the information of the sequence; the ctc (connectionist Temporal classification) training approach may eliminate the need for alignment of the inputs with the target outputs, so that the sequence modeling is trained in an end-to-end fashion. On the basis of the deep learning methods, the lip language recognition technology is greatly developed.

Lip language recognition can be divided into two categories, word-level and sentence-level, depending on whether the modeling task is to classify words or phonemes or to predict a complete sentence sequence. The lip language recognition method at the word level only predicts a single isolated word, the prediction object is usually a short video of about 0.5s, the information provided by context for word prediction is ignored, the prediction object can be a video segment of several seconds or even longer, the context information can be fully utilized to help predict the word, and the later represents larger practical significance. In recent years, word-level lip language recognition methods have been developed rapidly, and the accuracy of single word classification can reach over 86% (on the LRW dataset). The sentence-level lip language recognition method for predicting the complete sentence sequence is relatively less researched, partial features of the lip region are insufficiently extracted by the existing model, the accuracy rate of lip language recognition is still low, and the method can be improved.

Disclosure of Invention

Aiming at the related problems in the existing lip language recognition, the invention provides a lip language recognition method fusing channel attention and a selective feature fusion mechanism so as to better extract the features of a speaker lip region, thereby realizing more accurate lip reading and achieving better lip language recognition effect.

The invention adopts the following technical scheme for solving the technical problems:

the invention relates to a lip language identification method fusing channel attention and selective feature fusion mechanisms, which is characterized by comprising the following steps of:

step 1, a sentence-level lip language recognition video data set is obtained, face feature detection is carried out on each video in the lip language recognition video data set, lip region images are extracted, so that a lip region image set of each video is obtained, and a lip region image data set L is formed;

step 2, dividing the lip region image data set L into a training set L ₁ And test set L ₂ And the training set L is ₁ Dividing the video into a plurality of batches, wherein each batch contains lip region image sets corresponding to B videos and serves as B training samples; each training sample comprises T frames of lip region images; the number of channels of each frame of lip region image is C, the height is H and the width is W;

step 3, training set L ₁ And test set L ₂ The real texts corresponding to the lip region image sets of each video contained in the text table are respectively marked as G ₁ And G ₂ ；

Step 4, constructing a lip language identification network fusing a channel attention and selective feature fusion mechanism;

step 4.1, constructing a front-end network HN of a fusion channel attention mechanism;

the front-end network HN is formed by connecting three same sub-modules CAN in series, and each sub-module CAN sequentially comprises a 3D convolution layer, a 3D batch regularization layer, a ReLU activation function, a 3D Dropout layer, a 3D maximum pooling layer and a channel attention network layer CA; the output of the channel attention network CA and the input of the channel attention network CA are multiplied element by element to obtain a result which is used as the output of each sub-module CAN;

the channel attention network CA comprises two branches, a first branch comprising in sequence: a 3D global maximum pooling layer, a 3D convolutional layer for reducing the number of input feature channels by r times, a ReLU activation function and a 3D convolutional layer for increasing the number of input feature channels by r times; the other branch is the same as the first branch except that the 3D global maximum pooling layer is changed into a 3D global average pooling layer; adding the outputs of the two branches element by element, and obtaining the output of the attention network CA through a Sigmoid activation function;

step 4.2, constructing a selective feature fusion network SKN;

the selective characteristic fusion network SKN is formed by connecting n identical selective fusion sub-modules SK in series, and each selective fusion sub-module SK is processed according to the formula (1):

in formula (1), Z represents the output of each selective fusion submodule SK;

the representative feature matrix multiplies the elements one by one; tan h is Tanh activation function; x and Y are two different feature matrices obtained by inputting the selective fusion sub-module SK through two fusion branches, and each fusion branch comprises a full connection layer; g (U) represents a result U obtained by adding two different feature matrixes X and Y obtained by two fusion branches element by element, and then sequentially passes through a full connection layer for reducing input dimension by r times, a ReLU activation function, a full connection layer for increasing input dimension by r times and a Sigmoid activation function;

4.3, constructing a back-end network TN for extracting the long-time information;

the back-end network TN sequentially comprises two bidirectional GRU layers, a full connection layer and a CTC loss layer; the input of the back-end network TN is the output of the selective characteristic fusion network SKN;

step 4.4, using the training set L ₁ As the input of the lip language recognition network, and the training set L ₁ Corresponding real text set G ₁ As a label, CTC loss is used as a loss function, the lip language recognition network is trained by using an Adam optimization algorithm, and the lip language recognition network is combined in a test set L ₂ The final lip language identification network is obtained and used for identifying the movement of the lips of the speaker in the video, namely, machine lip reading is realized.

Compared with the existing lip language recognition technology, the invention has the following advantages:

1. the invention integrates the channel attention mechanism into the lip language recognition model, and adds the channel attention mechanism into the 3D convolution neural network extracting the short-time information and the space characteristics from the front end of the model, so that the model can fully utilize the characteristics with the maximum information quantity and restrain the useless characteristics according to the dependence degree on each channel. The lip language identification network added with the channel attention mechanism can have better lip reading effect.

2. According to the method, a Batch standardization layer is added between each convolution layer and the active layer to optimize the model training process, the optimization method not only greatly accelerates the model training speed, but also enables the model to be free from overfitting on a limited data set, and improves the effect of model lip language recognition to a certain extent.

3. The present invention employs a selective feature fusion mechanism. Compared with the high way Network, the high way Network only provides dynamic and nonlinear fusion weight, the provided selective feature fusion mechanism not only solves the problem that the Network is difficult to train after deepening, but also can adaptively selectively learn the information of different feature spaces, thereby extracting richer semantic information and greatly improving the effect of model lip language recognition.

Drawings

FIG. 1 is a block diagram of a model of a network according to the present invention;

FIG. 2 is a block diagram of an optional feature fusion module according to the present invention;

FIG. 3 is a flow chart of the present invention.

Detailed Description

In this embodiment, a lip language recognition method based on a fusion mechanism of channel attention and selective features is to recognize the content expressed by a speaker according to the motion of the lip region of the speaker in a video, and map the content to a text language, thereby implementing lip reading based on deep learning. Downloading a sentence-level lip language recognition data set GRID, obtaining an image of a lip region of a speaker after face feature detection processing, constructing a complete lip language recognition model, and accelerating the model training speed through batch standardization; the attention mechanism of the fusion channel improves the effect of the model; updating the parameters of the optimization model by adopting an Adam optimization algorithm; and recognizing the content expressed by the speaker according to the motion of the lip of the speaker in the video by using the finally trained model, converting the content into a text language form, and finishing the whole lip language recognition function. Specifically, as shown in fig. 3, the method comprises the following steps:

step 1, a sentence-level lip language recognition video data set is obtained, face feature detection is carried out on each video in the lip language recognition video data set, and lip region images are extracted, so that a lip region image set of each video is obtained, and a lip region image data set L is formed;

step 2, dividing lip region image data set L into training sets L ₁ And test set L ₂ And will train set L ₁ Dividing the video into a plurality of batches, wherein each batch contains lip region image sets corresponding to B videos and serves as B training samples; each training sample comprises T frames of lip region images; the number of channels of each frame of lip region image is C, the height is H and the width is W; in the specific example, B is 50, T is 75, C is 3, H is 64, and W is 128;

Step 4, constructing a lip language identification network fusing a channel attention and selective feature fusion mechanism, wherein the network structure is shown in figure 1;

step 4.1, constructing a front-end network HN of a fusion channel attention mechanism, and extracting short-time information and spatial characteristics of an input picture set;

the front-end network HN is composed of three identical sub-modules CAN in series, each sub-module sequentially comprises a 3D convolution layer, a 3D batch regularization layer (BN layer), a ReLU activation function, a 3D Dropout, a 3D maximum pooling layer and a channel attention network layer CA, and the output of each sub-module is a result of element-by-element multiplication of the output of the channel attention network CA and the input of the channel attention network CA.

In the embodiment, the 3D convolutional layers in the three sub-modules sequentially change the number of channels of the input features into 32, 64 and 96, each 3D maximum pooling layer reduces 1/2 the height and width of the input feature map, and the 3D convolutional neural network and the 3D maximum pooling layer can reduce the computational complexity and extract spatial features and short-time information in the input lip region image set; adding a 3D batch regularization layer (BN layer) after each 3D convolution layer can greatly accelerate the model training speed, has quick convergence and can improve the accuracy of the model.

The channel attention network CA comprises two branches, a first branch comprising in sequence: a 3D global maximum pooling layer, a 3D convolutional layer for reducing the number of input feature channels by r times, a ReLU activation function and a 3D convolutional layer for increasing the number of input feature channels by r times; in the specific example, r is 16; the other branch is the same as the first branch except that the 3D global maximum pooling layer is changed into a 3D global average pooling layer; and adding the outputs of the two branches element by element, and obtaining the output of the attention network CA through a Sigmoid activation function. The channel attention mechanism can improve the representation capability of the network by modeling the dependency of each channel, and can adjust the characteristics channel by channel, so that the network can selectively strengthen the learning of the characteristics containing useful information and inhibit useless characteristics;

step 4.2, constructing a selective feature fusion network SKN;

the selective characteristic fusion network SKN is formed by connecting n identical selective fusion submodules SK in series, the selective fusion submodules SK are shown in fig. 2, and experiments show that the model can finally achieve the best effect when n is 2, so that n is 2 in a specific embodiment. Each selective fusion submodule SK is processed according to equation (1):

in formula (1), Z represents the output of each selective fusion submodule SK;

the representative feature matrix is multiplied element by element; tan h is Tanh activation function; x and Y are two different feature matrixes obtained by inputting the selective fusion sub-module SK through two branches, and each branch comprises a full connection layer; g (U) represents: recording a result of element-by-element addition of two different feature matrices X and Y obtained by the two branches as U, and sequentially passing through a full connection layer for reducing the input dimension by r times, a ReLU activation function, a full connection layer for increasing the input dimension by r times and a Sigmoid activation function.

4.3, constructing a long-time information extraction back-end network TN;

the back-end network TN sequentially comprises two bidirectional GRU (gated Current Unit) layers, a full connection layer and a CTC (Connectionsist Temporal Classification) loss layer; the input of the back-end network TN is the output of the selective feature fusion network SKN. In a specific example, each GRU layer comprises 256 hidden neurons, the output of the front-end network HN can be further effectively aggregated by using two stacked bidirectional GRU layers, and long-term information in an input feature sequence can be acquired; the reason for selecting the CTC penalty function is that it eliminates the need for training data to align the input with the target output, eliminating many of the cumbersome post-processing operations.

Step 4.4 training divided by the resulting lip region image datasetCollection L ₁ For lip language recognition network input, and training set L ₁ Corresponding real text set G ₁ As a label, CTC loss is used as a loss function, the lip language recognition network is trained by using an Adam optimization algorithm, and then the lip language recognition network is combined in a test set L ₂ The final lip language identification network is obtained and used for identifying the motion of the lips of the speaker in the video, namely, the lip reading of a machine is realized.

Claims

1. A lip language recognition method fusing channel attention and selective feature fusion mechanisms is characterized by comprising the following steps of:

step 2, dividing the lip region image data set L into a training set L ₁ And test set L ₂ And the training set L is divided into ₁ Dividing the video into a plurality of batches, wherein each batch contains lip region image sets corresponding to B videos and serves as B training samples; each training sample comprises T frames of lip region images; the number of channels of each frame of lip region image is C, the height is H and the width is W;

step 4.2, constructing a selective feature fusion network SKN;

in formula (1), Z represents the output of each selective fusion submodule SK;

the representative feature matrix is multiplied element by element; tan h is a Tanh activation function; x and Y are two different feature matrices obtained by inputting the selective fusion sub-module SK through two fusion branches, and each fusion branch comprises a full connection layer; g (U) represents a result U obtained by adding two different feature matrixes X and Y obtained by two fusion branches element by element, and then sequentially passes through a full connection layer for reducing input dimension by r times, a ReLU activation function, a full connection layer for increasing input dimension by r times and a Sigmoid activation function;

step 4.4, using the training set L ₁ As the input of the lip language recognition network, and the training set L ₁ Corresponding real text set G ₁ As a label, CTC loss is used as a loss function, the lip language recognition network is trained by using an Adam optimization algorithm, and the lip language recognition network is combined in a test set L ₂ The final lip language identification network is obtained and used for identifying the motion of the lips of the speaker in the video, namely, the lip reading of a machine is realized.