CN111753704B

CN111753704B - Time sequence centralized prediction method based on video character lip reading recognition

Info

Publication number: CN111753704B
Application number: CN202010562822.4A
Authority: CN
Inventors: 陈志�; 刘玲; 岳文静; 祝驭航
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2022-08-26
Anticipated expiration: 2040-06-19
Also published as: CN111753704A

Abstract

The invention discloses a time sequence centralized prediction method based on video character lip reading identification, which comprises the steps of firstly inputting a character lip movement video frame sequence and extracting the space-time characteristics of lips, adopting a residual error network embedded with a SEnet module to obtain the useful characteristics of the lips of a character under multiple channels, inputting the characteristics into a bidirectional gate control circulation unit to obtain the probability distribution of characters corresponding to a lip movement outline, and introducing a time classification algorithm connected with an ambiguity to align text labels and characters on a time step length; then, aiming at the front-rear time sequence relation, establishing a time sequence related attention window by utilizing the hidden state of a bidirectional gating circulating unit to concentrate into a context vector, and subdividing and planning the probability distribution vector of the context vector under the length of the attention window; and finally, setting an attention unit for the probability distribution vector of each current time and recombining the attention unit into a unit capable of predicting the probability of lip reading the corresponding character. According to the invention, through the front and back centralized association of the time sequence information, the lip reading content of the human body in the video can be effectively predicted and identified.

Description

Time sequence centralized prediction method based on video character lip reading recognition

Technical Field

The invention designs a time sequence centralized prediction method based on video character lip reading identification, which mainly utilizes a multilayer convolutional neural network capable of extracting space-time and multi-channel characteristics and a time sequence prediction method embedded with a mixed attention mechanism, and belongs to the field of application of video mining, mode identification and computer vision intersection technologies.

Background

The video character lip reading identification is to judge lip outlines through vision and connect the contents of characters, in order to improve the identification accuracy, the effect is usually enhanced by combining the mode of an auditory channel, namely voice, so that the video character lip reading identification becomes an important research subject in the field of video character mode identification in recent years, and has high application prospect and value for assisting hearing-impaired people, transcribing audio-damaged videos, monitoring accidents and the like.

In the identification of the lips of a video character, the most critical step is to predict the sentence content corresponding to the motion contour of the human lip when speaking, and the method is particularly important for the association of the front time sequence and the rear time sequence. The time sequence centralized prediction aims to concentrate attention with time sequence information, align probability distribution of the content spoken by a person at a time step with a text label sequence, improve the prediction probability and enable recognized characters to form a complete and meaningful sentence.

The time sequence centralized prediction for the lip reading identification of the video character relates to the following three methods:

(1) multilayer convolutional neural networks: the method comprises the steps of extracting figure lip features under space-time dimensions in continuous video frames by using a 3D-CNN, adopting residual error network association embedded with a SEnet module and extracting multi-channel figure lip features, and in a multilayer convolution architecture, removing useless features and simultaneously maximally retaining useful features.

(2) And (3) time sequence prediction: and (3) acquiring the probability distribution of the characters by using a bidirectional gating circulation unit, and establishing a path and a loss function of a tag sequence to be identified through CTC (cell-based traffic control) as a powerful basic model for lip reading identification.

(3) A mixed attention mechanism: all the characteristic information and the position information are mixed and concentrated by utilizing the front-back association of the time sequence, so that the character prediction of each time step is associated with the content of the previous time and the next time, the problem of long-term dependence is avoided, and the accuracy of the front-back character prediction is improved.

Disclosure of Invention

The technical problem is as follows: the invention aims to align the lip outline opening and closing action and the character label in the lip reading identification of a video character, acquire character probability distribution by utilizing the lip characteristics of the character, improve the association of front and back time sequences by an attention mechanism mixing content and position information, intensively predict the corresponding character probability and connect all characters to form a complete and meaningful word and sentence.

The technical scheme is as follows: the invention relates to a timing sequence centralized prediction method based on video character lip reading identification, which comprises the following steps:

step 1): decoding lip reading content, comprising the following steps:

step 11): inputting Frames, wherein Frames is { frame ═ frame ₁ ,frame ₂ ,...,frame _n Denotes n successive onesThe human lip motion video frame sequence firstly uses 3D-CNN to extract the space-time characteristics of lips, and then extracts the multi-channel characteristics of the lips through a residual error network embedded with a SENET module to obtain X ^～＝{x ₁ ,x ₂ ,...,x _m }, said X ^～＝{x ₁ ,x ₂ ,...,x _m The feature vector of the figure lip of the dimension m is represented, wherein the three-dimensional convolutional neural network has a three-layer structure, each layer is provided with a 3D convolutional layer, batch normalization is carried out on the feature graph after each convolution operation, then a Leaky ReLU activation function is used, a 3D dropout layer is added, and finally a 3D maximum pooling layer is accessed except for the third layer;

step 12): two Bidirectional Gated circulation units (Bi-GRU) are arranged, and X is converted into ^～ As an input to a first bi-directional gated round robin unit, normalizing the probability of the corresponding character at each time step using the softmax function after the fully connected layer of a second bi-directional gated round robin unit;

step 13): introduce the connection director's Temporal Classification (CTC): is provided with

The described

Denotes a blank label, L' denotes all labels except the blank label, U denotes a union, L denotes all label sets, and CTC is set at pi of T, which denotes a time step, where pi ═ is (pi ═ is ₁ ,π ₂ ,...,π _T ) The n represents a path needing to identify a tag sequence, a Trans transformation path n is defined, the Trans represents a transformation mode that all paths n are close to y 'in a time step T, and y' represents a real tag sequence;

step 2): establishing an attention focusing window, and adding a front-back time sequence association, wherein the steps are as follows:

step 21): centering on q, h ═ h _q-τ ,...,h _q ,...,h _q+τ ]As a mixed attention window, the q represents the current time，h＝[h _q-τ ,...,h _q ,...,h _q+τ ]Representing the hidden state sequence output by two bidirectional gating circulation units, tau representing the Length of two sides of a mixed attention window, and setting the total Length of the window to be Length _win ＝2τ+1；

Step 22): computing

The Conv' represents a convolution kernel, t ∈ [ q- τ, q + τ [ ]]Represents a time period, context _q The expression is concentrated in t epsilon [ q-tau, q + tau]Length is an inner _win The context vector of the position information and the content information of all the features under the length, and theta is set _q ＝[θ _q-τ ,...,θ _q ,...,θ _q+τ ]Theta of _q ＝[θ _q-τ ,...,θ _q ,...,θ _q+τ ]Expressed in t ∈ [ q- τ, q + τ)]Length of intrinsic or intrinsic Length _win All attention probability distribution vectors of

Record as

Instant game

Wherein g is _t Representing the characteristic signal, θ, convolved over a time period t _q,t Expressed in t ∈ [ q- τ, q + τ)]Attention probability distribution vector weight of (1);

and step 3): strengthening the time sequence correlation before and after, comprising the following steps:

step 31): compute decoder _q ＝Conv _soft context _q +b _soft Said decoder _q Indicating the decoding status, Conv, at the current time q _soft Representing a convolution kernel subjected to a logarithmic operation of the softmax function, b _soft Representing an offset value subjected to a logarithmic operation of a softmax function;

step 32): decoder fusing decoding state of last time q-1 _q-1 Attention probability distribution vector θ _q-1 And a characteristic signal g _t Calculating

The Single _feedforward A single-layer feed-forward network is shown,

representing convolution operation, wherein eta, U, W, V', ξ and b all represent mixed attention parameters learned by the network in the decoding training process, tanh (-) represents a tanh activation function, and if enough mixed attention parameters are learned, the step 33) is carried out, otherwise, the training process of the step 32) is repeated;

step 33): calculating theta _q,t ＝Attention(decoder _q-1 ，θ _q-1 ，g _t ) Attention (·) means an Attention unit, which means that the result of step 32) is normalized, i.e. calculated, by a softmax activation function

Step 34): for p (y' | X) ^～ ) Modeling is carried out to let p (y' | X) ^～ )＝p(π _t |X ^～ ) Wherein p (pi) _t |X ^～ )＝softmax(decoder _t ) The p (y' | X) ^～ ) Feature vector X representing human lips ^～ Probability vector, p (π) corresponding to the true tag sequence y _t |X ^～ ) Expressed in t ∈ [ q-tau, q + tau]Inner figure lip feature vector X ^～ Corresponding path pi _t The probability vector of (1), softmax (·) denotes the softmax activation function, decoder _t Expressed in t ∈ [ q- τ, q + τ)]All decoding states within;

step 35): by way of calculation in step 22)

Weighted summation is in t epsilon [ q-tau, q + tau]Internally and mixedly attention window Length _win All attention probability distribution vector weights [ theta ] under _q,t-τ ,...,θ _q,t ,...,θ _q,t+τ ]Obtaining the current time qContext vector context of _q Calculating

The Loss _CTC Representing the CTC loss function for statistical character probability, aligning the probability vector p (π) of the label at each time node _t |X ^～ ) And each real label sequence y' is subjected to character-by-character prediction, wherein during decoding, a prefix bundle search is used for generating a decoding sequence, and the sequence is combined into a complete sentence to be output.

Further, in the step 11), n is set to 75 according to experience, and a residual network of a ResNet101 structure is used.

Further, in the step 13), 27 character tags are empirically set in the tag set L, which includes 1 blank tag

And 26 non-blank labels L', the value of the time step T corresponding to the input frame number n is set to 75.

Further, in the step 21), the lengths τ of the two sides of the attention mixing window are empirically taken to be 2, so that the total Length of the window is Length _win And 5, taking.

Further, in the step 22), the hidden state sequence h and the context vector context are _q Is set to 512.

Further, in the step 35), the bundle size of the prefix bundle search is empirically set to 8.

Has the advantages that: the invention provides a time sequence centralized prediction method based on video character lip reading identification, which has the following specific beneficial effects:

(1) the method extracts lip time and space characteristics of people in a video frame sequence through 3D-CNN, utilizes SENET associated channel information and extracts multi-channel lip characteristics, and performs quick connection and multi-layer convolution fusion on the characteristics through a residual error network, so that a large amount of useless lip characteristics are removed, useful characteristics are maximally reserved, and training parameters are reduced.

(2) The invention adopts the bidirectional gate control circulation unit as a sentence decoder, and utilizes the CTC to align the text label and the decoded character in the time step, thereby not only solving the problems of long-term dependence and gradient disappearance, but also effectively correlating the forward and backward information of the time sequence, and leading the prediction of the composed sentences to be more complete and accurate.

(3) According to the method, a mixed attention mechanism is introduced into the CTC, the content and the position of all data in an attention window are concentrated, the association degree of predicted characters and front and rear information is strengthened, and the problem of inherent label alignment loss of the CTC is effectively solved.

Drawings

Fig. 1 is a flow chart of a time-series centralized prediction method based on lip reading identification of a video character.

Fig. 2 is a diagram of a network architecture based on video character lip feature extraction.

Fig. 3 is a network architecture diagram of a bi-directional gated loop unit.

Fig. 4 is a lip reading recognition framework incorporating a hybrid attention mechanism.

Detailed Description

Some embodiments of the invention are described in more detail below with reference to the accompanying drawings.

In an implementation, fig. 1 is a flow chart of a time-series centralized prediction method based on lip reading recognition of a video character. Firstly, decoding lip reading content, extracting lip features, and setting n continuous lip video frame sequences as { Frames ═ frame ₁ ,frame ₂ ,...,frame _n Is input to the 3D-CNN, n is empirically set to 75 since the spoken duration of a sentence averages 3s and the frame rate of each video is about 25 fps. Wherein each video frame has a size of H x W, the 3D-CNN has three layers, each layer having 1 3D convolution layer and one 3D max pooling layer, and the features are further processed using batch normalization, Leaky ReLU activation function, 3D dropout in sequence after the convolution operation. The third layer of the 3D-CNN only has convolution operation, and then ResNet101 with a bottleneck structure block is accessed, namely convolution operation and parameter dimension reduction are carried out by sequentially using convolution kernels of 1 x 1,64, 3 x 3,64 and 1 x 1,256, SENET is embedded in a residual block of the ResNet101, information of each channel is correlated, more useful features are extracted, useless features are inhibited and removed, and finally, the ResNet101 with a bottleneck structure block is accessedObtaining m-dimensional feature vector X ^～＝{x ₁ ,x ₂ ,...,x _m Fig. 2 is a network structure diagram extracted based on the lip features of video characters. Then two bidirectional gating circulation units are arranged, FIG. 3 is a network structure diagram of the bidirectional gating circulation units, and X is ^～ As an input to the first bi-directional gated round unit, the probability of the corresponding character at each time step is normalized using the softmax function after the fully connected layer of the second bi-directional gated round unit. Then introduces the temporal classification CTC of the connected semanteme, since the sequence of lip video frames to be identified is 75 frames, and the number of characters normally spoken in the corresponding seconds is hard to exceed 75, blank symbols are added

As additional tags and allowed to recur, at which point the set of tags

L sets 27 character tags empirically, including 1 blank tag

And 26 non-blank labels L', 26 correspond to 26 letters, and the entire lip reading identification process uses the existing english sentence data set. Setting the alignment mode pi ═ of CTC at time step length T (pi) ₁ ,π ₂ ,...,π _T ) T is set to 75, where n is the same as the number of frame sequences _T E.l, indicates the path of a certain sequence of labels to be identified and can select the index of the corresponding character at time step T. And defining a conversion form Trans to convert the alignment mode pi, wherein the conversion form can map the value of all the alignment modes pi which is close to the real label sequence y' in the time step T.

Then, the CTC predicts the current time depending on the hidden feature vector, and does not directly participate in feature prediction of adjacent frames, the inherent label alignment deficiency problem causes that independent assumption of CTC output is not accurate, based on the point, an attention focusing window is established, front and back time sequence association is added, and fig. 4 shows that a mixed attention mechanism is introduced to the lipA read recognition framework that contains the sentence decoding and mixed attention mechanism CTC. Taking the current time q as a center, taking the hidden state sequence h as a mixed attention window, and enabling h to be [ h ] _q-τ ,...,h _q ,...,h _q+τ ]H is set to 512, where τ represents the Length of both sides of the mixed attention window, the total Length of the window, Length _win 2 τ +1, τ value is taken to be 2 empirically, then Length _win A value of 5 indicates that the time-series collective prediction is most effective in this case. Now at Length _win Convolution kernel at length Conv ═ Conv' _q-τ ,...,Conv′ _q ,...,Conv′ _q+τ ]Calculating

Namely, the context vector context is calculated by using the convolution kernel Conv' and the hidden state sequence h _q The context vector is the core of the mixed attention mechanism CTC, and the Length is centralized _win Position information and content information of all features under the length, context _q Is set to 512. Due to context _q Is the result of the convolution over time of the output hidden state sequences h and Conv ', so Conv' and context _q Which can represent the temporal convolution kernel and the temporal convolution characteristic, respectively. Taking into account the different probability distributions of the labels, the attention probability distribution vector theta is introduced _q ＝[θ _q-τ ,...,θ _q ,...,θ _q+τ ]When t is within [ q-tau, q + tau ∈ [ ]]. Wherein will be

Record as

Instant messenger

To calculate a context vector context for a current time q _q And completing the construction of the hybrid attention network, which requires Length _win Non-uniform attention probability distribution vector weight theta _q,t ，g _t Representing the feature signal convolved at time t.

Finally, strengthening the time sequence correlation before and after, and weighting theta of the attention probability distribution vector of the current time q _q,t Is dependent on the decoding state decoder of the last time q-1 _q-1 By p (y' | X) ^～ )＝p(π _t |X ^～ )＝softmax(decoder _t ) To human lip feature vector X ^～ Corresponding to the real label sequence y' in t epsilon [ q-tau, q + tau]Probability vector p (y' | X) ^～ ) Modeling was performed with each p (π) _q |x)＝[p(π _q ＝1|x),p(π _q ＝2|x),...,p(π _q ＝L|x)] ^T Probability vector, decoding state decoder, representing text label aligned to label at current time q _q Represents the logarithm of softmax, calculated as decoder _q ＝Conv _soft context _q +b _soft . Attention probability distribution vector weight theta of current time q at this time _q,t Single via Single layer feedforward network _feedforward Decoding the decoding state of the last time q-1 _q-1 Attention probability distribution vector θ _q-1 And a characteristic signal g _t The composition Attention unit Attention is integrated into the calculation, as shown in the following formula, namely, the network Single is normalized by utilizing softmax _feedforward 。

Single over a network _feedforward The output under action contains the content and location information of the feature, since

The last time q-1 content is encoded in the decoder _q-1 The position information being encoded at theta _q-1 Wherein eta, U, W, V', ξ, b are all mixed attention parameters learned by the network during the decoding training process. At this time again according to

I.e. weighted sum mixed attention window Length _win All attention probability distribution vector weights [ theta ] under _q,t-τ ,...,θ _q,t ,...,θ _q,t+τ ]Get the context vector context of the current time q _q . To obtain all outputs that can map the closest true tag sequence y', a CTC loss function is used

Calculating probability, i.e. Loss of function Loss through CTC _CTC Probability vector p (π) for labels at each time node _t |X ^～ ) And performing single-word prediction comparison with the real label sequence y ', generating a decoding sequence by using prefix bundle search during CTC decoding, setting the bundle size to be 8 according to experience, indicating that the graph searching capability under the value is optimal, and then composing the sequence into a complete sentence to be output so as to finish the lip reading identification process of the video character, wherein the output result is ' hello ', for example. The added mixed attention mechanism not only enables the character probability predicted by the CTC to be more enhanced and concentrated, but also fully considers the relevance between the current time and the previous and subsequent times, so that the formed sentence has more complete significance, and simultaneously, the accuracy of lip reading identification of the video character is improved.

Claims

1. A time sequence centralized prediction method based on video character lip reading identification is characterized by comprising the following steps:

step 1): decoding the lip reading content, comprising the following steps:

step 11): inputting Frames, wherein Frames is { frame ═ frame ₁ ,frame ₂ ,...,frame _n And (4) representing n continuous human lip motion video frame sequences, extracting lip space-time characteristics by using 3D-CNN (three-dimensional-convolutional network), extracting lip multichannel characteristics by using a residual error network embedded with a SENET (sensor network) module, and acquiring X ^～＝{x ₁ ,x ₂ ,...,x _m }, said X ^～＝{x ₁ ,x ₂ ,...,x _m The feature vectors of human lips of m dimensions are represented, wherein the three-dimensional convolutional neural network has a three-layer structure, each layer has a 3D convolutional layer, the feature map is subjected to batch normalization after each convolution operation, and then a Leaky ReLU excitation is usedAdding a 3D dropout layer to the active function, and finally accessing a 3D maximum pooling layer except the third layer;

step 12): two bidirectional gate control circulation units are arranged to enable X to be connected ^～ As an input to a first bi-directional gated round robin unit, normalizing the probability of the corresponding character at each time step using the softmax function after the fully connected layer of a second bi-directional gated round robin unit;

step 13): introduce the temporal classification of the connoisseors: is provided with

The above-mentioned

step 21): centering on q, h ═ h _q-τ ,...,h _q ,...,h _q+τ ]As a mixed attention window, q denotes the current time, h ═ h _q-τ ,...,h _q ,...,h _q+τ ]Representing the hidden state sequence output by two bidirectional gating circulation units, tau representing the Length of two sides of a mixed attention window, and setting the total Length of the window to be Length _win ＝2τ+1；

Step 22): calculating out

The Conv' represents a convolution kernel, t ∈ [ q- τ, q + τ [ ]]Represents a time period, context _q The expression is concentrated in t epsilon [ q-tau, q + tau]Internal Length _win The context vector of the position information and the content information of all the features under the length, and theta is set _q ＝[θ _q-τ ,...,θ _q ,...,θ _q+τ ]Theta of _q ＝[θ _q-τ ,...,θ _q ,...,θ _q+τ ]Expressed in t ∈ [ q- τ, q + τ)]Length of intrinsic or intrinsic Length _win All attention probability distribution vectors of

Record as

Instant game

Wherein g is _t Representing the characteristic signal, theta, convolved over a time period t _q,t Expressed in t ∈ [ q- τ, q + τ)]Attention probability distribution vector weight of (1);

step 3): strengthening the time sequence correlation before and after, comprising the following steps:

step 31): computing decoder _q ＝Conv _soft context _q +b _soft Said decoder _q Indicating the decoding state at the current time q, Conv _soft Representing a convolution kernel subjected to a logarithmic operation of the softmax function, b _soft Representing an offset value subjected to a logarithmic operation of a softmax function;

step 32): decoding state decoder fused with last time q-1 _q-1 Attention probability distribution vector θ _q-1 And a characteristic signal g _t Calculating

The Single _feedforward A single-layer feed-forward network is shown,

represents convolution operation, eta, U, W, V', xi, b all represent mixed attention parameters learned by the network in the decoding training process, and tanh (DEG) representstanh activating function, if sufficient mixed attention parameter is learned, going to step 33), otherwise, repeating the training process of step 32);

Step 34): for p (y' | X) ^～ ) Modeling was done, let p (y' | X) ^～ )＝p(π _t |X ^～ ) Wherein p (pi) _t |X ^～ )＝softmax(decoder _t ) The p (y' | X) ^～ ) Feature vector X representing human lip ^～ Probability vector, p (π) corresponding to the true tag sequence y _t |X ^～ ) Expressed in t ∈ [ q- τ, q + τ)]Inner figure lip feature vector X ^～ Corresponding path pi _t The probability vector of (1), softmax (·) denotes the softmax activation function, decoder _t Expressed in t ∈ [ q- τ, q + τ)]All decoding states within;

step 35): by way of calculation in step 22)

Weighted summation at t ∈ [ q- τ, q + τ]Intra-i-mixed attention window Length _win All attention probability distribution vector weights [ theta ] under _q,t-τ ,...,θ _q,t ,...,θ _q,t+τ ]Get the context vector context of the current time q _q Calculating

The Loss _CTC Representing a CTC loss function for statistical character probability, aligning the probability vector p (π) of the tag at each time node _t |X ^～ ) And each real label sequence y' and performing character-by-character prediction, wherein during decoding, a prefix bundle search is used to generate a decoded sequence, and the sequence is assembledAnd outputting the whole sentence.

2. The method as claimed in claim 1, wherein in step 11), n is set to 75 empirically, and a residual network of a ResNet101 structure is used.

3. The method as claimed in claim 1, wherein in step 13), the label set L is empirically set to 27 character labels, which include 1 blank label

4. The method as claimed in claim 1, wherein in step 21), the lengths τ of both sides of the mixed attention window are empirically determined to be 2, and the total Length of the window is Length _win And 5, taking.

5. The method as claimed in claim 1, wherein in step 22), the hidden state sequence h and the context vector context are used in the sequential centralized prediction method based on the lip-reading recognition of the video character _q Is set to 512.

6. The method as claimed in claim 1, wherein in step 35), the size of the prefix bundle search is empirically set to 8.