CN111753704B - Time sequence centralized prediction method based on video character lip reading recognition - Google Patents

Time sequence centralized prediction method based on video character lip reading recognition Download PDF

Info

Publication number
CN111753704B
CN111753704B CN202010562822.4A CN202010562822A CN111753704B CN 111753704 B CN111753704 B CN 111753704B CN 202010562822 A CN202010562822 A CN 202010562822A CN 111753704 B CN111753704 B CN 111753704B
Authority
CN
China
Prior art keywords
time
attention
sequence
character
lip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010562822.4A
Other languages
Chinese (zh)
Other versions
CN111753704A (en
Inventor
陈志�
刘玲
岳文静
祝驭航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202010562822.4A priority Critical patent/CN111753704B/en
Publication of CN111753704A publication Critical patent/CN111753704A/en
Application granted granted Critical
Publication of CN111753704B publication Critical patent/CN111753704B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a time sequence centralized prediction method based on video character lip reading identification, which comprises the steps of firstly inputting a character lip movement video frame sequence and extracting the space-time characteristics of lips, adopting a residual error network embedded with a SEnet module to obtain the useful characteristics of the lips of a character under multiple channels, inputting the characteristics into a bidirectional gate control circulation unit to obtain the probability distribution of characters corresponding to a lip movement outline, and introducing a time classification algorithm connected with an ambiguity to align text labels and characters on a time step length; then, aiming at the front-rear time sequence relation, establishing a time sequence related attention window by utilizing the hidden state of a bidirectional gating circulating unit to concentrate into a context vector, and subdividing and planning the probability distribution vector of the context vector under the length of the attention window; and finally, setting an attention unit for the probability distribution vector of each current time and recombining the attention unit into a unit capable of predicting the probability of lip reading the corresponding character. According to the invention, through the front and back centralized association of the time sequence information, the lip reading content of the human body in the video can be effectively predicted and identified.

Description

Time sequence centralized prediction method based on video character lip reading recognition
Technical Field
The invention designs a time sequence centralized prediction method based on video character lip reading identification, which mainly utilizes a multilayer convolutional neural network capable of extracting space-time and multi-channel characteristics and a time sequence prediction method embedded with a mixed attention mechanism, and belongs to the field of application of video mining, mode identification and computer vision intersection technologies.
Background
The video character lip reading identification is to judge lip outlines through vision and connect the contents of characters, in order to improve the identification accuracy, the effect is usually enhanced by combining the mode of an auditory channel, namely voice, so that the video character lip reading identification becomes an important research subject in the field of video character mode identification in recent years, and has high application prospect and value for assisting hearing-impaired people, transcribing audio-damaged videos, monitoring accidents and the like.
In the identification of the lips of a video character, the most critical step is to predict the sentence content corresponding to the motion contour of the human lip when speaking, and the method is particularly important for the association of the front time sequence and the rear time sequence. The time sequence centralized prediction aims to concentrate attention with time sequence information, align probability distribution of the content spoken by a person at a time step with a text label sequence, improve the prediction probability and enable recognized characters to form a complete and meaningful sentence.
The time sequence centralized prediction for the lip reading identification of the video character relates to the following three methods:
(1) multilayer convolutional neural networks: the method comprises the steps of extracting figure lip features under space-time dimensions in continuous video frames by using a 3D-CNN, adopting residual error network association embedded with a SEnet module and extracting multi-channel figure lip features, and in a multilayer convolution architecture, removing useless features and simultaneously maximally retaining useful features.
(2) And (3) time sequence prediction: and (3) acquiring the probability distribution of the characters by using a bidirectional gating circulation unit, and establishing a path and a loss function of a tag sequence to be identified through CTC (cell-based traffic control) as a powerful basic model for lip reading identification.
(3) A mixed attention mechanism: all the characteristic information and the position information are mixed and concentrated by utilizing the front-back association of the time sequence, so that the character prediction of each time step is associated with the content of the previous time and the next time, the problem of long-term dependence is avoided, and the accuracy of the front-back character prediction is improved.
Disclosure of Invention
The technical problem is as follows: the invention aims to align the lip outline opening and closing action and the character label in the lip reading identification of a video character, acquire character probability distribution by utilizing the lip characteristics of the character, improve the association of front and back time sequences by an attention mechanism mixing content and position information, intensively predict the corresponding character probability and connect all characters to form a complete and meaningful word and sentence.
The technical scheme is as follows: the invention relates to a timing sequence centralized prediction method based on video character lip reading identification, which comprises the following steps:
step 1): decoding lip reading content, comprising the following steps:
step 11): inputting Frames, wherein Frames is { frame ═ frame 1 ,frame 2 ,...,frame n Denotes n successive onesThe human lip motion video frame sequence firstly uses 3D-CNN to extract the space-time characteristics of lips, and then extracts the multi-channel characteristics of the lips through a residual error network embedded with a SENET module to obtain X ={x 1 ,x 2 ,...,x m }, said X ={x 1 ,x 2 ,...,x m The feature vector of the figure lip of the dimension m is represented, wherein the three-dimensional convolutional neural network has a three-layer structure, each layer is provided with a 3D convolutional layer, batch normalization is carried out on the feature graph after each convolution operation, then a Leaky ReLU activation function is used, a 3D dropout layer is added, and finally a 3D maximum pooling layer is accessed except for the third layer;
step 12): two Bidirectional Gated circulation units (Bi-GRU) are arranged, and X is converted into As an input to a first bi-directional gated round robin unit, normalizing the probability of the corresponding character at each time step using the softmax function after the fully connected layer of a second bi-directional gated round robin unit;
step 13): introduce the connection director's Temporal Classification (CTC): is provided with
Figure BDA0002546661490000026
The described
Figure BDA0002546661490000027
Denotes a blank label, L' denotes all labels except the blank label, U denotes a union, L denotes all label sets, and CTC is set at pi of T, which denotes a time step, where pi ═ is (pi ═ is 12 ,...,π T ) The n represents a path needing to identify a tag sequence, a Trans transformation path n is defined, the Trans represents a transformation mode that all paths n are close to y 'in a time step T, and y' represents a real tag sequence;
step 2): establishing an attention focusing window, and adding a front-back time sequence association, wherein the steps are as follows:
step 21): centering on q, h ═ h q-τ ,...,h q ,...,h q+τ ]As a mixed attention window, the q represents the current time,h=[h q-τ ,...,h q ,...,h q+τ ]Representing the hidden state sequence output by two bidirectional gating circulation units, tau representing the Length of two sides of a mixed attention window, and setting the total Length of the window to be Length win =2τ+1;
Step 22): computing
Figure BDA0002546661490000021
The Conv' represents a convolution kernel, t ∈ [ q- τ, q + τ [ ]]Represents a time period, context q The expression is concentrated in t epsilon [ q-tau, q + tau]Length is an inner win The context vector of the position information and the content information of all the features under the length, and theta is set q =[θ q-τ ,...,θ q ,...,θ q+τ ]Theta of q =[θ q-τ ,...,θ q ,...,θ q+τ ]Expressed in t ∈ [ q- τ, q + τ)]Length of intrinsic or intrinsic Length win All attention probability distribution vectors of
Figure BDA0002546661490000022
Record as
Figure BDA0002546661490000023
Instant game
Figure BDA0002546661490000024
Wherein g is t Representing the characteristic signal, θ, convolved over a time period t q,t Expressed in t ∈ [ q- τ, q + τ)]Attention probability distribution vector weight of (1);
and step 3): strengthening the time sequence correlation before and after, comprising the following steps:
step 31): compute decoder q =Conv soft context q +b soft Said decoder q Indicating the decoding status, Conv, at the current time q soft Representing a convolution kernel subjected to a logarithmic operation of the softmax function, b soft Representing an offset value subjected to a logarithmic operation of a softmax function;
step 32): decoder fusing decoding state of last time q-1 q-1 Attention probability distribution vector θ q-1 And a characteristic signal g t Calculating
Figure BDA0002546661490000031
The Single feedforward A single-layer feed-forward network is shown,
Figure BDA0002546661490000035
representing convolution operation, wherein eta, U, W, V', ξ and b all represent mixed attention parameters learned by the network in the decoding training process, tanh (-) represents a tanh activation function, and if enough mixed attention parameters are learned, the step 33) is carried out, otherwise, the training process of the step 32) is repeated;
step 33): calculating theta q,t =Attention(decoder q-1 ,θ q-1 ,g t ) Attention (·) means an Attention unit, which means that the result of step 32) is normalized, i.e. calculated, by a softmax activation function
Figure BDA0002546661490000032
Step 34): for p (y' | X) ) Modeling is carried out to let p (y' | X) )=p(π t |X ) Wherein p (pi) t |X )=softmax(decoder t ) The p (y' | X) ) Feature vector X representing human lips Probability vector, p (π) corresponding to the true tag sequence y t |X ) Expressed in t ∈ [ q-tau, q + tau]Inner figure lip feature vector X Corresponding path pi t The probability vector of (1), softmax (·) denotes the softmax activation function, decoder t Expressed in t ∈ [ q- τ, q + τ)]All decoding states within;
step 35): by way of calculation in step 22)
Figure BDA0002546661490000033
Weighted summation is in t epsilon [ q-tau, q + tau]Internally and mixedly attention window Length win All attention probability distribution vector weights [ theta ] under q,t-τ ,...,θ q,t ,...,θ q,t+τ ]Obtaining the current time qContext vector context of q Calculating
Figure BDA0002546661490000034
The Loss CTC Representing the CTC loss function for statistical character probability, aligning the probability vector p (π) of the label at each time node t |X ) And each real label sequence y' is subjected to character-by-character prediction, wherein during decoding, a prefix bundle search is used for generating a decoding sequence, and the sequence is combined into a complete sentence to be output.
Further, in the step 11), n is set to 75 according to experience, and a residual network of a ResNet101 structure is used.
Further, in the step 13), 27 character tags are empirically set in the tag set L, which includes 1 blank tag
Figure BDA0002546661490000041
And 26 non-blank labels L', the value of the time step T corresponding to the input frame number n is set to 75.
Further, in the step 21), the lengths τ of the two sides of the attention mixing window are empirically taken to be 2, so that the total Length of the window is Length win And 5, taking.
Further, in the step 22), the hidden state sequence h and the context vector context are q Is set to 512.
Further, in the step 35), the bundle size of the prefix bundle search is empirically set to 8.
Has the advantages that: the invention provides a time sequence centralized prediction method based on video character lip reading identification, which has the following specific beneficial effects:
(1) the method extracts lip time and space characteristics of people in a video frame sequence through 3D-CNN, utilizes SENET associated channel information and extracts multi-channel lip characteristics, and performs quick connection and multi-layer convolution fusion on the characteristics through a residual error network, so that a large amount of useless lip characteristics are removed, useful characteristics are maximally reserved, and training parameters are reduced.
(2) The invention adopts the bidirectional gate control circulation unit as a sentence decoder, and utilizes the CTC to align the text label and the decoded character in the time step, thereby not only solving the problems of long-term dependence and gradient disappearance, but also effectively correlating the forward and backward information of the time sequence, and leading the prediction of the composed sentences to be more complete and accurate.
(3) According to the method, a mixed attention mechanism is introduced into the CTC, the content and the position of all data in an attention window are concentrated, the association degree of predicted characters and front and rear information is strengthened, and the problem of inherent label alignment loss of the CTC is effectively solved.
Drawings
Fig. 1 is a flow chart of a time-series centralized prediction method based on lip reading identification of a video character.
Fig. 2 is a diagram of a network architecture based on video character lip feature extraction.
Fig. 3 is a network architecture diagram of a bi-directional gated loop unit.
Fig. 4 is a lip reading recognition framework incorporating a hybrid attention mechanism.
Detailed Description
Some embodiments of the invention are described in more detail below with reference to the accompanying drawings.
In an implementation, fig. 1 is a flow chart of a time-series centralized prediction method based on lip reading recognition of a video character. Firstly, decoding lip reading content, extracting lip features, and setting n continuous lip video frame sequences as { Frames ═ frame 1 ,frame 2 ,...,frame n Is input to the 3D-CNN, n is empirically set to 75 since the spoken duration of a sentence averages 3s and the frame rate of each video is about 25 fps. Wherein each video frame has a size of H x W, the 3D-CNN has three layers, each layer having 1 3D convolution layer and one 3D max pooling layer, and the features are further processed using batch normalization, Leaky ReLU activation function, 3D dropout in sequence after the convolution operation. The third layer of the 3D-CNN only has convolution operation, and then ResNet101 with a bottleneck structure block is accessed, namely convolution operation and parameter dimension reduction are carried out by sequentially using convolution kernels of 1 x 1,64, 3 x 3,64 and 1 x 1,256, SENET is embedded in a residual block of the ResNet101, information of each channel is correlated, more useful features are extracted, useless features are inhibited and removed, and finally, the ResNet101 with a bottleneck structure block is accessedObtaining m-dimensional feature vector X ={x 1 ,x 2 ,...,x m Fig. 2 is a network structure diagram extracted based on the lip features of video characters. Then two bidirectional gating circulation units are arranged, FIG. 3 is a network structure diagram of the bidirectional gating circulation units, and X is As an input to the first bi-directional gated round unit, the probability of the corresponding character at each time step is normalized using the softmax function after the fully connected layer of the second bi-directional gated round unit. Then introduces the temporal classification CTC of the connected semanteme, since the sequence of lip video frames to be identified is 75 frames, and the number of characters normally spoken in the corresponding seconds is hard to exceed 75, blank symbols are added
Figure BDA0002546661490000054
As additional tags and allowed to recur, at which point the set of tags
Figure BDA0002546661490000052
L sets 27 character tags empirically, including 1 blank tag
Figure BDA0002546661490000053
And 26 non-blank labels L', 26 correspond to 26 letters, and the entire lip reading identification process uses the existing english sentence data set. Setting the alignment mode pi ═ of CTC at time step length T (pi) 12 ,...,π T ) T is set to 75, where n is the same as the number of frame sequences T E.l, indicates the path of a certain sequence of labels to be identified and can select the index of the corresponding character at time step T. And defining a conversion form Trans to convert the alignment mode pi, wherein the conversion form can map the value of all the alignment modes pi which is close to the real label sequence y' in the time step T.
Then, the CTC predicts the current time depending on the hidden feature vector, and does not directly participate in feature prediction of adjacent frames, the inherent label alignment deficiency problem causes that independent assumption of CTC output is not accurate, based on the point, an attention focusing window is established, front and back time sequence association is added, and fig. 4 shows that a mixed attention mechanism is introduced to the lipA read recognition framework that contains the sentence decoding and mixed attention mechanism CTC. Taking the current time q as a center, taking the hidden state sequence h as a mixed attention window, and enabling h to be [ h ] q-τ ,...,h q ,...,h q+τ ]H is set to 512, where τ represents the Length of both sides of the mixed attention window, the total Length of the window, Length win 2 τ +1, τ value is taken to be 2 empirically, then Length win A value of 5 indicates that the time-series collective prediction is most effective in this case. Now at Length win Convolution kernel at length Conv ═ Conv' q-τ ,...,Conv′ q ,...,Conv′ q+τ ]Calculating
Figure BDA0002546661490000051
Namely, the context vector context is calculated by using the convolution kernel Conv' and the hidden state sequence h q The context vector is the core of the mixed attention mechanism CTC, and the Length is centralized win Position information and content information of all features under the length, context q Is set to 512. Due to context q Is the result of the convolution over time of the output hidden state sequences h and Conv ', so Conv' and context q Which can represent the temporal convolution kernel and the temporal convolution characteristic, respectively. Taking into account the different probability distributions of the labels, the attention probability distribution vector theta is introduced q =[θ q-τ ,...,θ q ,...,θ q+τ ]When t is within [ q-tau, q + tau ∈ [ ]]. Wherein will be
Figure BDA0002546661490000061
Record as
Figure BDA0002546661490000062
Instant messenger
Figure BDA0002546661490000063
To calculate a context vector context for a current time q q And completing the construction of the hybrid attention network, which requires Length win Non-uniform attention probability distribution vector weight theta q,t ,g t Representing the feature signal convolved at time t.
Finally, strengthening the time sequence correlation before and after, and weighting theta of the attention probability distribution vector of the current time q q,t Is dependent on the decoding state decoder of the last time q-1 q-1 By p (y' | X) )=p(π t |X )=softmax(decoder t ) To human lip feature vector X Corresponding to the real label sequence y' in t epsilon [ q-tau, q + tau]Probability vector p (y' | X) ) Modeling was performed with each p (π) q |x)=[p(π q =1|x),p(π q =2|x),...,p(π q =L|x)] T Probability vector, decoding state decoder, representing text label aligned to label at current time q q Represents the logarithm of softmax, calculated as decoder q =Conv soft context q +b soft . Attention probability distribution vector weight theta of current time q at this time q,t Single via Single layer feedforward network feedforward Decoding the decoding state of the last time q-1 q-1 Attention probability distribution vector θ q-1 And a characteristic signal g t The composition Attention unit Attention is integrated into the calculation, as shown in the following formula, namely, the network Single is normalized by utilizing softmax feedforward
Figure BDA0002546661490000064
Single over a network feedforward The output under action contains the content and location information of the feature, since
Figure BDA0002546661490000065
The last time q-1 content is encoded in the decoder q-1 The position information being encoded at theta q-1 Wherein eta, U, W, V', ξ, b are all mixed attention parameters learned by the network during the decoding training process. At this time again according to
Figure BDA0002546661490000066
I.e. weighted sum mixed attention window Length win All attention probability distribution vector weights [ theta ] under q,t-τ ,...,θ q,t ,...,θ q,t+τ ]Get the context vector context of the current time q q . To obtain all outputs that can map the closest true tag sequence y', a CTC loss function is used
Figure BDA0002546661490000071
Calculating probability, i.e. Loss of function Loss through CTC CTC Probability vector p (π) for labels at each time node t |X ) And performing single-word prediction comparison with the real label sequence y ', generating a decoding sequence by using prefix bundle search during CTC decoding, setting the bundle size to be 8 according to experience, indicating that the graph searching capability under the value is optimal, and then composing the sequence into a complete sentence to be output so as to finish the lip reading identification process of the video character, wherein the output result is ' hello ', for example. The added mixed attention mechanism not only enables the character probability predicted by the CTC to be more enhanced and concentrated, but also fully considers the relevance between the current time and the previous and subsequent times, so that the formed sentence has more complete significance, and simultaneously, the accuracy of lip reading identification of the video character is improved.

Claims (6)

1. A time sequence centralized prediction method based on video character lip reading identification is characterized by comprising the following steps:
step 1): decoding the lip reading content, comprising the following steps:
step 11): inputting Frames, wherein Frames is { frame ═ frame 1 ,frame 2 ,...,frame n And (4) representing n continuous human lip motion video frame sequences, extracting lip space-time characteristics by using 3D-CNN (three-dimensional-convolutional network), extracting lip multichannel characteristics by using a residual error network embedded with a SENET (sensor network) module, and acquiring X ={x 1 ,x 2 ,...,x m }, said X ={x 1 ,x 2 ,...,x m The feature vectors of human lips of m dimensions are represented, wherein the three-dimensional convolutional neural network has a three-layer structure, each layer has a 3D convolutional layer, the feature map is subjected to batch normalization after each convolution operation, and then a Leaky ReLU excitation is usedAdding a 3D dropout layer to the active function, and finally accessing a 3D maximum pooling layer except the third layer;
step 12): two bidirectional gate control circulation units are arranged to enable X to be connected As an input to a first bi-directional gated round robin unit, normalizing the probability of the corresponding character at each time step using the softmax function after the fully connected layer of a second bi-directional gated round robin unit;
step 13): introduce the temporal classification of the connoisseors: is provided with
Figure FDA0003746808640000014
The above-mentioned
Figure FDA0003746808640000015
Denotes a blank label, L' denotes all labels except the blank label, U denotes a union, L denotes all label sets, and CTC is set at pi of T, which denotes a time step, where pi ═ is (pi ═ is 12 ,...,π T ) The n represents a path needing to identify a tag sequence, a Trans transformation path n is defined, the Trans represents a transformation mode that all paths n are close to y 'in a time step T, and y' represents a real tag sequence;
step 2): establishing an attention focusing window, and adding a front-back time sequence association, wherein the steps are as follows:
step 21): centering on q, h ═ h q-τ ,...,h q ,...,h q+τ ]As a mixed attention window, q denotes the current time, h ═ h q-τ ,...,h q ,...,h q+τ ]Representing the hidden state sequence output by two bidirectional gating circulation units, tau representing the Length of two sides of a mixed attention window, and setting the total Length of the window to be Length win =2τ+1;
Step 22): calculating out
Figure FDA0003746808640000011
The Conv' represents a convolution kernel, t ∈ [ q- τ, q + τ [ ]]Represents a time period, context q The expression is concentrated in t epsilon [ q-tau, q + tau]Internal Length win The context vector of the position information and the content information of all the features under the length, and theta is set q =[θ q-τ ,...,θ q ,...,θ q+τ ]Theta of q =[θ q-τ ,...,θ q ,...,θ q+τ ]Expressed in t ∈ [ q- τ, q + τ)]Length of intrinsic or intrinsic Length win All attention probability distribution vectors of
Figure FDA0003746808640000012
Record as
Figure FDA0003746808640000013
Instant game
Figure FDA0003746808640000021
Wherein g is t Representing the characteristic signal, theta, convolved over a time period t q,t Expressed in t ∈ [ q- τ, q + τ)]Attention probability distribution vector weight of (1);
step 3): strengthening the time sequence correlation before and after, comprising the following steps:
step 31): computing decoder q =Conv soft context q +b soft Said decoder q Indicating the decoding state at the current time q, Conv soft Representing a convolution kernel subjected to a logarithmic operation of the softmax function, b soft Representing an offset value subjected to a logarithmic operation of a softmax function;
step 32): decoding state decoder fused with last time q-1 q-1 Attention probability distribution vector θ q-1 And a characteristic signal g t Calculating
Figure FDA0003746808640000022
The Single feedforward A single-layer feed-forward network is shown,
Figure FDA0003746808640000023
represents convolution operation, eta, U, W, V', xi, b all represent mixed attention parameters learned by the network in the decoding training process, and tanh (DEG) representstanh activating function, if sufficient mixed attention parameter is learned, going to step 33), otherwise, repeating the training process of step 32);
step 33): calculating theta q,t =Attention(decoder q-1 ,θ q-1 ,g t ) Attention (·) means an Attention unit, which means that the result of step 32) is normalized, i.e. calculated, by a softmax activation function
Figure FDA0003746808640000024
Step 34): for p (y' | X) ) Modeling was done, let p (y' | X) )=p(π t |X ) Wherein p (pi) t |X )=softmax(decoder t ) The p (y' | X) ) Feature vector X representing human lip Probability vector, p (π) corresponding to the true tag sequence y t |X ) Expressed in t ∈ [ q- τ, q + τ)]Inner figure lip feature vector X Corresponding path pi t The probability vector of (1), softmax (·) denotes the softmax activation function, decoder t Expressed in t ∈ [ q- τ, q + τ)]All decoding states within;
step 35): by way of calculation in step 22)
Figure FDA0003746808640000025
Weighted summation at t ∈ [ q- τ, q + τ]Intra-i-mixed attention window Length win All attention probability distribution vector weights [ theta ] under q,t-τ ,...,θ q,t ,...,θ q,t+τ ]Get the context vector context of the current time q q Calculating
Figure FDA0003746808640000031
The Loss CTC Representing a CTC loss function for statistical character probability, aligning the probability vector p (π) of the tag at each time node t |X ) And each real label sequence y' and performing character-by-character prediction, wherein during decoding, a prefix bundle search is used to generate a decoded sequence, and the sequence is assembledAnd outputting the whole sentence.
2. The method as claimed in claim 1, wherein in step 11), n is set to 75 empirically, and a residual network of a ResNet101 structure is used.
3. The method as claimed in claim 1, wherein in step 13), the label set L is empirically set to 27 character labels, which include 1 blank label
Figure FDA0003746808640000032
And 26 non-blank labels L', the value of the time step T corresponding to the input frame number n is set to 75.
4. The method as claimed in claim 1, wherein in step 21), the lengths τ of both sides of the mixed attention window are empirically determined to be 2, and the total Length of the window is Length win And 5, taking.
5. The method as claimed in claim 1, wherein in step 22), the hidden state sequence h and the context vector context are used in the sequential centralized prediction method based on the lip-reading recognition of the video character q Is set to 512.
6. The method as claimed in claim 1, wherein in step 35), the size of the prefix bundle search is empirically set to 8.
CN202010562822.4A 2020-06-19 2020-06-19 Time sequence centralized prediction method based on video character lip reading recognition Active CN111753704B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010562822.4A CN111753704B (en) 2020-06-19 2020-06-19 Time sequence centralized prediction method based on video character lip reading recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010562822.4A CN111753704B (en) 2020-06-19 2020-06-19 Time sequence centralized prediction method based on video character lip reading recognition

Publications (2)

Publication Number Publication Date
CN111753704A CN111753704A (en) 2020-10-09
CN111753704B true CN111753704B (en) 2022-08-26

Family

ID=72676329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010562822.4A Active CN111753704B (en) 2020-06-19 2020-06-19 Time sequence centralized prediction method based on video character lip reading recognition

Country Status (1)

Country Link
CN (1) CN111753704B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313056A (en) * 2021-06-16 2021-08-27 中国科学技术大学 Compact 3D convolution-based lip language identification method, system, device and storage medium
CN113239903B (en) * 2021-07-08 2021-10-01 中国人民解放军国防科技大学 Cross-modal lip reading antagonism dual-contrast self-supervision learning method
CN113658582B (en) * 2021-07-15 2024-05-07 中国科学院计算技术研究所 Lip language identification method and system for audio-visual collaboration
CN113343937B (en) * 2021-07-15 2022-09-02 北华航天工业学院 Lip language identification method based on deep convolution and attention mechanism
CN113435421B (en) * 2021-08-26 2021-11-05 湖南大学 Cross-modal attention enhancement-based lip language identification method and system
CN114694255B (en) * 2022-04-01 2023-04-07 合肥工业大学 Sentence-level lip language recognition method based on channel attention and time convolution network
CN115050092A (en) * 2022-05-20 2022-09-13 宁波明家智能科技有限公司 Lip reading algorithm and system for intelligent driving
CN115886830A (en) * 2022-12-09 2023-04-04 中科南京智能技术研究院 Twelve-lead electrocardiogram classification method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104884B (en) * 2019-12-10 2022-06-03 电子科技大学 Chinese lip language identification method based on two-stage neural network model

Also Published As

Publication number Publication date
CN111753704A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
CN111753704B (en) Time sequence centralized prediction method based on video character lip reading recognition
CN110188343B (en) Multi-mode emotion recognition method based on fusion attention network
CN113158875B (en) Image-text emotion analysis method and system based on multi-mode interaction fusion network
Gelly et al. Optimization of RNN-based speech activity detection
Huang et al. Image captioning with end-to-end attribute detection and subsequent attributes prediction
Saha et al. Towards emotion-aided multi-modal dialogue act classification
CN112348075A (en) Multi-mode emotion recognition method based on contextual attention neural network
CN110321418A (en) A kind of field based on deep learning, intention assessment and slot fill method
CN111967272B (en) Visual dialogue generating system based on semantic alignment
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN111178157A (en) Chinese lip language identification method from cascade sequence to sequence model based on tone
CN110443129A (en) Chinese lip reading recognition methods based on deep learning
CN114360005B (en) Micro-expression classification method based on AU region and multi-level transducer fusion module
CN113537024B (en) Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism
Li et al. Key action and joint ctc-attention based sign language recognition
CN111428481A (en) Entity relation extraction method based on deep learning
CN113516152A (en) Image description method based on composite image semantics
CN114385802A (en) Common-emotion conversation generation method integrating theme prediction and emotion inference
CN110569823A (en) sign language identification and skeleton generation method based on RNN
Zhang et al. Multi-modal emotion recognition based on deep learning in speech, video and text
CN111401116B (en) Bimodal emotion recognition method based on enhanced convolution and space-time LSTM network
CN113378919B (en) Image description generation method for fusing visual sense and enhancing multilayer global features
CN114694255A (en) Sentence-level lip language identification method based on channel attention and time convolution network
Boukdir et al. Character-level arabic text generation from sign language video using encoder–decoder model
CN113095201A (en) AU degree estimation model establishment method based on self-attention and uncertainty weighted multi-task learning among different regions of human face

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant