CN111753704B - Time sequence centralized prediction method based on video character lip reading recognition - Google Patents
Time sequence centralized prediction method based on video character lip reading recognition Download PDFInfo
- Publication number
- CN111753704B CN111753704B CN202010562822.4A CN202010562822A CN111753704B CN 111753704 B CN111753704 B CN 111753704B CN 202010562822 A CN202010562822 A CN 202010562822A CN 111753704 B CN111753704 B CN 111753704B
- Authority
- CN
- China
- Prior art keywords
- time
- attention
- sequence
- character
- lip
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a time sequence centralized prediction method based on video character lip reading identification, which comprises the steps of firstly inputting a character lip movement video frame sequence and extracting the space-time characteristics of lips, adopting a residual error network embedded with a SEnet module to obtain the useful characteristics of the lips of a character under multiple channels, inputting the characteristics into a bidirectional gate control circulation unit to obtain the probability distribution of characters corresponding to a lip movement outline, and introducing a time classification algorithm connected with an ambiguity to align text labels and characters on a time step length; then, aiming at the front-rear time sequence relation, establishing a time sequence related attention window by utilizing the hidden state of a bidirectional gating circulating unit to concentrate into a context vector, and subdividing and planning the probability distribution vector of the context vector under the length of the attention window; and finally, setting an attention unit for the probability distribution vector of each current time and recombining the attention unit into a unit capable of predicting the probability of lip reading the corresponding character. According to the invention, through the front and back centralized association of the time sequence information, the lip reading content of the human body in the video can be effectively predicted and identified.
Description
Technical Field
The invention designs a time sequence centralized prediction method based on video character lip reading identification, which mainly utilizes a multilayer convolutional neural network capable of extracting space-time and multi-channel characteristics and a time sequence prediction method embedded with a mixed attention mechanism, and belongs to the field of application of video mining, mode identification and computer vision intersection technologies.
Background
The video character lip reading identification is to judge lip outlines through vision and connect the contents of characters, in order to improve the identification accuracy, the effect is usually enhanced by combining the mode of an auditory channel, namely voice, so that the video character lip reading identification becomes an important research subject in the field of video character mode identification in recent years, and has high application prospect and value for assisting hearing-impaired people, transcribing audio-damaged videos, monitoring accidents and the like.
In the identification of the lips of a video character, the most critical step is to predict the sentence content corresponding to the motion contour of the human lip when speaking, and the method is particularly important for the association of the front time sequence and the rear time sequence. The time sequence centralized prediction aims to concentrate attention with time sequence information, align probability distribution of the content spoken by a person at a time step with a text label sequence, improve the prediction probability and enable recognized characters to form a complete and meaningful sentence.
The time sequence centralized prediction for the lip reading identification of the video character relates to the following three methods:
(1) multilayer convolutional neural networks: the method comprises the steps of extracting figure lip features under space-time dimensions in continuous video frames by using a 3D-CNN, adopting residual error network association embedded with a SEnet module and extracting multi-channel figure lip features, and in a multilayer convolution architecture, removing useless features and simultaneously maximally retaining useful features.
(2) And (3) time sequence prediction: and (3) acquiring the probability distribution of the characters by using a bidirectional gating circulation unit, and establishing a path and a loss function of a tag sequence to be identified through CTC (cell-based traffic control) as a powerful basic model for lip reading identification.
(3) A mixed attention mechanism: all the characteristic information and the position information are mixed and concentrated by utilizing the front-back association of the time sequence, so that the character prediction of each time step is associated with the content of the previous time and the next time, the problem of long-term dependence is avoided, and the accuracy of the front-back character prediction is improved.
Disclosure of Invention
The technical problem is as follows: the invention aims to align the lip outline opening and closing action and the character label in the lip reading identification of a video character, acquire character probability distribution by utilizing the lip characteristics of the character, improve the association of front and back time sequences by an attention mechanism mixing content and position information, intensively predict the corresponding character probability and connect all characters to form a complete and meaningful word and sentence.
The technical scheme is as follows: the invention relates to a timing sequence centralized prediction method based on video character lip reading identification, which comprises the following steps:
step 1): decoding lip reading content, comprising the following steps:
step 11): inputting Frames, wherein Frames is { frame ═ frame 1 ,frame 2 ,...,frame n Denotes n successive onesThe human lip motion video frame sequence firstly uses 3D-CNN to extract the space-time characteristics of lips, and then extracts the multi-channel characteristics of the lips through a residual error network embedded with a SENET module to obtain X ~ ={x 1 ,x 2 ,...,x m }, said X ~ ={x 1 ,x 2 ,...,x m The feature vector of the figure lip of the dimension m is represented, wherein the three-dimensional convolutional neural network has a three-layer structure, each layer is provided with a 3D convolutional layer, batch normalization is carried out on the feature graph after each convolution operation, then a Leaky ReLU activation function is used, a 3D dropout layer is added, and finally a 3D maximum pooling layer is accessed except for the third layer;
step 12): two Bidirectional Gated circulation units (Bi-GRU) are arranged, and X is converted into ~ As an input to a first bi-directional gated round robin unit, normalizing the probability of the corresponding character at each time step using the softmax function after the fully connected layer of a second bi-directional gated round robin unit;
step 13): introduce the connection director's Temporal Classification (CTC): is provided withThe describedDenotes a blank label, L' denotes all labels except the blank label, U denotes a union, L denotes all label sets, and CTC is set at pi of T, which denotes a time step, where pi ═ is (pi ═ is 1 ,π 2 ,...,π T ) The n represents a path needing to identify a tag sequence, a Trans transformation path n is defined, the Trans represents a transformation mode that all paths n are close to y 'in a time step T, and y' represents a real tag sequence;
step 2): establishing an attention focusing window, and adding a front-back time sequence association, wherein the steps are as follows:
step 21): centering on q, h ═ h q-τ ,...,h q ,...,h q+τ ]As a mixed attention window, the q represents the current time,h=[h q-τ ,...,h q ,...,h q+τ ]Representing the hidden state sequence output by two bidirectional gating circulation units, tau representing the Length of two sides of a mixed attention window, and setting the total Length of the window to be Length win =2τ+1;
Step 22): computingThe Conv' represents a convolution kernel, t ∈ [ q- τ, q + τ [ ]]Represents a time period, context q The expression is concentrated in t epsilon [ q-tau, q + tau]Length is an inner win The context vector of the position information and the content information of all the features under the length, and theta is set q =[θ q-τ ,...,θ q ,...,θ q+τ ]Theta of q =[θ q-τ ,...,θ q ,...,θ q+τ ]Expressed in t ∈ [ q- τ, q + τ)]Length of intrinsic or intrinsic Length win All attention probability distribution vectors ofRecord asInstant gameWherein g is t Representing the characteristic signal, θ, convolved over a time period t q,t Expressed in t ∈ [ q- τ, q + τ)]Attention probability distribution vector weight of (1);
and step 3): strengthening the time sequence correlation before and after, comprising the following steps:
step 31): compute decoder q =Conv soft context q +b soft Said decoder q Indicating the decoding status, Conv, at the current time q soft Representing a convolution kernel subjected to a logarithmic operation of the softmax function, b soft Representing an offset value subjected to a logarithmic operation of a softmax function;
step 32): decoder fusing decoding state of last time q-1 q-1 Attention probability distribution vector θ q-1 And a characteristic signal g t CalculatingThe Single feedforward A single-layer feed-forward network is shown,representing convolution operation, wherein eta, U, W, V', ξ and b all represent mixed attention parameters learned by the network in the decoding training process, tanh (-) represents a tanh activation function, and if enough mixed attention parameters are learned, the step 33) is carried out, otherwise, the training process of the step 32) is repeated;
step 33): calculating theta q,t =Attention(decoder q-1 ,θ q-1 ,g t ) Attention (·) means an Attention unit, which means that the result of step 32) is normalized, i.e. calculated, by a softmax activation function
Step 34): for p (y' | X) ~ ) Modeling is carried out to let p (y' | X) ~ )=p(π t |X ~ ) Wherein p (pi) t |X ~ )=softmax(decoder t ) The p (y' | X) ~ ) Feature vector X representing human lips ~ Probability vector, p (π) corresponding to the true tag sequence y t |X ~ ) Expressed in t ∈ [ q-tau, q + tau]Inner figure lip feature vector X ~ Corresponding path pi t The probability vector of (1), softmax (·) denotes the softmax activation function, decoder t Expressed in t ∈ [ q- τ, q + τ)]All decoding states within;
step 35): by way of calculation in step 22)Weighted summation is in t epsilon [ q-tau, q + tau]Internally and mixedly attention window Length win All attention probability distribution vector weights [ theta ] under q,t-τ ,...,θ q,t ,...,θ q,t+τ ]Obtaining the current time qContext vector context of q CalculatingThe Loss CTC Representing the CTC loss function for statistical character probability, aligning the probability vector p (π) of the label at each time node t |X ~ ) And each real label sequence y' is subjected to character-by-character prediction, wherein during decoding, a prefix bundle search is used for generating a decoding sequence, and the sequence is combined into a complete sentence to be output.
Further, in the step 11), n is set to 75 according to experience, and a residual network of a ResNet101 structure is used.
Further, in the step 13), 27 character tags are empirically set in the tag set L, which includes 1 blank tagAnd 26 non-blank labels L', the value of the time step T corresponding to the input frame number n is set to 75.
Further, in the step 21), the lengths τ of the two sides of the attention mixing window are empirically taken to be 2, so that the total Length of the window is Length win And 5, taking.
Further, in the step 22), the hidden state sequence h and the context vector context are q Is set to 512.
Further, in the step 35), the bundle size of the prefix bundle search is empirically set to 8.
Has the advantages that: the invention provides a time sequence centralized prediction method based on video character lip reading identification, which has the following specific beneficial effects:
(1) the method extracts lip time and space characteristics of people in a video frame sequence through 3D-CNN, utilizes SENET associated channel information and extracts multi-channel lip characteristics, and performs quick connection and multi-layer convolution fusion on the characteristics through a residual error network, so that a large amount of useless lip characteristics are removed, useful characteristics are maximally reserved, and training parameters are reduced.
(2) The invention adopts the bidirectional gate control circulation unit as a sentence decoder, and utilizes the CTC to align the text label and the decoded character in the time step, thereby not only solving the problems of long-term dependence and gradient disappearance, but also effectively correlating the forward and backward information of the time sequence, and leading the prediction of the composed sentences to be more complete and accurate.
(3) According to the method, a mixed attention mechanism is introduced into the CTC, the content and the position of all data in an attention window are concentrated, the association degree of predicted characters and front and rear information is strengthened, and the problem of inherent label alignment loss of the CTC is effectively solved.
Drawings
Fig. 1 is a flow chart of a time-series centralized prediction method based on lip reading identification of a video character.
Fig. 2 is a diagram of a network architecture based on video character lip feature extraction.
Fig. 3 is a network architecture diagram of a bi-directional gated loop unit.
Fig. 4 is a lip reading recognition framework incorporating a hybrid attention mechanism.
Detailed Description
Some embodiments of the invention are described in more detail below with reference to the accompanying drawings.
In an implementation, fig. 1 is a flow chart of a time-series centralized prediction method based on lip reading recognition of a video character. Firstly, decoding lip reading content, extracting lip features, and setting n continuous lip video frame sequences as { Frames ═ frame 1 ,frame 2 ,...,frame n Is input to the 3D-CNN, n is empirically set to 75 since the spoken duration of a sentence averages 3s and the frame rate of each video is about 25 fps. Wherein each video frame has a size of H x W, the 3D-CNN has three layers, each layer having 1 3D convolution layer and one 3D max pooling layer, and the features are further processed using batch normalization, Leaky ReLU activation function, 3D dropout in sequence after the convolution operation. The third layer of the 3D-CNN only has convolution operation, and then ResNet101 with a bottleneck structure block is accessed, namely convolution operation and parameter dimension reduction are carried out by sequentially using convolution kernels of 1 x 1,64, 3 x 3,64 and 1 x 1,256, SENET is embedded in a residual block of the ResNet101, information of each channel is correlated, more useful features are extracted, useless features are inhibited and removed, and finally, the ResNet101 with a bottleneck structure block is accessedObtaining m-dimensional feature vector X ~ ={x 1 ,x 2 ,...,x m Fig. 2 is a network structure diagram extracted based on the lip features of video characters. Then two bidirectional gating circulation units are arranged, FIG. 3 is a network structure diagram of the bidirectional gating circulation units, and X is ~ As an input to the first bi-directional gated round unit, the probability of the corresponding character at each time step is normalized using the softmax function after the fully connected layer of the second bi-directional gated round unit. Then introduces the temporal classification CTC of the connected semanteme, since the sequence of lip video frames to be identified is 75 frames, and the number of characters normally spoken in the corresponding seconds is hard to exceed 75, blank symbols are addedAs additional tags and allowed to recur, at which point the set of tagsL sets 27 character tags empirically, including 1 blank tagAnd 26 non-blank labels L', 26 correspond to 26 letters, and the entire lip reading identification process uses the existing english sentence data set. Setting the alignment mode pi ═ of CTC at time step length T (pi) 1 ,π 2 ,...,π T ) T is set to 75, where n is the same as the number of frame sequences T E.l, indicates the path of a certain sequence of labels to be identified and can select the index of the corresponding character at time step T. And defining a conversion form Trans to convert the alignment mode pi, wherein the conversion form can map the value of all the alignment modes pi which is close to the real label sequence y' in the time step T.
Then, the CTC predicts the current time depending on the hidden feature vector, and does not directly participate in feature prediction of adjacent frames, the inherent label alignment deficiency problem causes that independent assumption of CTC output is not accurate, based on the point, an attention focusing window is established, front and back time sequence association is added, and fig. 4 shows that a mixed attention mechanism is introduced to the lipA read recognition framework that contains the sentence decoding and mixed attention mechanism CTC. Taking the current time q as a center, taking the hidden state sequence h as a mixed attention window, and enabling h to be [ h ] q-τ ,...,h q ,...,h q+τ ]H is set to 512, where τ represents the Length of both sides of the mixed attention window, the total Length of the window, Length win 2 τ +1, τ value is taken to be 2 empirically, then Length win A value of 5 indicates that the time-series collective prediction is most effective in this case. Now at Length win Convolution kernel at length Conv ═ Conv' q-τ ,...,Conv′ q ,...,Conv′ q+τ ]CalculatingNamely, the context vector context is calculated by using the convolution kernel Conv' and the hidden state sequence h q The context vector is the core of the mixed attention mechanism CTC, and the Length is centralized win Position information and content information of all features under the length, context q Is set to 512. Due to context q Is the result of the convolution over time of the output hidden state sequences h and Conv ', so Conv' and context q Which can represent the temporal convolution kernel and the temporal convolution characteristic, respectively. Taking into account the different probability distributions of the labels, the attention probability distribution vector theta is introduced q =[θ q-τ ,...,θ q ,...,θ q+τ ]When t is within [ q-tau, q + tau ∈ [ ]]. Wherein will beRecord asInstant messengerTo calculate a context vector context for a current time q q And completing the construction of the hybrid attention network, which requires Length win Non-uniform attention probability distribution vector weight theta q,t ,g t Representing the feature signal convolved at time t.
Finally, strengthening the time sequence correlation before and after, and weighting theta of the attention probability distribution vector of the current time q q,t Is dependent on the decoding state decoder of the last time q-1 q-1 By p (y' | X) ~ )=p(π t |X ~ )=softmax(decoder t ) To human lip feature vector X ~ Corresponding to the real label sequence y' in t epsilon [ q-tau, q + tau]Probability vector p (y' | X) ~ ) Modeling was performed with each p (π) q |x)=[p(π q =1|x),p(π q =2|x),...,p(π q =L|x)] T Probability vector, decoding state decoder, representing text label aligned to label at current time q q Represents the logarithm of softmax, calculated as decoder q =Conv soft context q +b soft . Attention probability distribution vector weight theta of current time q at this time q,t Single via Single layer feedforward network feedforward Decoding the decoding state of the last time q-1 q-1 Attention probability distribution vector θ q-1 And a characteristic signal g t The composition Attention unit Attention is integrated into the calculation, as shown in the following formula, namely, the network Single is normalized by utilizing softmax feedforward 。
Single over a network feedforward The output under action contains the content and location information of the feature, sinceThe last time q-1 content is encoded in the decoder q-1 The position information being encoded at theta q-1 Wherein eta, U, W, V', ξ, b are all mixed attention parameters learned by the network during the decoding training process. At this time again according toI.e. weighted sum mixed attention window Length win All attention probability distribution vector weights [ theta ] under q,t-τ ,...,θ q,t ,...,θ q,t+τ ]Get the context vector context of the current time q q . To obtain all outputs that can map the closest true tag sequence y', a CTC loss function is usedCalculating probability, i.e. Loss of function Loss through CTC CTC Probability vector p (π) for labels at each time node t |X ~ ) And performing single-word prediction comparison with the real label sequence y ', generating a decoding sequence by using prefix bundle search during CTC decoding, setting the bundle size to be 8 according to experience, indicating that the graph searching capability under the value is optimal, and then composing the sequence into a complete sentence to be output so as to finish the lip reading identification process of the video character, wherein the output result is ' hello ', for example. The added mixed attention mechanism not only enables the character probability predicted by the CTC to be more enhanced and concentrated, but also fully considers the relevance between the current time and the previous and subsequent times, so that the formed sentence has more complete significance, and simultaneously, the accuracy of lip reading identification of the video character is improved.
Claims (6)
1. A time sequence centralized prediction method based on video character lip reading identification is characterized by comprising the following steps:
step 1): decoding the lip reading content, comprising the following steps:
step 11): inputting Frames, wherein Frames is { frame ═ frame 1 ,frame 2 ,...,frame n And (4) representing n continuous human lip motion video frame sequences, extracting lip space-time characteristics by using 3D-CNN (three-dimensional-convolutional network), extracting lip multichannel characteristics by using a residual error network embedded with a SENET (sensor network) module, and acquiring X ~ ={x 1 ,x 2 ,...,x m }, said X ~ ={x 1 ,x 2 ,...,x m The feature vectors of human lips of m dimensions are represented, wherein the three-dimensional convolutional neural network has a three-layer structure, each layer has a 3D convolutional layer, the feature map is subjected to batch normalization after each convolution operation, and then a Leaky ReLU excitation is usedAdding a 3D dropout layer to the active function, and finally accessing a 3D maximum pooling layer except the third layer;
step 12): two bidirectional gate control circulation units are arranged to enable X to be connected ~ As an input to a first bi-directional gated round robin unit, normalizing the probability of the corresponding character at each time step using the softmax function after the fully connected layer of a second bi-directional gated round robin unit;
step 13): introduce the temporal classification of the connoisseors: is provided withThe above-mentionedDenotes a blank label, L' denotes all labels except the blank label, U denotes a union, L denotes all label sets, and CTC is set at pi of T, which denotes a time step, where pi ═ is (pi ═ is 1 ,π 2 ,...,π T ) The n represents a path needing to identify a tag sequence, a Trans transformation path n is defined, the Trans represents a transformation mode that all paths n are close to y 'in a time step T, and y' represents a real tag sequence;
step 2): establishing an attention focusing window, and adding a front-back time sequence association, wherein the steps are as follows:
step 21): centering on q, h ═ h q-τ ,...,h q ,...,h q+τ ]As a mixed attention window, q denotes the current time, h ═ h q-τ ,...,h q ,...,h q+τ ]Representing the hidden state sequence output by two bidirectional gating circulation units, tau representing the Length of two sides of a mixed attention window, and setting the total Length of the window to be Length win =2τ+1;
Step 22): calculating outThe Conv' represents a convolution kernel, t ∈ [ q- τ, q + τ [ ]]Represents a time period, context q The expression is concentrated in t epsilon [ q-tau, q + tau]Internal Length win The context vector of the position information and the content information of all the features under the length, and theta is set q =[θ q-τ ,...,θ q ,...,θ q+τ ]Theta of q =[θ q-τ ,...,θ q ,...,θ q+τ ]Expressed in t ∈ [ q- τ, q + τ)]Length of intrinsic or intrinsic Length win All attention probability distribution vectors ofRecord asInstant gameWherein g is t Representing the characteristic signal, theta, convolved over a time period t q,t Expressed in t ∈ [ q- τ, q + τ)]Attention probability distribution vector weight of (1);
step 3): strengthening the time sequence correlation before and after, comprising the following steps:
step 31): computing decoder q =Conv soft context q +b soft Said decoder q Indicating the decoding state at the current time q, Conv soft Representing a convolution kernel subjected to a logarithmic operation of the softmax function, b soft Representing an offset value subjected to a logarithmic operation of a softmax function;
step 32): decoding state decoder fused with last time q-1 q-1 Attention probability distribution vector θ q-1 And a characteristic signal g t CalculatingThe Single feedforward A single-layer feed-forward network is shown,represents convolution operation, eta, U, W, V', xi, b all represent mixed attention parameters learned by the network in the decoding training process, and tanh (DEG) representstanh activating function, if sufficient mixed attention parameter is learned, going to step 33), otherwise, repeating the training process of step 32);
step 33): calculating theta q,t =Attention(decoder q-1 ,θ q-1 ,g t ) Attention (·) means an Attention unit, which means that the result of step 32) is normalized, i.e. calculated, by a softmax activation function
Step 34): for p (y' | X) ~ ) Modeling was done, let p (y' | X) ~ )=p(π t |X ~ ) Wherein p (pi) t |X ~ )=softmax(decoder t ) The p (y' | X) ~ ) Feature vector X representing human lip ~ Probability vector, p (π) corresponding to the true tag sequence y t |X ~ ) Expressed in t ∈ [ q- τ, q + τ)]Inner figure lip feature vector X ~ Corresponding path pi t The probability vector of (1), softmax (·) denotes the softmax activation function, decoder t Expressed in t ∈ [ q- τ, q + τ)]All decoding states within;
step 35): by way of calculation in step 22)Weighted summation at t ∈ [ q- τ, q + τ]Intra-i-mixed attention window Length win All attention probability distribution vector weights [ theta ] under q,t-τ ,...,θ q,t ,...,θ q,t+τ ]Get the context vector context of the current time q q CalculatingThe Loss CTC Representing a CTC loss function for statistical character probability, aligning the probability vector p (π) of the tag at each time node t |X ~ ) And each real label sequence y' and performing character-by-character prediction, wherein during decoding, a prefix bundle search is used to generate a decoded sequence, and the sequence is assembledAnd outputting the whole sentence.
2. The method as claimed in claim 1, wherein in step 11), n is set to 75 empirically, and a residual network of a ResNet101 structure is used.
4. The method as claimed in claim 1, wherein in step 21), the lengths τ of both sides of the mixed attention window are empirically determined to be 2, and the total Length of the window is Length win And 5, taking.
5. The method as claimed in claim 1, wherein in step 22), the hidden state sequence h and the context vector context are used in the sequential centralized prediction method based on the lip-reading recognition of the video character q Is set to 512.
6. The method as claimed in claim 1, wherein in step 35), the size of the prefix bundle search is empirically set to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010562822.4A CN111753704B (en) | 2020-06-19 | 2020-06-19 | Time sequence centralized prediction method based on video character lip reading recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010562822.4A CN111753704B (en) | 2020-06-19 | 2020-06-19 | Time sequence centralized prediction method based on video character lip reading recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111753704A CN111753704A (en) | 2020-10-09 |
CN111753704B true CN111753704B (en) | 2022-08-26 |
Family
ID=72676329
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010562822.4A Active CN111753704B (en) | 2020-06-19 | 2020-06-19 | Time sequence centralized prediction method based on video character lip reading recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111753704B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113313056A (en) * | 2021-06-16 | 2021-08-27 | 中国科学技术大学 | Compact 3D convolution-based lip language identification method, system, device and storage medium |
CN113239903B (en) * | 2021-07-08 | 2021-10-01 | 中国人民解放军国防科技大学 | Cross-modal lip reading antagonism dual-contrast self-supervision learning method |
CN113658582B (en) * | 2021-07-15 | 2024-05-07 | 中国科学院计算技术研究所 | Lip language identification method and system for audio-visual collaboration |
CN113343937B (en) * | 2021-07-15 | 2022-09-02 | 北华航天工业学院 | Lip language identification method based on deep convolution and attention mechanism |
CN113435421B (en) * | 2021-08-26 | 2021-11-05 | 湖南大学 | Cross-modal attention enhancement-based lip language identification method and system |
CN114694255B (en) * | 2022-04-01 | 2023-04-07 | 合肥工业大学 | Sentence-level lip language recognition method based on channel attention and time convolution network |
CN115050092A (en) * | 2022-05-20 | 2022-09-13 | 宁波明家智能科技有限公司 | Lip reading algorithm and system for intelligent driving |
CN115886830A (en) * | 2022-12-09 | 2023-04-04 | 中科南京智能技术研究院 | Twelve-lead electrocardiogram classification method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111104884B (en) * | 2019-12-10 | 2022-06-03 | 电子科技大学 | Chinese lip language identification method based on two-stage neural network model |
-
2020
- 2020-06-19 CN CN202010562822.4A patent/CN111753704B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111753704A (en) | 2020-10-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111753704B (en) | Time sequence centralized prediction method based on video character lip reading recognition | |
CN110188343B (en) | Multi-mode emotion recognition method based on fusion attention network | |
CN113158875B (en) | Image-text emotion analysis method and system based on multi-mode interaction fusion network | |
Gelly et al. | Optimization of RNN-based speech activity detection | |
Huang et al. | Image captioning with end-to-end attribute detection and subsequent attributes prediction | |
Saha et al. | Towards emotion-aided multi-modal dialogue act classification | |
CN112348075A (en) | Multi-mode emotion recognition method based on contextual attention neural network | |
CN110321418A (en) | A kind of field based on deep learning, intention assessment and slot fill method | |
CN111967272B (en) | Visual dialogue generating system based on semantic alignment | |
CN110991290A (en) | Video description method based on semantic guidance and memory mechanism | |
CN111178157A (en) | Chinese lip language identification method from cascade sequence to sequence model based on tone | |
CN110443129A (en) | Chinese lip reading recognition methods based on deep learning | |
CN114360005B (en) | Micro-expression classification method based on AU region and multi-level transducer fusion module | |
CN113537024B (en) | Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism | |
Li et al. | Key action and joint ctc-attention based sign language recognition | |
CN111428481A (en) | Entity relation extraction method based on deep learning | |
CN113516152A (en) | Image description method based on composite image semantics | |
CN114385802A (en) | Common-emotion conversation generation method integrating theme prediction and emotion inference | |
CN110569823A (en) | sign language identification and skeleton generation method based on RNN | |
Zhang et al. | Multi-modal emotion recognition based on deep learning in speech, video and text | |
CN111401116B (en) | Bimodal emotion recognition method based on enhanced convolution and space-time LSTM network | |
CN113378919B (en) | Image description generation method for fusing visual sense and enhancing multilayer global features | |
CN114694255A (en) | Sentence-level lip language identification method based on channel attention and time convolution network | |
Boukdir et al. | Character-level arabic text generation from sign language video using encoder–decoder model | |
CN113095201A (en) | AU degree estimation model establishment method based on self-attention and uncertainty weighted multi-task learning among different regions of human face |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |