CN112949481A - Lip language identification method and system for irrelevant speakers - Google Patents
Lip language identification method and system for irrelevant speakers Download PDFInfo
- Publication number
- CN112949481A CN112949481A CN202110226432.4A CN202110226432A CN112949481A CN 112949481 A CN112949481 A CN 112949481A CN 202110226432 A CN202110226432 A CN 202110226432A CN 112949481 A CN112949481 A CN 112949481A
- Authority
- CN
- China
- Prior art keywords
- loss
- sequence
- identity
- semantic
- lip language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000008878 coupling Effects 0.000 claims abstract description 43
- 238000010168 coupling process Methods 0.000 claims abstract description 43
- 238000005859 coupling reaction Methods 0.000 claims abstract description 43
- 238000005457 optimization Methods 0.000 claims abstract description 29
- 239000013598 vector Substances 0.000 claims description 60
- 238000013527 convolutional neural network Methods 0.000 claims description 42
- 238000004364 calculation method Methods 0.000 claims description 27
- 238000013528 artificial neural network Methods 0.000 claims description 20
- 230000007704 transition Effects 0.000 claims description 19
- 239000011159 matrix material Substances 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 9
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 5
- 238000005065 mining Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 10
- 238000010586 diagram Methods 0.000 description 14
- 238000010606 normalization Methods 0.000 description 9
- 230000000875 corresponding effect Effects 0.000 description 4
- 230000002776 aggregation Effects 0.000 description 3
- 238000004220 aggregation Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 206010011878 Deafness Diseases 0.000 description 2
- 230000008034 disappearance Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Computer Interaction (AREA)
- Psychiatry (AREA)
- Multimedia (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a lip language identification method and a system for irrelevant speakers, wherein the method comprises the following steps: acquiring a training lip language picture sequence; inputting the training lip language picture sequence into an identity and semantic depth coupling model to obtain a characteristic sequence and calculating the loss of each network; performing iterative optimization on the coupling model and the lip language prediction network by taking various weighted losses as optimization targets to obtain an optimal recognition model; and inputting the picture sequence to be detected into the identification model to obtain an identification text. The method respectively encodes the identity characteristics and the semantic characteristics of the lip language picture sequence, restrains the identity encoding process by the identity comparison loss of different samples and the identity difference loss of different frames of the same sample, restrains the semantic encoding process by the supervision loss, and restrains the learned identity and the semantic characteristics by adopting the identity and semantic coupling reconstruction network, thereby effectively avoiding the semantic characteristics from mixing into identity information and improving the identification accuracy of the lip language identification model under the condition of speaker independence.
Description
Technical Field
The invention relates to the technical field of intelligent human-computer interaction, in particular to a lip language identification method and system for irrelevant speakers.
Background
Lip language recognition, as a new man-machine interaction mode, is based on visual information and by analyzing dynamic changes of lip regions, understanding speaker semantics. The technology can well overcome the defects of speech recognition in the application of a noise environment, and effectively improve the reliability of a semantic analysis system. The lip language identification technology has wide application prospect, and can be used for the identification task of language interaction in various noise environments, such as the language identification in noisy environments of hospitals, markets and the like. In addition, lip language recognition can also be applied to assisting semantic understanding of the deaf-mutes, so that the deaf-mutes are helped to establish speaking ability.
At present, the precision of the lip language identification technology is far from meeting the requirement of practical application. Lip vocalization is formed by the mutual coupling of the speaker identity and the speech content in the time-space domain. Different speakers have great differences in lip appearance, speaking modes and the like, and even the same speaker has differences in speaking modes, speaking speeds and the like at different times and in different scenes. Therefore, different identity information can cause serious interference to semantic content in the identification process. The accuracy of the lip language recognition system is seriously restricted to be improved due to the high coupling of the identity information of the speaker and the semantic content.
Disclosure of Invention
The invention aims to provide a method and a system for identifying lip language irrelevant to a speaker, which can solve the problem that the identification result is influenced by the interference of identity information of the speaker and improve the accuracy of lip language identification.
In order to achieve the purpose, the invention provides the following scheme:
a lip language identification method for speaker independence, comprising:
acquiring training lip language picture sequences of a plurality of speaker samples;
inputting a plurality of training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity characteristic sequence, a semantic characteristic sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain a reconstructed image sequence;
calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence;
calculating difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence;
calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method;
calculating correlation loss according to the identity characteristic sequence and the semantic characteristic sequence;
calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence;
inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;
calculating supervision loss according to the predicted text sequence and the real text sequence;
taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets, and performing iterative optimization on the identity and semantic deep coupling model and the lip language prediction network to obtain an optimal lip language recognition model;
acquiring a lip language picture sequence to be identified;
and inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
Preferably, the 2D dense convolutional neural network and the 3D dense convolutional neural network are each composed of a dense convolutional neural network framework; the dense convolutional neural network framework comprises a dense connection transition layer, a pooling layer and a full connection layer which are connected in sequence; the dense connection transition layer comprises a plurality of dense connection transition units; each dense connection transition unit comprises a dense connection module and a transition module;
the lip language prediction network is a seq2seq network based on an attention-free mechanism; the lip language prediction network comprises an input module, an Encoder module, a Decoder module and a classification module;
the input module is respectively connected with the Encoder module and the Decoder module, the input module is used for acquiring a semantic feature sequence and a word vector sequence corresponding to the semantic feature sequence and embedding semantic vectors at different moments in the semantic feature sequence and word vectors in the word vector sequence into time position information, the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used for performing deep feature mining on the semantic feature sequence embedded with the time position information to obtain a first feature sequence; the Decoder module is used for obtaining a second characteristic sequence according to the attention of the first characteristic sequence and the attention of a word vector sequence embedded with time position information, and the classification module is used for judging and obtaining a prediction text sequence according to the second characteristic sequence.
Preferably, the calculation formula of the contrast loss is as follows:
wherein L iscLoss for comparison; n represents the number of the speaker samples;a t frame image representing an i sample;a t' frame image representing a j sample;to representThe identity of (2);to representThe identity of (2); y represents whether different groups of samples match the label, when the identities of the two groups of samples are the same, y is 1, otherwise, y is 0; margin is a set threshold.
Preferably, the calculation formula of the difference loss is as follows:
wherein L isdIs the loss of variance; n represents the number of samples of the speaker,a j frame image representing an i sample;a k frame image representing an i sample;to representThe identity of (2);to representThe identity of (2); t represents the number of frames in the speaker sample.
Preferably, the calculation formula of the difference loss of the gaussian distribution is as follows:
wherein L isddRepresenting the gaussian distribution difference loss;a t frame image representing an ith sample of the P group of speaker samples;semantic features of the t frame image representing the ith sample in the P groups of samples; sigmaPA covariance matrix representing semantic features of the P groups of speaker samples; sigmaQA covariance matrix representing semantic features of the Q groups of speaker samples; mu.sPMean vectors representing semantic features of the P groups of speaker samples; mu.sQMean vectors representing semantic features of the Q groups of speaker samples; det represents the value of the matrix determinant; z represents the dimension of the semantic coding feature and T represents the number of frames in the speaker sample.
Preferably, the correlation loss is calculated by the formula:
wherein L isRRepresenting the correlation loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;a t frame image representing an i sample;to representThe identity of (2);to representThe semantic features of (1).
Preferably, the calculation formula of the reconstruction error loss is:
wherein L isconRepresenting the reconstruction error loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;a t frame image representing an i sample;to representThe identity of (2);to representAnd (2) the identity feature vector and the semantic feature vector are connected.
Preferably, the formula for calculating the supervision loss is as follows:
wherein L isseqRepresenting the loss of supervision; n represents the number of the speaker samples; t represents the number of frames in the speaker sample; c represents the number of text categories;the true probability that the text class of the t-th frame of sample i is j,the prediction probability that the text category of the t frame of the speaker sample i is j is determined; siAn encoding matrix representing the semantic features; epRepresenting the lip language prediction network based on the self-attention mechanism,the 1 st frame image representing the ith sample,a 2 nd frame image representing the ith sample,a Tth frame image representing an ith sample;semantic features of the 1 st frame image representing the ith sample;semantic features of the 2 nd frame image representing the ith sample;semantic features of a Tth frame image representing an ith sample; the lip language prediction output of the t item is judged according to the semantic features of all frames and the lip language prediction output contents of the 0 th item to the t-1 th item;
preferably, the iterative optimization of the identity and semantic deep coupling model and the lip language prediction network with the comparison loss, the difference loss, the gaussian distribution difference loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model includes:
taking the weighted loss as an optimization function, and utilizing an Adam optimizer to perform iterative learning on the identity and semantic deep coupling model and the lip language prediction network to obtain an optimized identity and semantic deep coupling model and a lip language prediction network;
wherein the optimization function is L (theta) ═ Lseq+α1Lc+α2Ld+α3Ldd+α4LR+α5LconWherein L (θ) is a weighted loss; l isseqIs the loss of supervision; l iscIs the loss of contrast; l isd(ii) is a loss of said difference; l isddRepresenting the gaussian distribution difference loss; l isRRepresenting the correlation loss; l isconRepresenting the reconstruction error loss; alpha is alpha1Weight, α, representing the loss of contrast2Weight, α, representing the loss of said difference3Weight, α, representing the difference loss of said Gaussian distribution4Weight, α, representing the loss of correlation5A weight representing the reconstruction error loss.
A system for speaker independent lip language recognition, comprising:
the first acquisition module is used for acquiring training lip language picture sequences of a plurality of speaker samples;
the feature output module is used for inputting the training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity feature sequence, a semantic feature sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain a reconstructed image sequence;
the first calculation module is used for calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence;
the second calculation module is used for calculating the difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence;
the third calculation module is used for calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method;
the fourth calculation module is used for calculating correlation loss according to the identity characteristic sequence and the semantic characteristic sequence;
the fifth calculation module is used for calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence;
the text output module is used for inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;
the sixth calculation module is used for calculating supervision loss according to the predicted text sequence and the real text sequence;
the training module is used for performing iterative optimization on the identity and semantic deep coupling model and the lip language prediction network by taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model;
the second acquisition module is used for acquiring a lip language picture sequence to be identified;
and the recognition module is used for inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the lip language identification method and system for the speaker independence adopt two groups of independent networks, namely a 2D dense convolution neural network and a 3D dense convolution neural network, to respectively encode the identity information and the semantic information of a lip language picture sequence to obtain an identity characteristic sequence and a semantic characteristic sequence. The invention restricts the identity coding process by the identity comparison loss of different samples and the identity difference loss of different frames of the same sample, and solves the problem of influence on the identification result due to the interference of the identity information of the speaker. According to the method, the identity and semantic deep coupling model and the lip language prediction network are iteratively optimized by taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the reconstruction error loss and the supervision loss as optimization targets, so that the problem of overfitting of a learned feature space is avoided. Semantic features are effectively prevented from being mixed into identity information, and therefore the recognition accuracy of the lip language recognition model under the speaker-independent condition is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a method for speaker independent lip language identification in accordance with the present invention;
FIG. 2 is a schematic block diagram of an identification method in an embodiment of the invention;
fig. 3 is a diagram of an identity and semantic feature coding network framework in an embodiment of the present invention, where fig. 3(a) is a diagram of a dense convolutional neural network structure, fig. 3(b) is a diagram of a 2D convolutional structure in an identity coding network, and fig. 3(c) is a diagram of a 3D convolutional structure in a semantic coding network;
FIG. 4 is a diagram of an identity and semantic coupling reconstruction network in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a structure of a lip language prediction network based on a self-attention mechanism according to an embodiment of the present invention;
FIG. 6 is a block diagram of a lip language identification system for speaker independence in accordance with the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for identifying lip language irrelevant to a speaker, which can solve the problem that the identification result is influenced by the interference of identity information of the speaker and improve the accuracy of lip language identification.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flowchart of a method for recognizing speaker-independent lip language according to the present invention, and as shown in fig. 1, the method for recognizing speaker-independent lip language according to the present embodiment includes:
step 100: and acquiring training lip language picture sequences of a plurality of speaker samples.
Step 200: inputting a plurality of training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity characteristic sequence, a semantic characteristic sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain the reconstructed image sequence.
Step 300: and calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence.
Step 301: and calculating the difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence.
Step 302: and calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method.
Step 303: and calculating the correlation loss according to the identity characteristic sequence and the semantic characteristic sequence.
Step 304: and calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence.
Step 400: and inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence.
Step 500: and calculating the supervision loss according to the predicted text sequence and the real text sequence.
Step 600: and carrying out iterative optimization on the identity and semantic deep coupling model and the lip language prediction network by taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model.
Step 700: and acquiring a lip language picture sequence to be identified.
Step 800: and inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
Preferably, the 2D dense convolutional neural network and the 3D dense convolutional neural network are each composed of a dense convolutional neural network framework; the dense convolutional neural network framework comprises a dense connection transition layer, a pooling layer and a full connection layer which are connected in sequence; the dense connection transition layer comprises a plurality of dense connection transition units; each dense connection transition unit comprises a dense connection module and a transition module.
Fig. 3 is a frame diagram of an identity and semantic feature coding network according to an embodiment of the present invention, and fig. 3(a) is a structural diagram of a dense convolutional neural network, which is composed of dense connection modules, transition modules, pooling layers, and full connection layers, as shown in fig. 3 (a). Wherein, the dense connection module connects the output of the current layer to the input of each subsequent layer, unlike the mode of no cross-layer connection in the traditional neural network. If the current network has L layers, the traditional network has L connections, and the dense convolution mode has L (L-1)/2 connection modes. The feature multiplexing is realized by the dense connection mode, the number of channels of each layer is effectively reduced, and the number of network parameters is reduced to a certain extent. In addition, the large number of cross-layer connections can effectively relieve the gradient disappearance problem of the deep neural network along with the increase of the depth. Suppose the output of the l-th layer is xlThen the input and output of the densely connected ith layer can be expressed as:
xl=Hl([x0,x1,…,xl-1])
wherein Hl() The l-th layer convolution module is represented, fig. 3(b) is a 2D convolution structure in an identity coding network, and fig. 3(c) is a 3D convolution structure in a semantic coding network, as shown in fig. 3(b) and fig. 3(c), specifically, a 2D convolution or 3D convolution network structure can be adopted according to identity and semantic coding tasks, and the module is mainly composed of batch normalization, ReLU, 1 × 1 convolution and 3 × 3 convolution aggregation. And (3) adopting 2D convolution to extract the structural features of the image by the identity feature coding aiming at each frame of static lip picture. Semantic feature coding extracts spatiotemporal features for several consecutive frames using 3D convolution operations. Hl() The inputs of (1) are all the outputs of the 0 to l-1 layersAnd performing channel combination on the premise that the scales of the feature graphs output by each layer are required to be uniform. In each dense connection module, the dimensions of the feature map remain unchanged. However, an essential element in convolutional neural networks is to reduce the scale of the feature map through a pooling operation, thereby capturing a larger perceptual field. Therefore, the dense convolutional neural network introduces transition modules as shown in fig. 3(b) and 3(c) between different dense connection modules, the modules are composed of batch normalization, ReLU, 1 × 1 convolution and 2 × 2 pooling aggregation, channel compression is realized through 1 × 1 convolution, and downsampling of feature maps is realized through 2 × 2 pooling, so that a wider range of feature capture is realized. And the final pooling layer of the dense convolutional neural network performs global pooling on the output feature map, only retains channel information, and finally performs feature transformation through a full connection layer.
Fig. 4 is a diagram of an identity and semantic coupling reconstruction network structure in the embodiment of the present invention, and after extracting the identity and semantic features through the dense convolutional neural network, the two features are input into the coupling reconstruction network shown in fig. 4 in a connected manner. The network realizes the expansion from the characteristic vector to the characteristic diagram through the deconvolution operation of 4 multiplied by 4, then adopts the mode of up-sampling to carry out high-resolution reconstruction, and carries out characteristic extraction through a convolution module shown in figure 3 after reconstruction, wherein the module consists of 3 multiplied by 3 convolution, batch normalization and ReLU aggregation. And repeating the up-sampling and convolution operations until the scale of the output characteristic graph is consistent with that of the lip language picture, and finishing the reconstruction process.
Fig. 5 is a structural diagram of a lip language prediction network based on a self-attention mechanism in an embodiment of the present invention, and as shown in fig. 5, the lip language prediction network is a seq2seq network based on a self-attention mechanism; the lip language prediction network comprises an input module, an Encoder module, a Decoder module and a classification module.
The input module is respectively connected with the Encoder module and the Decoder module, the input module is used for acquiring a semantic feature sequence and a word vector sequence corresponding to the semantic feature sequence and embedding semantic vectors at different moments in the semantic feature sequence and word vectors in the word vector sequence into time position information, the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used for performing deep feature mining on the semantic feature sequence embedded with the time position information to obtain a first feature sequence; the Decoder module is used for obtaining a second characteristic sequence according to the attention of the first characteristic sequence and the attention of a word vector sequence embedded with time position information, and the classification module is used for judging and obtaining a prediction text sequence according to the second characteristic sequence.
Specifically, in the input module, the lip language picture sequence outputs semantic vectors at different moments through semantic feature codingThe input sequence is received by the input part of the lip language prediction network. Unlike RNNs that process timing signals through a recursive relationship, lip-language prediction networks implement semantic coding of different time information by superimposing time-position information in the input data.
The position embedding information uses sine and cosine position coding, the position coding is generated by using sine and cosine functions with different frequencies, and then the position coding is added with a semantic vector of a corresponding position, and the dimension of the position vector must be consistent with the dimension of the semantic vector. The specific calculation formula is as follows:
wherein pos represents the position of the semantic vector in the current sequence, i represents the ith position in the semantic vector, and d represents the dimension of the semantic vector.
Optionally, in the Encoder module, the semantic features after embedding of the time and position information are input to the Encoder module for deep feature mining. The Encoder module is divided into two parts of a transition layer and an output layer, wherein the transition layer is composed of multi-head attention and layer normalization, and the input and output relationship can be expressed as follows:
wherein,representing the semantic feature vector sequence of the ith sample after being embedded with time position information, wherein multiHeadAttention () is the multi-head attention, and LayerNorm () is the layer normalization.
Multi-headed attention allows neural networks to focus more on relevant parts of the input and less on irrelevant parts when performing predictive tasks. An attention function can be described as mapping a Query to an output with a set of Key-Value pairs (Key-Value), where Query, Key, Value, and output are vectors. The output may be calculated by a weighted sum of the values, where the weight assigned to each value may be calculated by a fitness function of Query and the corresponding Key. The method comprises the following specific steps:
MultiHeadAttention(si)=MultiHead(Q,K,V)=Concat(head1,…,headh)WO
wherein Q is a query vector sequence, K is a key vector sequence, V is a value vector sequence, and Q-K-V-si. Concat () is a matrix connection that,to output a transformation matrix.
Single-head attention was calculated by the following formula:
wherein,for the i-th head transformation matrix of the query vector sequence, the ith head transform matrix for the sequence of key vectors, an ith head transform matrix for a sequence of value vectors,h is the number of attention heads.
Layer normalization is a common method for solving the problem of Internal Covariate Shift, can pull data distribution to an unsaturated region of an activation function, has the characteristic of weight/data expansion invariance, and has the effects of relieving gradient disappearance/explosion, accelerating training and regularization. Layer normalization is specifically implemented as follows:
where z represents an input D-dimensional feature vector, and α and β are transform coefficients.
Optionally, the Encoder output layer is composed of a full connection layer and a layer normalization, and a mapping relationship between input and output is as follows:
preferably, the Decoder module is similar in overall structure to the Encoder module, and adds attention between the Encoder output and the Decoder input based on the Encoder, and calculates the output with the EncoderK, V as an attention model calculation in Decoder, input as DecoderThe Decoder model output is computed as Q.
Optionally, the Decoder input is a word vector of a sequence of languages And the word vector represents the jth moment of the ith lip language sequence. The word vector input Decoder is firstly embedded at the same time position as the word vector input Decoder in the Encoder to obtain a word vector sequence embedded with implementation informationThe word vector of the language sequence is the real text sequence. The input and output relationship after the attention of the first layer is as follows:
the multiheadAttention () and LayerNorm () are calculated in the same manner as the Encoder module.
Unlike the Encoder module, the Decoder is acquiringThen, this is used as a query value Q of multi-head attention and output as an EncoderCalculating attention and specific relation between word vector sequence and semantic feature sequence as key K and value V of multi-head attention
And the output attention vector is subjected to full connection and layer normalization, so that the final output of the Decoder is obtained:
output module according to DecoderAnd judging the output content of the lip language through full connection and a softmax layer.
Preferably, the calculation formula of the contrast loss is as follows:
wherein L iscLoss for comparison; n represents the number of the speaker samples;a t frame image representing an i sample;a t' frame image representing a j sample;to representThe identity of (2);to representThe identity of (2); y represents whether different groups of samples match the label, when the identities of the two groups of samples are the same, y is 1, otherwise, y is 0; margin is a set threshold.
Preferably, the calculation formula of the difference loss is as follows:
wherein L isdIs the loss of variance; n represents the number of samples of the speaker,a j frame image representing an i sample;a k frame image representing an i sample;to representThe identity of (2);to representThe identity of (2); t represents the number of frames in the speaker sample.
Preferably, the calculation formula of the difference loss of the gaussian distribution is as follows:
wherein L isddRepresenting the gaussian distribution difference loss;a t frame image representing an ith sample of the P group of speaker samples;semantic features of the t frame image representing the ith sample in the P groups of samples; sigmaPA covariance matrix representing semantic features of the P groups of speaker samples; sigmaQA covariance matrix representing semantic features of the Q groups of speaker samples; mu.sPMean vectors representing semantic features of the P groups of speaker samples; mu.sQMean vectors representing semantic features of the Q groups of speaker samples; det represents the value of the matrix determinant; z represents the dimension of the semantic coding feature and T represents the number of frames in the speaker sample.
Preferably, the correlation loss is calculated by the formula:
wherein L isRRepresenting the correlation loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;a t frame image representing an i sample;to representThe identity of (2);to representThe semantic features of (1).
Preferably, the calculation formula of the reconstruction error loss is:
wherein L isconRepresenting the reconstruction error loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;a t frame image representing an i sample;to representThe identity of (2);to representAnd (2) the identity feature vector and the semantic feature vector are connected.
Preferably, the formula for calculating the supervision loss is as follows:
wherein L isseqRepresenting the loss of supervision; n represents the number of the speaker samples; t represents the number of frames in the speaker sample; c represents the number of text categories;the true probability that the text class of the t-th frame of sample i is j,the prediction probability that the text category of the t frame of the speaker sample i is j is determined; siAn encoding matrix representing the semantic features; epRepresenting the lip language prediction network based on the self-attention mechanism,the 1 st frame image representing the ith sample,a 2 nd frame image representing the ith sample,a Tth frame image representing an ith sample;semantic features of the 1 st frame image representing the ith sample;semantic features of the 2 nd frame image representing the ith sample;semantic features of a Tth frame image representing an ith sample; the lip language prediction output of the t item is judged according to the semantic features of all frames and the lip language prediction output contents of the 0 th item to the t-1 th item.
Preferably, the iterative optimization of the identity and semantic deep coupling model and the lip language prediction network with the comparison loss, the difference loss, the gaussian distribution difference loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model includes:
taking the weighted loss as an optimization function, and utilizing an Adam optimizer to perform iterative learning on the identity and semantic deep coupling model and the lip language prediction network to obtain an optimized identity and semantic deep coupling model and a lip language prediction network;
wherein the optimization function is L (theta) ═ Lseq+α1Lc+α2Ld+α3Ldd+α4LR+α5LconWherein L (θ) is a weighted loss; l isseqIs the loss of supervision; l iscIs the loss of contrast; l isd(ii) is a loss of said difference; l isddRepresenting the gaussian distribution difference loss; l isRRepresenting the correlation loss; l isconRepresenting the reconstruction error loss; alpha is alpha1Weight, α, representing the loss of contrast2Weight, α, representing the loss of said difference3Weight, α, representing the difference loss of said Gaussian distribution4Weight, α, representing the loss of correlation5A weight representing the reconstruction error loss.
Specifically, the Adam optimizer combines the advantages of both AdaGrad and RMSProp optimization algorithms. And comprehensively considering the first moment estimation and the second moment estimation of the gradient to calculate the updating step length. Aiming at the optimization problem of the total loss, the Adam optimizer comprises the following specific implementation steps:
(1) random initialization parameter theta, first moment m at time 00Second moment v at time 00。
(3) Updating the first moment mt←β1·mt+(1-β1)·gt
Repeating (2) - (7) until the loss converges
Wherein, beta1、β2The rate of the exponential decay is expressed in terms of,is represented by beta1、β2To the power of t, α is the learning rate.Represents the gradient gtSquare of (e), e ═ 10-8。
Optionally, the method further comprises:
and inputting the lip language picture sequence to be recognized into the 3D dense convolution neural network in the optimal lip language recognition model to obtain a semantic feature sequence to be recognized.
And inputting the semantic feature sequence to be recognized into a lip language prediction network in the optimal lip language recognition model to obtain a predicted text sequence.
In particular, the network E is coded with semantic informationsAnd lip language prediction network EpExtracting and identifying semantic features;
the input lip language picture sequence outputs a semantic feature sequence after being subjected to semantic codingAnd predicting the word vector output at the t moment by the lip language prediction network according to the input semantic feature sequence and all the word vectors before the t moment. The input predicted feature sequence is an Encoder structure shown in figure 4, and semantic coding output is calculatedDecoder will self-attentively vector the words at time t-1All word vectors characterized by the first t-1 momentAttention weighted sum ofAnd then the semantic feature codes output by the Encoder are correlated through a self-attention mechanismAndthe Decoder output is calculated and the word vector at t moment is predictedThe word vector is output at the time t-1,decoder will start word vectors according to defaultAnd predicting, and recursively predicting the lip language output word vectors at each moment layer by layer.
Fig. 6 is a block diagram of a lip language identification system for speaker independence according to the present invention, and as shown in fig. 6, a lip language identification system for speaker independence according to the present invention includes:
the first acquisition module is used for acquiring training lip language picture sequences of a plurality of speaker samples;
the feature output module is used for inputting the training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity feature sequence, a semantic feature sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain a reconstructed image sequence;
the first calculation module is used for calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence;
the second calculation module is used for calculating the difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence;
the third calculation module is used for calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method;
the fourth calculation module is used for calculating correlation loss according to the identity characteristic sequence and the semantic characteristic sequence;
the fifth calculation module is used for calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence;
the text output module is used for inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;
the sixth calculation module is used for calculating supervision loss according to the predicted text sequence and the real text sequence;
the training module is used for performing iterative optimization on the identity and semantic deep coupling model and the lip language prediction network by taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model;
the second acquisition module is used for acquiring a lip language picture sequence to be identified;
and the recognition module is used for inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
The invention has the following beneficial effects:
firstly, the identity information and the semantic information of a lip language picture sequence are respectively coded by adopting two groups of independent networks, the identity coding process is restrained by the identity comparison loss of different samples and the identity difference loss of different frames of the same sample, and the semantic coding process is restrained by the seq2seq supervision loss. Compared with the current lip language recognition method, the method has the advantages that the semantic features can be effectively prevented from being mixed into identity information through a single semantic supervision and constraint mode, and the recognition accuracy of the lip language recognition model under the speaker-independent condition is improved.
Secondly, on the basis of the coupling model, the invention further introduces the related loss constraint of the identity characteristic and the semantic characteristic to ensure the minimum correlation of the identity information and the semantic information; in addition, the invention further assumes that the semantic features obey single Gaussian distribution, takes the Gaussian distribution differences of different groups of samples as loss constraints, ensures the minimum semantic feature distribution difference extracted by different speakers, and limits the independence of semantic space, thereby improving the robust performance of the lip language recognition system on speaker identity change.
Thirdly, the invention adopts a seq2seq model based on a self-attention mechanism in the semantic prediction process, and compared with the cyclic neural networks such as LSTM, GRU and the like adopted by the current lip language identification method, the long-term memory and correlation capability of the time sequence characteristics can be realized, thereby improving the precision of the lip language prediction process. In addition, the self-attention mechanism is different from a traditional recurrent neural network through a recursion training mode, and the model can realize parallel training, so that the learning time of the lip language recognition network is greatly shortened.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.
Claims (10)
1. A lip language identification method for speaker independence is characterized by comprising the following steps:
acquiring training lip language picture sequences of a plurality of speaker samples;
inputting a plurality of training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity characteristic sequence, a semantic characteristic sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain a reconstructed image sequence;
calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence;
calculating difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence;
calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method;
calculating correlation loss according to the identity characteristic sequence and the semantic characteristic sequence;
calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence;
inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;
calculating supervision loss according to the predicted text sequence and the real text sequence;
taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets, and performing iterative optimization on the identity and semantic deep coupling model and the lip language prediction network to obtain an optimal lip language recognition model;
acquiring a lip language picture sequence to be identified;
and inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
2. The method for speaker-independent lip language identification according to claim 1, wherein the 2D dense convolutional neural network and the 3D dense convolutional neural network are each composed of a dense convolutional neural network framework; the dense convolutional neural network framework comprises a dense connection transition layer, a pooling layer and a full connection layer which are connected in sequence; the dense connection transition layer comprises a plurality of dense connection transition units; each dense connection transition unit comprises a dense connection module and a transition module;
the lip language prediction network is a seq2seq network based on an attention-free mechanism; the lip language prediction network comprises an input module, an Encoder module, a Decoder module and a classification module;
the input module is respectively connected with the Encoder module and the Decoder module, the input module is used for acquiring a semantic feature sequence and a word vector sequence corresponding to the semantic feature sequence and embedding semantic vectors at different moments in the semantic feature sequence and word vectors in the word vector sequence into time position information, the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used for performing deep feature mining on the semantic feature sequence embedded with the time position information to obtain a first feature sequence; the Decoder module is used for obtaining a second characteristic sequence according to the attention of the first characteristic sequence and the attention of a word vector sequence embedded with time position information, and the classification module is used for judging and obtaining a prediction text sequence according to the second characteristic sequence.
3. The method of claim 1, wherein the comparison loss is calculated by the formula:
wherein L iscLoss for comparison; n represents the number of the speaker samples;a t frame image representing an i sample;a t' frame image representing a j sample;to representThe identity of (2);to representThe identity of (2); y represents whether different groups of samples match the label, when the identities of the two groups of samples are the same, y is 1, otherwise, y is 0; margin is a set threshold.
4. The method of claim 1, wherein the differential loss is calculated by the formula:
5. The method of claim 1, wherein the gaussian distribution variance loss is calculated by the formula:
wherein L isddRepresenting the gaussian distribution difference loss;a t frame image representing an ith sample of the P group of speaker samples;semantic features of the t frame image representing the ith sample in the P groups of samples; sigmaPA covariance matrix representing semantic features of the P groups of speaker samples; sigmaQA covariance matrix representing semantic features of the Q groups of speaker samples; mu.sPMean vectors representing semantic features of the P groups of speaker samples; mu.sQMean vectors representing semantic features of the Q groups of speaker samples; det represents the value of the matrix determinant; z represents the dimension of the semantic coding feature and T represents the number of frames in the speaker sample.
6. The method of claim 1, wherein the correlation loss is calculated by the formula:
7. The method of claim 1, wherein the reconstruction error loss is calculated by the formula:
wherein L isconRepresenting the reconstruction error loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;a t frame image representing an i sample;to representThe identity of (2);to representAnd (2) the identity feature vector and the semantic feature vector are connected.
8. The method of claim 1, wherein the supervised loss is calculated by the formula:
wherein L isseqRepresenting the loss of supervision; n represents the number of the speaker samples; t represents the number of frames in the speaker sample; c represents the number of text categories;the true probability that the text class of the t-th frame of sample i is j,the prediction probability that the text category of the t frame of the speaker sample i is j is determined; siAn encoding matrix representing the semantic features; epRepresenting the lip language prediction network based on the self-attention mechanism,the 1 st frame image representing the ith sample,a 2 nd frame image representing the ith sample,a Tth frame image representing an ith sample;semantic features of the 1 st frame image representing the ith sample;semantic features of the 2 nd frame image representing the ith sample;semantic features of a Tth frame image representing an ith sample; the lip language prediction output of the t item is judged according to the semantic features of all frames and the lip language prediction output contents of the 0 th item to the t-1 th item.
9. The method according to claim 1, wherein the iterative optimization of the deep coupling model of identity and semantic and the lip language prediction network with the contrast loss, the difference loss with gaussian distribution, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model comprises:
taking the weighted loss as an optimization function, and utilizing an Adam optimizer to perform iterative learning on the identity and semantic deep coupling model and the lip language prediction network to obtain an optimized identity and semantic deep coupling model and a lip language prediction network;
wherein the optimization function is L (theta) ═ Lseq+α1Lc+α2Ld+α3Ldd+α4LR+α5LconWherein L (θ) is a weighted loss; l isseqIs the loss of supervision; l iscIs the loss of contrast; l isd(ii) is a loss of said difference; l isddRepresenting the gaussian distribution difference loss; l isRRepresenting the correlation loss; l isconRepresenting the reconstruction error loss; alpha is alpha1Weight, α, representing the loss of contrast2Weight, α, representing the loss of said difference3Weight, α, representing the difference loss of said Gaussian distribution4Weight, α, representing the loss of correlation5A weight representing the reconstruction error loss.
10. A system for speaker independent lip recognition, comprising:
the first acquisition module is used for acquiring training lip language picture sequences of a plurality of speaker samples;
the feature output module is used for inputting the training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity feature sequence, a semantic feature sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain a reconstructed image sequence;
the first calculation module is used for calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence;
the second calculation module is used for calculating the difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence;
the third calculation module is used for calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method;
the fourth calculation module is used for calculating correlation loss according to the identity characteristic sequence and the semantic characteristic sequence;
the fifth calculation module is used for calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence;
the text output module is used for inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;
the sixth calculation module is used for calculating supervision loss according to the predicted text sequence and the real text sequence;
the training module is used for performing iterative optimization on the identity and semantic deep coupling model and the lip language prediction network by taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model;
the second acquisition module is used for acquiring a lip language picture sequence to be identified;
and the recognition module is used for inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110226432.4A CN112949481B (en) | 2021-03-01 | 2021-03-01 | Lip language identification method and system for speaker independence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110226432.4A CN112949481B (en) | 2021-03-01 | 2021-03-01 | Lip language identification method and system for speaker independence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112949481A true CN112949481A (en) | 2021-06-11 |
CN112949481B CN112949481B (en) | 2023-09-22 |
Family
ID=76246958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110226432.4A Active CN112949481B (en) | 2021-03-01 | 2021-03-01 | Lip language identification method and system for speaker independence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112949481B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113313056A (en) * | 2021-06-16 | 2021-08-27 | 中国科学技术大学 | Compact 3D convolution-based lip language identification method, system, device and storage medium |
CN114092496A (en) * | 2021-11-30 | 2022-02-25 | 西安邮电大学 | Lip segmentation method and system based on spatial weighting |
CN114466179A (en) * | 2021-09-09 | 2022-05-10 | 马上消费金融股份有限公司 | Method and device for measuring synchronism of voice and image |
CN116959060A (en) * | 2023-04-20 | 2023-10-27 | 湘潭大学 | Lip language identification method for patient with language disorder in hospital environment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111339806A (en) * | 2018-12-19 | 2020-06-26 | 马上消费金融股份有限公司 | Training method of lip language recognition model, living body recognition method and device |
WO2020252922A1 (en) * | 2019-06-21 | 2020-12-24 | 平安科技(深圳)有限公司 | Deep learning-based lip reading method and apparatus, electronic device, and medium |
CN112330713A (en) * | 2020-11-26 | 2021-02-05 | 南京工程学院 | Method for improving speech comprehension degree of severe hearing impaired patient based on lip language recognition |
-
2021
- 2021-03-01 CN CN202110226432.4A patent/CN112949481B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111339806A (en) * | 2018-12-19 | 2020-06-26 | 马上消费金融股份有限公司 | Training method of lip language recognition model, living body recognition method and device |
WO2020252922A1 (en) * | 2019-06-21 | 2020-12-24 | 平安科技(深圳)有限公司 | Deep learning-based lip reading method and apparatus, electronic device, and medium |
CN112330713A (en) * | 2020-11-26 | 2021-02-05 | 南京工程学院 | Method for improving speech comprehension degree of severe hearing impaired patient based on lip language recognition |
Non-Patent Citations (2)
Title |
---|
马宁;田国栋;周曦;: "一种基于long short-term memory的唇语识别方法", 中国科学院大学学报, no. 01 * |
马惠珠;宋朝晖;季飞;侯嘉;熊小芸;: "项目计算机辅助受理的研究方向与关键词――2012年度受理情况与2013年度注意事项", 电子与信息学报, no. 01 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113313056A (en) * | 2021-06-16 | 2021-08-27 | 中国科学技术大学 | Compact 3D convolution-based lip language identification method, system, device and storage medium |
CN114466179A (en) * | 2021-09-09 | 2022-05-10 | 马上消费金融股份有限公司 | Method and device for measuring synchronism of voice and image |
CN114092496A (en) * | 2021-11-30 | 2022-02-25 | 西安邮电大学 | Lip segmentation method and system based on spatial weighting |
CN116959060A (en) * | 2023-04-20 | 2023-10-27 | 湘潭大学 | Lip language identification method for patient with language disorder in hospital environment |
Also Published As
Publication number | Publication date |
---|---|
CN112949481B (en) | 2023-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112949481B (en) | Lip language identification method and system for speaker independence | |
CN112991354B (en) | High-resolution remote sensing image semantic segmentation method based on deep learning | |
CN108133188B (en) | Behavior identification method based on motion history image and convolutional neural network | |
CN109934261B (en) | Knowledge-driven parameter propagation model and few-sample learning method thereof | |
CN109919204B (en) | Noise image-oriented deep learning clustering method | |
CN109949824B (en) | City sound event classification method based on N-DenseNet and high-dimensional mfcc characteristics | |
CN112116593B (en) | Domain self-adaptive semantic segmentation method based on base index | |
CN111340046A (en) | Visual saliency detection method based on feature pyramid network and channel attention | |
CN110399850A (en) | A kind of continuous sign language recognition method based on deep neural network | |
CN111461025B (en) | Signal identification method for self-evolving zero-sample learning | |
CN115952407B (en) | Multipath signal identification method considering satellite time sequence and airspace interactivity | |
CN107491729B (en) | Handwritten digit recognition method based on cosine similarity activated convolutional neural network | |
CN114220154A (en) | Micro-expression feature extraction and identification method based on deep learning | |
CN111259785B (en) | Lip language identification method based on time offset residual error network | |
CN114898773B (en) | Synthetic voice detection method based on deep self-attention neural network classifier | |
CN117809181A (en) | High-resolution remote sensing image water body extraction network model and method | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN117033657A (en) | Information retrieval method and device | |
CN114783418A (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
CN116935403A (en) | End-to-end character recognition method based on dynamic sampling | |
CN116825120A (en) | Gas leakage acoustic signal noise reduction method and system based on hourglass model | |
CN113887504B (en) | Strong-generalization remote sensing image target identification method | |
CN114004295B (en) | Small sample image data expansion method based on countermeasure enhancement | |
CN114092716A (en) | Target detection method, system, computer equipment and storage medium thereof based on U2net | |
CN114529939A (en) | Pedestrian identification method based on millimeter wave radar point cloud clustering and deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |