CN112949481A - Lip language identification method and system for irrelevant speakers - Google Patents

Lip language identification method and system for irrelevant speakers Download PDF

Info

Publication number
CN112949481A
CN112949481A CN202110226432.4A CN202110226432A CN112949481A CN 112949481 A CN112949481 A CN 112949481A CN 202110226432 A CN202110226432 A CN 202110226432A CN 112949481 A CN112949481 A CN 112949481A
Authority
CN
China
Prior art keywords
loss
sequence
identity
semantic
lip language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110226432.4A
Other languages
Chinese (zh)
Other versions
CN112949481B (en
Inventor
路龙宾
宁都
金小敏
滑文强
孙涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN202110226432.4A priority Critical patent/CN112949481B/en
Publication of CN112949481A publication Critical patent/CN112949481A/en
Application granted granted Critical
Publication of CN112949481B publication Critical patent/CN112949481B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Multimedia (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a lip language identification method and a system for irrelevant speakers, wherein the method comprises the following steps: acquiring a training lip language picture sequence; inputting the training lip language picture sequence into an identity and semantic depth coupling model to obtain a characteristic sequence and calculating the loss of each network; performing iterative optimization on the coupling model and the lip language prediction network by taking various weighted losses as optimization targets to obtain an optimal recognition model; and inputting the picture sequence to be detected into the identification model to obtain an identification text. The method respectively encodes the identity characteristics and the semantic characteristics of the lip language picture sequence, restrains the identity encoding process by the identity comparison loss of different samples and the identity difference loss of different frames of the same sample, restrains the semantic encoding process by the supervision loss, and restrains the learned identity and the semantic characteristics by adopting the identity and semantic coupling reconstruction network, thereby effectively avoiding the semantic characteristics from mixing into identity information and improving the identification accuracy of the lip language identification model under the condition of speaker independence.

Description

Lip language identification method and system for irrelevant speakers
Technical Field
The invention relates to the technical field of intelligent human-computer interaction, in particular to a lip language identification method and system for irrelevant speakers.
Background
Lip language recognition, as a new man-machine interaction mode, is based on visual information and by analyzing dynamic changes of lip regions, understanding speaker semantics. The technology can well overcome the defects of speech recognition in the application of a noise environment, and effectively improve the reliability of a semantic analysis system. The lip language identification technology has wide application prospect, and can be used for the identification task of language interaction in various noise environments, such as the language identification in noisy environments of hospitals, markets and the like. In addition, lip language recognition can also be applied to assisting semantic understanding of the deaf-mutes, so that the deaf-mutes are helped to establish speaking ability.
At present, the precision of the lip language identification technology is far from meeting the requirement of practical application. Lip vocalization is formed by the mutual coupling of the speaker identity and the speech content in the time-space domain. Different speakers have great differences in lip appearance, speaking modes and the like, and even the same speaker has differences in speaking modes, speaking speeds and the like at different times and in different scenes. Therefore, different identity information can cause serious interference to semantic content in the identification process. The accuracy of the lip language recognition system is seriously restricted to be improved due to the high coupling of the identity information of the speaker and the semantic content.
Disclosure of Invention
The invention aims to provide a method and a system for identifying lip language irrelevant to a speaker, which can solve the problem that the identification result is influenced by the interference of identity information of the speaker and improve the accuracy of lip language identification.
In order to achieve the purpose, the invention provides the following scheme:
a lip language identification method for speaker independence, comprising:
acquiring training lip language picture sequences of a plurality of speaker samples;
inputting a plurality of training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity characteristic sequence, a semantic characteristic sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain a reconstructed image sequence;
calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence;
calculating difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence;
calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method;
calculating correlation loss according to the identity characteristic sequence and the semantic characteristic sequence;
calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence;
inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;
calculating supervision loss according to the predicted text sequence and the real text sequence;
taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets, and performing iterative optimization on the identity and semantic deep coupling model and the lip language prediction network to obtain an optimal lip language recognition model;
acquiring a lip language picture sequence to be identified;
and inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
Preferably, the 2D dense convolutional neural network and the 3D dense convolutional neural network are each composed of a dense convolutional neural network framework; the dense convolutional neural network framework comprises a dense connection transition layer, a pooling layer and a full connection layer which are connected in sequence; the dense connection transition layer comprises a plurality of dense connection transition units; each dense connection transition unit comprises a dense connection module and a transition module;
the lip language prediction network is a seq2seq network based on an attention-free mechanism; the lip language prediction network comprises an input module, an Encoder module, a Decoder module and a classification module;
the input module is respectively connected with the Encoder module and the Decoder module, the input module is used for acquiring a semantic feature sequence and a word vector sequence corresponding to the semantic feature sequence and embedding semantic vectors at different moments in the semantic feature sequence and word vectors in the word vector sequence into time position information, the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used for performing deep feature mining on the semantic feature sequence embedded with the time position information to obtain a first feature sequence; the Decoder module is used for obtaining a second characteristic sequence according to the attention of the first characteristic sequence and the attention of a word vector sequence embedded with time position information, and the classification module is used for judging and obtaining a prediction text sequence according to the second characteristic sequence.
Preferably, the calculation formula of the contrast loss is as follows:
Figure BDA0002956520610000031
wherein L iscLoss for comparison; n represents the number of the speaker samples;
Figure BDA0002956520610000032
a t frame image representing an i sample;
Figure BDA0002956520610000033
a t' frame image representing a j sample;
Figure BDA0002956520610000034
to represent
Figure BDA0002956520610000035
The identity of (2);
Figure BDA0002956520610000036
to represent
Figure BDA0002956520610000037
The identity of (2); y represents whether different groups of samples match the label, when the identities of the two groups of samples are the same, y is 1, otherwise, y is 0; margin is a set threshold.
Preferably, the calculation formula of the difference loss is as follows:
Figure BDA0002956520610000038
wherein L isdIs the loss of variance; n represents the number of samples of the speaker,
Figure BDA0002956520610000039
a j frame image representing an i sample;
Figure BDA00029565206100000310
a k frame image representing an i sample;
Figure BDA00029565206100000311
to represent
Figure BDA00029565206100000312
The identity of (2);
Figure BDA00029565206100000313
to represent
Figure BDA00029565206100000314
The identity of (2); t represents the number of frames in the speaker sample.
Preferably, the calculation formula of the difference loss of the gaussian distribution is as follows:
Figure BDA00029565206100000315
Figure BDA00029565206100000316
Figure BDA00029565206100000317
wherein L isddRepresenting the gaussian distribution difference loss;
Figure BDA00029565206100000318
a t frame image representing an ith sample of the P group of speaker samples;
Figure BDA00029565206100000319
semantic features of the t frame image representing the ith sample in the P groups of samples; sigmaPA covariance matrix representing semantic features of the P groups of speaker samples; sigmaQA covariance matrix representing semantic features of the Q groups of speaker samples; mu.sPMean vectors representing semantic features of the P groups of speaker samples; mu.sQMean vectors representing semantic features of the Q groups of speaker samples; det represents the value of the matrix determinant; z represents the dimension of the semantic coding feature and T represents the number of frames in the speaker sample.
Preferably, the correlation loss is calculated by the formula:
Figure BDA0002956520610000041
wherein L isRRepresenting the correlation loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;
Figure BDA0002956520610000042
a t frame image representing an i sample;
Figure BDA0002956520610000043
to represent
Figure BDA0002956520610000044
The identity of (2);
Figure BDA0002956520610000045
to represent
Figure BDA0002956520610000046
The semantic features of (1).
Preferably, the calculation formula of the reconstruction error loss is:
Figure BDA0002956520610000047
wherein L isconRepresenting the reconstruction error loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;
Figure BDA0002956520610000048
a t frame image representing an i sample;
Figure BDA0002956520610000049
to represent
Figure BDA00029565206100000410
The identity of (2);
Figure BDA00029565206100000411
to represent
Figure BDA00029565206100000412
And (2) the identity feature vector and the semantic feature vector are connected.
Preferably, the formula for calculating the supervision loss is as follows:
Figure BDA00029565206100000413
Figure BDA00029565206100000414
Figure BDA00029565206100000415
Figure BDA00029565206100000416
wherein L isseqRepresenting the loss of supervision; n represents the number of the speaker samples; t represents the number of frames in the speaker sample; c represents the number of text categories;
Figure BDA00029565206100000417
the true probability that the text class of the t-th frame of sample i is j,
Figure BDA00029565206100000418
the prediction probability that the text category of the t frame of the speaker sample i is j is determined; siAn encoding matrix representing the semantic features; epRepresenting the lip language prediction network based on the self-attention mechanism,
Figure BDA00029565206100000419
the 1 st frame image representing the ith sample,
Figure BDA00029565206100000420
a 2 nd frame image representing the ith sample,
Figure BDA00029565206100000421
a Tth frame image representing an ith sample;
Figure BDA00029565206100000422
semantic features of the 1 st frame image representing the ith sample;
Figure BDA00029565206100000423
semantic features of the 2 nd frame image representing the ith sample;
Figure BDA00029565206100000424
semantic features of a Tth frame image representing an ith sample; the lip language prediction output of the t item is judged according to the semantic features of all frames and the lip language prediction output contents of the 0 th item to the t-1 th item;
preferably, the iterative optimization of the identity and semantic deep coupling model and the lip language prediction network with the comparison loss, the difference loss, the gaussian distribution difference loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model includes:
taking the weighted loss as an optimization function, and utilizing an Adam optimizer to perform iterative learning on the identity and semantic deep coupling model and the lip language prediction network to obtain an optimized identity and semantic deep coupling model and a lip language prediction network;
wherein the optimization function is L (theta) ═ Lseq1Lc2Ld3Ldd4LR5LconWherein L (θ) is a weighted loss; l isseqIs the loss of supervision; l iscIs the loss of contrast; l isd(ii) is a loss of said difference; l isddRepresenting the gaussian distribution difference loss; l isRRepresenting the correlation loss; l isconRepresenting the reconstruction error loss; alpha is alpha1Weight, α, representing the loss of contrast2Weight, α, representing the loss of said difference3Weight, α, representing the difference loss of said Gaussian distribution4Weight, α, representing the loss of correlation5A weight representing the reconstruction error loss.
A system for speaker independent lip language recognition, comprising:
the first acquisition module is used for acquiring training lip language picture sequences of a plurality of speaker samples;
the feature output module is used for inputting the training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity feature sequence, a semantic feature sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain a reconstructed image sequence;
the first calculation module is used for calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence;
the second calculation module is used for calculating the difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence;
the third calculation module is used for calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method;
the fourth calculation module is used for calculating correlation loss according to the identity characteristic sequence and the semantic characteristic sequence;
the fifth calculation module is used for calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence;
the text output module is used for inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;
the sixth calculation module is used for calculating supervision loss according to the predicted text sequence and the real text sequence;
the training module is used for performing iterative optimization on the identity and semantic deep coupling model and the lip language prediction network by taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model;
the second acquisition module is used for acquiring a lip language picture sequence to be identified;
and the recognition module is used for inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the lip language identification method and system for the speaker independence adopt two groups of independent networks, namely a 2D dense convolution neural network and a 3D dense convolution neural network, to respectively encode the identity information and the semantic information of a lip language picture sequence to obtain an identity characteristic sequence and a semantic characteristic sequence. The invention restricts the identity coding process by the identity comparison loss of different samples and the identity difference loss of different frames of the same sample, and solves the problem of influence on the identification result due to the interference of the identity information of the speaker. According to the method, the identity and semantic deep coupling model and the lip language prediction network are iteratively optimized by taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the reconstruction error loss and the supervision loss as optimization targets, so that the problem of overfitting of a learned feature space is avoided. Semantic features are effectively prevented from being mixed into identity information, and therefore the recognition accuracy of the lip language recognition model under the speaker-independent condition is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a method for speaker independent lip language identification in accordance with the present invention;
FIG. 2 is a schematic block diagram of an identification method in an embodiment of the invention;
fig. 3 is a diagram of an identity and semantic feature coding network framework in an embodiment of the present invention, where fig. 3(a) is a diagram of a dense convolutional neural network structure, fig. 3(b) is a diagram of a 2D convolutional structure in an identity coding network, and fig. 3(c) is a diagram of a 3D convolutional structure in a semantic coding network;
FIG. 4 is a diagram of an identity and semantic coupling reconstruction network in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a structure of a lip language prediction network based on a self-attention mechanism according to an embodiment of the present invention;
FIG. 6 is a block diagram of a lip language identification system for speaker independence in accordance with the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for identifying lip language irrelevant to a speaker, which can solve the problem that the identification result is influenced by the interference of identity information of the speaker and improve the accuracy of lip language identification.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flowchart of a method for recognizing speaker-independent lip language according to the present invention, and as shown in fig. 1, the method for recognizing speaker-independent lip language according to the present embodiment includes:
step 100: and acquiring training lip language picture sequences of a plurality of speaker samples.
Step 200: inputting a plurality of training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity characteristic sequence, a semantic characteristic sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain the reconstructed image sequence.
Step 300: and calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence.
Step 301: and calculating the difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence.
Step 302: and calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method.
Step 303: and calculating the correlation loss according to the identity characteristic sequence and the semantic characteristic sequence.
Step 304: and calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence.
Step 400: and inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence.
Step 500: and calculating the supervision loss according to the predicted text sequence and the real text sequence.
Step 600: and carrying out iterative optimization on the identity and semantic deep coupling model and the lip language prediction network by taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model.
Step 700: and acquiring a lip language picture sequence to be identified.
Step 800: and inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
Preferably, the 2D dense convolutional neural network and the 3D dense convolutional neural network are each composed of a dense convolutional neural network framework; the dense convolutional neural network framework comprises a dense connection transition layer, a pooling layer and a full connection layer which are connected in sequence; the dense connection transition layer comprises a plurality of dense connection transition units; each dense connection transition unit comprises a dense connection module and a transition module.
Fig. 3 is a frame diagram of an identity and semantic feature coding network according to an embodiment of the present invention, and fig. 3(a) is a structural diagram of a dense convolutional neural network, which is composed of dense connection modules, transition modules, pooling layers, and full connection layers, as shown in fig. 3 (a). Wherein, the dense connection module connects the output of the current layer to the input of each subsequent layer, unlike the mode of no cross-layer connection in the traditional neural network. If the current network has L layers, the traditional network has L connections, and the dense convolution mode has L (L-1)/2 connection modes. The feature multiplexing is realized by the dense connection mode, the number of channels of each layer is effectively reduced, and the number of network parameters is reduced to a certain extent. In addition, the large number of cross-layer connections can effectively relieve the gradient disappearance problem of the deep neural network along with the increase of the depth. Suppose the output of the l-th layer is xlThen the input and output of the densely connected ith layer can be expressed as:
xl=Hl([x0,x1,…,xl-1])
wherein Hl() The l-th layer convolution module is represented, fig. 3(b) is a 2D convolution structure in an identity coding network, and fig. 3(c) is a 3D convolution structure in a semantic coding network, as shown in fig. 3(b) and fig. 3(c), specifically, a 2D convolution or 3D convolution network structure can be adopted according to identity and semantic coding tasks, and the module is mainly composed of batch normalization, ReLU, 1 × 1 convolution and 3 × 3 convolution aggregation. And (3) adopting 2D convolution to extract the structural features of the image by the identity feature coding aiming at each frame of static lip picture. Semantic feature coding extracts spatiotemporal features for several consecutive frames using 3D convolution operations. Hl() The inputs of (1) are all the outputs of the 0 to l-1 layersAnd performing channel combination on the premise that the scales of the feature graphs output by each layer are required to be uniform. In each dense connection module, the dimensions of the feature map remain unchanged. However, an essential element in convolutional neural networks is to reduce the scale of the feature map through a pooling operation, thereby capturing a larger perceptual field. Therefore, the dense convolutional neural network introduces transition modules as shown in fig. 3(b) and 3(c) between different dense connection modules, the modules are composed of batch normalization, ReLU, 1 × 1 convolution and 2 × 2 pooling aggregation, channel compression is realized through 1 × 1 convolution, and downsampling of feature maps is realized through 2 × 2 pooling, so that a wider range of feature capture is realized. And the final pooling layer of the dense convolutional neural network performs global pooling on the output feature map, only retains channel information, and finally performs feature transformation through a full connection layer.
Fig. 4 is a diagram of an identity and semantic coupling reconstruction network structure in the embodiment of the present invention, and after extracting the identity and semantic features through the dense convolutional neural network, the two features are input into the coupling reconstruction network shown in fig. 4 in a connected manner. The network realizes the expansion from the characteristic vector to the characteristic diagram through the deconvolution operation of 4 multiplied by 4, then adopts the mode of up-sampling to carry out high-resolution reconstruction, and carries out characteristic extraction through a convolution module shown in figure 3 after reconstruction, wherein the module consists of 3 multiplied by 3 convolution, batch normalization and ReLU aggregation. And repeating the up-sampling and convolution operations until the scale of the output characteristic graph is consistent with that of the lip language picture, and finishing the reconstruction process.
Fig. 5 is a structural diagram of a lip language prediction network based on a self-attention mechanism in an embodiment of the present invention, and as shown in fig. 5, the lip language prediction network is a seq2seq network based on a self-attention mechanism; the lip language prediction network comprises an input module, an Encoder module, a Decoder module and a classification module.
The input module is respectively connected with the Encoder module and the Decoder module, the input module is used for acquiring a semantic feature sequence and a word vector sequence corresponding to the semantic feature sequence and embedding semantic vectors at different moments in the semantic feature sequence and word vectors in the word vector sequence into time position information, the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used for performing deep feature mining on the semantic feature sequence embedded with the time position information to obtain a first feature sequence; the Decoder module is used for obtaining a second characteristic sequence according to the attention of the first characteristic sequence and the attention of a word vector sequence embedded with time position information, and the classification module is used for judging and obtaining a prediction text sequence according to the second characteristic sequence.
Specifically, in the input module, the lip language picture sequence outputs semantic vectors at different moments through semantic feature coding
Figure BDA0002956520610000101
The input sequence is received by the input part of the lip language prediction network. Unlike RNNs that process timing signals through a recursive relationship, lip-language prediction networks implement semantic coding of different time information by superimposing time-position information in the input data.
The position embedding information uses sine and cosine position coding, the position coding is generated by using sine and cosine functions with different frequencies, and then the position coding is added with a semantic vector of a corresponding position, and the dimension of the position vector must be consistent with the dimension of the semantic vector. The specific calculation formula is as follows:
Figure BDA0002956520610000102
Figure BDA0002956520610000103
wherein pos represents the position of the semantic vector in the current sequence, i represents the ith position in the semantic vector, and d represents the dimension of the semantic vector.
Optionally, in the Encoder module, the semantic features after embedding of the time and position information are input to the Encoder module for deep feature mining. The Encoder module is divided into two parts of a transition layer and an output layer, wherein the transition layer is composed of multi-head attention and layer normalization, and the input and output relationship can be expressed as follows:
Figure BDA0002956520610000111
wherein,
Figure BDA0002956520610000112
representing the semantic feature vector sequence of the ith sample after being embedded with time position information, wherein multiHeadAttention () is the multi-head attention, and LayerNorm () is the layer normalization.
Multi-headed attention allows neural networks to focus more on relevant parts of the input and less on irrelevant parts when performing predictive tasks. An attention function can be described as mapping a Query to an output with a set of Key-Value pairs (Key-Value), where Query, Key, Value, and output are vectors. The output may be calculated by a weighted sum of the values, where the weight assigned to each value may be calculated by a fitness function of Query and the corresponding Key. The method comprises the following specific steps:
MultiHeadAttention(si)=MultiHead(Q,K,V)=Concat(head1,…,headh)WO
wherein Q is a query vector sequence, K is a key vector sequence, V is a value vector sequence, and Q-K-V-si. Concat () is a matrix connection that,
Figure BDA0002956520610000113
to output a transformation matrix.
Single-head attention was calculated by the following formula:
Figure BDA0002956520610000114
wherein,
Figure BDA0002956520610000115
for the i-th head transformation matrix of the query vector sequence,
Figure BDA0002956520610000116
Figure BDA0002956520610000117
the ith head transform matrix for the sequence of key vectors,
Figure BDA0002956520610000118
Figure BDA0002956520610000119
an ith head transform matrix for a sequence of value vectors,
Figure BDA00029565206100001110
h is the number of attention heads.
Layer normalization is a common method for solving the problem of Internal Covariate Shift, can pull data distribution to an unsaturated region of an activation function, has the characteristic of weight/data expansion invariance, and has the effects of relieving gradient disappearance/explosion, accelerating training and regularization. Layer normalization is specifically implemented as follows:
Figure BDA00029565206100001111
Figure BDA00029565206100001112
where z represents an input D-dimensional feature vector, and α and β are transform coefficients.
Optionally, the Encoder output layer is composed of a full connection layer and a layer normalization, and a mapping relationship between input and output is as follows:
Figure BDA0002956520610000121
preferably, the Decoder module is similar in overall structure to the Encoder module, and adds attention between the Encoder output and the Decoder input based on the Encoder, and calculates the output with the Encoder
Figure BDA0002956520610000122
K, V as an attention model calculation in Decoder, input as Decoder
Figure BDA0002956520610000123
The Decoder model output is computed as Q.
Optionally, the Decoder input is a word vector of a sequence of languages
Figure BDA0002956520610000124
Figure BDA0002956520610000125
And the word vector represents the jth moment of the ith lip language sequence. The word vector input Decoder is firstly embedded at the same time position as the word vector input Decoder in the Encoder to obtain a word vector sequence embedded with implementation information
Figure BDA0002956520610000126
The word vector of the language sequence is the real text sequence. The input and output relationship after the attention of the first layer is as follows:
Figure BDA0002956520610000127
the multiheadAttention () and LayerNorm () are calculated in the same manner as the Encoder module.
Unlike the Encoder module, the Decoder is acquiring
Figure BDA0002956520610000128
Then, this is used as a query value Q of multi-head attention and output as an Encoder
Figure BDA0002956520610000129
Calculating attention and specific relation between word vector sequence and semantic feature sequence as key K and value V of multi-head attention
Figure BDA00029565206100001210
And the output attention vector is subjected to full connection and layer normalization, so that the final output of the Decoder is obtained:
Figure BDA00029565206100001211
output module according to Decoder
Figure BDA00029565206100001212
And judging the output content of the lip language through full connection and a softmax layer.
Preferably, the calculation formula of the contrast loss is as follows:
Figure BDA00029565206100001213
wherein L iscLoss for comparison; n represents the number of the speaker samples;
Figure BDA0002956520610000131
a t frame image representing an i sample;
Figure BDA0002956520610000132
a t' frame image representing a j sample;
Figure BDA0002956520610000133
to represent
Figure BDA0002956520610000134
The identity of (2);
Figure BDA0002956520610000135
to represent
Figure BDA0002956520610000136
The identity of (2); y represents whether different groups of samples match the label, when the identities of the two groups of samples are the same, y is 1, otherwise, y is 0; margin is a set threshold.
Preferably, the calculation formula of the difference loss is as follows:
Figure BDA0002956520610000137
wherein L isdIs the loss of variance; n represents the number of samples of the speaker,
Figure BDA0002956520610000138
a j frame image representing an i sample;
Figure BDA0002956520610000139
a k frame image representing an i sample;
Figure BDA00029565206100001310
to represent
Figure BDA00029565206100001311
The identity of (2);
Figure BDA00029565206100001312
to represent
Figure BDA00029565206100001313
The identity of (2); t represents the number of frames in the speaker sample.
Preferably, the calculation formula of the difference loss of the gaussian distribution is as follows:
Figure BDA00029565206100001314
Figure BDA00029565206100001315
Figure BDA00029565206100001316
wherein L isddRepresenting the gaussian distribution difference loss;
Figure BDA00029565206100001317
a t frame image representing an ith sample of the P group of speaker samples;
Figure BDA00029565206100001318
semantic features of the t frame image representing the ith sample in the P groups of samples; sigmaPA covariance matrix representing semantic features of the P groups of speaker samples; sigmaQA covariance matrix representing semantic features of the Q groups of speaker samples; mu.sPMean vectors representing semantic features of the P groups of speaker samples; mu.sQMean vectors representing semantic features of the Q groups of speaker samples; det represents the value of the matrix determinant; z represents the dimension of the semantic coding feature and T represents the number of frames in the speaker sample.
Preferably, the correlation loss is calculated by the formula:
Figure BDA00029565206100001319
wherein L isRRepresenting the correlation loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;
Figure BDA00029565206100001320
a t frame image representing an i sample;
Figure BDA00029565206100001321
to represent
Figure BDA00029565206100001322
The identity of (2);
Figure BDA00029565206100001323
to represent
Figure BDA00029565206100001324
The semantic features of (1).
Preferably, the calculation formula of the reconstruction error loss is:
Figure BDA0002956520610000141
wherein L isconRepresenting the reconstruction error loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;
Figure BDA0002956520610000142
a t frame image representing an i sample;
Figure BDA0002956520610000143
to represent
Figure BDA0002956520610000144
The identity of (2);
Figure BDA0002956520610000145
to represent
Figure BDA0002956520610000146
And (2) the identity feature vector and the semantic feature vector are connected.
Preferably, the formula for calculating the supervision loss is as follows:
Figure BDA0002956520610000147
Figure BDA0002956520610000148
Figure BDA0002956520610000149
Figure BDA00029565206100001410
wherein L isseqRepresenting the loss of supervision; n represents the number of the speaker samples; t represents the number of frames in the speaker sample; c represents the number of text categories;
Figure BDA00029565206100001411
the true probability that the text class of the t-th frame of sample i is j,
Figure BDA00029565206100001412
the prediction probability that the text category of the t frame of the speaker sample i is j is determined; siAn encoding matrix representing the semantic features; epRepresenting the lip language prediction network based on the self-attention mechanism,
Figure BDA00029565206100001413
the 1 st frame image representing the ith sample,
Figure BDA00029565206100001414
a 2 nd frame image representing the ith sample,
Figure BDA00029565206100001415
a Tth frame image representing an ith sample;
Figure BDA00029565206100001416
semantic features of the 1 st frame image representing the ith sample;
Figure BDA00029565206100001417
semantic features of the 2 nd frame image representing the ith sample;
Figure BDA00029565206100001418
semantic features of a Tth frame image representing an ith sample; the lip language prediction output of the t item is judged according to the semantic features of all frames and the lip language prediction output contents of the 0 th item to the t-1 th item.
Preferably, the iterative optimization of the identity and semantic deep coupling model and the lip language prediction network with the comparison loss, the difference loss, the gaussian distribution difference loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model includes:
taking the weighted loss as an optimization function, and utilizing an Adam optimizer to perform iterative learning on the identity and semantic deep coupling model and the lip language prediction network to obtain an optimized identity and semantic deep coupling model and a lip language prediction network;
wherein the optimization function is L (theta) ═ Lseq1Lc2Ld3Ldd4LR5LconWherein L (θ) is a weighted loss; l isseqIs the loss of supervision; l iscIs the loss of contrast; l isd(ii) is a loss of said difference; l isddRepresenting the gaussian distribution difference loss; l isRRepresenting the correlation loss; l isconRepresenting the reconstruction error loss; alpha is alpha1Weight, α, representing the loss of contrast2Weight, α, representing the loss of said difference3Weight, α, representing the difference loss of said Gaussian distribution4Weight, α, representing the loss of correlation5A weight representing the reconstruction error loss.
Specifically, the Adam optimizer combines the advantages of both AdaGrad and RMSProp optimization algorithms. And comprehensively considering the first moment estimation and the second moment estimation of the gradient to calculate the updating step length. Aiming at the optimization problem of the total loss, the Adam optimizer comprises the following specific implementation steps:
(1) random initialization parameter theta, first moment m at time 00Second moment v at time 00
(2) Updating the gradient at time t
Figure BDA0002956520610000151
(3) Updating the first moment mt←β1·mt+(1-β1)·gt
(4) Updating the second moment
Figure BDA0002956520610000152
(5) Updating unbiased first moment
Figure BDA0002956520610000153
(6) Updating unbiased second moment
Figure BDA0002956520610000154
(7) Updating parameters
Figure BDA0002956520610000155
Repeating (2) - (7) until the loss converges
Wherein, beta1、β2The rate of the exponential decay is expressed in terms of,
Figure BDA0002956520610000156
is represented by beta1、β2To the power of t, α is the learning rate.
Figure BDA0002956520610000157
Represents the gradient gtSquare of (e), e ═ 10-8
Optionally, the method further comprises:
and inputting the lip language picture sequence to be recognized into the 3D dense convolution neural network in the optimal lip language recognition model to obtain a semantic feature sequence to be recognized.
And inputting the semantic feature sequence to be recognized into a lip language prediction network in the optimal lip language recognition model to obtain a predicted text sequence.
In particular, the network E is coded with semantic informationsAnd lip language prediction network EpExtracting and identifying semantic features;
Figure BDA0002956520610000158
Figure BDA0002956520610000159
Figure BDA0002956520610000161
the input lip language picture sequence outputs a semantic feature sequence after being subjected to semantic coding
Figure BDA0002956520610000162
And predicting the word vector output at the t moment by the lip language prediction network according to the input semantic feature sequence and all the word vectors before the t moment. The input predicted feature sequence is an Encoder structure shown in figure 4, and semantic coding output is calculated
Figure BDA0002956520610000163
Decoder will self-attentively vector the words at time t-1
Figure BDA0002956520610000164
All word vectors characterized by the first t-1 moment
Figure BDA0002956520610000165
Attention weighted sum of
Figure BDA0002956520610000166
And then the semantic feature codes output by the Encoder are correlated through a self-attention mechanism
Figure BDA0002956520610000167
And
Figure BDA0002956520610000168
the Decoder output is calculated and the word vector at t moment is predicted
Figure BDA0002956520610000169
The word vector is output at the time t-1,decoder will start word vectors according to default
Figure BDA00029565206100001610
And predicting, and recursively predicting the lip language output word vectors at each moment layer by layer.
Fig. 6 is a block diagram of a lip language identification system for speaker independence according to the present invention, and as shown in fig. 6, a lip language identification system for speaker independence according to the present invention includes:
the first acquisition module is used for acquiring training lip language picture sequences of a plurality of speaker samples;
the feature output module is used for inputting the training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity feature sequence, a semantic feature sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain a reconstructed image sequence;
the first calculation module is used for calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence;
the second calculation module is used for calculating the difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence;
the third calculation module is used for calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method;
the fourth calculation module is used for calculating correlation loss according to the identity characteristic sequence and the semantic characteristic sequence;
the fifth calculation module is used for calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence;
the text output module is used for inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;
the sixth calculation module is used for calculating supervision loss according to the predicted text sequence and the real text sequence;
the training module is used for performing iterative optimization on the identity and semantic deep coupling model and the lip language prediction network by taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model;
the second acquisition module is used for acquiring a lip language picture sequence to be identified;
and the recognition module is used for inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
The invention has the following beneficial effects:
firstly, the identity information and the semantic information of a lip language picture sequence are respectively coded by adopting two groups of independent networks, the identity coding process is restrained by the identity comparison loss of different samples and the identity difference loss of different frames of the same sample, and the semantic coding process is restrained by the seq2seq supervision loss. Compared with the current lip language recognition method, the method has the advantages that the semantic features can be effectively prevented from being mixed into identity information through a single semantic supervision and constraint mode, and the recognition accuracy of the lip language recognition model under the speaker-independent condition is improved.
Secondly, on the basis of the coupling model, the invention further introduces the related loss constraint of the identity characteristic and the semantic characteristic to ensure the minimum correlation of the identity information and the semantic information; in addition, the invention further assumes that the semantic features obey single Gaussian distribution, takes the Gaussian distribution differences of different groups of samples as loss constraints, ensures the minimum semantic feature distribution difference extracted by different speakers, and limits the independence of semantic space, thereby improving the robust performance of the lip language recognition system on speaker identity change.
Thirdly, the invention adopts a seq2seq model based on a self-attention mechanism in the semantic prediction process, and compared with the cyclic neural networks such as LSTM, GRU and the like adopted by the current lip language identification method, the long-term memory and correlation capability of the time sequence characteristics can be realized, thereby improving the precision of the lip language prediction process. In addition, the self-attention mechanism is different from a traditional recurrent neural network through a recursion training mode, and the model can realize parallel training, so that the learning time of the lip language recognition network is greatly shortened.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A lip language identification method for speaker independence is characterized by comprising the following steps:
acquiring training lip language picture sequences of a plurality of speaker samples;
inputting a plurality of training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity characteristic sequence, a semantic characteristic sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain a reconstructed image sequence;
calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence;
calculating difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence;
calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method;
calculating correlation loss according to the identity characteristic sequence and the semantic characteristic sequence;
calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence;
inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;
calculating supervision loss according to the predicted text sequence and the real text sequence;
taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets, and performing iterative optimization on the identity and semantic deep coupling model and the lip language prediction network to obtain an optimal lip language recognition model;
acquiring a lip language picture sequence to be identified;
and inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
2. The method for speaker-independent lip language identification according to claim 1, wherein the 2D dense convolutional neural network and the 3D dense convolutional neural network are each composed of a dense convolutional neural network framework; the dense convolutional neural network framework comprises a dense connection transition layer, a pooling layer and a full connection layer which are connected in sequence; the dense connection transition layer comprises a plurality of dense connection transition units; each dense connection transition unit comprises a dense connection module and a transition module;
the lip language prediction network is a seq2seq network based on an attention-free mechanism; the lip language prediction network comprises an input module, an Encoder module, a Decoder module and a classification module;
the input module is respectively connected with the Encoder module and the Decoder module, the input module is used for acquiring a semantic feature sequence and a word vector sequence corresponding to the semantic feature sequence and embedding semantic vectors at different moments in the semantic feature sequence and word vectors in the word vector sequence into time position information, the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used for performing deep feature mining on the semantic feature sequence embedded with the time position information to obtain a first feature sequence; the Decoder module is used for obtaining a second characteristic sequence according to the attention of the first characteristic sequence and the attention of a word vector sequence embedded with time position information, and the classification module is used for judging and obtaining a prediction text sequence according to the second characteristic sequence.
3. The method of claim 1, wherein the comparison loss is calculated by the formula:
Figure FDA0002956520600000021
wherein L iscLoss for comparison; n represents the number of the speaker samples;
Figure FDA0002956520600000022
a t frame image representing an i sample;
Figure FDA0002956520600000023
a t' frame image representing a j sample;
Figure FDA0002956520600000024
to represent
Figure FDA0002956520600000025
The identity of (2);
Figure FDA0002956520600000026
to represent
Figure FDA0002956520600000027
The identity of (2); y represents whether different groups of samples match the label, when the identities of the two groups of samples are the same, y is 1, otherwise, y is 0; margin is a set threshold.
4. The method of claim 1, wherein the differential loss is calculated by the formula:
Figure FDA0002956520600000028
wherein L isdIs the loss of variance; n represents the number of samples of the speaker,
Figure FDA0002956520600000029
a j frame image representing an i sample;
Figure FDA00029565206000000210
a k frame image representing an i sample;
Figure FDA00029565206000000211
to represent
Figure FDA00029565206000000212
The identity of (2);
Figure FDA00029565206000000213
to represent
Figure FDA00029565206000000214
The identity of (2); t represents the number of frames in the speaker sample.
5. The method of claim 1, wherein the gaussian distribution variance loss is calculated by the formula:
Figure FDA0002956520600000031
Figure FDA0002956520600000032
Figure FDA0002956520600000033
wherein L isddRepresenting the gaussian distribution difference loss;
Figure FDA0002956520600000034
a t frame image representing an ith sample of the P group of speaker samples;
Figure FDA0002956520600000035
semantic features of the t frame image representing the ith sample in the P groups of samples; sigmaPA covariance matrix representing semantic features of the P groups of speaker samples; sigmaQA covariance matrix representing semantic features of the Q groups of speaker samples; mu.sPMean vectors representing semantic features of the P groups of speaker samples; mu.sQMean vectors representing semantic features of the Q groups of speaker samples; det represents the value of the matrix determinant; z represents the dimension of the semantic coding feature and T represents the number of frames in the speaker sample.
6. The method of claim 1, wherein the correlation loss is calculated by the formula:
Figure FDA0002956520600000036
wherein L isRRepresenting the correlation loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;
Figure FDA0002956520600000037
a t frame image representing an i sample;
Figure FDA0002956520600000038
to represent
Figure FDA0002956520600000039
The identity of (2);
Figure FDA00029565206000000310
to represent
Figure FDA00029565206000000311
The semantic features of (1).
7. The method of claim 1, wherein the reconstruction error loss is calculated by the formula:
Figure FDA00029565206000000312
wherein L isconRepresenting the reconstruction error loss; t represents the number of frames in the speaker sample; n represents the number of the speaker samples;
Figure FDA00029565206000000313
a t frame image representing an i sample;
Figure FDA00029565206000000314
to represent
Figure FDA00029565206000000315
The identity of (2);
Figure FDA00029565206000000316
to represent
Figure FDA00029565206000000317
And (2) the identity feature vector and the semantic feature vector are connected.
8. The method of claim 1, wherein the supervised loss is calculated by the formula:
Figure FDA0002956520600000041
Figure FDA0002956520600000042
Figure FDA0002956520600000043
Figure FDA0002956520600000044
wherein L isseqRepresenting the loss of supervision; n represents the number of the speaker samples; t represents the number of frames in the speaker sample; c represents the number of text categories;
Figure FDA0002956520600000045
the true probability that the text class of the t-th frame of sample i is j,
Figure FDA0002956520600000046
the prediction probability that the text category of the t frame of the speaker sample i is j is determined; siAn encoding matrix representing the semantic features; epRepresenting the lip language prediction network based on the self-attention mechanism,
Figure FDA0002956520600000047
the 1 st frame image representing the ith sample,
Figure FDA0002956520600000048
a 2 nd frame image representing the ith sample,
Figure FDA0002956520600000049
a Tth frame image representing an ith sample;
Figure FDA00029565206000000410
semantic features of the 1 st frame image representing the ith sample;
Figure FDA00029565206000000411
semantic features of the 2 nd frame image representing the ith sample;
Figure FDA00029565206000000412
semantic features of a Tth frame image representing an ith sample; the lip language prediction output of the t item is judged according to the semantic features of all frames and the lip language prediction output contents of the 0 th item to the t-1 th item.
9. The method according to claim 1, wherein the iterative optimization of the deep coupling model of identity and semantic and the lip language prediction network with the contrast loss, the difference loss with gaussian distribution, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model comprises:
taking the weighted loss as an optimization function, and utilizing an Adam optimizer to perform iterative learning on the identity and semantic deep coupling model and the lip language prediction network to obtain an optimized identity and semantic deep coupling model and a lip language prediction network;
wherein the optimization function is L (theta) ═ Lseq1Lc2Ld3Ldd4LR5LconWherein L (θ) is a weighted loss; l isseqIs the loss of supervision; l iscIs the loss of contrast; l isd(ii) is a loss of said difference; l isddRepresenting the gaussian distribution difference loss; l isRRepresenting the correlation loss; l isconRepresenting the reconstruction error loss; alpha is alpha1Weight, α, representing the loss of contrast2Weight, α, representing the loss of said difference3Weight, α, representing the difference loss of said Gaussian distribution4Weight, α, representing the loss of correlation5A weight representing the reconstruction error loss.
10. A system for speaker independent lip recognition, comprising:
the first acquisition module is used for acquiring training lip language picture sequences of a plurality of speaker samples;
the feature output module is used for inputting the training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity feature sequence, a semantic feature sequence and a reconstructed image sequence; the identity and semantic deep coupling model comprises: a 2D dense convolutional neural network, a 3D dense convolutional neural network, and a deconvolution neural network; the 2D dense convolutional neural network is used for coding the identity characteristics of the training lip language picture sequence to obtain the identity characteristic sequence; the 3D dense convolutional neural network is used for coding semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used for reconstructing and coupling the identity characteristic sequence and the semantic characteristic sequence to obtain a reconstructed image sequence;
the first calculation module is used for calculating the comparison loss according to the identity characteristics of different speaker samples in the identity characteristic sequence;
the second calculation module is used for calculating the difference loss according to the identity characteristics of different frames of the same speaker sample in the identity characteristic sequence;
the third calculation module is used for calculating the Gaussian distribution difference loss of the semantic feature sequence based on a Gaussian distribution method;
the fourth calculation module is used for calculating correlation loss according to the identity characteristic sequence and the semantic characteristic sequence;
the fifth calculation module is used for calculating reconstruction error loss according to the training lip language picture sequence and the reconstruction image sequence;
the text output module is used for inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;
the sixth calculation module is used for calculating supervision loss according to the predicted text sequence and the real text sequence;
the training module is used for performing iterative optimization on the identity and semantic deep coupling model and the lip language prediction network by taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization targets to obtain an optimal lip language recognition model;
the second acquisition module is used for acquiring a lip language picture sequence to be identified;
and the recognition module is used for inputting the lip language picture sequence to be recognized into an optimal lip language recognition model to obtain a recognition text.
CN202110226432.4A 2021-03-01 2021-03-01 Lip language identification method and system for speaker independence Active CN112949481B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110226432.4A CN112949481B (en) 2021-03-01 2021-03-01 Lip language identification method and system for speaker independence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110226432.4A CN112949481B (en) 2021-03-01 2021-03-01 Lip language identification method and system for speaker independence

Publications (2)

Publication Number Publication Date
CN112949481A true CN112949481A (en) 2021-06-11
CN112949481B CN112949481B (en) 2023-09-22

Family

ID=76246958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110226432.4A Active CN112949481B (en) 2021-03-01 2021-03-01 Lip language identification method and system for speaker independence

Country Status (1)

Country Link
CN (1) CN112949481B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313056A (en) * 2021-06-16 2021-08-27 中国科学技术大学 Compact 3D convolution-based lip language identification method, system, device and storage medium
CN114092496A (en) * 2021-11-30 2022-02-25 西安邮电大学 Lip segmentation method and system based on spatial weighting
CN114466179A (en) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 Method and device for measuring synchronism of voice and image
CN116959060A (en) * 2023-04-20 2023-10-27 湘潭大学 Lip language identification method for patient with language disorder in hospital environment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339806A (en) * 2018-12-19 2020-06-26 马上消费金融股份有限公司 Training method of lip language recognition model, living body recognition method and device
WO2020252922A1 (en) * 2019-06-21 2020-12-24 平安科技(深圳)有限公司 Deep learning-based lip reading method and apparatus, electronic device, and medium
CN112330713A (en) * 2020-11-26 2021-02-05 南京工程学院 Method for improving speech comprehension degree of severe hearing impaired patient based on lip language recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339806A (en) * 2018-12-19 2020-06-26 马上消费金融股份有限公司 Training method of lip language recognition model, living body recognition method and device
WO2020252922A1 (en) * 2019-06-21 2020-12-24 平安科技(深圳)有限公司 Deep learning-based lip reading method and apparatus, electronic device, and medium
CN112330713A (en) * 2020-11-26 2021-02-05 南京工程学院 Method for improving speech comprehension degree of severe hearing impaired patient based on lip language recognition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
马宁;田国栋;周曦;: "一种基于long short-term memory的唇语识别方法", 中国科学院大学学报, no. 01 *
马惠珠;宋朝晖;季飞;侯嘉;熊小芸;: "项目计算机辅助受理的研究方向与关键词――2012年度受理情况与2013年度注意事项", 电子与信息学报, no. 01 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313056A (en) * 2021-06-16 2021-08-27 中国科学技术大学 Compact 3D convolution-based lip language identification method, system, device and storage medium
CN114466179A (en) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 Method and device for measuring synchronism of voice and image
CN114092496A (en) * 2021-11-30 2022-02-25 西安邮电大学 Lip segmentation method and system based on spatial weighting
CN116959060A (en) * 2023-04-20 2023-10-27 湘潭大学 Lip language identification method for patient with language disorder in hospital environment

Also Published As

Publication number Publication date
CN112949481B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN112949481B (en) Lip language identification method and system for speaker independence
CN112991354B (en) High-resolution remote sensing image semantic segmentation method based on deep learning
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
CN109934261B (en) Knowledge-driven parameter propagation model and few-sample learning method thereof
CN109919204B (en) Noise image-oriented deep learning clustering method
CN109949824B (en) City sound event classification method based on N-DenseNet and high-dimensional mfcc characteristics
CN112116593B (en) Domain self-adaptive semantic segmentation method based on base index
CN111340046A (en) Visual saliency detection method based on feature pyramid network and channel attention
CN110399850A (en) A kind of continuous sign language recognition method based on deep neural network
CN111461025B (en) Signal identification method for self-evolving zero-sample learning
CN115952407B (en) Multipath signal identification method considering satellite time sequence and airspace interactivity
CN107491729B (en) Handwritten digit recognition method based on cosine similarity activated convolutional neural network
CN114220154A (en) Micro-expression feature extraction and identification method based on deep learning
CN111259785B (en) Lip language identification method based on time offset residual error network
CN114898773B (en) Synthetic voice detection method based on deep self-attention neural network classifier
CN117809181A (en) High-resolution remote sensing image water body extraction network model and method
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN117033657A (en) Information retrieval method and device
CN114783418A (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN116935403A (en) End-to-end character recognition method based on dynamic sampling
CN116825120A (en) Gas leakage acoustic signal noise reduction method and system based on hourglass model
CN113887504B (en) Strong-generalization remote sensing image target identification method
CN114004295B (en) Small sample image data expansion method based on countermeasure enhancement
CN114092716A (en) Target detection method, system, computer equipment and storage medium thereof based on U2net
CN114529939A (en) Pedestrian identification method based on millimeter wave radar point cloud clustering and deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant