CN112949481A

CN112949481A - Lip language identification method and system for irrelevant speakers

Info

Publication number: CN112949481A
Application number: CN202110226432.4A
Authority: CN
Inventors: 路龙宾; 宁都; 金小敏; 滑文强; 孙涛
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2021-06-11
Anticipated expiration: 2041-03-01
Also published as: CN112949481B

Abstract

The invention relates to a speaker-independent lip language recognition method and system. The method includes: acquiring a training lip language picture sequence; inputting the training lip language picture sequence into an identity and semantic depth coupling model, obtaining a feature sequence and calculating The loss of each network; with various weighted losses as optimization goals, iterative optimization of the coupling model and lip language prediction network is performed to obtain the optimal recognition model; the sequence of images to be tested is input into the recognition model to obtain the recognized text. The invention encodes the identity feature and semantic feature of the lip language picture sequence respectively, uses the identity contrast loss of different samples and the identity difference loss of different frames of the same sample to constrain the identity encoding process, uses the supervision loss to constrain the semantic encoding process, and uses The identity and semantic coupling reconstruction network constrains the learned identity and semantic features, which effectively avoids semantic features from being mixed with identity information, and improves the recognition accuracy of the lip language recognition model in the speaker-independent condition.

Description

A method and system for speaker-independent lip recognition

技术领域technical field

本发明涉及智能人机交互技术领域，特别是涉及一种用于说话人无关的唇语识别方法及系统。The invention relates to the technical field of intelligent human-computer interaction, in particular to a speaker-independent lip language recognition method and system.

背景技术Background technique

唇语识别作为一种新兴的人机交互方式，是从视觉信息出发，通过分析唇部区域的动态变化来理解说话人语义。该技术可以很好的克服语音识别在噪声环境应用中存在的不足，有效的提高语义分析系统的可靠性能。唇语识别技术具有广阔的应用前景，它可用于各类噪声环境下语言交互的识别任务，例如医院、商场等嘈杂环境下语言识别。此外，唇语识别还可应用于聋哑人辅助语义理解，从而帮助聋哑人建立说话能力。As an emerging human-computer interaction method, lip recognition starts from visual information and understands the speaker's semantics by analyzing the dynamic changes of the lip region. This technology can overcome the shortcomings of speech recognition in noisy environment applications, and effectively improve the reliability of the semantic analysis system. Lip language recognition technology has broad application prospects. It can be used for language interaction recognition tasks in various noisy environments, such as language recognition in noisy environments such as hospitals and shopping malls. In addition, lip language recognition can also be applied to deaf people to assist semantic comprehension, thereby helping deaf people build their speaking ability.

目前，唇语识别技术精度远未达到实际应用的需要。由于唇部发声是由说话人身份与说话内容在时空域内相互耦合作用而形成。不同说话人在唇部外观、说话方式等方面都存在巨大差异，甚至相同人在不同时刻、不同场景下的说话方式、语速等也存在差异。因此，在识别过程中，不同身份信息会对语义内容形成严重干扰。正是由于说话人身份信息与语义内容的高度耦合性，严重制约唇语识别系统精度的提升。At present, the accuracy of lip language recognition technology is far from meeting the needs of practical applications. Since lip vocalization is formed by the interaction between speaker identity and speech content in the space-time domain. Different speakers have huge differences in the appearance of their lips, the way they speak, and even the way and speed of speech of the same person at different times and in different scenarios. Therefore, in the recognition process, different identity information will seriously interfere with the semantic content. It is precisely because of the high coupling between the speaker identity information and the semantic content, which seriously restricts the improvement of the accuracy of the lip recognition system.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种用于说话人无关的唇语识别方法及系统，能够解决由于说话人身份信息干扰对识别结果造成的影响，提高唇语识别的准确率。The purpose of the present invention is to provide a speaker-independent lip language recognition method and system, which can solve the influence on the recognition result caused by the interference of the speaker's identity information, and improve the accuracy of lip language recognition.

为实现上述目的，本发明提供了如下方案：For achieving the above object, the present invention provides the following scheme:

一种用于说话人无关的唇语识别方法，包括：A method for speaker-independent lip recognition, including:

获取多个说话人样本的训练唇语图片序列；Obtain a sequence of training lip language pictures for multiple speaker samples;

将多个所述训练唇语图片序列输入身份与语义深度耦合模型中，得到身份特征序列、语义特征序列和重建图像序列；所述身份与语义深度耦合模型包括：2D稠密卷积神经网络、3D稠密卷积神经网络和反卷积神经网络；所述2D稠密卷积神经网络用于编码所述训练唇语图片序列的身份特征，得到所述身份特征序列；所述3D稠密卷积神经网络用于编码所述训练唇语图片序列的语义特征，得到所述语义特征序列；所述反卷积神经网络用于对所述身份特征序列与所述语义特征序列进行重建耦合，得到所述重建图像序列；A plurality of the training lip language picture sequences are input into the identity and semantic depth coupling model, and the identity feature sequence, the semantic feature sequence and the reconstructed image sequence are obtained; the identity and semantic depth coupling model includes: 2D dense convolutional neural network, 3D Dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D dense convolutional neural network uses encoding the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolutional neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence to obtain the reconstructed image sequence;

根据所述身份特征序列中不同说话人样本的身份特征计算对比损失；Calculate the comparison loss according to the identity features of different speaker samples in the identity feature sequence;

根据所述身份特征序列中相同说话人样本的不同帧的身份特征计算差异损失；Calculate the difference loss according to the identity features of different frames of the same speaker sample in the identity feature sequence;

基于高斯分布方法计算所述语义特征序列的高斯分布差异损失；Calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method;

根据所述身份特征序列和所述语义特征序列计算相关损失；Calculate a correlation loss according to the identity feature sequence and the semantic feature sequence;

根据所述训练唇语图片序列和所述重建图像序列计算重建误差损失；Calculate a reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence;

将所述语义特征序列输入唇语预测网络中，得到预测文本序列；Inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;

根据所述预测文本序列和真实文本序列计算监督损失；computing a supervised loss according to the predicted text sequence and the real text sequence;

以所述对比损失、所述差异损失、所述高斯分布差异损失、所述相关损失、所述重建误差损失和所述监督损失作为优化目标，对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优，得到最优唇语识别模型；Taking the contrast loss, the disparity loss, the Gaussian disparity loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization goals, the identity and semantic depth coupling model and the lip The language prediction network is used for iterative optimization, and the optimal lip language recognition model is obtained;

获取待识别唇语图片序列；Obtain the sequence of lip language pictures to be recognized;

将所述待识别唇语图片序列输入最优唇语识别模型中，得到识别文本。Inputting the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text.

优选地，所述2D稠密卷积神经网络和所述3D稠密卷积神经网络均由稠密卷积神经网络框架构成；所述稠密卷积神经网络框架包括依次连接的稠密连接过渡层、池化层和全连接层；所述稠密连接过渡层包括多个稠密连接过渡单元；每个所述稠密连接过渡单元均包括一个稠密连接模块和一个过渡模块；Preferably, both the 2D dense convolutional neural network and the 3D dense convolutional neural network are composed of a dense convolutional neural network framework; the dense convolutional neural network framework includes a densely connected transition layer and a pooling layer that are sequentially connected. and a fully connected layer; the densely connected transition layer includes a plurality of densely connected transition units; each of the densely connected transition units includes a dense connection module and a transition module;

所述唇语预测网络为基于自注意力机制的seq2seq网络；所述唇语预测网络包括输入模块、Encoder模块、Decoder模块和分类模块；The lip language prediction network is a seq2seq network based on a self-attention mechanism; the lip language prediction network includes an input module, an Encoder module, a Decoder module and a classification module;

所述输入模块分别与所述Encoder模块和所述Decoder模块连接，所述输入模块用于获取语义特征序列和语义特征序列对应的词向量序列，并将语义特征序列中不同时刻的语义向量和所述词向量序列中的词向量嵌入时间位置信息，所述Decoder模块分别与所述Encoder模块和所述分类模块连接，所述Encoder模块用于对嵌入时间位置信息的语义特征序列进行深度特征挖掘，得到第一特征序列；所述Decoder模块用于根据所述第一特征序列的注意力和嵌入时间位置信息的词向量序列的注意力得到第二特征序列，所述分类模块用于根据所述第二特征序列判定得到预测文本序列。The input module is respectively connected with the Encoder module and the Decoder module, and the input module is used to obtain the semantic feature sequence and the word vector sequence corresponding to the semantic feature sequence, and combine the semantic vectors at different times in the semantic feature sequence with all the word vector sequences. The word vector in the predicate vector sequence is embedded with time position information, and the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used to perform deep feature mining on the semantic feature sequence embedded with the time position information, Obtain the first feature sequence; the Decoder module is used to obtain the second feature sequence according to the attention of the first feature sequence and the attention of the word vector sequence embedded with time position information, and the classification module is used to obtain the second feature sequence according to the first feature sequence. The two feature sequences are determined to obtain the predicted text sequence.

优选地，所述对比损失的计算公式为：Preferably, the calculation formula of the contrast loss is:

其中，L_c为对比损失；N表示所述说话人样本的数量；

表示第i个样本的第t帧图像；

表示第j个样本的第t′帧图像；

表示

的身份特征；

表示

的身份特征；y表示不同组样本是否匹配标签，当两组样本身份相同，y＝1，否则y＝0；margin为设定的阈值。Among them, L _c is the contrast loss; N is the number of the speaker samples;

Represents the t-th frame image of the i-th sample;

represents the t'th frame image of the jth sample;

express

identity;

express

y indicates whether the different groups of samples match the label, when the identities of the two groups of samples are the same, y=1, otherwise y=0; margin is the set threshold.

优选地，所述差异损失的计算公式为：Preferably, the calculation formula of the difference loss is:

其中，L_d为差异损失；N表示所述说话人样本的数量，

表示第i个样本的第j帧图像；

表示第i个样本的第k帧图像；

表示

的身份特征；

表示

的身份特征；T表示说话人样本中的帧数。Among them, L _d is the difference loss; N is the number of the speaker samples,

represents the jth frame image of the ith sample;

represents the k-th frame image of the i-th sample;

express

identity;

express

The identity feature of ; T represents the number of frames in the speaker sample.

优选地，所述高斯分布差异损失的计算公式为：Preferably, the calculation formula of the Gaussian distribution difference loss is:

其中，L_dd表示所述高斯分布差异损失；

表示P组说话人样本中第i个样本的第t帧图像；

表示P组样本中第i个样本的第t帧图像的语义特征；Σ_P表示P组说话人样本的语义特征的协方差矩阵；Σ_Q表示Q组说话人样本的语义特征的协方差矩阵；μ_P表示P组说话人样本的语义特征的均值向量；μ_Q表示Q组说话人样本的语义特征的均值向量；det表示矩阵行列式的值；z表示语义编码特征的维度，T表示说话人样本中的帧数。Wherein, L _dd represents the Gaussian distribution difference loss;

represents the t-th frame image of the i-th sample in the P group of speaker samples;

represents the semantic feature of the t-th frame image of the ith sample in the P group of samples; Σ _P represents the covariance matrix of the semantic feature of the P group of speaker samples; Σ _Q represents the covariance matrix of the semantic feature of the Q group of speaker samples; μ _P represents the mean vector of the semantic features of the P group of speaker samples; μ _Q represents the mean vector of the semantic features of the Q group of speaker samples; det represents the value of the matrix determinant; z represents the dimension of the semantic encoding feature, T represents the speaker The number of frames in the sample.

优选地，所述相关损失的计算公式为：Preferably, the calculation formula of the correlation loss is:

其中，L_R表示所述相关损失；T表示说话人样本中的帧数；N表示所述说话人样本的数量；

表示第i个样本的第t帧图像；

表示

的身份特征；

表示

的语义特征。Among them, _LR represents the correlation loss; T represents the number of frames in the speaker sample; N represents the number of the speaker sample;

Represents the t-th frame image of the i-th sample;

express

identity;

express

semantic features.

优选地，所述重建误差损失的计算公式为：Preferably, the calculation formula of the reconstruction error loss is:

其中，L_con表示所述重建误差损失；T表示说话人样本中的帧数；N表示所述说话人样本的数量；

表示第i个样本的第t帧图像；

表示

的身份特征；

表示

的语义特征，conj表示身份特征向量和语义特征向量连接。Wherein, L _con represents the reconstruction error loss; T represents the number of frames in the speaker sample; N represents the number of the speaker sample;

Represents the t-th frame image of the i-th sample;

express

identity;

express

The semantic feature of , conj represents the connection between the identity feature vector and the semantic feature vector.

优选地，所述监督损失的计算公式为：Preferably, the calculation formula of the supervision loss is:

其中，L_seq表示所述监督损失；N表示所述说话人样本的数量；T表示说话人样本中的帧数；C表示文本类别的数量；

为样本i的第t帧的文本类别为j的真实概率，

为说话人样本i的第t帧的文本类别为j的预测概率；S_i表示所述语义特征的编码矩阵；E_p表示基于自注意力机制的所述唇语预测网络，

表示第i个样本的第1帧图像，

表示第i个样本的第2帧图像，

表示第i个样本的第T帧图像；

表示第i个样本的第1帧图像的语义特征；

表示第i个样本的第2帧图像的语义特征；

表示第i个样本的第T帧图像的语义特征；第t项的唇语预测输出是根据所有帧的语义特征以及第0项到第t-1项的唇语预测输出内容进行判定；where L _seq represents the supervision loss; N represents the number of speaker samples; T represents the number of frames in the speaker samples; C represents the number of text categories;

is the true probability that the text category of the t-th frame of sample i is j,

is the prediction probability that the text category of the t-th frame of the speaker sample i is j; S _i represents the encoding matrix of the semantic feature; E _p represents the lip language prediction network based on the self-attention mechanism,

Represents the first frame image of the i-th sample,

Represents the second frame image of the i-th sample,

Represents the T-th frame image of the i-th sample;

Represents the semantic feature of the first frame image of the ith sample;

Represents the semantic feature of the second frame image of the i-th sample;

Represents the semantic feature of the T-th frame image of the i-th sample; the lip-language prediction output of the t-th item is determined according to the semantic features of all frames and the lip-language prediction output content of the 0th item to the t-1th item;

优选地，所述以所述对比损失、所述差异损失、所述高斯分布差异损失、所述重建误差损失和所述监督损失作为优化目标，对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优，得到最优唇语识别模型，包括：Preferably, the contrast loss, the disparity loss, the Gaussian distribution disparity loss, the reconstruction error loss and the supervision loss are used as optimization objectives, and the identity and semantic depth coupling model and the lip The language prediction network is used for iterative optimization, and the optimal lip language recognition model is obtained, including:

以加权损失为优化函数，利用Adam优化器对所述身份与语义深度耦合模型以及所述唇语预测网络进行迭代学习，得到优化后的身份与语义深度耦合模型以及唇语预测网络；Taking the weighted loss as the optimization function, the Adam optimizer is used to iteratively learn the identity and semantic depth coupling model and the lip language prediction network to obtain the optimized identity and semantic depth coupling model and the lip language prediction network;

其中，所述优化函数为L(θ)＝L_seq+α₁L_c+α₂L_d+α₃L_dd+α₄L_R+α₅L_con，其中，L(θ)为加权损失；L_seq为所述监督损失；L_c为所述对比损失；L_d为所述差异损失；L_dd表示所述高斯分布差异损失；L_R表示所述相关损失；L_con表示所述重建误差损失；α₁表示所述对比损失的权重，α₂表示所述差异损失的权重，α₃表示所述高斯分布差异损失的权重，α₄表示所述相关损失的权重，α₅表示所述重建误差损失的权重。Wherein, the optimization function is L(θ)=L _seq +α ₁ L _c +α ₂ L _d +α ₃ L _dd +α ₄ L _R +α ₅ L _con , where L(θ) is a weighted loss; L _seq is the supervision loss; L _c is the contrast loss; L _d is the difference loss; L _dd is the Gaussian distribution difference loss; L _R is the correlation loss; L _con is the reconstruction error loss ; α ₁ represents the weight of the contrast loss, α ₂ represents the weight of the difference loss, α ₃ represents the weight of the Gaussian distribution difference loss, α ₄ represents the weight of the correlation loss, α ₅ represents the reconstruction error Loss weight.

一种用于说话人无关的唇语识别系统，包括：A system for speaker-independent lip recognition, including:

第一获取模块，用于获取多个说话人样本的训练唇语图片序列；The first acquisition module is used to acquire the training lip language picture sequence of multiple speaker samples;

特征输出模块，用于将多个所述训练唇语图片序列输入身份与语义深度耦合模型中，得到身份特征序列、语义特征序列和重建图像序列；所述身份与语义深度耦合模型包括：2D稠密卷积神经网络、3D稠密卷积神经网络和反卷积神经网络；所述2D稠密卷积神经网络用于编码所述训练唇语图片序列的身份特征，得到所述身份特征序列；所述3D稠密卷积神经网络用于编码所述训练唇语图片序列的语义特征，得到所述语义特征序列；所述反卷积神经网络用于对所述身份特征序列与所述语义特征序列进行重建耦合，得到所述重建图像序列；The feature output module is used to input a plurality of the training lip language picture sequences into the identity and semantic depth coupling model to obtain the identity feature sequence, the semantic feature sequence and the reconstructed image sequence; the identity and semantic depth coupling model includes: 2D dense Convolutional neural network, 3D dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D The dense convolutional neural network is used to encode the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolutional neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence , to obtain the reconstructed image sequence;

第一计算模块，用于根据所述身份特征序列中不同说话人样本的身份特征计算对比损失；a first computing module, configured to calculate a comparison loss according to the identity features of different speaker samples in the identity feature sequence;

第二计算模块，用于根据所述身份特征序列中相同说话人样本的不同帧的身份特征计算差异损失；a second calculation module, configured to calculate the difference loss according to the identity features of different frames of the same speaker sample in the sequence of identity features;

第三计算模块，用于基于高斯分布方法计算所述语义特征序列的高斯分布差异损失；a third calculation module, configured to calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method;

第四计算模块，用于根据所述身份特征序列和所述语义特征序列计算相关损失；a fourth calculation module, configured to calculate a correlation loss according to the identity feature sequence and the semantic feature sequence;

第五计算模块，用于根据所述训练唇语图片序列和所述重建图像序列计算重建误差损失；a fifth calculation module, configured to calculate the reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence;

文本输出模块，用于将所述语义特征序列输入唇语预测网络中，得到预测文本序列；a text output module for inputting the semantic feature sequence into the lip language prediction network to obtain a predicted text sequence;

第六计算模块，用于根据所述预测文本序列和真实文本序列计算监督损失；The sixth calculation module is used for calculating the supervision loss according to the predicted text sequence and the real text sequence;

训练模块，用于以所述对比损失、所述差异损失、所述高斯分布差异损失、所述相关损失、所述重建误差损失和所述监督损失作为优化目标，对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优，得到最优唇语识别模型；A training module for using the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss, and the supervision loss as optimization goals to deeply couple the identity and semantics The model and the lip language prediction network are iteratively optimized to obtain the optimal lip language recognition model;

第二获取模块，用于获取待识别唇语图片序列；The second acquisition module is used to acquire the sequence of lip language pictures to be recognized;

识别模块，用于将所述待识别唇语图片序列输入最优唇语识别模型中，得到识别文本。The recognition module is used for inputting the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text.

根据本发明提供的具体实施例，本发明公开了以下技术效果：According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:

本发明用于说话人无关的唇语识别方法及系统采用2D稠密卷积神经网络和3D稠密卷积神经网络这两组独立的网络分别对唇语图片序列的身份信息与语义信息编码，得到身份特征序列和语义特征序列。本发明以不同样本身份对比损失以及相同样本不同帧的身份差异损失对身份编码过程进行约束，解决由于说话人身份信息干扰对识别结果造成的影响。其中本发明通过将对比损失、差异损失、高斯分布差异损失、重建误差损失和监督损失作为优化目标，对述身份与语义深度耦合模型和唇语预测网络进行迭代寻优，避免学习的特征空间出现过拟合问题。有效的避免语义特征混入身份信息，进而提高唇语识别模型在说话人无关条件下的识别准确率。The present invention is used for the speaker-independent lip language recognition method and system by using two independent networks, 2D dense convolutional neural network and 3D dense convolutional neural network, to encode the identity information and semantic information of the lip language picture sequence, respectively, to obtain the identity information. Feature sequences and semantic feature sequences. The invention constrains the identity encoding process with the identity comparison loss of different samples and the identity difference loss of different frames of the same sample, and solves the influence on the recognition result caused by the interference of the speaker identity information. Among them, the present invention takes the contrast loss, difference loss, Gaussian distribution difference loss, reconstruction error loss and supervision loss as optimization goals, and iteratively optimizes the deep coupling model of identity and semantics and the lip language prediction network, so as to avoid the appearance of the learned feature space. overfitting problem. It effectively avoids semantic features from being mixed with identity information, thereby improving the recognition accuracy of the lip language recognition model in the speaker-independent condition.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings required in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the present invention. In the embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative labor.

图1为本发明用于说话人无关的唇语识别方法的流程图；Fig. 1 is the flow chart that the present invention is used for the speaker-independent lip language recognition method;

图2为本发明实施例中识别方法原理框图；Fig. 2 is the principle block diagram of the identification method in the embodiment of the present invention;

图3为本发明实施例中身份与语义特征编码网络框架图，其中，图3(a)为稠密卷积神经网络结构图，图3(b)为身份编码网络中2D卷积结构图，图3(c)语义编码网络中3D卷积结构图；FIG. 3 is a frame diagram of an identity and semantic feature encoding network in an embodiment of the present invention, wherein FIG. 3(a) is a structural diagram of a dense convolutional neural network, and FIG. 3(b) is a 2D convolutional structure diagram in an identity encoding network. 3(c) 3D convolutional structure diagram in semantic coding network;

图4为本发明实施例中身份与语义耦合重建网络结构图；4 is a structural diagram of an identity and semantic coupling reconstruction network in an embodiment of the present invention;

图5为本发明实施例中基于自注意力机制唇语预测网络结构图；5 is a structural diagram of a lip language prediction network based on a self-attention mechanism in an embodiment of the present invention;

图6为本发明用于说话人无关的唇语识别系统的模块连接图。FIG. 6 is a connection diagram of modules used in the speaker-independent lip language recognition system of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

图1为本发明用于说话人无关的唇语识别方法的流程图，如图1所示，本实施例提供的用于说话人无关的唇语识别方法，包括：FIG. 1 is a flowchart of the present invention for the speaker-independent lip language recognition method. As shown in FIG. 1 , the speaker-independent lip language recognition method provided by this embodiment includes:

步骤100：获取多个说话人样本的训练唇语图片序列。Step 100: Obtain a training lip language picture sequence of multiple speaker samples.

步骤200：将多个所述训练唇语图片序列输入身份与语义深度耦合模型中，得到身份特征序列、语义特征序列和重建图像序列；所述身份与语义深度耦合模型包括：2D稠密卷积神经网络、3D稠密卷积神经网络和反卷积神经网络；所述2D稠密卷积神经网络用于编码所述训练唇语图片序列的身份特征，得到所述身份特征序列；所述3D稠密卷积神经网络用于编码所述训练唇语图片序列的语义特征，得到所述语义特征序列；所述反卷积神经网络用于对所述身份特征序列与所述语义特征序列进行重建耦合，得到所述重建图像序列。Step 200: Input a plurality of the training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity feature sequence, a semantic feature sequence and a reconstructed image sequence; the identity and semantic depth coupling model includes: a 2D dense convolutional neural network; network, 3D dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D dense convolutional neural network The neural network is used to encode the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence to obtain the Describe the reconstructed image sequence.

步骤300：根据所述身份特征序列中不同说话人样本的身份特征计算对比损失。Step 300: Calculate a comparison loss according to the identity features of different speaker samples in the identity feature sequence.

步骤301：根据所述身份特征序列中相同说话人样本的不同帧的身份特征计算差异损失。Step 301: Calculate the difference loss according to the identity features of different frames of the same speaker sample in the identity feature sequence.

步骤302：基于高斯分布方法计算所述语义特征序列的高斯分布差异损失。Step 302: Calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method.

步骤303：根据所述身份特征序列和所述语义特征序列计算相关损失。Step 303: Calculate a correlation loss according to the identity feature sequence and the semantic feature sequence.

步骤304：根据所述训练唇语图片序列和所述重建图像序列计算重建误差损失。Step 304: Calculate a reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence.

步骤400：将所述语义特征序列输入唇语预测网络中，得到预测文本序列。Step 400: Input the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence.

步骤500：根据所述预测文本序列和真实文本序列计算监督损失。Step 500: Calculate a supervised loss according to the predicted text sequence and the real text sequence.

步骤600：以所述对比损失、所述差异损失、所述高斯分布差异损失、所述相关损失、所述重建误差损失和所述监督损失作为优化目标，对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优，得到最优唇语识别模型。Step 600: Taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization goals, analyze the identity and semantic depth coupling model and The lip language prediction network performs iterative optimization to obtain the optimal lip language recognition model.

步骤700：获取待识别唇语图片序列。Step 700: Obtain a sequence of lip language pictures to be recognized.

步骤800：将所述待识别唇语图片序列输入最优唇语识别模型中，得到识别文本。Step 800: Input the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text.

优选地，所述2D稠密卷积神经网络和所述3D稠密卷积神经网络均由稠密卷积神经网络框架构成；所述稠密卷积神经网络框架包括依次连接的稠密连接过渡层、池化层和全连接层；所述稠密连接过渡层包括多个稠密连接过渡单元；每个所述稠密连接过渡单元均包括一个稠密连接模块和一个过渡模块。Preferably, both the 2D dense convolutional neural network and the 3D dense convolutional neural network are composed of a dense convolutional neural network framework; the dense convolutional neural network framework includes a densely connected transition layer and a pooling layer that are sequentially connected. and a fully connected layer; the densely connected transition layer includes a plurality of densely connected transition units; each of the densely connected transition units includes a dense connection module and a transition module.

图3为本发明实施例中身份与语义特征编码网络框架图，图3(a)为稠密卷积神经网络结构图，如图3(a)所示，稠密卷积神经网络是由稠密连接模块、过渡模块、池化层、全连接层组成。其中，稠密连接模块不同于传统的神经网络中采用无跨层连接的方式，它将当前层的输出连接至随后每一层的输入。假如当前网络有L层，传统网络具有L个连接，稠密卷积方式则具有L(L-1)/2个连接方式。通过此种稠密连接的方式实现特征复用，有效的减少每层的通道数量，在一定程度减少网络的参数量。此外大量的跨层连接可以有效的减轻深度神经网络随着深度增加而出现的梯度消失问题。假设第l层的输出为x_l，则稠密连接第l层的输入与输出可以表示为：FIG. 3 is a frame diagram of an identity and semantic feature encoding network in an embodiment of the present invention, and FIG. 3(a) is a structural diagram of a dense convolutional neural network. As shown in FIG. 3(a), the dense convolutional neural network is composed of dense connection modules. , transition module, pooling layer, and fully connected layer. Among them, the dense connection module is different from the traditional neural network with no cross-layer connection. It connects the output of the current layer to the input of each subsequent layer. If the current network has L layers, the traditional network has L connections, and the dense convolution method has L(L-1)/2 connections. Feature multiplexing is achieved through this dense connection method, which effectively reduces the number of channels in each layer and reduces the amount of network parameters to a certain extent. In addition, a large number of cross-layer connections can effectively alleviate the gradient disappearance problem of deep neural networks with increasing depth. Assuming that the output of the lth layer is x _l , the input and output of the densely connected layer l can be expressed as:

x_l＝H_l([x₀,x₁,…,x_l-1])x _l =H _l ([x ₀ ,x ₁ ,...,x _l-1 ])

其中，H_l()表示第l层卷积模块，图3(b)为身份编码网络中2D卷积结构，图3(c)语义编码网络中3D卷积结构，如图3(b)和图3(c)所示，具体根据身份与语义编码任务可分别采用2D卷积或3D卷积网络结构，该模块主要由批归一化、ReLU、1×1卷积、3×3卷积聚合构成。身份特征编码针对每一帧静态唇部图片采用2D卷积提取图像的结构特征。语义特征编码对于若干连续帧采用3D卷积操作提取时空特征。H_l()的输入是将0到l-1层所有的输出做通道合并，通道合并的前提是要求每一层输出的特征图的尺度统一。在每一个稠密连接模块中，特征图的尺度保持不变。然而，卷积神经网络中一个必不可少的环节是通过池化操作降低特征图的尺度，以此捕获更大的感知野。因此，稠密卷积神经网络在不同稠密连接模块之间引入如图3(b)和图3(c)所示的过渡模块，该模块是由批归一化、ReLU、1×1卷积、2×2池化聚合组成，通过1×1卷积实现通道压缩，2×2池化实现特征图的下采样，以此实现更大范围的特征捕捉。稠密卷积神经网络最后的池化层对输出的特征图进行全局池化，仅保留通道信息，最后再通过全连接层进行特征变换。Among them, H _l ( ) represents the first layer convolution module, Figure 3 (b) is the 2D convolution structure in the identity encoding network, Figure 3 (c) The 3D convolution structure in the semantic encoding network, as shown in Figure 3 (b) and As shown in Figure 3(c), according to the identity and semantic coding tasks, 2D convolution or 3D convolution network structure can be used respectively. This module mainly consists of batch normalization, ReLU, 1×1 convolution, 3×3 convolution Aggregate composition. Identity feature encoding uses 2D convolution to extract the structural features of the image for each static lip picture. Semantic feature encoding uses a 3D convolution operation to extract spatiotemporal features for several consecutive frames. The input of H _l () is to combine all the outputs of layers 0 to l-1. The premise of channel combining is that the scale of the feature maps output by each layer is unified. In each densely connected module, the scale of the feature map remains unchanged. However, an essential link in convolutional neural networks is to reduce the scale of feature maps through pooling operations to capture a larger receptive field. Therefore, the dense convolutional neural network introduces a transition module as shown in Figure 3(b) and Figure 3(c) between different densely connected modules, which is composed of batch normalization, ReLU, 1×1 convolution, It consists of 2 × 2 pooling aggregation, channel compression is achieved through 1 × 1 convolution, and 2 × 2 pooling realizes downsampling of feature maps, so as to achieve a wider range of feature capture. The final pooling layer of the dense convolutional neural network performs global pooling on the output feature map, retaining only the channel information, and finally performs feature transformation through the fully connected layer.

图4为本发明实施例中身份与语义耦合重建网络结构图，经过上述稠密卷积神经网络提取身份与语义特征后，将两种特征连接输入如图4所示的耦合重建网络中。该网络通过4×4的反卷积操作实现从特征向量向特征图扩建，再采用上采样的方式进行高分辨率重建，重建后经过如图3所示的卷积模块进行特征提取，该模块由3×3卷积、批归一化、ReLU聚合组成。重复上采样、卷积操作，直至输出特征图尺度与唇语图片尺度一致，完成重建过程。Figure 4 is a structural diagram of the identity and semantic coupling reconstruction network in an embodiment of the present invention. After the identity and semantic features are extracted through the above-mentioned dense convolutional neural network, the two kinds of features are connected and input into the coupling reconstruction network as shown in Figure 4 . The network realizes the expansion from the feature vector to the feature map through the 4×4 deconvolution operation, and then uses the up-sampling method to perform high-resolution reconstruction. After reconstruction, the convolution module as shown in Figure 3 is used for feature extraction. It consists of 3×3 convolution, batch normalization, and ReLU aggregation. Repeat upsampling and convolution operations until the output feature map scale is consistent with the lip language picture scale, and the reconstruction process is completed.

图5为本发明实施例中基于自注意力机制唇语预测网络结构图，如图5所示，所述唇语预测网络为基于自注意力机制的seq2seq网络；所述唇语预测网络包括输入模块、Encoder模块、Decoder模块和分类模块。FIG. 5 is a structural diagram of a lip language prediction network based on a self-attention mechanism in an embodiment of the present invention. As shown in FIG. 5 , the lip language prediction network is a seq2seq network based on a self-attention mechanism; the lip language prediction network includes an input Module, Encoder Module, Decoder Module, and Classification Module.

具体的，所述输入模块中，唇语图片序列经过语义特征编码输出不同时刻的语义向量

唇语预测网络输入部分接收该输入序列。不同于RNN通过递归关系处理时序信号，唇语预测网络通过在输入数据中叠加时间位置信息，以此实现不同时间信息的语义编码。Specifically, in the input module, the lip language picture sequence is encoded with semantic features to output semantic vectors at different times.

The input sequence is received by the input part of the lip language prediction network. Different from RNN processing time series signals through recursive relationship, lip language prediction network realizes semantic encoding of different temporal information by superimposing temporal location information in the input data.

位置嵌入信息使用正余弦位置编码，位置编码通过使用不同频率的正弦、余弦函数生成，然后和对应的位置的语义向量相加，位置向量维度必须和语义向量的维度一致。具体计算公式如下：The position embedding information uses sine and cosine position encoding. The position encoding is generated by using sine and cosine functions of different frequencies, and then added to the semantic vector of the corresponding position. The dimension of the position vector must be consistent with the dimension of the semantic vector. The specific calculation formula is as follows:

其中，pos表示语义向量在当前序列中的位置，i表示语义向量中第i个位置，d表示语义向量的维度。Among them, pos represents the position of the semantic vector in the current sequence, i represents the ith position in the semantic vector, and d represents the dimension of the semantic vector.

可选地，所述Encoder模块中，经过时间位置信息嵌入后的语义特征，输入Encoder模块进行深度特征挖掘。Encoder模块分为过渡层与输出层两部分，过渡层是由多头注意力与层归一化所组成，其输入输出关系可以表示为：Optionally, in the Encoder module, the semantic features after the time position information is embedded are input into the Encoder module for deep feature mining. The Encoder module is divided into two parts: the transition layer and the output layer. The transition layer is composed of multi-head attention and layer normalization. The input-output relationship can be expressed as:

其中，

表示第i个样本经时间位置信息嵌入后的语义特征向量序列，MultiHeadAttention()为多头注意力，LayerNorm()为层归一化。in,

Represents the semantic feature vector sequence of the ith sample embedded with time position information, MultiHeadAttention() is multi-head attention, and LayerNorm() is layer normalization.

多头注意力让神经网络在执行预测任务时可以更多关注输入中的相关部分，更少关注不相关的部分。一个注意力函数可以描述为将Query与一组键值对(Key-Value)映射到输出，其中Query、Key、Value和输出都是向量。输出可以通过值的加权和而计算得出，其中分配到每一个值的权重可通过Query和对应Key的适应度函数计算。具体如下：Multi-head attention allows neural networks to focus more on relevant parts of the input and less on irrelevant parts when performing prediction tasks. An attention function can be described as mapping a Query with a set of key-value pairs (Key-Value) to an output, where Query, Key, Value and output are all vectors. The output can be calculated by the weighted sum of the values, where the weight assigned to each value can be calculated by the fitness function of Query and the corresponding Key. details as follows:

MultiHeadAttention(s_i)＝MultiHead(Q,K,V)＝Concat(head₁,…,head_h)W^O MultiHeadAttention(s _i )=MultiHead(Q,K,V)=Concat(head ₁ ,...,head _h )W ^O

其中，Q为查询向量序列，K为键向量序列，V为值向量序列，Q＝K＝V＝s_i。Concat()为矩阵连接，

为输出变换矩阵。Wherein, Q is a query vector sequence, K is a key vector sequence, V is a value vector sequence, and Q=K=V=s _i . Concat() is a matrix connection,

is the output transformation matrix.

单头注意力由下式计算：Single-head attention is calculated by:

其中，

为查询向量序列第i个头变换矩阵，

为键向量序列第i个头变换矩阵，

为值向量序列第i个头变换矩阵，

h为注意力头个数。in,

is the transformation matrix of the ith head of the query vector sequence,

is the transformation matrix of the ith head of the key vector sequence,

is the ith head transformation matrix of the value vector sequence,

h is the number of attention heads.

层归一化是解决Internal Covariate Shift问题的常用方法，它可以将数据分布拉到激活函数的非饱和区，具有权重/数据伸缩不变性的特点，起到缓解梯度消失/爆炸、加速训练、正则化的效果。层归一化具体实现如下：Layer normalization is a common method to solve the Internal Covariate Shift problem. It can pull the data distribution to the non-saturated area of the activation function. It has the characteristics of weight/data scaling invariance, which can alleviate gradient disappearance/explosion, accelerate training, and regularize effect of . The specific implementation of layer normalization is as follows:

其中，z表示输入的D维特征向量，α，β为变换系数。Among them, z represents the input D-dimensional feature vector, and α and β are transformation coefficients.

可选地，所述Encoder输出层由全连接层和层归一化组成，其输入输出的映射关系为：Optionally, the Encoder output layer is composed of a fully connected layer and layer normalization, and the mapping relationship between its input and output is:

优选地，Decoder模块总体结构与Encoder模块类似，在Encoder基础之上，加入Encoder输出与Decoder输入之间的注意力，以Encoder计算输出的

作为Decoder中注意力模型计算的K、V，以Decoder输入

作为Q，计算Decoder模型输出。Preferably, the overall structure of the Decoder module is similar to that of the Encoder module. On the basis of the Encoder, the attention between the Encoder output and the Decoder input is added, and the Encoder is used to calculate the output value.

As K and V calculated by the attention model in the Decoder, input with the Decoder

As Q, compute the Decoder model output.

可选地，Decoder输入为语言序列的词向量

表示第i个唇语序列第j个时刻的词向量。词向量输入Decoder首先经过与Encoder中同样的时间位置嵌入，获得嵌入实现信息后的词向量序列

语言序列的词向量即为真实文本序列。经过第一层注意力后输入输出关系为：Optionally, the Decoder input is the word vector of the language sequence

The word vector representing the jth moment of the ith lip sequence. The word vector input Decoder first goes through the same time position embedding as in the Encoder, and obtains the word vector sequence after embedding realization information

The word vector of the language sequence is the real text sequence. After the first layer of attention, the input-output relationship is:

其中，MultiHeadAttention()和LayerNorm()的计算方式与Encoder模块一致。Among them, the calculation methods of MultiHeadAttention() and LayerNorm() are consistent with the Encoder module.

不同于Encoder模块，Decoder在获取

后，将此作为多头注意力的查询值Q，以Encoder输出

作为多头注意力的键K和值V，计算词向量序列与语义特征序列的注意力，具体关系Unlike the Encoder module, the Decoder is getting

Then, use this as the query value Q of multi-head attention, and output it with the Encoder

As the key K and value V of the multi-head attention, the attention of the word vector sequence and the semantic feature sequence is calculated, and the specific relationship

输出的注意力向量再经过全连接与层归一化，以此获取Decoder的最终输出：The output attention vector is then fully connected and layer normalized to obtain the final output of the Decoder:

输出模块根据Decoder的输出

经过全连接与softmax层，判定唇语输出内容。The output module is based on the output of the Decoder

After full connection and softmax layer, the output content of lip language is determined.

其中，L_c为对比损失；N表示所述说话人样本的数量；

表示第i个样本的第t帧图像；

表示第j个样本的第t′帧图像；

表示

的身份特征；

表示

Represents the t-th frame image of the i-th sample;

represents the t'th frame image of the jth sample;

express

identity;

express

其中，L_d为差异损失；N表示所述说话人样本的数量，

表示第i个样本的第j帧图像；

表示第i个样本的第k帧图像；

表示

的身份特征；

表示

represents the jth frame image of the ith sample;

represents the k-th frame image of the i-th sample;

express

identity;

express

其中，L_dd表示所述高斯分布差异损失；

表示P组说话人样本中第i个样本的第t帧图像；

表示第i个样本的第t帧图像；

表示

的身份特征；

表示

Represents the t-th frame image of the i-th sample;

express

identity;

express

semantic features.

表示第i个样本的第t帧图像；

表示

的身份特征；

表示

Represents the t-th frame image of the i-th sample;

express

identity;

express

为样本i的第t帧的文本类别为j的真实概率，

表示第i个样本的第1帧图像，

表示第i个样本的第2帧图像，

表示第i个样本的第T帧图像；

表示第i个样本的第1帧图像的语义特征；

表示第i个样本的第2帧图像的语义特征；

表示第i个样本的第T帧图像的语义特征；第t项的唇语预测输出是根据所有帧的语义特征以及第0项到第t-1项的唇语预测输出内容进行判定。where L _seq represents the supervision loss; N represents the number of speaker samples; T represents the number of frames in the speaker samples; C represents the number of text categories;

Represents the first frame image of the i-th sample,

Represents the second frame image of the i-th sample,

Represents the T-th frame image of the i-th sample;

Represents the semantic feature of the first frame image of the ith sample;

Represents the semantic feature of the second frame image of the i-th sample;

Represents the semantic features of the T-th frame image of the i-th sample; the lip-language prediction output of the t-th item is determined according to the semantic features of all frames and the lip-language prediction output content of the 0th to t-1th items.

具体的，Adam优化器结合AdaGrad和RMSProp两种优化算法的优点。对梯度的一阶矩估计和二阶矩估计进行综合考虑，计算出更新步长。针对上述总体损失的优化问题，Adam优化器的具体实现步骤如下：Specifically, the Adam optimizer combines the advantages of two optimization algorithms, AdaGrad and RMSProp. The first-order moment estimation and the second-order moment estimation of the gradient are comprehensively considered, and the update step size is calculated. For the optimization problem of the above overall loss, the specific implementation steps of the Adam optimizer are as follows:

(1)随机初始化参数θ、第0时刻的一阶矩m₀、第0时刻的二阶矩v₀。(1) Randomly initialize the parameter θ, the first-order moment m ₀ at the 0th moment, and the second-order moment v ₀ at the 0th moment.

(2)更新t时刻的梯度

(2) Update the gradient at time t

(3)更新一阶矩m_t←β₁·m_t+(1-β₁)·g_t (3) Update the first-order moment m _t ←β ₁ ·m _t +(1-β ₁ )·g _t

(4)更新二阶矩

(4) Update the second moment

(5)更新无偏一阶矩

(5) Update the unbiased first-order moment

(6)更新无偏二阶矩

(6) Update the unbiased second-order moment

(7)更新参数

(7) Update parameters

重复(2)-(7)直至损失收敛Repeat (2)-(7) until the loss converges

其中，β₁、β₂表示指数衰减率，

表示β₁、β₂的t次幂，α为学习率。

表示梯度g_t的平方，ε＝10^-8。Among them, β ₁ , β ₂ represent the exponential decay rate,

Represents the t-th power of β ₁ and β ₂ , and α is the learning rate.

represents the square of the gradient _gt , ε=10 ⁻⁸ .

可选地，所述方法还包括：Optionally, the method further includes:

将所述待识别唇语图片序列输入所述最优唇语识别模型中的3D稠密卷积神经网络中，得到待识别语义特征序列。Inputting the lip language picture sequence to be recognized into the 3D dense convolutional neural network in the optimal lip language recognition model to obtain the to-be-recognized semantic feature sequence.

将所述待识别语义特征序列输入所述最优唇语识别模型中的唇语预测网络中，得到预测文本序列。The to-be-recognized semantic feature sequence is input into the lip language prediction network in the optimal lip language recognition model to obtain a predicted text sequence.

具体的，利用语义信息编码网络E_s以及唇语预测网络E_p进行语义特征提取与识别；Specifically, using the semantic information coding network _Es and the lip language prediction network _Ep to perform semantic feature extraction and recognition;

输入的唇语图片序列经过语义编码后输出语义特征序列

唇语预测网络根据输入的语义特征序列以及前t时刻之前所有的词向量预测t时刻的词向量输出。输入的预测特征序列如图4所示的Encoder结构，计算语义编码输出

Decoder将通过自注意力将t-1时刻的词向量

表征为前t-1时刻所有词向量

的注意力加权和

再通过自注意力机制关联Encoder输出的语义特征编码

和

以此计算Decoder的输出并以此预测t时刻的词向量

在t＝1的时刻词向量输出，Decoder会根据默认开始词向量

进行预测，层层递推预测每一时刻唇语输出词向量。After the input lip language picture sequence is semantically encoded, the semantic feature sequence is output

The lip language prediction network predicts the word vector output at time t according to the input semantic feature sequence and all word vectors before time t. The input predicted feature sequence is the Encoder structure shown in Figure 4, and the semantic encoding output is calculated

Decoder will use self-attention to convert the word vector at time t-1

Represented as all word vectors at the first t-1 time

The attention-weighted sum of

Then, the semantic feature encoding output by the Encoder is associated with the self-attention mechanism.

and

Calculate the output of the Decoder and use it to predict the word vector at time t

At the time of t=1, the word vector is output, and the Decoder will start the word vector according to the default.

Prediction is performed, and the lip language output word vector is predicted layer by layer recursively at each moment.

图6为本发明用于说话人无关的唇语识别系统的模块连接图，如图6所示，本发明一种用于说话人无关的唇语识别系统，包括：Fig. 6 is the module connection diagram for the speaker-independent lip language recognition system of the present invention. As shown in Fig. 6, the present invention is used for the speaker-independent lip language recognition system, including:

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

第一，本发明采用两组独立的网络分别对唇语图片序列的身份信息与语义信息编码，以不同样本身份对比损失以及相同样本不同帧的身份差异损失对身份编码过程进行约束，以seq2seq监督损失对语义编码过程进行约束，本发明找到最优的身份空间对语义编码特征进行约束，避免了学习的特征空间出现过拟合问题。相比于目前唇语识别方法通过单一语义监督约束的方式，可以有效的避免语义特征混入身份信息，进而提高唇语识别模型在说话人无关条件下的识别准确率。First, the present invention uses two independent networks to encode the identity information and semantic information of the lip language picture sequence respectively, and uses the identity comparison loss of different samples and the identity difference loss of different frames of the same sample to constrain the identity encoding process, and uses seq2seq to supervise the process of identity encoding. The loss constrains the semantic encoding process, and the present invention finds the optimal identity space to constrain the semantic encoding features, thereby avoiding the problem of overfitting in the learned feature space. Compared with the current lip language recognition method, which uses a single semantic supervision constraint, it can effectively avoid semantic features mixed with identity information, thereby improving the recognition accuracy of the lip language recognition model in the speaker-independent condition.

第二，本发明在上述耦合模型的基础上，进一步引入了身份特征与语义特征的相关损失约束，确保身份信息与语义信息的相关性最小；此外，本发明进一步假设语义特征服从单一高斯分布，以不同组样本的高斯分布差异性作为损失约束，保证不同说话人提取的语义特征分布差异最小，限定了语义空间的独立性，从而改善唇语识别系统对说话人身份变化的鲁棒性能。Second, on the basis of the above coupling model, the present invention further introduces the related loss constraints of identity features and semantic features to ensure that the correlation between identity information and semantic information is minimal; in addition, the present invention further assumes that the semantic features obey a single Gaussian distribution, The difference of Gaussian distribution of different groups of samples is used as the loss constraint to ensure the minimum difference in the distribution of semantic features extracted by different speakers, and the independence of the semantic space is limited, thereby improving the robust performance of the lip recognition system to changes in speaker identity.

第三，本发明在语义预测过程中采用了基于自注意力机制的seq2seq模型，相比于目前唇语识别方法采用的如LSTM、GRU等循环神经网络，可以实现时序特征的长期记忆与关联能力，从而提高唇语预测过程的精度。此外，自注意力机制不同于传统的循环神经网络通过递推训练的方式，该模型可以实现并行化训练，进而大幅缩短唇语识别网络的学习时间。Third, the present invention adopts the seq2seq model based on the self-attention mechanism in the semantic prediction process, which can realize the long-term memory and association ability of time series features compared with the current lip language recognition methods such as LSTM, GRU and other recurrent neural networks. , thereby improving the accuracy of the lip language prediction process. In addition, the self-attention mechanism is different from the traditional recurrent neural network through recursive training. This model can realize parallel training, thereby greatly shortening the learning time of the lip language recognition network.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本发明的限制。In this paper, specific examples are used to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present invention; meanwhile, for those skilled in the art, according to the present invention There will be changes in the specific implementation and application scope. In conclusion, the contents of this specification should not be construed as limiting the present invention.

Claims

1. a kind of lip language recognition method that is irrelevant for the speaker, is characterized in that, comprises:

Obtain a sequence of training lip language pictures for multiple speaker samples;

A plurality of the training lip language picture sequences are input into the identity and semantic depth coupling model, and the identity feature sequence, the semantic feature sequence and the reconstructed image sequence are obtained; the identity and semantic depth coupling model includes: 2D dense convolutional neural network, 3D Dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D dense convolutional neural network uses encoding the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolutional neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence to obtain the reconstructed image sequence;

Calculate the comparison loss according to the identity features of different speaker samples in the identity feature sequence;

Calculate the difference loss according to the identity features of different frames of the same speaker sample in the identity feature sequence;

Calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method;

Calculate a correlation loss according to the identity feature sequence and the semantic feature sequence;

Calculate a reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence;

Inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;

compute a supervised loss according to the predicted text sequence and the real text sequence;

Taking the contrast loss, the disparity loss, the Gaussian disparity loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization goals, the identity and semantic depth coupling model and the lip The language prediction network is used for iterative optimization, and the optimal lip language recognition model is obtained;

Obtain the sequence of lip language pictures to be recognized;

Inputting the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text.

2. The method for speaker-independent lip language recognition according to claim 1, wherein the 2D dense convolutional neural network and the 3D dense convolutional neural network are both composed of a dense convolutional neural network framework ; The dense convolutional neural network framework includes a dense connection transition layer, a pooling layer and a fully connected layer that are sequentially connected; the dense connection transition layer includes a plurality of dense connection transition units; each of the dense connection transition units includes A dense connection module and a transition module;

The lip language prediction network is a seq2seq network based on a self-attention mechanism; the lip language prediction network includes an input module, an Encoder module, a Decoder module and a classification module;

The input module is respectively connected with the Encoder module and the Decoder module, and the input module is used to obtain the semantic feature sequence and the word vector sequence corresponding to the semantic feature sequence, and combine the semantic vectors at different times in the semantic feature sequence with all the word vector sequences. The word vector in the predicate vector sequence is embedded with time position information, and the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used to perform deep feature mining on the semantic feature sequence embedded with the time position information, Obtain the first feature sequence; the Decoder module is used to obtain the second feature sequence according to the attention of the first feature sequence and the attention of the word vector sequence embedded with time position information, and the classification module is used to obtain the second feature sequence according to the first feature sequence. The two feature sequences are determined to obtain the predicted text sequence.

3. The lip language recognition method for speaker irrelevance according to claim 1, wherein the calculation formula of the contrast loss is:

Among them, L _c is the contrast loss; N is the number of the speaker samples;

Represents the t-th frame image of the i-th sample;

represents the t'th frame image of the jth sample;

express

identity;

express

4. The method for recognizing lip language that is irrelevant to the speaker according to claim 1, wherein the calculation formula of the difference loss is:

Among them, L _d is the difference loss; N is the number of the speaker samples,

represents the jth frame image of the ith sample;

represents the k-th frame image of the i-th sample;

express

identity;

express

5. The method for speaker-independent lip language recognition according to claim 1, wherein the calculation formula of the Gaussian distribution difference loss is:

Wherein, L _dd represents the Gaussian distribution difference loss;

6. The method for speaker-independent lip language recognition according to claim 1, wherein the calculation formula of the correlation loss is:

Among them, _LR represents the correlation loss; T represents the number of frames in the speaker sample; N represents the number of the speaker sample;

Represents the t-th frame image of the i-th sample;

express

identity;

express

semantic features.

7. The method for speaker-independent lip language recognition according to claim 1, wherein the calculation formula of the reconstruction error loss is:

Wherein, L _con represents the reconstruction error loss; T represents the number of frames in the speaker sample; N represents the number of the speaker sample;

Represents the t-th frame image of the i-th sample;

express

identity;

express

8. The method for speaker-independent lip language recognition according to claim 1, wherein the calculation formula of the supervision loss is:

where L _seq represents the supervision loss; N represents the number of speaker samples; T represents the number of frames in the speaker samples; C represents the number of text categories;

Represents the first frame image of the i-th sample,

Represents the second frame image of the i-th sample,

Represents the T-th frame image of the i-th sample;

Represents the semantic feature of the first frame image of the ith sample;

Represents the semantic feature of the second frame image of the i-th sample;

9. The method for speaker-independent lip language recognition according to claim 1, wherein the contrast loss, the difference loss, the Gaussian distribution difference loss, the reconstruction error loss and The supervised loss is used as an optimization target, and iterative optimization is performed on the identity and semantic depth coupling model and the lip language prediction network to obtain the optimal lip language recognition model, including:

Taking the weighted loss as the optimization function, the Adam optimizer is used to iteratively learn the identity and semantic depth coupling model and the lip language prediction network to obtain the optimized identity and semantic depth coupling model and the lip language prediction network;

Wherein, the optimization function is L(θ)=L _seq +α ₁ L _c +α ₂ L _d +α ₃ L _dd +α ₄ L _R +α ₅ L _con , where L(θ) is a weighted loss; L _seq is the supervision loss; L _c is the contrast loss; L _d is the difference loss; L _dd is the Gaussian distribution difference loss; L _R is the correlation loss; L _con is the reconstruction error loss ; α ₁ represents the weight of the contrast loss, α ₂ represents the weight of the difference loss, α ₃ represents the weight of the Gaussian distribution difference loss, α ₄ represents the weight of the correlation loss, α ₅ represents the reconstruction error Loss weight.

10. A speaker-independent lip language recognition system, comprising:

The first acquisition module is used to acquire the training lip language picture sequence of multiple speaker samples;

The feature output module is used to input a plurality of the training lip language picture sequences into the identity and semantic depth coupling model to obtain the identity feature sequence, the semantic feature sequence and the reconstructed image sequence; the identity and semantic depth coupling model includes: 2D dense Convolutional neural network, 3D dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D The dense convolutional neural network is used to encode the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolutional neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence , to obtain the reconstructed image sequence;

a first computing module, configured to calculate a comparison loss according to the identity features of different speaker samples in the identity feature sequence;

a second calculation module, configured to calculate the difference loss according to the identity features of different frames of the same speaker sample in the sequence of identity features;

a third calculation module, configured to calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method;

a fourth calculation module, configured to calculate a correlation loss according to the identity feature sequence and the semantic feature sequence;

a fifth calculation module, configured to calculate the reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence;

a text output module for inputting the semantic feature sequence into the lip language prediction network to obtain a predicted text sequence;

The sixth calculation module is used for calculating the supervision loss according to the predicted text sequence and the real text sequence;

A training module for using the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss, and the supervision loss as optimization goals to deeply couple the identity and semantics The model and the lip language prediction network are iteratively optimized to obtain the optimal lip language recognition model;

The second acquisition module is used to acquire the sequence of lip language pictures to be recognized;

The recognition module is used for inputting the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text.