CN112949481A - Lip language identification method and system for irrelevant speakers - Google Patents

Lip language identification method and system for irrelevant speakers Download PDF

Info

Publication number
CN112949481A
CN112949481A CN202110226432.4A CN202110226432A CN112949481A CN 112949481 A CN112949481 A CN 112949481A CN 202110226432 A CN202110226432 A CN 202110226432A CN 112949481 A CN112949481 A CN 112949481A
Authority
CN
China
Prior art keywords
sequence
loss
identity
semantic
lip language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110226432.4A
Other languages
Chinese (zh)
Other versions
CN112949481B (en
Inventor
路龙宾
宁都
金小敏
滑文强
孙涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN202110226432.4A priority Critical patent/CN112949481B/en
Publication of CN112949481A publication Critical patent/CN112949481A/en
Application granted granted Critical
Publication of CN112949481B publication Critical patent/CN112949481B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Multimedia (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

本发明涉及一种用于说话人无关的唇语识别方法及系统,所述方法包括:获取训练唇语图片序列;将训练唇语图片序列输入身份与语义深度耦合模型中,得到特征序列并计算各个网络的损失;以各种加权损失作为优化目标,对耦合模型和唇语预测网络进行迭代寻优,得到最优识别模型;将待测图片序列输入识别模型中,得到识别文本。本发明分别对唇语图片序列的身份特征与语义特征编码,以不同样本身份对比损失以及相同样本不同帧的身份差异损失对身份编码过程进行约束,以监督损失对语义编码过程进行约束,并采用身份与语义耦合重建网络对学习的身份与语义特征进行约束,有效的避免语义特征混入身份信息,提高了唇语识别模型在说话人无关条件下的识别准确率。

Figure 202110226432

The invention relates to a speaker-independent lip language recognition method and system. The method includes: acquiring a training lip language picture sequence; inputting the training lip language picture sequence into an identity and semantic depth coupling model, obtaining a feature sequence and calculating The loss of each network; with various weighted losses as optimization goals, iterative optimization of the coupling model and lip language prediction network is performed to obtain the optimal recognition model; the sequence of images to be tested is input into the recognition model to obtain the recognized text. The invention encodes the identity feature and semantic feature of the lip language picture sequence respectively, uses the identity contrast loss of different samples and the identity difference loss of different frames of the same sample to constrain the identity encoding process, uses the supervision loss to constrain the semantic encoding process, and uses The identity and semantic coupling reconstruction network constrains the learned identity and semantic features, which effectively avoids semantic features from being mixed with identity information, and improves the recognition accuracy of the lip language recognition model in the speaker-independent condition.

Figure 202110226432

Description

一种用于说话人无关的唇语识别方法及系统A method and system for speaker-independent lip recognition

技术领域technical field

本发明涉及智能人机交互技术领域,特别是涉及一种用于说话人无关的唇语识别方法及系统。The invention relates to the technical field of intelligent human-computer interaction, in particular to a speaker-independent lip language recognition method and system.

背景技术Background technique

唇语识别作为一种新兴的人机交互方式,是从视觉信息出发,通过分析唇部区域的动态变化来理解说话人语义。该技术可以很好的克服语音识别在噪声环境应用中存在的不足,有效的提高语义分析系统的可靠性能。唇语识别技术具有广阔的应用前景,它可用于各类噪声环境下语言交互的识别任务,例如医院、商场等嘈杂环境下语言识别。此外,唇语识别还可应用于聋哑人辅助语义理解,从而帮助聋哑人建立说话能力。As an emerging human-computer interaction method, lip recognition starts from visual information and understands the speaker's semantics by analyzing the dynamic changes of the lip region. This technology can overcome the shortcomings of speech recognition in noisy environment applications, and effectively improve the reliability of the semantic analysis system. Lip language recognition technology has broad application prospects. It can be used for language interaction recognition tasks in various noisy environments, such as language recognition in noisy environments such as hospitals and shopping malls. In addition, lip language recognition can also be applied to deaf people to assist semantic comprehension, thereby helping deaf people build their speaking ability.

目前,唇语识别技术精度远未达到实际应用的需要。由于唇部发声是由说话人身份与说话内容在时空域内相互耦合作用而形成。不同说话人在唇部外观、说话方式等方面都存在巨大差异,甚至相同人在不同时刻、不同场景下的说话方式、语速等也存在差异。因此,在识别过程中,不同身份信息会对语义内容形成严重干扰。正是由于说话人身份信息与语义内容的高度耦合性,严重制约唇语识别系统精度的提升。At present, the accuracy of lip language recognition technology is far from meeting the needs of practical applications. Since lip vocalization is formed by the interaction between speaker identity and speech content in the space-time domain. Different speakers have huge differences in the appearance of their lips, the way they speak, and even the way and speed of speech of the same person at different times and in different scenarios. Therefore, in the recognition process, different identity information will seriously interfere with the semantic content. It is precisely because of the high coupling between the speaker identity information and the semantic content, which seriously restricts the improvement of the accuracy of the lip recognition system.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种用于说话人无关的唇语识别方法及系统,能够解决由于说话人身份信息干扰对识别结果造成的影响,提高唇语识别的准确率。The purpose of the present invention is to provide a speaker-independent lip language recognition method and system, which can solve the influence on the recognition result caused by the interference of the speaker's identity information, and improve the accuracy of lip language recognition.

为实现上述目的,本发明提供了如下方案:For achieving the above object, the present invention provides the following scheme:

一种用于说话人无关的唇语识别方法,包括:A method for speaker-independent lip recognition, including:

获取多个说话人样本的训练唇语图片序列;Obtain a sequence of training lip language pictures for multiple speaker samples;

将多个所述训练唇语图片序列输入身份与语义深度耦合模型中,得到身份特征序列、语义特征序列和重建图像序列;所述身份与语义深度耦合模型包括:2D稠密卷积神经网络、3D稠密卷积神经网络和反卷积神经网络;所述2D稠密卷积神经网络用于编码所述训练唇语图片序列的身份特征,得到所述身份特征序列;所述3D稠密卷积神经网络用于编码所述训练唇语图片序列的语义特征,得到所述语义特征序列;所述反卷积神经网络用于对所述身份特征序列与所述语义特征序列进行重建耦合,得到所述重建图像序列;A plurality of the training lip language picture sequences are input into the identity and semantic depth coupling model, and the identity feature sequence, the semantic feature sequence and the reconstructed image sequence are obtained; the identity and semantic depth coupling model includes: 2D dense convolutional neural network, 3D Dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D dense convolutional neural network uses encoding the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolutional neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence to obtain the reconstructed image sequence;

根据所述身份特征序列中不同说话人样本的身份特征计算对比损失;Calculate the comparison loss according to the identity features of different speaker samples in the identity feature sequence;

根据所述身份特征序列中相同说话人样本的不同帧的身份特征计算差异损失;Calculate the difference loss according to the identity features of different frames of the same speaker sample in the identity feature sequence;

基于高斯分布方法计算所述语义特征序列的高斯分布差异损失;Calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method;

根据所述身份特征序列和所述语义特征序列计算相关损失;Calculate a correlation loss according to the identity feature sequence and the semantic feature sequence;

根据所述训练唇语图片序列和所述重建图像序列计算重建误差损失;Calculate a reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence;

将所述语义特征序列输入唇语预测网络中,得到预测文本序列;Inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence;

根据所述预测文本序列和真实文本序列计算监督损失;computing a supervised loss according to the predicted text sequence and the real text sequence;

以所述对比损失、所述差异损失、所述高斯分布差异损失、所述相关损失、所述重建误差损失和所述监督损失作为优化目标,对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优,得到最优唇语识别模型;Taking the contrast loss, the disparity loss, the Gaussian disparity loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization goals, the identity and semantic depth coupling model and the lip The language prediction network is used for iterative optimization, and the optimal lip language recognition model is obtained;

获取待识别唇语图片序列;Obtain the sequence of lip language pictures to be recognized;

将所述待识别唇语图片序列输入最优唇语识别模型中,得到识别文本。Inputting the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text.

优选地,所述2D稠密卷积神经网络和所述3D稠密卷积神经网络均由稠密卷积神经网络框架构成;所述稠密卷积神经网络框架包括依次连接的稠密连接过渡层、池化层和全连接层;所述稠密连接过渡层包括多个稠密连接过渡单元;每个所述稠密连接过渡单元均包括一个稠密连接模块和一个过渡模块;Preferably, both the 2D dense convolutional neural network and the 3D dense convolutional neural network are composed of a dense convolutional neural network framework; the dense convolutional neural network framework includes a densely connected transition layer and a pooling layer that are sequentially connected. and a fully connected layer; the densely connected transition layer includes a plurality of densely connected transition units; each of the densely connected transition units includes a dense connection module and a transition module;

所述唇语预测网络为基于自注意力机制的seq2seq网络;所述唇语预测网络包括输入模块、Encoder模块、Decoder模块和分类模块;The lip language prediction network is a seq2seq network based on a self-attention mechanism; the lip language prediction network includes an input module, an Encoder module, a Decoder module and a classification module;

所述输入模块分别与所述Encoder模块和所述Decoder模块连接,所述输入模块用于获取语义特征序列和语义特征序列对应的词向量序列,并将语义特征序列中不同时刻的语义向量和所述词向量序列中的词向量嵌入时间位置信息,所述Decoder模块分别与所述Encoder模块和所述分类模块连接,所述Encoder模块用于对嵌入时间位置信息的语义特征序列进行深度特征挖掘,得到第一特征序列;所述Decoder模块用于根据所述第一特征序列的注意力和嵌入时间位置信息的词向量序列的注意力得到第二特征序列,所述分类模块用于根据所述第二特征序列判定得到预测文本序列。The input module is respectively connected with the Encoder module and the Decoder module, and the input module is used to obtain the semantic feature sequence and the word vector sequence corresponding to the semantic feature sequence, and combine the semantic vectors at different times in the semantic feature sequence with all the word vector sequences. The word vector in the predicate vector sequence is embedded with time position information, and the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used to perform deep feature mining on the semantic feature sequence embedded with the time position information, Obtain the first feature sequence; the Decoder module is used to obtain the second feature sequence according to the attention of the first feature sequence and the attention of the word vector sequence embedded with time position information, and the classification module is used to obtain the second feature sequence according to the first feature sequence. The two feature sequences are determined to obtain the predicted text sequence.

优选地,所述对比损失的计算公式为:Preferably, the calculation formula of the contrast loss is:

Figure BDA0002956520610000031
Figure BDA0002956520610000031

其中,Lc为对比损失;N表示所述说话人样本的数量;

Figure BDA0002956520610000032
表示第i个样本的第t帧图像;
Figure BDA0002956520610000033
表示第j个样本的第t′帧图像;
Figure BDA0002956520610000034
表示
Figure BDA0002956520610000035
的身份特征;
Figure BDA0002956520610000036
表示
Figure BDA0002956520610000037
的身份特征;y表示不同组样本是否匹配标签,当两组样本身份相同,y=1,否则y=0;margin为设定的阈值。Among them, L c is the contrast loss; N is the number of the speaker samples;
Figure BDA0002956520610000032
Represents the t-th frame image of the i-th sample;
Figure BDA0002956520610000033
represents the t'th frame image of the jth sample;
Figure BDA0002956520610000034
express
Figure BDA0002956520610000035
identity;
Figure BDA0002956520610000036
express
Figure BDA0002956520610000037
y indicates whether the different groups of samples match the label, when the identities of the two groups of samples are the same, y=1, otherwise y=0; margin is the set threshold.

优选地,所述差异损失的计算公式为:Preferably, the calculation formula of the difference loss is:

Figure BDA0002956520610000038
Figure BDA0002956520610000038

其中,Ld为差异损失;N表示所述说话人样本的数量,

Figure BDA0002956520610000039
表示第i个样本的第j帧图像;
Figure BDA00029565206100000310
表示第i个样本的第k帧图像;
Figure BDA00029565206100000311
表示
Figure BDA00029565206100000312
的身份特征;
Figure BDA00029565206100000313
表示
Figure BDA00029565206100000314
的身份特征;T表示说话人样本中的帧数。Among them, L d is the difference loss; N is the number of the speaker samples,
Figure BDA0002956520610000039
represents the jth frame image of the ith sample;
Figure BDA00029565206100000310
represents the k-th frame image of the i-th sample;
Figure BDA00029565206100000311
express
Figure BDA00029565206100000312
identity;
Figure BDA00029565206100000313
express
Figure BDA00029565206100000314
The identity feature of ; T represents the number of frames in the speaker sample.

优选地,所述高斯分布差异损失的计算公式为:Preferably, the calculation formula of the Gaussian distribution difference loss is:

Figure BDA00029565206100000315
Figure BDA00029565206100000315

Figure BDA00029565206100000316
Figure BDA00029565206100000316

Figure BDA00029565206100000317
Figure BDA00029565206100000317

其中,Ldd表示所述高斯分布差异损失;

Figure BDA00029565206100000318
表示P组说话人样本中第i个样本的第t帧图像;
Figure BDA00029565206100000319
表示P组样本中第i个样本的第t帧图像的语义特征;ΣP表示P组说话人样本的语义特征的协方差矩阵;ΣQ表示Q组说话人样本的语义特征的协方差矩阵;μP表示P组说话人样本的语义特征的均值向量;μQ表示Q组说话人样本的语义特征的均值向量;det表示矩阵行列式的值;z表示语义编码特征的维度,T表示说话人样本中的帧数。Wherein, L dd represents the Gaussian distribution difference loss;
Figure BDA00029565206100000318
represents the t-th frame image of the i-th sample in the P group of speaker samples;
Figure BDA00029565206100000319
represents the semantic feature of the t-th frame image of the ith sample in the P group of samples; Σ P represents the covariance matrix of the semantic feature of the P group of speaker samples; Σ Q represents the covariance matrix of the semantic feature of the Q group of speaker samples; μ P represents the mean vector of the semantic features of the P group of speaker samples; μ Q represents the mean vector of the semantic features of the Q group of speaker samples; det represents the value of the matrix determinant; z represents the dimension of the semantic encoding feature, T represents the speaker The number of frames in the sample.

优选地,所述相关损失的计算公式为:Preferably, the calculation formula of the correlation loss is:

Figure BDA0002956520610000041
Figure BDA0002956520610000041

其中,LR表示所述相关损失;T表示说话人样本中的帧数;N表示所述说话人样本的数量;

Figure BDA0002956520610000042
表示第i个样本的第t帧图像;
Figure BDA0002956520610000043
表示
Figure BDA0002956520610000044
的身份特征;
Figure BDA0002956520610000045
表示
Figure BDA0002956520610000046
的语义特征。Among them, LR represents the correlation loss; T represents the number of frames in the speaker sample; N represents the number of the speaker sample;
Figure BDA0002956520610000042
Represents the t-th frame image of the i-th sample;
Figure BDA0002956520610000043
express
Figure BDA0002956520610000044
identity;
Figure BDA0002956520610000045
express
Figure BDA0002956520610000046
semantic features.

优选地,所述重建误差损失的计算公式为:Preferably, the calculation formula of the reconstruction error loss is:

Figure BDA0002956520610000047
Figure BDA0002956520610000047

其中,Lcon表示所述重建误差损失;T表示说话人样本中的帧数;N表示所述说话人样本的数量;

Figure BDA0002956520610000048
表示第i个样本的第t帧图像;
Figure BDA0002956520610000049
表示
Figure BDA00029565206100000410
的身份特征;
Figure BDA00029565206100000411
表示
Figure BDA00029565206100000412
的语义特征,conj表示身份特征向量和语义特征向量连接。Wherein, L con represents the reconstruction error loss; T represents the number of frames in the speaker sample; N represents the number of the speaker sample;
Figure BDA0002956520610000048
Represents the t-th frame image of the i-th sample;
Figure BDA0002956520610000049
express
Figure BDA00029565206100000410
identity;
Figure BDA00029565206100000411
express
Figure BDA00029565206100000412
The semantic feature of , conj represents the connection between the identity feature vector and the semantic feature vector.

优选地,所述监督损失的计算公式为:Preferably, the calculation formula of the supervision loss is:

Figure BDA00029565206100000413
Figure BDA00029565206100000413

Figure BDA00029565206100000414
Figure BDA00029565206100000414

Figure BDA00029565206100000415
Figure BDA00029565206100000415

Figure BDA00029565206100000416
Figure BDA00029565206100000416

其中,Lseq表示所述监督损失;N表示所述说话人样本的数量;T表示说话人样本中的帧数;C表示文本类别的数量;

Figure BDA00029565206100000417
为样本i的第t帧的文本类别为j的真实概率,
Figure BDA00029565206100000418
为说话人样本i的第t帧的文本类别为j的预测概率;Si表示所述语义特征的编码矩阵;Ep表示基于自注意力机制的所述唇语预测网络,
Figure BDA00029565206100000419
表示第i个样本的第1帧图像,
Figure BDA00029565206100000420
表示第i个样本的第2帧图像,
Figure BDA00029565206100000421
表示第i个样本的第T帧图像;
Figure BDA00029565206100000422
表示第i个样本的第1帧图像的语义特征;
Figure BDA00029565206100000423
表示第i个样本的第2帧图像的语义特征;
Figure BDA00029565206100000424
表示第i个样本的第T帧图像的语义特征;第t项的唇语预测输出是根据所有帧的语义特征以及第0项到第t-1项的唇语预测输出内容进行判定;where L seq represents the supervision loss; N represents the number of speaker samples; T represents the number of frames in the speaker samples; C represents the number of text categories;
Figure BDA00029565206100000417
is the true probability that the text category of the t-th frame of sample i is j,
Figure BDA00029565206100000418
is the prediction probability that the text category of the t-th frame of the speaker sample i is j; S i represents the encoding matrix of the semantic feature; E p represents the lip language prediction network based on the self-attention mechanism,
Figure BDA00029565206100000419
Represents the first frame image of the i-th sample,
Figure BDA00029565206100000420
Represents the second frame image of the i-th sample,
Figure BDA00029565206100000421
Represents the T-th frame image of the i-th sample;
Figure BDA00029565206100000422
Represents the semantic feature of the first frame image of the ith sample;
Figure BDA00029565206100000423
Represents the semantic feature of the second frame image of the i-th sample;
Figure BDA00029565206100000424
Represents the semantic feature of the T-th frame image of the i-th sample; the lip-language prediction output of the t-th item is determined according to the semantic features of all frames and the lip-language prediction output content of the 0th item to the t-1th item;

优选地,所述以所述对比损失、所述差异损失、所述高斯分布差异损失、所述重建误差损失和所述监督损失作为优化目标,对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优,得到最优唇语识别模型,包括:Preferably, the contrast loss, the disparity loss, the Gaussian distribution disparity loss, the reconstruction error loss and the supervision loss are used as optimization objectives, and the identity and semantic depth coupling model and the lip The language prediction network is used for iterative optimization, and the optimal lip language recognition model is obtained, including:

以加权损失为优化函数,利用Adam优化器对所述身份与语义深度耦合模型以及所述唇语预测网络进行迭代学习,得到优化后的身份与语义深度耦合模型以及唇语预测网络;Taking the weighted loss as the optimization function, the Adam optimizer is used to iteratively learn the identity and semantic depth coupling model and the lip language prediction network to obtain the optimized identity and semantic depth coupling model and the lip language prediction network;

其中,所述优化函数为L(θ)=Lseq1Lc2Ld3Ldd4LR5Lcon,其中,L(θ)为加权损失;Lseq为所述监督损失;Lc为所述对比损失;Ld为所述差异损失;Ldd表示所述高斯分布差异损失;LR表示所述相关损失;Lcon表示所述重建误差损失;α1表示所述对比损失的权重,α2表示所述差异损失的权重,α3表示所述高斯分布差异损失的权重,α4表示所述相关损失的权重,α5表示所述重建误差损失的权重。Wherein, the optimization function is L(θ)=L seq1 L c2 L d3 L dd4 L R5 L con , where L(θ) is a weighted loss; L seq is the supervision loss; L c is the contrast loss; L d is the difference loss; L dd is the Gaussian distribution difference loss; L R is the correlation loss; L con is the reconstruction error loss ; α 1 represents the weight of the contrast loss, α 2 represents the weight of the difference loss, α 3 represents the weight of the Gaussian distribution difference loss, α 4 represents the weight of the correlation loss, α 5 represents the reconstruction error Loss weight.

一种用于说话人无关的唇语识别系统,包括:A system for speaker-independent lip recognition, including:

第一获取模块,用于获取多个说话人样本的训练唇语图片序列;The first acquisition module is used to acquire the training lip language picture sequence of multiple speaker samples;

特征输出模块,用于将多个所述训练唇语图片序列输入身份与语义深度耦合模型中,得到身份特征序列、语义特征序列和重建图像序列;所述身份与语义深度耦合模型包括:2D稠密卷积神经网络、3D稠密卷积神经网络和反卷积神经网络;所述2D稠密卷积神经网络用于编码所述训练唇语图片序列的身份特征,得到所述身份特征序列;所述3D稠密卷积神经网络用于编码所述训练唇语图片序列的语义特征,得到所述语义特征序列;所述反卷积神经网络用于对所述身份特征序列与所述语义特征序列进行重建耦合,得到所述重建图像序列;The feature output module is used to input a plurality of the training lip language picture sequences into the identity and semantic depth coupling model to obtain the identity feature sequence, the semantic feature sequence and the reconstructed image sequence; the identity and semantic depth coupling model includes: 2D dense Convolutional neural network, 3D dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D The dense convolutional neural network is used to encode the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolutional neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence , to obtain the reconstructed image sequence;

第一计算模块,用于根据所述身份特征序列中不同说话人样本的身份特征计算对比损失;a first computing module, configured to calculate a comparison loss according to the identity features of different speaker samples in the identity feature sequence;

第二计算模块,用于根据所述身份特征序列中相同说话人样本的不同帧的身份特征计算差异损失;a second calculation module, configured to calculate the difference loss according to the identity features of different frames of the same speaker sample in the sequence of identity features;

第三计算模块,用于基于高斯分布方法计算所述语义特征序列的高斯分布差异损失;a third calculation module, configured to calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method;

第四计算模块,用于根据所述身份特征序列和所述语义特征序列计算相关损失;a fourth calculation module, configured to calculate a correlation loss according to the identity feature sequence and the semantic feature sequence;

第五计算模块,用于根据所述训练唇语图片序列和所述重建图像序列计算重建误差损失;a fifth calculation module, configured to calculate the reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence;

文本输出模块,用于将所述语义特征序列输入唇语预测网络中,得到预测文本序列;a text output module for inputting the semantic feature sequence into the lip language prediction network to obtain a predicted text sequence;

第六计算模块,用于根据所述预测文本序列和真实文本序列计算监督损失;The sixth calculation module is used for calculating the supervision loss according to the predicted text sequence and the real text sequence;

训练模块,用于以所述对比损失、所述差异损失、所述高斯分布差异损失、所述相关损失、所述重建误差损失和所述监督损失作为优化目标,对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优,得到最优唇语识别模型;A training module for using the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss, and the supervision loss as optimization goals to deeply couple the identity and semantics The model and the lip language prediction network are iteratively optimized to obtain the optimal lip language recognition model;

第二获取模块,用于获取待识别唇语图片序列;The second acquisition module is used to acquire the sequence of lip language pictures to be recognized;

识别模块,用于将所述待识别唇语图片序列输入最优唇语识别模型中,得到识别文本。The recognition module is used for inputting the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text.

根据本发明提供的具体实施例,本发明公开了以下技术效果:According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:

本发明用于说话人无关的唇语识别方法及系统采用2D稠密卷积神经网络和3D稠密卷积神经网络这两组独立的网络分别对唇语图片序列的身份信息与语义信息编码,得到身份特征序列和语义特征序列。本发明以不同样本身份对比损失以及相同样本不同帧的身份差异损失对身份编码过程进行约束,解决由于说话人身份信息干扰对识别结果造成的影响。其中本发明通过将对比损失、差异损失、高斯分布差异损失、重建误差损失和监督损失作为优化目标,对述身份与语义深度耦合模型和唇语预测网络进行迭代寻优,避免学习的特征空间出现过拟合问题。有效的避免语义特征混入身份信息,进而提高唇语识别模型在说话人无关条件下的识别准确率。The present invention is used for the speaker-independent lip language recognition method and system by using two independent networks, 2D dense convolutional neural network and 3D dense convolutional neural network, to encode the identity information and semantic information of the lip language picture sequence, respectively, to obtain the identity information. Feature sequences and semantic feature sequences. The invention constrains the identity encoding process with the identity comparison loss of different samples and the identity difference loss of different frames of the same sample, and solves the influence on the recognition result caused by the interference of the speaker identity information. Among them, the present invention takes the contrast loss, difference loss, Gaussian distribution difference loss, reconstruction error loss and supervision loss as optimization goals, and iteratively optimizes the deep coupling model of identity and semantics and the lip language prediction network, so as to avoid the appearance of the learned feature space. overfitting problem. It effectively avoids semantic features from being mixed with identity information, thereby improving the recognition accuracy of the lip language recognition model in the speaker-independent condition.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings required in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the present invention. In the embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative labor.

图1为本发明用于说话人无关的唇语识别方法的流程图;Fig. 1 is the flow chart that the present invention is used for the speaker-independent lip language recognition method;

图2为本发明实施例中识别方法原理框图;Fig. 2 is the principle block diagram of the identification method in the embodiment of the present invention;

图3为本发明实施例中身份与语义特征编码网络框架图,其中,图3(a)为稠密卷积神经网络结构图,图3(b)为身份编码网络中2D卷积结构图,图3(c)语义编码网络中3D卷积结构图;FIG. 3 is a frame diagram of an identity and semantic feature encoding network in an embodiment of the present invention, wherein FIG. 3(a) is a structural diagram of a dense convolutional neural network, and FIG. 3(b) is a 2D convolutional structure diagram in an identity encoding network. 3(c) 3D convolutional structure diagram in semantic coding network;

图4为本发明实施例中身份与语义耦合重建网络结构图;4 is a structural diagram of an identity and semantic coupling reconstruction network in an embodiment of the present invention;

图5为本发明实施例中基于自注意力机制唇语预测网络结构图;5 is a structural diagram of a lip language prediction network based on a self-attention mechanism in an embodiment of the present invention;

图6为本发明用于说话人无关的唇语识别系统的模块连接图。FIG. 6 is a connection diagram of modules used in the speaker-independent lip language recognition system of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明的目的是提供一种用于说话人无关的唇语识别方法及系统,能够解决由于说话人身份信息干扰对识别结果造成的影响,提高唇语识别的准确率。The purpose of the present invention is to provide a speaker-independent lip language recognition method and system, which can solve the influence on the recognition result caused by the interference of the speaker's identity information, and improve the accuracy of lip language recognition.

为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

图1为本发明用于说话人无关的唇语识别方法的流程图,如图1所示,本实施例提供的用于说话人无关的唇语识别方法,包括:FIG. 1 is a flowchart of the present invention for the speaker-independent lip language recognition method. As shown in FIG. 1 , the speaker-independent lip language recognition method provided by this embodiment includes:

步骤100:获取多个说话人样本的训练唇语图片序列。Step 100: Obtain a training lip language picture sequence of multiple speaker samples.

步骤200:将多个所述训练唇语图片序列输入身份与语义深度耦合模型中,得到身份特征序列、语义特征序列和重建图像序列;所述身份与语义深度耦合模型包括:2D稠密卷积神经网络、3D稠密卷积神经网络和反卷积神经网络;所述2D稠密卷积神经网络用于编码所述训练唇语图片序列的身份特征,得到所述身份特征序列;所述3D稠密卷积神经网络用于编码所述训练唇语图片序列的语义特征,得到所述语义特征序列;所述反卷积神经网络用于对所述身份特征序列与所述语义特征序列进行重建耦合,得到所述重建图像序列。Step 200: Input a plurality of the training lip language picture sequences into an identity and semantic depth coupling model to obtain an identity feature sequence, a semantic feature sequence and a reconstructed image sequence; the identity and semantic depth coupling model includes: a 2D dense convolutional neural network; network, 3D dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D dense convolutional neural network The neural network is used to encode the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolution neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence to obtain the Describe the reconstructed image sequence.

步骤300:根据所述身份特征序列中不同说话人样本的身份特征计算对比损失。Step 300: Calculate a comparison loss according to the identity features of different speaker samples in the identity feature sequence.

步骤301:根据所述身份特征序列中相同说话人样本的不同帧的身份特征计算差异损失。Step 301: Calculate the difference loss according to the identity features of different frames of the same speaker sample in the identity feature sequence.

步骤302:基于高斯分布方法计算所述语义特征序列的高斯分布差异损失。Step 302: Calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method.

步骤303:根据所述身份特征序列和所述语义特征序列计算相关损失。Step 303: Calculate a correlation loss according to the identity feature sequence and the semantic feature sequence.

步骤304:根据所述训练唇语图片序列和所述重建图像序列计算重建误差损失。Step 304: Calculate a reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence.

步骤400:将所述语义特征序列输入唇语预测网络中,得到预测文本序列。Step 400: Input the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence.

步骤500:根据所述预测文本序列和真实文本序列计算监督损失。Step 500: Calculate a supervised loss according to the predicted text sequence and the real text sequence.

步骤600:以所述对比损失、所述差异损失、所述高斯分布差异损失、所述相关损失、所述重建误差损失和所述监督损失作为优化目标,对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优,得到最优唇语识别模型。Step 600: Taking the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization goals, analyze the identity and semantic depth coupling model and The lip language prediction network performs iterative optimization to obtain the optimal lip language recognition model.

步骤700:获取待识别唇语图片序列。Step 700: Obtain a sequence of lip language pictures to be recognized.

步骤800:将所述待识别唇语图片序列输入最优唇语识别模型中,得到识别文本。Step 800: Input the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text.

优选地,所述2D稠密卷积神经网络和所述3D稠密卷积神经网络均由稠密卷积神经网络框架构成;所述稠密卷积神经网络框架包括依次连接的稠密连接过渡层、池化层和全连接层;所述稠密连接过渡层包括多个稠密连接过渡单元;每个所述稠密连接过渡单元均包括一个稠密连接模块和一个过渡模块。Preferably, both the 2D dense convolutional neural network and the 3D dense convolutional neural network are composed of a dense convolutional neural network framework; the dense convolutional neural network framework includes a densely connected transition layer and a pooling layer that are sequentially connected. and a fully connected layer; the densely connected transition layer includes a plurality of densely connected transition units; each of the densely connected transition units includes a dense connection module and a transition module.

图3为本发明实施例中身份与语义特征编码网络框架图,图3(a)为稠密卷积神经网络结构图,如图3(a)所示,稠密卷积神经网络是由稠密连接模块、过渡模块、池化层、全连接层组成。其中,稠密连接模块不同于传统的神经网络中采用无跨层连接的方式,它将当前层的输出连接至随后每一层的输入。假如当前网络有L层,传统网络具有L个连接,稠密卷积方式则具有L(L-1)/2个连接方式。通过此种稠密连接的方式实现特征复用,有效的减少每层的通道数量,在一定程度减少网络的参数量。此外大量的跨层连接可以有效的减轻深度神经网络随着深度增加而出现的梯度消失问题。假设第l层的输出为xl,则稠密连接第l层的输入与输出可以表示为:FIG. 3 is a frame diagram of an identity and semantic feature encoding network in an embodiment of the present invention, and FIG. 3(a) is a structural diagram of a dense convolutional neural network. As shown in FIG. 3(a), the dense convolutional neural network is composed of dense connection modules. , transition module, pooling layer, and fully connected layer. Among them, the dense connection module is different from the traditional neural network with no cross-layer connection. It connects the output of the current layer to the input of each subsequent layer. If the current network has L layers, the traditional network has L connections, and the dense convolution method has L(L-1)/2 connections. Feature multiplexing is achieved through this dense connection method, which effectively reduces the number of channels in each layer and reduces the amount of network parameters to a certain extent. In addition, a large number of cross-layer connections can effectively alleviate the gradient disappearance problem of deep neural networks with increasing depth. Assuming that the output of the lth layer is x l , the input and output of the densely connected layer l can be expressed as:

xl=Hl([x0,x1,…,xl-1])x l =H l ([x 0 ,x 1 ,...,x l-1 ])

其中,Hl()表示第l层卷积模块,图3(b)为身份编码网络中2D卷积结构,图3(c)语义编码网络中3D卷积结构,如图3(b)和图3(c)所示,具体根据身份与语义编码任务可分别采用2D卷积或3D卷积网络结构,该模块主要由批归一化、ReLU、1×1卷积、3×3卷积聚合构成。身份特征编码针对每一帧静态唇部图片采用2D卷积提取图像的结构特征。语义特征编码对于若干连续帧采用3D卷积操作提取时空特征。Hl()的输入是将0到l-1层所有的输出做通道合并,通道合并的前提是要求每一层输出的特征图的尺度统一。在每一个稠密连接模块中,特征图的尺度保持不变。然而,卷积神经网络中一个必不可少的环节是通过池化操作降低特征图的尺度,以此捕获更大的感知野。因此,稠密卷积神经网络在不同稠密连接模块之间引入如图3(b)和图3(c)所示的过渡模块,该模块是由批归一化、ReLU、1×1卷积、2×2池化聚合组成,通过1×1卷积实现通道压缩,2×2池化实现特征图的下采样,以此实现更大范围的特征捕捉。稠密卷积神经网络最后的池化层对输出的特征图进行全局池化,仅保留通道信息,最后再通过全连接层进行特征变换。Among them, H l ( ) represents the first layer convolution module, Figure 3 (b) is the 2D convolution structure in the identity encoding network, Figure 3 (c) The 3D convolution structure in the semantic encoding network, as shown in Figure 3 (b) and As shown in Figure 3(c), according to the identity and semantic coding tasks, 2D convolution or 3D convolution network structure can be used respectively. This module mainly consists of batch normalization, ReLU, 1×1 convolution, 3×3 convolution Aggregate composition. Identity feature encoding uses 2D convolution to extract the structural features of the image for each static lip picture. Semantic feature encoding uses a 3D convolution operation to extract spatiotemporal features for several consecutive frames. The input of H l () is to combine all the outputs of layers 0 to l-1. The premise of channel combining is that the scale of the feature maps output by each layer is unified. In each densely connected module, the scale of the feature map remains unchanged. However, an essential link in convolutional neural networks is to reduce the scale of feature maps through pooling operations to capture a larger receptive field. Therefore, the dense convolutional neural network introduces a transition module as shown in Figure 3(b) and Figure 3(c) between different densely connected modules, which is composed of batch normalization, ReLU, 1×1 convolution, It consists of 2 × 2 pooling aggregation, channel compression is achieved through 1 × 1 convolution, and 2 × 2 pooling realizes downsampling of feature maps, so as to achieve a wider range of feature capture. The final pooling layer of the dense convolutional neural network performs global pooling on the output feature map, retaining only the channel information, and finally performs feature transformation through the fully connected layer.

图4为本发明实施例中身份与语义耦合重建网络结构图,经过上述稠密卷积神经网络提取身份与语义特征后,将两种特征连接输入如图4所示的耦合重建网络中。该网络通过4×4的反卷积操作实现从特征向量向特征图扩建,再采用上采样的方式进行高分辨率重建,重建后经过如图3所示的卷积模块进行特征提取,该模块由3×3卷积、批归一化、ReLU聚合组成。重复上采样、卷积操作,直至输出特征图尺度与唇语图片尺度一致,完成重建过程。Figure 4 is a structural diagram of the identity and semantic coupling reconstruction network in an embodiment of the present invention. After the identity and semantic features are extracted through the above-mentioned dense convolutional neural network, the two kinds of features are connected and input into the coupling reconstruction network as shown in Figure 4 . The network realizes the expansion from the feature vector to the feature map through the 4×4 deconvolution operation, and then uses the up-sampling method to perform high-resolution reconstruction. After reconstruction, the convolution module as shown in Figure 3 is used for feature extraction. It consists of 3×3 convolution, batch normalization, and ReLU aggregation. Repeat upsampling and convolution operations until the output feature map scale is consistent with the lip language picture scale, and the reconstruction process is completed.

图5为本发明实施例中基于自注意力机制唇语预测网络结构图,如图5所示,所述唇语预测网络为基于自注意力机制的seq2seq网络;所述唇语预测网络包括输入模块、Encoder模块、Decoder模块和分类模块。FIG. 5 is a structural diagram of a lip language prediction network based on a self-attention mechanism in an embodiment of the present invention. As shown in FIG. 5 , the lip language prediction network is a seq2seq network based on a self-attention mechanism; the lip language prediction network includes an input Module, Encoder Module, Decoder Module, and Classification Module.

所述输入模块分别与所述Encoder模块和所述Decoder模块连接,所述输入模块用于获取语义特征序列和语义特征序列对应的词向量序列,并将语义特征序列中不同时刻的语义向量和所述词向量序列中的词向量嵌入时间位置信息,所述Decoder模块分别与所述Encoder模块和所述分类模块连接,所述Encoder模块用于对嵌入时间位置信息的语义特征序列进行深度特征挖掘,得到第一特征序列;所述Decoder模块用于根据所述第一特征序列的注意力和嵌入时间位置信息的词向量序列的注意力得到第二特征序列,所述分类模块用于根据所述第二特征序列判定得到预测文本序列。The input module is respectively connected with the Encoder module and the Decoder module, and the input module is used to obtain the semantic feature sequence and the word vector sequence corresponding to the semantic feature sequence, and combine the semantic vectors at different times in the semantic feature sequence with all the word vector sequences. The word vector in the predicate vector sequence is embedded with time position information, and the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used to perform deep feature mining on the semantic feature sequence embedded with the time position information, Obtain the first feature sequence; the Decoder module is used to obtain the second feature sequence according to the attention of the first feature sequence and the attention of the word vector sequence embedded with time position information, and the classification module is used to obtain the second feature sequence according to the first feature sequence. The two feature sequences are determined to obtain the predicted text sequence.

具体的,所述输入模块中,唇语图片序列经过语义特征编码输出不同时刻的语义向量

Figure BDA0002956520610000101
唇语预测网络输入部分接收该输入序列。不同于RNN通过递归关系处理时序信号,唇语预测网络通过在输入数据中叠加时间位置信息,以此实现不同时间信息的语义编码。Specifically, in the input module, the lip language picture sequence is encoded with semantic features to output semantic vectors at different times.
Figure BDA0002956520610000101
The input sequence is received by the input part of the lip language prediction network. Different from RNN processing time series signals through recursive relationship, lip language prediction network realizes semantic encoding of different temporal information by superimposing temporal location information in the input data.

位置嵌入信息使用正余弦位置编码,位置编码通过使用不同频率的正弦、余弦函数生成,然后和对应的位置的语义向量相加,位置向量维度必须和语义向量的维度一致。具体计算公式如下:The position embedding information uses sine and cosine position encoding. The position encoding is generated by using sine and cosine functions of different frequencies, and then added to the semantic vector of the corresponding position. The dimension of the position vector must be consistent with the dimension of the semantic vector. The specific calculation formula is as follows:

Figure BDA0002956520610000102
Figure BDA0002956520610000102

Figure BDA0002956520610000103
Figure BDA0002956520610000103

其中,pos表示语义向量在当前序列中的位置,i表示语义向量中第i个位置,d表示语义向量的维度。Among them, pos represents the position of the semantic vector in the current sequence, i represents the ith position in the semantic vector, and d represents the dimension of the semantic vector.

可选地,所述Encoder模块中,经过时间位置信息嵌入后的语义特征,输入Encoder模块进行深度特征挖掘。Encoder模块分为过渡层与输出层两部分,过渡层是由多头注意力与层归一化所组成,其输入输出关系可以表示为:Optionally, in the Encoder module, the semantic features after the time position information is embedded are input into the Encoder module for deep feature mining. The Encoder module is divided into two parts: the transition layer and the output layer. The transition layer is composed of multi-head attention and layer normalization. The input-output relationship can be expressed as:

Figure BDA0002956520610000111
Figure BDA0002956520610000111

其中,

Figure BDA0002956520610000112
表示第i个样本经时间位置信息嵌入后的语义特征向量序列,MultiHeadAttention()为多头注意力,LayerNorm()为层归一化。in,
Figure BDA0002956520610000112
Represents the semantic feature vector sequence of the ith sample embedded with time position information, MultiHeadAttention() is multi-head attention, and LayerNorm() is layer normalization.

多头注意力让神经网络在执行预测任务时可以更多关注输入中的相关部分,更少关注不相关的部分。一个注意力函数可以描述为将Query与一组键值对(Key-Value)映射到输出,其中Query、Key、Value和输出都是向量。输出可以通过值的加权和而计算得出,其中分配到每一个值的权重可通过Query和对应Key的适应度函数计算。具体如下:Multi-head attention allows neural networks to focus more on relevant parts of the input and less on irrelevant parts when performing prediction tasks. An attention function can be described as mapping a Query with a set of key-value pairs (Key-Value) to an output, where Query, Key, Value and output are all vectors. The output can be calculated by the weighted sum of the values, where the weight assigned to each value can be calculated by the fitness function of Query and the corresponding Key. details as follows:

MultiHeadAttention(si)=MultiHead(Q,K,V)=Concat(head1,…,headh)WO MultiHeadAttention(s i )=MultiHead(Q,K,V)=Concat(head 1 ,...,head h )W O

其中,Q为查询向量序列,K为键向量序列,V为值向量序列,Q=K=V=si。Concat()为矩阵连接,

Figure BDA0002956520610000113
为输出变换矩阵。Wherein, Q is a query vector sequence, K is a key vector sequence, V is a value vector sequence, and Q=K=V=s i . Concat() is a matrix connection,
Figure BDA0002956520610000113
is the output transformation matrix.

单头注意力由下式计算:Single-head attention is calculated by:

Figure BDA0002956520610000114
Figure BDA0002956520610000114

其中,

Figure BDA0002956520610000115
为查询向量序列第i个头变换矩阵,
Figure BDA0002956520610000116
Figure BDA0002956520610000117
为键向量序列第i个头变换矩阵,
Figure BDA0002956520610000118
Figure BDA0002956520610000119
为值向量序列第i个头变换矩阵,
Figure BDA00029565206100001110
h为注意力头个数。in,
Figure BDA0002956520610000115
is the transformation matrix of the ith head of the query vector sequence,
Figure BDA0002956520610000116
Figure BDA0002956520610000117
is the transformation matrix of the ith head of the key vector sequence,
Figure BDA0002956520610000118
Figure BDA0002956520610000119
is the ith head transformation matrix of the value vector sequence,
Figure BDA00029565206100001110
h is the number of attention heads.

层归一化是解决Internal Covariate Shift问题的常用方法,它可以将数据分布拉到激活函数的非饱和区,具有权重/数据伸缩不变性的特点,起到缓解梯度消失/爆炸、加速训练、正则化的效果。层归一化具体实现如下:Layer normalization is a common method to solve the Internal Covariate Shift problem. It can pull the data distribution to the non-saturated area of the activation function. It has the characteristics of weight/data scaling invariance, which can alleviate gradient disappearance/explosion, accelerate training, and regularize effect of . The specific implementation of layer normalization is as follows:

Figure BDA00029565206100001111
Figure BDA00029565206100001111

Figure BDA00029565206100001112
Figure BDA00029565206100001112

其中,z表示输入的D维特征向量,α,β为变换系数。Among them, z represents the input D-dimensional feature vector, and α and β are transformation coefficients.

可选地,所述Encoder输出层由全连接层和层归一化组成,其输入输出的映射关系为:Optionally, the Encoder output layer is composed of a fully connected layer and layer normalization, and the mapping relationship between its input and output is:

Figure BDA0002956520610000121
Figure BDA0002956520610000121

优选地,Decoder模块总体结构与Encoder模块类似,在Encoder基础之上,加入Encoder输出与Decoder输入之间的注意力,以Encoder计算输出的

Figure BDA0002956520610000122
作为Decoder中注意力模型计算的K、V,以Decoder输入
Figure BDA0002956520610000123
作为Q,计算Decoder模型输出。Preferably, the overall structure of the Decoder module is similar to that of the Encoder module. On the basis of the Encoder, the attention between the Encoder output and the Decoder input is added, and the Encoder is used to calculate the output value.
Figure BDA0002956520610000122
As K and V calculated by the attention model in the Decoder, input with the Decoder
Figure BDA0002956520610000123
As Q, compute the Decoder model output.

可选地,Decoder输入为语言序列的词向量

Figure BDA0002956520610000124
Figure BDA0002956520610000125
表示第i个唇语序列第j个时刻的词向量。词向量输入Decoder首先经过与Encoder中同样的时间位置嵌入,获得嵌入实现信息后的词向量序列
Figure BDA0002956520610000126
语言序列的词向量即为真实文本序列。经过第一层注意力后输入输出关系为:Optionally, the Decoder input is the word vector of the language sequence
Figure BDA0002956520610000124
Figure BDA0002956520610000125
The word vector representing the jth moment of the ith lip sequence. The word vector input Decoder first goes through the same time position embedding as in the Encoder, and obtains the word vector sequence after embedding realization information
Figure BDA0002956520610000126
The word vector of the language sequence is the real text sequence. After the first layer of attention, the input-output relationship is:

Figure BDA0002956520610000127
Figure BDA0002956520610000127

其中,MultiHeadAttention()和LayerNorm()的计算方式与Encoder模块一致。Among them, the calculation methods of MultiHeadAttention() and LayerNorm() are consistent with the Encoder module.

不同于Encoder模块,Decoder在获取

Figure BDA0002956520610000128
后,将此作为多头注意力的查询值Q,以Encoder输出
Figure BDA0002956520610000129
作为多头注意力的键K和值V,计算词向量序列与语义特征序列的注意力,具体关系Unlike the Encoder module, the Decoder is getting
Figure BDA0002956520610000128
Then, use this as the query value Q of multi-head attention, and output it with the Encoder
Figure BDA0002956520610000129
As the key K and value V of the multi-head attention, the attention of the word vector sequence and the semantic feature sequence is calculated, and the specific relationship

Figure BDA00029565206100001210
Figure BDA00029565206100001210

输出的注意力向量再经过全连接与层归一化,以此获取Decoder的最终输出:The output attention vector is then fully connected and layer normalized to obtain the final output of the Decoder:

Figure BDA00029565206100001211
Figure BDA00029565206100001211

输出模块根据Decoder的输出

Figure BDA00029565206100001212
经过全连接与softmax层,判定唇语输出内容。The output module is based on the output of the Decoder
Figure BDA00029565206100001212
After full connection and softmax layer, the output content of lip language is determined.

优选地,所述对比损失的计算公式为:Preferably, the calculation formula of the contrast loss is:

Figure BDA00029565206100001213
Figure BDA00029565206100001213

其中,Lc为对比损失;N表示所述说话人样本的数量;

Figure BDA0002956520610000131
表示第i个样本的第t帧图像;
Figure BDA0002956520610000132
表示第j个样本的第t′帧图像;
Figure BDA0002956520610000133
表示
Figure BDA0002956520610000134
的身份特征;
Figure BDA0002956520610000135
表示
Figure BDA0002956520610000136
的身份特征;y表示不同组样本是否匹配标签,当两组样本身份相同,y=1,否则y=0;margin为设定的阈值。Among them, L c is the contrast loss; N is the number of the speaker samples;
Figure BDA0002956520610000131
Represents the t-th frame image of the i-th sample;
Figure BDA0002956520610000132
represents the t'th frame image of the jth sample;
Figure BDA0002956520610000133
express
Figure BDA0002956520610000134
identity;
Figure BDA0002956520610000135
express
Figure BDA0002956520610000136
y indicates whether the different groups of samples match the label, when the identities of the two groups of samples are the same, y=1, otherwise y=0; margin is the set threshold.

优选地,所述差异损失的计算公式为:Preferably, the calculation formula of the difference loss is:

Figure BDA0002956520610000137
Figure BDA0002956520610000137

其中,Ld为差异损失;N表示所述说话人样本的数量,

Figure BDA0002956520610000138
表示第i个样本的第j帧图像;
Figure BDA0002956520610000139
表示第i个样本的第k帧图像;
Figure BDA00029565206100001310
表示
Figure BDA00029565206100001311
的身份特征;
Figure BDA00029565206100001312
表示
Figure BDA00029565206100001313
的身份特征;T表示说话人样本中的帧数。Among them, L d is the difference loss; N is the number of the speaker samples,
Figure BDA0002956520610000138
represents the jth frame image of the ith sample;
Figure BDA0002956520610000139
represents the k-th frame image of the i-th sample;
Figure BDA00029565206100001310
express
Figure BDA00029565206100001311
identity;
Figure BDA00029565206100001312
express
Figure BDA00029565206100001313
The identity feature of ; T represents the number of frames in the speaker sample.

优选地,所述高斯分布差异损失的计算公式为:Preferably, the calculation formula of the Gaussian distribution difference loss is:

Figure BDA00029565206100001314
Figure BDA00029565206100001314

Figure BDA00029565206100001315
Figure BDA00029565206100001315

Figure BDA00029565206100001316
Figure BDA00029565206100001316

其中,Ldd表示所述高斯分布差异损失;

Figure BDA00029565206100001317
表示P组说话人样本中第i个样本的第t帧图像;
Figure BDA00029565206100001318
表示P组样本中第i个样本的第t帧图像的语义特征;ΣP表示P组说话人样本的语义特征的协方差矩阵;ΣQ表示Q组说话人样本的语义特征的协方差矩阵;μP表示P组说话人样本的语义特征的均值向量;μQ表示Q组说话人样本的语义特征的均值向量;det表示矩阵行列式的值;z表示语义编码特征的维度,T表示说话人样本中的帧数。Wherein, L dd represents the Gaussian distribution difference loss;
Figure BDA00029565206100001317
represents the t-th frame image of the i-th sample in the P group of speaker samples;
Figure BDA00029565206100001318
represents the semantic feature of the t-th frame image of the ith sample in the P group of samples; Σ P represents the covariance matrix of the semantic feature of the P group of speaker samples; Σ Q represents the covariance matrix of the semantic feature of the Q group of speaker samples; μ P represents the mean vector of the semantic features of the P group of speaker samples; μ Q represents the mean vector of the semantic features of the Q group of speaker samples; det represents the value of the matrix determinant; z represents the dimension of the semantic encoding feature, T represents the speaker The number of frames in the sample.

优选地,所述相关损失的计算公式为:Preferably, the calculation formula of the correlation loss is:

Figure BDA00029565206100001319
Figure BDA00029565206100001319

其中,LR表示所述相关损失;T表示说话人样本中的帧数;N表示所述说话人样本的数量;

Figure BDA00029565206100001320
表示第i个样本的第t帧图像;
Figure BDA00029565206100001321
表示
Figure BDA00029565206100001322
的身份特征;
Figure BDA00029565206100001323
表示
Figure BDA00029565206100001324
的语义特征。Among them, LR represents the correlation loss; T represents the number of frames in the speaker sample; N represents the number of the speaker sample;
Figure BDA00029565206100001320
Represents the t-th frame image of the i-th sample;
Figure BDA00029565206100001321
express
Figure BDA00029565206100001322
identity;
Figure BDA00029565206100001323
express
Figure BDA00029565206100001324
semantic features.

优选地,所述重建误差损失的计算公式为:Preferably, the calculation formula of the reconstruction error loss is:

Figure BDA0002956520610000141
Figure BDA0002956520610000141

其中,Lcon表示所述重建误差损失;T表示说话人样本中的帧数;N表示所述说话人样本的数量;

Figure BDA0002956520610000142
表示第i个样本的第t帧图像;
Figure BDA0002956520610000143
表示
Figure BDA0002956520610000144
的身份特征;
Figure BDA0002956520610000145
表示
Figure BDA0002956520610000146
的语义特征,conj表示身份特征向量和语义特征向量连接。Wherein, L con represents the reconstruction error loss; T represents the number of frames in the speaker sample; N represents the number of the speaker sample;
Figure BDA0002956520610000142
Represents the t-th frame image of the i-th sample;
Figure BDA0002956520610000143
express
Figure BDA0002956520610000144
identity;
Figure BDA0002956520610000145
express
Figure BDA0002956520610000146
The semantic feature of , conj represents the connection between the identity feature vector and the semantic feature vector.

优选地,所述监督损失的计算公式为:Preferably, the calculation formula of the supervision loss is:

Figure BDA0002956520610000147
Figure BDA0002956520610000147

Figure BDA0002956520610000148
Figure BDA0002956520610000148

Figure BDA0002956520610000149
Figure BDA0002956520610000149

Figure BDA00029565206100001410
Figure BDA00029565206100001410

其中,Lseq表示所述监督损失;N表示所述说话人样本的数量;T表示说话人样本中的帧数;C表示文本类别的数量;

Figure BDA00029565206100001411
为样本i的第t帧的文本类别为j的真实概率,
Figure BDA00029565206100001412
为说话人样本i的第t帧的文本类别为j的预测概率;Si表示所述语义特征的编码矩阵;Ep表示基于自注意力机制的所述唇语预测网络,
Figure BDA00029565206100001413
表示第i个样本的第1帧图像,
Figure BDA00029565206100001414
表示第i个样本的第2帧图像,
Figure BDA00029565206100001415
表示第i个样本的第T帧图像;
Figure BDA00029565206100001416
表示第i个样本的第1帧图像的语义特征;
Figure BDA00029565206100001417
表示第i个样本的第2帧图像的语义特征;
Figure BDA00029565206100001418
表示第i个样本的第T帧图像的语义特征;第t项的唇语预测输出是根据所有帧的语义特征以及第0项到第t-1项的唇语预测输出内容进行判定。where L seq represents the supervision loss; N represents the number of speaker samples; T represents the number of frames in the speaker samples; C represents the number of text categories;
Figure BDA00029565206100001411
is the true probability that the text category of the t-th frame of sample i is j,
Figure BDA00029565206100001412
is the prediction probability that the text category of the t-th frame of the speaker sample i is j; S i represents the encoding matrix of the semantic feature; E p represents the lip language prediction network based on the self-attention mechanism,
Figure BDA00029565206100001413
Represents the first frame image of the i-th sample,
Figure BDA00029565206100001414
Represents the second frame image of the i-th sample,
Figure BDA00029565206100001415
Represents the T-th frame image of the i-th sample;
Figure BDA00029565206100001416
Represents the semantic feature of the first frame image of the ith sample;
Figure BDA00029565206100001417
Represents the semantic feature of the second frame image of the i-th sample;
Figure BDA00029565206100001418
Represents the semantic features of the T-th frame image of the i-th sample; the lip-language prediction output of the t-th item is determined according to the semantic features of all frames and the lip-language prediction output content of the 0th to t-1th items.

优选地,所述以所述对比损失、所述差异损失、所述高斯分布差异损失、所述重建误差损失和所述监督损失作为优化目标,对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优,得到最优唇语识别模型,包括:Preferably, the contrast loss, the disparity loss, the Gaussian distribution disparity loss, the reconstruction error loss and the supervision loss are used as optimization objectives, and the identity and semantic depth coupling model and the lip The language prediction network is used for iterative optimization, and the optimal lip language recognition model is obtained, including:

以加权损失为优化函数,利用Adam优化器对所述身份与语义深度耦合模型以及所述唇语预测网络进行迭代学习,得到优化后的身份与语义深度耦合模型以及唇语预测网络;Taking the weighted loss as the optimization function, the Adam optimizer is used to iteratively learn the identity and semantic depth coupling model and the lip language prediction network to obtain the optimized identity and semantic depth coupling model and the lip language prediction network;

其中,所述优化函数为L(θ)=Lseq1Lc2Ld3Ldd4LR5Lcon,其中,L(θ)为加权损失;Lseq为所述监督损失;Lc为所述对比损失;Ld为所述差异损失;Ldd表示所述高斯分布差异损失;LR表示所述相关损失;Lcon表示所述重建误差损失;α1表示所述对比损失的权重,α2表示所述差异损失的权重,α3表示所述高斯分布差异损失的权重,α4表示所述相关损失的权重,α5表示所述重建误差损失的权重。Wherein, the optimization function is L(θ)=L seq1 L c2 L d3 L dd4 L R5 L con , where L(θ) is a weighted loss; L seq is the supervision loss; L c is the contrast loss; L d is the difference loss; L dd is the Gaussian distribution difference loss; L R is the correlation loss; L con is the reconstruction error loss ; α 1 represents the weight of the contrast loss, α 2 represents the weight of the difference loss, α 3 represents the weight of the Gaussian distribution difference loss, α 4 represents the weight of the correlation loss, α 5 represents the reconstruction error Loss weight.

具体的,Adam优化器结合AdaGrad和RMSProp两种优化算法的优点。对梯度的一阶矩估计和二阶矩估计进行综合考虑,计算出更新步长。针对上述总体损失的优化问题,Adam优化器的具体实现步骤如下:Specifically, the Adam optimizer combines the advantages of two optimization algorithms, AdaGrad and RMSProp. The first-order moment estimation and the second-order moment estimation of the gradient are comprehensively considered, and the update step size is calculated. For the optimization problem of the above overall loss, the specific implementation steps of the Adam optimizer are as follows:

(1)随机初始化参数θ、第0时刻的一阶矩m0、第0时刻的二阶矩v0(1) Randomly initialize the parameter θ, the first-order moment m 0 at the 0th moment, and the second-order moment v 0 at the 0th moment.

(2)更新t时刻的梯度

Figure BDA0002956520610000151
(2) Update the gradient at time t
Figure BDA0002956520610000151

(3)更新一阶矩mt←β1·mt+(1-β1)·gt (3) Update the first-order moment m t ←β 1 ·m t +(1-β 1 )·g t

(4)更新二阶矩

Figure BDA0002956520610000152
(4) Update the second moment
Figure BDA0002956520610000152

(5)更新无偏一阶矩

Figure BDA0002956520610000153
(5) Update the unbiased first-order moment
Figure BDA0002956520610000153

(6)更新无偏二阶矩

Figure BDA0002956520610000154
(6) Update the unbiased second-order moment
Figure BDA0002956520610000154

(7)更新参数

Figure BDA0002956520610000155
(7) Update parameters
Figure BDA0002956520610000155

重复(2)-(7)直至损失收敛Repeat (2)-(7) until the loss converges

其中,β1、β2表示指数衰减率,

Figure BDA0002956520610000156
表示β1、β2的t次幂,α为学习率。
Figure BDA0002956520610000157
表示梯度gt的平方,ε=10-8。Among them, β 1 , β 2 represent the exponential decay rate,
Figure BDA0002956520610000156
Represents the t-th power of β 1 and β 2 , and α is the learning rate.
Figure BDA0002956520610000157
represents the square of the gradient gt , ε=10 −8 .

可选地,所述方法还包括:Optionally, the method further includes:

将所述待识别唇语图片序列输入所述最优唇语识别模型中的3D稠密卷积神经网络中,得到待识别语义特征序列。Inputting the lip language picture sequence to be recognized into the 3D dense convolutional neural network in the optimal lip language recognition model to obtain the to-be-recognized semantic feature sequence.

将所述待识别语义特征序列输入所述最优唇语识别模型中的唇语预测网络中,得到预测文本序列。The to-be-recognized semantic feature sequence is input into the lip language prediction network in the optimal lip language recognition model to obtain a predicted text sequence.

具体的,利用语义信息编码网络Es以及唇语预测网络Ep进行语义特征提取与识别;Specifically, using the semantic information coding network Es and the lip language prediction network Ep to perform semantic feature extraction and recognition;

Figure BDA0002956520610000158
Figure BDA0002956520610000158

Figure BDA0002956520610000159
Figure BDA0002956520610000159

Figure BDA0002956520610000161
Figure BDA0002956520610000161

输入的唇语图片序列经过语义编码后输出语义特征序列

Figure BDA0002956520610000162
唇语预测网络根据输入的语义特征序列以及前t时刻之前所有的词向量预测t时刻的词向量输出。输入的预测特征序列如图4所示的Encoder结构,计算语义编码输出
Figure BDA0002956520610000163
Decoder将通过自注意力将t-1时刻的词向量
Figure BDA0002956520610000164
表征为前t-1时刻所有词向量
Figure BDA0002956520610000165
的注意力加权和
Figure BDA0002956520610000166
再通过自注意力机制关联Encoder输出的语义特征编码
Figure BDA0002956520610000167
Figure BDA0002956520610000168
以此计算Decoder的输出并以此预测t时刻的词向量
Figure BDA0002956520610000169
在t=1的时刻词向量输出,Decoder会根据默认开始词向量
Figure BDA00029565206100001610
进行预测,层层递推预测每一时刻唇语输出词向量。After the input lip language picture sequence is semantically encoded, the semantic feature sequence is output
Figure BDA0002956520610000162
The lip language prediction network predicts the word vector output at time t according to the input semantic feature sequence and all word vectors before time t. The input predicted feature sequence is the Encoder structure shown in Figure 4, and the semantic encoding output is calculated
Figure BDA0002956520610000163
Decoder will use self-attention to convert the word vector at time t-1
Figure BDA0002956520610000164
Represented as all word vectors at the first t-1 time
Figure BDA0002956520610000165
The attention-weighted sum of
Figure BDA0002956520610000166
Then, the semantic feature encoding output by the Encoder is associated with the self-attention mechanism.
Figure BDA0002956520610000167
and
Figure BDA0002956520610000168
Calculate the output of the Decoder and use it to predict the word vector at time t
Figure BDA0002956520610000169
At the time of t=1, the word vector is output, and the Decoder will start the word vector according to the default.
Figure BDA00029565206100001610
Prediction is performed, and the lip language output word vector is predicted layer by layer recursively at each moment.

图6为本发明用于说话人无关的唇语识别系统的模块连接图,如图6所示,本发明一种用于说话人无关的唇语识别系统,包括:Fig. 6 is the module connection diagram for the speaker-independent lip language recognition system of the present invention. As shown in Fig. 6, the present invention is used for the speaker-independent lip language recognition system, including:

第一获取模块,用于获取多个说话人样本的训练唇语图片序列;The first acquisition module is used to acquire the training lip language picture sequence of multiple speaker samples;

特征输出模块,用于将多个所述训练唇语图片序列输入身份与语义深度耦合模型中,得到身份特征序列、语义特征序列和重建图像序列;所述身份与语义深度耦合模型包括:2D稠密卷积神经网络、3D稠密卷积神经网络和反卷积神经网络;所述2D稠密卷积神经网络用于编码所述训练唇语图片序列的身份特征,得到所述身份特征序列;所述3D稠密卷积神经网络用于编码所述训练唇语图片序列的语义特征,得到所述语义特征序列;所述反卷积神经网络用于对所述身份特征序列与所述语义特征序列进行重建耦合,得到所述重建图像序列;The feature output module is used to input a plurality of the training lip language picture sequences into the identity and semantic depth coupling model to obtain the identity feature sequence, the semantic feature sequence and the reconstructed image sequence; the identity and semantic depth coupling model includes: 2D dense Convolutional neural network, 3D dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D The dense convolutional neural network is used to encode the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolutional neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence , to obtain the reconstructed image sequence;

第一计算模块,用于根据所述身份特征序列中不同说话人样本的身份特征计算对比损失;a first computing module, configured to calculate a comparison loss according to the identity features of different speaker samples in the identity feature sequence;

第二计算模块,用于根据所述身份特征序列中相同说话人样本的不同帧的身份特征计算差异损失;a second calculation module, configured to calculate the difference loss according to the identity features of different frames of the same speaker sample in the sequence of identity features;

第三计算模块,用于基于高斯分布方法计算所述语义特征序列的高斯分布差异损失;a third calculation module, configured to calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method;

第四计算模块,用于根据所述身份特征序列和所述语义特征序列计算相关损失;a fourth calculation module, configured to calculate a correlation loss according to the identity feature sequence and the semantic feature sequence;

第五计算模块,用于根据所述训练唇语图片序列和所述重建图像序列计算重建误差损失;a fifth calculation module, configured to calculate the reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence;

文本输出模块,用于将所述语义特征序列输入唇语预测网络中,得到预测文本序列;a text output module for inputting the semantic feature sequence into the lip language prediction network to obtain a predicted text sequence;

第六计算模块,用于根据所述预测文本序列和真实文本序列计算监督损失;The sixth calculation module is used for calculating the supervision loss according to the predicted text sequence and the real text sequence;

训练模块,用于以所述对比损失、所述差异损失、所述高斯分布差异损失、所述相关损失、所述重建误差损失和所述监督损失作为优化目标,对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优,得到最优唇语识别模型;A training module for using the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss, and the supervision loss as optimization goals to deeply couple the identity and semantics The model and the lip language prediction network are iteratively optimized to obtain the optimal lip language recognition model;

第二获取模块,用于获取待识别唇语图片序列;The second acquisition module is used to acquire the sequence of lip language pictures to be recognized;

识别模块,用于将所述待识别唇语图片序列输入最优唇语识别模型中,得到识别文本。The recognition module is used for inputting the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text.

本发明的有益效果如下:The beneficial effects of the present invention are as follows:

第一,本发明采用两组独立的网络分别对唇语图片序列的身份信息与语义信息编码,以不同样本身份对比损失以及相同样本不同帧的身份差异损失对身份编码过程进行约束,以seq2seq监督损失对语义编码过程进行约束,本发明找到最优的身份空间对语义编码特征进行约束,避免了学习的特征空间出现过拟合问题。相比于目前唇语识别方法通过单一语义监督约束的方式,可以有效的避免语义特征混入身份信息,进而提高唇语识别模型在说话人无关条件下的识别准确率。First, the present invention uses two independent networks to encode the identity information and semantic information of the lip language picture sequence respectively, and uses the identity comparison loss of different samples and the identity difference loss of different frames of the same sample to constrain the identity encoding process, and uses seq2seq to supervise the process of identity encoding. The loss constrains the semantic encoding process, and the present invention finds the optimal identity space to constrain the semantic encoding features, thereby avoiding the problem of overfitting in the learned feature space. Compared with the current lip language recognition method, which uses a single semantic supervision constraint, it can effectively avoid semantic features mixed with identity information, thereby improving the recognition accuracy of the lip language recognition model in the speaker-independent condition.

第二,本发明在上述耦合模型的基础上,进一步引入了身份特征与语义特征的相关损失约束,确保身份信息与语义信息的相关性最小;此外,本发明进一步假设语义特征服从单一高斯分布,以不同组样本的高斯分布差异性作为损失约束,保证不同说话人提取的语义特征分布差异最小,限定了语义空间的独立性,从而改善唇语识别系统对说话人身份变化的鲁棒性能。Second, on the basis of the above coupling model, the present invention further introduces the related loss constraints of identity features and semantic features to ensure that the correlation between identity information and semantic information is minimal; in addition, the present invention further assumes that the semantic features obey a single Gaussian distribution, The difference of Gaussian distribution of different groups of samples is used as the loss constraint to ensure the minimum difference in the distribution of semantic features extracted by different speakers, and the independence of the semantic space is limited, thereby improving the robust performance of the lip recognition system to changes in speaker identity.

第三,本发明在语义预测过程中采用了基于自注意力机制的seq2seq模型,相比于目前唇语识别方法采用的如LSTM、GRU等循环神经网络,可以实现时序特征的长期记忆与关联能力,从而提高唇语预测过程的精度。此外,自注意力机制不同于传统的循环神经网络通过递推训练的方式,该模型可以实现并行化训练,进而大幅缩短唇语识别网络的学习时间。Third, the present invention adopts the seq2seq model based on the self-attention mechanism in the semantic prediction process, which can realize the long-term memory and association ability of time series features compared with the current lip language recognition methods such as LSTM, GRU and other recurrent neural networks. , thereby improving the accuracy of the lip language prediction process. In addition, the self-attention mechanism is different from the traditional recurrent neural network through recursive training. This model can realize parallel training, thereby greatly shortening the learning time of the lip language recognition network.

本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本发明的限制。In this paper, specific examples are used to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present invention; meanwhile, for those skilled in the art, according to the present invention There will be changes in the specific implementation and application scope. In conclusion, the contents of this specification should not be construed as limiting the present invention.

Claims (10)

1.一种用于说话人无关的唇语识别方法,其特征在于,包括:1. a kind of lip language recognition method that is irrelevant for the speaker, is characterized in that, comprises: 获取多个说话人样本的训练唇语图片序列;Obtain a sequence of training lip language pictures for multiple speaker samples; 将多个所述训练唇语图片序列输入身份与语义深度耦合模型中,得到身份特征序列、语义特征序列和重建图像序列;所述身份与语义深度耦合模型包括:2D稠密卷积神经网络、3D稠密卷积神经网络和反卷积神经网络;所述2D稠密卷积神经网络用于编码所述训练唇语图片序列的身份特征,得到所述身份特征序列;所述3D稠密卷积神经网络用于编码所述训练唇语图片序列的语义特征,得到所述语义特征序列;所述反卷积神经网络用于对所述身份特征序列与所述语义特征序列进行重建耦合,得到所述重建图像序列;A plurality of the training lip language picture sequences are input into the identity and semantic depth coupling model, and the identity feature sequence, the semantic feature sequence and the reconstructed image sequence are obtained; the identity and semantic depth coupling model includes: 2D dense convolutional neural network, 3D Dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D dense convolutional neural network uses encoding the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolutional neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence to obtain the reconstructed image sequence; 根据所述身份特征序列中不同说话人样本的身份特征计算对比损失;Calculate the comparison loss according to the identity features of different speaker samples in the identity feature sequence; 根据所述身份特征序列中相同说话人样本的不同帧的身份特征计算差异损失;Calculate the difference loss according to the identity features of different frames of the same speaker sample in the identity feature sequence; 基于高斯分布方法计算所述语义特征序列的高斯分布差异损失;Calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method; 根据所述身份特征序列和所述语义特征序列计算相关损失;Calculate a correlation loss according to the identity feature sequence and the semantic feature sequence; 根据所述训练唇语图片序列和所述重建图像序列计算重建误差损失;Calculate a reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence; 将所述语义特征序列输入唇语预测网络中,得到预测文本序列;Inputting the semantic feature sequence into a lip language prediction network to obtain a predicted text sequence; 根据所述预测文本序列和真实文本序列计算监督损失;compute a supervised loss according to the predicted text sequence and the real text sequence; 以所述对比损失、所述差异损失、所述高斯分布差异损失、所述相关损失、所述重建误差损失和所述监督损失作为优化目标,对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优,得到最优唇语识别模型;Taking the contrast loss, the disparity loss, the Gaussian disparity loss, the correlation loss, the reconstruction error loss and the supervision loss as optimization goals, the identity and semantic depth coupling model and the lip The language prediction network is used for iterative optimization, and the optimal lip language recognition model is obtained; 获取待识别唇语图片序列;Obtain the sequence of lip language pictures to be recognized; 将所述待识别唇语图片序列输入最优唇语识别模型中,得到识别文本。Inputting the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text. 2.根据权利要求1所述的用于说话人无关的唇语识别方法,其特征在于,所述2D稠密卷积神经网络和所述3D稠密卷积神经网络均由稠密卷积神经网络框架构成;所述稠密卷积神经网络框架包括依次连接的稠密连接过渡层、池化层和全连接层;所述稠密连接过渡层包括多个稠密连接过渡单元;每个所述稠密连接过渡单元均包括一个稠密连接模块和一个过渡模块;2. The method for speaker-independent lip language recognition according to claim 1, wherein the 2D dense convolutional neural network and the 3D dense convolutional neural network are both composed of a dense convolutional neural network framework ; The dense convolutional neural network framework includes a dense connection transition layer, a pooling layer and a fully connected layer that are sequentially connected; the dense connection transition layer includes a plurality of dense connection transition units; each of the dense connection transition units includes A dense connection module and a transition module; 所述唇语预测网络为基于自注意力机制的seq2seq网络;所述唇语预测网络包括输入模块、Encoder模块、Decoder模块和分类模块;The lip language prediction network is a seq2seq network based on a self-attention mechanism; the lip language prediction network includes an input module, an Encoder module, a Decoder module and a classification module; 所述输入模块分别与所述Encoder模块和所述Decoder模块连接,所述输入模块用于获取语义特征序列和语义特征序列对应的词向量序列,并将语义特征序列中不同时刻的语义向量和所述词向量序列中的词向量嵌入时间位置信息,所述Decoder模块分别与所述Encoder模块和所述分类模块连接,所述Encoder模块用于对嵌入时间位置信息的语义特征序列进行深度特征挖掘,得到第一特征序列;所述Decoder模块用于根据所述第一特征序列的注意力和嵌入时间位置信息的词向量序列的注意力得到第二特征序列,所述分类模块用于根据所述第二特征序列判定得到预测文本序列。The input module is respectively connected with the Encoder module and the Decoder module, and the input module is used to obtain the semantic feature sequence and the word vector sequence corresponding to the semantic feature sequence, and combine the semantic vectors at different times in the semantic feature sequence with all the word vector sequences. The word vector in the predicate vector sequence is embedded with time position information, and the Decoder module is respectively connected with the Encoder module and the classification module, and the Encoder module is used to perform deep feature mining on the semantic feature sequence embedded with the time position information, Obtain the first feature sequence; the Decoder module is used to obtain the second feature sequence according to the attention of the first feature sequence and the attention of the word vector sequence embedded with time position information, and the classification module is used to obtain the second feature sequence according to the first feature sequence. The two feature sequences are determined to obtain the predicted text sequence. 3.根据权利要求1所述的用于说话人无关的唇语识别方法,其特征在于,所述对比损失的计算公式为:3. The lip language recognition method for speaker irrelevance according to claim 1, wherein the calculation formula of the contrast loss is:
Figure FDA0002956520600000021
Figure FDA0002956520600000021
其中,Lc为对比损失;N表示所述说话人样本的数量;
Figure FDA0002956520600000022
表示第i个样本的第t帧图像;
Figure FDA0002956520600000023
表示第j个样本的第t′帧图像;
Figure FDA0002956520600000024
表示
Figure FDA0002956520600000025
的身份特征;
Figure FDA0002956520600000026
表示
Figure FDA0002956520600000027
的身份特征;y表示不同组样本是否匹配标签,当两组样本身份相同,y=1,否则y=0;margin为设定的阈值。
Among them, L c is the contrast loss; N is the number of the speaker samples;
Figure FDA0002956520600000022
Represents the t-th frame image of the i-th sample;
Figure FDA0002956520600000023
represents the t'th frame image of the jth sample;
Figure FDA0002956520600000024
express
Figure FDA0002956520600000025
identity;
Figure FDA0002956520600000026
express
Figure FDA0002956520600000027
y indicates whether the different groups of samples match the label, when the identities of the two groups of samples are the same, y=1, otherwise y=0; margin is the set threshold.
4.根据权利要求1所述的用于说话人无关的唇语识别方法,其特征在于,所述差异损失的计算公式为:4. The method for recognizing lip language that is irrelevant to the speaker according to claim 1, wherein the calculation formula of the difference loss is:
Figure FDA0002956520600000028
Figure FDA0002956520600000028
其中,Ld为差异损失;N表示所述说话人样本的数量,
Figure FDA0002956520600000029
表示第i个样本的第j帧图像;
Figure FDA00029565206000000210
表示第i个样本的第k帧图像;
Figure FDA00029565206000000211
表示
Figure FDA00029565206000000212
的身份特征;
Figure FDA00029565206000000213
表示
Figure FDA00029565206000000214
的身份特征;T表示说话人样本中的帧数。
Among them, L d is the difference loss; N is the number of the speaker samples,
Figure FDA0002956520600000029
represents the jth frame image of the ith sample;
Figure FDA00029565206000000210
represents the k-th frame image of the i-th sample;
Figure FDA00029565206000000211
express
Figure FDA00029565206000000212
identity;
Figure FDA00029565206000000213
express
Figure FDA00029565206000000214
The identity feature of ; T represents the number of frames in the speaker sample.
5.根据权利要求1所述的用于说话人无关的唇语识别方法,其特征在于,所述高斯分布差异损失的计算公式为:5. The method for speaker-independent lip language recognition according to claim 1, wherein the calculation formula of the Gaussian distribution difference loss is:
Figure FDA0002956520600000031
Figure FDA0002956520600000031
Figure FDA0002956520600000032
Figure FDA0002956520600000032
Figure FDA0002956520600000033
Figure FDA0002956520600000033
其中,Ldd表示所述高斯分布差异损失;
Figure FDA0002956520600000034
表示P组说话人样本中第i个样本的第t帧图像;
Figure FDA0002956520600000035
表示P组样本中第i个样本的第t帧图像的语义特征;ΣP表示P组说话人样本的语义特征的协方差矩阵;ΣQ表示Q组说话人样本的语义特征的协方差矩阵;μP表示P组说话人样本的语义特征的均值向量;μQ表示Q组说话人样本的语义特征的均值向量;det表示矩阵行列式的值;z表示语义编码特征的维度,T表示说话人样本中的帧数。
Wherein, L dd represents the Gaussian distribution difference loss;
Figure FDA0002956520600000034
represents the t-th frame image of the i-th sample in the P group of speaker samples;
Figure FDA0002956520600000035
represents the semantic feature of the t-th frame image of the ith sample in the P group of samples; Σ P represents the covariance matrix of the semantic feature of the P group of speaker samples; Σ Q represents the covariance matrix of the semantic feature of the Q group of speaker samples; μ P represents the mean vector of the semantic features of the P group of speaker samples; μ Q represents the mean vector of the semantic features of the Q group of speaker samples; det represents the value of the matrix determinant; z represents the dimension of the semantic encoding feature, T represents the speaker The number of frames in the sample.
6.根据权利要求1所述的用于说话人无关的唇语识别方法,其特征在于,所述相关损失的计算公式为:6. The method for speaker-independent lip language recognition according to claim 1, wherein the calculation formula of the correlation loss is:
Figure FDA0002956520600000036
Figure FDA0002956520600000036
其中,LR表示所述相关损失;T表示说话人样本中的帧数;N表示所述说话人样本的数量;
Figure FDA0002956520600000037
表示第i个样本的第t帧图像;
Figure FDA0002956520600000038
表示
Figure FDA0002956520600000039
的身份特征;
Figure FDA00029565206000000310
表示
Figure FDA00029565206000000311
的语义特征。
Among them, LR represents the correlation loss; T represents the number of frames in the speaker sample; N represents the number of the speaker sample;
Figure FDA0002956520600000037
Represents the t-th frame image of the i-th sample;
Figure FDA0002956520600000038
express
Figure FDA0002956520600000039
identity;
Figure FDA00029565206000000310
express
Figure FDA00029565206000000311
semantic features.
7.根据权利要求1所述的用于说话人无关的唇语识别方法,其特征在于,所述重建误差损失的计算公式为:7. The method for speaker-independent lip language recognition according to claim 1, wherein the calculation formula of the reconstruction error loss is:
Figure FDA00029565206000000312
Figure FDA00029565206000000312
其中,Lcon表示所述重建误差损失;T表示说话人样本中的帧数;N表示所述说话人样本的数量;
Figure FDA00029565206000000313
表示第i个样本的第t帧图像;
Figure FDA00029565206000000314
表示
Figure FDA00029565206000000315
的身份特征;
Figure FDA00029565206000000316
表示
Figure FDA00029565206000000317
的语义特征,conj表示身份特征向量和语义特征向量连接。
Wherein, L con represents the reconstruction error loss; T represents the number of frames in the speaker sample; N represents the number of the speaker sample;
Figure FDA00029565206000000313
Represents the t-th frame image of the i-th sample;
Figure FDA00029565206000000314
express
Figure FDA00029565206000000315
identity;
Figure FDA00029565206000000316
express
Figure FDA00029565206000000317
The semantic feature of , conj represents the connection between the identity feature vector and the semantic feature vector.
8.根据权利要求1所述的用于说话人无关的唇语识别方法,其特征在于,所述监督损失的计算公式为:8. The method for speaker-independent lip language recognition according to claim 1, wherein the calculation formula of the supervision loss is:
Figure FDA0002956520600000041
Figure FDA0002956520600000041
Figure FDA0002956520600000042
Figure FDA0002956520600000042
Figure FDA0002956520600000043
Figure FDA0002956520600000043
Figure FDA0002956520600000044
Figure FDA0002956520600000044
其中,Lseq表示所述监督损失;N表示所述说话人样本的数量;T表示说话人样本中的帧数;C表示文本类别的数量;
Figure FDA0002956520600000045
为样本i的第t帧的文本类别为j的真实概率,
Figure FDA0002956520600000046
为说话人样本i的第t帧的文本类别为j的预测概率;Si表示所述语义特征的编码矩阵;Ep表示基于自注意力机制的所述唇语预测网络,
Figure FDA0002956520600000047
表示第i个样本的第1帧图像,
Figure FDA0002956520600000048
表示第i个样本的第2帧图像,
Figure FDA0002956520600000049
表示第i个样本的第T帧图像;
Figure FDA00029565206000000410
表示第i个样本的第1帧图像的语义特征;
Figure FDA00029565206000000411
表示第i个样本的第2帧图像的语义特征;
Figure FDA00029565206000000412
表示第i个样本的第T帧图像的语义特征;第t项的唇语预测输出是根据所有帧的语义特征以及第0项到第t-1项的唇语预测输出内容进行判定。
where L seq represents the supervision loss; N represents the number of speaker samples; T represents the number of frames in the speaker samples; C represents the number of text categories;
Figure FDA0002956520600000045
is the true probability that the text category of the t-th frame of sample i is j,
Figure FDA0002956520600000046
is the prediction probability that the text category of the t-th frame of the speaker sample i is j; S i represents the encoding matrix of the semantic feature; E p represents the lip language prediction network based on the self-attention mechanism,
Figure FDA0002956520600000047
Represents the first frame image of the i-th sample,
Figure FDA0002956520600000048
Represents the second frame image of the i-th sample,
Figure FDA0002956520600000049
Represents the T-th frame image of the i-th sample;
Figure FDA00029565206000000410
Represents the semantic feature of the first frame image of the ith sample;
Figure FDA00029565206000000411
Represents the semantic feature of the second frame image of the i-th sample;
Figure FDA00029565206000000412
Represents the semantic features of the T-th frame image of the i-th sample; the lip-language prediction output of the t-th item is determined according to the semantic features of all frames and the lip-language prediction output content of the 0th to t-1th items.
9.根据权利要求1所述的用于说话人无关的唇语识别方法,其特征在于,所述以所述对比损失、所述差异损失、所述高斯分布差异损失、所述重建误差损失和所述监督损失作为优化目标,对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优,得到最优唇语识别模型,包括:9. The method for speaker-independent lip language recognition according to claim 1, wherein the contrast loss, the difference loss, the Gaussian distribution difference loss, the reconstruction error loss and The supervised loss is used as an optimization target, and iterative optimization is performed on the identity and semantic depth coupling model and the lip language prediction network to obtain the optimal lip language recognition model, including: 以加权损失为优化函数,利用Adam优化器对所述身份与语义深度耦合模型以及所述唇语预测网络进行迭代学习,得到优化后的身份与语义深度耦合模型以及唇语预测网络;Taking the weighted loss as the optimization function, the Adam optimizer is used to iteratively learn the identity and semantic depth coupling model and the lip language prediction network to obtain the optimized identity and semantic depth coupling model and the lip language prediction network; 其中,所述优化函数为L(θ)=Lseq1Lc2Ld3Ldd4LR5Lcon,其中,L(θ)为加权损失;Lseq为所述监督损失;Lc为所述对比损失;Ld为所述差异损失;Ldd表示所述高斯分布差异损失;LR表示所述相关损失;Lcon表示所述重建误差损失;α1表示所述对比损失的权重,α2表示所述差异损失的权重,α3表示所述高斯分布差异损失的权重,α4表示所述相关损失的权重,α5表示所述重建误差损失的权重。Wherein, the optimization function is L(θ)=L seq1 L c2 L d3 L dd4 L R5 L con , where L(θ) is a weighted loss; L seq is the supervision loss; L c is the contrast loss; L d is the difference loss; L dd is the Gaussian distribution difference loss; L R is the correlation loss; L con is the reconstruction error loss ; α 1 represents the weight of the contrast loss, α 2 represents the weight of the difference loss, α 3 represents the weight of the Gaussian distribution difference loss, α 4 represents the weight of the correlation loss, α 5 represents the reconstruction error Loss weight. 10.一种用于说话人无关的唇语识别系统,其特征在于,包括:10. A speaker-independent lip language recognition system, comprising: 第一获取模块,用于获取多个说话人样本的训练唇语图片序列;The first acquisition module is used to acquire the training lip language picture sequence of multiple speaker samples; 特征输出模块,用于将多个所述训练唇语图片序列输入身份与语义深度耦合模型中,得到身份特征序列、语义特征序列和重建图像序列;所述身份与语义深度耦合模型包括:2D稠密卷积神经网络、3D稠密卷积神经网络和反卷积神经网络;所述2D稠密卷积神经网络用于编码所述训练唇语图片序列的身份特征,得到所述身份特征序列;所述3D稠密卷积神经网络用于编码所述训练唇语图片序列的语义特征,得到所述语义特征序列;所述反卷积神经网络用于对所述身份特征序列与所述语义特征序列进行重建耦合,得到所述重建图像序列;The feature output module is used to input a plurality of the training lip language picture sequences into the identity and semantic depth coupling model to obtain the identity feature sequence, the semantic feature sequence and the reconstructed image sequence; the identity and semantic depth coupling model includes: 2D dense Convolutional neural network, 3D dense convolutional neural network and deconvolutional neural network; the 2D dense convolutional neural network is used to encode the identity feature of the training lip language picture sequence to obtain the identity feature sequence; the 3D The dense convolutional neural network is used to encode the semantic features of the training lip language picture sequence to obtain the semantic feature sequence; the deconvolutional neural network is used to reconstruct and couple the identity feature sequence and the semantic feature sequence , to obtain the reconstructed image sequence; 第一计算模块,用于根据所述身份特征序列中不同说话人样本的身份特征计算对比损失;a first computing module, configured to calculate a comparison loss according to the identity features of different speaker samples in the identity feature sequence; 第二计算模块,用于根据所述身份特征序列中相同说话人样本的不同帧的身份特征计算差异损失;a second calculation module, configured to calculate the difference loss according to the identity features of different frames of the same speaker sample in the sequence of identity features; 第三计算模块,用于基于高斯分布方法计算所述语义特征序列的高斯分布差异损失;a third calculation module, configured to calculate the Gaussian distribution difference loss of the semantic feature sequence based on the Gaussian distribution method; 第四计算模块,用于根据所述身份特征序列和所述语义特征序列计算相关损失;a fourth calculation module, configured to calculate a correlation loss according to the identity feature sequence and the semantic feature sequence; 第五计算模块,用于根据所述训练唇语图片序列和所述重建图像序列计算重建误差损失;a fifth calculation module, configured to calculate the reconstruction error loss according to the training lip language picture sequence and the reconstructed image sequence; 文本输出模块,用于将所述语义特征序列输入唇语预测网络中,得到预测文本序列;a text output module for inputting the semantic feature sequence into the lip language prediction network to obtain a predicted text sequence; 第六计算模块,用于根据所述预测文本序列和真实文本序列计算监督损失;The sixth calculation module is used for calculating the supervision loss according to the predicted text sequence and the real text sequence; 训练模块,用于以所述对比损失、所述差异损失、所述高斯分布差异损失、所述相关损失、所述重建误差损失和所述监督损失作为优化目标,对所述身份与语义深度耦合模型和所述唇语预测网络进行迭代寻优,得到最优唇语识别模型;A training module for using the contrast loss, the difference loss, the Gaussian distribution difference loss, the correlation loss, the reconstruction error loss, and the supervision loss as optimization goals to deeply couple the identity and semantics The model and the lip language prediction network are iteratively optimized to obtain the optimal lip language recognition model; 第二获取模块,用于获取待识别唇语图片序列;The second acquisition module is used to acquire the sequence of lip language pictures to be recognized; 识别模块,用于将所述待识别唇语图片序列输入最优唇语识别模型中,得到识别文本。The recognition module is used for inputting the lip language picture sequence to be recognized into the optimal lip language recognition model to obtain the recognized text.
CN202110226432.4A 2021-03-01 2021-03-01 Lip language identification method and system for speaker independence Active CN112949481B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110226432.4A CN112949481B (en) 2021-03-01 2021-03-01 Lip language identification method and system for speaker independence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110226432.4A CN112949481B (en) 2021-03-01 2021-03-01 Lip language identification method and system for speaker independence

Publications (2)

Publication Number Publication Date
CN112949481A true CN112949481A (en) 2021-06-11
CN112949481B CN112949481B (en) 2023-09-22

Family

ID=76246958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110226432.4A Active CN112949481B (en) 2021-03-01 2021-03-01 Lip language identification method and system for speaker independence

Country Status (1)

Country Link
CN (1) CN112949481B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313056A (en) * 2021-06-16 2021-08-27 中国科学技术大学 Compact 3D convolution-based lip language identification method, system, device and storage medium
CN114092496A (en) * 2021-11-30 2022-02-25 西安邮电大学 A method and system for lip segmentation based on spatial weighting
CN114141249A (en) * 2021-12-02 2022-03-04 河南职业技术学院 Teaching voice recognition optimization method and system
CN114419731A (en) * 2021-12-28 2022-04-29 西安邮电大学 A method and system for lip language recognition based on staged cross-training
CN114466179A (en) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 Method and device for measuring synchronism of voice and image
CN116959060A (en) * 2023-04-20 2023-10-27 湘潭大学 Lip language identification method for patient with language disorder in hospital environment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339806A (en) * 2018-12-19 2020-06-26 马上消费金融股份有限公司 Training method of lip language recognition model, living body recognition method and device
WO2020252922A1 (en) * 2019-06-21 2020-12-24 平安科技(深圳)有限公司 Deep learning-based lip reading method and apparatus, electronic device, and medium
CN112330713A (en) * 2020-11-26 2021-02-05 南京工程学院 Improvement method of speech comprehension in severely hearing impaired patients based on lip recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339806A (en) * 2018-12-19 2020-06-26 马上消费金融股份有限公司 Training method of lip language recognition model, living body recognition method and device
WO2020252922A1 (en) * 2019-06-21 2020-12-24 平安科技(深圳)有限公司 Deep learning-based lip reading method and apparatus, electronic device, and medium
CN112330713A (en) * 2020-11-26 2021-02-05 南京工程学院 Improvement method of speech comprehension in severely hearing impaired patients based on lip recognition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
马宁;田国栋;周曦;: "一种基于long short-term memory的唇语识别方法", 中国科学院大学学报, no. 01 *
马惠珠;宋朝晖;季飞;侯嘉;熊小芸;: "项目计算机辅助受理的研究方向与关键词――2012年度受理情况与2013年度注意事项", 电子与信息学报, no. 01 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313056A (en) * 2021-06-16 2021-08-27 中国科学技术大学 Compact 3D convolution-based lip language identification method, system, device and storage medium
CN114466179A (en) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 Method and device for measuring synchronism of voice and image
CN114092496A (en) * 2021-11-30 2022-02-25 西安邮电大学 A method and system for lip segmentation based on spatial weighting
CN114141249A (en) * 2021-12-02 2022-03-04 河南职业技术学院 Teaching voice recognition optimization method and system
CN114419731A (en) * 2021-12-28 2022-04-29 西安邮电大学 A method and system for lip language recognition based on staged cross-training
CN114419731B (en) * 2021-12-28 2025-02-07 西安邮电大学 A lip reading recognition method and system based on phased cross-training
CN116959060A (en) * 2023-04-20 2023-10-27 湘潭大学 Lip language identification method for patient with language disorder in hospital environment

Also Published As

Publication number Publication date
CN112949481B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN112949481B (en) Lip language identification method and system for speaker independence
CN108596958B (en) A Target Tracking Method Based on Difficult Positive Sample Generation
CN112330713B (en) Improvement method for speech understanding degree of severe hearing impairment patient based on lip language recognition
CN113159023B (en) Scene text recognition method based on explicit supervised attention mechanism
CN114220154A (en) A deep learning-based micro-expression feature extraction and recognition method
CN114821640A (en) Skeleton action identification method based on multi-stream multi-scale expansion space-time diagram convolution network
CN115439857A (en) Inclined character recognition method based on complex background image
CN113450313B (en) Image significance visualization method based on regional contrast learning
CN111695457A (en) Human body posture estimation method based on weak supervision mechanism
CN112232395B (en) Semi-supervised image classification method for generating countermeasure network based on joint training
CN113449801B (en) Image character behavior description generation method based on multi-level image context coding and decoding
CN116246338B (en) Behavior recognition method based on graph convolution and transducer composite neural network
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN112365551A (en) Image quality processing system, method, device and medium
CN115100709B (en) Feature separation image face recognition and age estimation method
CN111967358A (en) Neural network gait recognition method based on attention mechanism
CN112289338A (en) Signal processing method and device, computer device and readable storage medium
CN113936333A (en) Action recognition algorithm based on human body skeleton sequence
CN115687772A (en) Sequence recommendation method based on sequence dependence enhanced self-attention network
CN113489958A (en) Dynamic gesture recognition method and system based on video coding data multi-feature fusion
CN116935292B (en) Short video scene classification method and system based on self-attention model
CN113850182A (en) Action identification method based on DAMR-3 DNet
CN118585964A (en) Video saliency prediction method and system based on audio-visual correlation feature fusion strategy
CN111339782A (en) Sign language translation system and method based on multilevel semantic analysis
CN116884412A (en) A lip language recognition method based on hybrid three-dimensional residual gated recurrent units

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant