WO2023088080A1

WO2023088080A1 - Speaking video generation method and apparatus, and electronic device and storage medium

Info

Publication number: WO2023088080A1
Application number: PCT/CN2022/128584
Authority: WO
Inventors: 王宇欣; 吴文岩
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-11-22
Filing date: 2022-10-31
Publication date: 2023-05-25
Also published as: CN114093384A

Abstract

Disclosed are a speaking video generation method and apparatus, and a device and a storage medium. The method comprises: acquiring phoneme features and acoustic features of voice drive data, wherein the voice drive data comprises at least one of audio and text; acquiring at least one set of facial key point information of a target object in a first image according to the phoneme features and the acoustic features; according to the at least one set of facial key point information and a second image including the face of the target object, obtaining at least one target facial image corresponding to the voice drive data, wherein a set area comprising a specific part of the target object in the second image is blocked; and obtaining a speaking video of the target object according to the voice drive data and the at least one target facial image.

Description

Talking video generation method, device, electronic device and storage medium

相关申请的交叉引用Cross References to Related Applications

本专利申请要求于2021年11月22日提交的、申请号为202111386695.8的中国专利申请的优先权，其全部内容通过引用并入本文中。This patent application claims priority to Chinese Patent Application No. 202111386695.8 filed on November 22, 2021, the entire contents of which are incorporated herein by reference.

technical field

本公开涉及计算机视觉技术领域，具体涉及一种说话视频生成方法、装置、设备以及存储介质。The present disclosure relates to the technical field of computer vision, and in particular to a method, device, device and storage medium for generating a speaking video.

Background technique

说话视频生成技术是语音驱动人物形象、跨模态视频生成等中用到的一类重要技术，在虚拟数字对象商业化中起到关键作用。目前通常根据语音帧确定对应的口型图像，从而获取输出语音对应的一系列口型图像来生成说话视频，然而该方法所生成的视频中说话人的口型准确度较低且口型变化生硬。Talking video generation technology is an important technology used in voice-driven character images and cross-modal video generation, and plays a key role in the commercialization of virtual digital objects. At present, the corresponding mouth shape image is usually determined according to the voice frame, so as to obtain a series of mouth shape images corresponding to the output voice to generate a speaking video. However, the accuracy of the speaker's mouth shape in the video generated by this method is low and the mouth shape changes abruptly. .

发明内容Contents of the invention

本公开实施例提供一种说话视频生成方案。An embodiment of the present disclosure provides a solution for generating a talking video.

根据本公开的第一方面，提供一种说话视频生成方法，所述方法包括：获取声音驱动数据的音素特征以及声学特征，所述声音驱动数据包括音频、文本中的至少一项；根据所述音素特征和所述声学特征获取第一图像中目标对象的至少一组脸部关键点信息；根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像，其中，所述第二图像中包括所述目标对象的特定部位的设定区域被遮挡；根据所述声音驱动数据和所述至少一个目标脸部图像，得到所述目标对象的说话视频。According to the first aspect of the present disclosure, there is provided a method for generating a talking video, the method comprising: acquiring phoneme features and acoustic features of sound-driven data, the sound-driven data including at least one of audio and text; according to the The phoneme feature and the acoustic feature obtain at least one set of facial key point information of the target object in the first image; according to the at least one set of facial key point information and the second image containing the face of the target object, obtain At least one target facial image corresponding to the sound driving data, wherein a set area including a specific part of the target object in the second image is blocked; according to the sound driving data and the at least one target Facial image to obtain the speaking video of the target object.

结合本公开提供的任一实施方式，所述获取声音驱动数据的音素特征以及声学特征，包括：获取所述声音驱动数据对应的音频所包含的音素以及各个音素对应的时间戳，得到所述声音驱动数据的音素特征；对所述声音驱动数据对应的音频进行特征提取，得到所述声音驱动数据的声学特征。In combination with any implementation manner provided by the present disclosure, the acquiring the phoneme features and acoustic features of the sound driving data includes: acquiring the phonemes contained in the audio corresponding to the sound driving data and the time stamps corresponding to each phoneme, and obtaining the sound The phoneme feature of the driving data; performing feature extraction on the audio corresponding to the sound driving data to obtain the acoustic features of the sound driving data.

结合本公开提供的任一实施方式，所述根据所述音素特征和所述声学特征获取第一图像中目标对象的至少一组脸部关键点信息，包括：获取所述音素特征所包含的多个子音素特征以及所述多个子音素特征对应的子声学特征；将所述子音素特征和对应的子声学特征输入至脸部关键点提取网络，得到与所述子音素特征和所述子声学特征对应的脸部关键点信息。In combination with any implementation manner provided by the present disclosure, the acquiring at least one set of facial key point information of the target object in the first image according to the phoneme features and the acoustic features includes: acquiring multiple phoneme features contained in the phoneme features Sub-phoneme features and sub-acoustic features corresponding to the plurality of sub-phoneme features; the sub-phoneme features and corresponding sub-acoustic features are input to the face key point extraction network, and the sub-phoneme features and the sub-acoustic features are obtained. Corresponding facial key point information.

结合本公开提供的任一实施方式，所述脸部关键点信息包括3D脸部关键点信息，在根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像之前，所述方法还包括：将所述3D脸部关键点信息投影到2D平面上，得到所述3D脸部关键点信息对应的2D脸部关键点信息；利用所述2D脸部关键点信息更新所述脸部关键点信息。In combination with any implementation manner provided by the present disclosure, the facial key point information includes 3D facial key point information, and according to the at least one set of facial key point information and the second image containing the face of the target object Before obtaining at least one target facial image corresponding to the sound driving data, the method further includes: projecting the 3D facial key point information onto a 2D plane, and obtaining the 3D facial key point information corresponding to 2D facial key point information; using the 2D facial key point information to update the facial key point information.

结合本公开提供的任一实施方式，在根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像之前，所述方法还包括：对多组脸部关键点信息进行滤波处理，使每个图像帧的脸部关键点信息与该图像帧的相邻帧的脸部关键点信息之间的变化量满足设定条件。In combination with any implementation manner provided by the present disclosure, at least one target face corresponding to the sound driving data is obtained according to the at least one set of face key point information and the second image containing the face of the target object Before the image, the method also includes: performing filtering processing on multiple groups of face key point information, so that the change between the face key point information of each image frame and the face key point information of adjacent frames of the image frame The quantity satisfies the set conditions.

结合本公开提供的任一实施方式，所述根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像，包括：将每组脸部关键点信息与所述第二图像输入至脸部补全网络，得到与所述脸部关键点信息对应的目标脸部图像，其中，所述脸部补全网络用于根据脸部关键点信息对所述第二图像中被遮挡的设定区域进行补全。In combination with any implementation manner provided by the present disclosure, at least one target face corresponding to the sound driving data is obtained according to the at least one set of facial key point information and the second image containing the face of the target object The internal image includes: inputting each group of face key point information and the second image to the face complement network to obtain a target face image corresponding to the face key point information, wherein the face complement The whole network is used to complement the occluded set area in the second image according to the key point information of the face.

结合本公开提供的任一实施方式，所述根据所述声音驱动数据和所述至少一个目标脸部图像，得到所述目标对象的说话视频，包括：将所述至少一个目标脸部图像与设定背景图像进行融合，得到第一图像序列；根据所述第一图像序列与所述声音驱动数据对应的音频，得到所述目标对象的说话视频。In combination with any implementation manner provided by the present disclosure, the obtaining the speaking video of the target object according to the sound driving data and the at least one target facial image includes: combining the at least one target facial image with the set fused with a given background image to obtain a first image sequence; according to the audio corresponding to the first image sequence and the sound driving data, a speaking video of the target object is obtained.

结合本公开提供的任一实施方式，所述脸部关键点提取网络利用音素特征样本和对应的声学特征样本训练得到，其中，所述音素特征样本和所述声学特征样本包括标注的脸部关键点信息。In combination with any of the implementations provided in the present disclosure, the facial key point extraction network is trained using phoneme feature samples and corresponding acoustic feature samples, wherein the phoneme feature samples and the acoustic feature samples include labeled facial key points point information.

结合本公开提供的任一实施方式，所述脸部关键点提取网络通过以下方式训练得到：根据所述音素特征样本和对应的声学特征样本，对初始脸部关键点提取网络进行训练，在网络损失的变化满足收敛条件时完成训练得到所述脸部关键点提取网络，其中，所述网络损失包括所述初始神经网络预测得到的脸部关键点信息与标注的脸部关键点信息之间的差异。In combination with any of the implementations provided in the present disclosure, the facial key point extraction network is trained in the following manner: according to the phoneme feature samples and the corresponding acoustic feature samples, the initial facial key point extraction network is trained, and the network When the change of the loss meets the convergence condition, the training is completed to obtain the facial key point extraction network, wherein the network loss includes the difference between the facial key point information predicted by the initial neural network and the marked facial key point information. difference.

结合本公开提供的任一实施方式，所述音素特征样本和所述声学特征样本通过对一对象的音频的音素特征和声学特征进行所述对象的脸部关键点信息标注得到。In combination with any implementation manner provided in the present disclosure, the phoneme feature sample and the acoustic feature sample are obtained by marking the object's facial key point information on the phoneme feature and the acoustic feature of an object's audio.

结合本公开提供的任一实施方式，所述音素特征样本和所述声学特征样本通过以下方式得到：获取所述对象的说话视频；根据所述说话视频获取多个脸部图像，以及与每个所述脸部图像对应的至少一个音频帧；获取每个所述脸部图像对应的至少一个音频帧的音素特征以及声学特征；根据所述多个脸部图像获取脸部关键点信息，并根据所述脸部关键点信息对所述音素特征和所述声学特征进行标注，得到所述音素特征样本和所述声学特征样本。In combination with any implementation manner provided by the present disclosure, the phoneme feature sample and the acoustic feature sample are obtained in the following manner: acquiring a speaking video of the object; acquiring multiple facial images according to the speaking video, and combining with each At least one audio frame corresponding to the facial image; obtaining phoneme features and acoustic features of at least one audio frame corresponding to each facial image; obtaining facial key point information according to the plurality of facial images, and according to The facial key point information marks the phoneme features and the acoustic features to obtain the phoneme feature samples and the acoustic feature samples.

结合本公开提供的任一实施方式，所述脸部补全网络利用生成对抗网络训练得到，所述生成对抗网络包括所述脸部补全网络和第一鉴别网络，所述训练的网络损失包括：第一损失，用于指示所述脸部补全网络输出的脸部补全图像与完整脸部图像之间的差异，其中，所述完整脸部图像是所述脸部关键点信息对应的脸部图像；第二损失，用于指示所述第一鉴别网络对于输入图像输出的分类结果与所述输入图像的标注信息之间的差异，其中，所述标注信息指示所述输入图像为所述脸部补全网络输出的脸部补全图像或者为真实脸部图像。In combination with any of the embodiments provided in the present disclosure, the face completion network is trained by generating an adversarial network, the generation adversarial network includes the face completion network and the first identification network, and the trained network loss includes : The first loss, which is used to indicate the difference between the face completion image output by the face completion network and the complete face image, wherein the complete face image is corresponding to the facial key point information A face image; a second loss, which is used to indicate the difference between the classification result output by the first discrimination network for the input image and the annotation information of the input image, wherein the annotation information indicates that the input image is the The face complement image output by the face complement network may be a real face image.

结合本公开提供的任一实施方式，所述生成对抗网络还包括第二鉴别网络，所述训练的网络损失还包括：第三损失，用于指示所述第二鉴别网络对于所述脸部补全图像与音素特征之间的对应的判别结果与真实对应结果之间的差异。In combination with any implementation manner provided by the present disclosure, the generation confrontation network further includes a second discriminant network, and the trained network loss further includes: a third loss, which is used to instruct the second discriminative network to be effective for the face complement The difference between the discriminative result and the true corresponding result of the correspondence between the full image and the phoneme features.

根据本公开的第二方面，提供一种说话视频生成装置，所述装置包括：第一获取单元，用于获取声音驱动数据的音素特征以及声学特征，所述声音驱动数据包括音频、文本中的至少一项；第二获取单元，用于根据所述音素特征和所述声学特征获取第一图像中目标对象的至少一组脸部关键点信息；第一得到单元，用于根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像，其中，所述第二图像中包括所述目标对象的特定部位的设定区域被遮挡；第二得到单元，用于根据所述声音驱动数据和所述至少一个目标脸部图像，得到所述目标对象的说话视频。According to a second aspect of the present disclosure, there is provided an apparatus for generating a speaking video, the apparatus comprising: a first acquisition unit configured to acquire phoneme features and acoustic features of sound driving data, the sound driving data including audio, text At least one item; a second obtaining unit, configured to obtain at least one set of facial key point information of the target object in the first image according to the phoneme feature and the acoustic feature; a first obtaining unit, configured to obtain at least one set of face key point information according to the at least one Combining face key point information and a second image containing the face of the target object to obtain at least one target face image corresponding to the sound driving data, wherein the second image includes the face of the target object The set area of the specific part is blocked; the second obtaining unit is configured to obtain the speaking video of the target object according to the sound driving data and the at least one target facial image.

根据本公开的第三方面，提供一种电子设备，所述设备包括存储器、处理器，所述存储器用于存储可在处理器上运行的计算机指令，所述处理器用于在执行所述计算机指令时实现本公开提供的任一实施方式所述的说话视频生成方法。According to a third aspect of the present disclosure, there is provided an electronic device, the device includes a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to execute the computer instructions At this time, the method for generating a speaking video described in any implementation manner provided by the present disclosure is implemented.

根据本公开的第四方面，提供一种计算机可读存储介质，其上存储有计算机程序，所述程序被处理器执行时实现本公开提供的任一实施方式所述的说话视频生成方法。According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the method for generating a talking video in any implementation manner provided by the present disclosure is implemented.

本公开一个或多个实施例的说话视频生成方法、装置、设备及计算机可读存储介质，根据声音驱动数据的音素特征以及声学特征，获取第一图像中目标对象的至少一组脸部关键点信息；并根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像，其中，所述第二图像中包括所述目标对象的特定部位的设定区域被遮挡；最后根据所述声音驱动数据和所述至少一个目标脸部图像，得到所述目标对象的说话视频。本公开实施例根据与声音驱动数据对应的目标对象的脸部关键信息以及所述目标对象的遮挡了特定部位的图像来生成目标脸部图像，所得到的所述目标对象的说话视频中所述目标对象的口型与所述声音驱动数据的匹配度高，且口型变化连贯，所述目标对象说话状态真实、自然。The speaking video generation method, device, device, and computer-readable storage medium of one or more embodiments of the present disclosure acquire at least one set of facial key points of the target object in the first image according to the phoneme features and acoustic features of the voice driving data information; and according to the at least one set of face key point information and the second image containing the face of the target object, at least one target face image corresponding to the sound driving data is obtained, wherein the second A set area including a specific part of the target object in the image is blocked; finally, according to the sound driving data and the at least one target face image, a speaking video of the target object is obtained. In the embodiment of the present disclosure, a target face image is generated according to the key face information of the target object corresponding to the sound driving data and the image of the target object covering a specific part, and the obtained speech video of the target object is described in The mouth shape of the target object has a high degree of matching with the voice driving data, and the mouth shape changes coherently, and the speaking state of the target object is real and natural.

Description of drawings

为了更清楚地说明本说明书一个或多个实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本说明书一个或多个实施例中记载的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in one or more embodiments of this specification, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only one example of this specification. Or some embodiments described in multiple embodiments, for those skilled in the art, other drawings can also be obtained according to these drawings without paying creative efforts.

图1是本公开至少一个实施例提出的一种说话视频生成方法的流程图；FIG. 1 is a flow chart of a method for generating a speaking video proposed by at least one embodiment of the present disclosure;

图2是本公开至少一个实施例提出的脸部关键点提取网络训练方法流程图；Fig. 2 is a flow chart of a face key point extraction network training method proposed by at least one embodiment of the present disclosure;

图3是本公开至少一个实施例提出的样本获取方法流程图；Fig. 3 is a flowchart of a sample acquisition method proposed by at least one embodiment of the present disclosure;

图4是本公开至少一个实施例提出的另一种说话视频生成方法的流程图；Fig. 4 is a flowchart of another method for generating a speaking video proposed by at least one embodiment of the present disclosure;

图5是图4所示的说话视频生成方法的示意图；Fig. 5 is a schematic diagram of the speaking video generation method shown in Fig. 4;

图6是本公开至少一个实施例提出的说话视频生成方法中获取脸部关键点信息的示意图；Fig. 6 is a schematic diagram of acquiring face key point information in a method for generating a speaking video proposed by at least one embodiment of the present disclosure;

图7是本公开至少一个实施例提出的说话视频生成装置的结构示意图；Fig. 7 is a schematic structural diagram of a speaking video generation device proposed by at least one embodiment of the present disclosure;

图8是本公开至少一个实施例提出的电子设备的结构示意图。Fig. 8 is a schematic structural diagram of an electronic device proposed by at least one embodiment of the present disclosure.

Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.

本文中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合，例如，包括A、B、C中的至少一种，可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.

本公开至少一个实施例提供了一种说话视频生成方法，该方法可以由终端设备或服务器等电子设备执行。所述终端设备可以是固定终端或移动终端，例如手机、平板电脑、游戏机、台式机、广告机、一体机、车载设备等等，所述服务器包括本地服务器或云端服务器等。所述方法还可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。At least one embodiment of the present disclosure provides a method for generating a talking video, and the method may be executed by an electronic device such as a terminal device or a server. The terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game machine, a desktop computer, an advertising machine, an all-in-one machine, a vehicle-mounted device, etc., and the server includes a local server or a cloud server. The method can also be implemented by a processor invoking computer-readable instructions stored in a memory.

图1示出根据本公开至少一个实施例的说话视频生成方法的流程图，如图1所示，所述方法包括步骤101～步骤104。Fig. 1 shows a flowchart of a method for generating a speaking video according to at least one embodiment of the present disclosure. As shown in Fig. 1 , the method includes steps 101-104.

在步骤101中，获取声音驱动数据的音素特征以及声学特征。In step 101, phoneme features and acoustic features of sound driving data are acquired.

音素是构成音节的最小语音单位，所述声音驱动数据对应的音频可以包含一个或多个音素，音素特征可以包括表示各个音素的发音起止时间的特征。以所述音频是“你好”语音段为例，所述声音驱动数据的音素特征例如可以包括：n[0,0.2]，i3[0.2,0.4]，h[0.5,0.7]，ao3[0.7,1.2]，其中，[]内的数据指示相应音素的发音起止时间，单位例如为秒。在本公开实施例中，可以通过获取所述声音驱动数据对应的音频所包含的音素以及各个音素对应的时间戳，得到所述声音驱动数据的音素特征。A phoneme is the smallest unit of speech constituting a syllable. The audio corresponding to the sound driving data may contain one or more phonemes, and phoneme features may include features representing the start and end times of pronunciation of each phoneme. Taking the audio segment "Hello" as an example, the phoneme features of the sound-driven data may include, for example: n[0,0.2], i3[0.2,0.4], h[0.5,0.7], ao3[0.7 ,1.2], wherein, the data in [] indicates the starting and ending time of the pronunciation of the corresponding phoneme, and the unit is, for example, seconds. In the embodiment of the present disclosure, the phoneme features of the sound driving data may be obtained by acquiring the phonemes contained in the audio corresponding to the sound driving data and the time stamps corresponding to each phoneme.

声学特征主要用于描述音频的发音特点，所述声学特征包括但不限于线性预测参数、梅尔频率倒谱系数、感知线性高效能系数等中的至少一种。在本公开实施例中，所述声学特征例如为梅尔频率倒谱系数。在本公开实施例中，所述声音驱动数据的声学特征可以通过对所述声音驱动数据对应的音频进行特征提取得到。The acoustic features are mainly used to describe the pronunciation characteristics of the audio, and the acoustic features include but are not limited to at least one of linear prediction parameters, Mel-frequency cepstral coefficients, perceptual linear high-efficiency coefficients, and the like. In the embodiment of the present disclosure, the acoustic features are, for example, Mel-frequency cepstral coefficients. In the embodiment of the present disclosure, the acoustic features of the sound driving data may be obtained by performing feature extraction on the audio corresponding to the sound driving data.

所述声音驱动数据被预先存储在执行该说话视频生成方法的电子设备中，或预先存储在该电子设备以外的其他设备中，或者通过声音采集设备现场采集获得，等等，本公开不对所述声音驱动数据的来源进行限制。在本公开实施例，所述声音驱动数据可以包括音频、文本中的至少一项。The sound driving data is pre-stored in the electronic device that executes the talking video generation method, or in other devices other than the electronic device, or is collected on-site by a sound collection device, etc. Sources of sound driver data are restricted. In the embodiment of the present disclosure, the sound driving data may include at least one of audio and text.

在所述声音驱动数据只包括音频的情况下，可以通过对所述音频进行语音识别，确定所述音频对应的文本(例如，文字信息)；In the case where the sound driving data only includes audio, the text (for example, text information) corresponding to the audio can be determined by performing speech recognition on the audio;

在所述声音驱动数据只包括文本的情况下，可以通过对所述文本进行语音合成，将文本所对应的文字信息转化为音频(例如，语音段)；In the case where the sound driving data only includes text, the text information corresponding to the text can be converted into audio (for example, a speech segment) by performing speech synthesis on the text;

在所述声音驱动数据包括音频和文本的情况下，该音频和文本对应于相同的发音。例如，该文本是“你好”，所述声音驱动数据中的音频则是发出“你好”声音的语音段。Where the sound-driven data includes audio and text, the audio and text correspond to the same pronunciation. For example, the text is "Hello", and the audio in the sound driving data is a speech segment that emits the sound of "Hello".

在一些实施例中，通过对所述声音驱动数据对应的音频和文本进行对齐操作，可以得到所述声音驱动数据的音素特征。其中，对齐操作是指将音频中各个语音段与该语音段所对应的文本中的音素进行对齐，也即确定在所述音频中从何时开始发出文本对应的读音。通过对音频和文本进行对齐操作，一方面确定了所述音频所包含的音素，同时根据读音的持续时间可以得到各个音素对应的时间戳，从而可以得到所述声音驱动数据的音素特征。In some embodiments, the phoneme features of the sound driving data can be obtained by performing an alignment operation on the audio and text corresponding to the sound driving data. Wherein, the alignment operation refers to aligning each speech segment in the audio with the phoneme in the text corresponding to the speech segment, that is, determining when the pronunciation corresponding to the text in the audio starts to be pronounced. By aligning the audio and text, on the one hand, the phonemes contained in the audio are determined, and at the same time, the time stamp corresponding to each phoneme can be obtained according to the duration of the pronunciation, so that the phoneme features of the sound driving data can be obtained.

仍以“你好”为例，在对音频和文本进行对齐操作后，则可以确定在0～0.2秒发出音素“n”的读音，在0.2～0.4秒发出“i3”的读音，等等，从而可以得到所述声音驱动数据的音素特征。本领域技术人员应当理解，也可以通过其他方式获取所述声音驱动数据的音素特征，本公开实施例对此不进行限制。Still taking "Hello" as an example, after aligning the audio and text, it can be determined that the pronunciation of the phoneme "n" will be pronounced in 0-0.2 seconds, and the pronunciation of "i3" will be pronounced in 0.2-0.4 seconds, etc., Thus, the phoneme features of the sound driving data can be obtained. Those skilled in the art should understand that the phoneme features of the sound driving data may also be acquired in other ways, which is not limited in this embodiment of the present disclosure.

在本公开实施例中，通过利用音素特征，可以采用音频或文本任意一种方式来对目标对象进行驱动以生成目标对象的说话视频。In the embodiment of the present disclosure, by utilizing the phoneme features, the target object may be driven in any manner of audio or text to generate a speaking video of the target object.

在步骤102中，根据所述音素特征和所述声学特征获取第一图像中目标对象的至少一组脸部关键点信息。In step 102, at least one set of facial key point information of the target object in the first image is obtained according to the phoneme feature and the acoustic feature.

目标对象(例如，人)在发出不同的语音时，口型会发生相应的变化，因此，目标对象的嘴部区域或者包含嘴部区域的设定区域内的脸部关键点的位置会发生相应的变化。由此可知，一个音频帧的音素特征和声学特征是与目标对象的一组脸部关键点信息相对应的。目标对象在发出某一音素的读音时，其面部对应的脸部关键点信息是可以确定的。其中，所述脸部关键点信息包括目标对象的脸部关键点(例如，五官和脸部轮廓对应的关键点)在包含所述目标对象的脸部的图像(例如，第一图像)中的位置信息。在本公开中，可以将各个脸部关键点在同一时刻的信息称为一组脸部关键点信息。When the target object (for example, a person) makes different voices, the mouth shape will change accordingly. Therefore, the position of the target object's mouth area or the facial key points in the set area containing the mouth area will change accordingly. The change. It can be known that the phoneme features and acoustic features of an audio frame correspond to a set of facial key point information of the target object. When the target object pronounces a certain phoneme, the facial key point information corresponding to its face can be determined. Wherein, the facial key point information includes the facial key points of the target object (for example, key points corresponding to facial features and facial contours) in the image (for example, the first image) containing the face of the target object location information. In the present disclosure, the information of each facial key point at the same time may be referred to as a set of facial key point information.

以生成第一图像中目标对象的说话视频为例，在本步骤中，根据所述声音驱动数据的音素特征和声学特征，获取第一图像中目标对象的至少一组脸部关键点信息。在所述声音驱动数据对应的音频包括多个音素的情况下，可以得到这些音素和相应的声学特征所对应的脸部关键点信息序列。其中，所述脸部关键点信息序列包括按照时间顺序排列的多组脸部关键点信息。Taking the generation of the speaking video of the target object in the first image as an example, in this step, at least one set of facial key point information of the target object in the first image is obtained according to the phoneme features and acoustic features of the voice driving data. In the case that the audio corresponding to the sound driving data includes multiple phonemes, the facial key point information sequence corresponding to these phonemes and corresponding acoustic features can be obtained. Wherein, the facial key point information sequence includes multiple sets of facial key point information arranged in chronological order.

本公开实施例在所述声音驱动数据的音素特征的基础上还利用了声学特征，使得所获取的脸部关键点信息与该声音驱动数据对应的音频的发音特征更加匹配，使得后续生成的说话视频更加真实。The embodiment of the present disclosure also utilizes the acoustic features on the basis of the phoneme features of the voice driving data, so that the acquired facial key point information can better match the pronunciation features of the audio corresponding to the voice driving data, so that the subsequent generated speech Video is more realistic.

在步骤103中，根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像。In step 103, at least one target facial image corresponding to the sound driving data is obtained according to the at least one set of facial key point information and the second image including the target object's face.

其中，所述第二图像是包含所述目标对象的脸部的图像，所述第二图像可以通过对所述第一图像进行遮挡处理获得，也可以通过对与所述第一图像不同的另一图像进行遮挡处理获得。例如，第一图像是目标对象A正在微笑的脸部图像，与第一图像不同的另一图像可以是目标对象A正在撇嘴的脸部图像。Wherein, the second image is an image containing the face of the target object, and the second image can be obtained by performing occlusion processing on the first image, or by occluding another image different from the first image. An image is obtained by performing occlusion processing. For example, the first image is a face image of the target object A smiling, and another image different from the first image may be a face image of the target object A curling his lips.

所述第二图像中的包括所述目标对象的特定部位(例如，嘴部)的设定区域被遮挡，所述设定区域包括目标对象在说话时脸部关键点的位置会发生变化的区域，例如，可以是所述目标对象的脸部的下半部分，也可以是额头以下的脸部区域，还可以是嘴部区域，本公开实施例对于遮挡的具体区域不进行限制。In the second image, a set area including a specific part (for example, mouth) of the target object is blocked, and the set area includes an area where the position of key points on the face of the target object changes when the target object speaks For example, it may be the lower half of the face of the target object, or the facial area below the forehead, or the mouth area. Embodiments of the present disclosure do not limit the specific area to be blocked.

在一些实施例中，可以通过对所述设定区域进行噪声填充，生成所述设定区域被遮挡的第二图像。其中，对所述设定区域进行噪声填充是指利用随机生成的像素值对所述设定区域内的各个像素进行设置。本领域技术人员应当理解，也可以通过其他方式进行所述设定区域的遮挡，本公开对此不进行限制。In some embodiments, the second image in which the set area is blocked may be generated by filling the set area with noise. Wherein, performing noise filling on the set area refers to setting each pixel in the set area with randomly generated pixel values. Those skilled in the art should understand that the blocking of the set area can also be performed in other ways, which is not limited in the present disclosure.

根据在步骤102中所得到的至少一组脸部关键点信息，可以对所述第二图像中被遮挡的部分(即，设定区域)进行补全，从而使得第二图像中被遮挡的设定区域中的脸部关键点的分布，与声音驱动数据的音素特征和声学特征一致。这样，根据所述至少一组脸部关键点信息以及所述第二图像生成的至少一个目标脸部图像中，与所述设定区域相对应区域(即，补全后的区域)的脸部关键点信息与所述声音驱动数据是匹配的。According to at least one set of facial key point information obtained in step 102, the blocked part (that is, the set area) in the second image can be complemented, so that the blocked set in the second image The distribution of facial keypoints in a given region is consistent with the phonemic and acoustic features of the voice-driven data. In this way, in the at least one target facial image generated according to the at least one set of facial key point information and the second image, the face in the area corresponding to the set area (that is, the completed area) The key point information is matched with the sound driving data.

在步骤104中，根据所述声音驱动数据和所述至少一个目标脸部图像，得到所述目标对象的说话视频。In step 104, a speaking video of the target object is obtained according to the sound driving data and the at least one target facial image.

本公开实施例中，得到的所述目标对象的说话视频中，所述目标对象输出的声音为所述声音驱动数据对应的音频，而所述说话视频的各个图像帧中，所述目标对象的脸部关键点信息是与输出的声音的音素特征和声学特征对应的。由此生成的说话视频中所述目标对象的口型和说话表情是与发音一致的，使观众产生所述目标对象正在说话的感觉。In the embodiment of the present disclosure, in the obtained speech video of the target object, the sound output by the target object is the audio corresponding to the sound driving data, and in each image frame of the speech video, the sound of the target object The face key point information corresponds to the phoneme feature and the acoustic feature of the output voice. The mouth shapes and speaking expressions of the target object in the generated speaking video are consistent with the pronunciation, so that the audience can feel that the target object is speaking.

本公开实施例根据与声音驱动数据对应的目标对象的至少一组脸部关键点信息以及所述目标对象的遮挡了特定部位的图像来生成至少一个目标脸部图像，所得到的所述目标对象的说话视频中所述目标对象的口型与所述声音驱动数据的匹配度高，且口型变化连贯，所述目标对象说话状态真实、自然。In an embodiment of the present disclosure, at least one target facial image is generated according to at least one set of face key point information of the target object corresponding to the sound driving data and an image of the target object that blocks a specific part, and the obtained target object The mouth shape of the target object in the speaking video matches the voice driving data to a high degree, and the mouth shape changes coherently, and the speaking state of the target object is real and natural.

在一些实施例中，可以将所述至少一个目标脸部图像与设定背景图像进行融合，得到第一视频，并根据所述第一视频与所述声音驱动数据对应的音频，得到所述目标对象的说话视频。在一个示例中，可以将所述目标脸部图像中脸部区域的像素作为前景像素，与所述设定背景图像进行叠加，以实现所述目标脸部图像与设定背景图像的融合。本领域技术人员应当理解，可以采用多种方式对目标脸部图像与设定背景图像进行融合，本公开对此不进行限制In some embodiments, the at least one target face image can be fused with the set background image to obtain the first video, and the target target can be obtained according to the audio corresponding to the first video and the sound driving data. Subject's speaking video. In an example, the pixels of the face area in the target face image may be used as foreground pixels to be superimposed on the set background image, so as to realize the fusion of the target face image and the set background image. Those skilled in the art should understand that multiple ways can be used to fuse the target face image and the set background image, which is not limited in the present disclosure

通过上述方法，可以生成目标对象在任意背景下的说话视频，丰富了说话视频生成方法的应用场景。Through the above method, the speaking video of the target object in any background can be generated, which enriches the application scenarios of the method for generating the speaking video.

在一些实施例中，可以利用脸部关键点提取网络，得到所述音素特征和所述声学特征对应的第一图像中目标对象的至少一组脸部关键点信息。In some embodiments, at least one set of facial key point information of the target object in the first image corresponding to the phoneme feature and the acoustic feature may be obtained by using a facial key point extraction network.

首先，获取所述音素特征所包含的多个子音素特征以及所述多个子音素特征对应的子声学特征。First, multiple sub-phoneme features included in the phoneme features and sub-acoustic features corresponding to the multiple sub-phoneme features are acquired.

在一个示例中，可以通过在所述声音驱动数据的音素特征和声学特征上进行滑窗的方式，获得所述音素特征所包含的多个子音素特征以及与所述多个子音素特征对应的子声学特征。比如，音素特征和声学特征可以按照时间窗口的长度来分别划分出多个子音素特征和子声学特征。具体地，在对所述声音驱动数据的音素特征和声学特征进行滑窗的过程中，可以将每次滑窗操作后获得的该次时间窗口内的音素特征和声学特征，作为子音素特征和子声学特征，并且，在相同时间窗口内的子音素特征和子声学特征是对应于同一语音段的。In an example, a plurality of sub-phoneme features included in the phoneme feature and sub-acoustic sub-features corresponding to the plurality of sub-phoneme features may be obtained by performing a sliding window on the phoneme feature and the acoustic feature of the sound driving data. feature. For example, the phoneme feature and the acoustic feature can be divided into multiple sub-phoneme features and sub-acoustic features according to the length of the time window. Specifically, in the process of sliding the window on the phoneme features and acoustic features of the sound driving data, the phoneme features and acoustic features in the time window obtained after each sliding window operation can be used as sub-phoneme features and sub-phoneme features. Acoustic features, and sub-phoneme features and sub-acoustic features in the same time window correspond to the same speech segment.

接下来，将所述子音素特征和对应的子声学特征输入至经训练的脸部关键点提取网络，得到与所述子音素特征和所述子声学特征对应的脸部关键点信息。具体的，可以将多个子音素特征和对应的多个子声学特征，以多个子音素特征-子声学特征对的形式，按时间顺序输入至脸部关键点提取网络。所述脸部关键点提取网络用于根据每个子音素特征-子声学特征对，确定对应的一组脸部关键点信息。在将所有的子音素特征-子声学特征对输入至脸部关键点提取网络后，即可以得到所述声音驱动数据对应的多组脸部关键点信息。Next, the sub-phoneme features and corresponding sub-acoustic features are input to the trained facial key point extraction network to obtain facial key point information corresponding to the sub-phoneme features and the sub-acoustic features. Specifically, multiple sub-phoneme features and corresponding multiple sub-acoustic features may be input into the facial key point extraction network in time sequence in the form of multiple sub-phoneme feature-sub-acoustic feature pairs. The facial key point extraction network is used to determine a corresponding set of facial key point information according to each sub-phoneme feature-sub-acoustic feature pair. After inputting all the sub-phoneme feature-sub-acoustic feature pairs into the facial key point extraction network, multiple sets of facial key point information corresponding to the voice driving data can be obtained.

在本公开实施例中，通过经训练的脸部关键点提取网络，得到与各个子音素特征-子声学特征对对应的脸部关键点信息，可以实现所述目标对象的发音与口型和说话表情的良好匹配。In the embodiment of the present disclosure, through the trained facial key point extraction network, the facial key point information corresponding to each sub-phoneme feature-sub-acoustic feature pair can be obtained, and the pronunciation, mouth shape and speaking of the target object can be realized. Good match for expressions.

在本公开实施例中，所述脸部关键点提取网络可以是三维3D脸部关键点提取网络，也即所输出的脸部关键点信息是3D脸部关键点信息，所述3D脸部关键点信息除了包含所述脸部关键点的位置信息外，还包括所述脸部关键点的深度信息；所述脸部关键点提取网络也可以是二维2D脸部关键点提取网络，也即所输出的脸部关键点信息是2D脸部关键点信息。In the embodiment of the present disclosure, the facial key point extraction network may be a three-dimensional 3D facial key point extraction network, that is, the output facial key point information is 3D facial key point information, and the 3D facial key point In addition to the position information of the facial key points, the point information also includes the depth information of the facial key points; the facial key point extraction network can also be a two-dimensional 2D facial key point extraction network, that is The output facial key point information is 2D facial key point information.

在所述脸部关键点信息为3D脸部关键点信息的情况下，在根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像之前，所述方法还包括：将所述3D脸部关键点信息投影到2D平面上，得到所述3D脸部关键点信息对应的2D脸部关键点信息；利用所述2D脸部关键点信息更新所述脸部关键点信息。之后，根据至少一组2D脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像；最后，根据所述声音驱动数据和所述至少一个目标脸部图像，得到所述目标对象的说话视频。In the case that the facial key point information is 3D facial key point information, according to the at least one set of facial key point information and the second image containing the face of the target object, the sound corresponding to the sound is obtained. Before driving at least one target facial image corresponding to the data, the method further includes: projecting the 3D facial key point information onto a 2D plane to obtain 2D facial key point information corresponding to the 3D facial key point information ; Utilizing the 2D facial key point information to update the facial key point information. Afterwards, according to at least one set of 2D face key point information and a second image containing the face of the target object, at least one target face image corresponding to the sound driving data is obtained; finally, according to the sound driving data and the at least one target face image to obtain a speaking video of the target object.

在一些实施例中，可以对多组脸部关键点信息进行滤波处理，使最后获得的说话视频中每个图像帧的脸部关键点信息与该图像帧的相邻帧(包括前一帧和/或后一帧)的脸部关键点信息之间的变化量满足设定条件，该设定条件例如可以包括一个图像帧中各个脸部关键点的位置与相邻帧中对应脸部关键点的位置之间的变化量均小于设定阈值。通过上述方法可以滤除脸部关键点信息变化幅度较大的抖动帧，避免所生成的说话视频中出现口型突然变化的情况。In some embodiments, multiple groups of face key point information can be filtered so that the face key point information of each image frame in the finally obtained talking video is consistent with the adjacent frames of the image frame (including the previous frame and The amount of change between the face key point information of the next frame) satisfies the setting condition, which can include, for example, the position of each face key point in an image frame and the corresponding face key point in the adjacent frame. The variations between the positions are all smaller than the set threshold. Vibrating frames with large changes in face key point information can be filtered out by the above method, so as to avoid sudden changes in mouth shape in the generated speaking video.

在一个示例中，可以通过对多组脸部关键点信息在时间窗口上进行高斯滤波，来实现对所述多组脸部关键点信息所对应的连续帧的移动平均处理。其中，移动平均处理指对每一帧的脸部关键点的值和相邻帧的脸部关键点的值进行加权平均，利用加权平均的结果更新该帧的脸部关键点的值。In an example, the moving average processing of the consecutive frames corresponding to the multiple sets of facial key point information may be implemented by performing Gaussian filtering on the multiple sets of facial key point information in a time window. Wherein, the moving average processing refers to carrying out a weighted average of the value of the facial key point of each frame and the value of the facial key point of the adjacent frame, and updating the value of the facial key point of the frame by using the result of the weighted average.

在一些实施例中，可以通过以下方式得到与所述声音驱动数据对应的至少一个目标脸部图像：将每组脸部关键点信息与包含所述目标对象的脸部的第二图像输入至脸部补全网络，得到与所述脸部关键点信息对应的目标脸部图像，其中，所述脸部补全网络用于根据脸部关键点信息对所述第二图像中被遮挡的设定区域进行补全。In some embodiments, at least one target face image corresponding to the sound driving data may be obtained by inputting each set of face key point information and a second image containing the face of the target object into the face A part completion network to obtain a target face image corresponding to the facial key point information, wherein the face completion network is used to set the occlusion in the second image according to the facial key point information area to complete.

在本公开实施例中，通过脸部补全网络，根据脸部关键点信息对所述第二图像中被遮挡的设定区域进行补全，可以使得所述设定区域的脸部关键点信息与输入的脸部关键点信息一致，从而使得所述目标对象的口型以及说话表情与发出的语音匹配，并且利用脸部补全网络对所述第二图像中被遮挡的设定区域进行补全，可以生成清晰程度高的目标脸部图像。In the embodiment of the present disclosure, the masked set area in the second image is complemented according to the face key point information through the face completion network, so that the face key point information of the set area Consistent with the input facial key point information, so that the target object's mouth shape and speaking expression match the voice, and use the face complement network to complement the blocked set area in the second image It can generate high-definition target face images.

在一些实施例中，所述脸部关键点提取网络可以利用音素特征样本和声学特征样本进行训练得到。该训练方法可以由服务器执行，并且执行该训练方法的服务器与执行上述说话视频生成方法的设备可以是不同的。In some embodiments, the facial key point extraction network can be obtained by using phoneme feature samples and acoustic feature samples for training. The training method may be executed by a server, and the server executing the training method may be different from the device executing the above-mentioned talking video generation method.

图2示出了本公开至少一个实施例提出的脸部关键点提取网络的训练方法，如图2所示，该训练方法包括步骤201～202。FIG. 2 shows a training method for a facial key point extraction network proposed by at least one embodiment of the present disclosure. As shown in FIG. 2 , the training method includes steps 201-202.

在步骤201中，获取音素特征样本和对应的声学特征样本，所述音素特征样本和所述声学特征样本包括标注的所述目标对象的脸部关键点信息。其中，所述音素特征样本和对应的声学特征样本，是基于同一语音段得到的，并且所述音素特征样本和对应的声学特征样本中所标注的脸部关键点信息是相同的。In step 201, a phoneme feature sample and a corresponding acoustic feature sample are acquired, and the phoneme feature sample and the acoustic feature sample include marked facial key point information of the target object. Wherein, the phoneme feature sample and the corresponding acoustic feature sample are obtained based on the same speech segment, and the facial key point information marked in the phoneme feature sample and the corresponding acoustic feature sample are the same.

在步骤202中，根据所述音素特征样本和对应的声学特征样本，对初始脸部关键点提取网络进行训练，在网络损失的变化满足收敛条件时完成训练得到所述脸部关键点提取网络，其中，所述网络损失包括所述初始神经网络预测得到的脸部关键点信息与标注的脸部关键点信息之间的差异。In step 202, according to the phoneme feature sample and the corresponding acoustic feature sample, the initial facial key point extraction network is trained, and when the change of the network loss meets the convergence condition, the training is completed to obtain the facial key point extraction network, Wherein, the network loss includes the difference between the facial key point information predicted by the initial neural network and the labeled facial key point information.

在一些实施例中，所述音素特征样本和所述声学特征样本通过对一对象的音频的音素特征和声学特征进行所述对象的脸部关键点信息标注得到。一个示例中，可以通过图3所示的方法获取所述音素特征样本和对应的声学特征样本。In some embodiments, the phoneme feature samples and the acoustic feature samples are obtained by marking the object's facial key point information on the phoneme features and acoustic features of the audio of an object. In an example, the phoneme feature samples and corresponding acoustic feature samples may be acquired through the method shown in FIG. 3 .

在步骤301中，获取所述对象的说话视频。其中，该对象可以是待生成说话视频的上述目标对象，也可以是与该目标对象不同的对象。In step 301, the talking video of the subject is acquired. Wherein, the object may be the above-mentioned target object whose talking video is to be generated, or may be a different object from the target object.

在一示例中，在想要生成某一目标对象的说话视频的情况下，则获取该目标对象的已有说话视频，用于获取音素特征样本和声学特征样本。In an example, when it is desired to generate a speaking video of a certain target object, the existing speaking video of the target object is acquired to obtain phoneme feature samples and acoustic feature samples.

在步骤302中，根据所述说话视频获取多个脸部图像，以及与每个所述脸部图像对应的至少一个音频帧。In step 302, a plurality of facial images and at least one audio frame corresponding to each of the facial images are acquired according to the talking video.

通过对所述说话视频进行拆分，得到所述说话视频对应的语音段以及所述说话视频所包含的多个脸部图像。其中，所述语音段中的多个音频帧与所述多个脸部图像具有对应关系。By splitting the speaking video, a speech segment corresponding to the speaking video and a plurality of facial images included in the speaking video are obtained. Wherein, the multiple audio frames in the speech segment have a corresponding relationship with the multiple facial images.

在步骤303中，获取每个所述脸部图像对应的至少一个音频帧的音素特征以及声学特征。In step 303, phoneme features and acoustic features of at least one audio frame corresponding to each of the facial images are acquired.

根据所述多个脸部图像与所述语音段中的多个音频帧的对应关系，获取其中任意脸部图像对应的至少一个音频帧的音素特征以及声学特征。According to the corresponding relationship between the plurality of facial images and the plurality of audio frames in the speech segment, the phoneme features and acoustic features of at least one audio frame corresponding to any facial image are acquired.

在步骤304中，根据所述多个脸部图像获取脸部关键点信息，并根据所述脸部关键点信息对所述音素特征和所述声学特征进行标注，得到所述音素特征样本和所述声学特征样本。In step 304, facial key point information is obtained according to the plurality of facial images, and the phoneme feature and the acoustic feature are marked according to the facial key point information to obtain the phoneme feature sample and the Acoustic feature samples described above.

在本公开实施例中，通过根据待生成说话视频所针对的目标对象的已有说话视频，生成音素特征样本和声学特征样本，可以准确地建立起所述目标对象所说的语音的音素特征和声学特征与脸部关键点信息之间的关联，可以更好地实现对脸部关键点生成网络的训练。In the embodiment of the present disclosure, by generating phoneme feature samples and acoustic feature samples according to the existing speech video of the target object for which the speech video is to be generated, the phoneme feature and acoustic feature samples of the speech spoken by the target object can be accurately established. The association between acoustic features and facial keypoint information can better realize the training of facial keypoint generation network.

在一些实施例中，所述脸部补全网络可以利用生成对抗网络进行训练。该训练方法可以由服务器执行，并且执行该训练方法的服务器与执行上述说话视频生成方法的设备可以是不同的。In some embodiments, the face completion network can be trained using a generative adversarial network. The training method may be executed by a server, and the server executing the training method may be different from the device executing the above-mentioned speaking video generation method.

其中，所述生成对抗网络包括所述脸部补全网络和第一鉴别网络，所述脸部补全网络用于根据脸部关键点信息对输入的遮挡脸部图像(即，存在遮挡区域的脸部图像)进行补全，生成脸部补全图像，其中，所述遮挡脸部图像是通过对完整脸部图像中包括特定部位(例如，嘴部)的设定区域进行遮挡得到的，所述完整脸部图像可以是所述脸部关键点信息对应的脸部图像；所生成的脸部补全图像与真实脸部图像被随机输入至所述第一鉴别网络，所述第一鉴别网络输出对于输入图像的鉴别结果，也即判断所述输入图像是脸部补全图像或真实脸部图像。Wherein, the generation confrontation network includes the face completion network and the first identification network, and the face completion network is used to analyze the input occluded face image (that is, the one with the occlusion area) according to the facial key point information. face image) to generate a face complement image, wherein the occluded face image is obtained by blocking a set area including a specific part (for example, mouth) in the complete face image, so The complete face image may be the face image corresponding to the face key point information; the generated face complement image and the real face image are randomly input into the first identification network, and the first identification network Outputting an identification result for the input image, that is, judging whether the input image is a face complement image or a real face image.

利用所述生成对抗网络对所述脸部补全网络进行训练的损失包括：The loss of using the generation confrontation network to train the face completion network includes:

第一损失，用于指示所述脸部补全网络输出的脸部补全图像与完整脸部图像之间的差异，其中，所述完整脸部图像是所述脸部关键点信息对应的脸部图像；The first loss is used to indicate the difference between the face completion image output by the face completion network and the complete face image, wherein the complete face image is the face corresponding to the facial key point information internal image;

第二损失，用于指示所述第一鉴别网络对于输入图像输出的分类结果与所述输入图像的标注信息之间的差异，其中，所述标注信息指示所述输入图像为所述脸部补全网络输出的脸部补全图像或者为真实脸部图像。The second loss is used to indicate the difference between the classification result output by the first discrimination network for the input image and the annotation information of the input image, wherein the annotation information indicates that the input image is the face complement The face complement image output by the whole network may be a real face image.

在所述训练的损失的变化满足收敛条件时完成训练，得到所述脸部补全网络。The training is completed when the variation of the training loss satisfies the convergence condition, and the face completion network is obtained.

在本公开实施例中，利用生成对抗网络对所述脸部补全网络进行训练，可以提高所述脸部补全网络输出的脸部补全图像的准确度，有利于提高所生成的所述目标对象的说话视频的图像质量。In the embodiment of the present disclosure, using a generative confrontation network to train the face completion network can improve the accuracy of the face completion image output by the face completion network, which is conducive to improving the generated The image quality of the speaking video of the target subject.

在一些实施例中，还可以增加用于判断脸部补全图像是否与音素特征对齐的第二鉴别网络，以辅助所述脸部补全网络的训练。在该训练方法中，所述脸部补全网络输出的脸部补全图像输入至所述第二鉴别网络。In some embodiments, a second discriminant network for judging whether the face complement image is aligned with phoneme features can also be added to assist the training of the face complement network. In the training method, the face complement image output by the face complement network is input to the second discrimination network.

该训练的损失在上述第一损失和第二损失之外，还包括第三损失，所述第三损失用于指示所述第二鉴别网络对于所述脸部补全图像与音素特征之间的对应(例如，对齐) 的判别结果与真实对应结果之间的差异。In addition to the above-mentioned first loss and second loss, the loss of this training also includes a third loss, and the third loss is used to indicate that the second discriminant network is for the difference between the face complement image and the phoneme feature The difference between the corresponding (eg, aligned) discriminative result and the true corresponding result.

通过增加第二鉴别网络对所述脸部补全网络进行训练，进一步提高了音素特征与脸部关键点的对齐效果，有利于提高说话视频的质量。By adding the second discrimination network to train the face completion network, the alignment effect between phoneme features and facial key points is further improved, which is beneficial to improve the quality of speaking videos.

以下结合图4所示的说话视频生成方法流程图、图5所示的说话视频生成方法示意图和图6所示的获取脸部关键点信息的示意图，对本公开实施例提出的一种说话视频生成方法进行描述。Combining the flow chart of the speaking video generation method shown in FIG. 4, the schematic diagram of the speaking video generation method shown in FIG. 5, and the schematic diagram of obtaining facial key point information shown in FIG. The method is described.

在步骤401中，对声音驱动数据对应的音频和文本进行对齐操作，得到所述声音驱动数据的音素特征。In step 401, an alignment operation is performed on the audio and text corresponding to the sound driving data to obtain phoneme features of the sound driving data.

所述声音驱动数据对应的音频例如图6所示，为“你好”语音段，所述声音驱动数据对应的文本则为“你好”文本。通过对所述音频和文本进行对齐操作，得到所述声音驱动数据的音素特征。The audio corresponding to the sound driving data is, for example, as shown in FIG. 6 , which is the speech segment of "Hello", and the text corresponding to the sound driving data is the text of "Hello". By performing an alignment operation on the audio and the text, the phoneme features of the sound driving data are obtained.

如图5所示，在所述声音驱动数据只包括音频的情况下，可以通过对所述音频进行语音识别，确定所述音频对应的文本；在所述声音驱动数据只包括文本的情况下，可以通过对所述文本进行语音合成，将文本所对应的文字信息转化为音频。As shown in Figure 5, in the case where the sound driving data only includes audio, the text corresponding to the audio can be determined by performing speech recognition on the audio; in the case where the sound driving data only includes text, Text information corresponding to the text may be converted into audio by performing speech synthesis on the text.

在步骤402中，对所述声音驱动数据对应的音频进行特征提取，得到所述声音驱动数据的梅尔倒谱特征，也即梅尔频率倒谱系数。In step 402, feature extraction is performed on the audio corresponding to the sound driving data to obtain the Mel cepstrum feature of the sound driving data, that is, Mel frequency cepstral coefficients.

在步骤403中，通过在所述声音驱动数据的音素特征和声学特征上进行滑窗的方式，获得所述音素特征所包含的多个子音素特征以及与所述多个子音素特征对应的子声学特征。时间窗口如图6中的虚线框所示，箭头示出时间窗口的滑动方向。在滑动过程中，每次获得的时间窗口内的音素特征和声学特征为子音素特征和子声学特征，并且，在相同时间窗口内的子音素特征和子声学特征是对应于同一语音段的。In step 403, a plurality of sub-phoneme features contained in the phoneme feature and sub-acoustic features corresponding to the plurality of sub-phoneme features are obtained by performing a sliding window on the phoneme feature and the acoustic feature of the sound driving data . The time window is shown as a dotted box in FIG. 6 , and the arrow shows the sliding direction of the time window. During the sliding process, the phoneme features and acoustic features obtained in each time window are sub-phoneme features and sub-acoustic features, and the sub-phoneme features and sub-acoustic features in the same time window correspond to the same speech segment.

在步骤404中，将所述子音素特征和对应的子声学特征输入至经训练的脸部关键点提取网络，得到与所述子音素特征和所述子声学特征对应的脸部关键点信息。如图6所示，对于每次获得的时间窗口内的子音素特征和子声学特征，所述脸部关键点提取网络输出该时间窗口对应的脸部关键点信息。In step 404, the sub-phoneme features and the corresponding sub-acoustic features are input to the trained facial key point extraction network to obtain facial key point information corresponding to the sub-phoneme features and the sub-acoustic features. As shown in FIG. 6 , for each obtained sub-phoneme feature and sub-acoustic feature in a time window, the face key point extraction network outputs face key point information corresponding to the time window.

示例性的，所述脸部关键点提取网络为3D脸部关键点提取网络，相应地，所得到的脸部关键点信息为3D脸部关键点信息。Exemplarily, the facial key point extraction network is a 3D facial key point extraction network, and correspondingly, the obtained facial key point information is 3D facial key point information.

在步骤405中，获取所述3D脸部关键点信息对应的2D脸部关键点信息。In step 405, 2D facial key point information corresponding to the 3D facial key point information is acquired.

在步骤406中，对多组2D脸部关键点信息进行滤波处理，使每个图像帧的2D脸部关键点信息与相邻帧的脸部关键点信息之间的变化量满足设定条件。In step 406, filter processing is performed on multiple sets of 2D facial key point information, so that the variation between the 2D facial key point information of each image frame and the facial key point information of adjacent frames meets the set condition.

在步骤407中，将经滤波处理后的每组2D脸部关键点信息与所述第二图像输入至脸部补全网络，得到与所述2D脸部关键点信息对应的目标脸部图像，其中，所述第二图像为遮挡脸部图像，第二图像中的下半脸被噪声填充以进行遮挡。In step 407, each set of filtered 2D facial key point information and the second image are input to the face completion network to obtain a target facial image corresponding to the 2D facial key point information, Wherein, the second image is an occluded face image, and the lower half of the face in the second image is filled with noise for occlusion.

在步骤408中，将步骤407中得到的多帧目标脸部图像(例如，说话人的脸部图像)与背景图像进行融合，得到第一图像序列。In step 408, the multi-frame target facial images obtained in step 407 (for example, the speaker's facial images) are fused with the background image to obtain a first image sequence.

在步骤409中，根据所述第一图像序列与所述声音驱动数据对应的音频，得到所述目标对象的说话视频。In step 409, a speaking video of the target object is obtained according to the first image sequence and the audio corresponding to the sound driving data.

图7是本公开至少一个实施例提出的说话视频生成装置的结构示意图；如图7所示，所述装置包括：第一获取单元701，用于获取声音驱动数据的音素特征以及声学特征，所述声音驱动数据包括音频、文本中的至少一项；第二获取单元702，用于根据所述音素特征和所述声学特征获取第一图像中目标对象的至少一组脸部关键点信息；第一得到单元703，用于根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像，其中，所述第二图像中包括所述目标对象的特定部位的设定区域被遮挡；第二得到单元704，用于根据所述声音驱动数据和所述至少一个目标脸部图像，得到所述目标对象的说话视频。Fig. 7 is a schematic structural diagram of a talking video generating device proposed by at least one embodiment of the present disclosure; as shown in Fig. 7 , the device includes: a first acquiring unit 701, configured to acquire phoneme features and acoustic features of sound driving data, so The sound driving data includes at least one of audio and text; the second acquisition unit 702 is configured to acquire at least one set of facial key point information of the target object in the first image according to the phoneme features and the acoustic features; the second An obtaining unit 703, configured to obtain at least one target facial image corresponding to the sound driving data according to the at least one set of facial key point information and the second image containing the target object's face, wherein, A set area including a specific part of the target object in the second image is blocked; a second obtaining unit 704 is configured to obtain the target object according to the sound driving data and the at least one target facial image talking video.

结合本公开提供的任一实施方式，所述第一获取单元具体用于：获取所述声音驱动数据对应的音频所包含的音素以及各个音素对应的时间戳，得到所述声音驱动数据的音素特征；对所述声音驱动数据对应的音频进行特征提取，得到所述声音驱动数据的声学特征。In combination with any implementation manner provided in the present disclosure, the first acquisition unit is specifically configured to: acquire the phonemes contained in the audio corresponding to the sound driving data and the time stamps corresponding to each phoneme, and obtain the phoneme features of the sound driving data ; Performing feature extraction on the audio corresponding to the sound driving data to obtain the acoustic features of the sound driving data.

结合本公开提供的任一实施方式，所述第二获取单元具体用于：获取所述音素特征所包含的多个子音素特征以及所述多个子音素特征对应的子声学特征；将所述子音素特征和对应的子声学特征输入至脸部关键点提取网络，得到与所述子音素特征和所述子声学特征对应的脸部关键点信息。In combination with any implementation manner provided in the present disclosure, the second acquisition unit is specifically configured to: acquire a plurality of sub-phoneme features included in the phoneme feature and sub-acoustic features corresponding to the plurality of sub-phoneme features; The features and corresponding sub-acoustic features are input to the facial key point extraction network to obtain facial key point information corresponding to the sub-phoneme features and the sub-acoustic features.

结合本公开提供的任一实施方式，所述脸部关键点信息包括3D脸部关键点信息，所述装置还包括投影单元，用于在根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像之前，将所述3D脸部关键点信息投影到2D平面上，得到所述3D脸部关键点信息对应的2D脸部关键点信息；利用所述2D脸部关键点信息更新所述脸部关键点信息。In combination with any of the implementations provided in the present disclosure, the facial key point information includes 3D facial key point information, and the device further includes a projection unit, configured to use the at least one set of facial key point information and include the The second image of the face of the target object, before obtaining at least one target face image corresponding to the sound driving data, project the 3D facial key point information onto a 2D plane to obtain the 3D facial key 2D facial key point information corresponding to the point information; using the 2D facial key point information to update the facial key point information.

结合本公开提供的任一实施方式，所述装置还包括滤除单元，用于在根据所述至少一组脸部关键点信息以及包含所述目标对象的脸部的第二图像，得到与所述声音驱动数据对应的至少一个目标脸部图像之前，对多组脸部关键点信息进行滤波处理，使每个图像帧的脸部关键点信息与相邻帧的脸部关键点信息之间的变化量满足设定条件。In combination with any of the implementations provided in the present disclosure, the device further includes a filtering unit, configured to obtain, according to the at least one set of face key point information and the second image containing the face of the target object, a Before the at least one target facial image corresponding to the sound driving data, multiple groups of facial key point information are filtered, so that the facial key point information of each image frame and the facial key point information of adjacent frames The amount of change satisfies the set condition.

结合本公开提供的任一实施方式，所述第一得到单元具体用于：将每组脸部关键点信息与所述第二图像输入至脸部补全网络，得到与所述脸部关键点信息对应的目标脸部图像，其中，所述脸部补全网络用于根据脸部关键点信息对所述第二图像中被遮挡的设定区域进行补全。In combination with any implementation manner provided by the present disclosure, the first obtaining unit is specifically configured to: input each set of facial key point information and the second image to the face completion network, and obtain the facial key point The target face image corresponding to the information, wherein the face completion network is used to complete the blocked set area in the second image according to the facial key point information.

结合本公开提供的任一实施方式，所述第二得到单元具体用于：将所述至少一个目标脸部图像与设定背景图像进行融合，得到第一图像序列；根据所述第一图像序列与所述声音驱动数据对应的音频，得到所述目标对象的说话视频。In combination with any implementation manner provided in the present disclosure, the second obtaining unit is specifically configured to: fuse the at least one target face image with the set background image to obtain a first image sequence; according to the first image sequence The audio corresponding to the sound driving data is used to obtain the speaking video of the target object.

结合本公开提供的任一实施方式，所述音素特征样本和所述声学特征样本通过以下方式得到：获取所述对象的说话视频；根据所述说话视频获取多个脸部图像，以及与每个所述脸部图像对应的至少一个音频帧；获取每个所述脸部图像对应的至少一个音频帧的音素特征以及声学特征；根据所述多个脸部图像获取脸部关键点信息，并根据所述脸部关键点信息对所述音素特征和所述声学特征进行标注，得到所述音素特征样本和所述声学特征样本。In combination with any implementation manner provided by the present disclosure, the phoneme feature sample and the acoustic feature sample are obtained in the following manner: acquiring a speaking video of the object; acquiring multiple facial images according to the speaking video, and combining with each At least one audio frame corresponding to the facial image; obtaining phoneme features and acoustic features of at least one audio frame corresponding to each facial image; obtaining facial key point information according to the plurality of facial images, and according to The facial key point information marks the phoneme feature and the acoustic feature to obtain the phoneme feature sample and the acoustic feature sample.

结合本公开提供的任一实施方式，所述生成对抗网络还包括第二鉴别网络，所述训练的网络损失还包括：第三损失，用于指示所述第二鉴别网络对于所述人脸补全图像与音素特征对应的判别结果与真实对应结果之间的差异。In combination with any implementation manner provided by the present disclosure, the generation confrontation network further includes a second discrimination network, and the trained network loss further comprises: a third loss, which is used to indicate that the second discrimination network is suitable for the face complement The discrepancy between the discriminative results corresponding to the full image and phoneme features and the true corresponding results.

本公开至少一个实施例还提供了一种电子设备，如图8所示，所述设备包括存储器、处理器，存储器用于存储可在处理器上运行的计算机指令，处理器用于在执行所述计算机指令时实现本公开任一实施例所述的说话视频生成方法。At least one embodiment of the present disclosure also provides an electronic device, as shown in FIG. 8 , the device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the described The computer instructions implement the method for generating a talking video in any embodiment of the present disclosure.

本公开至少一个实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，所述程序被处理器执行时实现本公开任一实施例所述的说话视频生成方法。At least one embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for generating a talking video in any embodiment of the present disclosure is implemented.

本领域技术人员应明白，本说明书一个或多个实施例可提供为方法、系统或计算机程序产品。因此，本说明书一个或多个实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且，本说明书一个或多个实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may employ a computer program embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The form of the product.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于数据处理设备实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment.

上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下，在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外，在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中，多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of this specification. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.

本说明书中描述的主题及功能操作的实施例可以在以下中实现：数字电子电路、有形体现的计算机软件或固件、包括本说明书中公开的结构及其结构性等同物的计算机硬件、或者它们中的一个或多个的组合。本说明书中描述的主题的实施例可以实现为一个或多个计算机程序，即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理装置的操作的计算机程序指令中的一个或多个模块。可替代地或附加地，程序指令可以被编码在人工生成的传播信号上，例如机器生成的电、光或电磁信号，该信号被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介质可以是机器可读存储设备、机器可读存储基板、随机或串行存取存储器设备、或它们中的一个或多个的组合。Embodiments of the subject matter and functional operations described in this specification can be implemented in digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or in A combination of one or more of . Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or to control the operation of data processing apparatus. Multiple modules. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for transmission by the data The processing means executes. A computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

本说明书中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行，以通过根据输入数据进行操作并生成输出来执行相应的功能。所述处理及逻辑流程还可以由专用逻辑电路—例如FPGA(现场可编程门阵列)或ASIC (专用集成电路)来执行，并且装置也可以实现为专用逻辑电路。The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).

适合用于执行计算机程序的计算机包括，例如通用和/或专用微处理器，或任何其他类型的中央处理单元。通常，中央处理单元将从只读存储器和/或随机存取存储器接收指令和数据。计算机的基本组件包括用于实施或执行指令的中央处理单元以及用于存储指令和数据的一个或多个存储器设备。通常，计算机还将包括用于存储数据的一个或多个大容量存储设备，例如磁盘、磁光盘或光盘等，或者计算机将可操作地与此大容量存储设备耦接以从其接收数据或向其传送数据，抑或两种情况兼而有之。然而，计算机不是必须具有这样的设备。此外，计算机可以嵌入在另一设备中，例如移动电话、个人数字助理(PDA)、移动音频或视频播放器、游戏操纵台、全球定位系统(GPS)接收机、或例如通用串行总线(USB)闪存驱动器的便携式存储设备，仅举几例。Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both. However, it is not necessary for a computer to have such a device. In addition, a computer may be embedded in another device such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus (USB) ) portable storage devices like flash drives, to name a few.

适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备，例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部硬盘或可移动盘)、磁光盘以及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.

虽然本说明书包含许多具体实施细节，但是这些不应被解释为限制任何发明的范围或所要求保护的范围，而是主要用于描述特定发明的具体实施例的特征。本说明书内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面，在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外，虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护，但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除，并且所要求保护的组合可以指向子组合或子组合的变型。While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any inventions or of what may be claimed, but rather as primarily describing features of particular embodiments of particular inventions. Certain features that are described in this specification in multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function in certain combinations as described above and even be initially so claimed, one or more features from a claimed combination may in some cases be removed from that combination and the claimed A protected combination can point to a subcombination or a variant of a subcombination.

类似地，虽然在附图中以特定顺序描绘了操作，但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行，以实现期望的结果。在某些情况下，多任务和并行处理可能是有利的。此外，上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离，并且应当理解，所描述的程序组件和系统通常可以一起集成在单个软件产品中，或者封装成多个软件产品。Similarly, while operations are depicted in the figures in a particular order, this should not be construed as requiring that those operations be performed in the particular order shown, or sequentially, or that all illustrated operations be performed, to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above-described embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can often be integrated together in a single software product in, or packaged into multiple software products.

由此，主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下，权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外，附图中描绘的处理并非必需所示的特定顺序或顺次顺序，以实现期望的结果。在某些实现中，多任务和并行处理可能是有利的。Thus, certain embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

以上所述仅为本说明书一个或多个实施例的较佳实施例而已，并不用以限制本说明书一个或多个实施例，凡在本说明书一个或多个实施例的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本说明书一个或多个实施例保护的范围之内。The above descriptions are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. Within the spirit and principles of one or more embodiments of this specification, Any modification, equivalent replacement, improvement, etc. should be included in the scope of protection of one or more embodiments of this specification.

Claims

A method for generating a speaking video, characterized in that the method comprises:

Obtain phoneme features and acoustic features of sound-driven data, where the sound-driven data includes at least one of audio and text;

Acquiring at least one set of facial key point information of the target object in the first image according to the phoneme feature and the acoustic feature;

According to the at least one set of face key point information and the second image containing the face of the target object, at least one target face image corresponding to the sound driving data is obtained, wherein the second image includes The set area of the specific part of the target object is blocked;

A speaking video of the target object is obtained according to the sound driving data and the at least one target facial image.

The method according to claim 1, wherein said obtaining phoneme features and acoustic features of the sound driving data comprises:

Obtaining the phonemes contained in the audio corresponding to the sound driving data and the timestamps corresponding to each phoneme, and obtaining the phoneme characteristics of the sound driving data;

Feature extraction is performed on the audio corresponding to the sound driving data to obtain the acoustic features of the sound driving data.

The method according to claim 1 or 2, wherein said obtaining at least one set of face key point information of the target object in the first image according to the phoneme feature and the acoustic feature comprises:

Obtaining a plurality of sub-phoneme features contained in the phoneme feature and sub-acoustic features corresponding to the plurality of sub-phoneme features;

The sub-phoneme features and the corresponding sub-acoustic features are input to the facial key point extraction network to obtain facial key point information corresponding to the sub-phoneme features and the sub-acoustic features.

The method according to any one of claims 1 to 3, wherein the facial key point information includes 3-dimensional (3D) facial key point information, and according to the at least one set of facial key point information and The second image containing the face of the target object, before obtaining at least one target face image corresponding to the sound driving data, the method further includes:

Projecting the 3D face key point information onto a 2-dimensional (2D) plane to obtain 2D face key point information corresponding to the 3D face key point information;

Utilizing the 2D facial key point information to update the facial key point information.

The method according to any one of claims 1 to 4, characterized in that, according to the at least one set of face key point information and the second image containing the face of the target object, the sound drive is obtained. Before at least one target facial image corresponding to the data, the method also includes:

Filtering is performed on multiple sets of facial key point information, so that the variation between the facial key point information of each image frame and the facial key point information of adjacent frames of the image frame satisfies a set condition.

The method according to any one of claims 1 to 5, characterized in that, according to the at least one set of face key point information and the second image containing the face of the target object, the sound At least one target face image corresponding to the driving data, including:

Each set of face key point information and the second image are input to the face complement network to obtain a target face image corresponding to the face key point information, wherein the face complement network is used to The key point information of the face complements the blocked set area in the second image.

The method according to any one of claims 1 to 6, wherein the obtaining the speaking video of the target object according to the sound driving data and the at least one target facial image comprises:

Fusing the at least one target face image with the set background image to obtain a first image sequence;

According to the first image sequence and the audio corresponding to the sound driving data, a speaking video of the target object is obtained.

The method according to any one of claims 3 to 7, wherein the facial key point extraction network is trained by using phoneme feature samples and corresponding acoustic feature samples, wherein the phoneme feature samples and the acoustic feature samples Feature samples include labeled facial keypoint information.

The method according to claim 8, wherein the facial key point extraction network is trained in the following manner:

According to the phoneme feature sample and the corresponding acoustic feature sample, the initial facial key point extraction network is trained, and when the change of network loss meets the convergence condition, the training is completed to obtain the facial key point extraction network, wherein the network The loss includes the difference between the facial key point information predicted by the initial neural network and the labeled facial key point information.

The method according to claim 8 or 9, wherein the phoneme feature sample and the acoustic feature sample are obtained by marking the object's facial key point information on the phoneme feature and acoustic feature of an object's audio .

The method according to claim 10, wherein the phoneme feature samples and the acoustic feature samples are obtained in the following manner:

Obtain a speaking video of the subject;

Obtaining a plurality of facial images according to the speaking video, and at least one audio frame corresponding to each of the facial images;

Obtain phoneme features and acoustic features of at least one audio frame corresponding to each of the facial images;

Facial key point information is acquired according to the plurality of facial images, and the phoneme feature and the acoustic feature are marked according to the facial key point information to obtain the phoneme feature sample and the acoustic feature sample.

The method according to any one of claims 6 to 11, wherein the face completion network is trained by generating an adversarial network, and the generating adversarial network includes the face completion network and a first identification network , the network loss of the training includes:

The first loss is used to indicate the difference between the face completion image output by the face completion network and the complete face image, wherein the complete face image is the face corresponding to the facial key point information internal image;

The second loss is used to indicate the difference between the classification result output by the first discrimination network for the input image and the annotation information of the input image, wherein the annotation information indicates that the input image is the face complement The face complement image output by the whole network may be a real face image.

The method according to claim 12, wherein the generation confrontation network also includes a second discrimination network, and the network loss of the training also includes:

The third loss is used to indicate the difference between the discrimination result of the second discrimination network for the correspondence between the face complement image and the phoneme feature and the real correspondence result.

A talking video generation device, characterized in that the device comprises:

A first acquisition unit, configured to acquire phoneme features and acoustic features of sound-driven data, where the sound-driven data includes at least one of audio and text;

A second acquiring unit, configured to acquire at least one set of facial key point information of the target object in the first image according to the phoneme feature and the acoustic feature;

The first obtaining unit is configured to obtain at least one target facial image corresponding to the sound driving data according to the at least one set of facial key point information and the second image containing the target object's face, wherein, A set area including a specific part of the target object in the second image is blocked;

The second obtaining unit is configured to obtain the speaking video of the target object according to the sound driving data and the at least one target facial image.

An electronic device, characterized in that the device comprises a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement claim 1 when executing the computer instructions to the method described in any one of 13.

A computer-readable storage medium on which a computer program is stored, wherein the program implements the method according to any one of claims 1 to 13 when the program is executed by a processor.