CN113987269A

CN113987269A - Digital human video generation method and device, electronic equipment and storage medium

Info

Publication number: CN113987269A
Application number: CN202111169280.5A
Authority: CN
Inventors: 王鑫宇; 刘炫鹏; 常向月; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-01-28

Abstract

The embodiment of the disclosure discloses a digital human video generation method, a digital human video generation device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a target audio and a target face image; for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame and a target region image in the target face image into a pre-trained end-to-end model, and generating a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target region image is a region image in the target face image except for a mouth region image, and the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out an audio indicated by the audio frame; based on the generated target image, a digital human video is generated. The disclosed embodiments can improve the efficiency of digital human generation.

Description

Digital human video generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of digital human video generation technologies, and in particular, to a digital human video generation method, an apparatus, an electronic device, and a storage medium.

Background

The technology of digital human generation is becoming more sophisticated. The existing scheme is a digital human generation method based on pix2pix, pix2pixHD and video2video synthesis. Specifically, a large number of digital human generation technologies are currently available, for example, digital human generation methods based on pix2pix, pix2pixHD, Vid2Vid, how shot video2video, NERF, StyleGAN, and the like.

However, in these conventional schemes, if the generated face key points are inaccurate and the effect of generating a sketch is poor, the effect of the finally generated digital human picture is poor.

Disclosure of Invention

In view of the above, to solve some or all of the technical problems, embodiments of the present disclosure provide a digital human video generation method, apparatus, electronic device and storage medium.

In a first aspect, an embodiment of the present disclosure provides a method for generating a digital human video, where the method includes:

acquiring a target audio and a target face image;

for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame and a target region image in the target face image into a pre-trained end-to-end model, and generating a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target region image is a region image in the target face image except for a mouth region image, and the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out an audio indicated by the audio frame;

based on the generated target image, a digital human video is generated.

Optionally, in the method according to any embodiment of the present disclosure, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model, where input data of the first sub-model is an audio frame sequence corresponding to an audio frame, output data of the first sub-model is a first hidden vector, input data of the second sub-model is a target region image in the target face image, output data of the second sub-model is a second hidden vector, input data of the third sub-model includes the first hidden vector and the second hidden vector, and output data of the third sub-model includes a target image; and

the above inputting the audio frame sequence corresponding to the audio frame and the target area image in the target face image into a pre-trained end-to-end model to generate the target image corresponding to the audio frame includes:

inputting the audio frame sequence corresponding to the audio frame to the first sub-model to obtain a first hidden vector;

inputting a target area image in the target face image into the second sub-model to obtain a second hidden vector;

merging the first hidden vector and the second hidden vector to obtain a merged vector;

and inputting the merged vector into the third sub-model to obtain a target image corresponding to the audio frame.

Optionally, in the method according to any embodiment of the present disclosure, the end-to-end model is trained as follows:

acquiring video data;

extracting audio frames and face images corresponding to the audio frames from the video data, taking the audio frame sequence corresponding to the extracted audio frames as sample audio, and taking the extracted face images as sample face images;

and adopting a machine learning algorithm, taking the sample audio as input data of a generator in the generative confrontation network, obtaining a target image which corresponds to the sample audio and is generated by the generator, and taking the current generator as an end-to-end model if a discriminator in the generative confrontation network determines that the target image generated by the generator meets a preset training end condition.

Optionally, in the method according to any embodiment of the present disclosure, taking the sample audio as input data of a generator in a generative confrontation network, obtaining a target image generated by the generator corresponding to the sample audio, and taking a current generator as an end-to-end model if a discriminator in the generative confrontation network determines that the target image generated by the generator satisfies a preset training end condition, includes:

acquiring an initial generation type countermeasure network, wherein the initial generation type countermeasure network comprises a first submodel, a second submodel, a third submodel and a fourth submodel, the input data of the fourth submodel is a first hidden vector, and the output data of the fourth submodel is a key point of a mouth;

performing a first training step as follows:

inputting the sample audio into a first sub-model included in the initially generated confrontation network to obtain a first hidden vector corresponding to the sample audio;

inputting the first hidden vector corresponding to the sample audio into a fourth submodel to obtain a predicted mouth key point corresponding to the sample audio;

calculating a first function value of a first preset loss function based on predicted mouth keypoints corresponding to the sample audio and mouth keypoints extracted from a sample face image corresponding to the sample audio;

and if the calculated first function value is less than or equal to a first preset threshold value, determining the model parameters of the first sub-model included by the current initially-generated countermeasure network as the model parameters of the first sub-model included by the trained end-to-end model.

Optionally, in the method according to any embodiment of the present disclosure, the taking a sample audio as input data of a generator in a generative confrontation network to obtain a target image, corresponding to the sample audio, generated by the generator, and taking a current generator as an end-to-end model if a discriminator in the generative confrontation network determines that the target image generated by the generator meets a preset training end condition, further includes:

if the calculated first function value is larger than the first preset threshold value, updating the model parameters of the first sub-model and the model parameters of the fourth sub-model included in the current initially generated countermeasure network, and continuing to execute the first training step based on the initially generated countermeasure network after the model parameters are updated.

performing a second training step as follows:

inputting a target area image in a sample face image corresponding to the sample audio into a second sub-model included in the initial generation type confrontation network to obtain a second hidden vector corresponding to the sample audio;

merging the first hidden vector corresponding to the sample audio and the second hidden vector corresponding to the sample audio to obtain a merged vector corresponding to the sample audio;

inputting the merged vector corresponding to the sample audio into a third sub-model included in the initial generation type countermeasure network to obtain a prediction target image corresponding to the sample audio;

calculating a second function value of a second preset loss function based on a predicted target image corresponding to the sample audio and a target image extracted from a sample face image corresponding to the sample audio;

and if the calculated second function value is less than or equal to a preset threshold value, determining the model parameters of the second sub-model included by the current initial generation type countermeasure network as the model parameters of the second sub-model included by the trained end-to-end model, and determining the model parameters of the third sub-model included by the current initial generation type countermeasure network as the model parameters of the third sub-model included by the trained end-to-end model.

and if the calculated second function value is larger than the second preset threshold value, updating the model parameters of the second sub-model and the third sub-model included in the current initially-generated countermeasure network, and continuing to execute the second training step based on the initially-generated countermeasure network after the model parameters are updated.

Optionally, in the method according to any embodiment of the present disclosure, the preset training end condition includes at least one of:

the function value of a preset loss function calculated based on the audio frame sequence corresponding to the audio frame is smaller than or equal to a first preset value;

and the function value of the preset loss function calculated based on the audio frame sequence corresponding to the non-audio frame is greater than or equal to a second preset value.

Optionally, in the method according to any embodiment of the present disclosure, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.

Optionally, in the method according to any embodiment of the present disclosure, the sequence of audio frames corresponding to the audio frame includes the audio frame and audio frames consecutive to a preset number of frames before the audio frame in the target audio.

In a second aspect, an embodiment of the present disclosure provides a digital human video generating apparatus, where the apparatus includes:

an acquisition unit configured to acquire a target audio and a target face image;

an input unit configured to input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target region image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames in the target audio that include the audio frame, the target region image is a region image in the target face image except a mouth region image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame;

a generating unit configured to generate a digital human video based on the generated target image.

Optionally, in the apparatus according to any embodiment of the present disclosure, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model, where input data of the first sub-model is an audio frame sequence corresponding to an audio frame, output data of the first sub-model is a first hidden vector, input data of the second sub-model is a target region image in the target face image, output data of the second sub-model is a second hidden vector, input data of the third sub-model includes the first hidden vector and the second hidden vector, and output data of the third sub-model includes a target image; and

the above generation unit, further configured to:

Optionally, in the apparatus according to any embodiment of the present disclosure, the end-to-end model is trained as follows:

acquiring video data;

Optionally, in an apparatus according to any embodiment of the present disclosure, the taking a sample audio as input data of a generator in a generative confrontation network to obtain a target image generated by the generator corresponding to the sample audio, and taking a current generator as an end-to-end model if a discriminator in the generative confrontation network determines that the target image generated by the generator satisfies a preset training end condition includes:

performing a first training step as follows:

Optionally, in the apparatus according to any embodiment of the present disclosure, the taking a sample audio as input data of a generator in a generative confrontation network to obtain a target image generated by the generator corresponding to the sample audio, and if a discriminator in the generative confrontation network determines that the target image generated by the generator meets a preset training end condition, taking a current generator as an end-to-end model further includes:

performing a second training step as follows:

Optionally, in the apparatus according to any embodiment of the present disclosure, the preset training end condition includes at least one of:

Optionally, in the apparatus according to any embodiment of the present disclosure, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.

Optionally, in the apparatus according to any embodiment of the present disclosure, the sequence of audio frames corresponding to the audio frame includes the audio frame and audio frames consecutive to a preset number of frames before the audio frame in the target audio.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

a memory for storing a computer program;

a processor, configured to execute the computer program stored in the memory, and when the computer program is executed, implement the method of any embodiment of the digital human video generation method of the first aspect of the present disclosure.

In a fourth aspect, the disclosed embodiments provide a computer readable medium, which when executed by a processor, implements the method of any of the embodiments of the digital human video generation method of the first aspect.

In a fifth aspect, embodiments of the present disclosure provide a computer program comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing the steps in the method as in any of the embodiments of the digital human video generation method of the first aspect described above.

Based on the method for generating digital human video provided by the foregoing embodiment of the present disclosure, a target audio and a target face image are obtained, then, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image are input into a pre-trained end-to-end model, a target image corresponding to the audio frame is generated, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target area image is an area image in the target face image except a mouth area image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame, and finally, based on the generated target image, a digital human video is generated. Therefore, the target image used for generating the digital human video is directly obtained by adopting the end-to-end model, so that the efficiency of generating the digital human video is improved by improving the speed of generating the target image.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is an exemplary system architecture diagram of a digital human video generation method or a digital human video generation apparatus provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for generating a digital human video provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of one application scenario for the embodiment of FIG. 2;

FIG. 4A is a flow chart of another method for generating digital human video provided by embodiments of the present disclosure;

fig. 4B is a schematic structural diagram of a mouth region image generation model in a digital human video generation method according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a digital human video generating device provided by an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of parts and steps, numerical expressions, and values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those within the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one object, step, device, or module from another object, and do not denote any particular technical meaning or logical order therebetween.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is an exemplary system architecture diagram of a digital human video generation method or a digital human video generation apparatus provided by an embodiment of the present disclosure.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or transmit data (e.g., target audio and target facial images), etc. Various client applications, such as audio/video processing software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background server that processes data transmitted by the

terminal devices

101, 102, 103. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be further noted that the digital human video generation method provided by the embodiments of the present disclosure may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit, sub-unit, module, sub-module) included in the digital human video generating device may be entirely disposed in the server, may be entirely disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the digital human video generation method operates does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., a server or a terminal device) on which the digital human video generation method operates.

Fig. 2 shows a flow 200 of a digital human video generation method provided by an embodiment of the present disclosure. The digital human video generation method comprises the following steps:

step 201, acquiring a target audio and a target face image.

In this embodiment, an execution subject (for example, a server or a terminal device shown in fig. 1) of the digital human video generation method may acquire the target audio and the target face image from other electronic devices or locally.

The target audio may be various audio. The target audio may be used to sound the target audio indication for the digital human video generated in the subsequent step. For example, the target audio may be speech audio or audio generated by converting text through a machine.

The target face image can be any face image. As an example, the target face image may be a shot image containing a face, or a frame of face image extracted from a video.

In some cases, there may be no association between the target audio and the target face image. For example, the target audio may be audio uttered by a first person, and the target face image may be a face image of a second person, where the second person may be a person other than the first person; alternatively, the target audio may be audio emitted by the first person at a first time, and the target facial image may be a facial image of the first person at a second time, where the second time may be any time different from the first time.

Step 202, for the audio frame in the target audio, inputting the audio frame sequence corresponding to the audio frame and the target area image in the target face image into a pre-trained end-to-end model, and generating a target image corresponding to the audio frame.

In this embodiment, the executing entity may input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target region image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame.

The audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio. The target area image is an area image of the target face image excluding the mouth area image. The target image corresponding to the audio frame is used for indicating the person indicated by the target face image to send out the audio indicated by the audio frame. The end-to-end model can represent the corresponding relation among an audio frame sequence corresponding to the audio frame, a target area image in the target face image, a target image corresponding to the audio frame and the audio frame.

Here, the audio frame sequence corresponding to the audio frame may be a sequence of a preset number of audio frames containing the audio frame in the target audio. For example, the sequence of audio frames may comprise the audio frame and the first 4 audio frames of the audio frame, or the sequence of audio frames may comprise the audio frame and, the first 2 audio frames of the audio frame and, the last 2 audio frames of the audio frame.

Optionally, the audio frame sequence corresponding to the audio frame includes the audio frame and consecutive audio frames of a preset number of frames before the audio frame in the target audio.

In some optional implementations of this embodiment, the end-to-end model includes a first submodel, a second submodel, and a third submodel. The input data of the first submodel is a sequence of audio frames corresponding to the audio frames. The output data of the first sub-model is a first hidden vector. The input data of the second sub-model is a target area image in the target face image. The output data of the second sub-model is a second hidden vector. The input data of the third submodel includes the first hidden vector and the second hidden vector. The output data of the third submodel includes a target image.

On this basis, the executing entity may execute the step 202 in such a manner that the audio frame sequence corresponding to the audio frame and the target region image in the target face image are input into a pre-trained end-to-end model, and a target image corresponding to the audio frame is generated:

firstly, inputting the audio frame sequence corresponding to the audio frame to the first sub-model to obtain a first hidden vector.

The first sub-model may include model structures such as CNN (Convolutional Neural Networks), LSTM (Long Short-Term Memory Networks), and the like. As an example, the first submodel may include 2 CNN layers and 2 LSTM layers. The first concealment vector may be a vocoded vector, i.e. a vector output by the intermediate layer.

And secondly, inputting the target area image in the target face image into the second sub-model to obtain a second hidden vector.

The second sub-model may include model structures such as CNN, LSTM, and the like. As an example, the second submodel may include 4 CNN layers. The second hidden vector may be a vector of a target region image (e.g., a target region image in a target face image, or a target region image in a sample face image) (e.g., a vector of a hidden space output of the sketch encoder).

And thirdly, combining the first hidden vector and the second hidden vector to obtain a combined vector.

And fourthly, inputting the merged backward vector into the third submodel to obtain a target image corresponding to the audio frame.

The second sub-model may include model structures such as CNN, LSTM, and the like. As an example, the third submodel may include 4 CNN layers. The third sub-model can represent the corresponding relation between the merged vector and the target image.

It is to be understood that, in the above alternative implementation, the target image corresponding to the audio frame is generated through the first sub-model, the second sub-model and the third sub-model included in the end-to-end model, so that the generation effect of the digital human video can be improved by improving the accuracy of the generated target image. In some cases, in the optional implementation manner, in the use process of the end-to-end model, operations such as key point extraction and inverse normalization processing are not required, so that the accuracy of digital human video generation can be improved.

In some cases of the above alternative implementations, the above end-to-end model is trained by:

step one, video data is obtained.

The video data may be any video data containing voice and face images. In the video data, each video frame comprises an audio frame and a face image, namely, each audio frame has a corresponding face image. For example, in video data within one second, if the video within one second includes 5 frames, that is, 5 audio frames and 5 personal face images, the audio frames correspond to the face images one to one.

And step two, extracting audio frames and face images corresponding to the audio frames from the video data, taking the audio frame sequence corresponding to the extracted audio frames as sample audio, and taking the extracted face images as sample face images.

And step three, using a machine learning algorithm, using the sample audio as input data of a generator in the generative confrontation network to obtain a target image which corresponds to the sample audio and is generated by the generator, and using the current generator as an end-to-end model if a discriminator in the generative confrontation network determines that the target image generated by the generator meets a preset training end condition.

Wherein, the preset training end condition may include at least one of the following: the calculated loss function value is less than or equal to a preset threshold, and the probability that the mouth region image generated by the generator is the mouth region image of the sample face image corresponding to the sample audio is 50%.

It is understood that in the above case, the end-to-end model is obtained based on the generative confrontation network, which can improve the generation effect of the digital human video by improving the accuracy of the target image generated by the generator.

In some cases, the preset training end condition also includes at least one of the following:

the first item is that the function value of a preset loss function calculated based on the audio frame sequence corresponding to the audio frame is smaller than or equal to a first preset value.

The audio frame sequence corresponding to the audio frame may be a sequence formed by a preset number of audio frames including the audio frame in the target audio. For example, the sequence of audio frames may comprise the audio frame and the first 4 audio frames of the audio frame.

And the second term is that the function value of the preset loss function calculated based on the audio frame sequence corresponding to the non-audio frame is greater than or equal to a second preset value.

The audio frame sequence corresponding to the non-audio frame (hereinafter referred to as the target frame) may be a sequence composed of other audio frames except for the audio frame sequence corresponding to the audio frame. For example, the audio frame sequence corresponding to the audio frame may be a sequence of randomly selected audio frames of a preset number in the video data or the target video. The audio frame sequence corresponding to the non-audio frame may or may not include the target frame.

In some cases, the sequence of audio frames corresponding to non-audio frames, and the sequence of audio frames corresponding to audio frames may contain an equal number of audio frames.

It can be understood that, in the above case, the sequence of audio frames corresponding to the audio frames (for example, the current frame and the previous 4 frames) and the target image in the sample face image are input into the discriminator, and the smaller the loss is, the better the result is, specifically, 26 key points generated by audio inference of the current frame and the previous 4 frames and 26 key points of the current frame of the real face mouth are input into the discriminator to calculate a function value of a preset loss function, and the smaller the function value is, thereby indicating that the mouth generated against the face is truer, that is, the effect of the digital human video is good. The larger the function value of the preset loss function is, the better the function value of the preset loss function is, specifically, the 5-frame audio which is not corresponding to the current frame is adopted (for example, 26 key points generated by other 5-frame audio inference and 26 key points of the current frame real face mouth are input into the discriminator to calculate the function value of the preset loss function, the larger the function value is, the better the function value is, thereby indicating that the mouth generated by the countermeasure generator is truer, that is, the generation effect of the digital human video is good.

Optionally, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.

In some application scenarios in the foregoing cases, taking the sample audio as input data of a generator in a generative confrontation network, obtaining a target image generated by the generator corresponding to the sample audio, and taking a current generator as an end-to-end model if a discriminator in the generative confrontation network determines that the target image generated by the generator satisfies a preset training end condition, includes:

first, an initially generated countermeasure network is obtained. The initially generated countermeasure network comprises a first sub-model, a second sub-model, a third sub-model and a fourth sub-model, input data of the fourth sub-model is a first hidden vector, and output data of the fourth sub-model is a key point of a mouth.

Then, the following first training step (including step one to step four) is performed:

step one, inputting a sample audio into a first sub-model included in an initial generation type confrontation network to obtain a first hidden vector corresponding to the sample audio.

And step two, inputting the first hidden vector corresponding to the sample audio into a fourth submodel to obtain a predicted mouth key point corresponding to the sample audio.

And step three, calculating a first function value of the first preset loss function based on the predicted mouth key point corresponding to the sample audio and the mouth key point extracted from the sample face image corresponding to the sample audio.

And step four, if the calculated first function value is less than or equal to a first preset threshold value, determining the model parameters of the first sub-model included by the current initial generation type countermeasure network as the model parameters of the first sub-model included by the trained end-to-end model.

Optionally, if the calculated first function value is greater than the first preset threshold, the model parameters of the first sub-model and the model parameters of the fourth sub-model included in the current initially generated countermeasure network are updated, and the first training step is continuously performed on the basis of the initially generated countermeasure network after the model parameters are updated.

It can be understood that, in the above optional implementation manner, whether the model parameters of the first submodel and the model parameters of the fourth submodel in the generative confrontation network can be used for reasoning is judged according to the size of the first function value, and the trained generator in the generative confrontation network is adopted to generate the digital human video, so that the generation effect of the digital human video is improved, and in the stage of using the generator, a second submodel is not required to obtain key points, so that the generation efficiency of the digital human video can be improved.

Optionally, the step of training the end-to-end model may further include performing a second training step (including the first step to the sixth step) as follows.

The method comprises the following steps of firstly, inputting sample audio into a first sub-model included by an initial generation type confrontation network, and obtaining a first hidden vector corresponding to the sample audio.

And a second step of inputting the target area image in the sample face image corresponding to the sample audio into a second sub-model included in the initial generation type confrontation network to obtain a second hidden vector corresponding to the sample audio.

And a third step of merging the first hidden vector corresponding to the sample audio and the second hidden vector corresponding to the sample audio to obtain a merged vector corresponding to the sample audio.

And a fourth step of inputting the merged vector corresponding to the sample audio into a third sub-model included in the initial generation type countermeasure network to obtain a prediction target image corresponding to the sample audio.

A fifth step of calculating a second function value of a second preset loss function based on the prediction target image corresponding to the sample audio and the target image extracted from the sample face image corresponding to the sample audio.

And a sixth step of determining the model parameters of the second submodel included in the current initial generation type countermeasure network as the model parameters of the second submodel included in the trained end-to-end model and determining the model parameters of the third submodel included in the current initial generation type countermeasure network as the model parameters of the third submodel included in the trained end-to-end model if the calculated second function value is less than or equal to the preset threshold value.

Optionally, if the calculated second function value is greater than the second preset threshold, the model parameters of the second sub-model and the model parameters of the third sub-model included in the current initially-generated countermeasure network are updated, and the second training step is continuously performed on the basis of the initially-generated countermeasure network after the model parameters are updated.

It can be understood that, after the model parameters of the first sub-model and the model parameters of the fourth sub-model are fixed, whether the model parameters of the third sub-model can be used for reasoning is judged through the size of the second function value, the trained generation type is adopted to confront a generator in a network to generate a digital human video, the generation effect of the digital human video is improved, and in the stage of using the generator, a second sub-model is not required to be adopted to obtain key points, so that the generation efficiency of the digital human video can be further improved.

Step 203, generating a digital human video based on the generated target image.

In the present embodiment, the execution body described above may generate a digital human video based on the respective target images generated.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the digital human video generation method according to the present embodiment. In fig. 3, a server 310 (i.e., the executing agent) first acquires a target audio 301 and a target face image 304. The server 310 inputs, for an audio frame 302 in the target audio 301, an audio frame sequence 303 corresponding to the audio frame 302 and a target region image 305 in the target face image 304 into a pre-trained end-to-end model 306, and generates a target image 307 corresponding to the audio frame 302, where the audio frame sequence 303 corresponding to the audio frame 302 is a sequence of consecutive audio frames including the audio frame 302 in the target audio 301, the target region image 305 is a region image except a mouth region image in the target face image 304, and the target image 307 corresponding to the audio frame 302 is used for instructing a person indicated by the target face image to emit audio indicated by the audio frame. Server 310 generates digital human video 308 based on generated target image 307.

The method provided by the foregoing embodiment of the present disclosure includes acquiring a target audio and a target face image, then, for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, generating a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target area image is an area image in the target face image except for a mouth area image, the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame, and finally, generating a digital human video based on the generated target image. Therefore, the target image used for generating the digital human video is directly obtained by adopting the end-to-end model, so that the efficiency of generating the digital human video is improved by improving the speed of generating the target image.

With further reference to fig. 4A, a flow 400 of yet another embodiment of a digital human video generation method is shown. The process of the digital human video generation method comprises the following steps:

step 401, acquiring a target audio and a target face image.

Step 402, for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame to the first sub-model to obtain a first hidden vector, inputting a target region image in the target face image to the second sub-model to obtain a second hidden vector, merging the first hidden vector and the second hidden vector to obtain a merged vector, and inputting the merged vector to the third sub-model to obtain a target image corresponding to the audio frame.

The input data of the first sub-model is a sequence of audio frames corresponding to the audio frames, the output data of the first sub-model is a first hidden vector, the input data of the second sub-model is a target area image in the target face image, the output data of the second sub-model is a second hidden vector, the input data of the third sub-model comprises the first hidden vector and the second hidden vector, and the output data of the third sub-model comprises the target image.

At step 403, a digital human video is generated based on the generated target image.

As an example, the digital human video generation method in the present embodiment may be performed as follows:

first, the format of data is described:

in this embodiment, the size of the face sketch in the digital human video generation method is 512 × 1; the size of the target face image is 512 × 3; the face sketch and the target face image are combined to form a whole with the size of 512 x 4.

Referring to fig. 4B, the implementation process of the specific scheme is described as follows:

after obtaining the user audio (i.e., the target audio), processing the user audio by using an encoder (i.e., the first sub-model), generating an acoustic coding vector LM1 (i.e., an intermediate layer (hidden space) of cnn or lstm), and the first hidden vector), then synthesizing the acoustic coding vector LM with an original picture (i.e., a target region image (i.e., a merged vector) vector LM2 (i.e., a hidden space encoded by a sketch encoder) according to a channel synthesis manner, to obtain a channel synthesis vector LM3 (including characteristics of a mouth and a face image), then processing the channel synthesis vector LM3 by using a decoder (i.e., a third sub-model) (i.e., inputting the channel synthesis vector LM3 into a GAN generation model for decoding), to obtain a digital human pseudograph (i.e., a target image), and then outputting a digital human video (one video including multiple frames of pictures).

In the training phase, this can be performed by:

the training is divided into two stages:

in the first stage, sound (i.e., the sample audio) passes through CNN and LSTM, which are collectively called a model lmencor (i.e., the first submodel), and is fully connected (i.e., the third submodel), so as to obtain inferred 26 key points (i.e., the mouth key points, which may include, for example, 20 key points of the mouth and 6 key points of the chin), and the inferred 26 key points and real key points (i.e., the mouth key points extracted from the sample face image corresponding to the sample audio) to obtain a first function value of a first preset loss function, thereby training the lmencor.

In the second stage, after the first function value of the first preset loss function of the 26 key points is stabilized (for example, the calculated first function value is less than or equal to the first preset threshold), the model parameters of the lmencorder are fixed, and the training of the encoder and the decoder LipGAN is started, specifically, the following procedure is performed:

first, video data is prepared, the video data including audio (i.e., sample audio) and pictures (i.e., sample face images corresponding to the sample audio).

Then, processing data according to 25 frames per second of frame rate, extracting characteristics of audio, and extracting key points of the human face and corresponding canny lines from pictures; namely, for each video frame, audio features are extracted from video audio (sample audio), and 68 face key points are extracted from a video picture (i.e. a sample face image corresponding to the sample audio), wherein the audio feature extraction method can use fourier transform to extract MFCC, extract audio features with depepseech model, or extract audio features with other algorithms (ASR model — voice recognition).

Then, as shown in fig. 4B, after the voice passes through CNN and LSTM, a voice coding vector LM1 is generated, then 26 key points of the mouth are generated through the full connection layer (i.e. 26 key points are generated by inference), and then loss (i.e. a first function value) is calculated by using the inferred 26 key points and the 26 key points of the real face mouth, so as to train an LMEncoder.

Subsequently, after the loss (first function value) is stabilized, the lmencorder parameters (i.e., the first sub-model) are fixed, i.e., after the lmencorder model is trained, the training of the encoder and decoder LipGAN (i.e., the second sub-model and the third sub-model) is started. Specifically, in the hidden layer, the hidden vector (i.e., the original picture vector LM2) of the face picture (the portion where the mouth of the face is removed) and the hidden vector (i.e., the voice coding vector LM1) of the voice of the real person are merged to become 1024 × 1 (i.e., the channel synthesis vector LM3, which contains the features of the mouth and the face picture), and then the generated picture (i.e., the target image) is output through the decoder.

It should be noted that, in the first stage and the second stage, a mouth picture of a frame of picture can be trained by using one frame of audio data or multiple frames of audio data. Specifically, when a frame of mouth picture (i.e. 26 face key points) is trained by using N frames of audio data, for example, when a face mouth key point of a t-th frame picture is trained, 26 face mouth key points of the t-th frame picture can be trained by using audio data corresponding to the t-th frame and the t-1, t-2 … … t- (N-1) frames, so as to improve the generation effect of the face mouth picture, and make the generation effect of the digital person picture better. N may be greater than 1, the larger N, the better the mouth is produced. For example, the final target image may be output using the current audio frame and the first 4 frames of the current audio, and the mouth-removed picture (i.e., the target area image) of the current frame.

In addition, a loss function of a new discriminator (namely, a fourth sub-model) can be added into the LipGAN to ensure the stability of image generation;

the present frame and the previous 4 frames (i.e. the audio frame sequence corresponding to the audio frame) and the current true picture (i.e. the target area image) are input into the discriminator, and the smaller the loss is, the better the loss is, specifically, 26 key points generated by the audio inference of the present frame and the previous 4 frames and 26 key points of the current frame true face mouth are input into the discriminator to solve the loss, and the smaller the loss is, the better the loss is, thereby indicating that the mouth generated by the countermeasure is truer, i.e. the effect is good.

The input of the other five frames (i.e., the audio frame sequence corresponding to the non-audio frame) and the current frame picture (i.e., the target area image) into the discriminator is better if the loss is larger, specifically, the 5 frames of audio not corresponding to the current frame (i.e., the 26 key points generated by the other 5 frames of audio inference and the 26 key points of the current frame real face mouth are input into the discriminator to solve the loss, the larger the loss is, the better the loss is, thereby indicating that the mouth generated by the countermeasure generator is truer, i.e., the effective effect is good

In the inference (application) phase:

firstly, inputting the current frame and the previous 4 frames of audio (i.e. the sequence of audio frames corresponding to the audio frame)/or extracting the audio features, inputting the model LMEncoder (i.e. the first submodel), and obtaining the concealment vector LM1 (i.e. the first concealment vector).

Then, the mouth-out area (i.e. the target area image) of the current picture is obtained, and a hidden vector IM2 (i.e. a second hidden vector) is obtained through the encoder.

Finally, the hidden vector LM1 and the IM2 are merged to obtain a hidden vector (i.e., the channel synthesized vector LM3 (i.e., the merged vector) including features of the mouth and the face image) which is input into a decoder (i.e., the third sub-model), and a final picture (i.e., the target image) is output. The digital human video is then output.

The sound inference model may be configured to extract audio features of the audio, where the input sound may be in a wav format, and the frame rate may be 100, 50, or 25. Wav is a lossless audio file format. For the sound features, the feature can be MFCC, or the feature extracted by a model such as Deepseech/ASR/wav 2 Vector. The acoustic inference model may be LSMT, BERT (Bidirectional Encoder representation from transforms, transducer-based Bidirectional code representation model), Transfromer (transducer model), CNN (Convolutional Neural Networks), RNN (Recurrent Neural Networks), or the like. The 3DMM is a human face 3D deformation statistical model; the three-dimensional face model is a relatively basic three-dimensional statistical model and is firstly proposed to solve the problem of recovering a three-dimensional shape from a two-dimensional face graph; the author collects 200 pieces of three-dimensional face head information, and uses the group of data as the basis of PCA (principal component analysis) to obtain principal component information which can represent the shape and texture of the face.

In this embodiment, the specific implementation manners of the steps 401 to 403 may refer to the related descriptions of the embodiment corresponding to fig. 2, and are not repeated herein. In addition, besides the above-mentioned contents, the embodiment of the present disclosure may further include the same or similar features and effects as the embodiment corresponding to fig. 2, and details are not repeated herein.

In the digital human video generation method, an end-to-end mode is adopted to generate the digital human video, the audio is input, and a target image for generating the digital human video is directly generated by combining a hidden space of a sketch encoder code, namely, key points do not need to be acquired and anti-normalization processing does not need to be performed, so that the efficiency is high; furthermore, audio features are not extracted, and the efficiency is further improved; and when the audio features are extracted, the effect is better. In addition, the loss function of the new discriminator (namely, the fourth sub-model) is adopted, so that the stability of the generation of the target image can be maintained.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of a digital human video generating apparatus, which corresponds to the above-described method embodiment, and which may include the same or corresponding features as the above-described method embodiment and produce the same or corresponding effects as the above-described method embodiment, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the digital human video generating apparatus 500 of the present embodiment. The above apparatus 500 includes: an acquisition unit 501, an input unit 502, and a generation unit 503. The acquiring unit 501 is configured to acquire a target audio and a target face image; an input unit 502 configured to input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target region image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames in the target audio that include the audio frame, the target region image is a region image in the target face image except a mouth region image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame; a generating unit 503 configured to generate a digital human video based on the generated target image.

In the present embodiment, the acquisition unit 501 of the digital human video generating apparatus 500 may acquire a target audio and a target face image.

In this embodiment, the input unit 502 may input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target region image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames in the target audio that includes the audio frame, the target region image is a region image in the target face image except for a mouth region image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame.

In the present embodiment, the generation unit 503 may generate a digital human video based on the generated target image.

In some optional implementations of this embodiment, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model, where input data of the first sub-model is an audio frame sequence corresponding to an audio frame, output data of the first sub-model is a first hidden vector, input data of the second sub-model is a target region image in the target face image, output data of the second sub-model is a second hidden vector, input data of the third sub-model includes the first hidden vector and the second hidden vector, and output data of the third sub-model includes a target image; and

the above generation unit, further configured to:

In some optional implementations of this embodiment, the end-to-end model is trained as follows:

acquiring video data;

In some optional implementations of this embodiment, the taking the sample audio as the input data of the generator in the generative confrontation network to obtain the target image generated by the generator corresponding to the sample audio, and if the discriminator in the generative confrontation network determines that the target image generated by the generator meets the preset training end condition, taking the current generator as an end-to-end model includes:

performing a first training step as follows:

In some optional implementations of this embodiment, the taking a sample audio as input data of a generator in a generative confrontation network to obtain a target image generated by the generator corresponding to the sample audio, and if a discriminator in the generative confrontation network determines that the target image generated by the generator meets a preset training end condition, taking a current generator as an end-to-end model further includes:

performing a second training step as follows:

In some optional implementations of the present embodiment, the preset training end condition includes at least one of:

In some optional implementations of this embodiment, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.

In some optional implementations of this embodiment, the sequence of audio frames corresponding to the audio frame includes the audio frame and audio frames consecutive to a preset number of frames before the audio frame in the target audio.

In the apparatus 500 provided by the foregoing embodiment of the present disclosure, the obtaining unit 501 may obtain a target audio and a target face image, and then the input unit 502 may input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to send out audio indicated by the audio frame, and finally, the generating unit 503 may generate, based on the generated target image, a digital human video is generated. Therefore, the target image used for generating the digital human video is directly obtained by adopting the end-to-end model, so that the efficiency of generating the digital human video is improved by improving the speed of generating the target image.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where the electronic device 600 shown in fig. 6 includes: at least one processor 601, memory 602, and at least one network interface 604 and other user interfaces 603. The various components in the electronic device 600 are coupled together by a bus system 605. It is understood that the bus system 605 is used to enable communications among the components. The bus system 605 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 605 in fig. 6.

The user interface 603 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball, touch pad, or touch screen, among others.

It will be appreciated that the memory 602 in embodiments of the disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), synchlronous SDRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 602 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 602 stores the following elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system 6021 and application programs 6022.

The operating system 6021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program 6022 includes various application programs such as a Media Player (Media Player), a Browser (Browser), and the like, and is used to implement various application services. Programs that implement methods of embodiments of the disclosure can be included in the application program 6022.

In the embodiment of the present disclosure, by calling a program or an instruction stored in the memory 602, specifically, a program or an instruction stored in the application program 6022, the processor 601 is configured to execute the method steps provided by the method embodiments, for example, including: acquiring a target audio and a target face image; for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame and a target region image in the target face image into a pre-trained end-to-end model, and generating a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target region image is a region image in the target face image except for a mouth region image, and the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out an audio indicated by the audio frame; based on the generated target image, a digital human video is generated.

The method disclosed by the embodiment of the present disclosure can be applied to the processor 601 or implemented by the processor 601. The processor 601 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 601. The Processor 601 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present disclosure may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 602, and the processor 601 reads the information in the memory 602 and completes the steps of the method in combination with the hardware thereof.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented in one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The electronic device provided in this embodiment may be the electronic device shown in fig. 6, and may execute all the steps of the digital human video generation method shown in fig. 2, so as to achieve the technical effect of the digital human video generation method shown in fig. 2.

The disclosed embodiments also provide a storage medium (computer-readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.

When one or more programs in the storage medium are executable by one or more processors, the above-described digital human video generation method executed on the electronic device side is implemented.

The processor is configured to execute the communication program stored in the memory to implement the following steps of the digital human video generation method executed on the electronic device side: acquiring a target audio and a target face image; for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame and a target region image in the target face image into a pre-trained end-to-end model, and generating a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target region image is a region image in the target face image except for a mouth region image, and the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out an audio indicated by the audio frame; based on the generated target image, a digital human video is generated.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present disclosure in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present disclosure, and are not intended to limit the scope of the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method for generating a digital human video, the method comprising:

acquiring a target audio and a target face image;

for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame and a target region image in the target face image into a pre-trained end-to-end model, and generating a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target region image is a region image in the target face image except for a mouth region image, and the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out an audio indicated by the audio frame;

based on the generated target image, a digital human video is generated.

2. The method of claim 1, wherein the end-to-end model comprises a first sub-model, a second sub-model and a third sub-model, wherein input data of the first sub-model is a sequence of audio frames corresponding to an audio frame, output data of the first sub-model is a first concealment vector, input data of the second sub-model is a target region image in the target face image, output data of the second sub-model is a second concealment vector, input data of the third sub-model comprises the first concealment vector and the second concealment vector, and output data of the third sub-model comprises a target image; and

inputting the audio frame sequence corresponding to the audio frame and the target area image in the target face image into a pre-trained end-to-end model, and generating a target image corresponding to the audio frame, including:

and inputting the merged vector to the third sub-model to obtain a target image corresponding to the audio frame.

3. The method of claim 2, wherein the end-to-end model is trained by:

acquiring video data;

and adopting a machine learning algorithm, taking the sample audio as input data of a generator in the generative confrontation network to obtain a target image which corresponds to the sample audio and is generated by the generator, and taking the current generator as an end-to-end model if a discriminator in the generative confrontation network determines that the target image generated by the generator meets a preset training end condition.

4. The method of claim 3, wherein the using the sample audio as input data of a generator in the generative confrontation network obtains a target image generated by the generator corresponding to the sample audio, and if a discriminator in the generative confrontation network determines that the target image generated by the generator meets a preset training end condition, then using the current generator as an end-to-end model comprises:

performing a first training step as follows:

5. The method of claim 4, wherein the using the sample audio as the input data of the generator in the generative confrontation network obtains the target image generated by the generator corresponding to the sample audio, and if the discriminator in the generative confrontation network determines that the target image generated by the generator meets the preset training end condition, then using the current generator as an end-to-end model, further comprises:

if the calculated first function value is larger than the first preset threshold value, updating the model parameters of the first sub-model and the model parameters of the fourth sub-model which are included in the current initial generation type countermeasure network, and continuing to execute the first training step based on the initial generation type countermeasure network after the model parameters are updated.

6. The method of claim 4, wherein the using the sample audio as the input data of the generator in the generative confrontation network obtains the target image generated by the generator corresponding to the sample audio, and if the discriminator in the generative confrontation network determines that the target image generated by the generator meets the preset training end condition, then using the current generator as an end-to-end model, further comprises:

performing a second training step as follows:

7. The method of claim 6, wherein the using the sample audio as the input data of the generator in the generative confrontation network obtains the target image generated by the generator corresponding to the sample audio, and if the discriminator in the generative confrontation network determines that the target image generated by the generator meets the preset training end condition, then using the current generator as an end-to-end model, further comprises:

and if the calculated second function value is larger than the second preset threshold value, updating the model parameters of the second sub-model and the third sub-model included in the current initial generation type countermeasure network, and continuously executing the second training step based on the initial generation type countermeasure network after the model parameters are updated.

8. The method according to one of claims 3 to 7, wherein the preset training end condition comprises at least one of:

9. The method according to one of claims 2 to 7, wherein the second submodel is an encoder and the third submodel is a decoder corresponding to the encoder.

10. The method according to any of claims 1-7, wherein the sequence of audio frames corresponding to the audio frame comprises the audio frame and a predetermined number of consecutive audio frames preceding the audio frame in the target audio.

11. A digital human video generating apparatus, characterized in that the apparatus comprises:

an input unit, configured to input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target region image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames in the target audio that include the audio frame, the target region image is a region image in the target face image except a mouth region image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to emit audio indicated by the audio frame;

12. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing a computer program stored in the memory, and when executed, implementing the method of any of the preceding claims 1-10.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of the preceding claims 1 to 10.