CN113901894A

CN113901894A - Video generation method, device, server and storage medium

Info

Publication number: CN113901894A
Application number: CN202111109871.3A
Authority: CN
Inventors: 杨跃; 董治; 雷兆恒; 梅立锋
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2022-01-07

Abstract

The embodiment of the application discloses a video generation method, a video generation device, a server and a storage medium, wherein the method comprises the following steps: acquiring voice audio data and acquiring a face image of a target object; generating simulation facial expression parameters according to the human voice audio data, and generating three-dimensional facial parameters of the target object according to the facial image; generating an initial dynamic face video of the target object according to the simulated face expression parameters and the three-dimensional face parameters; correcting the lip shape of the face in the initial dynamic face video according to the voice audio data to obtain a target dynamic face video; the lip shape of the face in the target dynamic face video corresponds to the voice content in the voice audio data. Dynamic face video with realistic visual and auditory effects can be generated.

Description

Video generation method, device, server and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a video generation method, an apparatus, a server, and a storage medium.

Background

With the rapid development of computer technology, computer vision is increasingly applied to daily life, work and study. For example, there are also a large number of composite videos on the network, such as a facial expression video generated based on a piece of speech and a facial image. However, at present, when synthesizing a facial expression video through voice and a facial image, usually only the mouth movement of a human face is focused, the synchronization effect of the whole expression of the human face and the voice content is poor, and the human face with only the mouth movement is relatively stiff, which results in the poor video effect of the synthesized facial expression video. Therefore, how to synthesize a more realistic facial expression video becomes a hot research problem of the current computer vision technology.

Disclosure of Invention

The embodiment of the application provides a video generation method, a video generation device, a server and a storage medium, so that the dynamic face video can achieve better effects in two dimensions of expression and lip shape, and the dynamic face video with vivid visual and auditory effects is generated.

A first aspect of an embodiment of the present application discloses a video generation method, where the method includes:

acquiring voice audio data and acquiring a face image of a target object;

generating simulation facial expression parameters according to the human voice audio data, and generating three-dimensional facial parameters of the target object according to the facial image;

generating an initial dynamic face video of the target object according to the simulated face expression parameters and the three-dimensional face parameters;

correcting the lip shape of the face in the initial dynamic face video according to the voice audio data to obtain a target dynamic face video; the lip shape of the face in the target dynamic face video corresponds to the voice content in the voice audio data.

A second aspect of the embodiments of the present application discloses a video generating apparatus, including:

the acquisition unit is used for acquiring human voice audio data and acquiring a human face image of a target object;

the first generating unit is used for generating simulation face expression parameters according to the voice audio data and generating three-dimensional face parameters of the target object according to the face image;

the second generating unit is used for generating an initial dynamic face video of the target object according to the simulated face expression parameters and the three-dimensional face parameters;

the correction unit is used for correcting the lip shape of the face in the initial dynamic face video according to the voice audio data to obtain a target dynamic face video; the lip shape of the face in the target dynamic face video corresponds to the voice content in the voice audio data.

In a third aspect of embodiments of the present application, a server is disclosed, which includes a processor, a memory, and a network interface, where the processor, the memory, and the network interface are connected to each other, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method of the first aspect.

A fourth aspect of the embodiments of the present application discloses a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program includes program instructions, which, when executed by a processor, cause the processor to execute the method of the first aspect.

A fifth aspect of embodiments of the present application discloses a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the server reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the server to perform the method of the first aspect described above.

In the embodiment of the application, the server can acquire the human voice audio data and the face image of the target object, and generate the initial dynamic face video of the target object based on the human voice audio data and the face image of the target object. When the initial dynamic face video of the target object is generated based on the voice audio data and the face image of the target object, the simulated face expression parameters can be generated according to the voice audio data, and the three-dimensional face parameters of the target object can be generated according to the face image, so that the initial dynamic face video aiming at the target object can be generated according to the simulated face expression parameters and the three-dimensional face parameters. Further, the lip shape of the face in the initial dynamic face video may be modified according to the voice audio data to obtain a target dynamic face video, where the lip shape of the face in the target dynamic face video corresponds to the voice content in the voice audio data. By implementing the method, the human face expression change video related to the human voice audio content can be generated through the three-dimensional human face parameters and the image generated by the human voice audio and simulating the human face expression parameters. And the lip shape included in the face of the facial expression change video can be synchronized with the voice audio content based on the lip shape generation method. Therefore, the finally obtained dynamic face video can achieve better effects in two dimensions of expression and lip shape, and further can generate the dynamic face video with vivid visual and auditory effects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic architecture diagram of a video generation system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a video generation method according to an embodiment of the present application;

fig. 3a is a schematic structural diagram of acquiring human voice audio data and a human face image according to an embodiment of the present application;

fig. 3b is a schematic structural diagram of determining an initial face image according to an embodiment of the present application;

fig. 3c is a schematic structural diagram of a lip formation countermeasure model according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a video generation method according to an embodiment of the present application;

fig. 5a is a schematic structural diagram of determining converted human voice audio data according to an embodiment of the present application;

fig. 5b is a schematic structural diagram of a tone conversion model provided in an embodiment of the present application;

fig. 5c is a schematic structural diagram of a face repairing model according to an embodiment of the present application;

fig. 5d is a schematic structural diagram of a GAN prior network model provided in an embodiment of the present application;

fig. 5e is a schematic structural diagram of a GAN block according to an embodiment of the present application;

fig. 5f is a schematic structural diagram of a face repairing confrontation model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a video recommendation scene according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition.

The scheme provided by the embodiment of the application relates to the video processing, image processing and other technologies in the artificial intelligence computer vision technology, and is specifically explained by the following embodiments:

in an embodiment of the present application, there is provided a video generation method, whose general principle is as follows: first, human voice audio data and a face image of a target object may be acquired, and an initial dynamic face video for the target object may be generated based on the human voice audio data and the face image of the target object. When the initial dynamic face video aiming at the target object is generated based on the voice audio data and the face image of the target object, the simulated face expression parameters can be generated according to the voice audio data, the three-dimensional face parameters of the target object can be generated according to the face image, and therefore the initial dynamic face video of the target object can be generated according to the simulated face expression parameters and the three-dimensional face parameters. Further, the lip shape of the face in the initial dynamic face video may be modified according to the voice audio data to obtain a target dynamic face video, where the lip shape of the face in the target dynamic face video corresponds to the voice content in the voice audio data. In the application, a facial expression change video related to the content of the human voice audio can be generated through the three-dimensional facial parameters and the image generated by the simulated facial expression parameters generated by the human voice audio. And the lip shape included in the face of the facial expression change video can be synchronized with the voice audio content based on the lip shape generation method. Therefore, the finally obtained dynamic face video can achieve better effects in two dimensions of expression and lip shape, and further can generate the dynamic face video with vivid visual and auditory effects.

It should be noted that the method for generating a video provided by the present application can be specifically applied to a system for generating a video, please refer to fig. 1, and fig. 1 is a schematic structural diagram of a video generation system provided by an embodiment of the present application. The application relates to a terminal and a server.

Taking the terminal as an example, the target object may input a face image and voice audio data of the target object on the terminal interface, and the terminal may obtain the face image and the voice audio data input by the target object. After the terminal acquires the face image and the voice audio data, the face image and the voice audio data can be uploaded to the server, so that the server acquires the face image and the voice audio data, an initial dynamic face video is generated according to the face image and the voice audio data, and a target dynamic face video is generated according to the initial dynamic face video. After the target dynamic face video is generated, the server can also return the target dynamic face video to the terminal so that the target dynamic face video is displayed on the terminal interface.

The terminal shown in fig. 1 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, or other devices, and may also be an external device such as a handle, a touch screen, or other devices; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited herein.

The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:

referring to fig. 2, fig. 2 is a schematic flowchart illustrating a video generation method according to an embodiment of the present disclosure. The method is applied to a server and can be executed by the server, as shown in fig. 2, the video generation method can include:

s201: and acquiring voice audio data and acquiring a face image of the target object.

The human voice audio data may be a speaking voice of a certain object, or a song, etc.

In an implementation manner, an object in the present application may refer to a user, and thus, a target object may refer to any user, and when the target object needs to generate a dynamic face video including human voice audio data of a face image by using human voice audio data and a face image of the target object, the target object may perform a related operation through a user operation interface output by a terminal, so that a computer device may acquire the human voice audio data and the face image. The lip shape of the face in the dynamic face video corresponds to the voice content in the voice audio data, and the correspondence can be understood as that the lip shape of the face changes correspondingly along with the voice content, and the pronunciation of each word in the voice content is consistent with the corresponding lip shape (mouth shape) of the face. See, for example, FIG. 3 a: the object terminal (or terminal) used by the target object may display a user interface in the terminal screen, which may include at least an input data setting area, labeled 301, and a confirmation control, labeled 302. The input data setting area comprises an audio input setting item and an image input setting item, the audio input setting item is used for inputting human voice audio data by the target object, and the image input setting item is used for inputting a human face image by the target object. If the target object wants to generate a dynamic face video containing the human voice audio data of the face image, the target object may input a human voice audio data and a face image in the input data setting area 301. For example, the voice audio data may be a speech of the target object or a song sung by the target object, or may be a speech of other objects or a song sung by other objects; the face image may be a photograph containing the face of the target object. After the target object inputs the voice audio data and the face image, a trigger operation (such as a click operation, a press operation, or the like) may be performed on the confirmation control 302, so that the terminal used by the target object may acquire the voice audio data and the face image in the input data setting area 301 and send the voice audio data and the face image to the server.

S202: and generating a simulated facial expression parameter according to the human voice audio data, and generating a three-dimensional facial parameter of the target object according to the facial image.

In one implementation, in order to make the finally generated video be a video based on facial expression changes, the image with facial expression changes may be determined to be generated by combining the human voice audio data and the facial image, so that a video based on facial expression changes may be generated according to the image with facial expression changes. Specifically, the simulated facial expression parameters can be generated according to the human voice audio data, and the three-dimensional facial parameters for the target object can be generated according to the facial image, so that subsequent processing can be performed according to the simulated facial expression parameters and the three-dimensional facial parameters.

In one implementation, a facial expression mapping method may be introduced in the process of generating simulated facial expression parameters according to human voice audio data, and the method may be implemented by an expression parameter extraction model, through which facial expression parameters may be estimated from one audio. For example, human voice audio data may be input into the expression parameter extraction model to process the human voice audio data using the expression parameter extraction model, thereby generating simulated human facial expression parameters. Optionally, the human voice audio data may be input into the expression parameter extraction model, so as to perform feature conversion on the human voice audio data based on the expression parameter extraction model, to obtain a voice feature parameter of the human voice audio data, where the voice feature parameter may be a Mel Frequency Cepstrum Coefficient (MFCC) feature, and the MFCC feature performs Cepstrum analysis on a Mel (Mel) spectrum (for example, specifically, taking a logarithm of the Mel spectrum and performing inverse transformation, where the inverse transformation is generally implemented by discrete cosine transformation, and may take a 2 nd to a 13 th Coefficient after the discrete cosine transformation as an MFCC feature), to obtain a Mel Frequency Cepstrum Coefficient (MFCC), where the MFCC is a feature of the human voice audio data. Then, MFCC can be used as the speech feature, and in the present application, MFCC can be used as the speech feature parameter. After the speech feature parameters are obtained, feature migration can be performed on the speech feature parameters based on the expression parameter extraction model to obtain target audio features of the human voice audio data, and the target audio features can be high-order features for audio. It can be understood that compared with low-order features (which are generally related to contours, edges, colors, shapes, and the like, and contain less feature semantic information), high-order features contain more feature semantic information, and can be used for subsequently representing a complete face image. From the above, the present application can extract high-order features, which can be effectively independent of any specific identity and contain enough feature semantic information for predicting the subsequently required simulated facial expression parameters. After the target audio characteristics are obtained, expression parameter mapping can be performed according to the target audio characteristics to obtain simulated facial expression parameters.

For example, the expression parameter extraction model may include three networks, which may be a feature transformation network, a feature migration network, and a mapping network, respectively. Wherein, the feature conversion network can be used for converting the input human voice data into MFCC features (voice feature parameters); the feature migration network is used to convert the input MFCC features into target audio features, which may be higher-order features for audio. Wherein the feature migration network may be AT-Net, and 256-dimensional output features of the penultimate layer in AT-Net may be used as target audio features, and these output features may be effectively independent of any specific identity and contain enough feature semantic information for subsequent mapping. Alternatively, a 256-dimensional output feature may be extracted for every 40 milliseconds of human voice audio, which may correspond to 1 second of a video frame containing 25 frames of images. The mapping network may be an audio-facial expression mapping network, and the target audio features may be used as input of the mapping network to predict expression parameters of a face according to the target audio features, where the predicted expression parameters of the face may be referred to as simulated facial expression parameters. The relevant data of the network structure of the mapping network may be as shown in table 1:

table 1:

type (Type)	Convolution Kernel (Kernel)	Step size (Stride)	Outputs (Outputs)
				Convolutional layer	3	1	5×254
Convolutional layer	3	1	3×252
				Convolutional layer	3	1	1×250
Full connection layer	-	-	64

As can be seen from table 1, the mapping network may include three one-dimensional convolutional layers and one fully connected layer. The mapping network performs convolution processing using three one-dimensional convolutional layers, each convolutional layer using a one-dimensional convolution kernel of 3 with a step size of 1, the output of the first convolutional layer is a 5 × 254 matrix, the output of the second convolutional layer is a 3 × 252 matrix, and the output of the third convolutional layer is a 1 × 250 matrix. The fully-connected layer may include 64 nodes, i.e., the fully-connected layer may be 25 nodes in lengthThe vector of 0 (the output of the third convolutional layer) is reduced to a length-64 vector, which is the expression parameter of the face. That is, the fully-connected layer may output the predicted expression parameters of the human face using a fully-connected layer of 64 nodes. The loss function corresponding to the mapping network may be Mean Square Error (MSE), and the MSE may be specifically represented as L_exp＝MSE(H(F_t)-δ_t) Wherein H (-) represents the effect of mapping the network, F_tRepresenting the target audio feature, H (F)_t) Representing predicted facial expression parameters, delta_tRepresenting real facial expression parameters.

In one implementation, a three-dimensional face parameter for a target object may be generated according to a face image, where the three-dimensional face parameter may include an initial face expression parameter and a face shape parameter of the target object, and the face shape parameter may include a geometric shape parameter, a face texture parameter, a head posture parameter, and an illumination parameter, where the geometric shape parameter may be used to represent a face shape, the face texture parameter may be used to represent a face texture, the head posture parameter may be used to represent a head pose corresponding to a face, and the illumination parameter may be used to represent illumination on the face. Optionally, a three-dimensional face model may be constructed to obtain three-dimensional face parameters for the target object. For example, a face image may be input into the three-dimensional face construction model, so that face key points of a face of a target object in the face image may be extracted based on the three-dimensional face construction model, and then, in the three-dimensional face construction model, face reconstruction may be performed on the target object based on the face key points to obtain three-dimensional face parameters of the target object.

For example, the three-dimensional face construction model may be a refined Expression Capture and Animation (DECA) model. The DECA model can be based on a single picture for three-dimensional face reconstruction to reconstruct a three-dimensional head model with head pose, shape, detailed face geometry parameters from a single input image. The DECA model introduces a loss of detail consistency with which details relating to the identity and the expression of an object can be decomposed, which enables the synthesis of real wrinkles by controlling the expression parameters while keeping the specific details of the object unchanged. That is, the DECA model can be used to obtain object-specific detail parameters, which are specific to an object and can vary with the change of expression.

S203: and generating an initial dynamic face video of the target object according to the simulated face expression parameters and the three-dimensional face parameters.

In one implementation, as can be seen above, the three-dimensional face parameters may include initial facial expression parameters and facial morphology parameters of the target object; the human voice audio data can have multiple frames of audio data, and one frame of audio data corresponds to one group of simulated facial expression parameters. The specific implementation of generating the initial dynamic face video for the target object according to the simulated facial expression parameters and the three-dimensional face parameters may be as follows: firstly, the initial facial expression parameters in the three-dimensional facial parameters can be respectively replaced by the simulated facial expression parameters corresponding to each frame of audio data, and the target facial parameters corresponding to each frame of audio data of the target object are obtained. And then generating initial face images of target objects respectively corresponding to each frame of audio data according to the target face parameters respectively corresponding to each frame of audio data. Therefore, the initial dynamic face video can be generated according to the initial face image corresponding to each frame of audio data. For example, the initial face images respectively corresponding to each frame of audio data can be synthesized into an initial dynamic face video through the FFMPEG tool. The FFMPEG tool is a tool that can be used for audio-video processing, with which a series of image frames can be combined into one video.

In one implementation, it is considered that after the initial face image is synthesized by the target face parameters, the original dynamic face video generated by directly using the initial face image has poor fidelity, for example, only a face may exist in the original dynamic face video, but no hair or background image corresponding to the face exists in the original dynamic face video, and the like, and the face may not be realistic enough. In this way, it can be considered that the rendering processing is performed on the initial face image, so as to generate the initial dynamic face video according to the rendered initial face image. Optionally, image rendering may be performed on the initial face images corresponding to each frame of audio data, so as to obtain rendered face images corresponding to each frame of audio data. After the rendering face image is obtained, the initial dynamic face video can be generated according to the rendering face image corresponding to each frame of audio data.

Optionally, the step of rendering the image may include: the synthesized original facial image can be rendered by a rendering engine, wherein the rendering by the rendering engine is mainly performed on the face part, and it can be understood that the original facial image rendered by the rendering engine is free of hair and background areas, and the hair and background areas are generally necessary for a dynamic facial video for the reality of the dynamic facial video, for example, for a real singing video, the hair and background areas are necessary. For convenience of subsequent description, the initial face image rendered by the rendering engine may be referred to as a first rendered face image. Then, after the rendering engine is used to render the initial Face image, Face Alignment may also be implemented by using a 3D Dense Face Alignment (3D Dense Face Alignment, 3D dfa) method, so as to render the first rendered Face image, and obtain a rendered first rendered Face image, where the rendered first rendered Face image may be referred to as a second rendered Face image, and there are hair and a background matching with the Face in the second rendered Face image. After the rendering operation, the second rendered face image may still be visually synthesized by the computer, i.e., the second rendered face image is not very realistic. And in order to make the face in the second rendered face image more vivid and natural, the second rendered face image can be rendered by using a nerve face renderer, and a synthetic frame in the second rendered face image is converted into a real frame, so that the face image through the nerve face rendering model is more vivid, and the face image obtained by the nerve face rendering model can be called as a rendered face image. The neural face rendering model can be a generation confrontation rendering model which is composed of a generation network and a discrimination network. It can be understood that, when the neural face rendering model is trained, the adopted model structure is a generation confrontation model structure, and when the neural face rendering model is applied to an actual rendering scene, a generation network in the neural face rendering model is utilized. The generating network in the neural face rendering model may include a rendering coding network and a rendering decoding network. The rendering and coding network can use convolution layer to down-sample input data of the rendering and coding network, and then uses the normalization layer and the activation layer (Relu) to respectively perform batch normalization and activation processing. The render-decode network may synthesize a high quality output from the low-dimensional potential representation through upsampling, transposed convolution, batch normalization, Dropout layers, and activation function (Relu). From the above, after the rendering processing, a relatively real rendered face image can be obtained.

For example, as shown in fig. 3b, which is a schematic structural diagram of obtaining an initial facial image, as shown in fig. 3b, human voice and audio data may be input into the expression parameter extraction model to obtain simulated facial expression parameters. Meanwhile, the face image can also be input into the three-dimensional face construction model to obtain the three-dimensional face parameters of the target object, wherein the three-dimensional face parameters can comprise the initial face expression parameters and the face shape parameters of the target object. Then, the initial facial expression parameters in the three-dimensional facial parameters can be replaced by the simulated facial expression parameters respectively, so as to obtain target facial parameters corresponding to the target object. And further generating an initial face image of the target object according to the target face parameters.

S204: and correcting the lip shape of the face in the initial dynamic face video according to the voice audio data to obtain a target dynamic face video.

In one implementation, the lip shape of the face in the initial dynamic face video is relatively rough and lacks detail change, and in order to make the video more realistic, the lip shape of the face in the initial dynamic face video can be optimized to obtain a realistic video. For example, the lip shape of the face in the initial dynamic face video may be modified according to the voice audio data to obtain a target dynamic face video after the lip shape of the face is modified, where the lip shape of the face in the target dynamic face video corresponds to the voice content in the voice audio data. Optionally, the lip shape of the face in the initial dynamic face video may be modified according to the lip shape modification model, so as to obtain the target dynamic face video. For example, the human voice audio data and the initial dynamic human face video may be input into a lip shape modification model, so as to extract audio data features corresponding to each frame of audio data based on the lip shape modification model, and then modify the lip shape in the initial human face image corresponding to each frame of audio data in the initial dynamic human face video according to the audio data features corresponding to each frame of audio data, so as to obtain target human face images corresponding to each frame of audio data subjected to the lip shape modification. Furthermore, a target dynamic face video subjected to face lip shape correction can be generated according to target face images respectively corresponding to each frame of audio data.

In one implementation, the lip-modification model may be trained on a lip-generative confrontation model, which may be a wav2lip network model. The lip shape generation countermeasure model can comprise a lip shape generation network, a lip shape discrimination network and a video quality discrimination network. In particular, a first set of training samples may be obtained, which may include at least one first training sample pair, and after obtaining the first set of training samples, the lip generating confrontation model may be trained on the first set of training samples. In describing the subsequent training process, any first training sample pair in the first training sample set is taken as an example for illustration. Then, the lip-shape-modified model may be obtained by training the lip-shape-generation countermeasure model as follows:

a first training sample pair may be obtained, which may contain sample audio data and sample video data, one frame of audio data of the sample audio data corresponding to one frame of video data of the sample video data. For example, referring to fig. 3c, a first training sample pair may be input to a lip shape generation network to obtain a predicted dynamic face video, and after obtaining the predicted dynamic face video, the predicted dynamic face video and the sample audio data may be input to a lip shape determination network to obtain a lip shape determination result for the predicted dynamic face video. The predicted dynamic face video and the sample video data can be input into a video quality judgment network to obtain a quality judgment result aiming at the predicted dynamic face video. After the predicted dynamic face video, the sample video data, the lip shape determination result and the quality determination result are obtained, the network parameters of the lip shape generation network can be corrected according to the predicted dynamic face video, the sample video data, the lip shape determination result and the quality determination result, so that a target lip shape generation network is obtained, and the target lip shape generation network can be determined as a lip shape correction model.

In one implementation, the first sample training pair may contain a plurality of sets of data frames, a set of data frames containing a frame of audio data of the corresponding sample audio data and a frame of video data of the sample video data. For the lip-generated network, the specific implementation of inputting the first training sample pair into the lip-generated network to obtain the predicted dynamic face video may include: first, sample audio features of each frame of audio data in sample audio data and sample video features of each frame of video data in sample video data may be respectively extracted based on a lip-shaped generation network. And then, respectively carrying out feature fusion on the sample audio features corresponding to the sample audio data and the sample video features corresponding to the sample video data contained in each group of data frames to obtain fused sample features respectively corresponding to each group of data frames. Furthermore, the fused sample features respectively corresponding to each group of data frames can be decoded to obtain the predicted face image respectively corresponding to each data frame. Then, after the predicted face images respectively corresponding to each group of data frames are obtained, the predicted dynamic face video can be generated according to the predicted face images respectively corresponding to each group of data frames.

In one implementation, the lip generative confrontation model may employ a loss function as shown in equation 1 to optimize network parameters in the lip generative confrontation model. Wherein the loss function comprises three classesLosses, which are respectively: reconstruction loss L₁Loss of synchronism L₂And mass loss L₃. Wherein the reconstruction loss L₁May refer to the loss associated with the lip-generated network, the reconstruction loss L₁Which may be the L1 norm, may be used to measure the difference between predicted dynamic face video (video output by the lip-generating network) and sample video data (which may be understood as real video data). Loss of synchronism L₂It may refer to the loss associated with the lip discrimination network, which has been pre-trained and whose performance has been verified as a better discrimination network, which may be used to detect errors in predicting lip synchronization in dynamic face video. Loss of synchronism L₂As shown in equation 1.

Where N denotes the number of first training sample pairs, PⁱDenotes the probability that the ith pair of samples is a positive sample (which can be understood as sample video data (real video data)), PⁱThe calculation formula of (c) can be shown as formula 2:

wherein v and s are sample video features and sample audio features, respectively, and v · s represents a point product of the sample video features and the sample audio features.

Mass loss L₃The loss may be a loss associated with a video quality determination network, and in the training process of a lip formation countermeasure model, the lip formation network may generate a very accurate lip due to an over-accurate network effect of the lip formation countermeasure network, but in this case, an image frame included in a predicted dynamic face video generated by the lip formation network may be blurred or slightly ghosted, and an image of the image frame included in the predicted dynamic face video is affectedQuality, which in turn affects the video quality of the predicted dynamic face video. To alleviate this loss of image quality (or video quality), the video quality discrimination network of the present application may be introduced to solve the above-mentioned problems. The video quality discrimination network may punish unreal (low-quality) faces generated in the predicted dynamic face video without performing any inspection on the lips generated in the predicted dynamic face video. Wherein the mass loss L₃May be a binary cross entropy loss function.

In the embodiment of the application, the human voice audio data and the face image of the target object can be acquired, and the initial dynamic face video for the target object is generated based on the human voice audio data and the face image of the target object. When the initial dynamic face video aiming at the target object is generated based on the voice audio data and the face image of the target object, the simulated face expression parameters can be generated according to the voice audio data, the three-dimensional face parameters of the target object can be generated according to the face image, and therefore the initial dynamic face video of the target object can be generated according to the simulated face expression parameters and the three-dimensional face parameters. Further, the lip shape of the face in the initial dynamic face video may be modified according to the voice audio data to obtain a target dynamic face video, where the lip shape of the face in the target dynamic face video corresponds to the voice content in the voice audio data. In the application, a facial expression change video related to the content of the human voice audio can be generated through the three-dimensional facial parameters and the image generated by the simulated facial expression parameters generated by the human voice audio. And the lip shape included in the face of the facial expression change video can be synchronized with the voice audio content based on the lip shape generation method. Therefore, the finally obtained dynamic face video can achieve better effects in two dimensions of expression and lip shape, and further can generate the dynamic face video with vivid visual and auditory effects.

Referring to fig. 4, fig. 4 is a schematic flowchart of another video generation method according to an embodiment of the present disclosure. The method is applied to a server and can be executed by the server. As shown in fig. 4, the video generation method may include:

s401: and acquiring voice audio data and acquiring a face image of the target object.

In one implementation, target vocal audio data of the target object may also be obtained, where the target vocal audio data may be a segment of speaking voice audio or song audio of the target object, or the like. And performing tone conversion according to the human voice audio data and the target human voice audio data, converting the original tone of the human voice audio data into the tone of the target human voice audio data, and keeping the human voice content in the human voice audio data unchanged.

In one implementation, the specific implementation of performing the tone color conversion may include: after the target human voice audio data is acquired, the human voice tone characteristic of the target object may be extracted according to the target human voice audio data, for example, the human voice tone characteristic may be obtained by predicting by using a pre-trained neural network model, which may be an embedded model. The human voice data may also be converted into a human voice frequency sequence, and acoustic feature data of the human voice data may be acquired, where the acoustic feature data may include a fundamental frequency, a Root Mean Square Error (RMSE) sequence, and an audio duration of the human voice data. Further, the voice tone color feature, the voice audio sequence and the acoustic feature data are input into the tone color conversion model to obtain a mel-frequency spectrogram containing the voice tone color of the target object, wherein the mel-frequency spectrogram is a spectrogram converting the frequency into mel (mel) scale. After the mel-frequency spectrogram is obtained, the mel-frequency spectrogram can be subjected to sound code conversion so as to obtain converted human voice audio data of the target object. The converted human voice audio data has the human voice content of the human voice audio data and has the human voice tone of the target human voice audio data, namely the converted human voice audio data replaces the human voice tone in the human voice audio data.

For example, as shown in fig. 5a, a schematic structural diagram of a tone color conversion is shown, for example, referring to fig. 5a, assuming that the human voice audio data is a song and the target human voice audio data is a speech of the target object, the purpose of the tone color conversion is to convert the tone color of the original singer in the song into the tone color of the target object while ensuring that the singing content in the song is not changed. In a specific implementation, after the song and the speaking voice are acquired, the song and the speaking voice can be processed respectively to obtain data required for performing tone conversion. For example, for a song, acoustic feature data for the song may be obtained from the song, which may include the fundamental frequency, RMSE sequence, and audio duration for the song. The acoustic feature data may be used as a condition required in the tone conversion process. The fundamental frequency may be obtained by processing a song through a World Vocoder (World Vocoder), and may also be obtained by converting the song into a text sequence to obtain a vocal audio sequence corresponding to the song according to the text sequence, and after obtaining the vocal audio sequence, may also be obtained by performing alignment processing according to the vocal audio sequence and the song to obtain the duration of the audio. The alignment process may be, for example, forced alignment, and the alignment process may refer to determining a duration of each audio frame in the human voice audio sequence from the song, for example, as measured by a duration of a speech unit corresponding to each audio frame. The audio duration may be understood as a duration corresponding to each audio frame included in the human audio sequence. As another example, for speaking speech, a pre-trained speech embedding network may be utilized to obtain speech embedding of a target object, which may be a human voice tone feature of the target object. After obtaining the human voice timbre features, the human voice audio sequence, and the acoustic feature data (fundamental frequency, RMSE sequence, and audio duration), the obtained data may be input into a timbre conversion model to obtain a mel-frequency spectrogram containing the human voice timbre of the target object. After the mel-frequency spectrogram is obtained, the mel-frequency spectrogram can be subjected to sound code conversion, wherein the mel-frequency spectrogram can be subjected to sound code conversion by utilizing a nerve vocoder (such as WaveRNN) so as to generate a waveform, thereby obtaining converted human voice audio data of the target object. Wherein the converted human voice audio data has human voice content of the song and has human voice tone in the speaking voice. That is, by the above-described tone conversion, it is possible to convert the tone in the singing voice included in the song into the sound of the target object while keeping the song content unchanged.

In one implementation, the tone color conversion model may include an encoding network, an alignment network, and a decoding network, and the specific implementation of obtaining the mel frequency spectrum diagram through the tone color conversion model may include: the human voice audio sequence can be input into a coding network for coding processing, and an implicit characteristic sequence of the human voice audio sequence is obtained. After the implicit characteristic sequence is obtained, the implicit characteristic sequence, the human voice tone color characteristic and the acoustic characteristic data can be input into an alignment network, and the audio frame alignment is performed on the implicit characteristic sequence based on the human voice tone color characteristic and the acoustic characteristic data in the alignment network to obtain an aligned audio sequence. Finally, the aligned audio sequence may be input to a decoding module for decoding processing to obtain a mel-frequency spectrogram.

For example, as shown in fig. 5b, a schematic structural diagram of a tone Conversion model, which may be a Duration Informed Attention Network based Singing Voice Conversion System (DurIAN-SC) model, is shown. The timbre conversion model comprises a context coding network operable to encode an audio sequence, an alignment network for aligning the audio sequence (for the audio sequence of the human voice audio data) with a target audio sequence (for the audio sequence of the target human voice audio data), and an autoregressive decoding network for generating a mel-frequency spectrogram feature frame by frame. The encoding network (context encoding network) of the tone color conversion model may include an audio embedding layer, a full link layer, and a CBHG layer, where the CBHG layer may further include a convolutional layer, a high channel layer (high channel layer), and a bidirectional Gated Round Unit (GRU) layer. In particular, assume a human voice audio sequence x_1:NWhere N represents the length of the human voice audio sequence, the human voice audio sequence x may be represented_1:NInputting a coding network to obtain an implicit characteristic sequence of the human voice audio sequence, wherein the implicit characteristic sequence can be represented as h_1:N＝encoder(x_1:N) And encoder (.) represents the effect of the coding network. When the voice audio sequence is input into the coding network, the voice audio sequence can first pass through the audio respectivelyAnd embedding the layer and the full connection layer to generate an implicit representation corresponding to the human voice audio sequence, and then extracting the characteristics in the implicit representation through the CBHG layer to obtain an implicit characteristic sequence, wherein the implicit characteristic sequence comprises the sequence of audio frames in the human voice audio sequence.

The aim of the alignment network is to generate an implicit state of audio frame alignment from an implicit feature sequence, which can be referred to as an aligned audio sequence, and input the implicit state into the decoding network. After inputting the implicit feature sequence into the alignment network, first, the implicit feature sequence h_1:NThe representation after state expansion according to the audio duration may be e_1:T＝state-expand(h_1:N，d_1:N) Where T is the total number of input audio frames, and state-expanded (.) represents the effect of state expansion. State extension, i.e. according to the duration d of the audio frequency corresponding to each audio frequency frame_1:NA copy operation is performed. For convenience of subsequent description, a sequence obtained after state expansion is performed on the implicit feature sequence may be referred to as a first implicit feature sequence. After state expansion of the implicit feature sequence, a first frame aligned implicit feature sequence may be concatenated with the fundamental frequency, the RMSE sequence, and the human voice tone feature to obtain an aligned audio sequence. For example, the first implicit feature sequence obtained after state expansion may be connected (Concat) with the acoustic feature data (fundamental frequency and RMSE sequence) to obtain a second implicit feature sequence. And then, connecting the second implicit characteristic sequence with the human voice tone characteristic to obtain a third implicit characteristic sequence. And then, the third implicit characteristic sequence passes through the full connection layer, and the sequence obtained by passing through the full connection layer is the output of the alignment network, and the output is the alignment audio sequence.

The decoding Network is the same as a decoding Network in a Duration Informed Attention Network (dual) model, and is also composed of two autoregressive Neural Network (RNN) layers. Aligned audio sequences output by an aligned network may be passed through a full linkA decoding network composed of layers and RNN layers, and performing autoregressive decoding, for example, two frames can be decoded at each time point in the decoding process to obtain the output y of the decoding network_1:T，y_1:T＝decoder(v_1:T)，v_1:TRepresents the output of the alignment network and decoder (.) represents the effect of the decoding network. The output of the decoding network may also be processed by the CBHG layer to improve the quality of the resulting mel-frequency spectrogram.

It can be seen that, after the converted human voice audio data is obtained, the generating of the simulated facial expression parameter according to the human voice audio data in the subsequent step S402 may also be the generating of the simulated facial expression parameter according to the converted human voice audio data, and the modifying of the facial lip shape in the initial dynamic facial video according to the human voice audio data in the step S404 may also be the modifying of the facial lip shape in the initial dynamic facial video according to the converted human voice audio data, so that the facial lip shape in the finally obtained target dynamic facial video corresponds to the human voice content in the converted human voice audio data, and the target dynamic facial video has the human voice color of the target human voice audio data.

S402: and generating a simulated facial expression parameter according to the human voice audio data, and generating a three-dimensional facial parameter of the target object according to the facial image.

S403: and generating an initial dynamic face video of the target object according to the simulated face expression parameters and the three-dimensional face parameters.

S404: and correcting the lip shape of the face in the initial dynamic face video according to the voice audio data to obtain a corrected dynamic face video.

The modified dynamic face video may include a plurality of frames of modified face images.

S405: and acquiring the face position in each frame of corrected face image, and performing image quality restoration on the face in each frame of corrected face image according to the face position in each frame of corrected face image.

In an implementation manner, in the process of generating the initial dynamic face video for the target object and correcting the face lip shape in the initial dynamic face video to obtain the corrected dynamic face video, the quality of the generated video may be lost, so that the video quality presented by the corrected dynamic face video may be blurred, and in order to make the finally obtained video present a better image quality effect, the obtained corrected dynamic face video may be further subjected to image quality restoration processing to obtain a video with better image quality, and the image quality of the corrected dynamic face video after image quality restoration is clearer.

In an implementation manner, when performing image quality restoration on the modified dynamic face video, it may be considered to extract a local face image in the modified dynamic face video first to realize the restoration of the face in the modified dynamic face video. It is understood that the modified dynamic face video may include a plurality of frames of modified face images. The face position in each frame of the modified face image can be obtained first, so that the image quality of the face in each frame of the modified face image is repaired according to the face position in each frame of the modified face image.

Optionally, the face position in each frame of the modified face image may be obtained first, so that the face local image in each frame of the modified face image may be extracted according to the face position in each frame of the modified face image. For example, the face extraction can be performed based on a Retina face model, the Retina face model is a single-stage face detector with strong robustness, the Retina face model can be jointly monitored and automatically monitored and can be used for multi-task learning, the Retina face model can adopt a feature pyramid to extract multi-scale features, and the Retina face model can be used for face positioning of faces with different scales, so that the face position in each frame of corrected face image can be accurately obtained. After the face position in each frame of corrected face image is obtained, affine transformation and cutting can be performed on the corrected face image according to the face position in each frame of corrected face image, so that an effective face image in each frame of corrected face image is extracted, and the effective face image can be called as a face local image. It can be understood that one frame of the modified face image corresponds to one frame of the partial face image.

After the face local image corresponding to each frame of the modified face image in the modified dynamic face video is obtained, image quality restoration can be performed on each frame of the face local image, so that a target dynamic face video can be obtained according to the face local image subjected to image quality restoration in the following process. For example, the image quality of each frame of face partial image can be restored by using a face restoration model. In a specific implementation, each frame of face partial image can be input into a face restoration model, so as to extract the coded face features of each frame of face partial image based on the face restoration model. After the coded face features of each frame of face partial image are obtained, the restored face features of each frame of face partial image can be obtained according to the coded face features of each frame of face partial image. Furthermore, image quality restoration can be performed on each frame of face local image according to the coded face features and the restored face features corresponding to each frame of face local image, so that the restored face local images corresponding to each frame of face local image are obtained.

Wherein the face repairing model can be as shown in fig. 5 c. The face restoration model adopts a classical coding-decoding structure, and is formed by embedding a Generative Adaptive Network (GAN) prior network as a decoding network into a deep neural network with a U-shaped structure, wherein the network structure of the GAN prior network can be as shown in fig. 5d, and the description of the face restoration model can be referred to for relevant details in the GAN prior network, which is not described herein again. It is understood that the face repairing model may include a decoding network and an encoding network, and the decoding network is a GAN prior network, wherein the decoding network may include a plurality of feature extraction layers, and the encoding network may include a plurality of GAN blocks. The number of feature extraction layers and the number of GAN blocks are equal. The GAN block may be generated as shown in fig. 5c, and in this application, the generation of a countermeasure layer is understood as the GAN block, and the network structure of each GAN block may be as shown in fig. 5 e. Considering that the face restoration model is a low-quality face image processing model with incomplete convolution, before the face partial image is input to the face restoration model, the low-resolution face partial image may be adjusted to a required resolution (for example, the resolution of the low-resolution face partial image may be adjusted to 1024 × 1024), wherein the adjustment of the resolution corresponding to the face partial image may be implemented by using a bilinear interpolator. Then, the local image of the face after resolution adjustment can be subsequently input into the face repairing model.

For a frame of face local image, firstly, the face local image can extract the coded face features of the face local image through a coding network in a face repairing model, and the coded face features comprise the coded face features corresponding to each feature extraction layer included in a decoding network. The coded face features may include shallow features and deep features, the shallow features may be understood as coded face features corresponding to a shallow feature extraction layer in a coded network, the shallow features may generally include basic features related to a partial image of a face, such as edges or corners of the face, and the high features may generally be related features for characterizing a complete face. The deep features can be understood as coded face features corresponding to a deeper feature extraction layer in the coding network. After the coded face features corresponding to each feature extraction layer are obtained, the coded face features corresponding to the last feature extraction layer in the coding network are extracted, the coded face features pass through a full connection layer to obtain potential face features, and the potential face features pass through a mapping network to obtain less-entangled repaired face features aiming at partial face images.

After obtaining the repaired face features and the coded face features corresponding to each feature extraction layer in the coding network, the repaired face features may be broadcast to each GAN block in the decoding network, and the coded face features corresponding to each feature extraction layer in the coding network may be connected as a skip layer and input to each GAN block in the decoding network in a noise manner in a one-to-one correspondence manner, so as to control the global face structure, the local face details, and the background of the finally obtained repaired face local image. And the output of the coding network is the repaired face partial images corresponding to the face partial images respectively. And obtaining the repaired face local images corresponding to each frame of face local image through the face repairing model. After each frame of the repaired face local image is obtained, each frame of the repaired face local image and each frame of the repaired face local image with the corresponding relationship can be subjected to image synthesis respectively to obtain each frame of the repaired face image subjected to image quality restoration.

In an implementation manner, the face repairing model may be obtained by training a face repairing countermeasure model, the face repairing countermeasure model may be a GAN Prior Embedded Network (GPEN) model, and similarly, the decoding Network of the face repairing countermeasure model may be a GAN Prior Network, and the GAN Prior Network may generate a high-quality face image and may be relatively easily Embedded into a GPEN model with a U-shaped structure as the decoding Network. The face repair confrontation model may include an encoding network, a decoding network, and a repair discrimination network as described in fig. 5 f. Specifically, a second training sample set may be obtained, where the second training sample set may include at least one second training sample pair, and after the second training sample set is obtained, the face repairing countermeasure model may be trained according to the second training sample set. In describing the subsequent training process, any second training sample pair in the second training sample set is taken as an example for illustration. Then, the specific implementation of training the face repairing confrontation model to obtain the face repairing model may be as follows:

a second training sample pair may be acquired, which may comprise a first resolution image and a second resolution image, wherein the image resolution of the first resolution image is smaller than the image resolution of the second resolution image, the first resolution image and the second resolution image having the same image content. For example, referring to fig. 5f, the second training sample pair may be input into the coding network to obtain sample coded facial features corresponding to the first resolution image, where the coding network may include a plurality of feature extraction layers, and then the sample coded facial features corresponding to the first resolution image include sample coded facial features corresponding to each feature extraction layer. Then, the sample restored face features of the first resolution image may be obtained according to the sample encoded face feature mapping, where the specific determination manner for the sample restored face features may refer to the above-mentioned manner for determining the restored face features, and details are not described here again. Further, the sample coded face features and the sample restored face features may be input to a decoding network to obtain a predicted resolution image corresponding to the first resolution image. After the predicted resolution image is obtained, the second resolution image and the predicted resolution image may be input to a repair discrimination network to obtain a repair discrimination result of the predicted resolution image. A first image resolution characteristic of the first resolution image and a second image resolution characteristic of the second resolution image may also be obtained based on the remediation discrimination network. Furthermore, the network parameters of the face repairing confrontation model can be corrected according to the predicted resolution image, the second resolution image, the repairing judgment result, the first image resolution characteristic and the first image resolution characteristic. Therefore, the face repairing model can be determined according to the coding network and the decoding network in the modified face repairing confrontation model.

It should be noted that, before training the face repairing confrontation model, the GAN prior network (decoding network) is trained, and after the training of the GAN prior network is completed, the GAN prior network trained in advance can be embedded into the U-type deep neural network as the decoding network to obtain the face repairing confrontation model. And then, the second training sample set is used for fine adjustment of the face repairing countermeasure model, so that the coding network and the decoding network in the face repairing countermeasure model can be better combined. A High-definition face image dataset, for example, an FFHQ (Flickr-Faces-High-Quality) dataset, may be used, and the GAN prior network is trained according to the training strategy of StyleGAN.

In one implementation, in order to fine-tune (or train) the face repairing confrontation model, the face repairing confrontation model may optimize network parameters in the face repairing confrontation model by using a loss function as shown in equation 3. Wherein, the loss function comprises three types of losses, which are respectively: against loss L_AContent loss L_CSum-feature matching penalty L_F。L_AAnd L_FAs shown in equation 4 and equation 5, respectively.

L＝L_A+αL_C+βL_FEquation 3

Wherein the content is lost L_CThe method strengthens the detail characteristics in the face image (the first resolution image), retains the original color information, and introduces the characteristic matching loss L in the repairing and judging network_FCan better balance the resistance loss L_ATherefore, a more real and vivid face image, namely a face image with better image quality, is restored. Alpha and beta are balance parameters which can be set according to requirements, and practice shows that the training effect of the face repairing confrontation model is better when alpha is 1 and beta is 0.02.

Representing a first resolution image (or may be understood as a low resolution image (blurred image)), and X represents a second resolution image (or may be understood as a high resolution image (high definition image) corresponding to the low resolution image). G represents a generation network when the face repairing countermeasure model is trained, the generation network can comprise an encoding network and a decoding network, and D represents a repairing discrimination network. G (-) and D (-) respectively show the action effect of generating the network and repairing the discriminant network. Content loss L_CWhich is the L1 norm, can be used to measure the difference between the predicted resolution image (the image output by the decoding network) and the second resolution image (which can be understood as a true high definition image). Loss of feature matching L_FIs based on the perception loss of the repair discriminant network, wherein T is the total number of feature extraction layers included in the repair discriminant network, and Dⁱ(X) represents the features extracted by the i-th layer of the repair discrimination network. In the present application, Dⁱ(X) may be understood as the second image resolution characteristic of the above-described remediation-based discrimination network for the second resolution imagePerforming sign;

it may be understood as the first image resolution characteristic of the above-described restoration-based discrimination network for the first resolution image.

S406: and generating a target dynamic face video according to each frame of modified face images subjected to image quality restoration.

In one implementation, after each frame of modified face image subjected to image quality restoration is obtained, the target dynamic face video may be synthesized according to each frame of modified face image subjected to image quality restoration. Wherein, the lip shape of the human face in the target dynamic human face video corresponds to the human voice content in the human voice audio data. If the audio data acquired by the server is the human voice audio data and the target human voice audio data, the human voice audio data and the target human voice audio data are subjected to tone conversion, and then the human voice tone in the target dynamic human face video is the human voice tone of the target human voice audio data. And if the audio data acquired by the server is the human voice audio data, the human voice tone color in the target dynamic human face video is the human voice tone color of the human voice audio data.

For specific implementation of steps S401 to S404, reference may be made to the detailed description of steps S201 to S204 in the above embodiment, which is not described herein again.

In the embodiment of the application, a facial expression change video related to the content of the human voice audio can be generated through the three-dimensional facial parameters and the image generated by the simulated facial expression parameters generated by the human voice audio. And the lip shape included in the face of the face expression change video is synchronous with the voice audio content based on the lip shape generation method, so that two dimensions of the expression and the lip shape in the face expression change video can achieve better effect. And image quality restoration can be performed on the facial expression change video based on the facial restoration model so as to realize a high-definition face in the facial expression change video. For example, the original tone in the human voice audio can be converted into the tone in the target human voice audio by using the target human voice audio of a certain object, and the original audio content in the human voice audio is kept unchanged. Therefore, the video is guaranteed to have vivid visual and auditory effects, the interestingness of the video is increased, and the user experience can be improved.

To better understand the video generation method provided in the embodiment of the present application, a video recommendation scene is further described below with reference to the scene schematic diagram of video recommendation shown in fig. 6, where the video generation method is implemented by a cloud server. Referring to fig. 6, in the video recommendation scenario, a target object may input a song (which may be understood as the above-mentioned human voice audio data, where a song may be input, or a segment of speaking voice) and a photo of a face of the target object (which may be understood as the above-mentioned face image) on a user operation interface of a target terminal corresponding to the target object. The user interface may be as shown in the interface in fig. 3 a. After the target object clicks the determination control, the object terminal can acquire the song and the face photo and send the song and the face photo to the server, so that the server can generate a target dynamic face video based on the acquired song and the face photo, and the face lip shape in the target dynamic face video corresponds to the voice content in the song.

In a specific implementation, the server may drive generation of the facial expression based on audio (for example, a song), for example, an image with facial expression changes may be determined to be generated by combining the song and a facial photo, so that a video based on the facial expression changes may be subsequently generated according to the image with facial expression changes. Specifically, the song may be input into the expression parameter extraction model to generate simulated facial expression parameters of the target object, and the facial photograph may be input into the three-dimensional face construction model to generate three-dimensional face parameters for the target object. And then replacing the initial facial expression parameters in the three-dimensional facial parameters with the simulated facial expression parameters to obtain target facial parameters for the target object, so that an initial facial image of the target object can be synthesized according to the target facial parameters, and for the authenticity of a subsequent video, the initial facial image can be rendered to generate a vivid rendered facial image, so that an initial dynamic facial video is generated according to the rendered facial image. And then inputting the initial dynamic face video into a lip-shaped correction model, and also inputting a song into the lip-shaped correction model, so that the lip-shaped correction model can correct the face lip shape in the initial dynamic face video according to the song to obtain a corrected target dynamic face video, wherein the face lip shape in the corrected dynamic face video corresponds to the voice content in the song. And finally, performing image quality restoration on the face in the corrected target dynamic face video based on the face restoration model to generate a song video with vivid visual and auditory effects, wherein the song video is the target dynamic face video. When the human voice audio data is speaking voice, the target dynamic human face video can be speaking video.

In one implementation, when the target object inputs a song and a picture of a face of the target object on the user operation interface of the target terminal, a segment of speaking voice (which may be understood as the target voice audio data, where the speaking voice is different from the voice in the voice audio data) may also be input. Then, the object terminal may further send the acquired speaking voice to the server, so that the server may perform song timbre conversion based on the speaking voice and the song, so as to convert the voice timbre of the original singer in the song into the voice timbre of the target object when speaking, thereby generating a high-quality object song (which may be understood as the converted voice audio data described above). Then, the lip shape of the face in the finally generated target dynamic face video corresponds to the voice content in the song, and the tone of the original singer in the song is the speaking tone of the target object.

Fig. 7 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application. The video generation device described in this embodiment includes:

an obtaining unit 701, configured to obtain human voice audio data, and obtain a face image of a target object;

a first generating unit 702, configured to generate a simulated facial expression parameter according to the human voice audio data, and generate a three-dimensional facial parameter of the target object according to the facial image;

a second generating unit 703, configured to generate an initial dynamic face video of the target object according to the simulated facial expression parameters and the three-dimensional face parameters;

a correcting unit 704, configured to correct a face lip shape in the initial dynamic face video according to the voice audio data to obtain a target dynamic face video; the lip shape of the face in the target dynamic face video corresponds to the voice content in the voice audio data.

In an implementation manner, the first generating unit 702 is specifically configured to:

inputting the voice and audio data into an expression parameter extraction model, and performing feature conversion on the voice and audio data based on the expression parameter extraction model to obtain voice feature parameters of the voice and audio data;

carrying out feature migration on the voice feature parameters based on the expression parameter extraction model to obtain target audio features of the human voice audio data;

and mapping expression parameters according to the target audio characteristics to obtain the simulated facial expression parameters.

inputting the face image into a three-dimensional face construction model so that the three-dimensional face construction model extracts face key points of the target object in the face image, and performing face reconstruction on the target object by using the face key points to obtain three-dimensional face parameters of the target object.

In one implementation, the three-dimensional face parameters include initial facial expression parameters and facial morphology parameters of the target object; the human voice audio data comprises a plurality of frames of audio data, and one frame of audio data corresponds to one group of the simulated human face expression parameters; the first generating unit 702 is specifically configured to:

replacing the initial facial expression parameters in the three-dimensional facial parameters with the simulated facial expression parameters corresponding to each frame of audio data respectively to obtain target facial parameters corresponding to each frame of audio data of the target object respectively;

generating initial face images respectively corresponding to each frame of audio data according to the target face parameters respectively corresponding to each frame of audio data;

and generating an initial dynamic human face video of the target object according to the initial human face images respectively corresponding to the audio data of each frame.

In an implementation manner, the second generating unit 703 is specifically configured to:

performing image rendering on the initial face images respectively corresponding to each frame of audio data to obtain rendered face images respectively corresponding to each frame of audio data;

and generating an initial dynamic face video of the target object according to the rendered face images corresponding to the audio data of each frame.

In an implementation manner, the modification unit 704 is specifically configured to:

inputting the voice audio data and the initial dynamic human face video into a lip shape correction model;

extracting audio data characteristics corresponding to each frame of audio data respectively based on the lip shape correction model;

according to the audio data characteristics corresponding to each frame of audio data, lip shapes in initial face images corresponding to each frame of audio data in the initial dynamic face video are corrected, and target face images corresponding to each frame of audio data are obtained;

and generating the target dynamic human face video subjected to the lip shape correction according to the target human face image corresponding to each frame of audio data.

In one implementation, the lip shape modification model is obtained by training a lip shape generation countermeasure model, and the lip shape generation countermeasure model comprises a lip shape generation network, a lip shape discrimination network and a video quality discrimination network; the modification unit 704 is further configured to:

obtaining a first training sample pair; the first training sample pair comprises sample audio data and sample video data, wherein one frame of audio data in the sample audio data corresponds to one frame of video data in the sample video data;

inputting the first training sample pair into the lip-shaped generation network to obtain a predicted dynamic face video;

inputting the predicted dynamic face video and the sample audio data into the lip shape discrimination network to obtain a lip shape discrimination result aiming at the predicted dynamic face video;

inputting the predicted dynamic face video and the sample video data into the video quality judgment network to obtain a quality judgment result aiming at the predicted dynamic face video;

correcting network parameters of the lip-shaped generation network according to the predicted dynamic face video, the sample video data, the lip-shaped judgment result and the quality judgment result to obtain a target lip-shaped generation network;

determining the target lip generation network as the lip modification model.

In one implementation, the first sample training pair includes a plurality of sets of data frames, and a set of data frames includes a corresponding one of the sample audio data and one of the sample video data; the modification unit 704 is specifically configured to:

respectively extracting sample audio features of each frame of audio data in the sample audio data and respectively extracting sample video features of each frame of video data in the sample video data based on the lip-shaped generation network;

respectively carrying out feature fusion on the sample audio features of the sample audio data and the sample video features of the sample video data contained in each group of data frames to obtain fused sample features respectively corresponding to each group of data frames;

decoding the fusion sample characteristics respectively corresponding to each group of data frames to obtain the prediction face images respectively corresponding to each group of data frames;

and generating the predicted dynamic face video according to the predicted face images respectively corresponding to each group of data frames.

In an implementation manner, the obtaining unit 701 is further configured to:

acquiring target human voice audio data of the target object, and extracting human voice tone characteristics of the target object according to the target human voice audio data;

converting the human voice audio data into a human voice audio sequence, and acquiring acoustic characteristic data of the human voice audio data;

inputting the human voice tone color characteristics, the human voice audio frequency sequence and the acoustic characteristic data into a tone color conversion model to obtain a Mel frequency spectrogram containing the human voice tone color of the target object;

performing sound code conversion on the Mel frequency spectrogram to obtain converted human voice audio data of the target object; the converted human voice audio data has the human voice content of the human voice audio data and has the human voice tone of the target human voice audio data.

In one implementation, the tone color conversion model includes an encoding network, an alignment network, and a decoding network; the obtaining unit 701 is specifically configured to:

inputting the voice tone color feature, the voice audio sequence and the acoustic feature data into a tone color conversion model to obtain a mel frequency spectrum diagram containing the voice tone color of the target object, comprising:

inputting the voice audio sequence into the coding network for coding to obtain an implicit characteristic sequence of the voice audio sequence;

inputting the implicit characteristic sequence, the human voice tone color characteristic and the acoustic characteristic data into the alignment network, and performing audio frame alignment on the implicit characteristic sequence based on the human voice tone color characteristic and the acoustic characteristic data in the alignment network to obtain an aligned audio sequence;

and inputting the aligned audio sequence into the decoding module for decoding to obtain the Mel frequency spectrogram.

modifying the lip shape of the face in the initial dynamic face video according to the voice audio data to obtain a modified dynamic face video, wherein the modified dynamic face video comprises a plurality of frames of modified face images;

acquiring the face position in each frame of corrected face image, and performing image quality restoration on the face in each frame of corrected face image according to the face position in each frame of corrected face image;

and generating the target dynamic face video according to each frame of modified face image subjected to image quality restoration.

extracting a local face image in each frame of corrected face image according to the face position in each frame of corrected face image; one frame of corrected face image corresponds to one frame of partial face image;

inputting each frame of face local image into a face restoration model, and extracting the coded face features of each frame of face local image based on the face restoration model;

respectively mapping according to the coded face features of each frame of face local image to obtain the repaired face features of each frame of face local image;

according to the coded face features and the repaired face features of each frame of face local image, carrying out image quality repair on each frame of face local image to obtain repaired face local images corresponding to each frame of face local image;

and respectively carrying out image synthesis on each frame of corrected face image and each frame of repaired face local image with the corresponding relation to obtain each frame of corrected face image subjected to image quality repair.

In one implementation, the face repairing model is obtained by training a face repairing confrontation model, and the face repairing confrontation model comprises an encoding network, a decoding network and a repairing discrimination network; the modification unit 704 is further configured to:

obtaining a second training sample pair, the second training sample pair comprising a first resolution image and a second resolution image; the image resolution of the first resolution image is less than the image resolution of the second resolution image, the first resolution image and the second resolution image having the same image content;

inputting the second training sample pair into the coding network to obtain sample coding face features corresponding to the first resolution ratio image, and mapping according to the sample coding face features to obtain sample repairing face features of the first resolution ratio image;

inputting the sample coding face features and the sample repairing face features into the decoding network to obtain a prediction resolution image corresponding to the first resolution image;

inputting the second resolution image and the predicted resolution image into the repair discrimination network to obtain a repair discrimination result of the predicted resolution image;

acquiring a first image resolution characteristic of the first resolution image and a second image resolution characteristic of the second resolution image based on the repairing discrimination network;

correcting the network parameters of the face repairing countermeasure model according to the predicted resolution image, the second resolution image, the repairing judgment result, the first image resolution characteristic and the first image resolution characteristic;

and determining the face repairing model according to the coding network and the decoding network in the modified face repairing confrontation model.

In an implementation manner, the obtaining unit 701 is specifically configured to:

acquiring the human voice audio data and the human face image which are sent by an object terminal of the target object;

the apparatus further includes an output unit 705, where the output unit 705 is specifically configured to:

and sending the target dynamic face video to the object terminal so that the object terminal outputs the target dynamic face video.

It is understood that the division of the units in the embodiments of the present application is illustrative, and is only one logical function division, and there may be another division manner in actual implementation. Each functional unit in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

Please refer to fig. 8, which is a schematic structural diagram of a server according to an embodiment of the present application. The server described in this embodiment includes: a processor 801, a memory 802, and a network interface 803. Data may be exchanged between the processor 801, the memory 802, and the network interface 803.

The Processor 801 may be a Central Processing Unit (CPU), and may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 802, which may include both read-only memory and random-access memory, provides program instructions and data to the processor 801. A portion of the memory 802 may also include non-volatile random access memory. Wherein, the processor 801 is configured to execute, when calling the program instruction:

acquiring voice audio data and acquiring a face image of a target object;

In one implementation, the processor 801 is specifically configured to:

In one implementation, the three-dimensional face parameters include initial facial expression parameters and facial morphology parameters of the target object; the human voice audio data comprises a plurality of frames of audio data, and one frame of audio data corresponds to one group of the simulated human face expression parameters; the processor 801 is specifically configured to:

In one implementation, the processor 801 is specifically configured to:

In one implementation, the lip shape modification model is obtained by training a lip shape generation countermeasure model, and the lip shape generation countermeasure model comprises a lip shape generation network, a lip shape discrimination network and a video quality discrimination network; the processor 801 is further configured to:

determining the target lip generation network as the lip modification model.

In one implementation, the first sample training pair includes a plurality of sets of data frames, and a set of data frames includes a corresponding one of the sample audio data and one of the sample video data; the processor 801 is specifically configured to:

In one implementation, the processor 801 is further configured to:

In one implementation, the tone color conversion model includes an encoding network, an alignment network, and a decoding network; the processor 801 is specifically configured to:

In one implementation, the processor 801 is specifically configured to:

In one implementation, the face repairing model is obtained by training a face repairing confrontation model, and the face repairing confrontation model comprises an encoding network, a decoding network and a repairing discrimination network; the processor 801 is further configured to:

In one implementation, the processor 801 is specifically configured to:

The embodiment of the present application also provides a computer storage medium, in which program instructions are stored, and when the program is executed, some or all of the steps of the video generation method in the embodiment corresponding to fig. 2 or fig. 4 may be included.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of the server, and the processor executes the computer instructions to cause the server to perform the steps performed in the embodiments of the methods described above.

The video generation method, the video generation device, the video generation server and the storage medium provided by the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of video generation, the method comprising:

acquiring voice audio data and acquiring a face image of a target object;

2. The method of claim 1, wherein generating simulated facial expression parameters from the human voice audio data comprises:

3. The method of claim 1, wherein the generating three-dimensional face parameters of the target object from the face image comprises:

4. The method of claim 1, wherein the three-dimensional face parameters comprise initial facial expression parameters and facial morphology parameters of the target object; the human voice audio data comprises a plurality of frames of audio data, and one frame of audio data corresponds to one group of the simulated human face expression parameters;

the generating of the initial dynamic face video of the target object according to the simulated face expression parameters and the three-dimensional face parameters comprises:

5. The method according to claim 4, wherein the generating of the initial dynamic face video of the target object according to the initial face images respectively corresponding to the audio data of each frame comprises:

6. The method of claim 4, wherein the modifying the face lip shape in the initial dynamic face video according to the voice audio data to obtain a target dynamic face video comprises:

7. The method of claim 6, wherein the lip-shape modification model is obtained by training a lip-shape generation countermeasure model, and the lip-shape generation countermeasure model comprises a lip-shape generation network, a lip-shape discrimination network and a video quality discrimination network; the method further comprises the following steps:

determining the target lip generation network as the lip modification model.

8. The method of claim 7, wherein the first sample training pair comprises a plurality of sets of data frames, a set of data frames comprising a corresponding one of the sample audio data and one of the sample video data;

inputting the first training sample pair into the lip-shaped generation network to obtain a predicted dynamic face video, wherein the method comprises the following steps:

9. The method of claim 1, wherein after the obtaining of the human voice audio data, the method further comprises:

10. The method of claim 9, wherein the tone color conversion model comprises an encoding network, an alignment network, and a decoding network;

11. The method according to claim 1, wherein the modifying the face lip shape in the initial dynamic face video according to the voice audio data to obtain a target dynamic face video after the face lip shape modification, comprises:

12. The method of claim 11, wherein the performing the image quality restoration on the face in each frame of the modified face image according to the face position in each frame of the modified face image comprises:

13. The method according to claim 12, wherein the face repairing model is obtained by training a face repairing countermeasure model, and the face repairing countermeasure model comprises an encoding network, a decoding network and a repairing discrimination network;

the method further comprises the following steps:

14. The method according to any one of claims 1-13, wherein the acquiring human voice audio data and acquiring a human face image of a target object comprises:

the method further comprises the following steps:

15. A server, comprising a processor, a memory, and a network interface, the processor, the memory, and the network interface being interconnected, wherein the memory is configured to store a computer program comprising program instructions, and wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1-14.

16. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause a computer device having the processor to perform the method of any of claims 1-14.