CN117133310A

CN117133310A - Audio drive video generation method and device, storage medium and electronic equipment

Info

Publication number: CN117133310A
Application number: CN202311081384.XA
Authority: CN
Inventors: 王志波; 徐晖宇; 刘文鑫; 金帅帆; 胡佳慧; 任奎
Original assignee: Zhejiang University ZJU; Alipay Hangzhou Information Technology Co Ltd
Current assignee: Zhejiang University ZJU; Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-08-24
Filing date: 2023-08-24
Publication date: 2023-11-28

Abstract

The specification discloses an audio driving video generation method, an audio driving video generation device, a storage medium and electronic equipment. In the audio driving video generation method provided in the present specification, a target audio is acquired and a video generation model is input; extracting the audio characteristics of each frame of audio data; for each frame of audio data, according to the audio characteristics of the frame of audio data and the audio characteristics of the audio data related to the frame of audio data, obtaining the time sequence characteristics of the frame of audio data; fusing the audio characteristics and the time sequence characteristics of the frame of audio data to obtain the audio and video characteristics of the frame of audio data; obtaining an output space attention diagram according to the audio-video characteristics of the frame of audio data and the preset speaker pose; adjusting the audio-video characteristics of the frame of audio data according to the spatial attention map; generating a target image of the frame of audio data according to the audio-video characteristics of the frame of audio data; and determining a target video according to the target image of each frame of audio data.

Description

Audio drive video generation method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an audio driving video generating method, an audio driving video generating device, a storage medium, and an electronic device.

Background

With the development of computer vision and deep learning technology, the use of artificial intelligence means to synthesize and edit video is growing and maturing. The audio-driven speaker video generation technology is widely applied in the fields of virtual reality, movie creation, privacy protection and the like.

At present, the existing audio-driven speaker video generation method generally enables a neural network model to learn the mapping from audio features to speaker videos after audio features are obtained from audio. In this process, the neural network model indirectly learns the association between audio and video, thereby generating speaker video. However, there is a structural difference between the audio and the video, and the indirect learning method is difficult to capture the correlation between the audio and the video, which finally results in the problem that the generated video has poor audio and video synchronicity.

Therefore, how to improve the audio-video synchronization of the speaker video generated under the audio driving is a problem to be solved.

Disclosure of Invention

The present disclosure provides an audio-driven video generating method, apparatus, storage medium, and electronic device, so as to at least partially solve the foregoing problems of the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides an audio-driven video generation method, which comprises the following steps:

acquiring target audio, and inputting the target audio into a pre-trained video generation model, wherein the video generation model at least comprises an extraction subnet, a processing subnet, a fusion subnet, a spatial subnet, an adjustment subnet and a generation subnet;

extracting the audio characteristics of each frame of audio data in the target audio through the extraction subnet;

inputting the audio characteristics of the frame of audio data and the audio characteristics of a plurality of frames of audio data related to the frame of audio data into the processing sub-network for each frame of audio data so as to output the time sequence characteristics of the frame of audio data through the processing sub-network;

the audio characteristics and the time sequence characteristics of the frame of audio data are fused through the fusion sub-network, so that the audio and video characteristics of the frame of audio data are obtained;

inputting the audio-video characteristics of the frame of audio data and the preset speaker pose into the space subnet to obtain a space attention map output by the space subnet, wherein the space attention map is used for representing the attention weights of all part areas of the face of the speaker;

Adjusting the audio and video characteristics of the frame of audio data according to the spatial attention map through the adjustment sub-network;

inputting the adjusted audio and video characteristics into the generation sub-network so that the generation sub-network generates a target image of the frame of audio data according to the audio and video characteristics of the frame of audio data;

and determining a target video according to the target image of each frame of audio data.

Optionally, inputting the audio features of the frame of audio data and the audio features of the plurality of frames of audio data related to the frame of audio data into the processing subnet specifically includes:

the audio characteristics of the frame of audio data, and the audio characteristics of a specified number of frames of audio data that precede the frame of audio data in succession, are input to the processing sub-network.

Optionally, fusing the audio feature and the time sequence feature of the frame of audio data specifically includes:

and splicing the audio characteristics and the time sequence characteristics of the frame of audio data.

Optionally, the adjusting the audio-video feature of the frame of audio data according to the spatial attention map specifically includes:

the spatial attention is intended to be multiplied by the audio-visual characteristics of the frame of audio data.

Optionally, the pre-training video generation model specifically includes:

Acquiring sample audio and sample video corresponding to the sample audio, and determining each frame of sample image in the sample video, wherein each frame of sample image of the sample video corresponds to each frame of sample audio data of the sample audio one by one;

inputting the sample audio into a video generation model to be trained;

extracting audio features to be optimized of each frame of sample audio data in the sample audio through the extraction subnet;

for each frame of sample audio data, inputting the audio characteristics to be optimized of the frame of sample audio data and sample audio characteristics of a plurality of frames of sample audio data related to the frame of sample audio data into the processing sub-network so as to output time sequence characteristics to be optimized of the frame of sample audio data through the processing sub-network;

fusing the audio characteristics to be optimized and the time sequence characteristics to be optimized of the frame of sample audio data through the fusion subnetwork to obtain the audio and video characteristics to be optimized of the frame of sample audio data;

inputting the audio and video characteristics to be optimized of the frame sample audio data and the preset speaker pose into the space subnet to obtain the space attention map to be optimized output by the space subnet;

Adjusting the audio and video characteristics to be optimized of the frame sample audio data according to the space attention diagram to be optimized through the adjustment sub-network;

inputting the adjusted audio and video characteristics to be optimized into the generation sub-network, so that the generation sub-network generates a target image to be optimized of the frame sample audio data according to the audio and video characteristics to be optimized of the frame sample audio data;

and training the video generation model by taking the minimum difference between the sample image corresponding to the frame of sample audio data and the target image to be optimized of the frame of sample audio data as an optimization target.

Optionally, training the video generation model specifically includes:

and adjusting parameters of the extraction subnet, the fusion subnet, the adjustment subnet and the generation subnet in the video generation model.

Optionally, the video generation model further includes: predicting a subnet;

the method further comprises the steps of:

determining the key points of the real designated parts of each frame of sample image, wherein the key points of the designated parts are used for representing the shape and the position of the designated parts of the speaker;

inputting the time sequence characteristics to be optimized of the frame sample audio data into the prediction sub-network to obtain key points of the designated parts to be optimized of the frame sample audio data output by the prediction sub-network;

And taking the minimum difference between the actual designated position key point of the sample image corresponding to the frame sample audio data and the designated position key point to be optimized of the frame sample audio data as an optimization target, and at least adjusting the parameters of the processing sub-network.

Optionally, the method further comprises:

determining a real space attention map of each frame of sample image;

and adjusting at least the parameters of the space sub-network by taking the minimum difference between the real space attention map of the sample image corresponding to the frame of sample audio data and the space attention map to be optimized of the frame of sample audio data as an optimization target.

Optionally, determining the real space attention map of each frame of sample image specifically includes:

for each frame of sample image, determining the real space attention map of the frame of sample image according to the real speaker pose in the frame of sample image and the preset weight.

The present specification provides an audio-driven video generation apparatus including:

the acquisition module is used for acquiring target audio and inputting the target audio into a pre-trained video generation model, wherein the video generation model at least comprises an extraction subnet, a processing subnet, a fusion subnet, a space subnet, an adjustment subnet and a generation subnet;

The extraction module is used for extracting the audio characteristics of each frame of audio data in the target audio through the extraction subnet;

the processing module is used for inputting the audio characteristics of the frame of audio data and the audio characteristics of a plurality of frames of audio data related to the frame of audio data into the processing sub-network for each frame of audio data so as to output the time sequence characteristics of the frame of audio data through the processing sub-network;

the fusion module is used for fusing the audio characteristics and the time sequence characteristics of the frame of audio data through the fusion subnet to obtain the audio and video characteristics of the frame of audio data;

the space module is used for inputting the audio-video characteristics of the frame of audio data and the preset speaker pose into the space subnet to obtain a space attention map output by the space subnet, wherein the space attention map is used for representing the attention weights of all parts of the face of the speaker;

the adjusting module is used for adjusting the audio and video characteristics of the frame of audio data according to the spatial attention map through the adjusting sub-network;

the generation module is used for inputting the adjusted audio and video characteristics into the generation subnet so that the generation subnet generates a target image of the frame of audio data according to the audio and video characteristics of the frame of audio data;

And the determining module is used for determining a target video according to the target image of each frame of audio data.

Optionally, the processing module is specifically configured to input the audio feature of the frame of audio data and the audio feature of the specified number of frames of audio data that are consecutive before the frame of audio data into the processing subnet.

Optionally, the fusion module is specifically configured to splice an audio feature and a time sequence feature of the frame of audio data.

Optionally, the adjusting module is specifically configured to multiply the spatial attention map with an audio-video feature of the frame of audio data.

Optionally, the device further comprises a training module, which is specifically configured to obtain sample audio and sample video corresponding to the sample audio, and determine each frame of sample image in the sample video, where each frame of sample image of the sample video corresponds to each frame of sample audio data of the sample audio one by one; inputting the sample audio into a video generation model to be trained; extracting audio features to be optimized of each frame of sample audio data in the sample audio through the extraction subnet; for each frame of sample audio data, inputting the audio characteristics to be optimized of the frame of sample audio data and sample audio characteristics of a plurality of frames of sample audio data related to the frame of sample audio data into the processing sub-network so as to output time sequence characteristics to be optimized of the frame of sample audio data through the processing sub-network; fusing the audio characteristics to be optimized and the time sequence characteristics to be optimized of the frame of sample audio data through the fusion subnetwork to obtain the audio and video characteristics to be optimized of the frame of sample audio data; inputting the audio and video characteristics to be optimized of the frame sample audio data and the preset speaker pose into the space subnet to obtain the space attention map to be optimized output by the space subnet; adjusting the audio and video characteristics to be optimized of the frame sample audio data according to the space attention diagram to be optimized through the adjustment sub-network; inputting the adjusted audio and video characteristics to be optimized into the generation sub-network, so that the generation sub-network generates a target image to be optimized of the frame sample audio data according to the audio and video characteristics to be optimized of the frame sample audio data; and training the video generation model by taking the minimum difference between the sample image corresponding to the frame of sample audio data and the target image to be optimized of the frame of sample audio data as an optimization target.

Optionally, the training module is specifically configured to adjust parameters of the extraction subnet, the fusion subnet, the adjustment subnet, and the generation subnet in the video generation model.

Optionally, the video generation model further includes: predicting a subnet;

the training module is also used for determining the key points of the real designated parts of each frame of sample image, wherein the key points of the designated parts are used for representing the shape and the position of the designated parts of the speaker; inputting the time sequence characteristics to be optimized of the frame sample audio data into the prediction sub-network to obtain key points of the designated parts to be optimized of the frame sample audio data output by the prediction sub-network; and taking the minimum difference between the actual designated position key point of the sample image corresponding to the frame sample audio data and the designated position key point to be optimized of the frame sample audio data as an optimization target, and at least adjusting the parameters of the processing sub-network.

Optionally, the training module is further configured to determine a real-space attention map of each frame of sample image; and adjusting at least the parameters of the space sub-network by taking the minimum difference between the real space attention map of the sample image corresponding to the frame of sample audio data and the space attention map to be optimized of the frame of sample audio data as an optimization target.

Optionally, the training module is specifically configured to determine, for each frame of sample image, a real spatial attention map of the frame of sample image according to a real speaker pose in the frame of sample image and a preset weight.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above-described audio-driven video generation method.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described audio-driven video generation method when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

in the audio driving video generation method provided in the present specification, a target audio is acquired and a video generation model is input; extracting the audio characteristics of each frame of audio data; for each frame of audio data, according to the audio characteristics of the frame of audio data and the audio characteristics of the audio data related to the frame of audio data, obtaining the time sequence characteristics of the frame of audio data; fusing the audio characteristics and the time sequence characteristics of the frame of audio data to obtain the audio and video characteristics of the frame of audio data; obtaining an output space attention diagram according to the audio-video characteristics of the frame of audio data and the preset speaker pose; adjusting the audio-video characteristics of the frame of audio data according to the spatial attention map; generating a target image of the frame of audio data according to the audio-video characteristics of the frame of audio data; and determining a target video according to the target image of each frame of audio data.

When the audio-driven video generation method provided by the specification is used for generating the speaker video, the audio characteristics and the time sequence characteristics of each frame of audio data in the target audio can be obtained through a pre-trained video generation model, the audio-video characteristics are obtained through fusion, the audio-video characteristics are adjusted through spatial attention try to obtain the audio-video characteristics capable of distinguishing different facial areas of the speaker, and the target image corresponding to each frame of audio data of the target audio is finally generated, so that the target video is obtained. When the method is adopted to generate the speaker video, the relation between the audio data of each frame and the importance degree of different part areas of the face of the speaker can be additionally considered on the basis of the audio characteristics, so that better relation is established between the audio and the video, and finally, more accurate and more vivid target video is generated.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. Attached at

In the figure:

fig. 1 is a schematic flow chart of an audio-driven video generating method provided in the present specification;

Fig. 2 is a schematic diagram of a model structure of a video generating model provided in the present specification when applied;

FIG. 3 is a schematic diagram of a model structure of a video generating model during training provided in the present specification;

FIG. 4 is a schematic diagram of a preset weight provided in the present specification;

fig. 5 is a schematic structural diagram of an audio-driven video generating apparatus provided in the present specification;

fig. 6 is a schematic diagram of an electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a flow chart of an audio driving video generating method provided in the present specification, which includes the following steps:

S100: and obtaining target audio, and inputting the target audio into a pre-trained video generation model, wherein the video generation model at least comprises an extraction subnet, a processing subnet, a fusion subnet, a spatial subnet, an adjustment subnet and a generation subnet.

In this specification, an execution body for implementing an audio-driven video generation method may refer to a designated device such as a server provided on a service platform, and for convenience of description, this specification uses only the server as an execution body as an example, and describes an audio-driven video generation method provided in this specification.

The audio driving video generation method provided by the specification is used for generating a video of a speaker speaking according to the audio content of the speaker when speaking. Based on this, in this step, the target audio may be first acquired and input into the video generation model that has been trained in advance. Wherein the target audio is typically audio that contains at least the speech content of the speaker.

The video generation model is used for generating corresponding video according to the input audio. In popular terms, a picture of a speaker when speaking is generated according to what the speaker speaks. Fig. 2 is a model structure of a video generating model provided in the present specification, and as shown in fig. 2, the video generating model may at least include an extraction subnet, a processing subnet, a convergence subnet, a spatial subnet, an adjustment subnet, and a generation subnet. The video generation model may be a neural network model constructed based on a neural radiation field.

S102: and extracting the audio characteristics of each frame of audio data in the target audio through the extraction subnet.

In the video generation model, the extraction sub-network is used to extract audio features of the target audio input into the video generation model. In the extraction sub-network, audio features of each frame of audio data of the target audio may be extracted for subsequent steps. It is conceivable that each time there is one frame of audio data in the target audio, an image corresponding to one frame is generated accordingly when the video is generated, and therefore the number of frames of the target audio should be the same as the number of frames of the target video to be generated subsequently. While specific values of the number of frames, such as 24 frames per second, 30 frames per second, etc., are not particularly limited, and may be set as desired.

S104: for each frame of audio data, inputting the audio characteristics of the frame of audio data and the audio characteristics of a plurality of frames of audio data related to the frame of audio data into the processing sub-network so as to output the time sequence characteristics of the frame of audio data through the processing sub-network.

In this step, the timing characteristics of the audio data of each frame may be determined according to the audio characteristics obtained in step S102 through the processing subnet in the video generation model.

Normally, during the speaking of a speaker, the facial expression, limb movements, etc. of the speaker may be different for the same word when spoken in different contexts. Even for the same pronunciation, the mouth shape of the speaker may be different in the case that the contents of the front-back connection are different. Therefore, in generating a target video from a target audio, only the audio features themselves cannot be considered, and also the relation between the contexts of the audio content, that is, the relation between the audio features, needs to be considered.

Thus, in this step, for each frame of audio data, the audio characteristics of the frame of audio data, and the audio characteristics of several frames of audio data associated with the frame of audio data, may be input into the processing sub-network, resulting in the timing characteristics of the frame of audio data output by the processing sub-network. The audio data related to the frame of audio data may be audio data that is located before and after the frame of audio data in the target audio and is continuous with the frame of audio data. The processing subnetwork may be constructed based on Long Short-Term Memory (LSTM), among others.

In particular, the audio characteristics of the frame of audio data, as well as the audio characteristics of a specified number of frames of audio data that precede the frame of audio data, may be input to the processing sub-network. Wherein the specified number can be set according to specific requirements. For example, assuming that the specified number is 10, for the 30 th frame of audio data, the sequential 10 th frame of audio data between the frames, that is, the audio features of the 20 th to 29 th frame of audio data, and the audio features of the 30 th frame of audio data itself, that is, the audio features of the 20 th to 30 th frame of audio data, may be input into the processing sub-network, so as to obtain the timing features of the 30 th frame of audio data given by the processing sub-network.

S106: and fusing the audio characteristics and the time sequence characteristics of the frame of audio data through the fusion sub-network to obtain the audio and video characteristics of the frame of audio data.

For each frame of audio data, when a target image corresponding to the frame of audio data is obtained in a subsequent step, the audio feature and the time sequence feature of the frame of audio data are required to be adopted at the same time, so that in the step, the audio feature and the time sequence feature of the frame of audio data can be fused through a fusion network in a video generation model to obtain the audio and video feature of the frame of audio data.

There are various ways to fuse the audio features and the time sequence features, and the method can be set according to specific requirements in practical application. For example, the audio features and the time sequence features of the frame of audio data can be spliced to obtain the audio-video features of the original content which can keep the audio features and the time sequence features to the greatest extent.

S108: and inputting the audio-video characteristics of the frame of audio data and the preset speaker pose into the space subnet to obtain a space attention map output by the space subnet, wherein the space attention map is used for representing the attention weights of all the part areas of the face of the speaker.

In this step, for each frame of audio data, the audio-video feature of the frame of audio data and the preset speaker pose determined in step S106 may be input into the spatial subnet, so as to obtain a spatial attention map of the frame of audio data output by the spatial subnet.

Spatial attention is intended to characterize the attention weight of regions of the speaker's face. There may be a plurality of expression forms of the spatial attention map, for example, the spatial attention map may be expressed in the form of a vector, and different components in the vector represent the attention weights of different respective part regions.

It is conceivable that the importance of different regions of the face is different during the speaking process, for example, in the normal case, the mouth region of the speaker changes more frequently and more widely during the speaking process, which is relatively important and can have higher attention weight; the neck region of the speaker is rarely changed and the change amplitude is small, so that the speaker is relatively unimportant and the attention weight is relatively low.

The speaker pose is a position and a pose of the speaker in the video, wherein the position may include the speaker as in three-dimensional space, and the pose may include an angle of the speaker in three-dimensional space. The speaker pose can be preset according to specific requirements, and the specification is not particularly limited.

S110: and adjusting the audio and video characteristics of the frame of audio data according to the spatial attention map through the adjustment sub-network.

In this step, the audio-video characteristics of the frame of audio data may be adjusted by adjusting the subnet in the video generation model using the spatial attention map of the frame of audio data determined in step S108. Through learning the space attention diagram, the audio and video characteristics can learn the importance degree of different part areas of the face of the speaker when speaking, so that the action change of each part area of the face of the speaker can be better controlled when the target video is generated later.

There may be various ways to adjust the audio-video feature by using the spatial attention map, for example, the spatial attention map may be specifically multiplied by the audio-video feature of the frame of audio data, so that the audio-video feature has the capability of distinguishing the semantics of the different part areas of the face.

S112: and inputting the adjusted audio and video characteristics into the generation sub-network so that the generation sub-network generates a target image of the frame of audio data according to the audio and video characteristics of the frame of audio data.

In this step, the audio and video characteristics of the frame of audio data adjusted in step S110 may be input into a generating subnet of the video generating model, so as to obtain a target image of the frame of audio data output by the generating subnet. The target image is a frame of image in the target video which is expected to be obtained finally.

S114: and determining a target video according to the target image of each frame of audio data.

Finally, in this step, a target video may be determined from the target image of each frame of audio data generated by the foregoing steps. The target video can be obtained by arranging the target images of the audio data of each frame in the correct order.

Additionally, the video generation model employed in the present specification may be trained in advance. Specifically, sample audio and sample video corresponding to the sample audio can be obtained, and each frame of sample image in the sample video is determined, wherein each frame of sample image of the sample video corresponds to each frame of sample audio data of the sample audio one by one; inputting the sample audio into a video generation model to be trained; extracting audio features to be optimized of each frame of sample audio data in the sample audio through the extraction subnet; for each frame of sample audio data, inputting the audio characteristics to be optimized of the frame of sample audio data and sample audio characteristics of a plurality of frames of sample audio data related to the frame of sample audio data into the processing sub-network so as to output time sequence characteristics to be optimized of the frame of sample audio data through the processing sub-network; fusing the audio characteristics to be optimized and the time sequence characteristics to be optimized of the frame of sample audio data through the fusion subnetwork to obtain the audio and video characteristics to be optimized of the frame of sample audio data; inputting the audio and video characteristics to be optimized of the frame sample audio data and the preset speaker pose into the space subnet to obtain the space attention map to be optimized output by the space subnet; adjusting the audio and video characteristics to be optimized of the frame sample audio data according to the space attention diagram to be optimized through the adjustment sub-network; inputting the adjusted audio and video characteristics to be optimized into the generation sub-network, so that the generation sub-network generates a target image to be optimized of the frame sample audio data according to the audio and video characteristics to be optimized of the frame sample audio data; and training the video generation model by taking the minimum difference between the sample image corresponding to the frame of sample audio data and the target image to be optimized of the frame of sample audio data as an optimization target.

In the training process, sample audio and sample video for training may be first acquired, where the sample audio and sample video may be content from an audio track and content from an image track in the same complete speaker video. The time length and the frame number of the sample audio and the sample video are the same, and each frame of sample image corresponds to each frame of sample audio data one by one. After the sample audio is input into the video generation model to be trained to obtain target images to be optimized of each frame of sample audio output by the video generation model, for each frame of sample audio, the minimum difference between the sample image corresponding to the frame of sample audio data and the target image to be optimized of the frame of sample audio data can be used as an optimization target, and the video generation model is trained. By the method, each target image to be optimized generated by the video generation model can be more close to a real original image. The loss function during training can be set as follows:

wherein L is _photo A loss function representing the overall video generation model, wherein W represents the width of the target image to be optimized, and H represents the height of the target image to be optimized (the width and the height of the sample image are the same as those of the target image to be optimized); i _r Representing the target image to be optimized, I _gt Representing a sample image; (h, w) represents a coordinate point of height h and width w in the image. The formula is used for representing the sum of differences between the pixel points at corresponding positions in the target image to be optimized and the sample image in the same frame. And training the video generation model by taking the minimum loss function as an optimization target.

In addition, in the audio-driven video generating method provided in the present specification, when training the video generating model, two more critical subnets, namely, a processing subnet for determining a timing characteristic and a spatial subnet for determining a spatial attention diagram, may be trained separately. Specifically, when training a video generation model, parameters of the extracted subnet, the fused subnet, the adjusted subnet and the generated subnet in the video generation model can be adjusted.

For processing the sub-network and the space sub-network, the parameters are not adjusted when the whole video generation model is trained, and the independent training can be performed respectively. How to train the processing and spatial subnets in the video generation model individually will be described below.

The processing sub-network is used for determining the time sequence characteristic of one frame of audio data according to the audio characteristics of a plurality of continuous frames of audio data. Essentially, the content learned by the timing characteristics is the link between frames of audio data. As introduced in step S104 of the present specification, the association between the audio data of each frame is finally reflected on the facial expression of the speaker, so that the processing sub-network can be trained independently by taking the region of the face of the speaker as the focus, so that the time sequence characteristics determined by the processing sub-network can reflect the state of the face of the speaker more accurately.

Specifically, as shown in fig. 3, the video generation model may further include a prediction subnet; determining the key points of the real designated parts of each frame of sample image, wherein the key points of the designated parts are used for representing the shape and the position of the designated parts of the speaker; inputting the time sequence characteristics to be optimized of the frame sample audio data into the prediction sub-network to obtain key points of the designated parts to be optimized of the frame sample audio data output by the prediction sub-network; and taking the minimum difference between the actual designated position key point of the sample image corresponding to the frame sample audio data and the designated position key point to be optimized of the frame sample audio data as an optimization target, and at least adjusting the parameters of the processing sub-network.

The prediction sub-network is used for predicting key points of the appointed position to be optimized of the speaker according to the time sequence characteristics determined by the processing sub-network. The designated position key points can be expressed in a coordinate form, and the number of the designated position key points can be determined according to specific requirements and the size of the designated position; it is conceivable that the number of real designated-site keypoints should be the same as the number of designated-site keypoints to be optimized. In short, the key point of the specified part to be optimized can be a group of coordinates, and the prediction subnet considers that each coordinate in the group of coordinates is in the specified part area of the speaker in the sample image; the real designated location key point is also a set of coordinates, each of which is actually located in the designated location area of the speaker in the sample image. The designated portion may be any portion of the face of the speaker, and this specification is not particularly limited.

Taking the lip area of the speaker whose designated location is the more critical as an example. Inputting the time sequence characteristics to be optimized of the audio data of one frame of sample into the prediction sub-network to obtain lip key points to be optimized of the audio data of the frame of sample predicted by the prediction sub-network; the real lip key points can be extracted from the sample images corresponding to the frame of sample audio data. And taking the minimum difference between the to-be-optimized lip key point and the real lip key point as an optimization target, and at least adjusting parameters of the processing sub-network to finish independent training of the processing sub-network. The predictive subnetwork may be pre-trained or may be untrained. Under the condition that the predicted subnetwork is trained in advance, parameters of the processing subnetwork can be only adjusted independently; in the case that the predicted subnetwork is not trained, parameters of the processing subnetwork and the predicted subnetwork may be adjusted simultaneously. The processing subnetwork may be trained as a loss function using the following formula:

L _lmk ＝MSE(lmk _i ,lmk _i ^′ )

wherein L is _lmk Representing a loss function of a processing subnet (predictive subnet), lmk _i Represents the i-th real designated position key point lmk _i ^′ And (5) representing the key point of the ith designated part to be optimized. The above formula represents the sum of the differences between the corresponding real designated part key points and the designated part key points to be optimized. Adjusting at least parameters of the processing sub-network with the minimum of the loss function as an optimization target And (3) completing independent training of the processing sub-network.

It should be noted that the prediction subnet exists only in the stage of training the processing subnet alone, and does not exist in the process of training the video generation model as a whole or in the actual application process.

The spatial subnetworks are used to determine a spatial attention map of each frame of sample audio, the spatial attention map being used to characterize the attention weights of each region of the speaker's face. Thus, the spatial subnetwork may be trained separately based on the spatial attention map itself. Specifically, a real-space attention map of each frame of sample image may be determined; and adjusting at least the parameters of the space sub-network by taking the minimum difference between the real space attention map of the sample image corresponding to the frame of sample audio data and the space attention map to be optimized of the frame of sample audio data as an optimization target. The spatial subnetwork may be trained using the following loss functions:

wherein L is _M The loss function of the space subnetwork is that W is the width of the sample image, and H is the height of the sample image; m is M _R For real space attention, M' _R For the space attention diagram to be optimized, (h, w) represents coordinate points in the sample image at a height h and a width w. The above formula is used to characterize the sum of the differences between the real-space attention map and the attention weights of the pixels at the corresponding positions in the space attention map to be optimized for the same frame. And (3) taking the minimum loss function as an optimization target, adjusting parameters of the space subnetwork, and completing training of the space subnetwork.

Wherein the real-space attention map of each frame of sample images can be directly determined according to the content in each frame of sample images. Specifically, for each frame of sample image, a real space attention map of the frame of sample image can be determined according to the pose of the real speaker in the frame of sample image and a preset weight.

In the sample image, the true speaker pose determines the facial region of the speaker to which the pixel point at each position in the sample image belongs. The importance degree of different facial parts in the speaking process of the speaker is different, so that the assigned weights are different. The weight of each part area can be preset according to specific requirements, and the specification does not limit the weight. For example, fig. 4 is one of weights preset in the present specification. Because the method generates the target video of the speaker when the speaker speaks according to the target audio of the speaker when the speaker speaks, the region of the speaker's mouth is more critical, the weight of the region close to the mouth region can be set higher, and the weight of the region far away from the mouth region can be set lower.

Specifically, the real-space attention map may be determined according to the following formula:

Wherein Ω _i Representing a set of all coordinate points within a region of a site in the sample image,an attention weight representing each coordinate point within a region of the site; k represents the total number of site areas, i represents the ith site area; />Representing preset weights of the ith part area; each->The values of (2) are all in [0,1 ]]Between, beta represents a normalization coefficient for the followingThe maximum value of (2) is adjusted to 1. The real space attention map of each frame of sample image can be determined through the formula.

The foregoing describes one or more embodiments of the present disclosure, and based on the same concept, the present disclosure further provides a corresponding audio-driven video generating device, as shown in fig. 5.

Fig. 5 is a schematic diagram of an audio-driven video generating apparatus provided in the present specification, including:

the obtaining module 200 is configured to obtain a target audio, and input the target audio into a pre-trained video generation model, where the video generation model at least includes an extraction subnet, a processing subnet, a fusion subnet, a spatial subnet, an adjustment subnet, and a generation subnet;

an extracting module 202, configured to extract, through the extraction subnet, an audio feature of each frame of audio data in the target audio;

A processing module 204, configured to input, for each frame of audio data, an audio feature of the frame of audio data and audio features of a plurality of frames of audio data related to the frame of audio data into the processing subnet, so as to output a time sequence feature of the frame of audio data through the processing subnet;

the fusion module 206 is configured to fuse the audio feature and the time sequence feature of the frame of audio data through the fusion subnet, so as to obtain an audio-video feature of the frame of audio data;

the space module 208 is configured to input the audio-video feature of the frame of audio data and a preset speaker pose into the space subnet, and obtain a space attention map output by the space subnet, where the space attention map is used to characterize the attention weights of each part of the face of the speaker;

an adjustment module 210, configured to adjust, through the adjustment subnet, an audio-video feature of the frame of audio data according to the spatial attention map;

the generating module 212 is configured to input the adjusted audio and video feature into the generating subnet, so that the generating subnet generates a target image of the frame of audio data according to the audio and video feature of the frame of audio data;

the determining module 214 is configured to determine a target video according to the target image of each frame of audio data.

Optionally, the processing module 204 is specifically configured to input the audio feature of the frame of audio data and the audio feature of the specified number of frames of audio data that are consecutive before the frame of audio data into the processing subnet.

Optionally, the fusion module 206 is specifically configured to splice the audio feature and the time sequence feature of the frame of audio data.

Optionally, the adjusting module 210 is specifically configured to multiply the spatial attention map with an audio-video feature of the frame of audio data.

Optionally, the apparatus further includes a training module 216, specifically configured to obtain sample audio and sample video corresponding to the sample audio, and determine each frame of sample image in the sample video, where each frame of sample image of the sample video corresponds to each frame of sample audio data of the sample audio one by one; inputting the sample audio into a video generation model to be trained; extracting audio features to be optimized of each frame of sample audio data in the sample audio through the extraction subnet; for each frame of sample audio data, inputting the audio characteristics to be optimized of the frame of sample audio data and sample audio characteristics of a plurality of frames of sample audio data related to the frame of sample audio data into the processing sub-network so as to output time sequence characteristics to be optimized of the frame of sample audio data through the processing sub-network; fusing the audio characteristics to be optimized and the time sequence characteristics to be optimized of the frame of sample audio data through the fusion subnetwork to obtain the audio and video characteristics to be optimized of the frame of sample audio data; inputting the audio and video characteristics to be optimized of the frame sample audio data and the preset speaker pose into the space subnet to obtain the space attention map to be optimized output by the space subnet; adjusting the audio and video characteristics to be optimized of the frame sample audio data according to the space attention diagram to be optimized through the adjustment sub-network; inputting the adjusted audio and video characteristics to be optimized into the generation sub-network, so that the generation sub-network generates a target image to be optimized of the frame sample audio data according to the audio and video characteristics to be optimized of the frame sample audio data; and training the video generation model by taking the minimum difference between the sample image corresponding to the frame of sample audio data and the target image to be optimized of the frame of sample audio data as an optimization target.

Optionally, the training module 216 is specifically configured to adjust parameters of the extraction subnet, the fusion subnet, the adjustment subnet, and the generation subnet in the video generation model.

Optionally, the video generation model further includes: predicting a subnet;

the training module 216 is further configured to determine a real designated location key point of each frame of sample image, where the designated location key point is used to characterize a shape and a location of a designated location of the speaker; inputting the time sequence characteristics to be optimized of the frame sample audio data into the prediction sub-network to obtain key points of the designated parts to be optimized of the frame sample audio data output by the prediction sub-network; and taking the minimum difference between the actual designated position key point of the sample image corresponding to the frame sample audio data and the designated position key point to be optimized of the frame sample audio data as an optimization target, and at least adjusting the parameters of the processing sub-network.

Optionally, the training module 216 is further configured to determine a real-space attention map of each frame of sample images; and adjusting at least the parameters of the space sub-network by taking the minimum difference between the real space attention map of the sample image corresponding to the frame of sample audio data and the space attention map to be optimized of the frame of sample audio data as an optimization target.

The present specification also provides a computer readable storage medium storing a computer program operable to perform an audio-driven video generation method as provided in fig. 1 above.

The present specification also provides a schematic structural diagram of an electronic device corresponding to fig. 1 shown in fig. 6. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as illustrated in fig. 6, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to implement the audio driving video generating method described in fig. 1. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. An audio-driven video generation method, comprising:

2. The method of claim 1, inputting the audio characteristics of the frame of audio data and the audio characteristics of the plurality of frames of audio data associated with the frame of audio data into the processing subnet, comprising:

3. The method of claim 1, wherein the fusing of the audio features and the timing features of the frame of audio data comprises:

4. The method of claim 1, wherein the adjusting the audio-video characteristic of the frame of audio data based on the spatial attention map comprises:

5. The method of claim 1, pre-training a video generation model, comprising in particular:

inputting the sample audio into a video generation model to be trained;

6. The method of claim 5, training the video generation model, comprising:

7. The method of claim 6, the video generation model further comprising: predicting a subnet;

the method further comprises the steps of:

8. The method of claim 6, the method further comprising:

determining a real space attention map of each frame of sample image;

9. The method of claim 8, determining a real-space attention map for each frame of sample images, comprising:

10. An audio-driven video generating apparatus comprising:

11. The apparatus of claim 10, wherein the processing module is configured to input the audio characteristics of the frame of audio data and the audio characteristics of a specified number of consecutive frames of audio data preceding the frame of audio data into the processing subnet.

12. The apparatus of claim 10, wherein the fusion module is specifically configured to splice an audio feature and a timing feature of the frame of audio data.

13. The apparatus of claim 10, the adjustment module being configured to multiply the spatial attention map with an audiovisual feature of the frame of audio data.

14. The apparatus of claim 10, further comprising a training module, specifically configured to obtain sample audio and sample video corresponding to the sample audio, and determine each frame of sample image in the sample video, where each frame of sample image of sample video corresponds one-to-one to each frame of sample audio data of sample audio; inputting the sample audio into a video generation model to be trained; extracting audio features to be optimized of each frame of sample audio data in the sample audio through the extraction subnet; for each frame of sample audio data, inputting the audio characteristics to be optimized of the frame of sample audio data and sample audio characteristics of a plurality of frames of sample audio data related to the frame of sample audio data into the processing sub-network so as to output time sequence characteristics to be optimized of the frame of sample audio data through the processing sub-network; fusing the audio characteristics to be optimized and the time sequence characteristics to be optimized of the frame of sample audio data through the fusion subnetwork to obtain the audio and video characteristics to be optimized of the frame of sample audio data; inputting the audio and video characteristics to be optimized of the frame sample audio data and the preset speaker pose into the space subnet to obtain the space attention map to be optimized output by the space subnet; adjusting the audio and video characteristics to be optimized of the frame sample audio data according to the space attention diagram to be optimized through the adjustment sub-network; inputting the adjusted audio and video characteristics to be optimized into the generation sub-network, so that the generation sub-network generates a target image to be optimized of the frame sample audio data according to the audio and video characteristics to be optimized of the frame sample audio data; and training the video generation model by taking the minimum difference between the sample image corresponding to the frame of sample audio data and the target image to be optimized of the frame of sample audio data as an optimization target.

15. The apparatus of claim 14, wherein the training module is specifically configured to adjust parameters of an extraction subnet, a fusion subnet, an adjustment subnet, and a generation subnet in the video generation model.

16. The apparatus of claim 15, the video generation model further comprising: predicting a subnet;

17. The apparatus of claim 15, the training module further to determine a real-space attention profile for each frame of sample images; and adjusting at least the parameters of the space sub-network by taking the minimum difference between the real space attention map of the sample image corresponding to the frame of sample audio data and the space attention map to be optimized of the frame of sample audio data as an optimization target.

18. The apparatus of claim 17, wherein the training module is configured to determine, for each frame of sample image, a real spatial attention map of the frame of sample image based on a real speaker pose in the frame of sample image and a preset weight.

19. A computer readable storage medium storing a computer program which, when executed by a processor, implements the method of any one of the preceding claims 1 to 9.

20. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding claims 1-9 when the program is executed.