CN114202604A

CN114202604A - Voice-driven target person video generation method and device and storage medium

Info

Publication number: CN114202604A
Application number: CN202111466434.7A
Authority: CN
Inventors: 王波; 吴笛; 张沅; 刘吉伟; 罗东
Original assignee: Great Wall Information Co Ltd
Current assignee: Great Wall Information Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-18

Abstract

The invention discloses a voice-driven target person video generation method, a voice-driven target person video generation device and a storage medium, wherein the method comprises the following steps: acquiring voice data and a positive image of the upper half of the figure; extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired front image of the upper body of the person; separating voice content information and audio information based on the acquired voice data; training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix; generating a video image frame sequence based on a multi-dimensional mapping relation; and splicing the video image frame sequence with the language data to obtain the target human language audio/video. The linkage of the head action and the upper half body is fully considered, and the generated video is natural and has strong sense of reality.

Description

Voice-driven target person video generation method and device and storage medium

Technical Field

The invention relates to the technical field of computer information, in particular to a method and a device for generating a video of a voice-driven target person and a storage medium.

Background

The research of voice-driven human face video generation is an important research direction of man-machine interaction, and a section of voice is input at will to generate a section of voice of a target person and a synchronous speaking video of a mouth shape. Existing approaches are mainly end-to-end trained encoding-decoding frameworks. Chen et al adopt an attention mechanism to train facial features, improving video accuracy. Mittal et al propose separating speech content from emotion information, and controlling the movement of the face and head using different feature dimensions. Edwards et al multi-dimensionally map speech to faces, controlling face movement. Qian et al propose an auto vc approach, a few shot voice conversion approach that separates the voice into voice content and identity information. Zhou et al propose voice separated content and target person information, the content information controls the movement of the position of the lips and the appendage face strongly, while the target person information determines the details of the facial expression and other dynamics of the target head. Nian et al propose a face feature and mouth feature key point method, which respectively uses face contour key points and lip key points of a human face to represent head motion information and lip motion information of a target person. Although the prior art has good effects, the facial expression and the mouth and lip movement are mainly concerned, and the linkage of the head movement and the upper half body is ignored. The generated speaking video of the target person basically has only facial action and a small amount of head action, and the head and the upper body are not naturally linked.

Disclosure of Invention

The invention provides a voice-driven target person video generation method, a voice-driven target person video generation device and a storage medium, and aims to solve the problems that a target person speaking video generated in the prior art basically only has facial actions and a small amount of head actions, and the head and the upper body are linked unnaturally.

In a first aspect, a method for generating a video of a voice-driven target person is provided, which includes:

acquiring voice data and a positive image of the upper half of the figure;

extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired front image of the upper body of the person;

separating voice content information and audio information based on the acquired voice data;

training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;

generating a video image frame sequence based on a multi-dimensional mapping relation;

and splicing the video image frame sequence with the language data to obtain the target human language audio/video.

Further, the extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired image of the front of the upper body of the person includes:

dividing the front image of the upper half of the figure into a head image and an upper half image;

extracting an initial head key point coordinate matrix in a head portrait by adopting a head key point extraction model;

and extracting an initial upper body key point coordinate matrix in the upper body image by adopting an upper body key point extraction model.

Further, the separating the voice content information and the audio information based on the acquired voice data includes:

extracting a Mel spectrogram feature matrix of the voice data;

inputting the Mel spectrogram feature matrix into an LSTM-based voice feature extraction network to obtain a voice feature matrix;

for each frame image, inputting the corresponding voice feature matrix in the preset voice frame number window into a voice content encoder to obtain a voice content matrix;

and for each frame image, inputting the corresponding voice feature matrix in the preset voice frame number window into an audio encoder to obtain an audio matrix.

Further, the training of the multidimensional mapping relationship between the voice content information, the audio information, the coordinates of the head key point and the coordinates of the upper body key point based on the voice content information, the audio information, the coordinates of the head key point and the coordinates of the upper body key point specifically includes:

inputting the voice content matrix and the initial head key point coordinate matrix into a first multilayer perceptron, and predicting the displacement of the head key point coordinate matrix in each frame of image;

obtaining a head position prediction coordinate of each frame of image based on the initial head key point coordinate matrix and the displacement of the head key point coordinate matrix in each frame of image;

based on a self-attention network, fusing the voice content matrix and the audio matrix to obtain a self-reconfigurable audio mobile matrix;

inputting the self-reconfigurable audio movement matrix, the initial head key point coordinate matrix and the initial upper half body key point coordinate matrix into a second multilayer sensor, and predicting to obtain the head key point coordinate matrix and the upper half body key point coordinate matrix overall displacement of each frame of image;

and obtaining the overall predicted coordinate of each frame of image based on the head position predicted coordinate of each frame of image, the head key point coordinate matrix of each frame of image and the overall displacement of the upper body key point coordinate matrix.

Further, in the process of training the mapping relation between the voice content matrix and the head position prediction coordinate, the used voice content encoder and the first multilayer perceptron can be obtained by synchronous training based on the collected video data in advance, and a minimum loss function L adopted in the training process_cThe following were used:

in the formula, P_i,tIndicating the predicted coordinate position of the ith head key point in the t-th frame image,

represents the actual coordinate position of the ith head key point in the t frame image, lambda_cThe weight coefficient is represented by a weight coefficient,

represents P_i,tThe laplacian coordinates of the graph of (a),

to represent

Graph laplacian coordinates, N represents the total number of head key points, T represents the total frame number of the image, | | includes₂Representing the L2 norm.

Further, the generating a video image frame sequence based on the multi-dimensional mapping relationship includes:

and inputting the initial head key point coordinate matrix, the head position prediction coordinate of each frame of image and the overall prediction coordinate of each frame of image into a video frame reconstruction model, and fusing to obtain a reconstructed video frame sequence.

Further, a minimum loss function L adopted in the training of the video frame reconstruction model_aThe following were used:

where stc denotes the number of source video frames, trg denotes the number of target video frames, Q_trgWhich represents the reconstructed video frame or frames, and,

representing real video frames, λ_aDenotes a weight coefficient, λ_a＝1，φ(Q_trg) And

respectively representing the reconstructed video frame and the real video frame, pre-training the reconstructed video frame and the real video frame by adopting a ResNet-34 network₁Representing the L1 norm.

Further, the splicing the video image frame sequence and the language data to obtain the target human language audio/video comprises:

and splicing the video image frame sequence and the language data by adopting an ffmpeg method to obtain a target human voice video, wherein the frame number of each second of the target human voice audio and video is a preset value.

In a second aspect, a voice-driven target person video generation apparatus is provided, including:

the data acquisition module is used for acquiring voice data and a positive image of the upper half of the figure;

the key point extraction module is used for extracting an initial head key point coordinate matrix and an initial upper half key point coordinate matrix based on the acquired figure upper half front image;

the voice data separation module is used for separating content information and audio information based on the acquired voice data;

the voice image mapping module is used for training a multi-dimensional mapping relation among the content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;

the video frame generation module is used for generating a video image frame sequence based on a multi-dimensional mapping relation;

and the video generation module is used for splicing the video image frame sequence and the language data to obtain the target human language audio/video.

In a third aspect, a computer-readable storage medium is provided, which stores a computer program executed by a processor to implement the voice-driven target person video generation method as described above.

Advantageous effects

The invention provides a voice-driven target person video generation method, a voice-driven target person video generation device and a storage medium. The technical scheme of the invention is that the voice content information and the audio information in the voice data are separated firstly, the action of the head part is controlled through the voice content information, and the natural swing of the head part and the upper half body is controlled through the audio information. And performing multi-dimensional mapping training on the voice content information and the audio information, the head key points and the upper body key points to generate a mapping relation, obtaining a video image frame sequence, and finally generating the upper body front voice video of the target person. The linkage of the head action and the upper half body is fully considered, and the generated video is natural and has strong sense of reality.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for generating a voice-driven target person video according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

Example 1

The embodiment provides a method for generating a video of a voice-driven target person, which comprises the following steps:

s1: acquiring voice data and a positive image of the upper half of the figure; in practice, the front image of the upper body of the person includes the head and the upper body part, and may be one or more.

S2: and extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired image of the front side of the upper body of the person. More specifically, the method comprises the following steps:

s21: the front image of the upper half of the character is preprocessed to obtain a uniform size, such as 768 × 768. The image is cut into a head image and an upper half image by regular framing cutting, and the cutting position of the image can be determined only by coordinates of two points, namely the upper left corner and the lower right corner. Such as:

s211: the head image img is obtained by the formula (1) with the head upper left corner coordinates (256, 0) and the lower right corner coordinates (512, 256)_head。

img_head＝Image[256:512,0:256,] (1)

S212: the upper body image does not pay attention to the head, so that the width y coordinate of the picture is only required to be 256 to 768, all the x coordinate pixels are acquired, and the upper body image img is obtained through the formula (2)_body。

img_body＝Image[:,256:768,] (2)

S22: and extracting an initial head key point coordinate matrix in the head portrait by adopting a head key point extraction model. In this embodiment, the head image img can be obtained by directly using the DLIB library for the open source face detection and the key point extraction tool_headOf 68 initial head keypoint coordinate matrices q_headAnd the coordinates of the key points of the head contain lip movement information and head movement information of the face.

q_head＝DLIB(img_head) (3)

S23: and extracting an initial upper body key point coordinate matrix in the upper body image by adopting an upper body key point extraction model. In this embodiment, the upper body image img is obtained by using the open source inclusion _ v4 model_body128-dimensional initial keypoint coordinate matrix q_body。

q_body＝Inception_v4(img_body) (4)

S3: training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix.

A piece of voice data can be divided into voice content information and audio information. Different models are used to obtain the speech content information and the audio information. The speech content information mainly determines the general lip and nearby area movements, and the audio information determines the dynamics of the head and the fine movements of the body. For example, a person speaking will accompany changes in head and facial expressions and will not produce breathing movements, while at short pauses, changes in head and facial expressions will diminish with noticeable breathing movements. The method specifically comprises the following steps.

S31: speech content information-image training

The coordinate mapping of key points of the head features, mainly the coordinate mapping of key points of lips and surrounding faces, is obtained through the training of voice content signals.

S311: the speech data is read by python with librosa library.

S312: mel-language spectrogram feature matrix Audio for extracting voice data by python _ speech _ features method_mfcc。

S313: using Mel-language spectrogram characteristic of voice data_mfccInputting LSTM-based voice feature extraction network to obtain voice feature matrix A_t∈R^M×DWhere M is the total frame number of the input speech and D is the dimension of the speech feature matrix.

A_t＝LSTM(Audio_mfcc) (5)

S314: the sampling frequency of the voice data is much higher than that of the video frame, so that a plurality of frames of voice data correspond to one frame of image in the corresponding time window. Therefore, for each frame image, the speech feature matrix in the window of the corresponding preset number of speech frames (set to 18 frames in this embodiment) is input into the AutoCV.E_cThe speech content encoder obtains a speech content matrix c_t。

c_t＝AutoCV.E_c(A_t；w_lstm,C) (6)

S315: matrix c of speech content_tAnd an initial head keypoint coordinate matrix q_headInputting a first multilayer perceptron (MLP)_c) Predicting the displacement delta q of the coordinate matrix of the key point of the head in each frame of image_t(ii) a Wherein q is_head∈R^68×3。

Δq_t＝MLP_c(c_t,q_head；w_mlp,C) (7)

Wherein, { w_lstm,C，w_mlp,CAre AutoCV.E respectively_cAnd MLP_cLearnable parameters of the network. The LSTM has three layers of cells, each with an internal hidden state vector of size 256. Decoder (first multilayer perceptron) MLP_cThe network has three layers, with internal hidden state vector sizes of 512, 256 and 204(68 × 3), respectively.

S316: coordinate matrix q based on initial head key points_headAnd the displacement delta q of the coordinate matrix of the key point of the head in each frame of image_tObtaining the head position prediction coordinate P of each frame of image_t。

P_t＝q_head+Δq_t (8)

In the implementation process, in the process of training the mapping relation between the voice content matrix and the head position prediction coordinate, the used voice content encoder and the first multilayer perceptron can be obtained by synchronous training in advance based on the collected video data, and a minimum loss function L adopted in the training process_cThe penalty function evaluates the distance between the actual and predicted coordinate positions, as well as the distance between their respective laplacian coordinates of the image, which facilitates proper placement of the coordinates relative to each other and preserves head detail. The loss function is formulated as follows:

represents the actual coordinate position of the ith head key point in the t frame image, lambda_cDenotes a weight coefficient, in this embodiment λ_cTaking out the number 1 of the samples,

represents P_i,tThe laplacian coordinates of the graph of (a),

to represent

Graph laplacian coordinates, N represents the total number of head key points, T represents the total frame number of the image, | | includes₂Representing the L2 norm. Wherein the content of the first and second substances,

calculated by the following formula:

wherein, N (p)_i) Including p on a different face than p_iThe contiguous coordinates are adjacent. 8 face portions are used, which contain a predefined subset of coordinates for the face template.

The calculation method of (1) is the same as that of (3), and will not be described herein again,

s32: speech content information and audio information-image training

The similarity of audio information between different utterances spoken by the same targeted speaker is maximized and the similarity between different targeted speakers is minimized. And obtaining the mapping relation between the head and the upper body corresponding to the audio information through the training of the voice audio information. The motion of the head or subtle associations between the upper body is also a key factor in generating the true speaking target person.

S321: the speech data is read by python with librosa library.

S322: mel-language spectrogram feature matrix Audio for extracting voice data by python _ speech _ features method_mfcc。

S323: using Mel-language spectrogram characteristic of voice data_mfccInputting LSTM-based voice feature extraction network to obtain voice feature matrix A_t∈R^M×DWhere M is the total frame number of the input speech and D is the dimension of the speech feature matrix.

A_t＝LSTM(Audio_mfcc) (11)

S324: for each frame image, the speech feature matrix in the corresponding window of preset speech frame number (set to 18 frames in this embodiment) is input into utoCV.E_sAudio encoder obtains audio matrix

Wherein, w_lstm,SIs utoCV.E_sLearnable parameters of an audio encoder.

S325: by a single layer of MLP

From 256 down to 128, get

The generalization ability of facial video can be improved, particularly for target people not observed during training.

S326: to produce consistent head and upper body movements, it requires capturing a longer time correlation than voice content movements. While speech audio information typically lasts for tens of milliseconds, head movements (e.g., head swinging from left to right) and upper body movements (e.g., breathing movements) may last for one or more seconds, or even orders of magnitude longer. To capture this long-term and structured dependency, the output is computed using a self-attention network. Matrix speech contentAnd the audio matrix input decoder obtains from the reconstructed audio motion matrix h_t，w_attnAnd s denotes a trainable parameter of the self attention network (self attention encoder).

S327: the transformation of the speech content matrix is connected with the audio matrix and the two initial coordinates. The weight assigned to each frame is calculated by a compatibility function that compares the representation of all the frames in the window. In all experiments, the window size was set to τ 256 frames (4 seconds). Moving the matrix h from the reconstructed audio_tAnd an initial head keypoint coordinate matrix q_headAnd an initial upper half key point coordinate matrix q_bobyInputting a second multilayer perceptron MLP_sTo generate the target speaker perceptual coordinate displacement. The MLP_sPredicting the final head key point coordinate matrix and upper body key point coordinate matrix overall displacement delta p of each frame of image_t，w_mlp,SRepresenting a second multilayer perceptron MLP_s(MLP_sDecoder) of the video signal.

Δp_t＝MLP_s(h_t,q_head,q_body；w_mlp,s) (15)

S328: predicting coordinates P based on head position of each frame image_tAnd the head key point coordinate matrix and the upper body key point coordinate matrix of each frame of image are integrally displaced by delta p_tObtaining the overall predicted coordinate y of each frame of image_t。

y_t＝p_t+Δp_t (16)

In practice, in the process of training each network model in advance, besides capturing the key point positions, the head motion and the upper body motion of the target speaker need to be matched. To this end, a network of discriminators is created, the goal of which is to find out whether the temporal dynamics of the speaker's facial coordinates look "real" or false. It will generate the global predicted coordinate sequence within the same window used in the generator andthe speech content matrix and the audio matrix are used as input, and a representation r is returned_t：

During training, using LSGAN loss function L_ganFor discriminator parameter w_attn,dTraining is performed, the training coordinates are regarded as "true", and the generated coordinates are regarded as "false" for each frame:

wherein the content of the first and second substances,

representing the generator output, r, when the training coordinate y is used as its input_tRepresenting the output in the discriminator.

To train the parameter w_attn,sTo maximize the "reality" of the output, a minimization loss function L is set_sThe training distance taking into account the absolute position and the laplace coordinates:

wherein λ_s1 and μ_s0.001 pass hold verify set, y_i,tIndicating the predicted coordinate position of the ith head or upper body keypoint in the tth frame image,

indicating the actual coordinate position of the ith head or upper body keypoint in the tth frame image,

denotes y_i,tThe laplacian coordinates of the graph of (a),

to represent

Graph laplace coordinates. Training is alternated between the generator and the discriminator to improve each other as is done in the GAN approach.

S4: and generating a video image frame sequence based on the multi-dimensional mapping relation. The method specifically comprises the following steps:

initial head key point coordinate matrix q_headThe predicted head position coordinate P of each frame of image_tAnd the overall predicted coordinate y of each frame of image_tInputting the video frame reconstruction model, and fusing to obtain a reconstructed video frame Q_trgAnd (4) sequencing.

Minimum loss function L adopted during training of video frame reconstruction model_aThe following were used:

representing real video frames, λ_aDenotes a weight coefficient, λ_a＝1，φ(Q_trg)And

The video frame reconstruction model is an encoder-decoder network that performs image conversion to produce video frame images. The encoder takes 6 convolutional layers, each of which contains a 2-step convolution, followed by two residual blocks, and creates a bottleneck, which is then decoded by a symmetric upsampling decoder. Each frame image is generated using a skip connection between the symmetric layers of the encoder and decoder. Since the coordinates vary smoothly with time, the output image formed as an interpolation of these coordinates exhibits temporal coherence.

S5: and splicing the video image frame sequence with the language data to obtain the target human language audio/video. The method specifically comprises the following steps:

and splicing the video image frame sequence and the language data by adopting an ffmpeg method to obtain a target human voice video, wherein in the embodiment, the frame number of the target human voice audio/video per second is 29.

Example 2

The embodiment provides a voice-driven target person video generation device, which comprises:

Example 3

The present embodiment provides a computer-readable storage medium storing a computer program executed by a processor to implement the voice-driven target person video generation method as described above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A voice-driven target person video generation method is characterized by comprising the following steps:

acquiring voice data and a positive image of the upper half of the figure;

2. The method for generating video of a voice-driven target person according to claim 1, wherein the extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired image of the front of the upper body of the person includes:

3. The voice-driven target person video generation method according to claim 1, wherein the separating of the voice content information and the audio information based on the acquired voice data includes:

extracting a Mel spectrogram feature matrix of the voice data;

4. The method for generating video of a voice-driven target person according to claim 3, wherein the training of the multidimensional mapping relationship between the voice content information and the audio information and the coordinates of the head key point and the coordinates of the upper body key point based on the voice content information, the audio information, the coordinate matrix of the head key point and the coordinate matrix of the upper body key point specifically comprises:

5. The method according to claim 4, wherein the speech content encoder and the first multi-layer sensor used in the training of the mapping relationship between the speech content matrix and the head position prediction coordinates are obtained by synchronous training based on the collected video data, and the minimum loss function L used in the training is a minimum loss function L_cThe following were used:

represents P_i,tThe laplacian coordinates of the graph of (a),

to represent

6. The method as claimed in claim 4, wherein the generating a video image frame sequence based on multi-dimensional mapping comprises:

7. The method according to claim 6, wherein the minimum loss function L is used in training the video frame reconstruction model_aThe following were used:

separately representing reconstructed video frames and truesVideo frames are pre-trained using a ResNet-34 network, | | left alone |₁Representing the L1 norm.

8. The method for generating the video of the target person by the voice driving according to claim 4, wherein the splicing the video image frame sequence and the language data to obtain the voice audio and video of the target person comprises:

9. A voice-driven target person video generating device, comprising:

10. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the voice-driven target person video generation method according to any one of claims 1 to 8.