CN114663539A

CN114663539A - 2D face restoration technology under mask based on audio drive

Info

Publication number: CN114663539A
Application number: CN202210232796.8A
Authority: CN
Inventors: 李新德; 王航宇
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-24
Anticipated expiration: 2042-03-09
Also published as: CN114663539B

Abstract

The invention provides a 2D face restoration technology under a mask based on audio drive, which is used for really restoring the complete face of a speaker. Through the audio frequency emotion decoupler, the content and the emotion information contained in the audio frequency are effectively separated, the interference between the content and the emotion information is avoided, and the authenticity and the naturalness of the generated face expression and the mouth shape are effectively enhanced. Through 2D modeling, the geometric features, the face decoration and the skin color features of the human face are extracted, the three-dimensional Euler angle of the head is estimated, and the difficulty and the instability of 3D modeling are effectively solved while the generation of the speaking human face effect is ensured.

Description

2D face restoration technology under mask based on audio drive

Technical Field

The invention belongs to the field of face generation, and particularly relates to a 2D face restoration technology under a mask based on audio driving.

Background

With the prevalence of new crown epidemic situations in the world, wearing the mask becomes a normal state for people going out, especially in public places. Although the wearing mask ensures safety, the effectiveness of communication is greatly limited under the condition that the face is shielded in a large range. According to the Magcke effect, all languages people grasp are dependent on the visual information of language perception to a certain extent, and the information transmitted by the activity of the lips and the facial expressions of the other party is also important in the communication process.

In the modern age of high computing power and advanced technological advances, many complex tasks were accomplished that were once thought impossible to accomplish. In the field of face generation, there have been a number of successful models and neural network architectures that successfully and efficiently generate a true human face. The face generation attracts the attention of many researchers as a popular research direction in recent years, and has a good development trend.

The audio-driven face generation requires not only the real appearance of the target but also expression and facial movement during speaking, and has a lot of challenges in practical application, such as sound source stability, environment complexity, reality of generated face and continuity of pictures.

Disclosure of Invention

In order to solve the problems, the invention discloses a 2D face restoration technology under a mask based on audio drive, which is used for really restoring the complete face of a speaker. Through the audio decoupler, the content and the emotion information contained in the audio are effectively separated, the interference between the content and the emotion information is avoided, and the authenticity and the naturalness of the generated face are effectively enhanced. Through 2D modeling, the geometric features, the face decoration and the skin color features of the human face are extracted, the three-dimensional Euler angle of the head is estimated, and the difficulty and the instability of 3D modeling are effectively solved while the generation of the speaking human face effect is ensured.

In order to achieve the purpose, the technical scheme of the invention is as follows:

A2D face restoration technology under a mask based on audio drive comprises the following steps:

step 1, acquiring image information of a training video, audio information synchronous with the training video and a source identity image of a target object;

step 2, training an audio emotion decoupler by a cross reconstruction method on the basis of the audio information;

step 3, generating an identity code from the source identity image through a feature extraction network;

step 4, extracting the features of the image information to obtain the head posture code of each frame of image;

step 5, the audio frequency emotion decoupler extracts the characteristics of the audio frequency information to obtain the emotion coding of each frame of image;

step 6: a countermeasure generation network is constructed that is used to generate rendered images. And training the confrontation generation network according to the identity code, the emotion code and the head posture code.

And 7: and obtaining image information and audio information of a target video of a target object wearing a mask, repeating the

steps

3, 4 and 5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render the image to generate the target video.

The step 1 comprises the following steps:

step 1-1, the obtained training video is a single-person front non-blocking speaking video. The video picture color is colorful, the speaking time length of the human in the video is not limited, the optimal time is 3-5 minutes, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.

Step 1-2, the obtained source identity image of the target object is a picture. The picture color is colorful, the resolution ratio is 720p and 1080p, the front face of a person in the picture faces the camera, and the picture is free of blocking and has good light conditions.

The step 2 comprises the following steps:

step 2-1, for the audio information, using a Mel frequency cepstrum coefficient as an audio representation to obtain an audio vector;

step 2-2, stretching or shrinking the audio vector through a dynamic time warping algorithm to obtain an audio training pair { x ] with the same length and content but different emotions_i,m,x_i,nIn which x_i,m,x_i,nE.g. X, i represents the same content, m and n respectively represent two different emotions;

step 2-3, through cross-weightingThe audio frequency emotion decoupler is trained by a construction method, and a content coder is set to be E after the training_cThe emotion encoder is set to E_eAnd the decoder is set to D.

Step 2-1 comprises:

step 2-1-1, resampling the original video and audio to a fixed sampling frequency;

and 2-1-2, calculating the frequency domain characteristics of the audio by using the resampled audio, and expressing by adopting a Mel cepstrum coefficient.

The step 2-3 comprises the following steps:

step 2-3-1, the codec cross-reconstructs the loss, as in equation (1). The sentiment encoder gets x_i,mObtaining x by the emotion encoding and content encoder_j,nThe content coding combination of (1) is decoded by a decoder to obtain an audio coding x'_i,nAnd x_i,nAnd calculating loss, reversely propagating and training the coding and decoding network, and ensuring the independence of emotion coding and content coding.

L_cross＝||D(E_c(x_i,m),E_e(x_j,n))-x_i,n||₂+||D(E_c(x_j,n),E_e(x_i,m))-x_j,m||₂ (1)

Step 2-3-2, the codec reconstructs the loss itself, as in equation (2). And combining emotion encoding and content encoding of the same section of audio, calculating loss according to the new audio and the original audio obtained after decoding by a decoder, and reversely propagating and training the encoding and decoding network to ensure the integrity of encoding.

L_self＝||D(E_c(x_i,m),E_e(x_i,m))-x_i,m||₂+||D(E_c(x_j,n),E_e(x_j,n))-x_j,n||₂ (2)

And 3, outputting K three-dimensional points to form a three-dimensional point cloud comprising the geometric features of the human face, the face decoration and the skin color features.

Step 4 comprises the following steps:

and 4-1, detecting the key points of the face of each frame, and obtaining the two-dimensional key points of the face under the condition that the mask shields the eyes. The involved keypoint detection networks include convolutional layers, bottleneck layers, convolutional layers, and fully-connected layers.

And 4-2, inputting the obtained two-dimensional face key points into a posture estimation network, estimating a three-dimensional Euler angle of the face, and obtaining 12-dimensional head posture codes comprising 9-bit rotation codes and three-bit deflection codes. The involved pose estimation network includes convolutional layers and fully-connected layers.

And 5, inputting audio information of the target video through the audio emotion decoupler obtained through training to obtain emotion codes.

The step 6 comprises the following steps:

step 6-1, generating network self-evaluation for countermeasure; because of the progressive growth structure adopted by the antagonistic generation network, a series of down-sampling images are obtained by Gaussian smoothing and sub-sampling for the image generated by the generator and the image used for training to form a Gaussian pyramid, loss is calculated for each level n of the pyramid respectively, and the loss function is used for reversely propagating and training the generator network with the corresponding resolution

Where m denotes the number of samples, Disc is the discriminator, img_generatedFor the image generated by the generator, img_realImage information of a training video;

and 6-2, self-evaluation of the feature extraction network.

The step 6-2 comprises the steps of,

and 6-2-1, evaluating the characteristics of the characteristic extraction network. Calculating the feature extraction network loss according to the pixel points of the generated image and the training video image at the same position, reversely propagating the training feature extraction network and having a loss function formula (4)

Where H is the height of the input training video image, W is the width of the input training video image, Gen represents the generated image, and Real is the input training video image.

And 6-2-2, extracting the point cloud distribution evaluation of the network by using the characteristics. In order to avoid information redundancy of the three-dimensional point cloud, a distance loss function is calculated according to Euclidean distances between K points in the point cloud U, and a training feature extraction network is propagated reversely, so that a loss function formula (5) is lost.

And 6-3, generating network attitude evaluation for the confrontation. Calculating and generating Euclidean distance between the human face and head posture codes of the human face in a frame corresponding to training video image information, and taking the Euclidean distance as a posture loss back propagation training generation network;

and 6-4, generating network emotion evaluation by confrontation. Obtaining a generated face and a face in a frame corresponding to training video image information, respectively extracting two-dimensional key points to obtain a corresponding key point set P_gen,P_realAccording to the set of key points P_gen,P_realCalculating the emotional loss, such as formula (6), and performing back propagation training to generate a network;

the step 7 comprises the following steps:

and 7-1, the acquired conditional video is a single-person speaking video. The motion of the character in the video is the speaking video wearing the mask, the color of the video picture is colorful, the speaking time length of the character in the video is not limited, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.

7-2, generating an identity code from the source identity image through a feature extraction network;

7-3, performing feature extraction on the image information to obtain a head posture code of each frame of image;

7-4, performing feature extraction on the audio information by an audio emotion decoupler to obtain emotion codes of each frame of image; and 7-5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render images to generate a target image under the current visual angle and audio frequency conditions.

The invention has the beneficial effects that:

the 2D face restoration technology under the mask based on the audio drive is used for really restoring the complete face of a speaker. Through the audio decoupler, the content and the emotion information contained in the audio are effectively separated, the interference between the content and the emotion information is avoided, and the authenticity and the naturalness of the generated face are effectively enhanced. Through 2D modeling, the geometric features, the face decoration and the skin color features of the human face are extracted, the three-dimensional Euler angle of the head is estimated, and the difficulty and the instability of 3D modeling are effectively solved while the generation of the speaking human face effect is ensured.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a training method of an audio emotion decoupler;

FIG. 3 is a schematic diagram of a network for extracting eye key points and estimating head pose under a mask;

fig. 4 is a schematic diagram of a countermeasure generation network.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and detailed description, which will be understood as being illustrative only and not limiting in scope.

The application discloses a real-time audio-driven face generation method, which is characterized in that according to a section of face speaking video wearing a mask, a high-quality face reduction video under the mask based on audio drive is generated by using an audio emotion decoupler of an encoder and a decoder structure, using a MobileNet regression as a head posture estimation network under a mask state of a backbone network and resisting a style generation network of a generation network structure.

Illustratively, a person's front, well-lit, unobstructed facial color photograph is given as the identity source image. The acquired target video is a single-person speaking video, wherein a target object wears a mask, the color of a video image is colorful, the resolution of the video is 720P, 1080P, 2K or 4K, the frame rate is 30 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the above video attributes, other attributes, except the video time length and resolution, can be designed according to the actual situation. The feature extraction network acquires identity features from a source image, the audio emotion decoupler decouples emotion codes from audio information of a target video, and the head pose extracts head pose codes from image information of the target video. And finally, generating a speaking face video which accords with the identity characteristics of the people in the photos, and the unobstructed facial expression and head posture of the people in the target video.

As shown in fig. 1, is a flow chart of the method of the present invention, including: training by a cross reconstruction method to obtain an audio emotion decoupler; generating an identity code from a source identity image through a feature extraction network; acquiring image information of a target video wearing the mask, and performing feature extraction by a head posture evaluation network to obtain head posture codes of each frame; obtaining audio information of a target video wearing the mask, and performing feature extraction on the audio information by using an audio emotion decoupler to obtain emotion codes of each frame; and inputting the identity characteristic code, the emotion code and the head posture code into a confrontation generation network to generate a speaking face video.

(1) Acquiring image information of a training video, audio information synchronized with the training video and a source identity image of a target object, wherein the requirements are as follows:

(11) the obtained training video is a single-person front non-blocking speaking video. The video picture color is colorful, the speaking time length of the human in the video is not limited, the optimal time is 3-5 minutes, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.

(12) The obtained source identity image of the target object is a picture. The picture color is colorful, the resolution ratio is 720p, 1080p, and the front of the figure in the picture faces the camera, so that the picture is free from shielding and has good light conditions.

(2) On the basis of obtaining the audio information of the training video, the audio emotion decoupler is trained through a cross reconstruction method, and the training process of the audio emotion decoupler is as follows:

(21) for the audio information, using a Mel frequency cepstrum coefficient as an audio representation to obtain an audio vector;

(211) resampling the original video and audio to a fixed sampling frequency;

(212) and calculating the frequency domain characteristics of the audio by using the resampled audio, and expressing the frequency domain characteristics by adopting a Mel cepstrum coefficient.

(22) Stretching or shrinking the audio vector through a dynamic time warping algorithm to obtain an audio training pair { x ] with the same length and content but different emotions_i,m,x_i,nIn which x_i,m,x_i,nE.g. X, i represents the same content, m and n respectively represent two different emotions;

(23) training the audio frequency emotion decoupler by a cross reconstruction method, setting the content encoder obtained by training as E_cThe emotion encoder is set to E_eAnd the decoder is set to D.

(231) Codec cross-reconstruction loss, as in equation (1). The emotion encoder gets x_i,mObtaining x by the emotion encoding and content encoder_j,nThe content coding combination of (1) is decoded by a decoder to obtain an audio coding x'_i,nAnd x_i,nAnd calculating loss, reversely propagating and training the coding and decoding network, and ensuring the independence of emotion coding and content coding.

(232) The codec reconstructs the loss itself as in equation (2). And combining emotion encoding and content encoding of the same section of audio, calculating loss according to the new audio and the original audio obtained after decoding by a decoder, and reversely propagating and training the encoding and decoding network to ensure the integrity of encoding.

The stability of the disclosed three-dimensional key point algorithm is difficult to guarantee, the requirements on equipment and computing power are relatively strict, in order to avoid the problems, a two-dimensional mode is adopted for modeling, and the target of generating the three-dimensional point cloud is achieved through unsupervised learning. The corresponding feature extraction network is performed simultaneously with the training against the generated network.

(3) Generating an identity code from a source identity image through a feature extraction network;

(4) in order to cope with the situation that the mask has a large-range shielding, a head posture estimation network in the mask state with a MobileNet regressor as a backbone network is adopted to obtain a head posture code, and the network structure is shown in fig. 3.

(41) And detecting the key points of the face of each frame, and obtaining the two-dimensional key points of the face of the eyes under the condition that the mask is shielded. The related key point detection network comprises a convolution layer, a bottleneck layer, a convolution layer and a full connection layer;

(42) and inputting the two-dimensional face key points into a posture estimation network, estimating a three-dimensional Euler angle of the face, and obtaining 12-dimensional head posture codes comprising 9-bit rotation codes and three-bit deflection codes. The related attitude estimation network comprises a convolution layer and a full connection layer;

(5) and inputting audio information of a training video through the audio emotion decoupler obtained through training to obtain emotion codes.

(6) A countermeasure generation network is constructed that is used to generate rendered images. According to the identity code, the emotion code and the head posture code training confrontation generation network, as shown in fig. 4, A is the identity code, B is the head posture code, C is the emotion code, and the training process is as follows:

(61) generating network self evaluation against the situation; because of the progressive growth structure adopted by the antagonistic generation network, a series of down-sampling images are obtained by Gaussian smoothing and sub-sampling for the image generated by the generator and the image used for training to form a Gaussian pyramid, loss is calculated for each level n of the pyramid respectively, and the loss function is used for reversely propagating and training the generator network with the corresponding resolution

(62) self-evaluation of the feature extraction network.

(621) And evaluating the characteristics of the characteristic extraction network. Calculating the feature extraction network loss according to the pixel points of the generated image and the training video image at the same position, reversely propagating the training feature extraction network and having a loss function formula (4)

(622) And (5) evaluating the point cloud distribution of the feature extraction network. In order to avoid information redundancy of the three-dimensional point cloud, a distance loss function is calculated according to Euclidean distances between K points in the point cloud U, and a training feature extraction network is propagated reversely, so that a loss function formula (5) is lost.

(63) The countermeasure generates a network pose evaluation. Calculating and generating Euclidean distance between the human face and head posture codes of the human face in a frame corresponding to training video image information, and taking the Euclidean distance as a posture loss back propagation training generation network;

(64) the confrontation generates a network emotion assessment. Obtaining a generated face and a face in a frame corresponding to training video image information, respectively extracting two-dimensional key points to obtain a corresponding key point set P_gen,P_realAccording to the set of key points P_gen,P_realCalculating the emotional loss, such as formula (6), and performing back propagation training to generate a network;

(7) obtaining image information and audio information of a target video of a target object wearing a mask, repeating the

steps

3, 4 and 5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render the image to generate the target video:

(71) the acquired conditional video is a single-person speaking video. The motion of the character in the video is the speaking video wearing the mask, the color of the video picture is colorful, the speaking time length of the character in the video is not limited, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.

(72) Generating an identity code from a source identity image through a feature extraction network;

(73) extracting the features of the image information to obtain a head posture code of each frame of image;

(74) the audio frequency emotion decoupler extracts the characteristics of the audio frequency information to obtain the emotion coding of each frame of image;

(75) and using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render images to generate a target image under the current visual angle and audio frequency conditions.

It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it is obvious to those skilled in the art that several modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations fall within the protection scope of the claims of the present invention.

Claims

1. The utility model provides a 2D face reduction technique under gauze mask based on audio drive which characterized in that includes the following step:

step 6: constructing a countermeasure generation network for generating a rendered image; training and confronting to generate a network according to the identity code, the emotion code and the head posture code;

and 7: and obtaining image information and audio information of a target video of a target object wearing a mask, repeating the steps 3, 4 and 5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render the image to generate the target video.

2. The 2D face restoration technology under a mask based on audio driving according to claim 1, wherein the audio emotion decoupler in step 2 is obtained by the following steps:

step 2-2, stretching or shrinking the audio vector through a dynamic time warping algorithm to obtain an audio training pair { x ] with the same length and content but different emotions_i,m,x_i,nIn which x is_i,m,x_i,nE.g. X, i represents the same content, m and n respectively represent two different emotions;

step 2-3, training the audio emotion decoupler by a cross reconstruction method, and setting a content encoder obtained by training as E_cThe emotion encoder is set to E_eAnd the decoder is set to D.

3. The under-mask 2D face restoration technology based on audio driving according to claim 2, wherein the step 2-3 of training the audio emotion decoupler by the cross reconstruction method comprises the following steps:

step 2-3-1, the codec cross-reconstruction loss, as in equation (1); the emotion encoder gets x_i,mObtaining x by the emotion encoding and content encoder_j,nThe content coding combination of (1) is decoded by a decoder to obtain an audio coding x'_i,nAnd x_i,nCalculating loss, reversely propagating and training the coding and decoding network, and ensuring independence of emotion coding and content coding;

step 2-3-2, the codec reconstructs the loss by itself, as in equation (2); combining emotion encoding and content encoding of the same section of audio, calculating loss according to a new audio and an original audio obtained after decoding by a decoder, reversely propagating and training the encoding and decoding network, and ensuring the integrity of encoding;

L_self＝||D(E_c(x_i,m),E_e(x_i,m))-x_i,m||₂+||D(E_c(x_j,n),E_e(x_j,n))-x_j,n||₂ (2)。

4. the audio-driven under-mask 2D face reduction technology according to claim 1, wherein the head pose coding of step 4 is obtained by the following steps:

step 4-1, performing face key point detection on each frame, and obtaining two-dimensional face key points of eyes under the condition of shielding by a mask; the related key point detection network comprises a convolution layer, a bottleneck layer, a convolution layer and a full connection layer;

step 4-2, inputting the obtained two-dimensional face key points into an attitude estimation network, estimating the three-dimensional Euler angle of the face, and obtaining 12-dimensional head attitude codes comprising 9-bit rotation codes and three-bit deflection codes; the involved pose estimation network includes convolutional layers and fully-connected layers.

5. The audio-driven under-mask 2D face restoration technology according to claim 1, wherein the step 6 of constructing the confrontation generation network is obtained by the following steps:

step 6-2, self-evaluation of the feature extraction network;

the step 6-2 comprises the steps of,

step 6-2-1, evaluating the characteristics of the characteristic extraction network; calculating the feature extraction network loss according to the pixel points of the generated image and the training video image at the same position, reversely propagating the training feature extraction network and having a loss function formula (4)

H is the height of the input training video image, W is the width of the input training video image, Gen represents a generated image, and Real is the input training video image;

6-2-2, extracting the point cloud distribution evaluation of the network by using the characteristics; in order to avoid information redundancy of the three-dimensional point cloud, a distance loss function is calculated according to Euclidean distances among K points in the point cloud U, and a training feature extraction network is propagated reversely, so that a loss function formula (5) is lost;

6-3, generating network attitude evaluation by confrontation; calculating and generating Euclidean distance between the human face and a head posture code of the human face in a frame corresponding to the training video image information, and using the Euclidean distance as a posture loss back propagation training generation network;

6-4, generating network emotion evaluation for confrontation; obtaining a generated face and a face in a frame corresponding to training video image information, respectively extracting two-dimensional key points to obtain a corresponding key point set P_gen,P_realAccording to the set of key points P_gen,P_realCalculating the emotional loss, such as formula (6), and performing back propagation training to generate a network;