CN114663539A - 2D face restoration technology under mask based on audio drive - Google Patents

2D face restoration technology under mask based on audio drive Download PDF

Info

Publication number
CN114663539A
CN114663539A CN202210232796.8A CN202210232796A CN114663539A CN 114663539 A CN114663539 A CN 114663539A CN 202210232796 A CN202210232796 A CN 202210232796A CN 114663539 A CN114663539 A CN 114663539A
Authority
CN
China
Prior art keywords
audio
training
network
emotion
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210232796.8A
Other languages
Chinese (zh)
Other versions
CN114663539B (en
Inventor
李新德
王航宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202210232796.8A priority Critical patent/CN114663539B/en
Publication of CN114663539A publication Critical patent/CN114663539A/en
Application granted granted Critical
Publication of CN114663539B publication Critical patent/CN114663539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • Computational Linguistics (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a 2D face restoration technology under a mask based on audio drive, which is used for really restoring the complete face of a speaker. Through the audio frequency emotion decoupler, the content and the emotion information contained in the audio frequency are effectively separated, the interference between the content and the emotion information is avoided, and the authenticity and the naturalness of the generated face expression and the mouth shape are effectively enhanced. Through 2D modeling, the geometric features, the face decoration and the skin color features of the human face are extracted, the three-dimensional Euler angle of the head is estimated, and the difficulty and the instability of 3D modeling are effectively solved while the generation of the speaking human face effect is ensured.

Description

2D face restoration technology under mask based on audio drive
Technical Field
The invention belongs to the field of face generation, and particularly relates to a 2D face restoration technology under a mask based on audio driving.
Background
With the prevalence of new crown epidemic situations in the world, wearing the mask becomes a normal state for people going out, especially in public places. Although the wearing mask ensures safety, the effectiveness of communication is greatly limited under the condition that the face is shielded in a large range. According to the Magcke effect, all languages people grasp are dependent on the visual information of language perception to a certain extent, and the information transmitted by the activity of the lips and the facial expressions of the other party is also important in the communication process.
In the modern age of high computing power and advanced technological advances, many complex tasks were accomplished that were once thought impossible to accomplish. In the field of face generation, there have been a number of successful models and neural network architectures that successfully and efficiently generate a true human face. The face generation attracts the attention of many researchers as a popular research direction in recent years, and has a good development trend.
The audio-driven face generation requires not only the real appearance of the target but also expression and facial movement during speaking, and has a lot of challenges in practical application, such as sound source stability, environment complexity, reality of generated face and continuity of pictures.
Disclosure of Invention
In order to solve the problems, the invention discloses a 2D face restoration technology under a mask based on audio drive, which is used for really restoring the complete face of a speaker. Through the audio decoupler, the content and the emotion information contained in the audio are effectively separated, the interference between the content and the emotion information is avoided, and the authenticity and the naturalness of the generated face are effectively enhanced. Through 2D modeling, the geometric features, the face decoration and the skin color features of the human face are extracted, the three-dimensional Euler angle of the head is estimated, and the difficulty and the instability of 3D modeling are effectively solved while the generation of the speaking human face effect is ensured.
In order to achieve the purpose, the technical scheme of the invention is as follows:
A2D face restoration technology under a mask based on audio drive comprises the following steps:
step 1, acquiring image information of a training video, audio information synchronous with the training video and a source identity image of a target object;
step 2, training an audio emotion decoupler by a cross reconstruction method on the basis of the audio information;
step 3, generating an identity code from the source identity image through a feature extraction network;
step 4, extracting the features of the image information to obtain the head posture code of each frame of image;
step 5, the audio frequency emotion decoupler extracts the characteristics of the audio frequency information to obtain the emotion coding of each frame of image;
step 6: a countermeasure generation network is constructed that is used to generate rendered images. And training the confrontation generation network according to the identity code, the emotion code and the head posture code.
And 7: and obtaining image information and audio information of a target video of a target object wearing a mask, repeating the steps 3, 4 and 5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render the image to generate the target video.
The step 1 comprises the following steps:
step 1-1, the obtained training video is a single-person front non-blocking speaking video. The video picture color is colorful, the speaking time length of the human in the video is not limited, the optimal time is 3-5 minutes, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.
Step 1-2, the obtained source identity image of the target object is a picture. The picture color is colorful, the resolution ratio is 720p and 1080p, the front face of a person in the picture faces the camera, and the picture is free of blocking and has good light conditions.
The step 2 comprises the following steps:
step 2-1, for the audio information, using a Mel frequency cepstrum coefficient as an audio representation to obtain an audio vector;
step 2-2, stretching or shrinking the audio vector through a dynamic time warping algorithm to obtain an audio training pair { x ] with the same length and content but different emotionsi,m,xi,nIn which xi,m,xi,nE.g. X, i represents the same content, m and n respectively represent two different emotions;
step 2-3, through cross-weightingThe audio frequency emotion decoupler is trained by a construction method, and a content coder is set to be E after the trainingcThe emotion encoder is set to EeAnd the decoder is set to D.
Step 2-1 comprises:
step 2-1-1, resampling the original video and audio to a fixed sampling frequency;
and 2-1-2, calculating the frequency domain characteristics of the audio by using the resampled audio, and expressing by adopting a Mel cepstrum coefficient.
The step 2-3 comprises the following steps:
step 2-3-1, the codec cross-reconstructs the loss, as in equation (1). The sentiment encoder gets xi,mObtaining x by the emotion encoding and content encoderj,nThe content coding combination of (1) is decoded by a decoder to obtain an audio coding x'i,nAnd xi,nAnd calculating loss, reversely propagating and training the coding and decoding network, and ensuring the independence of emotion coding and content coding.
Lcross=||D(Ec(xi,m),Ee(xj,n))-xi,n||2+||D(Ec(xj,n),Ee(xi,m))-xj,m||2 (1)
Step 2-3-2, the codec reconstructs the loss itself, as in equation (2). And combining emotion encoding and content encoding of the same section of audio, calculating loss according to the new audio and the original audio obtained after decoding by a decoder, and reversely propagating and training the encoding and decoding network to ensure the integrity of encoding.
Lself=||D(Ec(xi,m),Ee(xi,m))-xi,m||2+||D(Ec(xj,n),Ee(xj,n))-xj,n||2 (2)
And 3, outputting K three-dimensional points to form a three-dimensional point cloud comprising the geometric features of the human face, the face decoration and the skin color features.
Step 4 comprises the following steps:
and 4-1, detecting the key points of the face of each frame, and obtaining the two-dimensional key points of the face under the condition that the mask shields the eyes. The involved keypoint detection networks include convolutional layers, bottleneck layers, convolutional layers, and fully-connected layers.
And 4-2, inputting the obtained two-dimensional face key points into a posture estimation network, estimating a three-dimensional Euler angle of the face, and obtaining 12-dimensional head posture codes comprising 9-bit rotation codes and three-bit deflection codes. The involved pose estimation network includes convolutional layers and fully-connected layers.
And 5, inputting audio information of the target video through the audio emotion decoupler obtained through training to obtain emotion codes.
The step 6 comprises the following steps:
step 6-1, generating network self-evaluation for countermeasure; because of the progressive growth structure adopted by the antagonistic generation network, a series of down-sampling images are obtained by Gaussian smoothing and sub-sampling for the image generated by the generator and the image used for training to form a Gaussian pyramid, loss is calculated for each level n of the pyramid respectively, and the loss function is used for reversely propagating and training the generator network with the corresponding resolution
Figure BDA0003539158870000031
Where m denotes the number of samples, Disc is the discriminator, imggeneratedFor the image generated by the generator, imgrealImage information of a training video;
and 6-2, self-evaluation of the feature extraction network.
The step 6-2 comprises the steps of,
and 6-2-1, evaluating the characteristics of the characteristic extraction network. Calculating the feature extraction network loss according to the pixel points of the generated image and the training video image at the same position, reversely propagating the training feature extraction network and having a loss function formula (4)
Figure BDA0003539158870000032
Where H is the height of the input training video image, W is the width of the input training video image, Gen represents the generated image, and Real is the input training video image.
And 6-2-2, extracting the point cloud distribution evaluation of the network by using the characteristics. In order to avoid information redundancy of the three-dimensional point cloud, a distance loss function is calculated according to Euclidean distances between K points in the point cloud U, and a training feature extraction network is propagated reversely, so that a loss function formula (5) is lost.
Figure BDA0003539158870000033
And 6-3, generating network attitude evaluation for the confrontation. Calculating and generating Euclidean distance between the human face and head posture codes of the human face in a frame corresponding to training video image information, and taking the Euclidean distance as a posture loss back propagation training generation network;
and 6-4, generating network emotion evaluation by confrontation. Obtaining a generated face and a face in a frame corresponding to training video image information, respectively extracting two-dimensional key points to obtain a corresponding key point set Pgen,PrealAccording to the set of key points Pgen,PrealCalculating the emotional loss, such as formula (6), and performing back propagation training to generate a network;
Figure BDA0003539158870000041
the step 7 comprises the following steps:
and 7-1, the acquired conditional video is a single-person speaking video. The motion of the character in the video is the speaking video wearing the mask, the color of the video picture is colorful, the speaking time length of the character in the video is not limited, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.
7-2, generating an identity code from the source identity image through a feature extraction network;
7-3, performing feature extraction on the image information to obtain a head posture code of each frame of image;
7-4, performing feature extraction on the audio information by an audio emotion decoupler to obtain emotion codes of each frame of image; and 7-5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render images to generate a target image under the current visual angle and audio frequency conditions.
The invention has the beneficial effects that:
the 2D face restoration technology under the mask based on the audio drive is used for really restoring the complete face of a speaker. Through the audio decoupler, the content and the emotion information contained in the audio are effectively separated, the interference between the content and the emotion information is avoided, and the authenticity and the naturalness of the generated face are effectively enhanced. Through 2D modeling, the geometric features, the face decoration and the skin color features of the human face are extracted, the three-dimensional Euler angle of the head is estimated, and the difficulty and the instability of 3D modeling are effectively solved while the generation of the speaking human face effect is ensured.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a training method of an audio emotion decoupler;
FIG. 3 is a schematic diagram of a network for extracting eye key points and estimating head pose under a mask;
fig. 4 is a schematic diagram of a countermeasure generation network.
Detailed Description
The present invention will be further illustrated with reference to the accompanying drawings and detailed description, which will be understood as being illustrative only and not limiting in scope.
The application discloses a real-time audio-driven face generation method, which is characterized in that according to a section of face speaking video wearing a mask, a high-quality face reduction video under the mask based on audio drive is generated by using an audio emotion decoupler of an encoder and a decoder structure, using a MobileNet regression as a head posture estimation network under a mask state of a backbone network and resisting a style generation network of a generation network structure.
Illustratively, a person's front, well-lit, unobstructed facial color photograph is given as the identity source image. The acquired target video is a single-person speaking video, wherein a target object wears a mask, the color of a video image is colorful, the resolution of the video is 720P, 1080P, 2K or 4K, the frame rate is 30 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the above video attributes, other attributes, except the video time length and resolution, can be designed according to the actual situation. The feature extraction network acquires identity features from a source image, the audio emotion decoupler decouples emotion codes from audio information of a target video, and the head pose extracts head pose codes from image information of the target video. And finally, generating a speaking face video which accords with the identity characteristics of the people in the photos, and the unobstructed facial expression and head posture of the people in the target video.
As shown in fig. 1, is a flow chart of the method of the present invention, including: training by a cross reconstruction method to obtain an audio emotion decoupler; generating an identity code from a source identity image through a feature extraction network; acquiring image information of a target video wearing the mask, and performing feature extraction by a head posture evaluation network to obtain head posture codes of each frame; obtaining audio information of a target video wearing the mask, and performing feature extraction on the audio information by using an audio emotion decoupler to obtain emotion codes of each frame; and inputting the identity characteristic code, the emotion code and the head posture code into a confrontation generation network to generate a speaking face video.
(1) Acquiring image information of a training video, audio information synchronized with the training video and a source identity image of a target object, wherein the requirements are as follows:
(11) the obtained training video is a single-person front non-blocking speaking video. The video picture color is colorful, the speaking time length of the human in the video is not limited, the optimal time is 3-5 minutes, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.
(12) The obtained source identity image of the target object is a picture. The picture color is colorful, the resolution ratio is 720p, 1080p, and the front of the figure in the picture faces the camera, so that the picture is free from shielding and has good light conditions.
(2) On the basis of obtaining the audio information of the training video, the audio emotion decoupler is trained through a cross reconstruction method, and the training process of the audio emotion decoupler is as follows:
(21) for the audio information, using a Mel frequency cepstrum coefficient as an audio representation to obtain an audio vector;
(211) resampling the original video and audio to a fixed sampling frequency;
(212) and calculating the frequency domain characteristics of the audio by using the resampled audio, and expressing the frequency domain characteristics by adopting a Mel cepstrum coefficient.
(22) Stretching or shrinking the audio vector through a dynamic time warping algorithm to obtain an audio training pair { x ] with the same length and content but different emotionsi,m,xi,nIn which xi,m,xi,nE.g. X, i represents the same content, m and n respectively represent two different emotions;
(23) training the audio frequency emotion decoupler by a cross reconstruction method, setting the content encoder obtained by training as EcThe emotion encoder is set to EeAnd the decoder is set to D.
(231) Codec cross-reconstruction loss, as in equation (1). The emotion encoder gets xi,mObtaining x by the emotion encoding and content encoderj,nThe content coding combination of (1) is decoded by a decoder to obtain an audio coding x'i,nAnd xi,nAnd calculating loss, reversely propagating and training the coding and decoding network, and ensuring the independence of emotion coding and content coding.
Lcross=||D(Ec(xi,m),Ee(xj,n))-xi,n||2+||D(Ec(xj,n),Ee(xi,m))-xj,m||2 (1)
(232) The codec reconstructs the loss itself as in equation (2). And combining emotion encoding and content encoding of the same section of audio, calculating loss according to the new audio and the original audio obtained after decoding by a decoder, and reversely propagating and training the encoding and decoding network to ensure the integrity of encoding.
Lself=||D(Ec(xi,m),Ee(xi,m))-xi,m||2+||D(Ec(xj,n),Ee(xj,n))-xj,n||2 (2)
The stability of the disclosed three-dimensional key point algorithm is difficult to guarantee, the requirements on equipment and computing power are relatively strict, in order to avoid the problems, a two-dimensional mode is adopted for modeling, and the target of generating the three-dimensional point cloud is achieved through unsupervised learning. The corresponding feature extraction network is performed simultaneously with the training against the generated network.
(3) Generating an identity code from a source identity image through a feature extraction network;
(4) in order to cope with the situation that the mask has a large-range shielding, a head posture estimation network in the mask state with a MobileNet regressor as a backbone network is adopted to obtain a head posture code, and the network structure is shown in fig. 3.
(41) And detecting the key points of the face of each frame, and obtaining the two-dimensional key points of the face of the eyes under the condition that the mask is shielded. The related key point detection network comprises a convolution layer, a bottleneck layer, a convolution layer and a full connection layer;
(42) and inputting the two-dimensional face key points into a posture estimation network, estimating a three-dimensional Euler angle of the face, and obtaining 12-dimensional head posture codes comprising 9-bit rotation codes and three-bit deflection codes. The related attitude estimation network comprises a convolution layer and a full connection layer;
(5) and inputting audio information of a training video through the audio emotion decoupler obtained through training to obtain emotion codes.
(6) A countermeasure generation network is constructed that is used to generate rendered images. According to the identity code, the emotion code and the head posture code training confrontation generation network, as shown in fig. 4, A is the identity code, B is the head posture code, C is the emotion code, and the training process is as follows:
(61) generating network self evaluation against the situation; because of the progressive growth structure adopted by the antagonistic generation network, a series of down-sampling images are obtained by Gaussian smoothing and sub-sampling for the image generated by the generator and the image used for training to form a Gaussian pyramid, loss is calculated for each level n of the pyramid respectively, and the loss function is used for reversely propagating and training the generator network with the corresponding resolution
Figure BDA0003539158870000061
Where m denotes the number of samples, Disc is the discriminator, imggeneratedFor the image generated by the generator, imgrealImage information of a training video;
(62) self-evaluation of the feature extraction network.
(621) And evaluating the characteristics of the characteristic extraction network. Calculating the feature extraction network loss according to the pixel points of the generated image and the training video image at the same position, reversely propagating the training feature extraction network and having a loss function formula (4)
Figure BDA0003539158870000071
Where H is the height of the input training video image, W is the width of the input training video image, Gen represents the generated image, and Real is the input training video image.
(622) And (5) evaluating the point cloud distribution of the feature extraction network. In order to avoid information redundancy of the three-dimensional point cloud, a distance loss function is calculated according to Euclidean distances between K points in the point cloud U, and a training feature extraction network is propagated reversely, so that a loss function formula (5) is lost.
Figure BDA0003539158870000072
(63) The countermeasure generates a network pose evaluation. Calculating and generating Euclidean distance between the human face and head posture codes of the human face in a frame corresponding to training video image information, and taking the Euclidean distance as a posture loss back propagation training generation network;
(64) the confrontation generates a network emotion assessment. Obtaining a generated face and a face in a frame corresponding to training video image information, respectively extracting two-dimensional key points to obtain a corresponding key point set Pgen,PrealAccording to the set of key points Pgen,PrealCalculating the emotional loss, such as formula (6), and performing back propagation training to generate a network;
Figure BDA0003539158870000073
(7) obtaining image information and audio information of a target video of a target object wearing a mask, repeating the steps 3, 4 and 5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render the image to generate the target video:
(71) the acquired conditional video is a single-person speaking video. The motion of the character in the video is the speaking video wearing the mask, the color of the video picture is colorful, the speaking time length of the character in the video is not limited, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.
(72) Generating an identity code from a source identity image through a feature extraction network;
(73) extracting the features of the image information to obtain a head posture code of each frame of image;
(74) the audio frequency emotion decoupler extracts the characteristics of the audio frequency information to obtain the emotion coding of each frame of image;
(75) and using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render images to generate a target image under the current visual angle and audio frequency conditions.
It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it is obvious to those skilled in the art that several modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations fall within the protection scope of the claims of the present invention.

Claims (5)

1. The utility model provides a 2D face reduction technique under gauze mask based on audio drive which characterized in that includes the following step:
step 1, acquiring image information of a training video, audio information synchronous with the training video and a source identity image of a target object;
step 2, training an audio emotion decoupler by a cross reconstruction method on the basis of the audio information;
step 3, generating an identity code from the source identity image through a feature extraction network;
step 4, extracting the features of the image information to obtain the head posture code of each frame of image;
step 5, the audio frequency emotion decoupler extracts the characteristics of the audio frequency information to obtain the emotion coding of each frame of image;
step 6: constructing a countermeasure generation network for generating a rendered image; training and confronting to generate a network according to the identity code, the emotion code and the head posture code;
and 7: and obtaining image information and audio information of a target video of a target object wearing a mask, repeating the steps 3, 4 and 5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render the image to generate the target video.
2. The 2D face restoration technology under a mask based on audio driving according to claim 1, wherein the audio emotion decoupler in step 2 is obtained by the following steps:
step 2-1, for the audio information, using a Mel frequency cepstrum coefficient as an audio representation to obtain an audio vector;
step 2-2, stretching or shrinking the audio vector through a dynamic time warping algorithm to obtain an audio training pair { x ] with the same length and content but different emotionsi,m,xi,nIn which x isi,m,xi,nE.g. X, i represents the same content, m and n respectively represent two different emotions;
step 2-3, training the audio emotion decoupler by a cross reconstruction method, and setting a content encoder obtained by training as EcThe emotion encoder is set to EeAnd the decoder is set to D.
3. The under-mask 2D face restoration technology based on audio driving according to claim 2, wherein the step 2-3 of training the audio emotion decoupler by the cross reconstruction method comprises the following steps:
step 2-3-1, the codec cross-reconstruction loss, as in equation (1); the emotion encoder gets xi,mObtaining x by the emotion encoding and content encoderj,nThe content coding combination of (1) is decoded by a decoder to obtain an audio coding x'i,nAnd xi,nCalculating loss, reversely propagating and training the coding and decoding network, and ensuring independence of emotion coding and content coding;
Lcross=||D(Ec(xi,m),Ee(xj,n))-xi,n||2+||D(Ec(xj,n),Ee(xi,m))-xj,m||2 (1)
step 2-3-2, the codec reconstructs the loss by itself, as in equation (2); combining emotion encoding and content encoding of the same section of audio, calculating loss according to a new audio and an original audio obtained after decoding by a decoder, reversely propagating and training the encoding and decoding network, and ensuring the integrity of encoding;
Lself=||D(Ec(xi,m),Ee(xi,m))-xi,m||2+||D(Ec(xj,n),Ee(xj,n))-xj,n||2 (2)。
4. the audio-driven under-mask 2D face reduction technology according to claim 1, wherein the head pose coding of step 4 is obtained by the following steps:
step 4-1, performing face key point detection on each frame, and obtaining two-dimensional face key points of eyes under the condition of shielding by a mask; the related key point detection network comprises a convolution layer, a bottleneck layer, a convolution layer and a full connection layer;
step 4-2, inputting the obtained two-dimensional face key points into an attitude estimation network, estimating the three-dimensional Euler angle of the face, and obtaining 12-dimensional head attitude codes comprising 9-bit rotation codes and three-bit deflection codes; the involved pose estimation network includes convolutional layers and fully-connected layers.
5. The audio-driven under-mask 2D face restoration technology according to claim 1, wherein the step 6 of constructing the confrontation generation network is obtained by the following steps:
step 6-1, generating network self-evaluation for countermeasure; because of the progressive growth structure adopted by the antagonistic generation network, a series of down-sampling images are obtained by Gaussian smoothing and sub-sampling for the image generated by the generator and the image used for training to form a Gaussian pyramid, loss is calculated for each level n of the pyramid respectively, and the loss function is used for reversely propagating and training the generator network with the corresponding resolution
Figure FDA0003539158860000021
Where m denotes the number of samples, Disc is the discriminator, imggeneratedFor the image generated by the generator, imgrealImage information of a training video;
step 6-2, self-evaluation of the feature extraction network;
the step 6-2 comprises the steps of,
step 6-2-1, evaluating the characteristics of the characteristic extraction network; calculating the feature extraction network loss according to the pixel points of the generated image and the training video image at the same position, reversely propagating the training feature extraction network and having a loss function formula (4)
Figure FDA0003539158860000022
H is the height of the input training video image, W is the width of the input training video image, Gen represents a generated image, and Real is the input training video image;
6-2-2, extracting the point cloud distribution evaluation of the network by using the characteristics; in order to avoid information redundancy of the three-dimensional point cloud, a distance loss function is calculated according to Euclidean distances among K points in the point cloud U, and a training feature extraction network is propagated reversely, so that a loss function formula (5) is lost;
Figure FDA0003539158860000023
6-3, generating network attitude evaluation by confrontation; calculating and generating Euclidean distance between the human face and a head posture code of the human face in a frame corresponding to the training video image information, and using the Euclidean distance as a posture loss back propagation training generation network;
6-4, generating network emotion evaluation for confrontation; obtaining a generated face and a face in a frame corresponding to training video image information, respectively extracting two-dimensional key points to obtain a corresponding key point set Pgen,PrealAccording to the set of key points Pgen,PrealCalculating the emotional loss, such as formula (6), and performing back propagation training to generate a network;
Figure FDA0003539158860000031
CN202210232796.8A 2022-03-09 2022-03-09 2D face restoration technology under mask based on audio drive Active CN114663539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210232796.8A CN114663539B (en) 2022-03-09 2022-03-09 2D face restoration technology under mask based on audio drive

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210232796.8A CN114663539B (en) 2022-03-09 2022-03-09 2D face restoration technology under mask based on audio drive

Publications (2)

Publication Number Publication Date
CN114663539A true CN114663539A (en) 2022-06-24
CN114663539B CN114663539B (en) 2023-03-14

Family

ID=82028539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210232796.8A Active CN114663539B (en) 2022-03-09 2022-03-09 2D face restoration technology under mask based on audio drive

Country Status (1)

Country Link
CN (1) CN114663539B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100329A (en) * 2022-06-27 2022-09-23 太原理工大学 Multi-mode driving-based emotion controllable facial animation generation method
CN117474807A (en) * 2023-12-27 2024-01-30 科大讯飞股份有限公司 Image restoration method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020168731A1 (en) * 2019-02-19 2020-08-27 华南理工大学 Generative adversarial mechanism and attention mechanism-based standard face generation method
CN111783603A (en) * 2020-06-24 2020-10-16 有半岛(北京)信息科技有限公司 Training method for generating confrontation network, image face changing method and video face changing method and device
CN111797897A (en) * 2020-06-03 2020-10-20 浙江大学 Audio face image generation method based on deep learning
CN113793408A (en) * 2021-09-15 2021-12-14 宿迁硅基智能科技有限公司 Real-time audio-driven face generation method and device and server
WO2021254499A1 (en) * 2020-06-19 2021-12-23 北京灵汐科技有限公司 Editing model generation method and apparatus, face image editing method and apparatus, device, and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020168731A1 (en) * 2019-02-19 2020-08-27 华南理工大学 Generative adversarial mechanism and attention mechanism-based standard face generation method
CN111797897A (en) * 2020-06-03 2020-10-20 浙江大学 Audio face image generation method based on deep learning
WO2021254499A1 (en) * 2020-06-19 2021-12-23 北京灵汐科技有限公司 Editing model generation method and apparatus, face image editing method and apparatus, device, and medium
CN111783603A (en) * 2020-06-24 2020-10-16 有半岛(北京)信息科技有限公司 Training method for generating confrontation network, image face changing method and video face changing method and device
CN113793408A (en) * 2021-09-15 2021-12-14 宿迁硅基智能科技有限公司 Real-time audio-driven face generation method and device and server

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100329A (en) * 2022-06-27 2022-09-23 太原理工大学 Multi-mode driving-based emotion controllable facial animation generation method
CN117474807A (en) * 2023-12-27 2024-01-30 科大讯飞股份有限公司 Image restoration method, device, equipment and storage medium
CN117474807B (en) * 2023-12-27 2024-05-31 科大讯飞股份有限公司 Image restoration method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114663539B (en) 2023-03-14

Similar Documents

Publication Publication Date Title
CN113378697B (en) Method and device for generating speaking face video based on convolutional neural network
CN112887698B (en) High-quality face voice driving method based on nerve radiation field
CN111145322B (en) Method, apparatus, and computer-readable storage medium for driving avatar
CN114663539B (en) 2D face restoration technology under mask based on audio drive
Jonell et al. Let's face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings
CN113269872A (en) Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization
CN108288072A (en) A kind of facial expression synthetic method based on generation confrontation network
CN115914505B (en) Video generation method and system based on voice-driven digital human model
CN112785671B (en) Virtual dummy face animation synthesis method
CN110796593A (en) Image processing method, device, medium and electronic equipment based on artificial intelligence
CN115908659A (en) Method and device for synthesizing speaking face based on generation countermeasure network
CN113470170A (en) Real-time video face region space-time consistent synthesis method using voice information
Rebol et al. Passing a non-verbal turing test: Evaluating gesture animations generated from speech
CN113838173A (en) Virtual human head motion synthesis method driven by voice and background sound
Li et al. Buccal: Low-cost cheek sensing for inferring continuous jaw motion in mobile virtual reality
Wang et al. 3d-talkemo: Learning to synthesize 3d emotional talking head
CN108908353B (en) Robot expression simulation method and device based on smooth constraint reverse mechanical model
Tang et al. Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar
Hill et al. Range-and domain-specific exaggeration of facial speech
WO2024124680A1 (en) Speech signal-driven personalized three-dimensional facial animation generation method, and application thereof
Yi et al. Predicting personalized head movement from short video and speech signal
CN115984452A (en) Head three-dimensional reconstruction method and equipment
Tin Facial extraction and lip tracking using facial points
CN113343761A (en) Real-time facial expression migration method based on generation confrontation
Kumar et al. Multi modal adaptive normalization for audio to video generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant