CN114663539A - 2D face restoration technology under mask based on audio drive - Google Patents
2D face restoration technology under mask based on audio drive Download PDFInfo
- Publication number
- CN114663539A CN114663539A CN202210232796.8A CN202210232796A CN114663539A CN 114663539 A CN114663539 A CN 114663539A CN 202210232796 A CN202210232796 A CN 202210232796A CN 114663539 A CN114663539 A CN 114663539A
- Authority
- CN
- China
- Prior art keywords
- audio
- training
- network
- emotion
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000005516 engineering process Methods 0.000 title claims abstract description 11
- 230000008451 emotion Effects 0.000 claims abstract description 65
- 238000012549 training Methods 0.000 claims description 76
- 210000003128 head Anatomy 0.000 claims description 32
- 238000000605 extraction Methods 0.000 claims description 28
- 238000011156 evaluation Methods 0.000 claims description 14
- 238000000034 method Methods 0.000 claims description 14
- 238000005070 sampling Methods 0.000 claims description 13
- 230000001902 propagating effect Effects 0.000 claims description 12
- 238000001514 detection method Methods 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 4
- 230000003042 antagnostic effect Effects 0.000 claims description 3
- 230000002996 emotional effect Effects 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 230000000750 progressive effect Effects 0.000 claims description 3
- 230000000644 propagated effect Effects 0.000 claims description 3
- 230000001360 synchronised effect Effects 0.000 claims description 3
- 238000005034 decoration Methods 0.000 abstract description 6
- 230000000694 effects Effects 0.000 abstract description 5
- 230000008921 facial expression Effects 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 8
- 230000000903 blocking effect Effects 0.000 description 3
- 230000033001 locomotion Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000012952 Resampling Methods 0.000 description 2
- 230000006854 communication Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Child & Adolescent Psychology (AREA)
- Computational Linguistics (AREA)
- Geometry (AREA)
- Computer Graphics (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a 2D face restoration technology under a mask based on audio drive, which is used for really restoring the complete face of a speaker. Through the audio frequency emotion decoupler, the content and the emotion information contained in the audio frequency are effectively separated, the interference between the content and the emotion information is avoided, and the authenticity and the naturalness of the generated face expression and the mouth shape are effectively enhanced. Through 2D modeling, the geometric features, the face decoration and the skin color features of the human face are extracted, the three-dimensional Euler angle of the head is estimated, and the difficulty and the instability of 3D modeling are effectively solved while the generation of the speaking human face effect is ensured.
Description
Technical Field
The invention belongs to the field of face generation, and particularly relates to a 2D face restoration technology under a mask based on audio driving.
Background
With the prevalence of new crown epidemic situations in the world, wearing the mask becomes a normal state for people going out, especially in public places. Although the wearing mask ensures safety, the effectiveness of communication is greatly limited under the condition that the face is shielded in a large range. According to the Magcke effect, all languages people grasp are dependent on the visual information of language perception to a certain extent, and the information transmitted by the activity of the lips and the facial expressions of the other party is also important in the communication process.
In the modern age of high computing power and advanced technological advances, many complex tasks were accomplished that were once thought impossible to accomplish. In the field of face generation, there have been a number of successful models and neural network architectures that successfully and efficiently generate a true human face. The face generation attracts the attention of many researchers as a popular research direction in recent years, and has a good development trend.
The audio-driven face generation requires not only the real appearance of the target but also expression and facial movement during speaking, and has a lot of challenges in practical application, such as sound source stability, environment complexity, reality of generated face and continuity of pictures.
Disclosure of Invention
In order to solve the problems, the invention discloses a 2D face restoration technology under a mask based on audio drive, which is used for really restoring the complete face of a speaker. Through the audio decoupler, the content and the emotion information contained in the audio are effectively separated, the interference between the content and the emotion information is avoided, and the authenticity and the naturalness of the generated face are effectively enhanced. Through 2D modeling, the geometric features, the face decoration and the skin color features of the human face are extracted, the three-dimensional Euler angle of the head is estimated, and the difficulty and the instability of 3D modeling are effectively solved while the generation of the speaking human face effect is ensured.
In order to achieve the purpose, the technical scheme of the invention is as follows:
A2D face restoration technology under a mask based on audio drive comprises the following steps:
step 2, training an audio emotion decoupler by a cross reconstruction method on the basis of the audio information;
step 5, the audio frequency emotion decoupler extracts the characteristics of the audio frequency information to obtain the emotion coding of each frame of image;
step 6: a countermeasure generation network is constructed that is used to generate rendered images. And training the confrontation generation network according to the identity code, the emotion code and the head posture code.
And 7: and obtaining image information and audio information of a target video of a target object wearing a mask, repeating the steps 3, 4 and 5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render the image to generate the target video.
The step 1 comprises the following steps:
step 1-1, the obtained training video is a single-person front non-blocking speaking video. The video picture color is colorful, the speaking time length of the human in the video is not limited, the optimal time is 3-5 minutes, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.
Step 1-2, the obtained source identity image of the target object is a picture. The picture color is colorful, the resolution ratio is 720p and 1080p, the front face of a person in the picture faces the camera, and the picture is free of blocking and has good light conditions.
The step 2 comprises the following steps:
step 2-1, for the audio information, using a Mel frequency cepstrum coefficient as an audio representation to obtain an audio vector;
step 2-2, stretching or shrinking the audio vector through a dynamic time warping algorithm to obtain an audio training pair { x ] with the same length and content but different emotionsi,m,xi,nIn which xi,m,xi,nE.g. X, i represents the same content, m and n respectively represent two different emotions;
step 2-3, through cross-weightingThe audio frequency emotion decoupler is trained by a construction method, and a content coder is set to be E after the trainingcThe emotion encoder is set to EeAnd the decoder is set to D.
Step 2-1 comprises:
step 2-1-1, resampling the original video and audio to a fixed sampling frequency;
and 2-1-2, calculating the frequency domain characteristics of the audio by using the resampled audio, and expressing by adopting a Mel cepstrum coefficient.
The step 2-3 comprises the following steps:
step 2-3-1, the codec cross-reconstructs the loss, as in equation (1). The sentiment encoder gets xi,mObtaining x by the emotion encoding and content encoderj,nThe content coding combination of (1) is decoded by a decoder to obtain an audio coding x'i,nAnd xi,nAnd calculating loss, reversely propagating and training the coding and decoding network, and ensuring the independence of emotion coding and content coding.
Lcross=||D(Ec(xi,m),Ee(xj,n))-xi,n||2+||D(Ec(xj,n),Ee(xi,m))-xj,m||2 (1)
Step 2-3-2, the codec reconstructs the loss itself, as in equation (2). And combining emotion encoding and content encoding of the same section of audio, calculating loss according to the new audio and the original audio obtained after decoding by a decoder, and reversely propagating and training the encoding and decoding network to ensure the integrity of encoding.
Lself=||D(Ec(xi,m),Ee(xi,m))-xi,m||2+||D(Ec(xj,n),Ee(xj,n))-xj,n||2 (2)
And 3, outputting K three-dimensional points to form a three-dimensional point cloud comprising the geometric features of the human face, the face decoration and the skin color features.
and 4-1, detecting the key points of the face of each frame, and obtaining the two-dimensional key points of the face under the condition that the mask shields the eyes. The involved keypoint detection networks include convolutional layers, bottleneck layers, convolutional layers, and fully-connected layers.
And 4-2, inputting the obtained two-dimensional face key points into a posture estimation network, estimating a three-dimensional Euler angle of the face, and obtaining 12-dimensional head posture codes comprising 9-bit rotation codes and three-bit deflection codes. The involved pose estimation network includes convolutional layers and fully-connected layers.
And 5, inputting audio information of the target video through the audio emotion decoupler obtained through training to obtain emotion codes.
The step 6 comprises the following steps:
step 6-1, generating network self-evaluation for countermeasure; because of the progressive growth structure adopted by the antagonistic generation network, a series of down-sampling images are obtained by Gaussian smoothing and sub-sampling for the image generated by the generator and the image used for training to form a Gaussian pyramid, loss is calculated for each level n of the pyramid respectively, and the loss function is used for reversely propagating and training the generator network with the corresponding resolution
Where m denotes the number of samples, Disc is the discriminator, imggeneratedFor the image generated by the generator, imgrealImage information of a training video;
and 6-2, self-evaluation of the feature extraction network.
The step 6-2 comprises the steps of,
and 6-2-1, evaluating the characteristics of the characteristic extraction network. Calculating the feature extraction network loss according to the pixel points of the generated image and the training video image at the same position, reversely propagating the training feature extraction network and having a loss function formula (4)
Where H is the height of the input training video image, W is the width of the input training video image, Gen represents the generated image, and Real is the input training video image.
And 6-2-2, extracting the point cloud distribution evaluation of the network by using the characteristics. In order to avoid information redundancy of the three-dimensional point cloud, a distance loss function is calculated according to Euclidean distances between K points in the point cloud U, and a training feature extraction network is propagated reversely, so that a loss function formula (5) is lost.
And 6-3, generating network attitude evaluation for the confrontation. Calculating and generating Euclidean distance between the human face and head posture codes of the human face in a frame corresponding to training video image information, and taking the Euclidean distance as a posture loss back propagation training generation network;
and 6-4, generating network emotion evaluation by confrontation. Obtaining a generated face and a face in a frame corresponding to training video image information, respectively extracting two-dimensional key points to obtain a corresponding key point set Pgen,PrealAccording to the set of key points Pgen,PrealCalculating the emotional loss, such as formula (6), and performing back propagation training to generate a network;
the step 7 comprises the following steps:
and 7-1, the acquired conditional video is a single-person speaking video. The motion of the character in the video is the speaking video wearing the mask, the color of the video picture is colorful, the speaking time length of the character in the video is not limited, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.
7-2, generating an identity code from the source identity image through a feature extraction network;
7-3, performing feature extraction on the image information to obtain a head posture code of each frame of image;
7-4, performing feature extraction on the audio information by an audio emotion decoupler to obtain emotion codes of each frame of image; and 7-5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render images to generate a target image under the current visual angle and audio frequency conditions.
The invention has the beneficial effects that:
the 2D face restoration technology under the mask based on the audio drive is used for really restoring the complete face of a speaker. Through the audio decoupler, the content and the emotion information contained in the audio are effectively separated, the interference between the content and the emotion information is avoided, and the authenticity and the naturalness of the generated face are effectively enhanced. Through 2D modeling, the geometric features, the face decoration and the skin color features of the human face are extracted, the three-dimensional Euler angle of the head is estimated, and the difficulty and the instability of 3D modeling are effectively solved while the generation of the speaking human face effect is ensured.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a training method of an audio emotion decoupler;
FIG. 3 is a schematic diagram of a network for extracting eye key points and estimating head pose under a mask;
fig. 4 is a schematic diagram of a countermeasure generation network.
Detailed Description
The present invention will be further illustrated with reference to the accompanying drawings and detailed description, which will be understood as being illustrative only and not limiting in scope.
The application discloses a real-time audio-driven face generation method, which is characterized in that according to a section of face speaking video wearing a mask, a high-quality face reduction video under the mask based on audio drive is generated by using an audio emotion decoupler of an encoder and a decoder structure, using a MobileNet regression as a head posture estimation network under a mask state of a backbone network and resisting a style generation network of a generation network structure.
Illustratively, a person's front, well-lit, unobstructed facial color photograph is given as the identity source image. The acquired target video is a single-person speaking video, wherein a target object wears a mask, the color of a video image is colorful, the resolution of the video is 720P, 1080P, 2K or 4K, the frame rate is 30 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the above video attributes, other attributes, except the video time length and resolution, can be designed according to the actual situation. The feature extraction network acquires identity features from a source image, the audio emotion decoupler decouples emotion codes from audio information of a target video, and the head pose extracts head pose codes from image information of the target video. And finally, generating a speaking face video which accords with the identity characteristics of the people in the photos, and the unobstructed facial expression and head posture of the people in the target video.
As shown in fig. 1, is a flow chart of the method of the present invention, including: training by a cross reconstruction method to obtain an audio emotion decoupler; generating an identity code from a source identity image through a feature extraction network; acquiring image information of a target video wearing the mask, and performing feature extraction by a head posture evaluation network to obtain head posture codes of each frame; obtaining audio information of a target video wearing the mask, and performing feature extraction on the audio information by using an audio emotion decoupler to obtain emotion codes of each frame; and inputting the identity characteristic code, the emotion code and the head posture code into a confrontation generation network to generate a speaking face video.
(1) Acquiring image information of a training video, audio information synchronized with the training video and a source identity image of a target object, wherein the requirements are as follows:
(11) the obtained training video is a single-person front non-blocking speaking video. The video picture color is colorful, the speaking time length of the human in the video is not limited, the optimal time is 3-5 minutes, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.
(12) The obtained source identity image of the target object is a picture. The picture color is colorful, the resolution ratio is 720p, 1080p, and the front of the figure in the picture faces the camera, so that the picture is free from shielding and has good light conditions.
(2) On the basis of obtaining the audio information of the training video, the audio emotion decoupler is trained through a cross reconstruction method, and the training process of the audio emotion decoupler is as follows:
(21) for the audio information, using a Mel frequency cepstrum coefficient as an audio representation to obtain an audio vector;
(211) resampling the original video and audio to a fixed sampling frequency;
(212) and calculating the frequency domain characteristics of the audio by using the resampled audio, and expressing the frequency domain characteristics by adopting a Mel cepstrum coefficient.
(22) Stretching or shrinking the audio vector through a dynamic time warping algorithm to obtain an audio training pair { x ] with the same length and content but different emotionsi,m,xi,nIn which xi,m,xi,nE.g. X, i represents the same content, m and n respectively represent two different emotions;
(23) training the audio frequency emotion decoupler by a cross reconstruction method, setting the content encoder obtained by training as EcThe emotion encoder is set to EeAnd the decoder is set to D.
(231) Codec cross-reconstruction loss, as in equation (1). The emotion encoder gets xi,mObtaining x by the emotion encoding and content encoderj,nThe content coding combination of (1) is decoded by a decoder to obtain an audio coding x'i,nAnd xi,nAnd calculating loss, reversely propagating and training the coding and decoding network, and ensuring the independence of emotion coding and content coding.
Lcross=||D(Ec(xi,m),Ee(xj,n))-xi,n||2+||D(Ec(xj,n),Ee(xi,m))-xj,m||2 (1)
(232) The codec reconstructs the loss itself as in equation (2). And combining emotion encoding and content encoding of the same section of audio, calculating loss according to the new audio and the original audio obtained after decoding by a decoder, and reversely propagating and training the encoding and decoding network to ensure the integrity of encoding.
Lself=||D(Ec(xi,m),Ee(xi,m))-xi,m||2+||D(Ec(xj,n),Ee(xj,n))-xj,n||2 (2)
The stability of the disclosed three-dimensional key point algorithm is difficult to guarantee, the requirements on equipment and computing power are relatively strict, in order to avoid the problems, a two-dimensional mode is adopted for modeling, and the target of generating the three-dimensional point cloud is achieved through unsupervised learning. The corresponding feature extraction network is performed simultaneously with the training against the generated network.
(3) Generating an identity code from a source identity image through a feature extraction network;
(4) in order to cope with the situation that the mask has a large-range shielding, a head posture estimation network in the mask state with a MobileNet regressor as a backbone network is adopted to obtain a head posture code, and the network structure is shown in fig. 3.
(41) And detecting the key points of the face of each frame, and obtaining the two-dimensional key points of the face of the eyes under the condition that the mask is shielded. The related key point detection network comprises a convolution layer, a bottleneck layer, a convolution layer and a full connection layer;
(42) and inputting the two-dimensional face key points into a posture estimation network, estimating a three-dimensional Euler angle of the face, and obtaining 12-dimensional head posture codes comprising 9-bit rotation codes and three-bit deflection codes. The related attitude estimation network comprises a convolution layer and a full connection layer;
(5) and inputting audio information of a training video through the audio emotion decoupler obtained through training to obtain emotion codes.
(6) A countermeasure generation network is constructed that is used to generate rendered images. According to the identity code, the emotion code and the head posture code training confrontation generation network, as shown in fig. 4, A is the identity code, B is the head posture code, C is the emotion code, and the training process is as follows:
(61) generating network self evaluation against the situation; because of the progressive growth structure adopted by the antagonistic generation network, a series of down-sampling images are obtained by Gaussian smoothing and sub-sampling for the image generated by the generator and the image used for training to form a Gaussian pyramid, loss is calculated for each level n of the pyramid respectively, and the loss function is used for reversely propagating and training the generator network with the corresponding resolution
Where m denotes the number of samples, Disc is the discriminator, imggeneratedFor the image generated by the generator, imgrealImage information of a training video;
(62) self-evaluation of the feature extraction network.
(621) And evaluating the characteristics of the characteristic extraction network. Calculating the feature extraction network loss according to the pixel points of the generated image and the training video image at the same position, reversely propagating the training feature extraction network and having a loss function formula (4)
Where H is the height of the input training video image, W is the width of the input training video image, Gen represents the generated image, and Real is the input training video image.
(622) And (5) evaluating the point cloud distribution of the feature extraction network. In order to avoid information redundancy of the three-dimensional point cloud, a distance loss function is calculated according to Euclidean distances between K points in the point cloud U, and a training feature extraction network is propagated reversely, so that a loss function formula (5) is lost.
(63) The countermeasure generates a network pose evaluation. Calculating and generating Euclidean distance between the human face and head posture codes of the human face in a frame corresponding to training video image information, and taking the Euclidean distance as a posture loss back propagation training generation network;
(64) the confrontation generates a network emotion assessment. Obtaining a generated face and a face in a frame corresponding to training video image information, respectively extracting two-dimensional key points to obtain a corresponding key point set Pgen,PrealAccording to the set of key points Pgen,PrealCalculating the emotional loss, such as formula (6), and performing back propagation training to generate a network;
(7) obtaining image information and audio information of a target video of a target object wearing a mask, repeating the steps 3, 4 and 5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render the image to generate the target video:
(71) the acquired conditional video is a single-person speaking video. The motion of the character in the video is the speaking video wearing the mask, the color of the video picture is colorful, the speaking time length of the character in the video is not limited, the video resolution is 720P and 1080P, the video frame rate is 25 frames/second, the audio code rate of the video is 128kb/s, and the audio sampling rate is 44100 Hz. In the video attributes, except the video time length and the resolution, other attributes can be designed according to the actual situation.
(72) Generating an identity code from a source identity image through a feature extraction network;
(73) extracting the features of the image information to obtain a head posture code of each frame of image;
(74) the audio frequency emotion decoupler extracts the characteristics of the audio frequency information to obtain the emotion coding of each frame of image;
(75) and using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render images to generate a target image under the current visual angle and audio frequency conditions.
It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it is obvious to those skilled in the art that several modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations fall within the protection scope of the claims of the present invention.
Claims (5)
1. The utility model provides a 2D face reduction technique under gauze mask based on audio drive which characterized in that includes the following step:
step 1, acquiring image information of a training video, audio information synchronous with the training video and a source identity image of a target object;
step 2, training an audio emotion decoupler by a cross reconstruction method on the basis of the audio information;
step 3, generating an identity code from the source identity image through a feature extraction network;
step 4, extracting the features of the image information to obtain the head posture code of each frame of image;
step 5, the audio frequency emotion decoupler extracts the characteristics of the audio frequency information to obtain the emotion coding of each frame of image;
step 6: constructing a countermeasure generation network for generating a rendered image; training and confronting to generate a network according to the identity code, the emotion code and the head posture code;
and 7: and obtaining image information and audio information of a target video of a target object wearing a mask, repeating the steps 3, 4 and 5, using the obtained identity code, emotion code and head posture code as condition information, and using the confrontation generation network to render the image to generate the target video.
2. The 2D face restoration technology under a mask based on audio driving according to claim 1, wherein the audio emotion decoupler in step 2 is obtained by the following steps:
step 2-1, for the audio information, using a Mel frequency cepstrum coefficient as an audio representation to obtain an audio vector;
step 2-2, stretching or shrinking the audio vector through a dynamic time warping algorithm to obtain an audio training pair { x ] with the same length and content but different emotionsi,m,xi,nIn which x isi,m,xi,nE.g. X, i represents the same content, m and n respectively represent two different emotions;
step 2-3, training the audio emotion decoupler by a cross reconstruction method, and setting a content encoder obtained by training as EcThe emotion encoder is set to EeAnd the decoder is set to D.
3. The under-mask 2D face restoration technology based on audio driving according to claim 2, wherein the step 2-3 of training the audio emotion decoupler by the cross reconstruction method comprises the following steps:
step 2-3-1, the codec cross-reconstruction loss, as in equation (1); the emotion encoder gets xi,mObtaining x by the emotion encoding and content encoderj,nThe content coding combination of (1) is decoded by a decoder to obtain an audio coding x'i,nAnd xi,nCalculating loss, reversely propagating and training the coding and decoding network, and ensuring independence of emotion coding and content coding;
Lcross=||D(Ec(xi,m),Ee(xj,n))-xi,n||2+||D(Ec(xj,n),Ee(xi,m))-xj,m||2 (1)
step 2-3-2, the codec reconstructs the loss by itself, as in equation (2); combining emotion encoding and content encoding of the same section of audio, calculating loss according to a new audio and an original audio obtained after decoding by a decoder, reversely propagating and training the encoding and decoding network, and ensuring the integrity of encoding;
Lself=||D(Ec(xi,m),Ee(xi,m))-xi,m||2+||D(Ec(xj,n),Ee(xj,n))-xj,n||2 (2)。
4. the audio-driven under-mask 2D face reduction technology according to claim 1, wherein the head pose coding of step 4 is obtained by the following steps:
step 4-1, performing face key point detection on each frame, and obtaining two-dimensional face key points of eyes under the condition of shielding by a mask; the related key point detection network comprises a convolution layer, a bottleneck layer, a convolution layer and a full connection layer;
step 4-2, inputting the obtained two-dimensional face key points into an attitude estimation network, estimating the three-dimensional Euler angle of the face, and obtaining 12-dimensional head attitude codes comprising 9-bit rotation codes and three-bit deflection codes; the involved pose estimation network includes convolutional layers and fully-connected layers.
5. The audio-driven under-mask 2D face restoration technology according to claim 1, wherein the step 6 of constructing the confrontation generation network is obtained by the following steps:
step 6-1, generating network self-evaluation for countermeasure; because of the progressive growth structure adopted by the antagonistic generation network, a series of down-sampling images are obtained by Gaussian smoothing and sub-sampling for the image generated by the generator and the image used for training to form a Gaussian pyramid, loss is calculated for each level n of the pyramid respectively, and the loss function is used for reversely propagating and training the generator network with the corresponding resolution
Where m denotes the number of samples, Disc is the discriminator, imggeneratedFor the image generated by the generator, imgrealImage information of a training video;
step 6-2, self-evaluation of the feature extraction network;
the step 6-2 comprises the steps of,
step 6-2-1, evaluating the characteristics of the characteristic extraction network; calculating the feature extraction network loss according to the pixel points of the generated image and the training video image at the same position, reversely propagating the training feature extraction network and having a loss function formula (4)
H is the height of the input training video image, W is the width of the input training video image, Gen represents a generated image, and Real is the input training video image;
6-2-2, extracting the point cloud distribution evaluation of the network by using the characteristics; in order to avoid information redundancy of the three-dimensional point cloud, a distance loss function is calculated according to Euclidean distances among K points in the point cloud U, and a training feature extraction network is propagated reversely, so that a loss function formula (5) is lost;
6-3, generating network attitude evaluation by confrontation; calculating and generating Euclidean distance between the human face and a head posture code of the human face in a frame corresponding to the training video image information, and using the Euclidean distance as a posture loss back propagation training generation network;
6-4, generating network emotion evaluation for confrontation; obtaining a generated face and a face in a frame corresponding to training video image information, respectively extracting two-dimensional key points to obtain a corresponding key point set Pgen,PrealAccording to the set of key points Pgen,PrealCalculating the emotional loss, such as formula (6), and performing back propagation training to generate a network;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210232796.8A CN114663539B (en) | 2022-03-09 | 2022-03-09 | 2D face restoration technology under mask based on audio drive |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210232796.8A CN114663539B (en) | 2022-03-09 | 2022-03-09 | 2D face restoration technology under mask based on audio drive |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114663539A true CN114663539A (en) | 2022-06-24 |
CN114663539B CN114663539B (en) | 2023-03-14 |
Family
ID=82028539
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210232796.8A Active CN114663539B (en) | 2022-03-09 | 2022-03-09 | 2D face restoration technology under mask based on audio drive |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114663539B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115100329A (en) * | 2022-06-27 | 2022-09-23 | 太原理工大学 | Multi-mode driving-based emotion controllable facial animation generation method |
CN117474807A (en) * | 2023-12-27 | 2024-01-30 | 科大讯飞股份有限公司 | Image restoration method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020168731A1 (en) * | 2019-02-19 | 2020-08-27 | 华南理工大学 | Generative adversarial mechanism and attention mechanism-based standard face generation method |
CN111783603A (en) * | 2020-06-24 | 2020-10-16 | 有半岛(北京)信息科技有限公司 | Training method for generating confrontation network, image face changing method and video face changing method and device |
CN111797897A (en) * | 2020-06-03 | 2020-10-20 | 浙江大学 | Audio face image generation method based on deep learning |
CN113793408A (en) * | 2021-09-15 | 2021-12-14 | 宿迁硅基智能科技有限公司 | Real-time audio-driven face generation method and device and server |
WO2021254499A1 (en) * | 2020-06-19 | 2021-12-23 | 北京灵汐科技有限公司 | Editing model generation method and apparatus, face image editing method and apparatus, device, and medium |
-
2022
- 2022-03-09 CN CN202210232796.8A patent/CN114663539B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020168731A1 (en) * | 2019-02-19 | 2020-08-27 | 华南理工大学 | Generative adversarial mechanism and attention mechanism-based standard face generation method |
CN111797897A (en) * | 2020-06-03 | 2020-10-20 | 浙江大学 | Audio face image generation method based on deep learning |
WO2021254499A1 (en) * | 2020-06-19 | 2021-12-23 | 北京灵汐科技有限公司 | Editing model generation method and apparatus, face image editing method and apparatus, device, and medium |
CN111783603A (en) * | 2020-06-24 | 2020-10-16 | 有半岛(北京)信息科技有限公司 | Training method for generating confrontation network, image face changing method and video face changing method and device |
CN113793408A (en) * | 2021-09-15 | 2021-12-14 | 宿迁硅基智能科技有限公司 | Real-time audio-driven face generation method and device and server |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115100329A (en) * | 2022-06-27 | 2022-09-23 | 太原理工大学 | Multi-mode driving-based emotion controllable facial animation generation method |
CN117474807A (en) * | 2023-12-27 | 2024-01-30 | 科大讯飞股份有限公司 | Image restoration method, device, equipment and storage medium |
CN117474807B (en) * | 2023-12-27 | 2024-05-31 | 科大讯飞股份有限公司 | Image restoration method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114663539B (en) | 2023-03-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113378697B (en) | Method and device for generating speaking face video based on convolutional neural network | |
CN112887698B (en) | High-quality face voice driving method based on nerve radiation field | |
CN111145322B (en) | Method, apparatus, and computer-readable storage medium for driving avatar | |
CN114663539B (en) | 2D face restoration technology under mask based on audio drive | |
Jonell et al. | Let's face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings | |
CN113269872A (en) | Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization | |
CN108288072A (en) | A kind of facial expression synthetic method based on generation confrontation network | |
CN115914505B (en) | Video generation method and system based on voice-driven digital human model | |
CN112785671B (en) | Virtual dummy face animation synthesis method | |
CN110796593A (en) | Image processing method, device, medium and electronic equipment based on artificial intelligence | |
CN115908659A (en) | Method and device for synthesizing speaking face based on generation countermeasure network | |
CN113470170A (en) | Real-time video face region space-time consistent synthesis method using voice information | |
Rebol et al. | Passing a non-verbal turing test: Evaluating gesture animations generated from speech | |
CN113838173A (en) | Virtual human head motion synthesis method driven by voice and background sound | |
Li et al. | Buccal: Low-cost cheek sensing for inferring continuous jaw motion in mobile virtual reality | |
Wang et al. | 3d-talkemo: Learning to synthesize 3d emotional talking head | |
CN108908353B (en) | Robot expression simulation method and device based on smooth constraint reverse mechanical model | |
Tang et al. | Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar | |
Hill et al. | Range-and domain-specific exaggeration of facial speech | |
WO2024124680A1 (en) | Speech signal-driven personalized three-dimensional facial animation generation method, and application thereof | |
Yi et al. | Predicting personalized head movement from short video and speech signal | |
CN115984452A (en) | Head three-dimensional reconstruction method and equipment | |
Tin | Facial extraction and lip tracking using facial points | |
CN113343761A (en) | Real-time facial expression migration method based on generation confrontation | |
Kumar et al. | Multi modal adaptive normalization for audio to video generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |