CN114155139B

CN114155139B - Deepfake generation method based on vector discretization representation

Info

Publication number: CN114155139B
Application number: CN202111400589.0A
Authority: CN
Inventors: 舒明雷; 曹伟; 陈达; 刘丽; 许继勇; 孔祥龙
Original assignee: Qilu University of Technology; Shandong Institute of Artificial Intelligence
Current assignee: Qilu University of Technology; Shandong Institute of Artificial Intelligence
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-07-22
Anticipated expiration: 2041-11-23
Also published as: CN114155139A

Abstract

A method for generating the deepfake based on the vector discretization representation includes the steps of extracting video frames of a source video and a target video, and then sequentially carrying out face detection, face alignment, trained face exchange network, face sharpening and fusion and video frame combination on the source video frames to obtain a final result. The method converts the coding result into a vector discrete representation form, and reduces artifacts generated in the face changing operation process. Meanwhile, the discriminator added in the training process enables the details of the decoded pictures to be clearer and the quality to be more stable.

Description

Deepfake generation method based on vector discretization representation

Technical Field

The invention relates to the field of generation of face exchange in videos, in particular to a deepake generation method based on vector discretization representation.

Background

With the development of deep learning techniques and the flooding of personal media data in public network environments, many fake face-changing videos have been generated. And the technique of generating a fake face-changed video using deep learning is called Deepfake. Specifically, the technology is to replace the face of the source video with the face of the target video, and ensure that the exchanged face remains unchanged from the source face attribute information (expression, illumination, background, etc.) and the target face identity information. The current generation technology is mainly realized by an autoencoder and a generative countermeasure network.

The self-encoder generation is through a common encoder and two respective source and target decoders. Through training, the source face picture in the source video frame and the target face picture in the target face video frame are put into a common encoder to extract the general characteristics of the human face, and then the general characteristics are reconstructed into the source face picture and the target face picture through respective decoders. The face changing is to put the source face image into a trained shared encoder and then output the source face image from a decoder of the target face, so as to finally obtain a face changing result. However, the faces synthesized by the self-encoder are all operated at the pixel level, so that the whole process generates artifacts, and the details of the synthesized image are not clear enough and the image quality is not stable enough.

Disclosure of Invention

In order to overcome the defects of the technology, the invention provides a deepfake generation method based on vector discretization representation, which reduces artifacts generated in the face changing operation process.

The technical scheme adopted by the invention for overcoming the technical problems is as follows:

a deepfake generation method based on vector discretization representation comprises the following steps:

a) extracting frames of a source video and a target video, and identifying and aligning human faces in the frames of the source video and the target video;

b) establishing a network model, and optimizing the network model by using a loss function;

c) sequentially passing aligned face pictures in a source video frame through an encoder, a discrete vector embedding unit and a decoder to obtain a face-changed picture;

d) sharpening and fusing the face-changed picture, and putting the sharpened and fused picture into a video frame;

e) repeating the steps c) -d), and combining the video frames into a final video.

Further, step a) comprises the following steps:

a-1) Using the multimedia processing tool ffempg, on the source video V_sExtracting a source video frame_sFor the target video V_tExtracting target video frame_t；

a-2) for source video frame_sAnd target video frame_tRespectively cutting out a face picture P by using an S3FD face detection algorithm_{sface_dection}And P_{tface_dection}；

a-3) taking a face picture P_{sface_dection}And P_{tface_dection}Carrying out alignment operation on the human face characteristic points by using a 2DFAN human face alignment algorithm to respectively obtain aligned human face pictures P_{sface_align}And P_{tface_align}，P_s ⁱIs P_{sface_align}The ith alignment face picture, P_t ^jIs P_{tface_align}The jth aligned face picture. Further, step b) comprises the following steps:

b-1) establishing a network model, wherein the network model is composed of an encoder E and a source video frame picture decoder G_sTarget video frame picture decoder G_tA source video frame picture discrete vector embedding unit E_sTarget video frame picture discrete vector embedding unit E_tSource video frame picture discriminator D_sAnd a target video frame picture discriminator D_tThe encoder E comprises 2 residual units and 4 downsampling convolutional layers in sequence, and the source video frame picture decoder G_sAnd a target video frame picture decoder G_tEach composed of 2 residual error units and 4 upsampling convolution layers in sequence, and a source video frame picture discrete vector embedding unit E_sAnd a target video frame picture discrete vector embedding unit E_tAre all sequentially composed of 2A residual unit, an attnBlock module of the transform model and a dictionary vector Embedding function Embedding, and a source video frame picture discriminator D_sAnd a target video frame picture discriminator D_tThe components are sequentially composed of 2 convolution and activation function layers, 3 convolution and activation function plus batch normalization layers and 2 convolution and activation function layers;

b-2) reacting P_s ⁱAnd P_t ^jRespectively obtaining coding vectors s from the input coder E_qAnd t_qEncoding the vector s_qDiscrete vector embedding unit E for picture input into source video frame_sMiddle passing formula

Calculating to obtain a space vector Z quantized to be similar in discrete space Z_sIn the formula

h is P_s ⁱ、P_t ^jW is P_s ⁱ、P_t ^jWidth value of (n)_zIn order to embed the dimensions of the image,

k is the number of vectors, and the vector t is encoded_qDiscrete vector embedding unit E of picture input into target video frame_tMiddle passing formula

Calculating to obtain a space vector Z quantized to be similar in discrete space Z_tWill be a space vector Z_sInput source video frame picture decoder G_sTo obtain a decoding result s_gWill be a space vector Z_tInput target video frame picture decoder G_tTo obtain a decoding result t_g；

b-3) by the formula l₁＝||s_g-P_s ⁱ||²+||t_g-P_t ^j||²Calculating to obtain the loss l₁By the formula l₂＝||s_g-Z_s||²+||t_g-Z_t||²Calculating to obtain the loss l₂Decoding the result s_gInput to source video frame picture discriminator D_sMaking a judgment, and decoding the result t_gInput to a target video frame picture discriminator D_tMaking a judgment by using the formula l₃＝logD_s(s_g)+log(1-D_s(P_s ⁱ))+logD_t(t_g)+log(1-D_t(P_t ^j) Computing the loss l of the reconstructed picture and the original picture₃Will lose l₁Loss of l₂And loss of l₃And (4) back propagation, and continuously carrying out iterative adjustment on the network model in the b-1) by using an optimizer.

Further, in step c), aligning the face picture P in the source video frame_{sface_align}Inputting the image data into the network model after iteration in the step b-3), and sequentially passing through an encoder E, a target video frame image discrete vector embedding unit and a target video frame image decoder G_tThen obtaining a decoding result t_stog。

Further, decoding result t in step d)_stogObtaining a plurality of video frame frames through sharpening and fusion operations_f。

Further, the plurality of video frame frames in step e) are processed_fSynthesis of the final video V Using the multimedia processing tool ffempg_f。

Preferably, the residual error unit in step b-1) includes a normalized convolution module and a convolution layer, the normalized convolution module sequentially includes two normalized layers plus the convolution layer, the picture sequentially passes through the two normalized layers plus the convolution layer and the output result of the convolution layer and is subjected to addition operation, the convolution kernel of the convolution layer is 3 × 3, the step length is 1, and the padding is 1.

Preferably, the convolution kernel of the downsampled convolutional layer of the encoder E in step b-1) is 3 × 3, the step size is 2, and the padding is 0; the convolution kernel of the upsampled convolutional layer is 3 x 3, the step length is 1, the filling is 1, and a source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe convolution kernel of the medium convolution is 4 x 4, the step size is 2, the padding is 1,source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe medium activation function uses the LeakyReLU activation function.

The beneficial effects of the invention are: the final result is obtained by extracting video frames of a source video and a target video, and then sequentially carrying out face detection, face alignment, trained face switching network, face sharpening and fusion and video frame combination on the source video frames. The method converts the coding result into a vector discrete representation form, and reduces artifacts generated in the face changing operation process. Meanwhile, the discriminator added in the training process enables the details of the decoded picture to be clearer and the quality to be more stable.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of the pretreatment process of the present invention;

FIG. 3 is a diagram of a training network of the model of the present invention;

FIG. 4 is a diagram of the testing and post-processing of the model of the present invention;

FIG. 5 is a diagram of a residual unit structure of the model of the present invention;

FIG. 6 is a diagram of an encoder network model of the present invention;

FIG. 7 is a diagram of a decoder network model of the present invention;

FIG. 8 is a diagram of a discrete spatial embedding process of the model of the present invention;

FIG. 9 is a diagram of a network model of the discriminator according to the present invention.

Detailed Description

The present invention is further described with reference to fig. 1 to 9.

A depfake generation method based on vector discretization representation comprises the following steps:

The method comprises the steps of putting a source face picture obtained through preprocessing into a trained shared encoder to obtain an encoding result, quantizing the encoding result into a space vector in a discrete representation mode in order to reduce artifacts generated in the face changing operation process, and then putting the space vector in the discrete representation mode into a trained target face decoder to obtain a face changing result. And carrying out a series of post-processing on the face changing result to obtain a final video result. The method converts the encoding result into a vector discrete representation form, and reduces artifacts generated in the face changing operation process. Meanwhile, the discriminator added in the training process enables the details of the decoded pictures to be clearer and the quality to be more stable.

The step a) comprises the following steps:

a-1) Using the multimedia processing tool ffempg, on the source video V_sExtracting source video frame frames_sFor the target video V_tExtracting target video frame_t；

a-3) taking a face picture P_{sface_dection}And P_{tface_dection}Using 2DFAN face alignment algorithm to align the face feature points to respectively obtain aligned face pictures P_{sface_align}And P_{tface_align}，P_s ⁱIs P_{sface_align}The ith alignment face picture, P_t ^jIs P_{tface_align}The jth aligned face picture. The step b) comprises the following steps:

b-1) establishing a network model, wherein the network model is composed of an encoder E and a source video frame picture decoder G_sTarget video frame picture decoder G_tA source video frame picture discrete vector embedding unit E_sTarget video frame picture discrete vector embedding unit E_tSource video frame picture discriminator D_sAnd a target video frame picture discriminator D_tThe encoder E sequentially comprises 2 residual error units and 4 down-sampling convolution layers, and the source video frame picture decoder G_sAnd a target video frame picture decoder G_tAre composed of 2 residual error units and 4 upsampling convolution layers in turn, and a source video frame picture discrete vector embedding unit E_sAnd a target video frame picture discrete vector embedding unit E_tAll sequentially consisting of 2 residual error units, an attnBlock module of a transform model and a dictionary vector Embedding function Embedding, and a source video frame picture discriminator D_sAnd a target video frame picture discriminator D_tThe structure of the system is composed of 2 layers of convolution and activation function, 3 layers of convolution and activation function and batch normalization layer and 2 layers of convolution and activation function in sequence;

h is P_s ⁱ、P_t ^jW is P_s ⁱ、P_t ^jWidth value of n_zIn order to embed the dimensions of the image,

k is the number of vectors, and the vector t is coded_qDiscrete vector embedding unit E of picture input into target video frame_tMiddle passing formula

Calculating to obtain a space vector Z quantized to be similar in discrete space Z_tWill space vector Z_sInput source video frame picture decoder G_sTo obtain a decoding result s_gWill space vector Z_tInput target video frame picture decoder G_tTo obtain a decoding result t_g；

b-3) by the formula l₁＝||s_g-P_s ⁱ||²+||t_g-P_t ^j||²Calculating to obtain the loss l₁By the formula l₂＝||s_g-Z_s||²+||t_g-Z_t||²Calculating to obtain the loss l₂Decoding the result s_gInput to source video frame image discriminator D_sMaking a judgment, and decoding the result t_gInput to a target video frame picture discriminator D_tMaking a judgment by using the formula l₃＝logD_s(s_g)+log(1-D_s(P_s ⁱ))+logD_t(t_g)+log(1-D_t(P_t ^j) Computing the loss l of the reconstructed picture and the original picture₃Will lose l₁Loss of l₂And loss of l₃And (4) back propagation, and continuously performing iterative adjustment on the network model in b-1) by using an optimizer.

Step c) aligning the face picture P in the source video frame_{sface_align}Inputting the image data into the network model after iteration in the step b-3), and sequentially passing through an encoder E, a target video frame image discrete vector embedding unit and a target video frame image decoder G_tThen obtaining a decoding result t_stog。

In step d) decoding result t_stogObtaining a plurality of video frame frames through sharpening and fusion operations_f。

Multiple video frame frames in step e)_fSynthesis of the final video V Using the multimedia processing tool ffempg_f. The residual error unit in the step b-1) comprises a normalized convolution module and a convolution layer, wherein the normalized convolution module sequentially comprises two normalization layers and a convolution layer, and the picture sequentially passes through two normalization layers and two convolution layersAnd adding the convolution layers and the output results of the convolution layers by the normalization layers, wherein the convolution kernel of the convolution layers is 3 x 3, the step length is 1, and the filling is 1. The convolution kernel of the downsampled convolutional layer of the encoder E in the step b-1) is 3 x 3, the step size is 2, and the padding is 0; convolution kernel of the upsampled convolution layer is 3 x 3, step length is 1, filling is 1, and source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe convolution kernel of the medium convolution is 4 x 4, the step length is 2, the filling is 1, and a source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe medium activation function uses a Leaky ReLU activation function.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described above, or equivalents may be substituted for elements thereof. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A deepfake generation method based on vector discretization representation is characterized by comprising the following steps of:

e) repeating steps c) -d), and combining the video frames into a final video;

the step a) comprises the following steps:

a-1) Using the multimedia processing tool ffempg, on the sourceVideo V_sExtracting a source video frame_sFor the target video V_tExtracting target video frame_t；

a-2) for Source video frame_sAnd a target video frame_tRespectively cutting out a face picture P by using an S3FD face detection algorithm_{sface_dection}And P_{tface_dection}；

a-3) taking a face picture P_{sface_dection}And P_{tface_dection}Using 2DFAN face alignment algorithm to align the face feature points to respectively obtain aligned face pictures P_{sface_align}And P_{tface_align}，P_s ⁱIs P_{sface_align}The ith alignment face picture, P_t ^jIs P_{tface_align}Aligning the jth alignment face picture; the step b) comprises the following steps:

b-1) establishing a network model, wherein the network model is composed of an encoder E and a source video frame picture decoder G_sTarget video frame picture decoder G_tSource video frame picture discrete vector embedding unit E_sTarget video frame picture discrete vector embedding unit E_tSource video frame picture discriminator D_sAnd a target video frame picture discriminator D_tThe encoder E comprises 2 residual units and 4 downsampling convolutional layers in sequence, and the source video frame picture decoder G_sAnd a target video frame picture decoder G_tAre composed of 2 residual error units and 4 upsampling convolution layers in turn, and a source video frame picture discrete vector embedding unit E_sAnd a target video frame picture discrete vector embedding unit E_tAll sequentially consisting of 2 residual error units, an attnBlock module of a transform model and a dictionary vector Embedding function Embedding, and a source video frame picture discriminator D_sAnd a target video frame picture discriminator D_tThe components are sequentially composed of 2 convolution and activation function layers, 3 convolution and activation function plus batch normalization layers and 2 convolution and activation function layers;

b-2) reacting

And P_t ^jRespectively obtaining coding vectors s from the input coder E_qAnd t_qEncoding the vector s_qDiscrete vector embedding unit E for picture input into source video frame_sMiddle passing formula

h is

P_t ^jA height value of w is

P_t ^jWidth value of n_zIn order to embed the dimensions of the image,

b-3) by the formula

Calculating to obtain the loss l₁By the formula l₂＝||s_g-Z_s||²+||t_g-Z_t||²Calculating to obtain the loss l₂Decoding the result s_gInput to source video frame image discriminator D_sMaking a judgment, and decoding the result t_gInput to the target video frame picture discriminator D_tMaking a judgment by using a formula

Calculating the loss l of the reconstructed picture and the original picture₃Will lose l₁Loss l₂And loss of l₃And (4) back propagation, and continuously performing iterative adjustment on the network model in b-1) by using an optimizer.

2. The method for depfake generation based on vector discretization representation according to claim 1, wherein: step c) aligning the face picture P in the source video frame_{sface_align}Inputting the data into the network model after iteration in the step b-3), and sequentially passing through an encoder E, a target video frame picture discrete vector embedding unit and a target video frame picture decoder G_tThen obtaining a decoding result t_stog。

3. The method for depfake generation based on vector discretization representation according to claim 2, wherein: decoding result t in step d)_stogObtaining a plurality of video frame frames through sharpening and fusion operations_f。

4. The method for depfake generation based on vector discretization representation according to claim 3, wherein: multiple video frame frames in step e)_fSynthesis of the final video V Using the multimedia processing tool ffempg_f。

5. The method for depfake generation based on vector discretization representation according to claim 1, wherein: and b-1), wherein the residual error unit comprises a normalized convolution module and a convolution layer, the normalized convolution module sequentially comprises two normalized layers and a convolution layer, the picture sequentially passes through the two normalized layers, the convolution layer and the convolution layer, the output result is subjected to addition operation, the convolution kernel of the convolution layer is 3 x 3, the step length is 1, and the filling is 1.

6. The method for depfake generation based on vector discretization representation according to claim 1, wherein: in the step b-1), the convolution kernel of the downsampled convolution layer of the encoder E is 3 x 3, the step length is 2, and the padding is 0; convolution kernel of the upsampled convolution layer is 3 x 3, step length is 1, filling is 1, and source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe convolution kernel of the medium convolution is 4 x 4, the step length is 2, the filling is 1, and a source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe medium activation function uses a Leaky ReLU activation function.