CN114155139A

CN114155139A - Deepfake generation method based on vector discretization representation

Info

Publication number: CN114155139A
Application number: CN202111400589.0A
Authority: CN
Inventors: 舒明雷; 曹伟; 陈达; 刘丽; 许继勇; 孔祥龙
Original assignee: Qilu University of Technology; Shandong Institute of Artificial Intelligence
Current assignee: Qilu University of Technology; Shandong Institute of Artificial Intelligence
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-03-08
Anticipated expiration: 2041-11-23
Also published as: CN114155139B

Abstract

A method for generating the deepfake based on the vector discretization representation includes the steps of extracting video frames of a source video and a target video, and then sequentially carrying out face detection, face alignment, trained face exchange network, face sharpening and fusion and video frame combination on the source video frames to obtain a final result. The method converts the encoding result into a vector discrete representation form, and reduces artifacts generated in the face changing operation process. Meanwhile, the discriminator added in the training process enables the details of the decoded picture to be clearer and the quality to be more stable.

Description

Deepfake generation method based on vector discretization representation

Technical Field

The invention relates to the field of generation of face exchange in videos, in particular to a deepfake generation method based on vector discretization representation.

Background

With the development of deep learning techniques and the flooding of personal media data in public network environments, many fake face-changing videos have been generated. And a technique of generating a fake face-changed video using deep learning is called deep. Specifically, the technology is to replace the face of the source video with the face of the target video, and ensure that the exchanged face remains unchanged in the source face attribute information (expression, illumination, background, and the like) and the target face identity information. The current generation technology is mainly realized by an autoencoder and a generative countermeasure network.

The self-encoder generation is through a common encoder and two source and target decoders, respectively. Through training, the source face picture in the source video frame and the target face picture in the target face video frame are put into a common encoder to extract the general characteristics of the human face, and then the general characteristics are reconstructed into the source face picture and the target face picture through respective decoders. The face changing is to put the source face image into the trained common encoder and then output the source face image from the decoder of the target face, so as to finally obtain the face changing result. However, the faces synthesized by the self-encoder are all operated at the pixel level, so that the whole process generates artifacts, and the details of the synthesized image are not clear enough and the image quality is not stable enough.

Disclosure of Invention

In order to overcome the defects of the technology, the invention provides a deepfake generation method based on vector discretization representation, which reduces artifacts generated in the face changing operation process.

The technical scheme adopted by the invention for overcoming the technical problems is as follows:

a deepfake generation method based on vector discretization representation comprises the following steps:

a) extracting frames of a source video and a target video, and identifying and aligning human faces in the frames of the source video and the target video;

b) establishing a network model, and optimizing the network model by using a loss function;

c) sequentially passing aligned face pictures in a source video frame through an encoder, a discrete vector embedding unit and a decoder to obtain a face-changed picture;

d) sharpening and fusing the face-changed picture, and putting the sharpened and fused picture into a video frame;

e) repeating the steps c) -d), and combining the video frames into a final video.

Further, step a) comprises the following steps:

a-1) Using the multimedia processing tool ffempg, on the source video V_sExtracting source video frame frames_sFor the target video V_tExtracting target video frame_t；

a-2) for source video frame_sAnd target video frame_tRespectively cutting out a face picture P by using an S3FD face detection algorithm_{sface_dection}And P_{tface_dection}；

a-3) taking a face picture P_{sface_dection}And P_{tface_dection}Carrying out alignment operation on the human face characteristic points by using a 2DFAN human face alignment algorithm to respectively obtain aligned human face pictures P_{sface_align}And P_{tface_align}，P_s ⁱIs P_{sface_align}The ith alignment face picture, P_t ^jIs P_{tface_align}Alignment person of jA face picture. Further, step b) comprises the following steps:

b-1) establishing a network model, wherein the network model is composed of an encoder E and a source video frame picture decoder G_sTarget video frame picture decoder G_tSource video frame picture discrete vector embedding unit E_sTarget video frame picture discrete vector embedding unit E_tSource video frame picture discriminator D_sAnd a target video frame picture discriminator D_tThe encoder E comprises 2 residual units and 4 downsampling convolutional layers in sequence, and the source video frame picture decoder G_sAnd a target video frame picture decoder G_tEach composed of 2 residual error units and 4 upsampling convolution layers in sequence, and a source video frame picture discrete vector embedding unit E_sAnd a target video frame picture discrete vector embedding unit E_tAll sequentially consisting of 2 residual error units, an attnBlock module of a transform model and a dictionary vector Embedding function Embedding, and a source video frame picture discriminator D_sAnd a target video frame picture discriminator D_tThe structure of the system is composed of 2 layers of convolution and activation function, 3 layers of convolution and activation function and batch normalization layer and 2 layers of convolution and activation function in sequence;

b-2) reacting P_s ⁱAnd P_t ^jRespectively obtaining coding vectors s from the input coder E_qAnd t_qEncoding the vector s_qDiscrete vector embedding unit E for picture input into source video frame_sMiddle passing formula

Calculating to obtain a space vector Z quantized to be similar in discrete space Z_sIn the formula

h is P_s ⁱ、P_t ^jW is P_s ⁱ、P_t ^jWidth value of (n)_zIn order to embed the dimensions of the image,

k isNumber of vectors, encoding the vector t_qDiscrete vector embedding unit E of picture input into target video frame_tMiddle passing formula

Calculating to obtain a space vector Z quantized to be similar in discrete space Z_tWill be a space vector Z_sInput source video frame picture decoder G_sTo obtain a decoding result s_gWill be a space vector Z_tInput target video frame picture decoder G_tTo obtain a decoding result t_g；

b-3) by the formula l₁＝||s_g-P_s ⁱ||²+||t_g-P_t ^j||²Calculating to obtain the loss l₁By the formula l₂＝||s_g-Z_s||²+||t_g-Z_t||²Calculating to obtain the loss l₂Decoding the result s_gInput to source video frame picture discriminator D_sDiscriminating t is carried out, and decoding result t is obtained_gInput to the target video frame picture discriminator D_tMaking a judgment by using the formula l₃＝logD_s(s_g)+log(1-D_s(P_s ⁱ))+logD_t(t_g)+log(1-D_t(P_t ^j) Computing the loss l of the reconstructed picture and the original picture₃Will lose l₁Loss l₂And loss of l₃And (4) back propagation, and continuously carrying out iterative adjustment on the network model in the b-1) by using an optimizer.

Further, aligning the face picture P in the source video frame in the step c)_{sface_align}Inputting the data into the network model after iteration in the step b-3), and sequentially passing through an encoder E, a target video frame picture discrete vector embedding unit and a target video frame picture decoder G_tThen obtaining a decoding result t_stog。

Further, decoding result t in step d)_stogObtaining a plurality of video frame frames through sharpening and fusion operations_f。

Further, in step e)Frames of multiple video frames_fSynthesis of the final video V Using the multimedia processing tool ffempg_f。

Preferably, the residual error unit in step b-1) includes a normalized convolution module and a convolution layer, the normalized convolution module sequentially includes two normalized layers plus the convolution layer, the picture sequentially passes through the two normalized layers plus the convolution layer and the output result of the convolution layer and is subjected to addition operation, the convolution kernel of the convolution layer is 3 × 3, the step length is 1, and the padding is 1.

Preferably, the convolution kernel of the downsampled convolutional layer of the encoder E in step b-1) is 3 × 3, the step size is 2, and the padding is 0; the convolution kernel of the upsampled convolutional layer is 3 x 3, the step length is 1, the filling is 1, and a source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe convolution kernel of the medium convolution is 4 x 4, the step length is 2, the filling is 1, and a source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe medium activation function uses the LeakyReLU activation function.

The invention has the beneficial effects that: the final result is obtained by extracting video frames of a source video and a target video, and then sequentially carrying out face detection, face alignment, a trained face exchange network, face sharpening and fusion and video frame combination on the source video frames. The method converts the encoding result into a vector discrete representation form, and reduces artifacts generated in the face changing operation process. Meanwhile, the discriminator added in the training process enables the details of the decoded picture to be clearer and the quality to be more stable.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a diagram of the pretreatment process of the present invention;

FIG. 3 is a diagram of a training network of the model of the present invention;

FIG. 4 is a diagram of the testing and post-processing of the model of the present invention;

FIG. 5 is a diagram of a residual unit structure of the model of the present invention;

FIG. 6 is a diagram of an encoder network model of the present invention;

FIG. 7 is a diagram of a decoder network model of the present invention;

FIG. 8 is a diagram of a discrete space embedding process of the model of the present invention;

FIG. 9 is a diagram of a network model of the discriminator according to the present invention.

Detailed Description

The present invention will be further described with reference to fig. 1 to 9.

The method comprises the steps of putting a source face picture obtained through preprocessing into a trained shared encoder to obtain an encoding result, quantizing the encoding result into a space vector in a discrete representation mode in order to reduce artifacts generated in the face changing operation process, and then putting the space vector in the discrete representation mode into a trained target face decoder to obtain a face changing result. And carrying out a series of post-processing on the face changing result to obtain a final video result. The method converts the encoding result into a vector discrete representation form, and reduces artifacts generated in the face changing operation process. Meanwhile, the discriminator added in the training process enables the details of the decoded picture to be clearer and the quality to be more stable.

The step a) comprises the following steps:

a-3) taking a face picture P_{sface_dection}And P_{tface_dection}Carrying out alignment operation on the human face characteristic points by using a 2DFAN human face alignment algorithm to respectively obtain aligned human face pictures P_{sface_align}And P_{tface_align}，P_s ⁱIs P_{sface_align}The ith alignment face picture, P_t ^jIs P_{tface_align}The jth aligned face picture. The step b) comprises the following steps:

k is the number of vectors, and the vector t is encoded_qDiscrete vector embedding unit E of picture input into target video frame_tMiddle passing formula

b-3) by the formula l₁＝||s_g-P_s ⁱ||²+||t_g-P_t ^j||²Calculating to obtain the loss l₁By the formula l₂＝||s_g-Z_s||²+||t_g-Z_t||²Calculating to obtain the loss l₂Decoding the result s_gInput to source video frame picture discriminator D_sDiscriminating t is carried out, and decoding result t is obtained_gInput to the target video frame picture discriminator D_tMaking a judgment by using the formula l₃＝logD_s(s_g)+log(1-D_s(P_s ⁱ))+logD_t(t_g)+log(1-D_t(P_t ^j) Computing the loss l of the reconstructed picture and the original picture₃Will lose l₁Loss l₂And loss of l₃Backward propagation, using optimizers constantly on networks in b-1)And carrying out iterative adjustment on the model.

Step c) aligning the face picture P in the source video frame_{sface_align}Inputting the data into the network model after iteration in the step b-3), and sequentially passing through an encoder E, a target video frame picture discrete vector embedding unit and a target video frame picture decoder G_tThen obtaining a decoding result t_stog。

Decoding result t in step d)_stogObtaining a plurality of video frame frames through sharpening and fusion operations_f。

The plurality of video frame frames in step e) are processed_fSynthesis of the final video V Using the multimedia processing tool ffempg_f. And b-1), wherein the residual error unit comprises a normalized convolution module and a convolution layer, the normalized convolution module sequentially comprises two normalized layers and a convolution layer, the picture sequentially passes through the two normalized layers, the convolution layer and the convolution layer, the output result is subjected to addition operation, the convolution kernel of the convolution layer is 3 x 3, the step length is 1, and the filling is 1. The convolution kernel of the downsampled convolutional layer of the encoder E in the step b-1) is 3 x 3, the step size is 2, and the padding is 0; the convolution kernel of the upsampled convolutional layer is 3 x 3, the step length is 1, the filling is 1, and a source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe convolution kernel of the medium convolution is 4 x 4, the step length is 2, the filling is 1, and a source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe medium activation function uses a Leaky ReLU activation function.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A deepfake generation method based on vector discretization representation is characterized by comprising the following steps:

2. The method for deepfake generation based on vector discretization representation according to claim 1, wherein step a) comprises the following steps:

a-3) taking a face picture P_{sface_dection}And P_{tface_dection}Carrying out alignment operation on the human face characteristic points by using a 2DFAN human face alignment algorithm to respectively obtain aligned human face pictures P_{sface_align}And P_{tface_align}，

Is P_{sface_align}The ith alignment face picture, P_t ^jIs P_{tface_align}The jth aligned face picture.

3. The method for deepfake generation based on vector discretization representation according to claim 2, wherein step b) comprises the following steps:

b-2) reacting

And P_t ^jRespectively obtaining coding vectors s from the input coder E_qAnd t_qEncoding the vector s_qDiscrete vector embedding unit E for picture input into source video frame_sMiddle passing formula

h is

P_t ^jA height value of w is

P_t ^jWidth value of (n)_zIn order to embed the dimensions of the image,

b-3) by the formula

Calculating to obtain the loss l₁By the formula l₂＝||s_g-Z_s||²+||t_g-Z_t||²Calculating to obtain the loss l₂Decoding the result s_gInput to source video frame picture discriminator D_sDiscriminating t is carried out, and decoding result t is obtained_gInput to the target video frame picture discriminator D_tMaking a judgment by using a formula

Calculating the loss l of the reconstructed picture and the original picture₃Will lose l₁Loss l₂And loss of l₃And (4) back propagation, and continuously carrying out iterative adjustment on the network model in the b-1) by using an optimizer.

4. The method of deepfake generation based on vector discretization representation according to claim 3, characterized in that: viewing the source in step c)Aligned face picture P in frequency frame_{sface_align}Inputting the data into the network model after iteration in the step b-3), and sequentially passing through an encoder E, a target video frame picture discrete vector embedding unit and a target video frame picture decoder G_tThen obtaining a decoding result t_stog。

5. The method of deepfake generation based on vector discretization representation according to claim 4, wherein: decoding result t in step d)_stogObtaining a plurality of video frame frames through sharpening and fusion operations_f。

6. The method of depfake generation based on vector discretization representation according to claim 5, wherein: the plurality of video frame frames in step e) are processed_fSynthesis of the final video V Using the multimedia processing tool ffempg_f。

7. The method for depfake generation based on vector discretization representation according to claim 1, wherein: and b-1), wherein the residual error unit comprises a normalized convolution module and a convolution layer, the normalized convolution module sequentially comprises two normalized layers and a convolution layer, the picture sequentially passes through the two normalized layers, the convolution layer and the convolution layer, the output result is subjected to addition operation, the convolution kernel of the convolution layer is 3 x 3, the step length is 1, and the filling is 1.

8. The method for depfake generation based on vector discretization representation according to claim 1, wherein: the convolution kernel of the downsampled convolutional layer of the encoder E in the step b-1) is 3 x 3, the step size is 2, and the padding is 0; the convolution kernel of the upsampled convolutional layer is 3 x 3, the step length is 1, the filling is 1, and a source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe convolution kernel of the medium convolution is 4 x 4, the step length is 2, the filling is 1, and a source video frame image discriminator D_sAnd a target video frame picture discriminator D_tThe medium activation function uses a Leaky ReLU activation function.