WO2024109374A1

WO2024109374A1 - Training method and apparatus for face swapping model, and device, storage medium and program product

Info

Publication number: WO2024109374A1
Application number: PCT/CN2023/124045
Authority: WO
Inventors: 贺珂珂; 朱俊伟; 邰颖; 汪铖杰
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2022-11-22
Filing date: 2023-10-11
Publication date: 2024-05-30
Also published as: CN115565238B; CN115565238A

Abstract

A training method for a face swapping model, comprising: splicing an expression feature of a template image and an identity feature of a face source image to obtain a combined feature (304); by means of a generative network of a face swapping model, performing encoding according to the face source image and the template image, so as to obtain an encoded feature, fusing the encoded feature and the combined feature to obtain a fused feature, and by means of the generative network of the face swapping model, performing decoding according to the fused feature, so as to obtain a face swapped image (306); by means of a discriminative network of the face swapping model, respectively predicting image attribute discrimination results with regard to the face swapped image and a reference image, wherein image attributes comprise a forged image and a non-forged image (308); and calculating the difference between an expression feature of the face swapped image and the expression feature of the template image, calculating the difference between an identity feature of the face swapped image and the identity feature of the face source image, and updating the generative network and the discriminative network according to the image attribute discrimination results with regard to the face swapped image and the reference image, the calculated difference between the expression features and the calculated difference between the identity features (310).

Description

Face-changing model training method, device, equipment, storage medium and program product

This application claims priority to a Chinese patent application filed with the China Patent Office on November 22, 2022, with application number 2022114680626, and entitled “Training method, device, equipment, storage medium and program product for face-changing model”, the entire contents of which are incorporated by reference into this application.

Technical Field

The present application relates to the field of computer technology, and in particular to a training method, apparatus, computer equipment, storage medium and computer program product for a face-changing model.

Background technique

With the rapid development of computer technology and artificial intelligence technology, face replacement technology has emerged. Face replacement refers to replacing the face in the image to be replaced (i.e., template image) with the face in the source image. The goal of face replacement technology is to keep the expression, angle, background and other information of the face in the template image, and to make it as similar as possible to the face in the source image. Face replacement has many application scenarios. For example, video face replacement can be applied to film and television portrait production, game character design, virtual image, privacy protection, etc.

The ability to maintain rich expressions is both the focus and the difficulty for face replacement technology. At present, most face-changing algorithms can achieve satisfactory results in common expression scenes, such as smiling. However, in some scenes with richer expressions, such as pouting, closing eyes, blinking one eye, getting angry, etc., the expression retention effect of the face-changing image is not good, and even some difficult expressions cannot be maintained, resulting in the accuracy of face-changing of face images being affected and the face-changing effect is poor.

Summary of the invention

Based on this, according to the various embodiments provided in the present application, a method, apparatus, computer device, computer-readable storage medium and computer program product for training a face-changing model are provided.

The present application provides a method for training a face-changing model. The method is executed by a computer device and includes:

Acquire a sample triplet, wherein the sample triplet includes a face source image, a template image, and a reference image;

splicing the expression features of the template image and the identity features of the face source image to obtain a combined feature;

By means of the generative network of the face-changing model, encoding is performed according to the face source image and the template image to obtain the encoding features required for face-changing;

Fusing the coding feature with the combined feature to obtain a fused feature;

Decoding is performed according to the fusion features through the generative network of the face-changing model to obtain a face-changing image;

Predicting the image attribute discrimination results of the face-swapped image and the reference image respectively through the discriminant network of the face-swapped model, wherein the image attributes include forged images and non-forged images; and

Calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network based on the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.

The present application also provides a training device for a face-changing model. The device comprises:

An acquisition module, used for acquiring a sample triplet, wherein the sample triplet includes a face source image, a template image and a reference image;

A splicing module, used for splicing the expression features of the template image and the identity features of the face source image to obtain a combined feature;

A generation module, configured to encode the face source image and the template image through the generation network of the face-changing model to obtain the encoding features required for face-changing, fuse the encoding features with the combined features to obtain the fused features, and decode the fused features through the generation network of the face-changing model to obtain the face-changing image;

A discrimination module, used for predicting the discrimination results of the image attributes of the face-swapped image and the reference image respectively through the discrimination network of the face-swapped model, wherein the image attributes include a forged image and a non-forged image;

An updating module is used to calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network according to the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.

The present application also provides a computer device, which includes a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the steps of the above-mentioned face-changing model training method when executing the computer-readable instructions.

The present application also provides a computer-readable storage medium having computer-readable instructions stored thereon, which, when executed by a processor, implement the steps of the above-mentioned face-changing model training method.

The present application also provides a computer program product, which includes computer-readable instructions, and when the computer-readable instructions are executed by a processor, the steps of the training method of the above-mentioned face-changing model are implemented.

The details of one or more embodiments of the present application are set forth in the following drawings and description. Other features and advantages of the present application will become apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

FIG1 is a schematic diagram of image face swapping in one embodiment;

FIG2 is a diagram showing an application environment of a face-changing model training method according to an embodiment;

FIG3 is a schematic diagram of a flow chart of a method for training a face-changing model in one embodiment;

FIG4 is a schematic diagram of a model structure of a face-changing model in one embodiment;

FIG5 is a schematic diagram of a flow chart of a method for training a face-changing model in one embodiment;

FIG6 is a schematic diagram of a training framework of a face-changing model in one embodiment;

FIG7 is a schematic diagram of key points of a face in one embodiment;

FIG8 is a schematic diagram of a training framework of a face-changing model in another embodiment;

FIG9 is a schematic diagram of a feature extraction network in one embodiment;

FIG10 is a schematic diagram of a training framework of a face-changing model in yet another embodiment;

FIG11 is a schematic diagram of a process of video face swapping in one embodiment;

FIG12 is a schematic diagram showing the effect of performing face swapping on a photo in one embodiment;

FIG13 is a structural block diagram of a training device for a face-changing model in one embodiment;

FIG. 14 is a diagram showing the internal structure of a computer device in one embodiment.

Detailed ways

In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.

Supervised learning is a machine learning task in which an algorithm can learn or establish a pattern from a labeled training set and infer new instances based on this pattern. The training set consists of a series of training examples, each of which consists of input and supervision information (i.e. expected output, also called labeling information). The output inferred by the algorithm based on the input can be a continuous value or a classification label.

Unsupervised learning is a machine learning task. Algorithms learn patterns, structures, and relationships from unlabeled data to discover hidden information and meaningful structures in the data. Unlike supervised learning, there is no supervised information to guide the learning process in unsupervised learning, and the algorithm needs to discover the inherent patterns of the data on its own.

Generative Adversarial Network (GAN): A method of unsupervised learning that learns by letting two neural networks compete with each other. It consists of a generator network and a discriminator network. The generator network randomly samples from the latent space as input, and its output needs to imitate the samples in the training set as much as possible, that is, its training goal is to generate samples that are as similar as possible to the samples in the training set. The input of the discriminator network is the output of the generator network, and its purpose is to distinguish the samples output by the generator network from the samples in the training set as much as possible, that is, its training goal is to distinguish the samples generated by the generator network from the samples in the training set. The generator network should deceive the discriminator network as much as possible. The two networks compete with each other and continuously update parameters, and finally the generator network can generate samples that are very similar to the samples in the training set.

Face swapping: It is to swap the face in the input face source image with the template image, output the swapped face image, and make the output swapped face image keep the expression, angle, background and other information of the template image. As shown in Figure 1, the input face source image of the face swapping process is face A, and the face in the template image is another face B. Through face swapping, a photo is output in which face B in the template image is replaced with face A.

Face-swapping model: A machine learning model implemented using deep learning and face recognition technology that can extract a person's facial expressions, eyes, mouth and other features from a photo or video and match them with the facial features of another person.

Video face swapping has many application scenarios, such as film and television portrait production, game character design, virtual image, privacy protection, etc. In film and television production, when an actor cannot complete professional actions, professionals can complete them first, and then face swapping technology can be used to automatically replace the human face with the actor. When an actor needs to be replaced, a new face can be replaced by face swapping technology to avoid reshooting, which can save a lot of costs. In virtual image design, for example, in live broadcast scenes, users can use virtual characters to swap faces to increase the fun of live broadcasts and protect personal privacy. The results of video face swapping can also provide anti-attack training materials for services such as face recognition.

GT: Ground Truth, true value, also known as reference information, labeled information or supervised information.

At present, in the related technology, by designing a relatively complex face-changing network and training the face-changing model, satisfactory results can be achieved in common expression scenes, such as smiling, etc. However, in some scenes with richer expressions, such as pouting, closing eyes, blinking one eye, getting angry, etc., the expression retention effect of the face-changing image is not good, and even some difficult expressions cannot be maintained, resulting in poor face-changing effects.

The face-changing model training method provided in the embodiment of the present application can be applied in the application environment shown in FIG. 2. The terminal 102 communicates with the server 104 via a network. The data storage system can store data that the server 104 needs to process. The data storage system can be integrated on the server 104, or placed on the cloud or other servers. It can be, but is not limited to, various personal computers, laptops, smart phones, tablet computers, IoT devices and portable wearable devices. IoT devices can be smart speakers, smart TVs, smart air conditioners, smart car devices, etc. Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, etc. The server 104 can be implemented as an independent server or a server cluster consisting of multiple servers.

In one embodiment, the terminal 102 may have an application client, and the server 104 may be a background server providing services for the application client. The application client may send the image or video collected by the terminal 102 to the server 104. The server 104 may obtain the trained face-changing model through the training method of the face-changing model provided in the present application, and then replace the face in the image or video collected by the terminal 102 with another face or a virtual image through the generation network of the trained face-changing model, and then return it to the terminal 102 in real time. The terminal 102 displays the face-changing image or video through the application client. The application client may be a video client, a social application client, an instant messaging client, and the like.

FIG3 is a flowchart of a training method for a face-changing model provided by the present application. The execution subject of this embodiment can be a computer device or a computer device cluster composed of multiple computer devices. The computer device can be a server or a terminal. Therefore, the execution subject in the embodiment of the present application can be a server or a terminal, or it can be composed of a server and a terminal. Here, the execution subject in the embodiment of the present application is a server as an example, and the following steps are included:

Step 302, obtaining a sample triplet, where the sample triplet includes a face source image, a template image, and a reference image.

In this application, the face-changing model includes a generator network (Generator Network) and a discriminator network (Discriminator Network). The face-changing model is trained through a generative adversarial network (GAN) formed by the generator network and the discriminator network. The specific content will be introduced later.

In the present application, the sample triplet is the sample data used to train the face-changing model. The server can obtain multiple sample triplets for training the face-changing model. Each sample triplet includes a face source image, a template image and a reference image. The face source image is an image that provides a human face, which can be recorded as source. The template image is an image that provides information such as facial expression, posture, image background, etc., which can be recorded as template. Face-changing is to replace the face in the template image with the face in the face source image. At the same time, the face-changing image can maintain the expression, posture, image background, etc. of the template image. The reference image is an image that serves as the supervisory information required for training the face-changing model, which can be recorded as GT. Since the principle of using each sample triplet (or a batch of sample triplets) to train the face-changing model is the same, the process of training the face-changing model using a sample triplet is used as an example to illustrate here.

It can be understood that based on the definition of face swapping, for each sample triplet, the reference image used to provide the supervision information required for model training should have the same identity attributes as the face source image and the same non-identity attributes as the template image. In addition, to ensure the face swapping effect, the face source image should have different identity attributes from the template image. Human faces are usually unique, and identity attributes refer to the identity represented by the face in the image. Having the same identity attributes means that the image is the same face. Non-identity attributes refer to the posture, expression, and makeup of the face in the image. Non-identity attributes also include attributes such as the style and background of the image.

For example, in a video face swapping scenario, the face in the face source image and the face in the reference image are the same person's face, but the facial expressions, makeup, posture of the person and the background in the image may be partially the same or different. The face in the face source image and the face in the template image are the faces of two different people. It is understandable that the face source image and the reference image may also be the same image.

In one embodiment, the sample triplet can be constructed by: obtaining a first image and a second image, An image corresponds to the same identity attribute as a second image, and corresponds to different non-identity attributes, and a third image is obtained, and the third image corresponds to different identity attributes from the first image; an object in the second image is replaced by an object in the third image to obtain a fourth image, and the first image is used as a face source image, the fourth image as a template image, and the second image as a reference image as a sample triplet.

Specifically, the server can randomly obtain the first image, determine the identity information corresponding to the face in the first image, and then obtain another image corresponding to the identity information as the second image, so that the first image and the second image have the same face, that is, they have the same identity attributes. Then, the server can randomly obtain the third image, and the third image corresponds to different identity attributes from the first image, that is, the face in the third image is not the face of the same person as the face in the first image. The server can input the second image and the third image into the face-changing model, and replace the face in the second image with the object in the third image through the generative network of the face-changing model to obtain the fourth image, and the fourth image retains the expression, posture, image background and other characteristics in the second image. It should be noted that the first image, the second image, and the third image are all images including faces, and the server can randomly obtain these images from the face image data set.

For example, the first image contains the face of Mr. A, and the facial expression of the first image is laughing, and the image background is background 1. The second image contains the face of Mr. A, and the facial expression of the second image is smiling, and the image background is background 2. The third image contains the face of Ms. B, and the facial expression of the third image is angry, and the image background is background 3. Obviously, the face of Mr. A is different from the face of Ms. B, that is, the third image has a different face from the first image and the second image. The server replaces the face of Mr. A in the second image with the face of Ms. B to obtain the fourth image, and the expression of the fourth image maintains the smiling expression of the second image, and the background maintains the image background 2. Therefore, the first image is used as the face source image, that is, the face of Mr. A, the laughing expression and the image background 1 are provided, the fourth image is used as the template image, and the face of Ms. B, the smiling expression and the image background 2 are provided, and the second image is used as the reference image, and the face of Mr. A, the smiling expression and the image background 2 are provided, so as to construct a sample triple. It can be seen that the reference image is a real image, not a forged image or a synthetic image.

In this embodiment, the second image serving as the reference image is a real image rather than a forged image. By using such a reference image as a reference, the face-changing image output by the generated network is continuously close to the real reference image, thereby ensuring that the output face-changing image can maintain consistency and smoothness with the non-synthetic parts in terms of shape, lighting, movement, etc., thereby obtaining a high-quality face-changing image or video with a better face-changing effect.

In one embodiment, after the server obtains the above-mentioned sample triples, it can directly input them into the face-changing model to perform model training on the face-changing model.

In one embodiment, after the server obtains the above-mentioned sample triplet, it first performs image preprocessing on the three images in the sample triplet respectively, and uses the preprocessed images to train the face-changing model. Specifically, the preprocessing may include the following aspects: 1. Since the face in the image often only occupies a part of the image, the server may first perform face detection on the image to obtain the face area. The face detection network or face detection algorithm required for face detection may adopt a pre-trained neural network model. 2. Face key point detection, perform key point detection in the face area to obtain the key points of the face, such as the key points of the eyes, mouth corners, and facial contours. 3. Face registration, face registration is to use affine transformation to uniformly "straighten" the face according to the identified key points, try to eliminate the errors caused by different postures, and crop the face image after face registration.

Optionally, the server can obtain the cropped face source image, template image and reference image through the above preprocessing steps, input the cropped image into the face-changing model, and the face-changing model outputs a face-changing image containing only the human face, and then use the output face-changing image to replace the human face area in the template image, so as to obtain the final output face-changing image. Ensure the training effect of the face-changing model.

Step 304 , combining the expression features of the template image and the identity features of the face source image to obtain a combined feature.

The expression features of an image can reflect the expression information expressed by the image. They are features of facial expressions obtained by locating and extracting the organ features, texture areas and predefined feature points of the face. Expression features are the key to expression recognition, and they determine the final expression recognition results. The identity features of an image are biological features that can be used for identity recognition, such as facial features, pupil features, fingerprint features, palm print features, etc. In this application, identity features are facial features based on human face recognition, which can be used for face recognition.

In one embodiment, the server can extract features of the template image through the expression recognition network of the face-changing model to obtain the expression features of the template image, and extract features of the face source image through the face recognition network of the face-changing model to obtain the identity features of the face source image.

In this embodiment, the face-changing model includes not only a generation network and a discrimination network, but also a pre-trained expression recognition network and a face recognition network, and both the expression recognition network and the face recognition network are pre-trained neural network models.

Facial expression recognition is an important research direction in the field of computer vision. Facial expression recognition is the process of predicting the category of emotion expressed by a face by analyzing and processing the face image. The embodiment of the present application does not limit the network structure of the facial expression recognition network. Optionally, the facial expression recognition network can be built based on a convolutional neural network (CNN). The convolutional neural network uses convolutional layers and pooling layers to extract the features of the input facial image, and performs facial expression classification through a fully connected layer.

Optionally, the expression recognition network can be trained by a series of pictures and corresponding expression labels. Specifically, it is necessary to obtain a face image data set containing expression labels, which includes sample face images of different emotion categories, such as happiness, sadness, anger, blinking, single blinking, making faces, and other common expressions and complex expressions. For the expression recognition network built on the basis of the convolutional neural network, the more abstract and advanced feature representations of the sample face image, i.e., the expression features, can be gradually extracted through the multiple convolutional layers and pooling layers stacked in the convolutional neural network. The extracted expression features are classified through the fully connected layer to obtain the predicted results of the facial expressions in the sample face image. According to the difference between the predicted results and the expression labels of the sample face images, the loss function of the expression recognition network can be constructed, and the network parameters of the expression recognition network can be updated based on the loss function, for example, the network parameters of the expression recognition network can be optimized by minimizing the loss function. In this way, multiple updates are performed based on multiple sample face images, and finally a trained expression recognition network is obtained. The trained expression recognition network can be used to extract the expression features of the image. The expression features can be used in this application to constrain the consistency of the expression, i.e., constrain the expression similarity between the face-changing image and the template image. The server can directly extract features from the template image through the trained expression recognition network to obtain the corresponding expression features. The server can also perform face detection on the template image through the expression recognition network, determine the face area in the template image based on the detection results, and then extract features from the face area to obtain the corresponding expression features. The expression features of the template image can be recorded as template_exp_features.

Face recognition is a biometric technique that identifies a person based on their facial features, and is one of the research challenges in the field of biometric recognition. The embodiments of the present application do not limit the network structure used by the face recognition network. Optionally, the face recognition network can be built based on a convolutional neural network (CNN), which uses convolutional layers and pooling layers to extract features of the input face image, and performs identity classification through a fully connected layer. The face recognition network can be trained using a series of images and corresponding identity labels. Specifically, the face recognition network includes multiple stacked convolutional layers and pooling layers, as well as a fully connected layer. The convolutional layer uses a set of learnable filters (also known as pooling layers) to extract features of the input face image. The server can directly perform feature extraction on the face source image through the trained face recognition network to obtain the corresponding identity features. The server can also perform face detection on the template image through the trained face recognition network, determine the face area in the face source image according to the detection results, and then perform feature extraction on the face area to obtain the corresponding identity features. The identity features of the face source image can be recorded as source_id_features.

The combined feature is a feature obtained by the server by splicing the expression feature of the template image with the identity feature of the face source image. For example, if the expression feature is a 1024-dimensional feature and the identity feature is a 512-dimensional feature, the two are concatenated according to the feature dimension to obtain a 1536-dimensional combined feature. Of course, the splicing method is not limited to this, and the embodiments of the present application are not limited to this. For example, a multi-scale feature fusion method can also be used to extract features of different scales from different layers of two networks and fuse them to obtain a combined feature. The combined feature can be recorded as id_exp_features.

The combined features obtained by the server can be subsequently decoded together with the coding features required for face-changing to output the face-changing image. That is to say, in this application, when training the face-changing model, not only the coding features of the template image and the face source image itself are decoded to output the face-changing image, but also the expression features of the template image and the identity features of the face source image are decoded to output the face-changing image, so that the output face-changing image can have both the expression information of the template image and the identity information of the face source image, that is, while keeping the expression of the template image as much as possible, it can also be as similar as possible to the face source image, thereby improving the accuracy and effect of face-changing of face images.

Step 306, encoding the face source image and the template image through the generative network of the face-changing model to obtain the coding features required for face-changing, fusing the coding features with the combined features to obtain fused features, and decoding according to the fused features through the generative network of the face-changing model to obtain the face-changing image.

As shown in Figure 4, it is a schematic diagram of the model structure of a face-changing model in an embodiment. Referring to Figure 4, the face-changing model includes a face recognition network, an expression recognition network, a generation network and a discrimination network.

In this application, the face-changing model is trained by a generative adversarial network (GAN) formed by a generator network and a discriminator network. In one embodiment, referring to FIG4 , the generator network includes an encoder and a decoder. The encoder continuously halves the size (resolution) of the input image through convolution calculation, and the number of channels gradually increases. The encoding process is essentially achieved by applying a convolution kernel (also called a filter) to the input data corresponding to the input image. The encoder consists of multiple convolution kernels and finally outputs a feature vector. The decoder performs a deconvolution operation, gradually doubles the size of the feature training, gradually reduces the number of channels, and reconstructs or generates the image based on the feature.

In one embodiment, encoding is performed based on a face source image and a template image through a generative network of a face-changing model to obtain coding features required for face-changing, including: splicing the face source image and the template image to obtain an input image, inputting the input image into the face-changing model, encoding the input image through the generative network of the face-changing model, and obtaining coding features required for face-changing the template image.

Specifically, the face source image and the template image are both three-channel images. The server can splice the face source image and the template image according to the image channels. The six-channel input image obtained after splicing is input into the encoder of the generation network. Through the encoder, the input image is gradually encoded to obtain an intermediate result in the latent space, namely the encoding feature (which can be recorded as swap_features). For example, the input image is gradually encoded from a resolution of 512*512*6 to 256*256*32, 128*128*64, 64*64*128, 32*32*256, and so on, and finally get an intermediate result in the latent space, called the encoding feature, i.e. swap_features. This encoding feature also has the image information of the face source image and the image information of the template image.

Furthermore, the server may fuse the coding feature with the above-mentioned combined feature to obtain a fused feature, which has both the content of the coding feature and the style of the combined feature.

In one embodiment, the server may calculate the mean and standard deviation of the coding features and the combined features respectively; normalize the coding features according to the mean and standard deviation of the coding features to obtain normalized coding features; and transfer the style of the combined features to the normalized coding features according to the mean and standard deviation of the combined features to obtain fused features.

Specifically, the server can fuse the encoding feature with the combined feature by means of AdaIN (Adaptive Instance Normalization) to obtain the fused feature. The specific principle is shown by the following formula:

x and y are the coded features and combined features, respectively, σ and μ are the standard deviation and mean, respectively. This formula aligns the mean and standard deviation of the coded features with the combined features. μ(x) is the mean of the coded features, σ(x) is the standard deviation of the coded features, σ(y) is the standard deviation of the combined features, and μ(y) is the mean of the combined features. It can be understood that both the coded features and the combined features are a multi-channel two-dimensional matrix. For example, the matrix size of the coded features is 32*32*256. For each channel, the mean and standard deviation of the corresponding channel can be calculated based on the values of all elements to obtain the mean and standard deviation of the coded features in each channel. The same is true for the combined features, that is, for each channel of the combined features, the mean and standard deviation of the corresponding channel can be calculated based on the values of all elements to obtain the mean and standard deviation of the combined features in each channel.

First, the server uses the mean and standard deviation of the coding features to normalize the coding features. That is, the normalized coding features are obtained by subtracting the mean of the coding features from the coding features and dividing them by the standard deviation of the coding features. The coding features are normalized, and the mean of the normalized features is 0 and the standard deviation is 1, which eliminates the original style of the coding features and retains the original content of the coding features. Next, the style of the combined features is transferred to the normalized coding features using the mean and standard deviation of the combined features. That is, the normalized coding features are multiplied by the standard deviation of the combined features and then added to the mean of the combined features to obtain the fused features. In this way, the obtained fused features retain the content of the coding features and have the style of the combined features.

It can be understood that, as mentioned above, the coding features have both the image information of the face source image and the image information of the template image, and the combined features have both the expression features and identity features required for face changing. Then, by fusing the coding features and the combined features in this way, the fused features obtained can make the face in the decoded face-changing image similar to the face in the face source image, while allowing the face-changing image to retain the expression, posture and image background of the face in the template image, thereby improving the accuracy of the output face-changing image.

Of course, the server can also fuse the coding features with the combined features in other ways, such as batch normalization, instance normalization, conditional instance normalization, etc. The embodiment of the present application does not limit the fusion method.

After obtaining the fused feature, the server inputs the fused feature into the decoder of the generation network. Through the deconvolution operation of the decoder, the resolution of the fused feature is gradually doubled, the number of channels is gradually reduced, and the face-swapped image is output. For example, the resolution of the fused feature is 32*32*256, and through the gradual deconvolution operation of the decoder, 64*64*128, 128*128*64, 256*256*32, 512*512*3 are output in sequence, and finally the face-swapped image is output.

Step 308, using the discriminant network of the face-changing model, respectively predict the image attribute discrimination results of the face-changing image and the reference image, where the image attributes include forged images and non-forged images.

Referring to Figure 4, the face-changing model also includes a discriminant network, which is used to determine whether the input image is a forged image or a non-forged image. After the face-changing image is output through the generation network, the server inputs the face-changing image into the discriminant network, and the discriminant network extracts features of the input face-changing image to obtain low-dimensional discriminant information, and classifies the image attributes based on the extracted discriminants to obtain corresponding image attribute discrimination results. In the present application, the classification of the discriminant network is a binary classification of image attributes, that is, to determine whether the image is a forged image or a non-forged image. Forged images are also called synthetic images, and non-forged images are also called real images.

In addition, the server will input the reference image in the sample triplet into the discriminant network, extract features from the input reference image through the discriminant network to obtain low-dimensional discriminant information, classify the image attributes based on the extracted discriminants, and obtain the corresponding image attribute discrimination results.

In one embodiment, the discriminant network of the face-changing model obtains the corresponding image attribute discrimination result according to the face-changing image and the reference image, including: inputting the face-changing image into the discriminant network of the face-changing model to obtain the first probability that the face-changing image is a non-forged image; inputting the reference image into the discriminant network of the face-changing model to obtain the second probability that the reference image is a non-forged image. It can be understood that the training goal of the discriminant network is to make the first probability output by the discriminant network as small as possible and the second probability output as large as possible, so that the discriminant network has better performance.

Step 310, calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network based on the image attribute discrimination results of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.

In the present application, the face-changing model includes a generating network and a discriminating network. The generating network and the discriminating network perform adversarial training on the image attribute discrimination results of the real reference data and the output forged data based on the discriminating network. In addition, in the embodiment of the present application, referring to FIG. 4, in order to make the output face-changing image retain the expression of the face in the template image as much as possible and retain the identity attribute of the face source image, during the training process, the server will also calculate the difference between the expression features of the face-changing image and the expression features of the template image, and calculate the difference between the identity features of the face-changing image and the identity features of the face source image. According to the calculated difference and the image attribute discrimination results of the face-changing image and the reference image output by the discriminating network, the loss function of the entire face-changing model is jointly constructed, and the network parameters of the generating network and the discriminating network are optimized and updated with the goal of minimizing the loss function. It should be noted that the embodiment of the present application does not limit the specific network structure adopted by the generating network and the discriminating network, and only requires the generating network to support the above-mentioned image reconstruction and generation capabilities and the discriminating network to support the above-mentioned image attribute discrimination capabilities. In addition, the expression features of the face-changing image can be obtained by extracting image features through the above-mentioned expression recognition network, and the identity features of the face-changing image can be obtained by extracting image features through the above-mentioned face recognition network.

In one embodiment, the server alternately: when the network parameters of the generation network are fixed, according to the first probability that the face-swapped image belongs to the non-forged image and the second probability that the reference image belongs to the non-forged image, construct the discriminant loss of the discriminant network, and update the network parameters of the discriminant network using the discriminant loss. When the network parameters of the generation network are fixed, according to the first probability that the face-swapped image belongs to the non-forged image and the second probability that the reference image belongs to the non-forged image, construct the generation loss of the generation network, according to the first probability that the face-swapped image belongs to the non-forged image, and update the network parameters of the discriminant network using the discriminant loss. The expression loss is constructed based on the difference between the expression features of the face-changing image and the expression features of the template image. The identity loss is constructed based on the difference between the identity features of the face-changing image and the identity features of the face source image. The face-changing loss of the generation network is constructed based on the generation loss, expression loss and identity loss. The network parameters of the generation network are updated using the face-changing loss. The training is repeated until the training stop condition is met, and the trained discriminant network and generation network are obtained.

In this embodiment, the training of the face-changing model includes two alternating stages, the first stage is to train the discriminant network, and the second stage is to train the generative network.

The training goal of the first stage is to make the discriminant network identify the face-swapped image as a forged image as much as possible, and to make the discriminant network identify the reference image as a non-forged image as much as possible. Therefore, in the first stage, the parameters of the generation network are fixed, the sample triples are input into the face-swapped model, and after the face-swapped image is output, the server updates the network parameters of the discriminant network according to the image attribute discrimination results of the face-swapped image and the reference image predicted by the discriminant network. That is, when the network parameters of the generation network are fixed, the server constructs the discriminant loss of the discriminant network according to the first probability that the face-swapped image belongs to a non-forged image and the second probability that the reference image belongs to a non-forged image, and uses the discriminant loss to update the network parameters of the discriminant network.

Optionally, the discriminant loss of the discriminant network can be expressed by the following formula:
D_Loss = -logD(GT) - log(1 - D(fake));

D represents the discriminant network, GT is the reference image, fake is the face-swapped image, D(fake) represents the first probability that the face-swapped image is a non-fake image, and D(GT) represents the second probability that the reference image is a non-fake image.

The training goal of the second stage is to make the face-swapped images output by the generator network "deceive" the discriminant network as much as possible, and be predicted by the discriminant network as non-forged images. Therefore, in the second stage, the parameters of the discriminant network are fixed, and the same batch of sample triplets are input into the face-swapped model. After the generator network outputs the face-swapped images, the loss function for training the generator network is constructed based on the image attribute discrimination results of the face-swapped images and the reference images predicted by the discriminant network, and the network parameters of the generator network are updated according to the loss function.

Optionally, in the second stage, in addition to the generation loss of the generation network, the server also introduces expression loss and identity loss in the loss function used to train the generation network. Specifically, the server extracts features of the face-changing image through the expression recognition network of the face-changing model to obtain the expression features of the face-changing image, and extracts features of the face-changing image through the face recognition network of the face-changing model to obtain the identity features of the face-changing image. Both the expression recognition network and the face recognition network are pre-trained neural network models.

Thus, in stage two, the server can construct the generation loss of the generation network according to the first probability that the face-swapped image is a non-forged image, construct the expression loss according to the difference between the expression features of the face-swapped image and the expression features of the template image, construct the identity loss according to the difference between the identity features of the face-swapped image and the identity features of the face source image, construct the face-swapped loss of the generation network according to the generation loss, expression loss and identity loss, and use the face-swapped loss to update the network parameters of the generation network.

In one embodiment, the generation loss of the generation network can be expressed by the following formula:
G_Loss = log(1-D(fake));

In one embodiment, the expression loss of the generative network can be expressed by the following formula:
Exp_features_loss=(template_exp_features-fake_exp_features) ² ;

template_exp_features is the expression features of the template image, and fake_exp_features is the expression features of the face-swapped image.

In one embodiment, the identity loss of the generated network can be expressed by the following formula:
ID_loss = 1 - cosine_similarity(fake_id_features, source_id_features);

cosine_similarity() is the cosine similarity, fake_id_features is the identity features of the face-swapped image, and source_id_features is the identity features of the face source image.

As shown in FIG5 , it is a flowchart of a method for training a face-changing model in an embodiment. The method can be executed by a computer device, and specifically includes the following steps:

Step 502, obtaining a sample triplet, where the sample triplet includes a face source image, a template image, and a reference image;

Step 504, extracting features of the template image through the expression recognition network of the face-changing model to obtain expression features of the template image;

Step 506, extracting features from the face source image through the face recognition network of the face-changing model to obtain identity features of the face source image;

Step 508, combining the expression features of the template image and the identity features of the face source image to obtain a combined feature;

Step 510, splicing the face source image and the template image to obtain an input image, inputting the input image into the face-changing model, encoding the input image through the generative network of the face-changing model, and obtaining the encoding features required for face-changing the template image;

Step 512, respectively calculating the mean and standard deviation of the coding feature and the combined feature, normalizing the coding feature according to the mean and standard deviation of the coding feature to obtain the normalized coding feature, and migrating the style of the combined feature to the normalized coding feature according to the mean and standard deviation of the combined feature to obtain the fusion feature;

Step 514, decoding the fused features through the generative network of the face-changing model to obtain a face-changing image;

Step 516, inputting the face-swapped image into the discriminant network of the face-swapped model to obtain a first probability that the face-swapped image is a non-forged image;

Step 518, inputting the reference image into the discriminant network of the face-changing model to obtain a second probability that the reference image is a non-forged image;

Step 520, when the network parameters of the generating network are fixed, a discriminant loss for the discriminant network is constructed according to a first probability that the face-swapped image is a non-forged image and a second probability that the reference image is a non-forged image, and the network parameters of the discriminant network are updated using the discriminant loss;

Step 522, under the condition of fixing the network parameters of the discriminant network, extracting features of the face-changing image through the expression recognition network of the face-changing model to obtain the expression features of the face-changing image; extracting features of the face-changing image through the face recognition network of the face-changing model The invention relates to a method for generating a face-changing image by extracting features of the face image to obtain identity features of the face-changing image; and constructing a generation loss of the generation network according to a first probability that the face-changing image is a non-forged image, constructing an expression loss according to the difference between the expression features of the face-changing image and the expression features of the template image, constructing an identity loss according to the difference between the identity features of the face-changing image and the identity features of the face source image, constructing a face-changing loss about the generation network according to the generation loss, the expression loss and the identity loss, and using the face-changing loss to update the network parameters of the generation network.

In the training method of the above-mentioned face-changing model, when training the face-changing model, not only the encoding features of the template image and the face source image themselves are involved in decoding to output the face-changing image, but also the expression features of the template image and the identity features of the face source image are involved in decoding to output the face-changing image, so that the output face-changing image can have both the expression information of the template image and the identity information of the face source image, that is, while maintaining the expression of the template image, it can also be similar to the face source image. In addition, the face-changing model is updated by the difference between the expression features of the template image and the expression features of the face-changing image, and the difference between the identity features of the face source image and the identity features of the face-changing image. The former can constrain the expression similarity between the face-changing image and the template image, and the latter can constrain the identity similarity between the face-changing image and the face source image. In this way, even if the expression of the template image is relatively complex, the output face-changing image can still maintain this complex expression, thereby improving the face-changing effect. Moreover, when updating the network parameters of the generative network and the discriminative network of the face-changing model, the generative network and the discriminative network will be trained adversarially based on the image attribute discrimination results predicted by the discriminative network for the face-changing image and the reference image, thereby improving the image quality of the face-changing image output by the face-changing model as a whole.

In one embodiment, as shown in FIG6 , the present application also introduces a pre-trained facial key point network when training the face-changing model, and trains the generation network of the face-changing model according to the difference between the facial key point information of the template image and the face-changing image. Specifically, the above method may also include: using the pre-trained facial key point network, respectively performing facial key point recognition on the template image and the face-changing image to obtain the respective facial key point information; constructing a key point loss according to the difference between the facial key point information of the template image and the face-changing image; and the key point loss is used to participate in the training of the generation network of the face-changing model.

In order to better achieve the effect of maintaining complex facial expressions in the template image when the facial expressions are special and complex, the face-changing image generated can optionally introduce a face key point network when training the face-changing model. The face key point network can locate the positions of the facial key points on the image, and thus construct a key point loss based on the difference between the facial key point information of the template image and the face-changing image, and participate in the training of the generation network, so as to ensure the consistency of the expressions of the template image and the face-changing image.

The key points of the face are the pixels where the facial features related to the facial expressions are located on the face in the image, such as the pixels where the eyebrows, mouth, eyes, nose, and facial contours are located. As shown in FIG7 , it is a schematic diagram of the key points of the face in one embodiment. In FIG7 , 97 key points of the face are illustrated, 0-32 points are the facial contour, 33-50 are the eyebrow contour, 51-59 are the nose, 60-75 are the eye contour, 76-95 are the mouth contour, and 96 and 97 are the positions of the pupils. Of course, the key point network of the face can also locate more key points of the face, for example, some can locate 256 key points of the face.

Facial key point detection is a process of locating the key points of the face based on the input face area. Affected by factors such as light, occlusion, and posture, facial key point detection is also a challenging task.

In one embodiment, the server locates the facial key points of the face-swapped image and the template image respectively through a pre-trained facial key point network. For some or all facial key points, the server calculates the square of the eigenvalue difference between the face-swapped image and the template image according to the eigenvalue of the same facial key point, and then sums them up, which is recorded as the key point loss landmark_loss. During training, it is hoped that the key point loss is as small as possible. For example, for the 95th key point, the square of the eigenvalue difference is calculated. According to the feature values of the 95th facial key point corresponding to the expression feature fake_landmark of the face-swapped image and the expression feature template_landmark of the template image, the square of the difference is calculated, and the facial key points are summed in this way to obtain the key point loss. Of course, in some embodiments, the server can also characterize the expression difference between the face-swapped image and the template image only based on the difference in the feature values of the key points where the eyebrows, mouth, and eyes are located.

The embodiment of the present application does not limit the network structure of the facial key point network used. Optionally, the facial key point network can be built based on a convolutional neural network. For example, by designing a cascaded convolutional neural network with three levels, the feature extraction capability of multi-level convolution is utilized to gradually obtain accurate features from coarse to fine, and then the fully connected layer is used to predict the position of the facial key points. When training the facial key point network, it is necessary to obtain a sample facial image data set, each image in the sample facial image data set has corresponding key point annotation information, that is, the position data of the facial key points. The sample facial image is input into the facial key point network, and the predicted position of each key point is output through the facial key point network, thereby calculating the difference between the annotated position and the predicted position of each key point, and summing the differences corresponding to all key points to obtain the predicted difference of the entire sample facial image. A loss function is constructed based on the predicted difference, and the network parameters of the facial key point network are optimized by minimizing the loss function.

In this embodiment, by introducing a facial key point network and key point loss when training the face-changing model, the generative network of the trained face-changing model can output a face-changing image with better expression retention effect.

In one embodiment, as shown in FIG8 , the present application also introduces a pre-trained feature extraction network when training the face-changing model, and trains the generation network of the face-changing model according to the difference between the image features of the template image and the face-changing image. Specifically, the above method may also include: extracting image features of the face-changing image and the reference image respectively through the pre-trained feature extraction network to obtain their respective image features; constructing a similarity loss according to the difference between the image features of the face-changing image and the reference image; and the similarity loss is used to participate in the training of the generation network of the face-changing model.

In this embodiment, in order to measure the difference between the face-swapped image and the reference image at the feature level, it is hoped that the features of the generated face-swapped image are similar to those of the reference image. Optionally, when training the face-swapped model, a similarity loss is also introduced. The similarity loss can be, for example, the learned perceptual image patch similarity (LPIPS). The pre-trained feature extraction network is used to extract the features of the face-swapped image and the reference image at different levels, compare the feature differences between the face-swapped image and the reference image at the same level, and construct a similarity loss. During training, it is hoped that the feature differences between the face-swapped image and the reference image are as small as possible. The embodiment of the present application does not limit the network structure of the feature extraction network used.

As shown in Figure 9, it is a schematic diagram of a feature extraction network in an embodiment. Referring to Figure 9, in the feature extraction process, the deeper the level, the smaller the resolution of the feature. The low-level features can represent low-level features such as lines and colors, and the high-level features can represent high-level features such as parts and objects. By comparing the image features extracted from two images, it can be used to measure the overall similarity of the two images.

Refer to Figure 9 for the visualization of features at different network layers. The feature extraction network includes five convolution operations. The resolution of the input image is 224*224*3. After the first-level convolution operation Conv1, the first-level image features are extracted, denoted as fake_fea1, with a resolution of 55*55*96. After the second-level convolution Conv2 and pooling operation, the second-level image features are extracted, denoted as fake_fea2, with a resolution of 27*27*256. The third-level convolution Conv3 and pooling operation extract the third-level image features, denoted as fake_fea3, with a resolution of 13*13*384. Finally, the fourth-level convolution operation Conv5 and pooling operation are used to obtain image features, denoted as fake_fea4, with a resolution of 13*13*256. Finally, a fully connected layer is used to obtain an output vector with a dimension of 1000 for image classification or target detection.

In one embodiment, the image features extracted by the server from the face-swapped image through the feature extraction network can be recorded as:
feature(fake) = (fake_fea1, fake_fea2, fake_fea3, fake_fea4);

Similarly, the image features extracted by the server from the reference image through the feature extraction network can be recorded as:
feature(GT) = (GT_fea1, GT_fea2, GT_fea3, GT_fea4);

The similarity loss can be expressed by the following formula:

In this embodiment, by constructing a similarity loss based on the similarity between the features of the face-changing image and the reference image when training the face-changing model, and participating in the training of the generative network of the face-changing model, the generative network of the trained face-changing model can output face-changing images with realistic face-changing effects.

In one embodiment, the present application also introduces reconstruction loss when training the face-changing model. According to the pixel-level difference between the reference image and the face-changing image, the reconstruction loss is constructed to train the generative network of the face-changing model. Specifically, the above method may also include: constructing a reconstruction loss according to the pixel-level difference between the face-changing image and the reference image; the reconstruction loss is used to participate in the training of the generative network of the face-changing model. During training, it is hoped that the pixel-level difference between the face-changing image and the reference image is as small as possible. The reconstruction loss can be expressed by the following formula:
Reconstruction_loss = |fake-GT|.

This formula represents the difference between the fake face-swapped image and the reference image GT of the same size. Specifically, the server can calculate the difference in pixel values corresponding to the same pixel position of the two images, sum the differences of all pixel positions, and obtain the overall difference between the two images at the image pixel level. This overall difference can be used to construct the reconstruction loss.

It is understandable that when training the face-changing model, in the training stage of the generative network, the above-mentioned generation loss, expression loss, identity loss, key point loss, similarity loss and reconstruction loss can be introduced at the same time to construct the overall face-changing loss of the generative network, in the hope of achieving better face-changing effects that retain complex expressions through these constraints in various aspects.

As shown in FIG10 , it is a schematic diagram of the training architecture of the face-changing model in a specific embodiment. Referring to FIG10 , the networks introduced when training the face-changing model include: a generation network, a discrimination network, an expression recognition network, a face recognition network, a face key point network, and a feature extraction network. In conjunction with FIG10 , the training process of the face-changing model is described as follows:

The server obtains training samples, which include multiple sample triplets, and the sample triplets include a face source image, a template image, and a reference image.

Next, the server extracts features from the template image through a pre-trained expression recognition network to obtain the expression features of the template image. The server extracts features from the face source image through a pre-trained face recognition network to obtain the identity features of the face source image, and then concatenates the expression features of the template image with the identity features of the face source image to obtain the combined features.

Next, the server also splices the face source image with the template image to obtain an input image, inputs the input image into the face-changing model, encodes the input image through the generative network of the face-changing model, and obtains the encoding features required for face-changing the template image.

Next, the server fuses the encoded features with the combined features to obtain fused features, and decodes them according to the fused features through the generative network of the face-changing model to obtain the face-changing image.

Next, the server inputs the face-changing image into the discriminant network of the face-changing model through the discriminant network of the face-changing model to obtain a first probability that the face-changing image is a non-forged image, and inputs the reference image into the discriminant network of the face-changing model to obtain a second probability that the reference image is a non-forged image.

Next, while fixing the network parameters of the generating network, a discriminative loss for the discriminative network is constructed based on a first probability that the face-swapped image is a non-forged image and a second probability that the reference image is a non-forged image, and the network parameters of the discriminative network are updated using the discriminative loss.

Next, under the condition that the network parameters of the discriminant network are fixed, the server re-inputs the face-swapped image into the updated discriminant network to obtain the first probability that the face-swapped image is a non-forged image, and constructs the generation loss of the generation network according to the first probability that the face-swapped image is a non-forged image. Through the expression recognition network of the face-swapped model, feature extraction is performed on the face-swapped image to obtain the expression features of the face-swapped image, and the expression loss is constructed according to the difference between the expression features of the face-swapped image and the expression features of the template image. Through the face recognition network of the face-swapped model, feature extraction is performed on the face-swapped image to obtain the identity features of the face-swapped image, and the identity loss is constructed according to the difference between the identity features of the face-swapped image and the identity features of the face source image. Through the pre-trained face key point network, face key points of the template image and the face-swapped image are respectively recognized to obtain the respective face key point information, and the key point loss is constructed according to the difference between the face key point information of the template image and the face-swapped image. Through the pre-trained feature extraction network, image feature extraction is performed on the face-swapped image and the reference image to obtain the respective image features, and the similarity loss is constructed according to the difference between the image features of the face-swapped image and the reference image. According to the pixel-level difference between the face-swapped image and the reference image, the reconstruction loss is constructed. Finally, according to the generation loss, expression loss, identity loss, key point loss, similarity loss, and reconstruction loss, the face-swapped loss of the generation network is constructed, and the network parameters of the generation network are updated using the face-swapped loss.

According to this alternating training method, when the training stop condition is met, a trained face-changing model can be obtained.

In one embodiment, after obtaining a trained face-changing model, the server can use the generative network, pre-trained expression recognition network and face recognition network in the trained face-changing model to perform face-changing on the target image or target video to obtain a face-changing image or face-changing video.

Taking face swapping of a target video as an example, the following steps are included: video acquisition, image input, face detection, cropping of face areas, video face swapping with expression optimization, and result display.

As shown in FIG11, it is a schematic diagram of the process of video face swapping in one embodiment. The execution subject of this embodiment can be a computer device or a computer device cluster composed of multiple computer devices. The computer device can be a server or a terminal. Referring to FIG11, the following steps are included:

Step 1102, obtaining the video to be replaced with the face and the face source image containing the target face.

The face source image may be an original image containing a human face, or may be a cropped image containing only a human face obtained by performing face detection and configuration on the original image.

Step 1104, for each video frame of the face-changing video, the trained expression recognition network is used to extract features of the video frame to obtain expression features of the video frame.

The server can directly perform subsequent processing on the video frame, or perform face detection on the video frame and obtain a cropped image containing only the face after configuration.

Step 1106, extracting features from the face source image through the trained face recognition network to obtain identity features of the face source image.

Step 1108, concatenate the expression feature and the identity feature to obtain a combined feature.

Step 1110, encoding is performed based on the face source image and video frame containing the target face through the generative network of the trained face-changing model to obtain the coding features required for face-changing.

Step 1112, fusing the coding feature and the combined feature to obtain a fused feature.

Step 1114, through the generative network of the trained face-changing model, decoding is performed according to the fusion features, and a face-changing video is outputted in which the object in the video frame is replaced with the target face.

As shown in Figure 12, it is a schematic diagram of the effect of face swapping of photos in one embodiment. The face swapping model trained by the face swapping model training method provided in the embodiment of the present application can still maintain a good face swapping effect under complex expressions, and can be used in a variety of scenarios such as ID photo production, film and television portrait production, game character design, virtual image, privacy protection, etc. The expression of the face in the template image can still be maintained under complex expressions, and the face swapping requirements in some complex expression scenes in film and television can be met. Moreover, in the video scene, the expression is maintained smoothly and naturally.

It should be understood that, although the steps in the flowcharts involved in the above embodiments are displayed in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence according to the order indicated by the arrows. Unless there is a clear explanation in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least a part of the steps in the flowcharts involved in the above embodiments may include multiple steps or multiple stages, and these steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these steps or stages is not necessarily carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application also provides a face-changing model training device for implementing the face-changing model training method involved above. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the above method, so the specific limitations in the embodiments of one or more face-changing model training devices provided below can refer to the limitations of the face-changing model training method above, and will not be repeated here.

In one embodiment, as shown in FIG. 13 , a training device 1300 for a face-changing model is provided, comprising: an acquisition module 1302 , a splicing module 1304 , a generation module 1306 , a determination module 1308 , and an update module 1310 , wherein:

An acquisition module 1302 is used to acquire a sample triplet, where the sample triplet includes a face source image, a template image, and a reference image;

A splicing module 1304 is used to splice the expression features of the template image and the identity features of the face source image to obtain a combined feature;

The generation module 1306 is used to encode the face source image and the template image through the generation network of the face-changing model to obtain the encoding features required for face-changing, fuse the encoding features with the combined features to obtain the fused features, and decode according to the fused features through the generation network of the face-changing model to obtain the face-changing image;

A discrimination module 1308, for predicting the image attribute discrimination results of the face-swapped image and the reference image respectively through the discrimination network of the face-swapped model, where the image attributes include forged images and non-forged images;

The updating module 1310 is used to calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network based on the image attribute discrimination results of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.

In one embodiment, the acquisition module 1302 is also used to acquire a first image and a second image, wherein the first image and the second image correspond to the same identity attribute and different non-identity attributes; acquire a third image, wherein the third image and the first image correspond to different identity attributes; replace the object in the second image with the object in the third image to obtain a fourth image; and use the first image as a face source image, the fourth image as a template image, and the second image as a reference image as a sample triplet.

In one embodiment, the face-changing model training device 1300 further includes:

The expression recognition module is used to extract features of the template image through the expression recognition network of the face-changing model to obtain the expression features of the template image;

A face recognition module is used to extract features of the face source image through the face recognition network of the face-changing model to obtain identity features of the face source image;

Both the expression recognition network and the face recognition network are pre-trained neural network models.

In one embodiment, the generation module 1306 is also used to splice the face source image with the template image to obtain an input image; input the input image to the face-changing model; encode the input image through the generation network of the face-changing model to obtain the encoding features required for face-changing the template image.

The fusion module is used to calculate the mean and standard deviation of the coding features and the combined features respectively; according to the mean and standard deviation of the coding features, the coding features are normalized to obtain the normalized coding features; according to the mean and standard deviation of the combined features, the style of the combined features is transferred to the normalized coding features to obtain the fusion features.

In one embodiment, the discrimination module 1308 is further used to input the face-swapped image into the discriminant network of the face-swapped model to obtain a first probability that the face-swapped image is a non-forged image; and input the reference image into the discriminant network of the face-swapped model to obtain a second probability that the reference image is a non-forged image.

The expression recognition module is used to extract features of the face-swapped image through the expression recognition network of the face-swapped model to obtain the expression features of the face-swapped image;

A face recognition module is used to extract features of the face-swapped image through the face recognition network of the face-swapped model to obtain identity features of the face-swapped image;

In one embodiment, the updating module 1310 is further used to alternately, when the network parameters of the generating network are fixed, construct a discriminant loss about the discriminant network according to a first probability that the face-swapped image belongs to a non-forged image and a second probability that the reference image belongs to a non-forged image, and update the network parameters of the discriminant network using the discriminant loss; when the network parameters of the discriminant network are fixed, construct a generating loss of the generating network according to the first probability that the face-swapped image belongs to a non-forged image, construct an expression loss according to the difference between the expression features of the face-swapped image and the expression features of the template image, construct an identity loss according to the difference between the identity features of the face-swapped image and the identity features of the face source image, and construct an identity loss according to the generating loss, the expression features, and the identity features. The face-changing loss of the generated network is constructed by combining emotion loss and identity loss. The network parameters of the generated network are updated using the face-changing loss until the training stop condition is met, and the trained discriminant network and generative network are obtained.

The key point positioning module is used to identify the facial key points of the template image and the face-swapped image through a pre-trained facial key point network to obtain their respective facial key point information;

The updating module 1310 is also used to construct a key point loss according to the difference between the facial key point information of the template image and the face-changing image; the key point loss is used to participate in the training of the generative network of the face-changing model.

An image feature extraction module is used to extract image features from the face-swapped image and the reference image through a pre-trained feature extraction network to obtain their respective image features;

The updating module 1310 is also used to construct a similarity loss according to the difference between the image features of the face-changing image and the reference image; the similarity loss is used to participate in the training of the generative network of the face-changing model.

In one embodiment, the updating module 1310 is further used to construct a reconstruction loss according to the pixel-level difference between the face-changing image and the reference image; the reconstruction loss is used to participate in the training of the generative network of the face-changing model.

The face-changing module is used to obtain the video to be replaced and the face source image containing the target face; for each video frame of the video to be replaced, the expression features of the video frame are obtained; the identity features of the face source image containing the target face are obtained; the expression features and the identity features are spliced to obtain the combined features; through the generative network of the trained face-changing model, the face source image containing the target face and the video frame are encoded to obtain the coding features required for face-changing, and the fused features obtained by fusing the coding features and the combined features are decoded to output the face-changing video in which the object in the video frame is replaced with the target face.

In the training device 1300 of the face-changing model, when training the face-changing model, not only the coding features of the template image and the face source image themselves are involved in decoding to output the face-changing image, but also the expression features of the template image and the identity features of the face source image are involved in decoding to output the face-changing image, so that the output face-changing image can have both the expression information of the template image and the identity information of the face source image, that is, while maintaining the expression of the template image, it can also be similar to the face source image. In addition, the face-changing model is updated by the difference between the expression features of the template image and the expression features of the face-changing image, and the difference between the identity features of the face source image and the identity features of the face-changing image. The former can constrain the expression similarity between the face-changing image and the template image, and the latter can constrain the identity similarity between the face-changing image and the face source image. In this way, even if the expression of the template image is relatively complex, the output face-changing image can still maintain such a complex expression, thereby improving the face-changing effect. Moreover, when updating the network parameters of the generative network and the discriminative network of the face-changing model, the generative network and the discriminative network will be trained adversarially based on the image attribute discrimination results predicted by the discriminative network for the face-changing image and the reference image, thereby improving the image quality of the face-changing image output by the face-changing model as a whole.

Each module in the above-mentioned face-changing model training device 1300 can be implemented in whole or in part by software, hardware, or a combination thereof. Each of the above-mentioned modules can be embedded in or independent of a processor in a computer device in the form of hardware, or can be stored in a memory in a computer device in the form of software, so that the processor can call and execute operations corresponding to each of the above modules.

In one embodiment, a computer device is provided, which may be a server or a terminal, and its internal structure diagram may be shown in FIG14. The computer device includes a processor, a memory, an input/output interface (I/O for short), and a communication interface. The processor, the memory, and the input/output interface are connected via a system bus, and the communication interface is connected to the system bus via the input/output interface. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage The medium stores an operating system, computer-readable instructions and a database. The internal memory provides an environment for the operation of the operating system and the computer-readable instructions in the non-volatile storage medium. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used to communicate with the external device through a network connection. When the computer-readable instructions are executed by the processor, a training method for a face-changing model is implemented.

Those skilled in the art will understand that the structure shown in FIG. 14 is merely a block diagram of a partial structure related to the scheme of the present application, and does not constitute a limitation on the computer device to which the scheme of the present application is applied. The specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions, the training method steps of the face-changing model provided in any embodiment of the present application are implemented.

In one embodiment, a computer-readable storage medium is provided, on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, the training method steps of the face-changing model provided in any embodiment of the present application are implemented.

In one embodiment, a computer program product is provided, including computer-readable instructions, which, when executed by a processor, implement the steps of the face-changing model training method provided in any embodiment of the present application.

It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment method can be completed by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer-readable storage medium. When the computer-readable instructions are executed, they can include the processes of the embodiments of the above-mentioned methods. Any reference to the memory, database or other medium used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetoresistive random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. As an illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM). The database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include distributed databases based on blockchains, etc., but are not limited to this. The processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, etc., but are not limited to this.

The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

The above-mentioned embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the present application. It should be pointed out that it is within the skill of ordinary technicians in this field to In other words, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the attached claims.

Claims

A face-changing model training method is performed by a computer device, the method comprising:

Acquire a sample triplet, wherein the sample triplet includes a face source image, a template image, and a reference image;

splicing the expression features of the template image and the identity features of the face source image to obtain a combined feature;

By means of the generative network of the face-changing model, encoding is performed according to the face source image and the template image to obtain the encoding features required for face-changing;

Fusing the coding feature with the combined feature to obtain a fused feature;

Decoding is performed according to the fusion features through the generative network of the face-changing model to obtain a face-changing image;

Predicting the image attribute discrimination results of the face-swapped image and the reference image respectively through the discriminant network of the face-swapped model, wherein the image attributes include a forged image and a non-forged image; and

Calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network based on the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
The method according to claim 1, characterized in that obtaining the sample triplet comprises:

Acquire a first image and a second image, wherein the first image and the second image correspond to the same identity attribute and correspond to different non-identity attributes;

Acquire a third image, where the third image and the first image correspond to different identity attributes;

replacing the object in the second image with the object in the third image to obtain a fourth image; and

A sample triplet is constructed by taking the first image as a face source image, the fourth image as a template image, and the second image as a reference image.
The method according to claim 1 or 2, characterized in that the method further comprises:

Extracting features of the template image through the expression recognition network of the face-changing model to obtain expression features of the template image; and

Extracting features of the face source image through the face recognition network of the face-changing model to obtain identity features of the face source image;

The expression recognition network and the face recognition network are both pre-trained neural network models.
The method according to any one of claims 1 to 3, characterized in that the encoding of the face source image and the template image through the generative network of the face-changing model to obtain the encoding features required for face-changing comprises:

splicing the face source image with the template image to obtain an input image;

Inputting the input image into the face-changing model; and

The input image is encoded through the generative network of the face-changing model to obtain the encoding features required for face-changing the template image.
The method according to any one of claims 1 to 4, characterized in that the step of fusing the coding feature with the combined feature to obtain a fused feature comprises:

Calculate the mean and standard deviation of the coding feature, and calculate the mean and standard deviation of the combined feature;

Normalizing the coding feature according to the mean and standard deviation of the coding feature to obtain a normalized coding feature; and

According to the mean and standard deviation of the combined feature, the style of the combined feature is transferred to the normalized encoding feature to obtain the fused feature.
The method according to any one of claims 1 to 5, characterized in that the predicting of the image attribute discrimination results of the face-swapped image and the reference image respectively through the discriminant network of the face-swapped model comprises:

Inputting the face-swapped image into the discriminant network of the face-swapped model, and predicting a first probability that the face-swapped image is a non-forged image through the discriminant network; and

The reference image is input into the discriminant network of the face-changing model, and the discriminant network is used to predict a second probability that the reference image is a non-forged image.
The method according to any one of claims 1 to 6, characterized in that after obtaining the face-swapped image, the method further comprises:

Extracting features of the face-swapped image through the expression recognition network of the face-swapped model to obtain expression features of the face-swapped image; and

Extracting features of the face-swapped image through a face recognition network of the face-swapped model to obtain identity features of the face-swapped image;

The expression recognition network and the face recognition network are both pre-trained neural network models.
The method according to any one of claims 1 to 7 is characterized in that the calculation of the difference between the expression features of the face-swapped image and the expression features of the template image, the calculation of the difference between the identity features of the face-swapped image and the identity features of the face source image, and updating the generation network and the discrimination network according to the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features, comprises:

Alternately, when the network parameters of the generating network are fixed, a discriminant loss for the discriminant network is constructed according to a first probability that the face-swapped image is a non-forged image and a second probability that the reference image is a non-forged image, and the network parameters of the discriminant network are updated using the discriminant loss;

Under the condition that the network parameters of the discriminant network are fixed, according to the first probability that the face-swapped image belongs to a non-forged image, a generation loss of the generation network is constructed, according to the difference between the expression features of the face-swapped image and the expression features of the template image, an expression loss is constructed, according to the difference between the identity features of the face-swapped image and the identity features of the face source image, an identity loss is constructed, according to the generation loss, the expression loss and the identity loss, a face-swapped loss of the generation network is constructed, and the network parameters of the generation network are updated using the face-swapped loss;

The training is repeated until the training stop condition is met, and the trained discriminant network and generative network are obtained.
The method according to any one of claims 1 to 8, characterized in that the method further comprises:

Using a pre-trained facial key point network, respectively, the template image and the face-swapped image are subjected to facial key point recognition to obtain respective facial key point information; and

According to the difference between the facial key point information of the template image and the face-changing image, a key point loss is constructed, and the key point loss is used to participate in the training of the generation network of the face-changing model.
The method according to any one of claims 1 to 9, characterized in that the method further comprises:

Using a pre-trained feature extraction network, respectively extracting image features from the face-swapped image and the reference image to obtain respective image features; and

According to the difference between the image features of the face-changing image and the reference image, a similarity loss is constructed, and the similarity loss is used to participate in the training of the generation network of the face-changing model.
The method according to any one of claims 1 to 10, characterized in that the method further comprises:

A reconstruction loss is constructed according to the pixel-level difference between the face-changing image and the reference image, and the reconstruction loss is used to participate in the training of the generative network of the face-changing model.
The method according to any one of claims 1 to 11, characterized in that the method further comprises:

Obtain the video to be face-swapped and the face source image containing the target face;

For each video frame of the video to be face-swapped, acquiring expression features of the video frame;

Acquire the identity feature of the face source image containing the target face;

splicing the expression feature and the identity feature to obtain a combined feature; and

Through the trained generative network of the face-changing model, the face source image containing the target face and the video frame are encoded to obtain the encoding features required for face-changing, and decoding is performed based on the fusion features obtained by fusing the encoding features with the combined features, and a face-changing video in which the object in the video frame is replaced with the target face is output.
A training device for a face-changing model, the device comprising:

An acquisition module, used for acquiring a sample triplet, wherein the sample triplet includes a face source image, a template image and a reference image;

A splicing module, used for splicing the expression features of the template image and the identity features of the face source image to obtain a combined feature;

A generation module, configured to encode the face source image and the template image through the generation network of the face-changing model to obtain the encoding features required for face-changing, fuse the encoding features with the combined features to obtain the fused features, and decode the fused features through the generation network of the face-changing model to obtain the face-changing image;

A discrimination module, used for predicting the image attribute discrimination results of the face-swapped image and the reference image respectively through the discrimination network of the face-swapped model, wherein the image attributes include a forged image and a non-forged image;

An updating module is used to calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network according to the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
A computer device comprises a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the steps of any one of the methods of claims 1 to 12 when executing the computer-readable instructions.
A computer-readable storage medium having computer-readable instructions stored thereon, wherein the computer-readable instructions, when executed by a processor, implement the steps of the method according to any one of claims 1 to 12.
A computer program product comprises computer readable instructions, which, when executed by a processor, implement the steps of the method according to any one of claims 1 to 12.