WO2024109374A1 - Training method and apparatus for face swapping model, and device, storage medium and program product - Google Patents

Training method and apparatus for face swapping model, and device, storage medium and program product Download PDF

Info

Publication number
WO2024109374A1
WO2024109374A1 PCT/CN2023/124045 CN2023124045W WO2024109374A1 WO 2024109374 A1 WO2024109374 A1 WO 2024109374A1 CN 2023124045 W CN2023124045 W CN 2023124045W WO 2024109374 A1 WO2024109374 A1 WO 2024109374A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
face
features
network
swapped
Prior art date
Application number
PCT/CN2023/124045
Other languages
French (fr)
Chinese (zh)
Inventor
贺珂珂
朱俊伟
邰颖
汪铖杰
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024109374A1 publication Critical patent/WO2024109374A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • the present application relates to the field of computer technology, and in particular to a training method, apparatus, computer equipment, storage medium and computer program product for a face-changing model.
  • Face replacement refers to replacing the face in the image to be replaced (i.e., template image) with the face in the source image.
  • the goal of face replacement technology is to keep the expression, angle, background and other information of the face in the template image, and to make it as similar as possible to the face in the source image.
  • Face replacement has many application scenarios. For example, video face replacement can be applied to film and television portrait production, game character design, virtual image, privacy protection, etc.
  • the ability to maintain rich expressions is both the focus and the difficulty for face replacement technology.
  • most face-changing algorithms can achieve satisfactory results in common expression scenes, such as smiling.
  • the expression retention effect of the face-changing image is not good, and even some difficult expressions cannot be maintained, resulting in the accuracy of face-changing of face images being affected and the face-changing effect is poor.
  • a method, apparatus, computer device, computer-readable storage medium and computer program product for training a face-changing model are provided.
  • the present application provides a method for training a face-changing model.
  • the method is executed by a computer device and includes:
  • sample triplet Acquire a sample triplet, wherein the sample triplet includes a face source image, a template image, and a reference image;
  • encoding is performed according to the face source image and the template image to obtain the encoding features required for face-changing;
  • Decoding is performed according to the fusion features through the generative network of the face-changing model to obtain a face-changing image
  • the present application also provides a training device for a face-changing model.
  • the device comprises:
  • An acquisition module used for acquiring a sample triplet, wherein the sample triplet includes a face source image, a template image and a reference image;
  • a splicing module used for splicing the expression features of the template image and the identity features of the face source image to obtain a combined feature
  • a generation module configured to encode the face source image and the template image through the generation network of the face-changing model to obtain the encoding features required for face-changing, fuse the encoding features with the combined features to obtain the fused features, and decode the fused features through the generation network of the face-changing model to obtain the face-changing image;
  • a discrimination module used for predicting the discrimination results of the image attributes of the face-swapped image and the reference image respectively through the discrimination network of the face-swapped model, wherein the image attributes include a forged image and a non-forged image;
  • An updating module is used to calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network according to the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
  • the present application also provides a computer device, which includes a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the steps of the above-mentioned face-changing model training method when executing the computer-readable instructions.
  • the present application also provides a computer-readable storage medium having computer-readable instructions stored thereon, which, when executed by a processor, implement the steps of the above-mentioned face-changing model training method.
  • the present application also provides a computer program product, which includes computer-readable instructions, and when the computer-readable instructions are executed by a processor, the steps of the training method of the above-mentioned face-changing model are implemented.
  • FIG1 is a schematic diagram of image face swapping in one embodiment
  • FIG2 is a diagram showing an application environment of a face-changing model training method according to an embodiment
  • FIG3 is a schematic diagram of a flow chart of a method for training a face-changing model in one embodiment
  • FIG4 is a schematic diagram of a model structure of a face-changing model in one embodiment
  • FIG5 is a schematic diagram of a flow chart of a method for training a face-changing model in one embodiment
  • FIG6 is a schematic diagram of a training framework of a face-changing model in one embodiment
  • FIG7 is a schematic diagram of key points of a face in one embodiment
  • FIG8 is a schematic diagram of a training framework of a face-changing model in another embodiment
  • FIG9 is a schematic diagram of a feature extraction network in one embodiment
  • FIG10 is a schematic diagram of a training framework of a face-changing model in yet another embodiment
  • FIG11 is a schematic diagram of a process of video face swapping in one embodiment
  • FIG12 is a schematic diagram showing the effect of performing face swapping on a photo in one embodiment
  • FIG13 is a structural block diagram of a training device for a face-changing model in one embodiment
  • FIG. 14 is a diagram showing the internal structure of a computer device in one embodiment.
  • Supervised learning is a machine learning task in which an algorithm can learn or establish a pattern from a labeled training set and infer new instances based on this pattern.
  • the training set consists of a series of training examples, each of which consists of input and supervision information (i.e. expected output, also called labeling information).
  • the output inferred by the algorithm based on the input can be a continuous value or a classification label.
  • Unsupervised learning is a machine learning task. Algorithms learn patterns, structures, and relationships from unlabeled data to discover hidden information and meaningful structures in the data. Unlike supervised learning, there is no supervised information to guide the learning process in unsupervised learning, and the algorithm needs to discover the inherent patterns of the data on its own.
  • Generative Adversarial Network A method of unsupervised learning that learns by letting two neural networks compete with each other. It consists of a generator network and a discriminator network.
  • the generator network randomly samples from the latent space as input, and its output needs to imitate the samples in the training set as much as possible, that is, its training goal is to generate samples that are as similar as possible to the samples in the training set.
  • the input of the discriminator network is the output of the generator network, and its purpose is to distinguish the samples output by the generator network from the samples in the training set as much as possible, that is, its training goal is to distinguish the samples generated by the generator network from the samples in the training set.
  • the generator network should deceive the discriminator network as much as possible.
  • the two networks compete with each other and continuously update parameters, and finally the generator network can generate samples that are very similar to the samples in the training set.
  • Face swapping It is to swap the face in the input face source image with the template image, output the swapped face image, and make the output swapped face image keep the expression, angle, background and other information of the template image.
  • the input face source image of the face swapping process is face A
  • the face in the template image is another face B.
  • face swapping a photo is output in which face B in the template image is replaced with face A.
  • Face-swapping model A machine learning model implemented using deep learning and face recognition technology that can extract a person's facial expressions, eyes, mouth and other features from a photo or video and match them with the facial features of another person.
  • Video face swapping has many application scenarios, such as film and television portrait production, game character design, virtual image, privacy protection, etc.
  • film and television production when an actor cannot complete professional actions, professionals can complete them first, and then face swapping technology can be used to automatically replace the human face with the actor.
  • face swapping technology When an actor needs to be replaced, a new face can be replaced by face swapping technology to avoid reshooting, which can save a lot of costs.
  • virtual image design for example, in live broadcast scenes, users can use virtual characters to swap faces to increase the fun of live broadcasts and protect personal privacy.
  • the results of video face swapping can also provide anti-attack training materials for services such as face recognition.
  • GT Ground Truth, true value, also known as reference information, labeled information or supervised information.
  • the face-changing model training method provided in the embodiment of the present application can be applied in the application environment shown in FIG. 2.
  • the terminal 102 communicates with the server 104 via a network.
  • the data storage system can store data that the server 104 needs to process.
  • the data storage system can be integrated on the server 104, or placed on the cloud or other servers. It can be, but is not limited to, various personal computers, laptops, smart phones, tablet computers, IoT devices and portable wearable devices.
  • IoT devices can be smart speakers, smart TVs, smart air conditioners, smart car devices, etc.
  • Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, etc.
  • the server 104 can be implemented as an independent server or a server cluster consisting of multiple servers.
  • the terminal 102 may have an application client, and the server 104 may be a background server providing services for the application client.
  • the application client may send the image or video collected by the terminal 102 to the server 104.
  • the server 104 may obtain the trained face-changing model through the training method of the face-changing model provided in the present application, and then replace the face in the image or video collected by the terminal 102 with another face or a virtual image through the generation network of the trained face-changing model, and then return it to the terminal 102 in real time.
  • the terminal 102 displays the face-changing image or video through the application client.
  • the application client may be a video client, a social application client, an instant messaging client, and the like.
  • FIG3 is a flowchart of a training method for a face-changing model provided by the present application.
  • the execution subject of this embodiment can be a computer device or a computer device cluster composed of multiple computer devices.
  • the computer device can be a server or a terminal. Therefore, the execution subject in the embodiment of the present application can be a server or a terminal, or it can be composed of a server and a terminal.
  • the execution subject in the embodiment of the present application is a server as an example, and the following steps are included:
  • Step 302 obtaining a sample triplet, where the sample triplet includes a face source image, a template image, and a reference image.
  • the face-changing model includes a generator network (Generator Network) and a discriminator network (Discriminator Network).
  • the face-changing model is trained through a generative adversarial network (GAN) formed by the generator network and the discriminator network.
  • GAN generative adversarial network
  • the sample triplet is the sample data used to train the face-changing model.
  • the server can obtain multiple sample triplets for training the face-changing model.
  • Each sample triplet includes a face source image, a template image and a reference image.
  • the face source image is an image that provides a human face, which can be recorded as source.
  • the template image is an image that provides information such as facial expression, posture, image background, etc., which can be recorded as template.
  • Face-changing is to replace the face in the template image with the face in the face source image.
  • the face-changing image can maintain the expression, posture, image background, etc. of the template image.
  • the reference image is an image that serves as the supervisory information required for training the face-changing model, which can be recorded as GT. Since the principle of using each sample triplet (or a batch of sample triplets) to train the face-changing model is the same, the process of training the face-changing model using a sample triplet is used as an example to illustrate here.
  • the reference image used to provide the supervision information required for model training should have the same identity attributes as the face source image and the same non-identity attributes as the template image.
  • the face source image should have different identity attributes from the template image.
  • Human faces are usually unique, and identity attributes refer to the identity represented by the face in the image. Having the same identity attributes means that the image is the same face.
  • Non-identity attributes refer to the posture, expression, and makeup of the face in the image.
  • Non-identity attributes also include attributes such as the style and background of the image.
  • the face in the face source image and the face in the reference image are the same person's face, but the facial expressions, makeup, posture of the person and the background in the image may be partially the same or different.
  • the face in the face source image and the face in the template image are the faces of two different people. It is understandable that the face source image and the reference image may also be the same image.
  • the sample triplet can be constructed by: obtaining a first image and a second image, An image corresponds to the same identity attribute as a second image, and corresponds to different non-identity attributes, and a third image is obtained, and the third image corresponds to different identity attributes from the first image; an object in the second image is replaced by an object in the third image to obtain a fourth image, and the first image is used as a face source image, the fourth image as a template image, and the second image as a reference image as a sample triplet.
  • the server can randomly obtain the first image, determine the identity information corresponding to the face in the first image, and then obtain another image corresponding to the identity information as the second image, so that the first image and the second image have the same face, that is, they have the same identity attributes. Then, the server can randomly obtain the third image, and the third image corresponds to different identity attributes from the first image, that is, the face in the third image is not the face of the same person as the face in the first image.
  • the server can input the second image and the third image into the face-changing model, and replace the face in the second image with the object in the third image through the generative network of the face-changing model to obtain the fourth image, and the fourth image retains the expression, posture, image background and other characteristics in the second image.
  • the first image, the second image, and the third image are all images including faces, and the server can randomly obtain these images from the face image data set.
  • the first image contains the face of Mr. A, and the facial expression of the first image is laughing, and the image background is background 1.
  • the second image contains the face of Mr. A, and the facial expression of the second image is smiling, and the image background is background 2.
  • the third image contains the face of Ms. B, and the facial expression of the third image is angry, and the image background is background 3.
  • the face of Mr. A is different from the face of Ms. B, that is, the third image has a different face from the first image and the second image.
  • the server replaces the face of Mr. A in the second image with the face of Ms. B to obtain the fourth image, and the expression of the fourth image maintains the smiling expression of the second image, and the background maintains the image background 2.
  • the first image is used as the face source image, that is, the face of Mr. A, the laughing expression and the image background 1 are provided, the fourth image is used as the template image, and the face of Ms. B, the smiling expression and the image background 2 are provided, and the second image is used as the reference image, and the face of Mr. A, the smiling expression and the image background 2 are provided, so as to construct a sample triple.
  • the reference image is a real image, not a forged image or a synthetic image.
  • the second image serving as the reference image is a real image rather than a forged image.
  • the face-changing image output by the generated network is continuously close to the real reference image, thereby ensuring that the output face-changing image can maintain consistency and smoothness with the non-synthetic parts in terms of shape, lighting, movement, etc., thereby obtaining a high-quality face-changing image or video with a better face-changing effect.
  • the server after the server obtains the above-mentioned sample triples, it can directly input them into the face-changing model to perform model training on the face-changing model.
  • the server after the server obtains the above-mentioned sample triplet, it first performs image preprocessing on the three images in the sample triplet respectively, and uses the preprocessed images to train the face-changing model.
  • the preprocessing may include the following aspects: 1. Since the face in the image often only occupies a part of the image, the server may first perform face detection on the image to obtain the face area.
  • the face detection network or face detection algorithm required for face detection may adopt a pre-trained neural network model.
  • Face key point detection perform key point detection in the face area to obtain the key points of the face, such as the key points of the eyes, mouth corners, and facial contours.
  • Face registration face registration is to use affine transformation to uniformly "straighten" the face according to the identified key points, try to eliminate the errors caused by different postures, and crop the face image after face registration.
  • the server can obtain the cropped face source image, template image and reference image through the above preprocessing steps, input the cropped image into the face-changing model, and the face-changing model outputs a face-changing image containing only the human face, and then use the output face-changing image to replace the human face area in the template image, so as to obtain the final output face-changing image. Ensure the training effect of the face-changing model.
  • Step 304 combining the expression features of the template image and the identity features of the face source image to obtain a combined feature.
  • the expression features of an image can reflect the expression information expressed by the image. They are features of facial expressions obtained by locating and extracting the organ features, texture areas and predefined feature points of the face. Expression features are the key to expression recognition, and they determine the final expression recognition results.
  • the identity features of an image are biological features that can be used for identity recognition, such as facial features, pupil features, fingerprint features, palm print features, etc. In this application, identity features are facial features based on human face recognition, which can be used for face recognition.
  • the server can extract features of the template image through the expression recognition network of the face-changing model to obtain the expression features of the template image, and extract features of the face source image through the face recognition network of the face-changing model to obtain the identity features of the face source image.
  • the face-changing model includes not only a generation network and a discrimination network, but also a pre-trained expression recognition network and a face recognition network, and both the expression recognition network and the face recognition network are pre-trained neural network models.
  • Facial expression recognition is an important research direction in the field of computer vision. Facial expression recognition is the process of predicting the category of emotion expressed by a face by analyzing and processing the face image.
  • the embodiment of the present application does not limit the network structure of the facial expression recognition network.
  • the facial expression recognition network can be built based on a convolutional neural network (CNN).
  • CNN convolutional neural network
  • the convolutional neural network uses convolutional layers and pooling layers to extract the features of the input facial image, and performs facial expression classification through a fully connected layer.
  • the expression recognition network can be trained by a series of pictures and corresponding expression labels. Specifically, it is necessary to obtain a face image data set containing expression labels, which includes sample face images of different emotion categories, such as happiness, sadness, anger, blinking, single blinking, making faces, and other common expressions and complex expressions.
  • a face image data set containing expression labels which includes sample face images of different emotion categories, such as happiness, sadness, anger, blinking, single blinking, making faces, and other common expressions and complex expressions.
  • the more abstract and advanced feature representations of the sample face image i.e., the expression features
  • the extracted expression features are classified through the fully connected layer to obtain the predicted results of the facial expressions in the sample face image.
  • the loss function of the expression recognition network can be constructed, and the network parameters of the expression recognition network can be updated based on the loss function, for example, the network parameters of the expression recognition network can be optimized by minimizing the loss function.
  • the trained expression recognition network can be used to extract the expression features of the image.
  • the expression features can be used in this application to constrain the consistency of the expression, i.e., constrain the expression similarity between the face-changing image and the template image.
  • the server can directly extract features from the template image through the trained expression recognition network to obtain the corresponding expression features.
  • the server can also perform face detection on the template image through the expression recognition network, determine the face area in the template image based on the detection results, and then extract features from the face area to obtain the corresponding expression features.
  • the expression features of the template image can be recorded as template_exp_features.
  • Face recognition is a biometric technique that identifies a person based on their facial features, and is one of the research challenges in the field of biometric recognition.
  • the embodiments of the present application do not limit the network structure used by the face recognition network.
  • the face recognition network can be built based on a convolutional neural network (CNN), which uses convolutional layers and pooling layers to extract features of the input face image, and performs identity classification through a fully connected layer.
  • CNN convolutional neural network
  • the face recognition network can be trained using a series of images and corresponding identity labels.
  • the face recognition network includes multiple stacked convolutional layers and pooling layers, as well as a fully connected layer.
  • the convolutional layer uses a set of learnable filters (also known as pooling layers) to extract features of the input face image.
  • the server can directly perform feature extraction on the face source image through the trained face recognition network to obtain the corresponding identity features.
  • the server can also perform face detection on the template image through the trained face recognition network, determine the face area in the face source image according to the detection results, and then perform feature extraction on the face area to obtain the corresponding identity features.
  • the identity features of the face source image can be recorded as source_id_features.
  • the combined feature is a feature obtained by the server by splicing the expression feature of the template image with the identity feature of the face source image. For example, if the expression feature is a 1024-dimensional feature and the identity feature is a 512-dimensional feature, the two are concatenated according to the feature dimension to obtain a 1536-dimensional combined feature.
  • the splicing method is not limited to this, and the embodiments of the present application are not limited to this.
  • a multi-scale feature fusion method can also be used to extract features of different scales from different layers of two networks and fuse them to obtain a combined feature.
  • the combined feature can be recorded as id_exp_features.
  • the combined features obtained by the server can be subsequently decoded together with the coding features required for face-changing to output the face-changing image. That is to say, in this application, when training the face-changing model, not only the coding features of the template image and the face source image itself are decoded to output the face-changing image, but also the expression features of the template image and the identity features of the face source image are decoded to output the face-changing image, so that the output face-changing image can have both the expression information of the template image and the identity information of the face source image, that is, while keeping the expression of the template image as much as possible, it can also be as similar as possible to the face source image, thereby improving the accuracy and effect of face-changing of face images.
  • Step 306 encoding the face source image and the template image through the generative network of the face-changing model to obtain the coding features required for face-changing, fusing the coding features with the combined features to obtain fused features, and decoding according to the fused features through the generative network of the face-changing model to obtain the face-changing image.
  • the face-changing model includes a face recognition network, an expression recognition network, a generation network and a discrimination network.
  • the face-changing model is trained by a generative adversarial network (GAN) formed by a generator network and a discriminator network.
  • GAN generative adversarial network
  • the generator network includes an encoder and a decoder.
  • the encoder continuously halves the size (resolution) of the input image through convolution calculation, and the number of channels gradually increases.
  • the encoding process is essentially achieved by applying a convolution kernel (also called a filter) to the input data corresponding to the input image.
  • the encoder consists of multiple convolution kernels and finally outputs a feature vector.
  • the decoder performs a deconvolution operation, gradually doubles the size of the feature training, gradually reduces the number of channels, and reconstructs or generates the image based on the feature.
  • encoding is performed based on a face source image and a template image through a generative network of a face-changing model to obtain coding features required for face-changing, including: splicing the face source image and the template image to obtain an input image, inputting the input image into the face-changing model, encoding the input image through the generative network of the face-changing model, and obtaining coding features required for face-changing the template image.
  • the face source image and the template image are both three-channel images.
  • the server can splice the face source image and the template image according to the image channels.
  • the six-channel input image obtained after splicing is input into the encoder of the generation network.
  • the input image is gradually encoded to obtain an intermediate result in the latent space, namely the encoding feature (which can be recorded as swap_features).
  • the encoding feature which can be recorded as swap_features.
  • the input image is gradually encoded from a resolution of 512*512*6 to 256*256*32, 128*128*64, 64*64*128, 32*32*256, and so on, and finally get an intermediate result in the latent space, called the encoding feature, i.e. swap_features.
  • This encoding feature also has the image information of the face source image and the image information of the template image.
  • the server may fuse the coding feature with the above-mentioned combined feature to obtain a fused feature, which has both the content of the coding feature and the style of the combined feature.
  • the server may calculate the mean and standard deviation of the coding features and the combined features respectively; normalize the coding features according to the mean and standard deviation of the coding features to obtain normalized coding features; and transfer the style of the combined features to the normalized coding features according to the mean and standard deviation of the combined features to obtain fused features.
  • the server can fuse the encoding feature with the combined feature by means of AdaIN (Adaptive Instance Normalization) to obtain the fused feature.
  • AdaIN Adaptive Instance Normalization
  • x and y are the coded features and combined features, respectively, ⁇ and ⁇ are the standard deviation and mean, respectively.
  • This formula aligns the mean and standard deviation of the coded features with the combined features.
  • ⁇ (x) is the mean of the coded features
  • ⁇ (x) is the standard deviation of the coded features
  • ⁇ (y) is the standard deviation of the combined features
  • ⁇ (y) is the mean of the combined features.
  • both the coded features and the combined features are a multi-channel two-dimensional matrix.
  • the matrix size of the coded features is 32*32*256.
  • the mean and standard deviation of the corresponding channel can be calculated based on the values of all elements to obtain the mean and standard deviation of the coded features in each channel.
  • the mean and standard deviation of the corresponding channel can be calculated based on the values of all elements to obtain the mean and standard deviation of the combined features in each channel.
  • the server uses the mean and standard deviation of the coding features to normalize the coding features. That is, the normalized coding features are obtained by subtracting the mean of the coding features from the coding features and dividing them by the standard deviation of the coding features.
  • the coding features are normalized, and the mean of the normalized features is 0 and the standard deviation is 1, which eliminates the original style of the coding features and retains the original content of the coding features.
  • the style of the combined features is transferred to the normalized coding features using the mean and standard deviation of the combined features. That is, the normalized coding features are multiplied by the standard deviation of the combined features and then added to the mean of the combined features to obtain the fused features. In this way, the obtained fused features retain the content of the coding features and have the style of the combined features.
  • the coding features have both the image information of the face source image and the image information of the template image, and the combined features have both the expression features and identity features required for face changing. Then, by fusing the coding features and the combined features in this way, the fused features obtained can make the face in the decoded face-changing image similar to the face in the face source image, while allowing the face-changing image to retain the expression, posture and image background of the face in the template image, thereby improving the accuracy of the output face-changing image.
  • the server can also fuse the coding features with the combined features in other ways, such as batch normalization, instance normalization, conditional instance normalization, etc.
  • the embodiment of the present application does not limit the fusion method.
  • the server After obtaining the fused feature, the server inputs the fused feature into the decoder of the generation network. Through the deconvolution operation of the decoder, the resolution of the fused feature is gradually doubled, the number of channels is gradually reduced, and the face-swapped image is output. For example, the resolution of the fused feature is 32*32*256, and through the gradual deconvolution operation of the decoder, 64*64*128, 128*128*64, 256*256*32, 512*512*3 are output in sequence, and finally the face-swapped image is output.
  • Step 308 using the discriminant network of the face-changing model, respectively predict the image attribute discrimination results of the face-changing image and the reference image, where the image attributes include forged images and non-forged images.
  • the face-changing model also includes a discriminant network, which is used to determine whether the input image is a forged image or a non-forged image.
  • the server inputs the face-changing image into the discriminant network, and the discriminant network extracts features of the input face-changing image to obtain low-dimensional discriminant information, and classifies the image attributes based on the extracted discriminants to obtain corresponding image attribute discrimination results.
  • the classification of the discriminant network is a binary classification of image attributes, that is, to determine whether the image is a forged image or a non-forged image. Forged images are also called synthetic images, and non-forged images are also called real images.
  • the server will input the reference image in the sample triplet into the discriminant network, extract features from the input reference image through the discriminant network to obtain low-dimensional discriminant information, classify the image attributes based on the extracted discriminants, and obtain the corresponding image attribute discrimination results.
  • the discriminant network of the face-changing model obtains the corresponding image attribute discrimination result according to the face-changing image and the reference image, including: inputting the face-changing image into the discriminant network of the face-changing model to obtain the first probability that the face-changing image is a non-forged image; inputting the reference image into the discriminant network of the face-changing model to obtain the second probability that the reference image is a non-forged image.
  • the training goal of the discriminant network is to make the first probability output by the discriminant network as small as possible and the second probability output as large as possible, so that the discriminant network has better performance.
  • Step 310 calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network based on the image attribute discrimination results of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
  • the face-changing model includes a generating network and a discriminating network.
  • the generating network and the discriminating network perform adversarial training on the image attribute discrimination results of the real reference data and the output forged data based on the discriminating network.
  • the server in order to make the output face-changing image retain the expression of the face in the template image as much as possible and retain the identity attribute of the face source image, during the training process, the server will also calculate the difference between the expression features of the face-changing image and the expression features of the template image, and calculate the difference between the identity features of the face-changing image and the identity features of the face source image.
  • the loss function of the entire face-changing model is jointly constructed, and the network parameters of the generating network and the discriminating network are optimized and updated with the goal of minimizing the loss function.
  • the embodiment of the present application does not limit the specific network structure adopted by the generating network and the discriminating network, and only requires the generating network to support the above-mentioned image reconstruction and generation capabilities and the discriminating network to support the above-mentioned image attribute discrimination capabilities.
  • the expression features of the face-changing image can be obtained by extracting image features through the above-mentioned expression recognition network, and the identity features of the face-changing image can be obtained by extracting image features through the above-mentioned face recognition network.
  • the server alternately: when the network parameters of the generation network are fixed, according to the first probability that the face-swapped image belongs to the non-forged image and the second probability that the reference image belongs to the non-forged image, construct the discriminant loss of the discriminant network, and update the network parameters of the discriminant network using the discriminant loss.
  • the network parameters of the generation network are fixed, according to the first probability that the face-swapped image belongs to the non-forged image and the second probability that the reference image belongs to the non-forged image, construct the generation loss of the generation network, according to the first probability that the face-swapped image belongs to the non-forged image, and update the network parameters of the discriminant network using the discriminant loss.
  • the expression loss is constructed based on the difference between the expression features of the face-changing image and the expression features of the template image.
  • the identity loss is constructed based on the difference between the identity features of the face-changing image and the identity features of the face source image.
  • the face-changing loss of the generation network is constructed based on the generation loss, expression loss and identity loss.
  • the network parameters of the generation network are updated using the face-changing loss. The training is repeated until the training stop condition is met, and the trained discriminant network and generation network are obtained.
  • the training of the face-changing model includes two alternating stages, the first stage is to train the discriminant network, and the second stage is to train the generative network.
  • the training goal of the first stage is to make the discriminant network identify the face-swapped image as a forged image as much as possible, and to make the discriminant network identify the reference image as a non-forged image as much as possible. Therefore, in the first stage, the parameters of the generation network are fixed, the sample triples are input into the face-swapped model, and after the face-swapped image is output, the server updates the network parameters of the discriminant network according to the image attribute discrimination results of the face-swapped image and the reference image predicted by the discriminant network.
  • the server constructs the discriminant loss of the discriminant network according to the first probability that the face-swapped image belongs to a non-forged image and the second probability that the reference image belongs to a non-forged image, and uses the discriminant loss to update the network parameters of the discriminant network.
  • D represents the discriminant network
  • GT is the reference image
  • fake is the face-swapped image
  • D(fake) represents the first probability that the face-swapped image is a non-fake image
  • D(GT) represents the second probability that the reference image is a non-fake image.
  • the training goal of the second stage is to make the face-swapped images output by the generator network "deceive" the discriminant network as much as possible, and be predicted by the discriminant network as non-forged images. Therefore, in the second stage, the parameters of the discriminant network are fixed, and the same batch of sample triplets are input into the face-swapped model.
  • the loss function for training the generator network is constructed based on the image attribute discrimination results of the face-swapped images and the reference images predicted by the discriminant network, and the network parameters of the generator network are updated according to the loss function.
  • the server in addition to the generation loss of the generation network, also introduces expression loss and identity loss in the loss function used to train the generation network. Specifically, the server extracts features of the face-changing image through the expression recognition network of the face-changing model to obtain the expression features of the face-changing image, and extracts features of the face-changing image through the face recognition network of the face-changing model to obtain the identity features of the face-changing image.
  • Both the expression recognition network and the face recognition network are pre-trained neural network models.
  • the server can construct the generation loss of the generation network according to the first probability that the face-swapped image is a non-forged image, construct the expression loss according to the difference between the expression features of the face-swapped image and the expression features of the template image, construct the identity loss according to the difference between the identity features of the face-swapped image and the identity features of the face source image, construct the face-swapped loss of the generation network according to the generation loss, expression loss and identity loss, and use the face-swapped loss to update the network parameters of the generation network.
  • template_exp_features is the expression features of the template image
  • fake_exp_features is the expression features of the face-swapped image
  • cosine_similarity() is the cosine similarity
  • fake_id_features is the identity features of the face-swapped image
  • source_id_features is the identity features of the face source image.
  • FIG5 it is a flowchart of a method for training a face-changing model in an embodiment.
  • the method can be executed by a computer device, and specifically includes the following steps:
  • Step 502 obtaining a sample triplet, where the sample triplet includes a face source image, a template image, and a reference image;
  • Step 504 extracting features of the template image through the expression recognition network of the face-changing model to obtain expression features of the template image;
  • Step 506 extracting features from the face source image through the face recognition network of the face-changing model to obtain identity features of the face source image;
  • Step 508 combining the expression features of the template image and the identity features of the face source image to obtain a combined feature
  • Step 510 splicing the face source image and the template image to obtain an input image, inputting the input image into the face-changing model, encoding the input image through the generative network of the face-changing model, and obtaining the encoding features required for face-changing the template image;
  • Step 512 respectively calculating the mean and standard deviation of the coding feature and the combined feature, normalizing the coding feature according to the mean and standard deviation of the coding feature to obtain the normalized coding feature, and migrating the style of the combined feature to the normalized coding feature according to the mean and standard deviation of the combined feature to obtain the fusion feature;
  • Step 514 decoding the fused features through the generative network of the face-changing model to obtain a face-changing image
  • Step 516 inputting the face-swapped image into the discriminant network of the face-swapped model to obtain a first probability that the face-swapped image is a non-forged image;
  • Step 518 inputting the reference image into the discriminant network of the face-changing model to obtain a second probability that the reference image is a non-forged image
  • Step 520 when the network parameters of the generating network are fixed, a discriminant loss for the discriminant network is constructed according to a first probability that the face-swapped image is a non-forged image and a second probability that the reference image is a non-forged image, and the network parameters of the discriminant network are updated using the discriminant loss;
  • Step 522 under the condition of fixing the network parameters of the discriminant network, extracting features of the face-changing image through the expression recognition network of the face-changing model to obtain the expression features of the face-changing image; extracting features of the face-changing image through the face recognition network of the face-changing model
  • the invention relates to a method for generating a face-changing image by extracting features of the face image to obtain identity features of the face-changing image; and constructing a generation loss of the generation network according to a first probability that the face-changing image is a non-forged image, constructing an expression loss according to the difference between the expression features of the face-changing image and the expression features of the template image, constructing an identity loss according to the difference between the identity features of the face-changing image and the identity features of the face source image, constructing a face-changing loss about the generation network according to the generation loss, the expression loss and the identity loss, and using the face-changing loss to update the network parameters of the generation network.
  • the face-changing model when training the face-changing model, not only the encoding features of the template image and the face source image themselves are involved in decoding to output the face-changing image, but also the expression features of the template image and the identity features of the face source image are involved in decoding to output the face-changing image, so that the output face-changing image can have both the expression information of the template image and the identity information of the face source image, that is, while maintaining the expression of the template image, it can also be similar to the face source image.
  • the face-changing model is updated by the difference between the expression features of the template image and the expression features of the face-changing image, and the difference between the identity features of the face source image and the identity features of the face-changing image.
  • the former can constrain the expression similarity between the face-changing image and the template image, and the latter can constrain the identity similarity between the face-changing image and the face source image.
  • the output face-changing image can still maintain this complex expression, thereby improving the face-changing effect.
  • the generative network and the discriminative network will be trained adversarially based on the image attribute discrimination results predicted by the discriminative network for the face-changing image and the reference image, thereby improving the image quality of the face-changing image output by the face-changing model as a whole.
  • the present application also introduces a pre-trained facial key point network when training the face-changing model, and trains the generation network of the face-changing model according to the difference between the facial key point information of the template image and the face-changing image.
  • the above method may also include: using the pre-trained facial key point network, respectively performing facial key point recognition on the template image and the face-changing image to obtain the respective facial key point information; constructing a key point loss according to the difference between the facial key point information of the template image and the face-changing image; and the key point loss is used to participate in the training of the generation network of the face-changing model.
  • the face-changing image generated can optionally introduce a face key point network when training the face-changing model.
  • the face key point network can locate the positions of the facial key points on the image, and thus construct a key point loss based on the difference between the facial key point information of the template image and the face-changing image, and participate in the training of the generation network, so as to ensure the consistency of the expressions of the template image and the face-changing image.
  • the key points of the face are the pixels where the facial features related to the facial expressions are located on the face in the image, such as the pixels where the eyebrows, mouth, eyes, nose, and facial contours are located.
  • FIG7 it is a schematic diagram of the key points of the face in one embodiment.
  • 97 key points of the face are illustrated, 0-32 points are the facial contour, 33-50 are the eyebrow contour, 51-59 are the nose, 60-75 are the eye contour, 76-95 are the mouth contour, and 96 and 97 are the positions of the pupils.
  • the key point network of the face can also locate more key points of the face, for example, some can locate 256 key points of the face.
  • Facial key point detection is a process of locating the key points of the face based on the input face area. Affected by factors such as light, occlusion, and posture, facial key point detection is also a challenging task.
  • the server locates the facial key points of the face-swapped image and the template image respectively through a pre-trained facial key point network. For some or all facial key points, the server calculates the square of the eigenvalue difference between the face-swapped image and the template image according to the eigenvalue of the same facial key point, and then sums them up, which is recorded as the key point loss landmark_loss. During training, it is hoped that the key point loss is as small as possible. For example, for the 95th key point, the square of the eigenvalue difference is calculated.
  • the server can also characterize the expression difference between the face-swapped image and the template image only based on the difference in the feature values of the key points where the eyebrows, mouth, and eyes are located.
  • the facial key point network can be built based on a convolutional neural network.
  • a convolutional neural network For example, by designing a cascaded convolutional neural network with three levels, the feature extraction capability of multi-level convolution is utilized to gradually obtain accurate features from coarse to fine, and then the fully connected layer is used to predict the position of the facial key points.
  • the facial key point network it is necessary to obtain a sample facial image data set, each image in the sample facial image data set has corresponding key point annotation information, that is, the position data of the facial key points.
  • the sample facial image is input into the facial key point network, and the predicted position of each key point is output through the facial key point network, thereby calculating the difference between the annotated position and the predicted position of each key point, and summing the differences corresponding to all key points to obtain the predicted difference of the entire sample facial image.
  • a loss function is constructed based on the predicted difference, and the network parameters of the facial key point network are optimized by minimizing the loss function.
  • the generative network of the trained face-changing model can output a face-changing image with better expression retention effect.
  • the present application also introduces a pre-trained feature extraction network when training the face-changing model, and trains the generation network of the face-changing model according to the difference between the image features of the template image and the face-changing image.
  • the above method may also include: extracting image features of the face-changing image and the reference image respectively through the pre-trained feature extraction network to obtain their respective image features; constructing a similarity loss according to the difference between the image features of the face-changing image and the reference image; and the similarity loss is used to participate in the training of the generation network of the face-changing model.
  • the similarity loss can be, for example, the learned perceptual image patch similarity (LPIPS).
  • LPIPS learned perceptual image patch similarity
  • the pre-trained feature extraction network is used to extract the features of the face-swapped image and the reference image at different levels, compare the feature differences between the face-swapped image and the reference image at the same level, and construct a similarity loss.
  • the embodiment of the present application does not limit the network structure of the feature extraction network used.
  • FIG 9 it is a schematic diagram of a feature extraction network in an embodiment.
  • the low-level features can represent low-level features such as lines and colors
  • the high-level features can represent high-level features such as parts and objects.
  • the feature extraction network includes five convolution operations.
  • the resolution of the input image is 224*224*3.
  • the first-level image features are extracted, denoted as fake_fea1, with a resolution of 55*55*96.
  • the second-level convolution Conv2 and pooling operation the second-level image features are extracted, denoted as fake_fea2, with a resolution of 27*27*256.
  • the third-level convolution Conv3 and pooling operation extract the third-level image features, denoted as fake_fea3, with a resolution of 13*13*384.
  • the fourth-level convolution operation Conv5 and pooling operation are used to obtain image features, denoted as fake_fea4, with a resolution of 13*13*256.
  • a fully connected layer is used to obtain an output vector with a dimension of 1000 for image classification or target detection.
  • feature(GT) (GT_fea1, GT_fea2, GT_fea3, GT_fea4);
  • the similarity loss can be expressed by the following formula:
  • the generative network of the trained face-changing model can output face-changing images with realistic face-changing effects.
  • the present application also introduces reconstruction loss when training the face-changing model.
  • the reconstruction loss is constructed to train the generative network of the face-changing model.
  • the above method may also include: constructing a reconstruction loss according to the pixel-level difference between the face-changing image and the reference image; the reconstruction loss is used to participate in the training of the generative network of the face-changing model. During training, it is hoped that the pixel-level difference between the face-changing image and the reference image is as small as possible.
  • This formula represents the difference between the fake face-swapped image and the reference image GT of the same size.
  • the server can calculate the difference in pixel values corresponding to the same pixel position of the two images, sum the differences of all pixel positions, and obtain the overall difference between the two images at the image pixel level. This overall difference can be used to construct the reconstruction loss.
  • FIG10 it is a schematic diagram of the training architecture of the face-changing model in a specific embodiment.
  • the networks introduced when training the face-changing model include: a generation network, a discrimination network, an expression recognition network, a face recognition network, a face key point network, and a feature extraction network.
  • the training process of the face-changing model is described as follows:
  • the server obtains training samples, which include multiple sample triplets, and the sample triplets include a face source image, a template image, and a reference image.
  • the server extracts features from the template image through a pre-trained expression recognition network to obtain the expression features of the template image.
  • the server extracts features from the face source image through a pre-trained face recognition network to obtain the identity features of the face source image, and then concatenates the expression features of the template image with the identity features of the face source image to obtain the combined features.
  • the server also splices the face source image with the template image to obtain an input image, inputs the input image into the face-changing model, encodes the input image through the generative network of the face-changing model, and obtains the encoding features required for face-changing the template image.
  • the server fuses the encoded features with the combined features to obtain fused features, and decodes them according to the fused features through the generative network of the face-changing model to obtain the face-changing image.
  • the server inputs the face-changing image into the discriminant network of the face-changing model through the discriminant network of the face-changing model to obtain a first probability that the face-changing image is a non-forged image, and inputs the reference image into the discriminant network of the face-changing model to obtain a second probability that the reference image is a non-forged image.
  • a discriminative loss for the discriminative network is constructed based on a first probability that the face-swapped image is a non-forged image and a second probability that the reference image is a non-forged image, and the network parameters of the discriminative network are updated using the discriminative loss.
  • the server re-inputs the face-swapped image into the updated discriminant network to obtain the first probability that the face-swapped image is a non-forged image, and constructs the generation loss of the generation network according to the first probability that the face-swapped image is a non-forged image.
  • the expression recognition network of the face-swapped model feature extraction is performed on the face-swapped image to obtain the expression features of the face-swapped image, and the expression loss is constructed according to the difference between the expression features of the face-swapped image and the expression features of the template image.
  • face recognition network of the face-swapped model feature extraction is performed on the face-swapped image to obtain the identity features of the face-swapped image, and the identity loss is constructed according to the difference between the identity features of the face-swapped image and the identity features of the face source image.
  • face key points of the template image and the face-swapped image are respectively recognized to obtain the respective face key point information, and the key point loss is constructed according to the difference between the face key point information of the template image and the face-swapped image.
  • image feature extraction is performed on the face-swapped image and the reference image to obtain the respective image features, and the similarity loss is constructed according to the difference between the image features of the face-swapped image and the reference image.
  • the reconstruction loss is constructed.
  • the face-swapped loss of the generation network is constructed, and the network parameters of the generation network are updated using the face-swapped loss.
  • the server can use the generative network, pre-trained expression recognition network and face recognition network in the trained face-changing model to perform face-changing on the target image or target video to obtain a face-changing image or face-changing video.
  • FIG11 it is a schematic diagram of the process of video face swapping in one embodiment.
  • the execution subject of this embodiment can be a computer device or a computer device cluster composed of multiple computer devices.
  • the computer device can be a server or a terminal. Referring to FIG11, the following steps are included:
  • Step 1102 obtaining the video to be replaced with the face and the face source image containing the target face.
  • the face source image may be an original image containing a human face, or may be a cropped image containing only a human face obtained by performing face detection and configuration on the original image.
  • Step 1104 for each video frame of the face-changing video, the trained expression recognition network is used to extract features of the video frame to obtain expression features of the video frame.
  • the server can directly perform subsequent processing on the video frame, or perform face detection on the video frame and obtain a cropped image containing only the face after configuration.
  • Step 1106 extracting features from the face source image through the trained face recognition network to obtain identity features of the face source image.
  • Step 1108 concatenate the expression feature and the identity feature to obtain a combined feature.
  • Step 1110 encoding is performed based on the face source image and video frame containing the target face through the generative network of the trained face-changing model to obtain the coding features required for face-changing.
  • Step 1112 fusing the coding feature and the combined feature to obtain a fused feature.
  • Step 1114 through the generative network of the trained face-changing model, decoding is performed according to the fusion features, and a face-changing video is outputted in which the object in the video frame is replaced with the target face.
  • FIG 12 it is a schematic diagram of the effect of face swapping of photos in one embodiment.
  • the face swapping model trained by the face swapping model training method provided in the embodiment of the present application can still maintain a good face swapping effect under complex expressions, and can be used in a variety of scenarios such as ID photo production, film and television portrait production, game character design, virtual image, privacy protection, etc.
  • the expression of the face in the template image can still be maintained under complex expressions, and the face swapping requirements in some complex expression scenes in film and television can be met. Moreover, in the video scene, the expression is maintained smoothly and naturally.
  • the embodiment of the present application also provides a face-changing model training device for implementing the face-changing model training method involved above.
  • the implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the above method, so the specific limitations in the embodiments of one or more face-changing model training devices provided below can refer to the limitations of the face-changing model training method above, and will not be repeated here.
  • a training device 1300 for a face-changing model comprising: an acquisition module 1302 , a splicing module 1304 , a generation module 1306 , a determination module 1308 , and an update module 1310 , wherein:
  • An acquisition module 1302 is used to acquire a sample triplet, where the sample triplet includes a face source image, a template image, and a reference image;
  • a splicing module 1304 is used to splice the expression features of the template image and the identity features of the face source image to obtain a combined feature;
  • the generation module 1306 is used to encode the face source image and the template image through the generation network of the face-changing model to obtain the encoding features required for face-changing, fuse the encoding features with the combined features to obtain the fused features, and decode according to the fused features through the generation network of the face-changing model to obtain the face-changing image;
  • the updating module 1310 is used to calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network based on the image attribute discrimination results of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
  • the acquisition module 1302 is also used to acquire a first image and a second image, wherein the first image and the second image correspond to the same identity attribute and different non-identity attributes; acquire a third image, wherein the third image and the first image correspond to different identity attributes; replace the object in the second image with the object in the third image to obtain a fourth image; and use the first image as a face source image, the fourth image as a template image, and the second image as a reference image as a sample triplet.
  • the face-changing model training device 1300 further includes:
  • the expression recognition module is used to extract features of the template image through the expression recognition network of the face-changing model to obtain the expression features of the template image;
  • a face recognition module is used to extract features of the face source image through the face recognition network of the face-changing model to obtain identity features of the face source image;
  • Both the expression recognition network and the face recognition network are pre-trained neural network models.
  • the generation module 1306 is also used to splice the face source image with the template image to obtain an input image; input the input image to the face-changing model; encode the input image through the generation network of the face-changing model to obtain the encoding features required for face-changing the template image.
  • the face-changing model training device 1300 further includes:
  • the fusion module is used to calculate the mean and standard deviation of the coding features and the combined features respectively; according to the mean and standard deviation of the coding features, the coding features are normalized to obtain the normalized coding features; according to the mean and standard deviation of the combined features, the style of the combined features is transferred to the normalized coding features to obtain the fusion features.
  • the discrimination module 1308 is further used to input the face-swapped image into the discriminant network of the face-swapped model to obtain a first probability that the face-swapped image is a non-forged image; and input the reference image into the discriminant network of the face-swapped model to obtain a second probability that the reference image is a non-forged image.
  • the face-changing model training device 1300 further includes:
  • the expression recognition module is used to extract features of the face-swapped image through the expression recognition network of the face-swapped model to obtain the expression features of the face-swapped image;
  • a face recognition module is used to extract features of the face-swapped image through the face recognition network of the face-swapped model to obtain identity features of the face-swapped image;
  • Both the expression recognition network and the face recognition network are pre-trained neural network models.
  • the updating module 1310 is further used to alternately, when the network parameters of the generating network are fixed, construct a discriminant loss about the discriminant network according to a first probability that the face-swapped image belongs to a non-forged image and a second probability that the reference image belongs to a non-forged image, and update the network parameters of the discriminant network using the discriminant loss; when the network parameters of the discriminant network are fixed, construct a generating loss of the generating network according to the first probability that the face-swapped image belongs to a non-forged image, construct an expression loss according to the difference between the expression features of the face-swapped image and the expression features of the template image, construct an identity loss according to the difference between the identity features of the face-swapped image and the identity features of the face source image, and construct an identity loss according to the generating loss, the expression features, and the identity features.
  • the face-changing loss of the generated network is constructed by combining emotion loss and identity loss.
  • the face-changing model training device 1300 further includes:
  • the key point positioning module is used to identify the facial key points of the template image and the face-swapped image through a pre-trained facial key point network to obtain their respective facial key point information;
  • the updating module 1310 is also used to construct a key point loss according to the difference between the facial key point information of the template image and the face-changing image; the key point loss is used to participate in the training of the generative network of the face-changing model.
  • the face-changing model training device 1300 further includes:
  • An image feature extraction module is used to extract image features from the face-swapped image and the reference image through a pre-trained feature extraction network to obtain their respective image features;
  • the updating module 1310 is also used to construct a similarity loss according to the difference between the image features of the face-changing image and the reference image; the similarity loss is used to participate in the training of the generative network of the face-changing model.
  • the updating module 1310 is further used to construct a reconstruction loss according to the pixel-level difference between the face-changing image and the reference image; the reconstruction loss is used to participate in the training of the generative network of the face-changing model.
  • the face-changing model training device 1300 further includes:
  • the face-changing module is used to obtain the video to be replaced and the face source image containing the target face; for each video frame of the video to be replaced, the expression features of the video frame are obtained; the identity features of the face source image containing the target face are obtained; the expression features and the identity features are spliced to obtain the combined features; through the generative network of the trained face-changing model, the face source image containing the target face and the video frame are encoded to obtain the coding features required for face-changing, and the fused features obtained by fusing the coding features and the combined features are decoded to output the face-changing video in which the object in the video frame is replaced with the target face.
  • the training device 1300 of the face-changing model when training the face-changing model, not only the coding features of the template image and the face source image themselves are involved in decoding to output the face-changing image, but also the expression features of the template image and the identity features of the face source image are involved in decoding to output the face-changing image, so that the output face-changing image can have both the expression information of the template image and the identity information of the face source image, that is, while maintaining the expression of the template image, it can also be similar to the face source image.
  • the face-changing model is updated by the difference between the expression features of the template image and the expression features of the face-changing image, and the difference between the identity features of the face source image and the identity features of the face-changing image.
  • the former can constrain the expression similarity between the face-changing image and the template image, and the latter can constrain the identity similarity between the face-changing image and the face source image.
  • the output face-changing image can still maintain such a complex expression, thereby improving the face-changing effect.
  • the generative network and the discriminative network will be trained adversarially based on the image attribute discrimination results predicted by the discriminative network for the face-changing image and the reference image, thereby improving the image quality of the face-changing image output by the face-changing model as a whole.
  • Each module in the above-mentioned face-changing model training device 1300 can be implemented in whole or in part by software, hardware, or a combination thereof.
  • Each of the above-mentioned modules can be embedded in or independent of a processor in a computer device in the form of hardware, or can be stored in a memory in a computer device in the form of software, so that the processor can call and execute operations corresponding to each of the above modules.
  • a computer device which may be a server or a terminal, and its internal structure diagram may be shown in FIG14.
  • the computer device includes a processor, a memory, an input/output interface (I/O for short), and a communication interface.
  • the processor, the memory, and the input/output interface are connected via a system bus, and the communication interface is connected to the system bus via the input/output interface.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage stores an operating system, computer-readable instructions and a database.
  • the internal memory provides an environment for the operation of the operating system and the computer-readable instructions in the non-volatile storage medium.
  • the input/output interface of the computer device is used to exchange information between the processor and the external device.
  • the communication interface of the computer device is used to communicate with the external device through a network connection.
  • FIG. 14 is merely a block diagram of a partial structure related to the scheme of the present application, and does not constitute a limitation on the computer device to which the scheme of the present application is applied.
  • the specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, wherein the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions, the training method steps of the face-changing model provided in any embodiment of the present application are implemented.
  • a computer-readable storage medium on which computer-readable instructions are stored.
  • the training method steps of the face-changing model provided in any embodiment of the present application are implemented.
  • a computer program product including computer-readable instructions, which, when executed by a processor, implement the steps of the face-changing model training method provided in any embodiment of the present application.
  • user information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetoresistive random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc.
  • Volatile memory can include random access memory (RAM) or external cache memory, etc.
  • RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM).
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • the database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database.
  • Non-relational databases may include distributed databases based on blockchains, etc., but are not limited to this.
  • the processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, etc., but are not limited to this.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Collating Specific Patterns (AREA)

Abstract

A training method for a face swapping model, comprising: splicing an expression feature of a template image and an identity feature of a face source image to obtain a combined feature (304); by means of a generative network of a face swapping model, performing encoding according to the face source image and the template image, so as to obtain an encoded feature, fusing the encoded feature and the combined feature to obtain a fused feature, and by means of the generative network of the face swapping model, performing decoding according to the fused feature, so as to obtain a face swapped image (306); by means of a discriminative network of the face swapping model, respectively predicting image attribute discrimination results with regard to the face swapped image and a reference image, wherein image attributes comprise a forged image and a non-forged image (308); and calculating the difference between an expression feature of the face swapped image and the expression feature of the template image, calculating the difference between an identity feature of the face swapped image and the identity feature of the face source image, and updating the generative network and the discriminative network according to the image attribute discrimination results with regard to the face swapped image and the reference image, the calculated difference between the expression features and the calculated difference between the identity features (310).

Description

换脸模型的训练方法、装置、设备、存储介质和程序产品Face-changing model training method, device, equipment, storage medium and program product
本申请要求于2022年11月22日提交中国专利局,申请号为2022114680626,申请名称为“换脸模型的训练方法、装置、设备、存储介质和程序产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application filed with the China Patent Office on November 22, 2022, with application number 2022114680626, and entitled “Training method, device, equipment, storage medium and program product for face-changing model”, the entire contents of which are incorporated by reference into this application.
技术领域Technical Field
本申请涉及计算机技术领域,特别是涉及一种换脸模型的训练方法、装置、计算机设备、存储介质和计算机程序产品。The present application relates to the field of computer technology, and in particular to a training method, apparatus, computer equipment, storage medium and computer program product for a face-changing model.
背景技术Background technique
随着计算机技术与人工智能技术的迅速发展,出现了人脸替换技术,人脸替换即换脸,是指将待进行人脸替换的图像(即模板图像)中的人脸,替换为脸源图像中的人脸,换脸技术的目标是,通过换脸得到换脸图像中的人脸能够保持模板图像中人脸的表情、角度、背景等信息,还需要与脸源图像中的人脸尽可能相像。人脸替换有非常多的应用场景,例如视频换脸可以应用于影视人像制作、游戏人物设计、虚拟形象、隐私保护等。With the rapid development of computer technology and artificial intelligence technology, face replacement technology has emerged. Face replacement refers to replacing the face in the image to be replaced (i.e., template image) with the face in the source image. The goal of face replacement technology is to keep the expression, angle, background and other information of the face in the template image, and to make it as similar as possible to the face in the source image. Face replacement has many application scenarios. For example, video face replacement can be applied to film and television portrait production, game character design, virtual image, privacy protection, etc.
丰富表情的保持能力,对于人脸替换技术而言,既是重点也是难点。目前,大多数换脸算法在普通表情场景下能取得令人满意的效果,比如微笑等。但在一些表情较为丰富的场景下,比如嘟嘴、闭眼、眨单眼、生气等,换脸图像的表情保持效果不佳,甚至一些较难的表情保持不住,导致对人脸图像进行换脸的换脸准确度受到影响,换脸效果较差。The ability to maintain rich expressions is both the focus and the difficulty for face replacement technology. At present, most face-changing algorithms can achieve satisfactory results in common expression scenes, such as smiling. However, in some scenes with richer expressions, such as pouting, closing eyes, blinking one eye, getting angry, etc., the expression retention effect of the face-changing image is not good, and even some difficult expressions cannot be maintained, resulting in the accuracy of face-changing of face images being affected and the face-changing effect is poor.
发明内容Summary of the invention
基于此,根据本申请提供的各种实施例,提供一种换脸模型的训练方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。Based on this, according to the various embodiments provided in the present application, a method, apparatus, computer device, computer-readable storage medium and computer program product for training a face-changing model are provided.
本申请提供了一种换脸模型的训练方法。该方法由计算机设备执行,该方法包括:The present application provides a method for training a face-changing model. The method is executed by a computer device and includes:
获取样本三元组,所述样本三元组包括脸源图像、模板图像与参考图像;Acquire a sample triplet, wherein the sample triplet includes a face source image, a template image, and a reference image;
拼接所述模板图像的表情特征与所述脸源图像的身份特征,得到组合特征;splicing the expression features of the template image and the identity features of the face source image to obtain a combined feature;
通过所述换脸模型的生成网络,根据所述脸源图像与所述模板图像进行编码,得到换脸所需的编码特征;By means of the generative network of the face-changing model, encoding is performed according to the face source image and the template image to obtain the encoding features required for face-changing;
将所述编码特征与所述组合特征进行融合,得到融合特征;Fusing the coding feature with the combined feature to obtain a fused feature;
通过所述换脸模型的生成网络,根据所述融合特征进行解码,得到换脸图像;Decoding is performed according to the fusion features through the generative network of the face-changing model to obtain a face-changing image;
通过所述换脸模型的判别网络,分别预测关于所述换脸图像与所述参考图像的图像属性判别结果,所述图像属性包括伪造图像和非伪造图像;及Predicting the image attribute discrimination results of the face-swapped image and the reference image respectively through the discriminant network of the face-swapped model, wherein the image attributes include forged images and non-forged images; and
计算所述换脸图像的表情特征与所述模板图像的表情特征之间的差异,计算所述换脸图像的身份特征与所述脸源图像的身份特征之间的差异,根据关于所述换脸图像与所述参考图像的图像属性判别结果、计算得到的表情特征之间的差异、身份特征之间的差异,更新所述生成网络与所述判别网络。Calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network based on the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
本申请还提供了一种换脸模型的训练装置。所述装置包括:The present application also provides a training device for a face-changing model. The device comprises:
获取模块,用于获取样本三元组,所述样本三元组包括脸源图像、模板图像与参考图像;An acquisition module, used for acquiring a sample triplet, wherein the sample triplet includes a face source image, a template image and a reference image;
拼接模块,用于拼接所述模板图像的表情特征与所述脸源图像的身份特征,得到组合特征; A splicing module, used for splicing the expression features of the template image and the identity features of the face source image to obtain a combined feature;
生成模块,用于通过所述换脸模型的生成网络,根据所述脸源图像与所述模板图像进行编码,得到换脸所需的编码特征,将所述编码特征与所述组合特征进行融合,得到融合特征,通过所述换脸模型的生成网络,根据所述融合特征进行解码,得到换脸图像;A generation module, configured to encode the face source image and the template image through the generation network of the face-changing model to obtain the encoding features required for face-changing, fuse the encoding features with the combined features to obtain the fused features, and decode the fused features through the generation network of the face-changing model to obtain the face-changing image;
判别模块,用于通过所述换脸模型的判别网络,分别预测关于所述换脸图像与所述参考图像的图像属性判别结果,所述图像属性包括伪造图像和非伪造图像;A discrimination module, used for predicting the discrimination results of the image attributes of the face-swapped image and the reference image respectively through the discrimination network of the face-swapped model, wherein the image attributes include a forged image and a non-forged image;
更新模块,用于计算所述换脸图像的表情特征与所述模板图像的表情特征之间的差异,计算所述换脸图像的身份特征与所述脸源图像的身份特征之间的差异,根据关于所述换脸图像与所述参考图像的图像属性判别结果、计算得到的表情特征之间的差异、身份特征之间的差异,更新所述生成网络与所述判别网络。An updating module is used to calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network according to the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
本申请还提供了一种计算机设备。所述计算机设备包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现上述换脸模型的训练方法的步骤。The present application also provides a computer device, which includes a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the steps of the above-mentioned face-changing model training method when executing the computer-readable instructions.
本申请还提供了一种计算机可读存储介质。所述计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述换脸模型的训练方法的步骤。The present application also provides a computer-readable storage medium having computer-readable instructions stored thereon, which, when executed by a processor, implement the steps of the above-mentioned face-changing model training method.
本申请还提供了一种计算机程序产品。所述计算机程序产品,包括计算机可读指令,该计算机可读指令被处理器执行时实现上述换脸模型的训练方法的步骤。The present application also provides a computer program product, which includes computer-readable instructions, and when the computer-readable instructions are executed by a processor, the steps of the training method of the above-mentioned face-changing model are implemented.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。The details of one or more embodiments of the present application are set forth in the following drawings and description. Other features and advantages of the present application will become apparent from the description, drawings, and claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.
图1为一个实施例中图像换脸的示意图;FIG1 is a schematic diagram of image face swapping in one embodiment;
图2为一个实施例中换脸模型的训练方法的应用环境图;FIG2 is a diagram showing an application environment of a face-changing model training method according to an embodiment;
图3为一个实施例中换脸模型的训练方法的流程示意图;FIG3 is a schematic diagram of a flow chart of a method for training a face-changing model in one embodiment;
图4为一个实施例中换脸模型的模型结构示意图;FIG4 is a schematic diagram of a model structure of a face-changing model in one embodiment;
图5为一个实施例中换脸模型的训练方法的流程示意图;FIG5 is a schematic diagram of a flow chart of a method for training a face-changing model in one embodiment;
图6为一个实施例中换脸模型的训练框架示意图;FIG6 is a schematic diagram of a training framework of a face-changing model in one embodiment;
图7为一个实施例中人脸关键点的示意图;FIG7 is a schematic diagram of key points of a face in one embodiment;
图8为另一个实施例中换脸模型的训练框架示意图;FIG8 is a schematic diagram of a training framework of a face-changing model in another embodiment;
图9为一个实施例中特征提取网络的示意图;FIG9 is a schematic diagram of a feature extraction network in one embodiment;
图10为又一个实施例中换脸模型的训练框架示意图;FIG10 is a schematic diagram of a training framework of a face-changing model in yet another embodiment;
图11为一个实施例中视频换脸的流程示意图;FIG11 is a schematic diagram of a process of video face swapping in one embodiment;
图12为一个实施例中对照片进行换脸的效果示意图;FIG12 is a schematic diagram showing the effect of performing face swapping on a photo in one embodiment;
图13为一个实施例中换脸模型的训练装置的结构框图;FIG13 is a structural block diagram of a training device for a face-changing model in one embodiment;
图14为一个实施例中计算机设备的内部结构图。FIG. 14 is a diagram showing the internal structure of a computer device in one embodiment.
具体实施方式 Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.
监督式学习(Supervised learning),是一种机器学习任务,算法可以从标记好的训练集中学到或建立一个模式,并依此模式推测新的实例。训练集是由一系列的训练范例组成,每个训练范例(或称样本)由输入和监督信息(即预期输出,也称标注信息)所组成。算法对基于输入所推测出的输出可以是一个连续的值或是一个分类标签。Supervised learning is a machine learning task in which an algorithm can learn or establish a pattern from a labeled training set and infer new instances based on this pattern. The training set consists of a series of training examples, each of which consists of input and supervision information (i.e. expected output, also called labeling information). The output inferred by the algorithm based on the input can be a continuous value or a classification label.
非监督学习(Unsupervised Learning),是一种机器学习任务。算法从未标记的数据中学习模式、结构和关系,以发现数据中的隐藏信息和有意义的结构。与监督式学习不同,非监督学习中没有监督信息来指导学习过程,算法需要自行发现数据的内在模式。Unsupervised learning is a machine learning task. Algorithms learn patterns, structures, and relationships from unlabeled data to discover hidden information and meaningful structures in the data. Unlike supervised learning, there is no supervised information to guide the learning process in unsupervised learning, and the algorithm needs to discover the inherent patterns of the data on its own.
生成对抗网络(Generative Adversarial Network,简称GAN):非监督式学习的一种方法,通过让两个神经网络相互博弈的方式进行学习。由一个生成网络(Generator Network)与一个判别网络(Discriminator Network)组成。生成网络从潜在空间(latent space)中随机取样作为输入,其输出结果需要尽量模仿训练集中的样本,即其训练目标是生成与训练集中的样本尽量相似的样本。判别网络的输入则为生成网络的输出,其目的是将生成网络输出的样本从训练集中的样本中尽可能分辨出来,即其训练目标是区分生成网络生成的样本与训练集中的样本。生成网络则要尽可能地欺骗判别网络。两个网络相互对抗、不断更新参数,最终生成网络可以生成与训练集中的样本非常相似的样本。Generative Adversarial Network (GAN): A method of unsupervised learning that learns by letting two neural networks compete with each other. It consists of a generator network and a discriminator network. The generator network randomly samples from the latent space as input, and its output needs to imitate the samples in the training set as much as possible, that is, its training goal is to generate samples that are as similar as possible to the samples in the training set. The input of the discriminator network is the output of the generator network, and its purpose is to distinguish the samples output by the generator network from the samples in the training set as much as possible, that is, its training goal is to distinguish the samples generated by the generator network from the samples in the training set. The generator network should deceive the discriminator network as much as possible. The two networks compete with each other and continuously update parameters, and finally the generator network can generate samples that are very similar to the samples in the training set.
换脸:是将输入的脸源图像中的人脸换到模板图像上,输出换脸图像,并使输出的换脸图像保持模板图像的表情、角度、背景等信息。如图1所示,换脸流程的输入的脸源图像中是人脸A,模板图像中的人像是另一个人脸B,通过换脸,输出一张将模板图像中的人脸B换成人脸A的照片。Face swapping: It is to swap the face in the input face source image with the template image, output the swapped face image, and make the output swapped face image keep the expression, angle, background and other information of the template image. As shown in Figure 1, the input face source image of the face swapping process is face A, and the face in the template image is another face B. Through face swapping, a photo is output in which face B in the template image is replaced with face A.
换脸模型:利用深度学习和人脸识别技术实现的机器学习模型,可实现将一个人的面部表情、眼睛、嘴巴等特征从一张照片或视频中提取出来,并将其与另一个人的面部特征进行匹配。Face-swapping model: A machine learning model implemented using deep learning and face recognition technology that can extract a person's facial expressions, eyes, mouth and other features from a photo or video and match them with the facial features of another person.
视频换脸有非常多的应用场景,比如影视人像制作、游戏人物设计、虚拟形象、隐私保护等。影视制作中,当演员无法完成专业动作时,可专业人员先完成,后期可以利用换脸技术自动将人脸替换为演员。当需要替换演员时,可以通过换脸技术替换新的人脸,避免重新拍摄,可以节约大量的成本。在虚拟形象设计中,例如在直播场景中,用户可以利用虚拟人物进行换脸,提高直播的趣味性和保护个人隐私。视频换脸的结果,还可以为人脸识别等业务提供对抗攻击训练素材。Video face swapping has many application scenarios, such as film and television portrait production, game character design, virtual image, privacy protection, etc. In film and television production, when an actor cannot complete professional actions, professionals can complete them first, and then face swapping technology can be used to automatically replace the human face with the actor. When an actor needs to be replaced, a new face can be replaced by face swapping technology to avoid reshooting, which can save a lot of costs. In virtual image design, for example, in live broadcast scenes, users can use virtual characters to swap faces to increase the fun of live broadcasts and protect personal privacy. The results of video face swapping can also provide anti-attack training materials for services such as face recognition.
GT:Ground Truth,真值,也称参考信息、标注信息或监督信息。GT: Ground Truth, true value, also known as reference information, labeled information or supervised information.
目前,相关技术中,通过设计较为复杂的换脸网络,进行换脸模型的训练,可在普通表情场景下能取得令人满意的效果,比如微笑等。但在一些表情较为丰富的场景下,比如嘟嘴、闭眼、眨单眼、生气等,换脸图像的表情保持效果不佳,甚至一些较难的表情保持不住,导致换脸效果较差。At present, in the related technology, by designing a relatively complex face-changing network and training the face-changing model, satisfactory results can be achieved in common expression scenes, such as smiling, etc. However, in some scenes with richer expressions, such as pouting, closing eyes, blinking one eye, getting angry, etc., the expression retention effect of the face-changing image is not good, and even some difficult expressions cannot be maintained, resulting in poor face-changing effects.
本申请实施例提供的换脸模型的训练方法,可以应用于如图2所示的应用环境中。终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上,也可以放在云上或其他服务器上。终端102 可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备,物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The face-changing model training method provided in the embodiment of the present application can be applied in the application environment shown in FIG. 2. The terminal 102 communicates with the server 104 via a network. The data storage system can store data that the server 104 needs to process. The data storage system can be integrated on the server 104, or placed on the cloud or other servers. It can be, but is not limited to, various personal computers, laptops, smart phones, tablet computers, IoT devices and portable wearable devices. IoT devices can be smart speakers, smart TVs, smart air conditioners, smart car devices, etc. Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, etc. The server 104 can be implemented as an independent server or a server cluster consisting of multiple servers.
在一个实施例中,上述终端102中可以具有应用客户端,服务器104可以是为该应用客户端提供服务的后台服务器,应用客户端可以将终端102采集的图像或视频发送至服务器104,服务器104可以通过本申请提供的换脸模型的训练方法,得到训练好的换脸模型之后,通过训练好的换脸模型的生成网络,将终端102采集的图像或视频中的人脸替换为另一人脸或虚拟形象,然后再实时返回至终端102,终端102通过应用客户端展示换脸后的图像或视频。该应用客户端,可以是视频客户端、社交应用客户端、即时通信客户端,等等。In one embodiment, the terminal 102 may have an application client, and the server 104 may be a background server providing services for the application client. The application client may send the image or video collected by the terminal 102 to the server 104. The server 104 may obtain the trained face-changing model through the training method of the face-changing model provided in the present application, and then replace the face in the image or video collected by the terminal 102 with another face or a virtual image through the generation network of the trained face-changing model, and then return it to the terminal 102 in real time. The terminal 102 displays the face-changing image or video through the application client. The application client may be a video client, a social application client, an instant messaging client, and the like.
图3是本申请提供的一种换脸模型的训练方法的流程示意图。本实施例的执行主体可以是一个计算机设备或者多个计算机设备所构成的计算机设备集群。该计算机设备可以是服务器,也可以终端。因此,本申请实施例中的执行主体可以是服务器,也可以是终端,还可以是由服务器和终端共同构成。此处以本申请实施例中的执行主体是服务器为例进行说明,包括以下步骤:FIG3 is a flowchart of a training method for a face-changing model provided by the present application. The execution subject of this embodiment can be a computer device or a computer device cluster composed of multiple computer devices. The computer device can be a server or a terminal. Therefore, the execution subject in the embodiment of the present application can be a server or a terminal, or it can be composed of a server and a terminal. Here, the execution subject in the embodiment of the present application is a server as an example, and the following steps are included:
步骤302,获取样本三元组,样本三元组包括脸源图像、模板图像与参考图像。Step 302, obtaining a sample triplet, where the sample triplet includes a face source image, a template image, and a reference image.
本申请中,换脸模型包括生成网络(Generator Network)与判别网络(Discriminator Network),换脸模型通过生成网络与判别网络形成的生成对抗网络(Generative Adversarial Network,简称GAN)进行训练,具体内容将在后文介绍。In this application, the face-changing model includes a generator network (Generator Network) and a discriminator network (Discriminator Network). The face-changing model is trained through a generative adversarial network (GAN) formed by the generator network and the discriminator network. The specific content will be introduced later.
本申请中,样本三元组是用于对该换脸模型进行训练的样本数据。服务器可以获取多个样本三元组,用于训练换脸模型。每个样本三元组包括脸源图像、模板图像与参考图像,脸源图像是提供人脸的图像,可以记为source,模板图像是提供人脸的表情、姿态、图像背景等信息的图像,可以记为template,换脸即将模板图像中的人脸替换为脸源图像中的人脸,同时换脸后的图像可以保持模板图像的表情、姿态、图像背景等。参考图像是作为训练换脸模型所需的监督信息的图像,可以记为GT。由于采用各个样本三元组(或一批次样本三元组)对换脸模型进行训练的原理相同,因此,此处以通过一个样本三元组训练换脸模型的过程为例进行说明。In the present application, the sample triplet is the sample data used to train the face-changing model. The server can obtain multiple sample triplets for training the face-changing model. Each sample triplet includes a face source image, a template image and a reference image. The face source image is an image that provides a human face, which can be recorded as source. The template image is an image that provides information such as facial expression, posture, image background, etc., which can be recorded as template. Face-changing is to replace the face in the template image with the face in the face source image. At the same time, the face-changing image can maintain the expression, posture, image background, etc. of the template image. The reference image is an image that serves as the supervisory information required for training the face-changing model, which can be recorded as GT. Since the principle of using each sample triplet (or a batch of sample triplets) to train the face-changing model is the same, the process of training the face-changing model using a sample triplet is used as an example to illustrate here.
可以理解,基于换脸的定义,对于每个样本三元组而言,用于提供模型训练所需监督信息的参考图像,应当与脸源图像具有相同的身份属性,且与模板图像具有相同的非身份属性。此外,为保证换脸效果,脸源图像应当与模板图像具有不同的身份属性。人脸通常具有唯一性,身份属性是指图像中人脸所表征的身份,具有相同的身份属性是指图像中为同一张人脸。非身份属性是指图像中人脸的姿态、表情、妆容,非身份属性还包括图像的风格、背景等属性。It can be understood that based on the definition of face swapping, for each sample triplet, the reference image used to provide the supervision information required for model training should have the same identity attributes as the face source image and the same non-identity attributes as the template image. In addition, to ensure the face swapping effect, the face source image should have different identity attributes from the template image. Human faces are usually unique, and identity attributes refer to the identity represented by the face in the image. Having the same identity attributes means that the image is the same face. Non-identity attributes refer to the posture, expression, and makeup of the face in the image. Non-identity attributes also include attributes such as the style and background of the image.
例如,在视频换脸场景中,脸源图像中的人脸与参考图像中的人脸,是同一个人的人脸,但是二者的人脸的表情、妆容、该人的姿态和图像中的背景可以部分相同或不同。脸源图像中的人脸与模板图像中的人脸,是不同的两个人的人脸。可以理解的是,脸源图像和参考图像也可以为相同的一张图像。For example, in a video face swapping scenario, the face in the face source image and the face in the reference image are the same person's face, but the facial expressions, makeup, posture of the person and the background in the image may be partially the same or different. The face in the face source image and the face in the template image are the faces of two different people. It is understandable that the face source image and the reference image may also be the same image.
在一个实施例中,样本三元组可以通过以下方式构造:获取第一图像与第二图像,第 一图像与第二图像对应相同的身份属性,且对应不同的非身份属性,获取第三图像,第三图像与第一图像对应不同的身份属性;将第二图像中的对象替换为第三图像中的对象,得到第四图像,将第一图像作为脸源图像、第四图像作为模板图像、第二图像作为参考图像,作为一个样本三元组。In one embodiment, the sample triplet can be constructed by: obtaining a first image and a second image, An image corresponds to the same identity attribute as a second image, and corresponds to different non-identity attributes, and a third image is obtained, and the third image corresponds to different identity attributes from the first image; an object in the second image is replaced by an object in the third image to obtain a fourth image, and the first image is used as a face source image, the fourth image as a template image, and the second image as a reference image as a sample triplet.
具体地,服务器可随机获取第一图像,确定第一图像中人脸对应的身份信息,再获取与该身份信息对应的另一个图像,作为第二图像,从而,第一图像与第二图像具有相同的人脸,也就是具有相同的身份属性。接着,服务器可随机获取第三图像,第三图像与第一图像对应不同的身份属性,即第三图像中的人脸与第一图像中的人脸并非同一个人的人脸。服务器可以将第二图像与第三图像输入换脸模型,通过换脸模型的生成网络,将第二图像中的人脸替换为第三图像中的对象,得到第四图像,且第四图像保持第二图像中的表情、姿态、图像背景等特点。需要说明的是,第一图像、第二图像、第三图像均是包括人脸的图像,服务器可以从人脸图像数据集中随机获取这些图像。Specifically, the server can randomly obtain the first image, determine the identity information corresponding to the face in the first image, and then obtain another image corresponding to the identity information as the second image, so that the first image and the second image have the same face, that is, they have the same identity attributes. Then, the server can randomly obtain the third image, and the third image corresponds to different identity attributes from the first image, that is, the face in the third image is not the face of the same person as the face in the first image. The server can input the second image and the third image into the face-changing model, and replace the face in the second image with the object in the third image through the generative network of the face-changing model to obtain the fourth image, and the fourth image retains the expression, posture, image background and other characteristics in the second image. It should be noted that the first image, the second image, and the third image are all images including faces, and the server can randomly obtain these images from the face image data set.
例如,第一图像中包含男士A的人脸,且第一图像的人脸表情为大笑,图像背景为背景1。第二图像中包含男士A的人脸,且第二图像的人脸表情为微笑,图像背景为背景2。第三图像中包含女士B的人脸,且第三图像的人脸表情为愤怒,图像背景为背景3。显然,男士A的人脸与女士B的人脸不同,也就是,第三图像与第一图像、第二图像均具有不同的人脸。服务器将第二图像中男士A的人脸替换为女士B的人脸,得到第四图像,第四图像的表情维持第二图像的微笑表情,背景维持图像背景2。从而,将第一图像作为脸源图像,即提供男士A的人脸、大笑表情与图像背景1,将第四图像作为模板图像,提供女士B的人脸、微笑表情与图像背景2,将第二图像作为参考图像,提供男士A的人脸、微笑表情与图像背景2,从而构造样本三元组。由此可见,参考图像是真实图像,而非伪造图像或合成图像。For example, the first image contains the face of Mr. A, and the facial expression of the first image is laughing, and the image background is background 1. The second image contains the face of Mr. A, and the facial expression of the second image is smiling, and the image background is background 2. The third image contains the face of Ms. B, and the facial expression of the third image is angry, and the image background is background 3. Obviously, the face of Mr. A is different from the face of Ms. B, that is, the third image has a different face from the first image and the second image. The server replaces the face of Mr. A in the second image with the face of Ms. B to obtain the fourth image, and the expression of the fourth image maintains the smiling expression of the second image, and the background maintains the image background 2. Therefore, the first image is used as the face source image, that is, the face of Mr. A, the laughing expression and the image background 1 are provided, the fourth image is used as the template image, and the face of Ms. B, the smiling expression and the image background 2 are provided, and the second image is used as the reference image, and the face of Mr. A, the smiling expression and the image background 2 are provided, so as to construct a sample triple. It can be seen that the reference image is a real image, not a forged image or a synthetic image.
本实施例中,作为参考图像的第二图像,是真实图像,而非伪造图像,以这样的参考图像作为参照,使得生成网络输出的换脸图像不断接近真实的参考图像,能够保证输出的换脸图像在形状、光照、动作等各方面与非合成部分能够保持连贯性、流畅性,得到高质量的换脸图像或视频,换脸效果较佳。In this embodiment, the second image serving as the reference image is a real image rather than a forged image. By using such a reference image as a reference, the face-changing image output by the generated network is continuously close to the real reference image, thereby ensuring that the output face-changing image can maintain consistency and smoothness with the non-synthetic parts in terms of shape, lighting, movement, etc., thereby obtaining a high-quality face-changing image or video with a better face-changing effect.
在一个实施例中,服务器获取到上述的样本三元组之后,可以直接输入至换脸模型中,对换脸模型进行模型训练。In one embodiment, after the server obtains the above-mentioned sample triples, it can directly input them into the face-changing model to perform model training on the face-changing model.
在一个实施例中,服务器获取到上述的样本三元组之后,先对样本三元组中三个图像分别进行图像预处理,使用预处理得到的图像,对换脸模型进行训练。具体而言,预处理可以包括以下几个方面:一、由于人脸在图像中,往往仅占据图像的一部分位置,服务器可以先对图像进行人脸检测,获得人脸区域。人脸检测所需的人脸检测网络或人脸检测算法,可采用预先训练好的神经网络模型。二、人脸关键点检测,在人脸区域内进行关键点检测,获得人脸的关键点,例如人的眼睛、嘴角、脸部轮廓的关键点。三、人脸配准,人脸配准即根据识别出的关键点,使用仿射变换将人脸统一“摆正”对齐,尽量去消除姿势不同带来的误差,人脸配准后裁剪出人脸图像。In one embodiment, after the server obtains the above-mentioned sample triplet, it first performs image preprocessing on the three images in the sample triplet respectively, and uses the preprocessed images to train the face-changing model. Specifically, the preprocessing may include the following aspects: 1. Since the face in the image often only occupies a part of the image, the server may first perform face detection on the image to obtain the face area. The face detection network or face detection algorithm required for face detection may adopt a pre-trained neural network model. 2. Face key point detection, perform key point detection in the face area to obtain the key points of the face, such as the key points of the eyes, mouth corners, and facial contours. 3. Face registration, face registration is to use affine transformation to uniformly "straighten" the face according to the identified key points, try to eliminate the errors caused by different postures, and crop the face image after face registration.
可选地,服务器可以通过上述预处理步骤,获得裁剪后的脸源图像、模板图像与参考图像,将裁剪后的图像输入至换脸模型,换脸模型输出的换脸图像仅包含人脸,再利用该输出的换脸图像替换模板图像中的人脸区域,即得到最终输出的换脸图像。这样,可以 保证换脸模型的训练效果。Optionally, the server can obtain the cropped face source image, template image and reference image through the above preprocessing steps, input the cropped image into the face-changing model, and the face-changing model outputs a face-changing image containing only the human face, and then use the output face-changing image to replace the human face area in the template image, so as to obtain the final output face-changing image. Ensure the training effect of the face-changing model.
步骤304,拼接模板图像的表情特征与脸源图像的身份特征,得到组合特征。Step 304 , combining the expression features of the template image and the identity features of the face source image to obtain a combined feature.
图像的表情特征,能够反映图像所表达的表情信息,是对人脸的器官特征、纹理区域和预定义的特征点进行定位和提取得到的有关人脸的表情方面的特征,表情特征是表情识别中的关键,它决定着最终的表情识别结果。图像的身份特征,是可用于身份识别的生物特征,如人的脸部特征、瞳孔特征、指纹特征、掌纹特征等等。本申请中,身份特征是基于人的脸部识别出的面部特征,可用于人脸识别。The expression features of an image can reflect the expression information expressed by the image. They are features of facial expressions obtained by locating and extracting the organ features, texture areas and predefined feature points of the face. Expression features are the key to expression recognition, and they determine the final expression recognition results. The identity features of an image are biological features that can be used for identity recognition, such as facial features, pupil features, fingerprint features, palm print features, etc. In this application, identity features are facial features based on human face recognition, which can be used for face recognition.
在一个实施例中,服务器可以通过换脸模型的表情识别网络,对模板图像进行特征提取,得到模板图像的表情特征,通过换脸模型的人脸识别网络,对脸源图像进行特征提取,得到脸源图像的身份特征。In one embodiment, the server can extract features of the template image through the expression recognition network of the face-changing model to obtain the expression features of the template image, and extract features of the face source image through the face recognition network of the face-changing model to obtain the identity features of the face source image.
本实施例中,换脸模型除了包括生成网络与判别网络,还包括已经预先训练好的表情识别网络与人脸识别网络,表情识别网络与人脸识别网络,均为预先训练好的神经网络模型。In this embodiment, the face-changing model includes not only a generation network and a discrimination network, but also a pre-trained expression recognition network and a face recognition network, and both the expression recognition network and the face recognition network are pre-trained neural network models.
表情识别是计算机视觉领域的一个重要研究方向,表情识别是通过对人脸图像进行分析和处理,预测人脸表达的情绪类别的过程。本申请实施例对表情识别网络的网络结构不作限制。可选地,表情识别网络可基于卷积神经网络(Convolutional Neural Networks,CNN)搭建。卷积神经网络采用卷积层和池化层提取输入的人脸图像的特征,并通过全连接层进行表情分类。Facial expression recognition is an important research direction in the field of computer vision. Facial expression recognition is the process of predicting the category of emotion expressed by a face by analyzing and processing the face image. The embodiment of the present application does not limit the network structure of the facial expression recognition network. Optionally, the facial expression recognition network can be built based on a convolutional neural network (CNN). The convolutional neural network uses convolutional layers and pooling layers to extract the features of the input facial image, and performs facial expression classification through a fully connected layer.
可选地,表情识别网络可通过一系列图片和对应的表情标签来训练,具体来说,需要获取一个包含表情标签的人脸图像数据集,该数据集包括不同情绪类别的样本人脸图像,例如高兴、悲伤、生气、眨眼、单眨眼、做鬼脸等等各类常见表情、复杂表情。对于基于卷积神经网络搭建的表情识别网络,可以通过卷积神经网络中堆叠的多个卷积层和池化层,可以逐渐提取样本人脸图像更加抽象和高级的特征表示,即表情特征,通过全连接层将提取到的表情特征进行分类,得到样本人脸图像中人脸表情的预测结果,根据该预测结果与样本人脸图像的表情标签的差异,可构造出表情识别网络的损失函数,基于该损失函数来更新表情识别网络的网络参数,例如,通过最小化损失函数来优化表情识别网络的网络参数。按照这样的方式,基于多个样本人脸图像进行多次更新,最终得到训练好的表情识别网络。训练好的表情识别网络可用于提取图像的表情特征,表情特征在本申请中可用来约束表情的一致性,即约束换脸图像与模板图像之间的表情相似度。服务器可以通过训练好的表情识别网络直接对模板图像进行特征提取,得到相应的表情特征。服务器还可以通过表情识别网络对模板图像进行人脸检测,根据检测结果确定模板图像中的人脸区域,再对人脸区域进行特征提取,得到相应的表情特征。模板图像的表情特征可记为template_exp_features。Optionally, the expression recognition network can be trained by a series of pictures and corresponding expression labels. Specifically, it is necessary to obtain a face image data set containing expression labels, which includes sample face images of different emotion categories, such as happiness, sadness, anger, blinking, single blinking, making faces, and other common expressions and complex expressions. For the expression recognition network built on the basis of the convolutional neural network, the more abstract and advanced feature representations of the sample face image, i.e., the expression features, can be gradually extracted through the multiple convolutional layers and pooling layers stacked in the convolutional neural network. The extracted expression features are classified through the fully connected layer to obtain the predicted results of the facial expressions in the sample face image. According to the difference between the predicted results and the expression labels of the sample face images, the loss function of the expression recognition network can be constructed, and the network parameters of the expression recognition network can be updated based on the loss function, for example, the network parameters of the expression recognition network can be optimized by minimizing the loss function. In this way, multiple updates are performed based on multiple sample face images, and finally a trained expression recognition network is obtained. The trained expression recognition network can be used to extract the expression features of the image. The expression features can be used in this application to constrain the consistency of the expression, i.e., constrain the expression similarity between the face-changing image and the template image. The server can directly extract features from the template image through the trained expression recognition network to obtain the corresponding expression features. The server can also perform face detection on the template image through the expression recognition network, determine the face area in the template image based on the detection results, and then extract features from the face area to obtain the corresponding expression features. The expression features of the template image can be recorded as template_exp_features.
人脸识别是根据人的脸部特征信息进行身份识别的一种生物识别术,是生物特征识别领域的研究难题之一。本申请实施例对人脸识别网络所采用的网络结构不作限制。可选地,人脸识别网络可基于卷积神经网络(Convolutional Neural Networks,CNN)搭建,卷积神经网络采用卷积层和池化层提取输入的人脸图像的特征,并通过全连接层进行身份分类。人脸识别网络可通过一系列图片和对应的身份标签来训练,具体来说,人脸识别网络包括堆叠的多个卷积层和池化层,还包括全连接层。卷积层利用一组可学习的滤波器(也 称为卷积核)对输入的样本人脸图像进行滤波,从而提取样本人脸图像中的局部特征。池化层用于降低局部特征的维度,减少计算量,并增强模型对输入的图像的不变性。全连接层将提取到的特征映射到最终的输出类别,例如人脸识别中的具体对象身份。训练好的人脸识别网络可用于提取图像的身份特征,身份特征在本申请中可用来约束身份的一致性,即约束换脸图像与脸源图像之间的身份相似度。服务器可以通过训练好的人脸识别网络直接对脸源图像进行特征提取,得到相应的身份特征。服务器还可以通过训练好的人脸识别网络对模板图像进行人脸检测,根据检测结果确定脸源图像中的人脸区域,再对人脸区域进行特征提取,得到相应的身份特征。脸源图像的身份特征可记为source_id_features。Face recognition is a biometric technique that identifies a person based on their facial features, and is one of the research challenges in the field of biometric recognition. The embodiments of the present application do not limit the network structure used by the face recognition network. Optionally, the face recognition network can be built based on a convolutional neural network (CNN), which uses convolutional layers and pooling layers to extract features of the input face image, and performs identity classification through a fully connected layer. The face recognition network can be trained using a series of images and corresponding identity labels. Specifically, the face recognition network includes multiple stacked convolutional layers and pooling layers, as well as a fully connected layer. The convolutional layer uses a set of learnable filters (also known as pooling layers) to extract features of the input face image. The server can directly perform feature extraction on the face source image through the trained face recognition network to obtain the corresponding identity features. The server can also perform face detection on the template image through the trained face recognition network, determine the face area in the face source image according to the detection results, and then perform feature extraction on the face area to obtain the corresponding identity features. The identity features of the face source image can be recorded as source_id_features.
组合特征,是服务器将模板图像的表情特征与脸源图像的身份特征拼接得到的特征。例如,比如表情特征为1024维的特征,身份特征为512维的特征,将二者按特征维度进行拼接(concat),可以得到1536维的组合特征。当然,拼接方式不仅限于此,本申请实施例对此不作限制。例如,还可以采用多尺度特征融合的方式,从两个网络不同层抽取不同尺度的特征进行融合,得到组合特征。组合特征可记为id_exp_features。The combined feature is a feature obtained by the server by splicing the expression feature of the template image with the identity feature of the face source image. For example, if the expression feature is a 1024-dimensional feature and the identity feature is a 512-dimensional feature, the two are concatenated according to the feature dimension to obtain a 1536-dimensional combined feature. Of course, the splicing method is not limited to this, and the embodiments of the present application are not limited to this. For example, a multi-scale feature fusion method can also be used to extract features of different scales from different layers of two networks and fuse them to obtain a combined feature. The combined feature can be recorded as id_exp_features.
服务器得到的组合特征,可在后续与换脸所需的编码特征一起参与解码,输出换脸图像。也就是说,本申请中,在训练换脸模型时,不仅模板图像与脸源图像本身的编码特征参与解码以输出换脸图像,模板图像的表情特征与脸源图像的身份特征,也会参与解码以输出换脸图像,使得输出的换脸图像既能够具备模板图像的表情信息,又能够具备脸源图像的身份信息,也即,在尽量保持模板图像的表情的同时,还能与脸源图像尽量相像,从而提升对人脸图像进行换脸的准确度与换脸效果。The combined features obtained by the server can be subsequently decoded together with the coding features required for face-changing to output the face-changing image. That is to say, in this application, when training the face-changing model, not only the coding features of the template image and the face source image itself are decoded to output the face-changing image, but also the expression features of the template image and the identity features of the face source image are decoded to output the face-changing image, so that the output face-changing image can have both the expression information of the template image and the identity information of the face source image, that is, while keeping the expression of the template image as much as possible, it can also be as similar as possible to the face source image, thereby improving the accuracy and effect of face-changing of face images.
步骤306,通过换脸模型的生成网络,根据脸源图像与模板图像进行编码,得到换脸所需的编码特征,将编码特征与组合特征进行融合,得到融合特征,通过换脸模型的生成网络,根据融合特征进行解码,得到换脸图像。Step 306, encoding the face source image and the template image through the generative network of the face-changing model to obtain the coding features required for face-changing, fusing the coding features with the combined features to obtain fused features, and decoding according to the fused features through the generative network of the face-changing model to obtain the face-changing image.
如图4所示,为一个实施例中换脸模型的模型结构示意图。参照图4,换脸模型包括人脸识别网络、表情识别网络、生成网络与判别网络。As shown in Figure 4, it is a schematic diagram of the model structure of a face-changing model in an embodiment. Referring to Figure 4, the face-changing model includes a face recognition network, an expression recognition network, a generation network and a discrimination network.
本申请中,换脸模型通过生成网络(Generator Network)与判别网络(Discriminator Network)形成的生成对抗网络(Generative Adversarial Network,简称GAN)进行训练。在一个实施例中,参照图4,生成网络包括编码器和解码器两部分,编码器通过卷积计算将输入图像的尺寸(分辨率)不断减半,通道数量逐渐增加,编码过程实质是通过在输入图像对应的输入数据上应用卷积核(也称滤波器)来实现的,编码器由多个卷积核组成,最终输出一个特征向量。解码器则进行反卷积运算,将特征训练的尺寸逐渐增倍,通道数逐渐减少,基于特征对图像进行重建或生成。In this application, the face-changing model is trained by a generative adversarial network (GAN) formed by a generator network and a discriminator network. In one embodiment, referring to FIG4 , the generator network includes an encoder and a decoder. The encoder continuously halves the size (resolution) of the input image through convolution calculation, and the number of channels gradually increases. The encoding process is essentially achieved by applying a convolution kernel (also called a filter) to the input data corresponding to the input image. The encoder consists of multiple convolution kernels and finally outputs a feature vector. The decoder performs a deconvolution operation, gradually doubles the size of the feature training, gradually reduces the number of channels, and reconstructs or generates the image based on the feature.
在一个实施例中,通过换脸模型的生成网络,根据脸源图像与模板图像进行编码,得到换脸所需的编码特征,包括:将脸源图像与模板图像进行拼接,得到输入图像,将输入图像输入至换脸模型,通过换脸模型的生成网络,对输入图像进行编码,得到对模板图像进行换脸所需的编码特征。In one embodiment, encoding is performed based on a face source image and a template image through a generative network of a face-changing model to obtain coding features required for face-changing, including: splicing the face source image and the template image to obtain an input image, inputting the input image into the face-changing model, encoding the input image through the generative network of the face-changing model, and obtaining coding features required for face-changing the template image.
具体地,脸源图像与模板图像均为三通道图像,服务器可以将脸源图像与模板图像按图像通道进行拼接,拼接后得到的六通道的输入图像,被输入至生成网络的编码器中,通过该编码器,对输入图像逐步编码,得到在隐空间中一个中间结果,即编码特征(可记为swap_features)。例如,输入图像从分辨率为512*512*6,逐步编码为256*256*32、 128*128*64、64*64*128、32*32*256,以此类推,最终得到在隐空间得到一个中间结果,称为编码特征,即swap_features。该编码特征同样兼具脸源图像的图像信息与模板图像的图像信息。Specifically, the face source image and the template image are both three-channel images. The server can splice the face source image and the template image according to the image channels. The six-channel input image obtained after splicing is input into the encoder of the generation network. Through the encoder, the input image is gradually encoded to obtain an intermediate result in the latent space, namely the encoding feature (which can be recorded as swap_features). For example, the input image is gradually encoded from a resolution of 512*512*6 to 256*256*32, 128*128*64, 64*64*128, 32*32*256, and so on, and finally get an intermediate result in the latent space, called the encoding feature, i.e. swap_features. This encoding feature also has the image information of the face source image and the image information of the template image.
进一步地,服务器可以将编码特征与上述的组合特征进行融合,得到融合特征,该融合特征兼具编码特征的内容与组合特征的风格。Furthermore, the server may fuse the coding feature with the above-mentioned combined feature to obtain a fused feature, which has both the content of the coding feature and the style of the combined feature.
在一个实施例中,服务器可以分别计算编码特征、组合特征的均值与标准差;根据编码特征的均值与标准差,对编码特征进行归一化处理,得到归一化后的编码特征;根据组合特征的均值与标准差,将组合特征的风格迁移至归一化后的编码特征,得到融合特征。In one embodiment, the server may calculate the mean and standard deviation of the coding features and the combined features respectively; normalize the coding features according to the mean and standard deviation of the coding features to obtain normalized coding features; and transfer the style of the combined features to the normalized coding features according to the mean and standard deviation of the combined features to obtain fused features.
具体地,服务器可以通过AdaIN(Adaptive Instance Normalization)的方式,将编码特征与组合特征进行融合,得到融合特征。具体原理通过如下公式示出:
Specifically, the server can fuse the encoding feature with the combined feature by means of AdaIN (Adaptive Instance Normalization) to obtain the fused feature. The specific principle is shown by the following formula:
x和y分别是编码特征与组合特征,σ、μ分别是标准差与均值,该公式将编码特征的均值和标准差与组合特征进行对齐。μ(x)为编码特征的均值,σ(x)为编码特征的标准差,σ(y)为组合特征的标准差,μ(y)为组合特征的均值。可以理解,编码特征与组合特征均为一个多通道二维矩阵,例如,编码特征的矩阵大小为32*32*256,对于每个通道,可以根据所有元素的值计算相应通道的均值与标准差,得到编码特征在每个通道的均值与标准差。对于组合特征,亦是如此,即对于组合特征的每个通道,可以根据所有元素的值计算相应通道的均值与标准差,得到组合特征在每个通道的均值与标准差。x and y are the coded features and combined features, respectively, σ and μ are the standard deviation and mean, respectively. This formula aligns the mean and standard deviation of the coded features with the combined features. μ(x) is the mean of the coded features, σ(x) is the standard deviation of the coded features, σ(y) is the standard deviation of the combined features, and μ(y) is the mean of the combined features. It can be understood that both the coded features and the combined features are a multi-channel two-dimensional matrix. For example, the matrix size of the coded features is 32*32*256. For each channel, the mean and standard deviation of the corresponding channel can be calculated based on the values of all elements to obtain the mean and standard deviation of the coded features in each channel. The same is true for the combined features, that is, for each channel of the combined features, the mean and standard deviation of the corresponding channel can be calculated based on the values of all elements to obtain the mean and standard deviation of the combined features in each channel.
首先,服务器利用编码特征的均值与标准差,对编码特征进行归一化处理,也就是,将编码特征减去编码特征的均值后除以编码特征的标准差,就可以得到归一化后的编码特征,编码特征通过归一化处理,归一化处理后的特征的均值为0,标准差为1,也就剔除了编码特征原本的风格,保留了编码特征原本的内容。接着,利用组合特征的均值与标准差,将组合特征的风格迁移至归一化后的编码特征,也就是,将归一化后的编码特征乘以组合特征的标准差再加上组合特征的均值,得到融合特征,这样,使得到的融合特征保留了编码特征的内容而兼具组合特征的风格。First, the server uses the mean and standard deviation of the coding features to normalize the coding features. That is, the normalized coding features are obtained by subtracting the mean of the coding features from the coding features and dividing them by the standard deviation of the coding features. The coding features are normalized, and the mean of the normalized features is 0 and the standard deviation is 1, which eliminates the original style of the coding features and retains the original content of the coding features. Next, the style of the combined features is transferred to the normalized coding features using the mean and standard deviation of the combined features. That is, the normalized coding features are multiplied by the standard deviation of the combined features and then added to the mean of the combined features to obtain the fused features. In this way, the obtained fused features retain the content of the coding features and have the style of the combined features.
可以理解,前文提到,编码特征兼具脸源图像的图像信息与模板图像的图像信息,组合特征兼具换脸所需的表情特征与身份特征,那么,通过这种方式融合编码特征与组合特征,得到的融合特征,能够使解码得到的换脸图像中的人脸与脸源图像中的人脸相像的同时,还能使换脸图像保留模板图像中人脸的表情、姿态以及图像背景等特点,提高输出的换脸图像的准确性。It can be understood that, as mentioned above, the coding features have both the image information of the face source image and the image information of the template image, and the combined features have both the expression features and identity features required for face changing. Then, by fusing the coding features and the combined features in this way, the fused features obtained can make the face in the decoded face-changing image similar to the face in the face source image, while allowing the face-changing image to retain the expression, posture and image background of the face in the template image, thereby improving the accuracy of the output face-changing image.
当然,服务器还可以通过其它方式对编码特征与组合特征进行融合。例如,标准化方法(Batch Normalization)、快速标准化方法(Instance Normalization)、有条件的快速标准化方法(Conditional Instance Normalization),等等,本申请实施例对融合方式不作限制。 Of course, the server can also fuse the coding features with the combined features in other ways, such as batch normalization, instance normalization, conditional instance normalization, etc. The embodiment of the present application does not limit the fusion method.
在得到该融合特征之后,服务器将该融合特征输入生成网络的解码器,通过解码器的反卷积运算,将融合特征的分辨率逐渐增倍、通道数逐渐减少,输出换脸图像。例如,融合特征的分辨率为32*32*256、通过解码器的逐步进行反卷积运算,依次输出64*64*128、128*128*64、256*256*32、512*512*3,最终输出换脸图像。After obtaining the fused feature, the server inputs the fused feature into the decoder of the generation network. Through the deconvolution operation of the decoder, the resolution of the fused feature is gradually doubled, the number of channels is gradually reduced, and the face-swapped image is output. For example, the resolution of the fused feature is 32*32*256, and through the gradual deconvolution operation of the decoder, 64*64*128, 128*128*64, 256*256*32, 512*512*3 are output in sequence, and finally the face-swapped image is output.
步骤308,通过换脸模型的判别网络,分别预测关于换脸图像与参考图像的图像属性判别结果,图像属性包括伪造图像和非伪造图像。Step 308, using the discriminant network of the face-changing model, respectively predict the image attribute discrimination results of the face-changing image and the reference image, where the image attributes include forged images and non-forged images.
参照图4,换脸模型还包括判别网络,判别网络用于判断输入的图像是伪造图像还是非伪造图像。在通过生成网络输出换脸图像后,服务器将该换脸图像输入判别网络,通过判别网络对于输入的换脸图像进行特征提取,得到低维度的判别信息,基于提取的判别进行图像属性的分类,得到相应的图像属性判别结果。本申请中,判别网络的分类是关于图像属性的二分类,即判别图像为伪造图像或是非伪造图像。伪造图像也称合成图像,非伪造图像也称真实图像。Referring to Figure 4, the face-changing model also includes a discriminant network, which is used to determine whether the input image is a forged image or a non-forged image. After the face-changing image is output through the generation network, the server inputs the face-changing image into the discriminant network, and the discriminant network extracts features of the input face-changing image to obtain low-dimensional discriminant information, and classifies the image attributes based on the extracted discriminants to obtain corresponding image attribute discrimination results. In the present application, the classification of the discriminant network is a binary classification of image attributes, that is, to determine whether the image is a forged image or a non-forged image. Forged images are also called synthetic images, and non-forged images are also called real images.
此外,服务器还会将样本三元组中的参考图像输入判别网络,通过判别网络对于输入的参考图像进行特征提取得到低维度的判别信息,基于提取的判别进行图像属性的分类,得到相应的图像属性判别结果。In addition, the server will input the reference image in the sample triplet into the discriminant network, extract features from the input reference image through the discriminant network to obtain low-dimensional discriminant information, classify the image attributes based on the extracted discriminants, and obtain the corresponding image attribute discrimination results.
在一个实施例中,通过换脸模型的判别网络,根据换脸图像与参考图像,得到相应的图像属性判别结果,包括:将换脸图像输入换脸模型的判别网络,得到换脸图像属于非伪造图像的第一概率;将参考图像输入换脸模型的判别网络,得到参考图像属于非伪造图像的第二概率。可以理解,判别网络的训练目标是使判别网络输出的第一概率应该尽可能小,输出的第二概率应该尽可能大,这样的判别网络具备较佳的性能。In one embodiment, the discriminant network of the face-changing model obtains the corresponding image attribute discrimination result according to the face-changing image and the reference image, including: inputting the face-changing image into the discriminant network of the face-changing model to obtain the first probability that the face-changing image is a non-forged image; inputting the reference image into the discriminant network of the face-changing model to obtain the second probability that the reference image is a non-forged image. It can be understood that the training goal of the discriminant network is to make the first probability output by the discriminant network as small as possible and the second probability output as large as possible, so that the discriminant network has better performance.
步骤310,计算换脸图像的表情特征与模板图像的表情特征之间的差异,计算换脸图像的身份特征与脸源图像的身份特征之间的差异,根据关于换脸图像与参考图像的图像属性判别结果、计算得到的表情特征之间的差异、身份特征之间的差异,更新生成网络与判别网络。Step 310, calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network based on the image attribute discrimination results of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
本申请中,换脸模型包括生成网络与判别网络,生成网络与判别网络基于判别网络对真实的参考数据与输出的伪造数据的图像属性判别结果进行对抗训练。此外,本申请实施例中,参照图4,为了使输出的换脸图像尽可能保留模板图像中人脸的表情、且保留脸源图像的身份属性,在训练过程中,服务器还会计算换脸图像的表情特征与模板图像的表情特征之间的差异,计算换脸图像的身份特征与脸源图像的身份特征之间的差异,根据计算出的差异、判别网络输出的关于换脸图像与参考图像的图像属性判别结果,来共同构建整个换脸模型的损失函数,以最小化该损失函数为目标,优化并更新生成网络与判别网络的网络参数。需要说明的是,本申请实施例对生成网络与判别网络具体采用的网络结构不作限制,只需要生成网络支持上述的图像重建与生成能力、判别网络支持上述的图像属性判别能力即可。此外,换脸图像的表情特征可通过上述的表情识别网络进行图像特征提取得到,换脸图像的身份特征可通过上述的人脸识别网络进行图像特征提取得到。In the present application, the face-changing model includes a generating network and a discriminating network. The generating network and the discriminating network perform adversarial training on the image attribute discrimination results of the real reference data and the output forged data based on the discriminating network. In addition, in the embodiment of the present application, referring to FIG. 4, in order to make the output face-changing image retain the expression of the face in the template image as much as possible and retain the identity attribute of the face source image, during the training process, the server will also calculate the difference between the expression features of the face-changing image and the expression features of the template image, and calculate the difference between the identity features of the face-changing image and the identity features of the face source image. According to the calculated difference and the image attribute discrimination results of the face-changing image and the reference image output by the discriminating network, the loss function of the entire face-changing model is jointly constructed, and the network parameters of the generating network and the discriminating network are optimized and updated with the goal of minimizing the loss function. It should be noted that the embodiment of the present application does not limit the specific network structure adopted by the generating network and the discriminating network, and only requires the generating network to support the above-mentioned image reconstruction and generation capabilities and the discriminating network to support the above-mentioned image attribute discrimination capabilities. In addition, the expression features of the face-changing image can be obtained by extracting image features through the above-mentioned expression recognition network, and the identity features of the face-changing image can be obtained by extracting image features through the above-mentioned face recognition network.
在一个实施例中,服务器交替地:在固定生成网络的网络参数的情况下,根据换脸图像属于非伪造图像的第一概率与参考图像属于非伪造图像的第二概率,构建关于判别网络的判别损失,利用判别损失更新判别网络的网络参数。在固定判别网络的网络参数的情况下,根据换脸图像属于非伪造图像的第一概率,构建生成网络的生成损失,根据换脸图像 的表情特征与模板图像的表情特征之间的差异,构建表情损失,根据换脸图像的身份特征与脸源图像的身份特征之间的差异,构建身份损失,根据生成损失、表情损失与身份损失,构建关于生成网络的换脸损失,利用换脸损失更新生成网络的网络参数,直至满足训练停止条件时交替结束,得到训练好的判别网络与生成网络。In one embodiment, the server alternately: when the network parameters of the generation network are fixed, according to the first probability that the face-swapped image belongs to the non-forged image and the second probability that the reference image belongs to the non-forged image, construct the discriminant loss of the discriminant network, and update the network parameters of the discriminant network using the discriminant loss. When the network parameters of the generation network are fixed, according to the first probability that the face-swapped image belongs to the non-forged image and the second probability that the reference image belongs to the non-forged image, construct the generation loss of the generation network, according to the first probability that the face-swapped image belongs to the non-forged image, and update the network parameters of the discriminant network using the discriminant loss. The expression loss is constructed based on the difference between the expression features of the face-changing image and the expression features of the template image. The identity loss is constructed based on the difference between the identity features of the face-changing image and the identity features of the face source image. The face-changing loss of the generation network is constructed based on the generation loss, expression loss and identity loss. The network parameters of the generation network are updated using the face-changing loss. The training is repeated until the training stop condition is met, and the trained discriminant network and generation network are obtained.
本实施例中,换脸模型的训练包括两个交替进行的阶段,阶段一是训练判别网络,阶段二是训练生成网络。In this embodiment, the training of the face-changing model includes two alternating stages, the first stage is to train the discriminant network, and the second stage is to train the generative network.
阶段一的训练目标是使判别网络尽量将换脸图像判别为伪造图像,且使判别网络尽量将参考图像判别为非伪造图像。因此,在阶段一,固定生成网络的参数,将样本三元组输入换脸模型,输出换脸图像后,服务器根据判别网络分别预测的关于换脸图像与参考图像的图像属性判别结果,对判别网络的网络参数进行更新。也即,服务器在固定生成网络的网络参数的情况下,根据换脸图像属于非伪造图像的第一概率与参考图像属于非伪造图像的第二概率,构建关于判别网络的判别损失,利用判别损失更新判别网络的网络参数。The training goal of the first stage is to make the discriminant network identify the face-swapped image as a forged image as much as possible, and to make the discriminant network identify the reference image as a non-forged image as much as possible. Therefore, in the first stage, the parameters of the generation network are fixed, the sample triples are input into the face-swapped model, and after the face-swapped image is output, the server updates the network parameters of the discriminant network according to the image attribute discrimination results of the face-swapped image and the reference image predicted by the discriminant network. That is, when the network parameters of the generation network are fixed, the server constructs the discriminant loss of the discriminant network according to the first probability that the face-swapped image belongs to a non-forged image and the second probability that the reference image belongs to a non-forged image, and uses the discriminant loss to update the network parameters of the discriminant network.
可选地,判别网络的判别损失可通过如下公式表示:
D_Loss=-logD(GT)-log(1-D(fake));
Optionally, the discriminant loss of the discriminant network can be expressed by the following formula:
D_Loss = -logD(GT) - log(1 - D(fake));
D表示判别网络,GT为参考图像,fake为换脸图像,D(fake)表示换脸图像属于非伪造图像的第一概率,D(GT)表示参考图像属于非伪造图像的第二概率。D represents the discriminant network, GT is the reference image, fake is the face-swapped image, D(fake) represents the first probability that the face-swapped image is a non-fake image, and D(GT) represents the second probability that the reference image is a non-fake image.
阶段二的训练目标是使生成网络输出的换脸图像尽可能“欺骗”判别网络,被判别网络预测为非伪造图像。因此,在阶段二,固定判别网络的参数,将同一批样本三元组,输入换脸模型,通过生成网络输出换脸图像后,根据判别网络分别预测的关于换脸图像与参考图像的图像属性判别结果,构造用于训练生成网络的损失函数,根据该损失函数对生成网络的网络参数进行更新。The training goal of the second stage is to make the face-swapped images output by the generator network "deceive" the discriminant network as much as possible, and be predicted by the discriminant network as non-forged images. Therefore, in the second stage, the parameters of the discriminant network are fixed, and the same batch of sample triplets are input into the face-swapped model. After the generator network outputs the face-swapped images, the loss function for training the generator network is constructed based on the image attribute discrimination results of the face-swapped images and the reference images predicted by the discriminant network, and the network parameters of the generator network are updated according to the loss function.
可选地,在阶段二中,用于训练生成网络的损失函数中,除了生成网络的生成损失,服务器还会引入表情损失与身份损失。具体而言,服务器通过换脸模型的表情识别网络,对换脸图像进行特征提取,得到换脸图像的表情特征,通过换脸模型的人脸识别网络,对换脸图像进行特征提取,得到换脸图像的身份特征,表情识别网络与人脸识别网络,均为预先训练好的神经网络模型。Optionally, in the second stage, in addition to the generation loss of the generation network, the server also introduces expression loss and identity loss in the loss function used to train the generation network. Specifically, the server extracts features of the face-changing image through the expression recognition network of the face-changing model to obtain the expression features of the face-changing image, and extracts features of the face-changing image through the face recognition network of the face-changing model to obtain the identity features of the face-changing image. Both the expression recognition network and the face recognition network are pre-trained neural network models.
从而,在阶段二,服务器可以根据换脸图像属于非伪造图像的第一概率,构建生成网络的生成损失,根据换脸图像的表情特征与模板图像的表情特征之间的差异,构建表情损失,根据换脸图像的身份特征与脸源图像的身份特征之间的差异,构建身份损失,根据生成损失、表情损失与身份损失,构建关于生成网络的换脸损失,利用换脸损失更新生成网络的网络参数。Thus, in stage two, the server can construct the generation loss of the generation network according to the first probability that the face-swapped image is a non-forged image, construct the expression loss according to the difference between the expression features of the face-swapped image and the expression features of the template image, construct the identity loss according to the difference between the identity features of the face-swapped image and the identity features of the face source image, construct the face-swapped loss of the generation network according to the generation loss, expression loss and identity loss, and use the face-swapped loss to update the network parameters of the generation network.
在一个实施例中,生成网络的生成损失可通过如下公式表示:
G_Loss=log(1-D(fake));
In one embodiment, the generation loss of the generation network can be expressed by the following formula:
G_Loss = log(1-D(fake));
在一个实施例中,生成网络的表情损失可通过如下公式表示:
Exp_ffeatures_loss=(template_exp_features-fake_exp_features)2
In one embodiment, the expression loss of the generative network can be expressed by the following formula:
Exp_features_loss=(template_exp_features-fake_exp_features) 2 ;
template_exp_features为模板图像的表情特征,fake_exp_features为换脸图像的表情特征。template_exp_features is the expression features of the template image, and fake_exp_features is the expression features of the face-swapped image.
在一个实施例中,生成网络的身份损失可通过如下公式表示:
ID_loss=1-cosine_similarity(fake_id_features,source_id_features);
In one embodiment, the identity loss of the generated network can be expressed by the following formula:
ID_loss = 1 - cosine_similarity(fake_id_features, source_id_features);
cosine_similarity()为余弦相似度,fake_id_features为换脸图像的身份特征,source_id_features为脸源图像的身份特征。cosine_similarity() is the cosine similarity, fake_id_features is the identity features of the face-swapped image, and source_id_features is the identity features of the face source image.
如图5所示,为一个实施例中换脸模型的训练方法的流程示意图。该方法可以由计算机设备执行,具体包括以下步骤:As shown in FIG5 , it is a flowchart of a method for training a face-changing model in an embodiment. The method can be executed by a computer device, and specifically includes the following steps:
步骤502,获取样本三元组,样本三元组包括脸源图像、模板图像与参考图像;Step 502, obtaining a sample triplet, where the sample triplet includes a face source image, a template image, and a reference image;
步骤504,通过换脸模型的表情识别网络,对模板图像进行特征提取,得到模板图像的表情特征;Step 504, extracting features of the template image through the expression recognition network of the face-changing model to obtain expression features of the template image;
步骤506,通过换脸模型的人脸识别网络,对脸源图像进行特征提取,得到脸源图像的身份特征;Step 506, extracting features from the face source image through the face recognition network of the face-changing model to obtain identity features of the face source image;
步骤508,拼接模板图像的表情特征与脸源图像的身份特征,得到组合特征;Step 508, combining the expression features of the template image and the identity features of the face source image to obtain a combined feature;
步骤510,将脸源图像与模板图像进行拼接,得到输入图像,将输入图像输入至换脸模型,通过换脸模型的生成网络,对输入图像进行编码,得到对模板图像进行换脸所需的编码特征;Step 510, splicing the face source image and the template image to obtain an input image, inputting the input image into the face-changing model, encoding the input image through the generative network of the face-changing model, and obtaining the encoding features required for face-changing the template image;
步骤512,分别计算编码特征、组合特征的均值与标准差,根据编码特征的均值与标准差,对编码特征进行归一化处理,得到归一化后的编码特征,根据组合特征的均值与标准差,将组合特征的风格迁移至归一化后的编码特征,得到融合特征;Step 512, respectively calculating the mean and standard deviation of the coding feature and the combined feature, normalizing the coding feature according to the mean and standard deviation of the coding feature to obtain the normalized coding feature, and migrating the style of the combined feature to the normalized coding feature according to the mean and standard deviation of the combined feature to obtain the fusion feature;
步骤514,通过换脸模型的生成网络,对融合特征进行解码,得到换脸图像;Step 514, decoding the fused features through the generative network of the face-changing model to obtain a face-changing image;
步骤516,将换脸图像输入换脸模型的判别网络,得到换脸图像属于非伪造图像的第一概率;Step 516, inputting the face-swapped image into the discriminant network of the face-swapped model to obtain a first probability that the face-swapped image is a non-forged image;
步骤518,将参考图像输入换脸模型的判别网络,得到参考图像属于非伪造图像的第二概率;Step 518, inputting the reference image into the discriminant network of the face-changing model to obtain a second probability that the reference image is a non-forged image;
步骤520,在固定生成网络的网络参数的情况下,根据换脸图像属于非伪造图像的第一概率与参考图像属于非伪造图像的第二概率,构建关于判别网络的判别损失,利用判别损失更新判别网络的网络参数;Step 520, when the network parameters of the generating network are fixed, a discriminant loss for the discriminant network is constructed according to a first probability that the face-swapped image is a non-forged image and a second probability that the reference image is a non-forged image, and the network parameters of the discriminant network are updated using the discriminant loss;
步骤522,在固定判别网络的网络参数的情况下,通过换脸模型的表情识别网络,对换脸图像进行特征提取,得到换脸图像的表情特征;通过换脸模型的人脸识别网络,对换 脸图像进行特征提取,得到换脸图像的身份特征;以及根据换脸图像属于非伪造图像的第一概率,构建生成网络的生成损失,根据换脸图像的表情特征与模板图像的表情特征之间的差异,构建表情损失,根据换脸图像的身份特征与脸源图像的身份特征之间的差异,构建身份损失,根据生成损失、表情损失与身份损失,构建关于生成网络的换脸损失,利用换脸损失更新生成网络的网络参数。Step 522, under the condition of fixing the network parameters of the discriminant network, extracting features of the face-changing image through the expression recognition network of the face-changing model to obtain the expression features of the face-changing image; extracting features of the face-changing image through the face recognition network of the face-changing model The invention relates to a method for generating a face-changing image by extracting features of the face image to obtain identity features of the face-changing image; and constructing a generation loss of the generation network according to a first probability that the face-changing image is a non-forged image, constructing an expression loss according to the difference between the expression features of the face-changing image and the expression features of the template image, constructing an identity loss according to the difference between the identity features of the face-changing image and the identity features of the face source image, constructing a face-changing loss about the generation network according to the generation loss, the expression loss and the identity loss, and using the face-changing loss to update the network parameters of the generation network.
上述换脸模型的训练方法中,在训练换脸模型时,不仅模板图像与脸源图像本身的编码特征参与解码以输出换脸图像,模板图像的表情特征与脸源图像的身份特征,也会参与解码以输出换脸图像,使得输出的换脸图像既能够具备模板图像的表情信息,又能够具备脸源图像的身份信息,也即,在保持模板图像的表情的同时,还能与脸源图像相像。此外,通过模板图像的表情特征与换脸图像的表情特征之间的差异,脸源图像的身份特征与换脸图像的身份特征之间的差异,来更新换脸模型,前者可以约束换脸图像与模板图像之间的表情相似度,后者可以约束换脸图像与脸源图像之间的身份相似度,这样,即便模板图像的表情较为复杂,输出的换脸图像仍能保持这种复杂表情,提升换脸效果。而且,更新换脸模型的生成网络与判别网络的网络参数时,还会依据判别网络对换脸图像与参考图像预测的图像属性判别结果,使生成网络与判别网络对抗训练,整体上提升换脸模型输出换脸图像的图像质量。In the training method of the above-mentioned face-changing model, when training the face-changing model, not only the encoding features of the template image and the face source image themselves are involved in decoding to output the face-changing image, but also the expression features of the template image and the identity features of the face source image are involved in decoding to output the face-changing image, so that the output face-changing image can have both the expression information of the template image and the identity information of the face source image, that is, while maintaining the expression of the template image, it can also be similar to the face source image. In addition, the face-changing model is updated by the difference between the expression features of the template image and the expression features of the face-changing image, and the difference between the identity features of the face source image and the identity features of the face-changing image. The former can constrain the expression similarity between the face-changing image and the template image, and the latter can constrain the identity similarity between the face-changing image and the face source image. In this way, even if the expression of the template image is relatively complex, the output face-changing image can still maintain this complex expression, thereby improving the face-changing effect. Moreover, when updating the network parameters of the generative network and the discriminative network of the face-changing model, the generative network and the discriminative network will be trained adversarially based on the image attribute discrimination results predicted by the discriminative network for the face-changing image and the reference image, thereby improving the image quality of the face-changing image output by the face-changing model as a whole.
在一个实施例中,如图6所示,本申请在训练换脸模型时,还引入了预先训练好的人脸关键点网络,根据模板图像与换脸图像各自的人脸关键点信息之间的差异,训练换脸模型的生成网络。具体地,上述方法还可以包括:通过预先训练好的人脸关键点网络,分别对模板图像与换脸图像进行人脸关键点识别,得到各自的人脸关键点信息;根据模板图像与换脸图像各自的人脸关键点信息之间的差异,构建关键点损失;关键点损失用于参与换脸模型的生成网络的训练。In one embodiment, as shown in FIG6 , the present application also introduces a pre-trained facial key point network when training the face-changing model, and trains the generation network of the face-changing model according to the difference between the facial key point information of the template image and the face-changing image. Specifically, the above method may also include: using the pre-trained facial key point network, respectively performing facial key point recognition on the template image and the face-changing image to obtain the respective facial key point information; constructing a key point loss according to the difference between the facial key point information of the template image and the face-changing image; and the key point loss is used to participate in the training of the generation network of the face-changing model.
为了更好的达到在模板图像中人脸的表情较为特殊复杂时,生成的换脸图像依旧能够保持这种复杂表情的效果,可选地,本申请中在训练换脸模型时,还引入了人脸关键点网络。人脸关键点网络可以定位出图像上人脸关键点的位置,从而根据模板图像与换脸图像各自的人脸关键点信息之间的差异,构建关键点损失,参与生成网络的训练,可保证模板图像与换脸图像的表情一致性。In order to better achieve the effect of maintaining complex facial expressions in the template image when the facial expressions are special and complex, the face-changing image generated can optionally introduce a face key point network when training the face-changing model. The face key point network can locate the positions of the facial key points on the image, and thus construct a key point loss based on the difference between the facial key point information of the template image and the face-changing image, and participate in the training of the generation network, so as to ensure the consistency of the expressions of the template image and the face-changing image.
人脸关键点是图像中人脸上与人脸表情相关的五官所在像素点,如眉毛、嘴巴、眼睛、鼻子、脸部轮廓等所在的像素点。如图7所示,为一个实施例中人脸关键点的示意图,图7中,示意出了97个人脸关键点,0-32点是脸部轮廓,33-50是眉毛轮廓,51-59是鼻子,60-75是眼部轮廓,76-95是嘴巴轮廓,96、97是瞳孔所在位置。当然,人脸关键点网络还可以定位出更多的人脸关键点,比如有些可以定位出256个人脸关键点。The key points of the face are the pixels where the facial features related to the facial expressions are located on the face in the image, such as the pixels where the eyebrows, mouth, eyes, nose, and facial contours are located. As shown in FIG7 , it is a schematic diagram of the key points of the face in one embodiment. In FIG7 , 97 key points of the face are illustrated, 0-32 points are the facial contour, 33-50 are the eyebrow contour, 51-59 are the nose, 60-75 are the eye contour, 76-95 are the mouth contour, and 96 and 97 are the positions of the pupils. Of course, the key point network of the face can also locate more key points of the face, for example, some can locate 256 key points of the face.
人脸关键点检测是基于输入的人脸区域,定位出人脸面部的关键点所在位置的处理过程。受光线、遮挡、姿态等因素的影响,人脸关键点检测也是一个富有挑战性的任务。Facial key point detection is a process of locating the key points of the face based on the input face area. Affected by factors such as light, occlusion, and posture, facial key point detection is also a challenging task.
在一个实施例中,服务器通过事先训练好的人脸关键点网络分别定位出换脸图像与模板图像的人脸关键点,对于部分或所有的人脸关键点,根据换脸图像与模板图像对应同一人脸关键点的特征值,计算特征值差异的平方,再进行求和,记为关键点损失landmark_loss。在训练时,希望关键点损失越小越好。比如,对于第95号关键点,根 据换脸图像的表情特征fake_landmark与模板图像的表情特征template_landmark各自对应第95号人脸关键点的特征值,计算差异的平方,按照这样的方式对人脸关键点求和,即为关键点损失。当然,在一些实施例中,服务器也可以仅根据眉毛、嘴巴、眼睛所在关键点的特征值的差异来表征换脸图像与模板图像的表情差异。In one embodiment, the server locates the facial key points of the face-swapped image and the template image respectively through a pre-trained facial key point network. For some or all facial key points, the server calculates the square of the eigenvalue difference between the face-swapped image and the template image according to the eigenvalue of the same facial key point, and then sums them up, which is recorded as the key point loss landmark_loss. During training, it is hoped that the key point loss is as small as possible. For example, for the 95th key point, the square of the eigenvalue difference is calculated. According to the feature values of the 95th facial key point corresponding to the expression feature fake_landmark of the face-swapped image and the expression feature template_landmark of the template image, the square of the difference is calculated, and the facial key points are summed in this way to obtain the key point loss. Of course, in some embodiments, the server can also characterize the expression difference between the face-swapped image and the template image only based on the difference in the feature values of the key points where the eyebrows, mouth, and eyes are located.
本申请实施例对所使用的人脸关键点网络的网络结构不作限制。可选地,人脸关键点网络可基于卷积神经网络搭建,例如,通过设计拥有三个层级的级联卷积神经网络,利用多层级卷积的特征提取能力,从粗到细逐步得到精确的特征,而后利用全连接层预测人脸关键点的位置。在训练人脸关键点网络时,需要获取样本人脸图像数据集,该样本人脸图像数据集中的每张图像具有对应的关键点标注信息,即人脸关键点的位置数据。将样本人脸图像输入人脸关键点网络,通过人脸关键点网络输出每个关键点的预测位置,从而计算每个关键点的标注位置与预测位置的差异,将所有关键点对应的差异求和,可得到整个样本人脸图像的预测差异,根据该预测差异构造损失函数,通过最小化该损失函数优化人脸关键点网络的网络参数。The embodiment of the present application does not limit the network structure of the facial key point network used. Optionally, the facial key point network can be built based on a convolutional neural network. For example, by designing a cascaded convolutional neural network with three levels, the feature extraction capability of multi-level convolution is utilized to gradually obtain accurate features from coarse to fine, and then the fully connected layer is used to predict the position of the facial key points. When training the facial key point network, it is necessary to obtain a sample facial image data set, each image in the sample facial image data set has corresponding key point annotation information, that is, the position data of the facial key points. The sample facial image is input into the facial key point network, and the predicted position of each key point is output through the facial key point network, thereby calculating the difference between the annotated position and the predicted position of each key point, and summing the differences corresponding to all key points to obtain the predicted difference of the entire sample facial image. A loss function is constructed based on the predicted difference, and the network parameters of the facial key point network are optimized by minimizing the loss function.
本实施例中,通过在训练换脸模型时,引入了人脸关键点网络与关键点损失,可以使训练好的换脸模型的生成网络输出表情保留效果较好的换脸图像。In this embodiment, by introducing a facial key point network and key point loss when training the face-changing model, the generative network of the trained face-changing model can output a face-changing image with better expression retention effect.
在一个实施例中,如图8所示,本申请在训练换脸模型时,还引入了预先训练好的特征提取网络,根据模板图像与换脸图像各自的图像特征之间的差异,训练换脸模型的生成网络。具体地,上述方法还可以包括:通过预先训练好的特征提取网络,分别对换脸图像与参考图像进行图像特征提取,得到各自的图像特征;根据换脸图像与参考图像各自的图像特征之间的差异,构建相似度损失;相似度损失用于参与换脸模型的生成网络的训练。In one embodiment, as shown in FIG8 , the present application also introduces a pre-trained feature extraction network when training the face-changing model, and trains the generation network of the face-changing model according to the difference between the image features of the template image and the face-changing image. Specifically, the above method may also include: extracting image features of the face-changing image and the reference image respectively through the pre-trained feature extraction network to obtain their respective image features; constructing a similarity loss according to the difference between the image features of the face-changing image and the reference image; and the similarity loss is used to participate in the training of the generation network of the face-changing model.
本实施例中,为了度量换脸图像与参考图像在特征级别上的差异,希望生成的换脸图像与参考图像的特征较为相似,可选地,在训练换脸模型时,还引入了相似度损失,相似度损失例如可以是学习感知图像块相似度(Learned Perceptual Image Patch Similarity,LPIPS)。预训练好的特征提取网络,用于分别提取换脸图像与参考图像在不同层级的特征,比较换脸图像与参考图像在对应同一层级的特征差异,构造相似度损失。训练时,希望换脸图像与参考图像的特征差异越小越好。本申请实施例对所使用的特征提取网络的网络结构不作限制。In this embodiment, in order to measure the difference between the face-swapped image and the reference image at the feature level, it is hoped that the features of the generated face-swapped image are similar to those of the reference image. Optionally, when training the face-swapped model, a similarity loss is also introduced. The similarity loss can be, for example, the learned perceptual image patch similarity (LPIPS). The pre-trained feature extraction network is used to extract the features of the face-swapped image and the reference image at different levels, compare the feature differences between the face-swapped image and the reference image at the same level, and construct a similarity loss. During training, it is hoped that the feature differences between the face-swapped image and the reference image are as small as possible. The embodiment of the present application does not limit the network structure of the feature extraction network used.
如图9所示,为一个实施例中特征提取网络的示意图。参照图9,特征提取过程中,层级越深,特征的分辨率越小,低层级的特征能够表示线条、颜色等低层级特征,高层级的特征能够表示部件、对象等高层级特征,通过比较对两个图像提取的图像特征,可以用来衡量两个图像整体的相似度。As shown in Figure 9, it is a schematic diagram of a feature extraction network in an embodiment. Referring to Figure 9, in the feature extraction process, the deeper the level, the smaller the resolution of the feature. The low-level features can represent low-level features such as lines and colors, and the high-level features can represent high-level features such as parts and objects. By comparing the image features extracted from two images, it can be used to measure the overall similarity of the two images.
参照图9,为不同网络层的特征可视化情况。该特征提取网络包括5个卷积操作,输入图像的分辨率为224*224*3,经过第一层级的卷积操作Conv1,提取得到第一层级的图像特征,记为fake_fea1,分辨率为55*55*96,再经过第二层级的卷积Conv2与池化操作,提取得到第二层级的图像特征,记为fake_fea2,分辨率为27*27*256,再经过 第三层级的卷积Conv3与池化操作,提取得到第三层级的图像特征,记为fake_fea3,分辨率为13*13*384,最后再通过第四层级的卷积操作Conv5与池化操作得到图像特征,记为fake_fea4,分辨率为13*13*256,最后通过全连接层获得一个维度为1000的输出向量,用于图像分类或目标检测。Refer to Figure 9 for the visualization of features at different network layers. The feature extraction network includes five convolution operations. The resolution of the input image is 224*224*3. After the first-level convolution operation Conv1, the first-level image features are extracted, denoted as fake_fea1, with a resolution of 55*55*96. After the second-level convolution Conv2 and pooling operation, the second-level image features are extracted, denoted as fake_fea2, with a resolution of 27*27*256. The third-level convolution Conv3 and pooling operation extract the third-level image features, denoted as fake_fea3, with a resolution of 13*13*384. Finally, the fourth-level convolution operation Conv5 and pooling operation are used to obtain image features, denoted as fake_fea4, with a resolution of 13*13*256. Finally, a fully connected layer is used to obtain an output vector with a dimension of 1000 for image classification or target detection.
在一个实施例中,服务器通过特征提取网络对换脸图像提取的图像特征可记为:
feature(fake)=(fake_fea1,fake_fea2,fake_fea3,fake_fea4);
In one embodiment, the image features extracted by the server from the face-swapped image through the feature extraction network can be recorded as:
feature(fake) = (fake_fea1, fake_fea2, fake_fea3, fake_fea4);
类似地,服务器通过特征提取网络对参考图像提取的图像特征可记为:
feature(GT)=(GT_fea1,GT_fea2,GT_fea3,GT_fea4);
Similarly, the image features extracted by the server from the reference image through the feature extraction network can be recorded as:
feature(GT) = (GT_fea1, GT_fea2, GT_fea3, GT_fea4);
则相似度损失可通过如下公式表示:
The similarity loss can be expressed by the following formula:
本实施例中,通过在训练换脸模型时,根据换脸图像与参考图像的特征之间的相似度,构建相似度损失,参与换脸模型的生成网络的训练,能够使训练好的换脸模型的生成网络可以输出换脸效果真切的换脸图像。In this embodiment, by constructing a similarity loss based on the similarity between the features of the face-changing image and the reference image when training the face-changing model, and participating in the training of the generative network of the face-changing model, the generative network of the trained face-changing model can output face-changing images with realistic face-changing effects.
在一个实施例中,本申请在训练换脸模型时,还引入了重建损失,根据参考图像与换脸图像之间的像素级差异,构建重建损失,训练换脸模型的生成网络。具体地,上述方法还可以包括:根据换脸图像与参考图像之间的像素级差异,构建重建损失;重建损失用于参与换脸模型的生成网络的训练。在训练时,希望换脸图像和参考图像之间的像素级差异越小越好。重建损失可通过如下公式表示:
Reconstruction_loss=|fake-GT|。
In one embodiment, the present application also introduces reconstruction loss when training the face-changing model. According to the pixel-level difference between the reference image and the face-changing image, the reconstruction loss is constructed to train the generative network of the face-changing model. Specifically, the above method may also include: constructing a reconstruction loss according to the pixel-level difference between the face-changing image and the reference image; the reconstruction loss is used to participate in the training of the generative network of the face-changing model. During training, it is hoped that the pixel-level difference between the face-changing image and the reference image is as small as possible. The reconstruction loss can be expressed by the following formula:
Reconstruction_loss = |fake-GT|.
该公式表示的是相同尺寸大小的换脸图像fake与参考图像GT之间的差异。具体地,服务器可以计算该两张图像对应相同像素位置的像素值的差异,对所有像素位置的差异求和,得到两张图像在图像像素级别层面的整体差异,该整体差异可构建出重建损失。This formula represents the difference between the fake face-swapped image and the reference image GT of the same size. Specifically, the server can calculate the difference in pixel values corresponding to the same pixel position of the two images, sum the differences of all pixel positions, and obtain the overall difference between the two images at the image pixel level. This overall difference can be used to construct the reconstruction loss.
可以理解的是,在训练换脸模型时,在生成网络的训练阶段,可同时引入上述的生成损失、表情损失、身份损失、关键点损失、相似度损失与重建损失,构建生成网络整体的换脸损失,以期望通过这些方方面面的约束,实现更好的保留复杂表情的换脸效果。It is understandable that when training the face-changing model, in the training stage of the generative network, the above-mentioned generation loss, expression loss, identity loss, key point loss, similarity loss and reconstruction loss can be introduced at the same time to construct the overall face-changing loss of the generative network, in the hope of achieving better face-changing effects that retain complex expressions through these constraints in various aspects.
如图10所示,为一个具体的实施例中换脸模型的训练架构示意图。参照图10,训练换脸模型时引入的网络包括:生成网络、判别网络、表情识别网络、人脸识别网络、人脸关键点网络、特征提取网络。结合图10,对换脸模型的训练过程进行说明如下: As shown in FIG10 , it is a schematic diagram of the training architecture of the face-changing model in a specific embodiment. Referring to FIG10 , the networks introduced when training the face-changing model include: a generation network, a discrimination network, an expression recognition network, a face recognition network, a face key point network, and a feature extraction network. In conjunction with FIG10 , the training process of the face-changing model is described as follows:
服务器获取训练样本,训练样本包括多个样本三元组,样本三元组包括脸源图像、模板图像与参考图像。The server obtains training samples, which include multiple sample triplets, and the sample triplets include a face source image, a template image, and a reference image.
接着,服务器通过预先训练好的表情识别网络,对模板图像进行特征提取,得到模板图像的表情特征。通过预先训练好的人脸识别网络,对脸源图像进行特征提取,得到脸源图像的身份特征,拼接模板图像的表情特征与脸源图像的身份特征,得到组合特征。Next, the server extracts features from the template image through a pre-trained expression recognition network to obtain the expression features of the template image. The server extracts features from the face source image through a pre-trained face recognition network to obtain the identity features of the face source image, and then concatenates the expression features of the template image with the identity features of the face source image to obtain the combined features.
接着,服务器还将脸源图像与模板图像进行拼接,得到输入图像,将输入图像输入至换脸模型,通过换脸模型的生成网络,对输入图像进行编码,得到对模板图像进行换脸所需的编码特征。Next, the server also splices the face source image with the template image to obtain an input image, inputs the input image into the face-changing model, encodes the input image through the generative network of the face-changing model, and obtains the encoding features required for face-changing the template image.
接着,服务器融合编码特征与组合特征,得到融合特征,并通过换脸模型的生成网络,根据融合特征进行解码,得到换脸图像。Next, the server fuses the encoded features with the combined features to obtain fused features, and decodes them according to the fused features through the generative network of the face-changing model to obtain the face-changing image.
接着,服务器通过换脸模型的判别网络,将换脸图像输入换脸模型的判别网络,得到换脸图像属于非伪造图像的第一概率,将参考图像输入换脸模型的判别网络,得到参考图像属于非伪造图像的第二概率。Next, the server inputs the face-changing image into the discriminant network of the face-changing model through the discriminant network of the face-changing model to obtain a first probability that the face-changing image is a non-forged image, and inputs the reference image into the discriminant network of the face-changing model to obtain a second probability that the reference image is a non-forged image.
接着,在固定生成网络的网络参数的情况下,根据换脸图像属于非伪造图像的第一概率与参考图像属于非伪造图像的第二概率,构建关于判别网络的判别损失,利用判别损失更新判别网络的网络参数。Next, while fixing the network parameters of the generating network, a discriminative loss for the discriminative network is constructed based on a first probability that the face-swapped image is a non-forged image and a second probability that the reference image is a non-forged image, and the network parameters of the discriminative network are updated using the discriminative loss.
接着,在固定判别网络的网络参数的情况下,服务器重新将换脸图像输入更新后的判别网络,得到换脸图像属于非伪造图像的第一概率,根据换脸图像属于非伪造图像的第一概率,构建生成网络的生成损失。通过换脸模型的表情识别网络,对换脸图像进行特征提取,得到换脸图像的表情特征,根据换脸图像的表情特征与模板图像的表情特征之间的差异,构建表情损失。通过换脸模型的人脸识别网络,对换脸图像进行特征提取,得到换脸图像的身份特征,根据换脸图像的身份特征与脸源图像的身份特征之间的差异,构建身份损失。通过预先训练好的人脸关键点网络,分别对模板图像与换脸图像进行人脸关键点识别,得到各自的人脸关键点信息,根据模板图像与换脸图像各自的人脸关键点信息之间的差异,构建关键点损失。通过预先训练好的特征提取网络,分别对换脸图像与参考图像进行图像特征提取,得到各自的图像特征,根据换脸图像与参考图像各自的图像特征之间的差异,构建相似度损失。根据换脸图像与参考图像之间的像素级差异,构建重建损失。最后,根据生成损失、表情损失、身份损失、关键点损失、相似度损失、重建损失,构建关于生成网络的换脸损失,利用换脸损失更新生成网络的网络参数。Next, under the condition that the network parameters of the discriminant network are fixed, the server re-inputs the face-swapped image into the updated discriminant network to obtain the first probability that the face-swapped image is a non-forged image, and constructs the generation loss of the generation network according to the first probability that the face-swapped image is a non-forged image. Through the expression recognition network of the face-swapped model, feature extraction is performed on the face-swapped image to obtain the expression features of the face-swapped image, and the expression loss is constructed according to the difference between the expression features of the face-swapped image and the expression features of the template image. Through the face recognition network of the face-swapped model, feature extraction is performed on the face-swapped image to obtain the identity features of the face-swapped image, and the identity loss is constructed according to the difference between the identity features of the face-swapped image and the identity features of the face source image. Through the pre-trained face key point network, face key points of the template image and the face-swapped image are respectively recognized to obtain the respective face key point information, and the key point loss is constructed according to the difference between the face key point information of the template image and the face-swapped image. Through the pre-trained feature extraction network, image feature extraction is performed on the face-swapped image and the reference image to obtain the respective image features, and the similarity loss is constructed according to the difference between the image features of the face-swapped image and the reference image. According to the pixel-level difference between the face-swapped image and the reference image, the reconstruction loss is constructed. Finally, according to the generation loss, expression loss, identity loss, key point loss, similarity loss, and reconstruction loss, the face-swapped loss of the generation network is constructed, and the network parameters of the generation network are updated using the face-swapped loss.
按照这样交替的训练方式,在满足训练停止条件时,就可以得到训练好的换脸模型。According to this alternating training method, when the training stop condition is met, a trained face-changing model can be obtained.
在一个实施例中,在得到训练好的换脸模型之后,服务器可以利用训练好的换脸模型中的生成网络、预先训练好的表情识别网络与人脸识别网络,对目标图像或目标视频进行换脸,得到换脸图像或换脸视频。In one embodiment, after obtaining a trained face-changing model, the server can use the generative network, pre-trained expression recognition network and face recognition network in the trained face-changing model to perform face-changing on the target image or target video to obtain a face-changing image or face-changing video.
以对目标视频进行换脸为例,包括以下的步骤:视频采集、图像输入、人脸检测、人脸区域的裁剪、进行表情优化的视频换脸和结果展示。Taking face swapping of a target video as an example, the following steps are included: video acquisition, image input, face detection, cropping of face areas, video face swapping with expression optimization, and result display.
如图11所示,为一个实施例中视频换脸的流程示意图。本实施例的执行主体可以是一个计算机设备或者多个计算机设备所构成的计算机设备集群。该计算机设备可以是服务器,也可以是终端。参照图11,包括以下步骤:As shown in FIG11, it is a schematic diagram of the process of video face swapping in one embodiment. The execution subject of this embodiment can be a computer device or a computer device cluster composed of multiple computer devices. The computer device can be a server or a terminal. Referring to FIG11, the following steps are included:
步骤1102,获取待换脸视频与包含目标人脸的脸源图像。 Step 1102, obtaining the video to be replaced with the face and the face source image containing the target face.
脸源图像可以是包含人脸的原始图像,也可以是对原始图像进行人脸检测、配置后得到的仅包含人脸的裁剪图像。The face source image may be an original image containing a human face, or may be a cropped image containing only a human face obtained by performing face detection and configuration on the original image.
步骤1104,对于待换脸视频的每一视频帧,通过训练好的表情识别网络,对视频帧进行特征提取,得到视频帧的表情特征。Step 1104, for each video frame of the face-changing video, the trained expression recognition network is used to extract features of the video frame to obtain expression features of the video frame.
服务器可以直接对视频帧进行后续的处理,也可以对视频帧进行人脸检测、配置后得到的仅包含人脸的裁剪图像。The server can directly perform subsequent processing on the video frame, or perform face detection on the video frame and obtain a cropped image containing only the face after configuration.
步骤1106,通过训练好的人脸识别网络,对脸源图像进行特征提取,得到脸源图像的身份特征。Step 1106, extracting features from the face source image through the trained face recognition network to obtain identity features of the face source image.
步骤1108,拼接表情特征与身份特征,得到组合特征。Step 1108, concatenate the expression feature and the identity feature to obtain a combined feature.
步骤1110,通过训练好的换脸模型的生成网络,根据包含目标人脸的脸源图像与视频帧进行编码,得到换脸所需的编码特征。Step 1110, encoding is performed based on the face source image and video frame containing the target face through the generative network of the trained face-changing model to obtain the coding features required for face-changing.
步骤1112,融合编码特征与组合特征,得到融合特征。Step 1112, fusing the coding feature and the combined feature to obtain a fused feature.
步骤1114,通过训练好的换脸模型的生成网络,根据融合特征进行解码,输出将视频帧中的对象替换为目标人脸的换脸视频。Step 1114, through the generative network of the trained face-changing model, decoding is performed according to the fusion features, and a face-changing video is outputted in which the object in the video frame is replaced with the target face.
如图12所示,为一个实施例中对照片进行换脸的效果示意图。通过本申请实施例提供的换脸模型的训练方法训练好的换脸模型,可在复杂表情下依旧能保持较好的换脸效果,可以用于证件照制作,影视人像制作、游戏人物设计、虚拟形象、隐私保护等多种场景中。在复杂的表情下依旧能够保持好模板图像中人脸的表情,还可满足影视中一些复杂表情场景下的换脸需求,而且,在视频场景下,表情的保持是流畅自然的。As shown in Figure 12, it is a schematic diagram of the effect of face swapping of photos in one embodiment. The face swapping model trained by the face swapping model training method provided in the embodiment of the present application can still maintain a good face swapping effect under complex expressions, and can be used in a variety of scenarios such as ID photo production, film and television portrait production, game character design, virtual image, privacy protection, etc. The expression of the face in the template image can still be maintained under complex expressions, and the face swapping requirements in some complex expression scenes in film and television can be met. Moreover, in the video scene, the expression is maintained smoothly and naturally.
应该理解的是,虽然如上的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the steps in the flowcharts involved in the above embodiments are displayed in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence according to the order indicated by the arrows. Unless there is a clear explanation in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least a part of the steps in the flowcharts involved in the above embodiments may include multiple steps or multiple stages, and these steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these steps or stages is not necessarily carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the steps or stages in other steps.
基于同样的发明构思,本申请实施例还提供了一种用于实现上述所涉及的换脸模型的训练方法的换脸模型的训练装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似,故下面所提供的一个或多个换脸模型的训练装置实施例中的具体限定可以参见上文中对于换脸模型的训练方法的限定,在此不再赘述。Based on the same inventive concept, the embodiment of the present application also provides a face-changing model training device for implementing the face-changing model training method involved above. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the above method, so the specific limitations in the embodiments of one or more face-changing model training devices provided below can refer to the limitations of the face-changing model training method above, and will not be repeated here.
在一个实施例中,如图13所示,提供了一种换脸模型的训练装置1300,包括:获取模块1302、拼接模块1304、生成模块1306、判别模块1308和更新模块1310,其中:In one embodiment, as shown in FIG. 13 , a training device 1300 for a face-changing model is provided, comprising: an acquisition module 1302 , a splicing module 1304 , a generation module 1306 , a determination module 1308 , and an update module 1310 , wherein:
获取模块1302,用于获取样本三元组,样本三元组包括脸源图像、模板图像与参考图像;An acquisition module 1302 is used to acquire a sample triplet, where the sample triplet includes a face source image, a template image, and a reference image;
拼接模块1304,用于拼接模板图像的表情特征与脸源图像的身份特征,得到组合特征;A splicing module 1304 is used to splice the expression features of the template image and the identity features of the face source image to obtain a combined feature;
生成模块1306,用于通过换脸模型的生成网络,根据脸源图像与模板图像进行编码,得到换脸所需的编码特征,将编码特征与组合特征进行融合,得到融合特征,通过换脸模型的生成网络,根据融合特征进行解码,得到换脸图像; The generation module 1306 is used to encode the face source image and the template image through the generation network of the face-changing model to obtain the encoding features required for face-changing, fuse the encoding features with the combined features to obtain the fused features, and decode according to the fused features through the generation network of the face-changing model to obtain the face-changing image;
判别模块1308,用于通过换脸模型的判别网络,分别预测关于换脸图像与参考图像的图像属性判别结果,图像属性包括伪造图像和非伪造图像;A discrimination module 1308, for predicting the image attribute discrimination results of the face-swapped image and the reference image respectively through the discrimination network of the face-swapped model, where the image attributes include forged images and non-forged images;
更新模块1310,用于计算换脸图像的表情特征与模板图像的表情特征之间的差异,计算换脸图像的身份特征与脸源图像的身份特征之间的差异,根据关于换脸图像与参考图像的图像属性判别结果、计算得到的表情特征之间的差异、身份特征之间的差异,更新生成网络与判别网络。The updating module 1310 is used to calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network based on the image attribute discrimination results of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
在一个实施例中,获取模块1302,还用于获取第一图像与第二图像,第一图像与第二图像对应相同的身份属性,且对应不同的非身份属性;获取第三图像,第三图像与第一图像对应不同的身份属性;将第二图像中的对象替换为第三图像中的对象,得到第四图像;将第一图像作为脸源图像、第四图像作为模板图像、第二图像作为参考图像,作为一个样本三元组。In one embodiment, the acquisition module 1302 is also used to acquire a first image and a second image, wherein the first image and the second image correspond to the same identity attribute and different non-identity attributes; acquire a third image, wherein the third image and the first image correspond to different identity attributes; replace the object in the second image with the object in the third image to obtain a fourth image; and use the first image as a face source image, the fourth image as a template image, and the second image as a reference image as a sample triplet.
在一个实施例中,换脸模型的训练装置1300还包括:In one embodiment, the face-changing model training device 1300 further includes:
表情识别模块,用于通过换脸模型的表情识别网络,对模板图像进行特征提取,得到模板图像的表情特征;The expression recognition module is used to extract features of the template image through the expression recognition network of the face-changing model to obtain the expression features of the template image;
人脸识别模块,用于通过换脸模型的人脸识别网络,对脸源图像进行特征提取,得到脸源图像的身份特征;A face recognition module is used to extract features of the face source image through the face recognition network of the face-changing model to obtain identity features of the face source image;
表情识别网络与人脸识别网络,均为预先训练好的神经网络模型。Both the expression recognition network and the face recognition network are pre-trained neural network models.
在一个实施例中,生成模块1306,还用于将脸源图像与模板图像进行拼接,得到输入图像;将输入图像输入至换脸模型;通过换脸模型的生成网络,对输入图像进行编码,得到对模板图像进行换脸所需的编码特征。In one embodiment, the generation module 1306 is also used to splice the face source image with the template image to obtain an input image; input the input image to the face-changing model; encode the input image through the generation network of the face-changing model to obtain the encoding features required for face-changing the template image.
在一个实施例中,换脸模型的训练装置1300还包括:In one embodiment, the face-changing model training device 1300 further includes:
融合模块,用于分别计算编码特征、组合特征的均值与标准差;根据编码特征的均值与标准差,对编码特征进行归一化处理,得到归一化后的编码特征;根据组合特征的均值与标准差,将组合特征的风格迁移至归一化后的编码特征,得到融合特征。The fusion module is used to calculate the mean and standard deviation of the coding features and the combined features respectively; according to the mean and standard deviation of the coding features, the coding features are normalized to obtain the normalized coding features; according to the mean and standard deviation of the combined features, the style of the combined features is transferred to the normalized coding features to obtain the fusion features.
在一个实施例中,判别模块1308,还用于将换脸图像输入换脸模型的判别网络,得到换脸图像属于非伪造图像的第一概率;将参考图像输入换脸模型的判别网络,得到参考图像属于非伪造图像的第二概率。In one embodiment, the discrimination module 1308 is further used to input the face-swapped image into the discriminant network of the face-swapped model to obtain a first probability that the face-swapped image is a non-forged image; and input the reference image into the discriminant network of the face-swapped model to obtain a second probability that the reference image is a non-forged image.
在一个实施例中,换脸模型的训练装置1300还包括:In one embodiment, the face-changing model training device 1300 further includes:
表情识别模块,用于通过换脸模型的表情识别网络,对换脸图像进行特征提取,得到换脸图像的表情特征;The expression recognition module is used to extract features of the face-swapped image through the expression recognition network of the face-swapped model to obtain the expression features of the face-swapped image;
人脸识别模块,用于通过换脸模型的人脸识别网络,对换脸图像进行特征提取,得到换脸图像的身份特征;A face recognition module is used to extract features of the face-swapped image through the face recognition network of the face-swapped model to obtain identity features of the face-swapped image;
表情识别网络与人脸识别网络,均为预先训练好的神经网络模型。Both the expression recognition network and the face recognition network are pre-trained neural network models.
在一个实施例中,更新模块1310,还用于交替地,在固定生成网络的网络参数的情况下,根据换脸图像属于非伪造图像的第一概率与参考图像属于非伪造图像的第二概率,构建关于判别网络的判别损失,利用判别损失更新判别网络的网络参数;在固定判别网络的网络参数的情况下,根据换脸图像属于非伪造图像的第一概率,构建生成网络的生成损失,根据换脸图像的表情特征与模板图像的表情特征之间的差异,构建表情损失,根据换脸图像的身份特征与脸源图像的身份特征之间的差异,构建身份损失,根据生成损失、表 情损失与身份损失,构建关于生成网络的换脸损失,利用换脸损失更新生成网络的网络参数,直至满足训练停止条件时交替结束,得到训练好的判别网络与生成网络。In one embodiment, the updating module 1310 is further used to alternately, when the network parameters of the generating network are fixed, construct a discriminant loss about the discriminant network according to a first probability that the face-swapped image belongs to a non-forged image and a second probability that the reference image belongs to a non-forged image, and update the network parameters of the discriminant network using the discriminant loss; when the network parameters of the discriminant network are fixed, construct a generating loss of the generating network according to the first probability that the face-swapped image belongs to a non-forged image, construct an expression loss according to the difference between the expression features of the face-swapped image and the expression features of the template image, construct an identity loss according to the difference between the identity features of the face-swapped image and the identity features of the face source image, and construct an identity loss according to the generating loss, the expression features, and the identity features. The face-changing loss of the generated network is constructed by combining emotion loss and identity loss. The network parameters of the generated network are updated using the face-changing loss until the training stop condition is met, and the trained discriminant network and generative network are obtained.
在一个实施例中,换脸模型的训练装置1300还包括:In one embodiment, the face-changing model training device 1300 further includes:
关键点定位模块,用于通过预先训练好的人脸关键点网络,分别对模板图像与换脸图像进行人脸关键点识别,得到各自的人脸关键点信息;The key point positioning module is used to identify the facial key points of the template image and the face-swapped image through a pre-trained facial key point network to obtain their respective facial key point information;
更新模块1310,还用于根据模板图像与换脸图像各自的人脸关键点信息之间的差异,构建关键点损失;关键点损失用于参与换脸模型的生成网络的训练。The updating module 1310 is also used to construct a key point loss according to the difference between the facial key point information of the template image and the face-changing image; the key point loss is used to participate in the training of the generative network of the face-changing model.
在一个实施例中,换脸模型的训练装置1300还包括:In one embodiment, the face-changing model training device 1300 further includes:
图像特征提取模块,用于通过预先训练好的特征提取网络,分别对换脸图像与参考图像进行图像特征提取,得到各自的图像特征;An image feature extraction module is used to extract image features from the face-swapped image and the reference image through a pre-trained feature extraction network to obtain their respective image features;
更新模块1310,还用于根据换脸图像与参考图像各自的图像特征之间的差异,构建相似度损失;相似度损失用于参与换脸模型的生成网络的训练。The updating module 1310 is also used to construct a similarity loss according to the difference between the image features of the face-changing image and the reference image; the similarity loss is used to participate in the training of the generative network of the face-changing model.
在一个实施例中,更新模块1310还用于根据换脸图像与参考图像之间的像素级差异,构建重建损失;重建损失用于参与换脸模型的生成网络的训练。In one embodiment, the updating module 1310 is further used to construct a reconstruction loss according to the pixel-level difference between the face-changing image and the reference image; the reconstruction loss is used to participate in the training of the generative network of the face-changing model.
在一个实施例中,换脸模型的训练装置1300还包括:In one embodiment, the face-changing model training device 1300 further includes:
换脸模块,用于获取待换脸视频与包含目标人脸的脸源图像;对于待换脸视频的每一视频帧,获取视频帧的表情特征;获取包含目标人脸的脸源图像的身份特征;拼接表情特征与身份特征,得到组合特征;通过训练好的换脸模型的生成网络,根据包含目标人脸的脸源图像与视频帧进行编码,得到换脸所需的编码特征,根据融合编码特征与组合特征得到的融合特征进行解码,输出将视频帧中的对象替换为目标人脸的换脸视频。The face-changing module is used to obtain the video to be replaced and the face source image containing the target face; for each video frame of the video to be replaced, the expression features of the video frame are obtained; the identity features of the face source image containing the target face are obtained; the expression features and the identity features are spliced to obtain the combined features; through the generative network of the trained face-changing model, the face source image containing the target face and the video frame are encoded to obtain the coding features required for face-changing, and the fused features obtained by fusing the coding features and the combined features are decoded to output the face-changing video in which the object in the video frame is replaced with the target face.
上述换脸模型的训练装置1300,在训练换脸模型时,不仅模板图像与脸源图像本身的编码特征参与解码以输出换脸图像,模板图像的表情特征与脸源图像的身份特征,也会参与解码以输出换脸图像,使得输出的换脸图像既能够具备模板图像的表情信息,又能够具备脸源图像的身份信息,也即,在保持模板图像的表情的同时,还能与脸源图像相像。此外,通过模板图像的表情特征与换脸图像的表情特征之间的差异,脸源图像的身份特征与换脸图像的身份特征之间的差异,来更新换脸模型,前者可以约束换脸图像与模板图像之间的表情相似度,后者可以约束换脸图像与脸源图像之间的身份相似度,这样,即便模板图像的表情较为复杂,输出的换脸图像仍能保持这种复杂表情,提升换脸效果。而且,更新换脸模型的生成网络与判别网络的网络参数时,还会依据判别网络对换脸图像与参考图像预测的图像属性判别结果,使生成网络与判别网络对抗训练,整体上提升换脸模型输出换脸图像的图像质量。In the training device 1300 of the face-changing model, when training the face-changing model, not only the coding features of the template image and the face source image themselves are involved in decoding to output the face-changing image, but also the expression features of the template image and the identity features of the face source image are involved in decoding to output the face-changing image, so that the output face-changing image can have both the expression information of the template image and the identity information of the face source image, that is, while maintaining the expression of the template image, it can also be similar to the face source image. In addition, the face-changing model is updated by the difference between the expression features of the template image and the expression features of the face-changing image, and the difference between the identity features of the face source image and the identity features of the face-changing image. The former can constrain the expression similarity between the face-changing image and the template image, and the latter can constrain the identity similarity between the face-changing image and the face source image. In this way, even if the expression of the template image is relatively complex, the output face-changing image can still maintain such a complex expression, thereby improving the face-changing effect. Moreover, when updating the network parameters of the generative network and the discriminative network of the face-changing model, the generative network and the discriminative network will be trained adversarially based on the image attribute discrimination results predicted by the discriminative network for the face-changing image and the reference image, thereby improving the image quality of the face-changing image output by the face-changing model as a whole.
上述换脸模型的训练装置1300中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。Each module in the above-mentioned face-changing model training device 1300 can be implemented in whole or in part by software, hardware, or a combination thereof. Each of the above-mentioned modules can be embedded in or independent of a processor in a computer device in the form of hardware, or can be stored in a memory in a computer device in the form of software, so that the processor can call and execute operations corresponding to each of the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器或终端,其内部结构图可以如图14所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output,简称I/O)和通信接口。处理器、存储器和输入/输出接口通过系统总线连接,通信接口通过输入/输出接口连接到系统总线。该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储 介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部设备通过网络连接通信。该计算机可读指令被处理器执行时以实现一种换脸模型的训练方法。In one embodiment, a computer device is provided, which may be a server or a terminal, and its internal structure diagram may be shown in FIG14. The computer device includes a processor, a memory, an input/output interface (I/O for short), and a communication interface. The processor, the memory, and the input/output interface are connected via a system bus, and the communication interface is connected to the system bus via the input/output interface. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage The medium stores an operating system, computer-readable instructions and a database. The internal memory provides an environment for the operation of the operating system and the computer-readable instructions in the non-volatile storage medium. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used to communicate with the external device through a network connection. When the computer-readable instructions are executed by the processor, a training method for a face-changing model is implemented.
本领域技术人员可以理解,图14中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art will understand that the structure shown in FIG. 14 is merely a block diagram of a partial structure related to the scheme of the present application, and does not constitute a limitation on the computer device to which the scheme of the present application is applied. The specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,该处理器执行计算机可读指令时实现本申请任一实施例提供的换脸模型的训练方法步骤。In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions, the training method steps of the face-changing model provided in any embodiment of the present application are implemented.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现本申请任一实施例提供的换脸模型的训练方法步骤。In one embodiment, a computer-readable storage medium is provided, on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, the training method steps of the face-changing model provided in any embodiment of the present application are implemented.
在一个实施例中,提供了一种计算机程序产品,包括计算机可读指令,该计算机可读指令被处理器执行时实现本申请任一实施例提供的换脸模型的训练方法步骤。In one embodiment, a computer program product is provided, including computer-readable instructions, which, when executed by a processor, implement the steps of the face-changing model training method provided in any embodiment of the present application.
需要说明的是,本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory,MRAM)、铁电存储器(Ferroelectric Random Access Memory,FRAM)、相变存储器(Phase Change Memory,PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器等。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等,不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等,不限于此。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment method can be completed by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer-readable storage medium. When the computer-readable instructions are executed, they can include the processes of the embodiments of the above-mentioned methods. Any reference to the memory, database or other medium used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetoresistive random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. As an illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM). The database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include distributed databases based on blockchains, etc., but are not limited to this. The processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, etc., but are not limited to this.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来 说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请的保护范围应以所附权利要求为准。 The above-mentioned embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the present application. It should be pointed out that it is within the skill of ordinary technicians in this field to In other words, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the attached claims.

Claims (16)

  1. 一种换脸模型的训练方法,由计算机设备执行,所述方法包括:A face-changing model training method is performed by a computer device, the method comprising:
    获取样本三元组,所述样本三元组包括脸源图像、模板图像与参考图像;Acquire a sample triplet, wherein the sample triplet includes a face source image, a template image, and a reference image;
    拼接所述模板图像的表情特征与所述脸源图像的身份特征,得到组合特征;splicing the expression features of the template image and the identity features of the face source image to obtain a combined feature;
    通过所述换脸模型的生成网络,根据所述脸源图像与所述模板图像进行编码,得到换脸所需的编码特征;By means of the generative network of the face-changing model, encoding is performed according to the face source image and the template image to obtain the encoding features required for face-changing;
    将所述编码特征与所述组合特征进行融合,得到融合特征;Fusing the coding feature with the combined feature to obtain a fused feature;
    通过所述换脸模型的生成网络,根据所述融合特征进行解码,得到换脸图像;Decoding is performed according to the fusion features through the generative network of the face-changing model to obtain a face-changing image;
    通过所述换脸模型的判别网络,分别预测关于所述换脸图像与所述参考图像的图像属性判别结果,所述图像属性包括伪造图像和非伪造图像;及Predicting the image attribute discrimination results of the face-swapped image and the reference image respectively through the discriminant network of the face-swapped model, wherein the image attributes include a forged image and a non-forged image; and
    计算所述换脸图像的表情特征与所述模板图像的表情特征之间的差异,计算所述换脸图像的身份特征与所述脸源图像的身份特征之间的差异,根据关于所述换脸图像与所述参考图像的图像属性判别结果、计算得到的表情特征之间的差异、身份特征之间的差异,更新所述生成网络与所述判别网络。Calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network based on the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
  2. 根据权利要求1所述的方法,其特征在于,所述获取样本三元组,包括:The method according to claim 1, characterized in that obtaining the sample triplet comprises:
    获取第一图像与第二图像,所述第一图像与第二图像对应相同的身份属性,且对应不同的非身份属性;Acquire a first image and a second image, wherein the first image and the second image correspond to the same identity attribute and correspond to different non-identity attributes;
    获取第三图像,所述第三图像与所述第一图像对应不同的身份属性;Acquire a third image, where the third image and the first image correspond to different identity attributes;
    将所述第二图像中的对象替换为所述第三图像中的对象,得到第四图像;及replacing the object in the second image with the object in the third image to obtain a fourth image; and
    将所述第一图像作为脸源图像、所述第四图像作为模板图像、所述第二图像作为参考图像,构造一个样本三元组。A sample triplet is constructed by taking the first image as a face source image, the fourth image as a template image, and the second image as a reference image.
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:The method according to claim 1 or 2, characterized in that the method further comprises:
    通过所述换脸模型的表情识别网络,对所述模板图像进行特征提取,得到所述模板图像的表情特征;及Extracting features of the template image through the expression recognition network of the face-changing model to obtain expression features of the template image; and
    通过所述换脸模型的人脸识别网络,对所述脸源图像进行特征提取,得到所述脸源图像的身份特征;Extracting features of the face source image through the face recognition network of the face-changing model to obtain identity features of the face source image;
    所述表情识别网络与所述人脸识别网络,均为预先训练好的神经网络模型。The expression recognition network and the face recognition network are both pre-trained neural network models.
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述通过所述换脸模型的生成网络,根据所述脸源图像与所述模板图像进行编码,得到换脸所需的编码特征,包括:The method according to any one of claims 1 to 3, characterized in that the encoding of the face source image and the template image through the generative network of the face-changing model to obtain the encoding features required for face-changing comprises:
    将所述脸源图像与模板图像进行拼接,得到输入图像;splicing the face source image with the template image to obtain an input image;
    将所述输入图像输入至所述换脸模型;及Inputting the input image into the face-changing model; and
    通过所述换脸模型的生成网络,对所述输入图像进行编码,得到对所述模板图像进行换脸所需的编码特征。The input image is encoded through the generative network of the face-changing model to obtain the encoding features required for face-changing the template image.
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述将所述编码特征与所述组合特征进行融合,得到融合特征,包括:The method according to any one of claims 1 to 4, characterized in that the step of fusing the coding feature with the combined feature to obtain a fused feature comprises:
    计算所述编码特征的均值与标准差,计算所述组合特征的均值与标准差;Calculate the mean and standard deviation of the coding feature, and calculate the mean and standard deviation of the combined feature;
    根据所述编码特征的均值与标准差,对所述编码特征进行归一化处理,得到归一化后的编码特征;及 Normalizing the coding feature according to the mean and standard deviation of the coding feature to obtain a normalized coding feature; and
    根据所述组合特征的均值与标准差,将所述组合特征的风格迁移至所述归一化后的编码特征,得到所述融合特征。According to the mean and standard deviation of the combined feature, the style of the combined feature is transferred to the normalized encoding feature to obtain the fused feature.
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述通过所述换脸模型的判别网络,分别预测关于所述换脸图像与所述参考图像的图像属性判别结果,包括:The method according to any one of claims 1 to 5, characterized in that the predicting of the image attribute discrimination results of the face-swapped image and the reference image respectively through the discriminant network of the face-swapped model comprises:
    将所述换脸图像输入所述换脸模型的判别网络,通过所述判别网络预测所述换脸图像属于非伪造图像的第一概率;及Inputting the face-swapped image into the discriminant network of the face-swapped model, and predicting a first probability that the face-swapped image is a non-forged image through the discriminant network; and
    将所述参考图像输入所述换脸模型的判别网络,通过所述判别网络预测所述参考图像属于非伪造图像的第二概率。The reference image is input into the discriminant network of the face-changing model, and the discriminant network is used to predict a second probability that the reference image is a non-forged image.
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,在得到换脸图像之后,所述方法还包括:The method according to any one of claims 1 to 6, characterized in that after obtaining the face-swapped image, the method further comprises:
    通过所述换脸模型的表情识别网络,对所述换脸图像进行特征提取,得到所述换脸图像的表情特征;及Extracting features of the face-swapped image through the expression recognition network of the face-swapped model to obtain expression features of the face-swapped image; and
    通过所述换脸模型的人脸识别网络,对所述换脸图像进行特征提取,得到所述换脸图像的身份特征;Extracting features of the face-swapped image through a face recognition network of the face-swapped model to obtain identity features of the face-swapped image;
    所述表情识别网络与所述人脸识别网络,均为预先训练好的神经网络模型。The expression recognition network and the face recognition network are both pre-trained neural network models.
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述计算所述换脸图像的表情特征与所述模板图像的表情特征之间的差异,计算所述换脸图像的身份特征与所述脸源图像的身份特征之间的差异,根据关于所述换脸图像与所述参考图像的图像属性判别结果、计算得到的表情特征之间的差异、身份特征之间的差异,更新所述生成网络与所述判别网络,包括:The method according to any one of claims 1 to 7 is characterized in that the calculation of the difference between the expression features of the face-swapped image and the expression features of the template image, the calculation of the difference between the identity features of the face-swapped image and the identity features of the face source image, and updating the generation network and the discrimination network according to the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features, comprises:
    交替地,在固定所述生成网络的网络参数的情况下,根据所述换脸图像属于非伪造图像的第一概率与所述参考图像属于非伪造图像的第二概率,构建关于所述判别网络的判别损失,利用所述判别损失更新所述判别网络的网络参数;Alternately, when the network parameters of the generating network are fixed, a discriminant loss for the discriminant network is constructed according to a first probability that the face-swapped image is a non-forged image and a second probability that the reference image is a non-forged image, and the network parameters of the discriminant network are updated using the discriminant loss;
    在固定所述判别网络的网络参数的情况下,根据所述换脸图像属于非伪造图像的第一概率,构建生成网络的生成损失,根据所述换脸图像的表情特征与所述模板图像的表情特征之间的差异,构建表情损失,根据所述换脸图像的身份特征与所述脸源图像的身份特征之间的差异,构建身份损失,根据所述生成损失、表情损失与所述身份损失,构建关于所述生成网络的换脸损失,利用所述换脸损失更新所述生成网络的网络参数;Under the condition that the network parameters of the discriminant network are fixed, according to the first probability that the face-swapped image belongs to a non-forged image, a generation loss of the generation network is constructed, according to the difference between the expression features of the face-swapped image and the expression features of the template image, an expression loss is constructed, according to the difference between the identity features of the face-swapped image and the identity features of the face source image, an identity loss is constructed, according to the generation loss, the expression loss and the identity loss, a face-swapped loss of the generation network is constructed, and the network parameters of the generation network are updated using the face-swapped loss;
    直至满足训练停止条件时交替结束,得到训练好的判别网络与生成网络。The training is repeated until the training stop condition is met, and the trained discriminant network and generative network are obtained.
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 8, characterized in that the method further comprises:
    通过预先训练好的人脸关键点网络,分别对所述模板图像与所述换脸图像进行人脸关键点识别,得到各自的人脸关键点信息;及Using a pre-trained facial key point network, respectively, the template image and the face-swapped image are subjected to facial key point recognition to obtain respective facial key point information; and
    根据所述模板图像与所述换脸图像各自的人脸关键点信息之间的差异,构建关键点损失,所述关键点损失用于参与所述换脸模型的生成网络的训练。According to the difference between the facial key point information of the template image and the face-changing image, a key point loss is constructed, and the key point loss is used to participate in the training of the generation network of the face-changing model.
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 9, characterized in that the method further comprises:
    通过预先训练好的特征提取网络,分别对所述换脸图像与所述参考图像进行图像特征提取,得到各自的图像特征;及Using a pre-trained feature extraction network, respectively extracting image features from the face-swapped image and the reference image to obtain respective image features; and
    根据所述换脸图像与所述参考图像各自的图像特征之间的差异,构建相似度损失,所述相似度损失用于参与所述换脸模型的生成网络的训练。 According to the difference between the image features of the face-changing image and the reference image, a similarity loss is constructed, and the similarity loss is used to participate in the training of the generation network of the face-changing model.
  11. 根据权利要求1至10中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 10, characterized in that the method further comprises:
    根据所述换脸图像与所述参考图像之间的像素级差异,构建重建损失,所述重建损失用于参与所述换脸模型的生成网络的训练。A reconstruction loss is constructed according to the pixel-level difference between the face-changing image and the reference image, and the reconstruction loss is used to participate in the training of the generative network of the face-changing model.
  12. 根据权利要求1至11中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 11, characterized in that the method further comprises:
    获取待换脸视频与包含目标人脸的脸源图像;Obtain the video to be face-swapped and the face source image containing the target face;
    对于所述待换脸视频的每一视频帧,获取所述视频帧的表情特征;For each video frame of the video to be face-swapped, acquiring expression features of the video frame;
    获取所述包含目标人脸的脸源图像的身份特征;Acquire the identity feature of the face source image containing the target face;
    拼接所述表情特征与所述身份特征,得到组合特征;及splicing the expression feature and the identity feature to obtain a combined feature; and
    通过训练好的所述换脸模型的生成网络,根据所述包含目标人脸的脸源图像与所述视频帧进行编码,得到换脸所需的编码特征,根据融合所述编码特征与所述组合特征得到的融合特征进行解码,输出将所述视频帧中的对象替换为所述目标人脸的换脸视频。Through the trained generative network of the face-changing model, the face source image containing the target face and the video frame are encoded to obtain the encoding features required for face-changing, and decoding is performed based on the fusion features obtained by fusing the encoding features with the combined features, and a face-changing video in which the object in the video frame is replaced with the target face is output.
  13. 一种换脸模型的训练装置,所述装置包括:A training device for a face-changing model, the device comprising:
    获取模块,用于获取样本三元组,所述样本三元组包括脸源图像、模板图像与参考图像;An acquisition module, used for acquiring a sample triplet, wherein the sample triplet includes a face source image, a template image and a reference image;
    拼接模块,用于拼接所述模板图像的表情特征与所述脸源图像的身份特征,得到组合特征;A splicing module, used for splicing the expression features of the template image and the identity features of the face source image to obtain a combined feature;
    生成模块,用于通过所述换脸模型的生成网络,根据所述脸源图像与所述模板图像进行编码,得到换脸所需的编码特征,将所述编码特征与所述组合特征进行融合,得到融合特征,通过所述换脸模型的生成网络,根据所述融合特征进行解码,得到换脸图像;A generation module, configured to encode the face source image and the template image through the generation network of the face-changing model to obtain the encoding features required for face-changing, fuse the encoding features with the combined features to obtain the fused features, and decode the fused features through the generation network of the face-changing model to obtain the face-changing image;
    判别模块,用于通过所述换脸模型的判别网络,分别预测关于所述换脸图像与所述参考图像的图像属性判别结果,所述图像属性包括伪造图像和非伪造图像;A discrimination module, used for predicting the image attribute discrimination results of the face-swapped image and the reference image respectively through the discrimination network of the face-swapped model, wherein the image attributes include a forged image and a non-forged image;
    更新模块,用于计算所述换脸图像的表情特征与所述模板图像的表情特征之间的差异,计算所述换脸图像的身份特征与所述脸源图像的身份特征之间的差异,根据关于所述换脸图像与所述参考图像的图像属性判别结果、计算得到的表情特征之间的差异、身份特征之间的差异,更新所述生成网络与所述判别网络。An updating module is used to calculate the difference between the expression features of the face-swapped image and the expression features of the template image, calculate the difference between the identity features of the face-swapped image and the identity features of the face source image, and update the generation network and the discrimination network according to the image attribute discrimination result of the face-swapped image and the reference image, the difference between the calculated expression features, and the difference between the identity features.
  14. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现权利要求1至12中任一项所述的方法的步骤。A computer device comprises a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the steps of any one of the methods of claims 1 to 12 when executing the computer-readable instructions.
  15. 一种计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现权利要求1至12中任一项所述的方法的步骤。A computer-readable storage medium having computer-readable instructions stored thereon, wherein the computer-readable instructions, when executed by a processor, implement the steps of the method according to any one of claims 1 to 12.
  16. 一种计算机程序产品,包括计算机可读指令,该计算机可读指令被处理器执行时实现权利要求1至12中任一项所述的方法的步骤。 A computer program product comprises computer readable instructions, which, when executed by a processor, implement the steps of the method according to any one of claims 1 to 12.
PCT/CN2023/124045 2022-11-22 2023-10-11 Training method and apparatus for face swapping model, and device, storage medium and program product WO2024109374A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211468062.6 2022-11-22
CN202211468062.6A CN115565238B (en) 2022-11-22 2022-11-22 Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product

Publications (1)

Publication Number Publication Date
WO2024109374A1 true WO2024109374A1 (en) 2024-05-30

Family

ID=84770880

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/124045 WO2024109374A1 (en) 2022-11-22 2023-10-11 Training method and apparatus for face swapping model, and device, storage medium and program product

Country Status (2)

Country Link
CN (1) CN115565238B (en)
WO (1) WO2024109374A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118366207A (en) * 2024-06-20 2024-07-19 杭州名光微电子科技有限公司 3D face anti-counterfeiting system and method based on deep learning

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115565238B (en) * 2022-11-22 2023-03-28 腾讯科技(深圳)有限公司 Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product
CN116229214B (en) * 2023-03-20 2023-12-01 北京百度网讯科技有限公司 Model training method and device and electronic equipment
CN116739893A (en) * 2023-08-14 2023-09-12 北京红棉小冰科技有限公司 Face changing method and device
CN117196937B (en) * 2023-09-08 2024-05-14 天翼爱音乐文化科技有限公司 Video face changing method, device and storage medium based on face recognition model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353546A (en) * 2020-03-09 2020-06-30 腾讯科技(深圳)有限公司 Training method and device of image processing model, computer equipment and storage medium
CN111401216A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Image processing method, model training method, image processing device, model training device, computer equipment and storage medium
CN111523413A (en) * 2020-04-10 2020-08-11 北京百度网讯科技有限公司 Method and device for generating face image
CN111553267A (en) * 2020-04-27 2020-08-18 腾讯科技(深圳)有限公司 Image processing method, image processing model training method and device
CN112766160A (en) * 2021-01-20 2021-05-07 西安电子科技大学 Face replacement method based on multi-stage attribute encoder and attention mechanism
WO2021258920A1 (en) * 2020-06-24 2021-12-30 百果园技术(新加坡)有限公司 Generative adversarial network training method, image face swapping method and apparatus, and video face swapping method and apparatus
CN114387656A (en) * 2022-01-14 2022-04-22 平安科技(深圳)有限公司 Face changing method, device, equipment and storage medium based on artificial intelligence
CN115565238A (en) * 2022-11-22 2023-01-03 腾讯科技(深圳)有限公司 Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705290A (en) * 2021-02-26 2021-11-26 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and storage medium
CN115171199B (en) * 2022-09-05 2022-11-18 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353546A (en) * 2020-03-09 2020-06-30 腾讯科技(深圳)有限公司 Training method and device of image processing model, computer equipment and storage medium
CN111401216A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Image processing method, model training method, image processing device, model training device, computer equipment and storage medium
CN111523413A (en) * 2020-04-10 2020-08-11 北京百度网讯科技有限公司 Method and device for generating face image
CN111553267A (en) * 2020-04-27 2020-08-18 腾讯科技(深圳)有限公司 Image processing method, image processing model training method and device
WO2021258920A1 (en) * 2020-06-24 2021-12-30 百果园技术(新加坡)有限公司 Generative adversarial network training method, image face swapping method and apparatus, and video face swapping method and apparatus
CN112766160A (en) * 2021-01-20 2021-05-07 西安电子科技大学 Face replacement method based on multi-stage attribute encoder and attention mechanism
CN114387656A (en) * 2022-01-14 2022-04-22 平安科技(深圳)有限公司 Face changing method, device, equipment and storage medium based on artificial intelligence
CN115565238A (en) * 2022-11-22 2023-01-03 腾讯科技(深圳)有限公司 Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118366207A (en) * 2024-06-20 2024-07-19 杭州名光微电子科技有限公司 3D face anti-counterfeiting system and method based on deep learning

Also Published As

Publication number Publication date
CN115565238B (en) 2023-03-28
CN115565238A (en) 2023-01-03

Similar Documents

Publication Publication Date Title
WO2024109374A1 (en) Training method and apparatus for face swapping model, and device, storage medium and program product
Lu et al. Image generation from sketch constraint using contextual gan
CN111401216B (en) Image processing method, model training method, image processing device, model training device, computer equipment and storage medium
Zhang et al. Facial expression analysis under partial occlusion: A survey
CN112990054B (en) Compact linguistics-free facial expression embedding and novel triple training scheme
WO2021078157A1 (en) Image processing method and apparatus, electronic device, and storage medium
WO2020103700A1 (en) Image recognition method based on micro facial expressions, apparatus and related device
CN111553267B (en) Image processing method, image processing model training method and device
CN111354079A (en) Three-dimensional face reconstruction network training and virtual face image generation method and device
Zhang et al. Computer models for facial beauty analysis
Tolosana et al. DeepFakes detection across generations: Analysis of facial regions, fusion, and performance evaluation
CN108830237B (en) Facial expression recognition method
CN112800903A (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
Cai et al. Semi-supervised natural face de-occlusion
CN107025678A (en) A kind of driving method and device of 3D dummy models
Liu et al. A 3 GAN: an attribute-aware attentive generative adversarial network for face aging
CN113705290A (en) Image processing method, image processing device, computer equipment and storage medium
CN113570684A (en) Image processing method, image processing device, computer equipment and storage medium
CN113780249B (en) Expression recognition model processing method, device, equipment, medium and program product
CN115050064A (en) Face living body detection method, device, equipment and medium
CN115862120B (en) Face action unit identification method and equipment capable of decoupling separable variation from encoder
Agbo-Ajala et al. A lightweight convolutional neural network for real and apparent age estimation in unconstrained face images
CN112101087A (en) Facial image identity de-identification method and device and electronic equipment
CN113705301A (en) Image processing method and device
WO2024059374A1 (en) User authentication based on three-dimensional face modeling using partial face images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23893466

Country of ref document: EP

Kind code of ref document: A1