WO2022166840A1 - 人脸属性编辑模型的训练方法、人脸属性编辑方法及设备 - Google Patents

人脸属性编辑模型的训练方法、人脸属性编辑方法及设备 Download PDF

Info

Publication number
WO2022166840A1
WO2022166840A1 PCT/CN2022/074742 CN2022074742W WO2022166840A1 WO 2022166840 A1 WO2022166840 A1 WO 2022166840A1 CN 2022074742 W CN2022074742 W CN 2022074742W WO 2022166840 A1 WO2022166840 A1 WO 2022166840A1
Authority
WO
WIPO (PCT)
Prior art keywords
picture
attribute
face
loss
target
Prior art date
Application number
PCT/CN2022/074742
Other languages
English (en)
French (fr)
Inventor
黄嘉彬
李玉乐
项伟
Original Assignee
百果园技术(新加坡)有限公司
黄嘉彬
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 黄嘉彬 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2022166840A1 publication Critical patent/WO2022166840A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/04Context-preserving transformations, e.g. by using an importance map
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and in particular, to a training method for a face attribute editing model, a face attribute editing method, and a device.
  • Face attribute editing is a technology that changes specific attributes of faces in pictures or videos. For example, through face attribute editing, faces in pictures can be made old, young, male faces into female faces, or Become a star face.
  • a face attribute editing model based on auto-encoder training is usually used to edit face attributes.
  • adversarial loss is usually used as a loss function for model training.
  • the embodiments of the present application provide a training method for a face attribute editing model, a face attribute editing method, and a device.
  • the technical solution is as follows:
  • an embodiment of the present application provides a training method for a face attribute editing model, the method comprising:
  • the first picture is input into the picture encoder, and the coding feature output by the picture encoder is obtained, and the face in the first picture has the first attribute;
  • the second picture Inputting the coding features into the first picture decoder and the second picture decoder respectively, to obtain the second picture output by the first picture decoder, and the third picture output by the second decoder, the second picture
  • the face in the picture has a second attribute
  • the face in the third picture has a first attribute
  • the first attribute is different from the second attribute
  • the target loss function includes an adversarial loss and a feature matching loss, and the feature matching loss is used to constrain the depth between pictures Similarity of semantic features;
  • the first picture decoder is trained based on the target loss function, and the picture encoder and the trained first picture decoder are determined as the face attribute editing model.
  • an embodiment of the present application provides a method for editing a face attribute, the method comprising:
  • the target face attribute editing model is composed of a picture encoder and a picture decoder, and the picture decoder is trained based on the target loss function.
  • the loss function includes adversarial loss and feature matching loss, and the feature matching loss is used to constrain the similarity of deep semantic features between pictures;
  • an embodiment of the present application provides a training device for a face attribute editing model, and the device includes:
  • an encoding module configured to input the first picture into the picture encoder, to obtain the encoding feature output by the picture encoder, and the face in the first picture has a first attribute
  • a decoding module configured to input the encoding feature into a first picture decoder and a second picture decoder respectively, to obtain a second picture output by the first picture decoder, and a third picture output by the second decoder , the face in the second picture has a second attribute, the face in the third picture has a first attribute, and the first attribute is different from the second attribute;
  • a loss building module configured to construct a target loss function of the first picture decoder based on the second picture and the third picture, the target loss function includes an adversarial loss and a feature matching loss, and the feature matching loss Used to constrain the similarity of deep semantic features between images;
  • a training module configured to train the first picture decoder based on the target loss function, and determine the picture encoder and the first picture decoder obtained by training as the face attribute editing model.
  • an embodiment of the present application provides a face attribute editing apparatus, the apparatus includes:
  • an acquisition module configured to acquire the picture to be edited and the attributes of the target face, where the attributes of the face in the picture to be edited are different from the attributes of the target face;
  • a model determination module configured to determine a target face attribute editing model corresponding to the target face attribute, the target face attribute editing model is composed of a picture encoder and a picture decoder, and the picture decoder is based on the target loss function Obtained from training, the target loss function includes an adversarial loss and a feature matching loss, and the feature matching loss is used to constrain the similarity of deep semantic features between pictures;
  • an editing module configured to input the picture to be edited into the target face attribute editing model, and obtain a target picture output by the target face attribute editing model, where the face in the target picture has the target face attribute .
  • an embodiment of the present application provides a computer device, the computer device includes a processor and a memory; the memory stores at least one instruction, and the at least one instruction is used to be executed by the processor to implement the The training method of the face attribute editing model described in the above aspect, or, the face attribute editing method described in the above aspect is implemented.
  • an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores at least one instruction, and the at least one instruction is configured to be executed by a processor to implement the face attribute described in the above aspect
  • a training method for editing a model, or, implementing the face attribute editing method described in the above aspect is configured to be executed by a processor to implement the face attribute described in the above aspect.
  • an embodiment of the present application provides a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the training method of the face attribute editing model provided in the above aspect, or, the face provided in the above aspect. Attribute editing method.
  • FIG. 1 shows a flowchart of a training method for a face attribute editing model provided by an exemplary embodiment of the present application
  • FIG. 2 is a schematic diagram of the principle of a model training process shown in an exemplary embodiment of the present application
  • FIG. 3 shows a flowchart of a training method for a face attribute editing model provided by another exemplary embodiment of the present application
  • FIG. 4 is a schematic diagram of the implementation of a model training process shown in an exemplary embodiment of the present application.
  • FIG. 5 is a schematic diagram of a network structure of a decoder shown in an exemplary embodiment of the present application.
  • FIG. 6 is a schematic diagram of the implementation of a process for determining an attribute perception loss according to an exemplary embodiment of the present application
  • FIG. 7 shows a flowchart of a method for editing a face attribute provided by an exemplary embodiment of the present application
  • FIG. 8 shows a structural block diagram of a training device for a face attribute editing model provided by an exemplary embodiment of the present application
  • FIG. 9 is a block diagram showing the structure of an apparatus for editing a face attribute provided by an exemplary embodiment of the present application.
  • Autoencoder As an unsupervised learning neural network, the autoencoder is used to perform representation learning on the input information by taking the input information as the learning target, so that the output information of the autoencoder is close to the input information.
  • An autoencoder consists of an encoder (Encoder) and a decoder (Decoder), wherein the encoder is used to extract features from the input information, and the decoder is used to restore the input information based on the features extracted by the encoder.
  • GAN Generative Adversarial Networks
  • GAN A deep learning model for unsupervised learning on complex distributions, consisting of a Generative Model and a Discriminative Model, where the Generative Model It is used to generate an image based on the original image output, and the discriminant model is used to determine whether the image is an original image or a generated image.
  • the training process of GAN is the game process between the generative model and the discriminant model, that is, the training goal of the generative model is that the discriminant model cannot distinguish between the original image and the generated image, and the training goal of the discriminant model is to accurately distinguish the original image and the generated image.
  • the training method for a face attribute editing model and the face attribute editing method provided by the embodiments of the present application are suitable for a face attribute editing scene.
  • the following uses two typical application scenarios as examples for description.
  • the developer When applying the method provided by the embodiment of the present application to a picture editing scene, the developer firstly trains a face attribute editing model for editing different face attributes based on the face attributes to be edited. For example, developers train a face attribute editing model for making faces older, a face attribute editing model for making faces younger, and a face attribute editing model for turning male faces into female faces , a face attribute editing model for turning female faces into male faces. After completing the training of the face attribute editing model, the face attribute editing model can be deployed on the server side, and the model calling interface can be set on the application side.
  • the user uploads the face image to be edited through the application, and after selecting the face attribute to be edited, the application calls the interface through the model to upload the face image and the background server of the application.
  • Face attribute the background server obtains the face attribute editing model that matches the face attribute to be edited, so that the face image is used as the model input, and the edited face image output by the face attribute model is obtained, and the edited face image is obtained.
  • the face picture is fed back to the app, and the app displays the edited face picture.
  • the developer Similar to being applied to a picture editing scene, when the method provided by this embodiment of the present application is applied to a video editing scene, the developer also needs to train a face attribute editing model for editing different face attributes.
  • the star face-changing function needs to be implemented, the developer pre-trains the corresponding face attribute editing model based on different star pictures, and deploys the face attribute editing model on the server side, on the application side Set the model calling interface.
  • the application When editing face attributes, upload the video to be edited through the application and select the star to be changed, the application will upload the video to be edited and the star logo to the background server of the application through the model call interface.
  • the background server determines the matching face attribute editing model according to the star identification, thereby taking each video frame in the video to be edited as the model input, and obtaining the edited video frame output by the face attribute model (the face of the video frame changes to a star person. face), so as to generate an edited video based on the edited video frame, and feed back the edited video to the application, and the application will display the video.
  • the training method of the face attribute editing model and the face attribute editing method provided by the embodiments of the present application can be applied to other scenarios in which face attributes need to be edited in addition to the above-mentioned scenarios, and can be edited except for the above-mentioned scenarios.
  • Other face attributes other than the examples are not limited in this embodiment of the present application.
  • the training method of the face attribute editing model provided by the embodiments of the present application can be applied to computer devices with strong data processing capabilities, such as personal computers, workstations, and servers; the face attribute editing method provided by the embodiments of the present application can be applied to Smartphones, tablet computers and other electronic devices (for example, deploying the trained face attribute editing model in a smart phone, so as to realize face attribute editing locally), can also be applied to personal computers, workstations, servers and other computer equipment (such as The trained face attribute editing model is deployed on the server, so that the server provides face attribute editing services for the application).
  • the following embodiments are described by taking the training method of a face attribute editing model and the application of the face attribute editing method to a computer device as an example for description.
  • FIG. 1 shows a flowchart of a training method for a face attribute editing model provided by an exemplary embodiment of the present application.
  • the method may include the following steps:
  • Step 101 Input the first picture into the picture encoder to obtain the coding feature output by the picture encoder, and the face in the first picture has the first attribute.
  • the developer needs to prepare a training data set in advance, and the pictures in the training data set contain A human face, and the human face has the first attribute.
  • the images in the training data set are all adult face images; when the first attribute is female and the second attribute is male, The pictures in the training dataset are all male face pictures.
  • the computer equipment will input the first picture in the training data set as a training sample into the picture encoder, and the picture encoder will perform feature extraction on the first picture to obtain coding features.
  • the picture encoder adopts a convolutional neural network, which is used to perform feature extraction on the first picture through several convolutional layers.
  • the picture encoder adopts a down-sampling convolutional neural network for outputting structural encoding through a series of convolutional layers and fully connected layers
  • structural coding features are used to represent the structural features of images in space (helping to improve the quality of background reconstruction and optimize the occlusion in pictures)
  • style coding features are used to to characterize the stylistic features of images.
  • the structural encoding feature is a feature map S H*W*C of height H, width W, and channel C
  • the stylistic encoding feature is a feature vector.
  • the picture encoder 22 encodes the first picture 21 to obtain the encoding feature 221 .
  • Step 102 input the coding features into the first picture decoder and the second picture decoder respectively, and obtain the second picture output by the first picture decoder, and the third picture output by the second decoder, and the face in the second picture is obtained.
  • the face in the third picture has the first attribute, and the first attribute is different from the second attribute.
  • the computer device decodes the encoded features through the first picture decoder and the second picture decoder, respectively.
  • the first picture decoder is a decoder for editing face attributes
  • the second picture decoder is a decoder for face reconstruction. Therefore, the encoding feature is input into the first picture
  • the second picture obtained by the picture decoder has different face attributes from the first picture
  • the third picture obtained by inputting the coding feature into the second picture decoder has the same face attribute as the first picture.
  • the first attribute of the face in the first picture and the second attribute of the face in the second picture may be different attribute values of the same attribute type, for example, the attribute types corresponding to the first attribute and the second attribute are age. , the first attribute is adult, and the second attribute is child, or the attribute types corresponding to the first attribute and the second attribute are both gender, the first attribute is male, and the second attribute is female.
  • the computer device inputs the coding feature 221 into the first picture decoder 23 to obtain a second picture 231 decoded by the first picture decoder 23; and inputs the coding feature 221 into the second picture decoder 24 , to obtain the third picture 241 decoded by the second picture decoder 24 .
  • Step 103 based on the second picture and the third picture, construct a target loss function of the first picture decoder.
  • the target loss function includes an adversarial loss and a feature matching loss, and the feature matching loss is used to constrain the similarity of deep semantic features between pictures.
  • the matching loss determines the objective loss function of the first picture decoder.
  • the computer device adopts the idea of GAN to determine the adversarial loss of the decoder of the first picture based on the second picture (the purpose is to make the generated second picture have the second attribute), and based on the second picture and The third picture determines the feature matching loss of the first picture decoder (the purpose is to make the deep semantic features between the second picture and the third picture similar), so as to fuse the adversarial loss and the feature matching loss to obtain the first picture decoder target loss function.
  • the computer device determines the confrontation loss 25 based on the second picture 231 , determines the feature matching loss 26 based on the second picture 231 and the third picture 241 , and then determines the target based on the confrontation loss 25 and the feature matching loss 26 .
  • Loss function 27 the computer device determines the confrontation loss 25 based on the second picture 231 , determines the feature matching loss 26 based on the second picture 231 and the third picture 241 , and then determines the target based on the confrontation loss 25 and the feature matching loss 26 .
  • Step 104 Train the first picture decoder based on the target loss function, and determine the picture encoder and the trained first picture decoder as a face attribute editing model.
  • the computer equipment trains the first picture decoder based on the constructed target loss function, and completes the training when the loss converges.
  • the computer device uses a gradient back-propagation algorithm to optimize the parameters of the first picture decoder.
  • the computer device determines the picture encoder and the first picture decoder as the face attribute editing model for editing the face attribute from the first attribute to the second attribute.
  • the picture encoder performs feature extraction, and the first picture decoder performs face attribute editing based on the extracted features.
  • the computer device can also use different training data sets to train a face attribute editing model for editing different face attributes (the picture encoder can be shared), the embodiments of the present application will not be repeated here.
  • the computer device may use the test data set to test the first picture decoder, which is not limited in this embodiment.
  • the computer device trains the first picture decoder 23 based on the target loss function 27 , and determines the first picture decoder 23 and the picture encoder 22 obtained by training as a face attribute editing model.
  • the first picture decoder when training a face attribute editing model, is used to perform face attribute editing to obtain the second picture, and the second picture decoder is used to perform face reconstruction to obtain the third picture. image, and use the feature matching loss that constrains the similarity of deep semantic features between images as part of the loss function to train the first image decoder. Since the similarity of deep semantic features between images is considered in the training process, the training When the obtained first picture decoder performs face attribute editing, it can ensure the consistency of the deep features in the generated picture and the original picture, prevent the generated picture from losing important features in the original picture, and help improve the editing quality of the face attribute.
  • the computer device in order to ensure that in the process of editing face attributes, in addition to changing the first attribute to the second attribute, the generated image is consistent with other attributes of the face in the original image, such as eye pupil color, The type of bangs, whether to wear glasses, etc.
  • the computer device also uses the attribute perception loss for constraining the face attribute as a part of the target loss function, which is described below using an exemplary embodiment.
  • FIG. 3 shows a flowchart of a training method for a face attribute editing model provided by another exemplary embodiment of the present application.
  • the method may include the following steps:
  • Step 301 Input the first picture into the picture encoder to obtain the coding feature output by the picture encoder, and the face in the first picture has the first attribute.
  • the computer equipment inputs the first picture 41 (the attribute of the face is an adult) into the picture encoder 42, and obtains The structural encoding feature map 421 and the style encoding feature vector 422 output by the picture encoder 42 .
  • Step 302 input the coding feature into the first picture decoder and the second picture decoder respectively, to obtain the second picture output by the first picture decoder, and the third picture output by the second decoder, the face in the second picture.
  • the face in the third picture has the first attribute, and the first attribute is different from the second attribute.
  • the computer device inputs the structural coding feature map 421 and the stylistic coding feature vector 422 into the first picture decoder 43, respectively, to obtain the second picture 431 output by the first picture decoder 43, and the second picture 431 is obtained.
  • the attribute of the face in the second picture 431 is a child; the computer equipment inputs the structural coding feature map 421 and the stylistic coding feature vector 422 into the second picture decoder 44, respectively, to obtain the third picture 441 output by the second picture decoder 44,
  • the attribute of the face in the third picture 441 is also an adult.
  • Step 303 Determine the adversarial loss based on the second image.
  • the computer device adopts a least squares adversarial loss network (Least Squares GAN, LSGAN), and constrains the second image through the adversarial loss of the adversarial network, so that the generated second image has the second attribute .
  • a least squares adversarial loss network Least Squares GAN, LSGAN
  • the computer device is set to a discriminator for distinguishing the original picture (or called a real picture) with the second attribute and the generated picture, that is, the discriminator is used for distinguishing the picture with the second attribute Whether it is the original picture or the generated picture output by the generator (that is, the second picture decoder in this application).
  • the discriminator plays a key role in the adversarial loss, and it needs to learn to distinguish the difference between the generated image and the original image (both have the second attribute) during the training process; The compiler cannot distinguish between the original image and the generated image.
  • the computer device inputs the second picture into the discriminator, obtains a discriminant result output by the discriminator, and determines the adversarial loss based on the discriminator result.
  • the adversarial loss of the generator can be expressed as:
  • G is the generator
  • D is the discriminator
  • x is the first picture
  • the discrimination result of the discriminator is a value of 0-1.
  • the discrimination result is 0, it indicates that the picture is a generated picture; when the discrimination result is 1 , indicating that the image is the original image.
  • the computer device needs to train the discriminator (need to use the generated picture with the second attribute and the original picture with the second attribute), wherein the second picture decoder and the discriminator can be trained alternately.
  • the adversarial loss of the discriminator can be expressed as:
  • the computer device determines the adversarial loss 45 of the first picture decoder 43 based on the second picture 431 .
  • Step 304 Determine the feature matching loss based on the deep semantic features corresponding to the second picture and the third picture respectively.
  • the computer device obtains the corresponding deep semantic features of the second picture and the third picture respectively, so as to determine the feature matching loss according to the difference between the two deep semantic features.
  • this step may include the following steps.
  • the network structures of the first picture decoder and the second picture decoder are the same. Therefore, when determining the degree of feature matching between the second picture and the third picture, the computer device extracts the deep semantics in the first picture decoder.
  • a layer of low-resolution feature maps of the information ie, the first depth feature map
  • a layer of low-resolution feature maps ie, the second depth feature
  • the same network depth ie, the same network level output
  • the network level used when extracting the first depth feature map and the second depth feature map may be preset by the developer, which is not limited in this embodiment.
  • both the first picture decoder and the second picture decoder are composed of a series of residual modules based on Adaptive Instance Normalization (AdaIN), and the upsampling layer adopts a transposed volume Transpose Convolution Layer.
  • AdaIN Adaptive Instance Normalization
  • the upsampling layer adopts a transposed volume Transpose Convolution Layer.
  • FIG. 5 the structures of the first picture decoder and the second picture decoder are shown in FIG. 5 .
  • the size of each feature map in the structural coding features of the input picture decoder is 8 ⁇ 32, and the style coding features are vectors of 1 ⁇ 2048.
  • the sizes of residual modules in the picture decoder are 32 ⁇ 128, 32 ⁇ 256, 32 ⁇ 384, 32 ⁇ 512, 54 ⁇ 512, 128 ⁇ 512, 256 ⁇ 256, and 512 ⁇ 128.
  • the residual module allows the input feature x to be reused, and can provide a shortcut for x in the back-propagation process of parameter optimization, making the neural network with residual module better trainable.
  • the computer device determines the feature matching loss of the first picture decoder by comparing the difference between the first depth feature map and the second depth feature map.
  • the feature matching loss can be expressed as:
  • x_ is the first depth feature map
  • y_ is the second depth feature map
  • the larger the feature matching loss the higher the loss or change of the feature when the first picture decoder performs attribute editing, and the feature is retained and maintained. The worse the situation is, and vice versa, the better the feature retention is.
  • the computer device obtains the first depth feature map 432 and the second depth feature map 442 corresponding to the first picture decoder 43 and the second picture decoder 44 respectively, so as to obtain the first depth feature map 432 and the second depth feature map 442 determine a feature matching loss 46 for the second picture to the third picture.
  • Step 305 Determine the attribute perception loss based on the first image and the second image, where the attribute perception loss is used to constrain the face attributes other than the first attribute and the second attribute.
  • the computer equipment before training the first picture decoder, the computer equipment first trains a face attribute classifier for classifying face attributes, so as to obtain a human face through training.
  • the face attribute classifier determines the attribute-aware loss between the first picture and the second picture.
  • the face attribute classifier is composed of a feature extraction layer (composed of several convolutional layers) and a fully connected layer.
  • the feature extraction layer is used to perform feature extraction on the input image, and the fully connected layer is used based on the extracted features. sort.
  • the following steps may be included:
  • each first sample image in the training set contains corresponding attribute labels.
  • the attribute tag is used to indicate the pupil color, lip shape, bangs type, whether to wear glasses, etc. of the face in the first sample picture, and this embodiment does not limit the attribute type.
  • the face attribute classifier extracts the features of the first sample image through the feature extraction layer, and inputs the extracted features into the fully connected layer, which is then processed by the fully connected layer.
  • the layer performs full connection processing, and then the full connection result is classified by the classifier to obtain the sample attribute classification result corresponding to the first sample image.
  • the computer device determines the attribute classification loss (cross-entropy loss) of the face attribute classifier by using the attribute label as the supervision of the sample attribute classification result, so as to perform a classification analysis on the face attribute classifier based on the attribute classification loss. Training, and finally trained to obtain a face attribute classifier that can accurately identify face attributes.
  • attribute classification loss cross-entropy loss
  • determining the attribute perceptual loss of the first picture and the second picture by the computer device may include the following steps:
  • the computer device After inputting the first picture and the second picture into the face attribute classifier respectively, the computer device obtains the feature map output by the last convolutional layer in the feature extraction layer as the first attribute feature map and the second attribute feature map. .
  • the feature extraction layer 51 in the face attribute classifier performs feature extraction on the first picture 41 to obtain a first attribute feature map 52 , and performs feature extraction on the second picture 431 to obtain a second image 431 .
  • the computer device performs full connection processing on the first attribute feature map and the second attribute feature map respectively through the fully connected layer of the face attribute classifier, and obtains the first face attribute classification result corresponding to the first picture and the second picture.
  • the corresponding second face attribute classification result wherein the first face attribute classification result and the second face attribute classification result are classification results that have not undergone softmax processing.
  • the fully connected layer 54 in the face attribute classifier performs full connection processing on the first attribute feature map 52 and the second attribute feature map 53, respectively, to obtain a first face attribute classification result 55. and the second face attribute classification result 56 .
  • the computer device compares the loss between the first attribute feature map and the second attribute feature map, and the first face attribute classification result and the The loss between the second face attribute classification results is jointly determined as the attribute perception loss of the first image decoder, wherein the loss between the feature maps can use L2 loss, and the loss between classification results can also use L2 loss.
  • the attribute perceptual loss (Attribute Perceptual Loss) of the first picture decoder can be expressed as:
  • Attribute Perceptual Loss (Ect(G(x))-Ext(x)) 2 +(classifer(G(x))-classifer(x)) 2
  • x is the first picture
  • Ext( ) is the feature map output by the last convolutional layer of the face attribute classifier
  • classifer( ) is the face attribute classification result
  • G is the generator (including the picture encoder and the first picture decoder).
  • the computer device converts the L2 loss between the first attribute feature map 52 and the second attribute feature map 53, as well as the first face attribute classification result 55 and the second face attribute classification result 56.
  • the L2 loss in between is jointly determined as the attribute-aware loss of the first picture decoder 47.
  • Step 306 Determine the target loss function based on the adversarial loss, the feature matching loss and the attribute perception loss.
  • the computer device jointly determines the target loss function based on the adversarial loss, the feature matching loss and the attribute sensing loss, wherein the adversarial loss, the feature matching loss and the attribute sensing loss in the target loss function can correspond to their respective loss weights, and the loss weight can be:
  • the hyperparameters set in the training process are not limited in this embodiment.
  • Step 307 Train the first picture decoder based on the target loss function, and determine the picture encoder and the trained first picture decoder as a face attribute editing model.
  • the computer device determines the attribute perception loss between the original picture (ie the first picture) and the generated picture (ie the second picture) by pre-training the face attribute classifier and using the face attribute classifier, thereby
  • the attribute perception loss is used as part of the target loss function for model training, so that during the model training process, the generated image is consistent with the face attributes except the first attribute and the second attribute in the original image, which further improves the face attributes. Editing quality.
  • the computer device before training the first picture decoder, the computer device first trains the picture encoder and the second picture decoder, that is, pre-trains the self-encoder for face picture reconstruction, and then follows In the process of training the first picture decoder, that is, there is no need to train the picture encoder and the second picture decoder.
  • the computer device performs self-encoding on the second sample picture through the picture encoder and the second picture decoder to obtain the sample generated picture, so as to determine the reconstruction loss function based on the sample generated picture and the second sample picture.
  • the computer equipment uses the second sample image with the diversity of face attributes, that is, the training is performed with sample images of different ages and genders, and the reconstruction loss function can be the sample generated image and the second sample image.
  • the L1 loss in between is not limited in this implementation.
  • the computer device trains the picture encoder and the second picture decoder based on the reconstruction loss function.
  • the computer device can use the gradient back-propagation algorithm to optimize the parameters of the picture encoder and the second picture decoder.
  • FIG. 7 shows a flowchart of a method for editing a face attribute provided by an exemplary embodiment of the present application.
  • the method may include the following steps:
  • Step 701 Obtain the attributes of the image to be edited and the target face, and the attributes of the face in the image to be edited are different from the attributes of the target face.
  • the application program provides several editable face attributes for the user to select, and the editable face attribute selected by the user is the target face attribute.
  • the target face attribute may be a child face, an adult face, a male face, a female face, or a specific star face, etc., which is not limited in this embodiment.
  • the picture to be edited is a single picture, or a video frame in a video.
  • Step 702 Determine the target face attribute editing model corresponding to the target face attribute, the target face attribute editing model is composed of a picture encoder and a picture decoder, and the picture decoder is obtained by training based on the target loss function, and the target loss function includes the confrontation loss. and feature matching loss, which is used to constrain the similarity of deep semantic features between images.
  • a face attribute editing model for editing different face attributes is deployed in the computer device, each face attribute editing model is composed of a picture encoder and a picture decoder, and each face attribute
  • the encoding model is obtained by training the training method of the face attribute encoding model provided by the above embodiments.
  • the computer device selects a face attribute editing model for editing the target face attribute as the target face attribute editing model.
  • different face attribute editing models may share a picture encoder, and different face attribute encoding models correspond to different picture decoders.
  • Step 703 Input the image to be edited into the target face attribute editing model to obtain a target image output by the target face attribute editing model, where the face in the target image has the target face attribute.
  • the computer equipment takes the picture to be edited as the model input, and the picture to be edited is encoded (that is, feature extraction) by the picture encoder in the target face attribute encoding model to obtain the encoding feature, and is decoded by the picture decoder based on the encoding feature ( That is, image reconstruction) to obtain the target image with the target face attribute.
  • the first picture decoder when training a face attribute editing model, is used to perform face attribute editing to obtain the second picture, and the second picture decoder is used to perform face reconstruction to obtain the third picture. image, and use the feature matching loss that constrains the similarity of deep semantic features between images as part of the loss function to train the first image decoder. Since the similarity of deep semantic features between images is considered in the training process, the training When the obtained first picture decoder performs face attribute editing, it can ensure the consistency of the deep features in the generated picture and the original picture, prevent the generated picture from losing important features in the original picture, and help improve the editing quality of the face attribute.
  • FIG. 8 shows a structural block diagram of an apparatus for training a face attribute editing model provided by an exemplary embodiment of the present application.
  • the apparatus may include:
  • the encoding module 801 is configured to input the first picture into the picture encoder to obtain the encoding feature output by the picture encoder, and the face in the first picture has the first attribute;
  • the decoding module 802 is configured to input the encoding feature into a first picture decoder and a second picture decoder respectively, to obtain a second picture output by the first picture decoder, and a third picture output by the second decoder pictures, the face in the second picture has a second attribute, the face in the third picture has a first attribute, and the first attribute is different from the second attribute;
  • the loss construction module 803 is configured to construct a target loss function of the first picture decoder based on the second picture and the third picture, the target loss function includes an adversarial loss and a feature matching loss, and the feature matching The loss is used to constrain the similarity of deep semantic features between images;
  • the training module 804 is configured to train the first picture decoder based on the target loss function, and determine the picture encoder and the first picture decoder obtained by training as the face attribute editing model.
  • the loss building module 803 includes:
  • a first loss determination unit configured to determine the confrontation loss based on the second picture
  • a second loss determining unit configured to determine the feature matching loss based on the deep semantic features corresponding to the second picture and the third picture respectively;
  • a loss construction unit configured to determine the target loss function based on the adversarial loss and the feature matching loss.
  • the second loss determination unit is set to:
  • the network structures of the picture decoder and the second picture decoder are the same, and the first depth feature map and the second depth feature map are feature maps output from the same network level;
  • the feature matching loss is determined based on the first depth feature map and the second depth feature map.
  • the first loss determination unit is set to:
  • the second picture is input into the discriminator, and the discrimination result output by the discriminator is obtained, and the discriminator is used to discriminate the original picture and the generated picture with the second attribute;
  • the adversarial loss is determined based on the discrimination result.
  • the loss building module 803 further includes:
  • a third loss determining unit configured to determine an attribute-aware loss based on the first picture and the second picture, where the attribute-aware loss is used to constrain face attributes other than the first attribute and the second attribute ;
  • the loss building unit set to:
  • the objective loss function is determined based on the adversarial loss, the feature matching loss, and the attribute-aware loss.
  • the third loss determination unit is set to:
  • the first picture and the second picture are respectively input into the feature extraction layer of the face attribute classifier, and the first attribute feature map corresponding to the first picture and the second attribute feature map corresponding to the second picture are obtained. ;
  • the first attribute feature map and the second attribute feature map are respectively input into the fully connected layer of the face attribute classifier, and the first face attribute classification result corresponding to the first picture is obtained, and the first face attribute classification result is obtained.
  • the device further includes a classifier training module, and the classifier training module is set to:
  • the face attribute classifier is determined based on the attribute label and the sample attribute classification result.
  • the device further includes an autoencoder training module, and the autoencoder training module is set to:
  • the second sample picture is self-encoded by the picture encoder and the second picture decoder to obtain a sample generated picture
  • the picture encoder and the second picture decoder are trained based on the reconstruction loss function.
  • the coding features include structural coding features and stylistic coding features, where the structural coding features are used to represent the spatial structural features of the image, and the stylistic coding features are used to represent the style features of the images.
  • the first picture decoder when training a face attribute editing model, is used to perform face attribute editing to obtain the second picture, and the second picture decoder is used to perform face reconstruction to obtain the third picture. image, and use the feature matching loss that constrains the similarity of deep semantic features between images as part of the loss function to train the first image decoder. Since the similarity of deep semantic features between images is considered in the training process, the training When the obtained first picture decoder performs face attribute editing, it can ensure the consistency of the deep features in the generated picture and the original picture, prevent the generated picture from losing important features in the original picture, and help improve the editing quality of the face attribute.
  • FIG. 9 shows a structural block diagram of an apparatus for editing a face attribute provided by an exemplary embodiment of the present application.
  • the apparatus may include:
  • the obtaining module 901 is configured to obtain the picture to be edited and the attribute of the target face, and the attribute of the face in the picture to be edited is different from the attribute of the target face;
  • the model determination module 902 is configured to determine a target face attribute editing model corresponding to the target face attribute, the target face attribute editing model is composed of a picture encoder and a picture decoder, and the picture decoder is based on the target loss Obtained from function training, the target loss function includes an adversarial loss and a feature matching loss, and the feature matching loss is used to constrain the similarity of deep semantic features between pictures;
  • the editing module 903 is configured to input the picture to be edited into the target face attribute editing model to obtain a target picture output by the target face attribute editing model, and the face in the target picture has the target face Attributes.
  • Embodiments of the present application further provide a computer device, the computer device includes a processor and a memory, the memory stores at least one instruction, and the at least one instruction is used to be executed by the processor to implement the face attribute editing model described in the above embodiments The training method, or, to implement the face attribute editing method described in each of the above embodiments.
  • Embodiments of the present application further provide a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and at least one instruction is loaded and executed by a processor to implement the face attribute editing model described in the above embodiments
  • the training method or, to implement the face attribute editing method described in each of the above embodiments.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the training method of the face attribute editing model provided in various optional implementations of the above aspects, Or, perform the face attribute editing method provided in the various optional implementation manners of the above aspect.
  • Computer-readable storage media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

一种人脸属性编辑模型的训练方法、人脸属性编辑方法及设备,属于人工智能领域。方法包括:将第一图片输入图片编码器,得到图片编码器输出的编码特征;将编码特征分别输入第一图片解码器和第二图片解码器,得到第二图片和第三图片;基于第二图片和第三图片,构建第一图片解码器的目标损失函数,目标损失函数包括对抗损失和特征匹配损失,特征匹配损失用于约束图片间深层语义特征的相似性(103);基于目标损失函数训练第一图片解码器,将图片编码器和训练得到的第一图片解码器确定为人脸属性编辑模型(104)。

Description

人脸属性编辑模型的训练方法、人脸属性编辑方法及设备
本申请要求于2021年02月02日提交的申请号为202110143937.4、发明名称为“人脸属性编辑模型的训练方法、人脸属性编辑方法及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能领域,特别涉及一种人脸属性编辑模型的训练方法、人脸属性编辑方法及设备。
背景技术
人脸属性编辑是一种改变图片或视频中人脸特定属性的技术,比如,通过人脸属性编辑可以实现将图片中人脸变老、变年轻、将男性人脸变为女性人脸、或变为明星脸。
相关技术中,通常采用基于自编码器(auto-encoder)训练得到人脸属性编辑模型进行人脸属性编辑。其中,在训练人脸属性编辑模型时,通常以对抗损失作为损失函数进行模型训练。
发明内容
本申请实施例提供了一种人脸属性编辑模型的训练方法、人脸属性编辑方法及设备。所述技术方案如下:
一方面,本申请实施例提供了一种人脸属性编辑模型的训练方法,所述方法包括:
将第一图片输入图片编码器,得到所述图片编码器输出的编码特征,所述第一图片中的人脸具有第一属性;
将所述编码特征分别输入第一图片解码器和第二图片解码器,得到所述第一图片解码器输出的第二图片,以及所述第二解码器输出的第三图片,所述第二图片中的人脸具有第二属性,所述第三图片中的人脸具有第一属性,且所述第一属性不同于所述第二属性;
基于所述第二图片和所述第三图片,构建所述第一图片解码器的目标损失函数,所述目标损失函数包括对抗损失和特征匹配损失,所述特征匹配损失用于约束图片间深层语义特征的相似性;
基于所述目标损失函数训练所述第一图片解码器,并将所述图片编码器和训练得到的所述第一图片解码器确定为所述人脸属性编辑模型。
另一方面,本申请实施例提供了一种人脸属性编辑方法,所述方法包括:
获取待编辑图片和目标人脸属性,所述待编辑图片中人脸的属性与所述目标人脸属性不同;
确定所述目标人脸属性对应的目标人脸属性编辑模型,所述目标人脸属性编辑模型由图片编码器和图片解码器构成,且所述图片解码器基于目标损失函 数训练得到,所述目标损失函数包括对抗损失和特征匹配损失,所述特征匹配损失用于约束图片间深层语义特征的相似性;
将所述待编辑图片输入所述目标人脸属性编辑模型,得到所述目标人脸属性编辑模型输出的目标图片,所述目标图片中的人脸具有所述目标人脸属性。
另一方面,本申请实施例提供了一种人脸属性编辑模型的训练装置,所述装置包括:
编码模块,设置为将第一图片输入图片编码器,得到所述图片编码器输出的编码特征,所述第一图片中的人脸具有第一属性;
解码模块,设置为将所述编码特征分别输入第一图片解码器和第二图片解码器,得到所述第一图片解码器输出的第二图片,以及所述第二解码器输出的第三图片,所述第二图片中的人脸具有第二属性,所述第三图片中的人脸具有第一属性,且所述第一属性不同于所述第二属性;
损失构建模块,设置为基于所述第二图片和所述第三图片,构建所述第一图片解码器的目标损失函数,所述目标损失函数包括对抗损失和特征匹配损失,所述特征匹配损失用于约束图片间深层语义特征的相似性;
训练模块,设置为基于所述目标损失函数训练所述第一图片解码器,并将所述图片编码器和训练得到的所述第一图片解码器确定为所述人脸属性编辑模型。
另一方面,本申请实施例提供了一种人脸属性编辑装置,所述装置包括:
获取模块,设置为获取待编辑图片和目标人脸属性,所述待编辑图片中人脸的属性与所述目标人脸属性不同;
模型确定模块,设置为确定所述目标人脸属性对应的目标人脸属性编辑模型,所述目标人脸属性编辑模型由图片编码器和图片解码器构成,且所述图片解码器基于目标损失函数训练得到,所述目标损失函数包括对抗损失和特征匹配损失,所述特征匹配损失用于约束图片间深层语义特征的相似性;
编辑模块,设置为将所述待编辑图片输入所述目标人脸属性编辑模型,得到所述目标人脸属性编辑模型输出的目标图片,所述目标图片中的人脸具有所述目标人脸属性。
另一方面,本申请实施例提供了一种计算机设备,所述计算机设备包括处理器和存储器;所述存储器存储有至少一条指令,所述至少一条指令用于被所述处理器执行以实现如上述方面所述的人脸属性编辑模型的训练方法,或者,实现上述方面所述的人脸属性编辑方法。
另一方面,本申请实施例提供了一种计算机可读存储介质,所述存储介质存储有至少一条指令,所述至少一条指令用于被处理器执行以实现如上述方面所述的人脸属性编辑模型的训练方法,或者,实现上述方面所述的人脸属性编辑方法。
另一方面,本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令, 处理器执行该计算机指令,使得该计算机设备执行上述方面提供的人脸属性编辑模型的训练方法,或者,上述方面提供的人脸属性编辑方法。
附图说明
图1示出了本申请一个示例性实施例提供的人脸属性编辑模型的训练方法的流程图;
图2是本申请一个示例性实施例示出的模型训练过程的原理示意图;
图3示出了本申请另一个示例性实施例提供的人脸属性编辑模型的训练方法的流程图;
图4是本申请一个示例性实施例示出的模型训练过程的实施示意图;
图5是本申请一个示例性实施例示出的解码器的网络结构示意图;
图6是是本申请一个示例性实施例示出的属性感知损失确定过程的实施示意图;
图7示出了本申请一个示例性实施例提供的人脸属性编辑方法的流程图;
图8示出了本申请一个示例性实施例提供的人脸属性编辑模型的训练装置的结构方框图;
图9示出了本申请一个示例性实施例提供的人脸属性编辑装置的结构方框图。
具体实施方式
为了方便理解,下面对本申请实施例中涉及的名词进行说明。
自编码器:自编码器作为一种非监督学习的神经网络,用于通过将输入信息作为学习目标,对输入信息进行表征学习(representation learning),使得自编码器的输出信息趋近于输入信息。自编码器由编码器(Encoder)和解码器(Decoder)构成,其中,编码器用于对输入信息进行特征提取,而解码器用于基于编码器提取到的特征对输入信息进行还原。
生成式对抗网络(Generative Adversarial Networks,GAN):一种用于在复杂分布上实现无监督学习的深度学习模型,通过由生成模型(Generative Model)和判别模型(Discriminative Model)构成,其中,生成模型用于基于原始图像输出生成图像,判别模型则用于判别图像为原始图像还是生成图像。GAN的训练过程即生成模型和判别模型之间的博弈过程,即生成模型的训练目标为判别模型无法分辨原始图像和生成图像,判别模型的训练目标为准确分辨原始图像和生成图像。
本申请实施例提供的人脸属性编辑模型的训练方法以及人脸属性编辑方法,适用于人脸属性编辑场景。下面采用两种典型的应用场景为例进行说明。
1、图片编辑场景
将本申请实施例提供的方法应用于图片编辑场景时,开发人员首先基于所要编辑的人脸属性,训练用于对不同人脸属性进行编辑的人脸属性编辑模型。 比如,开发人员训练用于将人脸变老的人脸属性编辑模型、用于将人脸变年轻的人脸属性编辑模型、用于将男性人脸变为女性人脸的人脸属性编辑模型、用于将女性人脸变为男性人脸的人脸属性编辑模型。完成人脸属性编辑模型训练后,即可将该人脸属性编辑模型部署在服务器端,并在应用程序侧设置模型调用接口。
在进行人脸属性编辑时,用户通过应用程序上传所需要编辑的人脸图片,并选择所要编辑的人脸属性后,应用程序即通过模型调用接口,向应用程序的后台服务器上传人脸图片和人脸属性,由后台服务器获取与所要编辑的人脸属性相匹配的人脸属性编辑模型,从而将人脸图片作为模型输入,得到人脸属性模型输出的编辑后人脸图片,并将编辑后人脸图片反馈至应用程序,由应用程序对编辑后人脸图片进行展示。
2、视频编辑场景
与应用于图片编辑场景相似的,将本申请实施例提供的方法应用于视频编辑场景时,开发人员同样需要训练用于对不同人脸属性进行编辑的人脸属性编辑模型。在一种可能的实施方式中,当需要实现明星变脸功能时,开发人员预先基于不同的明星图片训练相应的人脸属性编辑模型,并将人脸属性编辑模型部署在服务器端,在应用程序侧设置模型调用接口。
在进行人脸属性编辑时,通过应用程序上传待编辑视频,并选择所要变脸的明星后,应用程序即通过模型调用接口,向应用程序的后台服务器上传待编辑视频和明星标识。后台服务器根据明星标识确定匹配的人脸属性编辑模型,从而将待编辑视频中的每一视频帧作为模型输入,得到人脸属性模型输出的编辑后视频帧(视频帧的人脸变化为明星人脸),从而基于编辑后视频帧生成编辑后视频,并将编辑后视频反馈至应用程序,由应用程序进行视频展示。
当然,本申请实施例提供的人脸属性编辑模型的训练方法以及人脸属性编辑方法除了可以应用于上述场景外,还可以应用于其他需要对人脸属性进行编辑的场景,且可以编辑除上述示例以外的其他人脸属性,本申请实施例对此并不构成限定。
本申请实施例提供的人脸属性编辑模型的训练方法,可以应用于个人计算机、工作站、服务器等具有较强数据处理能力的计算机设备;本申请实施例提供的人脸属性编辑方法,可以应用于智能手机、平板电脑等电子设备(比如将训练完成的人脸属性编辑模型部署在智能手机中,从而在本地实现人脸属性编辑),也可以应用于个人计算机、工作站、服务器等计算机设备(比如将训练完成的人脸属性编辑模型部署在服务器上,从而由服务器为应用程序提供人脸属性编辑服务)。为了方便表述,下述各个实施例以人脸属性编辑模型的训练方法 以及人脸属性编辑方法应用于计算机设备为例进行说明。
请参考图1,其示出了本申请一个示例性实施例提供的人脸属性编辑模型的训练方法的流程图。该方法可以包括如下步骤:
步骤101,将第一图片输入图片编码器,得到图片编码器输出的编码特征,第一图片中的人脸具有第一属性。
在一种可能的实施方式中,当需要训练用于将人脸的第一属性变为第二属性的人脸属性编辑模型时,开发人员需要预先准备训练数据集,该训练数据集中的图片包含人脸,且该人脸具有第一属性。比如,当第一属性为成人而第二属性为儿童时(即人脸变年轻),训练数据集中的图片均为成人的人脸图片;当第一属性为女性而第二属性为男性时,训练数据集中的图片均为男性的人脸图片。
在进行模型训练过程中,计算机设备即将训练数据集中的第一图片作为训练样本输入图片编码器,由图片编码器对第一图片进行特征提取,得到编码特征。可选的,该图片编码器采用卷积神经网络,用于通过若干层卷积层对第一图片进行特征提取。
在一种可能的实施方式中,为了提高编码以及解码质量,进而提高属性编辑质量,该图片编码器采用下采样卷积神经网络,用于通过一系列卷积层和全连接层输出结构性编码特征(structure)以及风格性编码特征(texture),结构性编码特征用于表征图像在空间上的结构特征(有助于提高背景重构质量,优化图片中的遮挡情况),风格性编码特征用于表征图像的风格特征。其中,结构性编码特征为高H、宽W、通道C的特征图S H*W*C,而风格性编码特征则为特征向量。
示意性的,如图2所示,计算机设备将第一图片21输入图片编码器22之后,图片编码器22对第一图片21进行编码,得到编码特征221。
步骤102,将编码特征分别输入第一图片解码器和第二图片解码器,得到第一图片解码器输出的第二图片,以及第二解码器输出的第三图片,第二图片中的人脸具有第二属性,第三图片中的人脸具有第一属性,且第一属性不同于第二属性。
对于编码得到的编码特征,计算机设备分别通过第一图片解码器和第二图片解码器对编码特征进行解码。本申请实施例中,第一图片解码器是用于进行人脸属性编辑的解码器,而第二图片解码器则是用于进行人脸重构的解码器,因此,将编码特征输入第一图片解码器得到的第二图片与第一图片具有不同的人脸属性,而将编码特征输入第二图片解码器得到的第三图片与第一图片具有相同的人脸属性。
可选的,第一图片中人脸的第一属性与第二图片中人脸的第二属性可以为同一属性类型的不同属性值,比如第一属性和第二属性对应的属性类型均为年龄,第一属性为成人,而第二属性为儿童,或者,第一属性和第二属性对应的属性类型均为性别,第一属性为男性,而第二属性为女性。
示意性的,如图2所述,计算机设备将编码特征221输入第一图片解码器23,得到第一图片解码器23解码得到的第二图片231;将编码特征221输入第二图片解码器24,得到第二图片解码器24解码得到的第三图片241。
步骤103,基于第二图片和第三图片,构建第一图片解码器的目标损失函数,目标损失函数包括对抗损失和特征匹配损失,特征匹配损失用于约束图片间深层语义特征的相似性。
不同于相关技术中,仅以对抗损失作为损失函数进行模型训练,本申请实施例中,在保证人脸属性编辑效果的同时,为了保证人脸属性编辑后的生成图片与原始图片之间深层语义特征的相似性,即保证人脸属性编辑前后人脸特征的相似性,避免人脸编辑过程中特征丢失,本申请实施例中,计算机设备还需要确定图片之间的特征匹配损失,从而基于对抗损失和特征匹配损失确定第一图片解码器的目标损失函数。
在一种可能的实施方式中,计算机设备采用GAN的思想,基于第二图片确定第一图片解码器的对抗损失(目的是使生成的第二图片具有第二属性),并基于第二图片和第三图片确定第一图片解码器的特征匹配损失(目的是使第二图片与第三图片之间的深层语义特征相似),从而对对抗损失和特征匹配损失进行融合,得到第一图片解码器的目标损失函数。
示意性的,如图2所述,计算机设备基于第二图片231确定对抗损失25,基于第二图片231和第三图片241确定特征匹配损失26,进而基于对抗损失25和特征匹配损失26确定目标损失函数27。
步骤104,基于目标损失函数训练第一图片解码器,并将图片编码器和训练得到的第一图片解码器确定为人脸属性编辑模型。
进一步的,计算机设备基于构建的目标损失函数,对第一图片解码器进行训练,直至损失收敛时完成训练。可选的,训练过程中,计算机设备采用梯度反向传播算法对第一图片解码器进行参数优化。
完成训练后,计算机设备即将图片编码器和第一图片解码器确定为用于将人脸属性由第一属性编辑为第二属性的人脸属性编辑模型,后续进行人脸属性编辑时,即通过图片编码器进行特征提取,通过第一图片解码器基于提取到的特征进行人脸属性编辑。当然,当计算机设备还可以采用不同的训练数据集训练用于对不同人脸属性进行编辑的人脸属性编辑模型(可以共用图片编码器), 本申请实施例在此不再赘述。
可选的,完成第一图片解码器训练后,计算机设备可以利用测试数据集,对第一图片解码器进行测试,本实施例对此不作限定。
示意性的,如图2所示,计算机设备基于目标损失函数27对第一图片解码器23进行训练,并将训练得到的第一图片解码器23和图片编码器22确定为人脸属性编辑模型。
综上所述,本申请实施例中,在训练人脸属性编辑模型时,利用第一图片解码器进行人脸属性编辑得到第二图片,利用第二图片解码器进行人脸重构得到第三图片,并将约束图片间深层语义特征相似性的特征匹配损失作为损失函数的一部分,对第一图片解码器进行训练,由于在训练过程中考虑了图片间深层语义特征的相似性,因此利用训练得到的第一图片解码器进行人脸属性编辑时,能够保证生成图片与原始图片中深层特征的一致性,避免生成图片丢失原始图片中的重要特征,有助于提高人脸属性的编辑质量。
在一种可能的实施方式中,为了保证人脸属性编辑过程中,除了将第一属性变为第二属性外,生成图片与原始图片中人脸其他的属性保持一致性,比如人眼瞳孔颜色、刘海类型、是否佩戴眼镜等等,本申请实施例中,计算机设备还将用于约束人脸属性的属性感知损失作为目标损失函数的一部分,下面采用示例性的实施例进行说明。
请参考图3,其示出了本申请另一个示例性实施例提供的人脸属性编辑模型的训练方法的流程图。该方法可以包括如下步骤:
步骤301,将第一图片输入图片编码器,得到图片编码器输出的编码特征,第一图片中的人脸具有第一属性。
本步骤的实施方式可以参考上述步骤101,本实施例在此不再赘述。
示意性的,如图4所示,当需要对图片中的人脸进行变年轻处理时,训练过程中,计算机设备将第一图片41(人脸的属性为成人)输入图片编码器42,得到图片编码器42输出的结构性编码特征图421以及风格性编码特征向量422。
步骤302,将编码特征分别输入第一图片解码器和第二图片解码器,得到第一图片解码器输出的第二图片,以及第二解码器输出的第三图片,第二图片中的人脸具有第二属性,第三图片中的人脸具有第一属性,且第一属性不同于第二属性。
本步骤的实施方式可以参考上述步骤102,本实施例在此不再赘述。
示意性的,如图4所示,计算机设备将结构性编码特征图421以及风格性编码特征向量422分别输入第一图片解码器43,得到第一图片解码器43输出的第二图片431,第二图片431中人脸的属性为儿童;计算机设备将结构性编码特 征图421以及风格性编码特征向量422分别输入第二图片解码器44,得到第二图片解码器44输出的第三图片441,第三图片441中人脸的属性同样为成人。
步骤303,基于第二图片确定对抗损失。
在一种可能的实施方式中,计算机设备采用最小二乘对抗损失网络(Least Squares GAN,LSGAN),并通过对抗网络的对抗损失对第二图片进行约束,使生成的第二图片具备第二属性。
本申请实施例中,在模型训练过程中,计算机设备设置用于判别具有第二属性的原始图片(或称为真实图片)和生成图片的判别器,即判别用于区分具有第二属性的图片为原始图片还是生成器(本申请中即为第二图片解码器)输出的生成图片。判别器在对抗损失中起到关键作用,需要在训练过程中学习分辨生成图片与原始图片(均具有第二属性)之间的差别;而生成器在训练过程中与判别器进行对抗,使判别器无法区别原始图片和生成图片。
在一些实施例中,计算机设备将第二图片输入判别器,得到判别器输出的判别结果,从而基于判别结果确定对抗损失。其中,生成器的对抗损失可以表示为:
Loss_G=(D(G(x))-1) 2
其中,G为生成器,D为判别器,x为第一图片,且判别器的判别结果为0-1的数值,当判别结果为0时,表明图片为生成图片;当判别结果为1时,表明图片为原始图片。
相应的,在训练第二图片解码器的过程中,计算机设备需要对判别器进行训练(需要使用具有第二属性的生成图片,以及具有第二属性的原始图片),其中,第二图片解码器和判别器可以交替训练。而在训练判别器的过程中,判别器的对抗损失可以表示为:
Loss_D=(D(x)-1) 2+D(G(x)) 2
示意性的,如图4所示,计算机设备基于第二图片431确定第一图片解码器43的对抗损失45。
步骤304,基于第二图片和第三图片各自对应的深层语义特征,确定特征匹配损失。
为了使属性编辑得到第二图片仍旧能够与第一图片之间保持特征相似性,且由于第三图片为第一图片重构得到(与第一图片具有相似的特征),因此本申请实施例中,计算机设备分别获取第二图片和第三图片各自对应的深层语义特征,从而根据两者深层语义特征之间的差异,确定特征匹配损失。在一种可能的实施方式中,本步骤可以包括如下步骤。
一、获取第一图片解码器生成第二图片过程中的第一深度特征图,以及第二图片解码器生成第三图片过程中的第二深度特征图,第一图片解码器和第二 图片解码器的网络结构相同,且第一深度特征图和第二深度特征图为同一网络层级输出的特征图。
本申请实施例中,第一图片解码器和第二图片解码器的网络结构相同,因此在确定第二图片和第三图片的特征匹配程度时,计算机设备提取第一图片解码器中具有深层语义信息的一层低分辨率特征图(即第一深度特征图),并提取第二图片解码器中相同网络深度(即同一网络层级输出)的一层低分辨率特征图(即第二深度特征图),以便基于相同语义深度的第一深度特征图和第二特征图,确定第二图片和第三图片之间的特征匹配程度,使第一图片经过属性编辑后仍然能够在低分辨率特征图上保持与第一图片相似的特征。
其中,提取第一深度特征图和第二深度特征图时采用的网络层级可以由开发人员预先设置,本实施例对此不作限定。
在一种可能的实施方式中,第一图片解码器和第二图片解码器均由一系列基于自适应实例标准化(Adaptive Instance Normalization,AdaIN)的残差模块组成,且上采样层采用转置卷积层(Transpose Convolution Layer)。示意性的,第一图片解码器和第二图片解码器的结构如图5所示。
其中,输入图片解码器的结构性编码特征中每个特征图的尺寸为8×32,风格性编码特征为1×2048的向量。图片解码器中残差模块的尺寸依次为32×128、32×256、32×384、32×512、54×512、128×512、256×256以及512×128。
其中,残差模块可以用公式y=F(x)+x表示,其中F(·)表示卷积变换,x为残差模块的输入,y为残差模的输出。残差模块使得输入特征x得到了再次利用,且能够在参数优化的反向传播过程中为x提供一条捷径,使得带有残差模块的神经网络变得更好训练。
二、基于第一深度特征图和第二深度特征图确定特征匹配损失。
进一步的,计算机设备通过比较第一深度特征图和第二深度特征图之间的差异,确定第一图片解码器的特征匹配损失。其中,特征匹配损失可以表示为:
Feature Matching Loss=(x_-y_) 2
其中,x_为第一深度特征图,y_为第二深度特征图,且特征匹配损失越大,表示第一图片解码器在进行属性编辑时特征的丢失或变化程度越高,特征保留保持情况越差,反之,特征保留保持情况越好。
示意性的,如图4所示,计算机设备获取第一图片解码器43和第二图片解码器44各自对应的第一深度特征图432和第二深度特征图442,从而基于第一深度特征图432和第二深度特征图442确定第二图片与第三图片的特征匹配损失46。
步骤305,基于第一图片和第二图片确定属性感知损失,属性感知损失用于约束除第一属性和第二属性以外的人脸属性。
相关技术中,在进行人脸属性编辑时,除了待编辑的目标人脸属性被编辑外,目标人脸属性以外的其他属性也可能发生变化甚至丢失,影响最终的属性编辑质量。为了使生成图片中除目标人脸属性以外的其他人脸属性与原始图片保持一致,本申请实施例中,在训练第一图片解码器的过程中,将约束除第一属性和第二属性以外的人脸属性的属性感知损失作为目标损失函数的一部分。
为了对图片之间的属性感知损失进行量化,本申请实施例中,在训练第一图片解码器之前,计算机设备首先训练用于进行人脸属性分类的人脸属性分类器,从而利用训练得到人脸属性分类器确定第一图片和第二图片之间的属性感知损失。
可选的,人脸属性分类器由特征提取层(由若干卷积层构成)和全连接层构成,特征提取层用于对输入的图片进行特征提取,全连接层用于基于提取到的特征进行分类。其中,在训练人脸属性分类器时,可以包括如下步骤:
1、获取第一样本图片,第一样本图片包含对应的属性标签。
在训练人脸属性分类器时,首先需要构建训练集,该训练集中的各张第一样本图片均包含对应的属性标签。比如,该属性标签用于指示第一样本图片中人脸的瞳孔颜色、唇形、刘海类型、是否佩戴眼镜等等,本实施例对属性类型并不进行限定。
2、将第一样本图片输入人脸属性分类器,得到人脸属性分类器输出的样本属性分类结果。
计算机设备将第一样本图片输入人脸属性分类器后,人脸属性分类器即通过特征提取层对第一样本图片进行特征提取,并将提取到的特征输入全连接层,由全连接层进行全连接处理,进而通过分类器对全连接结果进行分类,得到第一样本图片对应的样本属性分类结果。
3、基于属性标签与样本属性分类结果,确定人脸属性分类器。
在一种可能的实施方式中,计算机设备以属性标签作为样本属性分类结果的监督,确定人脸属性分类器的属性分类损失(交叉熵损失),从而基于属性分类损失对人脸属性分类器进行训练,最终训练得到能够准确识别人脸属性的人脸属性分类器。
相应的,计算机设备确定第一图片和第二图片的属性感知损失可以包括如下步骤:
1、将第一图片和第二图片分别输入人脸属性分类器的特征提取层,得到第一图片对应的第一属性特征图以及第二图片对应的第二属性特征图。
可选的,计算机设备将第一图片和第二图片分别输入人脸属性分类器后,获取特征提取层中最后一层卷积层输出的特征图作为第一属性特征图和第二属 性特征图。
示意性的,如图6所示,人脸属性分类器中的特征提取层51对第一图片41进行特征提取,得到第一属性特征图52,对第二图片431进行特征提取,得到第二属性特征图53。
2、分别将第一属性特征图和第二属性特征图输入人脸属性分类器的全连接层,得到第一图片对应的第一人脸属性分类结果,以及第二图片对应的第二人脸属性分类结果。
进一步的,计算机设备通过人脸属性分类器的全连接层分别对第一属性特征图和第二属性特征图进行全连接处理,得到第一图片对应的第一人脸属性分类结果以及第二图片对应的第二人脸属性分类结果,其中,第一人脸属性分类结果和第二人脸属性分类结果是未经过softmax处理的分类结果。
示意性的,如图6所示,人脸属性分类器中的全连接层54分别对第一属性特征图52和第二属性特征图53进行全连接处理,得到第一人脸属性分类结果55和第二人脸属性分类结果56。
3、将第一属性特征图和第二属性特征图之间的L2损失,以及第一人脸属性分类结果和第二人脸属性分类结果之间的L2损失,确定为属性感知损失。
为了避免仅基于人脸属性分类结果确定属性感知损失时的片面性,本实施例中,计算机设备将第一属性特征图和第二属性特征图之间的损失,以及第一人脸属性分类结果和第二人脸属性分类结果之间的损失,共同确定为第一图片解码器的属性感知损失,其中,特征图之间的损失可以采用L2损失,分类结果之间的损失也可以采用L2损失。
其中,第一图片解码器的属性感知损失(Attribute Perceptual Loss)可以表示为:
Attribute Perceptual Loss=(Ect(G(x))-Ext(x)) 2+(classifer(G(x))-classifer(x)) 2
其中,x为第一图片,Ext(·)为人脸属性分类器最后一层卷积层输出的特征图,classifer(·)为人脸属性分类结果,G为生成器(包括图片编码器和第一图片解码器)。
示意性的,如图6所示,计算机设备将第一属性特征图52和第二属性特征图53之间的L2损失,以及第一人脸属性分类结果55和第二人脸属性分类结果56之间的L2损失,共同确定为第一图片解码器的属性感知损失47。
步骤306,基于对抗损失、特征匹配损失以及属性感知损失确定目标损失函数。
进一步的,计算机设备基于对抗损失、特征匹配损失以及属性感知损失共同确定目标损失函数,其中,目标损失函数中对抗损失、特征匹配损失以及属 性感知损失可以对应各自的损失权重,该损失权重可以为训练过程中设置的超参数,本实施例对此不作限定。
步骤307,基于目标损失函数训练第一图片解码器,并将图片编码器和训练得到的第一图片解码器确定为人脸属性编辑模型。
本步骤的实施方式可以参考上述步骤104,本实施例在此不再赘述。
本实施例中,计算机设备通过预先训练人脸属性分类器,并利用人脸属性分类器,确定原始图片(即第一图片)与生成图片(即第二图片)之间的属性感知损失,从而将该属性感知损失作为目标损失函数的一部分进行模型训练,使模型训练过程中,生成图片与原始图片中除第一属性和第二属性以外的人脸属性保持一致,进一步提高了人脸属性的编辑质量。
在一种可能的实施方式中,在训练第一图片解码器前,计算机设备首先对图片编码器和第二图片解码器进行训练,即预先训练用于人脸图片重构的自编码器,后续训练第一图片解码器过程中,即无需对图片编码器和第二图片解码器进行训练。可选的,计算机设备通过图片编码器和第二图片解码器对第二样本图片进行自编码,得到样本生成图片,从而基于样本生成图片和第二样本图片,确定重构损失函数。其中,为了保证训练质量,计算机设备采用第二样本图片具有人脸属性多样性,即以不同年龄、性别的样本图片进行训练,且该重构损失函数可以是样本生成图片和第二样本图片之间的L1损失,本实施对此不作限定。
进一步的,计算机设备基于重构损失函数训练图片编码器和第二图片解码器。其中,计算机设备可以采用梯度反向传播算法对图片编码器和第二图片解码器进行参数优化。
请参考图7,其示出了本申请一个示例性实施例提供的人脸属性编辑方法的流程图。该方法可以包括如下步骤:
步骤701,获取待编辑图片和目标人脸属性,待编辑图片中人脸的属性与目标人脸属性不同。
在一种可能的实施方式中,应用程序提供若干可编辑人脸属性供用户进行选择,用户选择的可编辑人脸属性即为目标人脸属性。该目标人脸属性可以为儿童人脸、成人人脸、男性人脸、女性人脸或者特定明星人脸等等,本实施例对此不作限定。
可选的,该待编辑图片为单张图片,或者,视频中的视频帧。
步骤702,确定目标人脸属性对应的目标人脸属性编辑模型,目标人脸属性编辑模型由图片编码器和图片解码器构成,且图片解码器基于目标损失函数训练得到,目标损失函数包括对抗损失和特征匹配损失,特征匹配损失用于约束 图片间深层语义特征的相似性。
在一种可能的实施方式中,计算机设备中部署有用于对不同人脸属性进行编辑的人脸属性编辑模型,每个人脸属性编辑模型由图片编码器和图片解码器构成,且每个人脸属性编码模型采用上述各个实施例提供的人脸属性编码模型的训练方法训练得到。相应的,计算机设备选取用于编辑目标人脸属性的人脸属性编辑模型作为目标人脸属性编辑模型。
可选的,不同人脸属性编辑模型可以共用图片编码器,不同人脸属性编码模型对应不同的图片解码器。
步骤703,将待编辑图片输入目标人脸属性编辑模型,得到目标人脸属性编辑模型输出的目标图片,目标图片中的人脸具有目标人脸属性。
进一步的,计算机设备将待编辑图片作为模型输入,由目标人脸属性编码模型中的图片编码器对待编辑图片进行编码(即特征提取),得到编码特征,由图片解码器基于编码特征进行解码(即图片重建),得到具有目标人脸属性的目标图片。
综上所述,本申请实施例中,在训练人脸属性编辑模型时,利用第一图片解码器进行人脸属性编辑得到第二图片,利用第二图片解码器进行人脸重构得到第三图片,并将约束图片间深层语义特征相似性的特征匹配损失作为损失函数的一部分,对第一图片解码器进行训练,由于在训练过程中考虑了图片间深层语义特征的相似性,因此利用训练得到的第一图片解码器进行人脸属性编辑时,能够保证生成图片与原始图片中深层特征的一致性,避免生成图片丢失原始图片中的重要特征,有助于提高人脸属性的编辑质量。
请参考图8,其示出了本申请一个示例性实施例提供的人脸属性编辑模型的训练装置的结构方框图。该装置可以包括:
编码模块801,设置为将第一图片输入图片编码器,得到所述图片编码器输出的编码特征,所述第一图片中的人脸具有第一属性;
解码模块802,设置为将所述编码特征分别输入第一图片解码器和第二图片解码器,得到所述第一图片解码器输出的第二图片,以及所述第二解码器输出的第三图片,所述第二图片中的人脸具有第二属性,所述第三图片中的人脸具有第一属性,且所述第一属性不同于所述第二属性;
损失构建模块803,设置为基于所述第二图片和所述第三图片,构建所述第一图片解码器的目标损失函数,所述目标损失函数包括对抗损失和特征匹配损失,所述特征匹配损失用于约束图片间深层语义特征的相似性;
训练模块804,设置为基于所述目标损失函数训练所述第一图片解码器,并将所述图片编码器和训练得到的所述第一图片解码器确定为所述人脸属性编辑 模型。
可选的,所述损失构建模块803,包括:
第一损失确定单元,设置为基于所述第二图片确定所述对抗损失;
第二损失确定单元,设置为基于所述第二图片和所述第三图片各自对应的深层语义特征,确定所述特征匹配损失;
损失构建单元,设置为基于所述对抗损失和所述特征匹配损失确定所述目标损失函数。
可选的,所述第二损失确定单元,设置为:
获取所述第一图片解码器生成所述第二图片过程中的第一深度特征图,以及所述第二图片解码器生成所述第三图片过程中的第二深度特征图,所述第一图片解码器和所述第二图片解码器的网络结构相同,且所述第一深度特征图和所述第二深度特征图为同一网络层级输出的特征图;
基于所述第一深度特征图和所述第二深度特征图确定所述特征匹配损失。
可选的,所述第一损失确定单元,设置为:
将所述第二图片输入判别器,得到所述判别器输出的判别结果,所述判别器用于判别具有所述第二属性的原始图片和生成图片;
基于所述判别结果确定所述对抗损失。
可选的,所述损失构建模块803,还包括:
第三损失确定单元,设置为基于所述第一图片和所述第二图片确定属性感知损失,所述属性感知损失用于约束除所述第一属性和所述第二属性以外的人脸属性;
所述损失构建单元,设置为:
基于所述对抗损失、所述特征匹配损失以及所述属性感知损失确定所述目标损失函数。
可选的,所述第三损失确定单元,设置为:
将所述第一图片和所述第二图片分别输入人脸属性分类器的特征提取层,得到所述第一图片对应的第一属性特征图以及所述第二图片对应的第二属性特征图;
分别将所述第一属性特征图和所述第二属性特征图输入所述人脸属性分类器的全连接层,得到所述第一图片对应的第一人脸属性分类结果,以及所述第二图片对应的第二人脸属性分类结果;
将所述第一属性特征图和所述第二属性特征图之间的L2损失,以及所述第一人脸属性分类结果和所述第二人脸属性分类结果之间的L2损失,确定为所述属性感知损失。
可选的,所述装置还包括分类器训练模块,分类器训练模块设置为:
获取第一样本图片,所述第一样本图片包含对应的属性标签;
将所述第一样本图片输入所述人脸属性分类器,得到所述人脸属性分类器输出的样本属性分类结果;
基于所述属性标签与所述样本属性分类结果,确定所述人脸属性分类器。
可选的,所述装置还包括自编码器训练模块,自编码器训练模块,设置为:
通过所述图片编码器和所述第二图片解码器对第二样本图片进行自编码,得到样本生成图片;
基于所述样本生成图片和所述第二样本图片,确定重构损失函数;
基于所述重构损失函数训练所述图片编码器和所述第二图片解码器。
可选的,所述编码特征包括结构性编码特征以及风格性编码特征,所述结构性编码特征用于表征图像在空间上的结构特征,所述风格性编码特征用于表征图像的风格特征。
综上所述,本申请实施例中,在训练人脸属性编辑模型时,利用第一图片解码器进行人脸属性编辑得到第二图片,利用第二图片解码器进行人脸重构得到第三图片,并将约束图片间深层语义特征相似性的特征匹配损失作为损失函数的一部分,对第一图片解码器进行训练,由于在训练过程中考虑了图片间深层语义特征的相似性,因此利用训练得到的第一图片解码器进行人脸属性编辑时,能够保证生成图片与原始图片中深层特征的一致性,避免生成图片丢失原始图片中的重要特征,有助于提高人脸属性的编辑质量。
请参考图9,其示出了本申请一个示例性实施例提供的人脸属性编辑装置的结构方框图。该装置可以包括:
获取模块901,设置为获取待编辑图片和目标人脸属性,所述待编辑图片中人脸的属性与所述目标人脸属性不同;
模型确定模块902,设置为确定所述目标人脸属性对应的目标人脸属性编辑模型,所述目标人脸属性编辑模型由图片编码器和图片解码器构成,且所述图片解码器基于目标损失函数训练得到,所述目标损失函数包括对抗损失和特征匹配损失,所述特征匹配损失用于约束图片间深层语义特征的相似性;
编辑模块903,设置为将所述待编辑图片输入所述目标人脸属性编辑模型,得到所述目标人脸属性编辑模型输出的目标图片,所述目标图片中的人脸具有所述目标人脸属性。
需要说明的是:上述实施例提供的装置在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同 一构思,其具体实现过程详见方法实施例,这里不再赘述。
本申请实施例还提供了一种计算机设备,计算机设备包括处理器和存储器,存储器存储有至少一条指令,至少一条指令用于被处理器执行以实现如上各个实施例所述的人脸属性编辑模型的训练方法,或,实现如上各个实施例所述的人脸属性编辑方法。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有至少一条指令,至少一条指令由处理器加载并执行以实现如上各个实施例所述的人脸属性编辑模型的训练方法,或,实现如上各个实施例所述的人脸属性编辑方法。
根据本申请的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述方面的各种可选实现方式中提供的人脸属性编辑模型的训练方法,或,执行上述方面的各种可选实现方式中提供的人脸属性编辑方法。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读存储介质中或者作为计算机可读存储介质上的一个或多个指令或代码进行传输。计算机可读存储介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种人脸属性编辑模型的训练方法,所述方法包括:
    将第一图片输入图片编码器,得到所述图片编码器输出的编码特征,所述第一图片中的人脸具有第一属性;
    将所述编码特征分别输入第一图片解码器和第二图片解码器,得到所述第一图片解码器输出的第二图片,以及所述第二解码器输出的第三图片,所述第二图片中的人脸具有第二属性,所述第三图片中的人脸具有第一属性,且所述第一属性不同于所述第二属性;
    基于所述第二图片和所述第三图片,构建所述第一图片解码器的目标损失函数,所述目标损失函数包括对抗损失和特征匹配损失,所述特征匹配损失用于约束图片间深层语义特征的相似性;
    基于所述目标损失函数训练所述第一图片解码器,并将所述图片编码器和训练得到的所述第一图片解码器确定为所述人脸属性编辑模型。
  2. 根据权利要求1所述的方法,其中,所述基于所述第二图片和所述第三图片,构建所述第一图片解码器的目标损失函数,包括:
    基于所述第二图片确定所述对抗损失;
    基于所述第二图片和所述第三图片各自对应的深层语义特征,确定所述特征匹配损失;
    基于所述对抗损失和所述特征匹配损失确定所述目标损失函数。
  3. 根据权利要求2所述的方法,其中,所述基于所述第二图片和所述第三图片各自对应的深层语义特征,确定所述特征匹配损失,包括:
    获取所述第一图片解码器生成所述第二图片过程中的第一深度特征图,以及所述第二图片解码器生成所述第三图片过程中的第二深度特征图,所述第一图片解码器和所述第二图片解码器的网络结构相同,且所述第一深度特征图和所述第二深度特征图为同一网络层级输出的特征图;
    基于所述第一深度特征图和所述第二深度特征图确定所述特征匹配损失。
  4. 根据权利要求2所述的方法,其中,所述基于所述第二图片确定所述对抗损失,包括:
    将所述第二图片输入判别器,得到所述判别器输出的判别结果,所述判别器用于判别具有所述第二属性的原始图片和生成图片;
    基于所述判别结果确定所述对抗损失。
  5. 根据权利要求2所述的方法,其中,所述基于所述第二图片和所述第三图片,构建所述第一图片解码器的目标损失函数,还包括:
    基于所述第一图片和所述第二图片确定属性感知损失,所述属性感知损失用于约束除所述第一属性和所述第二属性以外的人脸属性;
    所述基于所述对抗损失和所述特征匹配损失确定所述目标损失函数,包括:
    基于所述对抗损失、所述特征匹配损失以及所述属性感知损失确定所述目标损失函数。
  6. 根据权利要求5所述的方法,其中,所述基于所述第一图片和所述第二图片确定属性感知损失,包括:
    将所述第一图片和所述第二图片分别输入人脸属性分类器的特征提取层,得到所述第一图片对应的第一属性特征图以及所述第二图片对应的第二属性特征图;
    分别将所述第一属性特征图和所述第二属性特征图输入所述人脸属性分类器的全连接层,得到所述第一图片对应的第一人脸属性分类结果,以及所述第二图片对应的第二人脸属性分类结果;
    将所述第一属性特征图和所述第二属性特征图之间的L2损失,以及所述第一人脸属性分类结果和所述第二人脸属性分类结果之间的L2损失,确定为所述属性感知损失。
  7. 根据权利要求6所述的方法,其中,所述方法还包括:
    获取第一样本图片,所述第一样本图片包含对应的属性标签;
    将所述第一样本图片输入所述人脸属性分类器,得到所述人脸属性分类器输出的样本属性分类结果;
    基于所述属性标签与所述样本属性分类结果,确定所述人脸属性分类器。
  8. 根据权利要求1至7任一所述的方法,其中,所述将第一图片输入图片编码器,得到所述图片编码器输出的编码特征之前,所述方法还包括:
    通过所述图片编码器和所述第二图片解码器对第二样本图片进行自编码,得到样本生成图片;
    基于所述样本生成图片和所述第二样本图片,确定重构损失函数;
    基于所述重构损失函数训练所述图片编码器和所述第二图片解码器。
  9. 根据权利要求1至7任一所述的方法,其中,所述编码特征包括结构性编码特征以及风格性编码特征,所述结构性编码特征用于表征图像在空间上的结 构特征,所述风格性编码特征用于表征图像的风格特征。
  10. 一种人脸属性编辑方法,所述方法包括:
    获取待编辑图片和目标人脸属性,所述待编辑图片中人脸的属性与所述目标人脸属性不同;
    确定所述目标人脸属性对应的目标人脸属性编辑模型,所述目标人脸属性编辑模型由图片编码器和图片解码器构成,且所述图片解码器基于目标损失函数训练得到,所述目标损失函数包括对抗损失和特征匹配损失,所述特征匹配损失用于约束图片间深层语义特征的相似性;
    将所述待编辑图片输入所述目标人脸属性编辑模型,得到所述目标人脸属性编辑模型输出的目标图片,所述目标图片中的人脸具有所述目标人脸属性。
  11. 一种人脸属性编辑模型的训练装置,所述装置包括:
    编码模块,设置为将第一图片输入图片编码器,得到所述图片编码器输出的编码特征,所述第一图片中的人脸具有第一属性;
    解码模块,设置为将所述编码特征分别输入第一图片解码器和第二图片解码器,得到所述第一图片解码器输出的第二图片,以及所述第二解码器输出的第三图片,所述第二图片中的人脸具有第二属性,所述第三图片中的人脸具有第一属性,且所述第一属性不同于所述第二属性;
    损失构建模块,设置为基于所述第二图片和所述第三图片,构建所述第一图片解码器的目标损失函数,所述目标损失函数包括对抗损失和特征匹配损失,所述特征匹配损失用于约束图片间深层语义特征的相似性;
    训练模块,设置为基于所述目标损失函数训练所述第一图片解码器,并将所述图片编码器和训练得到的所述第一图片解码器确定为所述人脸属性编辑模型。
  12. 一种人脸属性编辑装置,所述装置包括:
    获取模块,设置为获取待编辑图片和目标人脸属性,所述待编辑图片中人脸的属性与所述目标人脸属性不同;
    模型确定模块,设置为确定所述目标人脸属性对应的目标人脸属性编辑模型,所述目标人脸属性编辑模型由图片编码器和图片解码器构成,且所述图片解码器基于目标损失函数训练得到,所述目标损失函数包括对抗损失和特征匹配损失,所述特征匹配损失用于约束图片间深层语义特征的相似性;
    编辑模块,设置为将所述待编辑图片输入所述目标人脸属性编辑模型,得到所述目标人脸属性编辑模型输出的目标图片,所述目标图片中的人脸具有所 述目标人脸属性。
  13. 一种计算机设备,所述计算机设备包括处理器和存储器;所述存储器存储有至少一条指令,所述至少一条指令用于被所述处理器执行以实现如权利要求1至9任一所述的人脸属性编辑模型的训练方法,或者,实现如权利要求10所述的人脸属性编辑方法。
  14. 一种计算机可读存储介质,所述存储介质存储有至少一条指令,所述至少一条指令用于被处理器执行以实现如权利要求1至9任一所述的人脸属性编辑模型的训练方法,或者,实现如权利要求10所述的人脸属性编辑方法。
  15. 一种计算机程序产品,所述计算机程序产品包括计算机指令,所述计算机指令存储在计算机可读存储介质中;计算机设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述计算机设备执行如权利要求1至9任一所述的的人脸属性编辑模型的训练方法,或者,执行如权利要求10所述的人脸属性编辑方法。
PCT/CN2022/074742 2021-02-02 2022-01-28 人脸属性编辑模型的训练方法、人脸属性编辑方法及设备 WO2022166840A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110143937.4A CN112819689A (zh) 2021-02-02 2021-02-02 人脸属性编辑模型的训练方法、人脸属性编辑方法及设备
CN202110143937.4 2021-02-02

Publications (1)

Publication Number Publication Date
WO2022166840A1 true WO2022166840A1 (zh) 2022-08-11

Family

ID=75860591

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074742 WO2022166840A1 (zh) 2021-02-02 2022-01-28 人脸属性编辑模型的训练方法、人脸属性编辑方法及设备

Country Status (2)

Country Link
CN (1) CN112819689A (zh)
WO (1) WO2022166840A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819689A (zh) * 2021-02-02 2021-05-18 百果园技术(新加坡)有限公司 人脸属性编辑模型的训练方法、人脸属性编辑方法及设备
CN113963409A (zh) * 2021-10-25 2022-01-21 百果园技术(新加坡)有限公司 一种人脸属性编辑模型的训练以及人脸属性编辑方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377535A (zh) * 2018-10-24 2019-02-22 电子科技大学 面部属性自动编辑系统、方法、存储介质和终端
CN110148081A (zh) * 2019-03-25 2019-08-20 腾讯科技(深圳)有限公司 图像处理模型的训练方法、图像处理方法、装置及存储介质
WO2020029356A1 (zh) * 2018-08-08 2020-02-13 杰创智能科技股份有限公司 一种基于生成对抗网络的脸部变化预测方法
CN112233012A (zh) * 2020-08-10 2021-01-15 上海交通大学 一种人脸生成系统及方法
CN112819689A (zh) * 2021-02-02 2021-05-18 百果园技术(新加坡)有限公司 人脸属性编辑模型的训练方法、人脸属性编辑方法及设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711254B (zh) * 2018-11-23 2020-12-15 北京交通大学 基于对抗生成网络的图像处理方法和装置
CN110457994B (zh) * 2019-06-26 2024-05-10 平安科技(深圳)有限公司 人脸图像生成方法及装置、存储介质、计算机设备
CN111563427A (zh) * 2020-04-23 2020-08-21 中国科学院半导体研究所 人脸图像属性编辑方法、装置及设备
CN111754596B (zh) * 2020-06-19 2023-09-19 北京灵汐科技有限公司 编辑模型生成、人脸图像编辑方法、装置、设备及介质
CN112116684A (zh) * 2020-08-05 2020-12-22 中国科学院信息工程研究所 图像处理方法、装置、设备及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020029356A1 (zh) * 2018-08-08 2020-02-13 杰创智能科技股份有限公司 一种基于生成对抗网络的脸部变化预测方法
CN109377535A (zh) * 2018-10-24 2019-02-22 电子科技大学 面部属性自动编辑系统、方法、存储介质和终端
CN110148081A (zh) * 2019-03-25 2019-08-20 腾讯科技(深圳)有限公司 图像处理模型的训练方法、图像处理方法、装置及存储介质
CN112233012A (zh) * 2020-08-10 2021-01-15 上海交通大学 一种人脸生成系统及方法
CN112819689A (zh) * 2021-02-02 2021-05-18 百果园技术(新加坡)有限公司 人脸属性编辑模型的训练方法、人脸属性编辑方法及设备

Also Published As

Publication number Publication date
CN112819689A (zh) 2021-05-18

Similar Documents

Publication Publication Date Title
CN110458282B (zh) 一种融合多角度多模态的图像描述生成方法及系统
CN109815826B (zh) 人脸属性模型的生成方法及装置
CN111798369B (zh) 一种基于循环条件生成对抗网络的人脸衰老图像合成方法
Johnson et al. Sparse coding for alpha matting
WO2022166840A1 (zh) 人脸属性编辑模型的训练方法、人脸属性编辑方法及设备
CN111754596A (zh) 编辑模型生成、人脸图像编辑方法、装置、设备及介质
CN110084193B (zh) 用于面部图像生成的数据处理方法、设备和介质
Jiang et al. Blind image quality measurement by exploiting high-order statistics with deep dictionary encoding network
CN111353546B (zh) 图像处理模型的训练方法、装置、计算机设备和存储介质
CN110288513B (zh) 用于改变人脸属性的方法、装置、设备和存储介质
WO2024109374A1 (zh) 换脸模型的训练方法、装置、设备、存储介质和程序产品
CN113781324B (zh) 一种老照片修复方法
CN112861805B (zh) 一种基于内容特征和风格特征的人脸图像生成方法
CN113205449A (zh) 表情迁移模型的训练方法及装置、表情迁移方法及装置
CN110163169A (zh) 人脸识别方法、装置、电子设备及存储介质
CN114913303A (zh) 虚拟形象生成方法及相关装置、电子设备、存储介质
CN113392791A (zh) 一种皮肤预测处理方法、装置、设备及存储介质
CN113160032A (zh) 一种基于生成对抗网络的无监督多模态图像转换方法
CN112101087A (zh) 一种面部图像身份去识别方法、装置及电子设备
CN114694074A (zh) 一种使用图像生成视频的方法、装置以及存储介质
Dastbaravardeh et al. Channel Attention‐Based Approach with Autoencoder Network for Human Action Recognition in Low‐Resolution Frames
KumarSingh et al. An Enhanced Image Colorization using Modified Generative Adversarial Networks with Pix2Pix Method
CN112016592A (zh) 基于交叉领域类别感知的领域适应语义分割方法及装置
WO2022252372A1 (zh) 一种图像处理方法、装置、设备及计算机可读存储介质
CN110675312A (zh) 图像数据处理方法、装置、计算机设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22749119

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22749119

Country of ref document: EP

Kind code of ref document: A1