WO2022135490A1 - 一种人脸图像合成方法、系统、电子设备及存储介质 - Google Patents

一种人脸图像合成方法、系统、电子设备及存储介质 Download PDF

Info

Publication number
WO2022135490A1
WO2022135490A1 PCT/CN2021/140563 CN2021140563W WO2022135490A1 WO 2022135490 A1 WO2022135490 A1 WO 2022135490A1 CN 2021140563 W CN2021140563 W CN 2021140563W WO 2022135490 A1 WO2022135490 A1 WO 2022135490A1
Authority
WO
WIPO (PCT)
Prior art keywords
face image
face
encoder
original
loss
Prior art date
Application number
PCT/CN2021/140563
Other languages
English (en)
French (fr)
Inventor
李安
李玉乐
项伟
Original Assignee
百果园技术(新加坡)有限公司
李安
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 李安 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2022135490A1 publication Critical patent/WO2022135490A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the embodiments of the present application relate to the field of computer technologies, and in particular, to a method, system, electronic device, and storage medium for synthesizing a face image.
  • face image synthesis based on two face images is an interesting and challenging technology. For example, by inputting a male face image and a female face image, the face images of the future children of the two are synthesized.
  • the existing face image synthesis algorithms lack consideration of features such as image race and skin color when synthesizing face images, and the feature fusion effect is relatively poor and lacks authenticity.
  • it requires relatively high quality of images to be synthesized, and for the synthesis of low-quality images, the effect of the output image is relatively poor.
  • the embodiments of the present application provide a face image synthesis method, system, electronic device and storage medium, which can improve the authenticity of face image synthesis, improve the image synthesis quality, and optimize the face image synthesis effect.
  • an embodiment of the present application provides a method for synthesizing a face image, including:
  • the generator includes a first encoder, a second encoder and a decoder
  • the first encoder extracts the skin color information and face feature information of the first original face image based on the first set weight and converts them into a plurality of first encoding vectors, and inputs the first encoding vectors into the decoding process.
  • the second encoder extracts the skin color information and the face feature information of the second original face image based on the second set weight and converts them into a plurality of second encoding vectors, and the second encoding vectors are input into all The decoder; the proportion of the skin color information extracted by the first set weight is greater than the second set weight, and the proportion of the second set weight extracted face feature information is greater than the first set weight;
  • the decoder performs face image conversion and synthesis on the first encoding vector and the second encoding vector based on the multi-channel affine transformation module and the style transfer module, and introduces random noise to generate a corresponding first original face image. and a face composite image of the second original face image.
  • the training process of the generated model includes:
  • the loss function corresponding to the first encoder includes generative adversarial network loss, face feature loss and coding vector distance loss;
  • the loss function corresponding to the second encoder includes generative adversarial network loss, skin color loss. loss and encoding vector distance loss.
  • the generative adversarial network loss includes generator loss and discriminator loss, and the calculation formula of the generative adversarial network loss is:
  • Loss_G E(D(G(x)-1) 2 )
  • Loss_D E((D(x)-1) 2 +D(G(x)) 2 )
  • D is the discriminator
  • G is the generator
  • x is the feature map input by the model
  • E is the mean value
  • G(x) is the synthetic face image generated by the generator
  • D(x) is the discrimination.
  • LOSS_G represents the corresponding generator loss
  • LOSS_D represents the discriminator loss.
  • w is the coding vector output by the first encoder or the second encoder
  • w_mean is the mean value of the coding vector in the decoder
  • the face feature loss is determined based on a face recognition network, and the calculation formula of the face feature loss is:
  • idLoss represents the loss of facial features
  • E represents the mean value
  • Facenet represents the face recognition network
  • x represents the feature map input by the model
  • G represents the generator
  • G(x) represents the synthetic face image generated by the generator.
  • the decoder performs face image conversion and synthesis on the first encoding vector and the second encoding vector based on the multi-channel affine transformation module and the style transfer module, and introduces random noise to generate a corresponding image of the first encoding vector.
  • the face composite image of the face image and the second original face image including:
  • the style transfer module performs face style conversion calculation based on the feature map of the first original face image or the second original face image, the corresponding bias factor and the scaling factor, and introduces random noise. A corresponding face style conversion result is generated, and a synthetic face image is obtained based on each face style conversion result.
  • AdaIN(x, y) is the face style conversion result
  • y_s is the scaling factor
  • y_b is the bias factor
  • x is the feature map input by the model
  • represents the standard deviation
  • u represents the mean.
  • an embodiment of the present application provides a system for synthesizing a face image, including:
  • an input module for inputting the first original face image and the second original face image into a generator of the generation model, the generator comprising a first encoder, a second encoder and a decoder;
  • the conversion module is used to extract the skin color information and the facial feature information of the first original face image based on the first set weight by the first encoder and convert them into a plurality of first encoding vectors, and convert the first
  • the coding vector is input into the decoder; the skin color information and face feature information of the second original face image are extracted based on the second set weight by the second encoder and converted into multiple second coding vectors, and the The second encoding vector is input to the decoder; the proportion of the skin color information extracted by the first set weight is greater than that of the second set weight, and the proportion of the second set weight to extract the face feature information is greater than the proportion of the first set weight to extract the skin feature information.
  • a set weight is used to extract the skin color information and the facial feature information of the first original face image based on the first set weight by the first encoder and convert them into a plurality of first encoding vectors, and convert the first
  • the coding vector is input into the decoder; the skin color information and face feature
  • the synthesis module is configured to perform face image conversion and synthesis on the first encoding vector and the second encoding vector based on the multi-channel affine transformation module and the style transfer module through the decoder, and introduce random noise to generate a corresponding A face composite image of the first original face image and the second original face image.
  • an electronic device including:
  • the memory for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the method for synthesizing a face image according to the first aspect.
  • an embodiment of the present application provides a storage medium containing computer-executable instructions, the computer-executable instructions, when executed by a computer processor, are used to execute the method for synthesizing a face image according to the first aspect .
  • the first original face image and the second original face image are input into the generator of the generation model, and the first encoder and the second encoder of the generator extract the skin color information and the face according to the set weight.
  • the feature information is converted into multiple encoding vectors, and the decoder of the generator converts and synthesizes the encoded vectors based on the multi-channel affine transformation module and the style transfer module, and introduces random noise to generate the corresponding first original face image and A face composite image of the second original face image.
  • Embodiment 1 is a flowchart of a method for synthesizing a face image provided in Embodiment 1 of the present application;
  • FIG. 2 is a schematic diagram of a network architecture of a generator in Embodiment 1 of the present application;
  • Fig. 3 is the training flow chart of the generation model in the first embodiment of the present application.
  • FIG. 4 is a schematic diagram of the architecture of the residual module in Embodiment 1 of the present application.
  • FIG. 6 is a schematic diagram of a decoder face image conversion architecture in Embodiment 1 of the present application.
  • Fig. 7 is the coding vector conversion flow chart of the affine transformation module in the first embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a system for synthesizing a face image provided in Embodiment 2 of the present application.
  • FIG. 9 is a schematic structural diagram of an electronic device provided in Embodiment 3 of the present application.
  • the face image synthesis method aims to convert and synthesize two original face images by using a generative model to obtain a corresponding face synthesis image.
  • multi-channel face image conversion is performed by converting the original face image into multiple coding vectors, and random noise is introduced to increase the authenticity of the face synthesis image and improve the image quality.
  • the focus of extracting face feature information and skin color information corresponding to the two original face avatars is different, and the face feature information and skin color information are extracted according to the set weight, and the skin color information is obtained.
  • the stronger first coding vector and the second coding vector with stronger face feature information can avoid the problem of lack of skin color information in the face synthesis image, and optimize the face image synthesis effect.
  • For the traditional 3D mapping technology to synthesize face images it needs to pre-process the designed 3D texture, and deform the face according to the key point information of the face, and then process the face through microdermabrasion, texture and other methods to obtain face synthesis.
  • Image this method relies on face key points, its diversity is poor, the generated image is unnatural, and there are certain limitations.
  • the traditional generative confrontation network model is used for face image synthesis, generally by inputting one or two face images, the face feature information in the image is extracted to generate a face synthesis image.
  • This method has high requirements on image quality, and the generation effect of low-quality images is relatively poor, and the influence of face skin color is not well considered. Based on this, the method for synthesizing a face image according to an embodiment of the present application is provided to solve the problems of lack of skin color and image quality in the existing face image synthesis.
  • Embodiment 1 shows a flowchart of a method for synthesizing a face image provided in Embodiment 1 of the present application.
  • the method for synthesizing a face image provided in this embodiment may be performed by a face image synthesizing device, and the face image synthesizing device may Realized by means of software and/or hardware, the face image synthesis device may be composed of two or more physical entities, or may be composed of one physical entity.
  • the face image synthesis device may be a computer device such as a computer.
  • the face image synthesis method specifically includes:
  • the generator includes a first encoder, a second encoder and a decoder
  • the first encoder extracts the skin color information and the face feature information of the first original face image based on the first set weight, converts them into a plurality of first encoding vectors, and inputs the first encoding vectors into all the first encoding vectors.
  • the decoder; the second encoder extracts the skin color information and face feature information of the second original face image based on the second set weight and converts them into a plurality of second encoding vectors, and the second encoding vector Input the decoder; the proportion of extracting skin color information from the first set weight is greater than that of the second set weight, and the proportion of extracting face feature information from the second set weight is greater than that of the first set weight.
  • the embodiments of the present application perform face image synthesis based on a generative model.
  • the generative model is a generative adversarial network-based generative model, using the architecture of the generative adversarial network model, and a generator of the generative model realizes the synthesis of face images.
  • the generative model needs to be pre-trained for subsequent face image synthesis.
  • the generative model based on the generative adversarial network in the embodiment of the present application mainly trains its generator during training, so that the generator can perform face image synthesis.
  • the discriminator it is necessary to verify the synthetic face image converted and synthesized by the generator to determine whether it is accurate. It is understandable that, based on the principle of generative adversarial network, the generator and the discriminator learn from each other by game. When the discriminator cannot identify the difference between the synthetic face image generated by the generator and the synthetic image of the real sample, the training of the generator is completed. Can be used for face image synthesis.
  • the generator includes two parts: encoder and decoder.
  • the encoder includes a first encoder and a second encoder, and the purpose of the first encoder and the second encoder is to extract face feature information and skin color information, and encode any face image into multiple encoding vectors w , and then inject w into the decoder.
  • the decoder is based on the structure of the Stylegan model (a generative adversarial network model that can generate high-resolution images). Face composite image.
  • random noise information (such as face spot features and other information) is further introduced, which can make the generated face composite image realistic and high-definition, thereby improving the generalization performance of the generated model in the embodiment of the present application, and optimizing the face Image composition effect.
  • the encoder in this embodiment of the present application includes a first encoder and a second encoder, and the two encoders, by inputting the same or different first original face image and second original face image respectively, The face image is encoded to obtain a first encoding vector and a second encoding vector.
  • the first encoder and the second encoder extract face feature information and skin color information according to their respective set weights, and convert them into corresponding coding vectors based on the face feature information and skin color information.
  • the feature information that the first encoder and the second encoder focus on extracting are different. The first encoder focuses on extracting skin color information, and the second encoder focuses on extracting face feature information.
  • the proportion of skin color information extracted by the first encoder is greater than the second set weights used by the second encoder to extract feature information.
  • the specific weight of extracting face feature information by the second set weight is greater than that of the first set weight.
  • both encoders adopt the same network structure, wherein the first encoder extracts the skin color information with a larger proportion, and the second encoder extracts the face feature information with a larger proportion.
  • the encoder adopts a residual block (Resblock) structure, which consists of 5 residual blocks and a fully connected layer (FC).
  • the input of the encoder is the original face image
  • the output of the encoder is the coded vector w of 14*512.
  • the encoder in the embodiment of the present application outputs N different 1*512-dimensional vectors, which can better control the problem of attribute entanglement.
  • the decoder is built on the structure of the Stylegan model (a generative adversarial network model that can generate high-resolution images), which is a style-based generative network that draws on the style transfer method to generate high-definition face synthesis image, and can disentangle different attributes of face image to a certain extent, which is beneficial to face image synthesis.
  • the constant qualifier "const 4 ⁇ 4 ⁇ 512" of the decoder is a parameter that can be learned, and its role is to learn an average face.
  • “A” represents the affine transformation module
  • “AdaIN” represents the style transfer module
  • the affine transformation module is a learnable affine transformation, which includes a fully connected layer.
  • the affine transformation module doubles its dimension, and the output is a vector of (2*512).
  • the output vector is then converted into bias and scaling factors.
  • the style transfer module performs style transfer through the bias factor and scaling factor combined with the feature map corresponding to the original face image, and obtains the corresponding face style transfer result. Since the encoding vector is multi-channel, the affine transformation and style transformation of the encoding vector are carried out through the multi-channel affine transformation module and the style transfer module, and then the corresponding face style transformation result is obtained, so that the different attributes of the face image can be solved. It is entangled to avoid the problem of face attribute entanglement.
  • the decoder of the embodiment of the present application further includes an upsampling (Upsample) module.
  • the Upsample module provides a corresponding upsampling operation by which the feature map can be upsampled to a corresponding size.
  • the decoder is performing the processing of each coding vector (that is, the first coding vector and the second coding vector).
  • the decoder uses the corresponding one-way affine transformation module and style transfer module to perform face image transformation on the input first encoding vector or second encoding vector, and combine random noise to obtain The corresponding face style conversion result.
  • a synthetic face image is obtained by synthesizing each face style conversion result.
  • the training process of the generation model includes:
  • S101 use two training sample images as model input, and train the generator with the synthetic face image of the training sample images as model output;
  • the generative adversarial network model Based on the characteristics of the generative adversarial network model, when training the generative model, face images with strong diversity, different skin tones, different races, and different ages are used as training sample images, and the two training sample images are input into the first encoder and After the second encoder performs encoding, the multi-channel encoded vectors are injected into the decoder to generate a corresponding synthetic face image. Based on the synthetic face image, the discriminator is used to verify the difference between the synthetic face image generated by the decoder and the real sample image, and the face synthesis parameters of the generator are continuously adjusted based on the loss function. When the loss function converges and the discriminator cannot distinguish between the synthetic face image generated by the decoder and the synthetic image of the real sample, the generator training is completed.
  • the loss function corresponding to the first encoder includes generative adversarial network loss, face feature loss and coding vector distance loss;
  • the loss function corresponding to the second encoder includes generative adversarial network loss, skin color loss loss and encoding vector distance loss.
  • the generative adversarial network loss includes generator loss and discriminator loss, and the calculation formula of the generative adversarial network loss is:
  • Loss_G E(D(G(x)-1) 2 )
  • Loss_D E((D(x)-1) 2 +D(G(x)) 2 )
  • D is the discriminator
  • G is the generator
  • x is the feature map input by the model
  • E is the mean value
  • G(x) is the synthetic face image generated by the generator
  • D(x) is the discrimination
  • LOSS_G represents the corresponding generator loss
  • LOSS_D represents the discriminator loss.
  • the generative adversarial network loss adopts the least square loss, which is mainly used to constrain whether the synthesized face composite image is the actual face composite image that the generative model actually wants, and the authenticity of the face composite image.
  • the coding vector distance loss calculation formula is:
  • w is the coding vector output by the first encoder or the second encoder
  • w_mean is the mean value of the coding vector in the decoder.
  • the face feature loss is determined based on the face recognition network, and the calculation formula of the face feature loss is:
  • idLoss represents the loss of facial features
  • E represents the mean value
  • Facenet represents the face recognition network
  • x represents the feature map input by the model
  • G represents the generator
  • G(x) represents the synthetic face image generated by the generator.
  • idLoss uses cosine loss, and idLoss is used to constrain the similarity between the synthetic face image and the original face image to ensure that there is a certain connection between the generated synthetic face image and the input second original face image.
  • the skin color loss uses the LAB color space to calculate the skin color difference loss, and considers the histogram loss as the skin color loss.
  • the first encoder and the second encoder are trained respectively corresponding to their own loss functions. After the loss function converges, the training of the generative model is completed, and the generator of the generative model can be used for face image synthesis.
  • the two original face images are respectively input into the first encoder and the second encoder, and the first encoder and the second encoder pass five
  • a residual module and a fully connected layer (FC) convert the original face image into multiple first and second encoding vectors.
  • FC fully connected layer
  • the extraction of the facial feature information and the skin color information is carried out according to the respective set weights to ensure that the two encoders focus on the facial feature information or the skin color information respectively.
  • Information extraction so that there is a certain relationship between the final face composite image and the original face avatar in face features and skin color.
  • the residual module (Resblock) is shown in Figure 4.
  • the structure used by the residual module connects each convolutional layer and the activation function (leaky_relu) layer in a cross-connected manner, which can effectively prevent the gradient from disappearing during the training process. The problem.
  • the decoder performs face image conversion and synthesis on the first encoding vector and the second encoding vector based on a multi-channel affine transformation module and a style transfer module, and introduces random noise to generate a corresponding image of the first original person.
  • the first coding vector and the second coding vector can be input into the decoder, and the decoder passes the affine transformation module and style transfer.
  • the module performs image conversion and synthesis to obtain the corresponding synthetic face image.
  • the decoder face image conversion and synthesis process includes:
  • the style transfer module performs a face style transfer calculation based on the feature map of the first original face image or the second original face image, the corresponding bias factor and the scaling factor, and introduces The random noise generates corresponding face style conversion results, and a synthetic face image is obtained based on each face style conversion result.
  • FIG. 6 a schematic diagram of a face image conversion architecture of a decoder according to an embodiment of the present application is provided.
  • the decoder forms a style-based generation network through the affine transformation module (A) and the style transfer module (AdaIN), so as to obtain the face style transformation result based on the encoding vector transformation.
  • A affine transformation module
  • AdaIN style transfer module
  • the bias factor and scaling factor are injected into the style transfer module (AdaIN), and the style transfer module (AdaIN) performs face style conversion based on the feature map, bias factor and scaling factor of the original face image, and introduces random noise to output the face Style transfer result.
  • the calculation formula of the style transfer module is:
  • AdaIN(x, y) is the face style conversion result
  • y_s is the scaling factor
  • y_b is the bias factor
  • x is the feature map input by the model
  • represents the standard deviation
  • u represents the mean.
  • the style transfer module (AdaIN) is a commonly used module in style transfer technology, which can change the image style well and realize the style transfer of face images. Based on the calculation formula of the above style transfer module, the corresponding face style conversion result can be obtained through image conversion. Moreover, considering the authenticity requirements of face composite images, random noise is further introduced to make the generated face composite images natural and authentic. Finally, by synthesizing various face style conversion results, a synthetic face image is generated and output, and the synthesis of two original face images is completed.
  • two face images corresponding to the parents are input into the first encoder and the second encoder respectively.
  • the first encoder and the second encoder extract the skin color information and the face feature information of the parent's face image according to the set weight, and convert them into a plurality of first encoding vectors and second encoding vectors.
  • a synthetic face image is obtained by converting and synthesizing the face image of the decoder, that is, the face image corresponding to the child.
  • the skin color information and the face feature information are extracted by the first encoder and the second encoder of the generator according to the set weight And converted into multiple coding vectors, the decoder of the generator performs face image conversion and synthesis on the coding vectors based on the multi-channel affine transformation module and the style transfer module, and introduces random noise to generate the corresponding first original face image and the second A face composite image of the original face image.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • FIG. 8 is a schematic structural diagram of a face image synthesis system provided in Embodiment 2 of the present application.
  • the face image synthesis system provided in this embodiment specifically includes: an input module 21 , a conversion module 22 and a synthesis module 23 .
  • the input module 21 is used to input the first original face image and the second original face image into the generator of the generation model, and the generator includes a first encoder, a second encoder and a decoder;
  • the conversion module 22 is configured to extract the skin color information and the facial feature information of the first original face image based on the first set weight by the first encoder and convert them into a plurality of first encoding vectors, and convert the first
  • the coding vector is input into the decoder; the skin color information and face feature information of the second original face image are extracted based on the second set weight by the second encoder and converted into multiple second coding vectors, and the The second encoding vector is input to the decoder; the proportion of the skin color information extracted by the first set weight is greater than that of the second set weight, and the proportion of the second set weight to extract the face feature information is greater than the proportion of the first set weight to extract the skin feature information.
  • a set weight is configured to extract the skin color information and the facial feature information of the first original face image based on the first set weight by the first encoder and convert them into a plurality of first encoding vectors, and convert the first
  • the coding vector is input into the decoder; the skin color information and face
  • the synthesis module 23 is configured to perform face image conversion and synthesis on the first encoding vector and the second encoding vector based on the multi-channel affine transformation module and the style transfer module through the decoder, and introduce random noise to generate a corresponding image.
  • the skin color information and the face feature information are extracted by the first encoder and the second encoder of the generator according to the set weight And converted into multiple coding vectors, the decoder of the generator performs face image conversion and synthesis on the coding vectors based on the multi-channel affine transformation module and the style transfer module, and introduces random noise to generate the corresponding first original face image and the second A face composite image of the original face image.
  • the face image synthesis system provided in Embodiment 2 of the present application can be used to execute the face image synthesis method provided in Embodiment 1 above, and has corresponding functions and beneficial effects.
  • the third embodiment of the present application provides an electronic device.
  • the electronic device includes: a processor 31 , a memory 32 , a communication module 33 , an input device 34 and an output device 35 .
  • the memory 32 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the face image synthesis method described in any embodiment of the present application (for example, a face image input module, transformation module and synthesis module of the synthesis system).
  • the communication module 33 is used for data transmission.
  • the processor 31 executes various functional applications and data processing of the device by running the software programs, instructions, and modules stored in the memory, ie, implements the above-mentioned method for synthesizing face images.
  • the input device 34 may be used to receive input numerical or character information and to generate key signal input related to user settings and function control of the device.
  • the output device 35 may include a display device such as a display screen.
  • the electronic device provided above can be used to execute the method for synthesizing a face image provided in the first embodiment, and has corresponding functions and beneficial effects.
  • Embodiment 4 is a diagrammatic representation of Embodiment 4:
  • Embodiments of the present application also provide a storage medium containing computer-executable instructions, where the computer-executable instructions are used to execute the above-mentioned method for synthesizing a face image when executed by a computer processor, and the storage medium may be any of various type of memory device or storage device.
  • a storage medium containing computer-executable instructions provided by an embodiment of the present application is not limited to the above-mentioned method for synthesizing a face image, and the computer-executable instructions can also execute the face image provided by any embodiment of the present application.
  • Related operations in image synthesis methods are not limited to the above-mentioned method for synthesizing a face image, and the computer-executable instructions can also execute the face image provided by any embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Processing (AREA)

Abstract

一种人脸图像合成方法、系统、电子设备及存储介质。该方法通过将第一原始人脸图像和第二原始人脸图像输入生成模型的生成器,由生成器按照设定的权重提取肤色信息和人脸特征信息并转换为多个编码向量,由生成器的解码器基于多路仿射变换模块和风格迁移模块对编码向量进行人脸图像转换合成,并引入随机噪声生成对应第一原始人脸图像和第二原始人脸图像的人脸合成图像。该方法通过分别设定不同权重提取肤色信息和人脸特征信息,可以提升人脸图像合成的真实性,优化人脸图像合成效果,并提升图像合成质量。此外,通过引入随机噪声,该方法可以进一步提升人脸图像合成的真实性。

Description

一种人脸图像合成方法、系统、电子设备及存储介质
本申请要求在2020年12月25日提交中国专利局、申请号为202011566624.1的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,尤其涉及一种人脸图像合成方法、系统、电子设备及存储介质。
背景技术
目前,在短视频、电影图像制作领域,基于两张人脸图像进行人脸图像合成是一种比较趣味性的和挑战性的技术。例如,通过输入一张男性人脸图像和一张女性人脸图像,合成两者未来孩子的人脸图像。
但是,现有的人脸图像合成算法在进行人脸图像合成时,缺乏对图像人种、肤色等特征的考虑,其特征融合效果相对较差,缺乏真实性。并且,其对进行合成的图像质量要求相对较高,对于低质量图像的合成而言,其输出图像的效果相对较差。
发明内容
本申请实施例提供一种人脸图像合成方法、系统、电子设备及存储介质,能够提升人脸图像合成的真实性,并提升图像合成质量,优化人脸图像合成效果。
在第一方面,本申请实施例提供了一种人脸图像合成方法,包括:
将第一原始人脸图像和第二原始人脸图像输入生成模型的生成器,所述生成器包括第一编码器、第二编码器和解码器;
所述第一编码器基于第一设定权重提取所述第一原始人脸图像的肤色信息和人脸特征信息并转换为多个第一编码向量,将所述第一编码向量输入所述解码器;所述第二编码器基于第二设定权重提取所述第二原始人脸图像的肤色信息和人脸特征信息并转换为多个第二编码向量,将所述第二编码向量输入所述解码器;所述第一设定权重提取肤色信息的比重大于所述第二设定权重,所述第二设定权重提取人脸特征信息的比重大于所述第一设定权重;
所述解码器基于多路仿射变换模块和风格迁移模块对所述第一编码向量和所述第二编码向量进行人脸图像转换合成,并引入随机噪声生成对应所述第一 原始人脸图像和所述第二原始人脸图像的人脸合成图像。
进一步的,所述生成模型的训练流程包括:
以两个训练样本图像作为模型输入,以所述训练样本图像的人脸合成图像作为模型输出训练所述生成器;
使用所述生成模型的判别器验证所述训练样本图像的人脸合成图像,并根据所述生成模型的损失函数调整所述生成器的人脸属性合成参数,直至所述损失函数收敛。
进一步的,所述损失函数对应所述第一编码器包括生成式对抗网络损失、人脸特征损失和编码向量距离损失;所述损失函数对应所述第二编码器包括生成式对抗网络损失、肤色损失和编码向量距离损失。
进一步的,所述生成式对抗网络损失包括生成器损失和判别器损失,所述生成式对抗网络损失计算公式为:
Loss_G=E(D(G(x)-1) 2)
Loss_D=E((D(x)-1) 2+D(G(x)) 2)
其中,D为判别器,G为生成器,x为模型输入的特征图,E表示求取均值,G(x)表示所述生成器生成的人脸合成图像,D(x)表示所述判别器对所述人脸合成图像的验证结果,LOSS_G表示对应的生成器损失,LOSS_D表示判别器损失。
进一步的,所述编码向量距离损失计算公式为:
wLoss=E((w-w_mean) 2)
其中,w为所述第一编码器或者所述第二编码器输出的编码向量,w_mean为所述解码器中的编码向量均值。
进一步的,所述人脸特征损失基于人脸识别网络确定,所述人脸特征损失计算公式为:
idLoss=E(cosin(Facenet(x),Facenet(G(x))))
其中,idLoss表示人脸特征损失,E表示求取均值,Facenet表示人脸识别网络,x为模型输入的特征图,G为生成器,G(x)表示所述生成器生成的人脸合成图像。
进一步的,所述解码器基于多路仿射变换模块和风格迁移模块对所述第一编码向量和所述第二编码向量进行人脸图像转换合成,并引入随机噪声生成对应所述第一原始人脸图像和所述第二原始人脸图像的人脸合成图像,包括:
将所述第一编码向量或所述第二编码向量输入仿射变换模块生成对应的偏置因子和缩放因子,将所述偏置因子和所述缩放因子输入对应的所述风格迁移模块;
所述风格迁移模块基于所述第一原始人脸图像或所述第二原始人脸图像的特征图、对应的所述偏置因子和所述缩放因子进行人脸风格转换计算,并引入随机噪声生成对应的人脸风格转换结果,并基于各个人脸风格转换结果得到人脸合成图像。
进一步的,将所述第一编码向量或所述第二编码向量输入仿射变换模块生成对应的偏置因子和缩放因子,包括:
将所述第一编码向量或所述第二编码向量转换为输出向量,并将所述输出向量转换为对应的偏置因子和缩放因子。
进一步的,所述风格迁移模块的计算公式为:
Figure PCTCN2021140563-appb-000001
其中,AdaIN(x,y)为人脸风格转换结果,y_s为缩放因子,y_b为偏置因子,x为模型输入的特征图,σ表示求取标准差,u表示求取均值。
在第二方面,本申请实施例提供了一种人脸图像合成系统,包括:
输入模块,用于将第一原始人脸图像和第二原始人脸图像输入生成模型的生成器,所述生成器包括第一编码器、第二编码器和解码器;
转换模块,用于通过所述第一编码器基于第一设定权重提取所述第一原始人脸图像的肤色信息和人脸特征信息并转换为多个第一编码向量,将所述第一编码向量输入所述解码器;通过所述第二编码器基于第二设定权重提取所述第二原始人脸图像的肤色信息和人脸特征信息并转换为多个第二编码向量,将所述第二编码向量输入所述解码器;所述第一设定权重提取肤色信息的比重大于所述第二设定权重,所述第二设定权重提取人脸特征信息的比重大于所述第一设定权重;
合成模块,用于通过所述解码器基于多路仿射变换模块和风格迁移模块对所述第一编码向量和所述第二编码向量进行人脸图像转换合成,并引入随机噪声生成对应所述第一原始人脸图像和所述第二原始人脸图像的人脸合成图像。
在第三方面,本申请实施例提供了一种电子设备,包括:
存储器以及一个或多个处理器;
所述存储器,用于存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如第一方面所述的人脸图像合成方法。
在第四方面,本申请实施例提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如第一方面所述的人脸图像合成方法。
本申请实施例通过将第一原始人脸图像和第二原始人脸图像输入生成模型的生成器,由生成器的第一编码器和第二编码器按照设定的权重提取肤色信息和人脸特征信息并转换为多个编码向量,由生成器的解码器基于多路仿射变换模块和风格迁移模块对编码向量进行人脸图像转换合成,并引入随机噪声生成对应第一原始人脸图像和第二原始人脸图像的人脸合成图像。采用上述技术手段,通过分别设定不同权重提取提取肤色信息和人脸特征信息,可以提升人脸图像合成的真实性,优化人脸图像合成效果,并提升图像合成质量。此外,通过引入随机噪声,可以进一步提升人脸图像合成的真实性。
附图说明
图1是本申请实施例一提供的一种人脸图像合成方法的流程图;
图2是本申请实施例一中的生成器的网络架构示意图;
图3是本申请实施例一中生成模型的训练流程图;
图4是本申请实施例一中残差模块的架构示意图;
图5是本申请实施例一中解码器人脸图像转换合成流程图;
图6是本申请实施例一中解码器人脸图像转换架构示意图;
图7是本申请实施例一中仿射变换模块的编码向量转换流程图;
图8是本申请实施例二提供的一种人脸图像合成系统的结构示意图;
图9是本申请实施例三提供的一种电子设备的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图对本申请具体实施例作进一步的详细描述。可以理解的是,此处所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部内容。在更加详细地讨论示例性实施例之前应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各项操作(或步骤)描述成顺序的处理,但是其中 的许多操作可以被并行地、并发地或者同时实施。此外,各项操作的顺序可以被重新安排。当其操作完成时所述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。所述处理可以对应于方法、函数、规程、子例程、子程序等等。
本申请提供的人脸图像合成方法,旨在通过生成模型将两张原始人脸图像进行人脸图像转换合成,得到对应的人脸合成图像。并在人脸图像转换合成过程中,通过将原始人脸图像转换为多个编码向量进行多路人脸图像转换,通过引入随机噪声,以此来增加人脸合成图像的真实性,提升图像质量,优化人脸图像编辑效果。并且,在将原始人脸头像转换为编码向量时,对应两张原始人脸头像提取人脸特征信息和肤色信息的侧重点不同,根据设定权重提取人脸特征信息和肤色信息,得到肤色信息较强的第一编码向量和人脸特征信息较强的第二编码向量,以此可以避免人脸合成图像出现肤色信息缺失的问题,优化人脸图像合成效果。对于传统的3D贴图技术进行人脸图像合成时,其需要预先处理设计好的3D纹理,并根据人脸关键点信息将人脸进行形变,再通过磨皮、贴图等方式处理后得到人脸合成图像,这种方法依赖人脸关键点,其多样性差,生成图像不自然,存在一定局限性。而采用传统生成式对抗网络模型进行人脸图像合成时,一般是通过输入一张或者两张人脸图像,提取图像中的人脸特征信息,从而生成人脸合成图像。这种方法对图像质量的要求高,对于低质量图像的生成效果相对较差,且没有较好地考虑人脸肤色的影响。基于此,提供本申请实施例的人脸图像合成方法,以解决现有人脸图像合成的肤色缺失及图像质量问题。
实施例一:
图1给出了本申请实施例一提供的一种人脸图像合成方法的流程图,本实施例中提供的人脸图像合成方法可以由人脸图像合成设备执行,该人脸图像合成设备可以通过软件和/或硬件的方式实现,该人脸图像合成设备可以是两个或多个物理实体构成,也可以是一个物理实体构成。一般而言,该人脸图像合成设备可以是电脑等计算机设备。
下述以人脸合成设备为执行人脸图像合成方法的主体为例,进行描述。参照图1,该人脸图像合成方法具体包括:
S110、将第一原始人脸图像和第二原始人脸图像输入生成模型的生成器,所述生成器包括第一编码器、第二编码器和解码器;
S120、所述第一编码器基于第一设定权重提取所述第一原始人脸图像的肤色信息和人脸特征信息并转换为多个第一编码向量,将所述第一编码向量输入所述解码器;所述第二编码器基于第二设定权重提取所述第二原始人脸图像的肤色信息和人脸特征信息并转换为多个第二编码向量,将所述第二编码向量输入所述解码器;所述第一设定权重提取肤色信息的比重大于所述第二设定权重,所述第二设定权重提取人脸特征信息的比重大于所述第一设定权重。
本申请实施例基于一个生成模型进行人脸图像合成,该生成模型为基于生成式对抗网络的生成模型,使用生成式对抗网络模型的架构,通过生成模型的生成器实现人脸图像的合成。
在此之前,需要预先训练该生成模型,以用于后续进行人脸图像合成。需要说明的是,本申请实施例基于生成式对抗网络的生成模型在训练时主要对其生成器进行训练,以使生成器能够进行人脸图像合成。而对于判别器而言,则需要对生成器完成转换合成的人脸合成图像进行验证,判断其是否准确。可以理解的是,基于生成式对抗网络的原理,生成器和判别器互相博弈学习,当判别器无法辨认生成器生成的人脸合成图像与真实样本合成图像的差异时,则生成器训练完成,可以用于进行人脸图像合成。
具体的,参照图2,提供本申请实施例生成器的网络架构示意图。如图2所示,生成器包括编码器和解码器两个部分。其中,编码器包括第一编码器和第二编码器,第一编码器和第二编码器的目的是提取人脸特征信息和肤色信息,将任意一张人脸图像编码为多个编码向量w,再将w注入到解码器中。解码器基于Stylegan模型(一种可以生成高分辨率图像的生成式对抗网络模型)的结构构建,基于解码器将编码向量进行风格转换得到人脸风格转换结果,并基于多路人脸风格转换结果生成人脸合成图像。在图像转换过程中,还进一步引入随机噪声信息(如人脸斑点特征等信息),可以使生成的人脸合成图像逼真且高清,进而提升本申请实施例生成模型的泛化性能,优化人脸图像合成效果。
需要说明的是,本申请实施例的编码器包括第一编码器和第二编码器,两个编码器通过分别输入相同或者不同的第一原始人脸图像和第二原始人脸图像,对原始人脸图像进行编码得到第一编码向量和第二编码向量。其中,第一编码器和第二编码器按照各自设定的权重提取人脸特征信息和肤色信息,基于人脸特征信息和肤色信息转换为相应的编码向量。并且,第一编码器和第二编码侧重提取的的特征信息不同。第一编码器注重提取肤色信息,第二编码器注 重提取人脸特征信息。第一编码器用于提取特征信息的第一设定权重中,其提取肤色信息的比重大于第二编码器用于提取特征信息的第二设定权重。对应的,第二设定权重提取人脸特征信息的比重要大于第一设定权重。通过设置编码器按照各自设定权重提取人脸特征信息和肤色信息,可以得到肤色信息较强的第一编码向量以及人脸特征较强的第二编码向量。以此可以避免人脸合成图像的肤色信息缺失,人脸合成图像的肤色与原始人脸图像的肤色存在不对应的情况。具体的,两个编码器均采用相同的网络结构,其中第一编码器提取肤色信息的比重较大,第二编码器提取人脸特征信息的比重较大。通过上述方法,能够很好的解缠出两张原始人脸图像的肤色和人脸特征的信息,使得两个编码器分别控制不同的信息,在实际人脸图像合成的过程中,可以很好地解决人脸合成图像肤色、人种与原始人脸图像的关联性问题。
进一步的,该编码器采用残差模块(Resblock)的结构,由5个残差模块和一个全连接层(FC)组成。编码器的输入为原始人脸图像,编码器的输出为14*512的编码向量w。不同于传统生成模型的编码器只输出1*512维的向量,本申请实施例编码器采用输出N个不同的1*512维向量,可以更好地控制属性纠缠的问题,通过输出N路编码向量w,将N路编码向量w采用AdaIN的注入方式注入解码器中,进行图像转换合成得到对应的人脸合成图像。解码器基于Stylegan模型(一种可以生成高分辨率图像的生成式对抗网络模型)的结构构建,Stylegan模型是一种基于样式的生成网络,其借鉴风格迁移的方法,以生成高清的人脸合成图像,并能够在一定程度上将人脸图像不同的属性解缠出来,从而有利于人脸图像合成。解码器的常量限定符“const 4×4×512”是可以学习的参数,其作用是学到一个均脸。“A”表示仿射变换模块,“AdaIN”表示风格迁移模块,仿射变换模块为可学习的仿射变换,其包括一个全连接层。对于输入的一路编码向量w(1*512),仿射变换模块将其维度扩大为两倍,输出为(2*512)的向量。然后将输出的向量转化为偏置因子和缩放因子。风格迁移模块则通过偏置因子和缩放因子结合对应原始人脸图像的特征图进行风格转换,得到对应的人脸风格转换结果。由于编码向量为多路,通过多路仿射变换模块和风格迁移模块进行编码向量的仿射变换和风格转换,进而得到对应的人脸风格转换结果,以此可以将人脸图像的不同属性解缠出来,避免人脸属性纠缠的问题。进而通过综合各个人脸风格转换结果输出最终的人脸合成图像,完成两张原始人脸图像的合成。更进一步的,解码器在将编码向量转换为人脸合成图 像时,还对应各路人脸风格转换结果添加相应的随机噪声(如人脸斑点特征等信息),将随机噪声(B)注入解码器,以此可以使生成的人脸合成图像更具真实性。此外,本申请实施例的解码器还包括上采样(Upsample)模块。上采样(Upsample)模块提供相应的上采样操作,通过该操作可以将特征图上采样为相应的大小。需要说明的是,由于第一编码器和第二编码器分别输出多路第一编码向量和多路第二编码向量至解码器,则解码器在对各路编码向量(即第一编码向量和第二编码向量)进行仿射变换和风格转换时,使用对应的一路仿射变换模块和风格迁移模块对输入其中的第一编码向量或者第二编码向量进行人脸图像转换,并结合随机噪声得到对应的人脸风格转换结果。最终,通过综合各个人脸风格转换结果得到人脸合成图像。
在进行本申请实施例生成模型的训练时,参照图3,生成模型的训练流程包括:
S101、以两个训练样本图像作为模型输入,以所述训练样本图像的人脸合成图像作为模型输出训练所述生成器;
S102、使用所述生成模型的判别器验证所述训练样本图像的人脸合成图像,并根据所述生成模型的损失函数调整所述生成器的人脸属性合成参数,直至所述损失函数收敛。
基于生成式对抗网络模型的特性,在训练生成模型时,使用多样性强、不同肤色、不同人种、不同年龄的人脸图像作为训练样本图像,将两个训练样本图像输入第一编码器和第二编码器进行编码后将多路编码向量注入解码器,生成对应的人脸合成图像。基于该人脸合成图像由判别器进行验证,判断解码器生成的人脸合成图像与真实样本图像的差异,并不断基于损失函数调整生成器的人脸合成参数。当损失函数收敛,判别器无法辨别解码器生成的人脸合成图像与真实样本合成图像时,则该生成器训练完成。
具体的,所述损失函数对应所述第一编码器包括生成式对抗网络损失、人脸特征损失和编码向量距离损失;所述损失函数对应所述第二编码器包括生成式对抗网络损失、肤色损失和编码向量距离损失。第一编码器和第二编码器基于上述对应的损失函数进行训练,直至各个损失函数收敛时,网关第一编码器和第二编码器的训练。
进一步的,生成式对抗网络损失包括生成器损失和判别器损失,所述生成式对抗网络损失计算公式为:
Loss_G=E(D(G(x)-1) 2)
Loss_D=E((D(x)-1) 2+D(G(x)) 2)
其中,D为判别器,G为生成器,x为模型输入的特征图,E表示求取均值,G(x)表示所述生成器生成的人脸合成图像,D(x)表示所述判别器对所述人脸合成图像的验证结果,LOSS_G表示对应的生成器损失,LOSS_D表示判别器损失。生成式对抗网络损失采用最小二乘损失,主要用于约束合成的人脸合成图像是否为生成模型实际想要的人脸合成图像,以及人脸合成图像的真实性。
所述编码向量距离损失计算公式为:
wLoss=E((w-w_mean) 2)
其中,w为所述第一编码器或者所述第二编码器输出的编码向量,w_mean为所述解码器中的编码向量均值。编码向量距离损失能够很好的保证生成模型的泛化性能,确保无论输入图像是否高清,是否低质都能够生成较高质量的人脸合成图像。
人脸特征损失基于人脸识别网络确定,所述人脸特征损失计算公式为:
idLoss=E(cosin(Facenet(x),Facenet(G(x))))
其中,idLoss表示人脸特征损失,E表示求取均值,Facenet表示人脸识别网络,x为模型输入的特征图,G为生成器,G(x)表示所述生成器生成的人脸合成图像。idLoss使用余弦损失,idLoss是用来约束人脸合成图像和原始人脸图像的相似度,保证生成的人脸合成图像和输入的第二原始人脸图像存在一定的联系。
此外,肤色损失则采用LAB颜色空间计算肤色差异损失,并考虑直方图损失作为肤色损失。
基于第一编码器和第二编码器分别对应自身的损失函数进行训练,当损失函数收敛后,完成生成模型的训练,该生成模型的生成器即可用于人脸图像合成。
示例性,在基于两张原始人脸图像生成人脸合成图像时,将两张原始人脸图像分别输入第一编码器和第二编码器,第一编码器和第二编码器通过自身5个残差模块和一个全连接层(FC)将原始人脸图像转换为多路第一编码向量和第二编码向量。并且,在进行第一编码向量和第二编码向量的转换时,根据各自的设定权重进行人脸特征信息和肤色信息的提取,以确保两个编码器分别侧重人脸特征信息或肤色信息进行信息提取,进而使得最终的人脸合成图像与原 始人脸头像在人脸特征和肤色上存在一定的联系。具体的,残差模块(Resblock)如图4所示,残差模块采用的结构通过跨连的方式连接各个卷积层和激活函数(leaky_relu)层,以此可以有效地防止训练过程中梯度消失的问题。
S130、所述解码器基于多路仿射变换模块和风格迁移模块对所述第一编码向量和所述第二编码向量进行人脸图像转换合成,并引入随机噪声生成对应所述第一原始人脸图像和所述第二原始人脸图像的人脸合成图像。
基于第一编码器和第二编码器生成多路第一编码向量和第二编码向量之后,即可将第一编码向量和第二编码向量输入解码器,解码器通过仿射变换模块和风格迁移模块进行图像转换合成,得到对应的人脸合成图像。具体的,参照图5,解码器人脸图像转换合成流程包括:
S1301、将所述第一编码向量或所述第二编码向量输入仿射变换模块生成对应的偏置因子和缩放因子,将所述偏置因子和所述缩放因子输入对应的所述风格迁移模块;
S1302、所述风格迁移模块基于所述第一原始人脸图像或所述第二原始人脸图像的特征图、对应的所述偏置因子和所述缩放因子进行人脸风格转换计算,并引入随机噪声生成对应的人脸风格转换结果,并基于各个人脸风格转换结果得到人脸合成图像。
具体的,如图6所示,提供本申请实施例解码器人脸图像转换架构示意图。解码器通过仿射变换模块(A)和风格迁移模块(AdaIN)构成基于样式的生成网络,以此来基于编码向量进行转换得到人脸风格转换结果。其中,参照图7,在将所述第一编码向量或所述第二编码向量输入仿射变换模块生成对应的偏置因子和缩放因子时,将所述第一编码向量或所述第二编码向量转换为输出向量,并将所述输出向量转换为对应的偏置因子和缩放因子。将偏置因子和缩放因子注入风格迁移模块(AdaIN),风格迁移模块(AdaIN)基于原始人脸图像的特征图、偏置因子和缩放因子进行人脸风格转换,并引入随机噪声,输出人脸风格转换结果。
所述风格迁移模块的计算公式为:
Figure PCTCN2021140563-appb-000002
其中,AdaIN(x,y)为人脸风格转换结果,y_s为缩放因子,y_b为偏置因子,x为模型输入的特征图,σ表示求取标准差,u表示求取均值。
风格迁移模块(AdaIN)是风格迁移技术中常用的模块,其能够很好地改变图像风格,实现人脸图像的风格迁移。基于上述风格迁移模块的计算公式即可通过图像转换得到对应的人脸风格转换结果。并且,考虑到人脸合成图像的真实性需求,进一步通过引入随机噪声,以使生成的人脸合成图像自然且具备真实性。最终,通过综合各路人脸风格转换结果,生成人脸合成图像并输出,完成两张原始人脸图像的合成。
示例性的,基于上述生成模型,在进行人脸图像合成时,通过将对应父母的两张人脸图像分别输入第一编码器和第二编码器。第一编码器和第二编码器按照设定的权重提取父母人脸图像的肤色信息和人脸特征信息,转换成多个第一编码向量和第二编码向量。进而通过解码器人脸图像转换合成得到人脸合成图像,即对应孩子的人脸图像。
上述,通过将第一原始人脸图像和第二原始人脸图像输入生成模型的生成器,由生成器的第一编码器和第二编码器按照设定的权重提取肤色信息和人脸特征信息并转换为多个编码向量,由生成器的解码器基于多路仿射变换模块和风格迁移模块对编码向量进行人脸图像转换合成,并引入随机噪声生成对应第一原始人脸图像和第二原始人脸图像的人脸合成图像。采用上述技术手段,通过分别设定不同权重提取提取肤色信息和人脸特征信息,可以提升人脸图像合成的真实性,优化人脸图像合成效果,并提升图像合成质量。此外,通过引入随机噪声,可以进一步提升人脸图像合成的真实性。
实施例二:
在上述实施例的基础上,图8为本申请实施例二提供的一种人脸图像合成系统的结构示意图。参考图8,本实施例提供的人脸图像合成系统具体包括:输入模块21、转换模块22和合成模块23。
其中,输入模块21用于将第一原始人脸图像和第二原始人脸图像输入生成模型的生成器,所述生成器包括第一编码器、第二编码器和解码器;
转换模块22用于通过所述第一编码器基于第一设定权重提取所述第一原始人脸图像的肤色信息和人脸特征信息并转换为多个第一编码向量,将所述第一编码向量输入所述解码器;通过所述第二编码器基于第二设定权重提取所述第二原始人脸图像的肤色信息和人脸特征信息并转换为多个第二编码向量,将所述第二编码向量输入所述解码器;所述第一设定权重提取肤色信息的比重大 于所述第二设定权重,所述第二设定权重提取人脸特征信息的比重大于所述第一设定权重;
合成模块23用于通过所述解码器基于多路仿射变换模块和风格迁移模块对所述第一编码向量和所述第二编码向量进行人脸图像转换合成,并引入随机噪声生成对应所述第一原始人脸图像和所述第二原始人脸图像的人脸合成图像。
上述,通过将第一原始人脸图像和第二原始人脸图像输入生成模型的生成器,由生成器的第一编码器和第二编码器按照设定的权重提取肤色信息和人脸特征信息并转换为多个编码向量,由生成器的解码器基于多路仿射变换模块和风格迁移模块对编码向量进行人脸图像转换合成,并引入随机噪声生成对应第一原始人脸图像和第二原始人脸图像的人脸合成图像。采用上述技术手段,通过分别设定不同权重提取提取肤色信息和人脸特征信息,可以提升人脸图像合成的真实性,优化人脸图像合成效果,并提升图像合成质量。此外,通过引入随机噪声,可以进一步提升人脸图像合成的真实性。
本申请实施例二提供的人脸图像合成系统可以用于执行上述实施例一提供的人脸图像合成方法,具备相应的功能和有益效果。
实施例三:
本申请实施例三提供了一种电子设备,参照图9,该电子设备包括:处理器31、存储器32、通信模块33、输入装置34及输出装置35。存储器32作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请任意实施例所述的人脸图像合成方法对应的程序指令/模块(例如,人脸图像合成系统的输入模块、转换模块和合成模块)。通信模块33用于进行数据传输。处理器31通过运行存储在存储器中的软件程序、指令以及模块,从而执行设备的各种功能应用以及数据处理,即实现上述的人脸图像合成方法。输入装置34可用于接收输入的数字或字符信息,以及产生与设备的用户设置以及功能控制有关的键信号输入。输出装置35可包括显示屏等显示设备。上述提供的电子设备可用于执行上述实施例一提供的人脸图像合成方法,具备相应的功能和有益效果。
实施例四:
本申请实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行上述一种人脸图像合成方法,存储介质可以是任何的各种类型的存储器设备或存储设备。当然,本申请实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的人脸图像合成方法,还可以执行本申请任意实施例所提供的人脸图像合成方法中的相关操作。
上述仅为本申请的较佳实施例及所运用的技术原理。本申请不限于这里所述的特定实施例,对本领域技术人员来说能够进行的各种明显变化、重新调整及替代均不会脱离本申请的保护范围。因此,虽然通过以上实施例对本申请进行了较为详细的说明,但是本申请不仅仅限于以上实施例,在不脱离本申请构思的情况下,还可以包括更多其他等效实施例,而本申请的范围由权利要求的范围决定。

Claims (10)

  1. 一种人脸图像合成方法,其特征在于,包括:
    将第一原始人脸图像和第二原始人脸图像输入生成模型的生成器,所述生成器包括第一编码器、第二编码器和解码器;
    所述第一编码器基于第一设定权重提取所述第一原始人脸图像的肤色信息和人脸特征信息并转换为多个第一编码向量,将所述第一编码向量输入所述解码器;所述第二编码器基于第二设定权重提取所述第二原始人脸图像的肤色信息和人脸特征信息并转换为多个第二编码向量,将所述第二编码向量输入所述解码器;所述第一设定权重提取肤色信息的比重大于所述第二设定权重,所述第二设定权重提取人脸特征信息的比重大于所述第一设定权重;
    所述解码器基于多路仿射变换模块和风格迁移模块对所述第一编码向量和所述第二编码向量进行人脸图像转换合成,并引入随机噪声生成对应所述第一原始人脸图像和所述第二原始人脸图像的人脸合成图像。
  2. 根据权利要求1所述的人脸图像合成方法,其特征在于,所述生成模型的训练流程包括:
    以两个训练样本图像作为模型输入,以所述训练样本图像的人脸合成图像作为模型输出训练所述生成器;
    使用所述生成模型的判别器验证所述训练样本图像的人脸合成图像,并根据所述生成模型的损失函数调整所述生成器的人脸属性合成参数,直至所述损失函数收敛。
  3. 根据权利要求2所述的人脸图像合成方法,其特征在于,所述损失函数对应所述第一编码器包括生成式对抗网络损失、人脸特征损失和编码向量距离损失;所述损失函数对应所述第二编码器包括生成式对抗网络损失、肤色损失和编码向量距离损失。
  4. 根据权利要求3所述的人脸图像合成方法,其特征在于,所述生成式对抗网络损失包括生成器损失和判别器损失。
  5. 根据权利要求3所述的人脸图像合成方法,其特征在于,所述人脸特征损失基于人脸识别网络确定。
  6. 根据权利要求1所述的人脸图像合成方法,其特征在于,所述解码器基于多路仿射变换模块和风格迁移模块对所述第一编码向量和所述第二编码向量进行人脸图像转换合成,并引入随机噪声生成对应所述第一原始人脸图像和所述第二原始人脸图像的人脸合成图像,包括:
    将所述第一编码向量或所述第二编码向量输入仿射变换模块生成对应的偏置因子和缩放因子,将所述偏置因子和所述缩放因子输入对应的所述风格迁移模块;
    所述风格迁移模块基于所述第一原始人脸图像或所述第二原始人脸图像的特征图、对应的所述偏置因子和所述缩放因子进行人脸风格转换计算,并引入随机噪声生成对应的人脸风格转换结果,并基于各个人脸风格转换结果得到人脸合成图像。
  7. 根据权利要求6所述的人脸图像合成方法,其特征在于,将所述第一编码向量或所述第二编码向量输入仿射变换模块生成对应的偏置因子和缩放因子,包括:
    将所述第一编码向量或所述第二编码向量转换为输出向量,并将所述输出向量转换为对应的偏置因子和缩放因子。
  8. 一种人脸图像合成系统,其特征在于,包括:
    输入模块,用于将第一原始人脸图像和第二原始人脸图像输入生成模型的生成器,所述生成器包括第一编码器、第二编码器和解码器;
    转换模块,用于通过所述第一编码器基于第一设定权重提取所述第一原始人脸图像的肤色信息和人脸特征信息并转换为多个第一编码向量,将所述第一编码向量输入所述解码器;通过所述第二编码器基于第二设定权重提取所述第二原始人脸图像的肤色信息和人脸特征信息并转换为多个第二编码向量,将所述第二编码向量输入所述解码器;所述第一设定权重提取肤色信息的比重大于所述第二设定权重,所述第二设定权重提取人脸特征信息的比重大于所述第一设定权重;
    合成模块,用于通过所述解码器基于多路仿射变换模块和风格迁移模块对所述第一编码向量和所述第二编码向量进行人脸图像转换合成,并引入随机噪声生成对应所述第一原始人脸图像和所述第二原始人脸图像的人脸合成图像。
  9. 一种电子设备,其特征在于,包括:
    存储器以及一个或多个处理器;
    所述存储器,用于存储一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7任一所述的人脸图像合成方法。
  10. 一种包含计算机可执行指令的存储介质,其特征在于,所述计算机可执 行指令在由计算机处理器执行时用于执行如权利要求1-7任一所述的人脸图像合成方法。
PCT/CN2021/140563 2020-12-25 2021-12-22 一种人脸图像合成方法、系统、电子设备及存储介质 WO2022135490A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011566624.1 2020-12-25
CN202011566624.1A CN112651915B (zh) 2020-12-25 2020-12-25 一种人脸图像合成方法、系统、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022135490A1 true WO2022135490A1 (zh) 2022-06-30

Family

ID=75363041

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/140563 WO2022135490A1 (zh) 2020-12-25 2021-12-22 一种人脸图像合成方法、系统、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN112651915B (zh)
WO (1) WO2022135490A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651915B (zh) * 2020-12-25 2023-08-29 百果园技术(新加坡)有限公司 一种人脸图像合成方法、系统、电子设备及存储介质
CN113191943B (zh) * 2021-05-31 2023-09-05 大连民族大学 一种多路并行图像内容特征分离风格迁移方法及系统
CN114648787A (zh) * 2022-02-11 2022-06-21 华为技术有限公司 人脸图像的处理方法及相关设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156730A (zh) * 2016-06-30 2016-11-23 腾讯科技(深圳)有限公司 一种人脸图像的合成方法和装置
US20190295302A1 (en) * 2018-03-22 2019-09-26 Northeastern University Segmentation Guided Image Generation With Adversarial Networks
CN111275613A (zh) * 2020-02-27 2020-06-12 辽宁工程技术大学 一种引入注意力机制生成对抗网络人脸属性编辑方法
CN111754596A (zh) * 2020-06-19 2020-10-09 北京灵汐科技有限公司 编辑模型生成、人脸图像编辑方法、装置、设备及介质
CN112651915A (zh) * 2020-12-25 2021-04-13 百果园技术(新加坡)有限公司 一种人脸图像合成方法、系统、电子设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113728353A (zh) * 2018-11-15 2021-11-30 巴黎欧莱雅 使用条件循环一致性生成图像至图像转换模型的用于增强现实的系统和方法
CN110189249B (zh) * 2019-05-24 2022-02-18 深圳市商汤科技有限公司 一种图像处理方法及装置、电子设备和存储介质
CN111368662B (zh) * 2020-02-25 2023-03-21 华南理工大学 一种人脸图像属性编辑方法、装置、存储介质及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156730A (zh) * 2016-06-30 2016-11-23 腾讯科技(深圳)有限公司 一种人脸图像的合成方法和装置
US20190295302A1 (en) * 2018-03-22 2019-09-26 Northeastern University Segmentation Guided Image Generation With Adversarial Networks
CN111275613A (zh) * 2020-02-27 2020-06-12 辽宁工程技术大学 一种引入注意力机制生成对抗网络人脸属性编辑方法
CN111754596A (zh) * 2020-06-19 2020-10-09 北京灵汐科技有限公司 编辑模型生成、人脸图像编辑方法、装置、设备及介质
CN112651915A (zh) * 2020-12-25 2021-04-13 百果园技术(新加坡)有限公司 一种人脸图像合成方法、系统、电子设备及存储介质

Also Published As

Publication number Publication date
CN112651915A (zh) 2021-04-13
CN112651915B (zh) 2023-08-29

Similar Documents

Publication Publication Date Title
WO2022135490A1 (zh) 一种人脸图像合成方法、系统、电子设备及存储介质
WO2021258920A1 (zh) 生成对抗网络训练方法、图像换脸、视频换脸方法及装置
CN111489287B (zh) 图像转换方法、装置、计算机设备和存储介质
WO2022135013A1 (zh) 一种人脸属性编辑方法、系统、电子设备及存储介质
Ning et al. Multi‐view frontal face image generation: a survey
CN109255831B (zh) 基于多任务学习的单视图人脸三维重建及纹理生成的方法
WO2022267641A1 (zh) 一种基于循环生成对抗网络的图像去雾方法及系统
Zhang et al. Text-guided neural image inpainting
WO2023072067A1 (zh) 人脸属性编辑模型的训练以及人脸属性编辑方法
CN110084193B (zh) 用于面部图像生成的数据处理方法、设备和介质
CN113901894A (zh) 一种视频生成方法、装置、服务器及存储介质
KR102409988B1 (ko) 딥러닝 네트워크를 이용한 얼굴 변환 방법 및 장치
WO2022205755A1 (zh) 纹理生成方法、装置、设备及存储介质
US20220399025A1 (en) Method and device for generating speech video using audio signal
CN117496072B (zh) 一种三维数字人生成和交互方法及系统
CN114863533A (zh) 数字人生成方法和装置及存储介质
CN115914505A (zh) 基于语音驱动数字人模型的视频生成方法及系统
CN115393480A (zh) 基于动态神经纹理的说话人合成方法、装置和存储介质
CN117036583A (zh) 视频生成方法、装置、存储介质及计算机设备
CN111489405B (zh) 基于条件增强生成对抗网络的人脸草图合成系统
CN117671764A (zh) 基于Transformer的动态说话人脸图像生成系统及方法
CN115082636B (zh) 基于混合高斯网络的单图像三维重建方法及设备
KR102543451B1 (ko) 딥 러닝을 이용한 이미지의 특징 추출 및 합성 시스템 및 이의 학습 방법
CN114724209A (zh) 模型训练方法、图像生成方法、装置、设备和介质
CN115482368A (zh) 一种利用语义图进行三维场景编辑的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21909485

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21909485

Country of ref document: EP

Kind code of ref document: A1