CN112949707A - Cross-mode face image generation method based on multi-scale semantic information supervision - Google Patents

Cross-mode face image generation method based on multi-scale semantic information supervision Download PDF

Info

Publication number
CN112949707A
CN112949707A CN202110218611.3A CN202110218611A CN112949707A CN 112949707 A CN112949707 A CN 112949707A CN 202110218611 A CN202110218611 A CN 202110218611A CN 112949707 A CN112949707 A CN 112949707A
Authority
CN
China
Prior art keywords
face
image
target
representing
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110218611.3A
Other languages
Chinese (zh)
Other versions
CN112949707B (en
Inventor
王楠楠
杨玥颖
郝毅
朱明瑞
李洁
高新波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110218611.3A priority Critical patent/CN112949707B/en
Publication of CN112949707A publication Critical patent/CN112949707A/en
Application granted granted Critical
Publication of CN112949707B publication Critical patent/CN112949707B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to a cross-modal face image generation method based on multi-scale semantic information supervision, which comprises the following steps: converting a source modal face image to be processed into a target modal face primary generation image; carrying out depth feature extraction on a source modal face image to be processed to obtain multi-scale depth features; according to the structural characteristics of the face, performing feature fusion on the multi-scale depth features and face semantic labels of the source mode face image to obtain multi-scale semantic information depth features; and inputting the primary generation image of the target modal face into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the depth feature of the multi-scale semantic information to obtain the generation image of the target modal face. The method can obviously enhance the detail information of the surrounding outline of the five sense organs, has better capturing capability on the texture detail of the facial structure, and further improves the reality of the generated image.

Description

Cross-mode face image generation method based on multi-scale semantic information supervision
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a cross-modal face image generation method based on multi-scale semantic information supervision.
Background
With the development of social informatization, a face image becomes one of the most widely applied human identity authentication information. Due to the fact that the modes for acquiring the face information are various, the acquired face images in different forms form different modes, and the task of converting the face images between the different modes is called cross-mode face image generation. By generating the cross-modal face image, the representation of the face content in different modalities is enriched on the premise of keeping the common face information as much as possible, and the method has wide application value and important research significance in the fields of public safety, digital entertainment and the like.
A cross-modal face image generation method based on samples is the mainstream of early research work in the field, and the main idea is that sample images or image blocks in a training set are directly combined into an output image of a test image in a target modality by mining the consistency between an input test image and a source modality image in the training set. However, since the image block stitching usually adopts a mean smoothing method and the number of images in the training set is limited, the output images of such methods have the disadvantages of blurring and deformation.
Zhang et al, In the documents "Liliang Zhang, Liang Lin, Xian Wu, et al," End-to-End photo-sketch generation via full volumetric rendering "In ACM ICMR,2015, pp. 627-634,2015", constructs an End-to-End full convolutional neural network to model the non-linear mapping between the face photo and sketch, bringing the cross-mode face image generation method into the deep learning based research stage. However, the network has a simple structure and shallow depth, and is difficult to capture the change of texture details among different modalities, so that the generated image effect is not ideal.
Because of the powerful capabilities GAN (generic adaptive Network, generating confrontational networks) exhibits on the task of clear image generation, researchers have further conducted cross-modality image generation studies using GAN Network-based models. The model for generating a countermeasure network based on conditions, which is proposed by Isola et al in the documents "Philip Isola, Jun-Yan Zhu, et al," Image-to-Image translation with conditional adaptation network, "in IEEE International Conference on Computer Vision and Pattern Recognition,2017, pp.1125-1134", can complete a variety of Image-to-Image translation works, and can also be applied to the task of generating a cross-modal face Image to achieve good expression. However, the existing deep convolutional neural network based on the coding and decoding structure has the problem of face structure loss, and the detailed expression of the features around the facial features is ignored, so that the texture of the generated result is lost.
In summary, the existing cross-modality face image generation method cannot capture the texture details of the face structure well, so that the generated image effect is not ideal.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a cross-mode face image generation method based on multi-scale semantic information supervision. The technical problem to be solved by the invention is realized by the following technical scheme:
the embodiment of the invention provides a cross-modal face image generation method based on multi-scale semantic information supervision, which comprises the following steps:
s1, converting the source mode face image to be processed into a target mode face primary generation image;
s2, performing depth feature extraction on the source mode face image to be processed to obtain multi-scale depth features;
s3, performing feature fusion on the multi-scale depth features and the face semantic labels of the source modal face images according to the facial structure characteristics to obtain multi-scale semantic information depth features;
and S4, inputting the primary target modal face generated image into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the multi-scale semantic information depth features to obtain a target modal face generated image.
In one embodiment of the present invention, step S1 is preceded by the steps of:
a number of source modality-target modality face image pairs are acquired.
In one embodiment of the present invention, step S2 includes:
inputting the source mode face images in the source mode-target mode face image pairs into a self-encoder for reconstruction to obtain reconstructed source mode face images;
calculating a reconstruction loss function of the source modal face image and the reconstructed source modal face image, and training the self-encoder by using the reconstruction loss function to obtain a trained self-encoder;
and carrying out deep feature extraction on the source mode face image to be processed by utilizing the trained self-encoder to obtain the multi-scale depth features.
In one embodiment of the invention, the reconstruction loss function is:
Figure BDA0002954945290000031
wherein, thetaAEThe model parameters representing the self-encoder are,
Figure BDA0002954945290000032
representing a reconstructed face image of a source modality, IxRepresenting the input source modality face image.
In one embodiment of the present invention, step S3 includes:
s31, extracting the face semantic label of the source mode face image;
s32, selecting a face area to be enhanced from the face semantic labels according to the facial structure characteristics, and constructing a multi-scale semantic mask set of the source modal face image by using the face area to be enhanced;
s33, selecting target multi-scale depth features from the multi-scale depth features, and performing feature fusion on the target multi-scale depth features and the multi-scale semantic mask type group to obtain the multi-scale semantic information depth features.
In one embodiment of the invention, the facial region to be enhanced includes a facial skin region, a left ear region, a right ear region, and a neck region.
In one embodiment of the present invention, step S4 includes:
s41, inputting the preliminary generated image of the target modal face into an encoder of the generator for down-sampling operation, and performing feature fusion on output features in the down-sampling operation process and the depth features of the multi-scale semantic information to obtain the depth features of the preliminary generated image of the target modal face;
and S42, inputting the depth features of the preliminary generated image of the target modal face into a decoder of the generator for up-sampling operation, and performing feature fusion on the output features in the up-sampling operation process and the output features corresponding to the down-sampling operation to obtain the generated image of the target modal face.
In an embodiment of the present invention, step S4 is followed by the steps of:
s5, combining the target modal face image in the source modal-target modal face image pair, judging the distribution similarity degree of the target modal face generated image by using a discriminator in the target generation model, calculating a discriminant loss function according to the distribution similarity degree, and then updating the parameter of the discriminator by using the discriminant loss function;
s6, updating the parameters of the generator by using the generating loss function of the generator, the fighting loss function of the generator and the discriminator and the fusion loss function of the generator and the self-encoder;
and S7, obtaining the trained target generation model when the arbiter and the generator reach a counterbalance state.
In one embodiment of the present invention, the discriminant loss function is:
Figure BDA0002954945290000051
wherein, thetaDParameter representing the discriminator, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,
Figure BDA0002954945290000052
representing a target mode face generation image, and D representing a discriminator;
the generation loss function is:
Figure BDA0002954945290000053
wherein, thetaGParameters representing the generator, IyRepresenting the image of the face of the target modality,
Figure BDA0002954945290000058
representing a target modality face generation image;
the penalty function is:
Figure BDA0002954945290000054
wherein, thetaGParameters representing the generator, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,
Figure BDA0002954945290000055
representing a target mode face generation image, and D representing a discriminator;
the fusion loss function is:
Figure BDA0002954945290000056
wherein L isGANRepresenting the function of the penalty of confrontation, LGRepresenting the resulting loss function, LAERepresenting the reconstruction loss function of the self-encoder, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,
Figure BDA0002954945290000057
representing a target modality, thetaGRepresenting a parameter of the generator, θAERepresenting parameters from the encoder.
Compared with the prior art, the invention has the beneficial effects that:
the cross-modal face image generation method monitors the cross-modal face image generation process through the multi-scale semantic information depth features, so that the generated target modal face image has rich face content expression on the premise of keeping common face information, the detail information of the surrounding outline of five sense organs (such as canthus and eyelid) can be obviously enhanced, the texture detail of a face structure has better capturing capability, the accuracy and the continuity of the face structure of a source modal face image are kept, and the authenticity of the generated image is further improved.
Drawings
Fig. 1 is a schematic flow diagram of a cross-modal face image generation method based on multi-scale semantic information supervision according to an embodiment of the present invention;
fig. 2 is a comparison diagram of simulation results provided by the embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart of a cross-modal face image generation method based on multi-scale semantic information supervision according to an embodiment of the present invention. The cross-mode face image generation method comprises the following steps:
and S1, acquiring a plurality of source modality-target modality face image pairs.
Specifically, in the disclosed cross-modality face database, M source modality images of different face objects and M target modality images corresponding to the M source modality images are selected to form M pairs of source modality-target modality face image pairs. It will be appreciated that the image a of the same person in the source modality and the image B in the target modality form a pair of source modality-target modality face images, such that the image a of M persons in the source modality and the image B in the target modality form M pairs of source modality-target modality face images. The remaining source modality-target modality face image pairs in the database are used as a test.
Wherein, the face image I of the source modexAs input of the method model, a face image I in a target modeyAnd the method is used for measuring the similarity of the content and the structure of the finally generated target modality face image.
And S2, converting the source mode face image to be processed into a target mode face primary generation image.
In this embodiment, a face photo-portrait synthesis method based on a probability map model is used to convert an input source modality face image into a target modality face primary generation image.
Specifically, the input source mode face image I is synthesized by using a face photo-portrait synthesis method based on a probability graph modelxObtaining a target modal face primary generation image by sequentially carrying out image blocking, nearest neighbor search, Markov weight field-based probability graph model optimization weight combination and image splicing
Figure BDA0002954945290000071
The specific implementation of the method is the prior art, and the detailed description is omitted here.
And S3, performing depth feature extraction on the source mode face image to be processed to obtain the multi-scale depth feature of the source mode face image.
In this embodiment, a face image I in a source mode is usedxInputting the image into a self-encoder AE based on a U-shaped structure for self-expression learning to obtain a multi-scale depth characteristic phi of a source mode face imagel(Ix),Wherein, l represents the selected network layer number. Specifically, firstly, a source mode face image in a plurality of source mode-target mode face image pairs is input into an encoder to be reconstructed, and a reconstructed source mode face image is obtained; calculating a reconstruction loss function of the source mode face image and the reconstruction source mode face image, and training the self-encoder by using the reconstruction loss function to obtain a trained self-encoder; and then inputting the source mode face image to be processed into a trained self-encoder to perform deep feature extraction, and extracting to obtain multi-scale depth features.
In a specific embodiment, the self-encoder comprises an encoder and a decoder connected in series.
The encoder is used for inputting a source mode face image IxAnd carrying out depth feature extraction. The encoder comprises a number of encoding sub-modules connected in sequence, which can be denoted as { E }lL is 1, …, L, and L is the number of coding sub-modules. The number of the coding sub-modules may be determined according to actual requirements, and in this embodiment, the number of the coding sub-modules is 5, that is, L is 5. Further, a first coding submodule E1Comprises a convolution layer with convolution kernel size of 4 and step length of 2 and an activation function layer, and is used for the face image I of the source modexFiltering, and recording the output characteristic as phi1(Ix). Second to fifth encoding sub-modules El(l 2,3,4,5) each coding submodule comprises a convolution layer with convolution kernel size of 4 and step size of 2, a normalization layer and an activation function layer; second to fifth encoding sub-modules ElIs recorded as phil(Ix) All inputs are the last coding submodule El-1Output characteristic phi ofl-1(Ix)。
The decoder has a structure symmetrical to that of the encoder, and is used for reconstructing the depth features extracted by the encoder to obtain a reconstructed source mode face image
Figure BDA0002954945290000081
The decoder comprises several decoding sub-modules connected in sequence, which can be denoted as { D }l,l=L,L-1, …,1, L being the number of decoding sub-modules. The number of decoding sub-modules may be determined according to actual requirements, and in this embodiment, the number of encoding sub-modules is 5, that is, L is 5. Further, the decoding sub-modules numbered 5 to 2 have the same structure, and each sub-module Dl(l 5,4, … 2) each contains an deconvolution layer with a convolution kernel size of 4 and a step size of 2, a normalization layer and an activation function layer. DlIs recorded as
Figure BDA0002954945290000082
The input of the decoding submodule numbered 5 is phi output by the encoder5(Ix) The input of the other decoding sub-modules is the previous decoding sub-module Dl+1Output of (2)
Figure BDA0002954945290000083
And a correspondingly numbered coding submodule ElOutput phi ofl(Ix) The fusion characteristics of (1). The deconvolution module with the number of 1, namely the last decoding submodule, comprises a deconvolution layer with the convolution kernel size of 4 and the step length of 2 and an activation function layer, and a feature map output by the upper layer
Figure BDA0002954945290000084
Upsampling to a reconstructed source modality image
Figure BDA0002954945290000085
Further, the training process of the self-encoder is as follows: firstly, inputting a source mode face image in a plurality of source mode-target mode face image pairs into an encoder to extract depth features; then inputting the extracted depth features into a decoder to obtain a reconstructed source mode face image
Figure BDA0002954945290000086
Then calculating the reconstructed source mode face image
Figure BDA0002954945290000087
With source mode of inputAnd (3) the reconstruction loss of the state face image is utilized to train the self-encoder. Specifically, the reconstruction loss of the source-modality face image is based on the minimum absolute value error between the reconstructed source-modality face image and the input source-modality face image, and the reconstruction loss function is expressed as:
Figure BDA0002954945290000091
wherein, thetaAEThe model parameters representing the self-encoder are,
Figure BDA0002954945290000092
representing a reconstructed face image of a source modality, IxRepresenting the input source modality face image.
It should be noted that, when the self-encoder is trained, both the encoder and the decoder in the self-encoder perform corresponding steps; and when the trained self-encoder is used for testing, only the encoder in the self-encoder is used for extracting the multi-scale depth features.
And S4, performing feature fusion on the multi-scale depth features and the face semantic labels of the source mode face images according to the facial structure characteristics to obtain the multi-scale semantic information depth features of the source mode face images.
And S41, extracting the face semantic label of the face image in the source mode.
In a specific embodiment, a trained semantic segmentation model BiSeNet on a CelebA-HQ database is used for extracting a face semantic label of a source mode face image, and the face semantic label is recorded as
Figure BDA0002954945290000093
Where N denotes the total number of tags of the face component, and in the present embodiment, N is 19.
S42, selecting a face area to be enhanced from the face semantic labels according to the facial structure characteristics, and constructing a multi-scale semantic mask set of the source mode face image by utilizing the face area to be enhanced.
Specifically, the facial region to be enhanced includes fiveDetails around the sense organs, such as double eyelids around the eyes, etc.; in this embodiment, facial skin regions, left ear regions, right ear regions, and neck regions other than five sense organs are selected as the facial regions to be enhanced, and these regions are denoted as m1,m7,m8,m14. After the face area to be enhanced is selected and obtained, the selected face area to be enhanced is utilized to construct a source mode face image IxFace semantic label binary mask MxThat is, the area value of the face area to be enhanced is 1, and the other area values are 0. Then, a face semantic label binary mask M is processed by a nearest neighbor interpolation methodxTransforming to obtain multi-scale semantic mask type sets with the sizes of 128 × 128, 64 × 64 and 32 × 32
Figure BDA0002954945290000094
S43, selecting target multi-scale depth features from the multi-scale depth features, and performing feature fusion on the target multi-scale depth features and the multi-scale semantic mask type group to obtain the multi-scale semantic information depth features.
In this embodiment, the target multi-scale depth features need to contain more detailed information of a face region, and an encoder in the self-encoder is based on a convolutional neural network, wherein a first encoding submodule is used for filtering a source mode face image, a feature map behind a second encoding submodule can learn more detailed information of the image, but as the network deepens, the feature map becomes smaller, and the detailed information gradually decreases, the multi-scale depth features output by the encoding submodule between the first encoding submodule and the last encoding submodule are selected as the target multi-scale depth features in this embodiment. In one embodiment, when the number of the coding sub-modules is 5, the deep features { phi ] of the source-mode face images output by the second coding sub-module to the fourth coding sub-module are selectedl(Ix) And l is 2,3,4, and is used as the target multi-scale depth feature.
Further, the target multi-scale depth features are combined with the multi-scale semantic mask type set
Figure BDA0002954945290000101
Figure BDA0002954945290000102
Performing feature fusion by adopting a multiplication mode to obtain multi-scale semantic information depth features
Figure BDA0002954945290000103
In the embodiment, the multi-scale semantic information is fused with the target deep features, so that the detail features of all the facial feature contour regions of the five sense organs can be enhanced, the texture details of the facial structure can be better captured, the subsequently generated target modality face image has rich face content expression on the premise of keeping the common face information, the accuracy and the continuity of the facial structure of the source modality face image are kept, and the authenticity of the generated image is further improved.
And S5, inputting the primary generated image of the target modal face into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the depth features of the multi-scale semantic information to obtain the generated image of the target modal face.
And S51, inputting the preliminary generated image of the target modal face into an encoder of the generator for down-sampling operation, and performing feature fusion on output features and multi-scale semantic information depth features in the down-sampling operation process to obtain the depth features of the preliminary generated image of the target modal face.
Specifically, the encoder of the generator comprises a plurality of encoding sub-modules which are connected in sequence and used for performing down-sampling operation on the preliminarily generated image of the target modal face; the number of the coding sub-modules is determined according to the requirement, and the deeper feature representation of the image can be obtained when the number of the coding sub-modules is larger.
In this embodiment, the encoder of the generator is composed of eight encoding sub-modules, which can be expressed as
Figure BDA0002954945290000111
And L is 8. First encoding submodule E1Comprises a convolution layer with convolution kernel size of 4 and step length of 2 and an activation function layer, which are connected in turn to input image
Figure BDA0002954945290000112
Performing convolution and down-sampling operation to output as
Figure BDA0002954945290000113
Second to eighth coding sub-modules
Figure BDA0002954945290000114
Each coding submodule comprises a convolution layer with convolution kernel size of 4 and step length of 2, a normalization layer and an activation function layer, wherein the convolution layer, the normalization layer and the activation function layer are connected in sequence. Each coding submodule
Figure BDA0002954945290000115
Is recorded as
Figure BDA0002954945290000116
Second to fourth coding sub-modules
Figure BDA0002954945290000117
Is the last coding submodule
Figure BDA0002954945290000118
Output of (2)
Figure BDA0002954945290000119
And having a corresponding measure of semantic information
Figure BDA00029549452900001110
The splicing fusion feature of (1); the fifth to eighth coding submodule inputs are all the previous submodule
Figure BDA00029549452900001111
Output of (2)
Figure BDA00029549452900001112
Output of the eighth coding submodule
Figure BDA00029549452900001113
Namely the depth characteristic of the primary generated image of the target modal face
Figure BDA00029549452900001114
And S52, inputting the depth features of the preliminary generated image of the target modal face into a decoder of the generator for up-sampling operation, and performing feature fusion on the output features in the up-sampling operation process and the output features corresponding to the down-sampling operation to obtain the generated image of the target modal face.
Specifically, the decoder of the generator comprises a plurality of decoding sub-modules which are connected in sequence and used for performing up-sampling operation on the preliminarily generated image of the target modal face; the number of decoding sub-modules corresponds to the number of encoding sub-modules one to one.
In this embodiment, the decoder of the generator is composed of eight decoding sub-modules, which can be expressed as
Figure BDA0002954945290000121
The decoding sub-modules numbered 8 to 2 have a similar structure, each decoding sub-module
Figure BDA0002954945290000122
The device comprises an deconvolution layer with convolution kernel size of 4 and step length of 2, a normalization layer and an activation function layer, wherein the deconvolution layer, the normalization layer and the activation function layer are connected in sequence. Decoding submodule
Figure BDA0002954945290000123
Is output as
Figure BDA0002954945290000124
Module input numbered 8 is a characterization of the encoder output
Figure BDA0002954945290000125
The inputs of the other decoding submodules are all the previous decoding submodules
Figure BDA0002954945290000126
Output of (2)
Figure BDA0002954945290000127
And a correspondingly numbered decoder sub-module
Figure BDA0002954945290000128
Output of (2)
Figure BDA0002954945290000129
The splice fusion feature of (1). The decoding submodule numbered 1, namely the last decoding submodule, comprises an deconvolution layer with convolution kernel size of 4 and step length of 2 and an activation function layer, wherein the deconvolution layer and the activation function layer are connected in sequence, and the decoding submodule is used for outputting a feature map output by an upper layer
Figure BDA00029549452900001210
Upsampling to generate an image for a final target modality face
Figure BDA00029549452900001211
S6, combining the target mode face image in the source mode-target mode face image pair, judging the distribution similarity degree of the target mode face generated image by using the discriminator in the target generation model, calculating a discriminant loss function according to the distribution similarity degree, and then updating the parameter of the discriminator by using the discriminant loss function.
In this embodiment, the discriminator includes a convolution layer and an activation function layer, wherein the number of convolution layers is 5, and the 5 convolution layers are connected in sequence; the convolution kernel size of each convolution layer is 4, and the step length is 2; the activation function layer is 5 layers; the 5 convolutional layers and the 5 active layers are alternately connected.
Specifically, a target modality face image and a target modality face generation image are respectively input into a discriminator, the discriminator outputs 1 and 0 prediction values representing true and false, and the 1 and 0 prediction values are the distribution similarity of the target modality face generation image.
The loss function of the discriminator is a classification loss function based on two classifications, wherein a source mode face image I is additionally added to the input of the discriminatorxAs part of the input, to more effectively guide the generation process of the generator model. The specific loss function formula is as follows:
Figure BDA0002954945290000131
wherein, thetaDModel parameters representing discriminators, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,
Figure BDA0002954945290000132
representing a face-generated image of the target modality, D representing a discriminator, D (I)x,Iy)、
Figure BDA0002954945290000133
Indicating the degree of similarity of the distributions.
In this embodiment, the discriminant loss function is minimized, and the parameters of the discriminant are updated by using a stochastic gradient descent method, which is the prior art and will not be described in detail in this embodiment.
And S7, updating the parameters of the generator by using the generating loss function of the generator, the fighting loss function of the generator and the discriminator and the fusion loss function of the generator and the self-encoder.
Specifically, the generation loss function is used to measure the generation loss of the target-modality face generation image, and the generation loss of the target-modality face image is based on the average absolute error of the target-modality face image in the target-modality face generation image and the source-modality-target-modality face image pair, so the generation loss function of the generator is:
Figure BDA0002954945290000134
wherein, thetaGParameters representing the generator, IyRepresenting the image of the face of the target modality,
Figure BDA0002954945290000135
representing a target modality face generation image.
The countermeasure loss of the face image of the target modality is designed so that the generation result can trick the discrimination function of the discriminator, that is, the generator and the discriminator are in a countermeasure and constraint relationship. In this embodiment, a countermeasure loss function is calculated according to the distribution similarity, which is an output result of the discriminator, and an expression of the countermeasure loss function is:
Figure BDA0002954945290000141
wherein, thetaGParameters representing the generator, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,
Figure BDA0002954945290000142
representing a target modality face generation image and D representing a discriminator.
The fusion loss function of the generator and the self-encoder is:
Figure BDA0002954945290000143
wherein L isGANRepresenting the function of the penalty of confrontation, LGRepresenting the resulting loss function, LAERepresenting the reconstruction loss function of the self-encoder, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,
Figure BDA0002954945290000144
representing a target modality, thetaGTo representParameter of the generator, θAERepresenting parameters from the encoder.
Further, the fusion loss function L is combinedfSum discriminator loss function LDAlternately updated to train all parameters of the model. Specifically, according to the countermeasure thought of the GAN, parameters of the generator and the discriminator are alternately updated in a mode that one party is fixed and the other party iteratively updates the parameters; specifically, first, the generator parameter θ is fixedGAnd thetaAEUpdating the parameter theta of the discriminator by a random gradient descent methodDSo as to discriminate the loss function LDThe fluctuation is reduced to a certain small range; then fix the parameter θ of the discriminatorDUpdating the generator parameter θ by a random gradient descentGAnd thetaAESo as to fuse the loss functions LfAnd the fluctuation is reduced to a certain small range.
And S8, obtaining a trained target generation model when the discriminator and the generator reach a counterbalance state.
Specifically, through alternate training of the generator and the discriminator, when the generator and the discriminator reach a certain countermeasure balance, namely the target generation model is in a convergence state, the training of the target generation model is completed, and the trained target generation model is obtained. The counter-balance condition is embodied as: when the fixed generator updates the discriminator, the discrimination loss does not change obviously; on the contrary, when the fixed arbiter updates the generator, the fusion loss will not change significantly.
And S9, performing cross-modal face generation on the source modal face image to be processed by using the trained target generation model.
And S91, converting the source mode face image to be processed into a target mode face primary generation image.
S92, performing depth feature extraction on the source mode face image to be processed to obtain the multi-scale depth feature of the source mode face image to be processed.
And S93, performing feature fusion on the multi-scale depth features and the face semantic labels of the source mode face images according to the facial structure characteristics to obtain the multi-scale semantic information depth features of the source mode face images.
And S94, inputting the primary target modal face generated image into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the multi-scale semantic information depth features to obtain a target modal face generated image.
Please refer to fig. 1 and steps S2 to S5 for detailed operation steps of the above steps, which are not described herein again.
The cross-modal face image generation method monitors the cross-modal face image generation process through the multi-scale semantic information depth features, so that the generated target modal face image has rich face content expression on the premise of keeping common face information, the detail information of the surrounding outline of five sense organs (such as canthus and eyelid) can be obviously enhanced, the texture detail of a face structure has better capturing capability, the accuracy and the continuity of the face structure of a source modal face image are kept, and the authenticity of the generated image is further improved.
Example two
On the basis of the first embodiment, the effect of the cross-modal face image generation method based on multi-scale semantic information supervision is further explained through a simulation experiment.
(1) Simulation conditions
The simulation experiment is carried out by using a Pythroch frame on an Intel (R) core (TM) i7-8700K 3.70GHz CPU, NVIDIA GeForce RTX 2080ti GPU and a Linux Mint 18.3Sylvia operating system as a central processing unit. The image database adopts a CUFS database.
The following methods are adopted for simulation experiment:
the method comprises the following steps: the method based on the probability map model is marked as MWF in the experiment; the second method comprises the following steps: a method for generating an antagonistic network based on conditions is marked as pix2pix in an experiment; the third method comprises the following steps: a method based on a multi-scale countermeasure network is marked as PS2MAN in an experiment; the method four comprises the following steps: a knowledge migration-based method is marked as KT in an experiment; the method five comprises the following steps: and (3) recording as SCA-GAN in an experiment based on a human face part auxiliary method. The first to fifth methods are all the prior art, and are not described in detail in this embodiment.
(2) Emulated content
And selecting a part of source modal face images and target modal face images of the same object from a database test set to form a source modal-target modal face image data pair, and performing cross-modal generation respectively by using the cross-modal face image generation method based on multi-scale semantic information supervision of the embodiment and the 5 prior art methods.
Referring to fig. 2, fig. 2 is a simulation result comparison diagram provided in an embodiment of the present invention, where inputs are input source-mode face images, outputs is a cross-mode face image generation method based on multi-scale semantic information supervision according to the embodiment, and groudtruth is a target-mode face real image corresponding to a source-mode face image in a data pair. As can be seen from fig. 2, the cross-modality face image generated by the method according to the embodiment of the present invention can retain the facial detail characteristics of the person, and the generated image has good structural integrity and consistency.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (9)

1. A cross-mode face image generation method based on multi-scale semantic information supervision is characterized by comprising the following steps:
s1, converting the source mode face image to be processed into a target mode face primary generation image;
s2, performing depth feature extraction on the source mode face image to be processed to obtain multi-scale depth features;
s3, performing feature fusion on the multi-scale depth features and the face semantic labels of the source modal face images according to the facial structure characteristics to obtain multi-scale semantic information depth features;
and S4, inputting the primary target modal face generated image into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the multi-scale semantic information depth features to obtain a target modal face generated image.
2. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 1, wherein step S1 is preceded by the steps of:
a number of source modality-target modality face image pairs are acquired.
3. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 2, wherein the step S2 comprises:
inputting the source mode face images in the source mode-target mode face image pairs into a self-encoder for reconstruction to obtain reconstructed source mode face images;
calculating a reconstruction loss function of the source modal face image and the reconstructed source modal face image, and training the self-encoder by using the reconstruction loss function to obtain a trained self-encoder;
and carrying out deep feature extraction on the source mode face image to be processed by utilizing the trained self-encoder to obtain the multi-scale depth features.
4. The multi-scale semantic information surveillance-based cross-modal face image generation method of claim 3, wherein the reconstruction loss function is:
Figure FDA0002954945280000021
wherein, thetaAEThe model parameters representing the self-encoder are,
Figure FDA0002954945280000022
representing a reconstructed face image of a source modality, IxRepresenting the input source modality face image.
5. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 1, wherein the step S3 comprises:
s31, extracting the face semantic label of the source mode face image;
s32, selecting a face area to be enhanced from the face semantic labels according to the facial structure characteristics, and constructing a multi-scale semantic mask set of the source modal face image by using the face area to be enhanced;
s33, selecting target multi-scale depth features from the multi-scale depth features, and performing feature fusion on the target multi-scale depth features and the multi-scale semantic mask type group to obtain the multi-scale semantic information depth features.
6. The cross-modal facial image generation method based on multi-scale semantic information supervision as claimed in claim 5, wherein the facial regions to be enhanced comprise a facial skin region, a left ear region, a right ear region and a neck region.
7. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 1, wherein the step S4 comprises:
s41, inputting the preliminary generated image of the target modal face into an encoder of the generator for down-sampling operation, and performing feature fusion on output features in the down-sampling operation process and the depth features of the multi-scale semantic information to obtain the depth features of the preliminary generated image of the target modal face;
and S42, inputting the depth features of the preliminary generated image of the target modal face into a decoder of the generator for up-sampling operation, and performing feature fusion on the output features in the up-sampling operation process and the output features corresponding to the down-sampling operation to obtain the generated image of the target modal face.
8. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 2, further comprising the following steps after step S4:
s5, combining the target modal face image in the source modal-target modal face image pair, judging the distribution similarity degree of the target modal face generated image by using a discriminator in the target generation model, calculating a discriminant loss function according to the distribution similarity degree, and then updating the parameter of the discriminator by using the discriminant loss function;
s6, updating the parameters of the generator by using the generating loss function of the generator, the fighting loss function of the generator and the discriminator and the fusion loss function of the generator and the self-encoder;
and S7, obtaining the trained target generation model when the arbiter and the generator reach a counterbalance state.
9. The multi-scale semantic information surveillance-based cross-modal face image generation method of claim 8, wherein the discriminant loss function is:
Figure FDA0002954945280000031
wherein, thetaDParameter representing the discriminator, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,
Figure FDA0002954945280000032
representing a target mode face generation image, and D representing a discriminator;
the generation loss function is:
Figure FDA0002954945280000033
wherein, thetaGParameters representing the generator, IyRepresenting the image of the face of the target modality,
Figure FDA0002954945280000034
representing a target modality face generation image;
the penalty function is:
Figure FDA0002954945280000041
wherein, thetaGParameters representing the generator, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,
Figure FDA0002954945280000042
representing a target mode face generation image, and D representing a discriminator;
the fusion loss function is:
Figure FDA0002954945280000043
wherein L isGANRepresenting the function of the penalty of confrontation, LGRepresenting the resulting loss function, LAERepresenting the reconstruction loss function of the self-encoder, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,
Figure FDA0002954945280000044
representing a target modality, thetaGRepresenting a parameter of the generator, θAERepresenting parameters from the encoder.
CN202110218611.3A 2021-02-26 2021-02-26 Cross-modal face image generation method based on multi-scale semantic information supervision Active CN112949707B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110218611.3A CN112949707B (en) 2021-02-26 2021-02-26 Cross-modal face image generation method based on multi-scale semantic information supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110218611.3A CN112949707B (en) 2021-02-26 2021-02-26 Cross-modal face image generation method based on multi-scale semantic information supervision

Publications (2)

Publication Number Publication Date
CN112949707A true CN112949707A (en) 2021-06-11
CN112949707B CN112949707B (en) 2024-02-09

Family

ID=76246481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110218611.3A Active CN112949707B (en) 2021-02-26 2021-02-26 Cross-modal face image generation method based on multi-scale semantic information supervision

Country Status (1)

Country Link
CN (1) CN112949707B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113409377A (en) * 2021-06-23 2021-09-17 四川大学 Phase unwrapping method for generating countermeasure network based on jump connection
CN114187408A (en) * 2021-12-15 2022-03-15 中国电信股份有限公司 Three-dimensional face model reconstruction method and device, electronic equipment and storage medium
WO2023280065A1 (en) * 2021-07-09 2023-01-12 南京邮电大学 Image reconstruction method and apparatus for cross-modal communication system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688821A (en) * 2017-07-11 2018-02-13 西安电子科技大学 View-based access control model conspicuousness and across the modality images natural language description methods of semantic attribute
EP3511942A2 (en) * 2018-01-16 2019-07-17 Siemens Healthcare GmbH Cross-domain image analysis and cross-domain image synthesis using deep image-to-image networks and adversarial networks
CN110675316A (en) * 2019-08-29 2020-01-10 中山大学 Multi-domain image conversion method, system and medium for generating countermeasure network based on condition
WO2020029356A1 (en) * 2018-08-08 2020-02-13 杰创智能科技股份有限公司 Method employing generative adversarial network for predicting face change
CN111243066A (en) * 2020-01-09 2020-06-05 浙江大学 Facial expression migration method based on self-supervision learning and confrontation generation mechanism
CN112270644A (en) * 2020-10-20 2021-01-26 西安工程大学 Face super-resolution method based on spatial feature transformation and cross-scale feature integration

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688821A (en) * 2017-07-11 2018-02-13 西安电子科技大学 View-based access control model conspicuousness and across the modality images natural language description methods of semantic attribute
EP3511942A2 (en) * 2018-01-16 2019-07-17 Siemens Healthcare GmbH Cross-domain image analysis and cross-domain image synthesis using deep image-to-image networks and adversarial networks
WO2020029356A1 (en) * 2018-08-08 2020-02-13 杰创智能科技股份有限公司 Method employing generative adversarial network for predicting face change
CN110675316A (en) * 2019-08-29 2020-01-10 中山大学 Multi-domain image conversion method, system and medium for generating countermeasure network based on condition
CN111243066A (en) * 2020-01-09 2020-06-05 浙江大学 Facial expression migration method based on self-supervision learning and confrontation generation mechanism
CN112270644A (en) * 2020-10-20 2021-01-26 西安工程大学 Face super-resolution method based on spatial feature transformation and cross-scale feature integration

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
柳欣;李鹤洋;钟必能;杜吉祥;: "结合有监督联合一致性自编码器的跨音视频说话人标注", 电子与信息学报, no. 07 *
魏?;孙硕;: "生成对抗网络进行感知遮挡人脸还原的算法研究", 小型微型计算机系统, no. 02 *
黄菲;高飞;朱静洁;戴玲娜;俞俊;: "基于生成对抗网络的异质人脸图像合成:进展与挑战", 南京信息工程大学学报(自然科学版), no. 06 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113409377A (en) * 2021-06-23 2021-09-17 四川大学 Phase unwrapping method for generating countermeasure network based on jump connection
CN113409377B (en) * 2021-06-23 2022-09-27 四川大学 Phase unwrapping method for generating countermeasure network based on jump connection
WO2023280065A1 (en) * 2021-07-09 2023-01-12 南京邮电大学 Image reconstruction method and apparatus for cross-modal communication system
US11748919B2 (en) 2021-07-09 2023-09-05 Nanjing University Of Posts And Telecommunications Method of image reconstruction for cross-modal communication system and device thereof
CN114187408A (en) * 2021-12-15 2022-03-15 中国电信股份有限公司 Three-dimensional face model reconstruction method and device, electronic equipment and storage medium
CN114187408B (en) * 2021-12-15 2023-04-07 中国电信股份有限公司 Three-dimensional face model reconstruction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112949707B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
CN108520503B (en) Face defect image restoration method based on self-encoder and generation countermeasure network
US11276231B2 (en) Semantic deep face models
CN112949707A (en) Cross-mode face image generation method based on multi-scale semantic information supervision
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN111932444A (en) Face attribute editing method based on generation countermeasure network and information processing terminal
CN111192201B (en) Method and device for generating face image and training model thereof, and electronic equipment
CN115565238B (en) Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product
WO2023065503A1 (en) Facial expression classification method and electronic device
Sreekala et al. Capsule Network‐Based Deep Transfer Learning Model for Face Recognition
CN113553961B (en) Training method and device of face recognition model, electronic equipment and storage medium
CN116433898A (en) Method for segmenting transform multi-mode image based on semantic constraint
CN113392791A (en) Skin prediction processing method, device, equipment and storage medium
CN114972016A (en) Image processing method, image processing apparatus, computer device, storage medium, and program product
CN115147261A (en) Image processing method, device, storage medium, equipment and product
Gao A method for face image inpainting based on generative adversarial networks
Zhou et al. A superior image inpainting scheme using Transformer-based self-supervised attention GAN model
CN114049290A (en) Image processing method, device, equipment and storage medium
Liu et al. Learning shape and texture progression for young child face aging
CN113762117A (en) Training method of image processing model, image processing model and computer equipment
CN115631285B (en) Face rendering method, device, equipment and storage medium based on unified driving
Xu et al. Correlation via synthesis: end-to-end nodule image generation and radiogenomic map learning based on generative adversarial network
WO2023173827A1 (en) Image generation method and apparatus, and device, storage medium and computer program product
CN112990123B (en) Image processing method, apparatus, computer device and medium
Wu et al. Voice2mesh: Cross-modal 3d face model generation from voices
Xu et al. Correlation via synthesis: End-to-end image generation and radiogenomic learning based on generative adversarial network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant