CN112949707A - Cross-mode face image generation method based on multi-scale semantic information supervision - Google Patents
Cross-mode face image generation method based on multi-scale semantic information supervision Download PDFInfo
- Publication number
- CN112949707A CN112949707A CN202110218611.3A CN202110218611A CN112949707A CN 112949707 A CN112949707 A CN 112949707A CN 202110218611 A CN202110218611 A CN 202110218611A CN 112949707 A CN112949707 A CN 112949707A
- Authority
- CN
- China
- Prior art keywords
- face
- image
- target
- representing
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 230000001815 facial effect Effects 0.000 claims abstract description 26
- 230000004927 fusion Effects 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 9
- 238000005070 sampling Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 10
- 238000009826 distribution Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 210000000697 sensory organ Anatomy 0.000 abstract description 6
- 230000006870 function Effects 0.000 description 53
- 230000004913 activation Effects 0.000 description 13
- 238000010606 normalization Methods 0.000 description 6
- 238000004088 simulation Methods 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 210000000744 eyelid Anatomy 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000001308 synthesis method Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 235000006679 Mentha X verticillata Nutrition 0.000 description 1
- 235000002899 Mentha suaveolens Nutrition 0.000 description 1
- 235000001636 Mentha x rotundifolia Nutrition 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000003042 antagnostic effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 210000001508 eye Anatomy 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention relates to a cross-modal face image generation method based on multi-scale semantic information supervision, which comprises the following steps: converting a source modal face image to be processed into a target modal face primary generation image; carrying out depth feature extraction on a source modal face image to be processed to obtain multi-scale depth features; according to the structural characteristics of the face, performing feature fusion on the multi-scale depth features and face semantic labels of the source mode face image to obtain multi-scale semantic information depth features; and inputting the primary generation image of the target modal face into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the depth feature of the multi-scale semantic information to obtain the generation image of the target modal face. The method can obviously enhance the detail information of the surrounding outline of the five sense organs, has better capturing capability on the texture detail of the facial structure, and further improves the reality of the generated image.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a cross-modal face image generation method based on multi-scale semantic information supervision.
Background
With the development of social informatization, a face image becomes one of the most widely applied human identity authentication information. Due to the fact that the modes for acquiring the face information are various, the acquired face images in different forms form different modes, and the task of converting the face images between the different modes is called cross-mode face image generation. By generating the cross-modal face image, the representation of the face content in different modalities is enriched on the premise of keeping the common face information as much as possible, and the method has wide application value and important research significance in the fields of public safety, digital entertainment and the like.
A cross-modal face image generation method based on samples is the mainstream of early research work in the field, and the main idea is that sample images or image blocks in a training set are directly combined into an output image of a test image in a target modality by mining the consistency between an input test image and a source modality image in the training set. However, since the image block stitching usually adopts a mean smoothing method and the number of images in the training set is limited, the output images of such methods have the disadvantages of blurring and deformation.
Zhang et al, In the documents "Liliang Zhang, Liang Lin, Xian Wu, et al," End-to-End photo-sketch generation via full volumetric rendering "In ACM ICMR,2015, pp. 627-634,2015", constructs an End-to-End full convolutional neural network to model the non-linear mapping between the face photo and sketch, bringing the cross-mode face image generation method into the deep learning based research stage. However, the network has a simple structure and shallow depth, and is difficult to capture the change of texture details among different modalities, so that the generated image effect is not ideal.
Because of the powerful capabilities GAN (generic adaptive Network, generating confrontational networks) exhibits on the task of clear image generation, researchers have further conducted cross-modality image generation studies using GAN Network-based models. The model for generating a countermeasure network based on conditions, which is proposed by Isola et al in the documents "Philip Isola, Jun-Yan Zhu, et al," Image-to-Image translation with conditional adaptation network, "in IEEE International Conference on Computer Vision and Pattern Recognition,2017, pp.1125-1134", can complete a variety of Image-to-Image translation works, and can also be applied to the task of generating a cross-modal face Image to achieve good expression. However, the existing deep convolutional neural network based on the coding and decoding structure has the problem of face structure loss, and the detailed expression of the features around the facial features is ignored, so that the texture of the generated result is lost.
In summary, the existing cross-modality face image generation method cannot capture the texture details of the face structure well, so that the generated image effect is not ideal.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a cross-mode face image generation method based on multi-scale semantic information supervision. The technical problem to be solved by the invention is realized by the following technical scheme:
the embodiment of the invention provides a cross-modal face image generation method based on multi-scale semantic information supervision, which comprises the following steps:
s1, converting the source mode face image to be processed into a target mode face primary generation image;
s2, performing depth feature extraction on the source mode face image to be processed to obtain multi-scale depth features;
s3, performing feature fusion on the multi-scale depth features and the face semantic labels of the source modal face images according to the facial structure characteristics to obtain multi-scale semantic information depth features;
and S4, inputting the primary target modal face generated image into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the multi-scale semantic information depth features to obtain a target modal face generated image.
In one embodiment of the present invention, step S1 is preceded by the steps of:
a number of source modality-target modality face image pairs are acquired.
In one embodiment of the present invention, step S2 includes:
inputting the source mode face images in the source mode-target mode face image pairs into a self-encoder for reconstruction to obtain reconstructed source mode face images;
calculating a reconstruction loss function of the source modal face image and the reconstructed source modal face image, and training the self-encoder by using the reconstruction loss function to obtain a trained self-encoder;
and carrying out deep feature extraction on the source mode face image to be processed by utilizing the trained self-encoder to obtain the multi-scale depth features.
In one embodiment of the invention, the reconstruction loss function is:
wherein, thetaAEThe model parameters representing the self-encoder are,representing a reconstructed face image of a source modality, IxRepresenting the input source modality face image.
In one embodiment of the present invention, step S3 includes:
s31, extracting the face semantic label of the source mode face image;
s32, selecting a face area to be enhanced from the face semantic labels according to the facial structure characteristics, and constructing a multi-scale semantic mask set of the source modal face image by using the face area to be enhanced;
s33, selecting target multi-scale depth features from the multi-scale depth features, and performing feature fusion on the target multi-scale depth features and the multi-scale semantic mask type group to obtain the multi-scale semantic information depth features.
In one embodiment of the invention, the facial region to be enhanced includes a facial skin region, a left ear region, a right ear region, and a neck region.
In one embodiment of the present invention, step S4 includes:
s41, inputting the preliminary generated image of the target modal face into an encoder of the generator for down-sampling operation, and performing feature fusion on output features in the down-sampling operation process and the depth features of the multi-scale semantic information to obtain the depth features of the preliminary generated image of the target modal face;
and S42, inputting the depth features of the preliminary generated image of the target modal face into a decoder of the generator for up-sampling operation, and performing feature fusion on the output features in the up-sampling operation process and the output features corresponding to the down-sampling operation to obtain the generated image of the target modal face.
In an embodiment of the present invention, step S4 is followed by the steps of:
s5, combining the target modal face image in the source modal-target modal face image pair, judging the distribution similarity degree of the target modal face generated image by using a discriminator in the target generation model, calculating a discriminant loss function according to the distribution similarity degree, and then updating the parameter of the discriminator by using the discriminant loss function;
s6, updating the parameters of the generator by using the generating loss function of the generator, the fighting loss function of the generator and the discriminator and the fusion loss function of the generator and the self-encoder;
and S7, obtaining the trained target generation model when the arbiter and the generator reach a counterbalance state.
In one embodiment of the present invention, the discriminant loss function is:
wherein, thetaDParameter representing the discriminator, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,representing a target mode face generation image, and D representing a discriminator;
the generation loss function is:
wherein, thetaGParameters representing the generator, IyRepresenting the image of the face of the target modality,representing a target modality face generation image;
the penalty function is:
wherein, thetaGParameters representing the generator, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,representing a target mode face generation image, and D representing a discriminator;
the fusion loss function is:
wherein L isGANRepresenting the function of the penalty of confrontation, LGRepresenting the resulting loss function, LAERepresenting the reconstruction loss function of the self-encoder, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,representing a target modality, thetaGRepresenting a parameter of the generator, θAERepresenting parameters from the encoder.
Compared with the prior art, the invention has the beneficial effects that:
the cross-modal face image generation method monitors the cross-modal face image generation process through the multi-scale semantic information depth features, so that the generated target modal face image has rich face content expression on the premise of keeping common face information, the detail information of the surrounding outline of five sense organs (such as canthus and eyelid) can be obviously enhanced, the texture detail of a face structure has better capturing capability, the accuracy and the continuity of the face structure of a source modal face image are kept, and the authenticity of the generated image is further improved.
Drawings
Fig. 1 is a schematic flow diagram of a cross-modal face image generation method based on multi-scale semantic information supervision according to an embodiment of the present invention;
fig. 2 is a comparison diagram of simulation results provided by the embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart of a cross-modal face image generation method based on multi-scale semantic information supervision according to an embodiment of the present invention. The cross-mode face image generation method comprises the following steps:
and S1, acquiring a plurality of source modality-target modality face image pairs.
Specifically, in the disclosed cross-modality face database, M source modality images of different face objects and M target modality images corresponding to the M source modality images are selected to form M pairs of source modality-target modality face image pairs. It will be appreciated that the image a of the same person in the source modality and the image B in the target modality form a pair of source modality-target modality face images, such that the image a of M persons in the source modality and the image B in the target modality form M pairs of source modality-target modality face images. The remaining source modality-target modality face image pairs in the database are used as a test.
Wherein, the face image I of the source modexAs input of the method model, a face image I in a target modeyAnd the method is used for measuring the similarity of the content and the structure of the finally generated target modality face image.
And S2, converting the source mode face image to be processed into a target mode face primary generation image.
In this embodiment, a face photo-portrait synthesis method based on a probability map model is used to convert an input source modality face image into a target modality face primary generation image.
Specifically, the input source mode face image I is synthesized by using a face photo-portrait synthesis method based on a probability graph modelxObtaining a target modal face primary generation image by sequentially carrying out image blocking, nearest neighbor search, Markov weight field-based probability graph model optimization weight combination and image splicingThe specific implementation of the method is the prior art, and the detailed description is omitted here.
And S3, performing depth feature extraction on the source mode face image to be processed to obtain the multi-scale depth feature of the source mode face image.
In this embodiment, a face image I in a source mode is usedxInputting the image into a self-encoder AE based on a U-shaped structure for self-expression learning to obtain a multi-scale depth characteristic phi of a source mode face imagel(Ix),Wherein, l represents the selected network layer number. Specifically, firstly, a source mode face image in a plurality of source mode-target mode face image pairs is input into an encoder to be reconstructed, and a reconstructed source mode face image is obtained; calculating a reconstruction loss function of the source mode face image and the reconstruction source mode face image, and training the self-encoder by using the reconstruction loss function to obtain a trained self-encoder; and then inputting the source mode face image to be processed into a trained self-encoder to perform deep feature extraction, and extracting to obtain multi-scale depth features.
In a specific embodiment, the self-encoder comprises an encoder and a decoder connected in series.
The encoder is used for inputting a source mode face image IxAnd carrying out depth feature extraction. The encoder comprises a number of encoding sub-modules connected in sequence, which can be denoted as { E }lL is 1, …, L, and L is the number of coding sub-modules. The number of the coding sub-modules may be determined according to actual requirements, and in this embodiment, the number of the coding sub-modules is 5, that is, L is 5. Further, a first coding submodule E1Comprises a convolution layer with convolution kernel size of 4 and step length of 2 and an activation function layer, and is used for the face image I of the source modexFiltering, and recording the output characteristic as phi1(Ix). Second to fifth encoding sub-modules El(l 2,3,4,5) each coding submodule comprises a convolution layer with convolution kernel size of 4 and step size of 2, a normalization layer and an activation function layer; second to fifth encoding sub-modules ElIs recorded as phil(Ix) All inputs are the last coding submodule El-1Output characteristic phi ofl-1(Ix)。
The decoder has a structure symmetrical to that of the encoder, and is used for reconstructing the depth features extracted by the encoder to obtain a reconstructed source mode face imageThe decoder comprises several decoding sub-modules connected in sequence, which can be denoted as { D }l,l=L,L-1, …,1, L being the number of decoding sub-modules. The number of decoding sub-modules may be determined according to actual requirements, and in this embodiment, the number of encoding sub-modules is 5, that is, L is 5. Further, the decoding sub-modules numbered 5 to 2 have the same structure, and each sub-module Dl(l 5,4, … 2) each contains an deconvolution layer with a convolution kernel size of 4 and a step size of 2, a normalization layer and an activation function layer. DlIs recorded asThe input of the decoding submodule numbered 5 is phi output by the encoder5(Ix) The input of the other decoding sub-modules is the previous decoding sub-module Dl+1Output of (2)And a correspondingly numbered coding submodule ElOutput phi ofl(Ix) The fusion characteristics of (1). The deconvolution module with the number of 1, namely the last decoding submodule, comprises a deconvolution layer with the convolution kernel size of 4 and the step length of 2 and an activation function layer, and a feature map output by the upper layerUpsampling to a reconstructed source modality image
Further, the training process of the self-encoder is as follows: firstly, inputting a source mode face image in a plurality of source mode-target mode face image pairs into an encoder to extract depth features; then inputting the extracted depth features into a decoder to obtain a reconstructed source mode face imageThen calculating the reconstructed source mode face imageWith source mode of inputAnd (3) the reconstruction loss of the state face image is utilized to train the self-encoder. Specifically, the reconstruction loss of the source-modality face image is based on the minimum absolute value error between the reconstructed source-modality face image and the input source-modality face image, and the reconstruction loss function is expressed as:
wherein, thetaAEThe model parameters representing the self-encoder are,representing a reconstructed face image of a source modality, IxRepresenting the input source modality face image.
It should be noted that, when the self-encoder is trained, both the encoder and the decoder in the self-encoder perform corresponding steps; and when the trained self-encoder is used for testing, only the encoder in the self-encoder is used for extracting the multi-scale depth features.
And S4, performing feature fusion on the multi-scale depth features and the face semantic labels of the source mode face images according to the facial structure characteristics to obtain the multi-scale semantic information depth features of the source mode face images.
And S41, extracting the face semantic label of the face image in the source mode.
In a specific embodiment, a trained semantic segmentation model BiSeNet on a CelebA-HQ database is used for extracting a face semantic label of a source mode face image, and the face semantic label is recorded asWhere N denotes the total number of tags of the face component, and in the present embodiment, N is 19.
S42, selecting a face area to be enhanced from the face semantic labels according to the facial structure characteristics, and constructing a multi-scale semantic mask set of the source mode face image by utilizing the face area to be enhanced.
Specifically, the facial region to be enhanced includes fiveDetails around the sense organs, such as double eyelids around the eyes, etc.; in this embodiment, facial skin regions, left ear regions, right ear regions, and neck regions other than five sense organs are selected as the facial regions to be enhanced, and these regions are denoted as m1,m7,m8,m14. After the face area to be enhanced is selected and obtained, the selected face area to be enhanced is utilized to construct a source mode face image IxFace semantic label binary mask MxThat is, the area value of the face area to be enhanced is 1, and the other area values are 0. Then, a face semantic label binary mask M is processed by a nearest neighbor interpolation methodxTransforming to obtain multi-scale semantic mask type sets with the sizes of 128 × 128, 64 × 64 and 32 × 32
S43, selecting target multi-scale depth features from the multi-scale depth features, and performing feature fusion on the target multi-scale depth features and the multi-scale semantic mask type group to obtain the multi-scale semantic information depth features.
In this embodiment, the target multi-scale depth features need to contain more detailed information of a face region, and an encoder in the self-encoder is based on a convolutional neural network, wherein a first encoding submodule is used for filtering a source mode face image, a feature map behind a second encoding submodule can learn more detailed information of the image, but as the network deepens, the feature map becomes smaller, and the detailed information gradually decreases, the multi-scale depth features output by the encoding submodule between the first encoding submodule and the last encoding submodule are selected as the target multi-scale depth features in this embodiment. In one embodiment, when the number of the coding sub-modules is 5, the deep features { phi ] of the source-mode face images output by the second coding sub-module to the fourth coding sub-module are selectedl(Ix) And l is 2,3,4, and is used as the target multi-scale depth feature.
Further, the target multi-scale depth features are combined with the multi-scale semantic mask type set Performing feature fusion by adopting a multiplication mode to obtain multi-scale semantic information depth features
In the embodiment, the multi-scale semantic information is fused with the target deep features, so that the detail features of all the facial feature contour regions of the five sense organs can be enhanced, the texture details of the facial structure can be better captured, the subsequently generated target modality face image has rich face content expression on the premise of keeping the common face information, the accuracy and the continuity of the facial structure of the source modality face image are kept, and the authenticity of the generated image is further improved.
And S5, inputting the primary generated image of the target modal face into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the depth features of the multi-scale semantic information to obtain the generated image of the target modal face.
And S51, inputting the preliminary generated image of the target modal face into an encoder of the generator for down-sampling operation, and performing feature fusion on output features and multi-scale semantic information depth features in the down-sampling operation process to obtain the depth features of the preliminary generated image of the target modal face.
Specifically, the encoder of the generator comprises a plurality of encoding sub-modules which are connected in sequence and used for performing down-sampling operation on the preliminarily generated image of the target modal face; the number of the coding sub-modules is determined according to the requirement, and the deeper feature representation of the image can be obtained when the number of the coding sub-modules is larger.
In this embodiment, the encoder of the generator is composed of eight encoding sub-modules, which can be expressed asAnd L is 8. First encoding submodule E1Comprises a convolution layer with convolution kernel size of 4 and step length of 2 and an activation function layer, which are connected in turn to input imagePerforming convolution and down-sampling operation to output asSecond to eighth coding sub-modulesEach coding submodule comprises a convolution layer with convolution kernel size of 4 and step length of 2, a normalization layer and an activation function layer, wherein the convolution layer, the normalization layer and the activation function layer are connected in sequence. Each coding submoduleIs recorded asSecond to fourth coding sub-modulesIs the last coding submoduleOutput of (2)And having a corresponding measure of semantic informationThe splicing fusion feature of (1); the fifth to eighth coding submodule inputs are all the previous submoduleOutput of (2)Output of the eighth coding submoduleNamely the depth characteristic of the primary generated image of the target modal face
And S52, inputting the depth features of the preliminary generated image of the target modal face into a decoder of the generator for up-sampling operation, and performing feature fusion on the output features in the up-sampling operation process and the output features corresponding to the down-sampling operation to obtain the generated image of the target modal face.
Specifically, the decoder of the generator comprises a plurality of decoding sub-modules which are connected in sequence and used for performing up-sampling operation on the preliminarily generated image of the target modal face; the number of decoding sub-modules corresponds to the number of encoding sub-modules one to one.
In this embodiment, the decoder of the generator is composed of eight decoding sub-modules, which can be expressed asThe decoding sub-modules numbered 8 to 2 have a similar structure, each decoding sub-moduleThe device comprises an deconvolution layer with convolution kernel size of 4 and step length of 2, a normalization layer and an activation function layer, wherein the deconvolution layer, the normalization layer and the activation function layer are connected in sequence. Decoding submoduleIs output asModule input numbered 8 is a characterization of the encoder outputThe inputs of the other decoding submodules are all the previous decoding submodulesOutput of (2)And a correspondingly numbered decoder sub-moduleOutput of (2)The splice fusion feature of (1). The decoding submodule numbered 1, namely the last decoding submodule, comprises an deconvolution layer with convolution kernel size of 4 and step length of 2 and an activation function layer, wherein the deconvolution layer and the activation function layer are connected in sequence, and the decoding submodule is used for outputting a feature map output by an upper layerUpsampling to generate an image for a final target modality face
S6, combining the target mode face image in the source mode-target mode face image pair, judging the distribution similarity degree of the target mode face generated image by using the discriminator in the target generation model, calculating a discriminant loss function according to the distribution similarity degree, and then updating the parameter of the discriminator by using the discriminant loss function.
In this embodiment, the discriminator includes a convolution layer and an activation function layer, wherein the number of convolution layers is 5, and the 5 convolution layers are connected in sequence; the convolution kernel size of each convolution layer is 4, and the step length is 2; the activation function layer is 5 layers; the 5 convolutional layers and the 5 active layers are alternately connected.
Specifically, a target modality face image and a target modality face generation image are respectively input into a discriminator, the discriminator outputs 1 and 0 prediction values representing true and false, and the 1 and 0 prediction values are the distribution similarity of the target modality face generation image.
The loss function of the discriminator is a classification loss function based on two classifications, wherein a source mode face image I is additionally added to the input of the discriminatorxAs part of the input, to more effectively guide the generation process of the generator model. The specific loss function formula is as follows:
wherein, thetaDModel parameters representing discriminators, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,representing a face-generated image of the target modality, D representing a discriminator, D (I)x,Iy)、Indicating the degree of similarity of the distributions.
In this embodiment, the discriminant loss function is minimized, and the parameters of the discriminant are updated by using a stochastic gradient descent method, which is the prior art and will not be described in detail in this embodiment.
And S7, updating the parameters of the generator by using the generating loss function of the generator, the fighting loss function of the generator and the discriminator and the fusion loss function of the generator and the self-encoder.
Specifically, the generation loss function is used to measure the generation loss of the target-modality face generation image, and the generation loss of the target-modality face image is based on the average absolute error of the target-modality face image in the target-modality face generation image and the source-modality-target-modality face image pair, so the generation loss function of the generator is:
wherein, thetaGParameters representing the generator, IyRepresenting the image of the face of the target modality,representing a target modality face generation image.
The countermeasure loss of the face image of the target modality is designed so that the generation result can trick the discrimination function of the discriminator, that is, the generator and the discriminator are in a countermeasure and constraint relationship. In this embodiment, a countermeasure loss function is calculated according to the distribution similarity, which is an output result of the discriminator, and an expression of the countermeasure loss function is:
wherein, thetaGParameters representing the generator, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,representing a target modality face generation image and D representing a discriminator.
The fusion loss function of the generator and the self-encoder is:
wherein L isGANRepresenting the function of the penalty of confrontation, LGRepresenting the resulting loss function, LAERepresenting the reconstruction loss function of the self-encoder, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,representing a target modality, thetaGTo representParameter of the generator, θAERepresenting parameters from the encoder.
Further, the fusion loss function L is combinedfSum discriminator loss function LDAlternately updated to train all parameters of the model. Specifically, according to the countermeasure thought of the GAN, parameters of the generator and the discriminator are alternately updated in a mode that one party is fixed and the other party iteratively updates the parameters; specifically, first, the generator parameter θ is fixedGAnd thetaAEUpdating the parameter theta of the discriminator by a random gradient descent methodDSo as to discriminate the loss function LDThe fluctuation is reduced to a certain small range; then fix the parameter θ of the discriminatorDUpdating the generator parameter θ by a random gradient descentGAnd thetaAESo as to fuse the loss functions LfAnd the fluctuation is reduced to a certain small range.
And S8, obtaining a trained target generation model when the discriminator and the generator reach a counterbalance state.
Specifically, through alternate training of the generator and the discriminator, when the generator and the discriminator reach a certain countermeasure balance, namely the target generation model is in a convergence state, the training of the target generation model is completed, and the trained target generation model is obtained. The counter-balance condition is embodied as: when the fixed generator updates the discriminator, the discrimination loss does not change obviously; on the contrary, when the fixed arbiter updates the generator, the fusion loss will not change significantly.
And S9, performing cross-modal face generation on the source modal face image to be processed by using the trained target generation model.
And S91, converting the source mode face image to be processed into a target mode face primary generation image.
S92, performing depth feature extraction on the source mode face image to be processed to obtain the multi-scale depth feature of the source mode face image to be processed.
And S93, performing feature fusion on the multi-scale depth features and the face semantic labels of the source mode face images according to the facial structure characteristics to obtain the multi-scale semantic information depth features of the source mode face images.
And S94, inputting the primary target modal face generated image into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the multi-scale semantic information depth features to obtain a target modal face generated image.
Please refer to fig. 1 and steps S2 to S5 for detailed operation steps of the above steps, which are not described herein again.
The cross-modal face image generation method monitors the cross-modal face image generation process through the multi-scale semantic information depth features, so that the generated target modal face image has rich face content expression on the premise of keeping common face information, the detail information of the surrounding outline of five sense organs (such as canthus and eyelid) can be obviously enhanced, the texture detail of a face structure has better capturing capability, the accuracy and the continuity of the face structure of a source modal face image are kept, and the authenticity of the generated image is further improved.
Example two
On the basis of the first embodiment, the effect of the cross-modal face image generation method based on multi-scale semantic information supervision is further explained through a simulation experiment.
(1) Simulation conditions
The simulation experiment is carried out by using a Pythroch frame on an Intel (R) core (TM) i7-8700K 3.70GHz CPU, NVIDIA GeForce RTX 2080ti GPU and a Linux Mint 18.3Sylvia operating system as a central processing unit. The image database adopts a CUFS database.
The following methods are adopted for simulation experiment:
the method comprises the following steps: the method based on the probability map model is marked as MWF in the experiment; the second method comprises the following steps: a method for generating an antagonistic network based on conditions is marked as pix2pix in an experiment; the third method comprises the following steps: a method based on a multi-scale countermeasure network is marked as PS2MAN in an experiment; the method four comprises the following steps: a knowledge migration-based method is marked as KT in an experiment; the method five comprises the following steps: and (3) recording as SCA-GAN in an experiment based on a human face part auxiliary method. The first to fifth methods are all the prior art, and are not described in detail in this embodiment.
(2) Emulated content
And selecting a part of source modal face images and target modal face images of the same object from a database test set to form a source modal-target modal face image data pair, and performing cross-modal generation respectively by using the cross-modal face image generation method based on multi-scale semantic information supervision of the embodiment and the 5 prior art methods.
Referring to fig. 2, fig. 2 is a simulation result comparison diagram provided in an embodiment of the present invention, where inputs are input source-mode face images, outputs is a cross-mode face image generation method based on multi-scale semantic information supervision according to the embodiment, and groudtruth is a target-mode face real image corresponding to a source-mode face image in a data pair. As can be seen from fig. 2, the cross-modality face image generated by the method according to the embodiment of the present invention can retain the facial detail characteristics of the person, and the generated image has good structural integrity and consistency.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (9)
1. A cross-mode face image generation method based on multi-scale semantic information supervision is characterized by comprising the following steps:
s1, converting the source mode face image to be processed into a target mode face primary generation image;
s2, performing depth feature extraction on the source mode face image to be processed to obtain multi-scale depth features;
s3, performing feature fusion on the multi-scale depth features and the face semantic labels of the source modal face images according to the facial structure characteristics to obtain multi-scale semantic information depth features;
and S4, inputting the primary target modal face generated image into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the multi-scale semantic information depth features to obtain a target modal face generated image.
2. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 1, wherein step S1 is preceded by the steps of:
a number of source modality-target modality face image pairs are acquired.
3. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 2, wherein the step S2 comprises:
inputting the source mode face images in the source mode-target mode face image pairs into a self-encoder for reconstruction to obtain reconstructed source mode face images;
calculating a reconstruction loss function of the source modal face image and the reconstructed source modal face image, and training the self-encoder by using the reconstruction loss function to obtain a trained self-encoder;
and carrying out deep feature extraction on the source mode face image to be processed by utilizing the trained self-encoder to obtain the multi-scale depth features.
4. The multi-scale semantic information surveillance-based cross-modal face image generation method of claim 3, wherein the reconstruction loss function is:
5. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 1, wherein the step S3 comprises:
s31, extracting the face semantic label of the source mode face image;
s32, selecting a face area to be enhanced from the face semantic labels according to the facial structure characteristics, and constructing a multi-scale semantic mask set of the source modal face image by using the face area to be enhanced;
s33, selecting target multi-scale depth features from the multi-scale depth features, and performing feature fusion on the target multi-scale depth features and the multi-scale semantic mask type group to obtain the multi-scale semantic information depth features.
6. The cross-modal facial image generation method based on multi-scale semantic information supervision as claimed in claim 5, wherein the facial regions to be enhanced comprise a facial skin region, a left ear region, a right ear region and a neck region.
7. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 1, wherein the step S4 comprises:
s41, inputting the preliminary generated image of the target modal face into an encoder of the generator for down-sampling operation, and performing feature fusion on output features in the down-sampling operation process and the depth features of the multi-scale semantic information to obtain the depth features of the preliminary generated image of the target modal face;
and S42, inputting the depth features of the preliminary generated image of the target modal face into a decoder of the generator for up-sampling operation, and performing feature fusion on the output features in the up-sampling operation process and the output features corresponding to the down-sampling operation to obtain the generated image of the target modal face.
8. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 2, further comprising the following steps after step S4:
s5, combining the target modal face image in the source modal-target modal face image pair, judging the distribution similarity degree of the target modal face generated image by using a discriminator in the target generation model, calculating a discriminant loss function according to the distribution similarity degree, and then updating the parameter of the discriminator by using the discriminant loss function;
s6, updating the parameters of the generator by using the generating loss function of the generator, the fighting loss function of the generator and the discriminator and the fusion loss function of the generator and the self-encoder;
and S7, obtaining the trained target generation model when the arbiter and the generator reach a counterbalance state.
9. The multi-scale semantic information surveillance-based cross-modal face image generation method of claim 8, wherein the discriminant loss function is:
wherein, thetaDParameter representing the discriminator, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,representing a target mode face generation image, and D representing a discriminator;
the generation loss function is:
wherein, thetaGParameters representing the generator, IyRepresenting the image of the face of the target modality,representing a target modality face generation image;
the penalty function is:
wherein, thetaGParameters representing the generator, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,representing a target mode face generation image, and D representing a discriminator;
the fusion loss function is:
wherein L isGANRepresenting the function of the penalty of confrontation, LGRepresenting the resulting loss function, LAERepresenting the reconstruction loss function of the self-encoder, IyRepresenting a face image of a target modality, IxRepresenting a face image of a source modality,representing a target modality, thetaGRepresenting a parameter of the generator, θAERepresenting parameters from the encoder.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110218611.3A CN112949707B (en) | 2021-02-26 | 2021-02-26 | Cross-modal face image generation method based on multi-scale semantic information supervision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110218611.3A CN112949707B (en) | 2021-02-26 | 2021-02-26 | Cross-modal face image generation method based on multi-scale semantic information supervision |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112949707A true CN112949707A (en) | 2021-06-11 |
CN112949707B CN112949707B (en) | 2024-02-09 |
Family
ID=76246481
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110218611.3A Active CN112949707B (en) | 2021-02-26 | 2021-02-26 | Cross-modal face image generation method based on multi-scale semantic information supervision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112949707B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113409377A (en) * | 2021-06-23 | 2021-09-17 | 四川大学 | Phase unwrapping method for generating countermeasure network based on jump connection |
CN114187408A (en) * | 2021-12-15 | 2022-03-15 | 中国电信股份有限公司 | Three-dimensional face model reconstruction method and device, electronic equipment and storage medium |
WO2023280065A1 (en) * | 2021-07-09 | 2023-01-12 | 南京邮电大学 | Image reconstruction method and apparatus for cross-modal communication system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107688821A (en) * | 2017-07-11 | 2018-02-13 | 西安电子科技大学 | View-based access control model conspicuousness and across the modality images natural language description methods of semantic attribute |
EP3511942A2 (en) * | 2018-01-16 | 2019-07-17 | Siemens Healthcare GmbH | Cross-domain image analysis and cross-domain image synthesis using deep image-to-image networks and adversarial networks |
CN110675316A (en) * | 2019-08-29 | 2020-01-10 | 中山大学 | Multi-domain image conversion method, system and medium for generating countermeasure network based on condition |
WO2020029356A1 (en) * | 2018-08-08 | 2020-02-13 | 杰创智能科技股份有限公司 | Method employing generative adversarial network for predicting face change |
CN111243066A (en) * | 2020-01-09 | 2020-06-05 | 浙江大学 | Facial expression migration method based on self-supervision learning and confrontation generation mechanism |
CN112270644A (en) * | 2020-10-20 | 2021-01-26 | 西安工程大学 | Face super-resolution method based on spatial feature transformation and cross-scale feature integration |
-
2021
- 2021-02-26 CN CN202110218611.3A patent/CN112949707B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107688821A (en) * | 2017-07-11 | 2018-02-13 | 西安电子科技大学 | View-based access control model conspicuousness and across the modality images natural language description methods of semantic attribute |
EP3511942A2 (en) * | 2018-01-16 | 2019-07-17 | Siemens Healthcare GmbH | Cross-domain image analysis and cross-domain image synthesis using deep image-to-image networks and adversarial networks |
WO2020029356A1 (en) * | 2018-08-08 | 2020-02-13 | 杰创智能科技股份有限公司 | Method employing generative adversarial network for predicting face change |
CN110675316A (en) * | 2019-08-29 | 2020-01-10 | 中山大学 | Multi-domain image conversion method, system and medium for generating countermeasure network based on condition |
CN111243066A (en) * | 2020-01-09 | 2020-06-05 | 浙江大学 | Facial expression migration method based on self-supervision learning and confrontation generation mechanism |
CN112270644A (en) * | 2020-10-20 | 2021-01-26 | 西安工程大学 | Face super-resolution method based on spatial feature transformation and cross-scale feature integration |
Non-Patent Citations (3)
Title |
---|
柳欣;李鹤洋;钟必能;杜吉祥;: "结合有监督联合一致性自编码器的跨音视频说话人标注", 电子与信息学报, no. 07 * |
魏?;孙硕;: "生成对抗网络进行感知遮挡人脸还原的算法研究", 小型微型计算机系统, no. 02 * |
黄菲;高飞;朱静洁;戴玲娜;俞俊;: "基于生成对抗网络的异质人脸图像合成:进展与挑战", 南京信息工程大学学报(自然科学版), no. 06 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113409377A (en) * | 2021-06-23 | 2021-09-17 | 四川大学 | Phase unwrapping method for generating countermeasure network based on jump connection |
CN113409377B (en) * | 2021-06-23 | 2022-09-27 | 四川大学 | Phase unwrapping method for generating countermeasure network based on jump connection |
WO2023280065A1 (en) * | 2021-07-09 | 2023-01-12 | 南京邮电大学 | Image reconstruction method and apparatus for cross-modal communication system |
US11748919B2 (en) | 2021-07-09 | 2023-09-05 | Nanjing University Of Posts And Telecommunications | Method of image reconstruction for cross-modal communication system and device thereof |
CN114187408A (en) * | 2021-12-15 | 2022-03-15 | 中国电信股份有限公司 | Three-dimensional face model reconstruction method and device, electronic equipment and storage medium |
CN114187408B (en) * | 2021-12-15 | 2023-04-07 | 中国电信股份有限公司 | Three-dimensional face model reconstruction method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112949707B (en) | 2024-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108520503B (en) | Face defect image restoration method based on self-encoder and generation countermeasure network | |
US11276231B2 (en) | Semantic deep face models | |
CN112949707A (en) | Cross-mode face image generation method based on multi-scale semantic information supervision | |
CN112800903B (en) | Dynamic expression recognition method and system based on space-time diagram convolutional neural network | |
CN111932444A (en) | Face attribute editing method based on generation countermeasure network and information processing terminal | |
CN111192201B (en) | Method and device for generating face image and training model thereof, and electronic equipment | |
CN115565238B (en) | Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product | |
WO2023065503A1 (en) | Facial expression classification method and electronic device | |
Sreekala et al. | Capsule Network‐Based Deep Transfer Learning Model for Face Recognition | |
CN113553961B (en) | Training method and device of face recognition model, electronic equipment and storage medium | |
CN116433898A (en) | Method for segmenting transform multi-mode image based on semantic constraint | |
CN113392791A (en) | Skin prediction processing method, device, equipment and storage medium | |
CN114972016A (en) | Image processing method, image processing apparatus, computer device, storage medium, and program product | |
CN115147261A (en) | Image processing method, device, storage medium, equipment and product | |
Gao | A method for face image inpainting based on generative adversarial networks | |
Zhou et al. | A superior image inpainting scheme using Transformer-based self-supervised attention GAN model | |
CN114049290A (en) | Image processing method, device, equipment and storage medium | |
Liu et al. | Learning shape and texture progression for young child face aging | |
CN113762117A (en) | Training method of image processing model, image processing model and computer equipment | |
CN115631285B (en) | Face rendering method, device, equipment and storage medium based on unified driving | |
Xu et al. | Correlation via synthesis: end-to-end nodule image generation and radiogenomic map learning based on generative adversarial network | |
WO2023173827A1 (en) | Image generation method and apparatus, and device, storage medium and computer program product | |
CN112990123B (en) | Image processing method, apparatus, computer device and medium | |
Wu et al. | Voice2mesh: Cross-modal 3d face model generation from voices | |
Xu et al. | Correlation via synthesis: End-to-end image generation and radiogenomic learning based on generative adversarial network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |