CN112949707A

CN112949707A - Cross-mode face image generation method based on multi-scale semantic information supervision

Info

Publication number: CN112949707A
Application number: CN202110218611.3A
Authority: CN
Inventors: 王楠楠; 杨玥颖; 郝毅; 朱明瑞; 李洁; 高新波
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-11
Anticipated expiration: 2041-02-26
Also published as: CN112949707B

Abstract

The invention relates to a cross-modal face image generation method based on multi-scale semantic information supervision, which comprises the following steps: converting a source modal face image to be processed into a target modal face primary generation image; carrying out depth feature extraction on a source modal face image to be processed to obtain multi-scale depth features; according to the structural characteristics of the face, performing feature fusion on the multi-scale depth features and face semantic labels of the source mode face image to obtain multi-scale semantic information depth features; and inputting the primary generation image of the target modal face into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the depth feature of the multi-scale semantic information to obtain the generation image of the target modal face. The method can obviously enhance the detail information of the surrounding outline of the five sense organs, has better capturing capability on the texture detail of the facial structure, and further improves the reality of the generated image.

Description

Cross-mode face image generation method based on multi-scale semantic information supervision

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a cross-modal face image generation method based on multi-scale semantic information supervision.

Background

With the development of social informatization, a face image becomes one of the most widely applied human identity authentication information. Due to the fact that the modes for acquiring the face information are various, the acquired face images in different forms form different modes, and the task of converting the face images between the different modes is called cross-mode face image generation. By generating the cross-modal face image, the representation of the face content in different modalities is enriched on the premise of keeping the common face information as much as possible, and the method has wide application value and important research significance in the fields of public safety, digital entertainment and the like.

A cross-modal face image generation method based on samples is the mainstream of early research work in the field, and the main idea is that sample images or image blocks in a training set are directly combined into an output image of a test image in a target modality by mining the consistency between an input test image and a source modality image in the training set. However, since the image block stitching usually adopts a mean smoothing method and the number of images in the training set is limited, the output images of such methods have the disadvantages of blurring and deformation.

Zhang et al, In the documents "Liliang Zhang, Liang Lin, Xian Wu, et al," End-to-End photo-sketch generation via full volumetric rendering "In ACM ICMR,2015, pp. 627-634,2015", constructs an End-to-End full convolutional neural network to model the non-linear mapping between the face photo and sketch, bringing the cross-mode face image generation method into the deep learning based research stage. However, the network has a simple structure and shallow depth, and is difficult to capture the change of texture details among different modalities, so that the generated image effect is not ideal.

Because of the powerful capabilities GAN (generic adaptive Network, generating confrontational networks) exhibits on the task of clear image generation, researchers have further conducted cross-modality image generation studies using GAN Network-based models. The model for generating a countermeasure network based on conditions, which is proposed by Isola et al in the documents "Philip Isola, Jun-Yan Zhu, et al," Image-to-Image translation with conditional adaptation network, "in IEEE International Conference on Computer Vision and Pattern Recognition,2017, pp.1125-1134", can complete a variety of Image-to-Image translation works, and can also be applied to the task of generating a cross-modal face Image to achieve good expression. However, the existing deep convolutional neural network based on the coding and decoding structure has the problem of face structure loss, and the detailed expression of the features around the facial features is ignored, so that the texture of the generated result is lost.

In summary, the existing cross-modality face image generation method cannot capture the texture details of the face structure well, so that the generated image effect is not ideal.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a cross-mode face image generation method based on multi-scale semantic information supervision. The technical problem to be solved by the invention is realized by the following technical scheme:

the embodiment of the invention provides a cross-modal face image generation method based on multi-scale semantic information supervision, which comprises the following steps:

s1, converting the source mode face image to be processed into a target mode face primary generation image;

s2, performing depth feature extraction on the source mode face image to be processed to obtain multi-scale depth features;

s3, performing feature fusion on the multi-scale depth features and the face semantic labels of the source modal face images according to the facial structure characteristics to obtain multi-scale semantic information depth features;

and S4, inputting the primary target modal face generated image into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the multi-scale semantic information depth features to obtain a target modal face generated image.

In one embodiment of the present invention, step S1 is preceded by the steps of:

a number of source modality-target modality face image pairs are acquired.

In one embodiment of the present invention, step S2 includes:

inputting the source mode face images in the source mode-target mode face image pairs into a self-encoder for reconstruction to obtain reconstructed source mode face images;

calculating a reconstruction loss function of the source modal face image and the reconstructed source modal face image, and training the self-encoder by using the reconstruction loss function to obtain a trained self-encoder;

and carrying out deep feature extraction on the source mode face image to be processed by utilizing the trained self-encoder to obtain the multi-scale depth features.

In one embodiment of the invention, the reconstruction loss function is:

wherein, theta_AEThe model parameters representing the self-encoder are,

representing a reconstructed face image of a source modality, I_xRepresenting the input source modality face image.

In one embodiment of the present invention, step S3 includes:

s31, extracting the face semantic label of the source mode face image;

s32, selecting a face area to be enhanced from the face semantic labels according to the facial structure characteristics, and constructing a multi-scale semantic mask set of the source modal face image by using the face area to be enhanced;

s33, selecting target multi-scale depth features from the multi-scale depth features, and performing feature fusion on the target multi-scale depth features and the multi-scale semantic mask type group to obtain the multi-scale semantic information depth features.

In one embodiment of the invention, the facial region to be enhanced includes a facial skin region, a left ear region, a right ear region, and a neck region.

In one embodiment of the present invention, step S4 includes:

s41, inputting the preliminary generated image of the target modal face into an encoder of the generator for down-sampling operation, and performing feature fusion on output features in the down-sampling operation process and the depth features of the multi-scale semantic information to obtain the depth features of the preliminary generated image of the target modal face;

and S42, inputting the depth features of the preliminary generated image of the target modal face into a decoder of the generator for up-sampling operation, and performing feature fusion on the output features in the up-sampling operation process and the output features corresponding to the down-sampling operation to obtain the generated image of the target modal face.

In an embodiment of the present invention, step S4 is followed by the steps of:

s5, combining the target modal face image in the source modal-target modal face image pair, judging the distribution similarity degree of the target modal face generated image by using a discriminator in the target generation model, calculating a discriminant loss function according to the distribution similarity degree, and then updating the parameter of the discriminator by using the discriminant loss function;

s6, updating the parameters of the generator by using the generating loss function of the generator, the fighting loss function of the generator and the discriminator and the fusion loss function of the generator and the self-encoder;

and S7, obtaining the trained target generation model when the arbiter and the generator reach a counterbalance state.

In one embodiment of the present invention, the discriminant loss function is:

wherein, theta_DParameter representing the discriminator, I_yRepresenting a face image of a target modality, I_xRepresenting a face image of a source modality,

representing a target mode face generation image, and D representing a discriminator;

the generation loss function is:

wherein, theta_GParameters representing the generator, I_yRepresenting the image of the face of the target modality,

representing a target modality face generation image;

the penalty function is:

wherein, theta_GParameters representing the generator, I_yRepresenting a face image of a target modality, I_xRepresenting a face image of a source modality,

the fusion loss function is:

wherein L is_GANRepresenting the function of the penalty of confrontation, L_GRepresenting the resulting loss function, L_AERepresenting the reconstruction loss function of the self-encoder, I_yRepresenting a face image of a target modality, I_xRepresenting a face image of a source modality,

representing a target modality, theta_GRepresenting a parameter of the generator, θ_AERepresenting parameters from the encoder.

Compared with the prior art, the invention has the beneficial effects that:

the cross-modal face image generation method monitors the cross-modal face image generation process through the multi-scale semantic information depth features, so that the generated target modal face image has rich face content expression on the premise of keeping common face information, the detail information of the surrounding outline of five sense organs (such as canthus and eyelid) can be obviously enhanced, the texture detail of a face structure has better capturing capability, the accuracy and the continuity of the face structure of a source modal face image are kept, and the authenticity of the generated image is further improved.

Drawings

Fig. 1 is a schematic flow diagram of a cross-modal face image generation method based on multi-scale semantic information supervision according to an embodiment of the present invention;

fig. 2 is a comparison diagram of simulation results provided by the embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart of a cross-modal face image generation method based on multi-scale semantic information supervision according to an embodiment of the present invention. The cross-mode face image generation method comprises the following steps:

and S1, acquiring a plurality of source modality-target modality face image pairs.

Specifically, in the disclosed cross-modality face database, M source modality images of different face objects and M target modality images corresponding to the M source modality images are selected to form M pairs of source modality-target modality face image pairs. It will be appreciated that the image a of the same person in the source modality and the image B in the target modality form a pair of source modality-target modality face images, such that the image a of M persons in the source modality and the image B in the target modality form M pairs of source modality-target modality face images. The remaining source modality-target modality face image pairs in the database are used as a test.

Wherein, the face image I of the source mode_xAs input of the method model, a face image I in a target mode_yAnd the method is used for measuring the similarity of the content and the structure of the finally generated target modality face image.

And S2, converting the source mode face image to be processed into a target mode face primary generation image.

In this embodiment, a face photo-portrait synthesis method based on a probability map model is used to convert an input source modality face image into a target modality face primary generation image.

Specifically, the input source mode face image I is synthesized by using a face photo-portrait synthesis method based on a probability graph model_xObtaining a target modal face primary generation image by sequentially carrying out image blocking, nearest neighbor search, Markov weight field-based probability graph model optimization weight combination and image splicing

The specific implementation of the method is the prior art, and the detailed description is omitted here.

And S3, performing depth feature extraction on the source mode face image to be processed to obtain the multi-scale depth feature of the source mode face image.

In this embodiment, a face image I in a source mode is used_xInputting the image into a self-encoder AE based on a U-shaped structure for self-expression learning to obtain a multi-scale depth characteristic phi of a source mode face image^l(I_x)，Wherein, l represents the selected network layer number. Specifically, firstly, a source mode face image in a plurality of source mode-target mode face image pairs is input into an encoder to be reconstructed, and a reconstructed source mode face image is obtained; calculating a reconstruction loss function of the source mode face image and the reconstruction source mode face image, and training the self-encoder by using the reconstruction loss function to obtain a trained self-encoder; and then inputting the source mode face image to be processed into a trained self-encoder to perform deep feature extraction, and extracting to obtain multi-scale depth features.

In a specific embodiment, the self-encoder comprises an encoder and a decoder connected in series.

The encoder is used for inputting a source mode face image I_xAnd carrying out depth feature extraction. The encoder comprises a number of encoding sub-modules connected in sequence, which can be denoted as { E }^lL is 1, …, L, and L is the number of coding sub-modules. The number of the coding sub-modules may be determined according to actual requirements, and in this embodiment, the number of the coding sub-modules is 5, that is, L is 5. Further, a first coding submodule E¹Comprises a convolution layer with convolution kernel size of 4 and step length of 2 and an activation function layer, and is used for the face image I of the source mode_xFiltering, and recording the output characteristic as phi¹(I_x). Second to fifth encoding sub-modules E^l(l 2,3,4,5) each coding submodule comprises a convolution layer with convolution kernel size of 4 and step size of 2, a normalization layer and an activation function layer; second to fifth encoding sub-modules E^lIs recorded as phi^l(I_x) All inputs are the last coding submodule E^l-1Output characteristic phi of^l-1(I_x)。

The decoder has a structure symmetrical to that of the encoder, and is used for reconstructing the depth features extracted by the encoder to obtain a reconstructed source mode face image

The decoder comprises several decoding sub-modules connected in sequence, which can be denoted as { D }^l,l＝L,L-1, …,1, L being the number of decoding sub-modules. The number of decoding sub-modules may be determined according to actual requirements, and in this embodiment, the number of encoding sub-modules is 5, that is, L is 5. Further, the decoding sub-modules numbered 5 to 2 have the same structure, and each sub-module D^l(l 5,4, … 2) each contains an deconvolution layer with a convolution kernel size of 4 and a step size of 2, a normalization layer and an activation function layer. D^lIs recorded as

The input of the decoding submodule numbered 5 is phi output by the encoder⁵(I_x) The input of the other decoding sub-modules is the previous decoding sub-module D^l+1Output of (2)

And a correspondingly numbered coding submodule E^lOutput phi of^l(I_x) The fusion characteristics of (1). The deconvolution module with the number of 1, namely the last decoding submodule, comprises a deconvolution layer with the convolution kernel size of 4 and the step length of 2 and an activation function layer, and a feature map output by the upper layer

Upsampling to a reconstructed source modality image

Further, the training process of the self-encoder is as follows: firstly, inputting a source mode face image in a plurality of source mode-target mode face image pairs into an encoder to extract depth features; then inputting the extracted depth features into a decoder to obtain a reconstructed source mode face image

Then calculating the reconstructed source mode face image

With source mode of inputAnd (3) the reconstruction loss of the state face image is utilized to train the self-encoder. Specifically, the reconstruction loss of the source-modality face image is based on the minimum absolute value error between the reconstructed source-modality face image and the input source-modality face image, and the reconstruction loss function is expressed as:

wherein, theta_AEThe model parameters representing the self-encoder are,

It should be noted that, when the self-encoder is trained, both the encoder and the decoder in the self-encoder perform corresponding steps; and when the trained self-encoder is used for testing, only the encoder in the self-encoder is used for extracting the multi-scale depth features.

And S4, performing feature fusion on the multi-scale depth features and the face semantic labels of the source mode face images according to the facial structure characteristics to obtain the multi-scale semantic information depth features of the source mode face images.

And S41, extracting the face semantic label of the face image in the source mode.

In a specific embodiment, a trained semantic segmentation model BiSeNet on a CelebA-HQ database is used for extracting a face semantic label of a source mode face image, and the face semantic label is recorded as

Where N denotes the total number of tags of the face component, and in the present embodiment, N is 19.

S42, selecting a face area to be enhanced from the face semantic labels according to the facial structure characteristics, and constructing a multi-scale semantic mask set of the source mode face image by utilizing the face area to be enhanced.

Specifically, the facial region to be enhanced includes fiveDetails around the sense organs, such as double eyelids around the eyes, etc.; in this embodiment, facial skin regions, left ear regions, right ear regions, and neck regions other than five sense organs are selected as the facial regions to be enhanced, and these regions are denoted as m₁,m₇,m₈,m₁₄. After the face area to be enhanced is selected and obtained, the selected face area to be enhanced is utilized to construct a source mode face image I_xFace semantic label binary mask M_xThat is, the area value of the face area to be enhanced is 1, and the other area values are 0. Then, a face semantic label binary mask M is processed by a nearest neighbor interpolation method_xTransforming to obtain multi-scale semantic mask type sets with the sizes of 128 × 128, 64 × 64 and 32 × 32

S43, selecting target multi-scale depth features from the multi-scale depth features, and performing feature fusion on the target multi-scale depth features and the multi-scale semantic mask type group to obtain the multi-scale semantic information depth features.

In this embodiment, the target multi-scale depth features need to contain more detailed information of a face region, and an encoder in the self-encoder is based on a convolutional neural network, wherein a first encoding submodule is used for filtering a source mode face image, a feature map behind a second encoding submodule can learn more detailed information of the image, but as the network deepens, the feature map becomes smaller, and the detailed information gradually decreases, the multi-scale depth features output by the encoding submodule between the first encoding submodule and the last encoding submodule are selected as the target multi-scale depth features in this embodiment. In one embodiment, when the number of the coding sub-modules is 5, the deep features { phi ] of the source-mode face images output by the second coding sub-module to the fourth coding sub-module are selected^l(I_x) And l is 2,3,4, and is used as the target multi-scale depth feature.

Further, the target multi-scale depth features are combined with the multi-scale semantic mask type set

Performing feature fusion by adopting a multiplication mode to obtain multi-scale semantic information depth features

In the embodiment, the multi-scale semantic information is fused with the target deep features, so that the detail features of all the facial feature contour regions of the five sense organs can be enhanced, the texture details of the facial structure can be better captured, the subsequently generated target modality face image has rich face content expression on the premise of keeping the common face information, the accuracy and the continuity of the facial structure of the source modality face image are kept, and the authenticity of the generated image is further improved.

And S5, inputting the primary generated image of the target modal face into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the depth features of the multi-scale semantic information to obtain the generated image of the target modal face.

And S51, inputting the preliminary generated image of the target modal face into an encoder of the generator for down-sampling operation, and performing feature fusion on output features and multi-scale semantic information depth features in the down-sampling operation process to obtain the depth features of the preliminary generated image of the target modal face.

Specifically, the encoder of the generator comprises a plurality of encoding sub-modules which are connected in sequence and used for performing down-sampling operation on the preliminarily generated image of the target modal face; the number of the coding sub-modules is determined according to the requirement, and the deeper feature representation of the image can be obtained when the number of the coding sub-modules is larger.

In this embodiment, the encoder of the generator is composed of eight encoding sub-modules, which can be expressed as

And L is 8. First encoding submodule E¹Comprises a convolution layer with convolution kernel size of 4 and step length of 2 and an activation function layer, which are connected in turn to input image

Performing convolution and down-sampling operation to output as

Second to eighth coding sub-modules

Each coding submodule comprises a convolution layer with convolution kernel size of 4 and step length of 2, a normalization layer and an activation function layer, wherein the convolution layer, the normalization layer and the activation function layer are connected in sequence. Each coding submodule

Is recorded as

Second to fourth coding sub-modules

Is the last coding submodule

Output of (2)

And having a corresponding measure of semantic information

The splicing fusion feature of (1); the fifth to eighth coding submodule inputs are all the previous submodule

Output of (2)

Output of the eighth coding submodule

Namely the depth characteristic of the primary generated image of the target modal face

And S52, inputting the depth features of the preliminary generated image of the target modal face into a decoder of the generator for up-sampling operation, and performing feature fusion on the output features in the up-sampling operation process and the output features corresponding to the down-sampling operation to obtain the generated image of the target modal face.

Specifically, the decoder of the generator comprises a plurality of decoding sub-modules which are connected in sequence and used for performing up-sampling operation on the preliminarily generated image of the target modal face; the number of decoding sub-modules corresponds to the number of encoding sub-modules one to one.

In this embodiment, the decoder of the generator is composed of eight decoding sub-modules, which can be expressed as

The decoding sub-modules numbered 8 to 2 have a similar structure, each decoding sub-module

The device comprises an deconvolution layer with convolution kernel size of 4 and step length of 2, a normalization layer and an activation function layer, wherein the deconvolution layer, the normalization layer and the activation function layer are connected in sequence. Decoding submodule

Is output as

Module input numbered 8 is a characterization of the encoder output

The inputs of the other decoding submodules are all the previous decoding submodules

Output of (2)

And a correspondingly numbered decoder sub-module

Output of (2)

The splice fusion feature of (1). The decoding submodule numbered 1, namely the last decoding submodule, comprises an deconvolution layer with convolution kernel size of 4 and step length of 2 and an activation function layer, wherein the deconvolution layer and the activation function layer are connected in sequence, and the decoding submodule is used for outputting a feature map output by an upper layer

Upsampling to generate an image for a final target modality face

S6, combining the target mode face image in the source mode-target mode face image pair, judging the distribution similarity degree of the target mode face generated image by using the discriminator in the target generation model, calculating a discriminant loss function according to the distribution similarity degree, and then updating the parameter of the discriminator by using the discriminant loss function.

In this embodiment, the discriminator includes a convolution layer and an activation function layer, wherein the number of convolution layers is 5, and the 5 convolution layers are connected in sequence; the convolution kernel size of each convolution layer is 4, and the step length is 2; the activation function layer is 5 layers; the 5 convolutional layers and the 5 active layers are alternately connected.

Specifically, a target modality face image and a target modality face generation image are respectively input into a discriminator, the discriminator outputs 1 and 0 prediction values representing true and false, and the 1 and 0 prediction values are the distribution similarity of the target modality face generation image.

The loss function of the discriminator is a classification loss function based on two classifications, wherein a source mode face image I is additionally added to the input of the discriminator_xAs part of the input, to more effectively guide the generation process of the generator model. The specific loss function formula is as follows:

wherein, theta_DModel parameters representing discriminators, I_yRepresenting a face image of a target modality, I_xRepresenting a face image of a source modality,

representing a face-generated image of the target modality, D representing a discriminator, D (I)_x,I_y)、

Indicating the degree of similarity of the distributions.

In this embodiment, the discriminant loss function is minimized, and the parameters of the discriminant are updated by using a stochastic gradient descent method, which is the prior art and will not be described in detail in this embodiment.

And S7, updating the parameters of the generator by using the generating loss function of the generator, the fighting loss function of the generator and the discriminator and the fusion loss function of the generator and the self-encoder.

Specifically, the generation loss function is used to measure the generation loss of the target-modality face generation image, and the generation loss of the target-modality face image is based on the average absolute error of the target-modality face image in the target-modality face generation image and the source-modality-target-modality face image pair, so the generation loss function of the generator is:

representing a target modality face generation image.

The countermeasure loss of the face image of the target modality is designed so that the generation result can trick the discrimination function of the discriminator, that is, the generator and the discriminator are in a countermeasure and constraint relationship. In this embodiment, a countermeasure loss function is calculated according to the distribution similarity, which is an output result of the discriminator, and an expression of the countermeasure loss function is:

representing a target modality face generation image and D representing a discriminator.

The fusion loss function of the generator and the self-encoder is:

representing a target modality, theta_GTo representParameter of the generator, θ_AERepresenting parameters from the encoder.

Further, the fusion loss function L is combined_fSum discriminator loss function L_DAlternately updated to train all parameters of the model. Specifically, according to the countermeasure thought of the GAN, parameters of the generator and the discriminator are alternately updated in a mode that one party is fixed and the other party iteratively updates the parameters; specifically, first, the generator parameter θ is fixed_GAnd theta_AEUpdating the parameter theta of the discriminator by a random gradient descent method_DSo as to discriminate the loss function L_DThe fluctuation is reduced to a certain small range; then fix the parameter θ of the discriminator_DUpdating the generator parameter θ by a random gradient descent_GAnd theta_AESo as to fuse the loss functions L_fAnd the fluctuation is reduced to a certain small range.

And S8, obtaining a trained target generation model when the discriminator and the generator reach a counterbalance state.

Specifically, through alternate training of the generator and the discriminator, when the generator and the discriminator reach a certain countermeasure balance, namely the target generation model is in a convergence state, the training of the target generation model is completed, and the trained target generation model is obtained. The counter-balance condition is embodied as: when the fixed generator updates the discriminator, the discrimination loss does not change obviously; on the contrary, when the fixed arbiter updates the generator, the fusion loss will not change significantly.

And S9, performing cross-modal face generation on the source modal face image to be processed by using the trained target generation model.

And S91, converting the source mode face image to be processed into a target mode face primary generation image.

S92, performing depth feature extraction on the source mode face image to be processed to obtain the multi-scale depth feature of the source mode face image to be processed.

And S93, performing feature fusion on the multi-scale depth features and the face semantic labels of the source mode face images according to the facial structure characteristics to obtain the multi-scale semantic information depth features of the source mode face images.

And S94, inputting the primary target modal face generated image into a generator of a target generation model to sequentially perform feature space coding and feature space decoding, and simultaneously performing auxiliary supervision by using the multi-scale semantic information depth features to obtain a target modal face generated image.

Please refer to fig. 1 and steps S2 to S5 for detailed operation steps of the above steps, which are not described herein again.

Example two

On the basis of the first embodiment, the effect of the cross-modal face image generation method based on multi-scale semantic information supervision is further explained through a simulation experiment.

(1) Simulation conditions

The simulation experiment is carried out by using a Pythroch frame on an Intel (R) core (TM) i7-8700K 3.70GHz CPU, NVIDIA GeForce RTX 2080ti GPU and a Linux Mint 18.3Sylvia operating system as a central processing unit. The image database adopts a CUFS database.

The following methods are adopted for simulation experiment:

the method comprises the following steps: the method based on the probability map model is marked as MWF in the experiment; the second method comprises the following steps: a method for generating an antagonistic network based on conditions is marked as pix2pix in an experiment; the third method comprises the following steps: a method based on a multi-scale countermeasure network is marked as PS2MAN in an experiment; the method four comprises the following steps: a knowledge migration-based method is marked as KT in an experiment; the method five comprises the following steps: and (3) recording as SCA-GAN in an experiment based on a human face part auxiliary method. The first to fifth methods are all the prior art, and are not described in detail in this embodiment.

(2) Emulated content

And selecting a part of source modal face images and target modal face images of the same object from a database test set to form a source modal-target modal face image data pair, and performing cross-modal generation respectively by using the cross-modal face image generation method based on multi-scale semantic information supervision of the embodiment and the 5 prior art methods.

Referring to fig. 2, fig. 2 is a simulation result comparison diagram provided in an embodiment of the present invention, where inputs are input source-mode face images, outputs is a cross-mode face image generation method based on multi-scale semantic information supervision according to the embodiment, and groudtruth is a target-mode face real image corresponding to a source-mode face image in a data pair. As can be seen from fig. 2, the cross-modality face image generated by the method according to the embodiment of the present invention can retain the facial detail characteristics of the person, and the generated image has good structural integrity and consistency.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A cross-mode face image generation method based on multi-scale semantic information supervision is characterized by comprising the following steps:

2. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 1, wherein step S1 is preceded by the steps of:

a number of source modality-target modality face image pairs are acquired.

3. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 2, wherein the step S2 comprises:

4. The multi-scale semantic information surveillance-based cross-modal face image generation method of claim 3, wherein the reconstruction loss function is:

wherein, theta_AEThe model parameters representing the self-encoder are,

5. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 1, wherein the step S3 comprises:

s31, extracting the face semantic label of the source mode face image;

6. The cross-modal facial image generation method based on multi-scale semantic information supervision as claimed in claim 5, wherein the facial regions to be enhanced comprise a facial skin region, a left ear region, a right ear region and a neck region.

7. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 1, wherein the step S4 comprises:

8. The method for generating a cross-modal facial image based on multi-scale semantic information supervision as claimed in claim 2, further comprising the following steps after step S4:

9. The multi-scale semantic information surveillance-based cross-modal face image generation method of claim 8, wherein the discriminant loss function is:

the generation loss function is:

representing a target modality face generation image;

the penalty function is:

the fusion loss function is: