CN115375596A

CN115375596A - Face photo-sketch portrait synthesis method based on two-way condition normalization

Info

Publication number: CN115375596A
Application number: CN202210885729.6A
Authority: CN
Inventors: 王楠楠; 吴子成; 朱明瑞; 易云; 何潇
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-11-22

Abstract

The invention discloses a face photo-sketch portrait synthesis method based on two-way condition normalization, which comprises the following steps: acquiring a face photo-pixel drawing image pair to be synthesized; inputting the pair of the face photo and the sketch portrait to be synthesized into a trained face photo and sketch portrait synthesis network to obtain a synthesis result; the human face photo-sketch portrait synthesis network comprises an encoder, a generator and a semantic segmentation module, wherein the generator comprises a two-way normalization module, a gate control attention feature fusion module and a decoder; the trained human face photo-sketch image synthesis network is obtained by training a training set through the human face photo-sketch image. The invention improves the quality of the human face photo-sketch image synthesis effect.

Description

Face photo-sketch portrait synthesis method based on two-way condition normalization

Technical Field

The invention belongs to the technical field of artificial intelligence and image processing, and particularly relates to a human face photo-sketch portrait synthesis method based on two-way condition normalization.

Background

With the rapid development of the computer vision field, the human face photo-sketch portrait synthesis technology gradually becomes a current research hotspot. The human face photo-sketch image synthesis is a process for synthesizing a human face photo into a human face sketch image or synthesizing the human face sketch image into a human face photo. Through the conversion process, the photo information and the portrait information which are originally in different domains can be converted into the same domain. And because the manual drawing of the face sketch portrait requires a professional painter and a large amount of time, the computer image processing algorithm for automatically synthesizing the face sketch portrait according to the face photo has high application value.

The early generation methods were mostly based on examples and were unidirectional from photo to sketch. These methods use the idea of nearest neighbor block matching, but the generated result often has a large amount of blurring and artifacts. Zhang et al, in the literature "l.zhang, l.lin, x.wu, s.ding, and l.zhang," End-to-End photo-scattering generation via complete volumetric presentation learning, "in Proceedings of the 5th ACM on International Conference on Multimedia retrieval,2015, pp.627-634." propose an End-to-End full volume Network (FCN) to directly learn the mapping between a picture of a face and a sketch; isola et al, in the literature "P.Isola, J. -Y.Zhu, T.Zhou, and A.A.Efrost," Image-to-Image transfer with conditional adaptation of the network, "in Proceedings of the IEEE connection on computer vision and pattern recognition,2017, pp.1125-1134," propose a general method named pix2pix for Image-to-Image conversion on paired datasets; zhu et al, in the documents "j. -y.zhu, t.park, p.isola, and a.a.efros," unappered image-to-image transformation using a cycle-dependent adaptive network, "in Proceedings of the IEEE international conference on computer vision,2017, pp.2223-2232," originally proposed a CycleGAN network framework that can form a universal mapping from domain a to domain B, learn how to transform between two domains, rather than being tied to a specific picture transformation, that can be trained using Unpaired datasets, with strong adaptability; chen et al, in the references "C.Chen, W.Liu, X.Tan, and K. -Y.K.Wong," Semi-equipped learning for face sketch synthesis in the world, "in an assessment on Computer Vision.Springer,2018, pp.216-231," propose a Semi-supervised learning method for extending photo-pixel pairs by constructing a pseudo-image feature that is appended to a training photo; yu et al, in the documents "J.Yu, X.Xu, F.Gao, S.Shi, M.Wang, D.Tao, and Q.Huang," heated real face photo-skin synthesis video composition-aided gates, "IEEE transactions on cybernetics, vol.51, no.9, pp.4350-4362,2020" propose to concentrate training on a specific site with facial composition information as supplementary input and introducing site loss; nie et al, in the documents "L.Nie, L.Liu, Z.Wu, and W.kang," Unconstrained face mask synthesis via prediction-adaptive network and a new benchmark, "neuro-rendering, vol.494, pp.192-202,2022", propose a face sketch synthesis based on perceptual adaptive networks under both constrained and Unconstrained conditions.

However, the above conventional methods do not encode sufficient texture and spatial information during training, and have some defects in the visual quality of the generated image, such as blurring, artifacts, lack of detail description, and lack of a pixel-pen touch feeling in the generated image.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a method for synthesizing a face photo-sketch portrait based on two-way condition normalization. The technical problem to be solved by the invention is realized by the following technical scheme:

the embodiment of the invention provides a face photo-sketch portrait synthesis method based on two-way condition normalization, which comprises the following steps:

acquiring a face photo-pixel drawing image pair to be synthesized;

inputting the pair of the face photo-sketch portrait to be synthesized into a trained face photo-sketch portrait synthesis network to obtain a synthesis result;

the human face photo-sketch image synthesis network comprises an encoder, a generator and a semantic segmentation module, wherein the generator comprises a two-way normalization module, a gated attention feature fusion module and a decoder; the trained human face photo-sketch image synthesis network is obtained by training a training set through the human face photo-sketch image; the corresponding training process comprises the following steps:

the encoder encodes the human face photo-sketch portrait to a training set and outputs depth characteristics; the semantic segmentation module extracts semantic labels of sketch portraits in a training set; the two-way normalization module performs reinforcement according to the semantic tags and the depth features and according to a spatial information branch and a texture information branch; the gating attention characteristic fusion module fuses output results of the spatial information branch and the texture information branch; the decoder decodes the fusion result and outputs the synthesis result corresponding to the human face photo-sketch portrait to the training set; constructing a loss function of the human face photo-sketch synthesis network according to the human face photo-sketch pair training set and the corresponding synthesis result; and updating parameters of the human face photo-sketch image synthesis network according to the loss function, and continuing training until an iteration stop condition is met to obtain the trained human face photo-sketch image synthesis network.

In one embodiment of the invention, the encoder comprises a plurality of sequentially connected convolutional layers;

in the encoder, the output corresponding to the last three convolutional layers of the encoder is taken as the depth feature.

In one embodiment of the invention, the two-way normalization module comprises a SPADE Resblock module and an AdaIN Resblock module, wherein,

the SPADE Resblock module performs spatial information enhancement according to semantic labels of sketch pictures in the training set and depth characteristics of face photos in the training set;

and the AdaIN Resblock module performs texture information enhancement according to depth characteristics of sketch pictures and face photos in the training set.

In an embodiment of the present invention, the AdaIN Resblock module includes a plurality of residual AdaIN modules connected in sequence; each residual AdaIN module comprises a basic AdaIN module and a residual module which are sequentially connected.

In one embodiment of the invention, the output of the SPADE Resblock module is represented as:

wherein p represents a photograph of a human face, A ⁱ Represents the output of the SPADE Resblock module corresponding to the i (i =1,2,3) th layer depth feature,

indicates the depth feature corresponding to the face picture p in the depth feature of the i (i =1,2,3) th layer, c indicates the channel, and (y, x) indicates the channel

At a position on the channel c, the position of the channel c,

to represent

Value at point (y, x) on the c-th channel, M _s The representation of the semantic label is carried out,

respectively represent

From semantic tag M on the c-th channel _s The learned zoom amount and offset amount,

respectively represent

Mean and variance on the c-th channel.

In one embodiment of the invention, the output of the AdaIN Resblock module is represented as:

wherein s represents a sketch image, B ⁱ Represents the output of the AdaIN Resblock module corresponding to the i (i =1,2,3) th layer depth feature,

indicating a depth feature corresponding to the sketch image s in the depth feature of the ith (i =1,2,3) layer,

to represent

On the c-th channel with

The value at the corresponding same point (y, x),

respectively represent F _s ⁱ Mean and variance on the c-th channel.

In one embodiment of the invention, the gated attention feature fusion module comprises two gating modules, a channel attention module; wherein the content of the first and second substances,

the outputs of the SPADE Resblock module and the AdaIN Resblock module are respectively input into two gating modules;

the outputs of the SPADE Resblock module and the AdaIN Resblock module are superposed and then input into the channel attention module;

the outputs of the two gating modules and the output of the channel attention module are fused.

In one embodiment of the present invention, the fusion of the outputs of the two gating modules and the output of the channel attention module is represented as:

wherein, C ⁱ Representing the output of the gated attention feature fusion module corresponding to the i (i =1,2,3) th layer depth feature, CA (-) representing the channel attention function,

is represented by A ⁱ The corresponding gating function is set to a value that,

is shown as B ⁱ A corresponding gating function.

In one embodiment of the invention, the decoder is an AFF module based decoder;

in the decoder, after the output of the gating attention feature fusion module is up-sampled to obtain features with the same resolution, an AFF module is used for feature fusion, and decoding and outputting are carried out.

In one embodiment of the present invention, the constructed loss function of the face photo-sketch image synthesis network is represented as:

L _full ＝λ ₁ L _adversarial +λ ₂ L _cycle +λ ₃ L _perceptual ；

wherein L is _full A loss function representing the face photo-sketch compositing network; l is _adversarial Indicates the generation of antagonistic losses, L _cycle Denotes loss of cyclic consistency, L _perceptual Representing the perceptual loss, λ ₁ 、λ ₂ 、λ ₃ Indicating the balance parameters.

The invention has the beneficial effects that:

the invention provides a human face photo-sketch synthesis method based on two-way condition normalization, which provides a human face photo-sketch synthesis network comprising a two-way normalization module and a gated attention characteristic fusion module, wherein the two-way normalization module and a branch of the two-way normalization module fully encode texture and spatial information, so that the learning of the spatial information and the texture information is enhanced, more real edge and detail characteristics can be reserved, the gated attention characteristic fusion module fully screens and fuses useful information of the two branches, the redundancy of the information is avoided to a certain extent, the two paths jointly improve the quality of the human face photo-sketch synthesis effect, and through user voting investigation, the synthesized image user provided by the embodiment of the invention has better subjective feeling and better user experience.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

FIG. 1 is a schematic flow chart of a method for synthesizing a face photo-sketch portrait based on two-way condition normalization according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network for synthesizing a photo-sketch of a human face according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a two-way condition normalization module according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a gated attention feature fusion module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a training process of a face photo-sketch synthesis network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram showing the comparison of sketch portrait composition results of the CUFS data set, the CUFSF data set and the WildScut data set respectively according to the method of the present invention and the prior 6 methods;

fig. 7 is a schematic diagram illustrating comparison of the face photo synthesis results of the CUFS data set and the CUFSF data set respectively by the method of the present invention according to the embodiment of the present invention and 3 existing methods;

fig. 8 is a schematic diagram showing a comparison of satisfaction voting results corresponding to user surveys performed by the method of the present invention provided in the embodiment of the present invention and 3 existing methods, respectively, using face photograph synthesis results in the CUFS data set and the CUFSF data set;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

In order to improve the quality of the synthesis effect of the face photo-sketch portrait, please refer to fig. 1, an embodiment of the present invention provides a method for synthesizing a face photo-sketch portrait based on two-way condition normalization, which specifically includes the following steps:

and S10, acquiring a picture-pixel drawing image pair of the human face to be synthesized.

Specifically, in the embodiment of the present invention, in the pair of the face photo to be synthesized and the sketch image, if the face photo is synthesized by the sketch image, the reference face photo is randomly selected, and the reference face photo and the sketch image form a pair of the face photo to be synthesized and the sketch image; if the sketch portrait is synthesized by the face photo, a reference sketch portrait is randomly selected, and the face photo and the reference sketch form a face photo-sketch image pair to be synthesized. Therefore, the human face photo-sketch portrait synthesis method can realize the synthesis of sketch portrait and human face photos. The randomly selected reference face picture or reference sketch can provide a large amount of texture and space prior information, so that the quality of the final face picture-sketch synthesis effect is improved.

S20, inputting the pair of the face photo and the sketch portrait to be synthesized into a trained face photo and sketch portrait synthesis network to obtain a synthesis result.

Specifically, referring to fig. 2, the human face photo-sketch portrait compositing network provided in the embodiment of the present invention includes an encoder, a generator and a semantic segmentation module, where the generator includes a two-way normalization module, a gated attention feature fusion module and a decoder. The same network structure is adopted in the process to be synthesized and the training process, and the same network structure is adopted in the process to be synthesized or the training process for realizing the synthesis of sketch pictures and the synthesis of face photos, but the input and the output are different, and the processing processes in the network are similar, so that whether the samples are to be synthesized or training samples is not determined in the following description. For each face photo-pixel drawing pair in each training set, or to be synthesized, the specific design for each part in fig. 2 is as follows:

the encoder comprises a plurality of convolution layers which are connected in sequence, and multi-scale depth features are extracted through a depth convolution network; specifically, in the encoder, the output corresponding to the last three convolutional layers of the encoder is used as the depth feature. Specifically, the method comprises the following steps:

an input sketch image s is propagated in the forward direction through an encoder composed of a series of convolution layers. Outputting depth features of different resolutions at the last three levels of the encoder as

Where i =1,2,3 represents three levels of output characteristics, c _i 、h _i 、w _i Respectively showing the channel number, height and width of the ith layer characteristic diagram.

The input face picture p is subjected to forward propagation through the same encoder (encoder sharing parameter) as that for the sketch image s. Outputting depth features of different resolutions at the last three levels of the encoder as

Here, in case of a face photograph to be composed by sketching an image s, the face photograph p in fig. 2 is a reference face photograph p selected at random _ref If the sketch s is to be composed and is composed by the face picture p, the sketch s in FIG. 2 is a randomly selected reference sketch s _ref (ii) a In the same way, if it is trainingIn the training process, and the face photo-sketch synthesis network for synthesizing the face photo is trained through the sketch s, the face photo p in fig. 2 is a random reference face photo p in the training set _ref If the face photo-sketch image synthesis network for sketch image synthesis is trained by the face photo p in the training process, the sketch image s in fig. 2 is a random reference face photo s in the training set _ref 。

Further, the embodiment of the invention is used for extracting the depth features

And

and carrying out two-way normalization, wherein the two-way normalization operation comprises two conditional normalization branches in total, and explicitly decomposing the overall mapping into two independent mappings to respectively enhance the learning of spatial information and texture information. Referring to fig. 3, the embodiment of the present invention designs a two-way normalization module for such a design concept, where the two-way normalization module includes two branches formed by a SPADE Resblock module and an AdaIN Resblock module, where,

the first spatial information branch is realized by a SPADE Resblock module, and the SPADE Resblock module reinforces the spatial information according to the semantic labels of the sketch in the training set and the depth characteristics of the face photos in the training set; and the second texture information branch is realized by an AdaIN Resblock module, and the AdaIN Resblock module performs texture information enhancement according to the depth characteristics of the sketch image and the face photo in the training set.

For the first spatial information branch, the SPADE reslock module according to the embodiment of the present invention may directly use the existing SPADE reslock module, which is described in detail in "t.park, m. -y.liu, t. -c.wang, and j. -y.zhu," sensory image synthesis with activity-adaptive modulation, "in Proceedings of the IEEE/CVF con on video and pattern registration, 2019, pp.2337-2346", and is not described herein again, and a network structure of the SPADE reslock module is specifically shown in fig. 3. The inventionExtracting semantic label M of input sketch portrait s by using semantic segmentation model BiSeNet trained on CelebA-HQ database in advance _s Then will be

And M _s Input into SPADE Resblock module, semantic tag M is used at channel level _s The learned mean and variance pair

Modulation is performed to enhance the learning of spatial information. The output of the final SPADE Resblock module is expressed as:

At a position on the channel c, the position of the channel c,

to represent

respectively represent

From semantic tag M on the c-th channel _s Amount of zoom inAnd an offset amount of the first and second optical fibers,

respectively represent

Mean and variance on the c-th channel. Wherein the content of the first and second substances,

the mean and variance on the c-th channel are expressed as:

wherein H ⁱ 、W ⁱ Respectively represent

Corresponding overall width and overall height.

However, due to semantic tag M _s Only the semantic information, the use of this branch alone inevitably loses fine texture information. Therefore, the embodiment of the invention also designs another branch, and the AdaIN Resblock module realizes the enhancement of the texture information.

For the second texture information branch, the embodiment of the present invention is obtained by improving the existing AdaIN module. The conventional AdaIN module is described in detail in "X.Huang and S.Belongie", "Archerary style transfer in real-time with adaptive instance knowledge", "in Proceedings of the IEEE international conference on computer vision,2017, pp.1501-1510", and will not be described herein again. Since the AdaIN module is a real-time style migration module, by introducing this module, by adjusting the mean and variance, it is possible to transform the AdaIN module into a real-time style migration module

To the style information of

In (3), the characterization capability of the texture is enhanced.

However, the direct use of the existing AdaIN module is not suitable for this task due to the large modulation gap between the face picture and the sketch picture and the insufficient adaptive capability of the parameterless operation. Therefore, in the embodiment of the present invention, the residual error block is introduced to improve the learning ability, and please refer to fig. 3 again, the AdaIN Resblock module in the embodiment of the present invention includes a plurality of residual error AdaIN modules connected in sequence; each residual AdaIN module comprises a basic AdaIN module and a residual module which are sequentially connected. For example, the AdaIN ResBlock module in the embodiment of the present invention is composed of 9 residual AdaIN modules connected in sequence, each residual AdaIN module is composed of an existing AdaIN module and a residual block, and the residual block added behind the existing AdaIN module can improve the receptive field and enhance the adaptive capability of the branch. The output of the final AdaIN Resblock module is represented as:

indicates the depth characteristic corresponding to the sketch portrait s in the depth characteristic of the i (i =1,2,3) th layer,

to represent

On the c-th channel with

Correspond toThe value at the same point (y, x),

respectively represent

mean and variance calculations on the c-th channel are similar

Respectively expressed as:

wherein, the formula (5) and the formula (6) are respectively utilized

The corresponding total width and total height are calculated. Here, the first and second liquid crystal display panels are,

the corresponding total width and total height are respectively taken as H ⁱ 、W ⁱ And is prepared by

The corresponding overall width and overall height are the same.

Further, the embodiment of the present invention provides a gated channel attention fusion module, which integrates a gating mechanism and a channel attention mechanism, and fuses two information branches, allowing a network to selectively amplify useful feature channels based on global information, and suppressing useless feature channels. Referring to fig. 4, the gated attention feature fusion module according to the embodiment of the present invention includes two gating modules and a channel attention module; wherein the content of the first and second substances,

the outputs of the SPADE Resblock module and the AdaIN Resblock module are respectively input into two gating modules; the outputs of the SPADE Resblock module and the AdaIN Resblock module are superposed and then input into the channel attention module; the outputs of the two gating modules and the output of the channel attention module are fused, and the fusion of the outputs of the two gating modules and the output of the channel attention module is specifically expressed as:

wherein A is ⁱ And B ⁱ Denotes the output of the SPADE Resblock module and the output of the AdaIN Resblock module corresponding to the i-th (i =1,2,3) layer depth feature, respectively, ". Denotes element-by-element multiplication, which is shown in FIG. 3 by

Denotes the addition element by element, which is denoted by "+" in FIG. 3

Is represented by C ⁱ Represents the output of the gated attention feature fusion module corresponding to the i-th (i =1,2,3) layer depth feature, CA (-) represents the channel attention function,

is represented by A ⁱ The corresponding gating function is set to be,

is represented by B ⁱ The corresponding gating function. Gating module input A in FIG. 3 ⁱ Corresponding output

Gating module input B ⁱ Corresponding output

In FIG. 3, α is CA (A) ⁱ +B ⁱ )。

Here, the gating function

Respectively expressed as:

where Conv denotes a convolution operation of 1 × 1, and σ denotes a sigmoid function.

The channel attention function CA (-) is expressed as:

CA(A ⁱ +B ⁱ )＝σ(Conv ₂ (δ(Conv ₁ (A ⁱ +B ⁱ )))) (10)

wherein, conv ₁ And Conv ₂ Each represents a 1 × 1 convolution operation, and δ represents a modified linear unit ReLU.

The gating function of the embodiment of the invention can judge the importance of each feature vector in the feature map, effectively control information flow, and clearly model the interdependency relationship between channels by channel attention, and globally filter the information flow, thereby realizing that a network is allowed to selectively amplify useful feature channels based on global information, and inhibiting useless feature channels.

Further, referring to fig. 2 again, the decoder according to the embodiment of the present invention is based on an AFF module; in a decoder, after the output of the gating attention feature fusion module is up-sampled to obtain features with the same resolution, the AFF module is used for feature fusion, and decoding and outputting are carried out. Specifically, the method comprises the following steps:

the outputs of the three-level gated attention feature fusion module are integrated using an Attention Feature Fusion (AFF) based decoder to generate the final composite result. AFF moduleIs a conventional attention feature fusion module, which is described in detail in "Y.Dai, F.Gieseke, S.Oehmcke, Y.Wu, and K.Barnard," Attentional feature fusion, "in Proceedings of the IEEE/CVF Window Conference on Applications of Computer Vision,2021, pp.3560-3569", and will not be described herein again. The embodiment of the invention samples three characteristic graphs obtained by a gating attention characteristic fusion module to the same resolution ratio, uses an AFF module to perform characteristic fusion pairwise, and decodes to generate a final synthetic result, wherein the synthetic result is a face photo

Or sketch of an image

Furthermore, the trained human face photo-sketch portrait synthesis network adopted in the synthesis process of the embodiment of the invention is obtained by training a training set through the human face photo-sketch portrait; the method comprises the steps of selecting M images from a face photo data set to form a face photo training set, selecting M face sketch images corresponding to the M face photos from a face sketch image data set to form an M-to-face photo-sketch image pair together, and using the M-to-face photo-sketch image pair as a face photo-sketch image pair training set.

Referring to fig. 5, the detailed training process includes the following steps:

s201, an encoder encodes a face photo-sketch portrait to a training set and outputs depth characteristics;

s202, a semantic segmentation module extracts semantic labels of sketch portraits in a training set;

s203, the two-way normalization module performs reinforcement according to the semantic tags and the depth features and the spatial information branch and the texture information branch;

s204, the gated attention feature fusion module fuses output results of the spatial information branch and the texture information branch;

s205, decoding the fusion result by a decoder to output a synthetic result corresponding to the training set by the face photo-sketch portrait;

the implementation of S201-S205 is described with reference to the detailed design of each part above.

S206, constructing a loss function of the human face photo-sketch synthesis network according to the human face photo-sketch pair training set and the corresponding synthesis result.

Specifically, the loss function of the human face photo-sketch image synthesis network constructed by the embodiment of the invention is expressed as follows:

L _full ＝λ ₁ L _adversarial +λ ₂ L _cycle +λ ₃ L _perceptual (11)

wherein L is _full A loss function representing a face photo-sketch portrait compositing network; l is _adversarial Indicates the creation of a countermeasure loss, L _cycle Denotes loss of cyclic consistency, L _perceptual Representing the perceptual loss, λ ₁ 、λ ₂ 、λ ₃ Expressing the balance parameters to balance different loss values, in the embodiment of the invention, setting lambda ₁ ＝1，λ ₂ ＝0.5，λ ₃ =0.5. It can be seen that the loss function of the embodiment of the present invention is composed of three parts:

generating the confrontation loss: generating the antagonistic loss aims to guide the generator to obtain a more realistic generated result. When the generation of the damage to the resistance is calculated, a discriminator is connected after the generation, and the discriminator in the CycleGAN in the structural references of J. -Y.Zhu, T.park, P.Isola, and A.A.Efrons, "Unperared image-to-image transformation using cycle-dependent adaptive networks," in Proceedings of the IEEE international conference on computer vision,2017, pp.2223-2232 "is trained. The resulting opposing losses for the generator and the arbiter are expressed as:

L _adversarial ＝E _p～Pdata(p) [(D(p)) ² ]+E _s～Pdata(s) [(1-D(G(s,p _ref ,M(s)))) ² ] (12)

wherein E (-) represents the expected value of the distribution function, E _p～Pdata(p) (. E) represents the expectation of correspondence of the face photograph p in the face photograph dataset, E _s～Pdata(s) (. Table)Showing the corresponding expectation of sketch portrait s in sketch portrait data set, D (-) represents a discriminator, D (p) represents a discrimination result obtained by inputting face picture p into the discriminator, M (-) represents a face semantic segmentation network BiSeNet, M(s) represents a semantic label result obtained by inputting sketch portrait s into the semantic segmentation network, G (-) represents a generator for generating face picture by sketch portrait s, G (s, p) represents a semantic label result obtained by inputting sketch portrait s into the semantic segmentation network _ref M (s)) represents a sketch image s and a reference sample picture p _ref The semantic label M(s) is input into the synthesis result obtained in the generator, i.e. the photo generated in FIG. 2

D(G(s,p _ref M (s))) represents the generation of a photograph

The discrimination result obtained in the discriminator is input.

Loss of cycle consistency: given a sketch S and its face label representation M (S), after the cyclic transformation S → P and P → S of the S and P domains, the sketch S should be transformed back to the original domain, with the corresponding cyclic consistency penalty expressed as:

L _cycle ＝E _s～Pdata(s) [||F(G(s,p _ref ,M(s)),s _ref ,M(G(s,p _ref ,M(s))))-s|| ₁ ] (13)

wherein s is _ref 、p _ref Reference sample images, G (S, P), representing S and P domains, respectively _ref M (s)) means that the generator generates a photograph

M(G(s,p _ref M (s)))) indicates that a photograph will be generated

Inputting semantic labeling results obtained from a semantic segmentation network, F (-) represents a generator for generating sketch portrait by face photo, F (G (s, p) _ref ,M(s)),s _ref ,M(G(s,p _ref M (s)))) represents G (s, p) _ref Production of photographs of M (s))

Reference sample sketch s _ref Semantic tag M (G (s, p) _ref M (s))) input generator, F (-) has the same network structure as G (-) to | Bai | | Liu ₁ The L1 norm is obtained.

Loss of perception: the perception loss is introduced, the generated photos can be similar to the real photos on the semantic feature level, the perception loss is designed through a pre-trained VGG-19 model, the real photos and the generated photos are input into the pre-trained VGG-19 model, and the corresponding perception loss is expressed as follows:

wherein phi is _j (. C) represents the output characteristic diagram of the j layer in the pre-trained VGG-19 model _j 、H _j And W _j Respectively representing the channel number, the height and the width of the output characteristic diagram of the j-th layer.

Further, updating network parameters of the human face photo-sketch synthesis network according to the loss function, and continuing training until an iteration stop condition is met, then:

s207, outputting the current human face photo-sketch image synthetic network as a trained human face photo-sketch image synthetic network.

In the whole iterative calculation process, parameters can be updated by using a gradient descent algorithm until the human face photo-sketch portrait synthetic network model converges, but the method is not limited to the gradient descent algorithm.

It should be noted that, in the embodiment of the present invention, regardless of whether the sketch image is synthesized or the human face photograph is synthesized, the training process may be adopted to train the corresponding human face photograph-sketch image synthesis network model, and the training process selects the corresponding reference human face photograph or reference sketch image for training, so that a large amount of texture and spatial prior information may be provided, and the human face photograph-sketch image synthesis network model obtained by training may better synthesize the sketch image or the human face photograph.

In order to verify the effectiveness of the method for synthesizing the face photo-sketch portrait based on two-way condition normalization provided by the embodiment of the invention, the following experiment is performed for verification.

1. Simulation conditions

The embodiment of the invention uses a Pythrch frame to simulate the CPU which is an Inter (R) Xeon (R) Gold 6226R 2.90GHz CPU, an NVIDIA GeForce RTX 3090GPU and an Ubuntu 16.04 operating system. Training is performed on the CUFS dataset, CUFSF dataset, and the WildSketch dataset, respectively.

The methods compared in the experiment were as follows:

one is FCN, referred to as "L.Zhang, L.Lin, X.Wu, S.Ding, and L.Zhang," End-to-End photo-deletion generation via complete collaborative presentation learning, "in Proceedings of the 5th ACM on International Conference on Multimedia retrieval,2015, pp.627-634." which proposes an End-to-End full convolution network to directly learn the mapping of a picture of a human face to a portrait. However, the network is too shallow to dig out deep semantic information.

Second is pix2pix, referenced as "p.isola, j. -y.zhu, t.zhou, and a.a.efrost," Image-to-Image transformation with conditional adaptation network, "in Proceedings of the IEEE connection on computer vision and pattern recognition,2017, pp.1125-1134," which uses the condition GAN (cGAN) as a unified solution for Image-to-Image transformation on paired datasets.

Thirdly, cycleGAN, referenced as "j. -y.zhu, t.park, p.isola, and a.a.efros," unknown image-to-image transformation using cycle-dependent adaptive networks, "in Proceedings of the IEEE international conference on computer vision,2017, pp.2223-2232", which proposes a CycleGAN network framework that can form a universal mapping from domain a to domain B, learn how to transform between two domains, rather than being tied to a specific picture transformation, that can be trained using Unpaired datasets, with strong adaptability.

Fourth, a field human face sketch synthesis method based on Semi-supervised learning, which is denoted as Wild in the experiment, and the reference is "c.chen, w.liu, x.tan, and k. -y.k.wong," Semi-supervised learning for face sketch synthesis in the world, "in assistant Conference on Computer vision, spring, 2018, pp.216-231", and the method extends photo-picture pairs by constructing pseudo-picture features of additional training photos.

Fifth, SCAGAN, referenced as "j.yu, x.xu, f.gao, s.shi, m.wang, d.tao, and q.huang," heated responsive face photo-skin synthesis vision composition-aided gates, "IEEE transactions on cybernetics, vol.51, no.9, pp.4350-4362,2020", proposes using facial composition information as a supplemental input and introducing site loss to focus training on a specific site.

Sixthly, PANET, the reference is "L.Nie, L.Liu, Z.Wu, and W.kang," Unconstrained face mask synthesis via prediction-adaptive network and a new benchmark, "neuro-rendering, vol.494, pp.192-202,2022", and the method provides face sketch synthesis based on perception adaptive network under the condition of constraint and no constraint.

2. Emulated content

A partial human face photo-sketch image pair is selected from a CUFS data set, a CUFSF data set and a WildSkey data set to be used as a human face photo-sketch image pair to be synthesized, and the result of sketch image synthesis of the method and 6 existing methods is shown in FIG. 6, specifically: FIG. 6, line 1, is a sketch image composition result for the cuhk sub data set in the CUFS data set, line 2, line 3, line xm2vts sub data set in the CUFS data set, line 4, line 5, is a sketch image composition result for the WildSketch data set; line 1 of fig. 7 is the result of synthesizing a face photo of the cuhk subdata set in the CUFS dataset, line 2 is the result of synthesizing a face photo of the ar subdata set in the CUFS dataset, line 3 is the result of synthesizing a face photo of the xm2vts subdata set in the CUFS dataset, and line 4 is the result of synthesizing a face photo of the CUFSF dataset. In fig. 6 and 7, the rightmost group Truth represents a synthetic image corresponding to the leftmost Test Photo in the training set, and the closer the synthetic result of all the methods is to the group Truth, the better the synthetic effect of the method is; in FIGS. 6 and 7, the 2 nd column on the right is the synthesis result of the method used in the present invention.

As can be seen from fig. 6 and 7, the synthesis result of the method adopted in the embodiment of the present invention well restores texture information and structure information, retains more real edge and detail features, and better overcomes the problems of blurring and artifacts.

Meanwhile, in the embodiment of the present invention, 3 existing methods and the method used in the present invention (denoted as DCNP in fig. 8) are selected to perform user survey on the face photograph synthesis results on the CUFS data set and the WildSketch data set, 14 groups of questionnaires are set in total (7 groups on the CUFS data set and 7 groups on the WildSketch data set), and the user selects the most satisfied result from the questionnaires to vote. 700 votes from 50 users are collected, and the specific proportion of the votes in each method is shown in FIG. 8.

In summary, the method for synthesizing a face photo-sketch map based on two-way conditional normalization provided by the embodiment of the invention provides a face photo-sketch map synthesis network comprising a two-way normalization module and a gated attention feature fusion module, wherein a branch of the two-way normalization module fully encodes texture and spatial information, so that the learning of the spatial information and the texture information is enhanced, more real edge and detail features can be retained, the gated attention feature fusion module fully screens and fuses useful information of the two branches, the redundancy of the information is avoided to a certain extent, the two paths jointly improve the quality of a face photo-sketch map synthesis effect, and through user investigation, the user subjective feeling of the synthesized image provided by the embodiment of the invention is better, and the synthesized image has better user experience.

Meanwhile, in the embodiment of the invention, reference sample images (reference face photos p) are additionally introduced in the processes of synthesis and training _ref Or with reference to a sketch image s _ref ) To provide a large amount of texture and spatial prior information, furtherThe quality of the synthesis effect of the human face photo-sketch portrait is improved.

Referring to fig. 9, an embodiment of the present invention provides an electronic device, which includes a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where the processor 901, the communication interface 902, and the memory 903 complete mutual communication through the communication bus 904;

a memory 903 for storing computer programs;

the processor 901 is configured to implement the steps of the above-mentioned method for synthesizing a human face photo-sketch image based on two-way condition normalization when executing the program stored in the memory 903.

The embodiment of the invention provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the human face photo-sketch portrait synthesis method based on two-way condition normalization are realized.

For the electronic device/storage medium embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

In the description of the present invention, it is to be understood that the terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A face photo-sketch portrait synthesis method based on two-way condition normalization is characterized by comprising the following steps:

acquiring a face photo-pixel drawing image pair to be synthesized;

the human face photo-sketch portrait synthesis network comprises an encoder, a generator and a semantic segmentation module, wherein the generator comprises a double-path normalization module, a gated attention feature fusion module and a decoder; the trained human face photo-sketch portrait synthesis network is obtained by training a training set through the human face photo-sketch portrait; the corresponding training process comprises the following steps:

the encoder encodes the face photo-sketch portrait to a training set and outputs depth characteristics; the semantic segmentation module extracts semantic labels of sketch portraits in a training set; the two-way normalization module reinforces according to the semantic tag and the depth characteristic according to a spatial information branch and a texture information branch; the gating attention characteristic fusion module fuses output results of the spatial information branch and the texture information branch; the decoder decodes the fusion result and outputs the synthesis result corresponding to the human face photo-sketch portrait to the training set; constructing a loss function of the human face photo-sketch synthesis network according to the human face photo-sketch pair training set and the corresponding synthesis result; and updating parameters of the human face photo-sketch image synthesis network according to the loss function, and continuing training until an iteration stop condition is met to obtain the trained human face photo-sketch image synthesis network.

2. The method of claim 1, wherein the encoder comprises a plurality of sequentially connected convolution layers;

3. The method of two-way conditional normalization-based human face photo-sketch portrait compositing of claim 1, wherein said two-way normalization module comprises a SPADE Resblock module and an AdaIN Resblock module, wherein,

and the AdaIN Resblock module performs texture information enhancement according to the depth characteristics of the sketch image and the face photo in the training set.

4. The method of claim 3, wherein the AdaIN reblock module comprises a plurality of residual AdaIN modules connected in sequence; each residual AdaIN module comprises a basic AdaIN module and a residual module which are sequentially connected.

5. The method of claim 3, wherein the output of the SPADE Resblock module is represented as:

represents the depth feature corresponding to the face picture p in the depth feature of the ith (i =1,2,3), c represents the channel, and (y, x) represents the channel

At a position on the channel c, the position of the channel c,

to represent

respectively represent

From semantic tag M on the c-th channel _s The zoom amount and the offset amount learned in (1),

respectively represent

Mean and variance on the c-th channel.

6. The method of two-way conditional normalization-based face photo-sketch synthesis of claim 5, wherein the output of the AdaIN reblock module is represented as:

represent

On the c-th channel with

The value at the corresponding same point (y, x),

respectively represent

Mean and variance on the c-th channel.

7. The method of claim 6, wherein the gated attention feature fusion module comprises two gated modules, a channel attention module; wherein, the first and the second end of the pipe are connected with each other,

8. The method of claim 7, wherein the output of two gating modules and the output of the channel attention module are fused as:

is shown as B ⁱ The corresponding gating function.

9. The method for two-way normalization based human face photo-sketch portrait synthesis method as claimed in claim 1, wherein said decoder is an AFF module based decoder;

in the decoder, after the output of the gated attention feature fusion module is up-sampled to obtain features with the same resolution, an AFF module is used for feature fusion, and decoding and outputting are performed.

10. The method for synthesizing a human face photo-sketch image based on two-way condition normalization according to claim 1, wherein the constructed loss function of the human face photo-sketch image synthesis network is represented as follows:

L _full ＝λ ₁ L _adversarial +λ ₂ L _cycle +λ ₃ L _perceptual ；

wherein L is _full Represents the aboveLoss function of human face photo-sketch portrait synthesis network; l is _adversarial Indicates the creation of a countermeasure loss, L _cycle Represents a loss of cyclic consistency, L _perceptual Representing the perceptual loss, λ ₁ 、λ ₂ 、λ ₃ Indicating the balance parameters.