CN112991484A

CN112991484A - Intelligent face editing method and device, storage medium and equipment

Info

Publication number: CN112991484A
Application number: CN202110466411.XA
Authority: CN
Inventors: 高林; 陈姝宇; 刘锋林
Original assignee: Institute Of Digital Economy Industry Institute Of Computing Technology Chinese Academy Of Sciences
Current assignee: Institute Of Digital Economy Industry Institute Of Computing Technology Chinese Academy Of Sciences
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-06-18
Anticipated expiration: 2041-04-28
Also published as: CN112991484B

Abstract

The invention relates to an intelligent face editing method, an intelligent face editing device, a storage medium and intelligent face editing equipment. It is suitable for the fields of computer vision and computer graphics. An intelligent face editing method, an intelligent face editing device, a storage medium and equipment are provided. The technical scheme adopted by the invention is as follows: an intelligent face editing method is characterized in that: inputting the geometric feature image and the appearance feature image of the face into corresponding trained local decoupling modules according to the face parts, and extracting the corresponding geometric features and appearance features of each part of the face; the local decoupling module generates local images and local intermediate feature images corresponding to all parts of the human face based on the geometric features and the appearance features; and fusing the local intermediate feature images corresponding to all parts of the human face through the trained global fusion module to generate a final human face image with the geometric features and the appearance features.

Description

Intelligent face editing method and device, storage medium and equipment

Technical Field

The invention relates to an intelligent face editing method, an intelligent face editing device, a storage medium and intelligent face editing equipment. It is suitable for the fields of computer vision and computer graphics.

Background

The face synthesis technology is one of the important subjects in the field of digital image processing, and there are many related technologies around high-quality face synthesis. The face synthesis technology based on deep learning mainly includes the following two types: synthesizing a new face by Gaussian sampling and using a generative confrontation network (GAN); and (3) using a condition generating type confrontation network, inputting information such as a semantic tag graph, a sketch and an attribute label, and synthesizing a corresponding face. Although there are many techniques for synthesizing a real face by sketch, most of them cannot control the appearance of the generated face or the effect of face synthesis is poor.

For face editing, some prior art techniques use tagged face attribute tag data to decouple the hidden space of the GAN. And editing the attributes of the human face by operating the projection codes of the hidden space. However, these techniques can only edit a specific attribute, and cannot modify contents other than the attribute tag, and thus the degree of freedom is low. Some techniques use semantic markup to edit faces, but because of the lack of geometric information in their input, cannot edit geometric information of the face such as wrinkles, hair style trends, etc. Some technologies use sketches to edit faces, but the technologies are based on image completion technology and have more limitations.

Portenier et al, published in 2018 in ACM Transactions on Graphics, "facial: Deep Sketch-Based Face Image Editing," propose a Sketch-Based Face Editing system, which edits a Face using a masking mark, a Sketch, and a color stroke. Jo et al, 2019, "SC-FEGAN, Face Editing genetic adaptive Network With User's Sketch and color," published by Proceedings of the IEEE/CVF International Conference on Computer Vision, used style loss to generate higher quality, more robust results. However, the prior art cannot edit the overall appearance of the face, and cannot generate a real and vivid face when a pure sketch is used as input.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the existing problems, an intelligent face editing method, an intelligent face editing device, a storage medium and intelligent face editing equipment are provided.

The technical scheme adopted by the invention is as follows: an intelligent face editing method is characterized in that:

inputting the geometric feature image and the appearance feature image of the face into corresponding trained local decoupling modules according to the face parts, and extracting the corresponding geometric features and appearance features of each part of the face;

the local decoupling module generates local images and local intermediate feature images corresponding to all parts of the human face based on the geometric features and the appearance features;

and fusing the local intermediate feature images corresponding to all parts of the human face through the trained global fusion module to generate a final human face image with the geometric features and the appearance features.

The geometric feature image is a draft of a human face or a real image of the human face, and the local decoupling module comprises a draft encoder

Image encoder

Appearance encoder

And image composition generator

；

The extracting of the geometric features corresponding to each part of the human face from the geometric feature image includes:

training cursive chart encoder

Sketch encoder

The hidden space in the bottleneck layer is of dimension of

Of low resolution, wherein

、

And

the height, width and channel number of the geometric feature map;

order to

Represents a true image of a human face, and

representing the corresponding face sketch;

sketch encoder through pre-training

Extracting

The geometric feature of (A) is expressed as

；

Training image encoder

Corresponding human face real image

A geometric hidden space mapped to the sketch representation, denoted as

。

When in use

And

input pre-trained decoder

Time, algorithm pair

Each layer of (a) imposes constraints while also at the final output

And

with the addition of L1 losses.

The geometric feature extraction training of the local decoupling module comprises the following steps:

first, train

And

learning the geometric hidden space of the sketch using the L1 reconstruction loss function, once

After training, the geometric characteristics are expressed as

；

The network is then trained

From the real image of the human face

As input and predict geometric characteristics

So that it follows the same distribution as the learned geometry space, the loss function is defined as follows:

wherein the content of the first and second substances,Nis a decoder

Index 0 corresponds to the input feature map, indexNThe other indices are intermediate feature maps corresponding to the output image.

The appearance encoder

And eliminating spatial information on the face appearance characteristic image and extracting appearance characteristics irrelevant to the geometric characteristics by utilizing global average pooling.

The appearance encoder and the image synthesis generator adopt an exchange training strategy;

using geometric images of human faces

Geometric characteristics of

And human face appearance characteristic image

Appearance characteristics of

Generating an image

I.e. by

；

Use of

The appearance characteristics of

Geometric features of (1), cyclically reconstructing the image

I.e. by

。

Training a local decoupling module by adopting the following loss functions, comprising:

a. self-weight building loss:

wherein:

representing a loss of perception;

a loss of feature matching representing a discriminator;

expressing color loss, converting the image to CIE-Lab color space, and controlling hue by calculating chroma distance in a and b channels; the a and b channels contain color information in the CIE LAB color space;

、

、

setting specific parameters according to experience;

b. loss of cyclic exchange:

wherein

、

Setting specific parameters according to experience;

c. the resistance loss:

the distribution of the generated image is limited using a multi-scale discriminator D to match the distribution of the real image:

weighted value

，

。

An intelligent face editing device, comprising:

the characteristic extraction unit is used for inputting the human face geometric characteristic image and the human face appearance characteristic image into corresponding trained local decoupling modules according to human face parts and extracting the geometric characteristics and appearance characteristics corresponding to all parts of the human face;

the image generation unit is used for generating local images and local intermediate feature images corresponding to all parts of the human face by the local decoupling module based on the geometric features and the appearance features;

and the image fusion unit is used for fusing the local intermediate feature images corresponding to all parts of the human face through the trained global fusion module to generate a final human face image with the geometric features and the appearance features.

A storage medium having stored thereon a computer program executable by a processor, the computer program comprising: the computer program, when executed, implements the steps of the intelligent face editing method.

A computer device having a memory and a processor, the memory having stored thereon a computer program executable by the processor, the computer program comprising: the computer program, when executed, implements the steps of the intelligent face editing method.

The invention has the beneficial effects that: the invention relates to a face synthesis and editing technology based on a sketch, wherein the geometric information represented by the sketch is rich, the geometric details of a face can be controlled, and the face synthesis and editing technology is more flexible compared with the prior art. Meanwhile, the decoupling technology can replace the appearance corresponding to the human face and edit the information such as skin color, hair color and the like.

The invention divides the human face into five parts of a left eye, a right eye, a nose, a mouth and a background, combines a local decoupling module and a global fusion module, can respectively edit local geometric and appearance characteristics, and has higher quality of a synthetic result at a local detail position.

The local decoupling module of the invention codes the image and the sketch into the same space, thereby ensuring the decoupling of information. In the training process, the geometric and appearance of different images are combined by using exchange operation to generate an intermediate result, and the result is respectively subjected to geometric and appearance constraints, so that geometric information can be extracted from a human face or a real image and combined with the appearance of other images to generate a new local synthesis result.

According to the invention, a local decoupling module generates an intermediate feature map, a global fusion module splices the intermediate feature map according to a fixed position, and then a down-sampling network, a residual error network and an up-sampling network are used, and the result of block splicing is fused through a discriminator in the GAN and a related loss function optimization network. And splicing the intermediate feature maps synthesized by the local decoupling modules to generate the human face with high reality sense.

Drawings

Fig. 1 is a network framework structure diagram (showing a structured local-to-global training strategy, and showing a training strategy of exchange and loop reconstruction of a local decoupling module) of the embodiment.

Fig. 2 is a schematic diagram of the structure and training strategy of the geometric encoder in the embodiment (encoding the sketch geometry and the real image into the same potential space, and extracting geometric information from the real image and the sketch).

Fig. 3 shows the face generation results of the geometry and appearance exchange in the example (the first row provides appearance information and the first column provides geometry information).

Fig. 4 shows the partial editing result using the sketch in the embodiment (the input sketch is edited in a serialized manner, and the system generates a corresponding face editing result, and the system provides a high degree of freedom and creativity for the user).

Fig. 5 shows the result of the partial appearance editing in the embodiment (geometric features of the fixed image, which is generated by replacing the appearance reference image of the eyes and mouth).

Fig. 6 shows the result of interpolation of geometry and appearance in the embodiment (the images in the upper left corner and the lower right corner are real images, and the rest are interpolation generation results).

Detailed Description

As shown in fig. 1, the present embodiment provides an intelligent face editing method by using an image decoupling technology and adopting a local-to-global method.

The face is divided into 5 parts in this example: after image blocking is completed, a local decoupling module is designed to extract and generate decoupling characteristics of images of each part; after the generation result of each part is obtained, the blocking results are spliced and fused through the global fusion module, and a face image result with global consistency is obtained.

The network structure of the embodiment comprises 5 local decoupling modules for decoupling geometric and appearance information, and 1 global fusion module realizes the fusion of local features and generates a result with high quality and global consistency. In the process of network training, an exchange strategy is used, and the characteristic constraint with consistent cycle is designed, so that the robustness and the generalization of a network framework are ensured.

The intelligent face editing method in the embodiment specifically comprises the following steps:

1) local decoupling: inputting a human face geometric feature image (the image comprises the geometric features of the human face) and a human face appearance feature image (the image comprises the appearance features of the human face) into a corresponding trained local decoupling module according to the human face part, and extracting the geometric features and the appearance features corresponding to all parts of the human face, wherein the geometric feature image is a human face sketch or a human face real image.

The geometrical characteristics mainly comprise two aspects: 1. shape information such as the shape of the five sense organs, the face shape of a person, the length of hair, and the like; 2. geometric details, i.e. the representation of details of geometric features of a human face, such as wrinkles of the person's face, the trend of the hair, etc.

The appearance characteristics mainly comprise three contents: 1. color information such as color development, skin color, lip color, and the like of a human face; 2. material information, namely the texture of the hair and skin of the human face, such as the smoothness of the skin and the like; 3. the illumination information is information of the influence of the illumination condition on the brightness of the human face, such as the brightness of light, the change of shadow, and the like. In some cases, the effects of the above factors on appearance are mutual, for example, illumination changes may affect the expression of skin color, and appearance characteristics do not make clear division between each of the above factors.

And aiming at each local block, the local decoupling module extracts the geometric and appearance information of the local block and then fuses the geometric and appearance information to generate a local feature map. Therefore, the local decoupling module comprises a geometric encoder and an appearance encoder, which acquire geometric and appearance features, respectively.

1a) Geometric encoder: the sketch is a monochromatic outline of a real image, and geometric information can be extracted. Thus, the self-encoder network extracts the geometric information directly from the input sketch. It is difficult to directly extract geometric features from real images, and when a real face partial image is used as input, an intuitive method for extracting the geometric features is to convert the real image into a sketch by using a pre-trained image-to-sketch conversion network, and then apply the generated sketch to a geometric encoder of the sketch.

To simplify the network, this embodiment proposes a unified method for extracting geometric information from the sketch and the real image, which is achieved by training two autoencoders, the sketch encoder

And an image encoder

One for sketches and the other for images. In the present embodiment, the implicit distribution in the image space is aligned with the implicit distribution in the sketch space, so that only the geometric information is encoded.

First training the encoder by sketch

Sketch encoder

The composed network generates the intermediate features of the sketch, as shown in fig. 2. To preserve the necessary spatial information, the hidden space in the bottleneck layer is not in the form of vectors, but has dimensions of

Of low resolution, wherein

、

And

is the height, width and number of channels of the geometric feature map. The input and output of the network are sketches, which can be edge maps extracted from the images or hand-drawn sketches. For hand-drawn sketches, particularly incomplete sketches in the sketches drawing process, the inventor uses the sketches manifold projection in the preprocessing process to improve the robustness of the system.

Order to

Represents a real image, and

showing the corresponding sketch thereof. By pre-training

Extracting

Is given by the formula

。

Then, the encoder needs to be trained

Will correspond to the image

A geometric hidden space mapped to the sketch representation, denoted as

。

To ensure

And

follow the same scoreCloth, when

And

input pre-trained decoder

Time, algorithm pair

Each layer of (a) imposes constraints while also at the final output

And

with the addition of L1 losses.

1b) Appearance encoder: appearance is another important attribute of facial images. The mapping between the geometric sketch of the face and the real face image is one-to-many and by specifying the appearance, this ambiguity can be resolved.

The present embodiment uses an appearance encoder

Appearance encoder for extracting appearance features

Global average pooling (i.e., averaging all spatial positions in the feature map for each feature channel) is utilized to remove spatial information and extract appearance features that are independent of geometric features.

Since the appearance features are extracted for local regions, deleting spatial information does not result in a significant loss of useful information. The interpolation experiment of the appearance and the geometric characteristics of the human face, such as the graph 6, proves that

A continuous face appearance space can be learned.

2) Image generation: and the local decoupling module generates local images and local intermediate feature images corresponding to all parts of the human face based on the geometric features and the appearance features.

The local decoupling module further comprises an image composition generator

And combining the geometric features and appearance features in one image or two images (one providing the geometric features and the other providing the appearance features) to obtain the converted local images and the intermediate feature map.

An image composition generator: and inputting independent geometric and appearance characteristics to generate a result of reconstruction or geometric and appearance exchange. To control the appearance of the generated face image, the present embodiment employs adaptive instance normalization (AdaIN) in the face image synthesis generation network.

The image composition generator in this embodiment includes 4 residual blocks and 4 upsampling layers: firstly, injecting appearance characteristics into each residual block; then, obtaining feature maps through 4 times of upsampling operations, wherein each feature map has the same resolution as the input image but 64 channels; finally, an image is generated by a convolutional layer that is consistent with the input geometric and appearance characteristics.

3) And (3) global fusion: and fusing the local intermediate feature images corresponding to all parts of the human face through the trained global fusion module to generate a final human face image with the geometric features and the appearance features.

To convert the local image feature map into a complete and natural facial image, one feasible approach is to directly combine local image blocks generated by local decoupling (as generated in fig. 2)

The process of (d). However, this intuitive approach is prone to show artifacts at the boundaries of local partitions.

In the embodiment, the local image blocks are not directly combined, but the intermediate characteristic diagram generated by the local decoupling module is input into the image generation network for combination, so that the network integrates more information streams and generates a high-quality image.

The global fusion module in this example comprises three units: an encoder, a residual block and a decoder. On the basis of the feature map of a given background part, corresponding blocks in the feature map of the background are replaced by feature maps generated by corresponding other blocks in the sequence of mouth, nose, left eye and right eye, and the influence of overlapping among the blocks is reduced. And sending the fused feature map into a global fusion module to generate a brand new face with the input specified appearance and geometric features.

In this example, the whole network framework is trained step by step, the local decoupling module is trained first, then the parameters of the local decoupling module are fixed, and the global fusion module is trained.

A-1) training data set: this embodiment requires a large-scale sketch to match the image data set training network. Meanwhile, the sketch in the sketch image pair requires higher quality, and is similar to a hand-drawn sketch. Conventional edge extraction techniques, such as HED and Canny, typically fail to produce an ideal edge map. Therefore, in the present embodiment, the edge image is obtained using the photocopy filter in Photoshop, then the edge image is simplified to obtain the sketch, and the training data set is constructed, and the resolution of both the image and the sketch is set to 512 × 512 using CelebA HQ as the training data.

A-2) decoupling training, wherein the training process of a local decoupling module comprises three steps:

first, train

And

the geometric hidden space of the sketch is learned using the L1 reconstruction loss function. Once the cover is closed

After training, the geometric characteristics can be expressed as

。

Then, training the network

From a real image

As input and predict geometric characteristics

So that it follows the same distribution as the learned geometry space. The loss function is defined as follows:

wherein the content of the first and second substances,N=7 is a decoder

Index 0 corresponds to the input feature map, indexNThe other indices are intermediate feature maps corresponding to the output image. In this example, in optimizing

When the parameters of (1) are fixed

And

the weight of (c).

Fixing

And

weight of (2)Training appearance encoder

And image composition generator

。

Geometric characteristics of sketch

Or geometrical characteristics of real images

Random input

. In the following sections, will

And

are all shown as

Without distinguishing the origin.

In the embodiment, the appearance encoder and the image synthesis generator adopt an exchange training strategy, and the appearance and the geometric structure of the real face image are decoupled by using a cycle consistency loss item; multi-scale discriminators and adversarial loss are also employed to ensure the realism of the generated images.

The exchange training strategy in this example is illustrated as follows: given two images in a training set

（

Is a real image or sketch (as a geometric feature image of a human face)) and

（

is a real image (as a face appearance feature image)), as shown in fig. 1, through pre-training

Or

From

And

extracting geometric features from

And

by using

From

Extracting appearance characteristics.

By mixing

Geometric characteristics of

Exchange of geometrical characteristics of, use of

Geometric characteristics of

And

to generate an image

I.e. by

。

Use of

The appearance characteristics of

Geometric features of (1), cyclically reconstructing the image

I.e. by

。

The present embodiment also includes self-reconstruction loss: when in use

And

in the same image (e.g. in the same picture)

) As input, it can be reconstructed by using its geometric and appearance features,

can be expressed as

。

In this embodiment, the following loss function is used to train the local decoupling module:

self-weight building loss:

when the geometric and appearance features are from the same image, i.e.

The self-reconstruction consistency of the algorithm requires that it be reconstructable through the network frameworkI。

The self-reconstruction loss function contains three terms: 1) loss of perception

Generating visual similarity between the image and the input image by the pre-trained VGG-19 model metric; 2) loss of feature matching for discriminators

Aiming at stabilizing the training process; 3) color loss

Converting the image into CIE-Lab color space by calculationaAndbthe chrominance distance in the channel controls the hue. The self-reconstruction loss can be expressed as:

whereinaAndbthe channels contain color information in the CIE LAB color space, set empirically

、

、

。

Loss of cyclic exchange:

to completely decouple the geometric and appearance features, the present embodiment uses a swapping method to generate a face image from the geometric and appearance features of different images, i.e.

≠

Loss of cyclic exchange

Containing item

And

。

to completely decouple the geometric and appearance characteristics of the face image, the face image is processed

Is replaced by

After appearance of (2)

After that, should be kept

The geometry of (2). Thus, the algorithm introduces geometric losses

（geometry loss）Constraining the geometry of the generated image to be invariant by comparison with the input image:

the network uses a cyclic consistency penalty term to ensure that images are exchanged

The appearance of (1)

The same is true. Use of

Geometry of (2) and exchange images

The appearance of (1), the image produced

Should cyclically reconstruct the image

. Reconstruction loss formula before this example uses

Constraints to achieve cycle consistency:

、

、

the same as before.

The cycle exchange loss was:

、

the settings are 1 and 1.

The resistance loss:

in this embodiment, a multi-scale discriminator D is used to limit the distribution of the generated image to match the distribution of the real image:

weighted value

，

。

Optimization target of local decoupling module in embodiment

Is the sum of the above 3 items, minimizing

The following 3 networks will be optimized:

，

，

。

can be expressed as:

a-3) global fusion training: after the local decoupling module is trained, a global fusion module needs to be trained, and a feature map generated by the local decoupling module is fused to generate a final result. Similar to the previous stage, the countermeasure penalty, feature matching penalty, and perceptual penalty are used as penalty functions for the global fusion module, which does not use a swapping strategy because it does not involve any decoupling operations.

The embodiment also provides an intelligent face editing device, which comprises a feature extraction unit, an image generation unit and an image fusion unit, wherein the feature extraction unit is used for inputting the geometric feature image and the appearance feature image of the face into corresponding trained local decoupling modules according to the face part and extracting the geometric features and the appearance features corresponding to all parts of the face; the image generation unit is used for generating local images and intermediate characteristic images corresponding to all parts of the human face by the local decoupling module based on the geometric characteristics and the appearance characteristics; the image fusion unit is used for fusing the local intermediate feature images corresponding to all parts of the human face through the trained global fusion module to generate a final human face image with the geometric features and the appearance features.

The present embodiment also provides a storage medium having stored thereon a computer program executable by a processor, the computer program, when executed, implementing the steps of the intelligent face editing method in the present embodiment.

The present embodiment also provides a computer device having a memory and a processor, the memory storing thereon a computer program executable by the processor, the computer program, when executed, implementing the steps of the intelligent face editing method in the present embodiment.

Claims

1. An intelligent face editing method is characterized in that:

2. The intelligent face editing method according to claim 1, characterized in that: the geometric feature image is a draft of a human face or a real image of the human face, and the local decoupling module comprises a draft encoder

Image encoder

Appearance encoder

And image composition generator

；

training cursive chart encoder

Sketch encoder

The hidden space in the bottleneck layer is of dimension of

Of low resolution, wherein

、

And

the height, width and channel number of the geometric feature map;

order to

Represents a true image of a human face, and

representing the corresponding face sketch;

sketch encoder through pre-training

Extracting

The geometric feature of (A) is expressed as

；

Training image encoder

Corresponding human face real image

A geometric hidden space mapped to the sketch representation, denoted as

。

3. The intelligent face editing method according to claim 2, characterized in that:

when in use

And

input pre-trained decoder

Time, algorithm pair

Each layer of (a) imposes constraints while also at the final output

And

with the addition of L1 losses.

4. The intelligent face editing method according to claim 3, wherein the training of the local decoupling module comprises:

first, train

And

After training, the geometric characteristics are expressed as

；

The network is then trained

From the real image of the human face

As input and predict geometric characteristics

wherein the content of the first and second substances,Nis a decoder

5. The intelligent face editing method according to claim 2, 3 or 4, characterized in that: the appearance encoder

6. The intelligent face editing method of claim 5, wherein the appearance encoder and the image synthesis generator employ an exchange training strategy;

using geometric images of human faces

Geometric characteristics of

And human face appearance characteristic image

Appearance characteristics of

Generating an image

I.e. by

；

Use of

The appearance characteristics of

Geometric features of (1), cyclically reconstructing the image

I.e. by

。

7. The intelligent face editing method of claim 6, wherein the training of the local decoupling module using the following loss function comprises:

a. self-weight building loss:

wherein:

representing a loss of perception;

a loss of feature matching representing a discriminator;

、

、

setting specific parameters according to experience;

b. loss of cyclic exchange:

wherein

、

Setting specific parameters according to experience;

c. the resistance loss:

weighted value

，

。

8. An intelligent face editing device, comprising:

9. A storage medium having stored thereon a computer program executable by a processor, the computer program comprising: the computer program when executed implements the steps of the intelligent face editing method of any one of claims 1 to 7.

10. A computer device having a memory and a processor, the memory having stored thereon a computer program executable by the processor, the computer program comprising: the computer program when executed implements the steps of the intelligent face editing method of any one of claims 1 to 7.