CN115527258A

CN115527258A - Face exchange method based on identity information response

Info

Publication number: CN115527258A
Application number: CN202211224223.7A
Authority: CN
Inventors: 杨嘉琛; 李新锋; 程晨; 肖帅; 温家宝
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2022-12-27

Abstract

At present, styleGAN is used in a face changing task to ensure the quality and definition of pictures, and how to use the latent space code of StyleGAN to change faces so as to save a large amount of resources for face changing operation is a problem to be solved at present. The invention relates to a face changing technical method. In the aspect of GAN inversion, we propose a multi-scale feature pyramid hybrid encoder. Meanwhile, the texture and structural features of the face image are extracted efficiently. After GAN inversion, identity characteristic decoupling is carried out on the obtained potential space codes, the invention provides that identity attribute exchange is carried out after characteristic layer screening is carried out through identity characteristic response, and face changing operation is completed on the premise of keeping other attributes unchanged. The method is verified through an experimental platform. The invention can be widely applied to face changing technology in engineering.

Description

Face exchange method based on identity information response

Technical Field

The invention relates to a face image editing method, in particular to a face exchange method based on identity information response.

Background

People have more face editing methods, and have higher requirements on the definition of an edited picture and the accuracy of an editing technology. In this case, styleGAN, which can generate a large number of clear face images, stands out. A lot of attribute editing work is done on the basis of StyleGAN. The latent spatial encoding of the StyleGAN itself relies on the pixel scale for feature decoupling. Editing its underlying code may well accomplish some image property editing tasks. However, in the face-change task, the StyleGAN inversion works in a much more space-enhancing way, and its underlying spatial encoding cannot be used directly for the exchange. How to use StyleGAN in the face changing task to save a lot of resources for face changing operation is a problem to be solved at present.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a face exchange method based on identity information response. In order to solve the technical problems, the invention adopts the following technical scheme:

an identity information response face exchange method based on StyleGAN comprises the following steps:

(1) Establishing a human face image generation model based on StyleGAN;

(2) Constructing a StyleGAN inversion model CTSNet, innovatively providing an inversion model combining a Transformer and a convolution for achieving a better image inversion effect, combining the advantages of a convolution network and a visual Transformer, designing a multi-scale encoder, capturing structural features of low pixel scale by the visual Transformer, and extracting detailed features such as texture, color and the like by the convolution network;

(3) Inputting the target face image and the background face image into the CTSNet to respectively obtain the potential space codes C of the corresponding images _id And C _back ；

(4) An identity characteristic response switching network is constructed, and identity characteristics of a potential space are directly extracted and exchanged, so that the method is compatible with other inversion methods;

(5) Latent spatial coding C of input target face image and background face image _id And C _back The identity attribute is exchanged to an identity characteristic response exchange network to obtain a potential space code C of the face-changed image _mix ；

(6) Inputting a potential space code of the face changing image to a StyleGAN-based human face image generation model to obtain a target image;

the invention has the advantages and positive effects that:

(1) The invention designs a multi-scale encoder by combining the advantages of a convolution network and a visual Transformer, and can obtain better effect in the GAN image inversion field.

(2) The method of the invention provides the identity characteristic response switching network, can directly extract the identity information in the potential space code for switching, and has excellent effect.

(3) The inversion model and the identity exchange module of the method are mutually decoupled and compatible with other methods, and can be widely applied to the field of face image editing.

(4) The method is verified through a large number of experiments, and the reliability of the method is effectively improved.

Drawings

FIG. 1 is an overall flow diagram in an embodiment of the present invention;

FIG. 2 is a graph of face change results in accordance with an embodiment of the present invention;

Detailed Description

This example shows a specific implementation method of the present invention by taking two face images as an example.

In order to make the purpose and technical scheme of the invention more clear, the following detailed description of the specific implementation steps of the invention is provided with the accompanying drawings.

Referring to fig. 1, there is shown an overall flow chart of an embodiment of the present invention, described in detail below;

1. inverse network, we propose a multi-scale feature pyramid hybrid encoder. In the formation process of StyleGAN, the input image of the inversion network needs to be clipped to the nth power of 2, such as 256 × 256, and the like, and the maximum is 1024 × 1024.

After clipping, normalization is performed, and then input to the network. The whole network is divided into a coding and mapping module, the high-level network just input is a convolutional coding module, and the bottom level is a Transformer coding module. Taking a single image as an example, the dimensions of the input image layer by layer are (1, the number of feature layers, 4), respectively. Some low-level feature information is generated at a low pixel scale, such as 4 × 4,8 × 8 and the like, at this pixel scale, the information that the network can extract is mainly structural information, and the network can capture information such as texture and color of details along with the increase of the pixel scale. The structural design of StyleGAN is inspired by ProGAN, and the partial attention of the convolution network is better exerted. The visual Transformer is proved to have better characteristic extraction capability in the aspect of the overall structure of the object than the convolutional network, and the Transformer plays a better role in the low-scale structural information. Therefore, a multi-scale encoder is designed by combining the advantages of the convolutional network and the visual Transformer, the visual Transformer is used for capturing the structural features with low pixel scale, and the convolutional network is used for extracting the detailed features such as texture, color and the like. The inversion encoder group is composed of two parts, the first part is a texture feature extractor, and the second part is a structural feature extractor. Texture extractor, layers 16-17 in StyleGAN correspond to 1024 x 1024 images, i.e., 14-15 to 512 x 512 and so on until 0-1 to 4 x 4. Generally, feature layers containing detail information such as texture information are mainly concentrated in 12-17 layers, wherein the higher the layer number is, the more certain the feature layers contain the texture information, because the local operation of convolution includes a small area, the features are more approximate to the texture features, the 12-13 layers are relatively fuzzy, the internal structure information is still much, and through related experiments, the final determination is that 12-17 layers are extracted by a texture encoder, but 256 × 256 images corresponding to 12-13 layers are still input into a vision transformer of the encoding structure features. The texture feature encoder is a pure convolution structure, corresponds to StyleGA N, firstly obtains down-sampled images of input images I with different sizes, down-samples the down-sampled images to 256 × 256, namely I25, and inputs the original images into a texture feature extractor. The downsampled map I256 is input to the structural feature extractor to encode the 0-11 layer features. And then, carrying out feature mapping after obtaining all the hierarchy information, wherein the features of each scale are mapped into different styles according to the size of the input image, namely (1, 2 (n-1), 512).

2. Obtaining inversion vectors (1, 2 (n-1), 512) with different sizes according to the size of the input image, and obtaining corresponding inversion characteristics C _id And C _back 。

3. And identity decoupling, namely inputting the obtained inversion features into a face information feature extraction switching network, wherein the face identity decoupling network mainly comprises two parts, namely identity feature coding on the first part and background feature coding on the other part, namely irrelevant attributes. The second part is identity feature mixing, and identity attribute transformation is carried out through an adain method. Firstly, selecting a characteristic layer which has the largest influence on identity information, wherein the adopted method is an identity characteristic response method: firstly, 200 face images img X are inverted on the basis of a trained inversion encoder, and a corresponding late Code is obtained and is recorded as Code X. In order to find a characteristic layer which has the largest influence on identity information, the characteristic layer of Code X is replaced, and for any Code X, 400 face images are randomly extracted from a data set and inverted to obtain Code Y. For one Code X and one Code Y, 18 feature fused late codes can be obtained, namely, each layer of the Code X is replaced by the Code Y, the obtained mixed Code is input into StyleGAN for image generation, and identity cosine similarity is calculated for the generated images i mg MIX and imgX, so that the feature layer with the largest influence on identity information is judged. 8-10 layers screened in the invention are subjected to decoupling training.

4. Inputting the characteristics (1, 3, 512) of the identity decoupling network by taking a single image as an example, inputting 8-10 layers of corresponding codes into the identity decoupling network respectively, outputting (1, 3, 512) of the network as 8-10 layers after identity decoupling, and then directly replacing 5-7 layers to obtain the final potential spatial code Cmix, wherein the size of the potential spatial code Cmix is (1, 18, 512).

5. The final potential spatial code is input into the StyleGAN generator to obtain the corresponding image. Fig. 2 is a graph showing the face change result.

The above embodiments are merely preferred embodiments, and the description of the embodiments is used to help understanding the method and the core idea of the present invention, and modifications and equivalents of some of the technical solutions described in the foregoing embodiments are included in the scope of the present invention.

Claims

1. A face exchange method based on identity information response is characterized by comprising the following steps:

(1) Establishing a human face image generation model based on StyleGAN;

(6) And inputting the face changing image latent space code to a StyleGAN-based human face image generation model to obtain a target image.

2. A face exchange method based on identity information response as claimed in claim 1, characterized in that: the inversion model in the step (2):

the inversion encoder group is composed of two parts, the first part is a texture feature extractor, and the second part is a structural feature extractor. First we look at the texture extractor, in StyleGAN layers 16-17 correspond to 1024 x 1024 images, i.e., 14-15 to 512 x 512 and so on until 0-1 to 4 x 4. Generally, feature layers containing detail information such as texture information are mainly concentrated in 12-17 layers, wherein the higher the layer number is, the more certain the feature layers contain the texture information, because the local operation of convolution includes a small area, the features are more approximate to the texture features, the 12-13 layers are relatively fuzzy, the internal structure information is still much, and through related experiments, the final determination is that 12-17 layers are extracted by a texture encoder, but 256 × 256 images corresponding to 12-13 layers are still input into a vision transformer of the encoding structure features.

Firstly, a texture feature encoder is used, the group is a pure convolution structure and corresponds to StyleGAN, and for input images I with different sizes, firstly, down-sampled images are obtained and down-sampled to 256 × 256I256, and the original images are input into a texture feature extractor. The downsampled map I256 is input to the structural feature extractor to encode the 0-11 layer features.

3. The identity response switching network method of claim 1, wherein:

in order to reduce the decoupling difficulty of the identity information, the method firstly carries out W + feature screening to select a feature layer which has the largest influence on the identity information, and the adopted method is an identity feature response method: firstly, 200 face images img X are inverted on the basis of a trained inversion encoder, and a corresponding late Code is obtained and is recorded as Code X. In order to find a characteristic layer which has the largest influence on identity information, the characteristic layer of Code X is replaced, and for any Code X, 400 face images are randomly extracted from a data set and inverted to obtain Code Y. For one Code X and one Code Y, 18 feature-fused late codes can be obtained, namely, each layer of the Code X is replaced by the Code Y, the obtained mixed Code is input into StyleGAN for image generation, and identity cosine similarity is calculated for the generated images img MIX and imgX so as to judge the feature layer with the largest influence on identity information.

The human face identity decoupling network mainly comprises two parts, wherein the first part is identity feature coding, and the other part is background feature coding, namely irrelevant attributes. The second part is identity feature mixing, and identity attribute transformation is carried out through an adain method.