CN115115560A

CN115115560A - Image processing method, apparatus, device and medium

Info

Publication number: CN115115560A
Application number: CN202210672542.8A
Authority: CN
Inventors: 李禹源; 江源
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-09-27

Abstract

The application discloses an image processing method, an image processing device, an image processing apparatus and an image processing medium, which relate to the technical field of computer vision, in particular to the technical field of image generation. The method comprises the following steps: carrying out face segmentation processing on the target image to obtain a mask image of the target image; performing fusion processing on the mask image and the target image to obtain a fusion image; inputting the fused image into an encoder for encoding processing to obtain an image encoding vector; the image coding vector is input into a hair style generation model to obtain a hair style transformation image of the target image, wherein the hair style generation model is based on a plurality of control vectors obtained by the image coding vector and controls a plurality of image features on the hair style transformation image according to the plurality of control vectors, the plurality of image features on the hair style transformation image are respectively adjusted by combining a mask image, and the definition of texture details of the hair style transformation image is improved.

Description

Image processing method, apparatus, device and medium

Technical Field

The present disclosure relates generally to the field of computer vision technologies, and in particular, to the field of image generation technologies, and in particular, to an image processing method, apparatus, device, and medium.

Background

In the related art, a hair style transformation method of an image is usually performed by guiding a mask image obtained by segmenting the image into different regions, for example: different regions are directly endowed with different values to change the style of the region, such as the transformation of the hairstyle, but the texture detail of the hairstyle transformation image obtained in the way is poor. Another commonly adopted mode is that a countermeasure network is generated, a face image is used as an input to obtain a hair style transformation image, the countermeasure network is generated according to the overall feature vector of the input face image, the texture details of hair are relatively complex, the hair style transformation image generated by the overall feature vector is difficult to ensure that each part of the image meets requirements, and usually, the outlines of the face and the like are changed, and the texture details of the hair are also poor.

Disclosure of Invention

In view of the foregoing defects or shortcomings in the prior art, it is desirable to provide an image processing method, an apparatus, a device, and a medium, which implement respective adjustment of a plurality of image features on a hair style transformation image in combination with a mask image, and improve the definition of texture details of the hair style transformation image.

In a first aspect, the present application provides an image processing method, including: performing face segmentation processing on a target image to obtain a mask image of the target image, wherein the mask image is used for representing a hair area of the target image and a face area of the target image; performing fusion processing on the mask image and the target image to obtain a fusion image; inputting the fused image into an encoder for encoding processing to obtain an image encoding vector; and inputting the image coding vector into a hair style generation model to obtain a hair style transformation image of the target image, wherein the hair style generation model is a plurality of control vectors obtained based on the image coding vector and controls a plurality of image features on the hair style transformation image according to the plurality of control vectors.

In a second aspect, the present application provides an image processing apparatus comprising: the segmentation unit is used for carrying out face segmentation processing on a target image to obtain a mask image of the target image, wherein the mask image is used for representing a hair area of the target image and a face area of the target image; the fusion unit is used for carrying out fusion processing on the mask image and the target image to obtain a fusion image; the encoding unit is used for inputting the fusion image into an encoder to carry out encoding processing to obtain an image encoding vector; and the generating unit is used for inputting the image coding vector into a hair style generating model to obtain a hair style transformation image of the target image, wherein the hair style generating model is a plurality of control vectors obtained based on the image coding vector and controls a plurality of image features on the hair style transformation image according to the plurality of control vectors.

In a possible implementation manner of the second aspect, the generating unit is specifically configured to: obtaining the plurality of control vectors according to the image coding vector; respectively inputting the control vectors into a plurality of convolution networks of the hair style generation model to obtain a plurality of characteristic graphs; and performing fusion processing on the plurality of feature maps to obtain a hair style transformation image of the target image.

In a possible implementation manner of the second aspect, the method further includes: a decoder, configured to perform decoding reconstruction processing on the image coding vector to obtain a plurality of reconstructed images, where a hair style included in the reconstructed images is different from a hair style included in the target image; the generating unit is specifically configured to: inputting the image coding vector into the hair style generation model to obtain a plurality of output images, wherein the resolution of the output images is different; and performing fusion processing on the plurality of output images and the plurality of reconstructed images to obtain a hair style transformation image of the target image.

In a possible implementation manner of the second aspect, the generating unit is specifically configured to: inputting output images 1 to n into a fusion module 1 to n in a one-to-one correspondence manner, and inputting reconstructed images 1 to n into the fusion module 1 to n in a one-to-one correspondence manner; the fusion module 1 is connected with the fusion module n in sequence, wherein n is the number of the output images; performing fusion processing on the output image 1 and the reconstructed image 1 through the fusion module 1 to obtain a fusion image 1; performing fusion processing on the output image x, the reconstructed image x and the up-sampling image y through the fusion module x to obtain a fusion image x; the x is a positive integer which is greater than 1 and less than or equal to the n, the up-sampling image y is obtained by up-sampling a fusion image y output by a fusion module y, and the y is x-1; and when the x is the n, obtaining a fused image n output by the fused module n, and taking the fused image n as a hair style transformation image of the target image.

In a possible implementation manner of the second aspect, the encoding unit is specifically configured to:

and inputting the image coding vectors into a plurality of deconvolution networks of the decoder to obtain a plurality of reconstructed images.

In a possible implementation manner of the second aspect, the training process of the decoder includes: inputting a sample image into the encoder, and obtaining an encoding vector of the sample image; inputting the coding vector of the sample image into an initial decoder, training the initial decoder according to the loss between the output of the initial decoder and a label image, and obtaining the decoder, wherein the label image comprises a hair style different from that of the sample image.

In a possible implementation manner of the second aspect, the training process of the encoder includes: inputting a sample image into an initial encoder, and training the initial encoder according to the loss between the output of the initial encoder and the implicit vector of the sample image to obtain the encoder.

In a possible implementation manner of the second aspect, the segmentation unit is specifically configured to: inputting the target image into an auto-encoder, and obtaining a mask image of the target image, wherein a training sample of the auto-encoder is a face image, a label of the training sample is the mask image of the face image, and the mask image of the face image is used for representing a face region and a hairstyle region of the face image.

In a possible implementation manner of the second aspect, the training process of the self-encoder includes: and inputting the training samples into an initial self-encoder, and training the initial self-encoder according to the loss between the output of the initial self-encoder and the labels of the training samples to obtain the self-encoder.

In a possible implementation manner of the second aspect, the fusion unit is specifically configured to: adding and fusing pixels of the mask image and corresponding pixels of the target image to obtain a fused image; performing feature fusion on the region of the mask image and the region of the target image by adopting an attention mechanism to obtain a fused image; and adding and fusing the pixel color channel of the mask image and the corresponding pixel color channel of the target image to obtain the fused image.

In a third aspect, embodiments of the present application provide a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method as described in embodiments of the present application when executing the program.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method as described in the embodiments of the present application.

According to the image processing method, the image processing device, the hair style generation model and the hair style transformation model, the hair style generation model obtains a plurality of control vectors according to the input image coding vector of the fusion image, the plurality of image characteristics on the hair style transformation image can be respectively controlled according to the plurality of control vectors, the mode of independently controlling different image characteristics on the hair style transformation image through different control vectors avoids that the other image characteristics are influenced by adjusting certain image characteristics is avoided, the hair style transformation image can be controlled through more detailed characteristics, the texture effect of the hair style transformation image is improved, in addition, the image coding vector input to the hair style generation model comprises the characteristics of the image after the fusion of the target image and the mask image, the image coding vector comprises the characteristics of the hair area and the face area of the mask image representing the mask image, therefore, according to the characteristics of the hair area and the face area of the mask image, the face contour and the hair range of the hairstyle transformation image can be effectively controlled, artifacts in the hairstyle transformation image are reduced, and the display effect of the hairstyle transformation image is improved.

Therefore, the method and the device can be used for respectively adjusting the image characteristics of the hair style transformation image by combining the mask image, and compared with the prior art that the hair style transformation image is generated by only depending on the face image, the definition of the texture details of the hair style transformation image is improved, and the quality of the hair style transformation image can be effectively improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a schematic flowchart of an image processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an auto-encoder according to an embodiment of the present application;

fig. 3 is a schematic view of a processing model of an image processing method according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a fusion model for fusing an output image and a reconstructed image according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of a fusion module provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

At present, when the person on the image is subjected to hair style transformation, such as changing the hair style, adding the hair style transformation obtained by bang and adding the hair style transformation image of bang, the hair style, bang and other textures on the obtained hair style transformation image are not clear.

Based on this, the application provides an image processing method, an image processing device, an image processing apparatus, and a storage medium, which can separately control a plurality of image features on a hair style transformation image by using a fusion image obtained by fusing a target image and a mask image of the target image, thereby greatly improving the definition of textures such as a hair style on the hair style transformation image.

In the implementation environment of the embodiment of the application, the face segmentation processing can be performed on the target image by personal computer equipment, a mobile terminal and the like to obtain a mask image of the target image; performing fusion processing on the mask image and the target image to obtain a fusion image; inputting the fused image into an encoder for encoding processing to obtain an image encoding vector; and inputting the image coding vector into a hair style generation model to obtain a hair style transformation image of the target image.

Alternatively, it may be implemented by a server, for example: personal computer equipment and a mobile terminal send a request to a server, and the server performs face segmentation processing on a target image to obtain a mask image of the target image; performing fusion processing on the mask image and the target image to obtain a fusion image; inputting the fused image into an encoder for encoding processing to obtain an image encoding vector; and inputting the image coding vector into a hair style generation model to obtain a hair style transformation image of the target image, and finally returning the hair style transformation image to personal computer equipment and a mobile terminal.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform.

The embodiment of the application provides an image processing method which can be applied to a computer device shown in fig. 7. As shown in fig. 1, the method comprises the steps of:

101. and carrying out face segmentation processing on the target image to obtain a mask image of the target image, wherein the mask image is used for representing a hair area of the target image and a face area of the target image.

The target image may be a target image including any one of faces, and the mask image of the target image may be a black and white mask image, for example: the face area and the hair area on the mask image are white, and the other areas on the mask image are black.

The method comprises the steps of carrying out face segmentation processing on a target image, firstly identifying a face and hair on the target image, then segmenting to obtain a face region, a hair region and other regions, and finally assigning different regions to be black and white to obtain a mask image.

In a possible implementation manner, performing face segmentation processing on a target image to obtain a mask image of the target image includes: inputting a target image into a self-encoder to obtain a mask image of the target image, wherein a training sample of the self-encoder is a face image, a label of the training sample is the mask image of the face image, and the mask image of the face image is used for representing a face area and a hairstyle area of the face image.

As shown in fig. 2, the self-encoder includes an encoder 1 and a decoder 1, and the target image passes through the encoder 1, extracts the features of the target image, and then generates a mask image corresponding to the target image through the decoder 1.

In this example, the encoder 1 includes a plurality of convolutional layers, as shown in fig. 2, taking an encoder 1 including three convolutional layers as an example, the encoder 1 is formed by stacking 3 convolutional layers, such as a first convolutional layer, a second convolutional layer and a third convolutional layer from left to right, and the first convolutional layer, the second convolutional layer and the third convolutional layer can extract image features of the target image at different resolutions from the target image; the decoder 1 includes a plurality of deconvolution layers, as shown in fig. 2, taking a decoder 1 including three deconvolution layers as an example, the decoder 1 is formed by stacking 3 deconvolution layers, such as a first deconvolution layer, a second deconvolution layer and a third deconvolution layer from right to left, and the first deconvolution layer, the second deconvolution layer and the third deconvolution layer can finally generate a mask image of the target image through deconvolution operation according to the image characteristics of the target image at different resolutions transmitted by the encoder 1. In order to improve the accuracy of the obtained mask image, in one example, as shown in fig. 2, the image features at the corresponding resolutions of the encoder 1 and the decoder 1 are connected by using a skip connection, that is: the first convolution layer is connected with the first deconvolution layer, the second convolution layer is connected with the second deconvolution layer, the third convolution layer is connected with the third deconvolution layer, the image characteristics of the encoder 1 and the decoder 1 under the corresponding resolution ratio are connected in the jump connection mode, each deconvolution layer of the decoder 1 in the self-encoder can be ensured to obtain richer characteristics, the loss of information is reduced, and therefore, a more accurate mask image can be obtained through decoding.

A down-sampling layer (i.e. a pooling layer) may be further disposed between the convolutional layers of the self-encoder, for example, a down-sampling layer is disposed between the first convolutional layer and the second convolutional layer, and a down-sampling layer is disposed between the second convolutional layer and the third convolutional layer, and the input of the pooling layer is derived from the last convolutional layer, which mainly provides strong robustness, for example: max-pooling is the maximum value in a small region, where if the other values in this region change slightly, or the image shifts slightly, the result after pooling remains unchanged, and the pooling layer is sandwiched between successive convolutional layers, which are used to compress the amount of data and parameters, reducing the over-fit.

It will be appreciated that the example uses an autoencoder that is a neural network model, namely: and inputting the target image into the neural network model to obtain a mask image of the target image. Before using the neural network model, it may be trained beforehand, i.e.: the self-encoder is trained.

In one embodiment, there is also provided the training process of the self-encoder described above, including: and inputting the training samples into an initial self-encoder, and training the initial self-encoder according to the loss between the output of the initial self-encoder and the labels of the training samples to obtain the self-encoder. That is, first, a training sample of a face image is obtained, a mask image of the training sample is obtained, the mask image is used as a label, and then the initial self-encoder is trained according to a loss between an output of the initial self-encoder and the label of the training sample, so as to obtain the self-encoder. That is to say, a training sample of a face image is obtained in advance, a mask image of the face image is obtained, the mask image of the face image is used as a label, an initial self-encoder is trained according to the training sample of the face image and the mask image of the face image, parameters of the initial self-encoder are adjusted according to loss between output of the initial self-encoder and the label of the training sample, and when the loss meets requirements, if the loss is smaller than a certain preset value, the training is completed, and the self-encoder is obtained. Thus, when an arbitrary target image including a human face is input from the encoder, a mask image of the target image can be obtained.

102. And carrying out fusion processing on the mask image and the target image to obtain a fusion image.

As shown in fig. 3, "+" in fig. 3 denotes a fusion operation of the mask image and the target image. The fusion method includes, but is not limited to, a pixel-by-pixel addition method, a cascade fusion method, and a fusion method using an attention mechanism, that is, performing fusion processing on the mask image and the target image to obtain a fusion image, and any one of the following methods may be used:

(1) and adding and fusing the pixels of the mask image and the corresponding pixels of the target image by pixel addition to obtain a fused image.

(2) And adopting an attention mechanism fusion method, and fusing the region of the mask image and the region of the target image by adopting the attention mechanism characteristic to obtain a fused image.

(3) And a cascade (conjugate) fusion method, which is used for adding and fusing the pixel color channel of the mask image and the corresponding pixel color channel of the target image to obtain a fused image. For example: the color of the pixels of the target image is characterized by a three-channel color vector, i.e., [ R, G, B ], while the color of the pixels of the mask image is characterized by a one-channel color vector, such as: black is 0 and white is 255, so that after the corresponding pixel color channels are additively fused, the pixel color is characterized by four-channel color, that is: the colors of the first channel, the second channel, the third channel and the fourth channel are characterized as follows: [ R, G, B, 0 or 255 ]. In this embodiment, a fused image is obtained by fusing the target image and the mask image by a cascade fusion method. Namely: the mask image (i.e., the face segmentation mask) of the target image extracted from the encoder shown in fig. 2 is fused with the target image input from the encoder through cascade (association), so that the pixel color in the fused image adopts a four-channel color representation [ R, G, B, 0 or 255], and since the fourth channel is a feature that represents a hair region and a face region of the target image, the mask image and the target image are fused, and when a hairstyle transformation image is subsequently generated, the contour range of the face and the range of the hair in the generated hairstyle transformation image can be controlled according to the feature that represents the hair region and the face region of the fourth channel, thereby effectively reducing the artifact in the generated hairstyle transformation image and improving the quality of the hairstyle transformation image.

103. And inputting the fused image into an encoder for encoding processing to obtain an image encoding vector.

Wherein, the image coding vector may be a latent vector code W, that is: and mapping the high-dimensional space to a one-dimensional space to obtain a one-dimensional characteristic vector.

In one possible implementation, as shown in fig. 3, the encoder 2 extracts the image features of the fused image to obtain a latent vector code W, which is: the image encoding vector.

In this example, the encoder 2 is a feature extraction network comprising a plurality of convolutional layers. Therefore, before obtaining the image coding vector from the fused image by using the encoder 2, a process of training the encoder 2 is further provided, which includes: and inputting the sample image into an initial encoder, and training the initial encoder according to the loss between the output of the input initial encoder and the hidden vector of the sample image to obtain an encoder 2. For example: for the encoder 2 using the ResNet-50 feature extraction network, in the training stage, the distance metric function L1loss may be used as a loss function between the hidden vector code W (i.e., the image coding vector) output by the encoder 2 and the hidden vector of the sample image, and the loss between the output of the initial encoder and the hidden vector of the sample image is obtained through the distance metric function L1loss, and when the loss satisfies a set condition, the initial encoder training is completed to obtain the encoder 2.

L1loss is: the distance metric-based loss function generally maps input data onto a distance metric-based feature space, such as euclidean space, considers the mapped samples as points on the space, and measures the distance between the true value and the predicted value of the sample on the feature space by using a suitable loss function. Generally, the smaller the distance between two points in the feature space, the better the prediction performance of the encoder 2.

104. And inputting the image coding vector into a hair style generation model to obtain a hair style transformation image of the target image, wherein the hair style generation model is a plurality of control vectors obtained based on the image coding vector and controls a plurality of image features on the hair style transformation image according to the plurality of control vectors.

In one possible implementation, inputting the image coding vector into a hair style generation model to obtain a hair style transformation image of the target image, including: obtaining a plurality of control vectors according to the image coding vectors; respectively inputting the control vectors into a plurality of convolution networks of the hair style generation model to obtain a plurality of characteristic graphs; and performing fusion processing on the plurality of feature maps to obtain a hair style transformation image of the target image. That is, the hair style generation model does not generate a plurality of feature maps directly from the image code vector, that is: different from the traditional generator which directly feeds the late code W to the input layer of the generator, namely: instead of directly feeding the late code W to the input layer, the late code W may first be non-linearly mapped f: Z → W through a mapping network comprising for example a fully connected layer of 8 layers, where Z and W have the same dimensions, e.g. 512X 1. This mapping network encodes the late code W into intermediate vectors which are then passed to a generating network, such as an 18-layer generating network, so that each layer in the generating network can generate one control vector, and thus 18 control vectors can be derived, enabling different control vectors to control different image features, including but not limited to color features, texture features, etc.

In the above description, the plurality of image features refer to a plurality of features extracted from an image, such as: color features, texture features, etc. Multiple feature maps refer to images of different resolutions, for example: the hair style generates images of different resolutions output by deconvolution layers of a plurality of convolution networks of the model.

The hairstyle generation model is based on a plurality of control vectors obtained by the image coding vector, and correspondingly controls a plurality of image characteristics on the hairstyle conversion image, so that the definition of texture details of the hairstyle conversion image can be improved. For example: when one part of the control vectors in the plurality of control vectors controls one part of image features on the hair style transformation image, and another part of the control vectors in the plurality of control vectors controls another part of image features on the hair style transformation image, the other part of image features on the hair style transformation image can not be influenced, and similarly, when another part of the control vectors in the plurality of control vectors controls another part of image features on the hair style transformation image, one part of image features on the hair style transformation image can not be influenced, that is, when some image features on the image are adjusted to meet the requirements, the other image features can not be changed, therefore, the mode of respectively controlling different image features can enable the finally obtained image texture details, the hair style transformation image, the hair style image and the hair style image, the hair style image and the hair style image are formed by adjusting the hair style, Higher quality, etc.

As shown in fig. 3, the image coding vector is input to the hair style generation model 3, and a hair style transformation image of the target image is obtained. In this example, the hair style generation model 3 employs a neural network comprising StyleGAN. Since StyleGAN contains rich portrait texture information, it can be used as a priori knowledge of the texture and details of the generated hair style transformation image. Specifically, inputting the latent vector code W into the StyleGAN modulates the weight of each convolution layer in the StyleGAN, which helps to generate rich texture details.

Wherein, the main working principle of StyleGAN is as follows: instead of directly feeding the late code W to the input layer, namely: different from the traditional generator which directly feeds the late code W to the input layer of the generator, namely: instead of directly feeding the late code W to the input layer, the late code W is first non-linearly mapped f: Z → W through an 8-layer fully-connected layer mapping network, where Z and W have the same dimensions, e.g., 512X 1. This mapping network encodes the late code W into intermediate vectors which are then passed to a generating network, such as an 18-layer generating network, thus resulting in 18 control vectors enabling different control vectors to control different image features, including but not limited to color features, texture features, etc. Wherein, the mapping network is used for: the entanglement among image features is disentangled, for example, the colors generated by human faces on an image with a resolution of 32 × 32 are controlled, but the image features controlled on the image with the resolution of 32 × 32, such as textures, are also changed, so that the mapping network can perform unwrapping on the features of the later code W, and further, specific image features can be adjusted based on different control vectors without influencing other image features, therefore, different image features can be adjusted respectively, the texture details of each image feature can meet the requirement of clearness, and a high-quality hair style transformation image with high-definition texture details can be obtained.

In one particular application, the target image is a portrait including one hairstyle, and a change to another hairstyle is desired. As shown in fig. 3, the target image (the leftmost image in fig. 3) is processed by the self-encoder to obtain a mask image of the target image, the two images are subjected to cascade fusion to obtain a fused image, the fused image is input to the encoder 2, the encoder 2 obtains an image coding vector W, the image coding vector W is input to the decoder 2 and the hair style generation model 3, the decoder 2 can obtain a plurality of reconstructed images having different hair styles from the target image according to the image coding vector W, the hair style generation model 3 generates a plurality of output images according to the image coding vector W, and then the plurality of output images and the plurality of reconstructed images are fused by the fusion model 4 to finally obtain a hair style transformed image (the rightmost image in fig. 3) in which the portrait is subjected to hair style transformation. As shown in fig. 3, a hair style transformation image in which bang is added is obtained, which is different from the hair style of the target image.

In the image processing method provided by the embodiment of the application, the hair style generation model obtains a plurality of control vectors according to the input image coding vector of the fused image, for example, the image coding vector is nonlinearly mapped through a multi-layer fully-connected layer mapping network, the image coding vector is coded into an intermediate vector by the mapping network, and the intermediate vector is transmitted to the multi-layer generation network, so that each layer in the multi-layer generation network can generate one control vector, and thus, a plurality of control vectors can be obtained, and then a plurality of image features on the hair style transformation image can be respectively controlled according to the plurality of control vectors, in this way, different image features on the hair style transformation image are separately controlled through different control vectors, thereby avoiding that the influence of the adjustment of certain image feature on other image features is changed, and the hair style transformation image can be controlled through more detailed features, the texture effect of the hair style transformation image is improved, in addition, the image coding vector input to the hair style generation model comprises the characteristics of the image obtained by fusing the target image and the mask image, therefore, the human face contour and the hair range of the hair style transformation image can be effectively controlled according to the characteristics of the hair region and the human face region of the mask image, the artifact in the hair style transformation image is reduced, and the display effect of the hair style transformation image is improved.

In a possible implementation, the method further includes: inputting the image coding vector into a decoder to perform decoding reconstruction processing, so as to obtain a plurality of reconstructed images, wherein the hairstyle included in the reconstructed images is different from the hairstyle included in the target image, and on the basis, inputting the image coding vector into a hairstyle generation model, so as to obtain a hairstyle transformation image of the target image, which includes: inputting the image coding vector into a hair style generation model to obtain a plurality of output images, wherein the resolution of the output images is different; and performing fusion processing on the plurality of output images and the plurality of reconstructed images to obtain a hair style transformation image of the target image.

The reconstructed image is an image obtained by performing hair style transformation on the avatar in the target image, that is, the reconstructed image has different image quality and different hair styles of the avatar in the target image and the target image, that is: the head portrait in both represents the same person with different hairstyles. The output avatars are the same as the feature maps of the above embodiment, namely: the hairstyle generation model obtains a plurality of control vectors according to the image coding vectors; the plurality of control vectors are respectively input into the plurality of convolution networks of the hair style generation model, and then a plurality of output images can be obtained, wherein the output images are generated based on the target images, different image characteristics on the output images are controlled through different control vectors, the quality of the output images can be improved, and because the output images are generated based on the hair style transformation images, compared with the target images, the hair style generation model adopts a neural network comprising StyleGAN, for example, the quality of the output images of the hair style generation model is higher, but the portrait in the images is different from that in the target images.

As shown in fig. 3, the image coding vector W is input into the hair style generation model 3 to obtain a plurality of output images; inputting the image coding vector W into a decoder 2 for decoding and reconstructing processing to obtain a plurality of reconstructed images; and performing fusion processing on the plurality of output images and the plurality of reconstructed images to obtain a hairstyle transformation image of the target image.

In this example, inputting the image coding vector into the decoder 2 for decoding and reconstructing processing to obtain a plurality of reconstructed images includes: the image coding vectors are input to a plurality of deconvolution networks of the decoder 2, obtaining a plurality of reconstructed images. Namely: a plurality of reconstructed images are generated by a plurality of stacked deconvolution layers.

In one possible implementation, the present application provides a process for training a decoder 2, including: inputting a sample image into an encoder to obtain an encoding vector of the sample image; and inputting the coding vector of the sample image into an initial decoder, training the initial decoder according to the loss between the output of the initial decoder and the label image, and obtaining a decoder 2, wherein the hair style included by the label image is different from the hair style included by the sample image. For example: the label image and the sample image are supervised by the reconstructed loss function L1loss and the perceptual loss to realize the training of the decoder 2. The encoder 2 and the decoder 2 may constitute a generating countermeasure network, and since the texture of hair is generally complicated, the hair in the image generated by the generating countermeasure network lacks a realistic texture and it is difficult to ensure the definition of the hairstyle texture.

As can be seen from the above embodiment, the hair style generation model 3 may obtain a hair style transformation image of the target image with clear texture details, possibly some forms on the human face, such as: the five sense organs are different from those on the target image, namely: the hairstyle generation model adopts the StyleGAN neural network, can independently control different image characteristics on an output image through different control vectors, improves the quality of the image, but leads the human image in the output image to have difference with the human image in the target image based on the working principle of the StyleGAN neural network. Therefore, in the present application, in order to make the face on the hairstyle transformation image more consistent with the face on the target image, the decoder 2 is used to obtain a plurality of reconstructed images more consistent with the face on the target image, and then the plurality of reconstructed images are fused with the plurality of output images output by the hairstyle generation model 3, so that the obtained face on the hairstyle transformation image is consistent with the face on the target image, and the definition of the texture details of the hairstyle transformation image is improved.

In one possible implementation manner, as shown in fig. 4, a fusion process may be performed on a plurality of output images and a plurality of reconstructed images by using a fusion model 4 to obtain a hair style transformation image of a target image, including: and respectively inputting the output images and the reconstructed images into a plurality of fusion modules which are sequentially cascaded in a one-to-one correspondence mode, and carrying out fusion processing on the output images and the reconstructed images to obtain a hairstyle transformation image of the target image.

As shown in fig. 4, assuming that the number of the plurality of reconstructed images, the plurality of output images, and the plurality of fusion modules is n, where n is a positive integer, the fusion model 4 performs fusion processing on the plurality of output images and the plurality of reconstructed images to obtain a hair style transformation image of the target image, which specifically includes: inputting the output images 1 to n into the fusion modules 1 to n in a one-to-one correspondence manner, and inputting the reconstructed images 1 to n into the fusion modules 1 to n in a one-to-one correspondence manner; the fusion module 1 is connected with a fusion module n in sequence, wherein n is the number of a plurality of output images; performing fusion processing on the output image 1 and the reconstructed image 1 through a fusion module 1 to obtain a fusion image 1; performing fusion processing on the output image x, the reconstructed image x and the up-sampling image y through a fusion module x to obtain a fusion image x; x is a positive integer which is larger than 1 and smaller than or equal to n, the up-sampling image y is obtained by up-sampling the fusion image y output by the fusion module y, and y is x-1; and when x is n, obtaining a fused image n output by the fusion module n, and taking the fused image n as a hair style transformation image of the target image.

Referring to fig. 4, Feat _ D1, Feat _ D2 to Feat _ Dn are a plurality of reconstructed images reconstructed by the decoder 2, Feat _ G1, and Feat _ G2 to Feat _ Gn are a plurality of output images output by the hair style generation model 3, and the fusion modules 1 to n respectively perform image fusion processing, that is: the fusion modules 1 to n perform gradual fusion on the Feat _ D1, the Feat _ D2 to the Feat _ Dn, the Feat _ G1 and the Feat _ G2 to the Feat _ Gn to obtain a hairstyle change image. Named progressive fusion mode.

Progressive fusion mode, namely: the specific fusion process from fusion module 1 to fusion module n is as follows: the fusion module 1 fuses Feat _ G1 and Feat _ D1, performs up-sampling on the fused features, and inputs the features into the fusion module 2; and the fusion module 2 fuses the output features of the Feat _ G2, the Feat _ D2 and the fusion module 1, performs up-sampling on the fused features, inputs the up-sampled features into the next fusion module, and so on until the fused image output by the last fusion module n is used as the hair style transformation image of the target image.

As shown in fig. 5, taking the fusion module 1 as an example, the input may be cascaded through a cascade layer concatenate, and then the features are extracted through a convolution structure layer conv + bn + relu; then, firstly, the features sequentially pass through 1 global pool layer, 1x1 convolutional layer and a sigmoid layer, the global information of each channel is extracted, then the global information and the features are multiplied channel by channel through a multiplication layer, and then the multiplication result and the features are added pixel by pixel through an addition layer to realize fusion. The channel-by-channel weighting and pixel-by-pixel weighting modes can better integrate the images, effectively maintain the characteristics on the target image, improve the texture details of the hair style transformation image, and further improve the quality of the hair style transformation image to obtain the high-quality hair style transformation image.

In an embodiment of the present application, the outputs of the global pooling layer global pool, 1x1 convolution layer and sigmoid layer of each of the fusion modules 1 to n may be supervised, so as to further improve the definition of the hair style transformation image.

Fig. 6 is a block diagram of an image processing apparatus according to an embodiment of the present application.

As shown in fig. 6, the image processing apparatus includes: a segmentation unit 601, a fusion unit 602, an encoding unit 603, and a generation unit 604, wherein:

a segmentation unit 601, configured to perform face segmentation processing on a target image to obtain a mask image of the target image, where the mask image is used to represent a hair region of the target image and a face region of the target image; a fusion unit 602, configured to perform fusion processing on the mask image and the target image to obtain a fusion image; an encoding unit 603, configured to input the fused image into an encoder for encoding processing, so as to obtain an image encoding vector; a generating unit 604, configured to input the image coding vector into a hair style generation model, and obtain a hair style transformation image of the target image, where the hair style generation model is a plurality of control vectors obtained based on the image coding vector and controls a plurality of image features on the hair style transformation image according to the plurality of control vectors.

In a possible implementation manner, the generating unit 604 is specifically configured to: obtaining a plurality of control vectors according to the image coding vectors; respectively inputting the control vectors into a plurality of convolution networks of the hair style generation model to obtain a plurality of characteristic graphs; and performing fusion processing on the plurality of feature maps to obtain a hair style transformation image of the target image.

In one possible implementation manner, the method further includes: the decoder is used for decoding and reconstructing the image coding vector to obtain a plurality of reconstructed images, wherein the hairstyle included in the reconstructed images is different from the hairstyle included in the target image; the generating unit 604 is specifically configured to: inputting the image coding vector into a hair style generation model to obtain a plurality of output images, wherein the resolution of the output images is different; and performing fusion processing on the plurality of output images and the plurality of reconstructed images to obtain a hair style transformation image of the target image.

In a possible implementation manner, the generating unit 604 is specifically configured to: inputting output images 1 to n into a fusion module 1 to n in a one-to-one correspondence manner, and inputting reconstructed images 1 to n into the fusion module 1 to n in a one-to-one correspondence manner; the fusion module 1 is connected with the fusion module n in sequence, and the n is the number of the plurality of output images; performing fusion processing on the output image 1 and the reconstructed image 1 through the fusion module 1 to obtain a fusion image 1; performing fusion processing on the output image x, the reconstructed image x and the up-sampling image y through the fusion module x to obtain a fusion image x; the x is a positive integer which is greater than 1 and less than or equal to the n, the up-sampling image y is obtained by up-sampling a fusion image y output by a fusion module y, and the y is x-1; and when the x is the n, obtaining a fused image n output by the fused module n, and taking the fused image n as a hair style transformation image of the target image.

In a possible implementation manner, the encoding unit 603 is specifically configured to:

the image coding vectors are input into a plurality of deconvolution networks of a decoder to obtain a plurality of reconstructed images.

In one possible implementation, the training process of the decoder includes: inputting a sample image into an encoder to obtain an encoding vector of the sample image; and inputting the coding vector of the sample image into an initial decoder, training the initial decoder according to the loss between the output of the initial decoder and the label image, and obtaining the decoder, wherein the hair style included by the label image is different from the hair style included by the sample image.

In one possible implementation, the training process of the encoder includes: and inputting the sample image into an initial encoder, and training the initial encoder according to the loss between the output of the input initial encoder and the hidden vector of the sample image to obtain the encoder.

In a possible implementation manner, the segmentation unit 601 is specifically configured to: inputting a target image into a self-encoder to obtain a mask image of the target image, wherein a training sample of the self-encoder is a face image, a label of the training sample is the mask image of the face image, and the mask image of the face image is used for representing a face area and a hairstyle area of the face image.

In one possible implementation, the training process of the self-encoder includes: and inputting the training samples into an initial self-encoder, and training the initial self-encoder according to the loss between the output of the initial self-encoder and the labels of the training samples to obtain the self-encoder.

In a possible implementation manner, the fusion unit 602 is specifically configured to: adding and fusing pixels of the mask image and corresponding pixels of the target image to obtain a fused image; fusing the region of the mask image and the region of the target image by adopting an attention mechanism to obtain a fused image; and adding and fusing the pixel color channel of the mask image and the corresponding pixel color channel of the target image to obtain a fused image.

In the image processing apparatus provided in this embodiment of the present application, the hair style generation model obtains a plurality of control vectors according to the input image coding vector of the fused image, for example, the image coding vector is nonlinearly mapped through a multi-layer fully-connected layer mapping network, the mapping network encodes the image coding vector into an intermediate vector, and the intermediate vector is then transmitted to a multi-layer generation network, so that each layer in the multi-layer generation network can generate one control vector, and thus, a plurality of control vectors can be obtained, and then a plurality of image features on the hair style transformation image can be respectively controlled according to the plurality of control vectors, such a manner that different image features on the hair style transformation image are separately controlled through different control vectors avoids that the adjustment of a certain image feature affects other image features to change, and the hair style transformation image can be controlled through finer features, the texture effect of the hair style transformation image is improved, in addition, the image coding vector input to the hair style generation model comprises the characteristics of the image obtained by fusing the target image and the mask image, therefore, the human face contour and the hair range of the hair style transformation image can be effectively controlled according to the characteristics of the hair region and the human face region of the mask image, the artifact in the hair style transformation image is reduced, and the display effect of the hair style transformation image is improved.

It should be understood that the units described in the image processing apparatus correspond to the respective steps in the image processing method described with reference to fig. 1. Thus, the operations and features described above for the method are equally applicable to the image processing apparatus and the units included therein, and are not described in detail here. The image processing apparatus may be implemented in a browser or other security applications of the computer device in advance, or may be loaded into the browser or other security applications of the computer device by downloading or the like. The corresponding units in the image processing apparatus may cooperate with units in the computer device to implement the solution of the embodiments of the present application.

The division into several modules or units mentioned in the above detailed description is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

It should be noted that details that are not disclosed in the image processing apparatus according to the embodiment of the present application refer to details disclosed in the above embodiments of the present application, and are not described herein again.

Referring now to fig. 7, fig. 7 shows a schematic block diagram of a computer device suitable for implementing embodiments of the present application, and as shown in fig. 7, a computer system 700 includes a Central Processing Unit (CPU)701 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for operation instructions of the system are also stored. The CPU701, the ROM702, and the RAM703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705; an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to embodiments of the present application, the process described above with reference to the flowchart fig. 1 may be implemented as a computer software program. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program comprises program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present application when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operational instructions of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor, and may be described as: a processor includes a first receive module, a second receive module, and a send module. Wherein the designation of a unit or module does not in some way constitute a limitation of the unit or module itself.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments, or may exist separately without being assembled into the electronic device. The computer-readable storage medium stores one or more programs that, when executed by one or more processors, perform the image processing methods described herein.

The foregoing description is only exemplary of the preferred embodiments of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other combinations of features described above or their equivalents without departing from the spirit of the disclosure. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. An image processing method, comprising:

performing face segmentation processing on a target image to obtain a mask image of the target image, wherein the mask image is used for representing a hair area of the target image and a face area of the target image;

performing fusion processing on the mask image and the target image to obtain a fusion image;

inputting the fused image into an encoder for encoding processing to obtain an image encoding vector;

and inputting the image coding vector into a hair style generation model to obtain a hair style transformation image of the target image, wherein the hair style generation model is a plurality of control vectors obtained based on the image coding vector and controls a plurality of image features on the hair style transformation image according to the plurality of control vectors.

2. The image processing method according to claim 1, wherein said inputting the image coding vector into a hair style generation model to obtain a hair style transformation image of the target image comprises:

obtaining the plurality of control vectors according to the image coding vector;

respectively inputting the control vectors into a plurality of convolution networks of the hair style generation model to obtain a plurality of characteristic graphs;

and performing fusion processing on the plurality of characteristic graphs to obtain a hair style transformation image of the target image.

3. The image processing method according to claim 1, further comprising:

inputting the image coding vector into a decoder for decoding and reconstructing processing to obtain a plurality of reconstructed images, wherein the reconstructed images comprise different hairstyles from the target image;

the inputting the image coding vector into a hair style generation model to obtain a hair style transformation image of the target image includes:

inputting the image coding vector into the hair style generation model to obtain a plurality of output images, wherein the resolution of the output images is different;

and performing fusion processing on the plurality of output images and the plurality of reconstructed images to obtain a hair style transformation image of the target image.

4. The image processing method according to claim 3, wherein performing fusion processing on the plurality of output images and the plurality of reconstructed images to obtain a hair style transformation image of the target image comprises:

inputting output images 1 to n into a fusion module 1 to n in a one-to-one correspondence manner, and inputting reconstructed images 1 to n into the fusion module 1 to n in a one-to-one correspondence manner; the fusion module 1 is connected with the fusion module n in sequence, wherein n is the number of the output images;

performing fusion processing on the output image 1 and the reconstructed image 1 through the fusion module 1 to obtain a fusion image 1;

performing fusion processing on the output image x, the reconstructed image x and the up-sampling image y through the fusion module x to obtain a fusion image x; the x is a positive integer which is greater than 1 and less than or equal to the n, the up-sampling image y is obtained by up-sampling a fusion image y output by a fusion module y, and the y is x-1;

and when the x is the n, obtaining a fused image n output by the fused module n, and taking the fused image n as a hair style transformation image of the target image.

5. The image processing method according to claim 3, wherein inputting the image coding vector into a decoder for decoding reconstruction processing to obtain a plurality of reconstructed images, comprises:

6. The image processing method of claim 5, wherein the training process of the decoder comprises:

inputting a sample image into the encoder, and obtaining an encoding vector of the sample image;

inputting the coding vector of the sample image into an initial decoder, training the initial decoder according to the loss between the output of the initial decoder and a label image, and obtaining the decoder, wherein the label image comprises a hair style different from that of the sample image.

7. The image processing method according to any of claims 1 to 5, wherein the training process of the encoder comprises:

inputting a sample image into an initial encoder, and training the initial encoder according to the loss between the output of the initial encoder and the implicit vector of the sample image to obtain the encoder.

8. The image processing method according to any one of claims 1 to 7, wherein performing face segmentation processing on a target image to obtain a mask image of the target image comprises:

inputting the target image into an auto-encoder, and obtaining a mask image of the target image, wherein a training sample of the auto-encoder is a face image, a label of the training sample is the mask image of the face image, and the mask image of the face image is used for representing a face region and a hairstyle region of the face image.

9. The image processing method of claim 8, wherein the training process of the self-encoder comprises:

and inputting the training samples into an initial self-encoder, and training the initial self-encoder according to the loss between the output of the initial self-encoder and the labels of the training samples to obtain the self-encoder.

10. The image processing method according to any one of claims 1 to 9, wherein the mask image and the target image are subjected to fusion processing to obtain a fused image, and the fused image includes at least one of:

adding and fusing pixels of the mask image and corresponding pixels of the target image to obtain a fused image;

fusing the area of the mask image and the area of the target image by adopting an attention mechanism to obtain a fused image;

and adding and fusing the pixel color channel of the mask image and the corresponding pixel color channel of the target image to obtain the fused image.

11. An image processing apparatus characterized by comprising:

the segmentation unit is used for carrying out face segmentation processing on a target image to obtain a mask image of the target image, wherein the mask image is used for representing a hair area of the target image and a face area of the target image;

the fusion unit is used for carrying out fusion processing on the mask image and the target image to obtain a fusion image;

the encoding unit is used for inputting the fusion image into an encoder to carry out encoding processing to obtain an image encoding vector;

and the generating unit is used for inputting the image coding vector into a hair style generating model to obtain a hair style transformation image of the target image, wherein the hair style generating model is a plurality of control vectors obtained based on the image coding vector and controls a plurality of image features on the hair style transformation image according to the plurality of control vectors.

12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements the image processing method according to any of claims 1-10.

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the image processing method according to any one of claims 1 to 10.