CN116012501A

CN116012501A - Image generation method based on style content self-adaptive normalized posture guidance

Info

Publication number: CN116012501A
Application number: CN202211590853.6A
Authority: CN
Inventors: 魏巍; 杨霞
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-04-25

Abstract

The invention provides an image generation method based on style content self-adaptive normalized posture guidance, which belongs to the technical field of image synthesis, wherein a character image is input, a source image and a target image are selected from the character images, a target character image with the same style as the source image and the same posture as the target image is generated, firstly, posture information transfer is carried out in advance by using an aligned multi-scale content transfer network to predict a target edge image, so that not only texture content is reserved, but also spatial dislocation is relieved. And secondly, gradually transferring the source style characteristics to the target gestures by using a style texture transfer network and realizing reasonable arrangement, wherein the source style characteristics, the target gestures and the edges are mapped in the same hidden space by a style self-adaptive normalization generator, and the consistency of the style textures and the content is enhanced by self-adaptively adjusting the source styles and the target gestures, so that the texture details generated by the source style characteristics enhancement target are reserved.

Description

Image generation method based on style content self-adaptive normalized posture guidance

Technical Field

The invention belongs to the technical field of image synthesis, and particularly relates to an image generation method based on style content self-adaptive normalized posture guidance.

Background

The pose-guided character image transformation is an image generation task that synthesizes an arbitrary target pose on the condition of a character source image. This topic has many potential applications such as video generation and virtual fitting. In addition, as the research on human behaviors by deep learning is more and more advanced, the demand for human behavior data is greatly increased, so that human body posture migration provides corresponding data for the research and a large amount of data for further human behavior research.

In recent years, the conversion of a source image into a target pose using the condition GAN has achieved significant success. The method is based on a condition GAN, a plurality of repeated modules are inserted, and the corresponding relation between the poses is learned through a neural network, so that the source image features are recombined into the image of the target pose. However, these methods cannot preserve the relationship between source style and spatial context, and it is difficult to predict a clear and reasonable target image. In order to solve the problem, the stream-based method predicts the offset of the position between the source and the target, guides the source characteristic to twist into a reasonable target gesture, obtains a more accurate and real texture image, but causes obvious artifacts due to larger change of the source and target gestures. In order to alleviate the dislocation problem caused by the large gesture change, some methods introduce human body analytic mapping to provide a semantic relationship corresponding to the target gesture to synthesize a target image close to the source style. These methods, while synthesizing a more satisfactory image of a person, still do not generate true texture details.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an image generation method based on style content self-adaptive normalized gesture guidance, which aims to improve the accuracy of gesture transfer and the authenticity of the appearance of a figure, effectively synthesizes a vivid figure appearance image, reduces training time on the premise of ensuring image quality, and quickens convergence speed.

The invention adopts the technical proposal for solving the technical problems that: an image generation method based on style content self-adaptive normalized posture guidance inputs a character image, selects a source image and a target image from the character images, generates a target character image with the same style as the source image and the same posture as the target image, and concretely comprises the following steps:

s1: detecting key points of a human body on the figure image to obtain a gesture heat map;

s2: extracting edge mapping information of a human body in the figure image to obtain an edge image;

s3: randomly selecting two images from the character image to serve as a source image and a target image respectively, and predicting the edge image of the target image through an aligned multi-scale content transfer network according to the obtained edge image and the obtained gesture heat image;

s4: inputting the gesture heat maps of the source image and the target image into an optical flow estimation model to obtain an optical flow map and shielding mask information between the source image and the target image;

s5: inputting the light flow graph, the shielding mask information, the target image gesture heat graph and the source image into a local attention model to obtain a rough target character image;

s6: and inputting the rough target character image, the target image edge map and the source image into a style self-adaptive normalization generator to obtain a final target character image with the moving gesture.

Further, in step S1, the pose heat map of 18 channels of the character image is estimated using the openpost method, including 18 key points, each represented by one channel, the key points being interrelated to form a skeletal structure of the human body, a nose, a neck, a left shoulder, a left elbow, a left wrist, a right shoulder, a right elbow, a right wrist, a left crotch, a left knee, a left ankle, a right crotch, a right knee, a right ankle, a left eye, a right eye, a left ear, and a right ear.

Further, in step S2, edge mapping information of the character image is extracted by using an expanded gaussian differential edge monitoring method, so as to obtain a black-white gray source edge map of the human body in the character image.

Further, the aligned multi-scale content transfer network in step S3 is composed of an aligned multi-scale transfer decoder and three encoders; each encoder consists of a downsampling layer, an example normalization layer, an activation layer and a residual block; the alignment multi-scale transfer decoder consists of an deconvolution layer, an example normalization layer, an activation function and a residual block, wherein the deconvolution layer uses a convolution kernel of 4X4, the step length is 2, and the margin is 1; and respectively inputting the source image edge map, the source image gesture heat map and the target image gesture heat map into an encoder to obtain a feature map with the channel number of 256 and the size of 32X32, and decoding after attention calculation to obtain the target edge map.

Further, the downsampling layer of the encoder uses a convolution kernel of 4X4, the step length is 2, and the margin is 1; the residual block consists of two convolution layers, two example normalization layers and an activation layer, wherein the convolution layers use a convolution kernel of 3x3, the step length is 1, and the margin is 1; each convolution layer is followed by an instance normalization layer, and RELU activation functions are added after the first instance normalization layer.

Further, in step S4, the optical flow estimation model is composed of an encoder and a decoder, the encoder is composed of an upsampling layer and a convolution layer, each layer is preceded by an example normalization layer and an activation function layer, the upsampling layer uses a convolution kernel of 4X4, the step size is 2, the convolution layer uses a convolution kernel of 3X3, and the step size is 1; and after the source image, the source image posture heat map and the target image posture heat map are fused in channel dimension, a characteristic map with the channel number of 256 and the size of 32X32 is obtained through an encoder, and a two-dimensional flow field light flow map with the channel number of 2 and shielding mask information with the channel number of 1 are output through a decoder.

Further, in step S5, a local feature patch pair is extracted from the source image and the target image according to the flow field, a context-aware sampling kernel is calculated by using the kernel prediction network, and finally, the source feature is sampled, so as to obtain a distortion result of the sampling position. The method comprises the steps of extracting local feature patch pairs by using a convolution kernel of 3X3, wherein a kernel prediction network consists of a convolution layer, an activation layer and softmax, so that local correlation of a local source and a target is obtained, and deformation of local features of the source is guided.

Further, the style adaptive normalization generator in step S6 is composed of a gesture encoder, a style encoder, a residual block, a style adaptive normalization module and a residual decoder; the gesture encoder consists of a convolution kernel of 4X4, an up-sampling layer with a step length of 2, a convolution kernel of 3X3 and a convolution layer with a step length of 1, wherein an example normalization layer and an activation function layer are arranged in front of each layer; the style encoder consists of a convolution kernel of 4X4, an up-sampling layer with a step length of 2, a convolution kernel of 3X3, a layer with a step length of 1 convolution combined with self-attention, and an example normalization layer and an activation function layer in front of each layer; the residual block is composed of two convolution layers, each of which is composed of an activation layer, an instance normalization layer and a convolution layer of a 3X3 convolution kernel; the residual decoder consists of a convolution layer of a 3X3 convolution kernel and a transpose convolution layer of a 4X4 convolution kernel; the style self-adaptive normalization module is composed of three area self-adaptive normalization layers, and each normalization layer modulates the input characteristic parameters by two spatially self-adaptive normalization parameters.

The beneficial effects of the invention include: 1) The problem of insufficient content information is solved by using the edge mapping as an additional constraint of the gesture heat map, so that the network enhanced texture detail is guided to generate a more realistic character image.

2) Based on the explicit distribution of the style characteristics of the source image to the target gesture by the style self-adaptive normalization generator, the source style is injected into the feature layer by feature layer, and the real texture information is reserved.

3) A new aligned multi-scale content transfer network is presented that is capable of warping and reasonably recombining input data at the feature level, not only to generate new content, but also to enhance the convergence speed of the network.

Drawings

FIG. 1 is a flow chart of the overall method of the present invention;

FIG. 2 is a diagram of the overall model structure of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides an image generation method based on style content self-adaptive normalized posture guidance. And secondly, gradually transferring the source style characteristics to the target gestures by using a style texture transfer network and realizing reasonable arrangement, wherein the source style characteristics, the target gestures and the edges are mapped in the same hidden space by a style self-adaptive normalization generator, and the consistency of the style textures and the content is enhanced by self-adaptively adjusting the source styles and the target gestures, so that the texture details generated by the source style characteristics enhancement target are reserved.

Example 1

The image generation method based on the style content self-adaptive normalized posture guidance inputs a character image, selects a source image and a target image from the character images, and generates a target character image with the same style as the source image and the same posture as the target image.

As shown in fig. 1, first, the coordinates of the key points of the person in the person image, namely, the gesture heat map, are extracted, and then, the edge map of the person image is extracted; selecting a source image and a target image from the character image, and generating a new target image edge map according to the gesture heat map, the edge map and the target image gesture heat map of the source image; calculating the corresponding relation between the postures according to the key point coordinates of the source image and the target image, and outputting to obtain a corresponding optical flow chart and shielding mask information; generating a rough target character image according to the light flow graph, the shielding mask information, the source image and the target image gesture heat graph; and refining the rough target character image by the target image gesture heat map, the generated target image edge map and the source image to obtain a vivid target character image. Specific:

1) In the training set of character images, the gesture heat map and the edge map of all the character images are extracted. 18 pieces of key point information of the character image are extracted by using an openpost method, and each piece of key point information represents a joint part of a human body. And extracting edge information of the character image by using an extended Gaussian differential edge monitoring method, and highlighting texture detail information of the image by using black and white contrast. A pair of images is selected from the training set as a source image and a target image.

2) And rendering the obtained key point information of the character image with different colors to obtain a gesture heat map. Wherein the nose, neck, left shoulder, left elbow, left wrist, right shoulder, right elbow, right wrist, left crotch, left knee, left ankle, right crotch, right knee, right ankle, left eye, right eye, left ear and right ear all have different colors and are linked using corresponding lines to form a skeletal structure diagram approximating the human body.

3) As shown in fig. 2, feature information of a source image pose heat map, an edge map and a target image pose heat map with a resolution of 256X256 is extracted, a feature map with a channel number of 64 and a resolution of 128X128 is obtained through an input layer, an up-sampling layer with a step length of 2 and a convolution layer with a convolution kernel of 3 and a step length of 1 are obtained through a convolution kernel of 4, a feature map with a channel number of 128 and a resolution of 64X64 is obtained, and a feature map with a channel number of 256 and a resolution of 32X32 is obtained through an up-sampling layer and a convolution layer. And performing weighted summation on the feature images of the source image pose and the target image pose with the same size of 32X32 to obtain a relation matrix, wherein the relation matrix is added with the source image edge feature images with the same size to obtain rough target image edge feature images, and then the rough target image edge feature images are added with the source image edge feature image pixels. And inputting the calculated result into a deconvolution layer with the convolution kernel size of 4 and the step length of 2 to obtain a target image edge feature map with the channel number of 128 and the resolution of 64X 64. And executing the same operation on the characteristic graphs of the source image gesture and the target image gesture with the same size of 64X64 to obtain a target image edge characteristic graph with the channel number of 64 and the resolution of 128X128, performing one-layer deconvolution, 5-layer convolution kernels of 3, residual convolution layers with the step length of 1 and output layers with the step length of 1, and performing output through a Tanh () function to generate the target image edge graph.

4) The source image, the source image gesture heat map and the target image gesture heat map are fused into a feature map with 39 channels and 256X256 resolution in channel dimension, a convolution kernel with 4X4 is used through an up-sampling layer, the step length is 2, a convolution kernel with 3X3 is used through a convolution layer, the step length is 1, the feature map with 32 channels and 128X128 resolution is obtained, the feature map with 64 channels and 64X64 resolution, the feature map with 128 channels and 32X32 resolution, the feature map with 256 channels and 16X16 resolution and the feature map with 256 channels are obtained through repeated operation, and the feature map with 8X8 resolution is obtained.

And (3) adding the deconvolution layers with the channel number of 256 and the resolution of 8X8 with the characteristic diagram with the resolution of 16X16 with the convolution kernel of 3 after passing through the residual convolution layer with the convolution kernel of 3 and the step of 1, respectively executing the convolution kernel of 3 and the step of 1, outputting the flow field information with the channel number of 2 and the 16X16 and the channel number of 1, and executing a sigmoid () function to obtain the shielding mask. The number of channels is 256, the feature map with the resolution of 16X16 passes through a deconvolution layer with the convolution kernel of 3 and the step length of 2 to obtain the feature map with the resolution of 32X32, and the feature map with the same size obtained by downsampling is added to execute deconvolution again to output a flow field with the size of 64X64 and a shielding mask.

5) And respectively extracting the gesture features of the source image and the target image by using a 4X4 convolution kernel and an up-sampling layer with the step length of 2 to obtain feature images with the sizes of 64X64 and 32X32, twisting the source feature images by using a bilinear difference method, merging the dimensions of twisted source image feature images and target image gesture feature channels, performing convolution kernel of 3, obtaining an attention matrix by using a convolution layer with the step length of 1 and a Softmax () function, and performing averaging pooling after weighted summation of the twisted source feature images and the attention matrix.

6) And taking the rough target character image as input, passing through three residual convolution layers, and modulating and demodulating the appearance and the content of the rough target character image by utilizing the source image, the target image and the target image edge map in the region self-adaptive normalization layer. And (3) sequentially utilizing a convolution kernel of 3X3 and a convolution layer with a step length of 1 to obtain modulation parameters from source images and target image edge images with the sizes of 32X32,64X64 and 128X128, and multiplying and adding rough target character image features with the modulation parameters to obtain modulated target features by utilizing the source images and target image gesture heat map modulation parameters. And finally, respectively obtaining a final target character image through a three-time convolution kernel size of 3X3 transposed convolution layer, a convolution kernel size of 1X1, an output layer with a step length of 1 and a Tanh () function.

In summary, the image generation method based on style content self-adaptive normalized posture guidance disclosed by the invention 1) provides a new two-stage network to decouple styles and contents, and aims to improve the accuracy of posture transfer and the authenticity of the appearance of a figure. 2) By using the aligned multi-scale content transfer network to predict the target image edge map, gesture information transfer is performed in advance, so that not only texture content is reserved, but also spatial dislocation can be relieved. 3) The source style characteristics are gradually transferred to the target gestures by using a style texture transfer network, reasonable arrangement is realized, the source style characteristics, the target gestures and the edges are mapped in the same hidden space by a style self-adaptive normalization generator, and the consistency of the style textures and the content is enhanced by self-adaptively adjusting the source styles and the target gestures, so that the texture details generated by the source style characteristics and the enhancement targets are reserved. The invention generates the character image which is consistent with the target gesture and keeps the style texture of the source image, reduces the training difficulty and accelerates the convergence of the model.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. The image generation method based on the style content self-adaptive normalized posture guide is characterized by inputting a character image, selecting a source image and a target image from the character images, and generating a target character image with the same style as the source image and the same posture as the target image, and specifically comprises the following steps:

2. The image generation method based on style content adaptive normalized pose guidance according to claim 1, wherein in step S1, a pose heat map of 18 channels of a character image is estimated using an openpost method, comprising 18 key points, each represented by one channel, the key points being interrelated to form a skeletal structure of a human body, a nose, a neck, a left shoulder, a left elbow, a left wrist, a right shoulder, a right elbow, a right wrist, a left crotch, a left knee, a left ankle, a right crotch, a right knee, a right ankle, a left eye, a right eye, a left ear, and a right ear.

3. The image generation method based on style content adaptive normalization posture guidance according to claim 1, wherein in step S2, edge mapping information of the person image is extracted by using an expanded gaussian differential edge monitoring method, so as to obtain a black-and-white gray source edge map of a human body in the person image.

4. The method for generating an image based on style content adaptive normalized pose guidance according to claim 1, wherein said aligned multi-scale content transfer network in step S3 is composed of an aligned multi-scale transfer decoder and three encoders; each encoder consists of a downsampling layer, an example normalization layer, an activation layer and a residual block; the alignment multi-scale transfer decoder consists of an deconvolution layer, an example normalization layer, an activation function and a residual block, wherein the deconvolution layer uses a convolution kernel of 4X4, the step length is 2, and the margin is 1; and respectively inputting the source image edge map, the source image gesture heat map and the target image gesture heat map into an encoder to obtain a characteristic map with the channel number of 256 and the size of 32X32, and decoding after attention calculation to obtain the target image edge map.

5. The image generation method based on style content adaptive normalized pose guidance according to claim 4, wherein said encoder has a downsampling layer using a convolution kernel of 4X4, a step size of 2, and a margin of 1; the residual block consists of two convolution layers, two example normalization layers and an activation layer, wherein the convolution layers use a convolution kernel of 3x3, the step length is 1, and the margin is 1; each convolution layer is followed by an instance normalization layer, and RELU activation functions are added after the first instance normalization layer.

6. The image generation method based on style content adaptive normalization pose guidance according to claim 1, wherein in step S4, the optical flow estimation model is composed of an encoder and a decoder, the encoder is composed of an upsampling layer and a convolution layer, each layer is preceded by an example normalization layer and an activation function layer, the upsampling layer uses a convolution kernel of 4X4, the step size is 2, the convolution layer uses a convolution kernel of 3X3, and the step size is 1; and after the source image, the source image posture heat map and the target image posture heat map are fused in channel dimension, a characteristic map with the channel number of 256 and the size of 32X32 is obtained through an encoder, and a two-dimensional flow field light flow map with the channel number of 2 and shielding mask information with the channel number of 1 are output through a decoder.

7. The image generation method based on style content adaptive normalization posture guidance according to claim 6, wherein in step S5, local feature patch pairs are extracted from the source image and the target image according to the flow field, context-aware sampling kernels are calculated by using a kernel prediction network, and finally source features are sampled to obtain a distortion result of sampling positions.

8. The method for generating an image based on style content adaptive normalized pose guidance according to claim 7, wherein the local feature patch pairs are extracted using a convolution kernel of 3X3, and the kernel prediction network is composed of a convolution layer, an activation layer, and a softmax.

9. The image generation method based on the style content adaptive normalization gesture guidance according to claim 1, wherein the style adaptive normalization generator in step S6 is composed of a gesture encoder, a style encoder, a residual block, a style adaptive normalization module, and a residual decoder; the gesture encoder consists of a convolution kernel of 4X4, an up-sampling layer with a step length of 2, a convolution kernel of 3X3 and a convolution layer with a step length of 1, wherein an example normalization layer and an activation function layer are arranged in front of each layer; the style encoder consists of a convolution kernel of 4X4, an up-sampling layer with a step length of 2, a convolution kernel of 3X3, a layer with a step length of 1 convolution combined with self-attention, and an example normalization layer and an activation function layer in front of each layer; the residual block is composed of two convolution layers, each of which is composed of an activation layer, an instance normalization layer and a convolution layer of a 3X3 convolution kernel; the residual decoder consists of a convolution layer of a 3X3 convolution kernel and a transpose convolution layer of a 4X4 convolution kernel; the style self-adaptive normalization module is composed of three area self-adaptive normalization layers, and each normalization layer modulates the input characteristic parameters by two spatially self-adaptive normalization parameters.