CN113592971B

CN113592971B - Virtual human body image generation method, system, equipment and medium

Info

Publication number: CN113592971B
Application number: CN202110865481.2A
Authority: CN
Inventors: 王乐; 师皓玥; 周三平; 陈仕韬; 辛景民; 郑南宁
Original assignee: Ningbo Shun'an Artificial Intelligence Research Institute; Xian Jiaotong University
Current assignee: Ningbo Shun'an Artificial Intelligence Research Institute; Xian Jiaotong University
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2024-04-16
Anticipated expiration: 2041-07-29
Also published as: CN113592971A

Abstract

The invention discloses a virtual human body image generation method, a system, equipment and a medium, wherein the method comprises the following steps: inputting the source human body image and the target posture image into a pre-trained virtual human body image generation network to obtain a target posture human body image; wherein the virtual human body image generation network is a convolutional neural network, comprising: the encoder is used for inputting a source human body image and a target posture image and obtaining source human body characteristics and target human body characteristics through encoding; the appearance generating module based on the structure is used for inputting and updating the source human body characteristics and the target human body characteristics to obtain updated source human body characteristics and target human body characteristics; and the decoder is used for inputting the target human body characteristics output by the structure-based appearance generating module and decoding to obtain the target posture human body image. The invention can generate the vivid human body image with correct target posture by utilizing the virtual human body image generation network under the posture guidance based on the human body structure.

Description

Virtual human body image generation method, system, equipment and medium

Technical Field

The invention belongs to the technical field of computer vision and computer graphics intersection, and particularly relates to a virtual human body image generation method, a system, equipment and a medium.

Background

The target of the virtual human body image task under the posture guidance is to generate a new human body image of a target posture aiming at a source human body image and a given target posture image; wherein the human body posture of the generated new human body image is consistent with the target posture, and the human body appearance is similar to the appearance of the source human body image. This task has many application scenarios such as movie production, virtual reality, data augmentation in motion recognition tasks, etc.

At present, the virtual image generation method under the current stage posture guidance mainly has the following problems and defects:

generating a virtual human body image of a target gesture, wherein the gesture consistency and the appearance consistency of the generated image are considered at the same time; the object pose is quite different from the pose in the source human body image, the object pose is quite complex, the prior method generally considers how to deform the source human body image to obtain the human body image of the object pose, and the human body pose of the generated image is quite unclear and even inconsistent with the object pose because the effectiveness of deformation cannot be ensured, so that the quality of the generated human body image with the specified pose is quite low.

In summary, a new method, system, device and medium for generating virtual human body images under the guidance of human body structure-based gestures are needed.

Disclosure of Invention

The invention aims to provide a virtual human body image generation method, a virtual human body image generation system, virtual human body image generation equipment and virtual human body image generation media, so as to solve one or more of the technical problems. The invention can generate the vivid human body image with correct target posture by utilizing the virtual human body image generation network under the posture guidance based on the human body structure.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the invention discloses a virtual human body image generation method, which comprises the following steps:

inputting a source human body image and a target posture image into a pre-trained virtual human body image generation network, and outputting and obtaining the target posture human body image through the virtual human body image generation network;

wherein the virtual human body image generation network is a convolutional neural network, comprising:

the encoder is used for inputting a source human body image and a target posture image and obtaining source human body characteristics and target human body characteristics through encoding;

the appearance generating module based on the structure is used for inputting and updating the source human body characteristics and the target human body characteristics to obtain updated source human body characteristics and target human body characteristics;

and the decoder is used for inputting the target human body characteristics output by the structure-based appearance generating module and decoding to obtain the target posture human body image.

The invention further improves that the obtaining step of the trained virtual human body image generating network specifically comprises the following steps:

acquiring a sample data set; wherein each sample data in the sample data set includes source human body image sample data, target human body image sample data, source human body posture sample data, and target human body posture sample data;

inputting source human body image sample data, source human body posture sample data and target human body posture sample data in selected sample data of the sample data set into the virtual human body image generation network to obtain virtual target human body image data; constructing a loss function based on the virtual target human body image data and target human body image sample data in the selected sample data, and performing iterative optimization on the virtual human body image generation network;

and after the preset iteration times or convergence conditions are reached, obtaining a trained virtual human body image generation network.

A further improvement of the present invention is that the structure-based appearance generating module includes:

the structure-aware self-adaptive normalization module is used for inputting source human body characteristics and target human body characteristics, generating stylized target attitude characteristics and outputting the stylized target attitude characteristics;

The feature enhancement module is used for inputting the generated stylized target gesture features and the source human body features and outputting updated source human body features and target human body features.

The invention further improves that in the sample data set, the step of acquiring the source human body posture sample data and the target human body posture sample data in each sample data comprises the following steps: performing posture estimation on the human body image by adopting an openpost posture estimation method to obtain an 18-person joint point coordinate sequence; wherein, the source human body imageIs expressed as P (I) _s )＝{p ₁ ，…，p _K K=18; target human body image->Is expressed as P (I) _t )＝{p ₁ ，…，p _K K=18; based on the obtained human body joint point coordinate sequence, representing human body posture information by using K heat maps; wherein the source human body posture information is expressed as +.>The target human body posture information is expressed as +.>

In the encoder, the step of obtaining the source human body characteristics and the target human body characteristics by encoding specifically comprises the following steps: target pose information P with 2 downsampled convolutional layers _t Coded as target human body feature C _t The method comprises the steps of carrying out a first treatment on the surface of the Source human body image I with 2 downsampled convolutional layers _s And posture information P _s Coded as source human body characteristics C _s 。

The invention further improves that in the appearance generating module based on the structure, the steps of inputting the source human body characteristics and the target human body characteristics and updating the source human body characteristics and the target human body characteristics after updating the source human body characteristics and the target human body characteristics comprise the following steps:

Dividing a human body image into a plurality of human body parts and 1 background part based on the obtained human body joint point coordinate sequence, and obtaining masks of the L parts; wherein each partial mask of the source human body image is expressed asThe mask of each part of the target human body image is expressed as +.>

Target human body characteristic C by using two convolution layers ₊ Convolving to obtain the target human body characteristicsSource human body characteristics C by two convolution layers _s Convolving to obtain source human body characteristics->According to the source human body characteristics F _s And partial masks M of source human body image _s Generating a style vector->Wherein for V _sty Is>Is a vector of C dimension, which represents the characteristics of each part of the source human body image; using mean pooling to obtain part 1 style vector +.>

Wherein, resize (·) represents a zoom operation;representing element-by-element multiplication; pool (-) represents a pooling operation;

according to the corresponding relation between each part of the source human body image and each part of the target human body image, the style vector V is calculated _sty Mask M for inserting into each part of target human body image _t Obtaining style matrix T from corresponding parts _sty The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the 1 st style vector isBroadcasting and inserting into 1 st mask of target human body image to generate 1 st style matrix +. >All L style matrices->Element-by-element addition of 0, …, L yields the final style matrix T _sty ：

Using two convolution layers to pair style matrix T _sty Convolving to obtain modulation parameters in normalization operationAnd->

For object attitude feature F ₊ Carrying out batch normalization treatment to obtainF is paired with gamma and beta _norm Performing modulation operation to obtain stylized target attitude characteristic F _sty ：

F _sty ＝γF _norm +β；

For object characteristic gesture F _sty Sum-source human body characteristics F _s Splicing and fusing, and then enhancing the fused features by using a Squeeze-and-specification operation to obtain enhanced features F _fuse ，

Obtaining updated target human body characteristics C' _t ：

Wherein,representing feature fusion operations, for C _t ，F _t ，F _s Fusion is carried out in a splicing and adding mode;

updated source human body characteristics C' _s Is the source human body characteristic F _s And updated target human body feature C' _t Is a splice of (2).

A further refinement of the invention provides that the loss function comprises: an fight loss function, a perceptual loss function, and a loss function based on a similarity of human structures.

A further improvement of the invention is that the structure-based appearance generation module is replaced by an integrated appearance generation module;

the integrated appearance generating module is formed by cascading a plurality of appearance generating modules based on structures.

The virtual human body image generating system of the present invention comprises:

the image generation module is used for inputting the source human body image and the target posture image into a pre-trained virtual human body image generation network, and outputting the virtual human body image generation network to obtain the target posture human body image;

An electronic device of the present invention includes a processor and a memory, where the processor is configured to execute a computer program stored in the memory to implement the virtual human body image generating method according to any one of the present invention.

A computer-readable storage medium of the present invention stores at least one instruction that, when executed by a processor, implements the virtual human body image generation method according to any one of the above-described aspects of the present invention.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a new human body structure-based virtual human body image generation method under the posture guidance, which can generate a vivid human body image with a correct target posture. Specifically, aiming at the technical problems that the existing method cannot guarantee the effectiveness of deformation, the generated human body image with a specified gesture is low in quality, fuzzy human body gestures are easy to generate, and even gesture consistency cannot be maintained, the invention constructs a virtual human body image generation network (SAGN) under gesture guidance based on a human body structure, and carries out iterative optimization on the constructed convolutional neural network (SAGN) to obtain a pre-trained convolutional neural network (SAGN) to realize virtual human body image generation under gesture guidance. The virtual human body image generation network (SAGN) under the posture guidance based on the human body structure can directly generate the appearance according to the target posture, so that the consistency of the human body posture of the generated image and the target posture can be ensured to the greatest extent, and meanwhile, the virtual human body image generation network also has a vivid human body appearance, so that a vivid human body image with a correct posture can be generated, and the virtual human body image generation under the target posture guidance is realized; and a new thought is provided for solving the difficult task of generating the target posture human body image.

In the system, aiming at the problems that the existing method cannot guarantee the effectiveness of deformation, the generated human body image with a specified gesture is low in quality, fuzzy human body gestures are easy to generate and even gesture consistency cannot be maintained, a virtual human body image generation network (SAGN) based on the gesture guidance of a human body structure is introduced, and the SAGN consists of a series of appearance generation modules (SAG-Blk) based on the structure. Wherein each structure-based appearance generation module consists of a structure-aware adaptive normalization (SAN) sub-module and a Feature Enhancement (FE) sub-module; the structure-aware adaptive normalization (SAN) module generates stylized target attitude features by using a normalization method, and the Feature Enhancement (FE) module provides rich appearance information for the stylized target attitude features and further enhances the appearance features of the stylized target attitude features. The two sub-modules cooperate together to gradually generate a realistic human body image with the correct posture. And generating a virtual human body image under the guidance of the target gesture.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description of the embodiments or the drawings used in the description of the prior art will make a brief description; it will be apparent to those of ordinary skill in the art that the drawings in the following description are of some embodiments of the invention and that other drawings may be derived from them without undue effort.

Fig. 1 is a flow chart of a virtual human body image generating method under posture guidance based on human body structure according to an embodiment of the present invention;

FIG. 2 is a schematic illustration of a human joint point in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a virtual human body image generation network (SAGN) under human body structure-based pose guidance in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a fabric aware adaptation (SAN) module according to an embodiment of the present invention;

FIG. 5 is a partial result diagram on a Market-1501 dataset in an embodiment of the present invention;

FIG. 6 is a partial result schematic on the deep Fashion dataset in an embodiment of the invention.

Detailed Description

In order to make the purposes, technical effects and technical solutions of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it will be apparent that the described embodiments are some of the embodiments of the present invention. Other embodiments, which may be made by those of ordinary skill in the art based on the disclosed embodiments without undue burden, are within the scope of the present invention.

Example 1

In embodiment 1 of the invention, aiming at the problems that the prior method cannot ensure the validity of deformation, the quality of the generated human body image with the designated gesture is low, the fuzzy human body gesture is easy to generate and even the gesture consistency cannot be maintained, and considering that the appearance of the generated image can be directly generated according to the target gesture to ensure the consistency of the human body gesture and the target gesture to the greatest extent, the invention provides a virtual human body image generation method under the gesture guidance based on the human body structure.

SAGN consists of a series of structure-based appearance generation modules (SAG-Blk); wherein each structure-based appearance generation module is composed of a structure-aware adaptive normalization (SAN) sub-module and a Feature Enhancement (FE) sub-module. The structure-aware adaptive normalization (SAN) module generates stylized target attitude features by using a normalization method, and the Feature Enhancement (FE) module provides rich appearance information for the stylized target attitude features and further enhances the appearance features of the stylized target attitude features. The two sub-modules cooperate together to gradually generate a realistic human body image with the correct posture. And generating a virtual human body image under the guidance of the target gesture.

The invention provides a virtual human body image generation method based on human body structure posture guidance, which comprises the following steps:

step 1, acquiring an acquired image, wherein the acquired image is required to acquire a source human body image and a target human body image, and simultaneously posture information of the acquired image is required to be acquired according to the source human body image and the target human body image; the source human body image and the target human body image are used for representing different human body postures under the same appearance;

step 2, constructing a virtual human body image generation network (SAGN) under the guidance of the posture based on the human body structure; the specific steps of constructing the virtual human body image generation network (SAGN) include:

Constructing an Encoder (Encoder); constructing a structure-based appearance generation module (SAG-Blk); constructing a Decoder (Decoder); the specific steps of the construction of the appearance generating module based on the structure comprise: constructing a structure-aware adaptive normalization (SAN) module and a Feature Enhancement (FE) module;

step 3, inputting the source human body image, the source human body posture information and the target human body posture information in the step 1 into the network constructed in the step 2 to obtain a virtual target human body image;

step 4, constructing a loss function based on the virtual target human body image obtained in the step 3 and the target human body image acquired in the step 1, and performing iterative optimization on the network constructed in the step 2; and after the preset iteration times are reached, obtaining an optimized virtual human body image generation network under the posture guidance based on the human body structure, wherein the virtual human body image generation network is used for generating the virtual human body image under the target posture guidance and generating a vivid human body image with the correct posture.

In the embodiment 2 of the present invention, in step 1, the specific steps for obtaining the pose information according to the source human body image and the target human body image include:

step 1.1, carrying out gesture estimation on a human body image by using a gesture estimation method to obtain a preset number of joint point coordinate sequences of a source human body image and a preset number of joint point coordinate sequences of a target human body image;

And step 1.2, based on the human body joint point coordinate sequence obtained in the step 1.1, using the heat map to represent human body posture information, and obtaining source human body posture information and target human body posture information.

Exemplary, in step 1.1 of the embodiment of the present invention, the method specifically includes: carrying out posture estimation on the human body image by using an openpost posture estimation method to obtain 18 human body joint point coordinate sequences; wherein, the source human body imageIs expressed as P (Is) = { P ₁ ，…，p _K K=18; target human body image->Is expressed as P (I) _t )＝{p ₁ ，…，p _K }，K＝18。

In step 1.2, specifically, the method includes: based on the human body joint point coordinate sequence obtained in the step 1.1, representing human body posture information by using K heat maps; wherein the source human body posture information is expressed asThe target human body posture information is expressed as +.>

In the embodiment 3 of the present invention, in step 2, the specific steps of constructing a virtual human body image generation network (SAGN) under the guidance of the posture based on the human body structure include:

step 2.1, constructing an Encoder (Encoder), and encoding the input target posture information, the source human body image and the posture information into target human body characteristics and source human body characteristics respectively to obtain the Encoder;

step 2.2, constructing a structure-based appearance generating module (SAG-Blk), updating the target human body characteristics and the source human body characteristics (or the new target human body characteristics and the new source human body characteristics output by the previous SAG-Blk) in the step 2.1, and generating appearance information of the target human body characteristics in the step 2.1 according to the source human body characteristics in the step 2.1 to obtain new target human body characteristics and source human body characteristics; wherein, T=9 cascaded SAG-Blks are used for gradually generating the appearance information of the target human body characteristics; finally 9 cascaded structure-based appearance generating modules are obtained;

And 2.3, constructing a Decoder, and decoding the new target human body characteristics output by the last SAG-Blk in the step 2.2 to generate a human body image of the target posture to obtain the Decoder.

Exemplary, in step 2.1 of the embodiment of the present invention, the method specifically includes: target pose information P with 2 downsampled convolutional layers _t Coded as target human body feature C _t The method comprises the steps of carrying out a first treatment on the surface of the Source human body image I with 2 downsampled convolutional layers _s And posture information P _s Coded as source human body characteristics C _s 。

In embodiment 4 of the present invention, in step 2.2, the specific steps for constructing the structure-based appearance generating module (SAG-Blk) include:

2.2.1, constructing a structure-aware self-adaptive normalization (SAN) module, and updating the target human body characteristics (or the new target human body characteristics output by the previous SAG-Blk) in the step 2.1 by using a normalization method to generate stylized target attitude characteristics; finally obtaining a self-adaptive normalization module of the structure perception;

2.2.2, constructing a Feature Enhancement (FE) module, and providing rich appearance information for the stylized target posture feature obtained in the step 2.2.1, namely, the source human body feature (or the new source human body feature output by the last SAG-Blk) in the step 2.1, so that the appearance feature of the stylized target posture feature is further enhanced; and finally obtaining the characteristic enhancement module.

Exemplary, in step 2.2.1 of the embodiment of the present invention, the method specifically includes: dividing a human body image into a plurality of human body parts and 1 background part based on the human body joint point coordinate sequence obtained in the step 1.1, and obtaining masks of the L parts; wherein each partial mask of the source human body image is expressed asThe mask of each part of the target human body image is expressed as +.>

In step 2.2.1, specifically, the method includes: target human body characteristic C by using two convolution layers ₊ (or the new target human body feature C 'output by the last SAG-Blk)' _t ) Convolving to obtain the target human body characteristicsSource human body characteristics C by two convolution layers _s (or the last SAG-Blk output new source human body feature C' _s ) Convolving to obtain source human body characteristics->

According to the source human body characteristics F _s And partial masks M of source human body image _s Generating style vectorsWherein for V _sty Is>Is a vector of C dimension, which represents the characteristics of each part of the source human body image; in particular, here mean pooling is used to obtain the style vector of part 1 +.>

Wherein Resize (·) represents a zoom operation, here needed for M _s Scaling to bring it to possession and F _s The same dimensions, i.e., H '. Times.W';representing element-by-element multiplication; pool (-) represents a pooling operation where all elements other than 0 are pooled.

According to the corresponding relation between each part of the source human body image and each part of the target human body image, the style vector V is calculated _sty Mask M for inserting into each part of target human body image _t Obtaining style matrix T from corresponding parts _sty The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the 1 st style vector isBroadcasting and inserting into 1 st mask of target human body to generate 1 st style matrix +.>All L style matrices-> To obtain the final style matrix T by element-by-element addition _sty ：

The style matrix T obtained here _sty There is a human body pose that is consistent with the target human body pose, but contains the most critical human body appearance information.

Using two convolution layers to pair style matrix T _sty Convolving to obtain modulation parameters in normalization operationAnd->Meanwhile, the object posture feature F ₊ Performing batch normalization to obtain->Finally, F is paired with gamma and beta _norm Performing modulation operation to obtain stylized target attitude characteristic F _sty ：

F _sty ＝γF _norm +β；

Notably, the stylized target pose feature F obtained here _sty Maintains the posture information of the target posture and also contains the most critical human appearance information.

In step 2.2.2 of the embodiment of the present invention, the method specifically includes: for object characteristic gesture F _sty Sum-source human body characteristics F _s Splicing and fusing, and then enhancing the fused features by using a Squeeze-and-specification operation to obtain enhanced features F _fuse The network training is further accelerated by adding a residual error module, so that the final new target human body characteristic C 'is obtained' _t ：

Wherein,representing feature fusion operations, for C _t ，F _t ，F _s Fusion is carried out by splicing and adding. Finally, new source human body characteristics C' _s Is the source human body characteristic F _s And new target human body characteristics C' _t Is a splice of (2).

In step 2.3 of the embodiment of the present invention, the method specifically includes: new target human body feature C 'output by last SAG-Blk with 2 up-sampled convolution layers' _t Decoding to generate human body image of target posture

In step 4 of the embodiment of the present invention, constructing a loss function based on the virtual target human body image obtained in step 3 and the target human body image acquired in step 1 specifically includes: an antagonistic loss function, a perceived loss function, and a loss function based on human structural similarity.

The invention provides a virtual human body image generation system based on human body structure posture guidance, which comprises:

the sample acquisition module is used for acquiring a source human body image and a target human body image; obtaining target posture information according to the target human body image and the target human body image;

the network model building module is used for building a virtual human body image generation network (SAGN) under the guidance of the posture based on the human body structure; wherein, human body image generation network (SAGN) comprises three parts: an Encoder (Encoder), a structure-based appearance generation module (SAG-Blk), and a Decoder (Encoder); wherein the structure-based appearance generation module comprises two parts: a structure-aware adaptive normalization (SAN) module and a Feature Enhancement (FE) module;

The training module is used for inputting the source human body image, the source human body posture information and the target human body posture information into a constructed network (SAGN) to obtain a virtual target human body image;

the optimization module is used for constructing a loss function based on the virtual target human body image and the real target human body image and carrying out iterative optimization on a network (SAGN); and after the preset iteration times are reached, obtaining an optimized virtual human body image generation network under the posture guidance based on the human body structure, wherein the virtual human body image generation network is used for generating the virtual human body image under the target posture guidance and generating a vivid human body image with the correct posture.

The system of the invention aims at the problems that the prior method cannot guarantee the effectiveness of deformation, the quality of the generated human body image with the designated gesture is low, the blurred human body gesture is easy to generate and even the gesture consistency cannot be maintained, and a virtual human body image generation network (SAGN) based on the gesture guidance of a human body structure is introduced, wherein the SAGN consists of a series of appearance generation modules (SAG-Blk) based on the structure. Each structure-based appearance generation module is composed of a structure-aware adaptive normalization (SAN) sub-module and a Feature Enhancement (FE) sub-module. The structure-aware adaptive normalization (SAN) module generates stylized target attitude features by using a normalization method, and the Feature Enhancement (FE) module provides rich appearance information for the stylized target attitude features and further enhances the appearance features of the stylized target attitude features. The two sub-modules cooperate together to gradually generate a realistic human body image with the correct posture. And generating a virtual human body image under the guidance of the target gesture.

The invention relates to a virtual human body image generation method based on human body structure posture guidance, which comprises the following steps:

step 1, acquiring posture information of a source human body image and a target human body image according to the source human body image and the target human body image:

1.1 Performing gesture estimation on the human body images by using a gesture estimation method to obtain a preset number of joint point coordinate sequences of the source human body images and joint point coordinate sequences of the target human body images;

1.2 Based on the human body joint point coordinate sequence obtained in the step 1.1), the human body posture information is represented by the heat map, and the source human body posture information and the target human body posture information are obtained.

Step 2, constructing a virtual human body image generation network (SAGN) under the guidance of the posture based on the human body structure:

2.1 Constructing an Encoder (Encoder) to encode the input target pose information, the source human body image and the pose information into target human body characteristics and source human body characteristics respectively to obtain the Encoder;

2.2 Constructing a structure-based appearance generating module (SAG-Blk), updating the target human body characteristics and the source human body characteristics (or the new target human body characteristics and the new source human body characteristics output by the previous SAG-Blk) in the step 2.1), and generating appearance information of the target human body characteristics in the step 2.1) according to the source human body characteristics in the step 2.1), so as to obtain the new target human body characteristics and the new source human body characteristics; wherein, T=9 cascaded SAG-Blks are used for gradually generating the appearance information of the target human body characteristics; finally 9 cascaded structure-based appearance generating modules are obtained;

2.3 A Decoder (Decoder) is constructed, the new target human body characteristics output by the last SAG-Blk in the step 2.2) are decoded, and the human body image of the generated target posture is obtained.

Step 3, generating a target posture human body image:

1) Organizing data input into a convolutional neural network;

2) And (3) generating a target posture human body image by utilizing the convolutional neural network (SAGN) constructed in the step (2).

Step 4, constructing a loss function of a convolutional neural network (SAGN):

4.1 Constructing an fight loss function;

4.2 Constructing a perceptual loss function;

4.3 Constructing a loss function based on the similarity of human body structures;

step 5, optimizing network parameters, and generating a virtual human body image under the guidance of a target gesture:

5.1 Performing iterative optimization on the network parameters constructed in the step 2 according to the loss function obtained in the step 4;

5.2 And (3) after the preset iteration times are reached, the convolution neural network (SAGN) constructed in the step (2) is used for generating the virtual human body image under the guidance of the target gesture.

According to the virtual human body image generation method under posture guidance based on the human body structure, aiming at the problems that the existing method cannot guarantee the effectiveness of deformation, the generated human body image with the designated posture is low in quality, fuzzy human body posture is easy to generate and even posture consistency cannot be maintained, a virtual human body image generation network (SAGN) under the posture guidance based on the human body structure is introduced, and the SAGN consists of a series of appearance generation modules (SAG-Blk) based on the structure. Each structure-based appearance generation module is composed of a structure-aware adaptive normalization (SAN) sub-module and a Feature Enhancement (FE) sub-module. The structure-aware adaptive normalization (SAN) module generates stylized target attitude features by using a normalization method, and the Feature Enhancement (FE) module provides rich appearance information for the stylized target attitude features and further enhances the appearance features of the stylized target attitude features. The two sub-modules cooperate together to gradually generate a realistic human body image with the correct posture. And generating a virtual human body image under the guidance of the target gesture.

Referring to fig. 1, the method for generating a virtual human body image under the guidance of a posture based on a human body structure according to the embodiment of the invention includes the following steps:

1.1 Using a posture estimation method to perform posture estimation on the human body image to obtain K=18 joint point coordinates of the source human body image and K=18 joint point coordinates of the target human body image.

In the embodiment of the invention, an opensense attitude estimation method is used for carrying out attitude estimation on a human body image to obtain 18 human body joint point coordinate sequences; wherein, the source human body imageIs expressed as P (I) _s )＝{p ₁ ，…，p _K K=18; target human body image->Is expressed as P (I) _t )＝{p ₁ ，…，p _K K=18; fig. 2 is a schematic diagram of 18 nodes.

In the embodiment of the invention, in order to utilize the spatial characteristics of human body joint point coordinates, K=18 heat maps are used for representing human body posture information; wherein the source human body posture information is expressed asThe target human body posture information is expressed as

the virtual human body image generation network (SAGN) guided by the posture of the human body structure consists of an encoder, a structure-based appearance generation module and a decoder; fig. 3 is a schematic diagram of a virtual human body image generation network (SAGN) structure under the guidance of a posture based on a human body structure.

2.1 An Encoder (Encoder) is constructed to encode the input target pose information and source human body image and pose information, respectively.

In the embodiment of the invention, 2 downsampled convolution layers are used for target pose information P _t Coded as target human body feature C _t The method comprises the steps of carrying out a first treatment on the surface of the Source human body image I with 2 downsampled convolutional layers _s And posture information P _s Coded as source human body characteristics C _s 。

2.2 Construction of a structure-based appearance generation module (SAG-Blk).

In the embodiment of the invention, a virtual human body image generation network (SAGN) under the guidance of a human body structure comprises T=9 appearance generation modules (SAG-Blk) based on the structure, and each SAG-Blk consists of a structure-aware adaptive (SAN) module and a Feature Enhancement (FE) module. The structure-aware adaptive normalization (SAN) module generates stylized target attitude features by using a normalization method, and the Feature Enhancement (FE) module provides rich appearance information for the stylized target attitude features and further enhances the appearance features of the stylized target attitude features. The two sub-modules cooperate together to gradually generate a realistic human body image with the correct posture. And generating a virtual human body image under the guidance of the target gesture.

2.2.1 A fabric aware adaptation (SAN) module is built.

In the embodiment of the invention, based on the human joint point coordinate sequence obtained in the step 1.1, dividing a human body image into 10 human body parts and 1 background part, wherein the human body image comprises a head, a left (right) upper arm, a left (right) lower arm, a left (right) thigh, a left (right) shank, a trunk and a background; obtaining masks of the L parts; wherein each partial mask of the source human body image is expressed asThe mask of each part of the target human body image is expressed as +.>

FIG. 4 is a schematic diagram of a structure-aware adaptive (SAN) module. In the embodiment of the invention, two convolution layers are used for the target human body characteristic C _t (or the new target human body feature C 'output by the last SAG-Blk)' _t ) Convolving to obtain the target human body characteristicsSource human body characteristics C by two convolution layers _s (or the last SAG-Blk output new source human body feature C' _s ) Convolving to obtain source human body characteristics->

According to the source human body characteristics F _s And partial masks M of source human body image _s Generating style vectorsWherein for V _sty Is>Is a vector of C dimension, which represents the characteristics of each part of the source human body image; in particular, here mean pooling is used to obtain the style vector of part 1 +. >

Using two convolution layers to pair style matrix T _sty Convolving to obtain modulation parameters in normalization operationAnd->Meanwhile, the object posture feature F _t Carrying out batch normalization treatment to obtain/>Finally, F is paired with gamma and beta _norm Performing modulation operation to obtain stylized target attitude characteristic F _sty ：

F _sty ＝γF _norm +β；

2.2.1 A Feature Enhancement (FE) module is constructed.

For object characteristic gesture F _sty Sum-source human body characteristics F _s Splicing and fusing, and then enhancing the fused features by using a Squeeze-and-specification operation to obtain enhanced features F _fuse The network training is further accelerated by adding a residual error module, so that the final new target human body characteristic C 'is obtained' _t ：

Step 2.3) constructing a Decoder, and decoding the new target human body characteristics output by the last SAG-Blk in the step 2.2) to generate a human body image of the target posture.

In the embodiment of the invention, 2 up-sampled convolution layers are used for outputting new target human body characteristics C 'of the last SAG-Blk in the step 2.2' _t Decoding to generate human body image of target posture

Step 3, generating a target posture human body image:

1) Data input to the convolutional neural network is organized.

The data input to the network is divided into two parts, one part is the target human body posture information expressed by the heat map obtained in the step 1, and the other part is the source human body image and the source human body posture information expressed by the heat map.

And inputting the organized data into a network to generate a human body image of the target gesture.

Step 4, constructing a loss function of a convolutional neural network (SAGN):

in the embodiment of the invention, a combination of an antagonism loss function, a perception loss function and a loss function based on human body structural similarity is used as the loss function of the convolutional neural network (SAGN) proposed by the invention.

Wherein, the contrast loss function uses a discriminator to measure the distance between the real image distribution and the generated image distribution, and continuously reduces the distance between the two distributions. The invention constructs two discriminators, namely an appearance discriminator and a gesture discriminator, which are respectively used for guaranteeing a real image I _t And generating an imageAppearance consistency and posture consistency, the formula of which is defined as:

wherein,and->Representing a human bodyPosture image and real human body image distribution, +.>Representing the human body image generation network proposed by the present invention. Further details can be found in Progressive pose attention transfer for person image generation.

The perceptual loss function is used to measure the similarity of the feature map of the real image and the generated image, and generally calculates the L1 distance between the two feature maps as the perceptual loss, i.e.:

Wherein phi is _i Is the output of the ith layer of a pre-trained network, and typically uses the profile of the conv1_2 output of the VGG-19 network pre-trained in ImageNet. For more details see Perceptual losses for real-time style transfer and super-resolution.

The loss function based on the human body structural similarity is used for measuring the structural similarity of the real image and each human body part of the generated image. This accurate measurement of the similarity of each human body part can bring clear human body boundaries and detailed texture features to the virtual human body image. It is defined as:

wherein MSSIM (·, ·) is structural similarity, i.e., I _t Andstructural similarity of background (part 0). SSIM (secure Shell) ^l (. Cndot.) is the structural similarity of the 1 st part of the human body image. See Loss Functions for Person Image Generation for further details.

4.1 A counterloss function is constructed.

The use of the combat loss function of "Generative adversarial nets" achieves a better effect in image generation, and the combat loss function in the paper is referred to herein as one of the loss functions of the convolutional neural network (SAGN) proposed by the present invention.

4.2 A perceptual loss function is constructed.

The 'Perceptual losses for real-time style transfer and super-resolution' has a better effect on style migration by using a perceptual loss function, and the perceptual loss function in the paper is used as one of the loss functions of the convolutional neural network (SAGN) proposed by the invention.

4.3 Constructing a loss function based on the similarity of the human body structure.

The Loss functions for person image generation uses the loss function based on the human body structural similarity to effectively calculate the structural similarity of each part of the human body, and achieves a good effect in the aspect of human body image generation, and the loss function based on the human body structural similarity in the paper is used as one of the loss functions of the convolutional neural network (SAGN) provided by the invention.

iterate 90k times using Adam optimizer, where β ₁ ＝0.5，β ₂ ＝0.999。

In summary, the method of the invention provides a virtual human body image generation network based on the posture guidance of human body structure aiming at a source human body image and any target human body posture image; firstly, carrying out gesture estimation on an input human body image to obtain a joint point coordinate sequence of the human body image; then constructing a virtual human body image generation network (SAGN) under the guidance of the posture based on the human body structure, comprising an encoder, a structure-based appearance generation module (SAG-Blk) and a decoder, wherein the structure-based appearance generation module (SAG-Blk) consists of a structure-aware adaptive normalization (SAN) submodule and a Feature Enhancement (FE) submodule; then constructing a loss function of the convolutional neural network, wherein the loss function comprises an antagonism loss function, a perception loss function and a loss function based on the similarity of human body structures; and finally, utilizing the convolutional neural network (SAGN) provided by the loss function joint optimization to realize the generation of the virtual human body image under the guidance of the target gesture. The method is compared with the existing method in qualitative and quantitative comparison experimental analysis, and the effectiveness of the method is verified on two public data sets of mark-1501 and deep fast.

Tables 1a and 1b show the quantitative test results of the present invention, respectively, table 1a shows the test results of the method under the mark-1501 data set, and table 1b shows the test results of the method under the deep fashion data set.

TABLE 1a experimental results of this method under the Market-1501 dataset

TABLE 1b experimental results of this method under DeepFashion dataset

SSIM, IS, DS is a common index for measuring the quality of image generation, and the larger the value is, the more vivid and the higher the quality of the generated image is, and FID and LPIPS are also common indexes for measuring the quality of the image generation, and the smaller the value is, the more vivid and the higher the quality of the generated image is. As can be seen from Table 1a, the image produced by the present invention is highest on all criteria, especially SSIM, at a maximum value of 0.321 on the Market-1501 dataset. As can be seen from table 1b, on the deep fashion dataset, the images generated by the present invention all reached the second level on SSIM, IS, DS, LPIPS, resulting in a more reliable image generation. Therefore, from the quantitative result, the virtual human body image generating method based on the structural similarity can generate a more real virtual human body image.

Fig. 5 and 6 are qualitative experimental results of the present invention, respectively, and fig. 5 is an image generated by the present invention under a mark-1501 dataset, and it can be seen that our method generates a virtual human body image with a clear human body posture and a real appearance. Especially in the case of large pose transitions, the virtual human body image that our method generates still maintains the correct human body pose (e.g., lines 2,4, 5); fig. 6 is a generated image of the present invention under the deep fashion dataset, it can be seen that the human body image generated by other methods is prone to some artifacts, while our method still maintains the correct human body pose and the real appearance, notably, even though the target pose has a very complex human body pose, our method can maintain the correct and integrity of the pose of the generated virtual human body, which makes the generated image look very realistic. Therefore, from the qualitative result, the virtual human body image generating method based on the posture guidance of the human body structure can generate a vivid human body image with the correct posture.

In summary, the invention discloses a new human body image generating method, system and electronic equipment under the guidance of a human body structure posture, belongs to the technical field of computer vision and computer graphics intersection, and aims at a source human body image and a given target posture image to generate a new human body image of a target posture. The invention constructs a virtual human body image generation network (SAGN) under the guidance of a human body structure, which comprises an encoder, a structure-based appearance generation module (SAG-Blk) and a decoder, wherein the structure-based appearance generation module (SAG-Blk) consists of a structure-aware adaptive normalization (SAN) submodule and a Feature Enhancement (FE) submodule; then constructing a loss function of the convolutional neural network, wherein the loss function comprises an antagonism loss function, a perception loss function and a loss function based on the similarity of human body structures; and finally, utilizing the convolutional neural network (SAGN) provided by the loss function joint optimization to realize the generation of the virtual human body image under the guidance of the target gesture. The invention can generate a realistic virtual human body image with correct posture.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, one skilled in the art may make modifications and equivalents to the specific embodiments of the present invention, and any modifications and equivalents not departing from the spirit and scope of the present invention are within the scope of the claims of the present invention.

Claims

1. A virtual human body image generation method, characterized by comprising the steps of:

the encoder is used for inputting a source human body image and a target posture image, encoding to obtain source human body characteristics and target human body characteristics and outputting the source human body characteristics and the target human body characteristics;

the appearance generating module based on the structure is used for inputting and updating the source human body characteristics and the target human body characteristics, obtaining and outputting the updated source human body characteristics and target human body characteristics;

the decoder is used for inputting the target human body characteristics output by the structure-based appearance generating module and decoding to obtain a target posture human body image;

the obtaining step of the trained virtual human body image generating network specifically comprises the following steps:

After the preset iteration times or convergence conditions are reached, a trained virtual human body image generation network is obtained;

the structure-based appearance generation module includes:

the feature enhancement module is used for inputting the generated stylized target gesture features and the source human body features and outputting updated source human body features and target human body features;

in the sample data set, the step of acquiring the source human body posture sample data and the target human body posture sample data in each sample data includes: performing posture estimation on the human body image by adopting an openpost posture estimation method to obtain an 18-person joint point coordinate sequence; wherein, the source human body imageIs expressed as P (I) _s )＝{p ₁ ，…，p _K K=18; target human body image->Is expressed as P (I) _t )＝{p ₁ ，…，p _K K=18; based on the obtained human body joint point coordinate sequence, representing human body posture information by using K heat maps; wherein the source human body posture information is expressed as/> The target human body posture information is expressed as +.>

In the encoder, the step of obtaining the source human body characteristics and the target human body characteristics by encoding specifically comprises the following steps: target pose information P with 2 downsampled convolutional layers _t Coded as target human body feature C _t The method comprises the steps of carrying out a first treatment on the surface of the Source human body image I with 2 downsampled convolutional layers _s And posture information P _s Coded as source human body characteristics C _s ；

In the structure-based appearance generation module, the steps of inputting and updating the source human body characteristics and the target human body characteristics and obtaining the updated source human body characteristics and target human body characteristics specifically comprise:

Target human body characteristic C by using two convolution layers _t Convolving to obtain the target human body characteristicsSource human body characteristics C by two convolution layers _s Convolving to obtain source human body characteristics->According to the source human body characteristics F _s And partial masks M of source human body image _s Generating a style vector->Wherein for V _sty Is>L is a vector of dimension C, which represents the characteristics of each part of the source human body image; using mean pooling to obtain the style vector of the first part +.>

According to the corresponding relation between each part of the source human body image and each part of the target human body image, the style vector V is calculated _sty Mask M for inserting into each part of target human body image _t Obtaining style matrix T from corresponding parts _sty The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the first style vectorBroadcasting inserted into the first mask of the target human body image to generate the first style matrix +.>All L style matrices-> L is added element by element to obtain a final style matrix T _sty ：

For object attitude feature F _t Carrying out batch normalization treatment to obtainF is paired with gamma and beta _norm Performing modulation operation to obtain stylized target attitude characteristic F _sty ：

F _sty ＝γF _norm +β；

Obtaining updated target human body characteristics C' _t ：

2. The virtual human body image generating method according to claim 1, wherein the loss function comprises: an fight loss function, a perceptual loss function, and a loss function based on a similarity of human structures.

3. The method of claim 1, wherein the structure-based appearance generation module is replaced with an integrated appearance generation module;

4. A virtual human body image generation system, comprising:

the structure-based appearance generation module includes:

In the sample data set, the step of acquiring the source human body posture sample data and the target human body posture sample data in each sample data includes: performing posture estimation on the human body image by adopting an openpost posture estimation method to obtain an 18-person joint point coordinate sequence; wherein, the source human body imageIs expressed as P (I) _s )＝{p ₁ ，…，p _K K=18; target human body image->Is expressed as P (I) _t )＝{p ₁ ，…，p _K K=18; based on the obtained human body joint point coordinate sequence, representing human body posture information by using K heat maps; wherein the source human body posture information is expressed as +.> The target human body posture information is expressed as +.>

dividing a human body image into a plurality of human body parts and 1 background part based on the obtained human body joint point coordinate sequence, and obtaining masks of the L parts; wherein each partial mask of the source human body image is expressed as The mask of each part of the target human body image is expressed as +.>

Target human body characteristics by using two convolution layersC _t Convolving to obtain the target human body characteristicsSource human body characteristics C by two convolution layers _s Convolving to obtain source human body characteristics->According to the source human body characteristics F _s And partial masks M of source human body image _s Generating a style vector->Wherein for V _sty Is>L is a vector of dimension C, which represents the characteristics of each part of the source human body image; using mean pooling to obtain the style vector of the first part +.>

according to the corresponding relation between each part of the source human body image and each part of the target human body image, the style vector V is calculated _sty Mask M for inserting into each part of target human body image _t Obtaining style matrix T from corresponding parts _sty The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the first style vectorBroadcasting inserted into the first mask of the target human body image to generate the first style matrix +.>All L style matrices L is added element by element to obtain a final style matrix T _sty ：

Using two convolution layers to pair style matrix T _sty Convolving to obtain modulation parameters in normalization operationAnd

for object attitude feature F _t Carrying out batch normalization treatment to obtain Modulating Fnorm with gamma and beta to obtain stylized target posture feature F _sty ：

F _sty ＝γF _norm +β；

Obtaining updated target human body characteristics C' _t ：

5. An electronic device comprising a processor and a memory, the processor being configured to execute a computer program stored in the memory to implement the virtual human body image generation method of any one of claims 1 to 3.

6. A computer readable storage medium storing at least one instruction which when executed by a processor implements the virtual human body image generation method of any one of claims 1 to 3.