CN113592971A

CN113592971A - Virtual human body image generation method, system, equipment and medium

Info

Publication number: CN113592971A
Application number: CN202110865481.2A
Authority: CN
Inventors: 王乐; 师皓玥; 周三平; 陈仕韬; 辛景民; 郑南宁
Original assignee: Ningbo Shun'an Artificial Intelligence Research Institute; Xian Jiaotong University
Current assignee: Ningbo Shun'an Artificial Intelligence Research Institute; Xian Jiaotong University
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2021-11-02
Anticipated expiration: 2041-07-29
Also published as: CN113592971B

Abstract

The invention discloses a virtual human body image generation method, a system, equipment and a medium, wherein the method comprises the following steps: inputting the source human body image and the target posture image into a pre-trained virtual human body image generation network to obtain a target posture human body image; wherein, virtual human image generation network is convolution neural network, includes: the encoder is used for inputting a source human body image and a target posture image, and encoding to obtain a source human body characteristic and a target human body characteristic; the structure-based appearance generation module is used for inputting and updating the source human body characteristics and the target human body characteristics to obtain updated source human body characteristics and target human body characteristics; and the decoder is used for inputting the target human body characteristics output by the structure-based appearance generation module and decoding to obtain the target posture human body image. The invention utilizes the virtual human body image generation network under the posture guidance based on the human body structure to generate the vivid human body image with the correct target posture.

Description

Virtual human body image generation method, system, equipment and medium

Technical Field

The invention belongs to the technical field of computer vision and computer graphics intersection, and particularly relates to a virtual human body image generation method, a virtual human body image generation system, virtual human body image generation equipment and a virtual human body image generation medium.

Background

The virtual human body image task under the posture guidance aims at a source human body image and a given target posture image to generate a new human body image of a target posture; wherein the human body posture of the generated new human body image is consistent with the target posture, and the human body appearance is similar to the appearance of the source human body image. The task has many application scenes, such as data augmentation in film production, virtual reality, motion recognition tasks and the like.

At present, the virtual image generation method under the current stage of posture guidance mainly has the following defects:

generating a virtual human body image of a target posture, wherein the posture consistency and the appearance consistency of the generated image are considered at the same time; the target posture is usually greatly different from the posture in the source human body image and is very complex, the existing method usually considers how to deform the source human body image to obtain the human body image with the target posture, and the human body posture of the generated image is not clear or even inconsistent with the target posture because the effectiveness of deformation cannot be ensured, so that the quality of the generated human body image with the specified posture is very low.

In summary, there is a need for a new method, system, device and medium for generating virtual human body images under posture guidance based on human body structures.

Disclosure of Invention

The present invention is directed to a method, system, device and medium for generating a virtual human body image, so as to solve one or more of the above-mentioned problems. The invention utilizes the virtual human body image generation network under the posture guidance based on the human body structure to generate the vivid human body image with the correct target posture.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a virtual human body image generation method, which comprises the following steps:

inputting a source human body image and a target posture image into a pre-trained virtual human body image generation network, and outputting the virtual human body image generation network to obtain a target posture human body image;

wherein, the virtual human body image generation network is a convolution neural network, and comprises:

the encoder is used for inputting a source human body image and a target posture image, and encoding to obtain a source human body characteristic and a target human body characteristic;

the structure-based appearance generation module is used for inputting and updating the source human body characteristics and the target human body characteristics to obtain updated source human body characteristics and target human body characteristics;

and the decoder is used for inputting the target human body characteristics output by the structure-based appearance generation module and decoding to obtain a target posture human body image.

The invention has the further improvement that the step of acquiring the trained virtual human body image generation network specifically comprises the following steps:

acquiring a sample data set; each sample data in the sample data set comprises source human body image sample data, target human body image sample data, source human body posture sample data and target human body posture sample data;

inputting source human body image sample data, source human body posture sample data and target human body posture sample data in selected sample data of the sample data set into the virtual human body image generation network to obtain virtual target human body image data; constructing a loss function based on the virtual target human body image data and target human body image sample data in the selected sample data, and performing iterative optimization on the virtual human body image generation network;

and obtaining the trained virtual human body image generation network after reaching the preset iteration times or convergence conditions.

A further refinement of the invention is that the structure-based appearance generation module comprises:

the structure perception self-adaptive normalization module is used for inputting the source human body characteristics and the target human body characteristics, generating stylized target posture characteristics and outputting the stylized target posture characteristics;

and the characteristic enhancement module is used for inputting the generated stylized target posture characteristic and the source human body characteristic and outputting the updated source human body characteristic and the updated target human body characteristic.

In the sample data set, the step of obtaining the source human body posture sample data and the target human body posture sample data in each sample data comprises: carrying out attitude estimation on the human body image by adopting an openposition attitude estimation method to obtain 18 human body joint point coordinate sequences; wherein the source human body image

Is expressed as P (I)_s)＝{p₁，…，p_KH, 18; target human body image

Is expressed as P (I)_t)＝{p₁，…，p_KH, 18; based on the obtained coordinate sequence of the human body joint points, K heat maps are used for representing human body posture information; wherein the source human body posture information is expressed as

The target human posture information is expressed as

In the encoder, the step of encoding to obtain the source human body characteristic and the target human body characteristic specifically includes: targeting pose information P with 2 downsampled convolutional layers_tEncoding as target human features C_t(ii) a Source human image I with 2 downsampled convolution layers_sAnd attitude information P_sEncoding as source human body characteristic C_s。

A further improvement of the present invention is that, in the structure-based appearance generating module, the step of inputting and updating the source human body characteristics and the target human body characteristics to obtain updated source human body characteristics and target human body characteristics specifically includes:

dividing the human body image into a plurality of human body parts and 1 background part based on the obtained human body joint point coordinate sequence to obtain L masks of each part; wherein each partial mask of the source human body image is represented as

The partial mask of the target human body image is represented as

Targeting body features C with two convolutional layers₊Carrying out convolution to obtain target human body characteristics

Targeting body features C with two convolutional layers_sConvolving to obtain source human body characteristics

According to source human body characteristics F_sAnd partial mask M of source human body image_sGenerating a style vector

Wherein for V_styFor each of the rows of the plurality of rows,

is a C-dimensional vector representing the characteristics of each part of the source human body image; obtaining style vector of 1 st part by mean pooling

Wherein Resize (·) represents a zoom operation;

representing element-by-element multiplication; pool (·) denotes pooling;

according to the corresponding relation of all parts of the source human body image and the target human body image, the style vector V is converted into a style vector V_styPartial mask M inserted into target human body image_tObtaining style matrix T in corresponding parts_sty(ii) a Wherein the 1 st style vector is used

Inserting the 1 st mask of the target human body image into the 1 st style matrix in a broadcasting way to generate a 1 st style matrix

All L style matrices

The element-by-element addition of 0, …, L results in the final style matrix T_sty：

Using two convolutional layers to form a style matrix T_styConvolution is carried out to obtain modulation parameters in normalization operation

And

to the target attitude feature F₊Go on to return in batchesIs subjected to a normalization treatment to obtain

Using gamma and beta to F_normModulating to obtain stylized target attitude characteristic F_sty：

F_sty＝γF_norm+β；

To target characteristic attitude F_styAnd source human features F_sSplicing and fusing, and then enhancing the fused features by using a Squeeze-and-Excitation operation to obtain enhanced features F_fuse，

Obtaining updated target human body characteristic C'_t：

Wherein,

represents a feature fusion operation, pair C_t，F_t，F_sFusing in a splicing and adding mode;

updated source human body characteristic C'_sIs a source human body characteristic F_sAnd updated target body characteristic C'_tAnd (4) splicing.

A further refinement of the invention provides that the loss function comprises: an antagonistic loss function, a perceptual loss function, and a loss function based on similarity of human structures.

A further improvement of the present invention is that the structure-based appearance generation module is replaced with an integrated appearance generation module;

the integrated appearance generating module is composed of a plurality of the structure-based appearance generating modules in a cascade.

The invention relates to a virtual human body image generation system, which comprises:

the image generation module is used for inputting the source human body image and the target posture image into a pre-trained virtual human body image generation network and outputting the virtual human body image generation network to obtain a target posture human body image;

An electronic device of the present invention includes a processor and a memory, the processor is configured to execute a computer program stored in the memory to implement the virtual human body image generation method according to any one of the above aspects of the present invention.

A computer-readable storage medium of the present invention stores at least one instruction, which when executed by a processor, implements a virtual human body image generation method as any one of the above.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a novel method for generating a virtual human body image under posture guidance based on a human body structure, which can generate a vivid human body image with a correct target posture. Specifically, aiming at the technical problems that the effectiveness of deformation cannot be guaranteed, the quality of the generated human body image with the specified posture is low, a fuzzy human body posture is easy to generate, and even the consistency of the posture cannot be maintained in the existing method, the invention constructs a human body structure-based posture-guided virtual human body image generation network (SAGN), and carries out iterative optimization on the constructed convolutional neural network (SAGN) to obtain a pre-trained convolutional neural network (SAGN) to realize the posture-guided virtual human body image generation. The virtual human body image generation network (SAGN) under the posture guidance based on the human body structure can directly generate the appearance of the virtual human body image generation network according to the target posture, so that the consistency of the human body posture and the target posture of the generated image can be ensured to the maximum extent, and meanwhile, the virtual human body image generation network also has vivid human body appearance, so that a vivid human body image with a correct posture can be generated, and the virtual human body image generation under the guidance of the target posture is realized; meanwhile, a new idea is provided for solving the difficult task of generating the human body image in the target posture.

In the system, aiming at the problems that the existing method cannot ensure the effectiveness of deformation, the generated human body image with the specified posture has low quality, the fuzzy human body posture is easy to generate, and even the posture consistency cannot be maintained, a virtual human body image generation network (SAGN) guided by the posture based on the human body structure is introduced, and the SAGN consists of a series of appearance generation modules (SAG-Blk) based on the structure. Each structure-based appearance generation module consists of a structure-aware adaptive normalization (SAN) submodule and a Feature Enhancement (FE) submodule; the structure-aware adaptive normalization (SAN) module generates stylized target attitude features by using a normalization method, and the Feature Enhancement (FE) module provides rich appearance information for the stylized target attitude features, so that the appearance features of the stylized target attitude features are further enhanced. The two sub-modules cooperate together to gradually generate a vivid human body image with a correct posture. And realizing the generation of the virtual human body image under the guidance of the target posture.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art are briefly introduced below; it is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic flowchart of a method for generating a virtual human body image under posture guidance based on a human body structure according to an embodiment of the present invention;

FIG. 2 is a schematic view of a joint of a human body according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a virtual human body image generation network (SAGN) under the guidance of human body structure-based posture in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a fabric-aware adaptive (SAN) module in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a partial result on a Market-1501 data set in accordance with an embodiment of the present invention;

fig. 6 is a graphical representation of a portion of the results on the depfashinon dataset in an embodiment of the present invention.

Detailed Description

In order to make the purpose, technical effect and technical solution of the embodiments of the present invention clearer, the following clearly and completely describes the technical solution of the embodiments of the present invention with reference to the drawings in the embodiments of the present invention; it is to be understood that the described embodiments are only some of the embodiments of the present invention. Other embodiments, which can be derived by one of ordinary skill in the art from the disclosed embodiments without inventive faculty, are intended to be within the scope of the invention.

Example 1

In embodiment 1 of the present invention, a method for generating a virtual human body image under posture guidance based on a human body structure is provided, in consideration of generating an appearance directly according to a target posture to maximally ensure consistency between a human body posture and the target posture of a generated image, for solving the problems that the existing method cannot ensure validity of deformation, the generated human body image in a specified posture has low quality, and a blurred human body posture is easily generated, and even the posture consistency cannot be maintained.

SAGN consists of a series of structure-based appearance generating modules (SAG-Blk); wherein each structure-based appearance generation module is comprised of a structure-aware adaptive normalization (SAN) sub-module and a Feature Enhancement (FE) sub-module. The structure-aware adaptive normalization (SAN) module generates stylized target attitude features by using a normalization method, and the Feature Enhancement (FE) module provides rich appearance information for the stylized target attitude features, so that the appearance features of the stylized target attitude features are further enhanced. The two sub-modules cooperate together to gradually generate a vivid human body image with a correct posture. And realizing the generation of the virtual human body image under the guidance of the target posture.

The method for generating the virtual human body image under the posture guidance based on the human body structure in the embodiment 1 of the invention comprises the following steps:

step 1, acquiring an acquired image, acquiring a source human body image and a target human body image, and acquiring posture information of the source human body image and the target human body image according to the source human body image and the target human body image; the source human body image and the target human body image are used for representing different human body postures under the same appearance;

step 2, constructing a virtual human body image generation network (SAGN) guided by the posture based on the human body structure; wherein the specific steps of the virtual human body image generation network (SAGN) construction include:

constructing an Encoder (Encoder); constructing a structure-based appearance generation module (SAG-Blk); constructing a Decoder (Decoder); wherein, the specific steps of the structure-based appearance generation module construction include: constructing a structure-aware adaptive normalization (SAN) module and a Feature Enhancement (FE) module;

step 3, inputting the source human body image, the source human body posture information and the target human body posture information in the step 1 into the network constructed in the step 2 to obtain a virtual target human body image;

step 4, constructing a loss function based on the virtual target human body image obtained in the step 3 and the target human body image acquired and obtained in the step 1, and performing iterative optimization on the network constructed in the step 2; and after the preset iteration times are reached, obtaining an optimized virtual human body image generation network under the posture guidance based on the human body structure, and generating a vivid human body image with a correct posture by realizing the virtual human body image generation under the target posture guidance.

In embodiment 2 of the present invention, in step 1, the specific steps of obtaining the posture information of the source human body image and the target human body image according to the source human body image and the target human body image include:

step 1.1, carrying out posture estimation on the human body image by using a posture estimation method to obtain joint point coordinate sequences of a preset number of source human body images and joint point coordinate sequences of a target human body image;

and step 1.2, representing the human body posture information by heat map based on the human body joint point coordinate sequence obtained in the step 1.1, and obtaining source human body posture information and target human body posture information.

Illustratively, step 1.1 in the embodiment of the present invention specifically includes: carrying out attitude estimation on the human body image by using an openposition attitude estimation method to obtain 18 human body joint point coordinate sequences; wherein the source human body image

Is expressed as p (is) ═ p₁，…，p_KH, 18; target human body image

Is expressed as P (I)_t)＝{p₁，…，p_K}，K＝18。

In step 1.2, the method specifically comprises the following steps: based on the human body joint point coordinate sequence obtained in the step 1.1, representing human body posture information by K heat maps; wherein the source human body posture information is expressed as

The target human posture information is expressed as

In embodiment 3 of the present invention, in step 2, the specific step of constructing a virtual human body image generation network (SAGN) based on the posture guidance of the human body structure includes:

step 2.1, constructing an Encoder (Encoder), encoding the input target posture information, the input source human body image and the input posture information respectively, and encoding the target posture information, the input source human body image and the input posture information into target human body characteristics and source human body characteristics to obtain an Encoder;

step 2.2, constructing an appearance generating module (SAG-Blk) based on a structure, updating the target human body characteristics and the source human body characteristics in the step 2.1 (or new target human body characteristics and new source human body characteristics output by the last SAG-Blk), and generating appearance information of the target human body characteristics in the step 2.1 according to the source human body characteristics in the step 2.1 to obtain new target human body characteristics and new source human body characteristics; wherein, the total T is 9 cascaded SAG-Blk to gradually generate appearance information of the target human body characteristics; finally, 9 cascaded appearance generating modules based on the structure are obtained;

and 2.3, constructing a Decoder (Decoder), decoding the new target human body characteristics output by the last SAG-Blk in the step 2.2, and generating a human body image of the target posture to obtain the Decoder.

Illustratively, step 2.1 in the embodiment of the present invention specifically includes: targeting pose information P with 2 downsampled convolutional layers_tEncoding as target human features C_t(ii) a Source human image I with 2 downsampled convolution layers_sAnd attitude information P_sEncoding as source human body characteristic C_s。

In embodiment 4 of the present invention, in step 2.2, the specific step of constructing the structure-based appearance generating module (SAG-Blk) includes:

step 2.2.1, a structure-aware adaptive normalization (SAN) module is constructed, the target human body feature in the step 2.1 (or a new target human body feature output by the last SAG-Blk) is updated by using a normalization method, and stylized target posture features are generated; finally obtaining a structure-aware self-adaptive normalization module;

step 2.2.2, constructing a Feature Enhancement (FE) module to provide rich appearance information for the stylized target posture feature obtained in step 2.2.1, namely, the source human body feature in step 2.1 (or a new source human body feature output by the last SAG-Blk) further enhances the appearance feature of the stylized target posture feature; and finally obtaining the characteristic enhancement module.

Illustratively, step 2.2.1 in the embodiment of the present invention specifically includes: based on the human body joint point coordinate sequence obtained in the step 1.1, dividing the human body image into a plurality of human body joint point coordinate sequencesA human body part and 1 background part are dried, and L masks of all parts are obtained; wherein each partial mask of the source human body image is represented as

The partial mask of the target human body image is represented as

Step 2.2.1 specifically comprises: targeting body features C with two convolutional layers₊(or the last SAG-Blk output New target body characteristic C'_t) Carrying out convolution to obtain target human body characteristics

Targeting body features C with two convolutional layers_s(or the last SAG-Blk output New Source body characteristic C'_s) Convolving to obtain source human body characteristics

Wherein for V_styFor each of the rows of the plurality of rows,

is a C-dimensional vector representing the characteristics of each part of the source human body image; specifically, mean pooling is used herein to obtain the style vector for the 1 st segment

Where Resize (·) denotes a zoom operation, where M needs to be scaled_sScaling to have and F_sThe same size, i.e., H 'x W';

representing element-by-element multiplication; pool (·) denotes a pooling operation where all elements other than 0 are pooled.

Inserting the style matrix into the 1 st mask of the target human body in a broadcasting way to generate a 1 st style matrix

All L style matrices

Element by element addition to obtain the final style matrix T_sty：

The style matrix T obtained here_styThere is a body pose that is consistent with the target body pose, but contains the most critical body appearance information.

And

meanwhile, here, the target attitude feature F is₊Is subjected to batch normalization treatment to obtain

Finally, F is subtended with gamma and beta_normModulating to obtain stylized target attitude characteristic F_sty：

F_sty＝γF_norm+β；

Notably, the stylized target pose feature F derived here_styThe posture information of the target posture is kept, and the most key human body appearance information is also contained.

In step 2.2.2 of the embodiment of the present invention, the method specifically includes: to target characteristic attitude F_styAnd source human features F_sSplicing and fusing, and then enhancing the fused features by using a Squeeze-and-Excitation operation to obtain enhanced features F_fuseA residual module is added to further accelerate network training to obtain a final new target human body feature C'_t：

Wherein,

represents a feature fusion operation, pair C_t，F_t，F_sThe fusion is carried out by splicing and adding. Finally, new source human body characteristic C'_sIs a source human body characteristic F_sAnd New target human body characteristic C'_tAnd (4) splicing.

In step 2.3 of the embodiment of the present invention, the method specifically includes: new target body feature C 'output to last SAG-Blk with 2 upsampled convolutional layers'_tDecoding to generate human body image of target posture

In step 4 of the embodiment of the present invention, constructing the loss function based on the virtual target human body image obtained in step 3 and the target human body image acquired and obtained in step 1 specifically includes: a confrontational loss function, a perceptual loss function, and a loss function based on similarity of human structures.

The system for generating a virtual human body image based on posture guidance of a human body structure in embodiment 4 of the present invention includes:

the sample acquisition module is used for acquiring and acquiring a source human body image and a target human body image; obtaining target posture information according to the target human body image and the target human body image;

a network model construction module for constructing a virtual human body image generation network (SAGN) under the guidance of the posture based on the human body structure; wherein the human body image generation network (SAGN) comprises three parts: an Encoder (Encoder), a structure-based appearance generation module (SAG-Blk), a Decoder (Decoder); wherein the structure-based appearance generation module comprises two parts: a structure-aware adaptive normalization (SAN) module and a Feature Enhancement (FE) module;

the training module is used for inputting the source human body image, the source human body posture information and the target human body posture information into a constructed network (SAGN) to obtain a virtual target human body image;

an optimization module for constructing a loss function based on the virtual target human body image and the real target human body image, and performing iterative optimization on the network (SAGN); and after the preset iteration times are reached, obtaining an optimized virtual human body image generation network under the posture guidance based on the human body structure, and generating a vivid human body image with a correct posture by realizing the virtual human body image generation under the target posture guidance.

Aiming at the problems that the existing method cannot ensure the effectiveness of deformation, the generated human body image with the specified posture has low quality, the fuzzy human body posture is easy to generate, and even the posture consistency cannot be maintained, the system introduces a virtual human body image generation network (SAGN) guided by the posture based on the human body structure, and the SAGN consists of a series of appearance generation modules (SAG-Blk) based on the structure. Each structure-based appearance generation module consists of a structure-aware adaptive normalization (SAN) sub-module and a Feature Enhancement (FE) sub-module. The structure-aware adaptive normalization (SAN) module generates stylized target attitude features by using a normalization method, and the Feature Enhancement (FE) module provides rich appearance information for the stylized target attitude features, so that the appearance features of the stylized target attitude features are further enhanced. The two sub-modules cooperate together to gradually generate a vivid human body image with a correct posture. And realizing the generation of the virtual human body image under the guidance of the target posture.

The method for generating the virtual human body image under the posture guidance based on the human body structure, disclosed by the embodiment 5 of the invention, comprises the following steps of:

step 1, obtaining posture information of a source human body image and a target human body image according to the source human body image and the target human body image:

1.1) carrying out posture estimation on the human body image by using a posture estimation method to obtain joint point coordinate sequences of a preset number of source human body images and joint point coordinate sequences of a target human body image;

1.2) based on the human body joint point coordinate sequence obtained in the step 1.1), representing human body posture information by heat map, and obtaining source human body posture information and target human body posture information.

Step 2, constructing a virtual human body image generation network (SAGN) under the guidance of the posture of the human body structure:

2.1) constructing an Encoder (Encoder), respectively encoding the input target posture information, the input source human body image and the input posture information into target human body characteristics and source human body characteristics to obtain an Encoder;

2.2) constructing an appearance generating module (SAG-Blk) based on a structure, updating the target human body characteristics and the source human body characteristics in the step 2.1) (or new target human body characteristics and new source human body characteristics output by the last SAG-Blk), and generating appearance information of the target human body characteristics in the step 2.1) according to the source human body characteristics in the step 2.1) to obtain new target human body characteristics and new source human body characteristics; wherein, the total T is 9 cascaded SAG-Blk to gradually generate appearance information of the target human body characteristics; finally, 9 cascaded appearance generating modules based on the structure are obtained;

2.3) constructing a Decoder (Decoder), decoding the new target human body characteristics output by the last SAG-Blk in the step 2.2), and generating a human body image of the target posture to obtain the Decoder.

Step 3, generating a target posture human body image:

1) organizing data input into a convolutional neural network;

2) and generating a target posture human body image by using the convolutional neural network (SAGN) constructed in the step 2.

Step 4, constructing a loss function of a convolutional neural network (SAGN):

4.1) constructing a resistance loss function;

4.2) constructing a perception loss function;

4.3) constructing a loss function based on the similarity of human body structures;

step 5, optimizing network parameters, and realizing the generation of the virtual human body image under the guidance of the target posture:

5.1) carrying out iterative optimization on the network parameters constructed in the step 2 according to the loss function obtained in the step 4;

5.2) when the preset iteration number is reached, generating the virtual human body image under the guidance of the target posture by using the convolutional neural network (SAGN) constructed in the step 2.

Aiming at the problems that the existing method cannot ensure the effectiveness of deformation, the generated human body image with the specified posture is low in quality, fuzzy human body postures are easy to generate, and even the posture consistency cannot be maintained, the virtual human body image generation method under the posture guidance based on the human body structure introduces a virtual human body image generation network (SAGN) under the posture guidance based on the human body structure, and the SAGN is composed of a series of appearance generation modules (SAG-Blk) based on the structure. Each structure-based appearance generation module consists of a structure-aware adaptive normalization (SAN) sub-module and a Feature Enhancement (FE) sub-module. The structure-aware adaptive normalization (SAN) module generates stylized target attitude features by using a normalization method, and the Feature Enhancement (FE) module provides rich appearance information for the stylized target attitude features, so that the appearance features of the stylized target attitude features are further enhanced. The two sub-modules cooperate together to gradually generate a vivid human body image with a correct posture. And realizing the generation of the virtual human body image under the guidance of the target posture.

Referring to fig. 1, a method for generating a virtual human body image under posture guidance based on a human body structure according to an embodiment of the present invention includes the following steps:

1.1) carrying out posture estimation on the human body image by using a posture estimation method to obtain the K-18 joint point coordinates of the source human body image and the K-18 joint point coordinates of the target human body image.

In the embodiment of the invention, an openposition attitude estimation method is used for carrying out attitude estimation on a human body image to obtain 18 human body joint point coordinate sequences; wherein the source human body image

Is expressed as P (I)_s)＝{p₁，…，p_KH, 18; target human body image

Is expressed as P (I)_t)＝{p₁，…，p_KH, 18; fig. 2 is a schematic diagram of 18 joint points.

In the embodiment of the invention, in order to utilize the spatial characteristics of the coordinates of the human body joint points, K is 18 heat maps to represent the human body posture information; wherein the source human body posture information is expressed as

TargetThe human posture information is expressed as

the virtual human body image generation network (SAGN) guided by the posture based on the human body structure is composed of an encoder, a structure-based appearance generation module and a decoder; fig. 3 is a schematic diagram of a virtual human body image generation network (SAGN) structure under the guidance of a human body structure-based posture.

2.1) constructing an Encoder (Encoder), and respectively encoding the input target attitude information and the input source human body image and attitude information.

In an embodiment of the present invention, target pose information P is mapped to 2 downsampled convolutional layers_tEncoding as target human features C_t(ii) a Source human image I with 2 downsampled convolution layers_sAnd attitude information P_sEncoding as source human body characteristic C_s。

2.2) building a structure-based appearance generation module (SAG-Blk).

In the embodiment of the invention, T is 9 structure-based appearance generation modules (SAG-Blk) are shared by a virtual human body image generation network (SAGN) under the posture guidance of a human body structure, and each SAG-Blk is composed of a structure-aware adaptive (SAN) module and a Feature Enhancement (FE) module. The structure-aware adaptive normalization (SAN) module generates stylized target attitude features by using a normalization method, and the Feature Enhancement (FE) module provides rich appearance information for the stylized target attitude features, so that the appearance features of the stylized target attitude features are further enhanced. The two sub-modules cooperate together to gradually generate a vivid human body image with a correct posture. And realizing the generation of the virtual human body image under the guidance of the target posture.

2.2.1) construct a fabric-aware adaptive (SAN) module.

In the embodiment of the invention, the human body image is divided into 10 human body parts and 1 back based on the human body joint point coordinate sequence obtained in the step 1.1A landscape part including a head, a left (right) upper arm, a left (right) lower arm, a left (right) thigh, a left (right) shank, a trunk and a background; obtaining masks of L parts; wherein each partial mask of the source human body image is represented as

The partial mask of the target human body image is represented as

Fig. 4 is a schematic diagram of a structure-aware adaptive (SAN) module. In an embodiment of the invention, two convolutional layers are used to pair the target human body feature C_t(or the last SAG-Blk output New target body characteristic C'_t) Carrying out convolution to obtain target human body characteristics

Wherein for V_styFor each of the rows of the plurality of rows,

All L style matrices

Element by element addition to obtain the final style matrix T_sty：

And

meanwhile, here, the target attitude feature F is_tIs subjected to batch normalization treatment to obtain

F_sty＝γF_norm+β；

2.2.1) building a Feature Enhancement (FE) module.

To target characteristic attitude F_styAnd source human features F_sSplicing and fusing, and then enhancing the fused features by using a Squeeze-and-Excitation operation to obtain enhanced features F_fuseA residual module is added to further accelerate network training to obtain a final new target human body feature C'_t：

Wherein,

And 2.3) constructing a Decoder (Decoder), decoding the new target human body characteristics output by the last SAG-Blk in the step 2.2), and generating a human body image of the target posture.

In an embodiment of the invention, the new target human body feature C 'output by the last SAG-Blk in step 2.2 is subjected to convolution layer sampling 2'_tDecoding to generate human body image of target posture

Step 3, generating a target posture human body image:

1) data input to the convolutional neural network is organized.

The data input into the network is divided into two parts, one part is the target human body posture information expressed by heat map obtained in step 1, and the other part is the source human body image and the source human body posture information expressed by heat map.

And inputting the organized data into a network to generate a human body image of the target posture.

Step 4, constructing a loss function of a convolutional neural network (SAGN):

in the embodiment of the invention, a combination of an antagonistic loss function, a perceptual loss function and a loss function based on human body structure similarity is used as the loss function of the convolutional neural network (SAGN) provided by the invention.

Wherein the fight loss function uses a discriminator for measuring the distance between the true image distribution and the generated image distribution and continuously reducing the distance between the two distributions. The invention constructs two discriminators, namely an appearance discriminator and a posture discriminator respectively, which are used for ensuring a real image I_tAnd generating an image

Appearance consistency and posture consistency, and the formula is defined as:

wherein,

and

representing the human body posture image and the distribution of the real human body image,

which represents the human image generation network proposed by the present invention. More details can be found in the Progressive position authentication transfer for person image generation.

The perceptual loss function is used to measure the similarity between the feature maps of the real image and the generated image, and usually the L1 distance between two feature maps is calculated as the perceptual loss, namely:

wherein phi is_iIs the output of the i-th layer of a pre-trained network, and usually adopts the characteristic diagram of the conv1_2 layer output of the VGG-19 network pre-trained in ImageNet. See the Perceptial losses for real-time style transfer and super-resolution for more details.

The loss function based on human body structure similarity is used for measuring the structure similarity of each human body part of the real image and the generated image. The accurate measurement of the similarity of each human body part can bring clear human body boundary and detailed texture characteristics to the virtual human body image. It is defined as:

wherein MSSIM (·,. cndot.) is a structural similarity, i.e., I_tAnd

structural similarity of (1) background (section 0). SSIM^l(-) is the structural similarity of the 1 st part of the human image. See Loss Functions for Person Image Generation for more details.

4.1) constructing a resistance loss function.

"general adaptive networks" achieves better effects in image generation using the penalty function, and the penalty function in this paper is used as one of the penalty functions of the convolutional neural network (SAGN) proposed by the present invention.

4.2) constructing a perception loss function.

"Perceptual losses for real-time style transfer and super-resolution" achieves better effect in style migration by using the Perceptual loss function, and the Perceptual loss function in the paper is taken as one of the loss functions of the convolutional neural network (SAGN) proposed by the invention for reference.

4.3) constructing a loss function based on the similarity of human body structures.

The Loss functions for person image generation use Loss functions based on human body structure similarity to effectively calculate the structure similarity of each part of a human body, and obtain a better effect in the aspect of human body image generation.

iterate 90k times using Adam optimizer, where β₁＝0.5，β₂＝0.999。

In summary, the method of the invention provides a virtual human body image generation network under posture guidance based on a human body structure aiming at a source human body image and any one target human body posture image; firstly, carrying out posture estimation on an input human body image to obtain a joint point coordinate sequence of the human body image; then constructing a virtual human body image generation network (SAGN) under the guidance of the posture based on the human body structure, wherein the virtual human body image generation network (SAGN) comprises an encoder, a structure-based appearance generation module (SAG-Blk) and a decoder, and the structure-based appearance generation module (SAG-Blk) is composed of a structure-aware adaptive normalization (SAN) sub-module and a Feature Enhancement (FE) sub-module; then constructing a loss function of the convolutional neural network, wherein the loss function comprises an antagonistic loss function, a perceptual loss function and a loss function based on human body structure similarity; and finally, generating a virtual human body image under the guidance of the target posture by using the loss function combined optimization proposed convolutional neural network (SAGN). Compared with the existing method, the method carries out qualitative and quantitative comparative experimental analysis, and the effectiveness of the method is verified on two public data sets, namely Market-1501 and DeepFashinon.

Tables 1a and 1b are the results of the quantitative experiments of the present invention, respectively, with Table 1a being the results of the method under the Market-1501 data set and Table 1b being the results of the method under the DeepFashion data set.

TABLE 1a Experimental results of this method under Market-1501 data set

TABLE 1b Experimental results of this method under the DeepFashion data set

SSIM, IS and DS are common indexes for measuring the quality of image generation, the larger the numerical value IS, the more vivid and the higher the quality of the generated image IS, FID and LPIPS are also common indexes for measuring the quality of image generation, and the smaller the numerical value IS, the more vivid and the higher the quality of the generated image IS. As can be seen from Table 1a, on the Market-1501 data set, the image generated by the method reaches the highest value on all indexes, particularly SSIM (structural similarity), and reaches the highest value of 0.321. As can be seen from Table 1b, on the DeepFashinon data set, the images generated by the method all reach the second level on SSIM, IS, DS and LPIPS, and a more reliable image generation effect IS obtained. Therefore, from the quantitative result, the virtual human body image generation method based on the structural similarity can generate a more real virtual human body image.

Fig. 5 and fig. 6 are qualitative experimental results of the present invention, respectively, and fig. 5 is an image generated by the present invention under the Market-1501 data set, and it can be seen that the virtual human body image with clear human body posture and real appearance is generated by our method. Especially in the case of large pose transitions, the virtual body image generated by our method still maintains the correct body pose (e.g., lines 2,4, 5); fig. 6 is a generated image of the present invention under the deep fast image data set, and it can be seen that human body images generated by other methods are easy to have some artificial traces, while our method still maintains the correct human body posture and real appearance, and it is noted that our method can maintain the correct and integrity of the posture of the generated virtual human body even if the target posture has a very complex human body posture, which makes the generated image look very real. Therefore, from the qualitative result, the virtual human body image generation method under the posture guidance based on the human body structure can generate a vivid human body image with a correct posture.

In summary, the invention discloses a new method, a system and electronic equipment for generating a virtual human body image under posture guidance based on a human body structure, belonging to the technical field of computer vision and computer graphics intersection. The invention constructs a virtual human body image generation network (SAGN) under the guidance of human body structure posture, which comprises an encoder, a structure-based appearance generation module (SAG-Blk) and a decoder, wherein the structure-based appearance generation module (SAG-Blk) consists of a structure-aware adaptive normalization (SAN) sub-module and a Feature Enhancement (FE) sub-module; then constructing a loss function of the convolutional neural network, wherein the loss function comprises an antagonistic loss function, a perceptual loss function and a loss function based on human body structure similarity; and finally, generating a virtual human body image under the guidance of the target posture by using the loss function combined optimization proposed convolutional neural network (SAGN). The invention can generate a vivid virtual human body image with correct posture.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art can make modifications and equivalents to the embodiments of the present invention without departing from the spirit and scope of the present invention, which is set forth in the claims of the present application.

Claims

1. A virtual human body image generation method is characterized by comprising the following steps:

the encoder is used for inputting the source human body image and the target posture image, and encoding to obtain and output source human body characteristics and target human body characteristics;

the structure-based appearance generation module is used for inputting and updating the source human body characteristics and the target human body characteristics, acquiring and outputting the updated source human body characteristics and the updated target human body characteristics;

2. The virtual human body image generation method according to claim 1, wherein the step of acquiring the trained virtual human body image generation network specifically includes:

3. The virtual human body image generation method according to claim 1, wherein the structure-based appearance generation module comprises:

4. The virtual human body image generation method according to claim 2,

in the sample data set, the step of acquiring source human body posture sample data and target human body posture sample data in each sample data comprises: carrying out attitude estimation on the human body image by adopting an openposition attitude estimation method to obtain18 human body joint point coordinate sequences; wherein the source human body image

Is expressed as P (I)_s)＝{p₁，…，p_KH, 18; target human body image

The target human posture information is expressed as

5. The method for generating a virtual human body image according to claim 4, wherein the step of inputting and updating the source human body features and the target human body features in the structure-based appearance generation module to obtain the updated source human body features and the updated target human body features specifically comprises:

The partial mask of the target human body image is represented as

Targeting body features C with two convolutional layers_tCarrying out convolution to obtain target human body characteristics

Wherein for V_styFor each of the rows of the plurality of rows,

l is a C-dimensional vector and represents the characteristics of each part of the source human body image; obtaining a style vector for the ith part using mean pooling

Wherein Resize (·) represents a zoom operation;

representing element-by-element multiplication; pool (·) denotes pooling;

according to the corresponding relation of all parts of the source human body image and the target human body image, the style vector V is converted into a style vector V_styPartial mask M inserted into target human body image_tObtaining style matrix T in corresponding parts_sty(ii) a Wherein, the first style vector is used

Inserting the first style matrix into the first mask of the target human body image in a broadcasting way to generate a first style matrix

All L style matrices

Adding L element by element to obtain final style matrix T_sty：

And

to the target attitude feature F_tIs subjected to batch normalization treatment to obtain

F_sty＝γF_norm+β；

To target characteristic attitude F_styHuman body of HeyuanCharacteristic F_sSplicing and fusing, and then enhancing the fused features by using a Squeeze-and-Excitation operation to obtain enhanced features F_fuse，

Obtaining updated target human body characteristic C'_t：

Wherein,

6. The virtual human image generation method of claim 5, wherein the loss function comprises: an antagonistic loss function, a perceptual loss function, and a loss function based on similarity of human structures.

7. The virtual human image generation method of claim i, wherein the structure-based appearance generation module is replaced with an integrated appearance generation module;

8. A virtual human body image generation system, comprising:

9. An electronic device, characterized in that the electronic device comprises a processor and a memory, the processor being configured to execute a computer program stored in the memory to implement the virtual human body image generation method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing at least one instruction which, when executed by a processor, implements the virtual human body image generation method according to any one of claims 1 to 7.