CN115482062A

CN115482062A - Virtual fitting method and device based on image generation

Info

Publication number: CN115482062A
Application number: CN202211141675.9A
Authority: CN
Inventors: 宋丹; 蒋雪静; 刘安安; 聂为之
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-12-16

Abstract

The invention discloses a virtual fitting method and a virtual fitting device based on image generation, wherein the method comprises the following steps: taking the wearing human body image and the standard clothes image as input of the virtual fitting; constructing a data set by combining the wearing human body image with a parameterized human body model of a semantic figure model, and deforming a two-dimensional human body figure according to the figure size variation by utilizing a condition generation network; constructing a clothes image segmentation data set aiming at the human body posture aiming at the standard clothes image, analyzing the clothes image aiming at the posture, and carrying out deformation on the clothes image in a slicing mode; human body expression irrelevant to clothes and the deformed clothes image are fused, and the virtual fitting network is trained to enable a more real fitting image to be generated. The device comprises: a processor and a memory. Aiming at the dress human body image, the invention combines the parameterized human body model of the semantic figure model to construct the data set, and utilizes the condition generation network to deform the two-dimensional human body figure according to the figure size variation, thereby relieving the problems of insufficient figure diversity and unbalanced distribution of the existing fitting image data set.

Description

Virtual fitting method and device based on image generation

Technical Field

The invention relates to the field of image generation, in particular to a virtual fitting method and device based on image generation.

Background

In recent years, electronic commerce is rapidly developed, more and more ginseng are purchased on the internet, and clothing commodities are the top of various commodities in online shopping transaction amount due to the characteristics of large product differentiation, convenience in transportation and the like. However, the return rate is extremely high, mainly because of the clothes being ill-fitted and the wearing effect not in accordance with the expectation. The virtual fitting technology can recommend clothes according to the size of the clothes, visually provide visual wearing effect, greatly improve sense of reality and immersion when users purchase clothes on line, and further reduce return rate and economic loss caused by return.

The virtual fitting method based on image generation has two works of generating fitting image result in the aspect of human body expression irrelevant to clothes ^[1][2] The human body is expressed as a posture, a stature, a face and hair. They are based on the reference ^[3] Extracting human body posture points, using reference documents ^[4] The human face, hair, clothes, skin and other parts are analyzed from the wearing human body image. However, this type of stature expression is not completely free of the influence of the original clothing. Han et al in respect of pose-guided garment image deformation ^[1] And Wang et al ^[2] Thin plate spline functions are adopted to depict clothes image deformation. However, they all regard clothes as a whole and perform deformation in an image plane (curved surface) space, and cannot depict clothes deformation caused by rigid posture transformation of a human body. In the aspect of fitting image data sets, because most models on a shopping website are slim, the existing fitting image data sets have the problems of insufficient body shape diversity and unbalanced distribution.

In summary, the virtual fitting method based on image generation still has some problems to be solved:

1) The size diversity of the dressing human body image data set is enhanced; 2) A garment image deformation method for guiding human body posture; 3) And constructing a virtual fitting system based on image generation.

Disclosure of Invention

The invention provides a virtual fitting method and a virtual fitting device based on image generation, wherein a data set is constructed by utilizing a semantic figure model, and the relation between size change and two-dimensional human figure change is learned based on the data set, so that the human figure in a dressing image is efficiently changed; analyzing the clothes image according to the human body posture, and independently deforming each clothes piece to enable the clothes deformation to adapt to the rigid posture change of the human body, thereby more truly simulating the clothes try-on effect; the invention designs an effective virtual fitting network to fuse the human body information irrelevant to the original clothes of the user and the deformed target clothes information so as to generate a real fitting image, which is described in detail as follows:

a virtual fitting method based on image generation, the method comprising:

taking the wearing human body image and the standard clothes image as input of the virtual fitting;

aiming at the dressed human body image, a data set is constructed by combining a parameterized human body model of a semantic figure model, and a condition generation network is utilized to deform a two-dimensional human body figure according to the figure size variation;

constructing a clothes image segmentation data set aiming at the human body posture aiming at the standard clothes image, analyzing the clothes image aiming at the posture, and carrying out deformation on the clothes image in a slicing mode;

and fusing the human body expression irrelevant to clothes and the deformed clothes image to generate a more real virtual fitting network of the image.

The semantic figure model adopts the deviation of the three-dimensional human body mesh vertex relative to the corresponding vertex of the template mesh to depict figure change, and the three-dimensional human body figure change

Is calculated as follows:

wherein Q is the number of human body parts divided according to stature,

to account for the semantic basis of the variation in length of the qth personal body segment,

to account for the semantic base of the change in body part circumference of the qth person, S ^u To account for the non-semantic basis of the remaining stature changes throughout the human body,

respectively corresponding human body figure parameters; and generating a training sample by utilizing the semantic parameterized three-dimensional human body model.

Wherein, the parameterized human body model construction data set combined with the semantic stature model comprises the following steps:

giving a figure change vector delta S to each three-dimensional human body in the database, generating a corresponding three-dimensional human body, projecting the three-dimensional human body to obtain a target two-dimensional human body segmentation I _S And I _t ；

Based on training samples (I) _S ,δS,I _t ) Training a conditional generation confrontation network to divide it according to the target two-dimensional body _S And the figure change vector deltaS generates I _t The input of the network is I _S And δ S, output is I _t 。

Further, the method comprises the steps of constructing a clothes image segmentation data set aiming at the human body posture aiming at the standard clothes image, analyzing the clothes image aiming at the posture, and carrying out fragmentation deformation on the clothes image into:

analyzing the standard clothes, dividing each part of the clothes, and rigidly rotating and stretching the analyzed clothes according to the posture and stature of the user to enable the clothes to be close to the stature and posture of the user;

performing segmentation operation on the analyzed clothes, inputting the segmented clothes and user information into a clothes deformation network together, and finally outputting a generated image by the clothes deformation network, wherein L1 loss is calculated between the formed image and a group Truth image and is used for constraining the network;

inputting a standard clothes image, automatically analyzing parts of clothes related to the trunk and four limbs by means of human body key point detection, labeling the clothes image by adopting semi-automation, constructing a first clothes image data set for semantic labeling of human body posture, interactively drawing a dividing line on the clothes image by utilizing a user-friendly interface, automatically labeling the clothes image by calculating a connected domain, and learning the clothes image analysis guided by the human body posture by adopting an image analysis frame based on the constructed data set.

Wherein the garment deformation network is:

the training set is a pair of images collected from a shopping site, comprising: a standard picture of a garment and a picture of a model wearing the garment; a preprocessing stage, namely preparing key points corresponding to standard clothes images, key points corresponding to the dressed human body images, two-dimensional figures of the dressed human body images and clothes parts segmented by analyzing the dressed human body photos;

in the training stage, a clothes image is divided into areas corresponding to human body postures through a clothes analysis network, and rigid rotation and shortening/lengthening operations are carried out on all parts according to target postures; and learning the parameters of the thin plate spline function corresponding to each piece of clothes according to the human body expression and the clothes piece.

Further, the clothing resolution network is composed of four parts:

the two coding networks are used for extracting high-level features; a correlation layer integrates the two paths of characteristics into a tensor which is used as the input of a regression network in the next step; a regression network for predicting the parameters; a TPS transformation module for transforming the image into output

Wherein c represents a piece of clothing,

representing the resulting image after TPS transformation.

Using sample triplets (p, c) _t ) Training is carried out, p represents the posture and stature expression of the user, c _t Represents a group Truth image;

wherein the loss function of the virtual fitting network is expressed as:

wherein,

representing the deformed garment image;

representing the final result; i' represents a rendering result; m denotes a combination mask; alpha is alpha _L1 A contribution value representing the L1 loss; alpha is alpha _V A contribution value representing a loss of VGG; alpha is alpha _i Representing the contribution value of the ith layer in the network to the loss; alpha is alpha _M The contribution of the loss of the mask portion is indicated.

A virtual fitting apparatus based on image generation, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps.

The technical scheme provided by the invention has the beneficial effects that:

1. aiming at the dress human body image, the invention combines the parameterized human body model of the semantic figure model to construct a data set, and utilizes the condition generation network to deform the two-dimensional human body according to the figure size variation, thereby relieving the problems of insufficient figure diversity and unbalanced distribution of the existing fitting image data set;

2. according to the method, a clothes image segmentation data set aiming at the human body posture is constructed, the clothes image is analyzed aiming at the posture, and the clothes image is deformed in a segmentation manner, so that the clothes deformation caused by the rigid posture change of the human body can be described;

3. the invention integrates the human body expression irrelevant to clothes and the deformed clothes image, is beneficial to the body expression to get rid of the influence of the original clothes, and generates a more real virtual fitting image.

Drawings

FIG. 1 is a flow chart of a virtual fitting method based on image generation;

FIG. 2 is a general flowchart of a virtual fitting method based on image generation;

FIG. 3 is an exploded view of a human body;

FIG. 4 is a flow chart of a method for reshaping a body in an image;

FIG. 5 is a network structure diagram of deformation of the garment image guided by human body posture;

FIG. 6 is a semi-automated garment image tagging diagram;

fig. 7 is a diagram of a virtual fitting network structure generated based on an image.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

A virtual fitting method based on image generation, see fig. 1, the method comprising the steps of:

101: taking a wearing human body image (namely a user photo) and a standard clothes image as input of a virtual fitting;

102: aiming at the dress human body image, a data set is constructed by combining a parameterized human body model of a semantic figure model, and a condition generation network is utilized to deform a two-dimensional human body figure according to the figure size variation;

the semantic figure model describes figure change by adopting the deviation of the three-dimensional human body mesh vertex relative to the corresponding vertex of the template mesh; the parameterized human model is an SMPL parameterized human model, and the semantic figure model is combined with the SMPL parameterized human model to construct the parameterized human model with the figure semantics.

The condition generating network is trained by adopting the constructed data set, and the deformation of the clothes image is guided by the deformed figure, which is not described in detail in the embodiment of the invention.

103: constructing a clothes image segmentation data set aiming at the human body posture aiming at the standard clothes image, analyzing the clothes image aiming at the posture, and carrying out deformation on the clothes image in a slicing mode;

the clothing image deformation network guided by the human body posture is trained through the constructed clothing image segmentation data set, so that the clothing image deformation network has practicability.

104: and fusing the human body expression irrelevant to clothes and the deformed clothes image to generate a more real virtual fitting network of the image.

The generated fitting image is not influenced by the original clothes of the user by fusing the human body expression mode irrelevant to the clothes, and the human body expression irrelevant to the clothes is extracted from the user picture. The fitting image is generated by inputting the human body expression of the user and the clothes image after posture deformation through the virtual fitting network. The research on how to design the network ensures that the generated image not only keeps the original posture and stature of the user, but also keeps the color and texture information of clothes, and finally outputs the fitting image in the current posture of the user.

In summary, in the embodiment of the present invention, the virtual fitting is realized through the steps 101 to 104, so that the sense of reality of the virtual fitting is enhanced.

Example 2

The scheme in example 1 is further described below with reference to specific examples and calculation formulas, which are described in detail below:

201: inputting a user photo and a standard clothes image, and performing subsequent operation on the basis;

the general flow of the virtual fitting method based on image generation is shown in fig. 2. The input of the whole method is a user photo and a clothes image, and the output is a fitting image of the user wearing a target clothes (the target clothes refers to the input clothes image) in the current posture. The output fitting image needs to meet the following requirements:

(1) The appearance and the hairstyle of the user are unchanged; (2) the user's posture is unchanged; (3) keeping the stature of the user unchanged; and (4) the clothes are naturally attached to the user.

202: aiming at the dressed human body image, a data set is constructed by combining a parameterized human body model of a semantic figure model, and a condition generation network is utilized to deform a two-dimensional human body figure according to the figure size variation;

1. appearance and hairstyle of user

For the appearance and the hair style of the user, a human body analytic model of the known technology is adopted ^[4] And analyzing the parts of the face and the hair style in the input user image to be used as a part of the input of the following virtual fitting network, and mapping the parts of the face and the hair style in the final output image in an identical way. As shown in fig. 3 ^[5] As shown in (1), the human body analysis model analyzes the human body into different parts (different colors represent different parts), and directly retains the head (namely, the face and the hair style) of the human body, so that the head of the user before and after fitting does not change, namely, the face and the hair style in the input user picture and the output virtual fitting result picture are the same.

In order to keep the fitting result from the original clothes, it is necessary to extract human body expressions (human body posture, stature, human face, and hairstyle remain unchanged) from the user's picture, which are not related to the original clothes. In terms of human body expression independent of clothes, predecessors worked ^[1][2] The human body is expressed as the human face, the hair, the posture and the figure. Based on the reference ^[6] Extracting human body posture points, using reference ^[4] The human face, hair, clothes, skin and other parts are analyzed from the dressing human body image. The stature is expressed as a human body part except the human face and the hair, and in order to make the stature independent of the original clothes as much as possible, the stature is further segmented and down-sampled to a low resolution and then amplified back. Taking the example shown in fig. 3, the analyzed human body image is sampled to a low resolution and then is amplified, so that the shapes of the collar, skirt and the like of the clothes in the figure still do not change greatly, and the human body cannot be represented clearly due to the influence of the original clothes. So that the stature expression cannot completely get rid of the influence of the original clothes.

In order to solve the above problems and make human expression not affected by the original clothes, densePose was used ^[7] The body shape of the user wearing clothes is estimated. DensePose learns dense correspondences between images and surface-based human representations. The results of DensePose were transferred to a 1-channel binary mask, where the number 1 represents the human body. Meanwhile, the operation of enhancing the stature diversity is carried out aiming at the stature estimation. Specifically, the body expression, which is not related to clothes, is divided into three parts, namely, the stature, the posture, the face, and the hairstyle (shown in fig. 2).

2. Pose estimation part

The embodiment of the invention adopts an attitude estimation model ^[6] To estimate the 2D coordinates of the 18 pose keypoints in the image. In order for each keypoint to retain semantic positional attributes (e.g., shoulder, elbow, and wrist), an 18-channel postural heatmap is constructed, one for each keypoint. Meanwhile, in order to enhance the effect of the key point, 11 neighborhoods around the key point are filled with the number 1, and the other neighborhoods are 0. And the obtained posture point diagram is used as a part of the input of the subsequent virtual fitting network to restrict the virtual fitting network to generate the human body with the corresponding posture.

3. Stature estimation

In order to ensure that the stature of the user is unchanged before and after fitting (for example, when the input user picture is a slim person, the output fitting result picture is still a slim person, and when the input user picture is a fat person, the output fitting result picture is still a fat person), the embodiment of the invention performs the operation of enhancing the stature diversity on the body image data set. In the existing data set, the model is mostly slim in shape, so the generalization performance of the virtual fitting network trained by the data set is weak, for example: when the statures of users are relatively slim, the output result obtained by the trained virtual fitting network is still a slim person, which meets the target requirement, but when the statures of the input users are relatively fat, the output result is still a slim person, which does not accord with the preset target requirement, so that the statures of the wearing human body images need to be enhanced, the data set is enriched, data of various statures such as tall, short, fat and thin are obtained, the effect of the trained virtual fitting network is better, the statures of the users are not changed before and after the input, and a more real fitting effect can be obtained.

The method aims at enhancing the body shape diversity of the dressing human body image and is a key research part of the embodiment of the invention. Most models in the fitting pictures of the internet stores are slim, so that the existing fitting image data sets have the problems of insufficient body diversity and unbalanced distribution. In the embodiment of the invention, a data driving method is adopted, as shown in fig. 4, a two-dimensional human body area is estimated from a source image, and then a two-dimensional human body with a changed figure is generated according to a figure size change vector delta S. And finally, establishing a corresponding relation of contour points of all parts of the human body before and after the stature changes, and deforming the source image based on an MLS (Moving left Square) method according to the corresponding relation. Aiming at the two-dimensional human body segmentation of a source image, the embodiment of the invention adopts Omran et al ^[8] The method of (1).

The key problem here is how to change the original two-dimensional human body according to the change of the figure size. The embodiment of the invention utilizes a semantic parameterized three-dimensional human body model to generate a training sample (source two-dimensional human body segmentation I) _S (i.e., input image of human body), body size variation vector δ S, target two-dimensional body segmentation I _t ) And training a conditional generation countermeasure network (CGAN) based on the training samples ^[9] )。

The semantic figure model adopts the deviation of the three-dimensional human body mesh vertex relative to the corresponding vertex of the template mesh to depict figure change. Three-dimensional human body figure change

(i.e., the offset of all vertices relative to the corresponding vertices of the template mesh) is calculated as follows:

wherein Q is a human body part divided for statureThe number of the components is divided into a plurality of components,

to account for the base of the variation in length of the qth person's body part (semantic base),

for interpreting the qth individual' S body part circumference variation basis (semantic basis), S ^u A basis for explaining the remaining stature change of the whole human body (a non-semantic basis). The body material base is obtained by performing principal component analysis on a three-dimensional body database with the same posture and different body materials, the semantic base realizes semantic control on the body materials of the human body (such as height change, three-dimensional girth change and the like), and the non-semantic base ensures the continuity between human body parts.

And respectively learning the mapping relation from the body size such as height, three-dimensional and the like to the corresponding body parameter by using the data set for the corresponding body parameter.

In order to make the training sample have pose diversity, a semantic figure model is combined with an SMPL (skinned multi-person linear model) parameterized human body model to increase the pose diversity of the training sample. By adjusting the model posture parameters and the stature size, three-dimensional human bodies with various postures and statures can be generated. And giving a figure change vector delta S to each three-dimensional human body in the database to generate the corresponding three-dimensional human body. Then, the three-dimensional human body is projected to obtain two-dimensional human body segmentation I _S And I _t . Based on training samples (I) _S ,δS,I _t ) Training a conditionally-generated countermeasure network (CGAN) ^[9] ) According to I, make it _S And δ S generation I _t (i.e., the input to the network is I _S And δ S, output is I _t ). The objective function of the conditional generation countermeasure network is as follows:

wherein G represents a generator and D represents authenticationDevice, I _S The two-dimensional human body segmentation of the source is represented as the input of a network CGAN, δ S represents a figure size change vector as condition information, and z represents random noise. Training the network so that the generator can be according to I _S And delta S to generate target data to obtain target two-dimensional human body segmentation I _t 。

203: and constructing a clothes image segmentation data set aiming at the human body posture aiming at the standard clothes image, analyzing the clothes image aiming at the posture, and carrying out deformation on the clothes image in a slicing mode.

Clothes in the clothes image provided by the shopping website are often placed in order, and are generally called standard clothes images. In order to generate a good fitting effect, the standard clothes image needs to be deformed before being synthesized with the human expression of the user. The embodiment of the invention defines the clothes parts related to the trunk and the four limbs of the human body on different image planes (curved surfaces) for deformation, so as to solve the problem of poor clothes deformation and enhance the authenticity of fitting images.

The overall flow is shown in fig. 5. First, a standard garment is analyzed, and each part of the garment is divided. And performing rigid rotation, stretching and other operations on the analyzed clothes according to the posture and the stature of the user to enable the clothes to be close to the stature and the posture of the user as much as possible. Then, the analyzed clothes are segmented, and the segmented clothes and the user information are input into a clothes deformation network (a clothes deformation network is shown in a dotted line frame in fig. 5), and finally, the clothes deformation network outputs the generated images. And calculating L1 loss between the generated image and the group Truth image to constrain the network.

Regarding the clothes analyzing part, standard clothes images are input, and parts of the clothes related to the trunk and the limbs are automatically analyzed by means of a human body key point detection method. And labeling the clothes image by adopting a semi-automatic method, and constructing a first clothes image data set for semantic labeling aiming at the human body posture. As shown in fig. 6, a user-friendly interface is used to interactively draw segmentation lines on the garment image, followed by automated labeling of the garment image by computing connected domains. And based on the constructed data set, learning the clothes image analysis guided by the human body posture by adopting a common image analysis framework.

Human body pose guided garment image deformation web framework see fig. 5. The training set is a pair of images (including a standard picture of clothing and a picture of the model wearing the clothing) collected from a shopping website. And a preprocessing stage, namely preparing key points corresponding to the standard clothes image, key points corresponding to the dressing human body image, a two-dimensional figure of the dressing human body image and a clothes part (Ground Truth) divided by analyzing the dressing human body picture. In the training stage, firstly, the clothes image is divided into areas corresponding to the human body posture through a clothes analysis network, and then, simple rigid rotation and shortening/lengthening operations are carried out on all parts according to the target posture. Wherein the rigid rotation formula is expressed as follows, wherein x and y represent the position coordinates of the original image data, x 'and y' represent the transformed position coordinates, and θ represents the rotation angle.

And finally, learning parameters of a Thin-Plate Spline (TPS) function corresponding to each piece of clothes according to the human body expression and the clothes piece. The partial network consists of four parts: two coding networks are used for extracting high-level features; (2) A correlation layer (shown in diamond of fig. 3) integrates the two paths of features into a tensor which is used as the input of a next regression network; (3) a regression network for predicting parameters; and (4) a TPS conversion module (T for short). It is a thin plate spline transform module for transforming the image into an output

Where c denotes a garment panel (corresponding to the garment panel in figure 3),

the generated image after TPS conversion (corresponding to the right generated image in fig. 3) is shown. This method is end-to-end learnable and uses sample triplets (p, c) _t ) Training is performed, and p represents the posture and stature expression of the user (corresponding to the user information in fig. 3:posture + stature), deformed result

And c _t The pixel-level penalty function between (Ground Truth) is as follows, where c _t A group Truth image (corresponding to the group Truth on the right side in fig. 3) is shown.

204: fusing human body expression irrelevant to clothes and the deformed clothes image to generate a more real virtual fitting network of the image;

through the steps, the clothes image which is expressed by the human body and deformed and is irrelevant to the clothes is obtained. They are used as input to the virtual fitting network. Subsequently, the body information and the clothes information in accordance with the posture are fused through a fitting network to generate a fitting image, see fig. 7.

Ideally, the training sample should be (I) _i ,c,I _t ) In which I _i Representing the input user image, c representing the clothing image, I _t Representing a fitting image. However, it is difficult to acquire image pairs (I) with identical posture and stature and different clothes in real life _i ,I _t ). Thus, the training samples are organized as (p, c, I) _t ) Wherein c represents a clothing image, I _t Representing a fitting image, and p is a human body expression independent of clothes.

First, from the dressing body image I _t The human body expression p irrelevant to clothes is extracted from the human body expression p, and consists of posture, stature and head information. Secondly, the standard clothes image is deformed through a posture-guided clothes deformation network. Finally, the clothes image expressed and deformed by human body is used as input, and Unet is used ^[10] And simultaneously generating a rendering result and a combined mask. The garment image is combined with the rendering result according to a combination mask, formula as follows, where Δ denotes multiplication by an element matrix.

The loss function of the virtual fitting network is expressed as:

wherein,

representing the deformed garment image;

representing the final result; i' represents a rendering result; m denotes a combination mask; alpha is alpha _L1 A contribution value representing the L1 loss; alpha (alpha) ("alpha") _V A contribution value representing a VGG loss; alpha is alpha _i Representing the contribution value of the ith layer in the network to the loss; alpha is alpha _M Representing the contribution of the partial loss of the mask, where N ₁ (.) represents a 1 norm.

In the formula (I)

Constrained generated images as close as possible to group Truth, second term (α) _mask ||1-M|| ₁ ) For regular terms, the constraint-generating image uses as much as possible the clothing image information, the third term

To generate a perceptual loss between the image and the group Truth, Ψ _i (. To) denotes a visual perception network, as used in ImageNet ^[11] On-pretrained VGG19 network ^[12] 。

205: and outputting fitting images of the user in the current posture based on the trained virtual fitting network, so as to meet various requirements in practical application.

A virtual fitting apparatus based on image generation, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of:

aiming at the dress human body image, a data set is constructed by combining a parameterized human body model of a semantic figure model, and a condition generation network is utilized to deform a two-dimensional human body figure according to the figure size variation;

Is calculated as follows:

wherein Q is the number of human body parts divided according to stature,

to account for the semantic base of the variation of the body part circumference of the qth person, S ^u To account for the non-semantic basis of the remaining stature changes throughout the human body,

Wherein, the parameterized human body model combined with the semantic stature model constructs a data set as follows:

giving a figure change vector delta S to each three-dimensional human body in the database, generating a three-dimensional human body corresponding to the figure change vector delta S, projecting the three-dimensional human body to obtain a target two-dimensional human body segmentation I _S And I _t ；

Based on training samples (I) _S ,δS,I _t ) Training a conditional generation countermeasure network to two-dimensional human body segmentation I based on the target _S And the figure change vector deltaS generates I _t The input of the network is I _S And δ S, output is I _t 。

Further, a clothes image segmentation data set aiming at the human body posture is constructed aiming at the standard clothes image, the clothes image is analyzed aiming at the posture, and the clothes image is transformed into the following steps:

Wherein, the clothes deformation network is:

in the training stage, the clothes image is divided into areas corresponding to human body postures through a clothes analysis network, and rigid rotation and shortening/lengthening operations are performed on each part according to the target postures; and learning the parameters of the thin plate spline function corresponding to each piece of clothes according to the human body expression and the clothes piece.

Further, the clothing analysis network consists of four parts:

the two coding networks are used for extracting high-level features; a correlation layer integrates the two paths of characteristics into a tensor which is used as the input of a next regression network; a regression network for predicting the parameters; a TPS transformation module for transforming the image into an output

Wherein c represents a piece of clothing,

representing the resulting image after TPS transformation.

wherein, the loss function of the virtual fitting network is expressed as:

wherein,

representing the deformed garment image;

representing the final result; i' represents a rendering result; m denotes a combination mask; alpha (alpha) ("alpha") _L1 A contribution value representing the L1 loss; alpha (alpha) ("alpha") _V A contribution value representing a loss of VGG; alpha is alpha _i Representing the contribution value of the ith layer in the network to the loss; alpha is alpha _M The contribution of the loss of the mask portion is indicated.

Reference:

[1]Han X,Wu Z,Wu Z,et al.Viton:an image-based virtual try-on network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2018:7543-7552.

[2]Wang B,Zheng H,Liang X,et al.Toward characteristic-preserving image-based virtual try-on network[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:589-604.

[3]Cao Z,Simon T,Wei S E,et al.Realtime multi-person 2d pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:7291-7299.

[4]Liang X,Gong K,Shen X,et al.Look into person:joint body parsing&pose estimation network and a new benchmark[J].IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI),2018.

[5]Gong K,Liang X,Zhang D,et al.Look into person:Self-supervised structure-sensitive learning and a new benchmark for human parsing[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2017:932-940.

[6]Cao Z,Simon T,Wei SE,Sheikh Y(2017)Realtime multi-person 2d pose estimation using part affinity fields.In:2017IEEE conference on computer vision and pattern recognition(CVPR).IEEE,pp 1302–1310

[7]Neverova N,Alp Guler R,Kokkinos I(2018)Dense pose transfer.In:Proceedings of the European conference on computer vision(ECCV),pp 123–138

[8]Omran M,Lassner C,Pons-Moll G,et al.Neural body fitting:unifying deep learning and model based human pose and shape estimation[C]//2018International Conference on 3D Vision(3DV).2018:484-494.

[9]Mirza M,Osindero S.Conditional generative adversarial nets[J].arXiv preprint arXiv:1411.1784,2014.

[10]Ronneberger O,Fischer P,Brox T.U-net:convolutional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Computer-assisted Intervention.Springer,Cham,2015:234-241

[11]Deng J,Dong W,Socher R,et al.Imagenet:a large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2009:248-255.

[12]Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[J].arXiv preprint arXiv:1409.1556,2014.

in the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A virtual fitting method based on image generation is characterized by comprising the following steps:

aiming at a standard clothes image, a clothes image segmentation data set aiming at a human body posture is constructed, the clothes image is analyzed aiming at the posture, and the clothes image is deformed in a slicing mode;

2. The image generation-based virtual fitting method according to claim 1, wherein the semantic figure model adopts the deviation of the three-dimensional human body mesh vertex relative to the corresponding vertex of the template mesh to depict figure change, and the three-dimensional human body figure change

Is calculated as follows:

wherein Q is the number of human body parts divided according to stature,

respectively corresponding human body figure parameters; parameterized three-dimensional human model generation training with semanticsAnd (4) sampling.

3. The image generation-based virtual fitting method according to claim 2, wherein the parameterized human body model construction data set combined with the semantic stature model is as follows:

4. The virtual fitting method based on image generation as claimed in claim 1, wherein the clothes image segmentation data set for human body posture is constructed for the standard clothes image, the clothes image is analyzed for posture, and the clothes image is transformed into by slicing:

5. The image generation-based virtual fitting method according to claim 4, wherein the clothes deformation network is as follows:

the training set is a pair of images collected from a shopping site, comprising: a standard picture of a garment and a picture of a model wearing the garment; a preprocessing stage, namely preparing key points corresponding to a standard clothes image, key points corresponding to a dressing human body image, a two-dimensional figure of the dressing human body image and a clothes part divided by analyzing the dressing human body picture;

6. The image generation-based virtual fitting method according to claim 5, wherein the clothing analysis network is composed of four parts:

the two coding networks are used for extracting high-level features; a correlation layer integrates the two paths of characteristics into a tensor which is used as the input of a next regression network; a regression network for predicting the parameters; a TPS transformation module for transforming the image into output

Wherein c represents a piece of clothing,

representing the generated image after TPS transformation;

using sample triplets (p, c) _t ) Training is performed, p represents the user's posture and stature, c _t Represents a group Truth image;

7. the virtual fitting method based on image generation according to claim 1, wherein the loss function of the virtual fitting network is expressed as:

wherein,

representing the deformed garment image;

representing the final result; i' represents a rendering result; m denotes a combination mask; alpha is alpha _L1 A contribution value representing the L1 loss; alpha is alpha _V A contribution value representing a VGG loss; alpha (alpha) ("alpha") _i Representing the contribution value of the ith layer in the network to the loss; alpha (alpha) ("alpha") _M Representing the contribution value of the loss of the mask portion.

8. An image generation based virtual fitting apparatus, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling upon the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-7.