CN114724004B

CN114724004B - Method for training fitting model, method for generating fitting image and related device

Info

Publication number: CN114724004B
Application number: CN202210262232.9A
Authority: CN
Inventors: 陈仿雄
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2024-04-26
Anticipated expiration: 2042-03-16
Also published as: CN114724004A

Abstract

The embodiment of the application relates to the technical field of image processing and discloses a method for training a fitting model, a method for generating fitting images and a related device. And carrying out feature extraction on the deformed clothes image by adopting a clothes coding network to obtain clothes feature codes, and carrying out feature extraction on the mask image by adopting an identity coding network to obtain an identity feature map. Inputting the clothes feature codes and the identity feature images into a fitting image generation network to be fused, obtaining a predicted fitting image, and carrying out iterative training on the fitting network until convergence according to the difference between the real fitting image and the predicted fitting image in the training set, so as to obtain a fitting model. Through the mode, the fitting model can enable fitting clothes and users to be combined and matched, and fitting effects are real and natural.

Description

Method for training fitting model, method for generating fitting image and related device

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a method for training a fitting model, a method for generating fitting images and a related device.

Background

With the continuous progress of modern technology, the online shopping scale is continuously increased, and users can purchase clothes on the online shopping platform through mobile phones, but because the information of the clothes to be sold, which is acquired by the users, is generally a two-dimensional display picture, the users cannot know the effect that the clothes are worn on the users, so that the users possibly purchase the clothes which are not suitable for the users, and the shopping experience is poor.

On-line fitting is generally carried out by shooting a user image, selecting a target garment provided by a system to automatically replace, and particularly, most of the on-line fitting adopts the mode of 3D modeling to remodel a user image and the target garment by collecting human body data and garment information to be tried on, however, the combination of the target garment and the user image is often inadequately close and lacks natural feeling.

Disclosure of Invention

The technical problem to be solved mainly by the embodiment of the application is to provide a method for training a fitting model, a method for generating fitting images and a related device.

In order to solve the technical problems described above, in a first aspect, an embodiment of the present application provides a method for training a fitting model, where a fitting network includes a clothes coding network, an identity coding network, and a fitting image generating network;

The method comprises the following steps:

Acquiring a training set, wherein the training set comprises a plurality of training data, the training data comprises a clothes image and a real fitting image, and the real fitting image comprises an image of corresponding clothes in the model wearing clothes image;

deforming clothes in the clothes image according to the model human body structure in the real fitting image to obtain a deformed clothes image;

carrying out mask shielding treatment on the region corresponding to the clothes of the real fitting image to obtain a mask image;

Performing feature extraction on the deformed clothes image by adopting a clothes coding network to obtain clothes feature codes, and performing feature extraction on the mask image by adopting an identity coding network to obtain an identity feature map;

inputting the clothes feature codes and the identity feature images into a fitting image generation network for fusion to obtain predicted fitting images;

and (3) performing iterative training on the fitting network according to the difference between each real fitting image and each predicted fitting image in the training set until the fitting network converges, so as to obtain a fitting model.

In some embodiments, the foregoing deforming the clothing in the clothing image according to the model human body structure in the real fitting image to obtain a deformed clothing image includes:

performing key point detection on the clothes image by adopting a clothes key point detection algorithm to acquire clothes key point information;

performing key point detection on the real fitting image by adopting a human key point detection algorithm to acquire human key point information;

Calculating an affine change matrix between the clothes key point information and the human body key point information;

and carrying out affine change on the clothes pixels in the clothes image according to an affine change matrix to obtain a deformed clothes image.

In some embodiments, the performing mask shielding processing on the region corresponding to the clothes on the real fitting image to obtain a mask image includes:

the human body analysis algorithm is adopted to analyze the parts of the real fitting image, so as to obtain a human body analysis chart;

Determining a mask matrix according to the human body analysis chart and the type of clothes in the clothes image;

Multiplying the real fitting image and the corresponding position of the mask matrix to obtain a mask image.

In some embodiments, the fitting network further includes a texture coding network, and before the clothes feature code and the identity feature map are input into the fitting image generating network to be fused, the method further includes:

extracting texture features of the clothes image by adopting a texture coding network to obtain at least one texture feature map;

Inputting the clothes feature codes and the identity feature map into a fitting image generation network for fusion to obtain a predicted fitting image, wherein the method comprises the following steps of:

inputting the clothes feature codes, the identity feature images and at least one texture feature image into a fitting image generation network for fusion to obtain a predicted fitting image.

In some embodiments, the iterative training of the fitting network according to the differences between the real fitting images and the predicted fitting images in the training set until the fitting network converges, to obtain the fitting model, includes:

Calculating the difference between each real fitting image and each predicted fitting image in the training set by adopting a loss function, wherein the loss function comprises the countermeasures loss, reconstruction loss and perception loss between the real fitting image and the predicted fitting image, and the mask loss between the mask image corresponding to the real fitting image and the mask image corresponding to the predicted fitting image;

And (3) carrying out iterative training on the fitting network according to the difference until the fitting network converges to obtain a fitting model.

In some embodiments, the aforementioned loss function comprises:

wherein, To combat losses,/>Reconstruction loss,/>For perceived loss,/>For mask loss, α is the weight of reconstruction loss, β is the weight of perception loss, δ is the weight of mask loss, I is the actual fitting image, I' is the predicted fitting image,/>For the feature map extracted from the real fitting image,/>For the feature map extracted from the predicted fitting image, P is/>Number of (5) or/>M _I is a mask image corresponding to a real fitting image, and M _I′ is a mask image corresponding to a predicted fitting image.

In some embodiments, before the step of extracting the identity feature from the real fitting image to obtain the identity feature map, the method further includes:

And filling the real fitting image so that the resolution of the real fitting image meets the preset proportion.

In order to solve the above technical problem, in a second aspect, an embodiment of the present application provides a method for generating a fitting image, including:

Acquiring an image of clothes to be fitted and an image of a user;

The clothes in the image of the clothes to be tested are deformed according to the human body structure of the user in the image of the user, and the deformed image of the clothes to be tested is obtained;

carrying out mask shielding treatment on the user image in the region corresponding to the fitting garment to obtain a mask image of the user;

Inputting the deformed fitting image and the mask image of the user into a fitting model to generate the fitting image, wherein the fitting model is trained by adopting the method for training the fitting model in the first aspect.

To solve the foregoing technical problem, in a third aspect, an embodiment of the present application provides a computer device, including:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as in the first aspect above.

To solve the above technical problem, in a fourth aspect, there is provided a computer readable storage medium storing computer executable instructions for causing a computer device to perform the method of the above first aspect.

The embodiment of the application has the beneficial effects that: in contrast to the situation of the prior art, in the method for training a fitting model provided by the embodiment of the present application, the fitting network as the framework structure of the fitting model includes a clothes coding network, an identity coding network and a fitting image generating network, and firstly, a training set is obtained, where the training set includes a plurality of training data, and each training data includes a clothes image and a real fitting image of a corresponding clothes in the clothes image worn by the model, that is, the real fitting image may be used as a real label. Then, for each training data, the clothes in the clothes image are deformed according to the model human body structure in the real fitting image, so that a deformed clothes image is obtained, and the clothes in the deformed clothes image are in a three-dimensional state and are adapted to the human body structure. And carrying out mask shielding treatment on the region corresponding to the clothes of the real fitting image to obtain a mask image, wherein the mask image retains the characteristics to be retained such as the identity characteristics of the model and the like, and shields the original clothes characteristics to be replaced when fitting the clothes. And carrying out feature extraction on the deformed clothing image by adopting a clothing coding network, and carrying out feature extraction on the mask image by adopting an identity coding network to obtain an identity feature map. And finally, inputting the clothes feature codes and the identity feature images into the fitting image generation network for fusion to obtain predicted fitting images, and performing iterative training on the whole fitting network until convergence according to the differences between each real fitting image and each predicted fitting image in the training set to obtain a fitting model.

By the mode, the images which are input into the fitting image generation network and subjected to fusion processing are the deformed clothes images and the mask images, clothes in the deformed clothes images are in a three-dimensional state and are suitable for human body structures, the mask images retain the characteristics of the models, such as the identity characteristics, which need to be retained, and the original clothes characteristics which need to be replaced when the clothes are tried on are blocked, so that the predicted fitting images obtained after fusion can retain the identity characteristics of the models, are not distorted, the fitting clothes and the models are combined closely, and the fitting effect is true and natural. Along with continuous iterative training of the fitting network, the fusion-generated predicted fitting image is continuously close to a real fitting image (real label), and an accurate fitting model is obtained. Therefore, the fitting image of the user generated by adopting the fitting model can keep the identity characteristics of the user, is not distorted, and can enable fitting clothes and the user to be combined and attached, so that the fitting effect is real and natural.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

FIG. 1 is a schematic diagram of a portion of a generated countermeasure network according to some embodiments of the present application;

FIG. 2 is a flow chart of a method of training a fitting model according to some embodiments of the present application;

FIG. 3 is a schematic illustration of a garment deformation in accordance with some embodiments of the present application;

FIG. 4 is a schematic flow chart of a sub-process of step S20 in the method shown in FIG. 2;

FIG. 5 is a schematic illustration of garment keypoints according to some embodiments of the application;

FIG. 6 is a schematic diagram of human keypoints according to some embodiments of the application;

FIG. 7 is a mask image in some embodiments of the application;

FIG. 8 is a schematic flow chart of a sub-process of step S30 in the method shown in FIG. 2;

FIG. 9 is a diagram of human resolution in some embodiments of the application;

FIG. 10 is a schematic diagram of the overall structure of a fitting network according to some embodiments of the present application;

FIG. 11 is a schematic flow chart of a sub-process of step S60 in the method of FIG. 2;

FIG. 12 is a flow chart of a method of generating fitting images in some embodiments of the application;

fig. 13 is a schematic structural diagram of a computer device according to some embodiments of the present application.

Detailed Description

The present application will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the application in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present application.

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that, if not in conflict, the features of the embodiments of the present application may be combined with each other, which is within the protection scope of the present application. In addition, while functional block division is performed in a device diagram and logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. Moreover, the words "first," "second," "third," and the like as used herein do not limit the data and order of execution, but merely distinguish between identical or similar items that have substantially the same function and effect.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items.

In addition, the technical features of the embodiments of the present application described below may be combined with each other as long as they do not collide with each other.

In order to facilitate understanding of the method provided in the embodiments of the present application, first, terms related to the embodiments of the present application are described:

(1) Neural network

A neural network may be composed of neural units, and is understood to mean, in particular, a neural network having an input layer, an hidden layer, and an output layer, where in general, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. Among them, the neural network with many hidden layers is called deep neural network (deep neural network, DNN). The operation of each layer in the neural network can be described by the mathematical expression y=a (w·x+b), from the physical level, and can be understood as the completion of the transformation of the input space into the output space (i.e., the row space into the column space of the matrix) by five operations on the input space (set of input vectors), including 1, dimension up/down; 2. zoom in/out; 3. rotating; 4. translating; 5. "bending". Wherein the operations of 2, 3 are done by "w·x", the operations of 4 are done by "+b", and the operations of 5 are done by "a ()" where the expression "space" is used in two words because the object to be classified is not a single thing but a class of things, space refers to the collection of all individuals of such things, where W is the weight matrix of the layers of the neural network, each value in the matrix representing the weight value of one neuron of that layer. The matrix W determines the spatial transformation of the input space into the output space described above, i.e. the W of each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Thus, the training process of the neural network is essentially a way to learn and control the spatial transformation, and more specifically to learn the weight matrix.

It should be noted that in the embodiments of the present application, the neural network is essentially based on the model employed by the machine learning task. Common components in the neural network comprise a convolution layer, a pooling layer, a normalization layer, a reverse convolution layer and the like, a model is designed and obtained by assembling the common components in the neural network, and when model parameters (weight matrixes of all layers) are determined so that model errors meet preset conditions or the number of adjusted model parameters reaches a preset threshold value, the model converges.

The convolution layer is configured with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to carry out convolution operation on the image. The purpose of the convolution operation is to extract different features of the input image, and the first layer of convolution layer may only extract some low-level features such as edges, lines, angles, etc., and the deeper convolution layer may iteratively extract more complex features from the low-level features.

The inverse convolution layer is used to map a low-dimensional space to a high-dimensional space while maintaining a connection/pattern between them (connection here refers to the connection at the time of convolution). The inverse convolution layer is configured with a plurality of convolution kernels, each of which is provided with a corresponding step size, to perform an inverse convolution operation on the image. Typically, a framework library (e.g., pyTorch library) for designing a neural network has a upsumple () function built into it, and by calling this upsumple () function, a low-dimensional to high-dimensional spatial mapping can be achieved.

The pooling layer (pooling) is to simulate that the human visual system can dimension down the data or represent the image with higher level features. Common operations of the pooling layer include maximum pooling, mean pooling, random pooling, median pooling, combined pooling, and the like. Typically, the pooling layers are periodically inserted between convolutional layers of the neural network to achieve dimension reduction.

The normalization layer is used for performing normalization operation on all neurons in the middle to prevent gradient explosion and gradient disappearance.

(2) Loss function

In the process of training the neural network, because the output of the neural network is expected to be as close to the value actually expected, the weight matrix of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the actually expected target value (however, an initialization process is usually performed before the first update, that is, the parameters are preconfigured for each layer in the neural network), for example, if the predicted value of the network is higher, the weight matrix is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the neural network can predict the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and the training of the neural network becomes the process of reducing the loss as much as possible.

(3) Generating an countermeasure network

Referring to fig. 1, fig. 1 is a schematic diagram of a part of a generated countermeasure network according to an embodiment of the present application, and as shown in fig. 1, the generated countermeasure network includes a mapping network S1 and an image generator S2. The mapping network S1 is configured to perform feature decoupling on composite features included in the feature latent code, so as to map the feature latent code into a plurality of sets of feature control vectors input to the image generator, and input the plurality of sets of feature control vectors obtained by mapping into the image generator S2, so as to perform style control on the image generator. The image generator S2 is configured to perform style control and processing on the constant tensor based on the control vector input from the mapping network S1, thereby generating one image. When the characteristic latent code reflects try-on clothing, then the image generated by the image generator S2 includes clothing.

As shown in fig. 1, the mapping network S1 includes 8 fully-connected layers, where the 8 fully-connected layers are sequentially connected and are used to perform nonlinear mapping on the feature latent codes to obtain an intermediate vector w. The intermediate vector w can reflect the features in the feature latent code, for example, can reflect the features of the try-on garment.

The image generator S2 comprises N sequentially arranged convolution modules, the first convolution module comprising a constant tensor const and a convolution layer, both of which are followed by an adaptive instance normalization layer. It will be appreciated that the constant tensor const corresponds to an initial data for generating an image. Any one of the remaining convolution modules, except the first one, includes two convolution layers, each of which is followed by an adaptive instance normalization layer (AdaIN). In the image generator S2, each convolution module outputs a feature map, where the feature map is used as an input of the next convolution module, and as the convolution modules recursively draw, the size of the feature map output by the last convolution module is larger and larger, and the feature map output by the last convolution module is the generated face image. It can be understood that the target size of the feature map output by each convolution module is set, for example, the size of the feature map output by the 1 st convolution module in fig. 1 is 4*4, the size of the feature map output by the 2 nd convolution module is 8×8, and if the size of the finally generated image is 512×512, the size of the feature map output by the last convolution module is 512×512.

Before describing the embodiments of the present application, a simple description of the virtual fitting method known to the inventor of the present application is provided, so that the understanding of the embodiments of the present application is facilitated.

The off-line fitting generally adopts an interactive fitting mirror device, a shopper stands in front of the interactive fitting mirror device, and selects fitting clothes through a screen of the interactive fitting mirror device, the interactive fitting mirror device can acquire user images, and overall dressing display is carried out according to basic characteristics and clothes styles of the user, so that the user can freely switch the clothes. On-line trial assembly is generally performed by capturing an image of a user, and selecting a target garment provided by a system for automatic replacement. However, whether the interactive fitting mirror device or the online virtual fitting is adopted, the user image is reshaped in a 3D modeling mode through collecting human body data, namely, a 3D human body model is built according to a two-dimensional human body image, and the fitting clothing is arranged at the corresponding position of the 3D human body model. Often, the combination of clothing and user's image is often inadequately close, lacks the sense of naturalness, leads to user experience relatively poor, gathers the 3D information of human and dress simultaneously and usually with higher costs and comparatively loaded down with trivial details.

In view of the foregoing, embodiments of the present application provide a method for training a fitting model, and embodiments of the present application are described below with reference to the accompanying drawings. As a person skilled in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is applicable to similar technical problems.

Referring to fig. 2, fig. 2 is a flowchart of a method for training a fitting model according to an embodiment of the present application, where a fitting network as a skeleton structure of the fitting model includes a clothes coding network, an identity coding network, and a fitting image generating network, and the method S100 may specifically include the following steps:

S10: a training set is obtained.

The training set includes a plurality of training data, each training data including an image of a garment and a real fit image including an image of a model wearing a corresponding garment of the aforementioned images of the garment.

It will be appreciated that a training data comprising pairs of images of clothing and real fitting images, in some embodiments, the number of training data is tens of thousands, for example 20000, is advantageous for training to obtain an accurate generic model. The number of training data can be determined by a person skilled in the art according to the actual situation.

In the image pair composed of the clothing image and the real fitting image, the clothing image includes the clothing to be fitted, for example, the clothing image 1# includes a green short sleeve. The model in the real fitting image wears the clothes in the corresponding clothes image, for example, the model in the real fitting image corresponding to clothes image 1# wears the green cotta.

It will be appreciated that training data including images of clothing and images of real fitting may be gathered in advance by those skilled in the art, for example, on some clothing vending sites the clothing images and corresponding model images with the clothing worn (i.e., real fitting images).

In some embodiments, the real fitting image is a model whole body shot, the image proportion of which is similar to the aspect ratio of the human body, however, the aspect ratio of the human body does not meet the network training requirements, i.e. the resolution ratio of the real fitting image does not meet the network training requirements. Therefore, the real fitting image in the training set needs to be preprocessed so that the resolution ratio of the real fitting image meets the network training requirement.

In some embodiments, prior to training, further comprising:

Based on that the real fitting image is model whole body illumination, the image proportion is similar to the length-width ratio of a human body, in order not to change the body state of the model, the real fitting image is subjected to width filling treatment, and the difference value between the image length and the width is utilized to fill pixels according to the proportion, for example, the pixels with the value of 0 can be filled, so that the resolution of the real fitting image meets the preset proportion.

In some embodiments, the preset ratio is 1:1, that is, the length and width of the real fitting image after the filling treatment are consistent. For example, the resolution of the real fitting image may be 512 x 512. In some embodiments, before training, normalization operation can be performed on the real fitting image after the filling processing, so that the training speed of the fitting network is conveniently increased. Normalization based on normalization is a common data processing approach well known to those skilled in the art and is not explained in detail here.

S20: and deforming clothes in the clothes image according to the model human body structure in the real fitting image to obtain a deformed clothes image.

It will be appreciated that in some embodiments, the real fit image herein may be a real fit image after a padding process and a normalization process.

Considering that the clothes in the clothes image are in a two-dimensional plane state, however, the human body structure is three-dimensional, in order to enable the clothes to be suitable for the human body structure when the clothes and the model are fused, the clothes in the clothes image are deformed according to the human body structure of the model in the real fitting image, and the deformed clothes image is obtained. As shown in fig. 3, the clothing image 1# is deformed according to the human body structure in the real fitting image 1# to obtain a deformed clothing image 1#. Therefore, the clothes in the deformed clothes image are in a three-dimensional state and are suitable for the human body structure, the clothes and the human body are combined and attached after being fused, and the fitting effect is real and natural.

In some embodiments, referring to fig. 4, the step S20 specifically includes:

S21: and detecting key points of the clothes image by adopting a clothes key point detection algorithm to acquire clothes key point information.

S22: and carrying out key point detection on the real fitting image by adopting a human key point detection algorithm to acquire human key point information.

S23: an affine change matrix between the clothing key point information and the human body key point information is calculated.

S24: and carrying out affine change on the clothes pixels in the clothes image according to an affine change matrix to obtain a deformed clothes image.

The key point detection algorithm is adopted to detect the key points of the clothes image, so that the key point information of the clothes (namely a plurality of key points on the clothes) can be positioned, and as shown in fig. 5, the key points of the areas including cuffs, necklines, shoulders, lower hem and the like are included. In some embodiments, the neural network implementing the clothing key point detection algorithm may be trained based on a sample image, where the sample image includes clothing and is labeled with reference key point coordinates of the clothing, and the neural network includes a plurality of convolution layers, pooling layers, full-connection layers, or the like, and the structure of the neural network may be an existing deep convolution neural network or may be designed by a person skilled in the art.

The key point detection algorithm is adopted to detect the key points of the real fitting image, so that the key point information of the human body (namely a plurality of key points on the human body) can be positioned, as shown in fig. 6, including the key points of the head, shoulders, arms, legs, trunk and other areas. In some embodiments, the human keypoint detection algorithm may employ a 2D keypoint detection algorithm, such as Convolutional Pose Machine (CPM) or Stacked Hourglass Network (Hourglass), or the like.

It is understood that, based on the key points of the areas such as cuffs, necklines, shoulders, and shirts, the key point information of the human body includes the key points of the areas such as heads, shoulders, arms, legs, and body trunk, it is known that the areas reflected by the key point information of the clothes and the areas reflected by the key point information of the human body have a correspondence relationship, for example, the shoulders of the clothes correspond to the shoulders of the human body, the cuffs of the clothes correspond to the arms of the human body, the shives correspond to the buttocks of the human body, and the like.

It can be understood that if the key point information of the deformed clothes is more similar to the corresponding part of the key point information of the human body, for example, the key points of the shoulder area of the clothes are similar to the key points of the shoulder area of the human body, etc., the clothes and the model can be attached to the human body when being fused, and the clothes look true and natural.

In some embodiments, the affine change matrix H between the garment keypoint information C and the human body keypoint information R may be calculated by an LM algorithm and a least squares algorithm, where r=h×c. Thus, after the clothes key point information C is multiplied by the affine change matrix H, the obtained key point information of the deformed clothes is the same as the corresponding part of the human body key point information.

And then affine changing the clothes pixels in the clothes image according to an affine change matrix to obtain a deformed clothes image. Specifically, the calculation can be performed using the following formula:

(x _i,y_i) is the coordinates of the clothing pixels in the clothing image, (x' _i,y′_i) is the coordinates of the clothing pixels in the deformed clothing image, and H is the affine change matrix calculated as described above.

In the embodiment, the affine change matrix for affine change is calculated by adopting the clothes key point information in the clothes image and the human body key point information in the real fitting image, and then the clothes pixels in the clothes image are affine changed according to the affine change matrix, so that an accurate deformed clothes image can be obtained, the clothes and the model can be favorably attached to the human body when being fused, and the clothes looks true and natural.

S30: and carrying out mask shielding treatment on the region corresponding to the clothes on the real fitting image to obtain a mask image.

It will be appreciated that when changing the clothing of a model in a real fitting image, the features of the model, such as identity features, need to be preserved, and the original clothing features, which need to be replaced when trying on the clothing, are blocked. For example, when the model wears the green short sleeves, the black sleeves need to be shielded, and other areas (identity features such as head, lower limbs, etc.) except the black sleeves are reserved, so that a mask image as shown in fig. 7 is obtained.

Before the clothes and the model are fused, covering and shielding treatment is carried out on old clothes which are original to the clothes to be tried on, so that on one hand, the interference caused by the characteristics of the original old clothes to the fusion can be avoided, and on the other hand, the identity characteristics of the model can be reserved, so that the model is not distorted after being replaced with the clothes to be tried on.

In some embodiments, referring to fig. 8, the step S30 specifically includes:

S31: and carrying out part analysis on the real fitting image by adopting a human body analysis algorithm to obtain a human body analysis chart.

S32: and determining a mask matrix according to the human body analysis chart and the types of clothes in the clothes image.

S33: multiplying the real fitting image and the corresponding position of the mask matrix to obtain a mask image.

The human body analysis is to divide each part of the human body, as shown in fig. 9, and different parts, such as hair, face, coat, trousers, arms, hat, shoes, etc., are identified and divided, and expressed with different colors, so as to obtain the human body analysis chart.

In some embodiments, the body resolution algorithm may be an existing Graphonomy algorithm. The Graphomay algorithm can divide the image into 20 categories, and can divide the image into different parts by adopting different colors. In some embodiments, the 20 categories described above may also be divided by the reference numerals 0-19, for example 0 for background, 1 for hat, 2 for hair, 3 for glove, 4 for sunglasses, 5 for coat, 6 for dress, 7 for coat, 8 for sock, 9 for trousers, 10 for torso skin, 11 for scarf, 12 for half skirt, 13 for face, 14 for left arm, 15 for right arm, 16 for left leg, 17 for right leg, 18 for left shoe, and 19 for right shoe.

From the human body analysis chart, the category to which each part in the image belongs can be determined. Thus, the mask matrix may be determined based on the human body analysis chart and the type of clothing in the clothing image (i.e., the type of try-on clothing, such as coat or pants, etc.). Wherein the mask matrix reflects the weights of the pixels in the real fitting image. And finally, multiplying the real fitting image by the corresponding position of the mask matrix to obtain a mask image.

For example, if the resolution of the real fitting image is 512×512, the mask matrix is a 512×512 matrix, the model in the real fitting image is worn with a black cotta, the clothes image is a green cotta (i.e. the fitting clothes is a jacket), the black cotta in the real fitting image needs to be replaced by the green cotta, and other areas remain unchanged, so that the value of the area corresponding to the black cotta in the mask matrix may be 0, and the value of the area corresponding to the other areas may be 1. After the mask matrix is multiplied by the corresponding position of the real fitting image, in the obtained mask image, the pixel value of the region corresponding to the black short sleeve is 0, and the pixel values of other regions are still the pixel values in the real fitting image.

The corresponding position multiplication can be explained by using a formula M _ij＝F_ij×m_ij, wherein F _ij represents the pixel value of the ith row and the jth column in the real fitting image, M _ij represents the value of the ith row and the jth column in the mask matrix, and M _ij represents the pixel value of the ith row and the jth column in the mask plate image.

In the embodiment, the mask matrix is constructed by adopting the human body analysis chart and the types of the clothes in the clothes image, and the mask image can be obtained by multiplying the real fitting image and the corresponding position of the mask matrix, so that the identity characteristics of the model can be accurately reserved in the mask image, the original clothes characteristics needing to be replaced are removed, and the model is not distorted after the clothes needing to be tried on are replaced.

S40: and carrying out feature extraction on the deformed clothes image by adopting a clothes coding network to obtain clothes feature codes, and carrying out feature extraction on the mask image by adopting an identity coding network to obtain an identity feature map.

In order to enable the deformed clothes image and the mask image to be fully fused, firstly, the clothes encoding network is adopted to downsample the deformed clothes image, and the characteristics are extracted, so that clothes characteristic codes are obtained. In some embodiments, the clothing feature code may be a 1 x 512 size vector obtained by downsampling the warped clothing image.

In some embodiments, the structure of the garment encoding network is as shown in table 1 below:

Table 1 construction of clothing coding network

Layer(s)	Convolution kernel size	Step size	Size of output feature map
				Clothing image X	-	-	5125123
Convolutional layer	1*1	2	25625616
				Convolutional layer	3*3	2	12812832
Convolutional layer	3*3	2	646432
				Convolutional layer	3*3	2	323264
Convolutional layer	3*3	2	161664
				Convolutional layer	3*3	2	88128
Convolutional layer	3*3	2	44256
				Full connection layer	-	-	1*1024
Full connection layer	-	-	1*512

As can be seen from table 1 above, the clothing coding network includes a plurality of convolution layers and a full connection layer, most of the convolution layers are configured with a 3*3-sized convolution kernel, and the step size is set to 2, thereby implementing the downsampling and dimension reduction operations.

And (5) downsampling the mask image by adopting an identity coding network, extracting the characteristics, and obtaining an identity characteristic diagram. The identity encoding network includes a plurality of convolution layers that may be the same structure as the plurality of convolution layers in the clothing encoding network described above. It will be appreciated that the identity code network outputs 4*4-sized identity maps.

S50: and inputting the clothes feature codes and the identity feature map into a fitting image generation network for fusion to obtain a predicted fitting image.

Wherein the fitting image generation network is a generation type countermeasure network. The structure and operation of the generated countermeasure network are described in detail in the foregoing "description of noun (3)", and the detailed description thereof will not be repeated here. From the above, it can be seen that the fitting image generation network comprises a mapping network and an image generator. In some embodiments, the image generator includes 8 convolution modules, and the magnitudes of the feature maps output by the 8 convolution modules are 4*4, 8×8, 16×16, 32×32, 64×64, 128×128, 256×256, 512×512 in order.

The clothing feature code generates a1 x 512 intermediate vector after feature decoupling by the mapping network S1 of the fitting image generation network. After the intermediate vector is input into a1 st convolution module for up-sampling, a 4*4-sized feature map is generated, after the 4*4-sized feature map and the 4*4-sized identity feature map are subjected to linear fusion or nonlinear fusion, the intermediate vector is input into a2 nd convolution module for up-sampling, and after the subsequent convolution modules are sequentially subjected to layer-by-layer up-sampling, a predicted fitting image which is 512 x 512-sized and has both clothes features and model identity features in a real fitting image is obtained.

It should be understood that, linear fusion refers to performing a primary function operation on two images to obtain a fused image, and nonlinear fusion refers to performing a secondary or multiple function operation on two images to obtain a fused image.

In some embodiments, the fitting network further comprises a texture coding network, further comprising, prior to the step S50: and adopting the texture coding network to extract texture features of the clothes image to obtain at least one texture feature map.

In order to prevent the texture information of the try-on clothes from being lost in the fusion process of the clothes feature codes and the identity feature images, a texture coding network is constructed, feature extraction is carried out on the clothes images, and the texture features are extracted, so that at least one texture feature image is obtained.

The texture coding network comprises a plurality of convolution layers for downsampling the clothing image to extract texture features, and outputting a texture feature map with a size corresponding to the convolution layers for each convolution layer. In some embodiments, the texture encoding network is structured as shown in table 2 below:

table 2 structural schematic representation of texture coding network

Layer(s)	Convolution kernel size	Step size	Size of output feature map
				Clothing image X	-	-	5125123
Convolutional layer	1	1	51251216
				Convolutional layer	3	1	51251264
Convolutional layer	3	2	25625664
				Convolutional layer	1	1	256256128
Convolutional layer	3	2	128128256
				Convolutional layer	1	1	128128256

As can be seen from table 2 above, the texture coding network outputs a plurality of texture feature maps of different sizes. Different sized texture feature maps can reflect different granularity of clothing texture features.

In this embodiment, the step S50 specifically includes: inputting the clothes feature codes, the identity feature images and at least one texture feature image into a fitting image generation network for fusion to obtain a predicted fitting image.

It will be appreciated that the "at least one texture map" is herein selected from texture maps output by the texture encoding network described above. For example, in the embodiment of continuation table 1, the "at least one texture map" to be fused may be a texture map of 128×128 and 256×256 sizes.

Referring to fig. 10, after a 128 x 128 size texture feature map is linearly fused or non-linearly fused with a 128 x 128 size feature map output by a 6 th convolution module of an image generator in a fitting image generation network, a 128 x 128 size fused feature map is obtained. Then, the fusion feature map with the size of 128 x 128 is input into a 7 th convolution module for up-sampling, 256 x 256 feature maps are output, and the 256 x 256 feature maps are subjected to linear fusion or nonlinear fusion with the 256 x 256 feature maps output by the 7 th convolution module, so that a 256 x 256 fusion feature map is obtained. As the convolution layer of the image generator advances, the 256×256-size fusion feature map is input to the 8 th convolution module to be up-sampled, and a 512×512-size predicted fitting image is output.

In this embodiment, by constructing the texture coding network, at least one texture feature map is fused with a feature map generated by the fitting image generating network in the up-sampling process, so that the generated predicted fitting image can store texture details of the fitting clothes, and therefore, the fitting network can learn the texture details of the clothes in the clothes image, and the fitting model obtained by training has the capability of retaining the texture details of the fitting clothes, generates a fitting image with the texture details of the clothes, and enables a user to see a real fitting effect of the clothes.

S60: and carrying out iterative training on the fitting network according to the difference between each real fitting image and each predicted fitting image in the training set until the fitting network converges, so as to obtain a fitting model.

It can be appreciated that if the difference between each real fitting image and each predicted fitting image in the training set is smaller, the real fitting image and the predicted fitting image are more similar, which means that the predicted fitting image can accurately restore the real fitting image. Therefore, the model parameters of the fitting network can be adjusted according to the difference between each real fitting image and each predicted fitting image in the training set. In some embodiments, the fitting network comprises a garment encoding network, an identity encoding network, and a fitting image generation network, and the model parameters comprise model parameters of the garment encoding network, model parameters of the identity encoding network, and model parameters of the fitting image generation network. In some embodiments, the fitting network comprises a garment encoding network, an identity encoding network, a fitting image generation network, and a texture encoding network, and the model parameters comprise model parameters of the garment encoding network, model parameters of the identity encoding network, model parameters of the texture encoding network of model parameters of the fitting image generation network. And the difference is reversely transmitted, so that the predicted fitting image output by the fitting network is continuously approximate to the real fitting image until the fitting network converges, and a fitting model is obtained.

It is to be understood that fitting network convergence here may refer to that under certain model parameters, the sum of differences between real fitting images and predicted fitting images in the training set is less than a preset threshold or fluctuates within a certain range.

In some embodiments, model parameters are optimized by adopting an adam algorithm, for example, the iteration number is set to 10 ten thousand times, the initial learning rate is set to 0.001, the weight attenuation of the learning rate is set to 0.0005, and the learning rate attenuation is 1/10 of the original weight attenuation of each 1000 iterations, wherein differences of each real fitting image and each predicted fitting image in the learning rate, the training set can be input into the adam algorithm to obtain adjustment model parameters output by the adam algorithm, the next training is performed by utilizing the adjustment model parameters until the training is completed, and the model parameters of the fitting network after convergence are output to obtain the fitting model.

It should be noted that, in the embodiment of the present application, the training set includes a plurality of training data, for example 20000 training data, which covers different models and clothes and can cover most of the characteristics of clothes on the market. Therefore, the trained fitting model is a universal model and can be widely used for virtual fitting and generating fitting images.

In some embodiments, referring to fig. 11, the step S60 specifically includes:

and S61, calculating the difference between each real fitting image and each predicted fitting image in the training set by adopting a loss function, wherein the loss function comprises a countermeasures loss, a reconstruction loss and a perception loss between the real fitting image and the predicted fitting image, and a mask loss between a mask image corresponding to the real fitting image and a mask image corresponding to the predicted fitting image.

And S62, performing iterative training on the fitting network according to the difference until the fitting network converges to obtain a fitting model.

The function of the loss function is described in detail in the foregoing "description of noun (2)", and the detailed description thereof is not repeated here. It will be appreciated that the structure of the loss function may be set according to the actual situation, based on the network structure and the training set. In this embodiment, the penalty function reflects the fight penalty, reconstruct penalty, and sense penalty between the real fit image and the predicted fit image, and the mask penalty between the mask image corresponding to the real fit image and the mask image corresponding to the predicted fit image.

The countermeasures loss is the loss of whether the predicted fitting image is the corresponding real fitting image, when the countermeasures loss is large, the distribution difference between the predicted fitting image and the real fitting image is larger, and when the countermeasures loss is small, the distribution difference between the predicted fitting image and the real fitting image is smaller and similar. Here, the distribution of the fitting image means the distribution of each part in the image, for example, the distribution of clothes, head, limbs, and the like.

The reconstruction loss is a pixel variance loss between the predicted fitting image and the real fitting image. When the reconstruction loss is large, the pixel value difference between the pixel point of the predicted fitting image and the pixel point of the real fitting image is larger, and when the reconstruction loss is small, the pixel value difference between the pixel point of the predicted fitting image and the pixel point of the real fitting image is smaller. The reconstruction loss is to compare the pixel points of the predicted fitting image with the pixel points of the real fitting image, so that the pixel values corresponding to all parts are close.

The perceived loss is that the feature map obtained by convoluting the real fitting image is compared with the feature map obtained by convoluting the predicted fitting image, so that the high-level information (content and global structure) is close.

The mask loss is to compare the mask image corresponding to the real fitting image with the mask image corresponding to the predicted fitting image so that the reserved areas of the real fitting image and the predicted fitting image are close to each other, and the reserved area of the predicted fitting image is constrained to be close to the reserved area of the real fitting image so that the predicted fitting image is not distorted.

In some embodiments, the loss function is:

In some embodiments, a convolutional neural network such as VGG may be used to downsample the real fitting image to extract P feature mapsSimilarly, the convolutional neural network such as VGG and the like can be adopted to downsample the predicted fitting image, and P feature images/>, obtained by extraction, can be obtained

In some embodiments, after the human body analysis algorithm is adopted to analyze the real fitting image and the predicted fitting image respectively, the region corresponding to the fitting garment is shielded according to the analysis result, so as to obtain M _I and M _I′ respectively.

Therefore, based on the difference obtained by calculation of the loss function comprising the counterloss, the reconstruction loss, the perception loss and the mask loss, iterative training is carried out on the fitting network, and the predicted fitting image can be restrained from being continuously close to the real fitting image in four aspects of distribution, pixels, content characteristics and reserved areas, so that the fitting effect of the fitting model obtained by training is improved.

In summary, according to the method for training a fitting model provided by the embodiment of the application, as the fitting network of the fitting model skeleton structure comprises a clothes coding network, an identity coding network and a fitting image generating network, firstly, a training set is obtained, the training set comprises a plurality of training data, each training data comprises a clothes image and a real fitting image of a corresponding clothes in the clothes image worn by the model, namely, the real fitting image can be used as a real label. Then, for each training data, the clothes in the clothes image are deformed according to the model human body structure in the real fitting image, so that a deformed clothes image is obtained, and the clothes in the deformed clothes image are in a three-dimensional state and are adapted to the human body structure. And carrying out mask shielding treatment on the region corresponding to the clothes of the real fitting image to obtain a mask image, wherein the mask image retains the characteristics to be retained such as the identity characteristics of the model and the like, and shields the original clothes characteristics to be replaced when fitting the clothes. And carrying out feature extraction on the deformed clothing image by adopting a clothing coding network, and carrying out feature extraction on the mask image by adopting an identity coding network to obtain an identity feature map. And finally, inputting the clothes feature codes and the identity feature images into the fitting image generation network for fusion to obtain predicted fitting images, and performing iterative training on the whole fitting network until convergence according to the differences between each real fitting image and each predicted fitting image in the training set to obtain a fitting model.

After the fitting model is obtained through training by the method for training the fitting model, virtual fitting can be carried out by using the fitting model, and fitting images are generated. Referring to fig. 12, fig. 12 is a flowchart of a method for generating a fitting image according to an embodiment of the present application, as shown in fig. 12, the method S200 includes the following steps:

S201: and acquiring an image of the garment to be fitted and an image of the user.

Wherein, the image of the clothes to be dressed comprises clothes, and the image of the user comprises the body of the user.

S202: and deforming the clothes in the image of the clothes to be tested according to the human body structure of the user in the image of the user to obtain the deformed image of the clothes to be tested.

Considering that the clothes in the image of the clothes to be fitted are in a two-dimensional plane state, however, the human body structure of the user is three-dimensional, and in order to enable the clothes in the image of the clothes to be fitted and the clothes to be matched with the human body structure when the clothes are fused, the clothes in the image of the clothes to be fitted are deformed according to the human body structure of the user in the image of the user, and the deformed image of the clothes to be fitted is obtained.

In some embodiments, the clothing in the image of the garment to be fitted may be deformed according to the user 'S human body structure in the user' S image with reference to the aforementioned steps S21-S24, which will not be described in detail here.

S203: and carrying out mask shielding treatment on the user image in the region corresponding to the fitting clothes to obtain a mask image of the user.

It will be appreciated that when changing the clothing for a user in the user image, the user's identity features etc. need to be preserved, blocking the original clothing features that need to be replaced when trying on the clothing.

Before the clothes to be tried on and the user input fitting model are fused, the old clothes corresponding to the clothes to be tried on are covered and shielded, so that on one hand, the interference of the original old clothes characteristics to the fusion can be avoided, and on the other hand, the identity characteristics of the user can be reserved, and the user is not distorted after the user changes the clothes to be tried on.

In some embodiments, the mask masking process may be performed on the user image with reference to the foregoing steps S31-S33 to obtain a mask image of the user, which will not be described in detail herein.

S204: inputting the deformed fitting image and the mask image of the user into a fitting model to generate a fitting image, wherein the fitting model is trained by adopting any one of the method for training the fitting model in the embodiment.

It can be understood that the fitting model is obtained by training the fitting model in the above embodiment, and has the same structure and function as the fitting model in the above embodiment, and will not be described in detail herein.

In short, the deformed fitting clothes image and the mask image of the user are input into a fitting model, the fitting model encodes the deformed fitting clothes image and the mask image respectively to obtain fitting clothes feature codes and user identity feature images, and then the user identity feature images are fused in the process of upsampling the fitting clothes feature codes to obtain fitting images.

Based on the fitting model and the fitting model in the embodiment, the fitting model has the same structure and function, so that the fitting effect is real and natural due to the fit of the fitting clothes and the user in the fitting chart.

Referring to fig. 13, fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application, and the computer device 50 includes a processor 501 and a memory 502. The processor 501 is connected to the memory 502, for example the processor 501 may be connected to the memory 502 via a bus.

The processor 501 is configured to support the computer device 50 to perform the corresponding functions in the methods of fig. 2-11 or the method of fig. 12. The processor 501 may be a central processor (central processing unit, CPU), a network processor (network processor, NP), a hardware chip, or any combination thereof. The hardware chip may be an Application SPECIFIC INTEGRATED Circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, generic array logic (GENERIC ARRAY logic, GAL), or any combination thereof.

The memory 502, as a non-transitory computer readable storage medium, may be used to store a non-transitory software program, a non-transitory computer executable program, and a module, such as program instructions/modules corresponding to a method for training a fitting model or program instructions/modules corresponding to a method for generating a fitting image in an embodiment of the present application. The processor 501 may implement the training in any of the method embodiments described above by running non-transitory software programs, instructions, and modules stored in the memory 502.

Memory 502 may include Volatile Memory (VM), such as random access memory (random access memory, RAM); the memory 1002 may also include a non-volatile memory (NVM), such as read-only memory (ROM), flash memory (flash memory), hard disk (HARD DISK DRIVE, HDD) or solid state disk (solid-state drive (STATE DRIVE, SSD); memory 502 may also include a combination of the types of memory described above.

The present application also provides a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a computer, cause the computer to perform a method of training a fitting model or a method of generating a fitting image as in the previous embodiments.

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the application, the steps may be implemented in any order, and there are many other variations of the different aspects of the application as described above, which are not provided in detail for the sake of brevity; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. A method of training a fitting model, wherein the fitting network comprises a garment encoding network, an identity encoding network, and a fitting image generating network;

The method comprises the following steps:

acquiring a training set, wherein the training set comprises a plurality of training data, the training data comprises a clothes image and a real fitting image, and the real fitting image comprises an image of a model wearing corresponding clothes in the clothes image;

Deforming clothes in the clothes image according to the model human body structure in the real fitting image to obtain a deformed clothes image; the step of deforming the clothes in the clothes image according to the model human body structure in the real fitting image to obtain a deformed clothes image comprises the following steps: performing key point detection on the clothes image by adopting a clothes key point detection algorithm to acquire clothes key point information; performing key point detection on the real fitting image by adopting a human key point detection algorithm to acquire human key point information; calculating an affine change matrix between the clothes key point information and the human body key point information; carrying out affine change on the clothes pixels in the clothes image according to the affine change matrix to obtain the deformed clothes image;

Carrying out mask shielding treatment on the real fitting image in the region corresponding to the clothes to obtain a mask image;

Performing feature extraction on the deformed clothes image by adopting the clothes coding network to obtain clothes feature codes, and performing feature extraction on the mask image by adopting the identity coding network to obtain an identity feature map;

inputting the clothes feature codes and the identity feature images into the fitting image generation network for fusion to obtain predicted fitting images;

And carrying out iterative training on the fitting network according to the difference between each real fitting image and each predicted fitting image in the training set until the fitting network converges, so as to obtain the fitting model.

2. The method according to claim 1, wherein the performing mask shielding processing on the real fitting image in the area corresponding to the garment to obtain a mask image includes:

performing part analysis on the real fitting image by adopting a human body analysis algorithm to obtain a human body analysis chart;

Determining a mask matrix according to the human body analysis chart and the types of clothes in the clothes image;

Multiplying the real fitting image and the corresponding position of the mask matrix to obtain the mask image.

3. The method according to any one of claims 1-2, wherein the fitting network further comprises a texture coding network, and wherein before said inputting the clothing feature code and the identity feature map into the fitting image generation network for fusion, further comprising:

Extracting texture features of the clothes image by adopting the texture coding network to obtain at least one texture feature map;

Inputting the clothes feature codes and the identity feature images into the fitting image generation network for fusion to obtain predicted fitting images, wherein the method comprises the following steps of:

inputting the clothes feature codes, the identity feature images and the at least one texture feature image into the fitting image generation network for fusion to obtain the predicted fitting image.

4. A method according to any one of claims 1-3, wherein said iteratively training said fitting network based on differences between each of said real fitting images and each of said predicted fitting images in said training set until said fitting network converges to obtain said fitting model, comprises:

Calculating the difference between each real fitting image and each predicted fitting image in the training set by adopting a loss function, wherein the loss function comprises a countermeasures loss, a reconstruction loss and a perception loss between the real fitting image and the predicted fitting image, and a mask loss between a mask image corresponding to the real fitting image and a mask image corresponding to the predicted fitting image;

And carrying out iterative training on the fitting network according to the difference until the fitting network converges to obtain the fitting model.

5. The method of claim 4, wherein the loss function comprises:

；

wherein, For the challenge loss,/>For the reconstruction loss,/>For the perceptual loss,/>For the mask loss,/>Weights lost for the reconstruction,/>Weight for the perceptual penalty,/>Weight lost for the mask,/>For the real fitting image,/>For the predictive fitting image,/>For the feature map extracted from the real fitting image,/>For the feature map extracted from the predicted fitting image, P is the/>Or the number of (I) >Number of/>For mask image corresponding to the real fitting image,/>And a mask image corresponding to the predicted fitting image.

6. The method according to claim 1, further comprising, before the step of extracting the identity feature from the real fitting image, the step of:

And filling the real fitting image so that the resolution of the real fitting image meets a preset proportion.

7. A method of generating a fitting image, comprising:

Acquiring an image of clothes to be fitted and an image of a user;

Deforming clothes in the image of the clothing to be fitted according to the human body structure of the user in the user image to obtain the deformed image of the clothing to be fitted;

carrying out mask shielding treatment on the user image in the region corresponding to the garment to be fitted to obtain a mask image of the user;

Inputting the deformed fitting image and the mask image of the user into a fitting model to generate the fitting image, wherein the fitting model is trained by the method for training the fitting model according to any one of claims 1-6.

8. A computer device, comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

9. A computer readable storage medium storing computer executable instructions for causing a computer device to perform the method of any one of claims 1-7.