CN115439308A

CN115439308A - Method for training fitting model, virtual fitting method and related device

Info

Publication number: CN115439308A
Application number: CN202210961538.3A
Authority: CN
Inventors: 陈仿雄
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2022-08-11
Filing date: 2022-08-11
Publication date: 2022-12-06

Abstract

The embodiment of the application relates to the technical field of image processing, and discloses a fitting model training method, a virtual fitting method and a related device. The clothes deformation image not only retains clothes characteristics, but also retains the body characteristics of the model. In addition, the clothes attribute vector reflects the attributes of the try-on clothes in the try-on effect, such as 'S code, loose and long sleeve', and can effectively guide the try-on network to edit and modify the attributes of the try-on clothes. Therefore, the predicted fitting image is vivid in fitting effect, and attributes of the fitting clothes can be edited and modified, such as code number change, length change and the like. Therefore, the fitting model obtained through training can edit the fitting clothes according to the input clothes attributes, and the virtual fitting effect of editing the clothes attributes is achieved.

Description

Method for training fitting model, virtual fitting method and related device

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a fitting model training method, a virtual fitting method and a related device.

Background

With the continuous progress of modern science and technology, the online shopping scale is continuously increased, and a user can purchase clothes on an online shopping platform through a mobile phone, however, because the information of the clothes to be sold, which is obtained by the user, is generally a two-dimensional display picture, the user cannot know the effect that the clothes are worn on the user, the user may buy clothes which are not suitable for the user, and the shopping experience is poor.

With the continuous development of neural networks, the method is widely applied to the field of image generation. Therefore, researchers apply the neural network to virtual fitting to provide various fitting algorithms, but the existing virtual fitting algorithms can only try on clothes according to the original attributes of the clothes, and cannot edit the attributes of the try-on clothes, such as changing the size, the length or the type.

Disclosure of Invention

The technical problem mainly solved by the embodiments of the present application is to provide a method for training a fitting model, a virtual fitting method and a related device, where the fitting model obtained by training in the training method can edit fitting clothes according to input clothes attributes, so as to achieve a virtual fitting effect capable of editing clothes attributes.

In order to solve the above technical problem, in a first aspect, an embodiment of the present application provides a method for training a fitting model, where the fitting network includes an image generation network, and the image generation network includes a cascaded encoder, a fusion module, and a decoder;

the method comprises the following steps:

acquiring a plurality of image groups, wherein the image groups comprise clothes images and model images, the clothes images correspond to editable clothes attribute text information, the models in the model images wear clothes in the clothes images, and the clothes are clothes edited according to the clothes attribute text information;

performing feature coding on the clothes attribute text information to obtain a clothes attribute vector;

preliminarily deforming the clothes in the clothes image according to the human body structure of the model in the model image to obtain a clothes deformation image;

analyzing the human body of the model image to obtain a first analysis image, and extracting a body trunk image from the model image according to the first analysis image;

inputting the clothes deformation image, the body trunk image and the clothes attribute vector into an image generation network to obtain a prediction fitting image, wherein the clothes deformation image and the body trunk image are input into a coder for coding, the obtained coding result and the clothes attribute vector are input into a fusion module for fusion, and the obtained fusion result is input into a decoder for decoding to obtain the prediction fitting image;

and calculating and predicting the loss between the fitting image and the model image by adopting a loss function, and performing iterative training on the fitting network according to the loss sum corresponding to the plurality of image groups until convergence to obtain a fitting model.

In some embodiments, the fitting network further comprises a garment deformation network;

the aforesaid clothing carries out preliminary deformation to clothes in the clothes image according to the human body structure of model in the model image, obtains clothes deformation image, includes:

detecting key points of a human body on the model image to obtain a key point image;

and inputting the body trunk image, the key point image, the clothes image and the clothes attribute vector into a clothes deformation network, and outputting the clothes deformation image.

In some embodiments, the extracting a body torso image from the model image according to the first analytic map includes:

separating a second analytical map from the first analytical map, wherein the second analytical map reflects the pixel region of the body trunk of the model;

performing binarization processing on the second analytical map to obtain a binarized image, wherein pixels corresponding to the trunk region of the body in the binarized image are 1, and pixels corresponding to other regions are 0;

and multiplying the corresponding positions of the pixels in the model image and the pixels in the binary image to obtain a body trunk image.

In some embodiments, the fitting network further comprises a multi-layer perceptron module;

the aforementioned feature coding of the clothes attribute text information to obtain a clothes attribute vector includes:

coding each word in the clothes attribute text information by adopting a word bag model to obtain a text code;

and performing feature extraction on the text codes by adopting a multilayer perceptron module to obtain the clothing attribute vector.

In some embodiments, the fusion module includes a first convolution layer, a second convolution layer, and a fusion layer;

the fusion module performs fusion processing on the encoding result and the clothes attribute direction by adopting the following modes:

respectively performing feature extraction on the coding result through the first convolutional layer and the second convolutional layer to obtain a first intermediate feature map and a second intermediate feature map;

respectively extracting the features of the clothes attribute vectors through the first convolution layer and the second convolution layer to obtain a first attribute feature map and a second attribute feature map;

and carrying out fusion processing on the encoding result, the first intermediate feature map, the second intermediate feature map, the first attribute feature map and the second attribute feature map through a fusion layer.

In some embodiments, the fusion layer is fused using the following formula:

wherein, x is the coding result, μ (x) is the first intermediate feature map, σ (x) is the second intermediate feature map, y is the clothing attribute vector, μ (y) is the first attribute feature map, σ (y) is the second attribute feature map, and IN (x, y) is the fused result.

In some embodiments, the loss function includes a conditional opposition loss, a perceptual loss, and a clothing deformation loss, wherein the clothing deformation loss reflects a difference between clothing in the predicted fitting image and clothing in the model image.

In order to solve the above technical problem, in a second aspect, an embodiment of the present application provides a virtual fitting method, including:

acquiring an image of a person trying on the clothes, an image of the clothes to be tried on the clothes and attribute text information of the clothes to be tried on the clothes;

performing feature coding on the clothes attribute text information to be tested to obtain a clothes attribute vector to be tested;

preliminarily deforming the clothes in the clothes image to be tested according to the human body structure in the image of the tested wearer to obtain a deformed image of the clothes to be tested;

analyzing the human body of the image of the try-on wearer to obtain a first try-on wearer analysis chart, and extracting a body trunk chart of the try-on wearer from the image of the try-on wearer according to the first try-on wearer analysis chart;

inputting the deformation image of the clothes to be fitted, the body trunk image of the person to be fitted and the attribute vector of the clothes to be fitted into a fitting model to generate a fitting image, wherein the fitting model is obtained by training by adopting the method of the first aspect.

In order to solve the foregoing technical problem, in a third aspect, an embodiment of the present application provides an electronic device, including:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as in the first aspect above.

In order to solve the above technical problem, in a fourth aspect, an embodiment of the present application provides a computer-readable storage medium storing computer-executable instructions for causing a computer device to perform the method according to the first aspect.

The beneficial effects of the embodiment of the application are as follows: different from the situation of the prior art, in the method for training a fitting model provided in the embodiment of the present application, a plurality of image groups are first obtained, each image group includes a clothing image and a model image, the clothing image corresponds to editable clothing attribute text information, a model in the model image wears clothing in the clothing image, and the clothing is clothing edited according to the clothing attribute text information. And then, performing feature coding on the clothes attribute text information to obtain a clothes attribute vector. And carrying out primary deformation on the clothes image according to the human body structure of the model in the model image to obtain a clothes deformation image. And analyzing the human body of the model image, and extracting a body trunk image from the model image based on an analysis result. And inputting the clothes deformation image, the body trunk image and the clothes attribute vector into an image generation network in the fitting network to obtain a prediction fitting image. And finally, calculating and predicting the loss between the fitting image and the model image by adopting a loss function, after the plurality of image groups are processed in turn, carrying out iterative training on the fitting network based on the loss sum corresponding to the plurality of image groups until convergence is reached, wherein the converged fitting network is the fitting model.

In the embodiment, the clothes deformation image not only keeps the clothes characteristics, but also can preliminarily reflect the approximate deformation shape of the clothes, so that the fitting network can be effectively guided to enable the clothes and the human body to be closely combined and the fitting effect to be more vivid when the prediction fitting image is generated. The body trunk image reserves the trunk characteristics of the model, can guide the fitting network to enable the trunk contour to be accurate when generating the prediction fitting image, is not influenced by original clothes on the model body, and is beneficial to predicting the style of the self-adaptive fitting clothes in the clothes area in the fitting image. In addition, the clothes attribute vector reflects attributes which are expected to be edited and modified by the try-on clothes in the try-on effect, such as 'S code, loose and long sleeve', and can effectively guide the fitting network to edit and modify the attributes of the try-on clothes, so that the try-on clothes in the predicted fitting image accord with the expected clothes attributes (namely the clothes attribute vector). The attributes of the try-on clothes are guided to be edited based on the preliminary deformation of the clothes in the clothes deformation image, the trunk characteristics reserved by the body trunk image and the clothes attribute vector, so that the predicted try-on image is vivid in try-on effect, and the attributes of the try-on clothes can be edited and modified. The fitting network is trained by adopting the plurality of image groups in the above manner, the fitting effect in the predicted fitting image corresponding to each image group is constrained to be continuously close to the real wearing effect in the model image based on loss and back propagation, and the attributes of the fitting clothes can be edited and modified according to the clothes attribute vector, such as code number change, length change and the like. Therefore, the fitting model obtained through training can edit the fitting clothes according to the input clothes attributes, and the virtual fitting effect of editing the clothes attributes is achieved.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a schematic diagram of an application scenario of a fitting system according to some embodiments of the present application;

FIG. 2 is a schematic diagram of an electronic device in some embodiments of the present application;

FIG. 3 is a schematic flow chart of a method for training a fitting model according to some embodiments of the present application;

FIG. 4 is a schematic diagram of an image generation network according to some embodiments of the present application;

FIG. 5 is a schematic diagram of a fitting network according to some embodiments of the present application;

FIG. 6 is a schematic view of a sub-flow of step S20 of the method shown in FIG. 3;

FIG. 7 is a sub-flowchart of step S30 of the method shown in FIG. 3;

FIG. 8 is a keypoint image in some embodiments of the present application;

FIG. 9 is a first analytical image in some embodiments of the present application;

FIG. 10 is a sub-flowchart of step S40 of the method shown in FIG. 3;

FIG. 11 is a schematic structural diagram of a fusion module according to some embodiments of the present application;

fig. 12 is a schematic flow chart of a virtual fitting method according to some embodiments of the present application.

Detailed Description

The present application will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the present application in any way. It should be noted that numerous variations and modifications could be made by those skilled in the art without departing from the spirit of the application. All falling within the scope of protection of the present application.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

It should be noted that, if not conflicting, the individual features of the embodiments of the present application may be combined with each other within the scope of protection of the present application. Additionally, while functional block divisions are performed in apparatus schematics, with logical sequences shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions in apparatus or flowcharts. Further, the terms "first," "second," "third," and the like, as used herein, do not limit the data and the execution order, but merely distinguish the same items or similar items having substantially the same functions and actions.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

In addition, the technical features mentioned in the embodiments of the present application described below may be combined with each other as long as they do not conflict with each other.

To facilitate understanding of the method provided in the embodiments of the present application, first, terms referred to in the embodiments of the present application will be described:

(1) Neural network

The neural network may be composed of neural units, and may be specifically understood as a neural network having an input layer, a hidden layer, and an output layer, where generally the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. Among them, a neural network with many layers of hidden layers is called a Deep Neural Network (DNN). The work of each layer in the neural network can be described by the mathematical expression y = a (W · x + b), and from the physical level, the work of each layer in the neural network can be understood as performing the transformation of the input space to the output space (i.e., the row space to the column space of the matrix) through five operations on the input space (the set of input vectors), including 1, ascending/descending; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". The operation of 2 and 3 is completed by W.x, the operation of 4 is completed by + b, and the operation of 5 is realized by a () because the classified object is not a single thing but a kind of thing, and the space refers to the set of all individuals of the kind of thing, wherein W is the weight matrix of each layer of the neural network, and each value in the matrix represents the weight value of one neuron of the layer. The matrix W determines the spatial transformation of the input space to the output space described above, i.e. W at each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

It should be noted that, in the embodiment of the present application, based on the model adopted by the machine learning task, the model is essentially a neural network. The common components in the neural network comprise a convolution layer, a pooling layer, a normalization layer, a reverse convolution layer and the like, the model is designed by assembling the common components in the neural network, and when model parameters (weight matrixes of all layers) are determined so that model errors meet preset conditions or the number of the adjusted model parameters reaches a preset threshold value, the model converges.

The convolution layer is provided with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to carry out convolution operation on the image. The convolution operation aims to extract different features of an input image, a first layer of convolution layer can only extract some low-level features such as edges, lines, angles and other levels, and a deeper convolution layer can iteratively extract more complex features from the low-level features.

The inverse convolutional layer is used to map a space with a low dimension to a space with a high dimension, while maintaining the connection relationship/mode therebetween (the connection relationship here refers to the connection relationship during convolution). The reverse convolution layer is configured with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to perform deconvolution operation on the image. In general, an upscale () function is built in a framework library (e.g., a PyTorch library) for designing a neural network, and a low-dimensional to high-dimensional spatial mapping can be realized by calling the upscale () function.

Pooling (posing) is a process that mimics the human visual system in that data can be reduced in size or images can be represented with higher level features. Common operations of pooling layers include maximum pooling, mean pooling, random pooling, median pooling, combined pooling, and the like. Typically, pooling layers are periodically inserted between convolutional layers of the neural network to achieve dimensionality reduction.

The normalization layer is used to normalize all neurons in the middle layer to prevent gradient explosion and gradient disappearance.

(2) Loss function

In the process of training the neural network, because the output of the neural network is expected to be as close as possible to the value really expected to be predicted, the weight matrix of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value by comparing the predicted value of the current network and the really expected target value (however, an initialization process is usually carried out before the first update, namely parameters are pre-configured for each layer in the neural network), for example, if the predicted value of the network is high, the weight matrix is adjusted to be lower in prediction, and the adjustment is carried out continuously until the neural network can predict the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the neural network becomes a process of reducing the loss as much as possible.

(3) Human body analysis

Human body analysis refers to the division of a person captured in an image into multiple semantically consistent regions, for example, a body part and clothing, or a fine classification of a body part and a fine classification of clothing, etc. Namely, the input image is identified in a pixel level, and the object type of each pixel point in the image is marked. For example, elements (including hair, face, limbs, clothing, background, and the like) in a picture including a human body are distinguished by a neural network.

Before the embodiments of the present application are described, a simple description is first given to a virtual fitting method known to the inventors of the present application, so that it is convenient to understand the embodiments of the present application in the following.

Generally, a fitting model is trained by using a generated confrontational network (GAN), so that the fitting model is placed in a terminal for a user to use, and a virtual fitting image can be generated after a user image and a clothes image to be fitted are acquired. However, most fitting models can only be fitted according to the original attributes of the clothes, and the attributes of the fitting clothes cannot be edited, such as changing the size, length or type.

In order to solve the problems, the application provides a fitting model training method, a virtual fitting method and a related device. The clothes deformation image not only retains the clothes characteristics, but also can preliminarily reflect the approximate deformation shape of the clothes, so that the fitting network can be effectively guided to enable the clothes and the human body to be closely combined and the fitting effect to be more vivid when the predicted fitting image is generated. The body trunk image reserves the trunk characteristics of the model, can guide the fitting network to enable the trunk contour to be accurate when generating the prediction fitting image, is not influenced by the original clothes on the model body, and is beneficial to predicting the style of the self-adaptive fitting clothes in the clothes area in the fitting image. In addition, the clothes attribute vector reflects the attributes of the try-on clothes expected to be edited and modified in the try-on effect, such as the 'S code, loose and long sleeve', and can effectively guide the fitting network to edit and modify the attributes of the try-on clothes, so that the try-on clothes in the predicted fitting image conform to the expected clothes attributes (namely the clothes attribute vector). The attributes of the try-on clothes are guided to be edited based on the preliminary deformation of the clothes in the clothes deformation image, the trunk characteristics reserved by the body trunk image and the clothes attribute vector, so that the predicted try-on image is vivid in try-on effect, and the attributes of the try-on clothes can be edited and modified. The fitting network is trained by adopting the plurality of image groups in the manner, the fitting effect in the predicted fitting image corresponding to each image group is restrained to be continuously close to the real wearing effect in the model image based on loss and back propagation, and the attribute of the fitting clothes can be edited and modified according to the attribute vector of the clothes, such as code number change, length change and the like. Therefore, the fitting model obtained through training can edit the fitting clothes according to the input clothes attributes, and the virtual fitting effect of the editable clothes attributes is achieved.

An exemplary application of the electronic device for training a fitting model or for virtual fitting provided in the embodiment of the present application is described below, and it can be understood that the electronic device may train a fitting model, or may perform virtual fitting using the fitting model to generate a fitting image.

The electronic device provided by the embodiment of the application can be a server, for example, a server deployed in the cloud. When the server is used for training the fitting model, the fitting network is iteratively trained by adopting the training set according to other equipment or a training set and a fitting network provided by a person skilled in the art, and a final model parameter is determined, so that the fitting network configures the final model parameter, and the fitting model can be obtained. When the server is used for virtual fitting, a built-in fitting model is called, corresponding calculation processing is carried out on the fitting person image, the clothes image to be fitted and the clothes attribute text information to be fitted provided by other equipment or a user, and a fitting image capable of changing the attributes of the clothes to be fitted is generated.

The electronic device provided by some embodiments of the present application may be various types of terminals such as a notebook computer, a desktop computer, or a mobile device. When the terminal is used for training the fitting model, a person skilled in the art inputs a prepared training set into the terminal, designs a fitting network on the terminal, and iteratively trains the fitting network by the terminal by adopting the training set to determine final model parameters, so that the fitting network configures the final model parameters to obtain the fitting model. When the terminal is used for virtual fitting, a built-in fitting model is called, corresponding calculation processing is carried out on a fitting person image, a to-be-fitted clothes image and to-be-fitted clothes attribute text information input by a user, and a fitting image capable of changing the attribute of the to-be-fitted clothes is generated.

By way of example, referring to fig. 1, fig. 1 is a schematic view of an application scenario of a fitting system provided in an embodiment of the present application, and a terminal 10 is connected to a server 20 through a network, where the network may be a wide area network or a local area network, or a combination of the two.

The terminal 10 may be used to obtain training sets and build a fitting network, for example, a person skilled in the art downloads the prepared training sets on the terminal and builds a network structure of the fitting network. It is understood that the terminal 10 may also be used to obtain the image of the try-on wearer, the image of the clothes to be tried and the attribute text information of the clothes to be tried, for example, the user inputs the image of the try-on wearer, the image of the clothes to be tried and the attribute text information of the clothes to be tried through the input interface, and after the input is completed, the terminal automatically obtains the image of the try-on wearer, the image of the clothes to be tried and the attribute text information of the clothes to be tried; for example, the terminal 10 is provided with a camera, the camera collects an image of a try-on person, a clothes image library is stored in the terminal 10, an image of clothes to be tried can be selected from the clothes image library, and then, the attribute text information of the clothes to be tried is input through an input interface of the terminal 10.

In some embodiments, the terminal 10 locally executes the method for training a fitting model provided in this embodiment to complete training of the designed fitting network by using a training set, and determine a final model parameter, so that the fitting network configures the final model parameter, and a fitting model can be obtained. In some embodiments, the terminal 10 may also send a training set stored on the terminal by a person skilled in the art and a well-constructed fitting network to the server 20 through the network, the server 20 receives the training set and the fitting network, trains the designed fitting network by using the training set, determines a final model parameter, and then sends the final model parameter to the terminal 10, and the terminal 10 stores the final model parameter, so that the fitting network configures the final model parameter, that is, the fitting model can be obtained.

In some embodiments, the terminal 10 locally executes the virtual fitting method provided in this embodiment to provide a virtual fitting service for the user, invokes a built-in fitting model, and performs corresponding calculation processing on the image of the try-on person, the image of the clothes to be tried and the text information of the attribute of the clothes to be tried, which are input by the user, to generate a fitting image capable of changing the attribute of the clothes to be tried. In some embodiments, the terminal 10 may also send the image of the try-on wearer, the image of the clothes to be tried and the attribute text information of the clothes to be tried, which are input by the user on the terminal, to the server 20 through the network, and the server 20 receives the image of the try-on wearer, the image of the clothes to be tried and the attribute text information of the clothes to be tried, invokes a built-in fitting model to perform corresponding calculation processing on the image of the try-on wearer, the image of the clothes to be tried and the attribute text information of the clothes to be tried, generates a fitting image capable of changing the attribute of the clothes to be tried, and then sends the fitting image to the terminal 10. After receiving the fitting image, the terminal 10 displays the fitting image on its own display interface, so that the user can view the fitting effect after editing the clothes attribute.

The structure of the electronic device in the embodiment of the present application is described below, and fig. 2 is a schematic structural diagram of the electronic device 500 in the embodiment of the present application, where the electronic device 500 includes at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.

The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., wherein the general purpose Processor may be a microprocessor or any conventional Processor, etc.

The user interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.

In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 552 for communicating with other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including Bluetooth, wireless Fidelity (WiFi), and Universal Serial Bus (USB), among others;

a display module 553 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;

an input processing module 554 to detect one or more user inputs or interactions from one of the one or more input devices 532 and to translate the detected inputs or interactions.

As can be understood from the foregoing, the method for training a fitting model and the virtual fitting method provided in the embodiments of the present application may be implemented by various types of electronic devices with computing processing capabilities, such as an intelligent terminal and a server.

The method for training a fitting model provided by the embodiment of the present application is described below with reference to an exemplary application and implementation of the server provided by the embodiment of the present application. Referring to fig. 3, fig. 3 is a schematic flowchart of a method for training a fitting model according to an embodiment of the present application. Referring to fig. 4, fig. 4 is a schematic structural diagram of a fitting network. As shown in fig. 4, the fitting network as a fitting model network structure includes an image generation network including an encoder, a fusion module, and a decoder in cascade. Here, concatenation means that the coding results output by the coder are input to the fusion module, and the fusion results output by the fusion module are input to the decoder.

The fusion module is used for fusing the at least two feature data, so that the fusion result can reflect the features in the at least two feature data. In the image generation network, the encoding result is fused with other characteristic data in a fusion module, and the obtained fusion result is input into a decoder for decoding.

As will be appreciated by those skilled in the art, the encoder includes a plurality of downsampled convolutional layers, and in the encoder, the size of the output feature map becomes smaller as the downsampled convolutional layers are advanced. The decoder includes a plurality of upsampled convolutional layers, in which the size of the output feature map is larger and larger as the upsampled convolutional layers are advanced. Those skilled in the art can configure parameters such as the size and step length of the convolution kernel of each lower convolution layer and each upper sampling convolution layer according to actual requirements.

It is understood that the fitting network can be constructed on the neural network design platform of the terminal (such as a computer) by the person skilled in the art, and then sent to the server.

Referring to fig. 3 again, the method S100 may specifically include the following steps:

s10: a number of image groups are acquired.

For any one of the image groups, including a clothing image and a model image, each clothing image corresponds to editable clothing attribute text information, that is, the clothing image, the corresponding model image and the corresponding clothing attribute text information constitute a set of training data. The garment attribute text information is a text describing the garment attribute desired to be edited and modified, and the garment attribute may include size, sleeve length, garment length, bust, waist circumference, hip circumference, trouser length, version and the like. In some embodiments, the editable clothing attribute text information includes "type: loose, size: s code, sleeve length: long ".

It is understood that the items included in the clothing attribute text message may be set by a person skilled in the art. In some embodiments, the items included in the clothing attributes in the clothing attribute text information may be set according to the user's preference for modifying the clothing attributes.

The model in the model image is worn with clothes in the clothes image, and the clothes are clothes edited according to the clothes attribute text information. The clothing image includes clothing intended to be tried on, for example, the clothing image 1# includes an S code green short sleeve, and the clothing attribute text information corresponding to the clothing image 1# includes "type: loose, size: m code, sleeve length: long ". The clothes worn by the model in the model image is the clothes with the clothes image No. 1 edited and modified according to the clothes attribute text information (the model: loose, the size: M code, the sleeve length: long), namely, a loose green long sleeve with M codes. The M-size loose green long sleeves and the S-size green short sleeves are the same except that the sizes, the loose degree and the sleeve lengths are different, and other parts (such as fabrics, patterns, styles and the like) are the same.

It will be appreciated that several image sets may be gathered by one skilled in the art on a terminal (e.g., a computer) in advance, for example, on some clothing vending websites, clothing images and corresponding model images wearing the clothing may be crawled. After a plurality of image groups and clothes attribute text information corresponding to each clothes image (or each image group) are prepared, the data for training are uploaded to a server through a terminal.

In some embodiments, the number of the image groups is ten thousand, for example 20000, which is beneficial for training to obtain an accurate general model. The number of the plurality of image groups can be determined by those skilled in the art according to practical situations.

S20: and performing feature coding on the clothes attribute text information to obtain a clothes attribute vector.

It is understood that the text information of the clothes attribute includes text information of size, sleeve length, clothes length, chest circumference, waist circumference, hip circumference, trousers length, version type, and the like. The clothing attribute text information is data in a text format. In order to enable the fitting network to learn the attribute characteristics reflected by the clothes attribute text information, the clothes attribute text information is coded to obtain a clothes attribute vector. That is, the clothing attribute text information is digitized, and the converted clothing attribute vector is data in a numerical format.

In some embodiments, referring to fig. 5, the fitting network further includes a multi-layer perceptron module.

In this embodiment, referring to fig. 6, the step S20 specifically includes:

s21: and coding each word in the clothes attribute text information by adopting a word bag model to obtain a text code.

The bag-of-words model is that words in the selected text data are put into a bag of words, the times of all the words in the bag of words appearing in the text data are counted, and the times are expressed in a vector form. In this embodiment, words in all the clothing attribute text information are first integrated to construct a dictionary. It is understood that in some embodiments, the dictionary built may be a 'clothes version' { 'relaxed 0001', 'straight tube 0010', 'modified 0100', 'clothes code number' { 'S code 0011', 'M code 0110', 'L code 1100', 'XL code 1001' }, 'clothes sleeve length {' short '0111', 'middle 1101', 'long 1110' }. For example, for a loose L-code long sleeve garment, the text encoding may be "0001 1100 1110".

S22: and performing feature extraction on the text codes by adopting a multilayer perceptron module to obtain the clothing attribute vector.

The multi-layer perceptron module comprises an input layer, a multi-layer hidden layer and an output layer, wherein the input layer comprises N neurons, the hidden layer comprises Q neurons, and the output layer comprises K neurons. The operation of each layer may be described by a functional expression, it being understood that the functional expression differs for each layer.

It will be appreciated that if the input information characteristic code is represented by x, the input layer is fed to the hidden layer x, and the output of the hidden layer may be f (w) ₁ x+b ₁ ) Wherein w is ₁ Is a weight, b ₁ Is an offset, the function f may be a commonly used sigmoid function or tanh function. The hidden layer to the output layer is equivalent to a multi-class logistic regression, namely softmax regression, so that the output of the output layer is softmax (w) ₂ x ₁ +b ₂ ) Wherein x is ₁ F (w) output for hidden layer ₁ x+b ₁ )。

Therefore, the multi-layer perceptron module can be represented by the following formula:

wherein G represents the softmax activation function, h represents the number of hidden layers, and W ⁱ And b ⁱ Representing the weights and offsets of the ith hidden layer. x represents the input information characteristic code. W ¹ And b ¹ Represents the weights and offsets of the input layers, S represents the activation function, and MLP (x) represents the target information vector.

In some embodiments, K may be 1024, so that the output layer outputs a one-dimensional vector with length of 1024, i.e. a target information vector with length of 1024.

Each layer of the multilayer perceptron module uses an activation function, and can introduce nonlinear factors into neurons, so that the module can approach any nonlinear function at will, and further, more nonlinear models can be utilized. The multilayer perceptron module has good feature extraction capability on discrete information, so that the multilayer perceptron module is adopted to extract features of text codes, and the obtained clothing attribute vector can fully reflect the features of clothing attributes.

In the embodiment, the multi-layer perceptron module is adopted to extract the text codes, so that the clothes attribute vector can more fully reflect the characteristics of the clothes attribute, and therefore, the fitting network can better learn the clothes attribute characteristics and is beneficial to model convergence.

S30: and carrying out primary deformation on the clothes in the clothes image according to the human body structure of the model in the model image to obtain a clothes deformation image.

The step is to deform the clothes in the clothes image according to the human body structure of the model in the model image to obtain a clothes deformation image. It can be understood that the clothes in the original clothes image are in a two-dimensional plane state, and compared with the original clothes image, the outline of the clothes in the clothes deformation image is close to the outline of the corresponding trunk of the model human body.

The deformation method is not limited in this embodiment, and the clothes in the clothes image may be deformed. In some embodiments, thin Plate Splines (TPS) may be used to simulate deformation of the garment. In some embodiments, an optical flow transformed from clothes pixels in the clothes image to clothes pixels in the model image may be calculated based on the clothes image and the model image, and then the clothes pixels in the clothes image may be transformed according to the optical flow to obtain a deformed clothes image.

In order to make the clothes contour in the clothes deformation image closer to the contour of the corresponding trunk of the model body, in some embodiments, please refer to fig. 5, the fitting network further comprises a clothes deformation network. And deforming the clothes in the clothes image by adopting a clothes deformation network. It can be understood that with continuous iterative training of the fitting network, the stronger the deformation processing capacity of the clothes deformation network is, and the clothes contour in the obtained clothes deformation image is closer to the contour of the corresponding trunk of the model human body.

In this embodiment, referring to fig. 7, the step S30 specifically includes:

s31: and detecting key points of the human body on the model image to obtain a key point image.

The human body key point detection algorithm is adopted to detect the human body key points of the model image, so that the human body key point information (namely a plurality of key points on the human body) can be positioned, and as shown in fig. 8, the key points can be coordinate points of the nose, the left eye, the right eye, the left ear, the right ear, the left shoulder, the right elbow, the left wrist, the right hip, the left knee, the right knee and the left ankle. In some embodiments, the human key point detection algorithm may use openposition algorithm for detection. In some embodiments, the human keypoint detection algorithm may employ a 2D keypoint detection algorithm, such as a probabilistic Pose Machine (CPM) or a Stacked Hourglass Network (Hourglass), among others.

It will be appreciated that each keypoint has its own serial number, and the position of each keypoint is represented by coordinates. For example, referring to fig. 8 again, the openposition algorithm originally defines 18 key points, the serial numbers of the key points represent human joints, and the serial numbers include: 0 (nose), 1 (neck), 2 (right shoulder), 3 (right elbow), 4 (right wrist), 5 (left shoulder), 6 (left elbow), 7 (left wrist), 8 (right hip), 9 (right knee), 10 (right ankle), 11 (left hip), 12 (left knee), 13 (left ankle), 14 (right eye), 15 (left eye), 16 (right ear), 17 (left ear). For different human bodies, the openposition algorithm detects the 18 key points, and the coordinates of the 18 key points of each model are different based on different human body types of the models.

Therefore, the human body key point detection is carried out on the model image to obtain a key point image. The key point image includes a plurality of key points, each of which includes a serial number representing a human joint and a corresponding coordinate.

S32: inputting the body trunk image, the key point image, the clothes image and the clothes attribute vector into a clothes deformation network, and outputting the clothes deformation image.

The body trunk image is obtained by extracting the body trunk of the model from the model image, and the specific extraction process can be referred to the description in step S40, and the description is not repeated here. The clothes attribute vector is obtained by feature coding the clothes attribute text information, and the specific coding process can refer to the description in step S40, and the description is not repeated here.

In some embodiments, the garment deformation Network may be an apparent Flow deformation Module (AFWM) that includes a Pyramid Feature Extraction Network (PFEN) for Feature Extraction and a progressive Apparent Flow Estimation Network (AFEN) for generating a light Flow graph. Each point in the light flow graph is a two-dimensional vector, and each pixel point in the clothes deformation image is recorded to be obtained by sampling from a corresponding point in the clothes image. Since the appearance flow morphing module is a prior art in this field, the specific process of calculating the light flow graph will not be described in detail here.

Therefore, after the optical flow diagram is acquired, the clothes deformation image can be obtained through calculation according to the clothes image and the optical flow diagram.

In this implementation, the body torso map and keypoint images can provide the body contours and structures of the model to the garment deformation network, directing the garment deformation network to deform the garment toward the body contours and structures of the model, approximating the body contours of the corresponding locations. In addition, the clothes attribute vector is input into the clothes deformation network, so that the clothes deformation network can consider the attribute change reflected by the clothes attribute vector to realize rough clothes attribute deformation. Therefore, the clothes contour in the clothes deformation image is closer to the contour of the corresponding trunk of the model human body, and the clothes attribute is also closer to the attribute reflected by the clothes attribute vector, so that the preliminary editing and modification of the attribute are realized, and therefore, the clothes deformation image is favorable for accelerating the convergence of the model and improving the fitting effect of the model.

S40: and carrying out human body analysis on the model image to obtain a first analysis image, and extracting a body trunk image from the model image according to the first analysis image.

After the server receives the plurality of image groups, the human body of each model image is analyzed to obtain a first analysis chart, and then the body trunk chart is extracted.

It will be appreciated that when changing the garment for a model in the image of the model, the identity of the model, etc. needs to be retained. Before clothes and model fuse, extract the identity characteristic of model, acquire health truck picture promptly, on the one hand, can avoid original old clothes characteristic to cause the interference to fusing, on the other hand, can remain the identity characteristic of model for the model does not distort after changing the clothes that need try on.

Specifically, as can be seen from the above "introduction (3)", the human body analysis is to divide each part of the human body, and as shown in fig. 9, different parts, for example, each part of the hair, face, jacket, trousers, arms, hat, shoes, etc., are identified and divided and expressed by different colors, so that the first analysis chart is obtained.

In some embodiments, the human body interpretation algorithm may be an existing Graphonomy algorithm. The Graphomay algorithm divides an image into 20 categories, and can distinguish the image by adopting different colors to divide each part. In some embodiments, the 20 categories mentioned above may also be divided using reference numerals 0-19, such as 0 for background, 1 for hat, 2 for hair, 3 for gloves, 4 for sunglasses, 5 for coat, 6 for dress, 7 for coat, 8 for socks, 9 for pants, 10 for torso skin, 11 for scarf, 12 for dress, 13 for face, 14 for left arm, 15 for right arm, 16 for left leg, 17 for right leg, 18 for left shoe, and 19 for right shoe. From the analysis map, the category to which each part in the image belongs can be specified.

In order to ensure that the identity information of the model is not changed in the process of generating the prediction fitting image, a body trunk image can be extracted from the model image according to the first analysis image.

In some embodiments, referring to fig. 10, the "extracting a body trunk map from a model image according to a first analysis map" includes:

s41: and separating a second analytic graph from the first analytic graph, wherein the second analytic graph reflects the pixel region of the body trunk of the model.

In this embodiment, a second map is separated from the first map, the second map reflecting the pixel regions of the torso of the model. For example, pixel regions of the first analysis chart having pixel types of 2 (hair), 10 (trunk skin), 13 (face), 14 (left arm), 15 (right arm), 16 (left leg), 17 (right leg), 18 (left shoe), and 19 (right shoe) are extracted as the second analysis chart.

The second analytical graph represents the pixel area of the body trunk of the model, and interference areas such as background are removed. Therefore, the body trunk image can be extracted from the model image based on the second analysis map. The identity information of the model is reserved based on the body trunk diagram, background interference is removed, the body trunk diagram is input into the fitting network for training, and the convergence speed and accuracy of the model are improved.

S42: and carrying out binarization processing on the second analytic graph to obtain a binarized image, wherein pixels corresponding to the trunk region of the body in the binarized image are 1, and pixels corresponding to other regions are 0.

For example, for the second analysis chart, the pixel regions representing the human identity characteristics, such as the pixel types 2 (hair), 10 (trunk skin), 13 (face), 14 (left arm), 15 (right arm), 16 (left leg), 17 (right leg), 18 (left shoe), and 19 (right shoe), are subjected to the type change, the pixel type change of these pixel regions is set to 1, and the pixel types of the other regions are set to 0, that is, the binarization processing is performed, so that the binarized image is obtained. Thus, a binarized image M is obtained in which pixel classes are characterized by 0 and 1 only.

S43: and multiplying the corresponding positions of the pixels in the model image and the pixels in the binary image to obtain a body trunk image.

The binarized image M is the same size as the first analysis chart and the model image, and includes elements of 0 and 1. In the binarized image M, the pixel corresponding to the body region of the model is 1, and the pixels corresponding to the other regions are 0. Therefore, the model image F is multiplied by the corresponding position of the binarized image M, and a body trunk image F' = F × M can be obtained.

Multiplying by corresponding position can adopt a formula F' _ij ＝F _ij ×M _ij Explain wherein, F _ij Representing the pixel value of the ith row and jth column in the model image, M _ij Denotes a value, F ', in the ith row and jth column of the binarized image' _ij Is the pixel value of the ith row and the jth column in the body trunk image.

In the embodiment, a second analysis graph reflecting the identity characteristics is separated from the first analysis graph, binarization processing is carried out on the second analysis graph, the obtained binarization image is multiplied by the corresponding position of the model image, and a body trunk graph can be obtained, so that the identity characteristics of the model can be accurately kept in the body trunk graph, the original clothes characteristics needing to be replaced are removed, and the model is not distorted after being replaced by the clothes needing to be tried on.

S50: and inputting the clothes deformation image, the body trunk image and the clothes attribute vector into an image generation network to obtain a prediction fitting image.

Referring again to fig. 4, the image generation network includes a cascade of an encoder, a fusion module, and a decoder. The clothes deformation image and the body trunk image are input into an encoder to be encoded, the obtained encoding result and the clothes attribute vector are input into a fusion module to be fused, the obtained fusion result is input into a decoder to be decoded, and the prediction fitting image is obtained.

As will be appreciated by those skilled in the art, the encoder includes a plurality of downsampled convolutional layers, and in the encoder, the size of the output feature map becomes smaller as the downsampled convolutional layers are advanced. The decoder includes a plurality of upsampled convolutional layers, in which the size of the output feature map is larger and larger as the upsampled convolutional layers are advanced. Those skilled in the art can configure parameters such as the size and step length of the convolution kernel of each lower convolution layer and each upper sampling convolution layer according to actual requirements. In this embodiment, the structures of the encoder and the decoder are not limited in any way.

In some embodiments, the multiple downsampling convolutional layers of the encoder are mainly configured with convolution kernels of 3 × 3, feature extraction is performed on the clothes deformation image and the body trunk image after channel splicing, in the feature extraction process, dimension reduction is achieved by using the pooling layers, and finally, the output encoding result is a feature map of 4 × 256. And 4, 256 feature maps and the clothes attribute vector are input into the fusion module for fusion, and the obtained fusion result is input into a decoder for decoding to obtain a prediction fitting image. In some embodiments, the decoder uses the upsampled convolution layer of the residual structure to upsample the fused result, gradually increasing the dimension, and outputting a predicted fitting image with a size of 512 × 3.

In order to be able to better fuse the encoding result and the clothing attribute vector to achieve a finer deformation of the clothing attributes, in some embodiments the fusion module comprises a first convolutional layer, a second convolutional layer and a fusion layer. And the first convolution layer and the second convolution layer are both provided with convolution kernels for feature extraction.

respectively performing feature extraction on the coding result through the first convolutional layer and the second convolutional layer to obtain a first intermediate feature map and a second intermediate feature map; respectively extracting the features of the clothes attribute vectors through the first convolution layer and the second convolution layer to obtain a first attribute feature map and a second attribute feature map; and carrying out fusion processing on the encoding result, the first intermediate feature map, the second intermediate feature map, the first attribute feature map and the second attribute feature map through a fusion layer.

Referring to fig. 11, after the coding result is input into the fusion module, the first convolution layer performs convolution processing on the coding result to extract features, so as to obtain a first intermediate feature map μ (x), and the second convolution layer performs convolution processing on the coding result to extract features, so as to obtain a second intermediate feature map σ (x). It will be appreciated that here the encoding result is a characteristic map of the encoder output. After the clothes attribute vectors are input into the fusion module, the first convolution layer conducts convolution processing on the clothes attribute vectors to extract features, a first attribute feature map mu (y) is obtained, and the second convolution layer conducts convolution processing on the clothes attribute vectors to extract features, a second attribute feature map sigma (y) is obtained.

Then, the coding result, the first intermediate characteristic diagram, the second intermediate characteristic diagram, the first attribute characteristic diagram and the second attribute characteristic diagram are all input into a fusion layer for fusion processing,

in this embodiment, feature extraction is performed on the encoding result and the clothing attribute vector by using two different convolution modes (a first convolution layer and a second convolution layer) to obtain a first intermediate feature map, a second intermediate feature map, a first attribute feature map and a second attribute feature map, wherein the intermediate feature maps can retain features of the encoding result from different angles, and the attribute feature maps can retain features of the clothing attribute from different angles, so that a fusion result (fusion feature map) obtained by fusing the intermediate feature maps and the attribute feature maps can more comprehensively reflect the features of the encoding result and the clothing attribute vector without losing the features.

In some embodiments, the fusion layer is fused using the following formula:

wherein, x is the coding result, μ (x) is the first intermediate feature map, σ (x) is the second intermediate feature map, y is the clothing attribute vector, μ (y) is the first attribute feature map, σ (y) is the second attribute feature map, and IN (x, y) is the fusion result.

Here, normalization processing is performed on the coding result, the difference of pixel value values in the result obtained by the normalization processing is reduced, and the result obtained by the normalization processing is fused with the first attribute feature map and the second attribute feature map, so that the influence of the difference of pixel value values on fusion can be effectively reduced, and the clothes attribute features reflected by the first attribute feature map and the second attribute feature map can be better fused and retained.

In this embodiment, the results of the processing are normalized

The garment attribute features required to be edited can be better fused by multiplying and fusing with the second attribute feature map sigma (y), then adding and fusing with the first attribute feature map mu (y), and multiplying and adding and fusing.

S60: and calculating the loss between the predicted fitting image and the model image by adopting a loss function, and performing iterative training on the fitting network according to the loss sum corresponding to the plurality of image groups until convergence to obtain a fitting model.

Here, the loss function may be configured in the terminal by a person skilled in the art, the configured loss function is sent to the server along with the fitting network, and after the server processes the predicted fitting images corresponding to the image groups, the server calculates the loss between each model image and the predicted fitting image by using the loss function, and iteratively trains the fitting network based on the loss until the fitting network converges to obtain the fitting model.

It can be understood that, if the difference between each model image and the predicted fitting image is smaller, the model image and the predicted fitting image are more similar, which indicates that the predicted fitting image can accurately restore the real fitting effect. Therefore, the model parameters of the fitting network can be adjusted according to the difference between each model image and the predicted fitting image, and the fitting network is subjected to iterative training. The difference is propagated reversely, so that the predicted fitting image output by the fitting network continuously approaches the model image until the fitting network converges, and the fitting model is obtained. It will be appreciated that in some embodiments, the fitting network comprises an image generation network, a garment deformation network, or a multi-layer perceptron module, and the model parameters comprise model parameters of the image generation network, model parameters of the garment deformation network, or model parameters of the multi-layer perceptron module, thereby enabling end-to-end training.

In some embodiments, the adam algorithm is used to optimize the model parameters, for example, the number of iterations is set to 10 ten thousand, the initial learning rate is set to 0.001, the weight attenuation of the learning rate is set to 0.0005, and the learning rate is attenuated to 1/10 of the original learning rate every 1000 iterations, wherein the learning rate, the loss and the loss can be input into the adam algorithm to obtain an adjusted model parameter output by the adam algorithm, the adjusted model parameter is used for next training, and until the training is finished, the model parameter of the fitting network after convergence is output, so that the fitting model is obtained.

It is understood that, after the server obtains the converged model parameters of the fitting network (i.e. the final model parameters), the final model parameters may be sent to the terminal, and the fitting network in the terminal is configured with the final model parameters to obtain the fitting model. In some embodiments, the server may also save the fitting network and the final model parameters to obtain a fitting model.

In this embodiment, the fitting network is trained by using the clothes deformation images, the body trunk images and the clothes attribute vectors corresponding to the plurality of image groups to obtain a fitting model. The clothes deformation image not only retains the clothes characteristics, but also can preliminarily reflect the approximate deformation shape of the clothes, so that the fitting network can be effectively guided to enable the clothes and the human body to be closely combined and the fitting effect to be more vivid when the predicted fitting image is generated. The body trunk image reserves the trunk characteristics of the model, can guide the fitting network to enable the trunk contour to be accurate when generating the prediction fitting image, is not influenced by the original clothes on the model body, and is beneficial to predicting the style of the self-adaptive fitting clothes in the clothes area in the fitting image. In addition, the clothes attribute vector reflects the attributes of the try-on clothes in the try-on effect, such as 'S code, loose and long sleeve', and can effectively guide the fitting network to edit and modify the attributes of the try-on clothes, so that the try-on clothes in the predicted fitting image conform to the expected clothes attributes (namely the clothes attribute vector). The attributes of the try-on clothes are guided to be edited based on the preliminary deformation of the clothes in the clothes deformation image, the trunk characteristics of the body trunk image and the clothes attribute vector, so that the predicted try-on image is vivid in try-on effect, and the attributes of the try-on clothes can be edited and modified. The fitting network is trained by adopting the plurality of image groups in the above manner, the fitting effect in the predicted fitting image corresponding to each image group is constrained to be continuously close to the real wearing effect in the model image based on loss and back propagation, and the attributes of the fitting clothes can be edited and modified according to the clothes attribute vector, such as code number change, length change and the like. Therefore, the fitting model obtained through training can edit the fitting clothes according to the input clothes attributes, and the virtual fitting effect of editing the clothes attributes is achieved.

In some embodiments, the loss function includes a conditional opposition loss, a perception loss, and a garment deformation loss. Wherein the garment deformation loss reflects a difference between the garment in the predicted fitting image and the garment in the model image.

The countermeasure loss is a loss of predicting whether the fitting image is a corresponding model image, and when the countermeasure loss is large, it indicates that the distribution of the predicted fitting image is largely different from the distribution of the model image, and when the countermeasure loss is small, it indicates that the distribution of the predicted fitting image is small and close to the distribution of the model image. Here, the distribution of the image refers to the distribution of each part in the image, for example, the distribution of clothes, heads, limbs, and the like.

The perception loss is to compare the feature map obtained by convolving the model image with the feature map obtained by convolving the prediction fitting image so as to enable high-level information (content and global structure) to be close.

The clothing deformation loss reflects the difference between the clothing in the predicted fitting image and the clothing in the model image. When the difference between the clothes in the clothes deformation image and the clothes in the model image is larger, the clothes deformation reflected by the clothes area in the clothes deformation image deviates from the real fitting effect, and the deformation effect of the clothes deformation algorithm is poor. When the difference between the clothes area in the clothes deformation image and the clothes area in the model image is smaller, the clothes deformation reflected by the clothes area in the clothes deformation image is close to the real fitting effect, and the deformation effect of the clothes deformation algorithm is better.

In some embodiments, the loss function comprises:

Loss＝L _cGAN +λ ₁ L _percept +λ ₂ L _clother

L _cGAN ＝E[logD(F)]+E[1-logD(Y)]

wherein Loss is a Loss function, L _cGAN To combat losses, L _percept For perception of loss, L _clother For deformation loss of clothes, λ ₁ And λ ₂ For hyper-parameters, it can be set that F is model image (real fitting image), Y is prediction fitting image, D is discriminator, V is the number of layers of feature map of VGG network, VGG and R _i Respectively, a feature map and an element number in the ith layer of the VGG network, wherein R _i ＝C _i W _i H _i ,C _i 、W _i 、H _i Respectively expressed as the number, width and length of the ith layer feature map. C _T Is an image of the try-on garment in the model image,

is a deformed image of the garment, C _{T_mask} Is the area of the garment in the model image,

is the area of the garment in the deformed image of the garment.

In some embodiments, a convolutional neural network such as VGG may be used to down-sample the model image, and V feature maps F are extracted ⁱ (T). Similarly, the predicted fitting image can be downsampled by adopting convolutional neural networks such as VGG (convolutional neural network) and the like, and V feature maps F are extracted ⁱ (Y)。

In some embodiments, F when i =1 ⁱ (T) and F ⁱ (Y) is 8 x 8 in size; f when i =2 ⁱ (T) and F ⁱ (Y) is 16 x 16 in size; f when i =3 ⁱ (T) and F ⁱ (Y) size 32 x 32; f when i =4 ⁱ (T) and F ⁱ (Y) size 64 x 64; f when i =5 ⁱ (T) and F ⁱ (Y) size 128 x 128; f when i =6 ⁱ (T) and F ⁱ (Y) size 256 × 256; f when i =7 ⁱ (T) and F ⁱ The size of (Y) is 512 x 512.

Therefore, iterative training is carried out on the fitting network based on the difference obtained by calculating the loss function comprising the resistance loss, the perception loss and the clothes deformation loss, the fitting images can be restrained from being continuously close to the model images (namely the real fitting images) from the four aspects of distribution, content characteristics and clothes shapes, and the fitting effect of the fitting model obtained through training is favorably improved.

In conclusion, the fitting network is trained by adopting the clothes deformation images, the body trunk images and the clothes attribute vectors corresponding to the plurality of image groups to obtain the fitting model. The clothes deformation image not only retains the clothes characteristics, but also can preliminarily reflect the approximate deformation shape of the clothes, so that the fitting network can be effectively guided to enable the clothes and the human body to be closely combined and the fitting effect to be more vivid when the predicted fitting image is generated. The body trunk image reserves the trunk characteristics of the model, can guide the fitting network to enable the trunk contour to be accurate when generating the prediction fitting image, is not influenced by the original clothes on the model body, and is beneficial to predicting the style of the self-adaptive fitting clothes in the clothes area in the fitting image. In addition, the clothes attribute vector reflects the attributes of the try-on clothes in the try-on effect, such as 'S code, loose and long sleeve', and can effectively guide the fitting network to edit and modify the attributes of the try-on clothes, so that the try-on clothes in the predicted fitting image accord with the expected clothes attributes (namely the clothes attribute vector). The attributes of the try-on clothes are guided to be edited based on the preliminary deformation of the clothes in the clothes deformation image, the trunk characteristics of the body trunk image and the clothes attribute vector, so that the predicted try-on image is vivid in try-on effect, and the attributes of the try-on clothes can be edited and modified. The fitting network is trained by adopting the plurality of image groups in the above manner, the fitting effect in the predicted fitting image corresponding to each image group is constrained to be continuously close to the real wearing effect in the model image based on loss and back propagation, and the attributes of the fitting clothes can be edited and modified according to the clothes attribute vector, such as code number change, length change and the like. Therefore, the fitting model obtained through training can edit the fitting clothes according to the input clothes attributes, and the virtual fitting effect of editing the clothes attributes is achieved.

After the fitting model is obtained through training by the fitting model training method provided by the embodiment of the application, the fitting model can be applied to the virtual fitting to generate a fitting image. The virtual fitting method provided by the embodiment of the application can be implemented by various electronic devices with computing processing capacity, such as an intelligent terminal, a server and the like.

The virtual fitting method provided by the embodiment of the present application is described below with reference to exemplary applications and implementations of the terminal provided by the embodiment of the present application. Referring to fig. 12, fig. 12 is a schematic flowchart of a virtual fitting method provided in the embodiment of the present application. The method S200 includes the steps of:

s201: and acquiring an image of the try-on wearer, an image of the clothes to be tested and attribute text information of the clothes to be tested.

A fitting assistant (application software) arranged in a terminal (such as a smart phone or a smart fitting mirror) acquires a fitting person image, a fitting clothes image and fitting clothes attribute text information. Wherein the image of the try-on person can be captured by the terminal or input by the user to the terminal. The image of the clothes to be tried may be selected by the user from the fitting assistant, and the text information of the attribute of the clothes to be tried may be inputted to the terminal by the user.

It will be appreciated that the image of the try-on includes the body of the try-on and the image of the garment to be tried includes the garment.

S202: and performing feature coding on the clothes attribute text information to be tested to obtain the clothes attribute vector to be tested.

It is understood that the text information of the attributes of the clothes to be tried may include text information of size, sleeve length, coat length, bust, waist circumference, hip circumference, trouser length, version, and the like. The clothes attribute text information to be tested is data in a text format. In order to enable the fitting model to know the attribute characteristics reflected by the attribute text information of the clothes to be tested, the attribute text information of the clothes to be tested is coded to obtain an attribute vector of the clothes to be tested. For a specific encoding implementation, refer to the encoding scheme in step S20 in the above example of the fitting model training scheme.

S203: and (3) preliminarily deforming the clothes in the clothes image to be tested according to the human body structure in the image of the person wearing the clothes to be tested to obtain a deformed image of the clothes to be tested.

In order to enable the clothes to be fitted to the human body structure when the clothes to be fitted and the fitting person are fused in the fitting model, the clothes in the image of the clothes to be fitted are deformed according to the human body structure of the fitting person, and a deformed image of the clothes to be fitted is obtained. The clothes in the deformed image of the clothes to be tried are in a three-dimensional state and are adaptive to the human body structure of the try-on person, the combination and the sticking of the fused clothes and the try-on person are facilitated, and the clothes fitting effect is real and natural. For a specific modified embodiment, please refer to the modified embodiment in step S30 in the above embodiment of the fitting model training mode.

S204: and carrying out human body analysis on the image of the try-on wearer to obtain a first try-on wearer analysis chart, and extracting a body trunk chart of the try-on wearer from the image of the try-on wearer according to the first try-on wearer analysis chart.

It is understood that when changing the clothes of the try-on person in the try-on person image, the identity of the try-on person, and the like, are required to be retained.

Before clothes to be tried on and a try-on person input a try-on model are fused, the identity characteristics of the try-on person are extracted, namely, a body trunk diagram of the try-on person is obtained.

The human body analysis, the second analysis chart separation, and the body trunk chart extraction have been described in detail in step S40 in the training fitting model embodiment, and a description thereof will not be repeated.

S205: inputting the deformation image of the clothes to be tried on, the body trunk image of the person to be tried on and the attribute vector of the clothes to be tried on into the fitting model to generate a fitting image.

Wherein, the fitting model is obtained by adopting any one of the methods for training the fitting model.

The fitting assistant arranged in the terminal comprises a fitting model, the fitting model is called to perform virtual fitting, and a fitting image is generated. It can be understood that the fitting model is obtained by training through the method for training the fitting model in the foregoing embodiment, and has the same structure and function as the fitting model in the foregoing embodiment, which is not described in detail herein.

In short, the fitting model is input with the deformation image of the clothes to be fitted, the body trunk image of the fitting person and the attribute vector of the clothes to be fitted, and the fitting model can edit the fitting clothes according to the input clothes attribute, so that the virtual fitting effect of the editable clothes attribute is realized.

Embodiments of the present application also provide a computer-readable storage medium storing computer-executable instructions for causing an electronic device to perform a method for training a fitting model provided in an embodiment of the present application, for example, a method for training a fitting model as shown in fig. 3 to 11, or a virtual fitting method provided in an embodiment of the present application, for example, a virtual fitting method as shown in fig. 12.

In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of a program, software module, script, or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a HyperText markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device (a device including a smart terminal and a server), or on multiple computing devices located at one site, or distributed across multiple sites interconnected by a communication network.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, where the computer program includes program instructions, and the program instructions, when executed by a computer, cause the computer to execute a method for training a fitting model or a virtual fitting method as in the foregoing embodiments.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, it is obvious to those skilled in the art that the embodiments may be implemented by software plus a general hardware platform, and may also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; within the context of the present application, where technical features in the above embodiments or in different embodiments may also be combined, the steps may be implemented in any order and there are many other variations of the different aspects of the present application described above which are not provided in detail for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for training a fitting model is characterized in that a fitting network comprises an image generation network, wherein the image generation network comprises a coder, a fusion module and a decoder which are cascaded;

the method comprises the following steps:

acquiring a plurality of image groups, wherein the image groups comprise clothes images and model images, the clothes images correspond to editable clothes attribute text information, the models in the model images are worn with clothes in the clothes images, and the clothes are the clothes edited according to the clothes attribute text information;

inputting the clothes deformation image, the body trunk image and the clothes attribute vector into the image generation network to obtain a prediction fitting image, wherein the clothes deformation image and the body trunk image are input into the encoder to be encoded, the obtained encoding result and the clothes attribute vector are input into the fusion module to be fused, and the obtained fusion result is input into the decoder to be decoded to obtain the prediction fitting image;

and calculating the loss between the predicted fitting image and the model image by adopting a loss function, and performing iterative training on the fitting network according to the loss sums corresponding to the plurality of image groups until convergence to obtain the fitting model.

2. The method of claim 1, wherein the fitting network further comprises a garment deformation network;

the preliminary deformation of the clothes in the clothes image according to the human body structure of the model in the model image to obtain a clothes deformation image comprises the following steps:

detecting key points of the human body on the model image to obtain a key point image;

inputting the body trunk image, the key point image, the clothes image and the clothes attribute vector into the clothes deformation network, and outputting the clothes deformation image.

3. The method of claim 1, wherein extracting the body torso image from the model image according to the first resolution map comprises:

separating a second analytical map from the first analytical map, wherein the second analytical map reflects a pixel region of the body trunk of the model;

carrying out binarization processing on the second analytic graph to obtain a binarized image, wherein pixels corresponding to a trunk region of the body in the binarized image are 1, and pixels corresponding to other regions are 0;

and multiplying the corresponding positions of the pixels in the model image and the pixels in the binary image to obtain the body trunk image.

4. The method of claim 1, wherein the fitting network further comprises a multi-tier perceptron module;

the feature coding of the clothes attribute text information to obtain a clothes attribute vector includes:

and performing feature extraction on the text code by adopting the multilayer perceptron module to obtain the clothing attribute vector.

5. The method of claim 1, wherein the fusion module comprises a first convolutional layer, a second convolutional layer, and a fusion layer;

the fusion module performs fusion processing on the coding result and the clothes attribute direction by adopting the following modes:

respectively performing feature extraction on the coding result through the first convolution layer and the second convolution layer to obtain a first intermediate feature map and a second intermediate feature map;

respectively performing feature extraction on the clothes attribute vector through the first convolution layer and the second convolution layer to obtain a first attribute feature map and a second attribute feature map;

and performing fusion processing on the encoding result, the first intermediate feature map, the second intermediate feature map, the first attribute feature map and the second attribute feature map through the fusion layer.

6. The method of claim 5, wherein the fusion layer is fused using the following formula:

wherein x is the encoding result, μ (x) is the first intermediate feature map, σ (x) is the second intermediate feature map, y is the clothing attribute vector, μ (y) is the first attribute feature map, σ (y) is the second attribute feature map, and IN (x, y) is the fused result.

7. The method of claim 1, wherein the loss function comprises a conditional opposition loss, a perceptual loss, and a clothing deformation loss, wherein the clothing deformation loss reflects a difference between clothing in the predicted fitting image and clothing in the model image.

8. A virtual fitting method, comprising:

acquiring an image of a try-on wearer, an image of clothes to be tested and attribute text information of the clothes to be tested;

primarily deforming the clothes in the clothes image to be tested according to the human body structure in the image of the try-on person to obtain a deformed image of the clothes to be tested;

inputting the deformation image of the clothes to be fitted, the body trunk image of the fitting person and the attribute vector of the clothes to be fitted into a fitting model to generate a fitting image, wherein the fitting model is obtained by adopting the method for training the fitting model according to any one of claims 1 to 7.

9. An electronic device, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

10. A computer-readable storage medium having computer-executable instructions stored thereon for causing a computer device to perform the method of any one of claims 1-8.