CN116563425A

CN116563425A - Method for training fitting model, virtual fitting method and related device

Info

Publication number: CN116563425A
Application number: CN202310470647.XA
Authority: CN
Inventors: 陈仿雄
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2023-08-08

Abstract

The embodiment of the application relates to the technical field of image processing and discloses a method for training a fitting model, a virtual fitting method and a related device. The human body key point image, the human body trunk image and the first clothes code are used as inputs of a generating module, coding is performed before decoding, and in the decoding process, the first clothes code and the generated at least one characteristic image are respectively subjected to characteristic fusion to generate an initial fitting image with three-dimensional clothes. And fusing the second clothes code and the initial fitting image by utilizing a fusion module, and adjusting the fusion result to generate a fitting image with high resolution. Therefore, the fitting model obtained through training can generate a virtual fitting image with high resolution and real and close stereoscopic impression for fitting clothes.

Description

Method for training fitting model, virtual fitting method and related device

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a method for training a fitting model, a virtual fitting method and a related device.

Background

With the continuous progress of modern technology, online shopping scale is increasing, and the user can purchase clothing on online shopping platform through the cell-phone, but because the information of the clothing that offers for sale that the user obtained is generally two-dimensional show picture, the user can't know the effect that these clothing dresses on oneself, therefore can lead to buying unsuitable clothing of oneself, shopping experience is relatively poor.

With the continuous development of neural networks, the neural network is widely applied to the field of generating images. Accordingly, researchers have applied neural networks to virtual fitting, and have proposed various fitting algorithms. However, some virtual fitting algorithms known to the inventors of the present application mostly deform clothes and generate fitting images simulating fitting clothes, and the resolution of the generated fitting images is relatively low, and the clothes are simulated to be worn and lack of stereoscopic impression, so that the finally presented fitting effect has a map sense, which is not true and natural enough.

Disclosure of Invention

In view of this, some embodiments of the present application provide a method for training a fitting model, a virtual fitting method, and a related apparatus, where the fitting model obtained by training the method can generate a virtual fitting image with high resolution and real stereoscopic impression for fitting clothes.

In a first aspect, some embodiments of the present application provide a method for training a fitting model, where a fitting network includes a generating module, an encoding module, and a fusion module, where the generating module includes an encoder and a decoder, and is configured to perform encoding-before-decoding processing on an input image; the encoding module is used for extracting features of the input images, and the fusion module is used for carrying out feature fusion on at least two input feature images;

the method comprises the following steps:

acquiring a plurality of image groups, wherein the image groups comprise clothes images and model images, and models in the model images wear clothes in the clothes images;

extracting a body trunk image and a body key point image from the model image;

inputting the body trunk graph and the human body key point image into an encoder in the generation module for encoding, and inputting the encoding result and the first clothes code into a decoder in the generation module for decoding to generate an initial fitting image; at least one feature map generated in the decoding process is respectively subjected to feature fusion with a first clothes code, wherein the first clothes code is obtained by extracting features from a clothes image by an encoding module and reflects the style features of clothes;

Inputting the initial fitting image and the second clothes code into a fusion module for feature fusion, and adjusting the resolution of the output fusion result to obtain a predicted fitting image; the second clothes code is obtained by extracting the characteristics of the clothes image by the coding module and reflects the texture characteristics of the clothes;

and calculating the loss between the predicted fitting image and the model image by adopting a loss function, and carrying out iterative training on the fitting network according to the loss sum corresponding to a plurality of image groups until convergence to obtain a fitting model.

In some embodiments, the decoder includes a plurality of cascaded, spaced-apart up-sampling convolutional layers and a feature fusion layer, wherein the feature fusion layer feature fuses the feature map output by the previous up-sampling convolutional layer with the first clothing code in the following manner:

performing dimension reduction treatment on the first clothes code;

and linearly fusing the first clothes code subjected to the dimension reduction treatment with the characteristic diagram output by the last up-sampling convolution layer to obtain a fused characteristic diagram.

In some embodiments, the feature fusion layer performs feature fusion on the feature map output by the previous upsampling convolutional layer with the first clothing code using the following formula:

Wherein V is _i Is the characteristic diagram of the output of the ith up-sampling convolution layer, V _i Is the weight value obtained by the code conversion of the first clothes input into the ith feature fusion layer,is the deviation value, IN (V _i S) is a fused feature map output by the ith feature fusion layer.

In some embodiments, the fusion module performs feature fusion on the initial fitting image and the second garment code in the following manner:

respectively carrying out linear fusion on each row vector of the second clothes code and the initial fitting image;

and adding and fusing the linear fusion results corresponding to the row vectors respectively.

In some embodiments, the fusion module performs feature fusion on the initial fitting image and the second garment code using the following formula:

wherein W is the second clothes code, G1 is the initial fitting image, AIN (G1, W) is the fusion result output by the fusion module, FC represents the full connection layer, and WI is the ith row vector of the second clothes code.

In some embodiments, the fitting network further includes at least one superdivision module, where the superdivision module includes at least one residual unit and at least one convolution layer, the residual unit is configured to perform mapping transformation processing on an input image to extract features, and the at least one convolution layer is configured to perform dimension lifting on a feature map output by each residual unit, so as to generate a high resolution image;

The resolution of the output fusion result is adjusted to obtain a predicted fitting image, which comprises the following steps:

determining the number R of super-division modules according to the resolution of the fusion result and the resolution of the model image;

and inputting the fusion result into the R superdivision modules in cascade, sequentially performing calculation processing, and outputting a predicted fitting image.

In some embodiments, the residual unit performs a mapping transform process on the input image in the following manner:

carrying out channel reduction mapping on an input image, and then carrying out channel expansion mapping;

and carrying out fusion processing on the feature map after the channel expansion mapping and the input image to obtain a feature map output by a residual unit.

In a second aspect, some embodiments of the present application provide a virtual fitting method, including:

acquiring an image of a wearer and an image of a garment to be fitted;

extracting a body trunk image of a test wearer from the test wearer image, and extracting a human body key point image of the test wearer from the test wearer image;

inputting a body trunk diagram of a try-on wearer, a human body key point image of the try-on wearer and an image of the to-be-fit garment into a fitting model to generate a virtual fitting image; wherein the fitting model is trained by the method of the first aspect.

In a third aspect, some embodiments of the present application provide a computer device, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

In a fourth aspect, some embodiments of the present application provide a computer-readable storage medium having stored thereon computer-executable instructions for causing a computer device to perform the method of the first aspect.

The beneficial effects of the embodiment of the application are that: different from the situation of the prior art, the fitting network constructed first comprises a generating module, a coding module and a fusion module, wherein the generating module comprises an encoder and a decoder and is used for performing coding-before-decoding processing on an input image; the encoding module is used for extracting features of the input images, and the fusion module is used for carrying out feature fusion on at least two input feature images. Then, a plurality of image groups are acquired, each image group including a clothing image and a model image, the model in the model image being worn with clothing in the clothing image. And extracting a body trunk image and a body key point image from the model image. The method comprises the steps of inputting a body trunk image and a human body key point image into an encoder in a generating module for encoding, and inputting an encoding result and a first clothes code into a decoder in the generating module for decoding to generate an initial fitting image, wherein at least one feature image generated in the decoding process is respectively subjected to feature fusion with the first clothes code, and the first clothes code is obtained after the encoding module extracts features of the clothes image and reflects the style features of clothes. Then, inputting the initial fitting image and the second clothes code into a fusion module for feature fusion, and adjusting the resolution of the output fusion result to obtain a predicted fitting image; the second clothes code is obtained by the coding module after feature extraction of the clothes image and reflects texture features of the clothes. And finally, calculating the loss between the predicted fitting image and the model image by adopting a loss function, and carrying out iterative training on the fitting network according to the loss sum corresponding to a plurality of image groups until convergence to obtain a fitting model.

In this embodiment, by designing a fitting network with the above-described structure, feature extraction is performed on the clothing image using the encoding module, and a first clothing code reflecting clothing style features and a second clothing code reflecting clothing texture features are obtained. The human body key point image reflecting the skeleton structure, the body trunk image retaining the human body characteristics and the first clothes code are used as inputs of a generating module, coding is performed before decoding, and in the decoding process, the first clothes code and at least one generated characteristic image are respectively subjected to characteristic fusion, so that clothes style characteristics can be fused with human body characteristics, positioning is performed according to the human body key points, and an initial fitting image with three-dimensional clothes is generated. And then, the second clothes code and the initial fitting image are fused by utilizing a fusion module, so that the clothes which simulate fitting in the fusion result have stereoscopic impression and original textures. And adjusting the fusion result to generate a high-resolution predictive fitting image. Compared with the method that the deformed two-dimensional clothes image is fused with the body trunk image, the method and the device respectively fuse clothes style features and clothes texture features, then the resolution is adjusted, and the problem that clothes lack of three-dimensional sense in the try-on effect and the situation that the resolution is not high can be effectively solved. The fitting network is trained in a training iterative manner by adopting the plurality of image groups, the predicted fitting image corresponding to each image group is restrained from continuously approaching to the real fitting effect in the model image based on loss and back propagation, and the three-dimensional sense of clothes fitting is more in accordance with the human body contour characteristics and the body state characteristics. Therefore, the fitting model obtained through training can generate a virtual fitting image with high resolution and real and close stereoscopic impression for fitting clothes.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

FIG. 1 is a schematic illustration of a fitting system according to some embodiments of the present application;

FIG. 2 is a schematic structural diagram of an electronic device according to some embodiments of the present application;

FIG. 3 is a flow chart of a method of training a fitting model in some embodiments of the present application;

FIG. 4 is a schematic diagram of a test network according to some embodiments of the present application;

FIG. 5 is a schematic diagram of a process for processing an image by a try-on network in some embodiments of the present application;

FIG. 6 is a schematic diagram of a decoder according to some embodiments of the present application;

FIG. 7 is a schematic diagram of a supermodule according to some embodiments of the present application;

fig. 8 is a flow chart of a virtual fitting method according to some embodiments of the present application.

Detailed Description

The present application is described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the present application in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the spirit of the present application. These are all within the scope of the present application.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that, if not conflicting, the various features in the embodiments of the present application may be combined with each other, which is within the protection scope of the present application. In addition, while functional block division is performed in a device diagram and logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. Moreover, the words "first," "second," "third," and the like as used herein do not limit the data and order of execution, but merely distinguish between identical or similar items that have substantially the same function and effect.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application in this description is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items.

In addition, technical features described below in the various embodiments of the present application may be combined with each other as long as they do not conflict with each other.

To facilitate understanding of the methods provided in the embodiments of the present application, the terms involved in the embodiments of the present application are first described:

(1) Neural network

A neural network may be composed of neural units, and is understood to mean, in particular, a neural network having an input layer, an hidden layer, and an output layer, where in general, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. Among them, the neural network with many hidden layers is called deep neural network (deep neural network, DNN). The operation of each layer in the neural network can be described by the mathematical expression y=a (w·x+b), from the physical level, and can be understood as the completion of the transformation of the input space into the output space (i.e., the row space into the column space of the matrix) by five operations on the input space (set of input vectors), including 1, dimension up/down; 2. zoom in/out; 3. rotating; 4. translating; 5. "bending". Wherein the operations of 2, 3 are done by "w·x", the operations of 4 are done by "+b", and the operations of 5 are done by "a ()" where the expression "space" is used in two words because the object to be classified is not a single thing but a class of things, space refers to the collection of all individuals of such things, where W is the weight matrix of the layers of the neural network, each value in the matrix representing the weight value of one neuron of that layer. The matrix W determines the spatial transformation of the input space into the output space described above, i.e. the W of each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Thus, the training process of the neural network is essentially a way to learn and control the spatial transformation, and more specifically to learn the weight matrix.

It should be noted that in the embodiments of the present application, the nature is neural networks based on the model employed by the machine learning task. Common components in the neural network comprise a convolution layer, a pooling layer, a normalization layer, a reverse convolution layer and the like, a model is designed and obtained by assembling the common components in the neural network, and when model parameters (weight matrixes of all layers) are determined so that model errors meet preset conditions or the number of adjusted model parameters reaches a preset threshold value, the model converges.

The convolution layer is configured with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to carry out convolution operation on the image. The purpose of the convolution operation is to extract different features of the input image, and the first layer of convolution layer may only extract some low-level features such as edges, lines, angles, etc., and the deeper convolution layer may iteratively extract more complex features from the low-level features.

The inverse convolution layer is used to map a low-dimensional space to a high-dimensional space while maintaining a connection/pattern between them (connection here refers to the connection at the time of convolution). The inverse convolution layer is configured with a plurality of convolution kernels, each of which is provided with a corresponding step size, to perform an inverse convolution operation on the image. In general, an update () function is built in a framework library (for example, a pyrerch library) for designing a neural network, and by calling the update () function, a low-dimensional to high-dimensional spatial mapping can be realized.

Pooling is a process by which the human visual system is simulated to reduce the size of the data or to represent the image with higher-level features. Common operations of the pooling layer include maximum pooling, mean pooling, random pooling, median pooling, combined pooling, and the like. Typically, the pooling layers are periodically inserted between convolutional layers of the neural network to achieve dimension reduction.

The normalization layer is used for performing normalization operation on all neurons in the middle to prevent gradient explosion and gradient disappearance.

(2) Loss function

In the process of training the neural network, because the output of the neural network is expected to be as close to the value actually expected, the weight matrix of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the actually expected target value (however, an initialization process is usually performed before the first update, that is, the parameters are preconfigured for each layer in the neural network), for example, if the predicted value of the network is higher, the weight matrix is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the neural network can predict the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and the training of the neural network becomes the process of reducing the loss as much as possible.

(3) Human body analysis

Human body analysis refers to dividing a person captured in an image into a plurality of semantically uniform regions, for example, a body part and clothing, or a subdivision class of a body part and a subdivision class of clothing, or the like. I.e. the input image is identified at the pixel level and each pixel point in the image is annotated with the object class to which it belongs. For example, individual elements in a picture including a human body (including hair, face, limbs, clothing, and background, etc.) are distinguished by a neural network.

Before describing the embodiments of the present application, a simple description of a virtual fitting method known to the inventor of the present application is provided, so that the embodiments of the present application will be convenient to understand later.

Typically, the fitting model is trained by generating a countermeasure network (Generative Adversarial Networks, GAN), whereby the virtual fitting image is generated after the fitting model is placed in the terminal for use by the user, the user image and the garment image to be fitted are acquired. Specifically, the fitting model firstly deforms the fitting clothes based on the human body key point image or the human body analytic graph, and then fuses the deformed clothes with the human body trunk part to obtain the fitting image, so that the fitting function is realized. However, the 2D fitting algorithm that fuses the deformed two-dimensional clothing image with the body torso image has problems of a lack of three-dimensional sense of clothing and a low resolution of fitting images in the simulated fitting effect.

Aiming at the problems, the application implementation provides a method for training a fitting model, a virtual fitting method and a related device, wherein a fitting network constructed firstly comprises a generation module, a coding module and a fusion module, wherein the generation module comprises an encoder and a decoder and is used for performing coding-before-decoding processing on an input image; the encoding module is used for extracting features of the input images, and the fusion module is used for carrying out feature fusion on at least two input feature images. Training the fitting network by adopting model outline images and clothes images corresponding to a plurality of image groups to obtain a fitting model. In this embodiment, by designing a fitting network with the above-described structure, feature extraction is performed on the clothing image using the encoding module, and a first clothing code reflecting clothing style features and a second clothing code reflecting clothing texture features are obtained. The human body key point image reflecting the skeleton structure, the body trunk image retaining the human body characteristics and the first clothes code are used as inputs of a generating module, coding is performed before decoding, and in the decoding process, the first clothes code and at least one generated characteristic image are respectively subjected to characteristic fusion, so that clothes style characteristics can be fused with human body characteristics, positioning is performed according to the human body key points, and an initial fitting image with three-dimensional sense is generated. And then, the second clothes code and the initial fitting image are fused by utilizing a fusion module, so that the clothes which simulate fitting in the fusion result have stereoscopic impression and original textures. And adjusting the fusion result to generate a high-resolution predictive fitting image. Compared with the method that the deformed two-dimensional clothes image is fused with the body trunk image, the method and the device respectively fuse clothes style features and clothes texture features, then the resolution is adjusted, and the problem that clothes lack of three-dimensional sense in the try-on effect and the situation that the resolution is not high can be effectively solved. The fitting network is trained in a training iterative manner by adopting the plurality of image groups, the predicted fitting image corresponding to each image group is restrained from continuously approaching to the real fitting effect in the model image based on loss and back propagation, and the three-dimensional sense of clothes fitting is more in accordance with the human body contour characteristics and the body state characteristics. Therefore, the fitting model obtained through training can generate a virtual fitting image with high resolution and real and close stereoscopic impression for fitting clothes.

An exemplary application of the electronic device for training a fitting model or for virtual fitting provided in the embodiments of the present application is described below, and it may be understood that the electronic device may train the fitting model, or may use the fitting model to perform virtual fitting, and generate fitting images.

The electronic device provided by the embodiment of the application may be a server, for example, a server deployed in a cloud. When the server is used for training the fitting model, the fitting network is iteratively trained by adopting the training set according to training sets and fitting networks provided by other equipment or persons skilled in the art, and final model parameters are determined, so that the fitting network configures the final model parameters, and the fitting model can be obtained. The training set comprises a plurality of image groups, each image group comprises a clothes image and a model image, and the models in the model images wear clothes in the clothes images. When the server is used for virtual fitting, the built-in fitting model is called, corresponding calculation processing is carried out on the fitting person image and the fitting image to be fitted provided by other equipment or users, and fitting images which can accord with the human body outline and the body state are generated.

The electronic device provided by some embodiments of the present application may be a notebook computer, a desktop computer, or a mobile device, and other various types of terminals. When the terminal is used for training the fitting model, a person skilled in the art inputs the prepared training set into the terminal, designs the fitting network on the terminal, and adopts the training set to carry out iterative training on the fitting network to determine final model parameters, so that the fitting network configures the final model parameters, and the fitting model can be obtained. When the terminal is used for virtual fitting, the built-in fitting model is called, corresponding calculation processing is carried out on the fitting person image and the fitting image to be fitted which are input by the user, and a virtual fitting image with high resolution and real and close stereoscopic impression of the fitting can be generated.

As an example, referring to fig. 1, fig. 1 is a schematic view of an application scenario of a fitting system provided in an embodiment of the present application, where a terminal 10 is connected to a server 20 through a network, where the network may be a wide area network or a local area network, or a combination of the two.

The terminal 10 may be used to acquire training sets and build a fitting network, for example, by those skilled in the art downloading the ready training sets on the terminal and building a network structure for the fitting network. It will be appreciated that the terminal 10 may also be used to obtain images of the wearer and of the garment to be fitted, for example, the user entering the images of the wearer and of the garment to be fitted via an input interface, the terminal automatically obtaining the images of the wearer and of the garment to be fitted after the input is completed; for example, the terminal 10 is provided with a camera, and the user image is acquired by the camera, and the clothing image library is stored in the terminal 10, so that the user can select the clothing image to be tried on from the clothing image library.

In some embodiments, the terminal 10 locally performs the method for training the fitting model provided in the embodiments of the present application to complete training of the designed fitting network using the training set, and determines the final model parameters, so that the fitting network configures the final model parameters, and a fitting model can be obtained. In some embodiments, the terminal 10 may also send the training set and the built fitting network stored on the terminal by the person skilled in the art to the server 20 through the network, the server 20 receives the training set and the fitting network, trains the designed fitting network with the training set, determines final model parameters, and then sends the final model parameters to the terminal 10, and the terminal 10 stores the final model parameters so that the fitting network configuration can be the final model parameters, i.e. the fitting model can be obtained.

In some embodiments, the terminal 10 locally executes the virtual fitting method provided in the embodiments of the present application to provide a virtual fitting service for a user, invokes a built-in fitting model, and performs corresponding calculation processing on the fitting person image and the fitting image input by the user, so as to generate a virtual fitting image with high resolution and real and close stereoscopic impression of fitting the garment. In some embodiments, the terminal 10 may also send the image of the fitting person and the image of the fitting garment input by the user on the terminal to the server 20 through the network, the server 20 receives the image of the fitting person and the image of the fitting garment, invokes the built-in fitting model to perform corresponding calculation processing on the image of the fitting person and the image of the fitting garment, generates a virtual fitting image with high resolution and real fitting stereoscopic impression of the fitting garment, and then sends the fitting image to the terminal 10. After receiving the fitting image, the terminal 10 displays the fitting image on its own display interface, so that the user can watch the fitting effect.

Next, the structure of the electronic device in the embodiment of the present application is described, and fig. 2 is a schematic structural diagram of the electronic device 500 in the embodiment of the present application, where the electronic device 500 includes at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in electronic device 500 are coupled together by bus system 540. It is appreciated that the bus system 540 is used to enable connected communications between these components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to the data bus. The various buses are labeled as bus system 540 in fig. 2 for clarity of illustration.

The processor 510 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 550 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory. Memory 550 may optionally include one or more storage devices physically located remote from processor 510.

In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

network communication module 552 for accessing other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 include Bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), among others;

a display module 553 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;

the input processing module 554 is configured to detect one or more user inputs or interactions from one of the one or more input devices 532 and translate the detected inputs or interactions.

As can be appreciated from the foregoing, the method for training a fitting model and the virtual fitting method provided in the embodiments of the present application may be implemented by various types of electronic devices having computing processing capabilities, such as an intelligent terminal, a server, and the like.

The method for training the fitting model provided in the embodiments of the present application is described below in connection with exemplary applications and implementations of the server provided in the embodiments of the present application. Referring to fig. 3, fig. 3 is a schematic flow chart of a method for training a fitting model according to an embodiment of the present application. Referring to fig. 4, fig. 4 is a schematic structural diagram of a fitting network. As shown in fig. 4, the fitting network, which is a fitting model network structure, includes a generating module, an encoding module, and a fusing module. The generating module, the encoding module and the fusing module are all preset neural networks and are provided with components (convolution layers, deconvolution layers or pooling layers and the like) of the neural networks. The basic structure and principle of the neural network are described in detail in the "noun introduction (1)", and will not be described in detail here. The fitting network may be self-constructed by those skilled in the art on a neural network design platform on a terminal (e.g., computer) computer, and then sent to the server.

In some embodiments, the intra-layer structures (convolution kernel, step size, etc.), inter-layer connection structures, layer combinations, etc. of the convolution layer, deconvolution layer, pooling layer, etc. in the generation module may be configured to obtain a specific generation module. The generation module includes an encoder and a decoder, and it will be appreciated by those skilled in the art that the encoder includes a plurality of downsampled convolutional layers, in which the size of the output feature map becomes smaller as the downsampled convolutional layers progress. The decoder includes a plurality of upsampling convolutional layers, and in the decoder, the size of the output feature map becomes larger as the upsampling convolutional layers progress. The parameters such as the convolution kernel size, the step size and the like of each lower adopted convolution layer and each up-sampling convolution layer can be configured by a person skilled in the art according to actual requirements.

Similarly, the structure of the coding module can be set, so that the coding module is used for extracting the characteristics of the clothes image; the structure of the fusion module can also be set, so that the fusion module is used for carrying out fusion processing on at least two feature images, and the fused feature images can comprise features in at least two feature images.

Referring again to fig. 3, the method S100 may specifically include the following steps:

S10: a plurality of image groups are acquired, wherein one image group comprises a clothes image and a model image, and a model in the model image wears clothes in the clothes image.

For any one image group, including a clothing image and a model image, a model in the model image is worn with clothing in the clothing image. The clothing image includes clothing that is intended to be tried on, for example, clothing image 1# includes a piece of green cotta. The model in the model image is worn with the clothing in the corresponding clothing image, for example, the model in the model image corresponding to clothing image 1# is worn with the green cotta.

The garment image and model image also have correspondingly large resolution based on the need for training to be able to output a fitting image of large resolution. In some embodiments, the resolution of the garment image and model image may be 1024 x 768.

It will be appreciated that several image sets may be gathered by those skilled in the art on a terminal (e.g., a computer) in advance, for example, on some clothing vending sites, the clothing image and corresponding model image with the clothing being worn. After several image groups are prepared, these data for training are uploaded to the server through the terminal.

In some embodiments, the number of the image groups is tens of thousands, for example 20000, which is beneficial for training to obtain an accurate universal model. The number of several image groups can be determined by a person skilled in the art according to the actual situation.

S20: and extracting a body trunk image and a body key point image from the model image.

It will be appreciated that in changing the clothing of a model in a model image, it is necessary to retain the identity of the model, etc. characteristics that need to be retained. Before the clothes and the model are fused, the identity characteristics of the model are extracted, namely, the body trunk diagram is obtained, so that on one hand, the interference caused by the original old clothes characteristics to the fusion can be avoided, and on the other hand, the identity characteristics of the model can be reserved, and the model is not distorted after the clothes needing to be tried on are replaced.

Specifically, first, the model image may be subjected to human body analysis to obtain an analysis image. As can be seen from the above description of the term (3), the analysis of the human body is to divide each part of the human body and assign a category to the pixel. In some embodiments, the template image may be subjected to human body parsing using a graphomyces algorithm, generating a corresponding human body parsing map. From the human body analysis chart, the category to which each part in the image belongs can be determined. Then, a body trunk area (such as face, neck, hand, foot and the like) in the human body analysis image is reserved, the pixel type corresponding to the body trunk area is set to be 1, and the areas of other types are set to be 0, so that a binarized image is obtained. And finally, multiplying the corresponding positions of the pixels in the model image and the pixels in the binarized image to obtain the body trunk map.

Multiplication of corresponding positions may be performed using the formula F' _ij ＝F _ij ×M _ij Interpretation, wherein F _ij Representing the pixel value of the ith row and jth column in the model image, M _ij Numerical value representing the j-th column of the i-th row in the binarized image, F' _ij Is the pixel value of the ith row and jth column in the body trunk diagram.

In this embodiment, the identity of the model can be accurately maintained in the body trunk diagram, and the original clothing features to be replaced are removed, so that the identity of the model is not distorted after the model is replaced with the clothing to be tried on.

It can be understood that, by detecting the human body key points of the model image by using the human body key point detection algorithm, the human body key point information (i.e. a plurality of key points on the human body) can be located, and as shown in fig. 8, the human body key points can be 18 key points such as a nose, a left eye, a left ear, a right ear, a left shoulder, a right elbow, a left wrist, a right wrist, a left hip, a right knee, a left ankle, a right ankle, etc. In some embodiments, the human keypoint detection algorithm may employ an openpost algorithm for detection.

It can be understood that each human body key point has its own serial number, and the position of each human body key point is represented by coordinates. The coordinates of these 18 human keypoints vary from model to model based on model body size. Therefore, human body key point detection is carried out on the model image, and a human body key point image is obtained. The human body key point image includes a plurality of human body key points, each of the plurality of human body key points includes a serial number representing a human body joint and corresponding coordinates.

S30: the body trunk diagram and the human body key point image are input into an encoder in the generation module to be encoded, and the encoding result and the first clothes encoding are input into a decoder in the generation module to be decoded, so that an initial fitting image is generated.

At least one feature map generated in the decoding process is respectively subjected to feature fusion with a first clothes code, wherein the first clothes code is obtained by extracting features from a clothes image by an encoding module and reflects the style features of clothes.

Referring to fig. 5, the torso map, the body keypoint images, and the garment images are input into a fitting network, and a series of calculations are performed according to the configuration of the fitting network to generate a final predicted fitting image.

The clothes image input coding module is used for extracting features and outputting a first clothes code reflecting clothes style features. It will be appreciated that the encoding module includes a plurality of convolution layers, an activation function, and a full connection layer for downsampling feature extraction of an input garment image to output a reduced-dimension first garment encoding. In some embodiments, the first clothing code may be 1 x 1024 in size, i.e., a one-dimensional vector of length 1024.

In some embodiments, the feature extraction network employed by the encoding module includes a plurality of convolutional layers, which are followed by a fully-connected layer. The convolution kernels with the size of 3*3 are adopted by the convolution layers, the step size is set to 2, the activation function is set to a Relu function, and convolution operation is carried out on the clothing image. Thus, the garment image is flattened into a first garment code of 1 x 1024 by a multi-layer convolution operation and full link layer. The first garment code reflects a garment style characteristic. It is understood that the garment style features include the layout, shape, and other structural features of the garment. It will be appreciated that in some embodiments, the encoding module may employ other forms of feature extraction networks, such as a ResNet network or a VGG network, and the like, where the feature extraction network employed by the encoding module is not limited and is capable of converting an input image into an encoding.

The generation module is an image generation network, and may be, for example, a generation countermeasure network (Generative Adversarial Network, GAN) or the like. The generating module comprises an encoder and a decoder, wherein the encoder comprises a plurality of downsampled convolution layers, and the size of the output characteristic map is smaller and smaller along with the progressive downsampling convolution layers. The decoder includes a plurality of upsampling convolutional layers, and in the decoder, the size of the output feature map becomes larger as the upsampling convolutional layers progress.

After the body trunk graph and the human body key point image are spliced in a channel, the body trunk graph and the human body key point image are input into an encoder in a generating module for encoding processing. In some embodiments, the plurality of downsampled convolution layers in the encoder use 3*3-sized convolution kernels, the step size is set to 2, the activation function is set to a Relu function, and the concatenated images of the body torso map and the body keypoint images are convolved. Therefore, after the spliced image of the body trunk image and the human body key point image is subjected to multi-layer convolution operation, a feature image V with low dimension is obtained. In some embodiments, the dimension of the feature map V may be 512 x 16 x 12.

The feature map V with low dimension is the encoding result, and as shown in fig. 5, the feature map V and the first clothes encoding input decoder perform decoding processing. The decoder includes a plurality of upsampling convolutional layers, and in the decoder, the resolution of the output feature map is larger and larger as the upsampling convolutional layers progress. At least one feature map generated during the decoding process is feature fused with the first clothing code, respectively. That is, after the feature map output by the up-sampling convolution layer in the decoder is feature-fused with the first clothes code, the next up-sampling convolution layer is input to perform up-sampling feature extraction.

In some embodiments, the decoder includes a plurality of concatenated, spaced-apart up-sampling convolutional layers and a feature fusion layer.

Referring to fig. 6, fig. 6 illustrates an example of a staggered arrangement of 3 upsampling convolutional layers and 3 feature fusion layers. After the feature map V is input into the 1 st up-sampling convolution layer for up-sampling feature extraction, a feature map 1# with larger resolution is output, the feature map 1# is subjected to feature fusion with the 1 st feature fusion layer input by the first clothes code, the output feature map 2# is input into the 2 nd up-sampling convolution layer for up-sampling feature extraction, a feature map 3# with larger resolution is output, the feature map 3# is subjected to feature fusion with the 2 nd feature fusion layer input by the first clothes code, the output feature map 4# is input into the 3 rd up-sampling convolution layer for up-sampling feature extraction, the feature map 5# with larger resolution is output, the feature map 5# is subjected to feature fusion with the 3 rd feature fusion layer input by the first clothes code, and the output feature map is used as an initial fitting image.

The feature fusion layer performs feature fusion on the feature map output by the last upsampling convolution layer and the first clothes code in the following manner:

(1) Performing dimension reduction treatment on the first clothes code;

(2) And linearly fusing the first clothes code subjected to the dimension reduction treatment with the characteristic diagram output by the last up-sampling convolution layer to obtain a fused characteristic diagram.

It will be appreciated that the first clothing code has a lower dimension, and in some embodiments, the first clothing code is a 1-dimensional vector. In this embodiment, the first clothing code is further dimension reduced, e.g., transformed into a length-2 vector. In some embodiments, the first clothing code may be convolved with a full join layer, converted to a length 2 vector. It will be appreciated that the garment style characteristics are reflected based on the first garment code, and thus the reduced dimension vector of length 2 can reflect the garment style characteristics.

In this embodiment, the first clothing code after the dimension reduction processing is adopted and is linearly fused with the feature map output by the last up-sampling convolution layer, so as to obtain the fused feature map. The linear fusion is to perform weighted combination on the first clothes code subjected to dimension reduction treatment and the feature map by adopting a primary function so as to perform fusion.

The first clothes code which can reflect the clothes style characteristics after the dimension reduction processing is linearly fused with the characteristic diagram which reflects the body trunk characteristics and the human body key points, so that the fused characteristic diagram has the body trunk characteristics and the clothes style characteristics, the human body key points in the fusion process can effectively guide the clothes style characteristics to be positioned, and an initial fitting image with three-dimensional clothes is generated.

wherein V is _i Is the characteristic diagram of the output of the ith up-sampling convolution layer, beta _wi Is the weight value obtained by the code conversion of the first clothes input into the ith feature fusion layer,is the deviation value obtained by the code conversion of the first clothes input into the ith characteristic fusion layer，IN(V _i S) is a fused feature map output by the ith feature fusion layer.

In this embodiment, the full connection layer is used to reduce the dimension of the first clothing code, and transform it into a vector with length of 2, where one element in the vector is used as a weight valueAnother element as a deviation value, and a characteristic diagram V _i The formula is adopted to perform feature fusion, so that the clothes style features and the body trunk features are fused, and human body key points in the fusion process can effectively guide the clothes style features to be positioned, so that an initial fitting image of clothes with three-dimensional sense is generated.

S40: and inputting the initial fitting image and the second clothes code into a fusion module for feature fusion, and adjusting the resolution of the output fusion result to obtain a predicted fitting image.

The second clothes code is obtained by the coding module after feature extraction of the clothes image and reflects texture features of the clothes.

It will be appreciated that referring again to fig. 5, the garment image input encoding module performs feature extraction and also outputs a second garment code reflecting the garment texture features. In some embodiments, the second garment code is 10 x 1024 in size that effectively reflects the texture characteristics of the garment. It is understood that the textural features include features such as the texture or pattern of the fabric of the garment. In some embodiments, the encoding module may employ the same feature extraction network to perform feature extraction on the garment image, with different levels of convolutional layers outputting the first garment encoding and the second garment encoding. In some embodiments, the encoding module may also include two feature extraction networks, one of which outputs a first clothing code and the other of which outputs a second clothing code. The form of the coding module is not limited at all, and the input image can be converted and output into two different codes.

The fusion module is also a preset neural network and comprises a convolution layer, a deconvolution layer or a pooling layer and the like. The fusion module is used for carrying out fusion processing on the at least two feature images, and the fused feature images can comprise features in the at least two feature images.

It will be appreciated that the initial fitting image and the second garment code are input to the fusion module for feature fusion, and the output fusion result includes the features of the initial fitting image and the garment texture features of the second garment code. Therefore, the clothes in the fusion result have three-dimensional sense, and meanwhile, the texture features are vivid, so that the effect of actually simulating the model to try on the clothes is realized.

In order to obtain a pre-test penetrating image with large resolution, the resolution of the fusion result output by the fusion module is adjusted so that the resolution is consistent with the resolution of the original model image. For example, the resolution of the fusion result is adjusted to 1024×768, thereby obtaining a predicted fitting image of 1024×768 size.

(1) Respectively carrying out linear fusion on each row vector of the second clothes code and the initial fitting image;

(2) And adding and fusing the linear fusion results corresponding to the row vectors respectively.

The second clothing code is a two-dimensional image data, and in some embodiments, the second clothing code has a size of 10 x 1024. Here, when the fusion module performs feature fusion on the initial fitting image and the second clothes code, each row vector of the second clothes code is respectively and linearly fused with the initial fitting image. Wherein the row vector of the second clothing code is the pixel value of a row in the second clothing code. The linear fusion is to perform weighted combination of a row of vectors coded by the second clothes and the initial fitting image by adopting a linear function so as to perform fusion.

And then, carrying out addition fusion on the linear fusion results corresponding to the row vectors respectively to obtain a fusion result output by the fusion module. It will be appreciated that the linear fusion results are the same as the resolution of the original fitting image. And adding and fusing the linear fusion results, namely adding pixel values at corresponding positions to obtain a fusion result output by the fusion module.

In the embodiment, through the fusion mode, the clothes in the fusion result output by the fusion module have three-dimensional sense, and meanwhile, the texture features are vivid, so that the effect of actually simulating the model to try on the clothes is realized.

In this embodiment, feature extraction is performed on the ith row vector wi of the second clothes code by adopting a full connection layer, multiplication fusion is performed on the extracted vector and the initial fitting image G1, and then the result after multiplication fusion and the extracted vector are added and fused to obtain a fusion result corresponding to wi. And finally, accumulating the fusion results corresponding to each W [ i ] to obtain the fusion result output by the fusion module.

In this embodiment, the fusion module performs feature fusion on the initial fitting image and the second clothes code by adopting the formula, so that the clothes in the fusion result output by the fusion module have stereoscopic impression, and meanwhile, the texture features are vivid, and the effect of actually simulating the model after fitting the clothes is realized.

In some embodiments, the fitting network further comprises at least one supermodule comprising at least one residual unit and at least one convolution layer, the output of each residual unit being the input of the at least one convolution layer. The residual units are used for carrying out mapping transformation processing on the input images so as to extract features, and the at least one convolution layer is used for carrying out dimension lifting on the feature images output by the residual units to generate high-resolution images.

In some embodiments, the superdivision module includes 2 residual units and 2 convolutional layers. Wherein an important component in residual unit deep learning consists of a series of convolution layers, normalization layers and activation function layers, and further comprises a jump connection (Shortcut Connection). The jump connection connects the input directly to the output such that the input and output are fused as a final output to preserve the information of the input in the final output. Therefore, the residual unit performs mapping transformation processing on the input image, and when the characteristics are extracted, the input information can be reserved, so that the final output is not distorted.

In some embodiments, as shown in fig. 7, after the feature maps output by the 2 residual units are subjected to channel stitching, the feature maps are input to a 1*1 convolution layer for convolution processing, so as to make full use of the hierarchical features. The output of 1*1 convolution layer is input 3*3 to convolution layer to reduce the channel of feature mapping and extract effective information from the fused features.

It will be appreciated that a supermodule is capable of expanding the size of the input image by a fixed factor. In some embodiments, one supermodule can double the size of the input image.

In this embodiment, the aforementioned "resolution adjustment of the output fusion result to obtain the predicted fitting image" includes:

and determining the number R of the super-division modules according to the resolution of the fusion result and the resolution of the model image. And inputting the fusion result into the R superdivision modules in cascade, sequentially performing calculation processing, and outputting a predicted fitting image.

The multiple of the size expansion of the input image can be fixed based on one supermodule, so that the multiple of the size expansion can be determined according to the resolution of the fusion result and the resolution of the predicted fitting image to be output, and the number R of the supermodules is determined. After the number of the super-division modules is determined, inputting the fusion result into the R super-division modules in cascade, sequentially performing calculation processing, and outputting a predicted fitting image. It will be appreciated that in the trained fitting model, the number of supermodules may be automatically adjusted based on the user's resolution setting for fitting images to provide fitting images of different resolutions.

In this embodiment, by determining the number of supermodules, at least one supermodule is enabled to dynamically expand, providing fitting images of different resolutions.

It can be understood that the input of the residual unit in the 1 st superdivision module is the fusion result output by the fusion module. The input of the residual unit in the 2 nd superdivision module is the output of the 1 st superdivision module. Specifically, taking an example that the superdivision module comprises 2 residual units and 2 cascaded convolution layers as an illustration, the fusion results output by the fusion module are respectively input into the residual units in the 1 st superdivision module for feature extraction. After the output of the 1 st residual unit and the output of the 2 nd residual unit are subjected to channel splicing, 2 cascaded convolution layers are input for convolution processing, and the output of the last convolution layer is used as the input of the 2 nd superdivision module. The output of the 2 nd superdivision module is the final predicted fitting image.

In this embodiment, the residual unit performs spatial mapping on the input image to extract the features, and in the process of feature extraction, the channels of the feature mapping are reduced, for example, the channels of the feature mapping are reduced by half, and then the channels are expanded and restored, so that the memory and the parameter number can be effectively saved, and the network is light.

And then, fusing the characteristic map after channel expansion mapping with the input image through jump connection of the residual error unit to obtain the characteristic map output by the residual error unit.

In some embodiments, the feature mapping process of the residual unit may be performed using the following formula:

y _ru ＝ _res * _ex (f _rx (S))+λ _x *

wherein y is _ru Is the output of the residual unit, S is the input of the residual unit, f _rx Representing a channel reduction map, f _ex Representing channel expansion map, lambda _res And lambda (lambda) _x The adaptive weights of the two paths, the primary connection and the hop connection, respectively.

In some embodiments, f _rx And f _ex 1*1 convolutional layers are used to vary the number of channels to achieve channel reduction and channel expansion.

In the embodiment, the residual unit performs spatial mapping on the input image to extract the features, and in the process of feature extraction, firstly, the channels of the channels mapped by the features are reduced, and then the channels are expanded and restored, so that on one hand, the memory and the parameter quantity can be effectively saved, and the network is light; on the other hand, through jump connection, the input and the output are subjected to weighted fusion, so that the information of the input image can be effectively reserved, and the information of the fusion result output by the fusion module, namely, the stereoscopic impression and the texture characteristics of clothes in the try-on effect are reserved, and the image is not distorted when the resolution is enlarged by adopting a convolution layer in the follow-up process.

S50: and calculating the loss between the predicted fitting image and the model image by adopting a loss function, and carrying out iterative training on the fitting network according to the loss sum corresponding to a plurality of image groups until convergence to obtain a fitting model.

Here, the loss function may be configured in a terminal by a person skilled in the art, and the configured loss function is sent to the server along with the fitting network, and after the server processes the predicted fitting images corresponding to each image group, the server calculates the loss between each model image and the predicted fitting image by using the loss function, and performs iterative training on the fitting network based on the loss until the fitting network converges, so as to obtain the fitting model.

In some embodiments, the loss function includes contrast loss, perceptual loss, and pixel loss, e.g., the loss function may be a weighted sum of contrast loss, perceptual loss, and pixel loss. Wherein the pixel loss reflects differences between respective corresponding pixels in the predicted fit image and the model image.

The prediction model image is a model image, and the model image is a model image, wherein the model image is a model image, and the model image is a model image. Here, the distribution of the fitting image means the distribution of each part in the image, for example, the distribution of clothes, head, limbs, and the like.

The perceived loss is the comparison of the feature map obtained by convolving the model image with the feature map obtained by convolving the predicted fitting image, so that the higher-level information (content and global structure) is close.

The pixel loss is to compare each pixel point of the predicted fitting image with each corresponding pixel point in the model image, so that the predicted fitting image can approach the real model image on a pixel level.

In some embodiments, the loss function includes:

L＝L _cGAN +λ ₁ L _percept +λ ₂ L _L1 ；

wherein L is _cGAN ＝Ε[logD(T)]+Ε[1-logD(Y)]

Wherein L is _cGAN Is against loss, L _percept Is the perceived loss, L _L1 Is pixel loss; lambda (lambda) ₁ And lambda (lambda) ₂ Is super parameter, set to 10; g1 represents an initial fitting image with a stereoscopic impression which is preliminarily generated; g2 represents the fusion result output by the fusion module, namely a low-resolution predictive fitting image; t represents model image, Y represents final generated high-resolution predicted fitting image, D represents discriminator, and V is VGG network pairNumber of layers to be used, VGG ⁱ And R is _i The activation amount and the number of elements in the ith layer of VGG, respectively. h, w, k represent the length, width, and height of the image, respectively.

In some embodiments, a convolutional neural network such as VGG may be used to downsample the model image to extract V feature maps VGG ⁱ (T). Similarly, the convolutional neural network such as VGG can be adopted to downsample the predicted fitting image, and V feature images VGG can be extracted ⁱ (G2)。

Therefore, based on the difference calculated by the loss function comprising the counterloss, the perceived loss and the pixel loss, iterative training is performed on the fitting network, so that the predicted fitting image can be restrained from being continuously close to the model image (namely, the real fitting image) from three aspects of distribution, content characteristics and pixel points, and the fitting effect of the fitting model obtained by training is improved.

It can be appreciated that if the difference between each model image and the predicted fitting image is smaller, the model image and the predicted fitting image are more similar, which means that the predicted fitting image can accurately restore the real fitting effect. Therefore, the model parameters of the fitting network can be adjusted according to the difference between each model image and the predicted fitting image, and iterative training can be carried out on the fitting network. And the difference is reversely transmitted, so that the predicted fitting image output by the fitting network is continuously approximate to the model image until the fitting network converges, and a fitting model is obtained. It will be appreciated that in some embodiments, the fitting network includes a generating module, a coding module, and a fusion module, and a supermodule, and the model parameters include model parameters of the generating module, model parameters of the coding module, model parameters of the fusion module, or model parameters of the supermodule, thereby enabling end-to-end training.

In some embodiments, the adam algorithm is adopted to optimize model parameters, for example, the iteration number is set to 10 ten thousand times, the initial learning rate is set to 0.001, the weight attenuation of the learning rate is set to 0.0005, and each 1000 iterations, the learning rate attenuation is 1/10 of the original learning rate, wherein the learning rate, the loss and the input into the adam algorithm can be used to obtain the adjusted model parameters output by the adam algorithm, the adjusted model parameters are utilized to perform the next training until the training is completed, and the model parameters of the fitting network after convergence are output, so as to obtain the fitting model.

It will be appreciated that after the server obtains the model parameters of the fitting network after convergence (i.e. the final model parameters), the final model parameters may be sent to the terminal, where the fitting network in the terminal is configured with the final model parameters, to obtain the fitting model. In some embodiments, the server may also store the fitting network and final model parameters to arrive at a fitting model.

In this embodiment, the fitting network constructed first includes a generating module, an encoding module and a fusing module, where the generating module includes an encoder and a decoder, and is configured to perform encoding-then-decoding processing on an input image; the encoding module is used for extracting features of the input images, and the fusion module is used for carrying out feature fusion on at least two input feature images. By designing the fitting network with the structure, the coding module is adopted to extract the characteristics of the clothes image, so as to obtain a first clothes code reflecting the clothes style characteristics and a second clothes code reflecting the clothes texture characteristics. The human body key point image reflecting the skeleton structure, the body trunk image retaining the human body characteristics and the first clothes code are used as inputs of a generating module, coding is performed before decoding, and in the decoding process, the first clothes code and at least one generated characteristic image are respectively subjected to characteristic fusion, so that clothes style characteristics can be fused with human body characteristics, positioning is performed according to the human body key points, and an initial fitting image with three-dimensional clothes is generated. And then, the second clothes code and the initial fitting image are fused by utilizing a fusion module, so that the clothes which simulate fitting in the fusion result have stereoscopic impression and original textures. And adjusting the fusion result to generate a high-resolution predictive fitting image. Compared with the method that the deformed two-dimensional clothes image is fused with the body trunk image, the method and the device respectively fuse clothes style features and clothes texture features, then the resolution is adjusted, and the problem that clothes lack of three-dimensional sense in the try-on effect and the situation that the resolution is not high can be effectively solved. The fitting network is trained in a training iterative manner by adopting the plurality of image groups, the predicted fitting image corresponding to each image group is restrained from continuously approaching to the real fitting effect in the model image based on loss and back propagation, and the three-dimensional sense of clothes fitting is more in accordance with the human body contour characteristics and the body state characteristics. Therefore, the fitting model obtained through training can generate a virtual fitting image with high resolution and real and close stereoscopic impression for fitting clothes.

After the fitting model is obtained through training by the method for training the fitting model, which is provided by the embodiment of the application, the fitting model can be applied to virtual fitting to generate fitting images. The virtual fitting method provided by the embodiment of the application can be implemented by various types of electronic equipment with computing processing capability, such as an intelligent terminal, a server and the like.

The virtual fitting method provided by the embodiment of the application is described below in connection with exemplary applications and implementations of the terminal provided by the embodiment of the application. Referring to fig. 8, fig. 8 is a schematic flow chart of a virtual fitting method according to an embodiment of the present application. The method S200 comprises the steps of:

s201: an image of a wearer and an image of a garment to be fitted are acquired.

A fitting assistant (application software) built in a terminal (e.g., a smart phone or a smart fitting mirror) acquires an image of a fitting person and an image of a fitting garment to be fitted. Wherein the image of the test person may be photographed by the terminal or inputted by the user. The image of the garment to be fitted may be selected by the user from a fitting assistant. It will be appreciated that the image of the try-on includes the body of the try-on and the image of the garment to be fitted includes the garment.

S202: a torso image of the body of the subject is extracted from the subject image, and a human body key point image of the subject is extracted from the subject image.

To preserve the identity of the wearer, a torso map of the body of the wearer is extracted from the image of the wearer. For a specific extraction embodiment, please refer to the extraction method in step S20 in the above-mentioned training fitting model embodiment.

In order to keep skeleton information of the test wearer and help the test clothing to locate, human body key point images of the test wearer are extracted from the test wearer images. For a specific extraction embodiment, please refer to the extraction method in step S20 in the above-mentioned training fitting model embodiment.

S203: inputting the body trunk diagram of the try-on wearer, the human body key point image of the try-on wearer and the image of the fitting to be fitted into a fitting model to generate a virtual fitting image.

The fitting model is obtained by training the fitting model by adopting the method for training the fitting model in any one embodiment.

The fitting assistant arranged in the terminal comprises a fitting model, and the fitting model is called to carry out virtual fitting to generate fitting images. It can be understood that the fitting model is obtained by training the fitting model in the above embodiment, and has the same structure and function as the fitting model in the above embodiment, and will not be described in detail herein.

The present embodiments also provide a computer readable storage medium storing computer executable instructions for causing an electronic device to perform a method of training a fitting model provided by the embodiments of the present application, for example, a method of training a fitting model as shown in fig. 3-7, or a virtual fitting method provided by the embodiments of the present application, for example, a virtual fitting method as shown in fig. 8.

In some embodiments, the storage medium may be FRAM, ROM, PROM, EPROM, EE PROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper TextMarkup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device (including devices such as smart terminals and servers) or on multiple computing devices located at one site, or on multiple computing devices distributed across multiple sites and interconnected by a communication network.

The present embodiments also provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a computer, cause the computer to perform a method of training a fitting model or a virtual fitting method as in the previous embodiments.

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; the technical features of the above embodiments or in the different embodiments may also be combined under the idea of the present application, the steps may be implemented in any order, and there are many other variations of the different aspects of the present application as described above, which are not provided in details for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for training a fitting model, which is characterized in that a fitting network comprises a generating module, an encoding module and a fusion module, wherein the generating module comprises an encoder and a decoder and is used for encoding and decoding an input image; the encoding module is used for extracting features of the input images, and the fusion module is used for carrying out feature fusion on at least two input feature images;

The method comprises the following steps:

inputting the body trunk diagram and the key point diagram of the human body of the model into an encoder in the generating module for encoding processing, and inputting the encoding result and the first clothes code into a decoder in the generating module for decoding processing to generate an initial fitting image; at least one feature map generated in the decoding process is respectively subjected to feature fusion with the first clothes codes, wherein the first clothes codes are obtained after the coding module extracts features of the clothes images and are used for reflecting style features of the clothes;

inputting the initial fitting image and the second clothes code into the fusion module for feature fusion, and adjusting the resolution of the output fusion result to obtain the predicted fitting image; the second clothes code is obtained after the coding module extracts the characteristics of the clothes image and is used for reflecting the texture characteristics of the clothes;

and calculating the loss between the predicted fitting image and the model image by adopting a loss function, and carrying out iterative training on the fitting network according to the loss sum corresponding to the image groups until convergence to obtain the fitting model.

2. The method of claim 1, wherein the decoder comprises a plurality of cascaded, spaced-apart up-sampling convolutional layers and a feature fusion layer, wherein the feature fusion layer feature fuses a feature map output by a previous up-sampling convolutional layer with the first clothing code in the following manner:

performing dimension reduction treatment on the first clothes code;

and linearly fusing the first clothes code subjected to dimension reduction treatment with the characteristic image output by the last up-sampling convolution layer to obtain a fused characteristic image.

3. The method of claim 2, wherein the feature fusion layer feature fuses the feature map output by the last upsampled convolutional layer with the first clothing code using the formula:

wherein V is _i Is the characteristic diagram of the output of the ith up-sampling convolution layer, V _i Is the weight value obtained by the code conversion of the first clothes input into the ith feature fusion layer,is to input the ith feature fusionThe deviation value, IN (V _i S) is a fused feature map output by the ith feature fusion layer.

4. The method of claim 1, wherein the fusing module performs feature fusion of the initial fitting image and the second garment code in the following manner:

5. The method of claim 4, wherein the fusing module performs feature fusion of the initial fitting image and the second garment code using the formula:

wherein W is the second clothes code, G1 is the initial fitting image, AIN (G1, W) is the fusion result output by the fusion module, FC represents the full connection layer, and W [ i ] is the ith row vector of the second clothes code.

6. The method according to claim 1, wherein the fitting network further comprises at least one super-division module, the super-division module comprising at least one residual unit and at least one convolution layer, the residual unit being configured to perform a mapping transformation process on an input image to extract features, the at least one convolution layer being configured to perform an up-scaling on a feature map output by each residual unit to generate a high resolution image;

and performing resolution adjustment on the output fusion result to obtain the predicted fitting image, wherein the method comprises the following steps of:

Determining the number R of the superdivision modules according to the resolution of the fusion result and the resolution of the model image;

inputting the fusion result into the R super-division modules in cascade, sequentially performing calculation processing, and outputting the predicted fitting image.

7. The method of claim 6, wherein the residual unit performs a mapping transform process on the input image by:

carrying out channel reduction mapping on the input image, and then carrying out channel expansion mapping;

and carrying out fusion processing on the characteristic map after channel expansion mapping and the input image to obtain the characteristic map output by the residual error unit.

8. A virtual fitting method, comprising:

acquiring an image of a wearer and an image of a garment to be fitted;

inputting the body trunk diagram of the try-on wearer, the human body key point image of the try-on wearer and the image of the to-be-fit garment into a fitting model to generate a virtual fitting image; wherein the fitting model is trained by a method for training a fitting model according to any one of claims 1-7.

9. A computer device, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

10. A computer readable storage medium storing computer executable instructions for causing a computer device to perform the method of any one of claims 1-8.