CN114913388B - Method for training fitting model, method for generating fitting image and related device - Google Patents

Method for training fitting model, method for generating fitting image and related device Download PDF

Info

Publication number
CN114913388B
CN114913388B CN202210433240.5A CN202210433240A CN114913388B CN 114913388 B CN114913388 B CN 114913388B CN 202210433240 A CN202210433240 A CN 202210433240A CN 114913388 B CN114913388 B CN 114913388B
Authority
CN
China
Prior art keywords
fitting
image
clothes
fusion
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210433240.5A
Other languages
Chinese (zh)
Other versions
CN114913388A (en
Inventor
陈仿雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Original Assignee
Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shuliantianxia Intelligent Technology Co Ltd filed Critical Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority to CN202210433240.5A priority Critical patent/CN114913388B/en
Publication of CN114913388A publication Critical patent/CN114913388A/en
Application granted granted Critical
Publication of CN114913388B publication Critical patent/CN114913388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0641Shopping interfaces
    • G06Q30/0643Graphical representation of items or shoppers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application relates to the technical field of image processing and discloses a method for training a fitting model, a method for generating fitting images and a related device. The first encoder in the generating network encodes the N deformed clothes images to obtain N clothes codes, and the second encoder, the N fusion modules and the decoder which are cascaded in the generating network encode, fuse and decode the body trunk map and the fusion analysis map in sequence until the predicted fitting image is output. The analytic feature map of the same hierarchy constrains the pixel categories of the fitting feature map of the same hierarchy. Along with continuous iterative training of the fitting network, the fusion analysis graph is segmented according to the real mixed fitting effect, and the generated predicted fitting image is continuously close to the real fitting image, so that an accurate fitting model is obtained.

Description

Method for training fitting model, method for generating fitting image and related device
Technical Field
The embodiment of the application relates to the technical field of image processing, in particular to a method for training a fitting model, a method for generating fitting images and a related device.
Background
With the continuous progress of modern technology, the online shopping scale is continuously increased, and users can purchase clothes on the online shopping platform through mobile phones, but because the information of the clothes to be sold, which is acquired by the users, is generally a two-dimensional display picture, the users cannot know the effect that the clothes are worn on the users, so that the users possibly purchase the clothes which are not suitable for the users, and the shopping experience is poor.
With the continuous development of neural networks, the neural network is widely applied to the field of generating images. Therefore, researchers apply neural networks to virtual fitting and propose various fitting algorithms, however, the existing virtual fitting algorithm can only realize single-piece clothes fitting and cannot realize mixed fitting of multiple clothes.
Disclosure of Invention
The technical problem to be solved mainly by the embodiment of the application is to provide a method for training a fitting model, a method for generating fitting images and a related device.
In order to solve the technical problems, in a first aspect, an embodiment of the present application provides a method for training a fitting model, where a fitting network includes a fusion analysis network and a generation network, where the generation network includes a first encoder, a cascade second encoder, N fusion modules, and a decoder, and the first encoder is connected with the N fusion modules;
The method comprises the following steps:
Acquiring a training set, wherein the training set comprises a plurality of training data, the training data comprises a real fitting image and N clothes images, the real fitting image comprises an image of a model wearing corresponding clothes in the N clothes images, and N is more than or equal to 2;
Human body analysis is carried out on the real fitting image to obtain a first analysis image, a second analysis image is separated from the first analysis image, and a body trunk image is extracted from the real fitting image according to the second analysis image, wherein the second analysis image reflects a pixel area of a body trunk of the model;
Inputting the N clothes images and the second analysis image into a fusion analysis network to obtain a fusion analysis image, wherein the fusion analysis image comprises pixel areas corresponding to clothes and pixel areas of the trunk of the model body in the N clothes images;
Inputting a body trunk graph, a fusion analysis graph and N deformed clothes images into a generation network to obtain a predicted fitting image, wherein the N deformed clothes images are obtained by deforming clothes in the N clothes images according to the model human body structure in a real fitting image, the N deformed clothes images are input into a first encoder to obtain N clothes codes, the N clothes codes are respectively and correspondingly input into N fusion modules, the body trunk graph and the fusion analysis graph are input into the generation network to obtain the predicted fitting image, and in the fusion decoding process, the analysis feature graphs of the same level restrict the pixel types of the fitting feature graphs of the same level;
And carrying out iterative training on the fitting network by adopting a loss function until the fitting network converges to obtain a fitting model, wherein the loss function is used for representing the difference between each real fitting image and each predicted fitting image in the training set.
In some embodiments, extracting the body torso map from the real fitting image from the second analytic map includes:
Performing binarization processing on the second analysis chart to obtain a binarized image, wherein pixels corresponding to a trunk area of the body in the binarized image are 1, and pixels corresponding to other areas are 0;
And multiplying the corresponding positions of the pixels in the real fitting image and the pixels in the binarized image to obtain the body trunk map.
In some embodiments, the fusion module includes a first convolution layer, a second convolution layer, and a fusion layer;
the fusion module fuses the input feature map and the input clothes codes in the following manner:
respectively carrying out feature extraction on the input feature map through a first convolution layer and a second convolution layer to obtain a first intermediate feature map and a second intermediate feature map;
the method comprises the steps that feature extraction is carried out on input clothes codes through a first convolution layer and a second convolution layer respectively, and a first intermediate code and a second intermediate code are obtained;
And carrying out fusion processing on the input feature map, the first intermediate feature map, the second intermediate feature map, the first intermediate code and the second intermediate code through a fusion layer.
In some embodiments, the fusing processing of the input feature map, the first intermediate feature map, the second intermediate feature map, and the first intermediate code, the second intermediate code by the fusion layer includes:
Taking the first intermediate feature map as a mean value and the second intermediate feature map as a variance, and carrying out normalization processing on the input feature map;
And carrying out fusion processing on the result obtained by the normalization processing and the first intermediate code and the second intermediate code.
In some embodiments, the fusion layer performs the fusion process using the following formula:
Wherein x is an input feature map, mu (x) is a first intermediate feature map, sigma (x) is a second intermediate feature map, y is an input clothing code, mu (y) is a first intermediate code, sigma (y) is a second intermediate code, and IN (x, y) is a feature map output by a fusion layer.
In some embodiments, the loss function includes a fight loss, a perception loss, and a clothing pixel loss between the real fit image and the predicted fit image, wherein the clothing pixel loss reflects differences between pixels of clothing in the real fit image and pixels in the predicted fit image, respectively, in the N clothing images.
In some embodiments, the loss function includes:
L=LcGAN1Lpercept2LL1
Wherein, L cGAN =E [ log D (T) ] +E [1-log D (Y) ]
Wherein L cGAN is the counterloss, L percept is the perceived loss, L L1 is the clothing pixel loss, λ 1 and λ 2 are the superparameter, T is the real fitting image, Y is the predicted fitting image, D is the discriminator, F i (T) is the i-th feature map extracted from the real fitting image, F i (Y) is the i-th feature map extracted from the predicted fitting image, R i is the number of elements in F i (T) or F i (Y), V is the number of F i (T) or F i (Y), F j (T) is the pixel of the j-th clothing in the real fitting image, and F j (F) is the pixel of the j-th clothing in the predicted fitting image.
In order to solve the above technical problem, in a second aspect, an embodiment of the present application provides a method for generating a fitting image, including:
acquiring a user image and N images of clothes to be tried on;
Human body analysis is carried out on the user image to obtain a first analysis chart of the user, a second analysis chart of the user is separated from the first analysis chart of the user, and a body trunk chart of the user is extracted from the user image according to the second analysis chart of the user, wherein the second analysis chart of the user reflects a pixel area of the body trunk of the user;
Inputting the N deformed garment images to be fitted, the body trunk graph of the user and the N garment images to be fitted into a fitting model to generate fitting images, wherein the fitting model is trained by the method for training the fitting model according to any one of claims 1-7, and the N deformed garment images to be fitted are obtained by deforming clothes in the N garment images according to the human body structure of the user in the user image.
In order to solve the above technical problem, in a third aspect, an embodiment of the present application provides an electronic device, including:
At least one processor, and
A memory communicatively coupled to the at least one processor, wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as in the first aspect above.
To solve the above technical problem, in a fourth aspect, there is provided a computer readable storage medium storing computer executable instructions for causing a computer device to perform the method of the above first aspect.
The embodiment of the application has the beneficial effects that: different from the situation of the prior art, the method for training the fitting model provided by the embodiment of the application is characterized in that the fitting network comprises a fusion analysis network and a generation network by designing the structure of the fitting network, the fusion analysis network performs downsampling encoding and upsampling decoding on N clothes images and a second analysis image reflecting the trunk pixel area of the model body to obtain the fusion analysis image, and the pixel positions corresponding to the trunk of the body and the pixel positions corresponding to N clothes to be fitted can be clearly obtained from the fusion analysis image. And a first encoder in the generating network performs downsampling encoding on the N deformed clothes images to obtain N clothes codes, and a second encoder, N fusion modules and a decoder which are cascaded in the generating network sequentially perform downsampling encoding, fusion and upsampling decoding on the body trunk map and the fusion analysis map until a predicted fitting image is output. In the fusion decoding process, the analytic feature images of the same level restrict the pixel categories of the fitting feature images of the same level, namely restrict the positions of a plurality of clothes relative to the body trunk. Along with continuous iterative training of the fitting network, the fusion analysis graph can be segmented according to the real mixed fitting effect, and the generated predicted fitting image can be continuously close to the real fitting image (real label), so that an accurate fitting model is obtained. Therefore, the fitting model can generate fitting images of the mixed effect of a plurality of clothes.
Drawings
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.
FIG. 1 is a schematic view of an application scenario of a try-on system according to some embodiments of the present application;
FIG. 2 is a schematic diagram of an electronic device according to some embodiments of the application;
FIG. 3 is a flow chart of a method of training a fitting model in some embodiments of the application;
FIG. 4 is a schematic diagram of a test network according to some embodiments of the present application;
FIG. 5 is a diagram of human resolution in some embodiments of the application;
FIG. 6 is a schematic diagram illustrating the operation of a converged resolution network in some embodiments of the present application;
FIG. 7 is a schematic diagram illustrating the operation of a network generation in some embodiments of the application;
FIG. 8 is a schematic diagram illustrating the operation of a fusion module according to some embodiments of the application;
fig. 9 is a flow chart of a method of generating fitting images in some embodiments of the application.
Detailed Description
The present application will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the application in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present application.
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
It should be noted that, if not in conflict, the features of the embodiments of the present application may be combined with each other, which is within the protection scope of the present application. In addition, while functional block division is performed in a device diagram and logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. Moreover, the words "first," "second," "third," and the like as used herein do not limit the data and order of execution, but merely distinguish between identical or similar items that have substantially the same function and effect.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items.
In addition, the technical features of the embodiments of the present application described below may be combined with each other as long as they do not collide with each other.
In order to facilitate understanding of the method provided in the embodiments of the present application, first, terms related to the embodiments of the present application are described:
(1) Neural network
A neural network may be composed of neural units, and is understood to mean, in particular, a neural network having an input layer, an hidden layer, and an output layer, where in general, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. Among them, the neural network with many hidden layers is called deep neural network (deep neural network, DNN). The operation of each layer in the neural network can be described by the mathematical expression y=a (w·x+b), from the physical level, and can be understood as the completion of the transformation of the input space into the output space (i.e., the row space into the column space of the matrix) by five operations on the input space (set of input vectors), including 1, dimension up/down; 2. zoom in/out; 3. rotating; 4. translating; 5. "bending". Wherein the operations of 2,3 are done by "w·x", the operations of 4 are done by "+b", and the operations of 5 are done by "a ()" where the expression "space" is used in two words because the object to be classified is not a single thing but a class of things, space refers to the collection of all individuals of such things, where W is the weight matrix of the layers of the neural network, each value in the matrix representing the weight value of one neuron of that layer. The matrix W determines the spatial transformation of the input space into the output space described above, i.e. the W of each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Thus, the training process of the neural network is essentially a way to learn and control the spatial transformation, and more specifically to learn the weight matrix.
It should be noted that in the embodiments of the present application, the neural network is essentially based on the model employed by the machine learning task. Common components in the neural network comprise a convolution layer, a pooling layer, a normalization layer, a reverse convolution layer and the like, a model is designed and obtained by assembling the common components in the neural network, and when model parameters (weight matrixes of all layers) are determined so that model errors meet preset conditions or the number of adjusted model parameters reaches a preset threshold value, the model converges.
The convolution layer is configured with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to carry out convolution operation on the image. The purpose of the convolution operation is to extract different features of the input image, and the first layer of convolution layer may only extract some low-level features such as edges, lines, angles, etc., and the deeper convolution layer may iteratively extract more complex features from the low-level features.
The inverse convolution layer is used to map a low-dimensional space to a high-dimensional space while maintaining a connection/pattern between them (connection here refers to the connection at the time of convolution). The inverse convolution layer is configured with a plurality of convolution kernels, each of which is provided with a corresponding step size, to perform an inverse convolution operation on the image. Typically, a framework library (e.g., pyTorch library) for designing a neural network has a upsumple () function built into it, and by calling this upsumple () function, a low-dimensional to high-dimensional spatial mapping can be achieved.
The pooling layer (pooling) is to simulate that the human visual system can dimension down the data or represent the image with higher level features. Common operations of the pooling layer include maximum pooling, mean pooling, random pooling, median pooling, combined pooling, and the like. Typically, the pooling layers are periodically inserted between convolutional layers of the neural network to achieve dimension reduction.
The normalization layer is used for performing normalization operation on all neurons in the middle to prevent gradient explosion and gradient disappearance.
(2) Loss function
In the process of training the neural network, because the output of the neural network is expected to be as close to the value actually expected, the weight matrix of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the actually expected target value (however, an initialization process is usually performed before the first update, that is, the parameters are preconfigured for each layer in the neural network), for example, if the predicted value of the network is higher, the weight matrix is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the neural network can predict the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and the training of the neural network becomes the process of reducing the loss as much as possible.
(3) Human body analysis
Human body analysis refers to dividing a person captured in an image into a plurality of semantically uniform regions, for example, a body part and clothing, or a subdivision class of a body part and a subdivision class of clothing, or the like. I.e. the input image is identified at the pixel level and each pixel point in the image is annotated with the object class to which it belongs. For example, individual elements in a picture including a human body (including hair, face, limbs, clothing, and background, etc.) are distinguished by a neural network.
Before describing the embodiments of the present application, a simple description of the virtual fitting method known to the inventor of the present application is provided, so that the understanding of the embodiments of the present application is facilitated.
Generally, a fitting model is trained by generating a countermeasure network (GENERATIVE ADVERSARIAL Networks, GAN), so that the fitting model is placed in a terminal for a user to use, and a virtual fitting image can be generated after a user image and a clothing image to be tried on are acquired. However, most fitting models can only generate fitting images of one fitting garment. If the user needs the try-on effect of mixing and overlapping a plurality of clothes, the effect cannot be realized.
Aiming at the problems, the application provides a method for training a fitting model, a method for generating fitting images, computing equipment and a storage medium, wherein the training method generates an analysis feature map through designing a network structure of the fitting model, and pixel categories of up-sampling feature maps of the same level are restrained through the analysis feature map of the same level, so that pixel areas of a trunk of a body and each garment are positioned, and the fitting model obtained through training can generate fitting images with mixed fitting effects of multiple garments.
An exemplary application of the electronic device for training a fitting model or for generating fitting images provided by embodiments of the present application is described below, it being understood that the electronic device may be used to train a fitting model or to generate fitting images using the fitting model.
The electronic device provided by the embodiment of the application can be a server, for example, a server deployed at a cloud end. When the server is used for training the fitting model, the fitting network is iteratively trained by adopting the training set according to training sets and fitting networks provided by other equipment or persons skilled in the art, and final model parameters are determined, so that the fitting network configures the final model parameters, and the fitting model can be obtained. When the server is used for generating fitting images, a built-in fitting model is called, corresponding calculation processing is carried out on user images and a plurality of to-be-fitted clothes images provided by other equipment or users, and fitting images with a plurality of clothes mixing effect are generated in a fusion mode.
The electronic device provided by some embodiments of the present application may be a notebook computer, a desktop computer, or a mobile device. When the terminal is used for training the fitting model, a person skilled in the art inputs the prepared training set into the terminal, designs the fitting network on the terminal, and adopts the training set to carry out iterative training on the fitting network to determine final model parameters, so that the fitting network configures the final model parameters, and the fitting model can be obtained. When the terminal is used for generating fitting images, a built-in fitting model is called, corresponding calculation processing is carried out on the user images input by the user and the plurality of fitting images to be fitted, and fitting images with mixed fitting effects of the plurality of clothes are generated in a fusion mode.
As an example, referring to fig. 1, fig. 1 is a schematic view of an application scenario of a fitting system provided by an embodiment of the present application, where a terminal 10 is connected to a server 20 through a network, where the network may be a wide area network or a local area network, or a combination of the two.
The terminal 10 may be used to acquire training sets and build a fitting network, for example, by those skilled in the art downloading the ready training sets on the terminal and building a network structure for the fitting network. It will be appreciated that the terminal 10 may also be used to obtain a user image and a plurality of garment images, for example, the user inputs the user image and the plurality of garment images via an input interface, and the terminal automatically obtains the user image and the plurality of garment images after the input is completed; for example, the terminal 10 is provided with a camera, and the user image is collected by the camera, and the clothing image library is stored in the terminal 10, so that the user can select a plurality of clothing images to be tried on from the clothing image library.
In some embodiments, the terminal 10 locally performs the method for training a fitting model provided by the embodiments of the present application to complete training of the designed fitting network using the training set, and determines final model parameters, so that the fitting network configures the final model parameters, and a fitting model can be obtained. In some embodiments, the terminal 10 may also send the training set and the built fitting network stored on the terminal by the person skilled in the art to the server 20 through the network, the server 20 receives the training set and the fitting network, trains the designed fitting network with the training set, determines final model parameters, and then sends the final model parameters to the terminal 10, and the terminal 10 stores the final model parameters so that the fitting network configuration can be the final model parameters, i.e. the fitting model can be obtained.
In some embodiments, the terminal 10 locally executes the method for generating fitting images provided by the embodiments of the present application to provide a virtual fitting service for a user, invokes a built-in fitting model, performs corresponding calculation processing on the user image and a plurality of to-be-fitted clothing images input by the user, and fuses to generate fitting images with multiple clothes mixed effects. In some embodiments, the terminal 10 may also send, to the server 20 through the network, the user image and the multiple fitting images input by the user on the terminal, the server 20 receives the user image and the multiple fitting images, invokes the built-in fitting model to perform corresponding computation processing on the user image and the multiple fitting images, fuses the user image and the multiple fitting images to generate fitting images with multiple clothes mixed effects, and then sends the fitting images to the terminal 10. After receiving the fitting image, the terminal 10 displays the fitting image on its own display interface, so that the user can view the mixing effect.
In the following, the structure of the electronic device according to the embodiment of the present application is described, and fig. 2 is a schematic diagram of the structure of the electronic device 500 according to the embodiment of the present application, where the electronic device 500 includes at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in electronic device 500 are coupled together by bus system 540. It is appreciated that the bus system 540 is used to enable connected communications between these components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to the data bus. The various buses are labeled as bus system 540 in fig. 2 for clarity of illustration.
The Processor 510 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
The user interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
Memory 550 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 550 described in embodiments of the present application is intended to comprise any suitable type of memory. Memory 550 may optionally include one or more storage devices physically located remote from processor 510.
In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 551 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;
Network communication module 552 for accessing other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 include Bluetooth, wireless compatibility authentication (WiFi), universal serial bus (USB, universal Serial Bus), and the like;
A display module 553 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;
The input processing module 554 is configured to detect one or more user inputs or interactions from one of the one or more input devices 532 and translate the detected inputs or interactions.
From the foregoing, it will be appreciated that the method for training a fitting model and the method for generating fitting images provided by the embodiments of the present application may be implemented by various types of electronic devices having computing processing capabilities, such as intelligent terminals and servers.
The method for training the fitting model provided by the embodiment of the present application is described below in connection with exemplary applications and implementations of the server provided by the embodiment of the present application. Referring to fig. 3, fig. 3 is a schematic flow chart of a method for training a fitting model according to an embodiment of the present application. Referring to fig. 4, a fitting network as a network structure of the fitting model includes a fusion analysis network and a generation network, wherein the generation network includes a first encoder and a cascade of a second encoder, N fusion modules, and a decoder, and the first encoder is connected with the N fusion modules. An exemplary illustration is given in fig. 4 with N being 2.
Here, the cascade refers to end-to-end connection in sequence, taking N equal to 2 as an example, based on the sequential end-to-end connection of the second encoder, the 2 fusion modules and the decoder, the feature map output by the second encoder will be input into the 1 st fusion module, the feature map output by the 1 st fusion module will be input into the 2 nd fusion module, and the feature map output by the 2 nd fusion module will be input into the decoder. Based on the connection of the first encoder and the 2 fusion modules, the feature codes output by the first encoder are input to the fusion modules to be fused with the feature graphs. It can be understood that the first encoder outputs 2 feature codes, and then the 2 feature codes are input in one-to-one correspondence with the 2 fusion modules and are fused with the corresponding feature maps.
As will be appreciated by those skilled in the art, the first encoder and the second encoder each include a plurality of downsampled convolutional layers, and the size of the output feature map becomes smaller and smaller as the downsampled convolutional layers progress in the first encoding network and the second encoding network. The decoder includes a plurality of upsampling convolutional layers, and in the decoding network, the size of the output feature map is larger and larger as the upsampling convolutional layers progress. The parameters such as the convolution kernel size, the step size and the like of each lower adopted convolution layer and each up-sampling convolution layer can be configured by a person skilled in the art according to actual requirements.
It will be appreciated that the fitting network may be self-constructed by those skilled in the art on a neural network design platform on a terminal (e.g., a computer) and then sent to a server.
Referring again to fig. 3, the method S100 may specifically include the following steps:
S10: a training set is obtained.
The training set includes a plurality of training data including a real fit image and N garment images. The real fitting image comprises images of corresponding clothes in the N clothes images of the model, wherein N is more than or equal to 2, and at least 2 clothes images, namely at least 2 clothes needing to be fitted, are indicated.
It will be appreciated that a training data set comprising a set of images of real fitting images and N images of clothing, in some embodiments the number of training data is tens of thousands, for example 20000, is advantageous for training to obtain an accurate generic model. The number of training data can be determined by a person skilled in the art according to the actual situation.
In the image group consisting of N clothes images and a real fitting image, each clothes image comprises a clothes to be fitted, and N clothes images are included, which illustrate that N clothes are fitted simultaneously. The model in the real fitting image wears the clothes in the corresponding N clothes images. Taking N as 2 as an example, for example, garment image 1# includes a green sleeve and garment image 2# includes a white sleeve. The model in the real fitting images corresponding to garment images 1# and 2# was fitted with the green cotta and white coat.
It will be appreciated that training data comprising a real fit image and N garment images may be gathered by those skilled in the art on a terminal (e.g. a computer) in advance, for example at least two garment images and corresponding model images with the at least two garments (i.e. real fit images) may be crawled on some garment vending sites. After the training set is prepared, the training set is uploaded to a server through the terminal.
S20: human body analysis is carried out on the real fitting image, a first analysis chart is obtained, a second analysis chart is separated from the first analysis chart, and a body trunk chart is extracted from the real fitting image according to the second analysis chart, wherein the second analysis chart reflects a pixel area of a body trunk of the model.
After the server receives the training set, human body analysis is carried out on each real fitting image in the training set, a first analysis chart is obtained, and then a body trunk chart is extracted.
It will be appreciated that in changing the model in the real fit image, it is necessary to retain the identity of the model, etc. characteristics that need to be retained. Before the clothes and the model are fused, the identity characteristics of the model are extracted, namely, the body trunk diagram of the user is obtained, so that on one hand, the interference caused by the original old clothes characteristics to the fusion can be avoided, and on the other hand, the identity characteristics of the model can be reserved, and the model is not distorted after the clothes to be tried on are replaced.
Specifically, as can be seen from the foregoing "introduction of noun (3)", the analysis of the human body is to divide each part of the human body, as shown in fig. 5, and different parts, such as hair, face, coat, trousers, arms, hat, shoes, etc., are identified and divided, and are represented by different colors, so as to obtain the first analysis chart.
In some embodiments, the body resolution algorithm may be an existing Graphonomy algorithm. The Graphomay algorithm can divide the image into 20 categories, and can divide the image into different parts by adopting different colors. In some embodiments, the 20 categories described above may also be divided by the reference numerals 0-19, for example 0 for background, 1 for hat, 2 for hair, 3 for glove, 4 for sunglasses, 5 for coat, 6 for dress, 7 for coat, 8 for sock, 9 for trousers, 10 for torso skin, 11 for scarf, 12 for half skirt, 13 for face, 14 for left arm, 15 for right arm, 16 for left leg, 17 for right leg, 18 for left shoe, and 19 for right shoe. From the analysis map, the category to which each part in the image belongs can be determined.
In order to ensure that model identity information is unchanged during the generation of the predictive fit image, in this embodiment, a second analytical map is separated from the first analytical map, the second analytical map reflecting the pixel areas of the torso of the model body. For example, pixel regions of pixel categories 2 (hair), 10 (torso skin), 13 (face), 14 (left arm), 15 (right arm), 16 (left leg), 17 (right leg), 18 (left shoe), and 19 (right shoe) in the first analysis map are extracted as the second analysis map.
The second analytic graph represents the pixel area of the trunk of the model body, and interference areas such as the background are removed. Therefore, the body torso map can be extracted from the real fitting image based on the second analysis map. The identity information of the model is reserved based on the body trunk diagram, background interference is removed, and the body trunk diagram is input into a fitting network for training, so that the convergence speed and accuracy of the model are improved.
In some embodiments, the aforementioned "extracting a body torso map from a real fitting image according to the second analytic map" includes:
and carrying out binarization processing on the second analysis chart to obtain a binarized image, wherein pixels corresponding to the trunk area of the body in the binarized image are 1, and pixels corresponding to other areas are 0. And multiplying the corresponding positions of the pixels in the real fitting image and the pixels in the binarized image to obtain the body trunk map.
For example, for the second analysis chart, the pixel areas with the pixel categories of 2 (hair), 10 (torso skin), 13 (face), 14 (left arm), 15 (right arm), 16 (left leg), 17 (right leg), 18 (left shoe) and 19 (right shoe) and the like which are the identity features of the human body are subjected to category modification, the pixel category modification of the pixel areas is set to 1, and the pixel categories of other areas are set to 0, namely, binarization processing is realized, so that a binarized image is obtained. Thus, a binarized image M is obtained that characterizes the pixel class with only 0 and 1.
The binarized image M is the same size as the first analysis map and the real fitting image, and includes elements of 0 and 1. In the binarized image M, the pixel corresponding to the body region of the model is 1, and the pixels corresponding to the other regions are 0. Thus, the body trunk map F' =f×m can be obtained by multiplying the real fitting image F by the position corresponding to the binarized image M.
The corresponding position multiplication can be explained by using the formula F 'ij=Fij×Mij, wherein F ij represents the pixel value of the ith row and the jth column in the real fitting image, M ij represents the value of the ith row and the jth column in the binarized image, and F' ij is the pixel value of the ith row and the jth column in the body trunk map.
In this embodiment, the second analysis chart reflecting the identity characteristic is separated from the first analysis chart, binarization processing is performed on the second analysis chart, and the obtained binarization image is multiplied by the corresponding position of the real fitting image, so that the body trunk chart can be obtained, the identity characteristic of the model can be accurately reserved in the body trunk chart, the original clothes characteristic needing to be replaced is removed, and the model is not distorted after being put on clothes needing to be tried on.
S30: and inputting the N clothes images and the second analysis chart into a fusion analysis network to obtain a fusion analysis chart.
And the server adopts a fusion analysis network to perform downsampling encoding and upsampling decoding on the N clothes images and the second analysis image to obtain the fusion analysis image.
The fusion analysis chart comprises pixel areas corresponding to clothes and pixel areas of the trunk of the model body in the N clothes images. For example, as for the trunk of the model body, the green short sleeve and the white coat to be tried on, as the network gradually converges, in the fusion analysis chart, region segmentation is performed according to the actual mixed-fit try-on effect, and the pixel position corresponding to the trunk of the model body, the pixel position corresponding to the green short sleeve and the pixel position corresponding to the white coat can be clearly obtained from the fusion analysis chart.
In some embodiments, referring to fig. 6, the fusion resolution network includes an encoder and a decoder, wherein the encoder includes 6 downsampled convolutional layers, and the last downsampled convolutional layer extracts a feature map having a size of 4×4×256. The decoder includes 6 up-sampling convolution layers, up-samples the feature map of size 4×4×256, and the last up-sampling convolution layer outputs a fused resolution map of size 512×512×3.
As will be appreciated by those skilled in the art, when the RGB three-channel garment image 1#, garment image 2# and single-channel second resolution map are input into the fusion resolution network, the garment image 1#, garment image 2# and second resolution map are combined into a seven-channel input image that is input into the encoder in the fusion resolution network. The 6 downsampling convolution layers in the encoder perform layer-by-layer downsampling extraction on the seven-channel input image to output a characteristic map with the size of 4 x 256. The 4 x 256 size feature map is then input to a decoder in the fusion resolution network. The 6 up-sampling convolution layers in the decoder up-sample the feature map with the size of 4×4×256, and finally output the fusion analysis map with the size of 512×512×1.
In the single-channel fusion analysis chart, the pixel areas corresponding to the clothes and the pixel areas of the trunk of the model body in the N clothes images can be positioned. It can be understood that, with the training of the fitting network, the pixel areas of the N clothes images corresponding to the clothes and the pixel areas of the trunk of the model body approach the real fitting effect, and are divided according to the real fitting effect.
S40: and inputting the body trunk graph, the fusion analysis graph and the N deformed clothes images into a generating network to obtain a predicted fitting image.
After the server acquires the body trunk map fusion analysis map and the N deformed clothes images, a generating network is adopted to fuse the body trunk map fusion analysis map and the N deformed clothes images, so that a predicted fitting image is obtained.
The N deformed clothes images are obtained by deforming clothes in the N clothes images according to the model human body structure in the real fitting image. Here, considering that the clothing in the clothing image is in a two-dimensional plane state, however, the human body structure is three-dimensional, in order to enable the clothing to be compatible with the human body structure when the clothing and the model are fused in the generation network, the clothing in the N clothing images is deformed according to the model human body structure in the real fitting image, and the deformed clothing image is obtained. The clothes in the deformed clothes image are in a three-dimensional state and are suitable for the human body structure, so that the clothes and the human body are combined and attached after being fused, and the fitting effect is real and natural. In some embodiments, existing key point algorithms (e.g., stacked Hourglass Network) may be used to locate the garment key points and the human body key points, then based on the garment key points and the human body key points, an affine change matrix for affine change is calculated, and then, affine change is performed on the garment pixels in the garment image according to the affine change matrix, so that an accurate deformed garment image can be obtained. Specifically, the calculation can be performed using the following formula:
(x i,yi) is the coordinates of the clothing pixels in the clothing image, (x' i,y′i) is the coordinates of the clothing pixels in the deformed clothing image, and H is the affine change matrix calculated as described above.
In order to extract the characteristics of the clothes to be tried on, the N deformed clothes images are input into a first encoder for downsampling encoding to obtain N clothes codes. In some embodiments, the first encoder includes 7 convolutional layers (Conv), the 7 convolutional layers (Conv) configured with a convolutional kernel size, step size, and the size of the output signature, as shown in table 1 below. The first encoder mainly uses a 3*3-sized convolution kernel (kernel_size), the step size (S) is set to 2, and the down-scaling operation is implemented. The size of the clothing image X input to the first encoder is 512X 3, and the last convolutional layer of the first encoder outputs clothing codes of size 4X 512.
Table 1 network structure of first encoder
Layer kernel_size S Output shape
Image X - - 512*512*3
Conv 1 2 256*256*16
Conv 3 2 128*128*32
Conv 3 2 64*64*64
Conv 3 2 32*32*128
Conv 3 2 16*16*256
Conv 3 2 8*8*256
Conv 3 2 4*4*512
It will be appreciated that the downsampling encoding of the N deformed clothing images in the first encoder is performed independently, and the obtained N clothing codes are in one-to-one correspondence with the N deformed clothing images, where the size of the clothing codes is 4×4×512.
Referring to fig. 7, N clothes codes are input to N fusion modules, respectively, i.e., one clothes code is input to one fusion module, to wait for fusion with the feature map output from the second encoder. The body trunk graph and the fusion analysis graph are input into a generating network, and the generating network comprises a cascade second encoder, N fusion modules and a decoder, so that the body trunk graph and the fusion analysis graph are subjected to downsampling encoding, fusion and upsampling decoding in the generating network in sequence to obtain a predicted fitting image. In the fusion decoding process, the analytic feature images of the same level restrict the pixel category of the fitting feature images of the same level.
Specifically, as shown in fig. 7, the body trunk map and the fusion analysis map are input to a second encoder to perform downsampling encoding respectively, so as to obtain a body trunk feature map with a size of 4×4×512 and an analysis feature map with a size of 4×4×512. Taking N as 2 as an example, inputting the body trunk feature map and the analysis feature map into a 1st fusion module, fusing the body trunk feature map and a 1st clothes code with the same size to generate a 1st fitting feature map (4 x 512 size), and restricting the pixel type of pixels in the 1st fitting feature map by the analysis feature map in the fusion process, namely positioning the 1st fitting clothes relative to the body trunk. Here, the body torso feature map, the analysis feature map, and the clothing code input to the 1st fusion module all have a size of 4×4×512, and belong to the same hierarchy.
The 1 st fitting feature map is input into a2 nd fusion module and fused with the 2 nd clothes codes with the same size to generate a2 nd fitting feature map (4 x 512 size), and in the fusion process, the analysis feature map constrains the pixel types of pixels in the 2 nd fitting feature map, namely the 2 nd fitting clothes are positioned relative to the body trunk. Here, the same hierarchy may be understood as having the same resolution and corresponding to the same fusion module or convolution layer.
The 2 nd fitting feature map and the analysis feature map are input into the decoder, and as the upsampling convolution layer advances layer by layer, upsampling decoding is performed layer by layer to generate fitting feature maps and analysis feature maps with gradually increased sizes, and the last upsampling feature map of the decoder outputs a predicted fitting image (for example, 512 x 3). In the decoding process, the analytic feature images of the same level constraint the pixel types of pixels in fitting feature images of the same level, namely, the analytic feature images of the same up-sampling convolution layer are input to carry out pixel type constraint on the corresponding fitting feature images, namely, the positions of a plurality of clothes relative to the body trunk are constrained, so that the real mixed fitting effect is realized.
In some embodiments, the fusion module includes a first convolution layer, a second convolution layer, and a fusion layer. The convolution kernel parameters of the first convolution layer and the second convolution layer may be configured according to practical situations, and in some embodiments, the convolution kernels of the first convolution layer and the second convolution layer may be 3*3 for use in extracting features by convolution operations. The first convolution layer and the second convolution layer are two different convolution layers, and the convolution operation modes are different.
The fusion layer is used for carrying out fusion processing on two or more input feature graphs and outputting a fused feature graph.
In this embodiment, the fusion module fuses the input signature and the input garment code in the following manner:
And respectively carrying out feature extraction on the input feature map through the first convolution layer and the second convolution layer to obtain a first intermediate feature map and a second intermediate feature map. And respectively extracting the characteristics of the input clothes codes through the first convolution layer and the second convolution layer to obtain a first intermediate code and a second intermediate code. And carrying out fusion processing on the input feature map, the first intermediate feature map, the second intermediate feature map, the first intermediate code and the second intermediate code through a fusion layer.
Referring to fig. 8, after the input feature map is input to the fusion module, the first convolution layer performs convolution processing on the input feature map to extract features, so as to obtain a first intermediate feature map μ (x), and the second convolution layer performs convolution processing on the input feature map to extract features, so as to obtain a second intermediate feature map σ (x). It will be appreciated that the input profile may be the profile of the second encoder output or the profile of the last fusion module output. Similarly, when the clothes codes are input into the fusion module, the first convolution layer carries out convolution processing on the clothes codes to extract features, so that a first intermediate code mu (y) is obtained, and the second convolution layer carries out convolution processing on the clothes codes to extract features, so that a second intermediate code sigma (y) is obtained.
Then, the input feature map, the first intermediate feature map, the second intermediate feature map, the first intermediate code and the second intermediate code are all input into a fusion layer for fusion processing,
In this embodiment, the input feature map and the clothes code are firstly subjected to feature extraction by adopting two different convolution modes (a first convolution layer and a second convolution layer), so as to obtain a first intermediate feature map, a second intermediate feature map, a first intermediate code and a second intermediate code, wherein the intermediate feature maps can retain the features of the input feature map from different angles, and the intermediate codes can retain the features of the clothes code from different angles, so that the fused feature map obtained by fusing the intermediate feature maps and the intermediate code can more comprehensively reflect the features of the input feature map and the clothes code without losing the features.
In some embodiments, the aforementioned "fusing the input feature map, the first intermediate feature map, the second intermediate feature map, and the first intermediate code, the second intermediate code by the fusion layer" includes: taking the first intermediate feature map as a mean value and the second intermediate feature map as a variance, and carrying out normalization processing on the input feature map; and carrying out fusion processing on the result obtained by the normalization processing and the first intermediate code and the second intermediate code.
Here, the input feature map is normalized, the difference of pixel value values in the result obtained by the normalization is reduced, and the result obtained by the normalization is fused with the first intermediate code and the second intermediate code, so that the influence of the difference of pixel value values on fusion can be effectively reduced, and the clothing features reflected by the first intermediate code and the second intermediate code can be better fused and reserved.
In some embodiments, the aforementioned fusion layer is fused using the following formula:
wherein x is an input feature map, mu (x) is a first intermediate feature map, sigma (x) is a second intermediate feature map, y is an input clothing code, mu (y) is a first intermediate code, sigma (y) is a second intermediate code, and IN (x, y) is a feature map output by a fusion layer.
In this embodiment, the results of the normalization processThe method comprises the steps of firstly carrying out multiplication fusion with a second intermediate code sigma (y), then carrying out addition fusion with a first intermediate code mu (y), and carrying out multiplication fusion and addition fusion to better fuse and retain the clothes characteristics of the clothes to be put on.
S50: and carrying out iterative training on the fitting network by adopting a loss function until the fitting network converges to obtain a fitting model, wherein the loss function is used for representing the difference between each real fitting image and each predicted fitting image in a training set.
Here, the loss function may be configured in a terminal by a person skilled in the art, the configured loss function is sent to the server along with the fitting network, after the server processes the predicted fitting image corresponding to each real fitting image in the training set, the server calculates differences between each real fitting image and the predicted fitting image in the training set by using the loss function, and iteratively trains the fitting network based on the differences until the fitting network converges, so as to obtain the fitting model.
It can be appreciated that if the difference between each real fitting image and the predicted fitting image in the training set is smaller, the real fitting image and the predicted fitting image are more similar, which means that the predicted fitting image can accurately restore the real fitting image. Therefore, the model parameters of the fitting network can be adjusted according to the difference between each real fitting image and the predicted fitting image in the training set, and iterative training can be carried out on the fitting network. The fitting network comprises a fusion analysis network and a generation network, and the model parameters comprise the model parameters of the fusion analysis network and the model parameters of the generation network. And the difference is reversely transmitted, so that the predicted fitting image output by the fitting network is continuously approximate to the real fitting image until the fitting network converges, and a fitting model is obtained.
It is to be understood that fitting network convergence here may refer to that under certain model parameters, the sum of differences between the actual fitting images and the predicted fit images in the training set is less than a preset threshold or fluctuates within a certain range.
In some embodiments, model parameters are optimized by adopting an adam algorithm, for example, the iteration number is set to 10 ten thousand times, the initial learning rate is set to 0.001, the weight attenuation of the learning rate is set to 0.0005, and the learning rate attenuation is 1/10 of the original weight attenuation of each 1000 iterations, wherein differences of each real fitting image and each predicted fitting image in the learning rate, the training set can be input into the adam algorithm to obtain adjustment model parameters output by the adam algorithm, the next training is performed by utilizing the adjustment model parameters until the training is completed, and the model parameters of the fitting network after convergence are output to obtain the fitting model.
It will be appreciated that after the server obtains the model parameters of the fitting network after convergence (i.e. the final model parameters), the final model parameters may be sent to the terminal, where the fitting network in the terminal is configured with the final model parameters, to obtain the fitting model. In some embodiments, the server may also store the fitting network and final model parameters to arrive at a fitting model.
It should be noted that, in the embodiment of the present application, the training set includes a plurality of training data, for example 20000 training data, which covers different models and clothes and can cover most of the characteristics of clothes on the market. Therefore, the trained fitting model is a universal model and can be widely used for virtual fitting, and fitting images with real mixed fitting effects are generated.
In this embodiment, a loss function is used to calculate the difference between the real fit image and the predicted fit image corresponding to each real fit image in the training set. The loss function is described in detail in the foregoing "description of noun (2)", and the detailed description thereof is not repeated here. It will be appreciated that the structure of the loss function may be set according to the actual situation, based on the network structure and the training mode.
In some embodiments, the loss function includes a fight loss, a perception loss, and a clothing pixel loss between the real fit image and the predicted fit image, wherein the clothing pixel loss reflects differences between pixels of clothing in the real fit image and pixels in the predicted fit image, respectively, in the N clothing images.
The countermeasures loss is the loss of whether the predicted fitting image is the corresponding real fitting image, when the countermeasures loss is large, the distribution difference between the predicted fitting image and the real fitting image is larger, and when the countermeasures loss is small, the distribution difference between the predicted fitting image and the real fitting image is smaller and similar. Here, the distribution of the fitting image means the distribution of each part in the image, for example, the distribution of clothes, head, limbs, and the like.
The perceived loss is that the feature map obtained by convoluting the real fitting image is compared with the feature map obtained by convoluting the predicted fitting image, so that the high-level information (content and global structure) is close.
The clothing pixel loss reflects differences between pixels of clothing in the N clothing images in the real fitting image and pixels in the predicted fitting image, respectively. And comparing the clothes pixels in the predicted fitting image with the clothes pixels in the real fitting image, so that the clothes in the predicted fitting image can approach the clothes in the real fitting image, on one hand, the try-on clothes is close to the real try-on effect, on the other hand, the stability of the predicted fitting image in training is ensured, and the model convergence speed is accelerated.
In some embodiments, the loss function includes:
L=LcGAN1LperceptzLL1
Wherein, L cGAN =E [ log D (T) ] +E [1-log D (Y) ]
Wherein L cGAN is the countermeasures loss, L percept is the perception loss, L L1 is the clothing pixel loss, λ 1 and λ 2 are superparameters, T is the real fitting image, Y is the predicted fitting image, D is a discriminator, F i (T) is an i-th feature map extracted from the real fitting image, F i (Y) is an i-th feature map extracted from the predicted fitting image, R i is the number of elements in F i (T) or F i (Y), V is the number of F i (T) or the number of F i (Y), F j (T) is the pixel of a j-th clothing in the real fitting image, and F j (F) is the pixel of the j-th clothing in the predicted fitting image.
In some embodiments, a convolutional neural network such as VGG may be used to downsample the real fitting image, and extract V feature maps F i (T). Similarly, a convolutional neural network such as VGG (VGG) can be adopted to downsample the predicted fitting image, and V feature maps F i (Y) are extracted.
In some embodiments, the sizes of F i (T) and F i (Y) are 8 x 8 when i=1; the sizes of F i (T) and F i (Y) are 16 x 16 when i=2; the sizes of F i (T) and F i (Y) are 32 x 32 when i=3; the sizes of F i (T) and F i (Y) are 64 x 64 when i=4; the sizes of F i (T) and F i (Y) are 128 x 128 when i=5; the sizes of F i (T) and F i (Y) are 256×256 when i=6; the sizes of F i (T) and F i (Y) are 512 x 512 when i=7.
Therefore, based on the difference calculated by the loss function comprising the counterloss, the perception loss and the clothing pixel loss, iterative training is carried out on the fitting network, the predicted fitting image can be restrained from being continuously close to the real fitting image in four aspects of distribution, content characteristics and clothing pixels, and the fitting effect of the fitting model obtained by training is improved.
In summary, by designing the structure of the fitting network, the fitting network includes a fusion analysis network and a generation network, the fusion analysis network performs downsampling encoding and upsampling decoding on the N garment images and the second analysis map reflecting the trunk pixel area of the model body, so as to obtain a fusion analysis map, and the pixel positions corresponding to the trunk of the body and the pixel positions corresponding to the N fitting garments can be clearly obtained from the fusion analysis map. And a first encoder in the generating network performs downsampling encoding on the N deformed clothes images to obtain N clothes codes, and a second encoder, N fusion modules and a decoder which are cascaded in the generating network sequentially perform downsampling encoding, fusion and upsampling decoding on the body trunk map and the fusion analysis map until a predicted fitting image is output. In the fusion decoding process, the analytic feature images of the same level restrict the pixel categories of the fitting feature images of the same level, namely restrict the positions of a plurality of clothes relative to the body trunk. Along with continuous iterative training of the fitting network, the fusion analysis graph can be segmented according to the real mixed fitting effect, and the predicted fitting image generated by fusion can be continuously close to the real fitting image (real label), so that an accurate fitting model is obtained. Therefore, the fitting model can generate fitting images of the mixed effect of a plurality of clothes.
After the fitting model is obtained through training by the method for training the fitting model provided by the embodiment of the application, the fitting model can be used for being applied to virtual fitting to generate fitting images. The method for generating fitting images provided by the embodiment of the application can be implemented by various types of electronic equipment with calculation processing capacity, such as an intelligent terminal, a server and the like.
The method for generating fitting images provided by the embodiment of the application is described below in connection with exemplary applications and implementations of the terminal provided by the embodiment of the application. Referring to fig. 9, fig. 9 is a flowchart of a method for generating a fitting image according to an embodiment of the present application. The method S200 comprises the steps of:
s201: and acquiring a user image and N images of the clothes to be tried on.
A fitting assistant (application software) built in a terminal (e.g., a smart phone or a smart fitting mirror) acquires a user image and N fitting images. The user image may be captured by the terminal or may be input by the user. The N fitting suit images may be selected by the user from the fitting assistant.
It will be appreciated that the user image includes the body of the user and the garment image includes the garment.
S202: human body analysis is carried out on the user image to obtain a first analysis chart of the user, a second analysis chart of the user is separated from the first analysis chart of the user, and a body trunk chart of the user is extracted from the user image according to the second analysis chart of the user, wherein the second analysis chart of the user reflects a pixel area of the body trunk of the user.
It will be appreciated that in changing clothing for a user in a user image, it is desirable to preserve features that need to be preserved, such as the identity of the user.
Before the clothes to be tried on and the user input fitting model are fused, the identity characteristics of the user are extracted, namely the body trunk diagram of the user is obtained, so that on one hand, the interference caused by the original old clothes characteristics to the fusion can be avoided, and on the other hand, the identity characteristics of the user can be reserved, and the user is not distorted after the user changes the clothes to be tried on.
The analysis of the human body, the separation of the second analysis chart, and the extraction of the body trunk chart are described in detail in step S20, and the description thereof will not be repeated.
S203: inputting the N deformed fitting clothes images, the body trunk graph of the user and the N fitting clothes images into a fitting model to generate fitting images, wherein the fitting model is trained by adopting the method for training the fitting model in the training embodiment.
The fitting assistant arranged in the terminal comprises a fitting model, and the fitting model is called to generate fitting images.
The N deformed clothes images to be fitted are obtained by deforming clothes in the N clothes images to be fitted according to the human body structure of the user in the user image. In order to enable the clothes to be suitable for the human body structure when the clothes are fused with the user in the fitting model, the clothes in the N to-be-fitted clothes images are deformed according to the human body structure of the user, and the deformed to-be-fitted clothes images are obtained. The clothes in the deformed clothes image to be fitted are in a three-dimensional state and are suitable for the human body structure of the user, so that the clothes and the user are combined and attached after being fused, and the fitting effect is real and natural. For a specific modification, refer to the modification in step S40.
And then, inputting the N deformed fitting clothes images, the body trunk graph of the user and the N fitting clothes images into a fitting model to generate fitting images. It can be understood that the fitting model is obtained by training the fitting model in the above embodiment, and has the same structure and function as the fitting model in the above embodiment, and will not be described in detail herein.
In short, the N deformed garment images and the body trunk diagram of the user are input into the fitting model, and the fitting model firstly generates a fusion analysis diagram of the user. The fusion analysis chart is divided according to the effect that the user truly blends and wears the N pieces of clothes to be tested. Then, the fitting model generates fitting images, and in the process of generating fitting images, the fusion analysis image constrains the pixel types of the fitting images, namely the positions of a plurality of clothes relative to the body trunk, so that fitting images of mixed fitting effects of the plurality of clothes can be generated
The embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores computer executable instructions for causing an electronic device to perform a method for training a fitting model provided by the embodiment of the present application, for example, a method for training a fitting model as shown in fig. 3 to 8, or a method for generating a fitting image provided by the embodiment of the present application, for example, a method for generating a fitting image as shown in fig. 9.
In some embodiments, the storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper TextMarkup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, executable instructions may be deployed to be executed on one computing device (including devices such as smart terminals and servers) or on multiple computing devices located at one site, or on multiple computing devices distributed across multiple sites and interconnected by a communication network.
The present application also provides a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a computer, cause the computer to perform a method of training a fitting model or a method of generating a fitting image as in the previous embodiments.
It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the application, the steps may be implemented in any order, and there are many other variations of the different aspects of the application as described above, which are not provided in detail for the sake of brevity; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims (10)

1. The method for training the fitting model is characterized in that the fitting network comprises a fusion analysis network and a generation network, wherein the generation network comprises a first encoder, a cascade second encoder, N fusion modules and a decoder, and the first encoder is connected with the N fusion modules;
the method comprises the following steps:
Acquiring a training set, wherein the training set comprises a plurality of training data, the training data comprises a real fitting image and N clothes images, the real fitting image comprises an image of a model wearing corresponding clothes in the N clothes images, and N is more than or equal to 2;
Performing human body analysis on the real fitting image to obtain a first analysis chart, separating a second analysis chart from the first analysis chart, and extracting a body trunk chart from the real fitting image according to the second analysis chart, wherein the second analysis chart reflects a pixel area of the body trunk of the model;
Inputting the N clothes images and the second analysis chart into a fusion analysis network to obtain a fusion analysis chart, wherein the fusion analysis chart comprises pixel areas corresponding to clothes in the N clothes images and pixel areas of the trunk of the model body;
Inputting the body trunk graph, the fusion analysis graph and N deformed clothes images into a generating network to obtain a predicted fitting image, wherein the N deformed clothes images are obtained by deforming clothes in the N clothes images according to the model human body structure in the real fitting image, the N deformed clothes images are input into the first encoder to obtain N clothes codes, the N clothes codes are respectively and correspondingly input into the N fusion modules, the body trunk graph and the fusion analysis graph are input into the generating network to obtain the predicted fitting image, and in the fusion decoding process, the analysis feature graphs of the same level restrict the pixel types of the fitting feature graphs of the same level;
And carrying out iterative training on the fitting network by adopting a loss function until the fitting network converges to obtain the fitting model, wherein the loss function is used for representing the difference between each real fitting image and each predicted fitting image in the training set.
2. The method of claim 1, wherein the extracting a torso map from the real fitting image according to the second analysis map comprises:
performing binarization processing on the second analysis chart to obtain a binarized image, wherein pixels corresponding to a trunk area of the body in the binarized image are 1, and pixels corresponding to other areas are 0;
and multiplying the corresponding positions of the pixels in the real fitting image and the pixels in the binarized image to obtain the body trunk map.
3. The method of claim 1, wherein the fusion module comprises a first convolution layer, a second convolution layer, and a fusion layer;
The fusion module fuses the input feature map and the input clothes code in the following manner:
The input feature images are respectively subjected to feature extraction through the first convolution layer and the second convolution layer, so that a first intermediate feature image and a second intermediate feature image are obtained;
The input clothing codes are respectively subjected to feature extraction through the first convolution layer and the second convolution layer to obtain a first intermediate code and a second intermediate code;
And carrying out fusion processing on the input feature map, the first intermediate feature map, the second intermediate feature map, the first intermediate code and the second intermediate code through the fusion layer.
4. A method according to claim 3, wherein said fusing, by said fusion layer, of said input feature map, said first intermediate feature map, said second intermediate feature map, and said first intermediate code, said second intermediate code, comprises:
Normalizing the input feature map by taking the first intermediate feature map as a mean value and the second intermediate feature map as a variance;
and fusing the result obtained by the normalization processing with the first intermediate code and the second intermediate code.
5. The method of claim 4, wherein the fusion layer performs the fusion process using the formula:
wherein x is the input feature map, μ (x) is the first intermediate feature map, σ (x) is the second intermediate feature map, y is the input clothing code, μ (y) is the first intermediate code, σ (y) is the second intermediate code, and IN (x, y) is the feature map output by the fusion layer.
6. The method of claim 1, wherein the loss function comprises a fight loss, a perception loss, and a clothing pixel loss between the real fitting image and the predicted fitting image, wherein the clothing pixel loss reflects differences between pixels of clothing in the N clothing images in the real fitting image and pixels in the predicted fitting image, respectively.
7. The method of claim 6, wherein the loss function comprises:
L=LcGAN1Lpercept2LL1
wherein, L cGAN =El [ log D (T) ]+Ep [1-log D (Y) ]
Wherein L cGAN is the countermeasures loss, L percept is the perception loss, L L1 is the clothing pixel loss, λ 1 and λ 2 are superparameters, T is the real fitting image, Y is the predicted fitting image, D is a discriminator, F i (T) is an i-th feature map extracted from the real fitting image, F i (Y) is an i-th feature map extracted from the predicted fitting image, R i is the number of elements in F i (T) or F i (Y), V is the number of F i (T) or the number of F i (Y), F j (T) is the pixel of a j-th clothing in the real fitting image, and F j (F) is the pixel of the j-th clothing in the predicted fitting image.
8. A method of generating a fitting image, comprising:
acquiring a user image and N images of clothes to be tried on;
performing human body analysis on the user image to obtain a first analysis chart of the user, separating a second analysis chart of the user from the first analysis chart of the user, and extracting a body trunk chart of the user from the user image according to the second analysis chart of the user, wherein the second analysis chart of the user reflects a pixel area of the body trunk of the user;
Inputting the N deformed clothing images to be fitted, the body trunk graph of the user and the N clothing images to be fitted into a fitting model to generate the fitting images, wherein the fitting model is trained by a method for training the fitting model according to any one of claims 1-7, and the N deformed clothing images to be fitted are obtained by deforming clothes in the N clothing images to be fitted according to the human body structure of the user in the user image.
9. An electronic device, comprising:
At least one processor, and
A memory communicatively coupled to the at least one processor, wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
10. A computer readable storage medium storing computer executable instructions for causing a computer device to perform the method of any one of claims 1-8.
CN202210433240.5A 2022-04-24 2022-04-24 Method for training fitting model, method for generating fitting image and related device Active CN114913388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210433240.5A CN114913388B (en) 2022-04-24 2022-04-24 Method for training fitting model, method for generating fitting image and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210433240.5A CN114913388B (en) 2022-04-24 2022-04-24 Method for training fitting model, method for generating fitting image and related device

Publications (2)

Publication Number Publication Date
CN114913388A CN114913388A (en) 2022-08-16
CN114913388B true CN114913388B (en) 2024-05-31

Family

ID=82765144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210433240.5A Active CN114913388B (en) 2022-04-24 2022-04-24 Method for training fitting model, method for generating fitting image and related device

Country Status (1)

Country Link
CN (1) CN114913388B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106604A (en) * 2013-01-23 2013-05-15 东华大学 Three dimensional (3D) virtual fitting method based on somatosensory technology
GB201613955D0 (en) * 2015-08-14 2016-09-28 Metail Ltd Method and system for generating an image file of a 3d garment model on a 3d body model
WO2021008166A1 (en) * 2019-07-17 2021-01-21 北京京东尚科信息技术有限公司 Method and apparatus for virtual fitting
CN112330580A (en) * 2020-10-30 2021-02-05 北京百度网讯科技有限公司 Method, device, computing equipment and medium for generating human body clothes fusion image
CN113781164A (en) * 2021-08-31 2021-12-10 深圳市富高康电子有限公司 Virtual fitting model training method, virtual fitting method and related device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106604A (en) * 2013-01-23 2013-05-15 东华大学 Three dimensional (3D) virtual fitting method based on somatosensory technology
GB201613955D0 (en) * 2015-08-14 2016-09-28 Metail Ltd Method and system for generating an image file of a 3d garment model on a 3d body model
WO2021008166A1 (en) * 2019-07-17 2021-01-21 北京京东尚科信息技术有限公司 Method and apparatus for virtual fitting
CN112330580A (en) * 2020-10-30 2021-02-05 北京百度网讯科技有限公司 Method, device, computing equipment and medium for generating human body clothes fusion image
CN113781164A (en) * 2021-08-31 2021-12-10 深圳市富高康电子有限公司 Virtual fitting model training method, virtual fitting method and related device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于卷积神经网络的虚拟试衣APP研究与实现;魏馨莹;;电脑编程技巧与维护;20200818(第08期);全文 *
深度卷积神经网络图像语义分割研究进展;青晨;禹晶;肖创柏;段娟;;中国图象图形学报;20200616(第06期);全文 *

Also Published As

Publication number Publication date
CN114913388A (en) 2022-08-16

Similar Documents

Publication Publication Date Title
CN110705448B (en) Human body detection method and device
CN110599492B (en) Training method and device for image segmentation model, electronic equipment and storage medium
CN111787242B (en) Method and apparatus for virtual fitting
CN115456160A (en) Data processing method and data processing equipment
CN113496507A (en) Human body three-dimensional model reconstruction method
KR102635777B1 (en) Methods and apparatus, electronic devices and storage media for detecting molecular binding sites
CN111598111B (en) Three-dimensional model generation method, device, computer equipment and storage medium
CN115439308A (en) Method for training fitting model, virtual fitting method and related device
CN116416416A (en) Training method of virtual fitting model, virtual fitting method and electronic equipment
CN110458924B (en) Three-dimensional face model establishing method and device and electronic equipment
CN111383308A (en) Method and electronic equipment for generating animation expression
CN110390259A (en) Recognition methods, device, computer equipment and the storage medium of diagram data
CN116229066A (en) Portrait segmentation model training method and related device
CN116109892A (en) Training method and related device for virtual fitting model
CN113298931B (en) Reconstruction method and device of object model, terminal equipment and storage medium
CN111507259B (en) Face feature extraction method and device and electronic equipment
CN113763440A (en) Image processing method, device, equipment and storage medium
CN114724004B (en) Method for training fitting model, method for generating fitting image and related device
CN114913388B (en) Method for training fitting model, method for generating fitting image and related device
CN115272822A (en) Method for training analytic model, virtual fitting method and related device
CN115439179A (en) Method for training fitting model, virtual fitting method and related device
CN110210314A (en) Method for detecting human face, device, computer equipment and storage medium
CN114821220A (en) Method for training fitting model, method for generating fitting image and related device
CN116563425A (en) Method for training fitting model, virtual fitting method and related device
CN115273140A (en) Analytical model training method, virtual fitting method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant