CN115273140A - Analytical model training method, virtual fitting method and related device - Google Patents

Analytical model training method, virtual fitting method and related device Download PDF

Info

Publication number
CN115273140A
CN115273140A CN202210785212.XA CN202210785212A CN115273140A CN 115273140 A CN115273140 A CN 115273140A CN 202210785212 A CN202210785212 A CN 202210785212A CN 115273140 A CN115273140 A CN 115273140A
Authority
CN
China
Prior art keywords
image
clothes
analysis
model
clothing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210785212.XA
Other languages
Chinese (zh)
Inventor
陈仿雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Original Assignee
Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shuliantianxia Intelligent Technology Co Ltd filed Critical Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority to CN202210785212.XA priority Critical patent/CN115273140A/en
Publication of CN115273140A publication Critical patent/CN115273140A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The embodiment of the application relates to the technical field of image processing, in particular to a method for training an analytic model, a virtual fitting method and a related device. When the clothing category to which the clothing region belongs in the constrained prediction analysis image is continuously close to the real clothing category, the corresponding relation between the real clothing category and the prediction analysis map can be strengthened, so that the clothing images of the same clothing category correspond to the same prediction analysis map. Therefore, when the analysis model obtained through training faces the clothes images to be tried of the same clothes type, the analysis graph corresponding to the clothes type can be generated. If the analytic model is adopted in the fitting process, the deformation effect of the clothes to be fitted after deformation is realized according to the clothes area in the analytic graph is real and vivid, and the fitting effect is favorably improved.

Description

Analytical model training method, virtual fitting method and related device
Technical Field
The embodiment of the application relates to the technical field of image processing, in particular to a method for training an analytic model, a virtual fitting method and a related device.
Background
With the continuous progress of modern science and technology, the online shopping scale is continuously increased, and a user can purchase clothes on an online shopping platform through a mobile phone, however, because the information of the clothes to be sold, which is obtained by the user, is generally a two-dimensional display picture, the user cannot know the effect of wearing the clothes on the user. Therefore, the demand for on-line try-on is becoming stronger, and dress display is becoming an important direction in the field of modern computer vision. Generally, on-line trial assembly is performed by shooting a user image and performing human body analysis on the user image to obtain a human body analysis image. Selecting the try-on clothes provided by the system, carrying out deformation processing on the try-on clothes according to the corresponding clothes area in the human body analysis image, and automatically replacing the original clothes on the user body by adopting the deformed try-on clothes.
However, since the style and style of the try-on garment are different from the style and style of the garment on the user's body, the difference between the contour of the garment region in the human body analysis image and the wearing contour of the try-on garment is large, and the effect of the try-on garment is deteriorated by performing deformation processing on the try-on garment according to the corresponding garment region in the human body analysis image. For example, if the fitting garment is a slackened sweater and the user's body is a tight sweater, the slackened sweater deforms to the contour of the tight sweater, resulting in a tight style of the slackened sweater and poor fitting results.
Disclosure of Invention
The technical problem mainly solved by the embodiments of the present application is to provide a method for training an analytic model, a virtual fitting method and a related device, wherein the analytic model obtained by training with the training method can generate an analytic graph of a self-adaptive clothing category, so that a deformation effect of a to-be-fitted garment after deformation according to a garment region in the analytic graph is real and vivid, and improvement of a fitting effect is facilitated.
In order to solve the foregoing technical problem, in a first aspect, an embodiment of the present application provides a method for training an analytic model, including:
acquiring a plurality of image groups, wherein the image groups comprise a clothes image and a model image, one image group is marked with a real clothes type, the clothes type in the clothes image is a real clothes type, and the type of clothes worn by a model in the model image is a real clothes type;
analyzing the human body by adopting an analysis network based on the image group to obtain a prediction analysis image;
inputting the prediction analysis image into a clothing classification network to obtain a corresponding prediction clothing category;
calculating the loss corresponding to the image group by adopting a loss function, performing iterative training on the analysis network and the clothing classification network according to the loss sum corresponding to the plurality of image groups until convergence, and taking the converged analysis network as an analysis model, wherein the loss corresponding to the image group reflects the analysis loss of the analysis network and the classification loss of the clothing classification network.
In some embodiments, the performing human body analysis based on the image group by using the analysis network to obtain a predictive analysis image includes:
detecting key points of a human body on the model images in the image group to obtain key point images;
and the analysis network carries out human body analysis based on the image group and the key point image to obtain a prediction analysis image.
In some embodiments, the method further comprises:
analyzing the human body of the model image in the image group by adopting an analysis algorithm to obtain a real analysis image;
extracting a trunk area of the model from the real analytic image to obtain a local analytic image;
the analysis network analyzes the human body based on the clothes image and the key point image in the image group to obtain a prediction analysis image, and comprises the following steps:
and (4) splicing and combining the local analysis images, the key point images and the clothes images in the image group, inputting a combined image obtained by splicing and combining into an analysis network to analyze the human body, and obtaining a prediction analysis image.
In some embodiments, the analysis network includes an encoder and a decoder, the encoder includes a plurality of cascaded first convolutional layers, each of the first convolutional layers is configured with a plurality of convolutional kernels with different sizes;
the first convolution layer is used for respectively performing downsampling feature extraction on an input image by adopting a plurality of convolution kernels, and performing channel splicing on intermediate feature maps obtained by extracting the plurality of convolution kernels to obtain an output downsampling feature map.
In some embodiments, the garment classification network comprises a plurality of second convolutional layers, fully-connected layers, and classification layers in cascade;
the second convolution layer is used for performing downsampling feature extraction on the input image; the full connection layer is used for carrying out feature classification on the input feature image and outputting a feature vector; and the classification layer is used for carrying out probability conversion on the input feature vectors to obtain the predicted clothing category.
In some embodiments, the loss function includes an analytic loss function and a category loss function, wherein the analytic loss function is used for calculating a difference between a predicted analytic image and a real analytic image, the real analytic image is obtained by analyzing a human body through a model image, and the category loss function is used for calculating a difference between a predicted clothing category and a real clothing category.
In some embodiments, the loss function comprises:
Figure BDA0003731623320000031
wherein L islossAs a loss function, LrecTo resolve the loss function, LclsssIs a class loss function, alpha and beta are weights, Xi,jFor the analytic class value of the ith row and jth column of the real analytic image, Yi,jThe analytic category value of the ith row and the jth column of the predictive analytic image, M is the size of the real analytic image or the predictive analytic image, TiProbability value of ith category in real clothing category, PiFor predicting summary of ith category in clothing categoriesThe value of the rate, N, is the total number of garment categories.
In order to solve the above technical problem, in a second aspect, an embodiment of the present application provides a virtual fitting method, including:
acquiring an image of clothes to be tried on and an image of a user;
analyzing the human body by adopting an analytical model based on the clothes image to be tested and the user image to obtain an analytical image, wherein the analytical model is obtained by training by adopting the method in the first aspect;
and fusing the user image and the clothes image to be tried on, and deforming the clothes image to be tried on based on the clothes area in the analysis image in the fusion process to obtain the clothes image to be tried on.
In order to solve the foregoing technical problem, in a third aspect, an embodiment of the present application provides an electronic device, including:
at least one processor, and
a memory communicatively coupled to the at least one processor, wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect or the method of the second aspect as described above.
In order to solve the above technical problem, in a fourth aspect, an embodiment of the present application provides a computer-readable storage medium storing computer-executable instructions for causing a computer device to perform the method in the first aspect or the method in the second aspect.
The beneficial effects of the embodiment of the application are as follows: different from the situation in the prior art, in the method for training an analytic model provided in the embodiment of the present application, first, a plurality of image groups are obtained, where each image group includes a garment image and a model image, and the image groups are labeled with real garment categories, where a category of a garment in the garment image is the real garment category, and a category of a garment worn by a model in the model image is the real garment category, that is, in one image group, the garment in the garment image and the garment worn by the model in the model image both belong to the real garment category. Then, a human body is analyzed based on the image group by using an analysis network, and a predictive analysis image is obtained. Inputting the prediction analysis image into a clothing classification network, and classifying clothing types of clothing areas in the prediction analysis image to obtain corresponding prediction clothing types. The loss corresponding to the image group is calculated using a loss function (the loss reflects both the analytical loss and the classification loss). After the plurality of image groups are processed in turn, iterative training is carried out on the analysis network and the clothing classification network based on the loss sum corresponding to the plurality of image groups until convergence is reached, and the analysis network after convergence is used as an analysis model.
In this embodiment, a clothing classification network is used to discriminate clothing categories of the predictive analytic images, and then, based on the analysis loss and the classification loss corresponding to the plurality of image groups and the back propagation, clothing categories to which clothing regions belong in the predictive analytic images are constrained to be continuously close to the real clothing categories. Based on the fact that clothes of the clothes image in the image group and clothes worn by the model image belong to the same real clothes category, when the clothes category to which the clothes region belongs in the constrained prediction analysis image is continuously close to the real clothes category, the corresponding relation between the real clothes category and the prediction analysis map can be strengthened, and the clothes images of the same clothes category correspond to the same prediction analysis map. Therefore, when the analysis model obtained through training faces the clothes images to be tried on of the same clothes type, the analysis graph corresponding to the clothes type can be generated, interference of original clothes on the body of a user can be reduced, and therefore the analysis image adaptive to the clothes type can be generated. If the analytic model is adopted in the fitting process, the deformation effect of the clothes to be fitted after deformation is realized according to the clothes area in the analytic graph is real and vivid, and the fitting effect is favorably improved.
In addition, in the training method, the clothes of the clothes image in the image group and the clothes worn by the model image belong to the same real clothes category, for example, the clothes image comprises loose green short sleeves, the model in the model image can be worn with the loose green short sleeves without being worn with the loose green short sleeves, the dependency relationship between the clothes in the clothes image and the clothes on the model is reduced, and the difficulty in collecting training data can be effectively reduced.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
Fig. 1 is a schematic view of an application scenario of a virtual fitting in some embodiments of the present application;
FIG. 2 is a schematic diagram of an electronic device in some embodiments of the present application;
FIG. 3 is a schematic flow chart diagram illustrating a method for training an analytical model according to some embodiments of the present application;
FIG. 4 is a schematic sub-flowchart of step S20 of the method shown in FIG. 3;
FIG. 5 is a keypoint image in some embodiments of the present application;
FIG. 6 is another sub-flowchart of step S20 of the method shown in FIG. 3;
FIG. 7 is a schematic diagram of a resolution network and a garment classification network in accordance with some embodiments of the present application;
FIG. 8 is a schematic illustration of a process for generating predictive analytic images and predictive garment types in accordance with some embodiments of the present application;
fig. 9 is a schematic flow chart of a virtual fitting method according to some embodiments of the present application.
Detailed Description
The present application will be described in detail with reference to specific examples. The following examples will aid those skilled in the art in further understanding the present application, but are not intended to limit the present application in any way. It should be noted that various changes and modifications can be made by one skilled in the art without departing from the spirit of the application. All falling within the scope of protection of the present application.
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
It should be noted that, if not conflicted, the various features of the embodiments of the present application may be combined with each other within the scope of protection of the present application. Additionally, while functional block divisions are performed in apparatus schematics, with logical sequences shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions in apparatus or flowcharts. Further, the terms "first," "second," "third," and the like, as used herein, do not limit the data and the execution order, but merely distinguish the same items or similar items having substantially the same functions and actions.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
In addition, the technical features mentioned in the embodiments of the present application described below may be combined with each other as long as they do not conflict with each other.
To facilitate understanding of the method provided in the embodiments of the present application, first, terms referred to in the embodiments of the present application will be described:
(1) Neural network
The neural network may be composed of neural units, and may be specifically understood as a neural network having an input layer, a hidden layer, and an output layer, where generally the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. Among them, a neural network with many layers of hidden layers is called a Deep Neural Network (DNN). The work of each layer in the neural network can be described by the mathematical expression y = a (W · x + b), and from the physical level, the work of each layer in the neural network can be understood as performing the transformation of the input space to the output space (i.e., the row space to the column space of the matrix) through five operations on the input space (the set of input vectors), including 1, ascending/descending; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". The operation of 2 and 3 is completed by W.x, the operation of 4 is completed by + b, and the operation of 5 is realized by a () because the classified object is not a single thing but a kind of thing, and the space refers to the set of all individuals of the kind of thing, wherein W is the weight matrix of each layer of the neural network, and each value in the matrix represents the weight value of one neuron of the layer. The matrix W determines the spatial transformation of the input space to the output space described above, i.e. W at each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.
It should be noted that, in the embodiment of the present application, based on the model adopted by the machine learning task, the model is essentially a neural network. The common components in the neural network comprise a convolution layer, a pooling layer, a normalization layer, an inverse convolution layer and the like, the model is designed by assembling the common components in the neural network, and when model parameters (weight matrixes of all layers) are determined so that model errors meet preset conditions or the number of the adjusted model parameters reaches a preset threshold value, the model converges.
The convolution layer is provided with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to carry out convolution operation on the image. The convolution operation aims to extract different features of an input image, a first layer of convolution layer can only extract some low-level features such as edges, lines, angles and other levels, and a deeper convolution layer can iteratively extract more complex features from the low-level features.
The deconvolution layer is used to map a space with a low dimension to a space with a high dimension while maintaining the connection/mode therebetween (the connection here refers to the connection during convolution). The deconvolution layer is configured with a plurality of convolution kernels, each convolution kernel being provided with a corresponding step length to perform a deconvolution operation on the image. In general, a framework library (e.g., pyTorch library) for designing a neural network is built in an upscale () function, and a spatial mapping from a low dimension to a high dimension can be implemented by calling the upscale () function.
Pooling (posing) is a process that mimics the human visual system in that data can be reduced in size or images can be represented with higher level features. Common operations of pooling layers include maximum pooling, mean pooling, random pooling, median pooling, combined pooling, and the like. Typically, pooling layers are periodically inserted between convolutional layers of the neural network to achieve dimensionality reduction.
The normalization layer is used to perform normalization operations on all neurons in the middle layer to prevent gradient explosion and gradient disappearance.
(2) Loss function
In the process of training the neural network, because the output of the neural network is expected to be as close as possible to the value really expected to be predicted, the weight matrix of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the value really expected to be predicted (an initialization process is usually carried out before the first updating, namely, parameters are configured in advance for each layer in the neural network), for example, if the predicted value of the network is high, the weight matrix is adjusted to be lower, and the adjustment is carried out continuously until the neural network can predict the value really expected to be predicted. Therefore, it is necessary to define in advance how to compare the difference between the predicted value and the target value, which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the neural network becomes a process of reducing the loss as much as possible.
(3) Human body analysis
Human body analysis refers to the division of a person captured in an image into multiple semantically consistent regions, for example, a body part and clothing, or a fine classification of a body part and a fine classification of clothing, etc. Namely, the input image is identified in a pixel level, and the object type of each pixel point in the image is marked. For example, elements (e.g., hair, face, limbs, clothing, background, etc.) in a picture including a human body are distinguished by a neural network.
Before the embodiments of the present application are described, a simple description is first given to a virtual fitting method known to the inventors of the present application, so that it is convenient to understand the embodiments of the present application in the following.
In some schemes, after a user image and a fitting clothes image are obtained, clothes in the fitting clothes image are firstly deformed according to a human body structure in the user image, and a deformed fitting clothes image is obtained. And then, carrying out mask processing on a corresponding clothes area in the user image, and fusing the user image subjected to the mask processing and the deformed try-on clothes image to obtain a try-on image. In the fitting image, the effect of the user wearing the fitting clothes can be seen. In this embodiment, the try-on garment is simply deformed according to key points of the human body structure, and the problem of fuzzy boundaries and the like exists, so that the try-on effect is poor.
In some schemes, after obtaining the user image and the fitting clothes image, firstly, the human body of the user image is analyzed to obtain a human body analysis image. And (4) deforming the try-on clothes according to the corresponding clothes area in the human body analysis image, and automatically replacing the original clothes on the user body by adopting the deformed try-on clothes. In the embodiment, the human body analysis image can help to divide the boundary information of the fitting clothes, so that the fitting clothes in the generated fitting image have clear boundaries, and the displayed fitting effect is more reasonable.
However, since the style and style of the try-on garment are different from the style and style of the garment on the user's body, the difference between the contour of the garment region in the human body analysis image and the wearing contour of the try-on garment is large, and the effect of the try-on garment is deteriorated by performing deformation processing on the try-on garment according to the corresponding garment region in the human body analysis image. For example, if the fitting garment is a slackened sweater and the user's body is a tight sweater, the slackened sweater deforms to the contour of the tight sweater, resulting in a tight style of the slackened sweater and poor fitting results.
In order to solve the above problems, the present application provides a method for training an analysis model, a virtual fitting method, and a related apparatus, in which a clothing classification network is used to discriminate clothing types of each predictive analysis image, and then, based on analysis loss and classification loss and back propagation reflecting correspondence between a plurality of image groups, clothing types to which clothing regions belong in the predictive analysis image are constrained to be continuously close to real clothing types, so that correspondence between the real clothing types and predictive analysis maps can be strengthened, and clothing images of the same clothing type correspond to the same predictive analysis map. Therefore, when the analysis model obtained by training faces the clothes images to be tried on of the same clothes type, the analysis graph corresponding to the clothes type can be generated, the interference of the original clothes on the user can be reduced, and therefore the analysis graph adaptive to the clothes type can be generated. If the analytic model is adopted in the fitting process, the deformation effect of the clothes to be fitted after deformation is realized according to the clothes area in the analytic graph is real and vivid, and the fitting effect is favorably improved.
An exemplary application of the electronic device for training the analytic model or for virtual fitting provided in the embodiment of the present application is described below, and it can be understood that the electronic device may train the analytic model, or may analyze the clothes to be fitted and the user image by using the analytic model to generate an analytic image, then deform the clothes in the clothes image to be fitted according to the outline of the clothes area in the analytic image, and fuse the deformed clothes image to be fitted and the user image to obtain a fitting image.
The electronic device provided by the embodiment of the application can be a server, for example, a server deployed in the cloud. When the server is used for training the analytical model, the neural network is subjected to iterative training by adopting the training set according to the training set and the neural network provided by other equipment or technicians in the field, and final model parameters are determined, so that the neural network configures the final model parameters, and the analytical model can be obtained. When the server is used for virtual fitting, a built-in analysis model is called, corresponding calculation processing is carried out on an image of clothes to be fitted and a user image provided by other equipment or a user to obtain an analysis image, then clothes in the image of the clothes to be fitted are deformed according to the outline of a clothes area in the analysis image, and the deformed image of the clothes to be fitted and the user image are fused to obtain a fitting image.
The electronic device provided by some embodiments of the present application may be various types of terminals such as a notebook computer, a desktop computer, or a mobile device. When the terminal is used for training the analytic model, a person skilled in the art inputs a prepared training set into the terminal, designs a neural network on the terminal, and iteratively trains the neural network by using the training set by the terminal to determine final model parameters, so that the neural network configures the final model parameters, and the analytic model can be obtained. When the terminal is used for virtual fitting, a built-in analysis model is called, corresponding calculation processing is carried out on a user image and a to-be-fitted clothes image input by a user, an analysis image is obtained, and the outline of a clothes area in the analysis image is matched with the outline of the real fitting effect of the to-be-fitted clothes. And then deforming the clothes in the clothes image to be tried on according to the outline of the clothes area in the analysis image, and fusing the deformed clothes image to be tried on with the user image to obtain a clothes image.
By way of example, referring to fig. 1, fig. 1 is a schematic view of an application scenario of a virtual fitting system provided in an embodiment of the present application, and a terminal 10 is connected to a server 20 through a network, where the network may be a wide area network or a local area network, or a combination of the two.
The terminal 10 may be used to obtain a training set and build a neural network, for example, a person skilled in the art downloads the prepared training set on the terminal and builds a network structure of the neural network. It is understood that the terminal 10 may also be used to obtain the user image and the image of the clothes to be tried on, for example, the user inputs the user image and the image of the clothes to be tried on through the input interface, and after the input is completed, the terminal automatically obtains the user image and the image of the clothes to be tried on; for example, the terminal 10 is provided with a camera, the camera captures an image of a user, a clothes image library is stored in the terminal 10, and the user can select an image of clothes to be tried on from the clothes image library.
In some embodiments, the terminal 10 locally executes the method for training an analytic model provided in this embodiment to complete training of the designed neural network by using the training set, and determine final model parameters, so that the neural network configures the final model parameters, and thus the analytic model can be obtained. In some embodiments, the terminal 10 may also send, to the server 20 through the network, a training set stored on the terminal by a person skilled in the art and a constructed neural network, the server 20 receives the training set and the neural network, trains the designed neural network by using the training set, determines final model parameters, and then sends the final model parameters to the terminal 10, and the terminal 10 stores the final model parameters, so that the neural network configuration can obtain the final model parameters, that is, the analytic model can be obtained.
In some embodiments, the terminal 10 locally executes the virtual fitting method provided in this embodiment to provide a virtual fitting clothes service for the user, invokes a built-in analysis model, performs corresponding calculation processing on a user image and a to-be-fitted clothes image input by the user to generate an analysis image, then deforms clothes in the to-be-fitted clothes image according to an outline of a clothes area in the analysis image, and fuses the deformed to-be-fitted clothes image and the user image to obtain a fitting image.
In some embodiments, the terminal 10 may also send, to the server 20 through the network, the user image and the image of the clothes to be tried, which are input by the user on the terminal, and the server 20 receives the user image and the image of the clothes to be tried, invokes a built-in analysis model to perform corresponding calculation processing on the user image and the image of the clothes to be tried to obtain an analysis image, deforms the clothes in the image of the clothes to be tried according to the outline of the clothes area in the analysis image, and fuses the deformed image of the clothes to be tried and the user image to obtain an image of the clothes to be tried. Then, the fitting image is transmitted to the terminal 10. After receiving the fitting image, the terminal 10 displays the fitting image on the interface for the user to view.
The structure of the electronic device in the embodiment of the present application is described below, and fig. 2 is a schematic structural diagram of the electronic device 500 in the embodiment of the present application, where the electronic device 500 includes at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.
The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., wherein the general purpose Processor may be a microprocessor or any conventional Processor, etc.
The user interface 530 includes one or more output devices 531 enabling presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.
In some embodiments, memory 550 may be capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 552 for communicating with other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including Bluetooth, wireless Fidelity (WiFi), and Universal Serial Bus (USB), among others;
a display module 553 for enabling the presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;
an input processing module 554 for detecting one or more user inputs or interactions from one of the one or more input devices 532 and translating the detected inputs or interactions.
As can be understood from the foregoing, the method for training an analytic model and the virtual fitting method provided in the embodiments of the present application may be implemented by various electronic devices with computing processing capabilities, such as an intelligent terminal and a server.
The method for training the analytic model provided by the embodiment of the present application is described below with reference to an exemplary application and implementation of the server provided by the embodiment of the present application. Referring to fig. 3, fig. 3 is a schematic flowchart of a method for training an analytic model according to an embodiment of the present application.
Referring to fig. 3 again, the method S100 may specifically include the following steps:
s10: several image groups are acquired.
For any image group, which includes a clothes image and a model image, an image group is labeled with a real clothing category. The type of clothes in the clothes image is a real clothes type, and the type of clothes worn by the model in the model image is a real clothes type.
For example, the image group 1# includes a garment image 1# and a model image 1#, the garment in the garment image 1# is loose-fitting short sleeves, the garment in the model image 1# is also loose-fitting short sleeves, and the image is labeled with the real garment category (for example, loose-fitting short sleeves). It will be appreciated that the actual garment category may take the form of a thermal coding. That is, the garment in the garment image 1# and the model in the model image 1# need not be the same as each other with the garment, e.g., the garment image 1# includes loose green cotta and the model in the model image has loose striped cotta. That is, the clothes of the clothes image in the image group and the clothes worn by the model image belong to the same real clothes category, so that the corresponding dependency relationship between the clothes in the clothes image and the clothes on the model can be reduced, and the difficulty in collecting training data can be effectively reduced.
In some embodiments, to acquire several image sets, one skilled in the art may first collect several garment images and several model images. It is understood that the garments in the several images of garments may cover a variety of garment categories, such as short, loose-fitting, tight-fitting, loose-fitting, straight pants, a-type overskirt, and the like. The clothing worn by the model in the several model images can also cover the clothing categories. The number of the clothes images belonging to the same clothes type is multiple, and the clothes images can cover multiple pieces of clothes under the same clothes type. The number of model images belonging to the same garment category is at least one, so that a plurality of garment images belonging to the same garment category can correspond to at least one model image, for example 100 garment images belonging to the same garment category C1 can correspond to at least 1C 1 model image (in which the model is worn with C1 type of garment).
And randomly combining the plurality of clothes images and the plurality of model images according to the clothes types to obtain a plurality of image groups. For example, 100 garment images belonging to the garment category C1 can correspond to 1C 1 model image, and 50 garment images belonging to the garment category C2 can correspond to 2C 1 model images. 100C 1 clothes images and 1C 1 model image can be combined to obtain 100 image groups, and the 100 image groups are labeled with a real clothes type C1; the 50C 1-type clothes images and the 2C 1-type model images can be combined to obtain 100 image groups, and the 100 image groups are labeled as the real clothes category C2, so that, in this embodiment, 200 image groups can be obtained.
In this embodiment, the model image does not need to correspond to the clothes image according to the same clothes, and a plurality of clothes images of the same clothes type can correspond to 1 model image with the same clothes type, so that the collection difficulty and the collection workload can be reduced.
It is understood that the clothing categories may be set according to actual conditions, and the number of the clothing images of each category and the number of the model images may also be set according to actual conditions. The above examples of the garment categories (C1, C2), the number of garment images, and the number of model images are only exemplary, and do not set any limit to the number of garment categories and image groups that can be covered by the image groups of the present application.
In some embodiments, the number of image groups is ten thousand, which may be 20000, for example, and is beneficial for training to obtain an accurate general model. The number of image groups or the clothing type covered can be determined by the person skilled in the art according to the actual situation.
It is understood that the plurality of image sets can be gathered by a person skilled in the art on a terminal (e.g., a computer) in advance, for example, several clothes images can be crawled on some clothes selling websites, and each clothes image has at least 1 model image with the same clothes category. The image data can be uploaded to a server through a terminal.
In some embodiments, the collected model images are pre-processed between combining the garment image and the model image such that the garment image and the model image are uniform in size. For example, the model image is padded with pixel values, for example, 0, so that the model image is a 1. Similarly, if the clothing image is not a 1.
S20: and analyzing the human body by adopting an analysis network based on the image group to obtain a prediction analysis image.
The analysis network is a preset neural network, and has components (convolution layer, deconvolution layer, pooling layer, etc.) of the neural network. The basic structure and principle of the neural network are described in detail in the noun introduction (1), and are not described in detail herein. The analysis network may be constructed by those skilled in the art on a neural network design platform on a terminal (e.g., computer) computer, and then sent to the server. In some embodiments, the layer structure (convolution kernel, step size, etc.), the interlayer connection structure, the layer combination condition, etc. of the convolution layer, the deconvolution layer, or the pooling layer, etc. in the analysis network can be set to obtain a specific analysis network.
The analysis network analyzes the human body of the input image to obtain a prediction analysis image. Here, the "image input to the analysis network" is obtained based on the image group. For example, in some embodiments, the "image input to the parsing network" may be a mosaic of a garment image and a model image in the set of images. In some embodiments, the "image input to the parsing network" may be a fusion of the processed garment image and the processed model image. That is, the "image input to the parsing network" can include features in the clothing image and features in the model image.
The analysis network performs calculation processing (for example, convolution operation, deconvolution operation, or the like) on the input image to realize human body analysis, thereby obtaining a predictive analysis image. The "human body analysis" is described in detail in the "noun introduction (3)", and is not described in detail here. It is understood that the prediction analysis image can reflect the type to which each pixel belongs in the "image input into the analysis network", and therefore, the shape region (for example, a clothing region, an extremity region, a background, and the like) of each part element in the "image input into the analysis network" can be acquired from the prediction analysis image.
It can be understood that the human body analysis is performed by the analysis network on a plurality of image groups in the same manner as in S20, and a corresponding prediction analysis image is obtained.
In this embodiment, the analysis network performs human body analysis based on the image group to obtain a predictive analysis image. The "image input to the analysis network" can include features in the clothes image and features in the model image, so that the analysis network can learn clothes features in the clothes image and torso features in the model image, and fuse the clothes features and the torso features, so that the obtained prediction analysis image can reflect a torso region and a clothes region, that is, the shape regions of the part elements in the prediction analysis image are obtained by integrating the clothes features and the torso features.
In some embodiments, referring to fig. 4, the step S20 specifically includes:
s21: and detecting key points of the human body on the model images in the image group to obtain key point images.
S22: and the analysis network carries out human body analysis based on the image group and the key point image to obtain a prediction analysis image.
Human body key point detection is performed on the model image by using a human body key point detection algorithm, so that human body key point information (namely a plurality of key points on a human body) can be located, and as shown in fig. 5, the key point image comprises key points of the head, shoulders, arms, legs, body trunk and other areas. In some embodiments, the human key point detection algorithm may use openposition algorithm for detection. In some embodiments, the human keypoint detection algorithm may employ a 2D keypoint detection algorithm, such as a probabilistic point Machine (CPM) or a Stacked Hourglass Network (Hourglass), among others.
The analysis network performs human body analysis based on the image group (clothes image, model image) and the key point image to obtain a prediction analysis image. Here, the "image input to the analysis network" is obtained based on the clothes image, the model image, and the key point image. For example, in some embodiments, the "image input to the parsing network" may be a mosaic of a clothing image, a model image, and a keypoint image. That is, the "image input to the parsing network" can include features in the clothing image, torso features in the model image, and human body keypoint information in the keypoint image.
The analysis network performs calculation processing (for example, convolution operation, deconvolution operation, or the like) on the input image to realize human body analysis, thereby obtaining a predictive analysis image.
In this embodiment, the analysis network performs human body analysis based on the clothes image, the model image, and the key point image, to obtain a predictive analysis image. The "image input to the analytic network" can include features in the garment image, torso features in the model image, and keypoint information in the keypoint image. Therefore, the analysis network can learn clothing features in the clothing images, trunk features in the model images and key point information of human bodies in the key point images, and fuse the clothing features, the trunk features and the key point information of the human bodies, so that the obtained prediction analysis images can reflect trunk areas and clothing areas. In addition, in the process of generating the prediction analysis image, the human body key point information in the key point image can restrict the trunk area and the clothes area, so that the clothes can be accurately positioned in the prediction analysis image.
In some embodiments, referring to fig. 6, the method further comprises:
s23: and carrying out human body analysis on the model image in the image group to obtain a real analysis image.
Here, the model image may be subjected to body analysis using a conventional body analysis algorithm, for example, graphonomy algorithm, to generate a preliminary body analysis image, and the preliminary body analysis image may be used as a real analysis image. It can be understood that the clothes worn by the model in the model image belong to the real clothes category, and the clothes in the clothes image also belong to the real clothes category, so that a preliminary human body analysis image obtained by analyzing the model image can be used as a real analysis image after the clothes in the model image are worn.
In the real analysis image, the pixels are classified into 20 types, and the categories can be identified by 0 to 19, for example, 0 represents background, 1 represents hat, 2 represents hair, 3 represents glove, 4 represents sunglasses, 5 represents jacket, 6 represents dress, 7 represents coat, 8 represents sock, 9 represents trousers, 10 represents trunk skin, 11 represents scarf, 12 represents dress, 13 represents face, 14 represents left arm, 15 represents right arm, 16 represents left leg, 17 represents right leg, 18 represents left shoe, and 19 represents right shoe. From the real analytical image, the category to which each region in the model image belongs can be determined.
S24: and extracting the trunk area of the model from the real analytic image to obtain a local analytic image.
The torso region may be a head region, an arm region, a foot region, or the like. Therefore, the trunk area of the model can be separated from the real analytic image according to the pixel type, and the local analytic image is obtained. For example, the pixel class of the torso region is retained, and the pixel classes of the other regions except the torso region are set to 0. In the local analysis image, regions such as the head, arms, and feet can be distinguished.
In this embodiment, referring to fig. 6 again, the step S22 specifically includes:
s221: and (4) splicing and combining the local analysis image, the key point image and the clothes image in the image group, inputting a combined image obtained by splicing and combining into an analysis network for human body analysis, and obtaining a prediction analysis image.
It can be understood that the local analysis image is a grayscale image of 1 channel, the key point image is a 3-channel image, and the clothing image is a 3-channel image. In some embodiments, the resolution of the local resolved images is 512 × 1, the resolution of the keypoint images is 512 × 3, and the resolution of the clothing images is 512 × 3. In some embodiments, the stitching combination may be performed by channel stitching, for example, the aforementioned 3 images are stitched and combined by channel to obtain a 512 × 7 combined image. Thus, the combined image includes the torso feature of the local analytic image, the keypoint information of the keypoint image, and the clothing feature of the clothing image.
And inputting the combined image into an analysis network for human body analysis to obtain a prediction analysis image. Based on the fact that the combined image comprises the trunk characteristics of the local analysis image, the key point information of the key point image and the clothing characteristics of the clothing image, the analysis network can learn the clothing characteristics in the clothing image, the trunk characteristics in the local analysis image and the key point information of the human body in the key point image, fuse the clothing characteristics, the trunk characteristics in the local analysis image and the key point information of the human body, and generate the prediction analysis image. The clothes image provides clothes characteristic information, and the human body key point information can restrict the trunk area and the clothes area, so that the clothes can be accurately positioned in the prediction analysis image. The local analysis image only comprises the trunk characteristics, so that the local analysis image can provide a priori condition for the prediction analysis image, and the trunk characteristics needing to be reserved are ensured to be unchanged. Compared with the mode that the model image, the clothes image and the key point image are combined, the local analysis image is not influenced by clothes on the model, and the reserved area is more accurate.
In some embodiments, the parsing network includes an encoder and a decoder, the encoder including a plurality of concatenated first convolutional layers, each configured with a plurality of convolutional cores of different sizes. The first convolution layer is used for performing down-sampling feature extraction on an input image by adopting a plurality of convolution kernels respectively, and performing channel splicing on intermediate feature maps obtained by extracting the plurality of convolution kernels to obtain an output down-sampling feature map.
As will be appreciated by those skilled in the art, the encoding network is used to down-sample the input image, and as the convolutional layer progresses, the size of the output down-sampled feature map becomes smaller. The decoding network is used to up-sample the input image, and as layers (which may include deconvolution layers) are advanced, the output up-sampled feature map becomes larger in size.
Referring to fig. 7 and 8, the coding network includes a plurality of concatenated first convolutional layers. Fig. 7 schematically illustrates an example in which the coding network includes 7 cascaded first convolution layers, please refer to fig. 8, in which an input combined image (for example, the resolution of the combined image is 512 × 7), after the 1 st first convolution layer is subjected to down-sampling processing, an obtained down-sampling feature map (256 × 256) is input into the 2 nd first convolution layer for down-sampling, an obtained down-sampling feature map (128 × 128) is input into the 3 rd first convolution layer for down-sampling processing, and so on, the down-sampling feature map (4) output by the 7 th first convolution layer is input into the decoding network for decoding processing.
Any one of the first convolution layers in the coding network is configured with a plurality of convolution kernels with different sizes. Therefore, the first convolution layer can respectively perform downsampling feature extraction on the input image by adopting the plurality of convolution kernels, and one convolution kernel correspondingly extracts an intermediate feature map. And performing channel splicing on a plurality of intermediate feature maps obtained by extracting a plurality of convolution kernels to obtain a down-sampling feature map finally output by the first convolution layer.
In some embodiments, the first convolution layer includes convolution kernels of 7 × 7, 5 × 5, and 3 × 3 sizes, and the 3 convolution kernels are configured with different step sizes and are capable of outputting intermediate signatures of the same size. For example, after the combined image with the size of 512 × 512 is input into the 1 st first convolution layer, the combined image is subjected to downsampling feature extraction through convolution kernels with the size of 7 × 7, and then the intermediate feature map 1# with the size of 256 × 256 is obtained; performing down-sampling feature extraction on the combined image by a convolution kernel of 5 × 5 to obtain an intermediate feature map 2# of 256 × 256; and (4) performing down-sampling feature extraction on the combined image by a convolution kernel of 3 × 3 to obtain an intermediate feature map 3# of 256 × 256. Then, channel splicing is performed on the intermediate feature map 1#, the intermediate feature map 2# and the intermediate feature map 3#, so as to obtain a down-sampling feature map with a size of 256 × 256. In this embodiment, the 2 nd first convolution layer outputs a downsampled signature of 128 x 128 size, the 3 rd first convolution layer outputs a downsampled signature of 64 x 64 size, the 4 th first convolution layer outputs a downsampled signature of 32 x 32 size, the 5 th first convolution layer outputs a downsampled signature of 8 x 8 size, and the 6 th first convolution layer outputs a downsampled signature of 4 x 4 size.
It will be appreciated that the multiple convolution kernels, which differ in size, have different perceptual ranges. The first convolution layer is configured with a plurality of convolution kernels with different sizes for feature extraction, and can establish the relation among the clothing feature, the key point information and the trunk feature, so that the clothing feature, the key point information and the trunk feature in the down-sampling feature map are closely related, and the key point information can play an effective constraint role.
The coding network shown in fig. 7 includes 7 deconvolution layers that can configure a convolution kernel of size 3 x 3 for upsampling. Referring to fig. 7 and 8, after the down-sampling feature maps with the size of 4 × 4 are input into the coding network, the up-sampling feature extraction is performed through 7 deconvolution layers layer by layer, and an up-sampling feature map with the size of 8 × 8, an up-sampling feature map with the size of 16 × 16, an up-sampling feature map with the size of 32 × 32, an up-sampling feature map with the size of 64 × 64, an up-sampling feature map with the size of 128 × 128, an up-sampling feature map with the size of 256 × 256, and an up-sampling feature map with the size of 512 are sequentially generated. And taking the upsampled feature map with the size of 512 by 512 as a predictive analysis image finally output by the decoding network.
S30: and inputting the prediction analysis image into a clothing classification network to obtain a corresponding prediction clothing category.
Here, the clothing classification network is a preset neural network for classification, having components of the neural network (convolutional layer, pooling layer, full-link layer, classification layer, and the like). The basic structure and principle of the neural network are described in detail in the noun introduction (1), and are not described in detail herein. The clothing classification network can be constructed by a person skilled in the art on a neural network design platform on a terminal (such as a computer) computer, and then the neural network design platform is sent to a server. In some embodiments, convolution kernel sizes, step sizes, etc. in the convolution layer, pooling layer, and fully-connected layer in the garment classification network may be set.
The real clothing category is marked on the basis of the image group, namely the prediction analysis image corresponds to the real clothing category, the clothing classification network learns the corresponding relation between the real clothing category and the prediction analysis image, and the category to which the prediction analysis image belongs is predicted to obtain the corresponding prediction clothing category.
In some embodiments, the garment classification network includes a plurality of second convolutional layers, fully-connected layers, and classification layers in cascade. The second convolution layer is used for performing downsampling feature extraction on the input image. The full connection layer is used for carrying out feature classification on the input feature image and outputting a feature vector. And the classification layer is used for carrying out probability conversion on the input feature vectors to obtain the predicted clothing category.
Referring to fig. 7 and 8, fig. 7 is a schematic illustration of a clothing classification network including 7 second convolutional layers, full-link layers, and classification layers, which are sequentially cascaded. The input predictive analysis image with the size of 512 by 512 carries out down-sampling feature extraction through 7 second convolution layers layer by layer to obtain a feature map with the size of 4 by 4; and inputting the feature map with the size of 4 x 4 into a full-connection layer to perform feature classification to obtain a feature vector, and inputting the feature vector into a classification layer to perform probability conversion to obtain a predicted clothing category.
In some embodiments, the second convolution layer is configured with a convolution kernel of size 3 x 3, with a step size set to 2. And the classification layer calculates the possible probability of each clothing category by adopting a sofmax function. In this embodiment, the structure of the clothing classification network is shown in table 1 below, where X is the predictive analysis image and the total number of N clothing categories.
TABLE 1
Layer kernel_size S Output shape
Image X - - 512*512*3
Conv 1 2 256*256*16
Conv 3 2 128*128*32
Conv 3 2 64*64*32
Conv 3 2 32*32*64
Conv 3 2 16*16*64
Conv 3 2 8*8*128
Conv 3 2 4*4*256
Fc - - 1024*1
softmax - - N*1
S40: and calculating the loss corresponding to the image group by adopting a loss function, performing iterative training on the analysis network and the clothing classification network according to the loss sum corresponding to the plurality of image groups until convergence, and taking the converged analysis network as an analysis model. And the loss corresponding to the image group reflects the analysis loss of the analysis network and the classification loss of the clothing classification network.
Here, the loss function may be configured in the terminal by a person skilled in the art, and the configured loss function is sent to the server along with the parsing network and the clothing classification network. The loss function has been described in detail in the above "noun introduction (2)", and will not be repeated herein. It is understood that the structure of the loss function can be set according to actual situations based on different network structures and training modes.
The server processes the predicted analysis image corresponding to the image group to obtain a predicted analysis image corresponding to the image group, and after the clothes type is predicted, a loss function is adopted to calculate the loss corresponding to the image group based on the predicted clothes type and the predicted analysis image (the loss reflects analysis loss and classification loss). After the plurality of image groups are processed in turn, iterative training is carried out on the analysis network and the clothing classification network based on the loss sum corresponding to the plurality of image groups until convergence is reached, and the analysis network after convergence is used as an analysis model.
It can be understood that, if the sum of loss is smaller, the difference between the predicted clothing type of each prediction analysis image and the corresponding real clothing type is smaller, and the shape of the clothing region in the prediction analysis image is similar to the real fitting shape of the clothing under the clothing type. Therefore, the loss corresponding to the image group can be calculated by adopting a loss function, and after the plurality of image groups are processed in turn, model parameters of the analysis network and the clothing classification network are adjusted based on the loss sum corresponding to the plurality of image groups. Namely loss and backward propagation are carried out, so that the shape of the clothes area in the prediction analysis image output by the analysis network continuously approaches the real try-on shape of the clothes under the clothes category. And adjusting the model parameters through multiple iterations until the analysis network and the clothing classification network are integrally converged, and taking the converged analysis network as an analysis model.
It is understood that convergence herein may refer to a sum of losses that is less than a predetermined threshold or fluctuates within a certain range at a certain model parameter. In some embodiments, the adam algorithm is used to optimize the model parameters, for example, the number of iterations is set to 10 ten thousand, the initial learning rate is set to 0.001, the weight attenuation of the learning rate is set to 0.0005, and the learning rate is attenuated to 1/10 of the original learning rate every 1000 iterations, wherein the learning rate, the loss and the loss can be input into the adam algorithm to obtain an adjusted model parameter output by the adam algorithm, the adjusted model parameter is used to perform next training, until the training is completed, the model parameters after convergence of the analysis network and the clothing classification network are output, that is, the model parameters of the converged analysis network can be stored to obtain the analysis model.
It can be understood that, after the server obtains the converged model parameters of the analysis network (i.e., the final model parameters), the final model parameters may be sent to the terminal, and the analysis network in the terminal is configured with the final model parameters to obtain the analysis model. In some embodiments, the server may also store the analytical network and the final model parameters to obtain the analytical model.
It should be noted that, in the embodiment of the present application, the plurality of image groups covers a plurality of garment categories, and can cover most kinds of clothes on the market. Therefore, the trained analysis model is a general model and can be widely used for human body analysis in virtual fitting.
In this embodiment, a clothing classification network is used to discriminate clothing categories of the prediction analysis images, and then, based on reflecting analysis loss and classification loss and back propagation corresponding to a plurality of image groups, clothing categories to which clothing regions in the prediction analysis images belong are constrained to be continuously close to real clothing categories. Based on the fact that clothes of the clothes image in the image group and clothes worn by the model image belong to the same real clothes category, when the clothes category to which the clothes region belongs in the constrained prediction analysis image is continuously close to the real clothes category, the corresponding relation between the real clothes category and the prediction analysis map can be strengthened, and the clothes images of the same clothes category correspond to the same prediction analysis map. Therefore, when the analysis model obtained through training faces the clothes images to be tried of the same clothes type, the analysis graph corresponding to the clothes type can be generated, interference of original clothes on the user can be reduced, and therefore the analysis graph adaptive to the clothes type can be generated. If the analytic model is adopted in the fitting process, the deformation effect of the clothes to be fitted after deformation is realized according to the clothes area in the analytic graph is real and vivid, and the fitting effect is favorably improved.
In some embodiments, the loss function includes an analytical loss function and a categorical loss function, wherein the analytical loss function is used to compute the difference between the predictive analytical image and the real analytical image.
The real analysis image is obtained by analyzing the model image, for example, the model image may be analyzed by using the existing human body analysis algorithm, for example, graphonomy algorithm, to generate a preliminary human body analysis image, and the preliminary human body analysis image is used as the real analysis image. When the difference between the prediction analysis image and the actual analysis image is larger, it is described that the difference between the shape of each element part in the prediction analysis image and the shape of each element part in the actual analysis image is larger. When the difference between the prediction analysis image and the real analysis image is smaller, it is indicated that the difference between the shape of each element part in the prediction analysis image and the shape of each element part in the real analysis image is smaller. That is, the analytical loss function compares the predictive analysis image with the real analysis image, so that the shape of each element part in the predictive analysis image continuously approximates the shape of each element part in the real analysis image.
The class loss function is used to calculate the difference between the predicted garment class and the actual garment class. It is understood that the real clothing category may be a vector in a thermal coding form, for example, the total number of clothing categories is N, and the real clothing category is an N-dimensional vector, each element in the vector represents the probability of belonging to the corresponding clothing category, the real clothing category is 1, and the other clothing categories are 0. In the predicted clothing category (an N-dimensional vector), each element represents the probability of belonging to the corresponding clothing category. When the difference between the predicted clothing category and the real clothing category is larger, the difference between the clothing contour in the predicted analysis image and the clothing contour in the real analysis image is larger; when the difference between the predicted clothing category and the real clothing category is smaller, the difference between the clothing contour in the predicted analysis image and the clothing contour in the real analysis image is smaller. The class loss function compares the clothes classes to which the clothes outlines in the predictive analytic image and the real analytic image belong, so that the clothes classes of the clothes outlines in the predictive analytic image continuously approximate to the real clothes classes, and accordingly, the clothes in the predictive analytic image approximates to the outline of the real fitting effect in the clothes image.
In some embodiments, the aforementioned loss function comprises:
Figure BDA0003731623320000231
wherein L islossAs a loss function, LrecTo resolve the loss function, LclsssIs a class loss function, alpha and beta are weights, Xi,jFor the analytic class value, Y, of the ith row and jth column of the real analytic imagei,jThe analytic classification value of ith row and jth column of the predictive analytic image, M is the size of the real analytic image or the predictive analytic image, TiProbability value of ith category in real clothing category, PiAnd predicting the probability value of the ith clothing category, wherein N is the total number of the clothing categories.
In this embodiment, the loss of the image group calculated based on the loss function is obtained, the sum of the losses corresponding to several image groups (i.e. the loss sum) is obtained, and then, iterative training is performed on the analysis network and the garment classification network based on the loss sum, so that network convergence can be accelerated. The converged analytical network (analytical model) is made to face the same type of garment, and the same human body analytical graph is output.
In summary, the method for training the analytic model according to the embodiment of the present application adopts the clothing classification network to discriminate clothing types of the predictive analytic images, and then restricts clothing types to which clothing regions in the predictive analytic images belong to continuously approach real clothing types based on reflecting analytical losses, losses of classification losses, and back propagation corresponding to a plurality of image groups. Based on the fact that clothes of the clothes image in the image group and clothes worn by the model image belong to the same real clothes category, when the clothes category to which the clothes region belongs in the constrained prediction analysis image is continuously close to the real clothes category, the corresponding relation between the real clothes category and the prediction analysis map can be strengthened, and the clothes images of the same clothes category correspond to the same prediction analysis map. Therefore, when the analysis model obtained by training faces the clothes images to be tried on of the same clothes type, the analysis graph corresponding to the clothes type can be generated, the interference of the original clothes on the user can be reduced, and therefore the analysis graph adaptive to the clothes type can be generated. If the analytical model is adopted in the fitting process, the deformation effect of the clothes to be fitted after deformation according to the clothes area in the analytical graph is real and vivid, and the fitting effect is favorably improved.
In addition, in the training method, the clothes of the clothes image in the image group and the clothes worn by the model image belong to the same real clothes category, for example, the clothes image comprises loose green short sleeves, the model in the model image can be worn with the loose green short sleeves without being worn with the loose green short sleeves, the dependency relationship between the clothes in the clothes image and the clothes on the model is reduced, and the difficulty in collecting training data can be effectively reduced.
After the analytic model is obtained through training by the analytic model training method, the analytic model can be applied to the virtual fitting. The virtual fitting provided by the embodiment of the application can be implemented by various electronic devices with computing processing capacity, such as an intelligent terminal, a server and the like.
The virtual fitting method provided by the embodiment of the present application is described below with reference to exemplary applications and implementations of the terminal provided by the embodiment of the present application. Referring to fig. 9, fig. 9 is a schematic flowchart of a virtual fitting method provided in the embodiment of the present application. The method S200 includes the steps of:
s201: and acquiring an image of the clothes to be tried on and an image of the user.
A fitting assistant (application software) built in a terminal (e.g., a smart phone or a smart fitting mirror) acquires a user image and an image of clothes to be fitted. Wherein the user image may be captured by the terminal or input to the terminal by the user. The image of the garment to be fitted may be selected by the user from fitting assistants.
It will be appreciated that the user image includes the user's body and the image of the clothing to be tried on includes clothing.
S202: and analyzing the human body by adopting an analysis model based on the clothes image to be tried on and the user image to obtain an analysis image.
The analytic model is obtained by training by adopting any method for training the analytic model in the training embodiment.
The fitting assistant built in the terminal comprises an analysis model which can be called to carry out human body analysis. And inputting the clothes image to be tried on and the user image into an analysis model, and carrying out human body analysis on the user image based on the clothes type of the clothes in the clothes image by the analysis model to obtain an analysis image corresponding to the clothes type.
It can be understood that the analytic model is obtained by training through the method for training the analytic model in the foregoing embodiment, and has the same structure and function as the analytic model in the foregoing embodiment, and details are not repeated here.
It can be understood that the shape of the clothing region in the analysis image matches the category of the clothing to be tried on, and thus, the shape of the clothing region in the analysis image conforms to the real fitting effect of the clothing to be tried on.
S203: and fusing the user image and the clothes image to be tried on, and deforming the clothes image to be tried on based on the clothes area in the analysis image in the fusion process to obtain the clothes image to be tried on.
In some embodiments, the analysis image, the image of the clothes to be tried on and the user image are input to generate a confrontation network, and the user image and the image of the clothes to be tried on are fused to generate a fitting image. In the fusion process, the analysis image restrains the boundary information of the try-on clothes, so that the try-on clothes deform according to the clothes area in the analysis image. Therefore, the fitting image obtained by fusion accords with the real fitting effect of the clothes to be fitted.
The generation of the countermeasure network for image generation is a conventional method known to those skilled in the art and will not be described in detail here.
In this embodiment, when the analysis model is faced with the clothes images to be tried on of the same clothes type, the analysis image corresponding to the clothes type can be generated, so that the interference of the original clothes on the user can be reduced, and the analysis image of the adaptive clothes type can be generated. The analytic image obtained by the analytic model is used for constraining the boundary information of the clothes to be tried on in the fusion process of the image of the clothes to be tried on and the image of the user, so that the deformation effect of the clothes to be tried on is real and vivid after the clothes to be tried on are deformed according to the clothes area in the analytic image, and the image of the clothes to be tried on obtained by fusion accords with the real fitting effect of the clothes to be tried on the user.
Embodiments of the present application also provide a computer-readable storage medium storing computer-executable instructions for causing an electronic device to execute a method for training an analytic model provided in an embodiment of the present application, for example, a method for training an analytic model as shown in fig. 3 to 8, or a method for virtual fitting as provided in an embodiment of the present application, for example, a method for virtual fitting as shown in fig. 9.
In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a HyperText markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device (a device including a smart terminal and a server), or on multiple computing devices located at one site, or distributed across multiple sites interconnected by a communication network.
Embodiments of the present application also provide a computer-readable storage medium storing a computer program, where the computer program includes program instructions, and the program instructions, when executed by a computer, cause the computer to execute a method for training an analytic model or a virtual fitting method as in the foregoing embodiments.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; within the context of the present application, where technical features in the above embodiments or in different embodiments can also be combined, the steps can be implemented in any order and there are many other variations of the different aspects of the present application as described above, which are not provided in detail for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A method of training an analytical model, comprising:
acquiring a plurality of image groups, wherein the image groups comprise a clothes image and a model image, one image group is marked with a real clothes category, the clothes category in the clothes image is the real clothes category, and the category of clothes worn by a model in the model image is the real clothes category;
analyzing the human body by adopting an analysis network based on the image group to obtain a prediction analysis image;
inputting the predictive analysis image into a clothing classification network to obtain a corresponding predictive clothing category;
and calculating the loss corresponding to the image group by adopting a loss function, performing iterative training on the analysis network and the clothing classification network according to the loss sums corresponding to the plurality of image groups until convergence, and taking the converged analysis network as the analysis model, wherein the loss corresponding to the image group reflects the analysis loss of the analysis network and the classification loss of the clothing classification network.
2. The method of claim 1, wherein performing human body analysis based on the image group using an analysis network to obtain a predictive analysis image comprises:
detecting key points of a human body on the model images in the image group to obtain key point images;
and the analysis network carries out human body analysis based on the image group and the key point image to obtain the prediction analysis image.
3. The method of claim 2, further comprising:
analyzing the human body of the model image in the image group by adopting an analysis algorithm to obtain a real analysis image;
extracting a trunk area of the model from the real analytic image to obtain a local analytic image;
the analyzing network analyzes the human body based on the clothes image and the key point image in the image group to obtain the prediction analysis image, and the analyzing network comprises the following steps:
and splicing and combining the local analysis image, the key point image and the clothes image in the image group, inputting a combined image obtained by splicing and combining into the analysis network for human body analysis, and obtaining the prediction analysis image.
4. The method according to any one of claims 1-3, wherein the parsing network comprises an encoder and a decoder, the encoder comprising a plurality of concatenated first convolutional layers, each of the first convolutional layers configured with a plurality of convolutional kernels of different sizes;
the first convolution layer is used for respectively performing downsampling feature extraction on the input image by adopting the plurality of convolution kernels, and performing channel splicing on the intermediate feature maps obtained by extracting the plurality of convolution kernels to obtain output downsampling feature maps.
5. The method of claim 4, wherein the garment classification network comprises a plurality of second convolutional layers, fully-connected layers, and classification layers in cascade;
the second convolution layer is used for performing downsampling feature extraction on the input image; the full connection layer is used for carrying out feature classification on the input feature image and outputting a feature vector; and the classification layer is used for carrying out probability conversion on the input feature vectors to obtain the predicted clothing category.
6. The method of claim 1, wherein the loss function comprises an analytical loss function and a class loss function, wherein the analytical loss function is used to calculate the difference between the predicted analytical image and a real analytical image, the real analytical image is obtained by human body analysis through the model image, and the class loss function is used to calculate the difference between the predicted clothing class and the real clothing class.
7. The method of claim 6, wherein the loss function comprises:
Figure FDA0003731623310000021
wherein L islossAs a function of said loss, LrecAs a function of said analytical loss, LclassFor the class loss function, α and β are weights, Xi,jAn analytic class value, Y, of the ith row and jth column of the real analytic imagei,jAn analytic class value of the ith row and the jth column of the predictive analytic image, M is the size of the real analytic image or the predictive analytic image, TiProbability value, P, of ith category in the real clothing categoriesiAnd N is the probability value of the ith type in the predicted clothing types, and is the total number of the clothing types.
8. A virtual fitting method, comprising:
acquiring an image of clothes to be tried on and an image of a user;
analyzing a human body by adopting an analytical model based on the clothes image to be tried on and the user image to obtain an analytical image, wherein the analytical model is obtained by adopting a method for training the analytical model according to any one of claims 1 to 6;
and fusing the user image and the clothes image to be tried on, and deforming the clothes image to be tried on based on the clothes area in the analysis image in the fusion process to obtain a clothes image to be tried on.
9. An electronic device, comprising:
at least one processor, and
a memory communicatively coupled to the at least one processor, wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
10. A computer-readable storage medium having stored thereon computer-executable instructions for causing a computer device to perform the method of any one of claims 1-8.
CN202210785212.XA 2022-07-05 2022-07-05 Analytical model training method, virtual fitting method and related device Pending CN115273140A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210785212.XA CN115273140A (en) 2022-07-05 2022-07-05 Analytical model training method, virtual fitting method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210785212.XA CN115273140A (en) 2022-07-05 2022-07-05 Analytical model training method, virtual fitting method and related device

Publications (1)

Publication Number Publication Date
CN115273140A true CN115273140A (en) 2022-11-01

Family

ID=83764318

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210785212.XA Pending CN115273140A (en) 2022-07-05 2022-07-05 Analytical model training method, virtual fitting method and related device

Country Status (1)

Country Link
CN (1) CN115273140A (en)

Similar Documents

Publication Publication Date Title
CN110599492B (en) Training method and device for image segmentation model, electronic equipment and storage medium
CN108205655B (en) Key point prediction method and device, electronic equipment and storage medium
CN109902548B (en) Object attribute identification method and device, computing equipment and system
CN107679466B (en) Information output method and device
CN111368672A (en) Construction method and device for genetic disease facial recognition model
CN115439308A (en) Method for training fitting model, virtual fitting method and related device
CN116416416A (en) Training method of virtual fitting model, virtual fitting method and electronic equipment
CN113569598A (en) Image processing method and image processing apparatus
CN110222718A (en) The method and device of image procossing
CN114676704A (en) Sentence emotion analysis method, device and equipment and storage medium
CN114332994A (en) Method for training age prediction model, age detection method and related device
JP2023501820A (en) Face parsing methods and related devices
CN116229066A (en) Portrait segmentation model training method and related device
CN112862828A (en) Semantic segmentation method, model training method and device
CN111507288A (en) Image detection method, image detection device, computer equipment and storage medium
CN116071619A (en) Training method of virtual fitting model, virtual fitting method and electronic equipment
CN117475258A (en) Training method of virtual fitting model, virtual fitting method and electronic equipment
CN116109892A (en) Training method and related device for virtual fitting model
CN112529068A (en) Multi-view image classification method, system, computer equipment and storage medium
CN113762117A (en) Training method of image processing model, image processing model and computer equipment
CN115272822A (en) Method for training analytic model, virtual fitting method and related device
CN115439179A (en) Method for training fitting model, virtual fitting method and related device
CN114724004B (en) Method for training fitting model, method for generating fitting image and related device
CN114821244A (en) Method for training clothes classification model, clothes classification method and related device
CN115273140A (en) Analytical model training method, virtual fitting method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination