CN114663552B

CN114663552B - Virtual fitting method based on 2D image

Info

Publication number: CN114663552B
Application number: CN202210573730.5A
Authority: CN
Inventors: 姜明华; 史衍康; 余锋; 花爱玲; 周昌龙; 宋坤芳
Original assignee: Wuhan Textile University
Current assignee: Wuhan Textile University
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-08-16
Anticipated expiration: 2042-05-25
Also published as: CN114663552A

Abstract

The invention discloses a virtual fitting method based on a 2D image, and relates to the technical field of virtual fitting. The method comprises the following steps: inputting image data required by virtual fitting, and generating a fitting region semantic graph, a non-fitting region semantic graph and a clothing correction graph through preprocessing operation; generating a human body semantic segmentation graph after fitting under the constraint of introducing an integrated discriminator and a multi-level loss function through an end-to-end semantic splicing network; predicting a deformed clothing image under the constraint of introducing a frequency domain classifier and a regular constraint loss function through a clothing deformation network; splicing the image of the try-on person and the deformed garment image to obtain a rough try-on result picture; and calculating and reducing the artifact area through a normalized network, and finally obtaining a fine fitting result image. The method solves the problem that the garment and the human body are misplaced to generate artifacts in the prior art, so that the generated picture keeps more details, and a high-resolution fitting picture is generated.

Description

Virtual fitting method based on 2D image

Technical Field

The present invention relates to the field of virtual fitting technology, and more particularly, to a virtual fitting method based on 2D images.

Background

In recent years, with the rapid development of digital technology and the continuous upgrade of artificial intelligence technology, important power and support are provided for the development of virtual fitting technology. The 'house economy' is prosperous, and daily clothes purchasing is also transferred to the line. The appearance of the virtual fitting can enable sellers to display the advantages and the disadvantages of the clothes more objectively, so that both sides of a transaction can know information more intuitively, the transaction is facilitated, unnecessary workload is reduced, the working efficiency is improved, and the user requirements are met.

At present, many research institutes at home and abroad are conducting relevant research, but the synthetic image aiming at 2D is still limited to low resolution, which is a key factor influencing the satisfaction degree of online consumers. With the increase of the resolution of the image, the artifact of the misaligned area between the deformed clothes and the human body becomes obvious in the final result, and the body part and the texture definition of the clothes in the final result are poor, so that the fitting effect is seriously influenced.

Chinese patent publication No. CN 112232914a discloses a four-stage virtual fitting method and apparatus based on 2D images, which assist in generating a fitting composite image by predicting semantic segmentation images and arm images after fitting, and solve the problem of artifact existing in the prior art that does not match, but the technique cannot realize virtual fitting on high-resolution 2D images, and has poor effect of maintaining the texture definition of high-resolution images.

Disclosure of Invention

Aiming at the defects or the improvement requirements of the prior art, the invention provides a virtual fitting method based on a 2D image, aiming at solving the problem of artifacts in a high-resolution fitting image, keeping more texture details and improving the fitting effect.

To achieve the above object, according to an aspect of the present invention, there is provided a virtual fitting method based on a 2D image, including the steps of:

(1) inputting a 2D image, a human body semantic segmentation image and a clothing image of a try-on person, dividing the human body semantic segmentation image into a try-on semantic area and a non-try-on semantic area through preprocessing operation, and performing masking and correction processing on the clothing image to generate a clothing mask image and a clothing correction image;

(2) inputting a clothing semantic graph and a non-fitting area semantic graph, and generating a human body semantic segmentation graph after fitting through an end-to-end semantic splicing network under the constraint of introducing an integrated discriminator and a multi-level loss function;

(3) predicting a deformed clothing image under the constraint of introducing a frequency domain classifier and a regular constraint loss function through a clothing deformation network according to the tried-on semantic segmentation image, the human body posture image and the clothing image;

(4) according to the image of the try-on person, coloring the semantic graph of the non-try-on area to obtain a colored graph of the non-try-on area, and splicing the colored graph with the deformed clothing image to obtain a rough try-on result graph;

(5) inputting the rough fitting result image and the semantic segmentation image after fitting into a normalized network, calculating and reducing an artifact area, and finally obtaining a fine fitting result image.

Preferably, in the method for virtual fitting based on 2D images, the preprocessing process in step (1) includes two parts: one part, inputting a human body semantic segmentation graph, comprising 7 semantic information which are sequentially: semantic information for hair, face, neck, hand, under-garment, clothing, and arm, and labeling each semantic information with a different pixel value may be expressed as: y = & a ₁ ,a ₂ , a ₃ ,…, a ₇ Dividing the human body semantic segmentation graph into a non-try-on area semantic graph according to different semantic information pixel values in the preprocessing process: y is ₁ ={a ₁ , a ₂ , a ₃ , a ₄ , a ₅ And fitting area semantic map: y is ₂ ={a ₆ ,a ₇ And the semantic graph of the non-fitting area comprises: semantic information of hair, face, neck, hand, lower garment, the fitting region semantic graph includes: semantic information of clothing and arms; and inputting a clothing image, performing mask processing on the clothing image, namely setting the pixel value of a clothing region in the clothing image to be 0 and the pixel value of the background to be 255 to obtain a clothing mask image, and performing affine transformation on the clothing image according to the position information of the pixel points in the semantic graph of the try-on region to enable the outline of the clothing image to be consistent with the outline of the semantic graph of the try-on region, thereby obtaining a clothing correction graph and preparing for subsequent clothing deformation.

Preferably, the semantic splicing network in the virtual fitting method based on the 2D image is an end-to-end network, the input is a clothing semantic map and a non-fitting area semantic map, the image features are extracted through a convolution layer and downsampled, the compressed image is expanded to the original size through a deconvolution layer to obtain a rough fitting human body semantic segmentation map, and finally the rough fitting human body semantic segmentation map is input to an integrated discriminator to perform similarity calculation, and a fitting human body semantic segmentation map with the highest similarity is output after multiple rounds of semantic splicing. And a multi-level loss function is introduced in the semantic splicing process, the prediction result is subjected to fine-grained optimization, and a semantic splicing network is encouraged to generate finer details.

Preferably, the integrated discriminator is formed by splicing n deep convolutional networks with the same network structure, the size of n is related to the resolution of an input picture, and each deep convolutional network is formed by sequentially connecting a pooling layer, a convolutional layer, a normalization layer and an active layer. The input of the integrated discriminator is a real picture and a prediction picture, and the output is the feature similarity of the two pictures. The n deep convolutional networks respectively distinguish the real image from the predicted image on n different scales, and the influence of the prediction result of each deep convolutional network on the final result is controlled by setting a coefficient, so that the semantic splicing network is encouraged to generate more details. Each deep convolution network corresponds to a characteristic matching loss function, and the characteristic matching loss functions together form a multi-level loss function of the integrated discriminator.

Preferably, the multi-level loss function is represented as:

in the above formula, the first and second carbon atoms are,

representing deep convolutional networksD ₁ Generated for semantic splicing networksG(1) The features of the image match the loss function,

is a coefficient that controls the effect of different scales of deep convolutional network loss on the multi-level loss function, i.e.

。

Preferably, the garment deformation network in the 2D image-based virtual fitting method is an end-to-end network, the input is a tried-on semantic segmentation graph, a garment image and a human body posture graph, under the constraint of 18 key point positions in the human body posture graph, pixel information of key point positions of the tried-on semantic segmentation graph and the garment image is respectively extracted through two branches, a frequency domain classifier is introduced, pixel information of the deformed garment image is predicted by using an interpolation method, a regular constraint loss function is introduced to control the deformation degree of the garment, and finally the deformed garment image is output.

The canonical constraint loss function is expressed as:

in the above formula, the first and second carbon atoms are,

representing coordinate points before and after deformation respectivelyxThe value of the one or more of the one,

representing coordinate points before and after deformation respectivelyyThe value of the one or more of the one,

to representxThe square of the amount of change in the coordinate in the axial direction,

to representyThe amount of change in the coordinates in the direction,α,βis two learnable parameters for controlling the clothesxShaft andydegree of deformation in the axial direction;

the loss function of the garment deformation network is expressed as:

in the above formula, the first and second carbon atoms are,

is a conditional-confrontation-loss function,

is a function of the cross-entropy loss of the pixel,

is a function of the loss of the regular constraint,λ ₁ ，λ ₂ ，λ ₃ is a coefficient for adjusting the three loss functions to the total loss functionThe influence of (c).

Preferably, the frequency domain classifier is to divide the features in the frequency domain space by setting a threshold, on one hand, more attention points are placed on the generated low-frequency feature information to enhance the content information of the image, on the other hand, noise points and high-frequency information are distinguished by means of the tried-on semantic segmentation map to extract more useful high-frequency information, and the texture fineness of the high-frequency region of the image is enhanced by using an interpolation method, so that the quality of the generated image is improved. The specific process of the frequency domain classifier comprises the following steps:

first, a spatial domain feature map of the clothing image is extracted from the convolutional layer, and a frequency domain feature map is obtained by using discrete fourier transform, which can be expressed as:

wherein the content of the first and second substances,S _i,j feature map representing spatial domainiGo to the firstjThe spatial feature component of the column is,F _i,j representing the first in a frequency domain profileiGo to the firstjFrequency characteristic components of the columns, DFT () being a discrete fourier transform function;

then, learning through a neural network structure to obtain an importance weight of each frequency feature component in the frequency domain feature map, identifying the frequency feature component with the weight greater than a threshold as a low-frequency information feature component, and identifying the frequency feature component with the weight less than the threshold as a high-frequency information feature component, which can be expressed as:

wherein the content of the first and second substances,

is shown asiGo to the firstjA frequency feature component of which column category is z, which is indicated as a low-frequency information feature component when z =0, which is indicated as a high-frequency information feature component when z =1,k _i,j is shown asiGo to the firstjThe importance weights of the column frequency feature components,λis a threshold value;

and finally, further screening noise information in a high-frequency information characteristic channel through the position information in the tried semantic segmentation image, filtering image noise by using a low-pass filter, and filling image content and texture information by using an interpolation method.

Optionally, the spline interpolation method includes: tension spline interpolation, regular spline interpolation, and thin plate spline interpolation.

Preferably, in the virtual fitting method based on 2D images, the step (4) includes: firstly, extracting pixel point values of a try-on person image, assigning the pixel point values to a non-try-on area semantic graph of a corresponding area again, re-coloring the non-try-on area semantic graph to obtain a non-try-on area color graph, and then inputting the non-try-on area color graph and a deformed clothing image into a semantic splicing network together to obtain a rough try-on result graph. In the semantic splicing network, firstly, the color image of a non-fitting area and the characteristics of a deformed garment image are extracted through a convolution layer and down-sampled, then the compressed image is expanded into the original size through a reverse convolution layer to obtain a rough fitting human body semantic segmentation image, finally, the rough fitting result image is input into an integrated discriminator to carry out similarity calculation, and a fitting result image with the highest similarity is output after multiple rounds of semantic splicing. And a multi-level loss function is introduced in the semantic splicing process, the prediction result is subjected to fine-grained optimization, and a semantic splicing network is encouraged to generate finer details.

Preferably, the processing procedure for removing artifacts in the virtual fitting method based on 2D images includes: firstly, extracting image features of a rough fitting result graph through a deep neural network, obtaining a semantic segmentation graph of the rough fitting result graph through up-sampling, and solving a difference value with the semantic segmentation graph after fitting, wherein the difference value can be expressed as:

Artifact=R-S

wherein the content of the first and second substances,Artifactpixel information representing the region of the artifact,Ra graph showing the results of a rough fitting,Srepresenting semantic segmentation after fittingA drawing;

obtaining an image with a small part of pixel values larger than 0, wherein the pixel points with the pixel values larger than 0 are the pixel points of the artifact region, and identifying the region in the rough fitting result image according to the position information of the artifact region; then, through a multilayer normalized network, two type variable coefficients are learnedαAndβrespectively controlling the reduction degree of the artifact area in the x-axis direction and the y-axis direction for reducing the size of the artifact area; and finally, sampling the adjusted feature picture into the size of the original picture through an upper sampling layer, and filling lost pixel point values by using an interpolation method to generate a more fine fitting result picture.

Optionally, the interpolation method includes: nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation.

Preferably, the input of the normalization network is a coarse fitting result graph, the output is a fine fitting result graph, and the normalization network comprises two stages of acquiring an artifact semantic graph and eliminating artifact information:

the first stage of acquiring an artifact semantic graph comprises two sub-networks which are set as networks 1-1 and 1-2, wherein the network 1-1 comprises n layers of convolution layers, image features are extracted through the convolution layers, n layers of deconvolution layers map images to be in original sizes through deconvolution, the output of each convolution layer is used as the input of each deconvolution layer, the convolution layers are connected with the deconvolution layers through residual error structures, and results are output after the last layer of deconvolution; the network 1-2 comprises an image difference module, two images with the same size are input, and a result graph obtained by difference of the two images is output. The value of n is related to the size of the input picture, and generally, when the size of the input picture is 1024 × 768, n =3,4 has a good experimental effect.

The second stage of eliminating artifact information includes one network 2-1, with the network 2-1 comprising two parallel branches, the first branch comprising one convolution layer, one activation layer, one area regularization layer, one down sampling layer, one up sampling layer and one addition layer connected serially, and the second branch comprising one single area regularization layer and the final result output after the addition layer.

The region regularization layer is used for standardizing an artifact region, firstly obtaining the spatial position of the artifact region, and then calculating a type variable coefficient through a convolution networkαAndβintroducing a plane deformation loss function to control the reduction degree of the artifact region, and preventing the pixel point information of the human body part from being wrongly modified due to excessive reduction so as to assist in reducing the artifact region; and the addition layer is used for fusing output results of the two branches.

Preferably, the planar deformation loss function is expressed as:

in the above formula, the first and second carbon atoms are,

respectively representing characteristic points of the artefact areax,yThe coordinates of the position of the object to be imaged,

representing the euclidean distance between two nodes,i,jis a quantity of deformation, and the deformation quantity,α,βis the deformation coefficient.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) the input picture is preprocessed through preprocessing, a fitting area and a non-fitting area are distinguished, pixel information of the non-fitting area can be better reserved, and more details of a fitting result image are reserved; the clothing image is corrected in space, so that subsequent better deformation of the clothing image is facilitated.

(2) The human body posture graph containing more key point information is used for assisting in generating the semantic segmentation image and the clothing deformation image, so that the semantic segmentation precision is higher, the clothing deformation robustness is improved, the regular constraint loss function and the frequency domain classifier are introduced, and the problem of clothing texture distortion caused by abnormal distortion deformation is solved.

(3) The problem that artifacts occur due to the fact that clothes are not matched with a human body under the condition of high resolution is solved by using the normalization network, the virtual fitting effect of the high-resolution 2D image is greatly improved, and the fitting effect and the user experience are improved.

Drawings

Fig. 1 is a schematic flow chart of a virtual fitting method based on a 2D image according to an embodiment of the present invention;

fig. 2 is a normalized network diagram of a virtual fitting method based on a 2D image according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Fig. 1 is a schematic flow chart of a virtual fitting method for a 2D image according to an embodiment; the embodiment provides a virtual fitting method of a 2D image, which comprises the following steps:

(1) inputting a 2D image, a human body semantic segmentation map and a clothing image of a try-on person, dividing the human body semantic segmentation map into a try-on semantic region and a non-try-on semantic region through preprocessing operation, and performing masking and correction processing on the clothing image to generate a clothing mask map and a clothing correction map;

wherein, the pretreatment process comprises two parts: one part, inputting a human body semantic segmentation graph, comprising 7 semantic information which are sequentially: semantic information for hair, face, neck, hand, under-garment, clothing, and arm, and labeling each semantic information with a different pixel value may be expressed as: y = & a ₁ , a ₂ , a ₃ ,…, a ₇ Dividing the human body semantic segmentation chart into non-try-on area words according to different semantic information pixel values in the preprocessing processSense graph: y is ₁ ={a ₁ , a ₂ , a ₃ , a ₄ , a ₅ And a semantic map of a try-on area: y is ₂ ={a ₆ ,a ₇ And the semantic graph of the non-fitting area comprises: semantic information of hair, face, neck, hand, lower garment, the fitting region semantic graph includes: semantic information of clothing and arms; and inputting a clothing image, performing mask processing on the clothing image, namely setting the pixel value of a clothing region in the clothing image to be 0 and the pixel value of the background to be 255 to obtain a clothing mask image, and performing affine transformation on the clothing image according to the position information of the pixel points in the semantic graph of the try-on region to enable the outline of the clothing image to be consistent with the outline of the semantic graph of the try-on region, thereby obtaining a clothing correction graph and preparing for subsequent clothing deformation.

(2) Inputting a clothing semantic graph and a non-fitting area semantic graph, generating a fitted human body semantic segmentation graph through an end-to-end semantic splicing network under the constraint of introducing an integrated discriminator and a multi-level loss function, wherein the clothing semantic graph is prepared in advance and is input data of the network;

the semantic splicing network is an end-to-end network, a clothing semantic graph and a non-try-on area semantic graph are input, image features are extracted through a convolution layer and downsampled, a compressed image is expanded to be of the original size through an anti-convolution layer, a rough try-on human body semantic segmentation graph is obtained, the rough try-on human body semantic segmentation graph is input to an integrated discriminator to be subjected to similarity calculation, and a try-on human body semantic segmentation graph with the highest similarity is output after multiple rounds of semantic splicing. And a multi-level loss function is introduced in the semantic splicing process, the prediction result is subjected to fine-grained optimization, and a semantic splicing network is encouraged to generate finer details.

The integrated discriminator is formed by splicing n deep convolution networks with the same network structure, the size of n is related to the resolution of an input picture, and the experimental effect of n =3 and 4 is better when the size of the input picture is 1024 × 768. The deep convolutional network is formed by sequentially connecting a pooling layer, a convolutional layer, a normalization layer and an activation layer. The input of the integrated discriminator is a real picture and a prediction picture, and the output is the feature similarity of the two pictures. The n deep convolutional networks respectively distinguish the real image from the predicted image on n different scales, and the influence of the prediction result of each deep convolutional network on the final result is controlled by setting a coefficient, so that the semantic splicing network is encouraged to generate more details. Each deep convolution network corresponds to a characteristic matching loss function, and the characteristic matching loss functions together form a multi-level loss function of the integrated discriminator.

Wherein the multi-level loss function is represented as:

in the above formula, the first and second carbon atoms are,

representing deep convolutional networksD ₁ Generated for semantic splicing networksG(1) The features of the image match the loss function (which is existing),

。

(3) Predicting a deformed garment image under the constraint of introducing a frequency domain classifier and a regular constraint loss function through a garment deformation network according to the tried-on semantic segmentation image, a human body posture image and the garment image, wherein the human body posture image is prepared in advance and is input data of the network;

the clothing deformation network is an end-to-end network, the input is a tried-on semantic segmentation graph, a clothing image and a human body posture graph, under the constraint of 18 key point positions in the human body posture graph, pixel information of the key point positions of the tried-on semantic segmentation graph and the clothing image is respectively extracted through two branches, a frequency domain classifier is introduced, the pixel information of the deformed clothing image is predicted by using an interpolation method, a regular constraint loss function is introduced to control the deformation degree of clothing, and finally the deformed clothing image is output, wherein the 18 key points comprise: nose, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, left ankle, right eye, left eye, right ear, and left ear.

The frequency domain classifier divides the features in a frequency domain space by setting a threshold, on one hand, more attention points are placed on the generated low-frequency feature information to enhance the content information of the image, on the other hand, noise points and high-frequency information are distinguished by means of a tried-on semantic segmentation graph to extract more useful high-frequency information, and the texture fineness of the high-frequency region of the image is enhanced by using an interpolation method, so that the quality of the generated image is improved. The specific process of the frequency domain classifier comprises the following steps:

wherein, the first and the second end of the pipe are connected with each other,S _i,j feature map representing spatial domainiGo to the firstjThe spatial feature component of the column is,F _i,j representing the first in a frequency domain profileiGo to the firstjFrequency characteristic components of the columns, DFT () being a discrete fourier transform function;

wherein the content of the first and second substances,

and finally, further screening noise information in a high-frequency information characteristic channel through the position information in the tried-on semantic segmentation image, filtering image noise by using a low-pass filter, and filling image content and texture information by using an interpolation method.

The spline interpolation method comprises the following steps: tension spline interpolation, regular spline interpolation, and thin plate spline interpolation.

Wherein the regular constraint loss function is represented as:

in the above formula, the first and second carbon atoms are,

wherein the loss function of the garment deformation network is represented as:

in the above formula, the first and second carbon atoms are,

is a conditional opposition loss function (which is existing),

is the pixel cross entropy loss function (which is existing),

is a function of the loss of the regular constraint,λ ₁ ，λ ₂ ，λ ₃ are coefficients that are used to adjust the effect of the three loss functions on the overall loss function.

(4) And according to the image of the try-on person, coloring the semantic graph of the non-try-on area to obtain a colored graph of the non-try-on area, and splicing the colored graph with the deformed clothing image to obtain a rough try-on result graph.

Firstly, extracting pixel point values of a try-on person image, assigning the pixel point values to a non-try-on area semantic graph of a corresponding area again, re-coloring the non-try-on area semantic graph to obtain a non-try-on area color graph, and then inputting the non-try-on area color graph and a deformed clothing image into a semantic splicing network together to obtain a rough try-on result graph. In the semantic splicing network, firstly, the color image of a non-fitting area and the characteristics of a deformed garment image are extracted through a convolution layer and down-sampled, then the compressed image is expanded into the original size through a reverse convolution layer to obtain a rough fitting human body semantic segmentation image, finally, the rough fitting result image is input into an integrated discriminator to carry out similarity calculation, and a fitting result image with the highest similarity is output after multiple rounds of semantic splicing. And a multi-level loss function is introduced in the semantic splicing process, the prediction result is subjected to fine-grained optimization, and a semantic splicing network is encouraged to generate finer details.

The processing procedure for removing the artifact comprises the following steps: firstly, extracting image features of a rough fitting result graph through a deep neural network, obtaining a semantic segmentation graph of the rough fitting result graph through up-sampling, and solving a difference value with the semantic segmentation graph after fitting, wherein the process can be expressed as:

Artifact=R-S

wherein the content of the first and second substances,Artifactpixel information representing the region of the artifact,Ra graph showing the results of a rough fitting,Srepresenting a semantic segmentation graph after fitting;

obtaining an image with a small part of pixel values larger than 0, wherein the pixel points with the pixel values larger than 0 are the pixel points of the artifact region, and identifying the region in the rough fitting result image according to the position information of the artifact region; then, through a normalized network, two type variable coefficients are learnedαAndβrespectively controlling the reduction degree of the artifact region in the x-axis direction and the y-axis direction for reducing the size of the artifact region; and finally, sampling the adjusted feature picture into the size of the original picture through an upper sampling layer, and filling lost pixel point values by using an interpolation method to generate a more fine fitting result picture.

Wherein the interpolation method comprises: nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation.

The normalization network inputs a rough fitting result graph and outputs a fine fitting result graph, and comprises two stages of acquiring an artifact semantic graph and eliminating artifact information:

Wherein the planar deformation loss function is expressed as:

in the above formula, the first and second carbon atoms are,

representing the euclidean distance between two nodes,i,jis the amount of the deformation,α,βis the deformation coefficient.

Fig. 2 is a normalized network diagram of a virtual fitting method of a 2D image according to an embodiment;

the normalized network is a trial result graph with coarse input and fine output after removing the false shadow. The normalized network consists of two stages: the first stage is used for acquiring an artifact semantic map and comprises 3 convolutional layers, 3 deconvolution layers and a difference calculating module; the second stage is used for eliminating artifact information and comprises two branches, wherein one branch comprises a convolution layer, an activation layer, a region regularization layer, a down-sampling layer and an up-sampling layer, the other branch only comprises the region regularization layer, and finally, the results of the two branches are fused through an addition layer.

The method not only enables the semantic segmentation precision to be higher, but also increases the robustness of clothing deformation, enables the fitting result image to retain more details, greatly improves the virtual fitting effect of the high-resolution 2D image, and improves the fitting effect and the user experience.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be implemented by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A virtual fitting method based on 2D images is characterized by comprising the following steps:

inputting a 2D image, a human body semantic segmentation image and a clothing image of a try-on person, dividing the human body semantic segmentation image into a try-on area and a non-try-on area through preprocessing operation, and performing masking and correction processing on the clothing image to generate a clothing mask image and a clothing correction image;

step (2), inputting a clothing semantic graph and a non-fitting area semantic graph, and generating a human body semantic segmentation graph after fitting under the constraint of introducing an integrated discriminator and a multi-level loss function through an end-to-end semantic splicing network;

step (3), predicting a deformed garment image under the constraint of introducing a frequency domain classifier and a regular constraint loss function through a garment deformation network according to the tried-on semantic segmentation graph, the human body posture graph and the garment image;

the clothing deformation network in the step (3) is an end-to-end network, the semantic segmentation graph, the clothing image and the human body posture graph after fitting are input, under the constraint of a plurality of key point positions in the human body posture graph, pixel information of the key point positions of the semantic segmentation graph and the clothing image after fitting is respectively extracted through two branches by a deep convolution network, a frequency domain classifier is introduced, the pixel information of the clothing image after deformation is predicted by using an interpolation method, a regular constraint loss function is introduced to control the deformation degree of the clothing, and finally the clothing image after deformation is output;

the canonical constraint loss function is expressed as:

in the above formula, the first and second carbon atoms are,

representing coordinate points before and after deformation respectivelyyThe value of the sum of the values,

the loss function of the garment deformation network is expressed as:

in the above formula, the first and second carbon atoms are,

is a conditional-confrontation-loss function,

is a function of the cross-entropy loss of the pixel,

is a function of the loss of the regular constraint,λ ₁ ，λ ₂ ，λ ₃ is a coefficient for adjusting the influence of the three loss functions on the total loss function;

in the step (3), the intermediate frequency domain classifier divides the features in a frequency domain space by setting a threshold, on one hand, more attention points are placed on the generated low frequency feature information to enhance the content information of the image, on the other hand, noise points and high frequency information are distinguished by means of the tried-on semantic segmentation graph to extract more useful high frequency information, and the texture fineness of the high frequency region of the image is enhanced by using an interpolation method, so that the quality of the generated image is improved; the specific process of the frequency domain classifier comprises the following steps:

wherein the content of the first and second substances,S _i,j representing features in the spatial domainiGo to the firstjThe spatial feature component of the column is,F _i,j representing the first in a frequency domain profileiGo to the firstjFrequency characteristic components of the columns, DFT () being a discrete fourier transform function;

wherein the content of the first and second substances,

finally, further screening noise information in a high-frequency information characteristic channel through position information in the tried semantic segmentation image, filtering image noise by using a low-pass filter, and filling image content and texture information by using an interpolation method;

step (4), according to the image of the try-on person, coloring the semantic graph of the non-try-on area to obtain a colored graph of the non-try-on area, and then splicing the colored graph with the deformed clothing image to obtain a rough try-on result graph;

and (5) inputting the rough fitting result image and the fitted semantic segmentation image into a normalized network, calculating and reducing an artifact area, and finally obtaining a fine fitting result image.

2. The virtual fitting method based on 2D images as claimed in claim 1, wherein: after the 2D image, the human body semantic segmentation map and the clothing image of the try-on person are input in the step (1), the preprocessing process comprises two parts:

one part, inputting a human body semantic segmentation graph, comprising 7 semantic information which are sequentially: semantic information of hair, face, neck, hand, under garment, clothing, and arm, and each semantic information is labeled with a different pixel value, which can be expressed as: y = & a ₁ ,a ₂ , a ₃ ,…, a ₇ Dividing the human body semantic segmentation graph into a non-try-on area semantic graph according to different semantic information pixel values in the preprocessing process: y is ₁ ={a ₁ , a ₂ , a ₃ , a ₄ , a ₅ And a semantic map of a try-on area: y is ₂ ={a ₆ ,a ₇ And the semantic graph of the non-fitting area comprises: semantic information of hair, face, neck, hand, lower garment, the fitting region semantic graph includes: semantic information of clothing and arms;

and inputting a clothing image, performing mask processing on the clothing image, namely setting the pixel value of a clothing region in the clothing image to be 0 and the pixel value of the background to be 255 to obtain a clothing mask image, and performing affine transformation on the clothing image according to the position information of the pixel points in the semantic graph of the try-on region to enable the outline of the clothing image to be consistent with the outline of the semantic graph of the try-on region, thereby obtaining a clothing correction graph and preparing for subsequent clothing deformation.

3. The virtual fitting method based on 2D images as claimed in claim 1, wherein: the semantic splicing network in the step (2) is an end-to-end network, a clothing semantic graph and a non-try-on area semantic graph are input, image features are extracted through a convolution layer and are down-sampled, a compressed image is up-sampled into an original size through a reverse convolution layer, a rough try-on human body semantic segmentation graph is obtained, the rough try-on human body semantic segmentation graph is input to an integrated discriminator to be subjected to similarity calculation, and a try-on human body semantic segmentation graph with the highest similarity is output after multiple rounds of semantic splicing; the integrated discriminator is formed by splicing n depth convolution networks with the same network structure, the size of n is related to the resolution of an input picture, and the depth convolution network is formed by sequentially connecting a pooling layer, a convolution layer, a normalization layer and an activation layer; the input of the integrated discriminator is a real picture and a predicted picture, the output of the integrated discriminator is the feature similarity of the two pictures, the n depth convolution networks respectively distinguish the real picture and the predicted picture on n different scales, and the influence of the prediction result of each depth convolution network on the final result is controlled by setting a coefficient, so that the semantic splicing network is encouraged to generate more details; each deep convolutional network corresponds to a characteristic matching loss function and forms a multi-level loss function of the integrated discriminator together, fine-grained optimization is carried out on the prediction result, and the semantic splicing network is encouraged to generate finer details;

the multi-level loss function is expressed as:

in the above formula, the first and second carbon atoms are,

。

4. The virtual fitting method based on 2D images as claimed in claim 1, wherein: the processing procedure for removing the artifacts in the step (5) comprises the following steps: firstly, extracting image features of a rough fitting result graph through a deep convolutional network, obtaining a semantic segmentation graph of the rough fitting result graph through up-sampling, and solving a difference value with the semantic segmentation graph after fitting, wherein the process can be expressed as:

Artifact=R-S

wherein the content of the first and second substances,Artifactpixel information representing the region of the artifact,Ra graph showing the results of a rough fitting,Srepresenting the semantic segmentation graph after the try-on;

obtaining an image with only a small part of pixel values larger than 0, wherein the pixel points with the pixel values larger than 0 are the pixel points of the artifact region, and identifying the region in the rough fitting result image according to the position information of the artifact region; then, through a normalized network, two type variable coefficients are learnedαAndβrespectively controlling the reduction degree of the artifact area in the x-axis direction and the y-axis direction for reducing the size of the artifact area; and finally, sampling the adjusted feature picture into the size of the original picture through an upper sampling layer, and filling lost pixel point values by using an interpolation method to generate a more fine fitting result picture.

5. The virtual fitting method based on 2D images as claimed in claim 1, wherein: in the step (5), the input of the normalization network is a rough fitting result graph, the output is a fine fitting result graph, and the normalization network comprises two stages of acquiring an artifact semantic graph and eliminating artifact information:

the first stage of acquiring an artifact semantic graph comprises two sub-networks which are set as networks 1-1 and 1-2, wherein the network 1-1 comprises n layers of convolution layers, image features are extracted through the convolution layers, n layers of deconvolution layers map images to be in original sizes through deconvolution, the output of each convolution layer is used as the input of each deconvolution layer, the convolution layers are connected with the deconvolution layers through residual error structures, and results are output after the last layer of deconvolution; the network 1-2 comprises an image difference calculating module, two images with the same size are input, and a result graph obtained by calculating the difference of the two images is output; wherein the value of n is related to the size of the input picture;

the second stage of eliminating artifact information comprises a network 2-1, wherein the network 2-1 is composed of two parallel branches, wherein the first branch is formed by sequentially connecting a convolution layer, an activation layer, a region regularization layer, a down sampling layer, an up sampling layer and an addition layer in series, the second branch is formed by a single region regularization layer, and the final result is output after the addition layer;

the region regularization layer is used for standardizing the artifact region, firstly obtaining the spatial position of the artifact region, and then calculating a type variable coefficient through a convolution networkαAndβintroducing a plane deformation loss function to control the reduction degree of the artifact region, and preventing the pixel point information of the human body part from being wrongly modified due to excessive reduction so as to assist in reducing the artifact region; the addition layer is used for fusing output results of the two branches;

wherein the planar deformation loss function is expressed as:

in the above-mentioned formula, the compound has the following structure,