CN114067088A

CN114067088A - Virtual wearing method, device, equipment, storage medium and program product

Info

Publication number: CN114067088A
Application number: CN202111356765.5A
Authority: CN
Inventors: 李安; 李玉乐; 项伟
Original assignee: Bigo Technology Singapore Pte Ltd
Current assignee: Bigo Technology Singapore Pte Ltd
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-02-18
Also published as: WO2023088277A1

Abstract

The application discloses a virtual wearing method, a virtual wearing device, equipment, a storage medium and a program product, wherein the method comprises the following steps: acquiring a first target image containing a target human body; acquiring a second target image containing a target wearing article; acquiring human body characteristic information based on the first target image, wherein the human body characteristic information comprises target human body key point information related to the target wearing object, a human body analysis result and a wearing object mask image; inputting the human body characteristic information and the second target image into a pre-trained virtual wearing network, determining human body wearing area information by the virtual wearing network based on the human body characteristic information, determining deformation information of the target wearing object according to the human body wearing area information and the second target image, and generating a wearing effect graph according to the deformation information and the human body characteristic information for outputting; wherein, this virtual wearing network is a generation antagonism network, can compromise virtual wearing effect and authenticity.

Description

Virtual wearing method, device, equipment, storage medium and program product

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a virtual wearing method, a virtual wearing apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

With the development of the internet, online shopping is more popular, but compared with offline shopping, online shopping has some problems of poor experience, for example, purchased clothes have no way to try on, and the effect is unknown, resulting in high return rate. The virtual fitting technology aims to provide a virtual fitting scene, and better experience is brought to a user. The virtual fitting technology is an important technical direction in the field of computer vision, and can be widely applied to e-commerce platforms to improve user experience.

The virtual fitting technique mentioned in the related art mainly transforms (warp) 3D clothes into a reconstructed 3D body by reconstructing the 3D body. However, 3D clothing is relatively difficult to acquire and the reconstructed 3D body may affect the clothing fitting effect if not real enough. Therefore, the virtual fitting technology mentioned in the related art is difficult to consider fitting effect and reality.

Disclosure of Invention

The application provides a virtual wearing method, a virtual wearing device, a virtual wearing storage medium and a program product, and aims to solve the problem that fitting effect and authenticity are difficult to take into account in a virtual fitting technology in the prior art.

In a first aspect, an embodiment of the present application provides a method for virtual wearing, where the method includes:

acquiring a first target image containing a target human body;

acquiring a second target image containing a target wearing article;

acquiring human body characteristic information based on the first target image, wherein the human body characteristic information comprises target human body key point information related to the target wearing object, a human body analysis result and a wearing object mask image;

inputting the human body characteristic information and the second target image into a pre-trained virtual wearing network, determining human body wearing area information by the virtual wearing network based on the human body characteristic information, determining deformation information of the target wearing object according to the human body wearing area information and the second target image, and generating a wearing effect graph according to the deformation information and the human body characteristic information for outputting; wherein the virtual wearing network is a generation countermeasure network.

In a second aspect, an embodiment of the present application further provides a device that is worn virtually, where the device includes:

the first target image acquisition module is used for acquiring a first target image containing a target human body;

the second target image acquisition module is used for acquiring a second target image containing a target wearing article;

the human body characteristic information acquisition module is used for acquiring human body characteristic information based on the first target image, wherein the human body characteristic information comprises target human body key point information related to the target wearing object, a human body analysis result and a wearing object mask image;

the wearing effect graph generating module is used for inputting the human body characteristic information and the second target image into a pre-trained virtual wearing network, determining human body wearing area information based on the human body characteristic information by the virtual wearing network, determining deformation information of the target wearing object according to the human body wearing area information and the second target image, and generating a wearing effect graph according to the deformation information and the human body characteristic information for outputting; wherein the virtual wearing network is a generation countermeasure network.

In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect described above.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method of the first aspect.

In a fifth aspect, the present application further provides a computer program product, which includes computer-executable instructions, and when executed, the computer-executable instructions are configured to implement the method of the first aspect.

The technical scheme that this application provided has following beneficial effect:

in the virtual wearing network, the wearing effect graph is processed in three steps, namely human body wearing area information of the target wearing object worn on the target human body is determined based on the human body characteristic information, then deformation information of the target wearing object is determined by combining the human body wearing area information and the second target image, and finally the wearing effect graph is generated according to the deformed target wearing object and the human body characteristic information and output. The virtual wearing network used in the embodiment is a generation countermeasure network, and the generation countermeasure network can enable the output wearing effect diagram to simultaneously take effect and authenticity into consideration through the three-step processing process, so that the user experience is further improved. And the whole processing process is two-dimensional processing of the image, and compared with a 3D processing mode, a 3D human body and 3D clothes do not need to be reconstructed, so that the realization difficulty and the realization cost are reduced.

Drawings

Fig. 1 is a flowchart of an embodiment of a method for virtual wearing according to an embodiment of the present application;

fig. 2 is a flowchart of an embodiment of a method for virtual wearing provided in the second embodiment of the present application;

fig. 3 is a schematic diagram of a first target image including a target human body according to a second embodiment of the present application;

fig. 4 is a schematic diagram of a human body key point obtained after key point detection is performed on a first target image according to a second embodiment of the present application;

fig. 5 is a schematic diagram of a preliminary human body analysis result obtained after human body analysis is performed on a first target image according to a second embodiment of the present application;

fig. 6 is a schematic diagram of a mask image of a wearing article obtained after clothes are removed from a target human body in a first target image according to a second embodiment of the present application;

fig. 7 is a schematic diagram of a human body analysis result obtained after clothes are erased from a preliminary human body analysis result according to the second embodiment of the present application;

fig. 8 is a flowchart of an embodiment of a method for virtual wearing provided in the third embodiment of the present application;

fig. 9 is a schematic diagram of a virtual wearable network architecture according to a third embodiment of the present application;

fig. 10 is a schematic diagram of a wearable region generative model architecture provided in the third embodiment of the present application;

fig. 11 is a schematic view of an input/output implementation scenario of a wearing region generative model provided in the third embodiment of the present application;

fig. 12 is a schematic view of an input/output implementation scenario of a Warp model according to a third embodiment of the present application;

FIG. 13 is a schematic diagram of a generative model architecture provided in the third embodiment of the present application;

fig. 14 is a schematic diagram of a StyleGAN2 model architecture provided in the third embodiment of the present application;

FIG. 15 is a schematic diagram of an input/output implementation scenario of a generative model provided in the third embodiment of the present application;

fig. 16 is a block diagram illustrating an embodiment of a virtual wearable device according to a fourth embodiment of the present disclosure;

fig. 17 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of an embodiment of a virtual wearing method provided in an embodiment of the present application, where the method may be implemented by a virtual wearing apparatus, where the virtual wearing apparatus may be accessed to an APP or a Web page according to a development document, so as to implement a virtual wearing function in the APP or the Web page. The terminal where the APP or the Web page is located can comprise a mobile phone, a tablet personal computer, a fitting robot and the like.

The wearing articles of the virtual wearing of the present embodiment may include clothes, trousers, shoes, socks, jewelry, etc., and for easy understanding, the following embodiments all use clothes as an example to illustrate virtual fitting scenes.

The embodiment can be applied to virtual wearing functions of scenes such as e-commerce platforms, short video entertainment, image processing, movie production, live broadcast, games and the like. For example, in the e-commerce platform, after the user selects clothes, a photo containing a person who wants to try on the clothes can be uploaded, and the user can directly see the dressing effect picture of the selected clothes on the person through the virtual fitting function. For another example, given a piece of video, a person in the video needing to try on clothes and clothes needing to try on clothes are specified, and the clothes of the specified person in the video can be changed into the clothes needing to try on through a virtual fitting function in the video application.

As shown in fig. 1, the present embodiment may include the following steps:

step 110, a first target image including a target human body is acquired.

In one example, the first target image may include: an image imported via the virtual wearing function page. For example, after the user triggers the virtual fitting function to enter the virtual fitting function page, the first target image may be imported according to an import interface in the page. The first target image is an image containing a target human body needing fitting, and the target human body can be the user himself or the other person; the first target image may be a self-portrait image or other non-self-portrait images, which is not limited in this embodiment.

In another example, the first target image may further include: the target video includes each image frame of the target human body. For example, in a live scene, when a user triggers a virtual fitting function in a live interface and specifies a person needing to try on clothes, an image frame containing the specified person in the live scene may be used as a first target image.

It should be noted that the target human body in the first target image needs to keep the positive features of the human body as complete as possible, at least the positive features of the human body part related to the target wearing object.

Step 120, a second target image comprising a target wearing article is acquired.

For example, the second target image may be an image uploaded by the user containing the target wearing article; or, the second target image may also be an image selected by the user in the wearing article image sequence displayed by the current APP or Web page; alternatively, the second target image may be an image generated by selecting a person in the video by the user and then extracting the target wearing object from the person, and the embodiment does not limit the manner of obtaining the second target image.

It should be noted that the target wearing article in the second target image needs to retain important features such as texture and shape of the wearing article as much as possible.

After the first target image and the second target image are obtained, the two images may be processed into a uniform size, for example, by performing center equal-scale cutting and equal-scale scaling.

Step 130, obtaining human body feature information based on the first target image, wherein the human body feature information includes: target human body key point information, a human body analysis result and a wearing article mask image related to the target wearing article.

The target human body key point information is a human body part detection result which is obtained by detecting key points of a target human body in the first target image and is related to the target wearing object.

Illustratively, the target human body key point information may be key point information of a human body part related to the target wearing object, which is selected after detecting the key point information of the whole human body; alternatively, the target human body key point information may be key point information obtained by directly performing key point detection on a human body part related to the target wearing object, such as a head, a neck, a shoulder, a hand, and the like, which is not limited in this embodiment.

The human body analysis result is obtained by performing human body analysis on the target human body in the first target image. The human body analysis is to divide each part of the human body, and is a semantic division task with fine granularity. For example, the target human body can be divided into parts such as hair, face, clothes, pants, and limbs by human body analysis.

The clothing mask image is an image obtained by blocking a clothing region related to the target clothing in the target human body in the first target image. For example, if the target wearing object is a garment, the wearing object mask image is an image generated by blocking the garment in the first target image. The wearing article mask image obtained by blocking the wearing article region related to the target wearing article is a human body image unrelated to the target wearing article.

Step 140, inputting the human body characteristic information and the second target image into a pre-trained virtual wearing network, so that the virtual wearing network determines human body wearing area information based on the human body characteristic information, determines deformation information of the target wearing object according to the human body wearing area information and the second target image, and generates a wearing effect graph according to the deformation information and the human body characteristic information for outputting.

In this step, the virtual wearing network may be a pre-trained model, which may be a generation countermeasure network gan (generic adaptive networks). The generation is that the model generates similar data by learning some data. For example, a machine may look at a picture of an animal and then create a picture of the animal itself, which is generated. Generative confrontation networks are a deep learning model that passes through (at least) two modules in a framework: the mutual game learning of the Generative Model (Generative Model) and the Discriminative Model (Discriminative Model) yields a reasonably good output. The discrimination model needs to input variables, and the variables are predicted through a certain model; generative models are the random generation of observed data given some kind of implicit information.

After the human body characteristic information is obtained through the analysis of the first target image, the human body characteristic information and the second target image can be input into the virtual wearing network, the virtual wearing network carries out virtual fitting processing, and a wearing effect picture of a target human body wearing the target wearing object is output.

In the virtual wearing network, first, human body wearing region information, which is a region on a target human body where a target wearing article is worn, is determined in conjunction with a posture of the target human body, may be determined based on human body feature information. Then, deformation information of the target wearing object can be determined by combining the information of the human body wearing area and the second target image, wherein the deformation information refers to how the target wearing object needs to be distorted and deformed to be matched with a specific area of the target human body. And then, according to the deformation information and the human body characteristic information, attaching the deformed target wearing object to the (Warp) target human body, erasing the original wearing object of the target human body corresponding to the target wearing object, and generating a wearing effect graph for outputting.

In this embodiment, after obtaining a first target image including a target human body and a second target image including a target wearing object, which are provided by a user, human characteristic information related to the target human body, such as target human body key point information, a human body analysis result, a wearing object mask image, and the like, may be extracted from the first target image, and then the human characteristic information and the second target image are processed through a virtual wearing network, and a wearing effect map after the target human body wears the target wearing object is output through the virtual wearing network. For the user, the wearing effect graph of the target wearing object worn by the target human body can be obtained only by specifying the target human body and the target wearing object, the operation is simple and fast, and the use experience of the user is improved.

Example two

Fig. 2 is a flowchart of an embodiment of a virtual wearing method provided in the second embodiment of the present application, and the embodiment describes a process of acquiring human body feature information more specifically on the basis of the first embodiment, and may include the following steps:

step 210, a first target image including a target human body is acquired.

Step 220, a second target image comprising a target wearing article is acquired.

Step 230, inputting the first target image to a pre-trained human key point detection model, so that the human key point detection model performs key point detection on the target human body in the first target image, and outputting corresponding human key point information.

The human body key points refer to key position points of each part of a human body, and are important for describing human body postures and predicting human body behaviors. The key points of the human body mainly comprise three points of the left arm and the right arm (wrist, elbow and shoulder), three points of the left leg and the right leg (ankle, knee and hip), a hip point, a head point (eyes, chin and vertex), and the like.

In this step, the key point detection of the target human body may be performed by a human body key point detection model. In particular, human body key point detection, also called human body pose estimation, has the task of locating human body key parts, such as head, neck, shoulders, hands, etc., in a given picture. On different data sets, the specific parts to be detected are different, and the number of detected key points is also different.

In one implementation, the human key point detection model may be a deep learning-based model, and different human key point detection models may be trained according to different wearing articles to extract human key points matching with the wearing articles. The training process of the human body key point detection model is not limited in the embodiment, and a person skilled in the art can perform model fitting by adopting a universal training method of the human body key point detection model according to a training target.

In another implementation, the human body key point detection model can also be a pretrained undifferentiated key point detection model which is subjected to multiple times of inference detection and has higher precision. For example, the first target image is input into an existing human key point detection model with no difference detection, and a probability distribution map corresponding to each human key point is obtained. The first target image may be subjected to different sampling processes according to actual processing conditions and network structures, for example, the first target image is 3x256x256 in size, and is preprocessed into an nx32x32 image through three times of downsampling and convolution operations. Then, the image of nx32x32 is input into a preset Hourglass network (Hourglass), up-sampling processing and convolution operation are carried out, a corresponding thermodynamic diagram (heatmap) is obtained, and a result corresponding to the thermodynamic diagram is determined as a human body key point detection result.

For example, as shown in fig. 3, a first target image including a target human body is subjected to a keypoint detection by the keypoint detection model, so as to obtain a human body keypoint diagram shown in fig. 4.

And 240, determining target key point information related to the target wearing article from the human body key point information.

In one implementation, if the human body keypoint information is keypoint information for a specified human body part matched with the target wearing object, all the keypoints may be taken as target keypoints. If the human body key point information is key point information of all human body parts or most human body parts, the key point of the specified human body part matched with the target wearing object can be selected as the target key point.

Step 250, inputting the first target image into a pre-trained human body analysis model, so that the human body analysis model performs human body analysis on the target human body in the first target image, and outputting a corresponding preliminary human body analysis result.

Human body parsing aims at precisely locating the human body and further dividing it into multiple semantic regions at the pixel level. For example, a human body can be divided into body parts and clothes by human body analysis. Illustratively, the Human body Parsing model may include a Human Parsing model, for example, for the first target image of fig. 3, after passing through the Human Parsing model, the output preliminary Human body Parsing result may be as shown in fig. 5.

Step 260, drawing a wearing object mask in the first target image by combining the target key point information and the preliminary human body analysis result, and generating a wearing object mask image.

To reduce the effect of the original clothing, this step may erase the original clothing. When the method is implemented, the wearing object mask can be drawn through the target key point information and the preliminary human body analysis result, for example, in a scene of virtual fitting, the target key points comprise arm key points, an elliptical image mask can be drawn based on the arm key points, the size of the ellipse needs to be larger than the range of an original arm, and the size of the ellipse can be determined according to an empirical value during implementation. Then, for the body part, a square mask can be drawn based on the body part in the preliminary human body analysis result, then the masks of the two arms and the masks of the body part are connected into a complete mask, and finally the complete mask is processed by an expansion corrosion method to obtain the result of erasing the clothes of the human body, namely the mask image of the wearing object. For example, for the first target image of fig. 3, a mask image of the wearing article generated after the clothing is occluded may be as shown in fig. 6.

And 270, drawing a wearing object mask based on the target key points in the preliminary human body analysis result to generate a human body analysis result.

Similar to the processing method in step 260, the original wearing object region corresponding to the target wearing object needs to be erased for the preliminary human body analysis result, so as to generate a human body analysis result unrelated to the original wearing object. During processing, the target human body key point information may be superimposed on the preliminary human body analysis result, a corresponding mask is drawn, and then the drawn mask is set as a background color, for example, in a virtual fitting scene, a mask formed by connecting drawn arm masks, body masks, and the like is processed as a background color, and the generated human body analysis result is as shown in fig. 7.

In other embodiments, the masks obtained in step 260 may also be directly superimposed on the preliminary human body analysis result, and processed into a background color.

Step 280, inputting the target key point information, the human body analysis result, the wearing object mask image and the second target image into a pre-trained virtual wearing network, so that the virtual wearing network determines human body wearing area information based on the target key point information, the human body analysis result and the wearing object mask image, determines deformation information of the target wearing object according to the human body wearing area information and the second target image, and generates a wearing effect image according to the deformation information, the target key point information and the wearing object mask image for outputting.

After the target key point information, the wearing object mask image and the human body analysis result are obtained, the three and the second target image can be used as input features to be input into a pre-trained virtual wearing network, the virtual wearing network performs virtual wearing processing, and a corresponding wearing effect graph is output.

In this embodiment, the input features of the virtual wearing network include target key point information of a human body part related to the target wearing object, a wearing object mask image obtained by erasing a wearing object region corresponding to the target wearing object for the target human body, and a human body analysis result obtained by erasing a wearing object region corresponding to the target wearing object for the initial human body analysis result, so that the dimension of the input features is expanded, the original features of the target human body and the target wearing object are retained to the maximum extent, and thus, a wearing effect graph output by the virtual wearing network is more real, and a better wearing simulation effect is achieved.

EXAMPLE III

Fig. 8 is a flowchart of an embodiment of a virtual wearing method provided in the third embodiment of the present application, and the more specific description of the process of performing virtual wearing processing on a virtual wearing network in this embodiment is based on the first embodiment or the second embodiment, where the process includes the following steps:

in step 310, a first target image including a target human body is acquired.

At step 320, a second target image comprising a target wearing article is acquired.

Step 330, obtaining human body characteristic information based on the first target image, wherein the human body characteristic information comprises target human body key point information related to the target wearing object, a human body analysis result and a wearing object mask image.

Step 340, inputting the human body feature information and the second target image into a pre-trained virtual wearing network, wherein the virtual wearing network comprises a wearing area generation model, a deformation identification model and a generation model.

As shown in fig. 9, the virtual wearing network includes a wearing region generation model, a deformation recognition model, and a generation model. The wearing region generation model is used for determining human body wearing region information based on the human body characteristic information; the deformation identification model is used for determining deformation information of the target wearing object according to the human body wearing area information and the second target image; and the generating model is used for generating a wearing effect graph according to the deformation information and the human body characteristic information and outputting the wearing effect graph. The description of the above three models will be described in detail in the following steps.

Step 350, in the virtual wearing network, inputting the target human body key point information, the human body analysis result and the wearing object mask image into the wearing area generation model, so that the wearing area generation model predicts the human body wearing area of the target human body wearing the target wearing object, and outputs corresponding human body wearing area information.

The wearing area generating model is a model of a first stage of a virtual wearing network, which may be referred to as a wearing object mask generating network, and may be a model including a U-NET network structure (the U-NET network structure is a symmetric model structure), as shown in fig. 10, including an encoder on the left side and a decoder on the right side.

The input characteristics of the wearing area generating model comprise target human body key point information, a human body analysis result and a wearing object mask image, the wearing area generating model determines human body posture information according to the human body key point information, and then generates and outputs an area (namely human body wearing area information) of a person wearing the target wearing object by combining the human body posture information, the wearing object mask image and the human body analysis result. For example, as shown in fig. 11, the features input to the wearing area generation model include, from top to bottom, target human body key point information, a wearing object mask image, and a human body analysis result, and the wearing area generation model outputs an area where a person wears the target wearing object.

In one implementation, the Loss function used in training the wearing region generation model may include a Cross-Entropy Loss function (CE _ Loss) and a Dice Loss function (Dice Loss, also known as a set similarity metric Loss function), wherein,

the cross entropy loss function calculation method comprises the following steps:

where N represents Batch Size at the time of training; y is_iRepresents label; p is a radical of_iRepresenting the model prediction results.

The method for calculating the Dice Loss comprises the following steps:

wherein X represents the result of label; y represents the result of prediction, | X ≦ Y | represents the intersection of the prediction result and label.

In one example, in training the wear area generative model, an Adam optimizer may be employed, with a learning rate set to 0.001, and 20 epochs trained.

Step 360, inputting the second target image, the target human body key point information, and the human body wearing area information output by the wearing area generation model into the deformation identification model, so that the deformation identification model generates a first feature map according to the human body wearing area information and the target human body key point information, generates a second feature map according to the second target image, and determines the deformation information of the target wearing object based on the first feature map and the second feature map.

The human body wearing region information output by the wearing region generation model, the second target image and the target human body key point information can be used as input characteristics of the deformation identification model and input into the deformation identification model. The deformation identification model is used as a model of the second stage of the virtual wearing network and can also be called a Warp model.

In one implementation, the Warp model may include two feature extractors (i.e., Encoder encoders), a first feature extractor and a second feature extractor, respectively. The first feature extractor is used for extracting key point information of a target human body and features of human body wearing region information and related to the target human body to generate a first feature map; the second feature extractor is used for extracting relevant features of the target wearing object and generating a second feature map. The structure of the two feature extractors is the same, but the weights are not shared.

Illustratively, the structural diagram of the feature extractor may be as shown in table 1 below, and includes an input layer (input) and 6 residual layers (ResBlock):

feature extractor
	Input，1024768N
ResBlock，51238432
	ResBlock，25619264
ResBlock，12896128
	ResBlock，6448512
ResBlock，3224512
	ResBlock，1612512

TABLE 1

In addition, the Warp model may further include a Spatial Transformer Networks (STN) sub-network, the first feature extractor extracts the first feature map, and the second feature extractor extracts the second feature map as input features of the STN sub-network, the STN sub-network is configured to perform relevant Spatial transformation processing based on the first feature map and the second feature map, the Spatial transformation processing includes various scaling, translation, rotation, transformation, and the like, and output deformation information of the target wearing object, that is, Warp parameters, that is, Warp operation is performed on the target wearing object, so as to obtain a shape of the target wearing object worn on the target human body.

For example, as shown in fig. 12, the features of the input Warp model include, from top to bottom, the second target image, the target human body key point information, and the human body wearing area information. The method comprises the steps that feature extraction is carried out on target human body key point information and human body wearing area information through a first feature extractor, feature extraction is carried out on a second target image through a second feature extractor, the first feature extractor and the second feature extractor output extraction results to an STN sub-network, and warp parameters of deformed target clothes are output by the STN sub-network.

In one implementation, the loss functions used in training the Warp model may include Perceptual loss function (Perceptual loss) and L1 loss function (L1_ loss), namely:

Warp Loss＝Perceptual loss+L1_loss

wherein the content of the first and second substances,

Perceptual loss＝E((VGG(Y)-VGG(W(X)))²)

L1_loss＝E(Y–W(X))

wherein E is the mean value; x is input of the Warp model; y is a second target image; the VGG is a VGG model, such as VGG-19 or VGG-16; w is a Warp model.

In an example, for training of the Warp model, an Adam optimizer may also be used, and when the Warp model is trained, the wearing region generation model is not trained, the learning rate may be set to 0.0005, and 100 epochs are trained.

Step 370, inputting the human body key point information, the wearing object mask image and the deformation information of the target wearing object output by the deformation identification model into the generated model, and processing the generated model to generate a wearing effect graph when the target wearing object is worn on the body of the target human body.

The deformation information of the target wearing object output by the Warp model, the key point information of the target human body and the wearing object mask image can be used as input characteristics of the generation model and input into the generation model. And generating a model as a model of the third stage of the virtual wearing network, and outputting a wearing effect picture of the target wearing object worn by the target human body.

In one implementation, the generation model may include an Encoder and a Decoder, where the Encoder is configured to perform feature extraction, and output a third feature map corresponding to the target human body and style attribute information of the deformed target wearing object to the Decoder; the decoder is used for decoding according to the third characteristic diagram and the style attribute information to generate a wearing effect diagram when the target wearing article is worn on the body of the target human body. As shown in fig. 13, the left-side dashed box portion is an Encoder, and the right-side dashed box portion is a Decoder.

In one embodiment, the structure of the Encoder may include: the device comprises an input layer, a plurality of residual error layers and a full connection layer, wherein each residual error layer is used for extracting a third feature map related to a target human body and outputting the third feature map to a corresponding layer of a decoder, and the full connection layer is used for extracting the style attribute information of a deformed target wearing object and outputting the style attribute information to each layer of the decoder. Wherein the style attribute information is a latent variable (pendcode).

For example, the structure of the Encoder is shown in table 2 below, in table 2, there are 6 residual layers (ResBlock), and the size of the third feature map (Featuremap) output by each residual layer is specified, such as 512 × 384 × 32, 256 × 192 × 64, and so on in table 2. The full connection layer FC outputs the style attribute information of 18 × 512 size.

TABLE 2

As shown in fig. 13, the third feature map extracted for each residual layer is output to the next layer for processing on one hand, and also needs to be output to the corresponding layer of the Decoder on the other hand (except for the last residual layer, the last residual layer only outputs the result to the corresponding layer of the Decoder). The corresponding layer here refers to a decoding layer that matches the size of the currently output third feature map, and for example, if the size of the currently output third feature map is 32 × 24 × 512, the corresponding layer in the Decoder refers to a decoding layer that can process the feature map of 32 × 24 × 512 size.

In fig. 13, the two rightmost output layers of the Encoder, above which is the last residual layer, ResBlock, output a profile with a size of 16 × 12 × 512; the FC layer is positioned below the pattern layer, style attribute information with the size of 18 x 512 is output, and the FC layer outputs the style attribute information to each layer of the Decoder so that the Decoder can generate a wearing effect graph according to the style attribute information.

In one embodiment, the network structure of the Decoder may be a structure of a synthetic network of StyleGAN 2. As shown in the model architecture of the StyleGAN2 model shown in fig. 14, the StyleGAN2 is composed of two parts, including a Mapping net (Mapping net work) on the left part and a synthetic net on the right part in fig. 14.

Mapping NetWork can unwrap the input better. As shown in fig. 14, Mapping netword is composed of 8 fully connected layers (FC), and gaussian noise (signal Z) is input to the Mapping netword, and a hidden variable (W) is obtained through the Mapping netword.

The synthesis network is composed of learnable affine transformation A, a modulation module Mod-Demod, an upsampling Upesample and other modules. In addition to this, the synthetic network comprises weights (w), offsets (b) and constant inputs (c, i.e. Const 4 x 512, representing learnable constants), the activation function (leak ReLU) always being applied immediately after adding the offset.

Wherein, the learnable affine transformation a can be composed of one fully connected layer; upsampling may use deconvolution (also called transposed convolution) for the upsampling operation.

The processing flow of the modulation module Mod-Demod is as follows:

w′_ijk＝s_i·w_ijk

wherein s is_iIs the scale of the ith input feature map;

after scaling and convolution, demod is carried out on the weight of the convolution layer, and the output activated standard deviation is as follows:

demod weights are demodulated to restore the output to unity standard deviation, i.e. the new convolutional layer weights are:

in the above formula, addition of ∈ is to avoid the denominator being 0.

The rightmost side in fig. 14 is the injection of random noise, B is a learnable noise parameter, and random noise is introduced to make the generated image more realistic and lifelike.

In one embodiment, the loss functions used in training the generative model may include the generative confrontation network loss function GAN _ loss, the Perceptual loss function Perceptual loss, and the L1 loss function L1_ loss, i.e.,

Loss＝GAN_loss+Perceptual loss+L1_loss

wherein the content of the first and second substances,

GAN_loss＝E[D(G(x)-1)²]+E[D(G(x))²]

Perceptual loss＝E((VGG(Y)-VGG(G(X)))²)

L1_loss＝E(Y–G(X))

wherein E represents a mean value; d is a discriminator; g (x) represents a wearing effect graph output by the generation model; x represents the input of the generative model, and Y represents the dressing effect map in the sample.

GAN loss is the more realistic result generated by the generative model.

In an example, for training of the generative model, an Adam optimizer may also be used, and when the generative model is trained, neither the image mask generative model nor the Warp model is trained, the learning rate is set to 0.0005, and 100 epochs are trained.

As shown in fig. 15, for the generated model after training, the input features of the generated model may include target human body key point information, a wearing object mask image, and deformation information of the target wearing object output by the deformation recognition model, and output a wearing effect diagram of the target wearing object when the target wearing object is worn on the target human body.

In this embodiment, the virtual wearing network realizes wearing the target wearing object on the target human body through the wearing area generation model, the deformation identification model and the generation model. The wearing area generation model is responsible for predicting a human body wearing area when the target human body wears the target wearing object according to the key point information of the target human body, a human body analysis result obtained after the original wearing object is erased and a wearing object mask image obtained after the original wearing object is erased, and outputting corresponding human body wearing area information. The deformation identification model is responsible for determining deformation information of the target wearing object relative to the posture of the human body according to the information of the human body wearing area, the key point information of the target human body and a second target image containing the target wearing object, and the deformed target wearing object is obtained. And the generation model is responsible for attaching the deformed target wearing object to the body of the target human body with the original wearing object erased according to the deformation information, the key point information of the target human body and the mask image of the wearing object to generate a wearing effect picture. The three models are strong in generalization capability and good in robustness, and the output wearing effect graph can give consideration to both wearing effect and authenticity.

Example four

Fig. 16 is a block diagram of a virtual wearable device according to a fourth embodiment of the present disclosure, which may include the following modules:

a first target image obtaining module 410, configured to obtain a first target image including a target human body;

a second target image obtaining module 420, configured to obtain a second target image including a target wearing object;

a human body feature information obtaining module 430, configured to obtain human body feature information based on the first target image, where the human body feature information includes target human body key point information related to the target wearing object, a human body analysis result, and a wearing object mask image;

the wearing effect map generating module 440 is configured to input the human body feature information and the second target image into a pre-trained virtual wearing network, so that the virtual wearing network determines human body wearing area information based on the human body feature information, determines deformation information of the target wearing object according to the human body wearing area information and the second target image, and generates a wearing effect map according to the deformation information and the human body feature information for output; wherein the virtual wearing network is a generation countermeasure network.

In an embodiment, if the human body feature information is target human body key point information, the human body feature information obtaining module 430 is specifically configured to:

inputting the first target image into a pre-trained human key point detection model, performing key point detection on a target human body in the first target image by using the human key point detection model, and outputting corresponding human key point information;

and determining target key point information related to the target wearing article from the human body key point information.

In an embodiment, if the human body characteristic information is a human body analysis result, the human body characteristic information obtaining module 430 is specifically configured to:

inputting the first target image into a pre-trained human body analysis model, so that the human body analysis model performs human body analysis on a target human body in the first target image, and outputting a corresponding preliminary human body analysis result;

and drawing a wearing object mask based on the target key points in the preliminary human body analysis result to generate a human body analysis result.

In an embodiment, if the human body characteristic information is a mask image of a wearing object, the human body characteristic information obtaining module 430 is specifically configured to:

and drawing a wearing article mask in the first target image by combining the target key point information and the preliminary human body analysis result to generate a wearing article mask image.

In one embodiment, the virtual wearing network includes a wearing area generation model, and the wearing effect map generation module 440 further includes the following sub-modules:

and the wearing area generation model processing submodule is used for inputting the target human body key point information, the human body analysis result and the wearing object mask image into the wearing area generation model, predicting the human body wearing area of the target human body when the target wearing object is worn by the wearing area generation model, and outputting corresponding human body wearing area information.

In one embodiment, the wearing area generation model is a model containing a U-NET network structure; in training the wear region generative model, the loss functions used include a cross entropy loss function and a dice loss function.

In an embodiment, the virtual wearable network further includes a deformation identification model, and the wearable effect map generation module 440 further includes the following sub-modules:

and the deformation identification model processing submodule is used for inputting the second target image, the target human body key point information and the human body wearing region information output by the wearing region generation model into the deformation identification model, generating a first feature map according to the human body wearing region information and the human body key point information by the deformation identification model, generating a second feature map according to the second target image, and determining the deformation information of the target wearing object based on the first feature map and the second feature map.

In one embodiment, the deformation recognition model includes a first feature extractor, a second feature extractor, and a spatial transformation sub-network;

the first feature extractor is used for outputting the first feature map to the spatial transformation subnetwork according to the human body wearing region information and the human body key point information;

the second feature extractor is used for outputting the second feature map to the spatial transformation sub-network according to the second target image;

the spatial transformation sub-network is used for carrying out relevant spatial transformation processing based on the first feature map and the second feature map and outputting deformation information of the target wearing object.

In one embodiment, the loss functions used in training the deformation recognition model include perceptual loss functions and L1 loss functions.

In one embodiment, the virtual wearing network further includes a generation model, and the wearing effect map generation module 440 further includes the following sub-modules:

and the generation model processing submodule is used for inputting the human key point information, the wearing object mask image and the deformation information of the target wearing object output by the deformation identification model into the generation model, and the generation model processes the information to generate a wearing effect graph when the target wearing object is worn on the body of the target human body.

In one embodiment, the generative model comprises an encoder and a decoder, wherein the encoder is used for performing feature extraction and outputting a third feature map corresponding to the target human body and the style attribute information of the deformed target wearing object to the decoder;

the decoder is used for decoding according to the third feature map and the style attribute information to generate a wearing effect map when the target wearing object is worn on the target human body.

In one embodiment, the network structure of the decoder is that of a synthetic network of StyleGAN 2;

loss functions used in training the generative model include generative confrontation network loss functions, perceptual loss functions, and L1 loss functions.

The virtual wearing device provided by the embodiment of the application can execute the virtual wearing method in the first embodiment to the third embodiment of the application, and has corresponding functional modules and beneficial effects of the executing method.

EXAMPLE five

Fig. 17 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application, as shown in fig. 17, the electronic device includes a processor 510, a memory 520, an input device 530, and an output device 540; the number of the processors 510 in the electronic device may be one or more, and one processor 510 is taken as an example in fig. 17; the processor 510, the memory 520, the input device 530 and the output device 540 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 17.

The memory 520 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as the program instructions/modules corresponding to the first embodiment or the second embodiment or the third embodiment in the embodiments of the present application. The processor 510 executes various functional applications and data processing of the electronic device by executing the software programs, instructions and modules stored in the memory 520, so as to implement the method mentioned in the first embodiment or the second embodiment or the third embodiment.

The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 520 may further include memory located remotely from the processor 510, which may be connected to the device/terminal/server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 540 may include a display device such as a display screen.

EXAMPLE six

The sixth embodiment of the present application further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the method of any one of the first to third embodiments of the method described above.

Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the method provided in any embodiments of the present application.

EXAMPLE seven

An embodiment seventh of the present application further provides a computer program product, which includes computer-executable instructions, when executed by a computer processor, for performing the method of any one of the first to third embodiments of the method.

Of course, the computer program product provided in the embodiments of the present application has computer-executable instructions that are not limited to the method operations described above, and may also perform related operations in the method provided in any embodiments of the present application.

From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.

It should be noted that, in the embodiment of the apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the application.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims

1. A method of virtual wear, the method comprising:

acquiring a first target image containing a target human body;

acquiring a second target image containing a target wearing article;

2. The method according to claim 1, wherein if the human feature information is target human key point information, the obtaining human feature information based on the first target image comprises:

3. The method according to claim 2, wherein if the human body feature information is a human body analysis result, the obtaining human body feature information based on the first target image comprises:

4. The method according to claim 3, wherein if the human body feature information is a wearing article mask image, the obtaining human body feature information based on the first target image comprises:

5. The method according to any one of claims 1 to 4, wherein the virtual wearing network includes a wearing area generation model, and the determining the human wearing area information based on the human characteristic information includes:

inputting the target human body key point information, the human body analysis result and the wearing object mask image into the wearing area generation model, predicting a human body wearing area when the target human body wears the target wearing object by the wearing area generation model, and outputting corresponding human body wearing area information.

6. The method of claim 5, wherein the wearing area generative model is a model containing a U-NET network structure; in training the wear region generative model, the loss functions used include a cross entropy loss function and a dice loss function.

7. The method of claim 5, wherein the virtual wearing network further comprises a deformation identification model, and the determining deformation information of the target wearing object according to the human body wearing area and the second target image comprises:

inputting the second target image, the target human body key point information and the human body wearing area information output by the wearing area generation model into the deformation identification model, so that the deformation identification model generates a first feature map according to the human body wearing area information and the human body key point information, generates a second feature map according to the second target image, and determines the deformation information of the target wearing object based on the first feature map and the second feature map.

8. The method of claim 7, wherein the deformation recognition model comprises a first feature extractor, a second feature extractor, and a spatial transformation sub-network;

the first feature extractor is used for outputting the first feature map to the spatial transformation subnetwork according to the human body wearing region information and the target human body key point information;

9. The method of claim 8, wherein the loss functions used in training the deformation recognition model include a perceptual loss function and an L1 loss function.

10. The method of claim 6, wherein the virtual wearable network further comprises a generation model, and the generating the wearable effect map according to the deformation information and the human body feature information comprises:

inputting the key point information of the target human body, the mask image of the wearing object and the deformation information of the target wearing object output by the deformation identification model into the generation model, and processing by the generation model to generate a wearing effect graph when the target wearing object is worn on the target human body.

11. The method according to claim 10, wherein the generative model comprises an encoder and a decoder, the encoder is configured to perform feature extraction, and output a third feature map corresponding to the target human body and style attribute information of the deformed target wearing object to the decoder;

12. The method of claim 11, wherein the network structure of the decoder is the structure of a synthetic network of StyleGAN 2;

13. A virtually-worn device, the device comprising:

14. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-12.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-12.

16. A computer program product comprising computer-executable instructions for implementing the method of any one of claims 1-12 when executed.