WO2023088277A1 - 虚拟穿戴方法、装置、设备、存储介质及程序产品 - Google Patents

虚拟穿戴方法、装置、设备、存储介质及程序产品 Download PDF

Info

Publication number
WO2023088277A1
WO2023088277A1 PCT/CN2022/132132 CN2022132132W WO2023088277A1 WO 2023088277 A1 WO2023088277 A1 WO 2023088277A1 CN 2022132132 W CN2022132132 W CN 2022132132W WO 2023088277 A1 WO2023088277 A1 WO 2023088277A1
Authority
WO
WIPO (PCT)
Prior art keywords
human body
target
information
wearing
wearable
Prior art date
Application number
PCT/CN2022/132132
Other languages
English (en)
French (fr)
Inventor
李安
李玉乐
项伟
Original Assignee
百果园技术(新加坡)有限公司
李安
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 李安 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2023088277A1 publication Critical patent/WO2023088277A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0641Shopping interfaces
    • G06Q30/0643Graphical representation of items or shoppers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/16Cloth

Definitions

  • the present application relates to the technical field of data processing, for example, to a virtual wear method, a virtual wear device, an electronic device, a computer-readable storage medium, and a computer program product.
  • the virtual fitting technology mentioned in the related art mainly transforms (warp) the 3D clothes onto the reconstructed 3D human body by reconstructing the 3D human body.
  • 3D clothes are relatively difficult to obtain, and if the reconstructed 3D human body is not realistic enough, it will affect the effect of trying on clothes. Therefore, the virtual fitting technology mentioned in the related art is relatively difficult to balance the fitting effect and authenticity.
  • the present application provides a virtual dressing method, device, equipment, storage medium and program product, so as to avoid the situation that the virtual fitting technology in the related art is difficult to balance the fitting effect and authenticity.
  • the embodiment of the present application provides a method of virtual wearing, the method comprising:
  • the human body characteristic information including target human body key point information related to the target wearing object, human body analysis results, and wearing object mask images;
  • the human body feature information and the second target image are input into a pre-trained virtual wearable network, so that the virtual wearable network determines the human body wearing area information based on the human body feature information, and according to the human body wearing area information And the second target image determines the deformation information of the target wearable, and generates a wearing effect map according to the deformation information and the human body feature information for output; wherein, the virtual wearable network is a generative confrontation network.
  • the embodiment of the present application also provides a virtual wearable device, the device comprising:
  • the first target image acquisition module is configured to acquire the first target image containing the target human body
  • the second target image acquisition module is configured to acquire a second target image including the target wearable
  • the human body characteristic information acquisition module is configured to acquire human body characteristic information based on the first target image, the human body characteristic information includes target human body key point information related to the target wearing object, human body analysis results, and wearing object mask images;
  • the wearing effect map generation module is configured to input the human body characteristic information and the second target image into the pre-trained virtual wearing network, so that the virtual wearing network determines the human body wearing area information based on the human body characteristic information, and determine the deformation information of the target wearable object according to the wearing area information of the human body and the second target image, and generate a wearing effect map according to the deformation information and the human body characteristic information for output; wherein, the virtual wearing The network is a generative adversarial network.
  • the embodiment of the present application further provides an electronic device, the electronic device comprising:
  • processors one or more processors
  • storage means configured to store one or more programs
  • the one or more processors are made to implement the method of the first aspect above.
  • the embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method in the above-mentioned first aspect is implemented.
  • the embodiment of the present application further provides a computer program product
  • the computer program product includes computer-executable instructions, and when executed, the computer-executable instructions are configured to implement the method in the above-mentioned first aspect.
  • Fig. 1 is a flow chart of an embodiment of a virtual wear method provided by an embodiment of the present application
  • Fig. 2 is a flow chart of an embodiment of a virtual wear method provided by another embodiment of the present application.
  • FIG. 3 is a schematic diagram of a first target image including a target human body provided by an embodiment of the present application
  • Fig. 4 is a schematic diagram of human body key points obtained after key point detection of the first target image provided by an embodiment of the present application
  • Fig. 5 is a schematic diagram of a preliminary human body analysis result obtained after performing human body analysis on the first target image provided by an embodiment of the present application;
  • Fig. 6 is a schematic diagram of a wearing mask image obtained after erasing the clothes of the target human body in the first target image provided by an embodiment of the present application;
  • Fig. 7 is a schematic diagram of human body analysis results obtained after erasing clothes from preliminary human body analysis results provided by an embodiment of the present application;
  • Fig. 8 is a flow chart of an embodiment of a virtual wear method provided by another embodiment of the present application.
  • FIG. 9 is a schematic diagram of a virtual wearable network architecture provided by an embodiment of the present application.
  • Fig. 10 is a schematic diagram of the architecture of a wearing area generation model provided by an embodiment of the present application.
  • Fig. 11 is a schematic diagram of an input and output implementation scenario of a wearing area generation model provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an input and output implementation scenario of a Warp model provided by an embodiment of the present application.
  • Fig. 13 is a schematic diagram of a generation model architecture provided by an embodiment of the present application.
  • Fig. 14 is a schematic diagram of a StyleGAN2 model architecture provided by an embodiment of the present application.
  • Fig. 15 is a schematic diagram of an input and output implementation scenario of a generative model provided by an embodiment of the present application.
  • Fig. 16 is a structural block diagram of an embodiment of a virtual wearable device provided by an embodiment of the present application.
  • Fig. 17 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Fig. 1 is a flow chart of an embodiment of a virtual wear method provided by an embodiment of the present application.
  • the method can be realized through a virtual wearable device, wherein the virtual wearable device can be connected to an APP or a Web page according to a development document, so as to realize a virtual wearable function in the APP or the Web page.
  • the terminal where the APP or Web page is located may include a mobile phone, a tablet computer, a fitting robot, and the like.
  • the virtual wearing objects in this embodiment may include clothes, trousers, shoes, socks, jewelry, etc.
  • the following embodiments all use clothes as an example to describe the virtual fitting scene.
  • This embodiment can be applied to virtual wearable functions in scenarios such as e-commerce platforms, short video entertainment, image processing, film production, live broadcast, and games.
  • e-commerce platforms after a user selects clothes, he or she can upload a photo containing a person who wants to try on the clothes, and through the virtual fitting function, the user can directly see the selected clothes on the person. Dressing renderings of clothes.
  • a video specify the person who needs to try on clothes in the video, and the clothes you want to try on, then through the virtual fitting function in the video application, you can change the clothes of the specified person in the video into clothes you want to try on.
  • this embodiment may include the following steps:
  • Step 110 acquiring a first target image including a target human body.
  • the first target image may include: an image imported via a virtual wear function page.
  • the first target image can be imported according to the import interface in the page.
  • the first target image is an image that contains a target human body that needs to be tried on, and the target human body can be the user himself or other people; the first target image can be a selfie image or other non-selfie images. The embodiment does not limit this.
  • the first target image may further include: multiple image frames including the target human body in the target video.
  • the image frame containing the specified person in the live broadcast scene can be used as the first target image.
  • the target human body in the first target image needs to preserve the frontal features of the human body as completely as possible, at least the frontal features of the body parts related to the target wearable.
  • Step 120 acquiring a second target image including the target wearable.
  • the second target image may be an image uploaded by the user containing the target wearable; or, the second target image may also be an image selected by the user in the sequence of wearable images displayed on the current APP or Web page; or, The second target image may also be an image generated by the user selecting a character in the video, and then extracting the target wearable from the character.
  • This embodiment does not limit the acquisition method of the second target image.
  • the target clothing in the second target image needs to retain important features such as texture and shape of the clothing as much as possible.
  • the sizes of the two images can be processed into a uniform size, for example, the two images can be processed into a uniform size by means of central equal ratio cutting and proportional scaling.
  • Step 130 acquiring human body feature information based on the first target image, the human body feature information including: key point information of the target human body related to the target wearable, a human body analysis result, and a mask image of the wearable.
  • the key point information of the target human body refers to a detection result of human body parts related to the target wearable obtained by detecting the key points of the target human body in the first target image.
  • the key point information of the target human body can be the key point information of the body parts related to the target wearable after detecting the key point information of the whole human body; or, the key point information of the target human body can also be directly
  • the human body analysis result refers to a result obtained by performing human body analysis on the target human body in the first target image.
  • Human body parsing is to segment multiple parts of the human body, which is a fine-grained semantic segmentation task. For example, after human body analysis, the target human body can be segmented into hair, face, clothes, pants, limbs and other parts.
  • the clothing mask image refers to an image obtained by blocking the clothing area related to the target clothing in the target human body in the first target image.
  • the clothing mask image refers to an image generated after occluding the clothing in the first target image.
  • the mask image of the clothing obtained after occluding the clothing area related to the target clothing is an image of the human body that has nothing to do with the target clothing.
  • Step 140 input the human body feature information and the second target image into the pre-trained virtual wearable network, so that the virtual wearable network can determine the human body wearing area information based on the human body feature information, and according to the human body
  • the wearing area information and the second target image determine the deformation information of the target wearing object, and generate a wearing effect map according to the deformation information and the human body feature information for output.
  • the virtual wearable network can be a pre-trained model, which can be a kind of Generative Adversarial Networks (GAN).
  • GAN Generative Adversarial Networks
  • Generation is when a model learns from some data and then generates similar data. For example, let the machine look at some pictures of animals, and then generate pictures of animals by itself, which is generation.
  • the generative confrontation network is a deep learning model.
  • the model produces a fairly good output through the mutual game learning of (at least) two modules in the framework: the generative model (Generative Model) and the discriminative model (Discriminative Model).
  • the discriminant model requires input variables to be predicted by a certain model; the generative model is given some hidden information to randomly generate observation data.
  • both the human body characteristic information and the second target image can be input into the virtual wearable network, and the virtual wearable network performs virtual try-on processing, and outputs the target human body wearing the target The wearing effect picture after wearing the item.
  • the information of the wearing area of the human body can be determined based on the characteristic information of the human body.
  • the information of the wearing area of the human body refers to which area of the target body the determined target wearing object is worn in combination with the posture of the target human body.
  • the deformation information of the target wearing object can be determined by combining the information of the wearing area of the human body and the second target image.
  • the deformation information refers to how the target wearing object needs to be twisted and deformed to match the specific area of the target human body.
  • the deformed target clothing can be pasted on (Warp) the target human body, and the original clothing corresponding to the target clothing can be erased, and the wearing effect map can be generated for output .
  • the human body characteristic information related to the target human body can be extracted from the first target image, such as The key point information of the target human body, the analysis result of the human body, and the mask image of the wearing object, etc., and then process the human body feature information and the second target image through the virtual wearable network, and the virtual wearing network outputs the wearing effect diagram of the target human body wearing the target wearing object to output.
  • the user it only needs to specify the target human body and the target wearable, and the wearing effect picture of the target human body wearing the target wearable can be obtained. The operation is simple and fast, and the user experience is improved.
  • the virtual wearable network it is divided into three steps for processing. First, determine the wearable area information of the target wearable on the target human body based on the characteristic information of the human body, and then determine the deformation information of the target wearable in combination with the wearable area information of the human body and the second target image. , and finally generate a wearing effect map for output according to the deformed target wearing object and human body feature information.
  • the virtual wearable network used in this embodiment is a generative adversarial network. Through the above-mentioned three-step process, the generative adversarial network can make the output wearable effect map take into account both effect and authenticity, and improve user experience. And the whole processing process is two-dimensional image processing, compared with the 3D processing method, there is no need to reconstruct the 3D human body and 3D clothes, which reduces the difficulty and cost of implementation.
  • Fig. 2 is a flow chart of an embodiment of a virtual wearing method provided by another embodiment of the present application.
  • this embodiment provides a more specific description of the acquisition process of human body characteristic information, which may include the following step:
  • Step 210 acquiring a first target image including a target human body.
  • Step 220 acquiring a second target image including the target wearable.
  • Step 230 input the first target image into the pre-trained human body key point detection model, so that the human body key point detection model can perform key point detection on the target human body in the first target image, and output the corresponding human body Key point information.
  • the key points of the human body refer to the key positions of multiple parts of the human body.
  • the key points of the human body are crucial for describing the posture of the human body and predicting the behavior of the human body.
  • the key points of the human body mainly include three points of the left and right arms (wrist, elbow, shoulder), three points of the left and right legs (ankle, knee, hip bone), hip, buttocks and head points (eyes, chin, top of head) etc.
  • the key point detection of the target human body can be performed through the human body key point detection model.
  • human key point detection is also called human pose estimation. Its task is to locate the key parts of the human body in a given picture, such as the head, neck, shoulders, hands, etc. On different data sets, the specific parts that need to be detected are different, and the number of key points detected is also different.
  • the human body key point detection model may be a model based on deep learning. According to different wearables, different human body key point detection models may be trained to extract human body key points matching the wearable. This embodiment does not limit the training process of the human body key point detection model, and those skilled in the art can use a general human body key point detection model training method to perform model fitting according to the training objectives.
  • the key point detection model of the human body may also be a pre-trained indiscriminate key point detection model with high precision after multiple inference detections.
  • the first target image is input into a human body key point detection model of indiscriminate detection in the related art, and a probability distribution map corresponding to each human body key point is obtained.
  • different sampling processes can be performed on the first target image.
  • the size of the first target image is 3*256*256.
  • the first target image Image preprocessing is an image of n*32*32.
  • the schematic diagram of the key points of the human body as shown in FIG. 4 can be obtained.
  • Step 240 determine target key point information related to the target wearable from the human body key point information.
  • the key point information of the human body is key point information for a specified human body part that matches the target wearable
  • all key points may be used as target key points. If the key point information of the human body is the key point information of all human body parts or most of the human body parts, then the key point of the specified human body part matching the target wearable can be selected as the target key point.
  • Step 250 Input the first target image into a pre-trained human body analysis model, so that the human body analysis model performs human body analysis on the target human body in the first target image, and outputs a corresponding preliminary human body analysis result.
  • Human body parsing aims at precisely locating human bodies and dividing them into multiple semantic regions at the pixel level.
  • human body parsing can be used to divide the human body into body parts and clothing.
  • the human body parsing model may include a Human Parsing model.
  • the output preliminary human body parsing result may be as shown in FIG. 5 .
  • Step 260 combining the information of key points of the target and the preliminary human body analysis result to draw a mask of the wearing object in the first target image to generate a mask image of the wearing object.
  • this step can erase the original clothing.
  • the wearable mask can be drawn through the target key point information and the preliminary human body analysis results.
  • the target key points include arm key points
  • an ellipse image can be drawn based on the arm key points Mask mask
  • the size of this ellipse needs to be larger than the range of the original arm
  • the size of the ellipse can be determined based on empirical values during implementation.
  • a square mask can be drawn based on the body part in the preliminary human body analysis result, and then the masks of the two arms and the mask of the body part are connected to form a complete mask, and finally processed by expansion and corrosion to obtain the wipe
  • the result of removing human clothing i.e. the wearable mask image.
  • the clothing mask image generated after the clothes are covered may be as shown in FIG. 6 .
  • Step 270 draw a mask of the wearing object based on the target key points in the preliminary human body analysis result, and generate a human body analysis result.
  • the original wearing area corresponding to the target wearing item needs to be erased to generate a human body analysis result irrelevant to the original wearing item.
  • the key point information of the target human body can be superimposed on the preliminary human body analysis result, and the corresponding mask can be drawn, and then the drawn mask can be set as the background color.
  • the drawn arm mask For example, in the virtual fitting scene, the drawn arm mask, The masks connected by the body mask and so on are processed into the background color, and the generated human body analysis results are shown in Figure 7.
  • the masks part obtained in step 260 can also be directly superimposed on the preliminary human body parsing result and processed into a background color.
  • Step 280 input the target key point information, the human body analysis result, the wearing object mask image and the second target image into the pre-trained virtual wearable network, so that the virtual wearable network can
  • the key point information of the target, the analysis result of the human body and the mask image of the wearing object determine the wearing area information of the human body, and determine the deformation information of the wearing object of the target according to the wearing area information of the human body and the second target image, And generating a wearing effect map according to the deformation information, the target key point information and the wearing object mask image for output.
  • these three and the second target image can be used as input features and input into the pre-trained virtual wear network, and the virtual wear processing can be performed by the virtual wear network , and output the corresponding wearing renderings.
  • the input features of the virtual wearable network include the target key point information of the human body part related to the target wearable, and the wearable mask obtained after erasing the wearable area corresponding to the target wearable on the target human body.
  • the model image and the human body analysis result obtained after erasing the wearable area corresponding to the target wearable from the preliminary human body analysis result expand the dimension of the input features and retain the original features of the target human body and the target wearable to the greatest extent. , so that the wearing effect map output by the virtual wearing network is more realistic and has a better wearing simulation effect.
  • Fig. 8 is a flow chart of an embodiment of a virtual wear method provided by another embodiment of the present application. On the basis of the foregoing embodiments, this embodiment provides a more specific description of the process of performing virtual wear processing on a virtual wear network. May include the following steps:
  • Step 310 acquiring a first target image including a target human body.
  • Step 320 acquiring a second target image including the target wearable.
  • Step 330 acquiring human body feature information based on the first target image, the human body feature information including key point information of the target human body related to the target wearable, a human body analysis result, and a mask image of the wearable.
  • Step 340 input the human body characteristic information and the second target image into a pre-trained virtual wearable network, the virtual wearable network includes a wearable region generation model, a deformation recognition model and a generation model.
  • the virtual wearable network includes a wearable area generation model, a deformation recognition model and a generation model.
  • the wearing area generation model is set to determine the human body wearing area information based on the human body feature information
  • the deformation recognition model is set to determine the deformation information of the target wearable object according to the human body wearing area information and the second target image
  • the feature information generates a wearable effect map for output.
  • Step 350 in the virtual wearable network, input the key point information of the target human body, the analysis result of the human body and the mask image of the wearing object into the wearing area generation model, so as to be predicted by the wearing area generation model The body wearing area of the target human body when wearing the target wearable, and output the corresponding body wearing area information.
  • the wearable area generation model can also be called the wearable mask generation network, which can be a U-NET network structure (U-NET network structure is a symmetrical model structure)
  • the model shown in Figure 10, includes an encoder on the left and a decoder on the right.
  • the input features of the wearable area generation model include key point information of the target human body, human body analysis results, and clothing mask images.
  • the area where the person wears the target clothing that is, the information on the wearing area of the human body
  • the features of the input wearing area generation model include the key point information of the target human body, the mask image of the wearing object, and the analysis results of the human body from top to bottom.
  • the output of the wearing area generation model is that the person wearing the target wearing object Area.
  • the loss function used may include a cross-entropy loss function (Cross-Entropy loss, CE_loss) and a dice loss function (Dice Loss, also known as a set similarity measure loss function), in,
  • X represents the result of the label
  • Y represents the predicted result
  • represents the intersection of the predicted result and the label.
  • an Adam optimizer when training the wearable region generation model, an Adam optimizer can be used, the learning rate is set to 0.001, and 20 Epochs are trained.
  • Step 360 input the second target image, the key point information of the target human body, and the wearing area information of the human body output by the wearing area generation model into the deformation recognition model, so that the deformation recognition model can Generate a first feature map based on the human body wearing area information and the target human body key point information, and generate a second feature map based on the second target image, and determine the selected feature map based on the first feature map and the second feature map Describe the deformation information of the target wearable.
  • the human body wearing area information output by the wearing area generation model can be input into the deformation recognition model together with the second target image and the key point information of the target human body as input features of the deformation recognition model.
  • the deformation recognition model as the model of the second stage of the virtual wearable network, can also be called the Warp model.
  • the Warp model may include two feature extractors (ie, encoders), which are a first feature extractor and a second feature extractor, respectively.
  • the first feature extractor is set to extract the key point information of the target human body and the features related to the target human body information of the wearing area of the human body to generate the first feature map;
  • the second feature extractor is set to extract the relevant features of the target wearable to generate the second feature map Two feature maps.
  • the structure of these two feature extractors is the same, but the weights are not shared.
  • the structural diagram of the feature extractor can be shown in Table 1 below, including an input layer (input) and six residual layers (ResBlock):
  • the Warp model can also include a spatial transformation (Spatial Transformer Networks, STN) sub-network, the first feature map that the first feature extractor will extract, and the second feature map that the second feature extractor will extract as STN
  • STN Spacal Transformer Networks
  • the input features of the sub-network, the STN sub-network is set to perform related spatial transformation processing based on the first feature map and the second feature map, including various scaling, translation, rotation, transformation, etc., and output the deformation information of the target wearable, that is, warp Parameters, that is, warp the target wearable to obtain the appearance of the target wearable on the target body.
  • the features of the input Warp model include the second target image, key point information of the target human body, and human body wearing area information from top to bottom.
  • the key point information of the target human body and the wearing area information of the human body are subjected to feature extraction through the first feature extractor
  • the second target image is subjected to feature extraction through the second feature extractor.
  • Both the first feature extractor and the second feature extractor will extract The result is output to the STN sub-network, and the STN sub-network outputs the warp parameters of the target clothes after deformation.
  • the loss function used may include a perceptual loss function (Perceptual loss) and an L1 loss function (L1_loss), namely:
  • Warp Loss Perceptual loss+L1_loss
  • E is the mean
  • X is the input of the Warp model
  • Y is the second target image
  • VGG is the VGG model, such as VGG-19 or VGG-16
  • W is the Warp model.
  • the Adam optimizer can also be used for the training of the Warp model.
  • the wearing area generation model is not trained, the learning rate can be set to 0.0005, and 100 Epochs are trained.
  • Step 370 input the key point information of the human body, the mask image of the wearing object, and the deformation information of the target wearing object output by the deformation recognition model into the generation model, and process it by the generation model to generate the The wearing effect diagram when the target wearing article is worn on the target human body.
  • the deformation information of the target clothing output by the Warp model can be input into the generation model together with the key point information of the target human body and the mask image of the clothing as the input features of the generation model.
  • the generated model is used as the model of the third stage of the virtual wearable network, and it is set to output the wearing effect diagram of the target human body wearing the target wearable.
  • the generation model may include an Encoder Encoder and a Decoder Decoder, wherein the Encoder is configured to perform feature extraction, and output the third feature map corresponding to the target human body and the style of the deformed target wearable to the Decoder Attribute information; the decoder is configured to perform decoding processing according to the third feature map and the style attribute information, and generate a wearing effect map when the target wearable is worn on the target human body.
  • the part in the dotted line box on the left is the Encoder
  • the part in the dotted line box on the right is the Decoder.
  • the structure of the Encoder may include: an input layer, several residual layers, and a fully connected layer, wherein the residual layer is set to extract a third feature map related to the target human body and output it to the corresponding layer of the decoder
  • the fully connected layer is set to extract the style attribute information of the deformed target wearable, and output the style attribute information to multiple layers of the decoder.
  • the style attribute information is a latent variable (latend code).
  • the structure of the Encoder is shown in Table 2 below.
  • Table 2 there are 6 residual layers (ResBlock), and the size of the third feature map (Featuremap) output by each residual layer is specified, as shown in Table 2. Among them are 512*384*32, 256*192*64, etc.
  • the fully connected layer FC outputs style attribute information of size 18*512.
  • the third feature map extracted by each residual layer is output to the next layer for processing, and on the other hand, it also needs to be output to the corresponding layer of Decoder (except the last residual layer, the last A residual layer only outputs the result to the corresponding layer of Decoder).
  • the corresponding layer here refers to the decoding layer that matches the size of the currently output third feature map. For example, if the size of the currently output third feature map is 32*24*512, the corresponding layer in the Decoder refers to the ability to process 32 *24*512 The decoding layer of the feature map.
  • the two output layers on the far right of the Encoder the upper one is the last residual layer ResBlock, which outputs a feature map with a size of 16*12*512; the lower one is the FC layer, and the output is 18*512
  • the FC layer outputs the style attribute information to each layer of the Decoder, so that the Decoder can generate a wearing effect map according to the style attribute information.
  • the network structure of the Decoder can be the structure of the synthesis network of StyleGAN2.
  • StyleGAN2 consists of two parts, including the mapping network (Mapping Network) on the left in Figure 14 and the synthesis network on the right.
  • the Mapping NetWork can unwrap the input better.
  • the Mapping NetWork consists of 8 fully connected layers (FC), whose input is Gaussian noise (latent Z), and the hidden variable (W) is obtained through the Mapping NetWork.
  • the synthesis network consists of modules such as learnable affine transformation A, modulation module Mod-Demod, and upsampling Upsample.
  • the synthetic network also includes weight (w), bias (b) and constant input (c, that is, Const 4*4*512, representing a learnable constant), and the activation function (Leaky ReLU) is always adding bias Apply immediately.
  • the learnable affine transformation A can be composed of a fully connected layer; Upsample can use deconvolution (also called transposed convolution) for upsampling operations.
  • si is the scaling of the i-th input feature map
  • Demodulate the demod weights aiming to restore the output to unit standard deviation, that is, the weights of the new convolutional layer are:
  • the far right in Figure 14 is the injection of random noise, and B is a learnable noise parameter.
  • the purpose of introducing random noise is to make the generated image more realistic and realistic.
  • the loss function used may include the generative confrontation network loss function GAN_loss, the perceptual loss function Perceptual loss and the L1 loss function L1_loss, that is,
  • GAN_loss E[D(G(x)-1) 2 ]+E[D(G(x)) 2 ]
  • E represents the mean value
  • D is the discriminator
  • G(x) represents the wearing effect map output by the generated model
  • x represents the input of the generated model
  • Y represents the wearing effect map in the sample.
  • GAN loss is to make the results generated by the generative model more realistic.
  • the Adam optimizer can also be used for the training of the generative model.
  • the learning rate is set to 0.0005, and 100 Epochs are trained.
  • its input features can include the key point information of the target human body, the mask image of the wearing object, and the deformation information of the target wearing object output by the deformation recognition model, and output the target wearing object worn on the target human body. Wearing renderings when on the body.
  • the virtual wearable network realizes wearing the target wearable on the target human body through the wearing area generation model, the deformation recognition model and the generation model.
  • the wearing area generation model is responsible for predicting the human body when the target human body wears the target wearing object according to the key point information of the target human body, the analysis result of the human body after erasing the original wearing object, and the mask image of the wearing object after erasing the original wearing object. Wearing area, and output the corresponding human body wearing area information.
  • the deformation recognition model is responsible for determining the deformation information of the target wearing object relative to the human body posture according to the information of the wearing area of the human body, the key point information of the target human body, and the second target image containing the target wearing object, that is, obtaining the deformed target wearing object.
  • the generation model is responsible for pasting the deformed target wearable on the body of the target human body whose original wear has been erased according to the above-mentioned deformation information, key point information of the target human body, and the mask image of the wearable, to generate a wearing rendering.
  • the above three models have strong generalization ability and good robustness, so that the output wearing renderings can take both wearing effect and authenticity into consideration.
  • Fig. 16 is a structural block diagram of an embodiment of a virtual wearable device provided by an embodiment of the present application, which may include the following modules:
  • the first target image acquisition module 410 is configured to acquire the first target image including the target human body
  • the second target image acquisition module 420 is configured to acquire a second target image including the target wearable
  • the human body characteristic information acquisition module 430 is configured to acquire human body characteristic information based on the first target image, the human body characteristic information includes target human key point information related to the target wearable, human body analysis results, and wearable mask images ;
  • the wearing effect map generating module 440 is configured to input the human body characteristic information and the second target image into the pre-trained virtual wearing network, so that the virtual wearing network determines the human body wearing area information based on the human body characteristic information , and determine the deformation information of the target wearable object according to the human body wearing area information and the second target image, and generate a wearing effect map according to the deformation information and the human body characteristic information for output; wherein, the virtual Wearable network is a kind of generative adversarial network.
  • the human body feature information acquisition module 430 is set to:
  • Target key point information related to the target wearable is determined from the key point information of the human body.
  • the human body characteristic information acquisition module 430 is set to:
  • a wearable mask is drawn based on the target key points to generate a human body analysis result.
  • the human body feature information acquisition module 430 is set to:
  • the virtual wearable network includes a wearable region generation model
  • the wearable effect map generation module 440 may include the following submodules:
  • the wearing area generation model processing submodule is configured to input the key point information of the target human body, the analysis result of the human body, and the mask image of the wearing object into the wearing area generation model, so as to generate a model from the wearing area Predicting the body wearing area when the target human body wears the target wearable, and outputting corresponding body wearing area information.
  • the wearing area generation model is a model including a U-NET network structure; when training the wearing area generation model, the loss function used includes a cross-entropy loss function and a dice loss function.
  • the virtual wear network also includes a deformation recognition model
  • the wear effect map generation module 440 may include the following submodules:
  • the deformation recognition model processing submodule is configured to input the second target image, the key point information of the target human body, and the human body wearing area information output by the wearing area generation model into the deformation recognition model, so that the The deformation recognition model generates a first feature map according to the human body wearing area information and the human body key point information, and generates a second feature map according to the second target image, and based on the first feature map and the second feature map The second feature map determines the deformation information of the target wearable.
  • the deformation recognition model includes a first feature extractor, a second feature extractor, and a space transformation sub-network;
  • the first feature extractor is configured to output the first feature map to the space transformation sub-network according to the human body wearing area information and the human body key point information;
  • the second feature extractor is configured to output the second feature map to the spatial transformation sub-network according to the second target image
  • the space transformation sub-network is configured to perform related space transformation processing based on the first feature map and the second feature map, and output deformation information of the target wearable.
  • the loss function used when training the deformation recognition model, includes a perceptual loss function and an L1 loss function.
  • the virtual wearable network also includes a generation model
  • the wearable effect map generating module 440 may include the following submodules:
  • the generating model processing sub-module is configured to input the key point information of the human body, the mask image of the wearing object, and the deformation information of the target wearing object output by the deformation recognition model into the generating model, and the generated model Processing is performed to generate a wearing effect diagram when the target wearing article is worn on the target human body.
  • the generation model includes an encoder and a decoder, the encoder is configured to perform feature extraction, and output the third feature map corresponding to the target human body and the deformed target wearable to the decoder style attribute information;
  • the decoder is configured to perform decoding processing according to the third feature map and the style attribute information, and generate a wearing effect map when the target wearable is worn on the target human body.
  • the network structure of the decoder is the structure of the synthesis network of StyleGAN2;
  • the loss functions used include the generative confrontation network loss function, the perceptual loss function and the L1 loss function.
  • a virtual wearable device provided in an embodiment of the present application can execute a virtual wearable method in the foregoing embodiments of the present application, and has corresponding functional modules and beneficial effects for executing the method.
  • Fig. 17 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. As shown in Fig. 17, the electronic device includes a processor 510, a memory 520, an input device 530, and an output device 540; the number of processors 510 in the electronic device It can be one or more, and a processor 510 is taken as an example in FIG. connection as an example.
  • the memory 520 is a computer-readable storage medium, and can be used to store software programs, computer-executable programs and modules, such as program instructions/modules corresponding to the above-mentioned embodiments in the embodiments of the present application.
  • the processor 510 executes various functional applications and data processing of the electronic device by running software programs, instructions and modules stored in the memory 520 , that is, implements the methods mentioned in the above method embodiments.
  • the memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the terminal, and the like.
  • the memory 520 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices.
  • the memory 520 may include a memory that is remotely located relative to the processor 510, and these remote memories may be connected to a device/terminal/server through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 530 may be configured to receive input numbers or character information, and generate key signal input related to user settings and function control of the electronic device.
  • the output device 540 may include a display device such as a display screen.
  • the embodiment of the present application also provides a storage medium containing computer-executable instructions, and the computer-executable instructions are configured to execute the methods of the above-mentioned method embodiments when executed by a computer processor.
  • the computer readable storage medium may be a non-transitory computer readable storage medium.
  • a storage medium containing computer-executable instructions provided in the embodiments of the present application the computer-executable instructions are not limited to the above-mentioned method operations, and can also perform related operations in the methods provided in any embodiments of the present application .
  • Embodiment 7 of the present application also provides a computer program product, the computer program product includes computer-executable instructions, and the computer-executable instructions are used to execute the method in any one of the above-mentioned method embodiments when executed by a computer processor .
  • the multiple units and modules included are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized; in addition, multiple The specific names of the functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application.

Abstract

本申请公开了一种虚拟穿戴方法、装置、设备、存储介质及程序产品,其中该方法包括:获取包含目标人体的第一目标图像;获取包含目标穿戴物的第二目标图像;基于该第一目标图像获取人体特征信息,该人体特征信息包括与该目标穿戴物相关的目标人体关键点信息、人体解析结果以及穿戴物掩模图像;将该人体特征信息以及该第二目标图像输入至预先训练的虚拟穿戴网络中,以由该虚拟穿戴网络基于该人体特征信息确定人体穿戴区域信息,并根据该人体穿戴区域信息以及该第二目标图像确定该目标穿戴物的形变信息,以及根据该形变信息以及该人体特征信息生成穿戴效果图进行输出;其中,该虚拟穿戴网络是一种生成对抗网络。

Description

虚拟穿戴方法、装置、设备、存储介质及程序产品
本申请要求在2021年11月16日提交中国专利局、申请号为202111356765.5的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,例如涉及一种虚拟穿戴的方法、一种虚拟穿戴的装置、一种电子设备、一种计算机可读存储介质以及一种计算机程序产品。
背景技术
随着互联网的发展,网络购物越来越受大众喜欢,但和线下购物相比,线上购物存在一些体验差的问题,例如购买的衣服没有办法试穿,不知道效果如何,导致退货率高。而虚拟试衣技术目的就是提供一种虚拟的试衣场景,给用户带来更好的体验。虚拟试衣技术是计算机视觉领域的一种重要技术方向,其可以广泛用于电商平台中,以提升用户体验。
相关技术中提及的虚拟试衣技术,主要是通过重构3D人体,将3D的衣服变换(warp)到重构的3D人体身上。然而,3D的衣服是比较难获取的,并且重构的3D人体如果不够真实则会影响衣服试穿效果。因此相关技术中提及的虚拟试衣技术,比较难兼顾试衣效果和真实性。
发明内容
本申请提供一种虚拟穿戴方法、装置、设备、存储介质及程序产品,以避免相关技术中的虚拟试衣技术比较难兼顾试衣效果和真实性的情况。
第一方面,本申请实施例提供了一种虚拟穿戴的方法,所述方法包括:
获取包含目标人体的第一目标图像;
获取包含目标穿戴物的第二目标图像;
基于所述第一目标图像获取人体特征信息,所述人体特征信息包括与所述目标穿戴物相关的目标人体关键点信息、人体解析结果以及穿戴物掩模图像;
将所述人体特征信息以及所述第二目标图像输入至预先训练的虚拟穿戴网络中,以由所述虚拟穿戴网络基于所述人体特征信息确定人体穿戴区域信息,并根据所述人体穿戴区域信息以及所述第二目标图像确定所述目标穿戴物的形变信息,以及根据所述形变信息以及所述人体特征信息生成穿戴效果图进行输出;其中,所述虚拟穿戴网络是一种生成对抗网络。
第二方面,本申请实施例还提供了一种虚拟穿戴的装置,所述装置包括:
第一目标图像获取模块,设置为获取包含目标人体的第一目标图像;
第二目标图像获取模块,设置为获取包含目标穿戴物的第二目标图像;
人体特征信息获取模块,设置为基于所述第一目标图像获取人体特征信息,所述人体特征信息包括与所述目标穿戴物相关的目标人体关键点信息、人体解析结果以及穿戴物掩模图像;
穿戴效果图生成模块,设置为将所述人体特征信息以及所述第二目标图像输入至预先训练的虚拟穿戴网络中,以由所述虚拟穿戴网络基于所述人体特征信息确定人体穿戴区域信息,并根据所述人体穿戴区域信息以及所述第二目标图像确定所述目标穿戴物的形变信息,以及根据所述形变信息以及所述人体特征信息生成穿戴效果图进行输出;其中,所述虚拟穿戴网络是一种生成对抗网络。
第三方面,本申请实施例还提供了一种电子设备,所述电子设备包括:
一个或多个处理器;
存储装置,设置为存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现 上述第一方面的方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述第一方面的方法。
第五方面,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,所述计算机可执行指令在被执行时设置为实现上述第一方面的方法。
附图说明
图1是本申请一实施例提供的一种虚拟穿戴的方法实施例的流程图;
图2是本申请另一实施例提供的一种虚拟穿戴的方法实施例的流程图;
图3是本申请一实施例提供的一种包含目标人体的第一目标图像示意图;
图4是本申请一实施例提供的一种对第一目标图像进行关键点检测后得到的人体关键点示意图;
图5是本申请一实施例提供的一种对第一目标图像进行人体解析后得到的初步人体解析结果示意图;
图6是本申请一实施例提供的对第一目标图像中的目标人体擦除衣服后得到的穿戴物掩模图像示意图;
图7是本申请一实施例提供的对初步人体解析结果擦除衣服后得到的人体解析结果示意图;
图8是本申请另一实施例提供的一种虚拟穿戴的方法实施例的流程图;
图9是本申请一实施例提供的一种虚拟穿戴网络架构示意图;
图10是本申请一实施例提供的一种穿戴区域生成模型架构示意图;
图11是本申请一实施例提供的一种穿戴区域生成模型的输入输出实现场景示意图;
图12是本申请一实施例提供的一种Warp模型的输入输出实现场景示意图;
图13是本申请一实施例提供的一种生成模型架构示意图;
图14是本申请一实施例提供的一种StyleGAN2模型架构示意图;
图15是本申请一实施例提供的一种生成模型的输入输出实现场景示意图;
图16是本申请一实施例提供的一种虚拟穿戴的装置实施例的结构框图;
图17是本申请一实施例提供的一种电子设备的结构示意图。
具体实施方式
图1为本申请一实施例提供的一种虚拟穿戴的方法实施例的流程图。该方法可以通过虚拟穿戴装置实现,其中,该虚拟穿戴装置可以按开发文档接入到APP或Web页面中,以在该APP或者Web页面中实现虚拟穿戴功能。该APP或Web页面所在的终端可以包括手机、平板电脑、试衣机器人等。
本实施例的虚拟穿戴的穿戴物可以包括衣服、裤子、鞋子、袜子、首饰等,为了便于理解,下述实施例均以衣服为例进行虚拟试衣场景的说明。
本实施例可以应用于电商平台、短视频娱乐、图像处理、电影制作、直播、游戏等场景的虚拟穿戴功能上。例如,在电商平台中,用户选定衣服以后,可以上传一张包含想要试穿该衣服的人物的照片,则通过虚拟试衣功能用户可以直接看到在该人物身上穿上该选定衣服的穿衣效果图。又如,给定一段视频,指定视频中需要试穿衣服的人物,以及想要试穿的衣服,则通过视频应用程序中的虚拟试衣功能,可以将视频中的该指定的人物的衣服换成想要试穿的衣服。
如图1所示,本实施例可以包括如下步骤:
步骤110,获取包含目标人体的第一目标图像。
在一种例子中,第一目标图像可以包括:经由虚拟穿戴功能页面导入的图像。例如,当用户触发虚拟试衣功能进入虚拟试衣功能页面以后,可以根据页面中的导入接口导入第一目 标图像。其中,第一目标图像为包含需要试穿的目标人体的图像,该目标人体可以是用户本人,也可以是其他人;该第一目标图像可以是自拍图像,也可以是其他非自拍图像,本实施例对此不作限制。
在另一种例子中,第一目标图像还可以包括:目标视频中包含目标人体的多个图像帧。例如,在直播场景中,当用户在直播界面中触发虚拟试衣功能、并指定需要试穿衣服的人物时,则可以将该直播场景中包含该指定人物的图像帧作为第一目标图像。
需要说明的是,第一目标图像中的目标人体需要尽可能地完整保留人体的正面特征,至少是保留与目标穿戴物相关的人体部位的正面特征。
步骤120,获取包含目标穿戴物的第二目标图像。
示例性地,第二目标图像可以是用户上传的包含目标穿戴物的图像;或者,第二目标图像还可以是用户在当前APP或者Web页面展示的穿戴物图像序列中选定的图像;或者,第二目标图像还可以是用户在视频中选定某个人物,然后从该人物身上提取出目标穿戴物生成的图像,本实施例对第二目标图像的获取方式不作限定。
需要说明的是,第二目标图像中的目标穿戴物需要尽可能保留穿戴物的纹理、形状等重要特征。
在获得第一目标图像以及第二目标图像以后,可以将这两图像的尺寸处理成统一尺寸,例如采用中心等比例切割和等比例缩放等方式将两图像处理成统一尺寸。
步骤130,基于所述第一目标图像获取人体特征信息,所述人体特征信息包括:与所述目标穿戴物相关的目标人体关键点信息、人体解析结果以及穿戴物掩模图像。
目标人体关键点信息是指通过对第一目标图像中的目标人体进行关键点检测后得到的、与目标穿戴物相关的人体部位检测结果。
示例性地,目标人体关键点信息可以为对整个人体的关键点信息进行检测后选取的与目标穿戴物相关的人体部位的关键点信息;或者,目标人体关键点信息还可以为直接对目标穿戴物相关的人体部位进行关键点检测后得到的关键点信息,例如头部、颈部、肩部、手部等,本实施例对此不作限定。
人体解析结果是指通过对第一目标图像中的目标人体进行人体解析后得到的结果。人体解析就是将人体的多个部位分割出来,是一种细粒度的语义分割任务。例如,通过人体解析后可以将目标人体分割成头发、脸部、衣服、裤子、四肢等部位。
穿戴物掩模图像是指对第一目标图像中的目标人体中与目标穿戴物相关的穿戴物区域进行遮挡后得到的图像。例如,假设目标穿戴物为衣服,则穿戴物掩模图像是指对第一目标图像中的衣服进行遮挡后生成的图像。对目标穿戴物相关的穿戴物区域进行遮挡后得到的穿戴物掩模图像是与目标穿戴物无关的人体图像。
步骤140,将所述人体特征信息以及所述第二目标图像输入至预先训练的虚拟穿戴网络中,以由所述虚拟穿戴网络基于所述人体特征信息确定人体穿戴区域信息,并根据所述人体穿戴区域信息以及所述第二目标图像确定所述目标穿戴物的形变信息,以及根据所述形变信息以及所述人体特征信息生成穿戴效果图进行输出。
在该步骤中,虚拟穿戴网络可以为预训练的模型,其可以是一种生成对抗网络GAN(Generative Adversarial Networks)。生成就是模型通过学习一些数据,然后生成类似的数据。例如,让机器看一些动物图片,然后自己来产生动物的图片,这就是生成。生成式对抗网络是一种深度学习模型,模型通过框架中(至少)两个模块:生成模型(Generative Model)和判别模型(Discriminative Model)的互相博弈学习产生相当好的输出。判别模型需要输入变量,通过某种模型来预测;生成模型是给定某种隐含信息,来随机产生观测数据。
当通过对第一目标图像的分析得到人体特征信息后,可以将该人体特征信息与第二目标图像均输入至虚拟穿戴网络中,由虚拟穿戴网络进行虚拟试穿处理,输出目标人体穿戴该目标穿戴物后的穿戴效果图。
在虚拟穿戴网络中,首先可以基于人体特征信息来确定人体穿戴区域信息,该人体穿戴 区域信息是指结合目标人体的姿态,确定的目标穿戴物穿戴在目标人体身上的哪个区域。然后,结合该人体穿戴区域信息以及第二目标图像可以确定目标穿戴物的形变信息,该形变信息是指目标穿戴物需要怎样扭曲形变才能匹配到目标人体的具体区域中。接着可以根据形变信息以及人体特征信息,将形变后的目标穿戴物贴在(Warp)目标人体身上,并擦除目标人体原有的与该目标穿戴物对应的穿戴物,生成穿戴效果图进行输出。
在本实施例中,当获得用户提供的包含目标人体的第一目标图像以及包含目标穿戴物的第二目标图像以后,可以从第一目标图像中提取出与目标人体相关的人体特征信息,如目标人体关键点信息、人体解析结果以及穿戴物掩模图像等,然后将人体特征信息与第二目标图像通过虚拟穿戴网络进行处理,由虚拟穿戴网络输出目标人体穿戴目标穿戴物后的穿戴效果图进行输出。对于用户而言,只需要指定目标人体以及目标穿戴物,即可得到目标人体穿戴目标穿戴物的穿戴效果图,操作简单快捷,提高了用户的使用体验。
在虚拟穿戴网络中,分成三步进行处理,先是基于人体特征信息确定目标穿戴物穿戴在目标人体上的人体穿戴区域信息,然后结合人体穿戴区域信息以及第二目标图像确定目标穿戴物的形变信息,最后根据形变后的目标穿戴物以及人体特征信息生成穿戴效果图进行输出。本实施例使用的虚拟穿戴网络是一种生成对抗网络,生成对抗网络通过上述的三步处理过程,可以使得输出的穿戴效果图能够同时兼顾效果和真实性,提高用户体验。且整个处理过程是对于图像的二维处理,相比于3D处理的方式,无需重构3D人体和3D衣服,降低了实现难度以及实现成本。
图2为本申请另一实施例提供的一种虚拟穿戴的方法实施例的流程图,本实施例在前述实施例的基础上,对人体特征信息的获取过程进行更具体的说明,可以包括如下步骤:
步骤210,获取包含目标人体的第一目标图像。
步骤220,获取包含目标穿戴物的第二目标图像。
步骤230,将所述第一目标图像输入至预训练的人体关键点检测模型,以由所述人体关键点检测模型对所述第一目标图像中的目标人体进行关键点检测,输出对应的人体关键点信息。
人体关键点是指人体的多个部位的关键位置点,人体关键点对于描述人体姿态,预测人体行为至关重要。人体关键点主要包括左、右手臂的三个点(手腕、手肘、肩膀),左、右腿的三个点(脚腕、膝盖、胯骨),髋、臀部点以及头部点(眼睛,下巴,头顶)等。
在该步骤中,可以通过人体关键点检测模型来对目标人体进行关键点检测。例如,人体关键点检测也称人体姿态估计,其任务是要在给定的图片中定位人体的身体关键部件,例如头部、颈部、肩部、手部等。在不同数据集上,需要检测的具体部位不同,检测出的关键点的数量也不同。
在一种实现中,人体关键点检测模型可以是一种基于深度学习的模型,根据不同的穿戴物,可以训练不同的人体关键点检测模型,以提取与该穿戴物匹配的人体关键点。本实施例对人体关键点检测模型的训练过程不作限定,本领域技术人员可以根据训练目标采用通用的人体关键点检测模型的训练方法进行模型拟合。
在另一种实现中,人体关键点检测模型还可以是预训练的经过多次推理检测、精度较高的无差别的关键点检测模型。例如,将第一目标图像输入一个相关技术中的无差别检测的人体关键点检测模型,得到每一人体关键点对应的概率分布图。其中,根据实际的处理情况和网络结构,可以对第一目标图像进行不同的采样处理,例如,第一目标图像是大小为3*256*256,经过三次下采样和卷积操作将第一目标图像预处理为n*32*32的图像。接着,将n*32*32的图像输入一个预设的沙漏网络(Hourglass),进行升采样处理和卷积操作,得到对应的热力图(heatmap),将该热力图对应的结果确定为人体关键点检测结果。
例如,如图3所示的包含目标人体的第一目标图像,经过关键点检测模型进行关键点检测后可以得到如图4所示的人体关键点示意图。
步骤240,从所述人体关键点信息中确定与所述目标穿戴物相关的目标关键点信息。
在一种实现中,如果人体关键点信息是针对与目标穿戴物匹配的指定人体部位的关键点信息,则可以将所有的关键点作为目标关键点。如果人体关键点信息是全部人体部位或者大部分人体部位的关键点信息,则可以选取与目标穿戴物匹配的指定人体部位的关键点作为目标关键点。
步骤250,将所述第一目标图像输入至预训练的人体解析模型,以由所述人体解析模型对所述第一目标图像中的目标人体进行人体解析,输出对应的初步人体解析结果。
人体解析旨在精确定位人体并将其划分为像素级的多个语义区域。例如,通过人体解析可以将人体划分为身体部位和衣物。示例性地,人体解析模型可以包括Human Parsing模型,例如,对于图3的第一目标图像,经过Human Parsing模型以后,输出的初步人体解析结果可以如图5所示。
步骤260,在所述第一目标图像中结合所述目标关键点信息以及所述初步人体解析结果绘制穿戴物掩模,生成穿戴物掩模图像。
为了减少原有穿戴物的影响,则该步骤可以将原有穿戴物擦除。在实现时,可以通过目标关键点信息以及初步人体解析结果绘制穿戴物掩模,例如,在进行虚拟试衣的场景中,目标关键点包括手臂关键点,可以基于手臂关键点绘制一个椭圆的图像掩模mask,这个椭圆的大小需要比原有手臂的范围大,在实现时椭圆的大小可以根据经验值确定。然后对于身体部分,可以基于初步人体解析结果中的身体部位,绘制方形mask,然后将两手臂的mask与身体部位的mask连接成一个完整的masks,最后再通过膨胀腐蚀的方法进行处理,得到擦除人体衣服的结果,即穿戴物掩模图像。例如,针对图3的第一目标图像,对其衣服进行遮挡后生成的穿戴物掩模图像可以如图6所示。
步骤270,在所述初步人体解析结果中基于所述目标关键点绘制穿戴物掩模,生成人体解析结果。
与步骤260的处理方法类似,对于初步人体解析结果也需要擦除原有的、与目标穿戴物对应的穿戴物区域,生成与原有穿戴物无关的人体解析结果。在处理时,可以将目标人体关键点信息叠加到初步人体解析结果中,并绘制相应的mask,然后将绘制的mask设置为背景色,例如,在虚拟试衣场景下,将绘制的手臂mask、身体mask等连接成的masks处理成背景色,生成的人体解析结果如图7所示。
在其他实施例中,还可以直接将步骤260中得到的masks部分叠加到初步人体解析结果中,并处理成背景色。
步骤280,将所述目标关键点信息、所述人体解析结果、所述穿戴物掩模图像以及所述第二目标图像输入至预先训练的虚拟穿戴网络中,以由所述虚拟穿戴网络基于所述目标关键点信息、所述人体解析结果及所述穿戴物掩模图像确定人体穿戴区域信息,并根据所述人体穿戴区域信息以及所述第二目标图像确定所述目标穿戴物的形变信息,以及根据所述形变信息、所述目标关键点信息以及所述穿戴物掩模图像生成穿戴效果图进行输出。
在得到目标关键点信息、穿戴物掩模图像以及人体解析结果以后,可以将这三者以及第二目标图像作为输入特征,输入至预训练的虚拟穿戴网络中,由虚拟穿戴网络进行虚拟穿戴处理,输出对应的穿戴效果图。
在本实施例中,虚拟穿戴网络的输入特征,包含了与目标穿戴物相关的人体部位的目标关键点信息、对目标人体擦除了与目标穿戴物对应的穿戴物区域后得到的穿戴物掩模图像、以及对初步人体解析结果擦除了与目标穿戴物对应的穿戴物区域后得到的人体解析结果,扩展了输入特征的维度,最大限度地保留了目标人体以及目标穿戴物的原有特征,从而使得虚拟穿戴网络输出的穿戴效果图更加真实,具有更好的穿戴模拟效果。
图8为本申请另一实施例提供的一种虚拟穿戴的方法实施例的流程图,本实施例在前述实施例的基础上,对虚拟穿戴网络进行虚拟穿戴处理的过程进行更具体的说明,可以包括如下步骤:
步骤310,获取包含目标人体的第一目标图像。
步骤320,获取包含目标穿戴物的第二目标图像。
步骤330,基于所述第一目标图像获取人体特征信息,所述人体特征信息包括与所述目标穿戴物相关的目标人体关键点信息、人体解析结果以及穿戴物掩模图像。
步骤340,将所述人体特征信息以及所述第二目标图像输入至预先训练的虚拟穿戴网络中,所述虚拟穿戴网络包括穿戴区域生成模型、形变识别模型以及生成模型。
如图9所示,虚拟穿戴网络包括穿戴区域生成模型、形变识别模型以及生成模型。其中,穿戴区域生成模型设置为基于人体特征信息确定人体穿戴区域信息;形变识别模型设置为根据人体穿戴区域信息以及第二目标图像确定目标穿戴物的形变信息;生成模型设置为根据形变信息以及人体特征信息生成穿戴效果图进行输出。关于上述三种模型的介绍将在后续步骤中进行详细说明。
步骤350,在虚拟穿戴网络中,将所述目标人体关键点信息、所述人体解析结果以及所述穿戴物掩模图像输入至所述穿戴区域生成模型中,以由所述穿戴区域生成模型预测所述目标人体穿戴所述目标穿戴物时的人体穿戴区域,并输出对应的人体穿戴区域信息。
穿戴区域生成模型作为虚拟穿戴网络第一阶段的模型,又可以称为穿戴物mask生成网络,其可以是一种包含U-NET网络结构(U-NET网络结构是一种对称的模型结构)的模型,如图10所示,包括左侧部分的编码器以及右侧部分的解码器。
穿戴区域生成模型的输入特征包括目标人体关键点信息、人体解析结果以及穿戴物掩模图像,穿戴区域生成模型根据人体关键点信息确定人体姿态信息,然后结合人体姿态信息以及穿戴物掩模图像和人体解析结果生成人穿上该目标穿戴物的区域(即人体穿戴区域信息)并输出。例如,如图11所示,输入穿戴区域生成模型的特征由上往下包括目标人体关键点信息、穿戴物掩模图像以及人体解析结果,穿戴区域生成模型输出的是人穿上该目标穿戴物的区域。
在一种实现中,在训练穿戴区域生成模型时,使用的损失函数可以包括交叉熵损失函数(Cross-Entropy loss,CE_loss)和骰子损失函数(Dice Loss,又称集合相似度度量损失函数),其中,
交叉熵损失函数计算方法为:
Figure PCTCN2022132132-appb-000001
其中,N表示训练时的Batch Size(批尺寸);y i表示label(标签);p i表示模型预测结果。
Dice Loss的计算方法为:
Figure PCTCN2022132132-appb-000002
其中,X表示label的结果;Y表示预测的结果,|X∩Y|表示预测结果和label的交集。
在一种例子中,在训练穿戴区域生成模型时,可以采用Adam优化器,学习率设置为0.001,训练20个Epoch。
步骤360,将所述第二目标图像、所述目标人体关键点信息、以及所述穿戴区域生成模型输出的人体穿戴区域信息输入至所述形变识别模型中,以由所述形变识别模型根据所述人体穿戴区域信息以及所述目标人体关键点信息生成第一特征图,以及根据所述第二目标图像生成第二特征图,并基于所述第一特征图以及所述第二特征图确定所述目标穿戴物的形变信息。
穿戴区域生成模型输出的人体穿戴区域信息可以与第二目标图像以及目标人体关键点信息作为形变识别模型的输入特征,输入到形变识别模型中。形变识别模型作为虚拟穿戴网络第二阶段的模型,又可以称为Warp模型。
在一种实现中,Warp模型可以包括两个特征提取器(即编码器Encoder),分别是第一特征提取器和第二特征提取器。第一特征提取器设置为提取目标人体关键点信息以及人体穿戴区域信息的、与目标人体相关的特征,生成第一特征图;第二特征提取器设置为提取目标 穿戴物的相关特征,生成第二特征图。这两个特征提取器的结构是相同的,但权重不共享。
示例性地,特征提取器的结构示意图可以如下表1所示,包括输入层(input)以及6个残差层(ResBlock):
特征提取器
Input,1024*768*N
ResBlock,512*384*32
ResBlock,256*192*64
ResBlock,128*96*128
ResBlock,64*48*512
ResBlock,32*24*512
ResBlock,16*12*512
表1
除此以外,Warp模型还可以包括空间变换(Spatial Transformer Networks,STN)子网络,第一特征提取器将提取的第一特征图,以及第二特征提取器将提取的第二特征图均作为STN子网络的输入特征,STN子网络设置为基于第一特征图以及第二特征图进行相关的空间变换处理,包括多种缩放、平移、旋转、变换等,输出目标穿戴物的形变信息,即warp参数,即,对目标穿戴物进行warp操作,得到目标穿戴物穿戴在目标人体身上的样子。
例如,如图12所示,输入Warp模型的特征由上往下包括第二目标图像、目标人体关键点信息以及人体穿戴区域信息。其中,目标人体关键点信息以及人体穿戴区域信息经由第一特征提取器进行特征提取,第二目标图像经由第二特征提取器进行特征提取,第一特征提取器以及第二特征提取器均将提取结果输出至STN子网络中,由STN子网络输出目标衣服形变后的warp参数。
在一种实现中在训练Warp模型时,使用的损失函数可以包括感知损失函数(Perceptual loss)和L1损失函数(L1_loss),即:
Warp Loss=Perceptual loss+L1_loss
其中,
Perceptual loss=E((VGG(Y)-VGG(W(X)))2)
L1_loss=E(Y–W(X))
其中,E为均值;X为Warp模型的输入;Y为第二目标图像;VGG为VGG模型,如VGG-19或VGG-16等;W为Warp模型。
在一种例子中,对于Warp模型的训练,也可以使用Adam优化器,在训练Warp模型的时候,穿戴区域生成模型不训练,学习率可以设置为0.0005,训练100个Epoch。
步骤370,将所述人体关键点信息、所述穿戴物掩模图像以及所述形变识别模型输出的目标穿戴物的形变信息输入至所述生成模型中,由所述生成模型进行处理,生成所述目标穿戴物穿戴在所述目标人体身上时的穿戴效果图。
Warp模型输出的目标穿戴物的形变信息可以与目标人体关键点信息以及穿戴物掩模图像作为生成模型的输入特征,输入到生成模型中。生成模型作为虚拟穿戴网络第三阶段的模型,设置为输出目标人体穿上目标穿戴物后的穿戴效果图。
在一种实现中,生成模型可以包括编码器Encoder以及解码器Decoder,其中,编码器设置为进行特征提取,并向解码器输出目标人体对应的第三特征图以及形变后的目标穿戴物的样式属性信息;解码器设置为根据该第三特征图以及样式属性信息进行解码处理,生成目标穿戴物穿戴在目标人体身上时的穿戴效果图。如图13所示,左边虚线框部分是Encoder,右边虚线框部分是Decoder。
在一种实施例中,Encoder的结构可以包括:输入层、若干个残差层以及全连接层,其中, 残差层设置为提取与目标人体相关的第三特征图输出至解码器的对应层中,全连接层设置为提取形变后的目标穿戴物的样式属性信息,并将该样式属性信息输出至解码器的多个层中。其中,该样式属性信息为隐变量(latend code)。
例如,Encoder的结构如下表2所示,在表2中,残差层(ResBlock)有6个,每个残差层输出的第三特征图(Featuremap)的大小都是指定的,如表2中的512*384*32、256*192*64等。全连接层FC输出的是18*512大小的样式属性信息。
Figure PCTCN2022132132-appb-000003
表2
如图13所示,每个残差层提取的第三特征图,一方面输出至下一层中进行处理,另一方面还需要输出至Decoder的对应层中(最后一个残差层除外,最后一个残差层只输出结果到Decoder的对应层中)。这里的对应层是指与当前输出的第三特征图的大小匹配的解码层,例如,若当前输出的第三特征图大小为32*24*512,则Decoder中的对应层是指能够处理32*24*512大小的特征图的解码层。
在图13中,Encoder最右边的两个输出层,位于上方的是最后一个残差层ResBlock,输出大小为16*12*512的特征图;位于下方的是FC层,输出的是18*512大小的样式属性信息,FC层将样式属性信息输出至Decoder的每一层中,以便于Decoder根据样式属性信息生成穿戴效果图。
在一种实施例中,Decoder的网络结构可以为StyleGAN2的合成网络的结构。如图14示出的StyleGAN2模型的模型架构所示,StyleGAN2由两部分组成,包含图14中左边部分为映射网络(Mapping NetWork)以及右边部分的合成网络。
Mapping NetWork可以将输入解缠得更好。如图14所示,Mapping NetWork由8个全连接层(fully connected layers,FC)构成,其输入为高斯噪声(latent Z),经过Mapping NetWork得到隐变量(W)。
合成网络由可学习的仿射变换A、调制模块Mod-Demod、上采样Upsample等模块构成。除此以外,合成网络还包括权重(w)、偏差(b)和常数输入(c,即Const 4*4*512,表示可学习的常数),激活函数(Leaky ReLU)总是在添加偏置后立即应用。
其中,可学习的仿射变换A可以由一个全连接层构成;Upsample可以使用反卷积(也叫转置卷积)进行上采样操作。
调制模块Mod-Demod的处理流程如下:
w′ ijk=s i·w ijk
其中,s i是第i个输入特征图的缩放比例;
经过缩放和卷积后,对卷积层的权重进行demod,输出激活的标准差为:
Figure PCTCN2022132132-appb-000004
解调demod权重,旨在使输出恢复到单位标准差,即新的卷积层的权重为:
Figure PCTCN2022132132-appb-000005
上式中,加上∈是为了避免分母为0。
图14中最右边是随机噪声的注入,B是可学习的噪声参数,引入随机噪声是为了使得生成的图像更加真实逼真。
在一种实施例中,在训练生成模型时,使用的损失函数可以包括生成式对抗网络损失函数GAN_loss、感知损失函数Perceptual loss及L1损失函数L1_loss,即,
Loss=GAN_loss+Perceptual loss+L1_loss
其中,
GAN_loss=E[D(G(x)-1) 2]+E[D(G(x)) 2]
Perceptual loss=E((VGG(Y)-VGG(G(X)))2)
L1_loss=E(Y–G(X))
其中,E表示均值;D为判别器;G(x)表示生成模型输出的穿戴效果图;x表示生成模型的输入,Y表示样本中的穿戴效果图。
GAN loss是让生成模型生成的结果更加真实。
在一种例子中,对于生成模型的训练,也可以使用Adam优化器,训练生成模型的时候,图像mask生成模型和Warp模型均不训练,学习率设置为0.0005,训练100个Epoch。
如图15所示,对于训练完成的生成模型,其输入特征可以包括目标人体关键点信息、穿戴物掩模图像以及形变识别模型输出的目标穿戴物的形变信息,输出目标穿戴物穿戴在目标人体身上时的穿戴效果图。
在本实施例中,虚拟穿戴网络通过穿戴区域生成模型、形变识别模型以及生成模型实现将目标穿戴物穿戴在目标人体身上。其中,穿戴区域生成模型负责根据目标人体关键点信息、擦除原有穿戴物后的人体解析结果以及擦除原有穿戴物后的穿戴物掩模图像,预测目标人体穿戴目标穿戴物时的人体穿戴区域,并输出对应的人体穿戴区域信息。形变识别模型负责根据人体穿戴区域信息、目标人体关键点信息以及包含目标穿戴物的第二目标图像,确定目标穿戴物相对于人体姿态的形变信息,即得到形变后的目标穿戴物。生成模型负责根据上述形变信息、目标人体关键点信息以及穿戴物掩模图像,将形变后的目标穿戴物贴在擦除了原有穿戴物的目标人体的身上,生成穿戴效果图。上述三种模型的泛化能力强、具有较好的鲁棒性,使得输出的穿戴效果图能够同时兼顾穿戴效果与真实性。
图16为本申请实施例提供的一种虚拟穿戴的装置实施例的结构框图,可以包括如下模块:
第一目标图像获取模块410,设置为获取包含目标人体的第一目标图像;
第二目标图像获取模块420,设置为获取包含目标穿戴物的第二目标图像;
人体特征信息获取模块430,设置为基于所述第一目标图像获取人体特征信息,所述人体特征信息包括与所述目标穿戴物相关的目标人体关键点信息、人体解析结果以及穿戴物掩模图像;
穿戴效果图生成模块440,设置为将所述人体特征信息以及所述第二目标图像输入至预先训练的虚拟穿戴网络中,以由所述虚拟穿戴网络基于所述人体特征信息确定人体穿戴区域信息,并根据所述人体穿戴区域信息以及所述第二目标图像确定所述目标穿戴物的形变信息,以及根据所述形变信息以及所述人体特征信息生成穿戴效果图进行输出;其中,所述虚拟穿戴网络是一种生成对抗网络。
在一种实施例中,若所述人体特征信息为目标人体关键点信息,所述人体特征信息获取模块430设置为:
将所述第一目标图像输入至预训练的人体关键点检测模型,以由所述人体关键点检测模型对所述第一目标图像中的目标人体进行关键点检测,输出对应的人体关键点信息;
从所述人体关键点信息中确定与所述目标穿戴物相关的目标关键点信息。
在一种实施例中,若所述人体特征信息为人体解析结果,所述人体特征信息获取模块430设置为:
将所述第一目标图像输入至预训练的人体解析模型,以由所述人体解析模型对所述第一目标图像中的目标人体进行人体解析,输出对应的初步人体解析结果;
在所述初步人体解析结果中基于所述目标关键点绘制穿戴物掩模,生成人体解析结果。
在一种实施例中,若所述人体特征信息为穿戴物掩模图像,所述人体特征信息获取模块430设置为:
在所述第一目标图像中结合所述目标关键点信息以及所述初步人体解析结果绘制穿戴物掩模,生成穿戴物掩模图像。
在一种实施例中,所述虚拟穿戴网络包括穿戴区域生成模型,所述穿戴效果图生成模块440可以包括如下子模块:
穿戴区域生成模型处理子模块,设置为将所述目标人体关键点信息、所述人体解析结果以及所述穿戴物掩模图像输入至所述穿戴区域生成模型中,以由所述穿戴区域生成模型预测所述目标人体穿戴所述目标穿戴物时的人体穿戴区域,并输出对应的人体穿戴区域信息。
在一种实施例中,所述穿戴区域生成模型为包含U-NET网络结构的模型;在训练所述穿戴区域生成模型时,使用的损失函数包括交叉熵损失函数和骰子损失函数。
在一种实施例中,所述虚拟穿戴网络还包括形变识别模型,所述穿戴效果图生成模块440可以包括如下子模块:
形变识别模型处理子模块,设置为将所述第二目标图像、所述目标人体关键点信息、以及所述穿戴区域生成模型输出的人体穿戴区域信息输入至所述形变识别模型中,以由所述形变识别模型根据所述人体穿戴区域信息以及所述人体关键点信息生成第一特征图,以及根据所述第二目标图像生成第二特征图,并基于所述第一特征图以及所述第二特征图确定所述目标穿戴物的形变信息。
在一种实施例中,所述形变识别模型包括第一特征提取器、第二特征提取器以及空间变换子网络;
所述第一特征提取器设置为根据所述人体穿戴区域信息以及所述人体关键点信息输出所述第一特征图至所述空间变换子网络;
所述第二特征提取器设置为根据所述第二目标图像输出所述第二特征图至所述空间变换子网络;
所述空间变换子网络设置为基于所述第一特征图以及所述第二特征图进行相关的空间变换处理,输出所述目标穿戴物的形变信息。
在一种实施例中,在训练所述形变识别模型时,使用的损失函数包括感知损失函数和L1损失函数。
在一种实施例中,所述虚拟穿戴网络还包括生成模型,所述穿戴效果图生成模块440可以包括如下子模块:
生成模型处理子模块,设置为将所述人体关键点信息、所述穿戴物掩模图像以及所述形变识别模型输出的目标穿戴物的形变信息输入至所述生成模型中,由所述生成模型进行处理,生成所述目标穿戴物穿戴在所述目标人体身上时的穿戴效果图。
在一种实施例中,所述生成模型包括编码器以及解码器,所述编码器设置为进行特征提取,并向所述解码器输出目标人体对应的第三特征图以及形变后的目标穿戴物的样式属性信息;
所述解码器设置为根据所述第三特征图以及所述样式属性信息进行解码处理,生成所述目标穿戴物穿戴在所述目标人体身上时的穿戴效果图。
在一种实施例中,所述解码器的网络结构为StyleGAN2的合成网络的结构;
在训练所述生成模型时,使用的损失函数包括生成式对抗网络损失函数、感知损失函数 及L1损失函数。
本申请实施例所提供的一种虚拟穿戴的装置可执行本申请前述实施例中的一种虚拟穿戴的方法,具备执行方法相应的功能模块和有益效果。
图17为本申请实施例提供的一种电子设备的结构示意图,如图17所示,该电子设备包括处理器510、存储器520、输入装置530和输出装置540;电子设备中处理器510的数量可以是一个或多个,图17中以一个处理器510为例;电子设备中的处理器510、存储器520、输入装置530和输出装置540可以通过总线或其他方式连接,图17中以通过总线连接为例。
存储器520作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请实施例中的上述实施例对应的程序指令/模块。处理器510通过运行存储在存储器520中的软件程序、指令以及模块,从而执行电子设备的多种功能应用以及数据处理,即实现上述的方法实施例中提到的方法。
存储器520可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端的使用所创建的数据等。此外,存储器520可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器520可包括相对于处理器510远程设置的存储器,这些远程存储器可以通过网络连接至设备/终端/服务器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置530可设置为接收输入的数字或字符信息,以及产生与电子设备的用户设置以及功能控制有关的键信号输入。输出装置540可包括显示屏等显示设备。
本申请实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时设置为执行上述方法实施例的方法。计算机可读存储介质可以为非暂态计算机可读存储介质。
当然,本申请实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的方法操作,还可以执行本申请任意实施例所提供的方法中的相关操作。
本申请实施例七还提供一种计算机程序产品,该计算机程序产品包括计算机可执行指令,所述计算机可执行指令在由计算机处理器执行时用于执行上述方法实施例中任一实施例的方法。
当然,本申请实施例所提供的一种计算机程序产品,其计算机可执行指令不限于如上所述的方法操作,还可以执行本申请任意实施例所提供的方法中的相关操作。
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到,本申请可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括若干指令用以使得一台电子设备(可以是个人计算机,服务器,或者网络设备等)执行本申请多个实施例所述的方法。
值得注意的是,上述装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,多个功能单元的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。

Claims (16)

  1. 一种虚拟穿戴的方法,包括:
    获取包含目标人体的第一目标图像;
    获取包含目标穿戴物的第二目标图像;
    基于所述第一目标图像获取人体特征信息,所述人体特征信息包括与所述目标穿戴物相关的目标人体关键点信息、人体解析结果以及穿戴物掩模图像;
    将所述人体特征信息以及所述第二目标图像输入至预先训练的虚拟穿戴网络中,以由所述虚拟穿戴网络基于所述人体特征信息确定人体穿戴区域信息,并根据所述人体穿戴区域信息以及所述第二目标图像确定所述目标穿戴物的形变信息,以及根据所述形变信息以及所述人体特征信息生成穿戴效果图进行输出;其中,所述虚拟穿戴网络是一种生成对抗网络。
  2. 根据权利要求1所述的方法,其中,响应于确定所述人体特征信息包括目标人体关键点信息,所述基于所述第一目标图像获取人体特征信息,包括:
    将所述第一目标图像输入至预训练的人体关键点检测模型,以由所述人体关键点检测模型对所述第一目标图像中的目标人体进行关键点检测,输出对应的人体关键点信息;
    从所述人体关键点信息中确定与所述目标穿戴物相关的目标关键点信息。
  3. 根据权利要求2所述的方法,其中,响应于确定所述人体特征信息包括人体解析结果,所述基于所述第一目标图像获取人体特征信息,包括:
    将所述第一目标图像输入至预训练的人体解析模型,以由所述人体解析模型对所述第一目标图像中的目标人体进行人体解析,输出对应的初步人体解析结果;
    在所述初步人体解析结果中基于所述目标关键点信息绘制穿戴物掩模,生成人体解析结果。
  4. 根据权利要求3所述的方法,其中,响应于确定所述人体特征信息包括穿戴物掩模图像,所述基于所述第一目标图像获取人体特征信息,包括:
    在所述第一目标图像中结合所述目标关键点信息以及所述初步人体解析结果绘制穿戴物掩模,生成穿戴物掩模图像。
  5. 根据权利要求1-4任一项所述的方法,其中,所述虚拟穿戴网络包括穿戴区域生成模型,所述基于所述人体特征信息确定人体穿戴区域信息,包括:
    将所述目标人体关键点信息、所述人体解析结果以及所述穿戴物掩模图像输入至所述穿戴区域生成模型中,以由所述穿戴区域生成模型预测所述目标人体穿戴所述目标穿戴物时的人体穿戴区域,并输出对应的人体穿戴区域信息。
  6. 根据权利要求5所述的方法,其中,所述穿戴区域生成模型为包含U-NET网络结构的模型;在训练所述穿戴区域生成模型时,使用的损失函数包括交叉熵损失函数和骰子损失函数。
  7. 根据权利要求5所述的方法,其中,所述虚拟穿戴网络还包括形变识别模型,所述根据人体穿戴区域信息以及所述第二目标图像确定所述目标穿戴物的形变信息,包括:
    将所述第二目标图像、所述目标人体关键点信息、以及所述穿戴区域生成模型输出的人体穿戴区域信息输入至所述形变识别模型中,以由所述形变识别模型根据所述人体穿戴区域信息以及所述人体关键点信息生成第一特征图,根据所述第二目标图像生成第二特征图,并基于所述第一特征图以及所述第二特征图确定所述目标穿戴物的形变信息。
  8. 根据权利要求7所述的方法,其中,所述形变识别模型包括第一特征提取器、第二特征提取器以及空间变换子网络;
    所述第一特征提取器设置为根据所述人体穿戴区域信息以及所述目标人体关键点信息输出所述第一特征图至所述空间变换子网络;
    所述第二特征提取器设置为根据所述第二目标图像输出所述第二特征图至所述空间变换子网络;
    所述空间变换子网络设置为基于所述第一特征图以及所述第二特征图进行相关的空间变 换处理,输出所述目标穿戴物的形变信息。
  9. 根据权利要求8所述的方法,其中,在训练所述形变识别模型时,使用的损失函数包括感知损失函数和L1损失函数。
  10. 根据权利要求6所述的方法,其中,所述虚拟穿戴网络还包括生成模型,所述根据所述形变信息以及所述人体特征信息生成穿戴效果图,包括:
    将所述目标人体关键点信息、所述穿戴物掩模图像以及所述形变识别模型输出的目标穿戴物的形变信息输入至所述生成模型中,由所述生成模型进行处理,生成所述目标穿戴物穿戴在所述目标人体身上时的穿戴效果图。
  11. 根据权利要求10所述的方法,其中,所述生成模型包括编码器以及解码器,所述编码器设置为进行特征提取,并向所述解码器输出目标人体对应的第三特征图以及形变后的目标穿戴物的样式属性信息;
    所述解码器设置为根据所述第三特征图以及所述样式属性信息进行解码处理,生成所述目标穿戴物穿戴在所述目标人体身上时的穿戴效果图。
  12. 根据权利要求11所述的方法,其中,所述解码器的网络结构为StyleGAN2的合成网络的结构;
    在训练所述生成模型时,使用的损失函数包括生成式对抗网络损失函数、感知损失函数及L1损失函数。
  13. 一种虚拟穿戴的装置,包括:
    第一目标图像获取模块,设置为获取包含目标人体的第一目标图像;
    第二目标图像获取模块,设置为获取包含目标穿戴物的第二目标图像;
    人体特征信息获取模块,设置为基于所述第一目标图像获取人体特征信息,所述人体特征信息包括与所述目标穿戴物相关的目标人体关键点信息、人体解析结果以及穿戴物掩模图像;
    穿戴效果图生成模块,设置为将所述人体特征信息以及所述第二目标图像输入至预先训练的虚拟穿戴网络中,以由所述虚拟穿戴网络基于所述人体特征信息确定人体穿戴区域信息,并根据所述人体穿戴区域信息以及所述第二目标图像确定所述目标穿戴物的形变信息,以及根据所述形变信息以及所述人体特征信息生成穿戴效果图进行输出;其中,所述虚拟穿戴网络是一种生成对抗网络。
  14. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,设置为存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-12任一项所述的方法。
  15. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-12任一项所述的方法。
  16. 一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,所述计算机可执行指令在被执行时设置为实现权利要求1-12中任一项所述的方法。
PCT/CN2022/132132 2021-11-16 2022-11-16 虚拟穿戴方法、装置、设备、存储介质及程序产品 WO2023088277A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111356765.5A CN114067088A (zh) 2021-11-16 2021-11-16 虚拟穿戴方法、装置、设备、存储介质及程序产品
CN202111356765.5 2021-11-16

Publications (1)

Publication Number Publication Date
WO2023088277A1 true WO2023088277A1 (zh) 2023-05-25

Family

ID=80273012

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/132132 WO2023088277A1 (zh) 2021-11-16 2022-11-16 虚拟穿戴方法、装置、设备、存储介质及程序产品

Country Status (2)

Country Link
CN (1) CN114067088A (zh)
WO (1) WO2023088277A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117575746A (zh) * 2024-01-17 2024-02-20 武汉人工智能研究院 虚拟试穿方法、装置、电子设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067088A (zh) * 2021-11-16 2022-02-18 百果园技术(新加坡)有限公司 虚拟穿戴方法、装置、设备、存储介质及程序产品
CN115937964B (zh) * 2022-06-27 2023-12-15 北京字跳网络技术有限公司 姿态估计的方法、装置、设备和存储介质
CN115174985B (zh) * 2022-08-05 2024-01-30 北京字跳网络技术有限公司 特效展示方法、装置、设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110211196A (zh) * 2019-05-28 2019-09-06 山东大学 一种基于姿势引导的虚拟试穿方法及装置
CN110852941A (zh) * 2019-11-05 2020-02-28 中山大学 一种基于神经网络的二维虚拟试衣方法
US20200380594A1 (en) * 2018-02-21 2020-12-03 Kabushiki Kaisha Toshiba Virtual try-on system, virtual try-on method, computer program product, and information processing device
CN112784865A (zh) * 2019-11-04 2021-05-11 奥多比公司 使用多尺度图块对抗性损失的衣物变形
US20210241531A1 (en) * 2020-02-04 2021-08-05 Nhn Corporation Method and apparatus for providing virtual clothing wearing service based on deep-learning
CN113269895A (zh) * 2020-02-17 2021-08-17 阿里巴巴集团控股有限公司 图像处理方法、装置及电子设备
CN113361560A (zh) * 2021-03-22 2021-09-07 浙江大学 一种基于语义的多姿势虚拟试衣方法
CN114067088A (zh) * 2021-11-16 2022-02-18 百果园技术(新加坡)有限公司 虚拟穿戴方法、装置、设备、存储介质及程序产品

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200380594A1 (en) * 2018-02-21 2020-12-03 Kabushiki Kaisha Toshiba Virtual try-on system, virtual try-on method, computer program product, and information processing device
CN110211196A (zh) * 2019-05-28 2019-09-06 山东大学 一种基于姿势引导的虚拟试穿方法及装置
CN112784865A (zh) * 2019-11-04 2021-05-11 奥多比公司 使用多尺度图块对抗性损失的衣物变形
CN110852941A (zh) * 2019-11-05 2020-02-28 中山大学 一种基于神经网络的二维虚拟试衣方法
US20210241531A1 (en) * 2020-02-04 2021-08-05 Nhn Corporation Method and apparatus for providing virtual clothing wearing service based on deep-learning
CN113269895A (zh) * 2020-02-17 2021-08-17 阿里巴巴集团控股有限公司 图像处理方法、装置及电子设备
CN113361560A (zh) * 2021-03-22 2021-09-07 浙江大学 一种基于语义的多姿势虚拟试衣方法
CN114067088A (zh) * 2021-11-16 2022-02-18 百果园技术(新加坡)有限公司 虚拟穿戴方法、装置、设备、存储介质及程序产品

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117575746A (zh) * 2024-01-17 2024-02-20 武汉人工智能研究院 虚拟试穿方法、装置、电子设备及存储介质
CN117575746B (zh) * 2024-01-17 2024-04-16 武汉人工智能研究院 虚拟试穿方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN114067088A (zh) 2022-02-18

Similar Documents

Publication Publication Date Title
WO2023088277A1 (zh) 虚拟穿戴方法、装置、设备、存储介质及程序产品
CN111787242B (zh) 用于虚拟试衣的方法和装置
CN108229559B (zh) 服饰检测方法、装置、电子设备、程序和介质
Kolotouros et al. Dreamhuman: Animatable 3d avatars from text
Wang et al. Normalgan: Learning detailed 3d human from a single rgb-d image
US20210097730A1 (en) Face Image Generation With Pose And Expression Control
CN113393550B (zh) 一种姿态和纹理引导的时尚服装设计合成方法
CN108012091A (zh) 图像处理方法、装置、设备及其存储介质
CN110660076A (zh) 一种人脸交换方法
WO2013078404A1 (en) Perceptual rating of digital image retouching
CN108460398A (zh) 图像处理方法、装置、云处理设备和计算机程序产品
US20210158593A1 (en) Pose selection and animation of characters using video data and training techniques
CN111862116A (zh) 动漫人像的生成方法及装置、存储介质、计算机设备
CN111815768B (zh) 三维人脸重建方法和装置
Hao et al. Far-gan for one-shot face reenactment
Peng et al. Implicit neural representations with structured latent codes for human body modeling
Kubo et al. Uvton: Uv mapping to consider the 3d structure of a human in image-based virtual try-on network
US11361467B2 (en) Pose selection and animation of characters using video data and training techniques
Lu et al. Parametric shape estimation of human body under wide clothing
WO2023155533A1 (zh) 一种图像驱动方法、装置、设备和介质
WO2023160074A1 (zh) 一种图像生成方法、装置、电子设备以及存储介质
WO2023035725A1 (zh) 虚拟道具展示方法及装置
CN111275610A (zh) 一种人脸变老图像处理方法及系统
Purps et al. Reconstructing facial expressions of HMD users for avatars in VR
Liu et al. 3DFP-FCGAN: Face completion generative adversarial network with 3D facial prior

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22894810

Country of ref document: EP

Kind code of ref document: A1