CN117095128A

CN117095128A - Priori-free multi-view human body clothes editing method

Info

Publication number: CN117095128A
Application number: CN202311131853.4A
Authority: CN
Inventors: 王好谦; 张弢
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2023-11-21

Abstract

A priori-free multi-view human body clothes editing method comprises the following steps: 1) Inputting the multi-view human body picture into a self-supervision human body clothing recognition and extraction module to obtain a feature map corresponding to the input picture; 2) Inputting the input pictures and the corresponding feature pictures into a nerve radiation field module for feature distillation to perform implicit three-dimensional reconstruction, wherein points in the hidden space contain feature information; 3) Selecting partial pixels of the clothes to be edited, performing feature matching and positioning in the hidden space, and separating the clothes to be edited; 4) Inputting the separated clothing to be edited, target editing conditions and view angle information into a diffusion model to generate an edited multi-view result. The method can complete the editing effect of the multi-view human body clothes in a simple text or picture input mode, and effectively realizes the function of virtual changing of the human body.

Description

Priori-free multi-view human body clothes editing method

Technical Field

The application relates to the field of 3D vision in computer vision, in particular to a priori-free multi-view human body clothes editing method.

Background

Three-dimensional reconstruction (3D reconstruction) refers to a technology for recovering three-dimensional information of an object or a scene by using two-dimensional projection, is a key technology for reproducing an objective object world in a computer virtual world, and is increasingly widely and importantly applied to various fields, such as Augmented Reality (AR), virtual Reality (VR), game cartoon virtual figures, and the like, along with the proposal of metauniverse concepts. The three-dimensional reconstruction method can be divided into an active method and a passive method according to whether a sensor actively sends signals to a detected object, wherein the active method uses coding structured light, time of flight (TOF) and other principles to reconstruct, the whole performance is slightly excellent, but equipment is more expensive, the use scene is limited, the passive three-dimensional reconstruction has lower requirements on hardware, and the three-dimensional reconstruction method can be divided into monocular reconstruction, binocular reconstruction and multi-view reconstruction according to the number of used views (pictures) based on mature visual geometry knowledge. With the great improvement of calculation force and the proposal of a batch of excellent network structures in recent years, the deep learning method provides a new treatment idea for various problems in the field of computer vision. The success of deep neural networks in various computer vision tasks proves the feasibility of the application of the deep learning method in three-dimensional reconstruction to a certain extent, and researchers also expect that the reconstruction precision and the integrity can be improved by means of the deep learning method, so that the three-dimensional reconstruction based on the deep learning is one of research hotspots in recent years.

The neural radiation field (NeRF) creates a brand new method for synthesizing a new view angle in the field of three-dimensional reconstruction, three-dimensional geometric information of an object is not directly reconstructed, the three-dimensional geometric information is implicitly reconstructed in a multi-layer perceptron (MLP), pictures of different view angles of the same object are input as supervision, a neural network can implicitly model the object, and then the pictures are rendered and generated under the new view angle by a Volume Rendering (Volume Rendering) method. The main flow of the algorithm is that firstly, camera light passes through a scene, a three-dimensional point set is sampled, then, the sampled three-dimensional point set and related two-dimensional visual angles are used as input, the input is input into a neural network multi-layer perceptron, the color and the volume density corresponding to the sampling point are output, and finally, the output color and the volume density are rendered into a 2D picture by a classical volume rendering method.

The DINO is a self-supervising visual transducer, the traditional neural network needs to have data and corresponding labels, the DINO model only needs to be provided for images, the model learns to semantically divide objects and create boundaries, and the model learns to understand the surrounding visual world. The DINO samples are learned in a self-distilling mode without labels, and finally can be automatically focused on the most relevant partial area of the input image.

Diffusion model (diffusion) inspiration comes from unbalanced thermodynamics, defining a markov chain of diffusion steps to slowly add random noise to the data, and then learning to reverse the diffusion process to build the required data samples from the noise, which is a generation class model, currently exceeding the original challenge generation network in the picture generation task. The diffusion model mainly comprises a forward process and a backward process, wherein the forward process is a diffusion process, noise with standard Gaussian distribution is continuously added into input data, and when the time tends to be positive infinity, the data finally becomes pure noise; the backward process is a denoising and continuous restoration process, namely a process of generating a target. The general diffusion model uses a U-net self-focusing structure.

It should be noted that the information disclosed in the above background section is only for understanding the background of the application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The main purpose of the application is to overcome the defects of the background technology and provide a priori-free multi-view human body clothes editing method.

In order to achieve the above purpose, the present application adopts the following technical scheme:

a priori-free multi-view human body clothes editing method comprises the following steps:

1) Inputting the multi-view human body picture into a self-supervision human body clothing recognition and extraction module to obtain a feature map corresponding to the input picture;

2) Inputting the input pictures and the corresponding feature pictures into a nerve radiation field module for feature distillation to perform implicit three-dimensional reconstruction, wherein points in the hidden space contain feature information;

3) Selecting partial pixels of the clothes to be edited, performing feature matching and positioning in the hidden space, and separating the clothes to be edited;

4) Inputting the separated clothing to be edited, target editing conditions and view angle information into a diffusion model to generate an edited multi-view result.

Further:

the self-supervising human body clothing recognition and extraction module is trained by using a data set and a pre-trained DINO model, wherein the data set comprises pictures of various clothing worn by different people under a solid color background, and the pictures are classified according to different clothing; so that the trained self-supervising human apparel recognition and extraction module can recognize apparel portions in human pictures.

In the step 2), the RGB information included in the sampled three-dimensional point set and the characteristic information corresponding to each point are input into the characteristic distillation nerve radiation field module, and the characteristic distillation nerve radiation field module outputs the characteristic information including three-dimensional RGB information, body density and one C dimension through the full connection layer, wherein C is the number of characteristic channels.

In the step 2), the neural radiation field module is arranged according to a teacher-student network of feature distillation, takes a network for extracting 2D image features as a teacher network, guides feature learning in implicit space of the neural radiation field, refines the features of the 2D teacher network into a 3D student network, and extracts the features and the 3D geometric features together in a neural rendering mode.

In the step 3), selecting a part of pixel region patch of the clothing to be subjected to style conversion editing from an original input picture, acquiring the characteristics of a corresponding part from the characteristic map, matching the characteristics of the middle point of the implicit space of a nerve radiation field, calculating the distance between the pixel characteristics of the region and the patch characteristic of the selected clothing for each view angle picture, analyzing the RGB image of which the three-dimensional space is visualized as the characteristic distance by the PCA through the main component, setting a threshold value of the characteristic matching difference, and screening in the rendered picture, thereby obtaining the clothing picture to be edited with multiple view angles.

The diffusion model is obtained by training in the following way: for input visual angle and picture information to reconstructThe implicit space center of the clothing is taken as an origin O, the camera direction of the new view angle is translated to the origin O, and a main view angle theta is defined ₀ All viewing angles and a main viewing angle theta ₀ The horizontal included angle between the two images is recorded as an input view angle theta, reflected as a multi-dimensional camera projection matrix, and is connected with potential codes corresponding to 2D pictures of the corresponding view angles, and is used as synthesized input to the diffusion model, and the loss is calculated by using the restored codes in the potential space and the original codes.

In step 4), an image encoder in a pre-training model based on a text-image pair is used, the input 2D picture is subjected to the image encoder to obtain corresponding potential codes, and after the input 2D picture is connected with camera parameters, the input 2D picture is restored to the original dimension through a plurality of linear layers and is used as the input of the diffusion model.

In the step 4), the separated codes of the clothing pictures to be edited are mixed with the codes of the target editing conditions, so that mixed codes are obtained; position coding is carried out on the view angle information; and after the mixed code is connected with the position code information, mapping the mixed code back to the original dimension through a plurality of linear layers to obtain a code z, connecting the code z with a potential space random noise vector, and inputting the code z into the diffusion model together with a time stamp t.

In step 4), using an image and a text encoder in a pre-training model based on a text-image pair, inputting a target editing style clothing or a pixel area patch or a small text description as a prompt, obtaining potential codes through the encoder, connecting with given camera parameters, and generating multi-view output finally realizing style change through a diffusion model module.

And carrying out multi-view consistent processing according to the result of squaring and summing the differences between the two image pixels.

A computer readable storage medium storing a computer program which, when executed by a processor, implements the a priori free multi-view human apparel editing method.

The application has the following beneficial effects:

the application provides a multi-view human body clothes editing method based on a nerve radiation field and a diffusion model, which only inputs a plurality of multi-view human body pictures under the condition of no priori semantics, thereby realizing the function of editing appointed clothes. Firstly, a trained self-supervision visual model DINO aiming at human clothing recognition is utilized, the problem that human clothing is recognized and extracted under the condition that human semantics prior does not exist is solved, and through prompting input of images or texts and combining with a diffusion model, a new image is generated after corresponding offset is performed on potential codes corresponding to original clothing, and editing effect of clothing to target style conversion is completed.

The embodiment of the application has the advantages that:

1) The application improves the self-supervision visual transducer model DINO on the task, the adjusted model can perceive different clothes of a human body from an input human body image under the condition of no semantic information priori, and different results (attention maps) are respectively output by using a multi-head self-attention mechanism, such as respectively identifying the coat and the trousers of the human body, and meanwhile, the skirt can be identified as a whole, so that the human body can be perceived and identified in a very convenient and accurate way at a higher speed.

2) According to the application, the implicit space of the nerve radiation field is subjected to characteristic learning of space points in a characteristic distillation mode, characteristic information output of points is added behind the original output RGB information, and the adjusted DINO model is used as a teacher network to supervise, so that the characteristic learning of the points in the implicit space is guided, the student network can learn the key point thermodynamic diagrams output by the teacher network at multiple visual angles in an omnibearing manner, and meanwhile, deeper implicit characteristic knowledge in the teacher network can be captured, and even the student network can show better performance than the teacher network under some training situations. The learning of the point features in the implicit space enables the method to be positioned to a specific position of a target pixel region (patch) in the 2D image in the implicit space, so that the aim of editing is fulfilled in a subsequent step.

3) The application provides a solution to the problem of multi-view consistency of the diffusion model applied to the three-dimensional scene, the diffusion model is trained by taking the coding pair of the view angle and the input image as input, different information of a plurality of view angles is learned to a certain extent, and different picture information can be output according to the input different view angles.

4) The application can generate the clothes of the user target in a user-friendly input mode, such as text, images and the like, and extract the original clothes on the multi-view image in a self-supervision and identification mode to replace the original clothes, thereby realizing the function of editing the multi-view human clothes and realizing the effect of virtual changing.

Drawings

FIG. 1 is a flow chart of an overall no prior multi-view human apparel editing method in accordance with an embodiment of the present application;

FIG. 2 is a schematic diagram of a self-supervising human apparel identification and extraction module training of an embodiment of the present application;

fig. 3.1 (a) to 3.3 (c) show self-attention output graphs of the pre-trained visual model before and after adjustment, wherein fig. 3.1 (a) is a picture of a training set, fig. 3.1 (b) is a characteristic graph of the pre-trained model output on fig. 3.1 (a), and fig. 3.2 (a) and 3.2 (b) are characteristic graphs of the output of two branches of the model after training adjustment, corresponding to clothes and trousers on the human body of the picture; fig. 3.3 (a) is a model diagram on a network, fig. 3.3 (b) is a feature diagram output by a pre-training model on the model diagram, the whole person is completely identified, and fig. 3.3 (c) is a feature diagram output by the model after training and adjustment, and skirt on the model is extracted;

FIG. 4 is a flow chart of a neural radiation field module process incorporating feature output/feature distillation in accordance with an embodiment of the present application;

FIG. 5 is a flowchart of a process of the diffusion model and editing module according to an embodiment of the present application.

Detailed Description

The following describes embodiments of the present application in detail. It should be emphasized that the following description is merely exemplary in nature and is in no way intended to limit the scope of the application or its applications.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are merely for convenience in describing embodiments of the application and to simplify the description by referring to the figures, rather than to indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus are not to be construed as limiting the application.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present application, the meaning of "plurality" is two or more, unless explicitly defined otherwise.

Referring to fig. 1, an embodiment of the present application provides a method for editing a priori-free multi-view human body garment, including the following steps:

According to the nerve radiation field and diffusion model-based apriori-free multi-view human body clothing editing method, the input human body image is subjected to self-recognition of clothing parts by an unsupervised method, the feature image corresponding to the image is extracted, the hidden reconstruction is completed by the nerve radiation field in combination with the original image, the clothing of a human body is also positioned in the output new view angle image by utilizing the features, finally, a new target clothing is generated by adding a diffusion model of view angle information, the effect of completing multi-view human body clothing editing by a simple text or picture input mode is achieved, and the virtual reloading function is effectively realized.

In some embodiments, a multi-view human body clothing editing method without prior use, using a nerve radiation field and a diffusion model, can be specifically divided into the following four modules:

1. a self-supervision human body clothing recognition and extraction module:

collecting data sets, putting on pictures of various clothes by various people under a solid color background, classifying the pictures according to different clothes, and training the well-arranged data sets by using a pre-trained DINO model on an ImageNet (a large visual database for visual object recognition software research).

2. Neural radiation field module for characteristic distillation:

neural radiation field input

The method comprises the steps of inputting pictures of an original nerve radiation field, and obtaining a Feature map (Feature map) after each picture passes through a clothes identification and extraction module 1, wherein a three-dimensional point set sampled by a camera light passing through a scene comprises original color RGB information and Feature information corresponding to each point.

Output of neural radiation field

In the original nerve radiation field, after passing through the full connection layer of 9 256 neurons, the full connection layer of the last 128 neurons outputs characteristic information including three dimensions of RGB (optical three primary colors) information, bulk density and one dimension of C (C is the number of characteristic channels).

Training of feature distillation

According to the teacher-student network setting of feature distillation, the network for extracting the 2D image features is used as a teacher network to guide feature learning of points in the implicit space of the nerve radiation field, the features of the 2D teacher network are finally extracted into a 3D student network, and the features and the 3D geometry are extracted together in a nerve rendering mode, so that the consistency and the viewpoint independence of the features are improved.

3. Feature matching and hidden space positioning module

Selecting partial pixels (patch) of clothes to be subjected to style conversion editing from an original input picture, acquiring characteristics of a corresponding part from a characteristic map (Feature map) obtained in step 1, matching the characteristics of the middle point of the implicit space of a nerve radiation field obtained by training in step 2, calculating the distance between the regional pixel characteristics and the selected partial characteristics of the clothes patch for each view angle picture, reducing the dimension to an RGB image with the characteristic distance by Principal Component Analysis (PCA), setting a threshold of characteristic matching difference, and screening in a rendered picture to obtain the clothes picture to be edited with multiple view angles.

4. Diffusion model and editing module

Improved diffusion model

Adding a view angle and a picture into a traditional diffusion model to serve as input, reconstructing an implicit space center of clothes to serve as an origin O, translating a camera direction of a new view angle to the origin O, and defining a main view angle theta ₀ All viewing angles and a main viewing angle theta ₀ The horizontal included angle between the two images is recorded as an input view angle theta, and the horizontal included angle is reflected as a 25-dimensional camera projection matrix in practical application, potential codes corresponding to 2D pictures of the corresponding view angles are connected together and serve as a synthesized input to a diffusion model, and the restored codes in the potential space and the original codes calculate loss to train the improved diffusion model.

Diffusion model input and encoding

We refer to the image encoder in the text-image pair based pre-training model (CLIP), and pass the input 2D picture through the image encoder to get its corresponding potential coding, and restore to the original dimension through three linear layers after connecting with the camera parameters, as the input of the diffusion model.

Editing of apparel styles

Also, by means of an image and text encoder in a text-to-image pair based pre-training model (CLIP), a target editing style garment or a pixel area (patch) or a small text description is input as a prompt (prompt), after passing through the encoder, a potential code is obtained, and in accordance with the foregoing, a multi-view output which finally realizes a change of style is generated through a diffusion model module, in connection with a given camera parameter.

The relation among the four modules is shown in figure 1, the input of the human body picture with multiple visual angles is accepted, the input picture and the corresponding characteristic diagram thereof are obtained after the trained self-supervision human body clothes identification and extraction module, the input picture and the corresponding characteristic diagram are input into the nerve radiation field module with characteristic distillation together, an implicit reconstruction result is obtained after training, and meanwhile, the points in the hidden space also contain characteristic information. And selecting partial pixels of the clothing to be edited from the original input picture, and performing feature matching and positioning in a hidden space by utilizing the feature information of the partial pixels, so that the clothing to be edited can be extracted from multiple visual angles. Finally, inputting the conditions and visual angle information of the clothes to be edited and the editing target into the improved diffusion model, and finally generating a new multi-visual angle clothes with changed styles, thereby completing the editing result of the clothes to target style conversion.

Specific embodiments of the present application are described further below.

In a specific embodiment, a method for editing a non-priori multi-view human body clothing based on a nerve radiation field and a diffusion model is used for editing a certain style transformation of the human body clothing under the condition of no semantic priori, as shown in fig. 1, and mainly comprises the following four processing modules:

a self-supervision human body clothing recognition and extraction module:

firstly, collecting the data set of different clothes of the external human body, classifying the data set according to labels as shown in figure 2, inputting the data set into a self-supervision visual transducer model DINO for training, and transmitting two different random pixel areas (patches) of an input image to a student and a teacher network by a network, wherein the two networks have the same architecture but different parameters. The output of the teacher network is centered on the average calculated over the batch. Each network outputs a K-dimensional feature f normalized by softmax (normalized exponential function) in the feature dimension, which is then measured by the loss of cross entropySimilarity, i.e. the values of the characteristics of the student network and the teacher network output after softmax are p ₁ And p ₂ The loss function is calculated as follows:

L _dino ＝-p ₂ logp ₁

and then stopping updating the gradient on the teacher network, and returning the gradient only through the student network, wherein the teacher network updates the gradient by using the index moving average value of the student parameters.

In a specific training process, the learning rate at the beginning is set to be 1e ^-4 After training 10000 epochs, the learning rate was adjusted to 1e ^-5 10000 epochs were trained. The original DINO model is trained in image Net, only the whole person can be identified for the picture of the model wearing the clothes, after the training steps, the loss of the model is reduced by 60%, meanwhile, the attention of multiple heads can respectively identify different clothes of a human body, a high-brightness attention heat map can be obtained, and the output of the original DINO for the picture of the human body and the output of the adjusted DINO for the picture of the human body are shown in the figure 3.

Neural radiation field module for characteristic distillation:

as shown in FIG. 4, in the original nerve radiation field, the network penultimate layer of the multilayer perceptron (MLP) outputs the volume density (density) of the points in the implicit space, the last layer outputs the color (RGB) information of three channels, the point characteristic information of the C characteristic dimension is added in the last layer output, the volume density (density) of the points in the defined space is sigma (x), for a ray r (t) =o+td, where o represents the position of the ray origin, d represents the ray direction, for different t, the sampling points corresponding to different positions in the ray direction, the far and near boundaries in the nerve radiation field space are respectively t _f And t _n Selecting proper interval between far and near boundaries, discretely sampling points in the interval, and carrying out weighted summation on color information of the sampling points, wherein the weight is as follows assuming that the sampling interval is delta:

wherein sigma _j For the volume density corresponding to each sampling point, according to the calculated weight, the color corresponding to the ray is as follows according to a voxel rendering formula:

wherein N is _c C is the number of sampling points _i Color information corresponding to each sampling point.

Similarly, the weights are used to weight the sampled points on the ray for the corresponding features of the ray:

wherein f _i Characteristic information for each sampling point.

Further, except for obtaining a color (RGB) image I after voxel rendering _t In addition to this, a reconstructed feature map phi (I _t ) The loss of the feature map portion is defined as follows:

here, theThe feature map obtained after the image is subjected to the self-supervision human body clothing recognition and extraction module is used for guiding the learning of the features in the implicit space. The supervised loss is like an image, and the square loss, i.e. the result of squaring and summing the differences between two image pixels, is used to evaluate the quality of the image reconstruction.

During specific training, only the image reconstruction loss of the original nerve radiation field is used as a loss function for training at first, after iterative training is performed for a certain number of times, the lambda weight is added into the loss function of the feature map reconstruction part for training, and finally the total reconstruction loss function is as follows:

here, theI.e. the square loss between the color image and the original image, is the result of squaring and summing the differences between the pixels of the two images, consistent with the foregoing.

The 2D image features are taken as supervision, the 2D feature extraction network is taken as a teacher network, the feature extraction in the nerve radiation field is taken as a distillation learning form of a student network, the knowledge from the teacher network is refined into the student network, the 3D geometric shape of a scene can be accurately reflected, and the trained features are more consistent among different viewpoints.

Feature matching and hidden space positioning module

Selecting a partial pixel area (patch) of the garment, such as a coat or trousers, which is intended to be subjected to style conversion editing in an original input picture, and calculating an average value of a feature for the selected area part:

where n is the number of pixels in the selected patch. With the average value of this region part and the features in the implicit space trained in the previous module, there are also feature values of individual pixels in the 2D picture rendered by the neural radiation field, corresponding to the rendered feature map phi (I for different viewing angles q _q ) The distance of the feature is calculated by differencing each pixel u with the average, the error being within a certain threshold τ:

||φ(I _q ) _u -φ(I _t ) _avg ||≤τ

the region is considered to be a matched region, and the target garment part can be extracted in all rendered new view angles by the steps, and simultaneously, the RGB image which is reduced in dimension to a three-dimensional space through Principal Component Analysis (PCA) and visualized as a characteristic distance can be used in the step.

Diffusion model and editing module

As shown in figure 5, the overall flow chart of the diffusion model and the editing module is that a visual angle is added into the traditional diffusion model as a condition parallel to a text or an image to be input, the human body picture input in the nerve radiation field is input of the same horizontal visual angle, and the aim is to only render the visual angle range horizontally rotated by a certain angle, so that the visual angle can be used as an observation result after the clothes style is changed and edited. Therefore, here, with the implicit spatial center of the reconstructed clothing as the origin O, the camera direction of the new view angle is translated to the origin O, defining a main view angle θ ₀ (typically, the viewing angle facing the human body), all viewing angles and a main viewing angle theta ₀ The horizontal angle between the two images is recorded as an input view angle theta, the view angle theta and the potential codes corresponding to the 2D pictures of the corresponding view angles are taken as an additional input to a diffusion model, and the reconstruction loss function is used for calculating:

in the loss calculation of the diffusion model, the actual reconstruction is noise, z _t As a random noise vector of potential space, y is the condition of input, such as image, text, etc., τ _c (y) is the result of encoding the input condition into the potential space, and z is as described above _t When a certain time step t is carried out, the noise recovered by the diffusion model is epsilon _θ (z _t ,t,τ _c (y)) and the actual noise e, the loss is calculated on a two-norm, in fact also the sum of squares of the differences at the pixel level of the two noisy images, by means of which a diffusion model is trained.

The aforementioned potential encoding of the diffusion model input, as well as the encoding used for the later-mentioned target style-guided pictures or text, refer to the image encoder and the text encoder in the text-to-image pair-based pre-training model (CLIP), and one potential spatial encoding of the image or text is obtained. While for the input of the view angle part, only one rotation in the horizontal direction is considered, a vector representing the whole 25 dimensions of the camera internal and external parameters is still used, which is more reasonable, the potential space codes of the image or text are connected, and the original dimension is retracted through three linear lamination as the final condition input.

Finally, replacing the self-sensing clothing image part in the output image of the original front module nerve radiation field with the newly generated multi-view clothing image according to the given conditions, realizing the editing effect of the input human clothing, outputting 45 degrees on both sides of a main view angle under the default condition, and realizing the virtual reloading effect to a certain extent by using a gif image (graph exchange format) rotated at a horizontal view angle of 90 degrees. The related art currently provides a service of virtual fitting technology to consumers, but only the clothing in the provided database can be replaced, and a more convenient and free editing mode and effect cannot be provided.

The embodiments of the present application also provide a storage medium storing a computer program which, when executed, performs at least the method as described above.

The embodiment of the application also provides a control device, which comprises a processor and a storage medium for storing a computer program; wherein the processor is adapted to perform at least the method as described above when executing said computer program.

The embodiments of the present application also provide a processor executing a computer program, at least performing the method as described above.

The storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. The storage media described in embodiments of the present application are intended to comprise, without being limited to, these and any other suitable types of memory.

In the several embodiments provided by the present application, it should be understood that the disclosed systems and methods may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The methods disclosed in the method embodiments provided by the application can be arbitrarily combined under the condition of no conflict to obtain a new method embodiment.

The features disclosed in the several product embodiments provided by the application can be combined arbitrarily under the condition of no conflict to obtain new product embodiments.

The features disclosed in the embodiments of the method or the apparatus provided by the application can be arbitrarily combined without conflict to obtain new embodiments of the method or the apparatus.

The foregoing is a further detailed description of the application in connection with the preferred embodiments, and it is not intended that the application be limited to the specific embodiments described. It will be apparent to those skilled in the art that several equivalent substitutions and obvious modifications can be made without departing from the spirit of the application, and the same should be considered to be within the scope of the application.

Claims

1. The method for editing the multi-view human body clothes without prior is characterized by comprising the following steps of:

2. The apriori-free multi-view human apparel editing method of claim 1, wherein the self-supervising human apparel recognition and extraction module trains using a dataset and a pre-trained DINO model, the dataset comprising pictures of various garments worn by different people in a solid color background and classifying them by different apparel; so that the trained self-supervising human apparel recognition and extraction module can recognize apparel portions in human pictures.

3. The method for editing the non-priori multi-view human body clothes according to claim 1 or 2, wherein in the step 2), RGB information included in the sampled three-dimensional point set and characteristic information corresponding to each point are input into the characteristic distillation nerve radiation field module, and the characteristic distillation nerve radiation field module outputs the characteristic information including the three-dimensional RGB information, the volume density and one C dimension through the full connection layer, wherein C is the number of characteristic channels.

4. A method for editing a non-prior multi-view human body clothing according to any one of claims 1 to 3, wherein in step 2), the neural radiation field module is set according to a teacher-student network of feature distillation, uses a network for extracting 2D image features as a teacher network, guides feature learning in an implicit space of the neural radiation field, refines the features of the 2D teacher network into a 3D student network, and extracts the features together with 3D geometric features by means of neural rendering.

5. The method for editing a non-prior multi-view human clothing according to any one of claims 1 to 4, wherein in step 3), a partial pixel region patch of the clothing to be subjected to style conversion editing is selected from an original input picture, features of a corresponding part are obtained in the feature map, the features are matched with features of an implicit spatial midpoint of a nerve radiation field, a distance between the regional pixel features and the patch features of the selected clothing is calculated for each view picture, then PCA is analyzed by a principal component to reduce dimensions to an RGB image of a feature distance for visualization in a three-dimensional space, a threshold of the feature matching difference is set, and screening is performed in a rendered picture, so that the multi-view picture of the clothing to be edited is obtained.

6. A method for the apriori-free multi-view human apparel editing of any one of claims 1 to 5, wherein the diffusion model is trained by: for the input visual angle and picture information, taking the implicit space center of the reconstructed clothes as an origin O, translating the camera direction of the new visual angle to the origin O, and defining a main visual angle theta ₀ All viewing angles and a main viewing angle theta ₀ The horizontal included angle between the two images is recorded as an input view angle theta, reflected as a multi-dimensional camera projection matrix, and is connected with potential codes corresponding to 2D pictures of the corresponding view angles, and is used as synthesized input to the diffusion model, and the loss is calculated by using the restored codes in the potential space and the original codes.

7. A method of non-a priori multi-view human apparel editing as claimed in any of claims 1 to 6, wherein in step 4) an image encoder in a pre-trained model based on text-image pairs is used, the input 2D picture is passed through the image encoder to get its corresponding potential coding, and after connection with camera parameters, it is restored to its original dimensions through multiple linear layers as input to the diffusion model.

8. The method for editing a piece of clothing with no priori multiple angles of view according to any one of claims 1 to 7, wherein in step 4), the codes of the separated pieces of clothing to be edited and the codes of the target editing conditions are mixed to obtain a mixed code; position coding is carried out on the view angle information; and after the mixed code is connected with the position code information, mapping the mixed code back to the original dimension through a plurality of linear layers to obtain a code z, connecting the code z with a potential space random noise vector, and inputting the code z into the diffusion model together with a time stamp t.

9. The method for editing a priori-free multi-view human clothing according to any one of claims 1 to 8, wherein in step 4), using an image and a text encoder in a pre-training model based on a text-image pair, inputting a target editing style clothing or a pixel area patch or a small text description as a prompt, obtaining a potential code through the encoder, connecting with given camera parameters, and generating a multi-view output which finally realizes a change of styles through a diffusion model module; preferably, the multi-view coincidence processing is performed based on the result of squaring and summing the differences between the two image pixels.

10. A computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the apriori-free multi-view human apparel editing method of any one of claims 1 to 9.