CN116721200A - Three-dimensional object generation method based on diffusion model and semantic guidance - Google Patents

Three-dimensional object generation method based on diffusion model and semantic guidance Download PDF

Info

Publication number
CN116721200A
CN116721200A CN202310285348.9A CN202310285348A CN116721200A CN 116721200 A CN116721200 A CN 116721200A CN 202310285348 A CN202310285348 A CN 202310285348A CN 116721200 A CN116721200 A CN 116721200A
Authority
CN
China
Prior art keywords
diffusion
vector
point cloud
feature
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310285348.9A
Other languages
Chinese (zh)
Inventor
耿卫东
凌泽宇
付一童
厉向东
梁秀波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202310285348.9A priority Critical patent/CN116721200A/en
Publication of CN116721200A publication Critical patent/CN116721200A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Geometry (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a three-dimensional object generation method based on a diffusion model and semantic guidance, which is characterized by comprising the following steps of: generating a semantic vector according to the text data by using a text encoder of the CLIP model; generating a shape vector according to the semantic vector and the first random noise by using a conditional flow model, and splicing the shape vector and the time step vector as guiding conditions; taking the second random noise as an initial inverse diffusion vector, and generating a low-dimensional point cloud vector by inverse diffusion based on a guide condition and the initial inverse diffusion vector by using a diffusion model; and decoding the low-dimensional point cloud vector by using a point cloud decoder to obtain Gao Weidian cloud, and generating a three-dimensional object according to Gao Weidian cloud. The method can generate a three-dimensional object with higher quality, but has the defects of slow training, slow reasoning, only single-class object generation and the like.

Description

Three-dimensional object generation method based on diffusion model and semantic guidance
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a three-dimensional point cloud generation method based on a diffusion model and semantic guidance.
Background
With the development of computer technology and graphics technology, three-dimensional objects have been widely used in various fields, for example, fields of games, movies, animations, etc., to construct virtual worlds using three-dimensional models as their basic resources; three-dimensional objects are used in industry, construction, etc. to accelerate or optimize their product design; three-dimensional objects are used in medical, biochemical, etc. fields to represent accurate models; social, media, etc. fields use three-dimensional objects to make richer entertainment products and information dissemination tools, etc.
Currently, three-dimensional object generation mainly depends on a modeler or designer with professional ability to manually make the three-dimensional object by using a professional modeling tool such as 3ds Max, maya and the like.
With the rapid development of deep learning techniques, one can use a deep learning model to obtain a desired three-dimensional object. Three-dimensional object generation methods based on deep learning can be roughly classified into two types. One is a three-dimensional reconstruction method for a specific object and a method based on a generative model. The three-dimensional reconstruction method for a specific object takes information (such as RGB-D images) of a single object or a scene as input, outputs a corresponding three-dimensional representation object, and is one-to-one mapping; another class is based on generative models that learn data distributions from a large number of three-dimensional object samples, thereby generating new sample instances. This method is mostly used to generate a large and diverse number of three-dimensional objects.
The flow model is a generative model. The generation network based on the stream model is widely applied to various fields such as image generation, audio and video synthesis and the like. Other generative models, such as variational self-encoders and generative countermeasure networks, use parameterized models to learn the implicit spatial distribution of data, so accurate likelihood functions cannot be calculated to optimize model training. The stream model learns the conversion process between two distributions through a series of reversible transformation functions to fit specific distribution changes, and can obtain good generation effect. The flow model starts with a standard normal distribution and transforms the original normal distribution into a probability distribution of the data we want with a series of reversible transformation functions.
The diffusion model is a generative model widely used for various generative tasks. The core thought of the diffusion model is that Gaussian noise is continuously added to a real sample in a training stage, and the noise is predicted through a noise prediction network; in the reasoning stage, a random noise is input to the noise prediction network, and the noise prediction network restores the random noise to one sample.
At present, some works apply a diffusion model to a three-dimensional object generation task, and compared with a non-diffusion model method, the diffusion model can generate a three-dimensional object with higher quality, but the method still has the defects of slow training, slow reasoning, only single-class object generation and the like.
Disclosure of Invention
In view of the above, the present invention aims to provide a three-dimensional object generating method based on diffusion model and semantic guidance, which improves the efficiency and quality of three-dimensional object generation and enriches the diversity of generated objects.
To achieve the above object, an embodiment provides a three-dimensional object generating method based on a diffusion model and semantic guidance, including the steps of:
generating a semantic vector according to the text data by using a text encoder of the CLIP model;
generating a shape vector according to the semantic vector and the first random noise by using a conditional flow model, and splicing the shape vector and the time step vector as guiding conditions;
taking the second random noise as an initial inverse diffusion vector, and generating a low-dimensional point cloud vector by inverse diffusion based on a guide condition and the initial inverse diffusion vector by using a diffusion model;
and decoding the low-dimensional point cloud vector by using a point cloud decoder to obtain Gao Weidian cloud, and generating a three-dimensional object according to Gao Weidian cloud.
In one embodiment, the conditional flow model and diffusion model need to undergo parameter optimization before being applied, comprising the steps of:
constructing a sample: preparing Gao Weidian cloud, and rendering Gao Weidian cloud to obtain object image;
building a training system: the training system comprises a shape encoder, a CLIP model, a point cloud encoder, a conditional flow model and a diffusion model, gao Weidian cloud is encoded by the shape encoder and the point cloud encoder to obtain a shape vector and a low-dimensional point cloud vector, an object image is encoded by the image encoder of the CLIP model to obtain a semantic vector, the shape vector and the semantic vector generate third random noise by the conditional flow model, the low-dimensional point cloud vector is used as an initial forward diffusion vector in a forward diffusion process of the diffusion model, forward diffusion is realized by adding actual noise in each diffusion step, the forward diffusion vector of each diffusion step is obtained, fourth random noise randomly sampled from Gaussian distribution is used as an initial reverse diffusion vector in a reverse diffusion process of the diffusion model, a splicing result of the shape vector and a time step vector is used as a guide condition, accumulated noise is calculated for each step of reverse diffusion based on the guide condition and the reverse diffusion vector of the previous diffusion step, and the reverse diffusion vector of each diffusion step is calculated according to the accumulated noise;
constructing a loss function: taking the difference between the third random noise and the preset noise label as a supervision loss function of the conditional flow model, and taking the difference between the actual noise added in the forward diffusion process of the same diffusion step and the accumulated noise calculated in the reverse diffusion process as a supervision loss function of the diffusion model;
and (3) training a system: and training the training system by adopting the loss function and the sample to optimize parameters of the conditional flow model and the diffusion model.
In one embodiment, the point cloud encoder and the point cloud decoder need to undergo parameter optimization before being applied, comprising the steps of:
the true Gao Weidian cloud is encoded by a point cloud encoder to obtain a low-dimensional point cloud vector, the low-dimensional point cloud vector and the random sphere point cloud are decoded by a point cloud decoder to obtain a reconstructed high-dimensional point cloud, and parameters of the point cloud encoder and the point cloud decoder are optimized by calculating the difference between the true Gao Weidian cloud and the reconstructed Gao Weidian cloud.
In one embodiment, in the back diffusion process of the diffusion model, the noise prediction network predicts accumulated noise based on a guiding condition and a back diffusion vector of a previous diffusion step, the noise prediction network comprises at least two noise prediction units, each noise prediction unit comprises a feature fusion module, a feature extraction module and a feature propagation module, the guiding condition and the back diffusion vector of the previous diffusion step are fused into a splicing feature through the feature fusion module, the splicing feature extracts three modal features through a point cloud branch, a sampling branch and a voxel branch of the feature extraction module respectively, and the features obtained through calculation of the three modal features by the feature propagation module serve as the accumulated noise.
In one embodiment, the feature fusion module includes three linear layers, and the fusion feature is obtained by dot-multiplying the back diffusion vector of the previous diffusion step after passing through the first linear layer and the result of the guiding condition after passing through the second linear layer, and the splicing feature is obtained by splicing the result of the guiding condition after passing through the second linear layer and the third linear layer and the fusion vector.
In one embodiment, in the feature extraction module, the stitching feature obtains a second modal feature through a first MLP in the point cloud branch, the stitching feature obtains a second modal feature through voxel forming operation, a second MLP and voxel removing operation in the voxel branch, and the stitching feature is stitched with the second modal feature after being sampled in the sampling branch to obtain a third modal feature.
In one embodiment, the feature propagation module includes an upsampling layer, a splicing layer, a third MLP and a PVConv layer, where the second mode feature and the third mode feature are spliced with the second mode feature and the first mode feature in the splicing layer after the upsampling layer is sampled, and the splicing result is sequentially calculated by the third MLP and the PVConv layer to obtain features as the accumulated noise.
In one embodiment, the PVConv layer includes a point cloud branch and a voxel branch, the feature output by the third MLP is obtained by passing through the first MLP in the point cloud branch, the feature output by the third MLP is obtained by sequentially passing through the voxelization operation, the second MLP and the decxelization operation in the voxel branch to obtain another feature, and the features output by the two branches are spliced to obtain the accumulated noise.
In one embodiment, the time step vector is derived from an embedded representation of [0,1] uniform parameters.
Compared with the prior art, the invention has the beneficial effects that at least the following steps are included:
the text encoder of the CLIP model is utilized to generate semantic vectors corresponding to the text data on the basis of the text data, the conditional flow model is utilized to generate shape vectors based on the semantic vectors to construct guide conditions, then the inverse diffusion of the diffusion model is utilized to generate low-dimensional point cloud vectors based on the guide conditions, and then the high-dimensional point cloud used for generating the three-dimensional object is constructed based on low-dimensional point cloud vector decoding.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a three-dimensional object generation method based on diffusion model and semantic guidance provided by an embodiment;
FIG. 2 is a schematic diagram of a training system provided by an embodiment;
FIG. 3 is a schematic diagram of a shape encoder provided by an embodiment;
FIG. 4 is a schematic diagram of a conditional flow model structure and a forward calculation process according to an embodiment;
FIG. 5 is a schematic diagram of a point cloud encoder according to an embodiment;
fig. 6 is a schematic structural diagram of a point cloud decoder according to an embodiment;
FIG. 7 is a schematic diagram of a noise prediction network provided by an embodiment;
FIG. 8 is a schematic diagram of a feature fusion module according to an embodiment;
FIG. 9 is a schematic diagram of a feature extraction module according to an embodiment;
FIG. 10 is a schematic diagram of a feature propagation module provided by an embodiment;
fig. 11 is a schematic structural diagram of the PVConv layer provided in the embodiment;
FIG. 12 is a flow chart of generating a three-dimensional object using a conditional flow model and a diffusion model provided by an embodiment.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.
An embodiment provides a three-dimensional object generation method based on a diffusion model and semantic guidance, as shown in fig. 1, comprising the following steps:
and step 1, constructing a training sample.
In an embodiment, a high-dimensional point cloud representing shape and appearance is prepared, and a cloud Gao Weidian is rendered to obtain an object image, wherein the cloud Gao Weidian and the corresponding object image serve as a training sample.
And 2, constructing a training system comprising a shape encoder, a CLIP model, a point cloud encoder, a conditional flow model and a diffusion model.
In an embodiment, as shown in fig. 2, the training system includes a shape encoder, a CLIP model, a point cloud encoder, a conditional flow model, and a diffusion model, where the shape encoder is configured to encode a Gao Weidian cloud into a shape vector, the shape encoder uses a multi-layer perceptron structure, specifically as shown in fig. 3, the input high-dimensional point cloud outputs characteristics μ and σ through encoding, and according to the characteristics μ and σ, the formula z=μ+_epsilon_exp (0.5 log (σ 2 ) Shape vector z), where e represents a randomly generated offset value, conv1d represents a one-dimensional convolution layer, battnorm 1d represents a one-dimensional batch process, reLU represents a ReLU activation function, maxPooling represents a max pooling operation, and Linear represents a Linear layer in fig. 3.
The CLIP model is trained on hundreds of millions of pairs of text-image data by a contrast learning method, and the model can map pictures and words describing the same things into similar feature vectors, and comprises an image encoder and a text encoder, as shown in fig. 3 and 12, a training stage, in which a semantic vector c of an object image is extracted through the image encoder of the CLIP model, and a reasoning stage, in which a semantic vector c of text data is extracted through the text encoder of the CLIP model, and the semantic vector serves as a semantic condition vector for guiding a diffusion model.
The conditional flow model is used for searching a reversible nonlinear transformation, and in a training stage, the third random Gaussian noise g can be estimated according to a semantic vector c serving as a condition and a shape vector z with the channel number D; and in the reasoning stage, a shape vector z is generated according to the first random Gaussian noise and the semantic vector c serving as a condition. A conditional flow model is exemplarily shown in fig. 4, and the forward calculation process of the conditional flow model is also shown in fig. 4, where z: D represents the first D dimensions of the shape vector z, zd+1 represents D-D dimensions from d+1 in the shape vector z, scale and shift are each a 3-layer MLP, and the concatenation of the input z and c is projected back to D dimensions. The forward computing process is a reversible process, and by computing the reversible process, the reversible process f can be performed in the reasoning stage according to the input first random Gaussian noise g -1 Yielding a shape vector z, i.e. g=f -1 (z)。
The diffusion model is very slow in reasoning, and the phenomenon of three-dimensional point cloud data, especially dense point cloud data, is very obvious, so that the embodiment uses the diffusion model based on an implicit layer, and therefore a point cloud encoder and a point cloud decoder are introduced to obtain a low-dimensional point cloud vector of a hidden space as an input of the diffusion model, and meanwhile the point cloud decoder is introduced to diffuse the low-dimensional point cloud vector output by the diffusion model into Gao Weidian cloud.
The point cloud encoder is used for encoding Gao Weidian cloud into a low-dimensional point cloud vector, and the point cloud decoder in fig. 12 is used for generating the low-dimensional point cloud vector (the low-dimensional point cloud vector is the low-dimensional point cloud vector output by the point cloud encoder or the final diffusion step of the diffusion model diffusion process to generate a diffusion vector X '' (t) ) And decoding the sphere point cloud into Gao Weidian cloud, namely gradually expanding the sphere point cloud into a shape described by the low-dimensional point cloud vector according to the low-dimensional point cloud vector. Fig. 5 and 6 show exemplary structures of a point cloud encoder and a point cloud decoder, in which Cov represents a covariance matrix of a calculated Gao Weidian cloud, KNN represents K nearest neighbors of each point in a high-dimensional point cloud, graph Layer represents a module for processing data information based on the Graph structure, the module includes a series of max pooling Layer, full connection Layer, activation function, and convolution Layer, and globalmaxp ooing represents a global max pooling operation. In the point cloud decoder shown in fig. 6, concat represents a splicing operation, and Folding represents a Folding operation of point cloud data, which consists of a series of convolution layers and activation functions.
The point cloud encoder and the point cloud decoder need to undergo parameter optimization before being applied, comprising the following steps: the true Gao Weidian cloud is encoded by a point cloud encoder to obtain a low-dimensional point cloud vector, the low-dimensional point cloud vector and the random sphere point cloud are decoded by a point cloud decoder to obtain a reconstructed high-dimensional point cloud, and parameters of the point cloud encoder and the point cloud decoder are optimized by calculating the difference between the true Gao Weidian cloud and the reconstructed Gao Weidian cloud. The difference between the real Gao Weidian cloud and the reconstructed Gao Weidian cloud may use a mean square error.
The diffusion model based on the hidden space also comprises a forward diffusion process and a reverse diffusion process, wherein in the forward diffusion process, a low-dimensional point cloud vector output by a point cloud encoder is used as an initial forward diffusion vector of the forward diffusion process, forward diffusion is realized by adding actual noise in each diffusion step, and the forward diffusion vector of each diffusion step is obtained by using the formula:
wherein X is (t) And X (t-1) Respectively representing positive diffusion vectors corresponding to diffusion steps t and t-1, beta (t) Is based on the total number of steps T in [0,1]]Uniformly sampling the obtained parameters, N (t-1) For the actual noise added at step t-1, gaussian noise may be used.
In the back diffusion process, fourth random noise randomly sampled from Gaussian distribution is taken as an initial back diffusion vector X 'of the back diffusion process' (T) And taking the splicing result of the shape vector and the time step vector as a guide condition, calculating accumulated noise based on the guide condition and the back diffusion vector of the previous diffusion step for each step of back diffusion, and calculating the back diffusion vector of each diffusion step according to the accumulated noise, wherein the accumulated noise is expressed as follows:
wherein X 'is' (t-1) And X' (t) Representing the back diffusion vectors corresponding to diffusion steps t-1 and t, N' (t-1) Inverse diffusion vector X 'representing the t-1 th step by noise prediction network net () according to the guiding condition (z, s) and the previous diffusion step' (t) Predicted accumulated noise is formulated as:
N′ (t-1) =net(X′ (t) ,z,s)
s=embedding(β (t) )
wherein S represents a time step vector, defined by the parameter beta (t) Obtained through embedding representation embedding ().
The noise prediction network provided by the embodiment is shown in FIG. 7The local detail can be improved, and the method comprises at least two noise prediction units, wherein each noise prediction unit comprises a feature fusion module, a feature extraction module and a feature propagation module, and comprises a guiding condition (z, s) and a back diffusion vector X 'of the previous diffusion step' (t) The three modal characteristics are respectively extracted through a point cloud branch, a sampling branch and a voxel branch of the characteristic extraction module, and the characteristics obtained through calculation of the three modal characteristics by the characteristic propagation module are used as accumulated noise N'. (t-1)
As shown in FIG. 8, the feature fusion module provided by the embodiment comprises three Linear layers, namely, the back diffusion vector X 'of the previous diffusion step' (t) And performing point multiplication on the result of the first linear layer and the result of the guide condition (z, s) passing through the second linear layer to obtain a fusion characteristic, and splicing the result of the guide condition passing through the second linear layer and the third linear layer in sequence with the fusion vector to obtain a splicing characteristic M.
As shown in fig. 9, the feature extraction module provided in the embodiment includes a Point-cloud branch (Point-branch), a Voxel branch (Voxel-branch) and a Sampling branch (Sampling), the stitching feature M output by the feature fusion module obtains a second mode feature P through a first MLP in the Point-cloud branch, the stitching feature obtains a second mode feature Y through a voxelization operation, a second MLP and a decxelization operation in the Voxel branch, and the stitching feature is stitched with the second mode feature after being sampled in the Sampling branch to obtain a third mode feature C. Note that the point cloud branch and voxel branch structures in the feature extraction module are the same as those in the PVConv layer shown in fig. 11.
As shown in fig. 10, the feature propagation module provided in the embodiment includes an UpSampling layer (UpSampling), a splicing layer (Concat), a third MLP and a PVConv layer, where the second mode feature Y and the third mode feature C are spliced with the second mode feature Y and the first mode feature P in the splicing layer after being sampled in the UpSampling layer, and the splicing result is calculated by the third MLP and the PVConv layer in sequence to obtain features as accumulated noise N' (t-1)
As shown in fig. 11, the embodiment provides a PVCThe onv layer comprises a Point-cloud branch (Point-branch) and a Voxel branch (Voxel-branch), the feature Q output by the third MLP is obtained by passing through the first MLP in the Point-cloud branch to obtain a feature, the feature Q output by the third MLP is obtained by passing through the Voxel branch through a voxelization operation (voxelization), a second MLP and a decvoxelization operation (Trilliarray) in turn to obtain another feature, and the features output by the two branches are spliced to obtain accumulated noise N' (t-1) Where GroupNorm represents the group normalization function and Swish represents the Swish activation function.
And 3, constructing a loss function.
In an embodiment, a difference between the third random noise and the preset noise label is used as a supervised loss function of the conditional flow model, and a difference between the actual noise added in the forward diffusion process and the accumulated noise calculated in the reverse diffusion process of the same diffusion step is used as a supervised loss function of the diffusion model. The difference between the amount of noise may be a mean square error.
And 4, carrying out parameter optimization on the training system by using the loss function and the training sample.
In the embodiment, when the parameters of the training system are optimized by using the loss function and the training samples, the parameters of the fixed point cloud encoder, the fixed point cloud decoder, the shape encoder and the ClIP are unchanged, and the minimum loss function is used as the parameters of the optimized target optimized conditional flow model and the diffusion model. The embodiment uses the implicit low-dimensional point cloud vector training diffusion model after encoding, thereby accelerating training, reducing network parameters and saving video memory overhead.
And 5, generating a three-dimensional object by using the parameter optimized conditional flow model and the diffusion model.
In an embodiment, after the parameter optimization is finished, the three-dimensional object is generated by using the conditional flow model and the diffusion model of the parameter optimization, as shown in fig. 12, including the following steps:
step 5-1, generating semantic vectors according to text data by using a text encoder of the CLIP model;
step 5-2, generating a shape vector according to the semantic vector and the first random noise by using a parameter-optimized conditional flow model, and splicing the shape vector and a time step vector to serve as a guide condition;
step 5-3, taking the second random noise as an initial inverse diffusion vector, and generating a low-dimensional point cloud vector by inverse diffusion based on a guide condition and the initial inverse diffusion vector by utilizing a diffusion model with optimized parameters;
and 5-4, decoding the low-dimensional point cloud vector by using a point cloud decoder to obtain Gao Weidian cloud, and generating a three-dimensional object according to Gao Weidian cloud.
In the prior art, a three-dimensional point cloud generation scheme based on a diffusion model can only generate three-point clouds of a single class. The method provided by the embodiment realizes the generation of multi-category point cloud data by using one diffusion model by introducing object categories as conditional semantic information and matching semantic information with three-dimensional point cloud by using the CLIP model.
The method provided by the embodiment introduces a point cloud encoder and a decoder, wherein the input of the diffusion model is not three-dimensional point cloud data any more, but the three-dimensional point cloud data is a hidden low-dimensional point cloud vector calculated by the point cloud encoder. And the low-dimensional point cloud vector obtained by the diffusion model reasoning is recovered into a three-dimensional point cloud through a point cloud decoder. Thereby greatly improving the training speed of the three-dimensional point cloud diffusion model and reducing the calculation cost.
The method provided by the embodiment introduces a conditional flow model. In the training stage, the conditional flow model estimates noise according to the conditional vector; in the reasoning stage, the conditional flow model generates a conditional vector for guiding the diffusion model according to the input noise. The condition vector at training time includes: the point cloud shape vector extracted by the point cloud encoder and the semantic vector extracted by the CLIP model on the sample appearance picture.
The method provided by the embodiment is aimed at a noise prediction network in a diffusion model, and PVConv (point cloud-voxel convolution module) operators and a multi-level structure are introduced, so that the network can extract finer local detail characteristics.
In a word, the method provided by the embodiment can generate the three-dimensional object with more details and higher quality under the condition that the training time and the memory consumption are greatly reduced.
The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims (9)

1. The three-dimensional object generation method based on the diffusion model and the semantic guidance is characterized by comprising the following steps of:
generating a semantic vector according to the text data by using a text encoder of the CLIP model;
generating a shape vector according to the semantic vector and the first random noise by using a conditional flow model, and splicing the shape vector and the time step vector as guiding conditions;
taking the second random noise as an initial inverse diffusion vector, and generating a low-dimensional point cloud vector by inverse diffusion based on a guide condition and the initial inverse diffusion vector by using a diffusion model;
and decoding the low-dimensional point cloud vector by using a point cloud decoder to obtain Gao Weidian cloud, and generating a three-dimensional object according to Gao Weidian cloud.
2. The three-dimensional object generation method based on diffusion model and semantic guidance according to claim 1, characterized in that the conditional flow model and diffusion model need to undergo parameter optimization before being applied, comprising the steps of:
constructing a sample: preparing Gao Weidian cloud, and rendering Gao Weidian cloud to obtain object image;
building a training system: the training system comprises a shape encoder, a CLIP model, a point cloud encoder, a conditional flow model and a diffusion model, gao Weidian cloud is encoded by the shape encoder and the point cloud encoder to obtain a shape vector and a low-dimensional point cloud vector, an object image is encoded by the image encoder of the CLIP model to obtain a semantic vector, the shape vector and the semantic vector generate third random noise by the conditional flow model, the low-dimensional point cloud vector is used as an initial forward diffusion vector in a forward diffusion process of the diffusion model, forward diffusion is realized by adding actual noise in each diffusion step, the forward diffusion vector of each diffusion step is obtained, fourth random noise randomly sampled from Gaussian distribution is used as an initial reverse diffusion vector in a reverse diffusion process of the diffusion model, a splicing result of the shape vector and a time step vector is used as a guide condition, accumulated noise is calculated for each step of reverse diffusion based on the guide condition and the reverse diffusion vector of the previous diffusion step, and the reverse diffusion vector of each diffusion step is calculated according to the accumulated noise;
constructing a loss function: taking the difference between the third random noise and the preset noise label as a supervision loss function of the conditional flow model, and taking the difference between the actual noise added in the forward diffusion process of the same diffusion step and the accumulated noise calculated in the reverse diffusion process as a supervision loss function of the diffusion model;
and (3) training a system: and training the training system by adopting the loss function and the sample to optimize parameters of the conditional flow model and the diffusion model.
3. The three-dimensional object generation method based on diffusion model and semantic guidance according to claim 2, wherein the point cloud encoder and point cloud decoder need to undergo parameter optimization before being applied, comprising the steps of:
the true Gao Weidian cloud is encoded by a point cloud encoder to obtain a low-dimensional point cloud vector, the low-dimensional point cloud vector and the random sphere point cloud are decoded by a point cloud decoder to obtain a reconstructed high-dimensional point cloud, and parameters of the point cloud encoder and the point cloud decoder are optimized by calculating the difference between the true Gao Weidian cloud and the reconstructed Gao Weidian cloud.
4. The three-dimensional object generating method based on a diffusion model and semantic guidance according to claim 1 or 2, wherein in the back diffusion process of the diffusion model, accumulated noise is predicted by a noise prediction network based on a guiding condition and a back diffusion vector of a previous diffusion step, the noise prediction network comprises at least two noise prediction units, each noise prediction unit comprises a feature fusion module, a feature extraction module and a feature propagation module, the guiding condition and the back diffusion vector of the previous diffusion step are fused into a spliced feature by the feature fusion module, the spliced feature respectively extracts three modal features by a point cloud branch, a sampling branch and a voxel branch of the feature extraction module, and the feature calculated by the feature propagation module of the three modal features is used as the accumulated noise.
5. The three-dimensional object generating method based on the diffusion model and the semantic guidance according to claim 4, wherein the feature fusion module comprises three linear layers, the inverse diffusion vector of the previous diffusion step passes through the first linear layer and then is subjected to dot multiplication with the result of the guidance condition passing through the second linear layer to obtain the fusion feature, and the result of the guidance condition passing through the second linear layer and the third linear layer in sequence is spliced with the fusion vector to obtain the splicing feature.
6. The three-dimensional object generating method based on the diffusion model and the semantic guidance according to claim 4, wherein in the feature extraction module, the stitching feature obtains a second modal feature through a first MLP in a point cloud branch, the stitching feature obtains a second modal feature through a voxelization operation, a second MLP and a decxelization operation in a voxel branch, and the stitching feature is stitched with the second modal feature after being sampled in a sampling branch to obtain a third modal feature.
7. The three-dimensional object generating method based on the diffusion model and the semantic guidance according to claim 6, wherein the feature propagation module comprises an upsampling layer, a splicing layer, a third MLP and a PVConv layer, the second mode feature and the third mode feature are spliced with the second mode feature and the first mode feature in the splicing layer after being sampled by the upsampling layer, and the splicing result is sequentially calculated by the third MLP and the PVConv layer to obtain the feature as the accumulated noise.
8. The three-dimensional object generating method based on the diffusion model and the semantic guidance according to claim 7, wherein the PVConv layer comprises a point cloud branch and a voxel branch, the feature output by the third MLP is obtained by passing through the first MLP in the point cloud branch, the feature output by the third MLP is obtained by passing through the voxelization operation, the second MLP and the decxelization operation in sequence in the voxel branch, and the features output by the two branches are spliced to obtain the accumulated noise.
9. The three-dimensional object generation method based on diffusion model and semantic guidance according to claim 1 or 2, wherein the time step vector is obtained from [0,1] uniform parameters via embedded representation.
CN202310285348.9A 2023-03-22 2023-03-22 Three-dimensional object generation method based on diffusion model and semantic guidance Pending CN116721200A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310285348.9A CN116721200A (en) 2023-03-22 2023-03-22 Three-dimensional object generation method based on diffusion model and semantic guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310285348.9A CN116721200A (en) 2023-03-22 2023-03-22 Three-dimensional object generation method based on diffusion model and semantic guidance

Publications (1)

Publication Number Publication Date
CN116721200A true CN116721200A (en) 2023-09-08

Family

ID=87874025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310285348.9A Pending CN116721200A (en) 2023-03-22 2023-03-22 Three-dimensional object generation method based on diffusion model and semantic guidance

Country Status (1)

Country Link
CN (1) CN116721200A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910572A (en) * 2023-09-13 2023-10-20 浪潮(北京)电子信息产业有限公司 Training method and device for three-dimensional content generation model based on pre-training language model
CN117953180A (en) * 2024-03-26 2024-04-30 厦门大学 Text-to-three-dimensional object generation method based on dual-mode latent variable diffusion
CN118404590A (en) * 2024-07-02 2024-07-30 海信集团控股股份有限公司 Robot action track planning method and device and robot

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910572A (en) * 2023-09-13 2023-10-20 浪潮(北京)电子信息产业有限公司 Training method and device for three-dimensional content generation model based on pre-training language model
CN116910572B (en) * 2023-09-13 2024-02-09 浪潮(北京)电子信息产业有限公司 Training method and device for three-dimensional content generation model based on pre-training language model
CN117953180A (en) * 2024-03-26 2024-04-30 厦门大学 Text-to-three-dimensional object generation method based on dual-mode latent variable diffusion
CN118404590A (en) * 2024-07-02 2024-07-30 海信集团控股股份有限公司 Robot action track planning method and device and robot

Similar Documents

Publication Publication Date Title
CN116721200A (en) Three-dimensional object generation method based on diffusion model and semantic guidance
CN113051420B (en) Robot vision man-machine interaction method and system based on text generation video
CN113140020B (en) Method for generating image based on text of countermeasure network generated by accompanying supervision
CN110942512B (en) Indoor scene reconstruction method based on meta-learning
CN112232485B (en) Cartoon style image conversion model training method, image generation method and device
CN116721334B (en) Training method, device, equipment and storage medium of image generation model
Berrahal et al. Optimal text-to-image synthesis model for generating portrait images using generative adversarial network techniques
CN113140023B (en) Text-to-image generation method and system based on spatial attention
CN116385667B (en) Reconstruction method of three-dimensional model, training method and device of texture reconstruction model
CN113627093A (en) Underwater mechanism cross-scale flow field characteristic prediction method based on improved Unet network
Ye et al. Audio-driven stylized gesture generation with flow-based model
CN117422823A (en) Three-dimensional point cloud characterization model construction method and device, electronic equipment and storage medium
CN116306793A (en) Self-supervision learning method with target task directivity based on comparison twin network
CN117456587A (en) Multi-mode information control-based speaker face video generation method and device
CN115995002B (en) Network construction method and urban scene real-time semantic segmentation method
CN116597154A (en) Training method and system for image denoising model
CN116524070A (en) Scene picture editing method and system based on text
Fu et al. Gendds: Generating diverse driving video scenarios with prompt-to-video generative model
CN115577111A (en) Text classification method based on self-attention mechanism
CN115239967A (en) Image generation method and device for generating countermeasure network based on Trans-CSN
CN116503517B (en) Method and system for generating image by long text
CN116805046B (en) Method for generating 3D human body action based on text label
CN117830324B (en) 3D medical image segmentation method based on multi-dimensional and global local combination
CN118470221B (en) Three-dimensional target reconstruction method based on non-calibrated single view
CN117853678B (en) Method for carrying out three-dimensional materialization transformation on geospatial data based on multi-source remote sensing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination