CN116721200A

CN116721200A - Three-dimensional object generation method based on diffusion model and semantic guidance

Info

Publication number: CN116721200A
Application number: CN202310285348.9A
Authority: CN
Inventors: 耿卫东; 凌泽宇; 付一童; 厉向东; 梁秀波
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-03-22
Filing date: 2023-03-22
Publication date: 2023-09-08

Abstract

The invention discloses a three-dimensional object generation method based on a diffusion model and semantic guidance, which is characterized by comprising the following steps of: generating a semantic vector according to the text data by using a text encoder of the CLIP model; generating a shape vector according to the semantic vector and the first random noise by using a conditional flow model, and splicing the shape vector and the time step vector as guiding conditions; taking the second random noise as an initial inverse diffusion vector, and generating a low-dimensional point cloud vector by inverse diffusion based on a guide condition and the initial inverse diffusion vector by using a diffusion model; and decoding the low-dimensional point cloud vector by using a point cloud decoder to obtain Gao Weidian cloud, and generating a three-dimensional object according to Gao Weidian cloud. The method can generate a three-dimensional object with higher quality, but has the defects of slow training, slow reasoning, only single-class object generation and the like.

Description

Three-dimensional object generation method based on diffusion model and semantic guidance

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a three-dimensional point cloud generation method based on a diffusion model and semantic guidance.

Background

With the development of computer technology and graphics technology, three-dimensional objects have been widely used in various fields, for example, fields of games, movies, animations, etc., to construct virtual worlds using three-dimensional models as their basic resources; three-dimensional objects are used in industry, construction, etc. to accelerate or optimize their product design; three-dimensional objects are used in medical, biochemical, etc. fields to represent accurate models; social, media, etc. fields use three-dimensional objects to make richer entertainment products and information dissemination tools, etc.

Currently, three-dimensional object generation mainly depends on a modeler or designer with professional ability to manually make the three-dimensional object by using a professional modeling tool such as 3ds Max, maya and the like.

With the rapid development of deep learning techniques, one can use a deep learning model to obtain a desired three-dimensional object. Three-dimensional object generation methods based on deep learning can be roughly classified into two types. One is a three-dimensional reconstruction method for a specific object and a method based on a generative model. The three-dimensional reconstruction method for a specific object takes information (such as RGB-D images) of a single object or a scene as input, outputs a corresponding three-dimensional representation object, and is one-to-one mapping; another class is based on generative models that learn data distributions from a large number of three-dimensional object samples, thereby generating new sample instances. This method is mostly used to generate a large and diverse number of three-dimensional objects.

The flow model is a generative model. The generation network based on the stream model is widely applied to various fields such as image generation, audio and video synthesis and the like. Other generative models, such as variational self-encoders and generative countermeasure networks, use parameterized models to learn the implicit spatial distribution of data, so accurate likelihood functions cannot be calculated to optimize model training. The stream model learns the conversion process between two distributions through a series of reversible transformation functions to fit specific distribution changes, and can obtain good generation effect. The flow model starts with a standard normal distribution and transforms the original normal distribution into a probability distribution of the data we want with a series of reversible transformation functions.

The diffusion model is a generative model widely used for various generative tasks. The core thought of the diffusion model is that Gaussian noise is continuously added to a real sample in a training stage, and the noise is predicted through a noise prediction network; in the reasoning stage, a random noise is input to the noise prediction network, and the noise prediction network restores the random noise to one sample.

At present, some works apply a diffusion model to a three-dimensional object generation task, and compared with a non-diffusion model method, the diffusion model can generate a three-dimensional object with higher quality, but the method still has the defects of slow training, slow reasoning, only single-class object generation and the like.

Disclosure of Invention

In view of the above, the present invention aims to provide a three-dimensional object generating method based on diffusion model and semantic guidance, which improves the efficiency and quality of three-dimensional object generation and enriches the diversity of generated objects.

To achieve the above object, an embodiment provides a three-dimensional object generating method based on a diffusion model and semantic guidance, including the steps of:

generating a semantic vector according to the text data by using a text encoder of the CLIP model;

generating a shape vector according to the semantic vector and the first random noise by using a conditional flow model, and splicing the shape vector and the time step vector as guiding conditions;

taking the second random noise as an initial inverse diffusion vector, and generating a low-dimensional point cloud vector by inverse diffusion based on a guide condition and the initial inverse diffusion vector by using a diffusion model;

and decoding the low-dimensional point cloud vector by using a point cloud decoder to obtain Gao Weidian cloud, and generating a three-dimensional object according to Gao Weidian cloud.

In one embodiment, the conditional flow model and diffusion model need to undergo parameter optimization before being applied, comprising the steps of:

constructing a sample: preparing Gao Weidian cloud, and rendering Gao Weidian cloud to obtain object image;

building a training system: the training system comprises a shape encoder, a CLIP model, a point cloud encoder, a conditional flow model and a diffusion model, gao Weidian cloud is encoded by the shape encoder and the point cloud encoder to obtain a shape vector and a low-dimensional point cloud vector, an object image is encoded by the image encoder of the CLIP model to obtain a semantic vector, the shape vector and the semantic vector generate third random noise by the conditional flow model, the low-dimensional point cloud vector is used as an initial forward diffusion vector in a forward diffusion process of the diffusion model, forward diffusion is realized by adding actual noise in each diffusion step, the forward diffusion vector of each diffusion step is obtained, fourth random noise randomly sampled from Gaussian distribution is used as an initial reverse diffusion vector in a reverse diffusion process of the diffusion model, a splicing result of the shape vector and a time step vector is used as a guide condition, accumulated noise is calculated for each step of reverse diffusion based on the guide condition and the reverse diffusion vector of the previous diffusion step, and the reverse diffusion vector of each diffusion step is calculated according to the accumulated noise;

constructing a loss function: taking the difference between the third random noise and the preset noise label as a supervision loss function of the conditional flow model, and taking the difference between the actual noise added in the forward diffusion process of the same diffusion step and the accumulated noise calculated in the reverse diffusion process as a supervision loss function of the diffusion model;

and (3) training a system: and training the training system by adopting the loss function and the sample to optimize parameters of the conditional flow model and the diffusion model.

In one embodiment, the point cloud encoder and the point cloud decoder need to undergo parameter optimization before being applied, comprising the steps of:

the true Gao Weidian cloud is encoded by a point cloud encoder to obtain a low-dimensional point cloud vector, the low-dimensional point cloud vector and the random sphere point cloud are decoded by a point cloud decoder to obtain a reconstructed high-dimensional point cloud, and parameters of the point cloud encoder and the point cloud decoder are optimized by calculating the difference between the true Gao Weidian cloud and the reconstructed Gao Weidian cloud.

In one embodiment, in the back diffusion process of the diffusion model, the noise prediction network predicts accumulated noise based on a guiding condition and a back diffusion vector of a previous diffusion step, the noise prediction network comprises at least two noise prediction units, each noise prediction unit comprises a feature fusion module, a feature extraction module and a feature propagation module, the guiding condition and the back diffusion vector of the previous diffusion step are fused into a splicing feature through the feature fusion module, the splicing feature extracts three modal features through a point cloud branch, a sampling branch and a voxel branch of the feature extraction module respectively, and the features obtained through calculation of the three modal features by the feature propagation module serve as the accumulated noise.

In one embodiment, the feature fusion module includes three linear layers, and the fusion feature is obtained by dot-multiplying the back diffusion vector of the previous diffusion step after passing through the first linear layer and the result of the guiding condition after passing through the second linear layer, and the splicing feature is obtained by splicing the result of the guiding condition after passing through the second linear layer and the third linear layer and the fusion vector.

In one embodiment, in the feature extraction module, the stitching feature obtains a second modal feature through a first MLP in the point cloud branch, the stitching feature obtains a second modal feature through voxel forming operation, a second MLP and voxel removing operation in the voxel branch, and the stitching feature is stitched with the second modal feature after being sampled in the sampling branch to obtain a third modal feature.

In one embodiment, the feature propagation module includes an upsampling layer, a splicing layer, a third MLP and a PVConv layer, where the second mode feature and the third mode feature are spliced with the second mode feature and the first mode feature in the splicing layer after the upsampling layer is sampled, and the splicing result is sequentially calculated by the third MLP and the PVConv layer to obtain features as the accumulated noise.

In one embodiment, the PVConv layer includes a point cloud branch and a voxel branch, the feature output by the third MLP is obtained by passing through the first MLP in the point cloud branch, the feature output by the third MLP is obtained by sequentially passing through the voxelization operation, the second MLP and the decxelization operation in the voxel branch to obtain another feature, and the features output by the two branches are spliced to obtain the accumulated noise.

In one embodiment, the time step vector is derived from an embedded representation of [0,1] uniform parameters.

Compared with the prior art, the invention has the beneficial effects that at least the following steps are included:

the text encoder of the CLIP model is utilized to generate semantic vectors corresponding to the text data on the basis of the text data, the conditional flow model is utilized to generate shape vectors based on the semantic vectors to construct guide conditions, then the inverse diffusion of the diffusion model is utilized to generate low-dimensional point cloud vectors based on the guide conditions, and then the high-dimensional point cloud used for generating the three-dimensional object is constructed based on low-dimensional point cloud vector decoding.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a three-dimensional object generation method based on diffusion model and semantic guidance provided by an embodiment;

FIG. 2 is a schematic diagram of a training system provided by an embodiment;

FIG. 3 is a schematic diagram of a shape encoder provided by an embodiment;

FIG. 4 is a schematic diagram of a conditional flow model structure and a forward calculation process according to an embodiment;

FIG. 5 is a schematic diagram of a point cloud encoder according to an embodiment;

fig. 6 is a schematic structural diagram of a point cloud decoder according to an embodiment;

FIG. 7 is a schematic diagram of a noise prediction network provided by an embodiment;

FIG. 8 is a schematic diagram of a feature fusion module according to an embodiment;

FIG. 9 is a schematic diagram of a feature extraction module according to an embodiment;

FIG. 10 is a schematic diagram of a feature propagation module provided by an embodiment;

fig. 11 is a schematic structural diagram of the PVConv layer provided in the embodiment;

FIG. 12 is a flow chart of generating a three-dimensional object using a conditional flow model and a diffusion model provided by an embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.

An embodiment provides a three-dimensional object generation method based on a diffusion model and semantic guidance, as shown in fig. 1, comprising the following steps:

and step 1, constructing a training sample.

In an embodiment, a high-dimensional point cloud representing shape and appearance is prepared, and a cloud Gao Weidian is rendered to obtain an object image, wherein the cloud Gao Weidian and the corresponding object image serve as a training sample.

And 2, constructing a training system comprising a shape encoder, a CLIP model, a point cloud encoder, a conditional flow model and a diffusion model.

In an embodiment, as shown in fig. 2, the training system includes a shape encoder, a CLIP model, a point cloud encoder, a conditional flow model, and a diffusion model, where the shape encoder is configured to encode a Gao Weidian cloud into a shape vector, the shape encoder uses a multi-layer perceptron structure, specifically as shown in fig. 3, the input high-dimensional point cloud outputs characteristics μ and σ through encoding, and according to the characteristics μ and σ, the formula z=μ+_epsilon_exp (0.5 log (σ ² ) Shape vector z), where e represents a randomly generated offset value, conv1d represents a one-dimensional convolution layer, battnorm 1d represents a one-dimensional batch process, reLU represents a ReLU activation function, maxPooling represents a max pooling operation, and Linear represents a Linear layer in fig. 3.

The CLIP model is trained on hundreds of millions of pairs of text-image data by a contrast learning method, and the model can map pictures and words describing the same things into similar feature vectors, and comprises an image encoder and a text encoder, as shown in fig. 3 and 12, a training stage, in which a semantic vector c of an object image is extracted through the image encoder of the CLIP model, and a reasoning stage, in which a semantic vector c of text data is extracted through the text encoder of the CLIP model, and the semantic vector serves as a semantic condition vector for guiding a diffusion model.

The conditional flow model is used for searching a reversible nonlinear transformation, and in a training stage, the third random Gaussian noise g can be estimated according to a semantic vector c serving as a condition and a shape vector z with the channel number D; and in the reasoning stage, a shape vector z is generated according to the first random Gaussian noise and the semantic vector c serving as a condition. A conditional flow model is exemplarily shown in fig. 4, and the forward calculation process of the conditional flow model is also shown in fig. 4, where z: D represents the first D dimensions of the shape vector z, zd+1 represents D-D dimensions from d+1 in the shape vector z, scale and shift are each a 3-layer MLP, and the concatenation of the input z and c is projected back to D dimensions. The forward computing process is a reversible process, and by computing the reversible process, the reversible process f can be performed in the reasoning stage according to the input first random Gaussian noise g ^-1 Yielding a shape vector z, i.e. g=f ^-1 (z)。

The diffusion model is very slow in reasoning, and the phenomenon of three-dimensional point cloud data, especially dense point cloud data, is very obvious, so that the embodiment uses the diffusion model based on an implicit layer, and therefore a point cloud encoder and a point cloud decoder are introduced to obtain a low-dimensional point cloud vector of a hidden space as an input of the diffusion model, and meanwhile the point cloud decoder is introduced to diffuse the low-dimensional point cloud vector output by the diffusion model into Gao Weidian cloud.

The point cloud encoder is used for encoding Gao Weidian cloud into a low-dimensional point cloud vector, and the point cloud decoder in fig. 12 is used for generating the low-dimensional point cloud vector (the low-dimensional point cloud vector is the low-dimensional point cloud vector output by the point cloud encoder or the final diffusion step of the diffusion model diffusion process to generate a diffusion vector X '' ^(t) ) And decoding the sphere point cloud into Gao Weidian cloud, namely gradually expanding the sphere point cloud into a shape described by the low-dimensional point cloud vector according to the low-dimensional point cloud vector. Fig. 5 and 6 show exemplary structures of a point cloud encoder and a point cloud decoder, in which Cov represents a covariance matrix of a calculated Gao Weidian cloud, KNN represents K nearest neighbors of each point in a high-dimensional point cloud, graph Layer represents a module for processing data information based on the Graph structure, the module includes a series of max pooling Layer, full connection Layer, activation function, and convolution Layer, and globalmaxp ooing represents a global max pooling operation. In the point cloud decoder shown in fig. 6, concat represents a splicing operation, and Folding represents a Folding operation of point cloud data, which consists of a series of convolution layers and activation functions.

The point cloud encoder and the point cloud decoder need to undergo parameter optimization before being applied, comprising the following steps: the true Gao Weidian cloud is encoded by a point cloud encoder to obtain a low-dimensional point cloud vector, the low-dimensional point cloud vector and the random sphere point cloud are decoded by a point cloud decoder to obtain a reconstructed high-dimensional point cloud, and parameters of the point cloud encoder and the point cloud decoder are optimized by calculating the difference between the true Gao Weidian cloud and the reconstructed Gao Weidian cloud. The difference between the real Gao Weidian cloud and the reconstructed Gao Weidian cloud may use a mean square error.

The diffusion model based on the hidden space also comprises a forward diffusion process and a reverse diffusion process, wherein in the forward diffusion process, a low-dimensional point cloud vector output by a point cloud encoder is used as an initial forward diffusion vector of the forward diffusion process, forward diffusion is realized by adding actual noise in each diffusion step, and the forward diffusion vector of each diffusion step is obtained by using the formula:

wherein X is ^(t) And X ^(t-1) Respectively representing positive diffusion vectors corresponding to diffusion steps t and t-1, beta ^(t) Is based on the total number of steps T in [0,1]]Uniformly sampling the obtained parameters, N ^(t-1) For the actual noise added at step t-1, gaussian noise may be used.

In the back diffusion process, fourth random noise randomly sampled from Gaussian distribution is taken as an initial back diffusion vector X 'of the back diffusion process' ^(T) And taking the splicing result of the shape vector and the time step vector as a guide condition, calculating accumulated noise based on the guide condition and the back diffusion vector of the previous diffusion step for each step of back diffusion, and calculating the back diffusion vector of each diffusion step according to the accumulated noise, wherein the accumulated noise is expressed as follows:

wherein X 'is' ^(t-1) And X' ^(t) Representing the back diffusion vectors corresponding to diffusion steps t-1 and t, N' ^(t-1) Inverse diffusion vector X 'representing the t-1 th step by noise prediction network net () according to the guiding condition (z, s) and the previous diffusion step' ^(t) Predicted accumulated noise is formulated as:

N′ ^(t-1) ＝net(X′ ^(t) ,z,s)

s＝embedding(β ^(t) )

wherein S represents a time step vector, defined by the parameter beta ^(t) Obtained through embedding representation embedding ().

The noise prediction network provided by the embodiment is shown in FIG. 7The local detail can be improved, and the method comprises at least two noise prediction units, wherein each noise prediction unit comprises a feature fusion module, a feature extraction module and a feature propagation module, and comprises a guiding condition (z, s) and a back diffusion vector X 'of the previous diffusion step' ^(t) The three modal characteristics are respectively extracted through a point cloud branch, a sampling branch and a voxel branch of the characteristic extraction module, and the characteristics obtained through calculation of the three modal characteristics by the characteristic propagation module are used as accumulated noise N'. ^(t-1) 。

As shown in FIG. 8, the feature fusion module provided by the embodiment comprises three Linear layers, namely, the back diffusion vector X 'of the previous diffusion step' ^(t) And performing point multiplication on the result of the first linear layer and the result of the guide condition (z, s) passing through the second linear layer to obtain a fusion characteristic, and splicing the result of the guide condition passing through the second linear layer and the third linear layer in sequence with the fusion vector to obtain a splicing characteristic M.

As shown in fig. 9, the feature extraction module provided in the embodiment includes a Point-cloud branch (Point-branch), a Voxel branch (Voxel-branch) and a Sampling branch (Sampling), the stitching feature M output by the feature fusion module obtains a second mode feature P through a first MLP in the Point-cloud branch, the stitching feature obtains a second mode feature Y through a voxelization operation, a second MLP and a decxelization operation in the Voxel branch, and the stitching feature is stitched with the second mode feature after being sampled in the Sampling branch to obtain a third mode feature C. Note that the point cloud branch and voxel branch structures in the feature extraction module are the same as those in the PVConv layer shown in fig. 11.

As shown in fig. 10, the feature propagation module provided in the embodiment includes an UpSampling layer (UpSampling), a splicing layer (Concat), a third MLP and a PVConv layer, where the second mode feature Y and the third mode feature C are spliced with the second mode feature Y and the first mode feature P in the splicing layer after being sampled in the UpSampling layer, and the splicing result is calculated by the third MLP and the PVConv layer in sequence to obtain features as accumulated noise N' ^(t-1) 。

As shown in fig. 11, the embodiment provides a PVCThe onv layer comprises a Point-cloud branch (Point-branch) and a Voxel branch (Voxel-branch), the feature Q output by the third MLP is obtained by passing through the first MLP in the Point-cloud branch to obtain a feature, the feature Q output by the third MLP is obtained by passing through the Voxel branch through a voxelization operation (voxelization), a second MLP and a decvoxelization operation (Trilliarray) in turn to obtain another feature, and the features output by the two branches are spliced to obtain accumulated noise N' ^(t-1) Where GroupNorm represents the group normalization function and Swish represents the Swish activation function.

And 3, constructing a loss function.

In an embodiment, a difference between the third random noise and the preset noise label is used as a supervised loss function of the conditional flow model, and a difference between the actual noise added in the forward diffusion process and the accumulated noise calculated in the reverse diffusion process of the same diffusion step is used as a supervised loss function of the diffusion model. The difference between the amount of noise may be a mean square error.

And 4, carrying out parameter optimization on the training system by using the loss function and the training sample.

In the embodiment, when the parameters of the training system are optimized by using the loss function and the training samples, the parameters of the fixed point cloud encoder, the fixed point cloud decoder, the shape encoder and the ClIP are unchanged, and the minimum loss function is used as the parameters of the optimized target optimized conditional flow model and the diffusion model. The embodiment uses the implicit low-dimensional point cloud vector training diffusion model after encoding, thereby accelerating training, reducing network parameters and saving video memory overhead.

And 5, generating a three-dimensional object by using the parameter optimized conditional flow model and the diffusion model.

In an embodiment, after the parameter optimization is finished, the three-dimensional object is generated by using the conditional flow model and the diffusion model of the parameter optimization, as shown in fig. 12, including the following steps:

step 5-1, generating semantic vectors according to text data by using a text encoder of the CLIP model;

step 5-2, generating a shape vector according to the semantic vector and the first random noise by using a parameter-optimized conditional flow model, and splicing the shape vector and a time step vector to serve as a guide condition;

step 5-3, taking the second random noise as an initial inverse diffusion vector, and generating a low-dimensional point cloud vector by inverse diffusion based on a guide condition and the initial inverse diffusion vector by utilizing a diffusion model with optimized parameters;

and 5-4, decoding the low-dimensional point cloud vector by using a point cloud decoder to obtain Gao Weidian cloud, and generating a three-dimensional object according to Gao Weidian cloud.

In the prior art, a three-dimensional point cloud generation scheme based on a diffusion model can only generate three-point clouds of a single class. The method provided by the embodiment realizes the generation of multi-category point cloud data by using one diffusion model by introducing object categories as conditional semantic information and matching semantic information with three-dimensional point cloud by using the CLIP model.

The method provided by the embodiment introduces a point cloud encoder and a decoder, wherein the input of the diffusion model is not three-dimensional point cloud data any more, but the three-dimensional point cloud data is a hidden low-dimensional point cloud vector calculated by the point cloud encoder. And the low-dimensional point cloud vector obtained by the diffusion model reasoning is recovered into a three-dimensional point cloud through a point cloud decoder. Thereby greatly improving the training speed of the three-dimensional point cloud diffusion model and reducing the calculation cost.

The method provided by the embodiment introduces a conditional flow model. In the training stage, the conditional flow model estimates noise according to the conditional vector; in the reasoning stage, the conditional flow model generates a conditional vector for guiding the diffusion model according to the input noise. The condition vector at training time includes: the point cloud shape vector extracted by the point cloud encoder and the semantic vector extracted by the CLIP model on the sample appearance picture.

The method provided by the embodiment is aimed at a noise prediction network in a diffusion model, and PVConv (point cloud-voxel convolution module) operators and a multi-level structure are introduced, so that the network can extract finer local detail characteristics.

In a word, the method provided by the embodiment can generate the three-dimensional object with more details and higher quality under the condition that the training time and the memory consumption are greatly reduced.

The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims

1. The three-dimensional object generation method based on the diffusion model and the semantic guidance is characterized by comprising the following steps of:

2. The three-dimensional object generation method based on diffusion model and semantic guidance according to claim 1, characterized in that the conditional flow model and diffusion model need to undergo parameter optimization before being applied, comprising the steps of:

3. The three-dimensional object generation method based on diffusion model and semantic guidance according to claim 2, wherein the point cloud encoder and point cloud decoder need to undergo parameter optimization before being applied, comprising the steps of:

4. The three-dimensional object generating method based on a diffusion model and semantic guidance according to claim 1 or 2, wherein in the back diffusion process of the diffusion model, accumulated noise is predicted by a noise prediction network based on a guiding condition and a back diffusion vector of a previous diffusion step, the noise prediction network comprises at least two noise prediction units, each noise prediction unit comprises a feature fusion module, a feature extraction module and a feature propagation module, the guiding condition and the back diffusion vector of the previous diffusion step are fused into a spliced feature by the feature fusion module, the spliced feature respectively extracts three modal features by a point cloud branch, a sampling branch and a voxel branch of the feature extraction module, and the feature calculated by the feature propagation module of the three modal features is used as the accumulated noise.

5. The three-dimensional object generating method based on the diffusion model and the semantic guidance according to claim 4, wherein the feature fusion module comprises three linear layers, the inverse diffusion vector of the previous diffusion step passes through the first linear layer and then is subjected to dot multiplication with the result of the guidance condition passing through the second linear layer to obtain the fusion feature, and the result of the guidance condition passing through the second linear layer and the third linear layer in sequence is spliced with the fusion vector to obtain the splicing feature.

6. The three-dimensional object generating method based on the diffusion model and the semantic guidance according to claim 4, wherein in the feature extraction module, the stitching feature obtains a second modal feature through a first MLP in a point cloud branch, the stitching feature obtains a second modal feature through a voxelization operation, a second MLP and a decxelization operation in a voxel branch, and the stitching feature is stitched with the second modal feature after being sampled in a sampling branch to obtain a third modal feature.

7. The three-dimensional object generating method based on the diffusion model and the semantic guidance according to claim 6, wherein the feature propagation module comprises an upsampling layer, a splicing layer, a third MLP and a PVConv layer, the second mode feature and the third mode feature are spliced with the second mode feature and the first mode feature in the splicing layer after being sampled by the upsampling layer, and the splicing result is sequentially calculated by the third MLP and the PVConv layer to obtain the feature as the accumulated noise.

8. The three-dimensional object generating method based on the diffusion model and the semantic guidance according to claim 7, wherein the PVConv layer comprises a point cloud branch and a voxel branch, the feature output by the third MLP is obtained by passing through the first MLP in the point cloud branch, the feature output by the third MLP is obtained by passing through the voxelization operation, the second MLP and the decxelization operation in sequence in the voxel branch, and the features output by the two branches are spliced to obtain the accumulated noise.

9. The three-dimensional object generation method based on diffusion model and semantic guidance according to claim 1 or 2, wherein the time step vector is obtained from [0,1] uniform parameters via embedded representation.