CN115994990A

CN115994990A - Three-dimensional model automatic modeling method based on text information guidance

Info

Publication number: CN115994990A
Application number: CN202211533211.2A
Authority: CN
Inventors: 聂为之; 陈睿东
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2023-04-21

Abstract

The invention discloses a three-dimensional model automatic modeling method based on text information guidance, which comprises the following steps: performing data preprocessing on the text annotation data set, and constructing a text-three-dimensional model knowledge graph by defining a semantic tag entity, a text description entity, a three-dimensional model entity and the association among the semantic tag entity, the text description entity and the three-dimensional model entity by using a knowledge graph technology; searching a related three-dimensional model and a semantic tag according to text content input by a user by using the constructed knowledge graph as priori knowledge of vision and semantics; constructing a text-three-dimensional model generation network, constructing a characteristic fusion network based on a multi-layer transducer network, and fusing the prior knowledge of the cross-mode with the input text information; and constructing a three-dimensional model generation network based on an implicit field, and predicting the shape and color information of the three-dimensional model based on the space coordinates and the text features fused with priori information. The method can enable a user to automatically build a three-dimensional model with good geometric and color details by directly utilizing text description.

Description

Three-dimensional model automatic modeling method based on text information guidance

Technical Field

The invention relates to an automatic modeling method of a three-dimensional model. In particular to an automatic modeling method of a three-dimensional model based on text information guidance.

Background

In recent years, with the rapid development of computer hardware and related software industries, various new forms of multimedia visual contents are widely used in various practical applications, and users have raised higher demands on the display contents and display forms of visual information. Among many new forms of multimedia visual content, the use of three-dimensional model data has been increasing explosively in recent years. The three-dimensional model data is modeled and contains all-dimensional visual information of objects in the real world, and the abundant expressive force and visual characteristics closest to the real world enable the three-dimensional model data to be applied to various industries, such as medical imaging, automatic driving, electronic games, virtual house watching, virtual fitting and the like. Meanwhile, with increasing convenience and low price of hardware devices such as LiDAR cameras, civil 3D printers, AR/VR (augmented reality/virtual reality) and the like, more and more product application scenes based on three-dimensional models are being produced in society. In particular, the "meta universe" concept proposed by google, nvidia, meta (once Facebook) in recent years aims to create a virtual world in the internet by using 3D modeling technology, providing a very good immersion experience for users. Driven by these demands, how to obtain three-dimensional models as quickly as possible has become a major challenge for various three-dimensional model-based applications.

In order to alleviate the problem of difficulty in acquiring a three-dimensional model to some extent, there is gradually increasing effort to begin research on techniques for generating a three-dimensional model using deep learning techniques. The three-dimensional model generation framework is always generated for a particular certain three-dimensional model representation, such as a related method 3D-VAE-GAN for generating three-dimensional voxels ^[1] 、GRASS ^[2] 、SCORES ^[3] And the like; tree-GAN for generating point cloud data ^[4] 、SP-GAN ^[5] 、PointFlow ^[6] Etc.; octreeGAN for generating octree data ^[7] Etc. The main design concept of the methods is to pass through GAN ^[8] Or Flow ^[9] Classical generation network architecture to learn primitive numbers in training data setsAccording to the distribution, the effect of randomly generating the three-dimensional model with good visual quality from the sampling space is achieved. In addition, reconstructing an original three-dimensional model from visual information such as views, scenes or sketches is a common task in the field, and related works such as reconstructing 3D-R2N2 of the three-dimensional model from the views ^[10] 、Pix2Vox ^[11][12] 、Pixel2Mesh ^[13][14] Etc.; mem3D reconstruction of three-dimensional models from scenes ^[15] Etc.; sketch3D reconstruction of three-dimensional models from sketches ^[16] Etc.

Because the text has the characteristic of easy acquisition and editing, the difficulty in acquiring model data can be further greatly reduced by using the text description to generate the three-dimensional model. Text2Shape ^[18] First an attempt was made to this task using the text-to-image method GAN-INT-CLS ^[17] A similar idea learns the data distribution of colored three-dimensional voxels by generating a countermeasure network. Text-Guided ^[19] The method aligns the text and the three-dimensional model characteristics in the latent space learned by the self-encoder, and completes the generation of the cross-mode by utilizing the powerful performance of recovering the three-dimensional information from the latent space by the self-encoder.

However, existing methods still have limitations, particularly when faced with text descriptions that are ambiguous or flexible in description, that make it difficult to accurately generate three-dimensional models that match the details of the description. On one hand, the problems are that the text and the three-dimensional model have huge modal differences, so that great difficulty is brought to the design of a deep learning network architecture, and the problem becomes more obvious under the condition that a large-scale text labeling three-dimensional model data set is lacking; on the other hand, natural language has the characteristics of flexibility and abstraction, so that the generation network has difficulty in learning the correct mode for generating the three-dimensional model from simple text description.

Disclosure of Invention

The invention aims to solve the technical problem of providing an automatic modeling method of a three-dimensional model based on text information guidance in order to overcome the defects of the prior art.

The technical scheme adopted by the invention is as follows: a three-dimensional model automatic modeling method based on text information guidance comprises the following steps:

1) Constructing a text feature extraction network and a three-dimensional model feature extraction network;

2) Performing data preprocessing on the text labeling three-dimensional model dataset;

3) Constructing a text-three-dimensional model knowledge graph, comprising: defining an entity and a relationship, the entity comprising: the method comprises three types of semantic tag entities, text description entities and three-dimensional model entities, wherein the relation comprises the following steps: describing three relationship types of relationship, attribute relationship and similarity relationship;

4) Stacking by adopting a transducer network to form a priori knowledge fusion network, and carrying out feature fusion on the input text and the searched related semantic tags by using the priori knowledge fusion network;

5) Forming a three-dimensional model decoder network by adopting a multi-layer perceptron network, wherein the three-dimensional model decoder network is divided into a shape decoder and a color decoder; and 4) respectively taking the two-dimensional fusion characteristics obtained in the step 4) as the input of a shape decoder and a color decoder, and respectively predicting the shape and the color of the three-dimensional model in the implicit occupied field space.

6) Performing end-to-end training on a three-dimensional model generation network based on text information guidance by using an Adam optimizer; the three-dimensional model generating network based on text information guidance comprises a three-dimensional model feature extraction network, a three-dimensional model decoder network, a text feature extraction network and a priori knowledge fusion network;

7) And 3, utilizing the trained three-dimensional model generating network based on text information guidance, predicting the shape and the color of the three-dimensional model based on the text input by a user, and rendering the prediction result into a three-dimensional grid by using a MarchingCubes algorithm to obtain an automatic modeling result. According to the text information guiding-based three-dimensional model automatic modeling method, generation of a text to a three-dimensional model is assisted by additionally introducing auxiliary information, and even under the condition of fuzzy input text, the three-dimensional model modeling performance of a final method can be ensured by introducing additional priori knowledge to generate complementary rich three-dimensional structure and color information.

The beneficial effects of the invention are as follows:

1. the invention creatively provides a text-three-dimensional model construction method, which can construct a text-three-dimensional model cross-modal correlation with high confidence based on a series of data preprocessing steps and knowledge graph construction steps. The two priori knowledge of the related semantic tags and the related three-dimensional model can be obtained effectively based on the text description.

2. The invention innovatively provides a multi-mode information fusion module based on a transducer for fusing priori knowledge with input text information, and designs a text generation three-dimensional model framework based on the guidance of the priori knowledge. The introduction of priori knowledge can supplement abundant shape and color information for the input text description, so that a good generation effect is ensured, and the problem of difficult generation of the three-dimensional model caused by the fuzzy situation of the existing text description is solved. The invention can make the user directly utilize text description to automatically build the three-dimensional model with good geometric and color details by the method of the invention, so as to replace the complicated manual modeling step and simplify the acquisition process of the three-dimensional model.

Drawings

FIG. 1 is a flow chart of the text information guided three-dimensional model automatic modeling method of the present invention;

FIG. 2 is a schematic diagram of a knowledge graph construction step and a three-dimensional model generation flow;

FIG. 3 is a schematic diagram of a network architecture of a text encoder and a three-dimensional model encoder;

FIG. 4 is a schematic diagram of a network architecture of a three-dimensional model decoder;

FIG. 5 is a schematic diagram of a prior knowledge fusion network architecture;

FIG. 6 is a schematic diagram showing the effect of the method of the present invention compared with the prior art.

Detailed Description

The text information guiding-based three-dimensional model automatic modeling method of the present invention is described in detail below with reference to the examples and the accompanying drawings.

As shown in fig. 1, the text information guiding-based three-dimensional model automatic modeling method of the present invention comprises the following steps:

1) Constructing a text feature extraction network and a three-dimensional model feature extraction network; wherein the method comprises the steps of

The text feature extraction network is constructed by constructing a text encoder based on a pre-training BERT network as the text feature extraction network, wherein the text feature extraction network is input into a text description word sequence with the length not more than 64, the BERT pooling is used for obtaining sentence global features, and a linear layer is used for outputting 256-dimensional feature vectors.

The three-dimensional model feature extraction network is constructed by stacking three-dimensional convolution blocks to form a three-dimensional model encoder as the three-dimensional model feature extraction network, wherein three-dimensional voxels with the input resolution of 64 of the three-dimensional model feature extraction network are output as 256-dimensional feature vectors.

2) Performing data preprocessing on the text labeling three-dimensional model dataset; comprising the following steps:

extracting keywords from all text description corpus in the text labeling three-dimensional model dataset by adopting a TextRank algorithm;

performing network training on the text feature extraction network and the three-dimensional model feature extraction network by adopting a contrast learning loss function to obtain joint feature representation of the text features and the three-dimensional model features;

and obtaining the cross-modal feature similarity between each text and each three-dimensional model in the text labeling three-dimensional model data set by adopting a cosine similarity method.

3) Constructing a text-three-dimensional model knowledge graph, comprising: defining an entity and a relationship, the entity comprising: the method comprises three types of semantic tag entities, text description entities and three-dimensional model entities, wherein the relation comprises the following steps: describing three relationship types of relationship, attribute relationship and similarity relationship; wherein, the liquid crystal display device comprises a liquid crystal display device,

the semantic tag entity is composed of keywords extracted after data preprocessing; the text description entity and the three-dimensional model entity are composed of text description and three-dimensional models in a text labeling three-dimensional model dataset.

The description relation is constructed by text description in a text labeling three-dimensional model dataset and a three-dimensional model corresponding relation, and relates to a text description entity and a three-dimensional model entity; the attribute relationship is constructed by the correlation between text description corresponding to a three-dimensional model in a text labeling three-dimensional model dataset and extracted keywords, and relates to a semantic tag entity and a three-dimensional model entity; the similarity relation is constructed by selecting a three-dimensional model with highest cross-modal feature similarity according to text description features corresponding to the three-dimensional model by utilizing the cross-modal feature similarity obtained in data preprocessing, and the similarity relation relates to two three-dimensional model entities.

4) Stacking by adopting a transducer network to form a priori knowledge fusion network, and carrying out feature fusion on the input text and the searched related semantic tags by using the priori knowledge fusion network; the feature fusion comprises the following steps:

(4.1) searching two priori knowledge of the related semantic labels and the related three-dimensional model from the text-three-dimensional model knowledge graph according to the text input by the user by utilizing a multi-entity searching method;

(4.2) extracting features from the input text and the retrieved related semantic tags by using a text feature extraction network to form a feature sequence of the related semantic tags, extracting features from the related three-dimensional model by using a three-dimensional model feature extraction network to form a feature sequence of the related three-dimensional model

(4.3) inputting text features (256 dimensions) as inputs of the prior knowledge fusion network, wherein the feature sequences (256 dimensions) of the related three-dimensional model, the feature sequences (256 dimensions) of the related semantic labels;

(4.4) calculating the input based on a multi-head attention mechanism in the prior knowledge fusion network, and fusing two prior knowledge of related semantic labels and related three-dimensional models in the input text features in stages; the method specifically comprises the following steps: firstly, updating text features by using priori knowledge from a related three-dimensional model, and performing attention calculation to obtain updated text features (256 dimensions); and then, the prior knowledge from the related semantic tags is utilized to perform attention calculation by combining the space information of the implicit occupation field, and finally, the two-dimensional fusion characteristics fused with the prior knowledge are output.

The trained loss function is a self-encoder reconstruction loss based on the L2 norm, and a text-to-three dimensional model feature alignment loss also based on the L2 norm.

7) And 3, utilizing the trained three-dimensional model generating network based on text information guidance, predicting the shape and the color of the three-dimensional model based on the text input by a user, and rendering the prediction result into a three-dimensional grid by using a MarchingCubes algorithm to obtain an automatic modeling result.

Examples are given below:

first, a text-three-dimensional model knowledge graph needs to be constructed to obtain a priori knowledge based on the text description input by the user. The text-three-dimensional model knowledge graph construction step and the three-dimensional model automatic modeling flow proposed by the embodiment of the invention are described in detail below with reference to fig. 2:

101: and constructing a text feature extraction network and a three-dimensional model feature extraction network. Wherein a text encoder E is constructed _t As a text information feature extraction network, which is constructed based on a pre-trained BERT model, a text description word sequence with the length not more than 64 is input, global sentence features are obtained by using BERT pooling, and 256-dimensional feature vectors are output by using a linear layer. Construction of a three-dimensional model encoder E _v As a three-dimensional modelA syndrome extraction network based on a three-dimensional convolution block stack, each convolution block comprising a three-dimensional convolution layer, a batch normalization layer, and a LeakyRelu activation function. E (E) _v Is three-dimensional voxel with resolution of 64, and outputs 256-dimensional feature vector. Text encoder E _t And a three-dimensional model encoder E _v The basic network structure of (a) is shown in figure 3.

102: using TextRank ^[20] The method is characterized in that the method comprises the steps of extracting keywords from all text description corpus in a text labeling three-dimensional model data set by an algorithm, wherein the main idea of the algorithm is to model the structural relation between each word in the text by using a weighted graph model, and calculate the importance of each word by iterative operation of the graph, so that the function of extracting keywords from sentences is realized.

The calculation flow of the TextRank algorithm is as follows: first, each word in a sentence is regarded as a node in the weighted graph, and all the nodes in the weighted graph, namely the set of words, are denoted by V. V (V) _r Represents the r-th word In the sentence, using In (V _r ) And Out (V) _r ) Respectively indicate the directions V _r Node and V of (2) _r The node pointed to, at this time, the calculation formula of each word weight is:

wherein d in the formula is a damping coefficient of control algorithm iteration and is set to 0.85. The summation on the right side of the formula calculates the contribution degree of each adjacent word to the importance of the word, W _hr Representation word V _h And V _r The frequency of co-occurrence in sentences of a certain length is calculated by iteration to finally obtain the (r) word V _r Weights WS (V) _r )。

When semantic tag extraction is performed, keyword extraction is performed on each text description in the data set by using a TextRank algorithm. And manually screening the obtained keywords, wherein nouns and descriptive phrases in the keywords are mainly reserved and are used for constructing semantic tag entities in the final knowledge graph.

103: text feature extraction network (text encoder) E employing contrast learning penalty function _t And a three-dimensional model feature extraction network (three-dimensional model encoder) E _v Training is performed. The optimization of the features is performed by constructing positive and negative pairs of samples within a training batch of the network. Definition (x) _t ,x _v ) For a text-three-dimensional model data pair, in a training batch having n pairs of text-three-dimensional model data pairs, the ith data pair is defined as

It is considered as a positive pair of samples in this training batch. In contrast, the remaining data pairs +.>

And +.>

Defined as a negative pair of samples, the effect of this loss function is to maximize the feature distance between the negative pair of samples while minimizing the distance between the features of the positive pair of samples, the loss function of which is expressed symmetrically as:

the loss function is a symmetrical two-part,

and->

And respectively comparing learning losses of the ith data pair from the text to the three-dimensional model and from the three-dimensional model to the text. In the formula<a,b>Representing cosine similarity between two features a, bThe calculation method comprises the following steps:

based on the definition, the calculated losses are accumulated and averaged for all data in a training batch, and the final comparison learning loss function is written as:

where β is a parameter of balance versus learning loss direction, set to 0.5.

And calculating the cross-modal feature similarity between each text description and each three-dimensional model in the data set, and sorting according to the similarity to obtain a preliminary cross-modal retrieval result. By summarizing the cross-modal retrieval results of all text descriptions of the same three-dimensional model, cross-modal correlation with higher confidence can be obtained. The association is further defined and stored in the knowledge-graph later.

104: three entity types are defined for the constructed text-three-dimensional model knowledge graph K: a three-dimensional model entity (S) representing each three-dimensional model in the dataset; a text description entity (T) representing a text description corresponding to each three-dimensional model in the dataset; semantic tag entities (a), each corresponding to a particular semantic information of the three-dimensional model, for example, a "comfortable red chair with four legs" which contains four semantic tags "comfort", "red", "chair", "four legs".

In addition, three relationship types are defined: a description relation (T-S) which stores an original belonging relation between the text description and the three-dimensional model, the T-S edge relating the text description entity (T) to its original corresponding three-dimensional model entity (S); and the attribute relationship (S-A) stores the belonging relationship between the semantic tag entity (A) and the three-dimensional model entity (S). When the relation is constructed, if a text description entity associated with a three-dimensional model entity contains a semantic tag on the word segmentation level, the three-dimensional model entity is connected with the semantic tag entity; similarity relationship (S-S), describes the relevance between three-dimensional model entities. When the edges are constructed, a text encoder and a three-dimensional model encoder which are obtained by adopting comparison learning training in a data preprocessing stage are utilized to extract features from each text description and each three-dimensional model, and the cross-modal feature similarity between the text description and each three-dimensional model is calculated based on cosine similarity. And summarizing all three-dimensional models, which are closest to each other in terms of feature similarity, of all corresponding text descriptions for each three-dimensional model, calculating the total similarity, selecting a certain number of three-dimensional model construction (S-S) edges with the highest similarity, and recording the calculated weights on the edges as additional information.

The knowledge graph constructed based on the definition clearly stores three data types of a three-dimensional model, text description and semantic tags, and stores clear correlations among different data types. These different types of associated information (edges), which may be referred to as different types of prior knowledge, may help us find semantic tags or three-dimensional information associated with the input text based on the input text. The invention uses the similar three-dimensional model provided by the relation (S-S) and the related semantic labels provided by the relation (S-A) as the guide text to generate the prior knowledge of the three-dimensional model.

Secondly, the design of the network framework and the training process are introduced by combining a specific network structure and a calculation formula:

106: a priori knowledge fusion network is constructed based on a transducer network stack, and the basic network structure is shown in figure 4. In obtaining n from knowledge graph _p A related three-dimensional model P _s And n _a Personal related semantic tags P _a Then, in the process of generating a three-dimensional model by using text, the two priori knowledge needs to be fused with the original text information. Specifically, for the input text T, a text encoder E is utilized _t Extracting original text feature f _t The method comprises the steps of carrying out a first treatment on the surface of the For a related three-dimensional model P _s Related semantic tags P _a Using three-dimensional model encoders E, respectively _v And text encoder E _t Processing it into a feature sequence

And->

107: and carrying out feature fusion by utilizing the priori knowledge fusion network. The first stage is to use the characteristic sequence F of the related three-dimensional model _p Updating original text feature f _t . Set F _t ¹ ＝{f _t The sequence is the initial Query sequence, F _p For Key and Value sequences, a stack-l layer Transformer decoder network is input. Definition F _t ^m For the query sequence input by the m-th layer, the calculation process of each layer can be written as follows:

Q＝W ^Q ·F _t ^m-1 ,K＝W ^K ·F _p ,V＝W ^V ·F _p

F _t ^m ＝Multihead(Q,K,V)

F _t ^m ＝FFN(F _t ^m )

wherein Multihead () represents a multi-head attention mechanism, FFN () represents a feed-forward neural network, and after calculation by a layer I transducer, a feature F fused with a three-dimensional prior is obtained _t ′＝F _t ^l . The second stage is to update F for features using related semantic tags _t '. F is firstly carried out according to the number N of implicit sampling points _t ' spread and splice with the position p (N×3 dimension) of the implicit sampling point to obtain N×259 dimension feature

S is carried out by utilizing one layer of full-connection network _t And F is equal to _a Conversion to a Low latitude feature (N×32 dimension) for ease of handling>

And->

Input of a Transformer network of the same stack of layers, definition +.>

For the query sequence input by the nth layer, the calculation process at this stage is as follows:

the output characteristic of the final first layer is

The features obtained in this step are as follows _t Directly splicing to obtain

Is a two-dimensional fusion feature of n×291 dimensions.

108: and constructing a three-dimensional model decoder network based on the multi-layer perceptron network, and dividing the three-dimensional model decoder network into a shape decoder and a color decoder. The input of the method is the two-dimensional fusion characteristic output by the priori knowledge fusion module, and the shape and the color of the target three-dimensional model are predicted in the implicit occupation field space. Three-dimensional model decoder d= { D _s ,D _c The network structure of the figure 5 is shown, wherein the shape encoder and the color encoder have similar multi-layer perceptron (MLP) network structure and are composed of seven fully connected layers, the LeakyRelu is used as an activation function at the output of each layer, and the prediction value of each point space occupation probability and RGB color is obtained through a Sigmoid function at the end of the network.

109：End-to-end training is performed on a three-dimensional model generation network based on text information guidance by using an Adam optimizer. The whole framework of the network comprises a text feature extraction network (text encoder), a three-dimensional model feature extraction network (three-dimensional model encoder), a priori knowledge fusion network and a three-dimensional model decoder network. Introducing a real three-dimensional model GT associated with text input in a training set, using E _v Extracting the feature f from it _s . Feature sequence F of phase Guan Yuyi tag by using priori knowledge fusion module _a And feature sequence F of related three-dimensional model _p And f _s Fusing to obtain a two-dimensional fusion characteristic S, and inputting the two-dimensional fusion characteristic S into a three-dimensional model encoder D= { D _s ,D _c }. Introducing a self-encoder reconstruction loss L based on L2 norm _ae Performing network training:

L _ae ＝||D _s (S)-I _s || ₂ +||D _c (S)×I _s -I _c || ₂

wherein I is _s And I _c Representing the shape and color information of the sampled implicit points, respectively, the network trains to apply these two as ground truth values (GroundTruth) to D _s And D _c And (3) optimizing the prediction result of the model (c). In the loss calculation of color information, I will be used _s As a filter, i.e. the color is optimized only for implicit points where there is actually space usage.

In addition, f _s F as a reference feature for a text encoder learning feature space map _t To input text features, an L2 norm based text-to-three dimensional model feature alignment penalty is used to enable a text encoder to learn the ability to map text features into a three dimensional feature space:

L _reg ＝||f _t -f _s || ₂

the overall loss function that is ultimately used for network training is:

L＝λ ₁ L _ae +λ ₂ L _reg

wherein lambda is ₁ And lambda (lambda) ₂ Weight parameters balanced for controlling the two loss functions. At the completion of the networkAfter training, inputting user-defined text information, obtaining two priori knowledge of related semantic labels and related three-dimensional models by using a knowledge graph, extracting features of the input text and the priori knowledge by using a text encoder and a three-dimensional model encoder, inputting the priori knowledge into a fusion network and a three-dimensional model decoder, and after the three-dimensional model decoder network predicts the shape and color information of a target three-dimensional model, rendering a final three-dimensional grid model by using a MarchangCubes algorithm to obtain an automatic three-dimensional model modeling result based on text information guidance.

(III) the protocol of examples 1 and 2 was validated in conjunction with the specific dataset, table 1, as described in detail below:

the Text-tagged three-dimensional model dataset used to evaluate the proposed method was Text2Shape published in 2018 ^[18] A data set. The data set contains pairs of partial shapenets ^[22] And carrying out text labeling on the model to obtain a data set subset, and constructing a text-three-dimensional model knowledge graph and carrying out experiments by using the three-dimensional model contained in the data set and corresponding text description in the embodiment. The subset contains two types of three-dimensional models of chairs and tables, providing three-dimensional voxel binvox files with RGBA color information at resolutions 32 and 64. In terms of text labeling, each three-dimensional model in the dataset has an average of 5 text descriptions, for a total of 75344 text descriptions. The specific three-dimensional model of the dataset and the specific category and quantity distribution of the text description are shown in table 1.

TABLE 1Text2Shape dataset category and quantity distribution case

The method is compared with the prior related method in performance. Because related work of a deep learning algorithm for generating a three-dimensional model by using Text is still rare at present, only Text2Shape is selected in comparison experiments ^[18] With Text-Guided ^[19] These two items use the same dataset as the present work and have the same application direction (generating a three-dimensional model with color details by text) Performance comparisons are made for related tasks of (c). The generation quality quantization method used in the previous work is used, and the compared quantization indexes comprise:

1) IOU (Intersection over Union): the index measures the overlapping rate between the generated three-dimensional model and the real three-dimensional model, and is used as a similarity measure of a prediction result and the real three-dimensional model.

2) EMD (Earth Mover's Distance): the index is used for measuring the matching degree of the generated three-dimensional model and the color of the real three-dimensional model.

3) Acc (classification accuracy): classifying the generated three-dimensional model by using a three-dimensional model classifier trained in advance on the data set to evaluate whether the generating method can accurately understand semantics according to input text and generate a corresponding type of three-dimensional model.

4) IS (InceptionScore): the index evaluates the performance of the generated network from both the quality and diversity of the generated content. The same pre-trained three-dimensional model classifier as Acc is used as a perception network for calculation.

Table 2 correlation method quantization index comparison

As examples of the effect of the generation under the final settings of each experiment, table 2 shows the evaluation results of the relevant quantitative indicators. Firstly, from the result of quantization index, the method has obvious improvement on the indexes of IOU, EMD and Acc. Since the data set used in the experiment has only two categories, the calculated value of IS IS limited in the [0,2] interval, and the IS values of 1.96, 1.97, 1.96 and 1.97 of the four methods all achieve excellent effects.

In addition, the analysis is performed from the viewpoint of visual quality of the generated three-dimensional model, as shown in fig. 6. The introduction of priori knowledge in the method can bring more visual characteristics of the input text to the generated three-dimensional model. As can be seen from the figure, the three-dimensional model generated by the method is less limited by the original real model than Text-Guided in the generation effect, so that a three-dimensional model which is more flexible and more matched with the input Text semantics is obtained, and the visual quality of the generated model is the best in the displayed method. This may prove the advancement and effectiveness of the present method.

Those skilled in the art will appreciate that the drawings are schematic representations of only one preferred embodiment, and that the above-described embodiment numbers are merely for illustration purposes and do not represent advantages or disadvantages of the embodiments. The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Reference is made to:

[1]Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling[J].

Advances in neural information processing systems,2016,29.

[2]Li J,Xu K,Chaudhuri S,et al.Grass:Generative recursive autoencoders for shape structures[J].

ACM Transactions on Graphics(TOG),2017,36(4):1-14.

[3]Zhu C,Xu K,Chaudhuri S,et al.SCORES:Shape composition with recursive substructure priors[J].ACM Transactions on Graphics(TOG),2018,37(6):1-14.

[4]Shu D W,Park S W,Kwon J.3d point cloud generative adversarial network based on tree structured graph convolutions[C].Proceedings of the IEEE/CVF international conference on computer vision.2019:3859-3868.

[5]Li R,Li X,Hui K H,et al.SP-GAN:Sphere-guided 3D shape generation and manipulation[J].

ACM Transactions on Graphics(TOG),2021,40(4):1-12.

[6]Yang G,Huang X,Hao Z,et al.Pointflow:3d point cloud generation with continuous normalizing flows[C].Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4541-4550.

[7]Tatarchenko M,Dosovitskiy A,Brox T.Octree generating networks:Efficient convolutional architectures for high-resolution 3d outputs[C].Proceedings of the IEEE international conference on computer vision.2017:2088-2096.

[8]Goodfellow I,Pouget-Abadie J,Mirza M,et al.Generative adversarial networks[J].

Communications of the ACM,2020,63(11):139-144

[9]Dinh L,Krueger D,Bengio Y.Nice:Non-linear independent components estimation[J].arXiv preprint arXiv:1410.8516,2014.

[10]Choy C B,Xu D,Gwak J Y,et al.3d-r2n2:A unified approach for single and multi-view 3d object reconstruction[C].European conference on computer vision.Springer,Cham,2016:

628-644.

[11]Xie H,Yao H,Sun X,et al.Pix2vox:Context-aware 3d reconstruction from single and multi-view images[C].Proceedings of the IEEE/CVF international conference on computer vision.2019:2690-2698.

[12]Xie H,Yao H,Zhang S,et al.Pix2Vox++:Multi-scale context-aware 3D object reconstruction from single and multiple images[J].International Journal of Computer Vision,2020,128(12):

2919-2935.

[13]Wang N,Zhang Y,Li Z,et al.Pixel2mesh:Generating 3d mesh models from single rgb images[C].Proceedings of the European conference on computer vision(ECCV).2018:52-67.[14]Wen C,Zhang Y,Li Z,et al.Pixel2mesh++:Multi-view 3d mesh generation via deformation[C].

Proceedings of the IEEE/CVF international conference on computer vision.2019:1042-1051.[15]Yang S,Xu M,Xie H,et al.Single-view 3D object reconstruction from shape priors in memory[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3152-3161.

[16]Lun Z,Gadelha M,Kalogerakis E,et al.3D shape reconstruction from sketches via multi-view convolutional networks[C].2017 International Conference on 3D Vision(3DV).IEEE,2017:

67-77.

[17]Reed S,Akata Z,Yan X,et al.Generative adversarial text to image synthesis[C].International conference on machine learning.PMLR,2016:1060-1069.

[18]Chen K,Choy C B,Savva M,et al.Text2shape:Generating shapes from natural language by learning joint embeddings[C].Asian conference on computer vision.Springer,Cham,2018:

100-116.

[19]Liu Z,Wang Y,Qi X,et al.Towards Implicit Text-Guided 3D Shape Generation[C].

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:

17896-17906.

[20]Mihalcea R,Tarau P.Textrank:Bringing order into text[C].Proceedings of the 2004 conference on empirical methods in natural language processing.2004:404-411.

[21]Chen T,Kornblith S,Norouzi M,et al.A simple framework for contrastive learning of visual representations[C].International conference on machine learning.PMLR,2020:1597-1607.

[22]Chang AX,Funkhouser T,Guibas L,et al.Shapenet:An information-rich 3d model repository[J].arXiv preprint arXiv:1512.03012,2015.

Claims

1. the automatic modeling method of the three-dimensional model based on text information guidance is characterized by comprising the following steps:

2. The automatic modeling method of three-dimensional model based on text information guidance according to claim 1, wherein the constructing of the text feature extraction network in step 1) constructs a text encoder as the text feature extraction network based on a pre-training BERT network, the text feature extraction network inputs a text description word sequence with a length of not more than 64, obtains sentence global features using BERT pooling, and outputs 256-dimensional feature vectors using one linear layer.

3. The text information guidance-based three-dimensional model automatic modeling method according to claim 1, wherein the three-dimensional model feature extraction network is constructed in the step 1), a three-dimensional model encoder is formed by stacking three-dimensional convolution blocks as the three-dimensional model feature extraction network, and three-dimensional voxels with input resolution of 64 of the three-dimensional model feature extraction network are output as 256-dimensional feature vectors.

4. The text information guidance-based three-dimensional model automatic modeling method according to claim 1, wherein the step 2) includes:

5. The text information guidance-based three-dimensional model automatic modeling method according to claim 1, wherein the semantic tag entity in the step 3) is composed of keywords extracted after data preprocessing; the text description entity and the three-dimensional model entity are composed of text description and three-dimensional models in a text labeling three-dimensional model dataset.

6. The automatic modeling method of a three-dimensional model based on text information guidance according to claim 1, wherein the description relation in the step 3) is constructed by text description in a text-annotated three-dimensional model dataset and a three-dimensional model correspondence relation, and relates to a text description entity and a three-dimensional model entity; the attribute relationship is constructed by the correlation between text description corresponding to a three-dimensional model in a text labeling three-dimensional model dataset and extracted keywords, and relates to a semantic tag entity and a three-dimensional model entity; the similarity relation is constructed by selecting a three-dimensional model with highest cross-modal feature similarity according to text description features corresponding to the three-dimensional model by utilizing the cross-modal feature similarity obtained in data preprocessing, and the similarity relation relates to two three-dimensional model entities.

7. The text-based guided three-dimensional model automatic modeling method of claim 1, wherein the feature fusion of step 4) comprises:

(4.3) inputting text features as inputs of the prior knowledge fusion network in relation to the feature sequences of the three-dimensional model and the feature sequences of the semantic tags;

(4.4) calculating the input based on a multi-head attention mechanism in the prior knowledge fusion network, and fusing two prior knowledge of related semantic labels and related three-dimensional models in the input text features in stages; the method specifically comprises the following steps: firstly, updating text features by using priori knowledge from a related three-dimensional model, and performing attention calculation to obtain updated text features; and then, the prior knowledge from the related semantic tags is utilized to perform attention calculation by combining the space information of the implicit occupation field, and finally, the two-dimensional fusion characteristics fused with the prior knowledge are output.

8. The text-based guided three-dimensional model automatic modeling method of claim 1, wherein the trained loss function of step 5) is an L2-norm based self-encoder reconstruction loss, and also an L2-norm based text-three-dimensional model feature alignment loss.