CN115994990A - Three-dimensional model automatic modeling method based on text information guidance - Google Patents

Three-dimensional model automatic modeling method based on text information guidance Download PDF

Info

Publication number
CN115994990A
CN115994990A CN202211533211.2A CN202211533211A CN115994990A CN 115994990 A CN115994990 A CN 115994990A CN 202211533211 A CN202211533211 A CN 202211533211A CN 115994990 A CN115994990 A CN 115994990A
Authority
CN
China
Prior art keywords
dimensional model
text
network
dimensional
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211533211.2A
Other languages
Chinese (zh)
Inventor
聂为之
陈睿东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202211533211.2A priority Critical patent/CN115994990A/en
Publication of CN115994990A publication Critical patent/CN115994990A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a three-dimensional model automatic modeling method based on text information guidance, which comprises the following steps: performing data preprocessing on the text annotation data set, and constructing a text-three-dimensional model knowledge graph by defining a semantic tag entity, a text description entity, a three-dimensional model entity and the association among the semantic tag entity, the text description entity and the three-dimensional model entity by using a knowledge graph technology; searching a related three-dimensional model and a semantic tag according to text content input by a user by using the constructed knowledge graph as priori knowledge of vision and semantics; constructing a text-three-dimensional model generation network, constructing a characteristic fusion network based on a multi-layer transducer network, and fusing the prior knowledge of the cross-mode with the input text information; and constructing a three-dimensional model generation network based on an implicit field, and predicting the shape and color information of the three-dimensional model based on the space coordinates and the text features fused with priori information. The method can enable a user to automatically build a three-dimensional model with good geometric and color details by directly utilizing text description.

Description

Three-dimensional model automatic modeling method based on text information guidance
Technical Field
The invention relates to an automatic modeling method of a three-dimensional model. In particular to an automatic modeling method of a three-dimensional model based on text information guidance.
Background
In recent years, with the rapid development of computer hardware and related software industries, various new forms of multimedia visual contents are widely used in various practical applications, and users have raised higher demands on the display contents and display forms of visual information. Among many new forms of multimedia visual content, the use of three-dimensional model data has been increasing explosively in recent years. The three-dimensional model data is modeled and contains all-dimensional visual information of objects in the real world, and the abundant expressive force and visual characteristics closest to the real world enable the three-dimensional model data to be applied to various industries, such as medical imaging, automatic driving, electronic games, virtual house watching, virtual fitting and the like. Meanwhile, with increasing convenience and low price of hardware devices such as LiDAR cameras, civil 3D printers, AR/VR (augmented reality/virtual reality) and the like, more and more product application scenes based on three-dimensional models are being produced in society. In particular, the "meta universe" concept proposed by google, nvidia, meta (once Facebook) in recent years aims to create a virtual world in the internet by using 3D modeling technology, providing a very good immersion experience for users. Driven by these demands, how to obtain three-dimensional models as quickly as possible has become a major challenge for various three-dimensional model-based applications.
In order to alleviate the problem of difficulty in acquiring a three-dimensional model to some extent, there is gradually increasing effort to begin research on techniques for generating a three-dimensional model using deep learning techniques. The three-dimensional model generation framework is always generated for a particular certain three-dimensional model representation, such as a related method 3D-VAE-GAN for generating three-dimensional voxels [1] 、GRASS [2] 、SCORES [3] And the like; tree-GAN for generating point cloud data [4] 、SP-GAN [5] 、PointFlow [6] Etc.; octreeGAN for generating octree data [7] Etc. The main design concept of the methods is to pass through GAN [8] Or Flow [9] Classical generation network architecture to learn primitive numbers in training data setsAccording to the distribution, the effect of randomly generating the three-dimensional model with good visual quality from the sampling space is achieved. In addition, reconstructing an original three-dimensional model from visual information such as views, scenes or sketches is a common task in the field, and related works such as reconstructing 3D-R2N2 of the three-dimensional model from the views [10] 、Pix2Vox [11][12] 、Pixel2Mesh [13][14] Etc.; mem3D reconstruction of three-dimensional models from scenes [15] Etc.; sketch3D reconstruction of three-dimensional models from sketches [16] Etc.
Because the text has the characteristic of easy acquisition and editing, the difficulty in acquiring model data can be further greatly reduced by using the text description to generate the three-dimensional model. Text2Shape [18] First an attempt was made to this task using the text-to-image method GAN-INT-CLS [17] A similar idea learns the data distribution of colored three-dimensional voxels by generating a countermeasure network. Text-Guided [19] The method aligns the text and the three-dimensional model characteristics in the latent space learned by the self-encoder, and completes the generation of the cross-mode by utilizing the powerful performance of recovering the three-dimensional information from the latent space by the self-encoder.
However, existing methods still have limitations, particularly when faced with text descriptions that are ambiguous or flexible in description, that make it difficult to accurately generate three-dimensional models that match the details of the description. On one hand, the problems are that the text and the three-dimensional model have huge modal differences, so that great difficulty is brought to the design of a deep learning network architecture, and the problem becomes more obvious under the condition that a large-scale text labeling three-dimensional model data set is lacking; on the other hand, natural language has the characteristics of flexibility and abstraction, so that the generation network has difficulty in learning the correct mode for generating the three-dimensional model from simple text description.
Disclosure of Invention
The invention aims to solve the technical problem of providing an automatic modeling method of a three-dimensional model based on text information guidance in order to overcome the defects of the prior art.
The technical scheme adopted by the invention is as follows: a three-dimensional model automatic modeling method based on text information guidance comprises the following steps:
1) Constructing a text feature extraction network and a three-dimensional model feature extraction network;
2) Performing data preprocessing on the text labeling three-dimensional model dataset;
3) Constructing a text-three-dimensional model knowledge graph, comprising: defining an entity and a relationship, the entity comprising: the method comprises three types of semantic tag entities, text description entities and three-dimensional model entities, wherein the relation comprises the following steps: describing three relationship types of relationship, attribute relationship and similarity relationship;
4) Stacking by adopting a transducer network to form a priori knowledge fusion network, and carrying out feature fusion on the input text and the searched related semantic tags by using the priori knowledge fusion network;
5) Forming a three-dimensional model decoder network by adopting a multi-layer perceptron network, wherein the three-dimensional model decoder network is divided into a shape decoder and a color decoder; and 4) respectively taking the two-dimensional fusion characteristics obtained in the step 4) as the input of a shape decoder and a color decoder, and respectively predicting the shape and the color of the three-dimensional model in the implicit occupied field space.
6) Performing end-to-end training on a three-dimensional model generation network based on text information guidance by using an Adam optimizer; the three-dimensional model generating network based on text information guidance comprises a three-dimensional model feature extraction network, a three-dimensional model decoder network, a text feature extraction network and a priori knowledge fusion network;
7) And 3, utilizing the trained three-dimensional model generating network based on text information guidance, predicting the shape and the color of the three-dimensional model based on the text input by a user, and rendering the prediction result into a three-dimensional grid by using a MarchingCubes algorithm to obtain an automatic modeling result. According to the text information guiding-based three-dimensional model automatic modeling method, generation of a text to a three-dimensional model is assisted by additionally introducing auxiliary information, and even under the condition of fuzzy input text, the three-dimensional model modeling performance of a final method can be ensured by introducing additional priori knowledge to generate complementary rich three-dimensional structure and color information.
The beneficial effects of the invention are as follows:
1. the invention creatively provides a text-three-dimensional model construction method, which can construct a text-three-dimensional model cross-modal correlation with high confidence based on a series of data preprocessing steps and knowledge graph construction steps. The two priori knowledge of the related semantic tags and the related three-dimensional model can be obtained effectively based on the text description.
2. The invention innovatively provides a multi-mode information fusion module based on a transducer for fusing priori knowledge with input text information, and designs a text generation three-dimensional model framework based on the guidance of the priori knowledge. The introduction of priori knowledge can supplement abundant shape and color information for the input text description, so that a good generation effect is ensured, and the problem of difficult generation of the three-dimensional model caused by the fuzzy situation of the existing text description is solved. The invention can make the user directly utilize text description to automatically build the three-dimensional model with good geometric and color details by the method of the invention, so as to replace the complicated manual modeling step and simplify the acquisition process of the three-dimensional model.
Drawings
FIG. 1 is a flow chart of the text information guided three-dimensional model automatic modeling method of the present invention;
FIG. 2 is a schematic diagram of a knowledge graph construction step and a three-dimensional model generation flow;
FIG. 3 is a schematic diagram of a network architecture of a text encoder and a three-dimensional model encoder;
FIG. 4 is a schematic diagram of a network architecture of a three-dimensional model decoder;
FIG. 5 is a schematic diagram of a prior knowledge fusion network architecture;
FIG. 6 is a schematic diagram showing the effect of the method of the present invention compared with the prior art.
Detailed Description
The text information guiding-based three-dimensional model automatic modeling method of the present invention is described in detail below with reference to the examples and the accompanying drawings.
As shown in fig. 1, the text information guiding-based three-dimensional model automatic modeling method of the present invention comprises the following steps:
1) Constructing a text feature extraction network and a three-dimensional model feature extraction network; wherein the method comprises the steps of
The text feature extraction network is constructed by constructing a text encoder based on a pre-training BERT network as the text feature extraction network, wherein the text feature extraction network is input into a text description word sequence with the length not more than 64, the BERT pooling is used for obtaining sentence global features, and a linear layer is used for outputting 256-dimensional feature vectors.
The three-dimensional model feature extraction network is constructed by stacking three-dimensional convolution blocks to form a three-dimensional model encoder as the three-dimensional model feature extraction network, wherein three-dimensional voxels with the input resolution of 64 of the three-dimensional model feature extraction network are output as 256-dimensional feature vectors.
2) Performing data preprocessing on the text labeling three-dimensional model dataset; comprising the following steps:
extracting keywords from all text description corpus in the text labeling three-dimensional model dataset by adopting a TextRank algorithm;
performing network training on the text feature extraction network and the three-dimensional model feature extraction network by adopting a contrast learning loss function to obtain joint feature representation of the text features and the three-dimensional model features;
and obtaining the cross-modal feature similarity between each text and each three-dimensional model in the text labeling three-dimensional model data set by adopting a cosine similarity method.
3) Constructing a text-three-dimensional model knowledge graph, comprising: defining an entity and a relationship, the entity comprising: the method comprises three types of semantic tag entities, text description entities and three-dimensional model entities, wherein the relation comprises the following steps: describing three relationship types of relationship, attribute relationship and similarity relationship; wherein, the liquid crystal display device comprises a liquid crystal display device,
the semantic tag entity is composed of keywords extracted after data preprocessing; the text description entity and the three-dimensional model entity are composed of text description and three-dimensional models in a text labeling three-dimensional model dataset.
The description relation is constructed by text description in a text labeling three-dimensional model dataset and a three-dimensional model corresponding relation, and relates to a text description entity and a three-dimensional model entity; the attribute relationship is constructed by the correlation between text description corresponding to a three-dimensional model in a text labeling three-dimensional model dataset and extracted keywords, and relates to a semantic tag entity and a three-dimensional model entity; the similarity relation is constructed by selecting a three-dimensional model with highest cross-modal feature similarity according to text description features corresponding to the three-dimensional model by utilizing the cross-modal feature similarity obtained in data preprocessing, and the similarity relation relates to two three-dimensional model entities.
4) Stacking by adopting a transducer network to form a priori knowledge fusion network, and carrying out feature fusion on the input text and the searched related semantic tags by using the priori knowledge fusion network; the feature fusion comprises the following steps:
(4.1) searching two priori knowledge of the related semantic labels and the related three-dimensional model from the text-three-dimensional model knowledge graph according to the text input by the user by utilizing a multi-entity searching method;
(4.2) extracting features from the input text and the retrieved related semantic tags by using a text feature extraction network to form a feature sequence of the related semantic tags, extracting features from the related three-dimensional model by using a three-dimensional model feature extraction network to form a feature sequence of the related three-dimensional model
(4.3) inputting text features (256 dimensions) as inputs of the prior knowledge fusion network, wherein the feature sequences (256 dimensions) of the related three-dimensional model, the feature sequences (256 dimensions) of the related semantic labels;
(4.4) calculating the input based on a multi-head attention mechanism in the prior knowledge fusion network, and fusing two prior knowledge of related semantic labels and related three-dimensional models in the input text features in stages; the method specifically comprises the following steps: firstly, updating text features by using priori knowledge from a related three-dimensional model, and performing attention calculation to obtain updated text features (256 dimensions); and then, the prior knowledge from the related semantic tags is utilized to perform attention calculation by combining the space information of the implicit occupation field, and finally, the two-dimensional fusion characteristics fused with the prior knowledge are output.
5) Forming a three-dimensional model decoder network by adopting a multi-layer perceptron network, wherein the three-dimensional model decoder network is divided into a shape decoder and a color decoder; and 4) respectively taking the two-dimensional fusion characteristics obtained in the step 4) as the input of a shape decoder and a color decoder, and respectively predicting the shape and the color of the three-dimensional model in the implicit occupied field space.
The trained loss function is a self-encoder reconstruction loss based on the L2 norm, and a text-to-three dimensional model feature alignment loss also based on the L2 norm.
6) Performing end-to-end training on a three-dimensional model generation network based on text information guidance by using an Adam optimizer; the three-dimensional model generating network based on text information guidance comprises a three-dimensional model feature extraction network, a three-dimensional model decoder network, a text feature extraction network and a priori knowledge fusion network;
7) And 3, utilizing the trained three-dimensional model generating network based on text information guidance, predicting the shape and the color of the three-dimensional model based on the text input by a user, and rendering the prediction result into a three-dimensional grid by using a MarchingCubes algorithm to obtain an automatic modeling result.
Examples are given below:
first, a text-three-dimensional model knowledge graph needs to be constructed to obtain a priori knowledge based on the text description input by the user. The text-three-dimensional model knowledge graph construction step and the three-dimensional model automatic modeling flow proposed by the embodiment of the invention are described in detail below with reference to fig. 2:
101: and constructing a text feature extraction network and a three-dimensional model feature extraction network. Wherein a text encoder E is constructed t As a text information feature extraction network, which is constructed based on a pre-trained BERT model, a text description word sequence with the length not more than 64 is input, global sentence features are obtained by using BERT pooling, and 256-dimensional feature vectors are output by using a linear layer. Construction of a three-dimensional model encoder E v As a three-dimensional modelA syndrome extraction network based on a three-dimensional convolution block stack, each convolution block comprising a three-dimensional convolution layer, a batch normalization layer, and a LeakyRelu activation function. E (E) v Is three-dimensional voxel with resolution of 64, and outputs 256-dimensional feature vector. Text encoder E t And a three-dimensional model encoder E v The basic network structure of (a) is shown in figure 3.
102: using TextRank [20] The method is characterized in that the method comprises the steps of extracting keywords from all text description corpus in a text labeling three-dimensional model data set by an algorithm, wherein the main idea of the algorithm is to model the structural relation between each word in the text by using a weighted graph model, and calculate the importance of each word by iterative operation of the graph, so that the function of extracting keywords from sentences is realized.
The calculation flow of the TextRank algorithm is as follows: first, each word in a sentence is regarded as a node in the weighted graph, and all the nodes in the weighted graph, namely the set of words, are denoted by V. V (V) r Represents the r-th word In the sentence, using In (V r ) And Out (V) r ) Respectively indicate the directions V r Node and V of (2) r The node pointed to, at this time, the calculation formula of each word weight is:
Figure BDA0003976625920000051
wherein d in the formula is a damping coefficient of control algorithm iteration and is set to 0.85. The summation on the right side of the formula calculates the contribution degree of each adjacent word to the importance of the word, W hr Representation word V h And V r The frequency of co-occurrence in sentences of a certain length is calculated by iteration to finally obtain the (r) word V r Weights WS (V) r )。
When semantic tag extraction is performed, keyword extraction is performed on each text description in the data set by using a TextRank algorithm. And manually screening the obtained keywords, wherein nouns and descriptive phrases in the keywords are mainly reserved and are used for constructing semantic tag entities in the final knowledge graph.
103: text feature extraction network (text encoder) E employing contrast learning penalty function t And a three-dimensional model feature extraction network (three-dimensional model encoder) E v Training is performed. The optimization of the features is performed by constructing positive and negative pairs of samples within a training batch of the network. Definition (x) t ,x v ) For a text-three-dimensional model data pair, in a training batch having n pairs of text-three-dimensional model data pairs, the ith data pair is defined as
Figure BDA0003976625920000052
It is considered as a positive pair of samples in this training batch. In contrast, the remaining data pairs +.>
Figure BDA0003976625920000053
And +.>
Figure BDA0003976625920000054
Defined as a negative pair of samples, the effect of this loss function is to maximize the feature distance between the negative pair of samples while minimizing the distance between the features of the positive pair of samples, the loss function of which is expressed symmetrically as:
Figure BDA0003976625920000055
Figure BDA0003976625920000056
the loss function is a symmetrical two-part,
Figure BDA0003976625920000057
and->
Figure BDA0003976625920000058
And respectively comparing learning losses of the ith data pair from the text to the three-dimensional model and from the three-dimensional model to the text. In the formula<a,b>Representing cosine similarity between two features a, bThe calculation method comprises the following steps:
Figure BDA0003976625920000059
based on the definition, the calculated losses are accumulated and averaged for all data in a training batch, and the final comparison learning loss function is written as:
Figure BDA0003976625920000061
where β is a parameter of balance versus learning loss direction, set to 0.5.
And calculating the cross-modal feature similarity between each text description and each three-dimensional model in the data set, and sorting according to the similarity to obtain a preliminary cross-modal retrieval result. By summarizing the cross-modal retrieval results of all text descriptions of the same three-dimensional model, cross-modal correlation with higher confidence can be obtained. The association is further defined and stored in the knowledge-graph later.
104: three entity types are defined for the constructed text-three-dimensional model knowledge graph K: a three-dimensional model entity (S) representing each three-dimensional model in the dataset; a text description entity (T) representing a text description corresponding to each three-dimensional model in the dataset; semantic tag entities (a), each corresponding to a particular semantic information of the three-dimensional model, for example, a "comfortable red chair with four legs" which contains four semantic tags "comfort", "red", "chair", "four legs".
In addition, three relationship types are defined: a description relation (T-S) which stores an original belonging relation between the text description and the three-dimensional model, the T-S edge relating the text description entity (T) to its original corresponding three-dimensional model entity (S); and the attribute relationship (S-A) stores the belonging relationship between the semantic tag entity (A) and the three-dimensional model entity (S). When the relation is constructed, if a text description entity associated with a three-dimensional model entity contains a semantic tag on the word segmentation level, the three-dimensional model entity is connected with the semantic tag entity; similarity relationship (S-S), describes the relevance between three-dimensional model entities. When the edges are constructed, a text encoder and a three-dimensional model encoder which are obtained by adopting comparison learning training in a data preprocessing stage are utilized to extract features from each text description and each three-dimensional model, and the cross-modal feature similarity between the text description and each three-dimensional model is calculated based on cosine similarity. And summarizing all three-dimensional models, which are closest to each other in terms of feature similarity, of all corresponding text descriptions for each three-dimensional model, calculating the total similarity, selecting a certain number of three-dimensional model construction (S-S) edges with the highest similarity, and recording the calculated weights on the edges as additional information.
The knowledge graph constructed based on the definition clearly stores three data types of a three-dimensional model, text description and semantic tags, and stores clear correlations among different data types. These different types of associated information (edges), which may be referred to as different types of prior knowledge, may help us find semantic tags or three-dimensional information associated with the input text based on the input text. The invention uses the similar three-dimensional model provided by the relation (S-S) and the related semantic labels provided by the relation (S-A) as the guide text to generate the prior knowledge of the three-dimensional model.
Secondly, the design of the network framework and the training process are introduced by combining a specific network structure and a calculation formula:
106: a priori knowledge fusion network is constructed based on a transducer network stack, and the basic network structure is shown in figure 4. In obtaining n from knowledge graph p A related three-dimensional model P s And n a Personal related semantic tags P a Then, in the process of generating a three-dimensional model by using text, the two priori knowledge needs to be fused with the original text information. Specifically, for the input text T, a text encoder E is utilized t Extracting original text feature f t The method comprises the steps of carrying out a first treatment on the surface of the For a related three-dimensional model P s Related semantic tags P a Using three-dimensional model encoders E, respectively v And text encoder E t Processing it into a feature sequence
Figure BDA0003976625920000062
And->
Figure BDA0003976625920000063
107: and carrying out feature fusion by utilizing the priori knowledge fusion network. The first stage is to use the characteristic sequence F of the related three-dimensional model p Updating original text feature f t . Set F t 1 ={f t The sequence is the initial Query sequence, F p For Key and Value sequences, a stack-l layer Transformer decoder network is input. Definition F t m For the query sequence input by the m-th layer, the calculation process of each layer can be written as follows:
Q=W Q ·F t m-1 ,K=W K ·F p ,V=W V ·F p
F t m =Multihead(Q,K,V)
F t m =FFN(F t m )
wherein Multihead () represents a multi-head attention mechanism, FFN () represents a feed-forward neural network, and after calculation by a layer I transducer, a feature F fused with a three-dimensional prior is obtained t ′=F t l . The second stage is to update F for features using related semantic tags t '. F is firstly carried out according to the number N of implicit sampling points t ' spread and splice with the position p (N×3 dimension) of the implicit sampling point to obtain N×259 dimension feature
Figure BDA0003976625920000071
S is carried out by utilizing one layer of full-connection network t And F is equal to a Conversion to a Low latitude feature (N×32 dimension) for ease of handling>
Figure BDA0003976625920000072
And->
Figure BDA0003976625920000073
Input of a Transformer network of the same stack of layers, definition +.>
Figure BDA0003976625920000074
For the query sequence input by the nth layer, the calculation process at this stage is as follows:
Figure BDA0003976625920000075
Figure BDA0003976625920000076
Figure BDA0003976625920000077
the output characteristic of the final first layer is
Figure BDA0003976625920000078
The features obtained in this step are as follows t Directly splicing to obtain
Figure BDA0003976625920000079
Is a two-dimensional fusion feature of n×291 dimensions.
108: and constructing a three-dimensional model decoder network based on the multi-layer perceptron network, and dividing the three-dimensional model decoder network into a shape decoder and a color decoder. The input of the method is the two-dimensional fusion characteristic output by the priori knowledge fusion module, and the shape and the color of the target three-dimensional model are predicted in the implicit occupation field space. Three-dimensional model decoder d= { D s ,D c The network structure of the figure 5 is shown, wherein the shape encoder and the color encoder have similar multi-layer perceptron (MLP) network structure and are composed of seven fully connected layers, the LeakyRelu is used as an activation function at the output of each layer, and the prediction value of each point space occupation probability and RGB color is obtained through a Sigmoid function at the end of the network.
109:End-to-end training is performed on a three-dimensional model generation network based on text information guidance by using an Adam optimizer. The whole framework of the network comprises a text feature extraction network (text encoder), a three-dimensional model feature extraction network (three-dimensional model encoder), a priori knowledge fusion network and a three-dimensional model decoder network. Introducing a real three-dimensional model GT associated with text input in a training set, using E v Extracting the feature f from it s . Feature sequence F of phase Guan Yuyi tag by using priori knowledge fusion module a And feature sequence F of related three-dimensional model p And f s Fusing to obtain a two-dimensional fusion characteristic S, and inputting the two-dimensional fusion characteristic S into a three-dimensional model encoder D= { D s ,D c }. Introducing a self-encoder reconstruction loss L based on L2 norm ae Performing network training:
L ae =||D s (S)-I s || 2 +||D c (S)×I s -I c || 2
wherein I is s And I c Representing the shape and color information of the sampled implicit points, respectively, the network trains to apply these two as ground truth values (GroundTruth) to D s And D c And (3) optimizing the prediction result of the model (c). In the loss calculation of color information, I will be used s As a filter, i.e. the color is optimized only for implicit points where there is actually space usage.
In addition, f s F as a reference feature for a text encoder learning feature space map t To input text features, an L2 norm based text-to-three dimensional model feature alignment penalty is used to enable a text encoder to learn the ability to map text features into a three dimensional feature space:
L reg =||f t -f s || 2
the overall loss function that is ultimately used for network training is:
L=λ 1 L ae2 L reg
wherein lambda is 1 And lambda (lambda) 2 Weight parameters balanced for controlling the two loss functions. At the completion of the networkAfter training, inputting user-defined text information, obtaining two priori knowledge of related semantic labels and related three-dimensional models by using a knowledge graph, extracting features of the input text and the priori knowledge by using a text encoder and a three-dimensional model encoder, inputting the priori knowledge into a fusion network and a three-dimensional model decoder, and after the three-dimensional model decoder network predicts the shape and color information of a target three-dimensional model, rendering a final three-dimensional grid model by using a MarchangCubes algorithm to obtain an automatic three-dimensional model modeling result based on text information guidance.
(III) the protocol of examples 1 and 2 was validated in conjunction with the specific dataset, table 1, as described in detail below:
the Text-tagged three-dimensional model dataset used to evaluate the proposed method was Text2Shape published in 2018 [18] A data set. The data set contains pairs of partial shapenets [22] And carrying out text labeling on the model to obtain a data set subset, and constructing a text-three-dimensional model knowledge graph and carrying out experiments by using the three-dimensional model contained in the data set and corresponding text description in the embodiment. The subset contains two types of three-dimensional models of chairs and tables, providing three-dimensional voxel binvox files with RGBA color information at resolutions 32 and 64. In terms of text labeling, each three-dimensional model in the dataset has an average of 5 text descriptions, for a total of 75344 text descriptions. The specific three-dimensional model of the dataset and the specific category and quantity distribution of the text description are shown in table 1.
TABLE 1Text2Shape dataset category and quantity distribution case
Figure BDA0003976625920000081
The method is compared with the prior related method in performance. Because related work of a deep learning algorithm for generating a three-dimensional model by using Text is still rare at present, only Text2Shape is selected in comparison experiments [18] With Text-Guided [19] These two items use the same dataset as the present work and have the same application direction (generating a three-dimensional model with color details by text) Performance comparisons are made for related tasks of (c). The generation quality quantization method used in the previous work is used, and the compared quantization indexes comprise:
1) IOU (Intersection over Union): the index measures the overlapping rate between the generated three-dimensional model and the real three-dimensional model, and is used as a similarity measure of a prediction result and the real three-dimensional model.
2) EMD (Earth Mover's Distance): the index is used for measuring the matching degree of the generated three-dimensional model and the color of the real three-dimensional model.
3) Acc (classification accuracy): classifying the generated three-dimensional model by using a three-dimensional model classifier trained in advance on the data set to evaluate whether the generating method can accurately understand semantics according to input text and generate a corresponding type of three-dimensional model.
4) IS (InceptionScore): the index evaluates the performance of the generated network from both the quality and diversity of the generated content. The same pre-trained three-dimensional model classifier as Acc is used as a perception network for calculation.
Table 2 correlation method quantization index comparison
Figure BDA0003976625920000091
As examples of the effect of the generation under the final settings of each experiment, table 2 shows the evaluation results of the relevant quantitative indicators. Firstly, from the result of quantization index, the method has obvious improvement on the indexes of IOU, EMD and Acc. Since the data set used in the experiment has only two categories, the calculated value of IS IS limited in the [0,2] interval, and the IS values of 1.96, 1.97, 1.96 and 1.97 of the four methods all achieve excellent effects.
In addition, the analysis is performed from the viewpoint of visual quality of the generated three-dimensional model, as shown in fig. 6. The introduction of priori knowledge in the method can bring more visual characteristics of the input text to the generated three-dimensional model. As can be seen from the figure, the three-dimensional model generated by the method is less limited by the original real model than Text-Guided in the generation effect, so that a three-dimensional model which is more flexible and more matched with the input Text semantics is obtained, and the visual quality of the generated model is the best in the displayed method. This may prove the advancement and effectiveness of the present method.
Those skilled in the art will appreciate that the drawings are schematic representations of only one preferred embodiment, and that the above-described embodiment numbers are merely for illustration purposes and do not represent advantages or disadvantages of the embodiments. The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
Reference is made to:
[1]Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling[J].
Advances in neural information processing systems,2016,29.
[2]Li J,Xu K,Chaudhuri S,et al.Grass:Generative recursive autoencoders for shape structures[J].
ACM Transactions on Graphics(TOG),2017,36(4):1-14.
[3]Zhu C,Xu K,Chaudhuri S,et al.SCORES:Shape composition with recursive substructure priors[J].ACM Transactions on Graphics(TOG),2018,37(6):1-14.
[4]Shu D W,Park S W,Kwon J.3d point cloud generative adversarial network based on tree structured graph convolutions[C].Proceedings of the IEEE/CVF international conference on computer vision.2019:3859-3868.
[5]Li R,Li X,Hui K H,et al.SP-GAN:Sphere-guided 3D shape generation and manipulation[J].
ACM Transactions on Graphics(TOG),2021,40(4):1-12.
[6]Yang G,Huang X,Hao Z,et al.Pointflow:3d point cloud generation with continuous normalizing flows[C].Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4541-4550.
[7]Tatarchenko M,Dosovitskiy A,Brox T.Octree generating networks:Efficient convolutional architectures for high-resolution 3d outputs[C].Proceedings of the IEEE international conference on computer vision.2017:2088-2096.
[8]Goodfellow I,Pouget-Abadie J,Mirza M,et al.Generative adversarial networks[J].
Communications of the ACM,2020,63(11):139-144
[9]Dinh L,Krueger D,Bengio Y.Nice:Non-linear independent components estimation[J].arXiv preprint arXiv:1410.8516,2014.
[10]Choy C B,Xu D,Gwak J Y,et al.3d-r2n2:A unified approach for single and multi-view 3d object reconstruction[C].European conference on computer vision.Springer,Cham,2016:
628-644.
[11]Xie H,Yao H,Sun X,et al.Pix2vox:Context-aware 3d reconstruction from single and multi-view images[C].Proceedings of the IEEE/CVF international conference on computer vision.2019:2690-2698.
[12]Xie H,Yao H,Zhang S,et al.Pix2Vox++:Multi-scale context-aware 3D object reconstruction from single and multiple images[J].International Journal of Computer Vision,2020,128(12):
2919-2935.
[13]Wang N,Zhang Y,Li Z,et al.Pixel2mesh:Generating 3d mesh models from single rgb images[C].Proceedings of the European conference on computer vision(ECCV).2018:52-67.[14]Wen C,Zhang Y,Li Z,et al.Pixel2mesh++:Multi-view 3d mesh generation via deformation[C].
Proceedings of the IEEE/CVF international conference on computer vision.2019:1042-1051.[15]Yang S,Xu M,Xie H,et al.Single-view 3D object reconstruction from shape priors in memory[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3152-3161.
[16]Lun Z,Gadelha M,Kalogerakis E,et al.3D shape reconstruction from sketches via multi-view convolutional networks[C].2017 International Conference on 3D Vision(3DV).IEEE,2017:
67-77.
[17]Reed S,Akata Z,Yan X,et al.Generative adversarial text to image synthesis[C].International conference on machine learning.PMLR,2016:1060-1069.
[18]Chen K,Choy C B,Savva M,et al.Text2shape:Generating shapes from natural language by learning joint embeddings[C].Asian conference on computer vision.Springer,Cham,2018:
100-116.
[19]Liu Z,Wang Y,Qi X,et al.Towards Implicit Text-Guided 3D Shape Generation[C].
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:
17896-17906.
[20]Mihalcea R,Tarau P.Textrank:Bringing order into text[C].Proceedings of the 2004 conference on empirical methods in natural language processing.2004:404-411.
[21]Chen T,Kornblith S,Norouzi M,et al.A simple framework for contrastive learning of visual representations[C].International conference on machine learning.PMLR,2020:1597-1607.
[22]Chang AX,Funkhouser T,Guibas L,et al.Shapenet:An information-rich 3d model repository[J].arXiv preprint arXiv:1512.03012,2015.

Claims (8)

1. the automatic modeling method of the three-dimensional model based on text information guidance is characterized by comprising the following steps:
1) Constructing a text feature extraction network and a three-dimensional model feature extraction network;
2) Performing data preprocessing on the text labeling three-dimensional model dataset;
3) Constructing a text-three-dimensional model knowledge graph, comprising: defining an entity and a relationship, the entity comprising: the method comprises three types of semantic tag entities, text description entities and three-dimensional model entities, wherein the relation comprises the following steps: describing three relationship types of relationship, attribute relationship and similarity relationship;
4) Stacking by adopting a transducer network to form a priori knowledge fusion network, and carrying out feature fusion on the input text and the searched related semantic tags by using the priori knowledge fusion network;
5) Forming a three-dimensional model decoder network by adopting a multi-layer perceptron network, wherein the three-dimensional model decoder network is divided into a shape decoder and a color decoder; and 4) respectively taking the two-dimensional fusion characteristics obtained in the step 4) as the input of a shape decoder and a color decoder, and respectively predicting the shape and the color of the three-dimensional model in the implicit occupied field space.
6) Performing end-to-end training on a three-dimensional model generation network based on text information guidance by using an Adam optimizer; the three-dimensional model generating network based on text information guidance comprises a three-dimensional model feature extraction network, a three-dimensional model decoder network, a text feature extraction network and a priori knowledge fusion network;
7) And 3, utilizing the trained three-dimensional model generating network based on text information guidance, predicting the shape and the color of the three-dimensional model based on the text input by a user, and rendering the prediction result into a three-dimensional grid by using a MarchingCubes algorithm to obtain an automatic modeling result.
2. The automatic modeling method of three-dimensional model based on text information guidance according to claim 1, wherein the constructing of the text feature extraction network in step 1) constructs a text encoder as the text feature extraction network based on a pre-training BERT network, the text feature extraction network inputs a text description word sequence with a length of not more than 64, obtains sentence global features using BERT pooling, and outputs 256-dimensional feature vectors using one linear layer.
3. The text information guidance-based three-dimensional model automatic modeling method according to claim 1, wherein the three-dimensional model feature extraction network is constructed in the step 1), a three-dimensional model encoder is formed by stacking three-dimensional convolution blocks as the three-dimensional model feature extraction network, and three-dimensional voxels with input resolution of 64 of the three-dimensional model feature extraction network are output as 256-dimensional feature vectors.
4. The text information guidance-based three-dimensional model automatic modeling method according to claim 1, wherein the step 2) includes:
extracting keywords from all text description corpus in the text labeling three-dimensional model dataset by adopting a TextRank algorithm;
performing network training on the text feature extraction network and the three-dimensional model feature extraction network by adopting a contrast learning loss function to obtain joint feature representation of the text features and the three-dimensional model features;
and obtaining the cross-modal feature similarity between each text and each three-dimensional model in the text labeling three-dimensional model data set by adopting a cosine similarity method.
5. The text information guidance-based three-dimensional model automatic modeling method according to claim 1, wherein the semantic tag entity in the step 3) is composed of keywords extracted after data preprocessing; the text description entity and the three-dimensional model entity are composed of text description and three-dimensional models in a text labeling three-dimensional model dataset.
6. The automatic modeling method of a three-dimensional model based on text information guidance according to claim 1, wherein the description relation in the step 3) is constructed by text description in a text-annotated three-dimensional model dataset and a three-dimensional model correspondence relation, and relates to a text description entity and a three-dimensional model entity; the attribute relationship is constructed by the correlation between text description corresponding to a three-dimensional model in a text labeling three-dimensional model dataset and extracted keywords, and relates to a semantic tag entity and a three-dimensional model entity; the similarity relation is constructed by selecting a three-dimensional model with highest cross-modal feature similarity according to text description features corresponding to the three-dimensional model by utilizing the cross-modal feature similarity obtained in data preprocessing, and the similarity relation relates to two three-dimensional model entities.
7. The text-based guided three-dimensional model automatic modeling method of claim 1, wherein the feature fusion of step 4) comprises:
(4.1) searching two priori knowledge of the related semantic labels and the related three-dimensional model from the text-three-dimensional model knowledge graph according to the text input by the user by utilizing a multi-entity searching method;
(4.2) extracting features from the input text and the retrieved related semantic tags by using a text feature extraction network to form a feature sequence of the related semantic tags, extracting features from the related three-dimensional model by using a three-dimensional model feature extraction network to form a feature sequence of the related three-dimensional model
(4.3) inputting text features as inputs of the prior knowledge fusion network in relation to the feature sequences of the three-dimensional model and the feature sequences of the semantic tags;
(4.4) calculating the input based on a multi-head attention mechanism in the prior knowledge fusion network, and fusing two prior knowledge of related semantic labels and related three-dimensional models in the input text features in stages; the method specifically comprises the following steps: firstly, updating text features by using priori knowledge from a related three-dimensional model, and performing attention calculation to obtain updated text features; and then, the prior knowledge from the related semantic tags is utilized to perform attention calculation by combining the space information of the implicit occupation field, and finally, the two-dimensional fusion characteristics fused with the prior knowledge are output.
8. The text-based guided three-dimensional model automatic modeling method of claim 1, wherein the trained loss function of step 5) is an L2-norm based self-encoder reconstruction loss, and also an L2-norm based text-three-dimensional model feature alignment loss.
CN202211533211.2A 2022-12-02 2022-12-02 Three-dimensional model automatic modeling method based on text information guidance Pending CN115994990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211533211.2A CN115994990A (en) 2022-12-02 2022-12-02 Three-dimensional model automatic modeling method based on text information guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211533211.2A CN115994990A (en) 2022-12-02 2022-12-02 Three-dimensional model automatic modeling method based on text information guidance

Publications (1)

Publication Number Publication Date
CN115994990A true CN115994990A (en) 2023-04-21

Family

ID=85994578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211533211.2A Pending CN115994990A (en) 2022-12-02 2022-12-02 Three-dimensional model automatic modeling method based on text information guidance

Country Status (1)

Country Link
CN (1) CN115994990A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116934970A (en) * 2023-07-24 2023-10-24 天津大学 Medical single view three-dimensional reconstruction device based on priori knowledge guidance
CN117236341A (en) * 2023-09-21 2023-12-15 东方经纬项目管理有限公司 Whole process engineering consultation integrated system
CN117351173A (en) * 2023-12-06 2024-01-05 北京飞渡科技股份有限公司 Three-dimensional building parameterization modeling method and device based on text driving
CN117853678A (en) * 2024-03-08 2024-04-09 陕西天润科技股份有限公司 Method for carrying out three-dimensional materialization transformation on geospatial data based on multi-source remote sensing

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116934970A (en) * 2023-07-24 2023-10-24 天津大学 Medical single view three-dimensional reconstruction device based on priori knowledge guidance
CN117236341A (en) * 2023-09-21 2023-12-15 东方经纬项目管理有限公司 Whole process engineering consultation integrated system
CN117351173A (en) * 2023-12-06 2024-01-05 北京飞渡科技股份有限公司 Three-dimensional building parameterization modeling method and device based on text driving
CN117351173B (en) * 2023-12-06 2024-03-19 北京飞渡科技股份有限公司 Three-dimensional building parameterization modeling method and device based on text driving
CN117853678A (en) * 2024-03-08 2024-04-09 陕西天润科技股份有限公司 Method for carrying out three-dimensional materialization transformation on geospatial data based on multi-source remote sensing
CN117853678B (en) * 2024-03-08 2024-05-17 陕西天润科技股份有限公司 Method for carrying out three-dimensional materialization transformation on geospatial data based on multi-source remote sensing

Similar Documents

Publication Publication Date Title
CN111291212B (en) Zero sample sketch image retrieval method and system based on graph convolution neural network
WO2021223567A1 (en) Content processing method and apparatus, computer device, and storage medium
CN115994990A (en) Three-dimensional model automatic modeling method based on text information guidance
CN113657124B (en) Multi-mode Mongolian translation method based on cyclic common attention transducer
Zhang et al. A survey of 3D indoor scene synthesis
CN111079532A (en) Video content description method based on text self-encoder
CN110599592B (en) Three-dimensional indoor scene reconstruction method based on text
Li et al. Residual attention-based LSTM for video captioning
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
CN111368197B (en) Deep learning-based comment recommendation system and method
CN114840705A (en) Combined commodity retrieval method and system based on multi-mode pre-training model
CN115858847B (en) Combined query image retrieval method based on cross-modal attention reservation
CN114612767B (en) Scene graph-based image understanding and expressing method, system and storage medium
CN115222998B (en) Image classification method
Cheng et al. Stack-VS: Stacked visual-semantic attention for image caption generation
CN114281982B (en) Book propaganda abstract generation method and system adopting multi-mode fusion technology
CN114418032A (en) Five-modal commodity pre-training method and retrieval system based on self-coordination contrast learning
CN117635275B (en) Intelligent electronic commerce operation commodity management platform and method based on big data
CN117556067B (en) Data retrieval method, device, computer equipment and storage medium
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
Nam et al. A survey on multimodal bidirectional machine learning translation of image and natural language processing
Cao et al. An image caption method based on object detection
CN115422369B (en) Knowledge graph completion method and device based on improved TextRank
CN115292533B (en) Cross-modal pedestrian retrieval method driven by visual positioning
Yuan et al. A survey of recent 3D scene analysis and processing methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination