CN117152363A

CN117152363A - Three-dimensional content generation method, device and equipment based on pre-training language model

Info

Publication number: CN117152363A
Application number: CN202311413646.8A
Authority: CN
Inventors: 范宝余; 杜国光; 赵雅倩; 王丽; 郭振华; 李仁刚
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2023-12-01
Anticipated expiration: 2043-10-30
Also published as: CN117152363B

Abstract

The invention discloses a three-dimensional content generation method, device and equipment based on a pre-training language model, which are applied to the technical field of artificial intelligence. Generating part name text data and three-dimensional part point cloud data based on three-dimensional content description data carried by a three-dimensional content generation request; determining the space position of each three-dimensional part according to the three-dimensional content description data, the part name text data and the three-dimensional part point cloud data; calculating offset information of each point of each three-dimensional component according to the three-dimensional content description data and the fusion point cloud data, adjusting the spatial position of the corresponding three-dimensional component based on the offset information, and generating three-dimensional content according to each adjusted three-dimensional component. The invention can solve the problem that the related technology can not obtain the high-quality three-dimensional content result, and effectively improves the three-dimensional content generation quality.

Description

Three-dimensional content generation method, device and equipment based on pre-training language model

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a device for generating three-dimensional content based on a pre-training language model.

Background

With the rapid development and increasing popularity of artificial intelligence technology, AIGC (Artificial Intelligence Generated Content, artificial intelligence content generation) technology that automatically generates modal digitized content using artificial intelligence technology has grown. Among them, generating high-quality, diversified 3D (three-dimensional) contents as 3D digital assets through AIGC is widely applied to technical fields of virtual reality, augmented reality, and the like.

In the process of generating 3D content based on file conditions, details of the 3D content generated by the related technology are obviously insufficient, and cannot meet high quality requirements of users on 3D content results.

In view of this, generating high quality 3D content is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention provides a three-dimensional content generation method and device based on a pre-training language model, electronic equipment and a readable storage medium, which can generate high-quality 3D content.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the present invention provides a method for generating three-dimensional content based on a pre-training language model, including:

generating part name text data and three-dimensional part point cloud data based on three-dimensional content description data carried by the three-dimensional content generation request;

Determining the space position of each three-dimensional part according to the three-dimensional content description data, the part name text data and the three-dimensional part point cloud data, and moving each three-dimensional part to a corresponding space position to obtain initial fusion point cloud data;

sampling the initial fusion point cloud data in a fixed number to obtain fusion point cloud data;

calculating offset information of each point of each three-dimensional component according to the three-dimensional content description data and the fusion point cloud data;

adjusting the space positions of the corresponding three-dimensional components based on the offset information, and generating three-dimensional contents according to the adjusted three-dimensional components;

the text features of the three-dimensional content description data and the part name text data are extracted by using a text feature extraction model; the text feature extraction model is a network model obtained by utilizing a three-dimensional content sample data set and performing fine adjustment on a pre-training language model based on executing text feature extraction tasks.

In a first exemplary embodiment, the generating, based on the three-dimensional content description data carried by the three-dimensional content generation request, the part name text data and the three-dimensional part point cloud data includes:

Generating part name text data based on the three-dimensional content description data carried by the three-dimensional content generation request;

the target noise point cloud data, the three-dimensional content description data and the part name text data are used as data to be processed together;

and obtaining the three-dimensional part point cloud data carrying the color parameters in a mode of removing noise in the data to be processed in each iteration process.

In a second exemplary embodiment, before the removing noise in the data to be processed in each iteration process to obtain the three-dimensional component point cloud data carrying the color parameters, the method further includes:

pre-building a two-channel condition control part to generate a network;

training the two-channel condition control component to generate a network by utilizing a three-dimensional content sample data set;

the two-channel condition control component generation network comprises an integral text description sub-extraction network, a component text description sub-extraction network, a diffusion time processing network, a point cloud data processing network, a sequence data processing neural network and a predicted point cloud noise output network;

the system comprises a whole text description sub-extraction network, a part text description sub-extraction network, a diffusion time processing network, a point cloud data processing network, a sequence data processing neural network and a prediction point cloud noise output network, wherein the whole text description sub-extraction network is used for generating whole control description sub-data of three-dimensional content description data, the part text description sub-extraction network is used for generating part control description sub-data of part name text data, the diffusion time processing network is used for generating point cloud data description sub-data, the sequence data processing neural network is used for learning characteristic representations of the whole control description sub-data, the part control description sub-data, the diffusion time description sub-data and the point cloud data description sub-data, and the prediction point cloud noise output network is used for predicting point cloud noise data at the current moment based on the characteristic representations; the sequence data processing neural network is a pre-training language model.

In a third exemplary embodiment, the global text descriptor extraction network includes a global text input, a global text feature extraction model, and a global control sub-generation model;

the whole text input end is used for inputting three-dimensional content description data, and the whole text feature extraction model is used for extracting text features of the three-dimensional content description data and taking the text features as whole initial description sub-data; the overall control sub-generation model is used for conducting subdivision processing on the features in the overall initial description sub-data to obtain fine-grained overall text features, and the fine-grained overall text features are used as overall control description sub-data; the text feature extraction model comprises a trained overall text feature extraction model.

In a fourth exemplary embodiment, the component text descriptor extraction network includes a component text input, a component text feature extraction model, and a component control sub-generation model;

the part text input end is used for inputting part name text data, and the part text feature extraction model is used for extracting text features of the part name text data to serve as part initial description sub-data; the component control sub-generation model is used for conducting subdivision processing on the features in the component initial description sub-data to obtain fine-grained component text features serving as component control description sub-data; the text feature extraction model includes a trained component text feature extraction model.

In a fifth exemplary embodiment, the predicted point cloud noise output network includes a noise point cloud data generation model;

and the noise point cloud data generation model is used for obtaining point cloud noise data added by each diffusion through carrying out regression analysis on the diffusion sub-output description sub-data output by the sequence data processing neural network.

In a sixth exemplary embodiment, the training the dual channel condition control component to generate a network includes:

calculating according to the three-dimensional point cloud data without noise in the three-dimensional content sample data set to obtain sample noise point cloud data;

acquiring training sample data corresponding to the three-dimensional content description data and training sample data corresponding to the part name text data according to the three-dimensional content sample data set;

setting the whole text descriptor extraction network to be in an inactive state, setting the component text descriptor extraction network to be in an active state, and jointly training the component text descriptor extraction network by using training sample data corresponding to the sample noise point cloud data, the diffusion time and the component name text data;

freezing the trained part text description sub-extraction network, setting the whole text description sub-extraction network to be in an activated state, and jointly training the whole text description sub-extraction network by using training sample data corresponding to the sample noise point cloud data, the diffusion time and the three-dimensional content description data;

And thawing the trained part text description sub-extraction network, and retraining the part text description sub-extraction network and the whole text description sub-extraction network by using the sample noise point cloud data, the diffusion time, the training sample data corresponding to the three-dimensional content description data and the training sample data corresponding to the part name text data.

In a seventh exemplary embodiment, the calculating sample noise point cloud data according to the three-dimensional point cloud data without noise in the three-dimensional content sample data set includes:

invoking a noise point cloud data calculation relation, and calculating sample noise point cloud data; the noise point cloud data calculation relational expression is as follows:

；

wherein,x _t is thattThe samples at the time instant noise the point cloud data,x ₀ for noise-free three-dimensional point cloud data in the three-dimensional content sample dataset,for preset timetVariable noise weighting factor,/->Is the noise sampled from the standard normal distribution.

In an eighth exemplary embodiment, the training the dual channel condition control component to generate a network includes:

taking the sample noise point cloud data, the diffusion time, training sample data corresponding to the three-dimensional content description data and training sample data corresponding to the part name text data as input data, taking predicted point cloud noise data added by each diffusion as output, generating a loss function relation by a calling part, and training the two-channel condition control part to generate a network;

Wherein the component generates a loss function relationship as:

；

wherein,noise sampled from a standard normal distribution, +.>The point cloud noise added for each diffusion,x _t is thattThe samples at the time instant noise the point cloud data,C _full the data is described for the three-dimensional content in its entirety,C _{_part} the data is described for the part name text.

In a ninth exemplary embodiment, before the training the two-channel condition control unit to generate a network, the method further includes:

and unifying all three-dimensional component points of each three-dimensional content in the three-dimensional content sample data set to be aligned with a target coordinate system, and normalizing the size of each three-dimensional content to a preset standard size.

In a tenth exemplary embodiment, the determining the spatial position of each three-dimensional component according to the three-dimensional content description data, the component name text data, and the three-dimensional component point cloud data includes:

acquiring high-dimensional integral text characteristics of the three-dimensional content description data;

acquiring high-dimensional part text characteristics of the part name text data;

acquiring the global features of the three-dimensional part point cloud data;

combining the high-dimensional integral text feature, the high-dimensional part text feature and the three-dimensional part global feature, and continuously reducing the dimension of the combined feature to obtain the three-dimensional displacement and the three-dimensional dimension of each three-dimensional part;

And determining the spatial position of the corresponding three-dimensional component according to the three-dimensional scale and the three-dimensional displacement of each three-dimensional component.

In an eleventh exemplary embodiment, the determining a spatial position of each three-dimensional part according to the three-dimensional content description data, the part name text data, and the three-dimensional part point cloud data includes:

training a coarse-granularity fusion network of the three-dimensional part in advance; the three-dimensional part coarse granularity fusion network comprises a text feature extraction sub-network, a feature joint layer and a joint feature processing sub-network;

inputting the three-dimensional content description data, the part name text data and the three-dimensional part point cloud data as input data into the three-dimensional part coarse granularity fusion network;

and taking the output data of the coarse-granularity fusion network of the three-dimensional components as the spatial position of each three-dimensional component.

In a twelfth exemplary embodiment, the text feature extraction sub-network includes a global text high-dimensional feature extraction model, a component text high-dimensional feature extraction model, and a global feature extraction model;

the overall text high-dimensional feature extraction model is used for extracting high-dimensional overall text features of three-dimensional content description data; the part text high-dimensional feature extraction model is used for extracting high-dimensional part text features of the part name text data; the global feature extraction model comprises a first multi-layer sensing layer, a second multi-layer sensing layer, a third multi-layer sensing layer and a fourth multi-layer sensing layer which are sequentially connected, and the output features of the fourth multi-layer sensing layer are subjected to pooling treatment to obtain the three-dimensional part global features of the three-dimensional part point cloud data; the text feature extraction model comprises a trained overall text high-dimensional feature extraction model and a trained part text high-dimensional feature extraction model.

In a thirteenth exemplary embodiment, the joint feature processing sub-network includes a fifth multi-layer sensing layer, a sixth multi-layer sensing layer, a seventh multi-layer sensing layer, and an eighth multi-layer sensing layer;

and the combined characteristic processing sub-network is used for continuously reducing the dimension of the combined characteristic to 1X 6 dimension through the fifth multi-layer sensing layer, the sixth multi-layer sensing layer, the seventh multi-layer sensing layer and the eighth multi-layer sensing layer so as to obtain the three-dimensional displacement and the three-dimensional dimension of the three-dimensional component.

In a fourteenth exemplary embodiment, the calculating offset information of each point of each three-dimensional part according to the three-dimensional content description data and the fused point cloud data includes:

based on the total number of the three-dimensional parts, each three-dimensional part is independently sampled according to the actual point cloud data of each three-dimensional part and the total point cloud number of the point clouds of the three-dimensional parts, and fusion point cloud data with fixed point cloud number is obtained.

In a fifteenth exemplary embodiment, the individually sampling each three-dimensional part based on the total number of three-dimensional parts according to the actual point cloud data of each three-dimensional part and the total point cloud number of the three-dimensional part point clouds includes:

Calculating the current point cloud number according to the total number of the three-dimensional parts and the total point cloud number of the three-dimensional part point clouds;

calculating the number of sampling point clouds based on a preset sampling factor and the total number of the three-dimensional parts;

determining a sampling mode adopted by each three-dimensional component by comparing the current point cloud number with the sampling point cloud number;

wherein the sampling modes include an up-sampling mode, a down-sampling mode, and an un-sampling mode.

In a sixteenth exemplary embodiment, the determining the sampling pattern adopted by each three-dimensional component by comparing the current point cloud number and the sampling point cloud number includes:

sampling the number of the point clouds of each three-dimensional part from the actual point cloud data of the three-dimensional part to the current point cloud number and the sampling point cloud numbern _part *m/kIn which, in the process,n _part is the actual point cloud data of the three-dimensional component,mfor the predetermined sampling factor to be the same,kfor the three-dimensional partTotal number of pieces;

if it isk=mThe corresponding three-dimensional component is not sampled;

if it isk>mRandomly downsampling the corresponding three-dimensional component;

if it isk<mThen the corresponding three-dimensional part is randomly up-sampled.

In a seventeenth exemplary embodiment, the calculating offset information of each point of each three-dimensional part according to the three-dimensional content description data and the fused point cloud data includes:

Training a three-dimensional part fine granularity fusion network in advance; the three-dimensional component fine granularity fusion network comprises a point cloud global feature extraction sub-network, an overall text feature extraction sub-network and an offset prediction sub-network; the text feature extraction model comprises a trained integral text feature extraction sub-network;

inputting the three-dimensional content description data and the fusion point cloud data as input data into the three-dimensional component fine granularity fusion network;

and taking the output data of the three-dimensional component fine granularity fusion network as offset information of each point of each three-dimensional component.

In an eighteenth exemplary embodiment, the point cloud global feature extraction sub-network includes a tenth multi-layer sensing layer, an eleventh multi-layer sensing layer, a twelfth multi-layer sensing layer and a thirteenth multi-layer sensing layer that are sequentially connected, and the output features of the thirteenth multi-layer sensing layer are subjected to pooling processing to obtain global features of the three-dimensional component fusion point cloud data.

In a nineteenth exemplary embodiment, the offset prediction sub-network includes a feature mix layer, a fourteenth multi-layer sense layer, a fifteenth multi-layer sense layer, a sixteenth multi-layer sense layer, and a seventeenth multi-layer sense layer;

The feature mixing layer is used for feature mixing of the copy features of the output features of the twelfth multi-layer sensing layer, the copy features of the global features of the three-dimensional component and the copy features of the global text features output by the global text feature extraction sub-network;

and the mixed features are subjected to dimension reduction processing through the fourteenth multi-layer sensing layer, the fifteenth multi-layer sensing layer, the sixteenth multi-layer sensing layer and the seventeenth multi-layer sensing layer, so that offset information of each point of each three-dimensional component is obtained.

In a twentieth exemplary embodiment, the training process of the three-dimensional component fine-grained fusion network includes:

invoking a fine granularity loss function relation, and training the three-dimensional part fine granularity fusion network; the fine grain loss function relationship is:

L=α*||f _{description_gt} -f _{description_pred} || ² +β*||△T|| ² ；

in the method, in the process of the invention,Lrepresenting a fine-grained loss function relationship,αas a first weight parameter,βas a second weight parameter, the first weight parameter,f _{description_gt} descriptive sub-data for the global text feature of the three-dimensional content descriptive data,f _{description_pred} descriptor data for the predicted category feature; and (V)TRepresenting offset information;

the three-dimensional part fine granularity fusion network further comprises a point cloud identification sub-network used for a training process; the point cloud identification sub-network is used for carrying out point cloud category identification on corresponding three-dimensional point cloud sample data, and extracting description sub-data of prediction category characteristics to serve as a true value of each point offset of each three-dimensional component.

In a twenty-first exemplary embodiment, the generating, based on the three-dimensional content description data carried by the three-dimensional content generation request, part name text data and three-dimensional part point cloud data includes:

the method comprises the steps that a three-dimensional content question-answer sample set is utilized in advance, and a pre-training language model is subjected to fine adjustment based on a question-answer task execution, so that a text question-answer model is obtained;

acquiring three-dimensional content description data according to the three-dimensional content generation request;

acquiring the name of the three-dimensional content according to the three-dimensional content description data;

and obtaining the text data of the part names described in the text form by using the text question-answering model.

In a twenty-second exemplary embodiment, the obtaining text data of part names described in text form using the text question-answering model includes:

according to the three-dimensional content part question and the auxiliary question input by the user, generating a question set and an auxiliary question set;

inputting the auxiliary question set into the text question-answering model, and obtaining auxiliary question answers corresponding to the auxiliary question set by using the text question-answering model;

and inputting the question set and the auxiliary question answers to the text question-answer model to obtain the part name text data.

In another aspect, the present invention provides a three-dimensional content generating apparatus based on a pre-training language model, including:

the data generation module is used for generating part name text data and three-dimensional part point cloud data based on the three-dimensional content description data carried by the three-dimensional content generation request; the text features of the three-dimensional content description data and the part name text data are extracted by using a text feature extraction model; the text feature extraction model is a network model obtained by utilizing a three-dimensional content sample data set and performing fine adjustment on a pre-training language model based on executing a text feature extraction task;

the space position determining module is used for determining the space position of each three-dimensional part according to the three-dimensional content description data, the part name text data and the three-dimensional part point cloud data, and moving each three-dimensional part to the corresponding space position so as to obtain initial fusion point cloud data;

the sampling module is used for sampling the initial fusion point cloud data in a fixed number to obtain fusion point cloud data;

the offset calculation module is used for calculating offset information of each point of each three-dimensional component according to the three-dimensional content description data and the fusion point cloud data;

And the content generation module is used for adjusting the space position of the corresponding three-dimensional component based on the offset information and generating three-dimensional content according to each adjusted three-dimensional component.

The invention also provides an electronic device comprising a processor for implementing the steps of the three-dimensional content generation method based on a pre-trained language model as described in any one of the preceding claims when executing a computer program stored in a memory.

The invention finally provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the three-dimensional content generation method based on a pre-trained language model as described in any of the preceding claims.

The technical scheme provided by the invention has the advantages that the part name text data contained in the three-dimensional content is obtained by analyzing the three-dimensional content description data input by a user, the three-dimensional part point cloud data is further generated, after the spatial position of each part in the three-dimensional content is predicted, rough fusion is performed to obtain the approximate position of each three-dimensional part, and then the final high-quality complete three-dimensional content is obtained through fine fusion, so that the three-dimensional parts in the fused three-dimensional content can have better details, and the situation that two three-dimensional parts overlap or gaps exist at the boundary can be avoided, thereby effectively improving the generation quality of the three-dimensional content.

In addition, the invention also provides a corresponding implementation device, electronic equipment and a readable storage medium for the three-dimensional content generation method based on the pre-training language model, so that the method has more practicability, and the device, the electronic equipment and the readable storage medium have corresponding advantages.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

For a clearer description of the present invention or of the technical solutions related thereto, the following brief description will be given of the drawings used in the description of the embodiments or of the related art, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained from these drawings without the inventive effort of a person skilled in the art.

FIG. 1 is a schematic flow chart of a three-dimensional content generation method based on a pre-training language model provided by the invention;

FIG. 2 is a schematic diagram of an exemplary architecture of a dual channel conditional access component generation network provided by the present invention;

FIG. 3 is a schematic diagram of a training process of a two-channel condition control unit generation network provided by the present invention;

FIG. 4 is a schematic diagram of an exemplary architecture of a three-dimensional component coarse-grained fusion network provided by the present invention;

FIG. 5 is a schematic diagram of an exemplary architecture of a three-dimensional component fine-grained fusion network provided by the invention;

FIG. 6 is a schematic diagram of a hardware framework of an exemplary application scenario provided by the present invention;

FIG. 7 is a flow chart of another method for generating three-dimensional content based on a pre-training language model according to the present invention;

FIG. 8 is a flow chart of a method for generating three-dimensional content based on a pre-training language model according to the present invention;

FIG. 9 is a block diagram of an embodiment of a three-dimensional content generating device based on a pre-training language model according to the present invention;

fig. 10 is a block diagram of an embodiment of an electronic device according to the present invention.

Detailed Description

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms first, second, third, fourth and the like in the description and in the claims and in the above drawings are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations of the two, are intended to cover a non-exclusive inclusion. The term "exemplary" means "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Three-dimensional content can be classified into two types, unconditional generation (Unconditional Generation) and conditional generation (Conditional Generation) according to whether or not the usage condition is satisfied. Unconditional generation refers to direct learning using a deep neural network to obtain a data distribution of the 3D content, and then sampling from the data distribution to generate new 3D content. Conditional generation refers to the generation of 3D content consistent with condition requirements by controlling a deep neural network through a reasonable condition introduction mode given condition input. Because conditionally generated 3D content can meet more application requirements, it is widely used in the field of artificial intelligence than unconditionally generated approaches.

The conditionally generated three-dimensional content is generated to generate the 3D content intelligently through language description, namely text description and the like which conform to the condition control mode of human interaction. With the development of the generation type artificial intelligence technology and the large-scale Pre-training model of graphics and texts represented by the CLIP (Contrastive Language-Image Pre-training model), the capability of generating 3D content based on text conditions is greatly improved. After the text condition is encoded by the large-scale pre-training model, the generation model can be guided to generate target 3D content by means of pre-generation control or post-generation guidance, and a good 3D content generation result is obtained.

However, in the related art, in the process of generating 3D content based on text, the 3D content is generated as a whole, which results in obvious defects in the final generation details, and high-quality 3D content results cannot be obtained. In view of the above, the invention is based on the concept of divide-and-conquer, each three-dimensional component of the 3D content is generated separately, rough fusion is performed to obtain the approximate position of each three-dimensional component, and then fine fusion is performed to obtain the final high-quality complete three-dimensional content, so that the precision of the 3D content can be improved remarkably, and the high-quality generation result can be obtained. Various non-limiting embodiments of the present invention are described in detail below. Numerous specific details are set forth in the following description in order to provide a better understanding of the invention. It will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.

Referring first to fig. 1, fig. 1 is a flow chart of a three-dimensional content generating method based on a pre-training language model according to the present invention, where the method may include the following steps:

s101: and generating the part name text data and the three-dimensional part point cloud data based on the three-dimensional content description data carried by the three-dimensional content generation request.

The three-dimensional content generation request is issued by a user through a client and is used for indicating a three-dimensional content generation requirement of the user to a system, the request carries three-dimensional content description data input by the user, the three-dimensional content description data can be any data which can enable the system to definitely and finally generate the three-dimensional content to be needed, the simplest three-dimensional content description data can be a three-dimensional content name, the simplest three-dimensional content description data can be three-dimensional content text description data, three-dimensional content voice description data and the like, for example, the user can input a text type request for needing to generate one server through text, can input a voice type request for needing to generate one server through voice, and the three-dimensional content description data can be extracted by adopting a method of corresponding data format for different types of three-dimensional content description data. The component name text data is the names of all components contained in the corresponding three-dimensional content to be generated by analyzing the three-dimensional content description data, wherein the component names are the names of components or parts or unit structures combined into a three-dimensional entity corresponding to the three-dimensional content to be generated, for example, the three-dimensional content to be generated is a server, and the component names are an indicator lamp, a main board, a shell, a display and a host. Any method that can derive the names of the components contained in the three-dimensional content based on the textual description of the three-dimensional content may be used. The text described component name sets are the text data of the component names in this step. For example, the user description in the three-dimensional content generation request is intended to generate "an airplane", and the corresponding part name text data includes a fuselage, wings, tail wings, and the like.

Since the same part of different objects may not be consistent, the generation of a single three-dimensional part is affected by the overall three-dimensional content, e.g. a cat leg and a horse leg differ, and thus this step combines the three-dimensional content description data and the part name text data to generate a three-dimensional part. After the text data of each part name included in the three-dimensional content is obtained in the previous step, three-dimensional part point cloud data corresponding to each part name can be determined according to the three-dimensional content description data and the part name text data. The present step may be based on any related technology, and a method capable of generating corresponding three-dimensional Point cloud data based on the text description data, such as a diffusion model and Point E (a three-dimensional model experimental technology), which is not limited in this embodiment.

The text features of the three-dimensional content description data and the part name text data are extracted by using a text feature extraction model; the text feature extraction model in this embodiment is a network model obtained by fine-tuning a pre-training language model based on executing text feature extraction tasks using a three-dimensional content sample dataset. The pre-training language model is based on a large-scale corpus (comprising language training materials such as sentences and paragraphs), a language model training task is designed, a large-scale neural network algorithm structure is trained to learn and realize, the finally obtained large-scale neural network algorithm structure and parameters are the pre-training language model, and other tasks can be subjected to feature extraction or task fine tuning on the basis of the model to realize the purpose of a specific task. The pre-training is to train one task to obtain a set of model parameters, initialize the network model parameters by using the set of model parameters, train other tasks by using the initialized network model to obtain the model adapted to other tasks. By pre-training on a large-scale corpus, the neuro-linguistic representation model can learn powerful linguistic representation capability, and can extract rich syntactic and semantic information from the text. The neural network algorithm structure trained by the Pre-training language model includes, but is not limited to, CNN (Convolutional Neural Network ), RNN (Recurrent Neural Network, cyclic neural network), LSTM (Long Short-Term Memory network), but may also be a model constructed by using an attention network, such as LLM (Large Language Model ), transform neural network, BERT (Bidirectional Encoder Representation from Transformers, bi-directional encoder characterization quantity), GPT (generated Pre-trained Transformer, pre-training model), CLIP (Contrastive Language-Image Pre-training model, graphic contrast Pre-training model), which is not limited herein. The attention network is a network model trained by using an attention mechanism, and the model gives different weights to each part of an input sequence, so that more important characteristic information is extracted from the input sequence, and the model finally obtains more accurate output.

Because the pre-training language model can provide the feature of the token and sentence level containing abundant semantic information for the downstream task, the pre-training language model can also directly perform fine adjustment on the pre-training model aiming at the downstream task, and the downstream exclusive model can be conveniently and rapidly obtained. The fine tuning is to perform small-scale training on specific task targets (downstream tasks) and task data (downstream data) on the basis of a pre-training model, so as to realize micro adjustment of parameters of the pre-training model, and finally obtain a model adapting to the specific tasks and data. In order to further improve the extraction accuracy of text features, the embodiment uses a three-dimensional content sample data set as task data, and performs fine tuning on any one of the pre-training language models based on executing a text feature extraction task to obtain a text feature extraction model. The text feature extraction model of the embodiment includes a network structure for extracting text features of three-dimensional content description data and component name text data in subsequent method steps, such as a trained two-channel condition control component generating network integral text feature extraction model and text feature extraction model, a trained three-dimensional component coarse-granularity fusion network integral text feature extraction model and a trained component text feature extraction model, and a trained three-dimensional component fine-granularity fusion network integral text feature extraction sub-network.

S102: and determining the space position of each three-dimensional part according to the three-dimensional content description data, the part name text data and the three-dimensional part point cloud data, and moving each three-dimensional part to the corresponding space position to obtain initial fusion point cloud data.

And determining the space position of each three-dimensional part according to the three-dimensional content description data, the part name text data and the three-dimensional part point cloud data, and then moving the three-dimensional part corresponding to each part name in the part name text data to the corresponding space position for initial fusion to obtain initial fusion point cloud data.

S103: and carrying out fixed number sampling on the initial fusion point cloud data to obtain fusion point cloud data.

Because the number of the components contained in each three-dimensional content is different, the combined point clouds of a plurality of different three-dimensional components obtained in S102 need to be sampled to obtain the fused point cloud data with fixed point cloud number, so that the fixed number of the fused point clouds needs to be sampled.

S104: and calculating offset information of each point of each three-dimensional component according to the three-dimensional content description data and the fusion point cloud data.

In the previous step, each three-dimensional component can be placed at a substantially accurate spatial position, and in consideration of the situation that two three-dimensional components overlap or gaps exist at the boundary, in order to obtain high-quality three-dimensional content, the spatial position of each point of each three-dimensional component needs to be adjusted, that is, the spatial position of each three-dimensional point of the three-dimensional component is adjusted through offset information calculated in the step, so that the whole three-dimensional content finally generated accords with the structure and the semantics of the three-dimensional content.

S105: and adjusting the spatial positions of the corresponding three-dimensional components based on the offset information, and generating three-dimensional contents according to the adjusted three-dimensional components.

After the offset information of each three-dimensional point of each three-dimensional component of the three-dimensional content is obtained through calculation in the previous step, the spatial position of the corresponding three-dimensional point is adjusted based on the offset information, each three-dimensional component is independently generated by the previous step, the detail is good, and the three-dimensional components after final adjustment are fused, so that the complete and high-quality three-dimensional content can be obtained.

According to the technical scheme provided by the invention, the part name text data contained in the three-dimensional content is obtained by analyzing the three-dimensional content description data input by a user, so that three-dimensional part point cloud data is generated, after the spatial position of each part in the three-dimensional content is obtained by prediction, rough fusion is performed to obtain the approximate position of each three-dimensional part, and then the final high-quality complete three-dimensional content is obtained by fine fusion, so that the three-dimensional parts in the fused three-dimensional content can have better details, and the situation that two three-dimensional parts overlap or gaps exist at the boundary can be avoided, thereby effectively improving the generation quality of the three-dimensional content.

In the above embodiment, how to generate the three-dimensional component point cloud data is not limited, and an exemplary generation manner of the three-dimensional component point cloud data in this embodiment may include the following:

generating part name text data based on the three-dimensional content description data carried by the three-dimensional content generation request; the target noise point cloud data, the three-dimensional content description data and the part name text data are used as data to be processed together; and obtaining the three-dimensional part point cloud data carrying the color parameters in a mode of removing noise in the data to be processed in each iteration process.

In the embodiment, three-dimensional point cloud data is generated based on a diffusion model, and DDPM (Denoising Diffusion Probabilistic Model, denoising diffusion probability model) is used for gradually improving content generation quality by predicting noise added by each diffusion and denoising for a plurality of times, so that a high-quality result is obtained on a two-dimensional image generation task. In order to apply DDPM to the high-quality generation process of three-dimensional content, the embodiment uses an explicit point cloud representation method, that is, in the training process of a model, three-dimensional component point cloud data in a text and three-dimensional component corresponding data set for training the model is represented by using colored 3D point clouds, the point cloud data corresponding to each three-dimensional component can be represented as a tensor of k×6 dimensions, where K represents the number of point clouds, three dimensions in six dimensions represent (x, y, z) coordinates, and the other three dimensions represent (R, G, B) colors of each point. Correspondingly, the three-dimensional part point cloud data obtained directly based on the diffusion model is colored point cloud data. The target noise point cloud data used in the reasoning process is noise point cloud obtained by random sampling from one Gaussian distribution, the noise point cloud data is used as input data, three-dimensional content description data and part name text data are input at the same time, and the final three-dimensional point cloud data can be obtained by predicting noise added by each diffusion and denoising for multiple times to gradually improve content generation quality.

The invention also provides a high-efficiency and accurate generation mode of the three-dimensional point cloud data, which comprises the following steps:

constructing a two-channel condition control part generation network based on a diffusion model in advance; training a two-channel condition control component to generate a network by utilizing the three-dimensional content sample data set; if the three-dimensional content description data is not text data, converting the three-dimensional content description data into text-form three-dimensional content description data, and inputting the three-dimensional content description data, the part name text data, the diffusion time and the noise point cloud data serving as input data to a two-channel condition control part generation network to obtain final three-dimensional point cloud data.

The two-channel condition control component generating network comprises an integral text descriptor extracting network, a component text descriptor extracting network, a diffusion time processing network, a point cloud data processing network, a sequence data processing neural network and a predicted point cloud noise output network as shown in fig. 2. The whole text description sub-extraction network is used for generating whole control description sub-data of three-dimensional content description data in a text form, the component text description sub-extraction network is used for generating component control description sub-data of component name text data, the diffusion time processing network is used for generating diffusion time description sub-data, the point cloud data processing network is used for generating point cloud data description sub-data, the sequence data processing neural network is used for learning characteristic representations of the whole control description sub-data, the component control description sub-data, the diffusion time description sub-data and the point cloud data description sub-data, and the predicted point cloud noise output network is used for predicting point cloud noise data at the current moment based on the characteristic representations. It can be understood that if the whole two-channel condition control component generating network is the whole two-channel condition control component generating network, the whole text description sub-extracting network and the component text description sub-extracting network are removed, and the rest network structure can be suitable for generating three-dimensional point cloud data in the process of unconditionally generating three-dimensional content.

In this embodiment, considering that the three-dimensional content description data and the part name text data are used as the whole text data to perform text feature extraction, the performance of the finally obtained model is poor, and based on this, the embodiment performs two-channel text description on the whole and the part to perform feature extraction respectively. The whole text description sub-extraction network can be any network model or combination of a plurality of network models capable of extracting high-dimensional text characteristics of three-dimensional content description data in a text form; the component text description sub-extraction network may be any network model or combination of network models capable of extracting high-dimensional text features of the component name text data in textual form, none of which affect the implementation of the present invention.

Considering that text features obtained by directly extracting text features at one time can lead to poor quality of three-dimensional content generated based on conditions, the text feature extraction of the three-dimensional content description data and the part name text data can be realized through the combination of a plurality of models. For example, the global text description sub-extraction network may include a global text input, a global text feature extraction model, and a global control sub-generation model; the whole text input end is used for inputting three-dimensional content description data in a text form, and the whole text feature extraction model is used for extracting text features of the three-dimensional content description data and is used as whole initial description sub-data; the overall control sub-generation model is used for conducting subdivision processing on the features in the overall initial description sub-data to obtain fine-grained overall text features, and the fine-grained overall text features are used as overall control description sub-data. For example, as shown in fig. 2, in this embodiment, feature extraction may be performed on three-dimensional content description data in text form by means of CLIP (Contrastive Language-Image Pre-training model) to obtain fine-granularity initial descriptor with dimension of 256×d', that is, obtain overall initial descriptor data of three-dimensional content description data, where CLIP is a large-scale Pre-training model for semantic alignment of text and Image, and the extracted initial descriptor is used for poor condition generation effect. The MLP network structure may be flexibly set according to an actual application scenario, for example, two layers of MLP (d', 2 d) and MLP (2 d, d) may be set, where the larger the d value is 512, 1024 or 2048, the larger the corresponding network parameter number is.

Likewise, the component text description sub-extraction network may include a component text input, a component text feature extraction model, and a component control sub-generation model; the component text input end is used for inputting component name text data, and the component text feature extraction model is used for extracting text features of the component name text data as component initial description sub-data; the component control sub-generation model is used for conducting subdivision processing on the features in the component initial description sub-data to obtain fine-grained component text features serving as component control description sub-data. For example, as shown in fig. 2, the present embodiment may first perform feature extraction on the part name text data by means of CLIP to obtain fine-grained initial descriptors with dimensions of 256×d ', that is, obtain part initial descriptor data of the part name text data, and then convert the part initial descriptor data from 256×d' to 256×d-dimensional fine-grained control descriptors by using MLP, that is, obtain part control descriptor data. The component text description sub-extraction network may or may not have the same internal network structure as the overall text description sub-extraction network, which does not affect the implementation of the present invention.

In order to input the diffusion time to the two-channel condition control section generating network, the present embodiment can generate a descriptor by converting the diffusion time from one dimension to 1×d using the diffusion time processing network. Similarly, in order to input the noise point cloud into the two-channel condition control part generation network, the invention also uses the point cloud data processing network to convert the input noise point cloud from K x 6 to K x d dimensional point cloud generation descriptors. For the diffusion time and the noise point cloud data, the influence on the effect generated based on the condition is small, and in order to reduce the scale of the whole model, the diffusion time processing network and the point cloud data processing network can perform feature extraction once to generate corresponding descriptor data. As shown in fig. 2, for example, the diffusion time processing network may include a diffusion time input and a time feature extraction model, the time feature extraction model may be an MLP model, the MLP model outputs a generated descriptor of the diffusion time, and similarly, the point cloud data processing network may include a noise point cloud data input and a noise feature extraction model, the noise feature extraction model may be an MLP model, the MLP model outputs a generated descriptor of the point cloud, and the sequence data processing neural network may employ any pre-training language model capable of processing language sequence data, for example, it may be a transform (change network) model. From this, the input of the transducer model is data in the dimension (256×2+1+k) ×d, and the last K position features output by the transducer model are taken as diffusion output descriptors, where K is the number of points of the three-dimensional point cloud, and 512, 1024 or 2048 can be taken.

For example, the predicted point cloud noise output network in the above embodiment may include a noise point cloud data generation model. The noise point cloud data generation model is used for carrying out regression analysis on the diffusion sub-output description sub-data output by the sequence data processing neural network to obtain point cloud noise data added by each diffusion. For example, the sequence data processing neural network is a transducer model, the last K position features output by the transducer model are used as diffusion output descriptors, and the MLP network is used to convert to k×6-dimensional predicted point cloud noise, i.e. noise predicted for each point in the three-dimensional point cloud to be removed.

When the generation network of the two-channel condition control component is built, the diffusion time t and the noise point cloud x are used for training a network model _t Sample data of three-dimensional content description dataC _full And part name text dataC _{_part} As input data, the two-channel condition control component generates noise that is output by the network and is added for each predicted diffusionI.e. the noise that needs to be removed at each diffusion time t. Wherein the diffusion time t is randomly from t E [1, T]Sampling, T may be set to 1024. The noise point cloud data of the point cloud data processing network is sample noise point cloud data obtained by calculating according to three-dimensional point cloud data without noise in a three-dimensional content sample data set, for example, the sample noise point cloud data can be calculated by calling a noise point cloud data calculation relation; the noise point cloud data calculation relation is expressed as: / >Wherein, the method comprises the steps of, wherein,x _t is thattThe samples at the time instant noise the point cloud data,x ₀ for noise-free three-dimensional point cloud data in a three-dimensional content sample dataset,>for preset timetVariable noise weighting factor,/->Is a noise sampled from a standard normal distribution, and +.> ~(0, 1) is a noise sampled from a standard normal distribution. One sample noise point cloud data can be input in each round of training, the predicted point cloud noise output by the predicted point cloud noise output network is used as a predicted value, and the model training process is guided through the real noise and the predicted noise of the currently input sample noise point cloud data. In other words, the sample noise point cloud data, the diffusion time, the training sample data corresponding to the three-dimensional content description data, and the training sample data corresponding to the part name text data are used as input data, and the predicted point cloud noise data added for each diffusion is used as output. After obtaining the input and output of the network structure, generating a loss function relation by the calling component such that minimizing the loss function trains the two-channel condition control component to generate a network; wherein the component generation loss function relationship can be expressed as:

；/>

Further, in order to adapt to the training precision of the network generated by the dual-channel condition control component, the embodiment also provides a corresponding training mode for the network structure part of the dual-channel condition control of the whole and component, namely, the network is extracted by gradually training the descriptors of the single channel so as to realize the stable and efficient training of the network generated by the whole dual-channel condition control component, and as shown in fig. 3, the method can comprise the following steps:

first, a training sample is obtained: calculating according to the three-dimensional point cloud data without noise in the three-dimensional content sample data set to obtain sample noise point cloud data; and acquiring training sample data corresponding to the three-dimensional content description data and training sample data corresponding to the part name text data according to the three-dimensional content sample data set.

And setting the whole text descriptor extraction network to be in an inactive state, setting the component text descriptor extraction network to be in an active state, and jointly training the component text descriptor extraction network by using training sample data corresponding to the sample noise point cloud data, the diffusion time and the component name text data.

Freezing the trained part text description sub-extraction network, setting the whole text description sub-extraction network as an activated state, and jointly training the whole text description sub-extraction network by using training sample data corresponding to sample noise point cloud data, diffusion time and three-dimensional content description data.

And thawing the trained part text description sub-extraction network, and training the part text description sub-extraction network and the whole text description sub-extraction network again by using training sample data corresponding to sample noise point cloud data, diffusion time and three-dimensional content description data and training sample data corresponding to part name text data to perform joint fine tuning.

After the two-channel condition control component generates network training, in the reasoning process, noise point clouds are obtained by randomly sampling from one Gaussian distribution, and corresponding control description sub-data is respectively obtained by extracting text description data of the two channels of the whole and the component, namely three-dimensional content description data and component name text data by utilizing the whole text description sub-extraction network and the component text description sub-extraction network. For time t, predicting noise at that timeSubtracting the noise point cloud at the time t Can obtain x _t-1 I.e. the noise point cloud data of the point cloud data processing network of the next iteration is predicted by the predicted point cloud noise output network of the previous roundAnd (5) point cloud noise data. T is from T to 1, the process of removing noise is gradually completed, and finally x is obtained ₀ And finally generating network generated colored three-dimensional point cloud data according to the two-channel condition control component.

From the above, the three-dimensional point cloud data of each three-dimensional component can be efficiently and accurately generated through the two-channel condition control component generating network, the problem that the two-channel condition control component generating network training process is uncontrollable can be effectively avoided through the model training method of the whole and component two-channel adaptation, the convergence speed of the whole network model is accelerated, the effective control generation of the fine-granularity condition of each channel is realized, and the training of the whole network model is efficiently completed.

Further, since each three-dimensional component may be located at a respective portion of the complete three-dimensional content, its coordinates are not aligned with the world coordinate system, which greatly affects the generation of the network training effect by the entire two-channel condition control component. Therefore, in order to ensure that the generation process is controllable and improve the generation precision of the whole three-dimensional point cloud data, before training the two-channel condition control component to generate the network by utilizing the three-dimensional content sample data set, the method can further comprise the following steps:

Unifying all three-dimensional component points of each three-dimensional content in the three-dimensional content sample data set to be aligned with the target coordinate system, and normalizing the size of each three-dimensional content to a preset standard size.

In the embodiment, two preprocessing including coordinate system alignment and scale scaling are performed on each three-dimensional content in the three-dimensional content sample data set. The coordinate system alignment refers to unifying all points of the three-dimensional content to be aligned with a world coordinate system, firstly calculating the center of the current three-dimensional content, and then calculating new coordinates of each point based on the center and original coordinates of each three-dimensional point, for example, original coordinates-bus coordinates=new coordinates of each three-dimensional point, so as to complete preprocessing of coordinate system alignment. Scaling refers to normalizing three-dimensional content to a standard scale, for example, scaling an object corresponding to the three-dimensional content into a cube with a side length of 1, calculating the difference between the maximum and minimum values of the three-dimensional content on the x-axis, the y-axis and the z-axis, taking the maximum value, taking the reciprocal as the scaling scale, multiplying the coordinates of each point of the 3D content by the scaling factor, and completing the scaling of each three-dimensional content.

The above embodiment does not limit how to determine the spatial position of each three-dimensional component, and based on the above embodiment, the present invention further provides an exemplary implementation of determining the spatial position of each three-dimensional component according to the three-dimensional content description data, the component name text data, and the three-dimensional component point cloud data, which may include the following:

Acquiring high-dimensional integral text characteristics of three-dimensional content description data; acquiring high-dimensional part text characteristics of part name text data; acquiring three-dimensional part global features of three-dimensional part point cloud data;

combining the high-dimensional integral text feature, the high-dimensional part text feature and the three-dimensional part global feature, and continuously reducing the dimension of the combined feature to obtain the three-dimensional displacement and the three-dimensional dimension of each three-dimensional part; and determining the spatial position of the corresponding three-dimensional component according to the three-dimensional scale and the three-dimensional displacement of each three-dimensional component.

In this embodiment, after the point cloud data of each three-dimensional component is generated separately in S102, a plurality of three-dimensional components included in the three-dimensional content need to be effectively fused to obtain the final complete three-dimensional content. The text features of the three-dimensional content description data, the part name text data and the three-dimensional part point cloud data can be extracted by using any network model capable of extracting text features, for example, a Bert model (Bidirectional Encoder Representation from Transformers, bi-directional encoder characterization quantity) can be adopted, and of course, other types of neural network models can also be adopted for extraction. When the dimension reduction processing is performed on the joint features, any network structure capable of reducing the dimension of the multi-dimensional features to the required dimension, such as 6 dimensions, can be adopted, and the invention is not limited in any way. After the three-dimensional scale and the three-dimensional displacement of each three-dimensional component are obtained, multiplying each point of the three-dimensional component by the three-dimensional scale, and then translating, so that the spatial position where the three-dimension should be is obtained. For example, the point cloud coordinates of one of the target points, such as a three-dimensional part, are After the three-dimensional part is moved to the corresponding space position, the transformed point cloud coordinates of the target point are as followsIf->For the three-dimensional scale of the target point +.>For the three-dimensional displacement of the target point, the relationship between the point cloud coordinates before and after the change is:，，。

Based on the above embodiment, the present invention further provides an efficient manner of determining the spatial position of each three-dimensional component, which may include the following:

and training a coarse-granularity fusion network of the three-dimensional component in advance. Three-dimensional content description data, part name text data and three-dimensional part point cloud data are used as input data, and a three-dimensional part coarse granularity fusion network is input; and taking the output data of the coarse-grained fusion network of the three-dimensional components as the spatial position of each three-dimensional component.

The three-dimensional part coarse-granularity fusion network comprises a text feature extraction sub-network, a feature joint layer and a joint feature processing sub-network; the output data of the coarse-grain fusion network of the three-dimensional components are the three-dimensional scale and the three-dimensional displacement of each three-dimensional component, and the spatial position of each three-dimensional component is determined according to the three-dimensional scale and the three-dimensional displacement of each three-dimensional component. The text feature extraction sub-network is used for extracting text features of three-dimensional content description data, part name text data and three-dimensional part point cloud data, and can comprise an overall text high-dimensional feature extraction model, a part text high-dimensional feature extraction model and a global feature extraction model; the high-dimensional overall text feature extraction model may extract the high-dimensional overall text feature of the three-dimensional content description data directly based on any pre-training language model as described above, for example, a Bert model, as shown in fig. 4. Of course, it may also employ a fine-tuned pre-trained language model for feature extraction. The high-dimensional feature extraction model of the component text can be directly used for extracting the high-dimensional component text features of the component name text data based on any pre-training language model as described above, and of course, the feature extraction can also be performed by adopting a fine-tuned pre-training language model. The text feature extraction sub-network can extract the significance descriptors by means of the strong understanding capability of the large-scale pre-training language model, and the training effect of the whole model is improved. The global feature extraction model comprises a first multi-layer sensing layer, a second multi-layer sensing layer, a third multi-layer sensing layer and a fourth multi-layer sensing layer which are sequentially connected, the first multi-layer sensing layer, the second multi-layer sensing layer, the third multi-layer sensing layer and the fourth multi-layer sensing layer sequentially conduct dimension reduction processing on input three-dimensional component point cloud data, and pooling processing is conducted on output features of the fourth multi-layer sensing layer to obtain three-dimensional component global features of the three-dimensional component point cloud data, and the three-dimensional component global features are shown in fig. 4. The feature combination layer adds and processes the features output by the text feature extraction sub-network, and outputs the processed feature data to the combined feature processing sub-network, wherein the combined feature processing sub-network can comprise a fifth multi-layer sensing layer, a sixth multi-layer sensing layer, a seventh multi-layer sensing layer and an eighth multi-layer sensing layer; the method comprises the steps of continuously reducing the dimension of the combined characteristic to 1 multiplied by 6 by a fifth multi-layer sensing layer, a sixth multi-layer sensing layer, a seventh multi-layer sensing layer and an eighth multi-layer sensing layer to obtain the three-dimensional displacement and the three-dimensional dimension of the three-dimensional component. Taking fig. 4 as an example, let the Bert model extract 1×768 dimensional features of three-dimensional content description data and part name text data, respectively, and the number of points of the three-dimensional part is n, then the input point cloud is n×6 dimensions, the n×6 dimensional features are gradually increased to n×2048 dimensions by using 4 MLP layers, and pooling is performed to obtain 1×2048 dimensions, and the 2048 dimension vector is a high-dimensional abstraction of the n point features. And combining the three-dimensional content description data, the part name text data and the global features of the three-dimensional part point cloud data to obtain 3584-dimensional combined features, gradually reducing the dimensions of the features to 1X 6 dimensions through five MLP layers, and respectively representing three-dimensional dimensions and three-dimensional displacement.

After the three-dimensional part coarse-grain fusion network is built, training is required to be carried out on the three-dimensional part coarse-grain fusion network. In order to ensure the precision of three-dimensional content generation, the three-dimensional content sample data set is subjected to scale pretreatment, namely, the complete three-dimensional content is located in a cube with a side length of 1, so that only the three-dimensional position and the three-dimensional geometric scale of the three-dimensional content in space are needed to be predicted, namely, a three-dimensional part coarse-granularity fusion network is used for predicting how to convert the generated three-dimensional part back to the three-dimensional displacement and the three-dimensional scale which are needed by the original three-dimensional content. And performing supervised training according to the three-dimensional position and the three-dimensional geometric scale obtained in the scaling process of the three-dimensional content sample data set as true values of the three-dimensional component relative to the corresponding three-dimensional content. Wherein the true value can be expressed asEach element respectively represents a three-dimensional displacement value and a three-dimensional scale value of the current three-dimensional component in x, y and z coordinate axes, and the predicted value can be expressed as +.>Each element respectively represents a three-dimensional displacement value and a three-dimensional scale value of the current three-dimensional component in x, y and z coordinate axes, and the loss function is minimized>And performing supervision training on the coarse-granularity fusion network of the three-dimensional component.

After the above embodiment predicts the spatial position of each three-dimensional component, the problem that the quality of the final three-dimensional content generated is not high due to the situation that two three-dimensional components overlap or a boundary has a gap is avoided, and this embodiment further provides an exemplary implementation manner of fine tuning the spatial position of the three-dimensional components, so as to obtain the final high-quality complete three-dimensional content, which may include the following contents:

A fixed number of samples are required before predicting the offset information for each point of the respective three-dimensional component from the three-dimensional content description data and the fusion point cloud data. Because the number of point clouds per three-dimensional part is fixed, small three-dimensional parts tend to have more points, i.e., richer detail. In order to improve the generation quality of the whole three-dimensional content, a proper sampling mode needs to be selected so as to ensure that the three-dimensional part with small volume still has richer details than the three-dimensional part with large volume after sampling, thereby ensuring that the final whole three-dimensional content also has richer details. Based on the above, the present embodiment may calculate the current point cloud number according to the total number of three-dimensional components and the total point cloud number of the three-dimensional component point clouds; calculating the number of sampling point clouds based on a preset sampling factor and the total number of three-dimensional components; and comparing the current point cloud number with the sampling point cloud number to determine whether the sampling mode adopted by each three-dimensional component is an up-sampling mode or a down-sampling mode or a non-sampling mode. To ensure high accuracy, the preset sampling factor may be 2 or 4. The complete three-dimensional content contains three-dimensional components in the number of kThe number of point clouds of each three-dimensional part isn _part S103, obtaining the total number of point clouds of the three-dimensional partn _merge Is thatn _merge =k*n _part Number of sampled point cloudsn _sample Is thatn _sample =m*n _part And then need to ben _merge Sampling ton _sample . That is, the number of the point clouds of each three-dimensional part is sampled from the actual point cloud data of the three-dimensional part to the current point cloud number and the sampling point cloud numbern _part *m/k，mFor the preset sampling factor to be a predetermined sampling factor,kis the total number of three-dimensional parts. For each three-dimensional part, according tokAndmis different from (a) in (b)There are three sampling cases, ifk=mThe corresponding three-dimensional component is not sampled; if it isk>mDownsampling is needed, and in order to improve algorithm robustness and sampling speed, a random downsampling method is adopted, namely, corresponding three-dimensional components are subjected to random downsampling, 1 point can be randomly selected for removal until the number of residual point clouds is equal ton _part *m/k。If it isk<mIn order to improve the robustness and sampling speed of the algorithm, the up-sampling is needed, the corresponding three-dimensional component can be randomly up-sampled, 1 point and 2 adjacent points can be randomly selected, the barycentric coordinates of the three-dimensional component can be obtained through interpolation to serve as newly added points, and the up-sampling is repeated until the number of the current point clouds reaches the value ofn _part *m/k。

As an efficient offset information calculation mode, the invention can combine a large-scale pre-training language model to construct a three-dimensional component fine granularity fusion network, and calculate the offset information of the points of each three-dimensional component through the three-dimensional component fine granularity fusion network, and can comprise the following contents:

Training a three-dimensional part fine granularity fusion network in advance; the three-dimensional content description data and fusion point cloud data are used as input data, and a three-dimensional component fine granularity fusion network is input; and taking the output data of the three-dimensional component fine-granularity fusion network as offset information of each point of each three-dimensional component.

In this embodiment, as shown in fig. 5, the three-dimensional component fine-granularity fusion network may include a point cloud global feature extraction sub-network, an overall text feature extraction sub-network, and an offset prediction sub-network. The whole text feature extraction sub-network is used for extracting text features of three-dimensional content description data, and any network structure capable of extracting high-dimensional text features can be used. By way of example, the present embodiment may utilize the strong understanding capability of a large-scale pre-training language model to extract high-dimensional overall text features of three-dimensional content description data, where the pre-training language model may be, for example, a Bert model, as shown in fig. 5, so as to extract salient descriptors, thereby improving the training effect and performance of the whole model. The point cloud global feature extraction sub-network is used for extracting global features of three-dimensional component point cloud data after initial fusion in S103, and for example, considering a model structure and training cost, the point cloud global feature extraction sub-network can comprise a tenth multi-layer sensing layer, an eleventh multi-layer sensing layer, a twelfth multi-layer sensing layer and a thirteenth multi-layer sensing layer which are sequentially connected, and performs dimension lifting processing on input three-dimensional component point cloud data through the tenth multi-layer sensing layer, the eleventh multi-layer sensing layer, the twelfth multi-layer sensing layer and the thirteenth multi-layer sensing layer, and then performs pooling processing on output features of the thirteenth multi-layer sensing layer to obtain global features of the three-dimensional component fusion point cloud data. As shown in fig. 5, if the total number of points of the point cloud data of the three-dimensional component after initial fusion is n, the dimension of the input data is n×3, the features are gradually increased to n×2048 by using 4 connected MLP layers, and pooled to obtain 1×2048 dimensions, which can be used as global descriptor data, and in order to complement the detail features, the output of any branch can be used as local descriptor data, for example, the middle n×2048 data can be used as local descriptor data. The offset prediction sub-network can be used for fusing the characteristics extracted by the point cloud global characteristic extraction sub-network, the whole text characteristic extraction sub-network and the local detail characteristics, and continuously reducing the dimension of the fused characteristics until six-dimensional output characteristics are obtained. For example, the offset prediction sub-network may include a feature mix layer, a fourteenth multi-layer sense layer, a fifteenth multi-layer sense layer, a sixteenth multi-layer sense layer, and a seventeenth multi-layer sense layer; the feature mixing layer is used for carrying out feature mixing on the output features of the twelfth multi-layer sensing layer serving as the copy features of the local features, the multiple copy features of the global features of the three-dimensional part and the multiple copy features of the global text features output by the global text feature extraction sub-network; the mixed features are subjected to dimension reduction processing through a fourteenth multi-layer sensing layer, a fifteenth multi-layer sensing layer, a sixteenth multi-layer sensing layer and a seventeenth multi-layer sensing layer, and offset information of each point of each three-dimensional component is obtained. Taking fig. 5 as an example, the three-dimensional component global feature is duplicated n times, the whole text feature is duplicated n times and connected with the local descriptor to obtain an n× (1024+2048+768) dimensional mixed descriptor; and secondly, obtaining the final n multiplied by 3-dimension offset through a four-layer MLP network, and finally, adding the offset to the point cloud coordinates after the S103 coarse fusion to obtain the final fine-granularity fused three-dimensional content.

After the three-dimensional component fine-grained fusion network is built, training of the three-dimensional component fine-grained fusion network is required. The three-dimensional component fine granularity fusion network further comprises a point cloud identification sub-network for training process due to lack of true value of offset of three-dimensional points of each three-dimensional component; the point cloud identification sub-network is used for carrying out point cloud category identification on the corresponding three-dimensional point cloud sample data, and extracting description sub-data of the prediction category characteristics to serve as a true value of each point offset of each three-dimensional component. For example, as shown in fig. 5, the present embodiment may determine whether to approach the three-dimensional content global text description to guide training based on the supervised training method of the pre-training point cloud classification model, that is, performing three-dimensional recognition on the final fine-granularity fusion result of the three-dimensional component fine-granularity fusion network. For example, a point cloud classification model pre-trained on a large-scale three-dimensional content sample data set such as ShapeNet is used for classifying the complete three-dimensional point cloud after final fine granularity fusion to obtain category text information, then a large-scale pre-training language model such as Bert is used for extracting descriptors, and the L2-Norm distance (Euclidean distance) between the descriptors is calculated as training loss. Compared with comparison of pure text types, the method for measuring errors in the description subspace can enable the three-dimensional content after fine granularity fusion to be more in line with the description of the whole text by a user. f _{description_gt} Descriptive sub-data that is a global textual feature of the three-dimensional content descriptive data,f _{description_pred} for the description sub-data of the predicted category characteristics, the classification supervised loss function is: if _{description_gt} -f _{description_pred} || ² . Because the three-dimensional content after S103 coarse fusion is fused by a plurality of independently generated three-dimensional components, in order to better keep the original generated result, the offset of the three-dimensional components needs to be suppressed, the detail keeping of the three-dimensional components after fine adjustment is ensured, namely, the predicted offset is required to be as small as possible, and the point-by-point offset loss is causedThe loss function is:L _translation =||△T|| ² . Therefore, the final loss function for training the three-dimensional part fine-granularity fusion network can be a fine-granularity loss function relation, and the three-dimensional part fine-granularity fusion network can be trained by directly calling the fine-granularity loss function relation; the three-dimensional component fine-grained fusion network can be expressed as:

L=α*||f _{description_gt} -f _{description_pred} || ² +β*||△T|| ² ；

in the method, in the process of the invention,Lrepresenting a fine-grained loss function relationship,αas a first weight parameter,βis a second weight parameter, a deltaTRepresenting offset information，f _{description_gt} Descriptive sub-data that is a global textual feature of the three-dimensional content descriptive data,f _{description_pred} descriptor data for the predicted category characteristics.

From the above, it is clear that higher quality three-dimensional content can be generated by the three-dimensional component fine-grained fusion network of the present embodiment.

The above embodiment does not limit how to obtain the part name text data based on the three-dimensional content generation request of the pre-training language model, and the present invention also provides an exemplary implementation, which may include the following:

the method comprises the steps that a three-dimensional content question-answer sample set is utilized in advance, and a pre-training language model is subjected to fine adjustment based on a question-answer task execution, so that a text question-answer model is obtained; acquiring three-dimensional content description data according to the three-dimensional content generation request; acquiring the name of the three-dimensional content according to the three-dimensional content description data; and obtaining the text data of the part names described in the text form by using the text question-answering model.

The pre-training language model has strong cognitive ability, and can obtain high-quality reply results according to the input text form problem. In this embodiment, the task target is a question-answer task, the task data is a three-dimensional content question-answer sample set, and the three-dimensional content question-answer sample set includes a plurality of sets of training samples, where each training sample is a three-dimensional content name and a corresponding included component name. And (3) fine tuning any pre-training language model by utilizing the data set of the question-answering task, namely the three-dimensional content question-answering sample set, of the current downstream application task, so that the pre-training language model can be suitable for the downstream application task, namely the text question-answering model of the step. Thus, the present embodiment takes the name of each three-dimensional content as the text form of the input, and provides a strong cognitive ability of the pre-trained language model, so that the component name set contained in the three-dimensional content name can be obtained. Considering that the direct questioning of the large-scale pre-training language model, the answer quality possibly obtained is not high, namely the obtained part name information is not accurate. According to the three-dimensional content part questioning questions and auxiliary questions input by a user, a questioning question set and an auxiliary questioning question set are generated; inputting an auxiliary question set into a text question-answering model, and obtaining auxiliary question answers corresponding to the auxiliary question set by using the text question-answering model; and inputting the question set and the auxiliary question answers into a text question-answer model to obtain the text data of the part names.

In this embodiment, a question set is constructed according to question questions input by a user, and three-dimensional contents are structurally divided into at least one or more parts, so that part names of the three-dimensional contents can be obtained for any three-dimensional contents, and questions can be asked by a large-scale pre-trained language model. The question set includes a plurality of question questions, for example, "which component names are contained in three-dimensional content? "which parts of the three-dimensional content can be divided? "which three-dimensional parts can make up three-dimensional content? ". And then constructing an auxiliary question set according to the auxiliary questions input by the user. In order to assist the large-scale pre-training language model to better answer questions, auxiliary questions can be set, wherein the auxiliary questions comprise questions about three-dimensional content related basic principles, namely, the step of giving thinking to the large-scale pre-training model is carried out, and the auxiliary questions are answered first and then the questions are answered. The auxiliary question set may include, for example, "what is the function of three-dimensional content? What are the design principles of "three-dimensional content? "what is the three-dimensional content essentially? "etc. And then asking each auxiliary question to the large-scale pre-training language model based on the auxiliary question set to obtain a series of answers to the auxiliary questions. And finally, inputting the questioning questions and the answers of the auxiliary questions into the large-scale pre-training language model to obtain the final answers of the three-dimensional content part name set.

As can be seen from the above, the answer accuracy of the pre-training language model can be greatly improved by the thought chain mode based on the auxiliary question answers, and reasonable part name text data which is more suitable for mass cognition can be obtained.

In order to make the technical solution of the present invention more clear for those skilled in the art, the present invention provides an exemplary embodiment, and some possible application scenarios related to the technical solution of the present invention are described by way of example, as shown in fig. 6, fig. 6 is a schematic diagram of a hardware composition framework to which a data set generating method based on a pre-training language model provided by the present invention is applicable, which may include the following:

the hardware component framework may include a server 61 and a client 62, where the server 61 and the client 62 are connected through a network 63. The server 61 is configured to deploy a processor for executing the three-dimensional content generating method based on the pre-training language model according to any of the above embodiments, and the client 62 is a client having a man-machine interaction interface, and is configured to input various control instructions and text descriptions, such as a three-dimensional content generating request, a question set, and questions in an auxiliary question set, to the server 61 through the client.

The server 61 performs model training based on the flow shown in fig. 7, and after the model training is completed, performs reasoning as shown in fig. 8. In the training stage, firstly, generating a data set corresponding to a text and a 3D part based on a large-scale pre-training language model to obtain a data set corresponding to the large-scale text and the 3D part; secondly, designing a training double-channel condition control component generating network, and training the double-channel condition control component generating network; thirdly, designing a three-dimensional part coarse-grain fusion network, and training the three-dimensional part coarse-grain fusion network; thirdly, designing a three-dimensional part fine-grained fusion network and training the three-dimensional part fine-grained fusion network. And finally, acquiring part name text data corresponding to the three-dimensional content description data input by a user on the basis of a large-scale pre-training language model during reasoning, calling a two-channel condition control part generation network to generate three-dimensional part point cloud data, calling a three-dimensional part coarse-granularity fusion network to determine the spatial position of each three-dimensional part for coarse-granularity fusion, and finally calling a three-dimensional part fine-granularity fusion network to calculate offset information of each point of each three-dimensional part for fine-granularity fusion, and obtaining a final 3D content generation result according to a fine-granularity fusion result.

The method comprises the steps of generating a text and 3D component corresponding data set based on a large-scale pre-training language model, and obtaining a component name text data set generating part, a three-dimensional content multi-view rendering part, a multi-view pixel and component name corresponding part and a multi-view pixel and component name corresponding fusion part in the process of obtaining the large-scale text and 3D component corresponding data set. Wherein the three-dimensional content multi-view rendering section is: calculating pose information of the virtual camera according to the position and the orientation of the virtual camera; the orientation of the virtual camera is fixedly directed to each three-dimensional content of the three-dimensional content sample data set, and the virtual camera comprises a plurality of positions so as to cover and collect data under a plurality of view angles of each three-dimensional content; and respectively rendering each three-dimensional content according to each pose information of the virtual camera to obtain two-dimensional content images of the same three-dimensional content under different viewing angles. The multi-view pixel and part name corresponding part is: combining each part name in the part name text data and each two-dimensional content image obtained by rendering respectively to obtain a plurality of image-text combination results; and based on the part names in each image-text combination result, carrying out image segmentation processing on the two-dimensional content images in the corresponding image-text combination result by using a segmentation all-model to obtain the part name corresponding to each pixel in each two-dimensional content image. The multi-view pixel and component name correspondence fusion portion includes: constructing a multi-view pixel and component name corresponding result fusion network structure; inputting the multidimensional data information determined based on the three-dimensional content sample data set and the component names corresponding to the pixels into a multi-view pixel and component name corresponding result fusion network structure to obtain all three-dimensional points corresponding to the component names; the multi-view pixel and component name corresponding result fusion network structure comprises a first input end, a second input end, a third input end, a component category identification network, a data fusion network and an output end; the first input end is connected with the component category identification network, the component category identification network and the second input end are both connected with the data fusion network, and the data fusion network is connected with the output end; the first input end is used for inputting the multidimensional data information, and the second input end is used for inputting the part name corresponding to each pixel; the multi-view pixel and component name corresponding result fusion network structure further comprises a third input end in the training process, wherein the third input end is connected with the data fusion network and used for inputting the component category corresponding to each pixel in each two-dimensional content image. It should be noted that the above application scenario is only shown for the convenience of understanding the idea and principle of the present invention, and the embodiment of the present invention is not limited in any way. Rather, embodiments of the invention may be applied to any scenario where applicable.

As can be seen from the above, in this embodiment, the three-dimensional components included in the three-dimensional content are generated separately, and coarse granularity and fine granularity are fused respectively, so that the quality of the conditional three-dimensional content generation result can be greatly improved, high-quality three-dimensional content with more lifelike details can be generated, and a large number of personalized three-dimensional contents can be generated by controlling different component combinations, so that the diversity of three-dimensional content creation is greatly improved.

It should be noted that, in the present invention, there is no strict sequence of execution among the steps, so long as the sequence accords with the logic sequence, the steps may be executed simultaneously, or may be executed according to a certain preset sequence, and the above flowchart is only a schematic manner, and does not represent only such execution sequence.

The invention also provides a corresponding device for the three-dimensional content generation method based on the pre-training language model, so that the method has more practicability. Wherein the device may be described separately from the functional module and the hardware. In the following description of the three-dimensional content generating device based on a pre-training language model provided by the present invention, the device is used to implement the three-dimensional content generating method based on a pre-training language model provided by the present invention, in this embodiment, the three-dimensional content generating device based on a pre-training language model may include or be divided into one or more program modules, where the one or more program modules are stored in a storage medium and executed by one or more processors, to complete the three-dimensional content generating method based on a pre-training language model disclosed in the first embodiment. Program modules in the present invention refer to a series of computer program instruction segments capable of performing a specific function, which are more suitable than the program itself for describing the execution of a three-dimensional content generating device based on a pre-trained language model in a storage medium. The following description will specifically describe functions of each program module of the present embodiment, and the three-dimensional content generating apparatus based on the pre-training language model described below and the three-dimensional content generating method based on the pre-training language model described above may be referred to correspondingly to each other.

Based on the angles of the functional modules, referring to fig. 9, fig. 9 is a block diagram of a three-dimensional content generating device based on a pre-training language model according to an embodiment of the present invention, where the device may include:

the data generating module 901 is configured to generate part name text data and three-dimensional part point cloud data based on three-dimensional content description data carried by the three-dimensional content generating request; the text features of the three-dimensional content description data and the part name text data are extracted by using a text feature extraction model; the text feature extraction model is a network model obtained by utilizing a three-dimensional content sample data set and performing fine adjustment on a pre-training language model based on executing text feature extraction tasks.

The spatial position determining module 902 is configured to determine a spatial position of each three-dimensional component according to the three-dimensional content description data, the component name text data, and the three-dimensional component point cloud data, and move each three-dimensional component to a corresponding spatial position to obtain initial fusion point cloud data.

The sampling module 903 is configured to sample the initial fusion point cloud data by a fixed number to obtain fusion point cloud data.

And the offset calculating module 904 is used for calculating offset information of each point of each three-dimensional component according to the three-dimensional content description data and the fusion point cloud data.

The content generating module 905 is configured to adjust the spatial position of the corresponding three-dimensional component based on the offset information, and generate three-dimensional content according to each adjusted three-dimensional component.

Illustratively, in some implementations of the present embodiment, the data generating module 901 may further be configured to:

As an exemplary implementation of the foregoing embodiment, the foregoing data generating module 901 may further be configured to:

pre-building a two-channel condition control part to generate a network; training a two-channel condition control component to generate a network by utilizing the three-dimensional content sample data set; the two-channel condition control component generation network comprises an integral text description sub-extraction network, a component text description sub-extraction network, a diffusion time processing network, a point cloud data processing network, a sequence data processing neural network and a predicted point cloud noise output network; the system comprises a whole text description sub-extraction network, a part text description sub-extraction network, a diffusion time processing network, a point cloud data processing network, a sequence data processing neural network, a prediction point cloud noise output network and a point cloud noise output network, wherein the whole text description sub-extraction network is used for generating whole control description sub-data of three-dimensional content description data, the part text description sub-extraction network is used for generating part control description sub-data of part name text data, the diffusion time processing network is used for generating diffusion time description sub-data, the point cloud data processing network is used for generating point cloud data description sub-data, the sequence data processing neural network is used for learning characteristic representations of the whole control description sub-data, the part control description sub-data, the diffusion time description sub-data and the point cloud data description sub-data, and the prediction point cloud noise output network is used for predicting point cloud noise data at the current moment based on the characteristic representations; the sequence data processing neural network is a pre-training language model.

The whole text description sub-extraction network comprises a whole text input end, a whole text feature extraction model and a whole control sub-generation model; the integrated text input end is used for inputting three-dimensional content description data, and the integrated text feature extraction model is used for extracting text features of the three-dimensional content description data and is used as integrated initial description sub-data; the overall control sub-generation model is used for conducting subdivision processing on the features in the overall initial description sub-data to obtain fine-grained overall text features, and the fine-grained overall text features are used as overall control description sub-data; the text feature extraction model includes a trained overall text feature extraction model.

Illustratively, the component text description sub-extraction network includes a component text input, a component text feature extraction model, and a component control sub-generation model; the component text input end is used for inputting component name text data, and the component text feature extraction model is used for extracting text features of the component name text data and used as component initial description sub-data; the component control sub-generation model is used for conducting subdivision processing on features in the component initial description sub-data to obtain fine-grained component text features serving as component control description sub-data; the text feature extraction model includes a trained component text feature extraction model.

Illustratively, the predicted point cloud noise output network includes a noise point cloud data generation model; the noise point cloud data generation model is used for carrying out regression analysis on the diffusion sub-output description sub-data output by the sequence data processing neural network to obtain point cloud noise data added by each diffusion.

As another exemplary implementation of the above embodiment, the above data generating module 901 may further be configured to:

calculating according to the three-dimensional point cloud data without noise in the three-dimensional content sample data set to obtain sample noise point cloud data; acquiring training sample data corresponding to the three-dimensional content description data and training sample data corresponding to the part name text data according to the three-dimensional content sample data set; setting the whole text descriptor extraction network as an inactive state, setting the component text descriptor extraction network as an active state, and jointly training the component text descriptor extraction network by using training sample data corresponding to sample noise point cloud data, diffusion time and component name text data; freezing the trained part text description sub-extraction network, setting the whole text description sub-extraction network as an activated state, and jointly training the whole text description sub-extraction network by using training sample data corresponding to sample noise point cloud data, diffusion time and three-dimensional content description data; and thawing the trained part text description sub-extraction network, and retraining the part text description sub-extraction network and the whole text description sub-extraction network by using training sample data corresponding to sample noise point cloud data, diffusion time and three-dimensional content description data and training sample data corresponding to part name text data.

invoking a noise point cloud data calculation relation, and calculating sample noise point cloud data; the noise point cloud data calculation relation is as follows:

；

wherein,x _t is thattThe samples at the time instant noise the point cloud data,x ₀ for three-dimensional point cloud data without noise in the three-dimensional content sample dataset,for preset timetVariable noise weighting factor,/->Is the noise sampled from the standard normal distribution.

training sample data corresponding to sample noise point cloud data, diffusion time and three-dimensional content description data and training sample data corresponding to part name text data are taken as input data, predicted point cloud noise data added by each diffusion are taken as output, a loss function relation is generated through a calling part, and a two-channel condition control part generation network is trained;

wherein the component generation loss function relationship is:

；

wherein,noise sampled from a standard normal distribution, +.>The point cloud noise added for each diffusion, x _t Is thattThe samples at the time instant noise the point cloud data,C _full the data is described for the three-dimensional content in its entirety,C _{_part} the data is described for the part name text. />

As still another exemplary implementation of the foregoing embodiment, the foregoing data generating module 901 may further be configured to:

Illustratively, in some implementations of this embodiment, the above-described spatial location determination module 902 may also be used to

Acquiring high-dimensional integral text characteristics of three-dimensional content description data; acquiring high-dimensional part text characteristics of part name text data; acquiring three-dimensional part global features of three-dimensional part point cloud data; combining the high-dimensional integral text feature, the high-dimensional part text feature and the three-dimensional part global feature, and continuously reducing the dimension of the combined feature to obtain the three-dimensional displacement and the three-dimensional dimension of each three-dimensional part; and determining the spatial position of the corresponding three-dimensional component according to the three-dimensional scale and the three-dimensional displacement of each three-dimensional component.

As an exemplary implementation of the above embodiment, the above spatial location determination module 902 may further be configured to:

Training a coarse-granularity fusion network of the three-dimensional part in advance; the three-dimensional part coarse-granularity fusion network comprises a text feature extraction sub-network, a feature joint layer and a joint feature processing sub-network; three-dimensional content description data, part name text data and three-dimensional part point cloud data are used as input data, and a three-dimensional part coarse granularity fusion network is input; and taking the output data of the coarse-grained fusion network of the three-dimensional components as the spatial position of each three-dimensional component.

The text feature extraction sub-network comprises an integral text high-dimensional feature extraction model, a component text high-dimensional feature extraction model and a global feature extraction model; the overall text feature extraction model is used for extracting high-dimensional overall text features of three-dimensional content description data; the part text high-dimensional feature extraction model is used for extracting high-dimensional part text features of part name text data; the global feature extraction model comprises a first multi-layer sensing layer, a second multi-layer sensing layer, a third multi-layer sensing layer and a fourth multi-layer sensing layer which are sequentially connected, and the output features of the fourth multi-layer sensing layer are subjected to pooling treatment to obtain the global features of the three-dimensional component point cloud data; the text feature extraction model comprises a trained overall text high-dimensional feature extraction model and a trained part text high-dimensional feature extraction model.

Illustratively, the joint feature processing sub-network includes a fifth multi-layer sensing layer, a sixth multi-layer sensing layer, a seventh multi-layer sensing layer, and an eighth multi-layer sensing layer; the joint feature processing sub-network is used for continuously reducing the dimension of the joint feature to 1X 6 dimension through the fifth multi-layer sensing layer, the sixth multi-layer sensing layer, the seventh multi-layer sensing layer and the eighth multi-layer sensing layer so as to obtain the three-dimensional displacement and the three-dimensional dimension of the three-dimensional component.

Illustratively, in other implementations of this embodiment, the sampling module 903 may be further configured to:

As an exemplary implementation of the above embodiment, the sampling module 903 may be further configured to:

calculating the current point cloud number according to the total number of the three-dimensional parts and the total point cloud number of the three-dimensional part point clouds; calculating the number of sampling point clouds based on a preset sampling factor and the total number of three-dimensional components; determining a sampling mode adopted by each three-dimensional component by comparing the current point cloud number with the sampling point cloud number; the sampling modes comprise an up-sampling mode, a down-sampling mode and a non-sampling mode.

As another exemplary implementation of the above embodiment, the sampling module 903 may be further configured to:

sampling the number of the point clouds of each three-dimensional part from the actual point cloud data of the three-dimensional part to the current number of the point clouds and the sampling number of the point cloudsn _part *m/kIn which, in the process,n _part is the actual point cloud data of the three-dimensional component,mfor the preset sampling factor to be a predetermined sampling factor,kis the total number of three-dimensional parts;

if it isk=mThe corresponding three-dimensional component is not sampled;

if it isk>mRandomly downsampling the corresponding three-dimensional component;

As yet another exemplary implementation of the foregoing embodiment, the foregoing offset calculation module 904 may be further configured to:

training a three-dimensional part fine granularity fusion network in advance; the three-dimensional component fine granularity fusion network comprises a point cloud global feature extraction sub-network, an overall text feature extraction sub-network and an offset prediction sub-network; the three-dimensional content description data and fusion point cloud data are used as input data, and a three-dimensional component fine granularity fusion network is input; taking the output data of the three-dimensional component fine-granularity fusion network as offset information of each point of each three-dimensional component; the text feature extraction model comprises a trained integral text feature extraction sub-network.

The point cloud global feature extraction sub-network comprises a tenth multi-layer sensing layer, an eleventh multi-layer sensing layer, a twelfth multi-layer sensing layer and a thirteenth multi-layer sensing layer which are sequentially connected, and the output features of the thirteenth multi-layer sensing layer are subjected to pooling processing to obtain global features of three-dimensional component fusion point cloud data.

Illustratively, the offset prediction sub-network includes a feature mix layer, a fourteenth multi-layer sense layer, a fifteenth multi-layer sense layer, a sixteenth multi-layer sense layer, and a seventeenth multi-layer sense layer; the feature mixing layer is used for feature mixing of the copy features of the output features of the twelfth multi-layer sensing layer, the copy features of the global features of the three-dimensional component and the copy features of the global text features output by the global text feature extraction sub-network; the mixed features are subjected to dimension reduction processing through a fourteenth multi-layer sensing layer, a fifteenth multi-layer sensing layer, a sixteenth multi-layer sensing layer and a seventeenth multi-layer sensing layer, and offset information of each point of each three-dimensional component is obtained.

calling a fine granularity loss function relation, and training a fine granularity fusion network of the three-dimensional part; the fine grain loss function relationship is:

L=α*||f _{description_gt} -f _{description_pred} || ² +β*||△T|| ² ；

In the method, in the process of the invention,Lrepresenting a fine-grained loss function relationship,αas a first weight parameter,βas a second weight parameter, the first weight parameter,f _{description_gt} descriptive sub-data that is a global textual feature of the three-dimensional content descriptive data,f _{description_pred} descriptor data for the predicted category feature; the three-dimensional part fine granularity fusion network further comprises a point cloud identification sub-network used for a training process; the point cloud identification sub-network is used for identifying the point cloud type of the corresponding three-dimensional point cloud sample data and extracting description sub-data of the prediction type characteristics to serve as each three-dimensional point cloud sample dataTrue values for each point offset of the component.

the method comprises the steps that a three-dimensional content question-answer sample set is utilized in advance, and a pre-training language model is subjected to fine adjustment based on a question-answer task execution, so that a text question-answer model is obtained; acquiring three-dimensional content description data according to a three-dimensional content generation request based on a pre-training language model; acquiring the name of the three-dimensional content according to the three-dimensional content description data; and obtaining the text data of the part names described in the text form by using the text question-answering model.

According to the three-dimensional content part question and the auxiliary question input by the user, generating a question set and an auxiliary question set; inputting the auxiliary question set into a text question-answering model, and obtaining auxiliary question answers corresponding to the auxiliary question set by using the text question-answering model; and inputting the question set and the auxiliary question answers into a text question-answer model to obtain the text data of the part names.

The functions of each functional module of the three-dimensional content generating device based on the pre-training language model can be specifically realized according to the method in the method embodiment, and the specific implementation process can refer to the related description of the method embodiment, which is not repeated here.

From the above, the present embodiment can effectively improve the quality of three-dimensional content generation.

The three-dimensional content generating device based on the pre-training language model is described from the perspective of a functional module, and further, the invention also provides electronic equipment, which is described from the perspective of hardware. Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 10, the electronic device comprises a memory 100 for storing a computer program; a processor 101 for implementing the steps of the method for generating three-dimensional content based on a pre-trained language model as mentioned in any of the embodiments above when executing a computer program.

Processor 101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and processor 101 may also be a controller, microcontroller, microprocessor, or other data processing chip, among others. The processor 101 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 101 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 101 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and drawing of content that the display screen is required to display. In some embodiments, the processor 101 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 100 may include one or more computer-readable storage media, which may be non-transitory. Memory 100 may also include high-speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. Memory 100 may be an internal storage unit of an electronic device, such as a hard disk of a server, in some embodiments. The memory 100 may also be an external storage device of the electronic device, such as a plug-in hard disk provided on a server, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), etc. in other embodiments. Further, the memory 100 may also include both internal storage units and external storage devices of the electronic device. The memory 100 may be used to store not only application software installed in an electronic device, but also various types of data, such as: code or the like that executes a program during the three-dimensional content generation method based on the pre-training language model may also be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 100 is at least used to store a computer program 1001, where the computer program, when loaded and executed by the processor 101, is capable of implementing the relevant steps of the three-dimensional content generating method based on a pre-trained language model disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 100 may further include an operating system 1002, data 1003, and the like, and the storage manner may be transient storage or permanent storage. The operating system 1002 may include Windows, unix, linux, among other things. The data 1003 may include, but is not limited to, data corresponding to the results of the three-dimensional content generation based on the pre-trained language model, and the like.

In some embodiments, the electronic device may further include a display 102, an input/output interface 103, a communication interface 104, or referred to as a network interface, a power supply 105, and a communication bus 106. Among other things, the display 102, input output interface 103 such as a Keyboard (Keyboard) belong to a user interface, which may optionally also include standard wired interfaces, wireless interfaces, etc. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface. The communication interface 104 may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a bluetooth interface, etc., typically used to establish a communication connection between an electronic device and other electronic devices. The communication bus 106 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.

Those skilled in the art will appreciate that the configuration shown in fig. 10 is not limiting of the electronic device and may include more or fewer components than shown, for example, may also include sensors 107 to perform various functions.

The functions of each functional module of the electronic device according to the present invention may be specifically implemented according to the method in the above method embodiment, and the specific implementation process may refer to the relevant description of the above method embodiment, which is not repeated herein.

It will be appreciated that if the three-dimensional content generating method based on the pre-training language model in the above embodiment is implemented in the form of a software functional unit and sold or used as a separate product, it may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution contributing to the related art, or may be embodied in the form of a software product stored in a storage medium, which performs all or part of the steps of the methods of the various embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrically erasable programmable ROM, registers, a hard disk, a multimedia card, a card-type Memory (e.g., SD or DX Memory, etc.), a magnetic Memory, a removable disk, a CD-ROM, a magnetic disk, or an optical disk, etc., that can store program code.

Based on this, the present invention also provides a readable storage medium storing a computer program which, when executed by a processor, performs the steps of the three-dimensional content generating method based on a pre-trained language model according to any one of the embodiments above.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the hardware including the device and the electronic equipment disclosed in the embodiments, the description is relatively simple because the hardware includes the device and the electronic equipment corresponding to the method disclosed in the embodiments, and relevant places refer to the description of the method.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The three-dimensional content generating method, the device, the electronic equipment and the readable storage medium based on the pre-training language model provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims

1. A method for generating three-dimensional content based on a pre-trained language model, comprising:

2. The method for generating three-dimensional content based on a pre-training language model according to claim 1, wherein generating the part name text data and the three-dimensional part point cloud data based on the three-dimensional content description data carried by the three-dimensional content generation request comprises:

3. The method for generating three-dimensional content based on a pre-training language model according to claim 2, wherein before the step of obtaining three-dimensional component point cloud data carrying color parameters by removing noise in the data to be processed in each iteration process, the method further comprises:

pre-building a two-channel condition control part to generate a network;

training the two-channel condition control component to generate a network using the three-dimensional content sample dataset;

4. The method for generating three-dimensional content based on a pre-training language model according to claim 3, wherein the whole text description sub-extraction network comprises a whole text input end, a whole text feature extraction model and a whole control sub-generation model;

5. The method for generating three-dimensional content based on a pre-training language model according to claim 3, wherein the component text description sub-extraction network comprises a component text input end, a component text feature extraction model and a component control sub-generation model;

6. The method for generating three-dimensional content based on a pre-training language model according to claim 3, wherein the predicted point cloud noise output network comprises a noise point cloud data generation model;

7. A method of three-dimensional content generation based on a pre-training language model according to claim 3, wherein said training the two-channel condition control section generation network comprises:

8. The method for generating three-dimensional content based on a pre-training language model according to claim 7, wherein the calculating sample noise point cloud data according to the three-dimensional point cloud data without noise in the three-dimensional content sample data set comprises:

；

wherein,x _t is thattThe samples at the time instant noise the point cloud data,x ₀ for noise-free three-dimensional point cloud data in the three-dimensional content sample dataset, For preset timetVariable noise weighting factor,/->Is the noise sampled from the standard normal distribution.

9. The method for generating three-dimensional content based on a pre-training language model according to claim 7, wherein the training the two-channel condition control section generating network comprises:

wherein the component generates a loss function relationship as:

；

10. The method for generating three-dimensional content based on a pre-training language model according to claim 7, wherein before said training the two-channel condition control section to generate a network, further comprising:

11. The three-dimensional content generation method based on a pre-training language model according to any one of claims 1 to 10, wherein the determining the spatial position of each three-dimensional part from the three-dimensional content description data, the part name text data, and the three-dimensional part point cloud data comprises:

acquiring the global features of the three-dimensional part point cloud data;

12. The method for generating three-dimensional content based on a pre-training language model according to claim 11, wherein the determining a spatial position of each three-dimensional part from the three-dimensional content description data, the part name text data, and the three-dimensional part point cloud data comprises:

13. The method for generating three-dimensional content based on a pre-training language model according to claim 12, wherein the text feature extraction sub-network comprises a whole text high-dimensional feature extraction model, a part text high-dimensional feature extraction model and a global feature extraction model;

the overall text high-dimensional feature extraction model is used for extracting high-dimensional overall text features of three-dimensional content description data; the part text high-dimensional feature extraction model is used for extracting high-dimensional part text features of the part name text data; the global feature extraction model comprises a first multi-layer sensing layer, a second multi-layer sensing layer, a third multi-layer sensing layer and a fourth multi-layer sensing layer which are sequentially connected, and the output features of the fourth multi-layer sensing layer are subjected to pooling treatment to obtain the three-dimensional part global features of the three-dimensional part point cloud data;

The text feature extraction model comprises a trained overall text high-dimensional feature extraction model and a trained part text high-dimensional feature extraction model.

14. The method for generating three-dimensional content based on a pre-training language model according to claim 12, wherein the joint feature processing sub-network comprises a fifth multi-layer perceptual layer, a sixth multi-layer perceptual layer, a seventh multi-layer perceptual layer, and an eighth multi-layer perceptual layer;

15. The method for generating three-dimensional content based on a pre-training language model according to claim 1, wherein the sampling the initial fusion point cloud data by a fixed number to obtain fusion point cloud data comprises:

16. The method for generating three-dimensional content based on a pre-training language model according to claim 15, wherein the individually sampling each three-dimensional part based on the actual point cloud data of each three-dimensional part and the total point cloud number of the three-dimensional part point clouds based on the total number of three-dimensional parts comprises:

17. The method for generating three-dimensional content based on a pre-training language model according to claim 16, wherein the determining the sampling pattern adopted by each three-dimensional component by comparing the current point cloud number and the sampling point cloud number comprises:

sampling the number of the point clouds of each three-dimensional part from the actual point cloud data of the three-dimensional part to the current point cloud number and the sampling point cloud number n _part *m/kThe method comprises the steps of carrying out a first treatment on the surface of the In the method, in the process of the invention,n _part for the three-dimensional part actual point cloud data,mfor the predetermined sampling factor to be the same,kfor the total number of three-dimensional parts;

if it isk=mThe corresponding three-dimensional component is not sampled;

if it isk>mRandomly downsampling the corresponding three-dimensional component;

18. The method for generating three-dimensional content based on a pre-training language model according to claim 1, wherein the calculating offset information of each point of each three-dimensional part from the three-dimensional content description data and the fused point cloud data comprises:

19. The method for generating three-dimensional content based on a pre-training language model according to claim 18, wherein the point cloud global feature extraction sub-network comprises a tenth multi-layer sensing layer, an eleventh multi-layer sensing layer, a twelfth multi-layer sensing layer and a thirteenth multi-layer sensing layer which are sequentially connected, and the output features of the thirteenth multi-layer sensing layer are subjected to pooling processing to obtain global features of three-dimensional component fusion point cloud data.

20. The method of claim 19, wherein the offset prediction sub-network comprises a feature mix layer, a fourteenth multi-layer perceptual layer, a fifteenth multi-layer perceptual layer, a sixteenth multi-layer perceptual layer, and a seventeenth multi-layer perceptual layer;

21. The method for generating three-dimensional content based on a pre-training language model according to claim 18, wherein the training process of the three-dimensional component fine-granularity fusion network comprises:

L=α*||f _{description_gt} -f _{description_pred} || ² +β*||△T|| ² ；

22. The method for generating three-dimensional content based on a pre-training language model according to any one of claims 1 to 10, wherein generating the part name text data and the three-dimensional part point cloud data based on the three-dimensional content description data carried by the three-dimensional content generation request comprises:

23. The method for generating three-dimensional content based on a pre-trained language model according to claim 22, wherein said obtaining text data of part names described in text form using said text question-answering model comprises:

24. A three-dimensional content generation device based on a pre-trained language model, comprising:

25. An electronic device comprising a processor and a memory, the processor being configured to implement the steps of the pre-trained language model based three-dimensional content generation method of any one of claims 1 to 23 when executing a computer program stored in the memory.

26. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the three-dimensional content generation method based on a pre-trained language model according to any one of claims 1 to 23.