CN116932803A

CN116932803A - Data set generation method and training method based on multi-mode pre-training model

Info

Publication number: CN116932803A
Application number: CN202311177091.1A
Authority: CN
Inventors: 杜国光; 范宝余; 王丽; 郭振华; 赵雅倩; 李仁刚
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2023-10-24
Anticipated expiration: 2043-09-13
Also published as: CN116932803B

Abstract

The invention discloses a data set generation method and a training method based on a multi-mode pre-training model, which are applied to the technical field of three-dimensional content generation and comprise the following steps: rendering each three-dimensional content in the three-dimensional content set into a two-dimensional image; constructing a problem set; the problem set comprises problems corresponding to a plurality of attributes; inquiring an image-text question-answer pre-training model based on the question set for each two-dimensional image to obtain an answer corresponding to each question, and determining text description of each attribute based on the answer corresponding to each attribute; determining description information of each attribute of each three-dimensional content based on the text description to obtain three-dimensional content description of each three-dimensional content so as to generate a three-dimensional content description data set; the three-dimensional content description contains description information of a plurality of attributes. Therefore, the quality of the data set can be improved, the performance of the three-dimensional content generation model is further guaranteed, and the accuracy of three-dimensional content generation is improved.

Description

Data set generation method and training method based on multi-mode pre-training model

Technical Field

The present invention relates to the field of three-dimensional content generation, and in particular, to a data set generation method and apparatus based on a multi-mode pre-training model, a training method, a three-dimensional content generation method, an electronic device, and a computer readable storage medium.

Background

AIGC (Artificial Intelligence Generated Content, i.e., artificial intelligence content generation), refers to automatic production of digitized content including text, audio, image, etc. modalities using artificial intelligence technology, and in addition, AIGC is also used for generation of 3D (i.e., three-dimensional) content, i.e., 3D content intelligent generation technology, by generating high-quality, diversified 3D content, as a 3D digital asset, is widely used in industries such as virtual reality, augmented reality, etc.

In the current three-dimensional content generation scheme based on text, the text description quality of a three-dimensional data set is poor, a three-dimensional content generation model with better performance cannot be obtained, and the generated three-dimensional content has lower accuracy.

Disclosure of Invention

Therefore, the invention aims to provide a data set generating method and a training method based on a multi-mode pre-training model, which can improve the quality of the data set, further ensure the performance of a three-dimensional content generating model and further improve the accuracy of generating three-dimensional content. The specific scheme is as follows:

in a first aspect, the invention discloses a data set generation method based on a multi-mode pre-training model, which comprises the following steps:

rendering each three-dimensional content in the three-dimensional content set into a two-dimensional image;

Constructing a problem set; the problem set comprises problems corresponding to a plurality of attributes;

for each two-dimensional image, inquiring a graph-text question-answer pre-training model based on the question set to obtain an answer corresponding to each question, and determining a text description of each attribute based on the answer corresponding to each attribute;

determining description information of each attribute of each three-dimensional content based on the text description, and obtaining three-dimensional content description of each three-dimensional content to generate a three-dimensional content description data set; the three-dimensional content description includes the description information of a plurality of the attributes.

Optionally, rendering each three-dimensional content in the three-dimensional content set as a two-dimensional image includes:

each three-dimensional content in the set of three-dimensional content is rendered as a two-dimensional image at a plurality of perspectives.

Optionally, the rendering each three-dimensional content in the three-dimensional content set into a two-dimensional image under a plurality of viewing angles includes:

calculating a plurality of virtual camera positions based on the spherical coordinate system;

and rendering each three-dimensional content in the three-dimensional content set based on the plurality of virtual camera positions to obtain two-dimensional images under a plurality of view angles.

Optionally, the rendering each three-dimensional content in the three-dimensional content set into a two-dimensional image includes:

Converting each three-dimensional content in the three-dimensional content set to a world coordinate system;

multiplying each point in the three-dimensional content in the world coordinate system by a scaling factor to complete scaling;

rendering the scaled three-dimensional content into a two-dimensional image.

Optionally, before multiplying each point in the three-dimensional content in the world coordinate system by a scaling factor to complete the scaling, the method further comprises:

calculating the difference between the maximum value and the minimum value of the three-dimensional content in the world coordinate system in each coordinate axis;

and taking the reciprocal of the maximum value in the difference values corresponding to the coordinate axes to obtain the scaling factor.

Optionally, the determining, based on the text description, description information of each attribute of each three-dimensional content includes:

determining the scores of the text descriptions under a plurality of view angles by using an evaluation network model aiming at each attribute, and fusing the text descriptions with the scores larger than a preset threshold value to obtain the description information of the attribute; the text descriptions under the multiple view angles are text descriptions corresponding to the same three-dimensional content.

Optionally, the determining, for each attribute, a score of the text description under multiple perspectives using the evaluation network model includes:

For each attribute, inputting text descriptions under multiple views into an evaluation network model, extracting local features and global features, constructing joint features based on the local features and the global features, and outputting scores of the text descriptions under multiple views based on the joint features.

Optionally, for each attribute, inputting text descriptions under multiple perspectives into the evaluation network model, extracting local features and global features, including:

and inputting text description under multiple view angles into an evaluation network model aiming at each attribute, sequentially obtaining local characteristics through a bi-directional encoder representation structure and a multi-layer perceptron layer, and obtaining global characteristics through the multi-layer perceptron layer and a pooling layer.

Optionally, outputting the scores of the text descriptions under multiple perspectives based on the joint features includes:

and outputting scores of text descriptions under multiple view angles by using a preset number of multi-layer perceptron layers based on the joint characteristics.

Optionally, the training process of the evaluation network model includes:

constructing a training data set, wherein the training data set comprises training samples and label information corresponding to the training samples, and the training samples are text descriptions corresponding to different attributes under a plurality of view angles;

Inputting the training sample into an initial model to obtain scores of text descriptions under multiple view angles;

calculating a training loss based on the score and the tag information;

updating parameters of the initial model based on the training loss to obtain a model with updated parameters;

and training and iterating the model after parameter updating until the training stopping condition is met, and determining the current model after parameter updating as an evaluation network model.

Optionally, the method further comprises:

and if the score of the text description under the multiple view angles output by the evaluation network model does not exist in the score which is larger than the preset threshold value, inquiring the text description of the attribute by the text question-answer pre-training model based on the problem corresponding to the attribute, determining the score of the text description under the multiple view angles by using the evaluation network model, and fusing the text description with the score which is larger than the preset threshold value to obtain the description information of the attribute.

Optionally, the constructing the problem set includes:

respectively setting questions aiming at a plurality of attributes to obtain a question set;

wherein the plurality of attributes includes at least two of a conceptual attribute, a geometric attribute, a color attribute, and a material attribute.

Optionally, the querying the pre-training model for a graphic question and answer based on the question set to obtain an answer corresponding to each question, and determining the text description of each attribute based on the answer corresponding to each attribute includes:

and when each attribute in the question set corresponds to one question, inquiring the text question-answer pre-training model for multiple times by using the question of each attribute to obtain multiple answers, and determining the answer which is most similar to the two-dimensional image in the multiple answers as the text description of the attribute.

Optionally, the determining, as the text description of the attribute, an answer most similar to the two-dimensional image among the plurality of answers includes:

and calculating the similarity between a plurality of answers and the two-dimensional image by using the image-text comparison pre-training model, and determining the answer with the highest similarity as the text description of the attribute.

and when each attribute in the question set corresponds to a plurality of questions, inquiring the text question-answer pre-training model by using the plurality of questions of each attribute to obtain a plurality of answers, and determining the answer which is most similar to the two-dimensional image in the plurality of answers as the text description of the attribute.

In a second aspect, the invention discloses a training method for a three-dimensional content generation model, which comprises the following steps:

training a generated countermeasure network based on the three-dimensional content description dataset; the three-dimensional content description data set is generated according to the data set generation method based on the multi-mode pre-training model; the generated countermeasure network includes a discriminator, a generator, and a condition controller;

and when the discriminator cannot distinguish the content generated by the generator, determining a network structure formed by the generator and the condition controller in the current generation type countermeasure network as a three-dimensional content generation model.

Optionally, the training the generated countermeasure network based on the three-dimensional content description data set includes:

and alternately training the generator and the discriminator based on the three-dimensional content description data set, freezing the generator and the discriminator, and then freezing the discriminator and the training generator, and alternately executing a plurality of times.

Optionally, the training process of the generator includes:

acquiring three-dimensional content description from the three-dimensional content description data set, and generating a multi-attribute coding descriptor by using the condition controller;

Transforming the initial noise to obtain a noise coding descriptor, and constructing a joint descriptor based on the noise coding descriptor and the multi-attribute coding descriptor;

inputting the joint descriptors into a generator to obtain three-dimensional contents generated by the generator;

inputting the three-dimensional content into a discriminator to obtain a predicted value corresponding to the three-dimensional content

A first training loss is calculated based on the predicted value and the first loss function, and parameters of the generator are updated based on the first training loss.

Optionally, the training process of the discriminator includes:

inputting the true value data and the three-dimensional content generated by the generator into a discriminator to obtain a first predicted value corresponding to the true value data and a second predicted value corresponding to the three-dimensional content;

a second training loss is calculated based on the first and second predicted values and a second loss function, and parameters of the discriminator are updated based on the second training loss.

In a third aspect, the present invention discloses a three-dimensional content generating method, including:

acquiring target description and target noise, and inputting a three-dimensional content generation model; the target description is description information of a plurality of attributes; the three-dimensional content generation model is obtained by training according to the three-dimensional content generation model training method;

Generating a multi-attribute coding descriptor by using a condition controller, and transforming the target noise to obtain a noise coding descriptor;

constructing a joint descriptor based on the multi-attribute coding descriptor and the noise coding descriptor, and generating target three-dimensional content based on the joint descriptor by using a generator.

In a fourth aspect, the present invention discloses a data set generating apparatus, including:

the two-dimensional image rendering module is used for rendering each three-dimensional content in the three-dimensional content set into a two-dimensional image;

the problem set construction module is used for constructing a problem set; the problem set comprises problems corresponding to a plurality of attributes;

the attribute description determining module is used for inquiring the text question-answer pre-training model based on the question set to obtain an answer corresponding to each question aiming at each two-dimensional image, and determining the text description of each attribute based on the answer corresponding to each attribute;

a data set determining module, configured to determine description information of each attribute of each three-dimensional content based on the text description, and obtain a three-dimensional content description of each three-dimensional content, so as to generate a three-dimensional content description data set; the three-dimensional content description includes the description information of a plurality of the attributes.

In a fifth aspect, the present invention discloses an electronic device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the foregoing data set generating method based on the multi-mode pre-training model, and/or the foregoing three-dimensional content generating model training method, and/or the foregoing three-dimensional content generating method.

In a sixth aspect, the present invention discloses a computer readable storage medium for storing a computer program, where the computer program, when executed by a processor, implements the foregoing method for generating a dataset based on a multimodal pre-training model, and/or the foregoing method for training a three-dimensional content generation model, and/or the foregoing method for generating a three-dimensional content.

Firstly, rendering each three-dimensional content in a three-dimensional content set into a two-dimensional image, constructing a question set, wherein the question set comprises questions corresponding to a plurality of attributes, inquiring a graph-text question-answer pre-training model based on the question set for each two-dimensional image to obtain an answer corresponding to each question, determining text description of each attribute based on the answer corresponding to each attribute, and determining description information of each attribute of each three-dimensional content based on the text description to obtain three-dimensional content description of each three-dimensional content so as to generate a three-dimensional content description data set; the three-dimensional content description includes the description information of a plurality of the attributes. That is, the invention constructs a question set comprising a plurality of attribute questions, queries a text question-answer pre-training model aiming at a two-dimensional image corresponding to three-dimensional content to obtain an answer corresponding to each question, determines text description of each attribute based on the answer, and further determines description information of a plurality of attributes corresponding to each three-dimensional content to obtain a three-dimensional content description data set.

The invention has the beneficial effects that: the three-dimensional content description contains description information corresponding to each attribute in the plurality of attributes, so that the obtained three-dimensional content description data set is richer and more accurate, the quality of the data set can be improved, the performance of the three-dimensional content generation model is further ensured, and the accuracy of generating the three-dimensional content is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a data set generating method based on a multi-mode pre-training model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of generating a specific data set according to an embodiment of the present invention;

FIG. 3 is a diagram of an evaluation network model according to an embodiment of the present invention;

FIG. 4 is a training method for a three-dimensional content generation model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a generated challenge network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of training a condition controller according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a data set generating device according to an embodiment of the present invention;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the current three-dimensional content generation scheme based on text, the text description quality of a three-dimensional data set is poor, a three-dimensional content generation model with better performance cannot be obtained, and the generated three-dimensional content has lower accuracy. Therefore, the invention provides a three-dimensional content generation scheme which can improve the quality of a data set, further ensure the performance of a three-dimensional content generation model and further improve the accuracy of generating three-dimensional content.

Referring to fig. 1, the embodiment of the invention discloses a data set generating method based on a multi-mode pre-training model, which comprises the following steps:

step S11: each three-dimensional content in the set of three-dimensional content is rendered as a two-dimensional image.

In particular embodiments, each three-dimensional content in the set of three-dimensional content may be converted to a world coordinate system; multiplying each point in the three-dimensional content in the world coordinate system by a scaling factor to complete scaling; rendering the scaled three-dimensional content into a two-dimensional image. Specifically, the difference between the maximum value and the minimum value of the three-dimensional content in each coordinate axis in the world coordinate system can be calculated; and taking the reciprocal of the maximum value in the difference values corresponding to the coordinate axes to obtain the scaling factor.

It should be noted that, in the embodiment of the present invention, the three-dimensional content is preprocessed first, and since the position of the three-dimensional point cloud of each three-dimensional content in the space is not fixed, in order to ensure that the rendering process is controllable, the embodiment of the present invention performs two preprocessing of coordinate system alignment and scale scaling. Coordinate system alignment refers to unifying all points of three-dimensional content to the worldThe coordinate system is aligned, firstly, the center o of the current object under the world coordinate system is calculated _center The new coordinates of each point are calculated as p _new =p _ori -o _center Wherein p is _ori Coordinate system alignment is completed for the original coordinates of the points. Scaling refers to normalizing a model to a standard scale, namely scaling an object into a cube with a side length of 1, firstly calculating the difference between the maximum and minimum values of three-dimensional content in an x axis, a y axis and a z axis, taking the maximum value, and taking the reciprocal as the scaling scale, namely s=1/max ((max) _x -min _x )，(max _y -min _y )，(max _z -min _z ) The coordinates of each point of the three-dimensional content are multiplied by a scaling factor to complete the scaling.

Also, in a specific embodiment, each three-dimensional content in the three-dimensional content set may be rendered as a two-dimensional image at a plurality of viewing angles. Specifically, a plurality of virtual camera positions may be calculated based on the spherical coordinate system; and rendering each three-dimensional content in the three-dimensional content set based on the plurality of virtual camera positions to obtain two-dimensional images under a plurality of view angles.

The embodiment of the invention can calculate the pose of the virtual camera. The pose of the virtual camera consists of position and orientation. In order to ensure that information under all view angles of the 3D content is comprehensively acquired, the embodiment of the invention sets a large number of virtual cameras surrounding the upper half part of the 3D content, wherein the positions of the cameras are different, and the orientations of the cameras are fixed to point to the center of the object from the positions of the cameras. Setting the position (p) of the virtual camera in the spherical coordinate system _camera-x ，p _camera-y ，p _camera-z ) According to the formula: p is p _camera-x =r*sinθ*cosφ，p _camera-y =r*sinθ*sinφ，p _camera-z R is cos θ, where r represents the sphere radius, θ is the polar angle in the vertical direction, and Φ is the azimuth angle in the horizontal direction. The radius, polar angle and azimuth angle can be set to different situations if the radius r has n _r In the case where θ is n _θ In the case of phi n _φ In the case, n=n can be finally rendered _r *n _θ *n _φ Images at the individual viewing angles. For example, the polar angle θ may be divided by a fixed radius r of 3In the two cases of 60 degrees and 90 degrees, the azimuth angle phi is divided into 12 cases from 0 degree to 30 degrees until 360 degrees, and then 24 virtual camera positions are totally 1 x 2 x 12, and 24 images are totally rendered; finally, a multi-view 2D (i.e., two-dimensional) image is rendered. And rendering the 3D content to obtain n 2D images by using a rendering engine such as Blender according to the calculated virtual camera positions. Likewise, for each 3D content in the 3D content dataset, n 2D images rendered for each 3D content are obtained according to the same procedure.

Step S12: constructing a problem set; the question set includes questions corresponding to a plurality of attributes.

In a specific embodiment, questions may be set for multiple attributes, respectively, to obtain a question set; wherein the plurality of attributes includes, but is not limited to, at least two of a conceptual attribute, a geometric attribute, a color attribute, and a material attribute. For each attribute, one or more questions may be set. It is understood that an attribute is a description of a feature of three-dimensional content.

It should be noted that the 3D content has a plurality of attributes such as concept, geometry, color, material, etc., and the final form of the 3D content is characterized by common fine granularity. For example, conceptual properties indicate what class of objects, geometric properties indicate what geometry, color properties indicate what color style, and material properties indicate what material. For each attribute, one or more questions may be set up to get as detailed a description of the attribute as possible, e.g., for conceptual attributes, "what is in the picture? "what is the geometry of the object in the picture? "what is the color style of the object in the picture? "and the like. The number of attributes here is not fixed and may be as appropriate.

Step S13: and inquiring a graphic question-answer pre-training model based on the question set for each two-dimensional image to obtain an answer corresponding to each question, and determining the text description of each attribute based on the answer corresponding to each attribute.

The image-text question-answer pre-training model can be a large-scale pre-training model.

In one embodiment, when each attribute in the question set corresponds to a question, the pre-training model of the image-text question-and-answer is queried multiple times by using the question of each attribute to obtain multiple answers, and an answer which is most similar to the two-dimensional image in the multiple answers is determined to be a text description of the attribute. Specifically, the similarity between a plurality of answers and the two-dimensional image can be calculated by using the graph-text comparison pre-training model, and the answer with the highest similarity is determined to be the text description of the attribute. The graphic comparison pre-training model may be a large-scale pre-training model.

In another embodiment, when each attribute in the question set corresponds to a plurality of questions, the text description of the attribute is determined by querying a pre-training model of the text question and answer using the plurality of questions of each attribute to obtain a plurality of answers, and determining an answer most similar to the two-dimensional image in the plurality of answers. Specifically, the similarity between a plurality of answers and the two-dimensional image is calculated by using the graph-text comparison pre-training model, and the answer with the highest similarity is determined to be the text description of the attribute.

Step S14: and determining description information of each attribute of each three-dimensional content based on the text description, and obtaining a three-dimensional content description of each three-dimensional content to generate a three-dimensional content description data set.

That is, the three-dimensional content description of the three-dimensional content is description information corresponding to each of the plurality of attributes.

In one embodiment, each three-dimensional content in the set of three-dimensional content is rendered as a two-dimensional image. Accordingly, the description information for determining each attribute of each three-dimensional content based on the text description is specifically: the text description corresponding to each attribute is determined as the description information of the attribute of the three-dimensional content.

In another embodiment, each three-dimensional content in the three-dimensional content set is rendered into a plurality of two-dimensional images, and then for each attribute, description information of the attribute of the three-dimensional content is determined based on text descriptions of the attribute corresponding to the plurality of two-dimensional images. In a specific embodiment, the plurality of two-dimensional images are two-dimensional images under a plurality of view angles, for each attribute, the score of the text description under the plurality of view angles is determined by using the evaluation network model, and the text description with the score larger than a preset threshold value is fused to obtain the description information of the attribute; the text descriptions under the multiple view angles are text descriptions corresponding to the same three-dimensional content.

And, for any attribute, if there is no score greater than the preset threshold value in the scores of the text descriptions under multiple view angles output by the evaluation network model, querying the text description of the attribute by the text question-answer pre-training model again based on the problem corresponding to the attribute, and executing the steps of determining the scores of the text descriptions under multiple view angles by using the evaluation network model, and fusing the text descriptions with the scores greater than the preset threshold value to obtain the description information of the attribute.

The method comprises the steps of inputting text descriptions under multiple views into an evaluation network model according to each attribute, extracting local features and global features, constructing joint features based on the local features and the global features, and outputting scores of the text descriptions under multiple views based on the joint features.

In a specific embodiment, text descriptions under multiple view angles can be input into an evaluation network model according to each attribute, and local features are obtained through a bidirectional encoder representation structure and a multi-layer perceptron layer in sequence, and global features are obtained through a multi-layer perceptron layer and a pooling layer. And constructing a joint feature based on the local feature and the global feature, outputting scores of text descriptions under multiple view angles by using a preset number of multi-layer perceptron layers based on the joint feature. The preset number may be four.

Further, the training process of the evaluation network model includes: constructing a training data set, wherein the training data set comprises training samples and label information corresponding to the training samples, and the training samples are text descriptions corresponding to different attributes under a plurality of view angles; inputting the training sample into an initial model to obtain scores of text descriptions under multiple view angles; calculating a training loss based on the score and the tag information; updating parameters of the initial model based on the training loss to obtain a model with updated parameters; and training and iterating the model after parameter updating until the training stopping condition is met, and determining the current model after parameter updating as an evaluation network model. That is, each training sample is a text description under multiple views corresponding to a certain attribute, and the text description label value under the semantically clear and complementary view is 1, otherwise is 0.

Referring to fig. 2, fig. 2 is a schematic diagram of generating a specific data set according to an embodiment of the present invention.

Firstly, performing multi-view rendering on each three-dimensional content in a three-dimensional content set to obtain a two-dimensional image under multiple views, and assuming that each three-dimensional content is rendered into a two-dimensional image under n views. And, a problem set is constructed, assuming m attributes, each of which may have one problem or multiple problems.

Second, a large-scale pre-training model is used for visual questions and answers. For example, a graphic-text large-scale Pre-training model, such as GPT (i.e., generating Pre-trained Transformer, generating Pre-training transformation model) -4, may be used, which has strong cognitive reasoning and picture question-answering capabilities, so that for each rendered image of 3D content, the large-scale Pre-training model may be queried using the constructed question set to obtain an answer to the question for each image as a textual description of the corresponding attribute of the question. In order to improve the accuracy of each attribute description, the invention adopts a checking mechanism. Inquiring the large-scale pre-training model for w times aiming at each attribute to obtain w answers; and then, judging the answer with the highest similarity degree with the picture in the w answers by using a picture-text large-scale Pre-training model focusing on similarity comparison, such as a CLIP (i.e. Contrastive Language-Image Pre-training model), and using the answer with the highest similarity degree with the picture as a text description of the Image aiming at the attribute. And then, aiming at m attributes, the same operation is executed to obtain m fine-grained text descriptions of each rendering image. It should be noted that the graphic-text large-scale pre-training model has strong cognitive ability, and can obtain reasonable results according to the problems of input images and text forms. Thus, it can be used to generate a fine-grained description of the properties of 3D content by means of its strong cognitive capabilities.

Further, to obtain a fine-grained description of 3D content, it is necessary to fuse fine-grained descriptions of n rendered images. However, fine-grained descriptions at these n perspectives are of different quality and there is a poor description. Therefore, the invention designs a multi-view fine granularity description fusion method with a feedback mechanism based on a multi-view fine granularity description evaluation network, which specifically comprises the following steps:

first, for a certain granularity, i.e., a certain attribute, a multi-view fine-granularity description is used to evaluate the network, resulting in scores of fine-granularity descriptions under n views. In order to complete quality evaluation of n view fine grain descriptions and obtain high quality fine grain descriptions after de-duplication, a multi-view fine grain description evaluation network model is designed, and is shown in fig. 3, and fig. 3 is a structural diagram of an evaluation network model disclosed by an embodiment of the invention. And (3) evaluating the fine grain descriptions of the network under n view angles with a certain granularity, extracting to obtain local features and global features, and then jointly predicting the scores of the fine grain descriptions under n view angles. By extracting the local features and the global features, the network model has extremely strong discrimination capability, can judge which descriptions have poorer quality, and achieves the aim of removing low-quality descriptions. The method comprises the following specific steps:

a. Constructing a data set: for text descriptions of n view angles which are not scored under a large number of different granularities, carrying out manual screening by combining corresponding 3D content, selecting one or more descriptions with clear semantics and complementation, giving a score of 1, giving other descriptions with 0, and taking all fine granularity descriptions with the score of 1 as accurate and complete descriptions of the 3D content aiming at the granularity;

b. designing a network structure: firstly, a language big model pre-training model is used, such as Bert (Bidirectional Encoder Representation from Transformers, bi-directional coding representation model from transformation model) and the like, high-dimensional characteristics of text description are extracted, 768-dimensional characteristics can be obtained from each view, and n views can form n×768-dimensional characteristics; secondly, using two MLP (Multi-Layer Perception) layers to raise the dimension of the feature to n multiplied by 2048 dimension, and carrying out pooling to obtain 1 multiplied by 2048 dimension, wherein the 2048 dimension vector can be regarded as high-dimension abstract of the description feature of n visual angle texts, so that the feature is global feature; finally, the global features are combined with n multiplied by 1024-dimensional local features of the previous step to form n multiplied by 3072-dimensional combined features, and scores of fine-grained descriptions under n visual angles are finally predicted through 4 MLP layers; the network structure integrates local features and global features of n visual angles, so that descriptions belonging to low quality can be screened, and the descriptions can be predicted to be complementary, and the final descriptions of the 3D content about the granularity are formed together;

c. Training and reasoning: the training uses root mean square error (Mean Square Error, MSE), the specific formula is as follows:

；

wherein f (x) _i ) Scoring the i-th view of the network prediction, y _i A Truth value (GT) for the i-th viewing angle; when reasoning, inputting text description of n visual angles, the range of [0,1 ] under n visual angles can be obtained]A score of the fine grain description within.

And secondly, carrying out fine granularity description fusion with a feedback mechanism. For a certain granularity i, if the fine granularity description with the score larger than a certain threshold delta exists in the n fine granularity descriptions, the existence of qualified fine granularity descriptions is indicated, and all the qualified fine granularity descriptions are fused to serve as 3D content fine granularity descriptions for the granularity i; if no fine-grained description with the score larger than a certain threshold delta exists, the fact that no qualified fine-grained description exists is indicated, fine-grained description generation based on a large-scale pre-training model is re-executed, multi-view fine-grained description evaluation is re-conducted until fine-grained description with the score larger than a certain threshold delta exists, and fine-grained description generation of granularity i is completed. The feedback mechanism is performed for m granularity until the text description of the 3D content m granularity is completed.

The embodiment of the invention carries out the steps aiming at each 3D model in the 3D content data set to finish the construction of the 3D content database with fine granularity description.

That is, in the embodiment of the invention, firstly, for any 3D content in the 3D content data set, multi-view rendering is performed by setting different virtual camera positions, so as to obtain a large number of content 2D images under different view angles; secondly, creating a problem set aiming at different attributes, wherein the number of the attributes is not limited, and the attributes such as concepts, geometry, color, materials and the like can be generally used; thirdly, aiming at each image rendered by each 3D content, using an image-text large-scale pre-training model, asking questions for a plurality of times aiming at the image content, and using a checking mechanism to select the best answer as the text description of the attribute; thirdly, fusing and deduplicating text descriptions of different attributes of a large number of rendered 2D images to obtain text descriptions of the 3D content aiming at the different attributes; finally, executing the previous steps for each 3D model in the 3D content data set to obtain the 3D content data set with fine granularity attribute text description;

it can be seen that, in the embodiment of the present invention, each three-dimensional content in a three-dimensional content set is rendered into a two-dimensional image, and a question set is constructed, where the question set includes questions corresponding to a plurality of attributes, then, for each two-dimensional image, an answer corresponding to each question is obtained by querying a graph-text question-answer pre-training model based on the question set, a text description of each attribute is determined based on the answer corresponding to each attribute, and then, description information of each attribute of each three-dimensional content is determined based on the text description, so as to obtain three-dimensional content description of each three-dimensional content, so as to generate a three-dimensional content description data set; the three-dimensional content description includes the description information of a plurality of the attributes. That is, the invention constructs a question set comprising a plurality of attribute questions, queries a text question-answer pre-training model aiming at a two-dimensional image corresponding to three-dimensional content to obtain an answer corresponding to each question, determines text description of each attribute based on the answer, and further determines description information of a plurality of attributes corresponding to each three-dimensional content to obtain a three-dimensional content description data set. The three-dimensional content description contains description information corresponding to each attribute in the plurality of attributes, so that the obtained three-dimensional content description data set is richer and more accurate, the quality of the data set can be improved, the performance of the three-dimensional content generation model is further ensured, and the accuracy of generating the three-dimensional content is improved.

Referring to fig. 4, the embodiment of the invention discloses a three-dimensional content generation model training method, which comprises the following steps:

step S21, training a generated type countermeasure network based on the three-dimensional content description data set; the three-dimensional content description data set is generated according to the data set generation method based on the multi-mode pre-training model disclosed in the previous embodiment; the generative antagonism network includes a discriminator, a generator, and a condition controller.

And S22, when the discriminator cannot distinguish the content generated by the generator, determining a network structure formed by the generator and the condition controller in the current generation type countermeasure network as a three-dimensional content generation model.

That is, when the discriminator cannot distinguish whether the input content is the content generated by the generator, the current generation type countermeasure network is determined as a three-dimensional content generation model.

In a specific embodiment, the generator and the discriminator may be alternately trained based on the three-dimensional content description data set, with the generator and the discriminator being frozen first, and the discriminator and the training generator being frozen later, and being alternately performed a plurality of times.

Wherein the training process of the generator comprises: acquiring three-dimensional content description from the three-dimensional content description data set, and generating a multi-attribute coding descriptor by using the condition controller; transforming the initial noise to obtain a noise coding descriptor, and constructing a joint descriptor based on the noise coding descriptor and the multi-attribute coding descriptor; inputting the joint descriptors into a generator to obtain three-dimensional contents generated by the generator; inputting the three-dimensional content into a discriminator, obtaining a predicted value corresponding to the three-dimensional content, calculating a first training loss based on the predicted value and a first loss function, and updating parameters of a generator based on the first training loss.

And, the training process of the discriminator includes: inputting the true value data and the three-dimensional content generated by the generator into a discriminator to obtain a first predicted value corresponding to the true value data and a second predicted value corresponding to the three-dimensional content; a second training loss is calculated based on the first and second predicted values and a second loss function, and parameters of the discriminator are updated based on the second training loss.

For example, referring to fig. 5, fig. 5 discloses a schematic diagram of a generated countermeasure network according to an embodiment of the present invention. A generation type countermeasure network (GAN, generative Adversarial Networks) controls content generation through a generator and a discriminator, and high quality generation results are obtained on a plurality of tasks such as image generation. For the 3D content generation task of the multi-channel fine-grained conditional control, the GAN network contains three parts, a conditional Controller (Controller), a Generator (Generator), and a Discriminator (Discriminator). Based on the constructed 3D content data set containing fine-granularity descriptions, the invention provides a 3D content generation network structure based on multichannel fine-granularity condition control of a generation type countermeasure network, which comprises the following steps:

1) The condition controller module is used for generating multi-granularity coding descriptors:

Aiming at fine granularity conditional text description of m channels, firstly extracting to obtain 256 x d' initial descriptors based on a CLIP large-scale pre-training model for text and image semantic alignment; secondly, designing a multi-layer perceptron (MLP) network, and converting an initial descriptor from 256 Xd' to a 256 Xd-dimensional 3D content control descriptor; finally, the MLP network is used to obtain the 1X 256-dimensional coding descriptors of m channels.

2) The generator module is used for generating target point cloud:

the generator part starts from the gaussian noise z and generates the target content through the generator network. However, the original noise may not be suitable for the 3D content generation task, and therefore, the present invention z-transforms the original noise based on the MLP network, resulting in a k-dimensional noise coding descriptor; together with the m-channel 1 x 256-dimensional coded descriptors, the w-dimensional input of the generator, i.e. the w= (m x 256+ k) -dimensional joint descriptors, is then formed. The generator network comprises a multi-layer MLP network, w-dimensional joint descriptors can be transformed to obtain 6*g-dimensional descriptors, and gX 6-dimensional output, namely target colored 3D point cloud, can be obtained through recombination (reshape) operation.

3) The discriminator module is used for discriminating the rationality of the target point cloud:

The discriminator module can output a generated point cloud prediction of 0 and a true value of the point cloud prediction of 1. The network structure comprises a multi-layer MLP network, gX 2048-dimensional descriptors can be output, 1X 2048-dimensional global descriptors can be obtained through pooling operation, and 1-dimensional predicted values can be obtained through the 1-layer MLP network.

In the process of network training, the GAN network adopts an alternate training strategy, and a generator is frozen first to train a discriminator; freezing the discriminator parameters, and training the generator; then freezing the generator again, and training the discriminator; this is alternated until the discriminator is unable to tell whether the content generated by the generator is authentic.

For the discriminator, the following loss function is used: loss of loss _D =log(D(x _gt ))+log(1-D(x _generate ) And D (x) _gt ) Representing predicted values for a true 3D point cloud, when D (x _gt ) When=1, the first term is 0; d (x) _generate ) Representing a predicted value for generating a point cloud, when D (x _generate ) When=0, the second term is 0; due to output of other values in [0,1 ]]Since the log function value is negative when the predicted value is in between, the discriminator can be trained to correctly discriminate the authenticity of the point cloud by maximizing the loss function. An output of 0.5 indicates no discrimination.

For the generator, the following loss function is employed: loss of loss _G =log(1-D(x _generate ) I.e., for a trained discriminator, the probability of making its decision to generate content is 1, indicating that the generator generating content has been able to spoof the discriminator, at which point the loss function approaches minus infinity. Therefore, by minimizing the above-mentioned loss function, the generator can be trained to generate sufficiently realistic and reasonable 3D content;

for the condition controller part, in order to accelerate convergence, a model training strategy of multi-channel fine granularity adaptation can be adopted, namely, a single-channel code descriptor extraction module is trained in sequence, and then integral fine adjustment is carried out. Referring to fig. 6, fig. 6 is a schematic diagram of training a condition controller according to an embodiment of the present invention. In a specific training process, the generator and the condition controller may be trained together, and parameters of the generator and the condition controller are updated based on the first training loss. And, for the condition controller, one channel is in an activated state and the other channels are in an inactivated state in turn, so as to train.

In reasoning, only the condition controller and generator are used, and no discriminator is used. Firstly, sampling from a Gaussian distribution to obtain noise z; secondly, extracting the multi-channel fine granularity condition respectively to obtain fine granularity coding descriptors; thirdly, constructing a joint descriptor, and using generator reasoning to obtain a final colored 3D point cloud to finish the generation task.

Further, the embodiment of the invention discloses a three-dimensional content generation method, which comprises the following steps:

acquiring target description and target noise, and inputting a three-dimensional content generation model; the target description is description information of a plurality of attributes; the three-dimensional content generation model is obtained by training according to the three-dimensional content generation model training method disclosed by the embodiment; the target description may be a textual description of a plurality of attributes determined based on user input.

Referring to fig. 7, an embodiment of the present invention discloses a data set generating apparatus, including:

a two-dimensional image rendering module 11 for rendering each three-dimensional content in the three-dimensional content set as a two-dimensional image;

a problem set construction module 12 for constructing a problem set; the problem set comprises problems corresponding to a plurality of attributes;

an attribute description determining module 13, configured to query a pre-training model of a graphic question and answer based on the question set for each of the two-dimensional images to obtain an answer corresponding to each question, and determine a text description of each attribute based on the answer corresponding to each attribute;

A data set determining module 14 for determining description information of each attribute of each of the three-dimensional contents based on the text description, to obtain a three-dimensional content description of each of the three-dimensional contents, to generate a three-dimensional content description data set; the three-dimensional content description includes the description information of a plurality of the attributes.

The two-dimensional image rendering module 11 specifically includes:

the two-dimensional image rendering module 11 is specifically configured to render each three-dimensional content in the three-dimensional content set as a two-dimensional image under multiple viewing angles.

In one embodiment, the two-dimensional image rendering module 11 specifically includes:

a virtual camera position calculation sub-module for calculating a plurality of virtual camera positions based on a spherical coordinate system;

and the two-dimensional image rendering sub-module is used for rendering each three-dimensional content in the three-dimensional content set based on the plurality of virtual camera positions to obtain two-dimensional images under a plurality of view angles.

In a specific embodiment, the two-dimensional image rendering module 11 specifically includes:

the coordinate system conversion sub-module is used for converting each three-dimensional content in the three-dimensional content set into a world coordinate system;

a scaling sub-module for multiplying each point in the three-dimensional content in the world coordinate system by a scaling factor to complete scaling;

and the two-dimensional image rendering sub-module is used for rendering the three-dimensional content subjected to the scale scaling into a two-dimensional image.

Further, the two-dimensional image rendering module 11 further includes:

a scaling factor calculation module, configured to calculate a difference between a maximum value and a minimum value of the three-dimensional content in the world coordinate system under each coordinate axis before multiplying each point in the three-dimensional content in the world coordinate system by a scaling factor to complete scaling; and taking the reciprocal of the maximum value in the difference values corresponding to the coordinate axes to obtain the scaling factor.

The data set determining module 14 specifically includes:

the description fusion sub-module is used for determining the scores of the text descriptions under a plurality of view angles by utilizing the evaluation network model aiming at each attribute, and fusing the text descriptions with the scores larger than a preset threshold value to obtain the description information of the attribute; the text descriptions under the multiple view angles are text descriptions corresponding to the same three-dimensional content.

Wherein, describe and fuse the submodule, specifically used for: for each attribute, inputting text descriptions under multiple views into an evaluation network model, extracting local features and global features, constructing joint features based on the local features and the global features, and outputting scores of the text descriptions under multiple views based on the joint features. Specifically, for each attribute, inputting text description under multiple view angles into an evaluation network model, sequentially obtaining local features through a bi-directional encoder representation structure and a multi-layer perceptron layer, and obtaining global features through a multi-layer perceptron layer and a pooling layer. And outputting scores of text descriptions under multiple view angles by using a preset number of multi-layer perceptron layers based on the joint characteristics.

The training process of the evaluation network model comprises the following steps:

calculating a training loss based on the score and the tag information;

updating the initial model parameters based on the training loss to obtain a model with updated parameters;

Further, the device is further configured to:

The problem set construction module 12 is specifically configured to set problems for a plurality of attributes respectively, so as to obtain a problem set; wherein the plurality of attributes includes at least two of a conceptual attribute, a geometric attribute, a color attribute, and a material attribute.

In one embodiment, the attribute description determining module 13 is configured to query the pre-training model for a text description of the attribute with the questions of each attribute multiple times to obtain multiple answers when each attribute in the question set corresponds to one question, and determine an answer most similar to the two-dimensional image in the multiple answers as a text description of the attribute. Specifically, the similarity between a plurality of answers and the two-dimensional image is calculated by using the graph-text comparison pre-training model, and the answer with the highest similarity is determined to be the text description of the attribute.

In another embodiment, the attribute description determining module 13 is configured to query the pre-training model of the teletext question and answer with the plurality of questions of each attribute to obtain a plurality of answers when each attribute in the set of questions corresponds to a plurality of questions, and determine an answer most similar to the two-dimensional image in the plurality of answers as a text description of the attribute.

Referring to fig. 8, an embodiment of the present invention discloses an electronic device 20 comprising a processor 21 and a memory 22; wherein the memory 22 is used for storing a computer program; the processor 21 is configured to execute the computer program, and the data set generating method based on the multi-mode pre-training model and/or the three-dimensional content generating model training method and/or the three-dimensional content generating method disclosed in the foregoing embodiments.

Regarding the data set generating method based on the multi-mode pre-training model and/or the three-dimensional content generating model training method and/or the specific process of the three-dimensional content generating method may refer to the corresponding content disclosed in the foregoing embodiment, and will not be described herein.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk or an optical disk, and the storage mode may be transient storage or permanent storage.

In addition, the electronic device 20 further includes a power supply 23, a communication interface 24, an input-output interface 25, and a communication bus 26; wherein the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present invention, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.

Further, the embodiment of the invention also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to realize the data set generating method based on the multi-mode pre-training model disclosed in the previous embodiment, and/or the three-dimensional content generating model training method, and/or the three-dimensional content generating method.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing has outlined rather broadly the more detailed description of the invention in order that the detailed description of the principles and embodiments of the invention may be implemented in conjunction with the detailed description of the invention that follows, the examples being merely intended to facilitate an understanding of the method of the invention and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method for generating a data set based on a multi-modal pre-training model, comprising:

for each two-dimensional image, inquiring a graph-text question-answer pre-training model based on the question set to obtain an answer corresponding to each question, and determining text description of each attribute based on the answer corresponding to each attribute;

2. The method for generating a dataset based on a multimodal pre-training model as claimed in claim 1, wherein the rendering each three-dimensional content in the set of three-dimensional content as a two-dimensional image comprises:

3. The method for generating a dataset based on a multimodal pre-training model as claimed in claim 2, wherein the rendering each three-dimensional content in the set of three-dimensional content as a two-dimensional image at a plurality of perspectives comprises:

4. The method for generating a dataset based on a multimodal pre-training model as claimed in claim 1, wherein the rendering each three-dimensional content in the set of three-dimensional content as a two-dimensional image comprises:

multiplying each point in the three-dimensional content in the world coordinate system by a scaling factor to accomplish scaling;

Rendering the scaled three-dimensional content into a two-dimensional image.

5. The method of generating a multi-modal pre-training model based dataset of claim 4, further comprising, prior to said multiplying each point in the three-dimensional content in the world coordinate system by a scaling factor to complete scaling:

calculating the difference value between the maximum value and the minimum value of the three-dimensional content in each coordinate axis in the world coordinate system;

6. The method for generating a data set based on a multimodal pre-training model according to claim 2, wherein the determining description information of each attribute of each of the three-dimensional contents based on the text description comprises:

7. The method for generating a multimodal pre-training model based data set according to claim 6, wherein the determining the scores of the text descriptions at a plurality of perspectives using the evaluation network model for each attribute comprises:

8. The method for generating a data set based on a multi-modal pre-training model according to claim 7, wherein the inputting text descriptions at a plurality of perspectives into the evaluation network model for each attribute, extracting local features and global features, comprises:

9. The method for generating a dataset based on a multimodal pre-training model as claimed in claim 7, wherein the outputting of the scores of the text descriptions at a plurality of perspectives based on the joint features comprises:

10. The method for generating a data set based on a multi-modal pre-training model according to claim 6, wherein the training process of evaluating the network model comprises:

calculating a training loss based on the score and the tag information;

11. The multi-modal pre-training model based dataset generation method as claimed in claim 6, further comprising:

and if the scores of the text descriptions under the multiple view angles output by the evaluation network model do not exist, inquiring the text description of the attribute by a text question-answer pre-training model based on the problem corresponding to the attribute, and executing the steps of determining the scores of the text descriptions under the multiple view angles by using the evaluation network model and fusing the text descriptions with the scores larger than the preset threshold to obtain the description information of the attribute.

12. The method for generating a dataset based on a multimodal pre-training model as claimed in claim 1, wherein the constructing a set of questions comprises:

13. The method for generating a data set based on a multi-modal pre-training model according to claim 1, wherein the querying of a pre-training model for a question-answer based on the set of questions to obtain an answer corresponding to each question, and determining a text description of each attribute based on the answer corresponding to each attribute, comprises:

14. The method of claim 13, wherein determining an answer of the plurality of answers that is most similar to the two-dimensional image as a textual description of the attribute comprises:

And calculating the similarity between the answers and the two-dimensional image by using the graphic comparison pre-training model, and determining the answer with the highest similarity as the text description of the attribute.

15. The method for generating a data set based on a multi-modal pre-training model according to claim 1, wherein the querying of a pre-training model for a question-answer based on the set of questions to obtain an answer corresponding to each question, and determining a text description of each attribute based on the answer corresponding to each attribute, comprises:

and when each attribute in the question set corresponds to a plurality of questions, inquiring a text question-answer pre-training model by using the plurality of questions of each attribute to obtain a plurality of answers, and determining an answer which is most similar to the two-dimensional image in the plurality of answers as a text description of the attribute.

16. A three-dimensional content generation model training method, comprising:

training a generated countermeasure network based on the three-dimensional content description dataset; the three-dimensional content description dataset generated according to the multi-modal pre-training model-based dataset generation method of any of claims 1 to 15; the generated countermeasure network includes a discriminator, a generator, and a condition controller;

17. The three-dimensional content generation model training method of claim 16, wherein the training the generation type countermeasure network based on the three-dimensional content description data set comprises:

and alternately training the generator and the discriminator based on the three-dimensional content description data set, freezing the generator and training the discriminator, and then freezing the discriminator and training the generator, and alternately executing for a plurality of times.

18. The three-dimensional content generation model training method of claim 17, wherein the training process of the generator comprises:

inputting the joint descriptors into the generator to obtain three-dimensional content generated by the generator;

Inputting the three-dimensional content into the discriminator to obtain a predicted value corresponding to the three-dimensional content;

a first training loss is calculated based on the predicted value and a first loss function, and parameters of the generator are updated based on the first training loss.

19. The three-dimensional content generation model training method of claim 17, wherein the training process of the discriminator comprises:

inputting the true value data and the three-dimensional content generated by the generator into the discriminator to obtain a first predicted value corresponding to the true value data and a second predicted value corresponding to the three-dimensional content;

20. A three-dimensional content generation method, comprising:

acquiring target description and target noise, and inputting a three-dimensional content generation model; the target description is description information of a plurality of attributes; wherein the three-dimensional content generation model is trained according to the three-dimensional content generation model training method of any one of claims 16 to 19;

21. A data set generating apparatus, comprising:

22. An electronic device comprising a memory and a processor, wherein:

The memory is used for storing a computer program;

the processor is configured to execute the computer program to implement a data set generating method based on a multi-modal pre-training model according to any one of claims 1 to 15, and/or a three-dimensional content generating model training method according to any one of claims 16 to 19, and/or a three-dimensional content generating method according to claim 20.

23. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements a multi-modal pre-training model based dataset generation method as claimed in any of claims 1 to 15 and/or a three-dimensional content generation model training method as claimed in any of claims 16 to 19 and/or a three-dimensional content generation method as claimed in claim 20.