CN117473105A

CN117473105A - Three-dimensional content generation method based on multi-mode pre-training model and related components

Info

Publication number: CN117473105A
Application number: CN202311827111.5A
Authority: CN
Inventors: 杜国光; 范宝余; 赵雅倩; 王丽; 郭振华; 李仁刚
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-01-30
Anticipated expiration: 2043-12-28
Also published as: CN117473105B

Abstract

The application discloses a three-dimensional content generation method and related components based on a multi-mode pre-training model, relates to the field of data processing, and is used for solving the problem of low speed of generating three-dimensional content. The scheme obtains a target text description input by a user; searching in a three-dimensional content database based on the target text description and the multi-mode pre-training model, determining a first three-dimensional content and determining a corresponding third text description; determining a text description difference between the target text description and the third text description; and driving the first three-dimensional content to deform based on the text description difference to obtain the target three-dimensional content. According to the method and the device, the multi-mode pre-training model is utilized to search in the three-dimensional content database, the first three-dimensional content can be determined more quickly, the first three-dimensional content is deformed based on the target text description, the target three-dimensional content corresponding to the target text description is obtained, compared with the generation of the three-dimensional content from zero, the target three-dimensional content meeting the requirements can be generated more quickly, and the generation efficiency and the generation speed are improved.

Description

Three-dimensional content generation method based on multi-mode pre-training model and related components

Technical Field

The application relates to the field of data processing, in particular to a three-dimensional content generation method based on a multi-mode pre-training model and related components.

Background

The artificial intelligence content generation (AIGC, artificial Intelligence Generated Content) automatically generates digital content in the forms of text, audio, image, etc. by using the artificial intelligence technology, plays an important role in the industries of film, television, entertainment, media, etc., improves the working efficiency and quality of content creators, and promotes the digitizing and intelligent processes of enterprises.

The 3D (three dimensional, three-dimensional) content intelligent generation technology is one of important applications of AIGC, and by development of the generation type AI technology and the graphic multi-mode pre-training model, generation of high-quality and diversified 3D content can be guided through text description. After the text description is encoded by the multi-mode pre-training model, the generation model can be guided to generate target 3D content in a pre-generation control or post-generation guiding mode, and a good result is obtained.

However, existing methods for generating 3D content based on text descriptions have some problems, and in general, they require generating 3D content from scratch, resulting in slow generation speed and inefficiency. For example, one common method is to generate a model by denoising diffusion probability, and starting from complete noise, a large amount of denoising process is needed to obtain the final 3D content, which takes a lot of time.

Disclosure of Invention

The object of the present application is to provide a three-dimensional content generating method and related components based on a multi-mode pre-training model, which uses the multi-mode pre-training model to search in a three-dimensional content database, so that a first three-dimensional content can be determined more quickly, and then the first three-dimensional content is deformed based on a target text description, so as to obtain a target three-dimensional content corresponding to the target text description.

In order to solve the technical problems, the application provides a three-dimensional content generation method based on a multi-mode pre-training model, which comprises the following steps:

acquiring a target text description input by a user;

searching in a three-dimensional content database based on the target text description and the multi-mode pre-training model to determine first three-dimensional content;

acquiring a third text description corresponding to the first three-dimensional content;

determining a text description difference between the target text description and the third text description;

and driving the first three-dimensional content to deform based on the text description difference to obtain the target three-dimensional content.

In one embodiment, retrieving in a three-dimensional content database based on the target text description and a multimodal pre-training model, determining the first three-dimensional content includes:

searching in a three-dimensional content database based on the target text description and a multi-mode pre-training model, and determining a target category corresponding to the target text description;

the first three-dimensional content is determined from each three-dimensional content corresponding to the target category.

In one embodiment, retrieving in a three-dimensional content database based on the target text description and a multimodal pre-training model, determining a target category corresponding to the target text description includes:

acquiring a first descriptor of the target text description and acquiring a second descriptor corresponding to each category name in the three-dimensional content data;

determining a second descriptor with the smallest cosine distance from the first descriptor as a first target descriptor according to the first descriptor and each second descriptor;

and determining the category corresponding to the first target descriptor as the target category.

In one embodiment, obtaining the first descriptor of the target text description and obtaining the second descriptor corresponding to each category name in the three-dimensional content data includes:

Obtaining a third descriptor corresponding to the target text description and a fourth descriptor corresponding to each category name through an image-text comparison pre-training model;

obtaining a fifth descriptor corresponding to the target text description and a sixth descriptor corresponding to each category name through a pre-training language model;

superposing the third descriptor and the fifth descriptor to obtain the first descriptor;

and superposing the fourth descriptor and the sixth descriptor to obtain the second descriptor.

In one embodiment, determining the first three-dimensional content from among the respective three-dimensional content corresponding to the target category includes:

acquiring a seventh descriptor of each three-dimensional content in the target category;

determining a seventh descriptor with the smallest cosine distance from the first descriptor as a second target descriptor according to the first descriptor and the seventh descriptor;

and determining the three-dimensional content corresponding to the second target descriptor as the first three-dimensional content.

In one embodiment, obtaining a seventh descriptor of each three-dimensional content in the target class includes:

rendering the three-dimensional contents in the target category at multiple angles to obtain a first two-dimensional image at multiple angles;

Processing each first two-dimensional image based on a bootstrapping image-text pre-training model to obtain a corresponding first text description;

and acquiring first descriptors corresponding to the first text descriptions, and taking the first descriptors as the seventh descriptors.

In one embodiment, obtaining a first descriptor corresponding to each of the first text descriptions, and taking the first descriptor as the seventh descriptor includes:

acquiring a first text descriptor corresponding to the first text description, and acquiring a first image descriptor corresponding to each first two-dimensional image;

and superposing the first text descriptor and the first image descriptor to obtain a first mixed descriptor serving as the seventh descriptor.

In one embodiment, before determining that the seventh descriptor having the smallest cosine distance from the first descriptor is the second target descriptor according to the first descriptor and the seventh descriptor, the method further includes:

obtaining a second two-dimensional image corresponding to the target text description based on the target text description;

acquiring a second image descriptor corresponding to the second two-dimensional image;

superposing the first descriptor and the second image descriptor to obtain a second mixed descriptor;

Determining, from the first descriptor and the seventh descriptor, that a seventh descriptor having a smallest cosine distance from the first descriptor is a second target descriptor, including:

and determining a first mixed descriptor with the smallest cosine distance from the second mixed descriptor as the second target descriptor according to the second mixed descriptor and each first mixed descriptor.

In one embodiment, further comprising:

constructing a mixed descriptor extraction network model, and optimizing the mixed descriptor extraction network model by using a contrast loss function;

judging the mixed descriptor extraction network model with the output value smaller than a first threshold value of the contrast loss function as a mixed descriptor extraction network model meeting a first iteration ending condition, and taking the mixed descriptor extraction network model meeting the first iteration ending condition as a final mixed descriptor extraction network model;

acquiring a first text descriptor corresponding to the first text description, and acquiring a first image descriptor corresponding to each first two-dimensional image; superposing the first text descriptor and the first image descriptor to obtain a first mixed descriptor serving as the seventh descriptor, wherein the method comprises the following steps:

Inputting the first text description and each first two-dimensional image into the final mixed description sub-extraction network model to obtain the first mixed description;

acquiring a second image descriptor corresponding to the second two-dimensional image, and superposing the first descriptor and the second image descriptor to obtain a second mixed descriptor, wherein the method comprises the following steps:

and inputting the target text description and the second two-dimensional image into the final mixed description sub-extraction network model to obtain a second mixed description.

In one embodiment, optimizing the hybrid descriptor extraction network model using a contrast loss function includes:

inputting a third two-dimensional image rendered by the preset three-dimensional content under a preset visual angle and a corresponding second text description into the mixed description sub-extraction network model, and calculating an output value of the contrast loss function through the mixed description sub-extraction network model;

when the output value of the contrast loss function is larger than a second threshold value, a first negative sample and a preset positive sample are used for optimizing the mixed descriptor extraction network model, and the second threshold value is larger than the first threshold value;

when the output value of the contrast loss function is not greater than the second threshold value, optimizing the mixed descriptor extraction network model by using a second negative sample and the preset positive sample;

The positive sample is a two-dimensional image and a corresponding text description which are rendered by the preset three-dimensional content at other visual angles except the preset visual angle, the negative sample is a two-dimensional image and a corresponding text description which are rendered by the other three-dimensional content except the preset three-dimensional content at any visual angle under the same category as the preset three-dimensional content, and the cosine distance between the two-dimensional image and the text description corresponding to the first negative sample and the third two-dimensional image and the corresponding second text description is larger than the cosine distance between the two-dimensional image and the text description corresponding to the second negative sample and the third two-dimensional image and the corresponding second text description.

In one embodiment, further comprising:

extracting a third mixed descriptor corresponding to the two-dimensional image and the text description of each negative sample;

extracting a fourth mixed descriptor corresponding to the third two-dimensional image and the second text description;

calculating cosine distances to be compared between the third mixed descriptors and the fourth mixed descriptors;

taking a negative sample with the cosine distance to be compared being greater than a third threshold value as the first negative sample;

and taking a negative sample with the cosine distance to be compared not larger than the third threshold value as the second negative sample.

In one embodiment, obtaining a third text description corresponding to the first three-dimensional content includes:

rendering the first three-dimensional content at multiple viewing angles to obtain a fourth two-dimensional image at multiple viewing angles;

processing each fourth two-dimensional image based on a bootstrapping image-text pre-training model to obtain a corresponding fourth text description;

and integrating the fourth text descriptions to obtain the third text description.

In one embodiment, determining a text description difference between the target text description and the third text description includes:

acquiring a first descriptor corresponding to the target text description;

acquiring an eighth descriptor corresponding to the third text description;

obtaining a difference descriptor between the first descriptor and the eighth descriptor based on the first descriptor and the eighth descriptor;

driving the first three-dimensional content to deform based on the text description difference to obtain the target three-dimensional content, wherein the method comprises the following steps:

and driving the first three-dimensional content to deform based on the difference descriptor to obtain the target three-dimensional content.

In one embodiment, driving the first three-dimensional content to deform based on the difference descriptor to obtain the target three-dimensional content includes:

Acquiring first point cloud data of the first three-dimensional content;

extracting global descriptors and local descriptors of the first three-dimensional content according to the first point cloud data;

determining an offset according to the difference descriptor, the global descriptor and the local descriptor, wherein the offset comprises a color offset and a position offset;

determining second point cloud data of the target three-dimensional content according to the first point cloud data of the first three-dimensional content and the offset;

and obtaining the target three-dimensional content according to the second point cloud data.

In one embodiment, further comprising:

pre-constructing a deformed network model, wherein the deformed network model comprises a text description difference descriptor extraction structure, a three-dimensional content descriptor extraction structure and a three-dimensional content offset prediction structure;

optimizing the deformed network model, and taking the deformed network model meeting the second preset iteration ending condition as a final deformed network model;

determining a text description difference between the target text description and the third text description, comprising:

inputting the target text description and the third text description into a text description difference descriptor extraction structure in the final deformed network model to obtain the difference descriptor;

Extracting global descriptors and local descriptors of the first three-dimensional content according to the first point cloud data, wherein the extracting comprises the following steps:

inputting the first point cloud data into a three-dimensional content descriptor extraction structure in the final deformation network model to obtain a global descriptor and a local descriptor of the first three-dimensional content;

determining an offset from the difference descriptor, the global descriptor, and the local descriptor, comprising:

and determining the offset according to the difference descriptors, the global descriptors, the local descriptors and the offset prediction structures of the three-dimensional contents in the final deformation network model.

In one embodiment, optimizing the deformed network model, taking the deformed network model meeting the second preset iteration end condition as a final deformed network model, includes:

acquiring actual three-dimensional content obtained by deforming the first three-dimensional content based on the offset;

calculating an output value of a total loss function between the actual three-dimensional content and the target three-dimensional content;

when the output value of the total loss function is smaller than a fourth threshold value, judging that the second preset iteration ending condition is met;

And taking the deformation network model with the output value of the total loss function smaller than the fourth threshold value as a final deformation network model.

In one embodiment, calculating an output value of a loss function between the actual three-dimensional content and the target three-dimensional content includes:

calculating an output value of a loss function between the actual three-dimensional content and the target three-dimensional content by using a preset formula, wherein the preset formula is as follows:

；

wherein L is the total loss function,、/>and->Respectively represent different weights, +.>As a distance loss function>For the point cloud perceived loss function, < >>In order to align the loss function, G1 is a first set of sampling points in the actual three-dimensional content, p is a sampling point in the first set, G is a second set of sampling points in the target three-dimensional content, q is a sampling point in the second set, and p is a sampling point in the first set>For the description corresponding to the point cloud data of the actual three-dimensional content, the description is->The descriptors corresponding to the point cloud data of the target three-dimensional content are provided; />For the alignment descriptor of said actual three-dimensional content, -/->And (3) aligning descriptors for the target three-dimensional content.

In order to solve the above technical problem, the present application further provides a three-dimensional content generating system based on a multi-mode pre-training model, including:

The acquisition unit is used for acquiring the target text description input by the user;

the retrieval unit is used for retrieving in a three-dimensional content database based on the target text description and the multi-mode pre-training model to determine first three-dimensional content;

the text acquisition unit is used for acquiring a third text description corresponding to the first three-dimensional content;

a difference determining unit configured to determine a text description difference between the target text description and the third text description;

and the driving deformation unit is used for driving the first three-dimensional content to deform based on the text description difference to obtain the target three-dimensional content.

In order to solve the above technical problem, the present application further provides a three-dimensional content generating device based on a multi-mode pre-training model, including:

a memory for storing a computer program;

a processor for implementing the steps of the method of generating three-dimensional content based on a multimodal pre-training model as described above when executing a computer program.

To solve the above technical problem, the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the steps of the three-dimensional content generating method based on the multi-mode pre-training model as described above.

The application provides a three-dimensional content generation method based on a multi-mode pre-training model and related components, relates to the field of data processing, and aims to solve the problem of low speed of generating three-dimensional content. In the scheme, a target text description input by a user is acquired; searching in a three-dimensional content database based on the target text description and the multi-mode pre-training model to determine first three-dimensional content; and determining a corresponding third text description; determining a text description difference between the target text description and the third text description; and driving the first three-dimensional content to deform based on the text description difference to obtain the target three-dimensional content. According to the method and the device, the multi-mode pre-training model is used for searching in the three-dimensional content database, the first three-dimensional content can be determined more quickly, and then the first three-dimensional content is deformed based on the target text description, so that the target three-dimensional content corresponding to the target text description is obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings needed in the prior art and embodiments, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a three-dimensional content generation method based on a multi-modal pre-training model provided by the present application;

FIG. 2 is a specific flowchart of a three-dimensional content generating method based on a multi-mode pre-training model provided in the present application;

FIG. 3 is a schematic view of category retrieval of three-dimensional content provided in the present application;

FIG. 4 is a schematic diagram of multi-view rendering and image text description generation provided herein;

fig. 5 is a schematic structural diagram of a hybrid descriptor extraction network model provided in the present application;

FIG. 6 is a schematic diagram of a target text description image generation provided herein;

FIG. 7 is a schematic diagram of a negative selection strategy provided herein;

FIG. 8 is a schematic diagram of a text integration provided herein;

FIG. 9 is a schematic diagram of a modified network provided herein;

FIG. 10 is a schematic diagram of inference based on a deformed network provided herein;

fig. 11 is a schematic diagram of a three-dimensional content generating system based on a multi-mode pre-training model provided in the present application.

Detailed Description

The core of the application is to provide a three-dimensional content generation method based on a multi-mode pre-training model and related components, the multi-mode pre-training model is utilized to search in a three-dimensional content database, so that a first three-dimensional content can be determined more quickly, and then the first three-dimensional content is deformed based on target text description, so that target three-dimensional content corresponding to the target text description is obtained.

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

As shown in fig. 1, the present application provides a three-dimensional content generating method based on a multi-mode pre-training model, including:

s11: acquiring a target text description input by a user;

this step involves obtaining a target text description of the user input, which may be a textual description of the three-dimensional content that is desired to be generated, such as "a chair and a table scene in sunlight". The user directs the system to generate corresponding target three-dimensional content by entering a target text description. This description may include information on scenes, objects, contexts, etc. to guide the generated three-dimensional content to conform to the desires of the user.

S12: searching in a three-dimensional content database based on the target text description and the multi-mode pre-training model to determine first three-dimensional content;

the method utilizes the understanding capability of the multi-mode pre-training model on the text description and the three-dimensional content, and can quickly determine the initial three-dimensional content by searching the three-dimensional content database to quickly find the first three-dimensional content related to the target text description, and further deform the initial three-dimensional content in the subsequent steps to obtain the target three-dimensional content corresponding to the target text description. Therefore, the step is used for efficiently finding candidate three-dimensional contents with the help of the multi-mode pre-training model, and provides a basis for the subsequent steps.

S13: acquiring a third text description corresponding to the first three-dimensional content;

s14: determining a text description difference between the target text description and the third text description;

s15: and driving the first three-dimensional content to deform based on the text description difference to obtain the target three-dimensional content.

In this step, after the first three-dimensional content that is most matched with the target text description and is retrieved through the multimodal pre-training model and the three-dimensional content database in S12, the first three-dimensional content is deformed based on the target text description, so as to obtain the final target three-dimensional content corresponding to the target text description.

To achieve this goal, the target text description is used as a guide to adjust the properties, shape, or other characteristics of the first three-dimensional content. In particular, various morphing techniques, such as model morphing, geometric transformation, or texture transformation, may be applied to alter the appearance and structure of the first three-dimensional content to conform to the desired characteristics of the target text description.

In this embodiment, the specific steps of deforming the first three-dimensional content include: firstly, a third text description corresponding to the first three-dimensional content is acquired, then, text description differences between the target text description and the third text description are determined, and finally, deformation of the first three-dimensional content is driven based on the text description differences, so that the target three-dimensional content is obtained. The key point of the embodiment is that the difference between the target text description and the text description corresponding to the first three-dimensional content is determined by analyzing the text description, so that the first three-dimensional content is deformed in a targeted manner, and the first three-dimensional content better accords with the target text description, and the accuracy and the efficiency of three-dimensional content generation can be improved.

By such a morphing operation, it is possible to generate a target three-dimensional content corresponding to the target text description, which is identical in shape, visual characteristics, or other aspects to that required by the target text description. Compared with the method for generating the three-dimensional content from zero, the method based on the deformation can generate the target three-dimensional content meeting the requirements more quickly, and therefore the calculation resources and the time cost are saved.

Based on the above embodiments:

in one embodiment, determining the first three-dimensional content based on the target text description and the retrieval in the three-dimensional content database of the multi-modal pre-training model comprises:

searching in a three-dimensional content database based on the target text description and the multi-mode pre-training model, and determining a target category corresponding to the target text description;

a first three-dimensional content is determined from among the respective three-dimensional contents corresponding to the target category.

In this embodiment, determining the first three-dimensional content based on the target text description and the multimodal pre-training model includes two steps: a target category is determined and a first three-dimensional content is determined.

First, a target text description and a multimodal pre-training model are used to retrieve in a three-dimensional content database to determine a target category corresponding to the target text description. This means that the category that best matches the target text description is found by the multimodal pre-training model.

Then, after the target category is determined, one is further selected as the first three-dimensional content from among the respective three-dimensional contents corresponding to the target category. This may be achieved by looking up three-dimensional content related to the target class in a three-dimensional content database, and selecting the most matching or similar three-dimensional content as the first three-dimensional content based on the output result of the multi-modal pre-training model and the characteristics related to the target class.

In this way, the target class is determined first, and then the first three-dimensional content is selected from the three-dimensional content related to the target class, and the classifying capability of the multi-mode pre-training model is utilized by the process to match the content of the three-dimensional content database, so as to ensure that the selected first three-dimensional content has a certain relevance to the target text description.

In one embodiment, retrieving in a three-dimensional content database based on a target text description and a multi-modal pre-training model, determining a target category corresponding to the target text description includes:

and determining the category corresponding to the first target descriptor as a target category.

In this embodiment, a specific implementation manner of determining a target class based on a target text description and class descriptors in a three-dimensional content database is described, first, a first descriptor of the target text description is obtained, that is, the target text description is used as input, and a vector representation obtained after processing is generated through a multi-mode pre-training model. And meanwhile, a second descriptor corresponding to each category name in the three-dimensional content database is acquired, namely, the name of each category is used as input, the vectors obtained after processing are represented, and the vectors can be generated through a multi-mode pre-training model and are stored in the database in advance.

Then, the similarity between the first descriptor of the target text description and the second descriptor corresponding to each category name is calculated. Specifically, the similarity between the first descriptors of the target text description and the second descriptors corresponding to the category names can be measured by using cosine distance and other methods, and the second descriptor with the smallest cosine distance, namely the descriptor corresponding to the category name which is matched with the target text description, can be found by calculating the cosine distance between the first descriptor of the target text description and the second descriptor corresponding to each category name.

And finally, determining the category corresponding to the description sub corresponding to the category name which is matched with the target text description as the target category. Therefore, the target category can be rapidly and accurately determined according to the target text description, and a basis is provided for subsequent three-dimensional content generation and deformation.

As shown in fig. 2, the overall steps of the present application include: receiving target text description input by a user, determining a target category closest to the target text description based on category retrieval of three-dimensional content of the multi-mode pre-training model, determining a first closest three-dimensional content from the target category based on three-dimensional content retrieval of the multi-mode pre-training model, and deforming the first three-dimensional content based on three-dimensional content deformation of the multi-mode pre-training model to obtain target three-dimensional content.

In one embodiment, obtaining a first descriptor of the target text description and obtaining a second descriptor corresponding to each category name in the three-dimensional content data includes:

obtaining a third descriptor corresponding to the target text description and a fourth descriptor corresponding to each category name through image-text comparison pre-training models;

superposing the third descriptor and the fifth descriptor to obtain a first descriptor;

and superposing the fourth descriptor and the sixth descriptor to obtain a second descriptor.

In this embodiment, first, a third descriptor corresponding to a target text description and a fourth descriptor corresponding to each category name are obtained through a graphic comparison Pre-training model (e.g., CLIP (Contrastive Language-Image Pre-training) model). The graphic contrast pre-training model is a model capable of comparing and matching text descriptions with images, and by inputting target text descriptions into the model, a vector representation, namely a third descriptor, can be obtained. Meanwhile, each category name in the three-dimensional content database is used as input, and corresponding vector representations, namely fourth descriptors, can be obtained, wherein the vector representations can reflect semantic similarity between text descriptions and the category names.

Next, a fifth descriptor corresponding to the target text description and a sixth descriptor corresponding to each category name are obtained by a pre-trained language model, such as a BERT (Bidirectional Encoder Representation from Transformers, bi-directional coded representation model from a transformation model) model. A pre-trained language model is a model that can understand and generate natural language, and by inputting a target text description into the model, a vector representation, i.e., a fifth descriptor, can be obtained. At the same time, each category name in the three-dimensional content database is used as input, and corresponding vector representations, namely a sixth descriptor, can be obtained, wherein the vector representations can reflect semantic association between the text description and the category name.

Finally, in an alternative embodiment, the third descriptor and the fifth descriptor are superimposed to obtain the first descriptor of the target text description. And meanwhile, superposing the fourth descriptor and the sixth descriptor to obtain a second descriptor corresponding to each category name in the three-dimensional content database. In this way, a vector representation of the target text description and a vector representation of each category name can be obtained, providing a basis for subsequent target category matching.

As shown in fig. 3, the target text description is input into a graphic contrast pretrained model CLIP for text image alignment and a bi-directional coded representation model BERT from a transformation model for text understanding, so as to obtain 256-dimensional aligned descriptors and 768-dimensional understanding descriptors respectively; inputting a CLIP model and a BERT model for each category name in the three-dimensional content data set to respectively obtain 256-dimension aligned descriptors and 768-dimension understanding descriptors; based on 256-dimensional alignment descriptors corresponding to the target text description, searching the 256-dimensional alignment descriptors corresponding to each 3D content category in the 3D content data set, and obtaining the distance between the target text description and the alignment descriptors of each three-dimensional content category by using cosine distance measurement, wherein when the two descriptors are completely consistent, the cosine distance is closest; based on 768-dimensional understanding descriptors corresponding to the target text description, 768-dimensional understanding descriptors corresponding to each 3D content category in the 3D content data set are searched, and a cosine distance measure is used to obtain the distance between the user text description and the understanding descriptors of each 3D content category. And adding the two cosine distances for each category name in the 3D content data set to obtain the total distance between the user text description and each 3D content category, and selecting the category name with the smallest total distance as the target category conforming to the user text description. That is, in another alternative embodiment, the cosine distance 1 between the third descriptor and each fourth descriptor may be calculated, then the cosine distance 2 between the fifth descriptor and each sixth descriptor may be calculated, and the two cosine distances may be added to obtain a cosine distance sum, and the three-dimensional content with the smallest cosine distance sum may be selected as the first three-dimensional content.

In one embodiment, determining a first three-dimensional content from among the respective three-dimensional content corresponding to the target category includes:

The embodiment describes a specific implementation manner of determining a first three-dimensional content from each three-dimensional content corresponding to a target category, first, a seventh descriptor of each three-dimensional content in the target category is obtained, and then, according to the first descriptor of the target text description and the seventh descriptor of each three-dimensional content, a seventh descriptor which is most matched with the target text description is determined to be a second target descriptor. This process is similar to the target category matching in the above embodiment, except that the matched object changes from category name to a description of the three-dimensional content. Specifically, the cosine distance between the first descriptor of the target text description and the seventh descriptor of each three-dimensional content may be calculated, and the descriptor in which the cosine distance is the smallest, that is, the second target descriptor, may be found. And finally, determining the three-dimensional content corresponding to the second target descriptor as the first three-dimensional content. This description represents the characteristics of the three-dimensional content that best matches the target text description, so that its corresponding three-dimensional content can be taken as the final generation result.

rendering each three-dimensional content in the target category in a multi-view manner to obtain a first two-dimensional image with multiple view angles;

and acquiring first descriptors corresponding to the first text descriptions, and taking the first descriptors as seventh descriptors.

Because each 3D content lacks a corresponding text description, the closest 3D content cannot be retrieved directly based on the text description of the user. Therefore, as shown in fig. 4, in this embodiment, the multi-view rendering is performed on each three-dimensional content in the target class to obtain a first two-dimensional image with multiple views; this process may generate a corresponding two-dimensional image by multi-perspective rendering of the three-dimensional content using a rendering engine. Next, each first two-dimensional image is processed based on a bootstrapping teletext pre-training model (BLIP, multimodal Bidirectional Encoder Representations from Transformers) to obtain a corresponding first text description. The bootstrapping picture and text pre-training model is a multi-mode pre-training model based on images and texts, and can align the characteristics of the images and the texts so as to realize understanding and learning of semantic relations between the images and the texts. And finally, acquiring first descriptors corresponding to the first text descriptions, and taking the first descriptors as the seventh descriptors. This description represents the characteristics of the three-dimensional content corresponding to the first text description and thus can be used to describe and match the three-dimensional content.

In order to perform multi-view rendering of different 3D contents, on one hand, 3D content normalization preprocessing is required to process 3D contents with inconsistent positions and scales in space; on the one hand, the position of the virtual camera that performs rendering needs to be set to ensure that the various angles of the 3D content are covered. First, 3D content is preprocessed, including coordinate system alignment and scaling. Coordinate system alignment refers to unifying all points of 3D content to be aligned with a world coordinate system, and finishing coordinate system alignment by calculating the center of a current object and the new coordinates of each point as the new coordinates; scaling refers to normalizing a model to a standard scale, namely scaling an object into a cube with a side length of 1, firstly calculating the difference between the maximum and minimum values of 3D content on an x axis, a y axis and a z axis, taking the maximum value, taking the reciprocal as a scaling scale, and multiplying the coordinates of each point of the 3D model by a scaling factor to finish scaling of the model. Secondly, setting the pose of the virtual camera, including the position and the orientation. The present embodiment provides a large number of virtual cameras surrounding the upper half of the 3D content, where the cameras are positioned differently and the orientation of the cameras is fixed to point from the camera position to the center of the object. The present invention sets the position (p) of the virtual camera in the spherical coordinate system _{camera_x} ,p _{camera_y} ,p _{camera_z} ) According to the formula:

，

wherein r represents the sphere radius, < >>Is perpendicular polar angle +.>Is the azimuth angle of the horizontal direction. Assuming that the radius r has n _r Seed conditionCondition (I)>There is->Case of->There is->In the case of this, the final rendering can be achieved>Images at the individual viewing angles. For example, three cases of radius r of 3, 2.5 and 2 can be fixed, let polar angle +.>Azimuth angle +.>In total, there are 48 virtual camera positions of 3×2×8 in total, and 48 images are obtained in total, in eight cases, from 0 degrees to 45 degrees up to 360 degrees. Likewise, for each 3D content in the 3D content dataset, a 2D image rendered for each 3D content is obtained in the same step.

For text description generation of a rendered Image, the embodiment can obtain a corresponding text description for an input Image by means of a multi-mode Pre-training model, such as a BLIP (Bootstrapping Language-Image Pre-training model) and further obtain a text description that each 3D content contains each rendered Image.

In summary, the embodiment realizes accurate description and matching of the three-dimensional content by converting the multi-view image of the three-dimensional content into text description and performing feature extraction and matching by using the bootstrapping picture-text pre-training model.

In one embodiment, obtaining the first descriptor corresponding to each first text description, and taking the first descriptor as the seventh descriptor includes:

acquiring a first text descriptor corresponding to the first text description, and acquiring first image descriptors corresponding to each first two-dimensional image;

and superposing the first text descriptor and the first image descriptor to obtain a first mixed descriptor serving as a seventh descriptor.

As shown in fig. 5, since information loss is caused in the process of obtaining text description from an image, in order to make full use of image information, in this embodiment, a search strategy using a joint text mode and an image mode is proposed, so as to improve the accuracy of 3D content search. In this embodiment, a first text descriptor corresponding to the first text description and a first image descriptor corresponding to the first two-dimensional image are overlapped to obtain a first mixed descriptor. The superposition process can be implemented by vector concatenation, weighted addition and the like on the two descriptors, so that a comprehensive mixed descriptor is obtained. Finally, the first hybrid descriptor is taken as a seventh descriptor, and the seventh descriptor can be used for further describing and matching the three-dimensional content.

In summary, the present embodiment obtains the seventh descriptor by obtaining the first text descriptor and the first image descriptor and superimposing them to obtain the first hybrid descriptor. The descriptor can be used for describing and matching the three-dimensional content, and the accuracy and the diversity of the three-dimensional content generation are improved.

In one embodiment, before determining, according to the first descriptor and the seventh descriptor, that the seventh descriptor having the smallest cosine distance from the first descriptor is the second target descriptor, the method further includes:

acquiring a second image descriptor corresponding to a second two-dimensional image;

and determining the first mixed descriptor with the smallest cosine distance from the second mixed descriptor as the second target descriptor according to the second mixed descriptor and each first mixed descriptor.

As shown in fig. 6, in this process, the target text description is intended to be matched with each of the first mixed descriptors of the three-dimensional content, and then a second two-dimensional image corresponding to the target text description needs to be acquired through the target text description, and then the descriptor corresponding to the image needs to be extracted. Next, the first descriptor and the second image descriptor are superimposed to form a second hybrid descriptor. And finally, calculating the cosine distance between the second mixed descriptor and each first mixed descriptor, and determining the first mixed descriptor with the smallest cosine distance as the second target descriptor.

Implementation of this series of steps may help generate a second target descriptor corresponding to the target text description, thereby enabling more accurate three-dimensional content generation.

In one embodiment, further comprising:

judging the mixed descriptor extraction network model with the output value of the contrast loss function smaller than a first threshold value as a mixed descriptor extraction network model meeting the first iteration ending condition, and taking the mixed descriptor extraction network model meeting the first iteration ending condition as a final mixed descriptor extraction network model;

acquiring a first text descriptor corresponding to the first text description, and acquiring first image descriptors corresponding to each first two-dimensional image; superposing the first text descriptor and the first image descriptor to obtain a first mixed descriptor serving as a seventh descriptor, wherein the method comprises the following steps:

inputting the first text description and each first two-dimensional image into a final mixed descriptor extraction network model to obtain a first mixed descriptor;

And inputting the target text description and the second two-dimensional image into a final mixed description sub-extraction network model to obtain a second mixed description.

To implement the extraction of hybrid descriptors of the above embodiments, a hybrid descriptor extraction network model needs to be constructed first. The network model should contain the appropriate structure and parameters to efficiently extract the hybrid descriptor from the input text description and image description. In order to improve the quality and accuracy of the hybrid descriptors, the hybrid descriptor extraction network model may be optimized using a contrast loss function, which may be calculated based on the similarity between hybrid descriptors and used as a training target for the network model.

As shown in fig. 5, for the descriptor extraction of the text description of the rendered image, the present embodiment first extracts 768-dimensional text descriptors using a bi-directional coded representation model from a transformation model focusing on text understanding, that is, a BERT model; secondly, a 768-dimensional text descriptor after transformation is obtained through a Layer of Multi-Layer perceptron (MLP) network, and the 768-dimensional text descriptor and the original 768-dimensional text descriptor form a 1536-dimensional text descriptor; finally, 768-dimensional text characterization descriptors are obtained through a layer of MLP network.

For extracting descriptors of rendered two-dimensional images, in the embodiment, firstly, a graphic comparison pre-training model, namely a CLIP model, which is focused on graphic alignment is used for extracting to obtain 256-dimensional image descriptors; secondly, a layer of MLP network is used for obtaining a 256-dimensional image descriptor after transformation, and the 256-dimensional image descriptor and the original 256-dimensional image descriptor form a 512-dimensional image descriptor; finally, a 256-dimensional image characterization descriptor is obtained through a layer of MLP network.

For extracting mixed descriptors, the embodiment firstly combines 768-dimensional text characterization descriptors of a text channel with 256-dimensional image characterization descriptors of an image channel to form 1024-dimensional combined descriptors; second, a layer of MLP network is used to obtain the final 512-dimensional hybrid descriptor.

In the training process, whether the mixed descriptor extraction network model meets the first iteration end condition can be judged by setting a threshold value. If the output value of the contrast loss function is smaller than the set threshold value, the network model can be judged to be the mixed descriptor extraction network model meeting the first iteration end condition. Upon determining that a certain hybrid descriptor extraction network model meets the first iteration end condition, the network model will be determined to be the final hybrid descriptor extraction network model, which means that the network model is considered to be a model with higher performance and accuracy and is available for the subsequent target three-dimensional content generation process.

inputting a third two-dimensional image rendered by the preset three-dimensional content under a preset visual angle and a corresponding second text description into a mixed description sub-extraction network model, and calculating an output value of a contrast loss function through the mixed description sub-extraction network model;

when the output value of the contrast loss function is larger than a second threshold value, the first negative sample and a preset positive sample are used for optimizing the mixed descriptor extraction network model, and the second threshold value is larger than the first threshold value;

when the output value of the contrast loss function is not greater than a second threshold value, a second negative sample and a preset positive sample are used for optimizing the mixed descriptor extraction network model;

the positive sample is a two-dimensional image and a corresponding text description which are obtained by rendering the preset three-dimensional content at other visual angles except the preset visual angle, the negative sample is a two-dimensional image and a corresponding text description which are obtained by rendering the other three-dimensional content except the preset three-dimensional content at any visual angle under the same category as the preset three-dimensional content, and the cosine distance between the two-dimensional image and the text description corresponding to the first negative sample and the third two-dimensional image and the corresponding second text description is larger than the cosine distance between the two-dimensional image and the text description corresponding to the second negative sample and the third two-dimensional image and the corresponding second text description.

In order to better accelerate training effect and ensure faster convergence, the embodiment provides a training strategy of first easy and last difficult, namely, a training strategy of first selecting a simple negative sample and then selecting a difficult negative sample. In the early stages of training, the network is not yet provided with a sufficiently descriptive descriptor, which would greatly interfere with the training of the network if a confusing negative sample were provided. And at this point if some very different negative samples are provided, it will make it easier for the network to distinguish between positive and negative samples. After training for a period of time, the network has a certain capability of extracting significant descriptors, and at the moment, a plurality of easily confused negative samples are provided, so that the network capability is further enhanced, and descriptor vectors with large differences are extracted from the difficult negative samples.

As shown in fig. 7, in the optimization process of the contrast loss function, a third two-dimensional image rendered by the preset three-dimensional content under the preset viewing angle and the corresponding second text description are input into the mixed description sub-extraction network model, and the output value of the contrast loss function is calculated through the mixed description sub-extraction network model. When the output value of the contrast loss function is larger than a set second threshold value, namely, in the initial stage of training, the first negative sample and the preset positive sample are used for optimizing the mixed descriptor extraction network model. And when the output value of the contrast loss function is not greater than a second threshold value, using a second negative sample and a preset positive sample to optimize the mixed descriptor extraction network model.

The positive sample is a two-dimensional image and a corresponding text description which are obtained by rendering the preset three-dimensional content at other visual angles except the preset visual angle, the negative sample is a two-dimensional image and a corresponding text description which are obtained by rendering the other three-dimensional content except the preset three-dimensional content at any visual angle under the same category as the preset three-dimensional content, and the cosine distance between the two-dimensional image and the text description corresponding to the first negative sample and the third two-dimensional image and the corresponding second text description is larger than the cosine distance between the two-dimensional image and the text description corresponding to the second negative sample and the third two-dimensional image and the corresponding second text description. The samples are chosen and used to introduce varying degrees of difficulty into the optimization process to help the network learn better about the variability between samples.

In this way, the mixed descriptor extraction network model can be guided to learn more accurately, so that the accuracy and quality of extracting the mixed descriptor are improved.

In one embodiment, further comprising:

extracting a fourth mixed descriptor corresponding to the second text description from the third two-dimensional image;

Calculating cosine distances to be compared between each third mixed descriptor and each fourth mixed descriptor;

taking a negative sample with the cosine distance to be compared being greater than a third threshold value as a first negative sample;

and taking the negative sample with the cosine distance to be compared not greater than a third threshold value as a second negative sample.

The embodiment describes a specific way of distinguishing a simple negative (first negative) from a difficult negative (second negative), specifically, first extracting a two-dimensional image of each negative and a third hybrid descriptor corresponding to the text description; extracting a fourth mixed descriptor corresponding to the second text description from the third two-dimensional image; and then calculating the cosine distance to be compared between each third mixed descriptor and each fourth mixed descriptor. And then taking a negative sample with the cosine distance to be compared being larger than a third threshold value as a first negative sample, and taking a negative sample with the cosine distance to be compared being not larger than the third threshold value as a second negative sample.

Through the training strategy, the mixed descriptor extraction network model can be optimized according to the similarity between samples, so that the network can better distinguish positive and negative samples and generate descriptor vectors conforming to the description information.

rendering the first three-dimensional content at multiple angles to obtain a fourth two-dimensional image at multiple angles;

processing each fourth two-dimensional image based on the bootstrapping image-text pre-training model to obtain a corresponding fourth text description;

and integrating the fourth text descriptions to obtain a third text description.

As shown in fig. 8, specifically, the present embodiment may obtain a fourth text description corresponding to a fourth two-dimensional image at a different rendering angle for each 3D content, whereas the 3D content itself lacks an overall refined text description. Therefore, the present embodiment proposes a text description fusion method based on a pre-training language model, by setting an appropriate context instruction, a fourth text description of a rendered image of the same 3D content under all viewing angles is input into a pre-training language big model, so as to obtain an overall text description (i.e. a third text description) for the 3D content, which finally removes errors, removes duplicates, and retains all detail descriptions.

In one embodiment, determining a text description difference between the target text description and the third text description includes: acquiring a first descriptor corresponding to the target text description; acquiring an eighth descriptor corresponding to the third text description; obtaining a difference descriptor between the first descriptor and the eighth descriptor based on the first descriptor and the eighth descriptor; driving the first three-dimensional content to deform based on the text description difference to obtain target three-dimensional content, including: and driving the first three-dimensional content to deform based on the difference descriptor to obtain the target three-dimensional content.

For example, one contextual instruction set by the present embodiment for a pre-trained language big model is as follows: "given a 3D object, different text descriptions are obtained from different angles, which are { fourth text description corresponding to view 1 }, { fourth text description corresponding to view 2 }, …, { fourth text description corresponding to view n }, please combine the above results to obtain a unified complete text description of the 3D object, require error description removal, duplicate description removal, and preserve all details. "wherein { fourth text description corresponding to view 1 } is the text description of the rendering image of the 3D content obtained by the foregoing work under view 1. The above operations are performed for each 3D content in the 3D content dataset, resulting in a global text description of each 3D content.

acquiring first point cloud data of first three-dimensional content;

In order to drive the first three-dimensional content to deform to obtain target three-dimensional content, a deformation network driven by text description difference is provided, and a difference descriptor is extracted from the text description of the first three-dimensional content and the text description of the target three-dimensional content and is input into a 3D content deformation prediction network as a condition to predict the offset of the first three-dimensional content, so that deformed 3D content is obtained. The method comprises the following steps: and determining an offset according to the difference descriptor, the global descriptor and the local descriptor of the first three-dimensional content, wherein the offset comprises a color offset and a position offset.

By means of the driving mode of the text description difference, the difference of two 3D contents can be effectively predicted, compared with the mode of directly generating the 3D contents by text description, the problem difficulty is greatly reduced, and the 3D generation quality can be improved.

In one embodiment, further comprising:

inputting the target text description and the third text description into a text description difference descriptor extraction structure in the final deformed network model to obtain a difference descriptor;

extracting global descriptors and local descriptors of the first three-dimensional content according to the first point cloud data, including:

inputting the first point cloud data into a three-dimensional content descriptor extraction structure in a final deformation network model to obtain a global descriptor and a local descriptor of the first three-dimensional content;

the offset is determined from the difference descriptor, the global descriptor, and the offset prediction structure of the three-dimensional content in the local descriptor and the final deformed network model.

As shown in fig. 9, the deformed network structure includes three parts, which are a text description difference descriptor extraction structure, a three-dimensional content descriptor extraction structure, and an offset prediction structure of three-dimensional content, respectively.

For the text description difference descriptor extraction structure, extracting 768-dimensional corresponding third descriptors and first descriptors respectively for the text description of the first three-dimensional content and the text description of the target three-dimensional content by means of a bi-directional coding representation model from a transformation model for text understanding, such as a BERT model, and then combining to obtain 1536-dimensional combined descriptors, and obtaining 768-dimensional difference descriptors through a layer of MLP network.

Aiming at the three-dimensional content descriptor extraction structure, enabling the first three-dimensional content to be an n multiplied by 6-dimensional colored three-dimensional point cloud, gradually increasing the dimension of the features to n multiplied by 2048 dimensions by using four MLP layers, and carrying out pooling to obtain 1 multiplied by 2048 dimensions, wherein the 2048-dimension vector can be regarded as the whole abstraction of n point features, and represents the global descriptor of the first three-dimensional content; meanwhile, the middle n multiplied by 1024 dimensional descriptor is used as a local abstract of n point characteristics and represents a local descriptor of the first three-dimensional content;

for the offset prediction structure of the three-dimensional content, firstly, copying n parts of global descriptors, copying n parts of difference descriptors, and connecting the difference descriptors with local descriptors to obtain n× (2048+768+1024) =n×3840-dimensional mixed descriptors; secondly, obtaining the final n multiplied by 6 dimension position and color offset through four layers of MLP networks; finally, the offset is added to the first three-dimensional content, so as to obtain the final deformed actual 3D content, as shown in fig. 10.

In one embodiment, optimizing the deformed network model, taking the deformed network model meeting the second preset iteration end condition as a final deformed network model includes: acquiring actual three-dimensional content obtained by deforming the first three-dimensional content based on the offset; calculating an output value of a total loss function between the actual three-dimensional content and the target three-dimensional content; when the output value of the total loss function is smaller than a fourth threshold value, judging that a second preset iteration ending condition is met; and taking the deformation network model with the output value of the total loss function smaller than the fourth threshold value as a final deformation network model.

；

wherein L is the total loss function,、/>and->Respectively are provided withRepresenting different weights, +.>As a function of the distance loss,for the point cloud perceived loss function, < >>In order to align the loss function, G1 is the first set of sampling points in the actual three-dimensional content, p is the sampling points in the first set, G is the second set of sampling points in the target three-dimensional content, q is the sampling points in the second set, (-) >Description corresponding to point cloud data of actual three-dimensional content,/-for>The descriptors corresponding to the point cloud data of the target three-dimensional content; />For an alignment descriptor of the actual three-dimensional content, < >>Is an alignment descriptor of the target three-dimensional content.

By minimizing the distance loss function, the predicted offset can be made very close to the true offset; by minimizing the perception loss function, the deformed source 3D content and the deformed target 3D content can be ensured to be relatively close to each other in the description subspace, and compared with absolute point distance loss, the method can be used for semantically supervising; by minimizing the alignment loss function, the rendered image at any view angle is relatively close to the description subspace of the target text description extracted by the CLIP, and the actual three-dimensional content after supervision deformation accords with the target text description of the user.

In order to solve the above technical problems, as shown in fig. 11, the present application further provides a three-dimensional content generating system based on a multi-mode pre-training model, including:

an acquisition unit 1 for acquiring a target text description input by a user;

the retrieval unit 2 is used for retrieving in the three-dimensional content database based on the target text description and the multi-mode pre-training model to determine a first three-dimensional content;

A text obtaining unit 3, configured to obtain a third text description corresponding to the first three-dimensional content;

a difference determining unit 4 for determining a text description difference between the target text description and the third text description;

and the driving deformation unit 5 is used for driving the first three-dimensional content to deform based on the text description difference to obtain the target three-dimensional content.

In one embodiment, the retrieval unit comprises:

the category retrieval unit is used for retrieving in the three-dimensional content database based on the target text description and the multi-mode pre-training model, and determining a target category corresponding to the target text description;

and a content retrieval unit for determining a first three-dimensional content from among the respective three-dimensional contents corresponding to the target category.

In one embodiment, the category retrieval unit includes:

the first extraction unit is used for acquiring a first descriptor of the target text description and acquiring a second descriptor corresponding to each category name in the three-dimensional content data;

the first comparing unit is used for determining the second descriptor with the smallest cosine distance from the first descriptor as the first target descriptor according to the first descriptor and each second descriptor;

and the first determining unit is used for determining the category corresponding to the first target descriptor as a target category.

In one embodiment, the first extraction unit comprises:

the first sub-extraction unit is used for obtaining a third descriptor corresponding to the target text description and a fourth descriptor corresponding to each category name through image-text comparison pre-training models;

the second sub-extraction unit is used for acquiring a fifth descriptor corresponding to the target text description and a sixth descriptor corresponding to each category name through the pre-training language model;

the first superposition unit is used for superposing the third descriptor and the fifth descriptor to obtain a first descriptor;

and the second superposition unit is used for superposing the fourth descriptor and the sixth descriptor to obtain a second descriptor.

In one embodiment, a content retrieval unit includes:

a third extraction unit, configured to obtain a seventh descriptor of each three-dimensional content in the target class;

the second comparing unit is used for determining a seventh descriptor with the smallest cosine distance from the first descriptor as a second target descriptor according to the first descriptor and the seventh descriptor;

and the second determining unit is used for determining the three-dimensional content corresponding to the second target descriptor as the first three-dimensional content.

In one embodiment, the third extraction unit comprises:

The first rendering unit is used for rendering the three-dimensional contents in the target category in multiple view angles by a user to obtain first two-dimensional images with multiple view angles;

the first image processing unit is used for processing each first two-dimensional image based on the bootstrapping image-text pre-training model to obtain corresponding first text description;

and the first descriptor extraction unit is used for acquiring first descriptors corresponding to the first text descriptions and taking the first descriptors as seventh descriptors.

In one embodiment, the first descriptor extraction unit is specifically configured to obtain a first text descriptor corresponding to the first text description, and obtain first image descriptors corresponding to each of the first two-dimensional images; and superposing the first text descriptor and the first image descriptor to obtain a first mixed descriptor serving as a seventh descriptor.

In one embodiment, further comprising:

a second image processing unit, configured to obtain a second two-dimensional image corresponding to the target text description based on the target text description;

a fourth extraction unit, configured to obtain a second image descriptor corresponding to a second two-dimensional image;

the third superposition unit is used for superposing the first descriptor and the second image descriptor to obtain a second mixed descriptor;

The second comparing unit is specifically configured to determine, according to the second mixed descriptor and each of the first mixed descriptors, the first mixed descriptor having the smallest cosine distance from the second mixed descriptor as the second target descriptor.

In one embodiment, further comprising:

the first model construction unit is used for constructing a mixed descriptor extraction network model by a user and optimizing the mixed descriptor extraction network model by using a contrast loss function;

the first model optimizing unit is used for judging the mixed descriptor extraction network model with the output value of the contrast loss function smaller than a first threshold value as a mixed descriptor extraction network model meeting a first iteration ending condition, and taking the mixed descriptor extraction network model meeting the first iteration ending condition as a final mixed descriptor extraction network model;

In one embodiment, the first model optimizing unit is specifically configured to input a third two-dimensional image obtained by rendering the preset three-dimensional content under a preset viewing angle and a corresponding second text description into the mixed description sub-extraction network model, and calculate an output value of the contrast loss function through the mixed description sub-extraction network model; when the output value of the contrast loss function is larger than a second threshold value, the first negative sample and a preset positive sample are used for optimizing the mixed descriptor extraction network model, and the second threshold value is larger than the first threshold value; when the output value of the contrast loss function is not greater than a second threshold value, a second negative sample and a preset positive sample are used for optimizing the mixed descriptor extraction network model;

In one embodiment, further comprising:

the negative sample dividing unit is used for extracting a two-dimensional image of each negative sample and a third mixed descriptor corresponding to the text description; extracting a fourth mixed descriptor corresponding to the second text description from the third two-dimensional image; calculating cosine distances to be compared between each third mixed descriptor and each fourth mixed descriptor; taking a negative sample with the cosine distance to be compared being greater than a third threshold value as a first negative sample; and taking the negative sample with the cosine distance to be compared not greater than a third threshold value as a second negative sample.

In one embodiment, a text acquisition unit includes:

the second rendering unit is used for rendering the first three-dimensional content from multiple views to obtain a fourth two-dimensional image under multiple views;

the third image processing unit is used for processing each fourth two-dimensional image based on the bootstrapping image-text pre-training model to obtain a corresponding fourth text description;

and the text integrating unit is used for integrating the fourth text descriptions to obtain a third text description.

In one embodiment, the difference determining unit is specifically configured to obtain a first descriptor corresponding to the target text description; acquiring an eighth descriptor corresponding to the third text description; obtaining a difference descriptor between the first descriptor and the eighth descriptor based on the first descriptor and the eighth descriptor;

The driving deformation unit is specifically used for driving the first three-dimensional content to deform based on the difference descriptor to obtain the target three-dimensional content.

In one embodiment, the deformation unit is driven, and the deformation unit is specifically configured to acquire first point cloud data of the first three-dimensional content; extracting global descriptors and local descriptors of the first three-dimensional content according to the first point cloud data; determining an offset according to the difference descriptor, the global descriptor and the local descriptor, wherein the offset comprises a color offset and a position offset; determining second point cloud data of the target three-dimensional content according to the first point cloud data of the first three-dimensional content and the offset; and obtaining the target three-dimensional content according to the second point cloud data.

In one embodiment, further comprising:

the second model construction unit is used for constructing a deformed network model in advance, wherein the deformed network model comprises a text description difference descriptor extraction structure, a three-dimensional content descriptor extraction structure and a three-dimensional content offset prediction structure;

the second model optimizing unit is used for optimizing the deformed network model and taking the deformed network model meeting the second preset iteration ending condition as a final deformed network model;

the difference determining unit is specifically configured to input the target text description and the third text description into a text description difference descriptor extraction structure in the final deformed network model to obtain a difference descriptor;

In one embodiment, the second model optimization unit is specifically configured to obtain an actual three-dimensional content obtained by deforming the first three-dimensional content based on the offset; calculating an output value of a total loss function between the actual three-dimensional content and the target three-dimensional content; when the output value of the total loss function is smaller than a fourth threshold value, judging that a second preset iteration ending condition is met; and taking the deformation network model with the output value of the total loss function smaller than the fourth threshold value as a final deformation network model.

In one embodiment, calculating an output value of a loss function between the actual three-dimensional content and the target three-dimensional content includes: calculating an output value of a loss function between the actual three-dimensional content and the target three-dimensional content by using a preset formula, wherein the preset formula is as follows:

；/>

；

Wherein L is the total loss function,、/>and->Respectively represent different weights, +.>As a function of the distance loss,for the point cloud perceived loss function, < >>In order to align the loss function, G1 is the first set of sampling points in the actual three-dimensional content, p is the sampling points in the first set, G is the second set of sampling points in the target three-dimensional content, q is the sampling points in the second set, (-)>Description corresponding to point cloud data of actual three-dimensional content,/-for>The descriptors corresponding to the point cloud data of the target three-dimensional content; />For an alignment descriptor of the actual three-dimensional content, < >>Is an alignment descriptor of the target three-dimensional content.

For the description of the three-dimensional content generating system based on the multi-mode pre-training model, refer to the above embodiment, and the description is omitted herein.

a memory for storing a computer program;

a processor for implementing the steps of the method for generating three-dimensional content based on a multi-modal pre-training model as described above when executing a computer program.

For the description of the three-dimensional content generating device based on the multi-mode pre-training model, refer to the above embodiment, and the description is omitted herein.

In order to solve the technical problem, the application further provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and the computer program realizes the steps of the three-dimensional content generation method based on the multi-mode pre-training model when being executed by a processor.

For the description of the computer-readable storage medium, refer to the above embodiments, and the description is omitted herein.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The three-dimensional content generation method based on the multi-mode pre-training model is characterized by comprising the following steps of:

acquiring a target text description input by a user;

and driving the first three-dimensional content to deform based on the text description difference to obtain target three-dimensional content.

2. The method for generating three-dimensional content based on a multi-modal pre-training model as set forth in claim 1, wherein determining the first three-dimensional content based on the target text description and the retrieval of the multi-modal pre-training model in the three-dimensional content database comprises:

3. The method for generating three-dimensional content based on a multi-modal pre-training model as claimed in claim 2, wherein determining the target category corresponding to the target text description based on the target text description and searching in a three-dimensional content database of the multi-modal pre-training model comprises:

4. The method for generating three-dimensional content based on a multi-modal pre-training model as claimed in claim 3, wherein obtaining the first descriptor of the target text description and obtaining the second descriptor corresponding to each category name in the three-dimensional content data comprises:

5. A multi-modal pre-training model based three-dimensional content generation method as claimed in claim 3 wherein determining the first three-dimensional content from each three-dimensional content corresponding to the target class comprises:

6. The method for generating three-dimensional content based on a multi-modal pre-training model as claimed in claim 5, wherein obtaining the seventh descriptor of each three-dimensional content in the target class comprises:

7. The method for generating three-dimensional content based on a multi-modal pre-training model as set forth in claim 6, wherein obtaining first descriptors corresponding to the respective first text descriptions and using the first descriptors as the seventh descriptors includes:

8. The method for generating three-dimensional content based on a multi-modal pre-training model as set forth in claim 7, further comprising, before determining, from the first descriptor and the seventh descriptor, a seventh descriptor having a smallest cosine distance from the first descriptor as a second target descriptor:

9. The multi-modal pre-training model based three-dimensional content generation method as claimed in claim 8 further comprising:

10. The multi-modal pre-training model based three-dimensional content generation method of claim 9 wherein optimizing the hybrid descriptor extraction network model using a contrast loss function comprises:

11. The multi-modal pre-training model based three-dimensional content generation method as claimed in claim 10 further comprising:

12. The method for generating three-dimensional content based on a multi-modal pre-training model according to any one of claims 1 to 11, wherein obtaining a third text description corresponding to the first three-dimensional content comprises:

13. The multi-modal pre-training model based three-dimensional content generation method of claim 1 wherein determining a text description difference between the target text description and the third text description comprises:

Acquiring a first descriptor corresponding to the target text description;

acquiring an eighth descriptor corresponding to the third text description;

driving the first three-dimensional content to deform based on the text description difference to obtain target three-dimensional content, including:

14. The method for generating three-dimensional content based on a multi-modal pre-training model according to claim 13, wherein driving the first three-dimensional content to be deformed based on the difference descriptor to obtain the target three-dimensional content comprises:

acquiring first point cloud data of the first three-dimensional content;

15. The multi-modal pre-training model based three-dimensional content generation method as claimed in claim 14 further comprising:

and determining the offset according to the difference descriptors, the global descriptors, the local descriptors and the offset prediction structure of the three-dimensional content in the final deformation network model.

16. The method for generating three-dimensional content based on a multi-mode pre-training model according to claim 15, wherein optimizing the deformed network model and taking the deformed network model satisfying the second preset iteration end condition as a final deformed network model comprises:

17. The multi-modal pre-training model based three-dimensional content generation method as claimed in claim 16 wherein calculating the output value of the total loss function between the actual three-dimensional content and the target three-dimensional content comprises:

Calculating an output value of a total loss function between the actual three-dimensional content and the target three-dimensional content by using a preset formula, wherein the preset formula is as follows:

;

wherein L is the total loss function,、/>and->Respectively represent different weights, +.>As a function of the distance loss,for the point cloud perceived loss function, < >>In order to align the loss function, G1 is a first set of sampling points in the actual three-dimensional content, p is a sampling point in the first set, G is a second set of sampling points in the target three-dimensional content, q is a sampling point in the second set, and p is a sampling point in the first set>For the description corresponding to the point cloud data of the actual three-dimensional content, the description is->The descriptors corresponding to the point cloud data of the target three-dimensional content are provided; />For the alignment descriptor of said actual three-dimensional content, -/->And (3) aligning descriptors for the target three-dimensional content.

18. A multi-modal pre-training model-based three-dimensional content generation system, comprising:

19. A three-dimensional content generation device based on a multi-modal pre-training model, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the three-dimensional content generation method based on a multimodal pre-training model according to any of claims 1-17 when executing a computer program.

20. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the method for generating three-dimensional content based on a multimodal pre-training model according to any of the claims 1-17.