CN117372607A

CN117372607A - Three-dimensional model generation method and device and electronic equipment

Info

Publication number: CN117372607A
Application number: CN202311177612.3A
Authority: CN
Inventors: 赵敏达; 李林橙; 赵超逸; 刘柏; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2024-01-09

Abstract

The application discloses a method and a device for generating a three-dimensional model and electronic equipment, wherein the method comprises the following steps: acquiring a multi-view image group corresponding to an initial three-dimensional model and text description information corresponding to a target three-dimensional model to be generated; the multi-view image group comprises view images corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle; determining micro-rendering parameters corresponding to the target three-dimensional model according to the text description information and the multi-view image group; and generating the target three-dimensional model according to the micro-renderable parameters corresponding to the target three-dimensional model. The target three-dimensional model generated by the method has high accuracy and good matching performance with text description information.

Description

Three-dimensional model generation method and device and electronic equipment

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method and apparatus for generating a three-dimensional model, an electronic device, and a computer readable storage medium.

Background

With the rapid development of computer technology, three-dimensional models are widely visible in people's daily lives. The three-dimensional model can help people to better understand and analyze complex objects or scenes, and the scenes in which the model is applicable are wide, such as games, virtual reality, robot simulation and the like, so the demand of the three-dimensional model in real life is increasing.

In accordance with the text generation three-dimensional model technique, a three-dimensional model conforming to a text description can be automatically and efficiently generated by a given text to be touted. However, in the conventional technology for generating a three-dimensional model according to a text, the sensitivity of the three-dimensional model generation process to view angle information is low, so that the generated three-dimensional model often fails to accurately express information corresponding to the text, and the deviation between the text information and the generated three-dimensional model is large. Therefore, how to improve the sensitivity of the three-dimensional model generation process to the visual angle information, and generate the three-dimensional model with high accuracy and good text matching is a technical problem to be solved currently.

Disclosure of Invention

The embodiment of the application provides a method, a device, electronic equipment and a computer readable storage medium for generating a three-dimensional model, so as to solve the problems existing in the prior art.

The embodiment of the application provides a method for generating a three-dimensional model, which comprises the following steps: acquiring a multi-view image group corresponding to an initial three-dimensional model and text description information corresponding to a target three-dimensional model to be generated, wherein the multi-view image group comprises view images corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle; determining micro-rendering parameters corresponding to the target three-dimensional model according to the text description information and the multi-view diagram set; and generating the target three-dimensional model according to the micro-renderable parameters corresponding to the target three-dimensional model.

The embodiment of the application also provides a device for generating the three-dimensional model, which comprises the following steps: the device comprises an acquisition unit, a processing unit and a generation unit;

the acquisition unit is configured to acquire a multi-view image group corresponding to an initial three-dimensional model and text description information corresponding to a target three-dimensional model to be generated, wherein the multi-view image group comprises view images corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle;

the processing unit is configured to determine micro-renderable parameters corresponding to the target three-dimensional model according to the text description information and the multi-view image group;

the generation unit is configured to generate the target three-dimensional model according to the micro-renderable parameters corresponding to the target three-dimensional model.

The embodiment of the application also provides electronic equipment, which comprises a processor and a memory; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method described above.

Embodiments of the present application also provide a computer-readable storage medium having stored thereon one or more computer instructions that are executed by a processor to implement the above-described methods.

Compared with the prior art, the embodiment of the application has the following advantages:

according to the three-dimensional model generation method, as the multi-view image group comprises view images corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle, the multi-view image group is used as input data in the three-dimensional model generation process, and compared with the input data in the three-dimensional model generation process which are common images, the sensitivity of the three-dimensional model generation process to view angle information can be improved; further, when the micro-renderable parameters corresponding to the target three-dimensional model are determined, the accuracy of the micro-renderable parameters can be improved because the micro-renderable parameters are determined by the text description information and the multi-view image group together, and further the target three-dimensional model generated according to the micro-renderable parameters is ensured, and the target three-dimensional model is high in accuracy and good in matching with the text description information.

Further, the view angle interval diagram shows view angle diagrams corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle; the view angle interval diagram and the multi-view angle diagram group are used as input data of a three-dimensional model generating process together, and compared with the multi-view angle diagram group which is used as input data of the three-dimensional model generating process, effective supervision is carried out between the view angle interval diagram and the multi-view angle diagram group; when the micro-renderable parameters corresponding to the target three-dimensional model are determined, the accuracy and the reliability of the micro-renderable parameters are guaranteed because the micro-renderable parameters are determined by the text description information, the view angle interval diagram and the multi-view angle diagram group together, and further the target three-dimensional model generated according to the micro-renderable parameters is realized, and the target three-dimensional model is high in accuracy and good in matching with the text description information.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2 is a flowchart of a method for generating a three-dimensional model according to an embodiment of the present application;

FIG. 3 is a schematic view of a pitch angle and a surround view provided by an embodiment of the present application;

FIG. 4 is a diagram illustrating a predetermined number of perspective filter views corresponding to an initial three-dimensional model provided by an embodiment of the present application;

FIG. 5 is a view joint pictorial intent corresponding to an initial three-dimensional model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a three-dimensional model generation principle provided in an embodiment of the present application;

FIG. 7 is a training schematic diagram of a second text-to-graph model provided in an embodiment of the present application;

FIG. 8 is a flowchart of a training method for a text-generated graph model according to an embodiment of the present application;

FIG. 9 is a block diagram of a unit of a three-dimensional model generating apparatus provided in an embodiment of the present application;

FIG. 10 is a block diagram of a training device for a text-to-graph model provided in an embodiment of the present application;

fig. 11 is a schematic logic structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other ways than those herein described and similar generalizations can be made by those skilled in the art without departing from the spirit of the application and the application is therefore not limited to the specific embodiments disclosed below.

First, some technical terms related to the present application will be explained:

a text-to-image model uses a deep learning algorithm to convert text descriptions into image pixel values to effect text-to-image conversion. The meridional graph model can be used for many applications such as image generation, image restoration, and AI mapping.

The obj ase dataset is a 3D model dataset that contains a large amount of three-dimensional model information, such as geometry and topology information of the 3D model. The above data sets are typically used in the fields of computer vision, natural language processing, robotics, etc., for training models to identify, understand, and manipulate 3D objects and scenes. The 3D model dataset may be made up of real world objects, virtual world scenes, or synthetic models.

The Cap3D data set is a 3D model annotation data set, and the data set contains annotation information of three-dimensional models in the obj everse data set, including text description information, rendering graphs and the like of each three-dimensional model. The 3D model annotation dataset described above is typically used to train machine learning and deep learning models, and the like.

A stablistiffusion model, also called a stable diffusion model, is a model for generating images in the AIGC field, which uses a technique called "stable diffusion" to generate high-quality images. The model works on the principle that the model takes as input an initial noise image and generates the final image by constantly modifying the pixel level. In the modification process, the model determines the modification direction of each pixel point according to the content of the input image, so as to generate a more real image.

In order to facilitate understanding of the methods provided by the embodiments of the present application, the background of the embodiments of the present application is described before the embodiments of the present application are described.

At present, a technology for generating a three-dimensional model based on text is a popular research direction of artificial intelligence, and the technology can be applied to various scenes such as games, virtual reality, robot simulation and the like. The technology refers to a three-dimensional model process for automatically generating a text description according to given text content. The above process involves a text-to-image model, which, as previously described, is capable of converting text descriptions into image pixel values to effect text-to-image conversion.

In the existing technology for generating a three-dimensional model based on text, the following steps are: firstly, the representation mode of the three-dimensional model is usually represented by adopting a Nerf method and the like; secondly, randomly setting camera parameters, and rendering a picture containing a three-dimensional model from the view angle; then, taking the picture and the text description as input of a text-generated picture model, outputting the picture estimated according to the text description by the model, and calculating the inconsistency between the picture and the text description; and finally, updating gradient parameters according to the inconsistency calculation model, and optimizing the representation of the three-dimensional model by using the gradient parameters. However, the text graph models used in the prior art, such as a stablifusion model, a deep-floyd IF model, and the like, all have the defect of low sensitivity of view information. Even if visual angle information (such as front visual angle information and overlook visual angle information) exists in the text description content, the text graph model still cannot accurately output a picture containing visual angle information description, so that the generated three-dimensional model has the problems of large error and low matching degree with the text description content. For example, the generation of three-dimensional animal models has problems that the knowledge of multi-head, multi-foot, chair multi-leg, etc. is against common sense knowledge.

Aiming at the problems in the prior art, the three-dimensional model generation method provided by the application can improve the sensitivity of the venturi figure model to view angle information, and can generate the three-dimensional model with high accuracy and good text matching property according to the venturi figure model.

Through the background description of the foregoing, those skilled in the art can understand the problems existing in the prior art, and the following details of the application scenario of the method for generating a three-dimensional model of the present application are described. The three-dimensional model generation method provided by the embodiment of the application can be applied to the field of artificial intelligence or other related technical fields with the requirement of generating the three-dimensional model.

In the following, first, an application scenario of the method for generating a three-dimensional model according to the embodiment of the present application will be described by way of example.

Fig. 1 is an application scenario schematic diagram of a method for generating a three-dimensional model according to a first embodiment of the present application.

As shown in fig. 1, in the present application scenario, the application scenario includes a client 101 and a server 102; wherein, the client 101 and the server 102 are connected through network communication. The client 101 is configured to obtain text description information, and the server 102 is configured to generate a target three-dimensional model according to the text description information. The server 102 is a port provided with a text-to-graph model.

Taking fig. 1 as an example for describing in detail, fig. 1 is a schematic application scenario diagram of a three-dimensional model generating method provided in an embodiment of the present application, where the embodiment of the present application does not limit the devices included in fig. 1 and does not limit the number of clients 101 and servers 102. For example, in the application scenario shown in fig. 1, a data storage device may be further included, where the data storage device may be an external memory with respect to the client 101 and the server 102, or may be an internal memory integrated with the client 101 and the server 102. The client 101 may be a smart phone, a smart bracelet, a tablet computer, a wearable device, a multimedia player, an electronic reader, or other devices with communication functions, and an Application (APP) with a text input function is correspondingly installed on the device. The server 102 may be a server or a cluster formed by a plurality of servers, or may be a cloud computing service center.

In the embodiment of the present application, the number of devices of the client 101 and the server 102 in fig. 1 may vary. The specific implementation process of the application scenario can be described with reference to the schemes of the following embodiments.

In the application scenario of the embodiment of the present application, the present application further provides a method for generating a three-dimensional model, and an apparatus, an electronic device and a computer readable storage medium corresponding to the method. The following provides examples to describe the above methods, apparatus, electronic device computer readable storage medium, and system in detail.

The second embodiment of the application provides a method for generating a three-dimensional model. Fig. 2 is a flowchart of a method for generating a three-dimensional model according to an embodiment of the present application, and the method provided in this embodiment is described in detail below with reference to fig. 2. The embodiments referred to in the following description are intended to illustrate the method principles and not to limit the practical use.

As shown in fig. 2, the method for generating a three-dimensional model provided in this embodiment includes the following steps S201 to S203:

s201, acquiring a multi-view image group corresponding to an initial three-dimensional model and text description information corresponding to a target three-dimensional model to be generated, wherein the multi-view image group comprises view images corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle.

The method is used for obtaining a multi-view image group of an initial three-dimensional model and text description information corresponding to a target three-dimensional model to be generated.

The initial three-dimensional model refers to a virtual three-dimensional model for describing an object or a scene; the initial three-dimensional model is used as an application model base, and a target three-dimensional model corresponding to the text description information can be generated according to the initial three-dimensional model.

It should be understood that, the method for generating the three-dimensional model of the embodiment can be applied to model improvement of the existing three-dimensional model (initial three-dimensional model), so as to obtain the target three-dimensional model; the method can also be applied to the direct generation of the target three-dimensional model by the initialized three-dimensional model (initial three-dimensional model). For example, the initial three-dimensional model may be a virtual animal three-dimensional model, and the model-related content information, the animal type, the size, the color, the proportion, and the like can be obtained by observing the virtual animal three-dimensional model. The initial three-dimensional model can be any type of initialized three-dimensional geometric model such as a sphere, a cube and the like, and related content information can not be obtained by observation according to the initialized three-dimensional geometric model, and the initial three-dimensional model is only used as a three-dimensional model embryo.

The initial three-dimensional model is model expressed in a rendering representation, such as Neus, nerf, and dmet. Taking a Neus volume rendering representation as an example, the rendering representation of the initial three-dimensional model is a representation that depends on micro-renderable parameters. For example, the Neus volume rendering representation of the initial three-dimensional model, denoted as g (ψ), may be micro-rendered with Neus parameters. The illustrated Neus volume rendering representation g (ψ), symbol g, represents a micro-renderable expression. In the implementation, if the Neus parameter ψ of the initial three-dimensional model changes, the initial three-dimensional model expressed according to the Neus rendering representation mode changes. It is also understood that by adjusting the micro-renderable parameters (Neus parameters ψ), the initial three-dimensional model for model expression using the rendering representation (Neus volume rendering representation g (ψ)) is adjusted accordingly.

Of course, in the present embodiment, the rendering representation of the other models than the initial three-dimensional model, such as the target three-dimensional model, is consistent with the rendering representation of the initial three-dimensional model. That is, the rendering representation corresponding to the target three-dimensional model and the rendering representation corresponding to the initial three-dimensional model are the same representation. For example, if the initial three-dimensional model employs a Neus volume rendering representation, the target three-dimensional model likewise employs a Neus volume rendering representation. The difference between the two is that specific values of the Neus parameter ψ in the Neus volume rendering representation are different, for example, the Neus parameter of the initial three-dimensional model is ψ0 and the Neus parameter of the target three-dimensional model is ψ1. If the initial three-dimensional model adopts the surface grid representation mode of the DMET method, the target three-dimensional model also adopts the surface grid representation mode of the DMET method. Since the surface mesh representation using the dmet method is not the focus of this embodiment, the description is not repeated here.

In this embodiment, before the determining, according to the text description information and the multi-view image group, a micro-renderable parameter corresponding to the target three-dimensional model, the method further includes: and obtaining a view angle interval diagram corresponding to the initial three-dimensional model, wherein the view angle interval diagram shows view angle diagrams corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle.

And the view angle interval diagrams show view angle diagrams corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle. It may also be understood that the view angle interval diagram is a single Zhang Zutu obtained by performing noise adding processing on view angle diagrams corresponding to different surrounding view angles under the same pitch angle with respect to the initial three-dimensional model. The obtaining the view angle interval diagram corresponding to the initial three-dimensional model includes: the view angles are arranged and combined to form a view angle joint graph; and carrying out noise adding processing on the view angle joint graph according to a preset sampling noise value to obtain a view angle interval graph corresponding to the initial three-dimensional model. The view angle interval diagram corresponding to the initial three-dimensional model comprises the following components: and carrying out noise diagram obtained by first noise adding processing according to the view diagram. And the first noise adding process is to sequentially arrange and combine the view angles to form a view angle joint graph, and the noise adding process is carried out according to the sampling noise value. And carrying out noise adding processing according to the sampling noise value, and describing the noise adding processing in the follow-up content.

In the embodiment, a plurality of view angles photographed by a virtual camera at the same pitch angle and different surrounding view angles are screened according to the view angles, so that a preset number of view angle screening images are obtained; and sequentially arranging and combining a predetermined number of view screening pictures to form a single view joint picture. The view angle interval diagram is a single image obtained by performing noise addition processing on the view angle joint diagram according to the sampling noise value. In the present embodiment, a view joint graph, which may also be referred to as a joint rendering graph, is denoted as I ₁ The method comprises the steps of carrying out a first treatment on the surface of the Viewing angle interval graph is written as

Here, the initial three-dimensional model is illustrated with different surrounding angles at the same pitch angle by way of illustration. Referring to fig. 3, fig. 3 is a schematic diagram of pitch angle and surrounding view angle in a coordinate system. As shown in fig. 3, if the origin point O in the x-y-z coordinate system is used as the set point of the initial three-dimensional model, the pitch angle may be an included angle formed by the line of sight L2 where the camera is located and the x-O-y plane; i.e. the angle v1 formed by the line of sight L2 and the ray L1 (projection of the line of sight L2 in the x-o-y plane). If the horizontal plane of the camera is x '-o' -y ', the camera circles in the x' -o '-y' plane, and the line of sight L2 circles from the initial position to the line of sight L3, then the included angle eta 1 formed by the line of sight L3 and the line of sight L2 is taken as a surrounding view angle.

The multi-view image group comprises view images corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle. It can also be understood that the multi-view image group is an image group formed by combining a plurality of view images obtained by performing noise addition processing on different surrounding view angles under the same pitch angle with respect to the initial three-dimensional model. The obtaining the view angle interval diagram corresponding to the initial three-dimensional model includes: and carrying out noise adding processing on each view angle graph according to the preset sampling noise value to obtain a view angle interval graph corresponding to the initial three-dimensional model. The multi-view image group corresponding to the initial three-dimensional model comprises: and carrying out second noise adding processing according to the view angle diagram to obtain a noise diagram group.

In the embodiment, a plurality of view angles photographed by a virtual camera at the same pitch angle and different surrounding view angles are screened according to the view angles, so that a preset number of view angle screening images are obtained; the multi-view images are image groups formed by combining a plurality of images after noise addition processing is performed on a predetermined number of view screening images (a plurality of view images) according to sampling noise values. In the present embodiment, a predetermined number of view screening diagrams are denoted as I ₂ The method comprises the steps of carrying out a first treatment on the surface of the Multiple view group is described as

For ease of understanding, the above view, a predetermined number of view screening views, view joint views, view interval views, and multi-view groups are illustrated. The initial three-dimensional model is taken as a virtual cartoon character three-dimensional model for illustration. Randomly selecting a camera radius (rendering radius) from [1.5,2.0], randomly selecting a pitch angle from [ -10 degrees, 45 degrees ] such as a 15-degree pitch angle, initializing the horizontal direction of the horizontal plane of the camera by using the random angle, and shooting the camera to obtain a plurality of view angles; in addition, based on the angle values of the surrounding view angles, one surrounding view angle is selected every 90 ° interval (4 surrounding view angles are determined in total), and the above settings are used as four groups of camera positions. And performing Neus volume rendering on the virtual cartoon character three-dimensional model based on the camera position to obtain 4 perspective views (a preset number of perspective screening views), and setting the resolution of the perspective views to be 512x512. Of course, the predetermined number of view screening images may be obtained by direct screening from the plurality of view images directly according to the angular interval of 90 ° around the view angle. The above-mentioned 4 perspective views of the three-dimensional model of the virtual cartoon character may also refer to the schematic diagram of fig. 4, and fig. 4 is a screening illustration of a predetermined number of perspective views corresponding to the initial three-dimensional model. As shown in fig. 4, (a), (b), (c), and (d) in fig. 4 are pictures rendered by the virtual camera at different surround angles. And taking the obtained 4 visual angle images as a predetermined number of visual angle screening images corresponding to the three-dimensional model of the virtual cartoon character.

Further, the 4 view maps (the predetermined number of view screening maps) are arranged in a 2x2 manner, the arrangement order is rotated clockwise according to the view maps, and then the combined map rendering is adjusted to 512x512 resolution, so as to obtain 1 view joint map (joint rendering map I) ₁ ) The method comprises the steps of carrying out a first treatment on the surface of the The view angle joint diagram may also refer to the schematic diagram of fig. 5, where fig. 5 is a view angle joint diagram schematic diagram corresponding to the initial three-dimensional model. As shown in the figure, the 4 view angles (a), (b), (c) and (d) corresponding to FIG. 4 are combined and arranged to obtain the 1 view angle combined graph shown in FIG. 5.

Still further, 4 perspective views are directly combined to obtain a predetermined number of perspective screening views (I ₂ ). That is, a predetermined number of perspective screen images I formed by 4 perspective image combinations ₂ The self 4 pieces of the image groups with different surrounding view angles have 15 degrees of pitch angle.

The text description information refers to information for describing the three-dimensional model of the object. The text description information comprises: view description information. The visual angle description information is used for describing appearance morphology of the target three-dimensional model under different visual angles. The text description information may further include description information of the initial three-dimensional model if applied to model improvement of an existing three-dimensional model (initial three-dimensional model).

In this embodiment, the text description information described above is denoted by y. Of course, if the text description information corresponding to the three-dimensional model is null, it is denoted as y'. In particular implementations, textual description information y' of the object is obtained from the Cap3D dataset.

Next, a detailed description will be given of noise addition processing based on the sampled noise value:

as described above, the view angle interval mapIs derived from view angle joint diagram I ₁ By combining the views with the diagram I ₁ Noise adding processing is carried out according to the sampling noise value, thus obtaining a view angle interval diagram +.>Multi-view group->Is derived from a predetermined number of perspective filter views I ₂ By screening a predetermined number of viewing angles for graph I ₂ Noise adding processing is carried out according to the sampling noise value, and a multi-view image group can be obtained>In this embodiment, the sampling noise value set in advance is denoted by e.

In specific implementation, the noise adding process comprises the following steps:

1) A T is randomly sampled from the integer array [1, …, T ]. Wherein t=1000;

2) Sampling a noise E-N from a standard Gaussian distribution (0,I); where e represents the sampled noise value;

3) For picture x ₀ Adding noiseWherein (1)>α _i ＝1-∈ _t ，β _t ＝0.02-((0.02-0.0001))/T*t。

Beta as above _t Representing a random t-th sampling noise intensity setting value; alpha _i Representing a picture noise added value corresponding to a single sample (a t-th sample);representing a picture noise accumulation value corresponding to the t-th sampling; /> Representation for picture x ₀ Added noise value. In the above process, picture x ₀ Can be view interval group diagram I ₁ Map I may also be filtered for a predetermined number of perspectives ₂ Any picture to be added with noise may be used, and this embodiment is shown as an example and is not limited to practice.

Through the above process, the graph I is combined according to the visual angle ₁ Obtain view angle interval diagramView angle spacer->Is represented as follows:

of course, diagram I is filtered according to a predetermined number of perspectives ₂ Obtaining the multi-view image groupMulti-view group->Is represented as follows:

wherein, in formula 1.1 and formula 1.2The meaning of e is referred to in the foregoing description and will not be described in detail here.

Through the steps of the embodiment, the view angle interval diagram corresponding to the initial three-dimensional model is obtainedMulti-view group->And text description information y corresponding to the target three-dimensional model to be generated.

S202, determining micro-rendering parameters corresponding to the target three-dimensional model according to the text description information and the multi-view image group.

The method is used for determining the micro-renderable parameters corresponding to the target three-dimensional model.

In this embodiment, the determining, according to the text description information and the multi-view image group, the micro-renderable parameter corresponding to the target three-dimensional model includes: and determining micro-rendering parameters corresponding to the target three-dimensional model according to the text description information, the multi-view image group and the view interval image.

The micro-renderable parameters refer to parameters for rendering the three-dimensional model, and are applied to rendering representation modes of the three-dimensional model; for example, the target three-dimensional model adopts a Neus volume rendering representation, denoted as g (ψ), and the micro-renderable parameter is a Neus parameter ψ. Of course, different three-dimensional models may also be expressed using different rendering representations. This embodiment is shown as an illustration and is not intended to be limiting.

As a possible implementation manner, determining the micro-renderable parameters corresponding to the target three-dimensional model according to the text description information, the view interval map and the multi-view map group in this embodiment includes the following steps S202-1 to S202-3:

s202-1, calculating a first consistency result of the view angle interval diagram and the text description information based on a first text generation diagram model; the first text-generated graph model is used for generating a visual angle interval output graph of the three-dimensional model according to text description information of the three-dimensional model;

S202-2, calculating a second consistency result of the multi-view graph group and the text description information based on a second text-generated graph model; the second text-generated graph model is used for generating output graphs of each view angle of the three-dimensional model according to text description information of the three-dimensional model;

s202-3, determining micro-rendering parameters corresponding to the target three-dimensional model according to the first consistency result and the second consistency result.

The first consistency result is a matching result of the view angle interval diagram and the text description information. The matching result is realized based on the first text-generated graph model. The first meridional graph model differs from the common meridional graph model in that: in the first text-to-graph model, since the first text-to-graph model has the capability of generating the view angle interval output graph of the three-dimensional model according to the text description information of the three-dimensional model, the sensitivity of the first text-to-graph model to the view angle information is high. Based on the first text-generated graph model, the calculated matching result of the view angle interval graph and the text description information is high in accuracy and reliability. It is to be understood that, as the text-generated graph model itself, the model input data is the picture and text description information, and the output data is the model prediction picture. In the operation process of the draft graph, image characteristic prediction estimation is carried out according to the input text description information and the image, and the image is predicted by the synthetic model. The view angle interval output picture is a picture obtained by prediction estimation of the first text-generated picture model. The picture style of the view angle interval output picture can also refer to the schematic diagram of the view angle joint diagram in fig. 5. Same reason The view angle output picture is the picture obtained by prediction estimation of the second text-generated picture model. For ease of distinction, in this embodiment, the first text-generated graph model is denoted as Z _π 。

The second consistency result is a matching result of the multi-view image group and the text description information. The matching result is realized based on a second text-generated graph model. The second meridional graph model differs from the common meridional graph model in that: in the second text-generated graphic model, since the second text-generated graphic model has the capability of generating the respective view angle output graphic of the three-dimensional model from the text description information of the three-dimensional model, the second text-generated graphic model is also based on the sensitivity of the view angle information, but has low sensitivity compared to the first text-generated graphic model. It should be understood that in this embodiment, the purpose of the first consistent result based on the first primitive graph model and the second consistent result based on the second primitive graph model is to mutually monitor the two consistent results, and may also be understood as a monitor optimization strategy between the first primitive graph model and the second primitive graph model, so as to ensure that the determined micro-rendering parameters corresponding to the target three-dimensional model have high accuracy and good reliability. For ease of distinction, in this embodiment, the second text-generated graph model is denoted as Z _θ 。

In step S202-1, the calculating a first consistency result of the view angle interval map and the text description information includes: and carrying out noise prediction on the view angle interval diagram according to the text description information to obtain a first noise prediction result, and determining a first consistency result of the view angle interval diagram and the text description information according to the first noise prediction result.

In step S202-2, the calculating a second consistency result of the multi-view image group and the text description information includes: and carrying out noise prediction on the multi-view image group according to the text description information to obtain a second noise prediction result, and determining a second consistency result of the multi-view image group and the text description information according to the second noise prediction result.

It should be understood that the model can also be based on the meridional diagramAnd carrying out noise prediction of the picture. In the present embodiment, the first noise prediction result is recorded asThe second noise prediction result is marked as +.>First noise prediction result->Based on a first text-generated graph model Z _π Realizing; second noise prediction result->Based on a second text-generated graph model Z _θ Realizing the method.

In specific implementation, a first text graph model, namely a stabledifusion text graph model Z _π Predictive view interval mapCorresponding first noise prediction result +.>The specific calculation formula is as follows:

where λ=100, λ represents the first text-to-graph model Z _π Proportional parameter values of corresponding text description information and blank text description information,The view interval diagram is represented, t represents a random sampling value, y represents text description information, and y' represents blank text description information.

Similarly, a second text graph model, the original open source stablistifuge text graph model Z _θ Predicting multi-view groups of picturesCorresponding second noise prediction result +.>The specific calculation formula is as follows:

wherein λ=100, λ represents the second text-to-pictorial model Z _θ Proportional parameter values of corresponding text description information and blank text description information,Representing a multi-view group, t representing a randomly sampled value, y representing text description information, and y' representing blank text description information.

Through the above process, a first consistency result-a first noise prediction result for representing the view interval diagram and the text description information is obtainedAnd, a second coherence result-a second noise prediction result-for representing the multi-view group and the text description information +.>

Through the steps, the first text-generated graph model Z is obtained _π The obtained first noise prediction resultAnd based on a second text-generated graphic model Z _θ Second noise prediction result obtained->Next, according to the above first noise prediction result +.>And second noise prediction result->The micro-renderable parameter psi 1 corresponding to the target three-dimensional model can be determined, and the specific process is as follows:

acquiring a preset sampling noise value and a micro-rendering parameter corresponding to the initial three-dimensional model;

and determining the micro-renderable parameters corresponding to the target three-dimensional model according to the sampling noise value, the first noise prediction result, the second noise prediction result and the micro-renderable parameters corresponding to the initial three-dimensional model.

The sampled noise value is a noise value e randomly sampled during the noise adding process, as described above. Since the text-based graph model is essentially a neural network model, the principle of operation of its model itself involves forward and reverse (inference) processes. The forward process can also be understood as a process of image noise addition, in which the direct removal of pixels results in loss of information, and the addition of noise allows the model to learn the picture features. The process may involve multi-step denoising, a process of acquiring an input data-noise image for a model. In the multi-step noise adding process, the image noise added by the current T step and the noise added by the T-1 step image have an association relation. Further, in the multi-step noise adding process, the sampling noise value selected in each step may be different; random noise can also increase model flexibility.

The inverse (inference) process may also be understood as a process of performing diffusion denoising inference to obtain a micro-renderable parameter corresponding to the target three-dimensional model. The model is guided by text description information, and the model is trained to repeatedly predict and restore pictures until an original picture is obtained; by controlling the process step by step, the stability in the denoising process is improved. Thus, the graph I is combined for view angle in the foregoing ₁ And a predetermined number of perspective filter views I ₂ The sample noise value e used in the noise addition process is also applied to the inverse (inference) process.

The micro-renderable parameters corresponding to the initial three-dimensional model refer to parameters corresponding to model expression of the initial three-dimensional model by adopting a rendering representation mode. As previously described, the Neus volume rendering representation of the initial three-dimensional model, denoted as g (ψ), may be micro-rendered with Neus parameters. The micro-renderable parameters corresponding to the initial three-dimensional model are different from those of the target three-dimensional model in that specific values of a Neus parameter ψ in a Neus volume rendering representation mode are different, for example, the Neus parameter of the initial three-dimensional model is ψ0 and the Neus parameter of the target three-dimensional model is ψ1.

In specific implementation, according to the sampling noise value, the first noise prediction result, the second noise prediction result and the micro-renderable parameters corresponding to the initial three-dimensional model, determining the update gradient of the micro-renderable parameters corresponding to the target three-dimensional model and the micro-renderable parameters corresponding to the target three-dimensional model, wherein the calculation formula is as follows:

Wherein,in the formula, the total training step number of the text-generated graph model is set to be L, which corresponds to the step i training: />Representing a first noise prediction result; />Representing a second noise prediction result; e represents the sampled noise value; psi represents a micro-renderable parameter corresponding to the initial three-dimensional model; />A view angle interval diagram is represented; />Representing multiple viewing anglesA group of graphs; />Representing a picture noise accumulation value corresponding to the t-th sampling; β (i) represents the noise value corresponding to the random i-th training; w (t) represents a weight parameter corresponding to the t-th sampling; />And the update gradient of the micro-renderable parameter psi corresponding to the target three-dimensional model is represented. Update gradient of the micro-renderable parameter ψ>Micro-rendering parameter psi corresponding to target three-dimensional model obtained by calculation ₁ ：

Wherein, the value of the micro-renderable parameter psi of the step i-1 is assumed to be psi ₀ The parameter value of the ith step after the one iteration is executed isThrough multi-step calculation iteration, the micro-rendering parameter psi corresponding to the target three-dimensional model is obtained through calculation, and the parameter psi is used for distinguishing conveniently ₁ And (3) representing.

Through the process, the micro-rendering parameter psi corresponding to the target three-dimensional model is calculated ₁ . The above calculation process may also be understood as a process of optimizing the micro-renderable parameter ψ corresponding to the initial three-dimensional model according to a gradient formula. To facilitate understanding of the micro-renderable parameter ψ corresponding to the above-described acquisition target three-dimensional model ₁ Reference may also be made to the schematic diagram of fig. 6, and fig. 6 is a schematic diagram of the three-dimensional model generation principle provided in the embodiment of the present application.

Next, a detailed description will be given of the relationship between the first and second text-to-graph models.

As a possible embodimentAnd the first text-generated graph model is obtained by model training according to the second text-generated graph model. That is, the relationship between the first and second text-to-graph models is: and performing model training according to the second text-generated graph model to obtain the first text-generated graph model. For example, stabledifusion text graph model Z _π Is based on an original open source stablistifuge text graph model Z _θ And (5) performing model training to obtain the model. Of course, in this embodiment, the first text-to-graph model may be a directly acquired model, and may not be trained according to the second text-to-graph model.

It should be understood that, in the three-dimensional model generating process of this embodiment, the first text-generated graph model-stable diffusion text-generated graph model Z is used _π And a second text graph model original open source stablifusion text graph model Z _θ Noise supervision is jointly conducted so as to ensure that the accuracy and the reliability of the micro-rendering parameter-Neus parameter psi corresponding to the determined target three-dimensional model obtained through calculation are high.

In this embodiment, the first text graph model is trained in the following manner, which is specifically as follows:

acquiring a sample view angle interval diagram and sample description information for model training; the sample view angle interval diagram shows view angle diagrams corresponding to different surrounding view angles of the three-dimensional model under the same pitch angle; the sample description information comprises a model for describing appearance morphology of the three-dimensional model under different visual angles;

based on the second venturi figure model, acquiring venturi figure model parameter gradient information according to the sample visual angle interval diagram and the sample description information; the meridional graph model parameter gradient information is used for representing the updating direction of the second meridional graph model parameter;

and obtaining the first text-generated graph model according to the text-generated graph model parameter gradient information.

The step of obtaining the first text-to-image model according to the text-to-image model parameter gradient information comprises the following steps: acquiring initial model parameters corresponding to the first text-generated graph model; and updating the initial model parameters corresponding to the first text-to-graph model according to the text-to-graph model parameter gradient information to obtain the first text-to-graph model.

For ease of understanding, the above model training process of the present embodiment will be described in detail. The process may also refer to the schematic diagram of fig. 7, where fig. 7 is a training schematic diagram of the second text-generated graph model provided in the embodiment of the present application.

The essence of the sample view angle interval diagram is a single Zhang Zutu obtained by performing noise addition processing on view angle diagrams corresponding to different surrounding view angles under the same pitch angle of the three-dimensional model.

In the implementation process, the surrounding rendering with the fixed pitch angle is carried out aiming at each three-dimensional model in the obj everse data set. So as to expand the data set as much as possible. The pitch angle sequence described above may be set to [ 0 °, 15 °, 30 °, 45 °. And rendering one view angle for each pitch angle at intervals of 30 degrees in the process of looking around rendering, so as to obtain a view angle diagram. A total of 12 perspective views are rendered after one circle around the three-dimensional model. The resolution of the 12 view maps is set to 512x512 to ensure that the view maps contain the full view of the three-dimensional model. To avoid the three-dimensional model having too small a duty cycle in the perspective view, the virtual camera rendering radius is set to 1.75.

Through the above process, a perspective view of the three-dimensional model is obtained. Next, a process of acquiring a sample view angle interval map is described. Randomly selecting a pitch angle from a pitch angle sequence (0 degrees, 15 degrees, 30 degrees and 45 degrees), randomly selecting 4 view angles from 12 view angles according to the selected pitch angle, and ensuring that the interval degrees of surrounding view angles of the 4 view angles are 90 degrees. And arranging the 4 view angle graphs with the interval degree of 90 degrees surrounding the view angles according to 2x2, clockwise rotating the arrangement sequence according to the view angle graphs, and then adjusting the combined graph rendering to 512x512 resolution to obtain 1 sample view angle combined graph. In the present embodiment, the sample view angle joint diagram is denoted as S ₁ The method comprises the steps of carrying out a first treatment on the surface of the By combining sample view angles with the graph S ₁ Noise adding processing is carried out to obtain a sample visual angle interval diagram S ₂ 。

1) A T is randomly sampled from the integer array [1, …, T ]. Wherein t=1000;

3) For picture x ₀ Adding noiseWherein (1)>α _i ＝1-β _t ，β _t ＝0.02-((0.02-0.0001))/T*t。

Beta as above _t Representing a random t-th sampling noise intensity setting value; alpha _i Representing a picture noise added value corresponding to a single sample (a t-th sample);representing a picture noise accumulation value corresponding to the t-th sampling; /> Representation for picture x ₀ Added noise value. In the above process, picture x ₀ Record S for sample view angle joint graph ₁ . The process may also refer to the description of the view angle joint graph corresponding to the initial three-dimensional model, which is not described herein.

The sample description information also includes: three-dimensional model view description information. This information is used to describe the appearance of the three-dimensional model at different viewing angles. In particular, text description information of each three-dimensional model can be obtained from the Cap3D data set.

In particular implementation, the original open source stablistifuge text graph model Z aims at the second text graph model _θ And loading the pre-training parameters. The sample view angle interval diagram and the sample description information are used for an open source stable diffusion venturi diagram model Z _θ And (3) performing model fine tuning training. To improve the degree and speed of model training, an open source stablifusion culture chart model Z can be used _θ The number of samples selected for one training of (a) is set to 256 (batch size=256).

As a possible implementation, an open source stabliff version of the text graph model Z _θ The judgment basis for the completion of training may be the number of preset training steps. For example, a preset number of training steps of 20 ten thousand steps is reached to complete the model training. As another possible implementation, an open source stablistifuge text graph model Z can also be used _θ And rendering the three-dimensional model generated by training into a two-dimensional picture, calculating the similarity between the picture and sample description information through a clip model (text generation model), and taking the similarity as a judging basis for model training completion.

After obtaining the sample view angle interval diagram S with noise ₂ Then, the parameter gradient information of the text-generated graph model can be obtained through calculation according to the following formula:

wherein,representing the parameter gradient information of the graph generation model; e represents the sampled noise value; z is Z _θ Representing a second text-generated graph model; / >Representing a picture noise accumulation value corresponding to the t-th sampling; y represents sample description information.

Updating the initial model parameters corresponding to the first text-to-image model according to the text-to-image model parameter gradient information, thereby obtaining a first text-to-image model-stablifusion text-to-image model Z _π . It is to be understood that the meridional graph model is essentially a neural network model, which itself depends on the model parameters used for model expression. The initial model parameters can also be understood as open source pre-training modelsType parameters. Compared with the second text-to-graph model, the first text-to-graph model has the capability of generating a view angle interval output graph of the three-dimensional model according to sample description information of the three-dimensional model, so that the first text-to-graph model has higher sensitivity to view angle information.

In open source stablistiffusion literature graph model Z _θ After training is completed, a first text graph model-stabledifusion text graph model Z is obtained through training _π And taking the sample description information as input, wherein the model output picture contains contents conforming to the sample description information, and the picture is also a view angle interval output picture corresponding to different surrounding view angles of the three-dimensional model under the same pitch angle.

Through the steps of the embodiment, according to the text description information y and the view angle interval diagram Multiple view group->A micro-renderable parameter ψ corresponding to a target three-dimensional model is determined.

S203, generating the target three-dimensional model according to the micro-renderable parameters corresponding to the target three-dimensional model.

The target three-dimensional model is a three-dimensional model generated corresponding to the text description information.

In this embodiment, the generating the target three-dimensional model according to the micro-renderable parameters includes the following steps S203-1 to S203-2:

s203-1, obtaining a rendering representation mode corresponding to the target three-dimensional model;

s203-2, generating the target three-dimensional model according to the micro-renderable parameters corresponding to the target three-dimensional model and the rendering representation mode corresponding to the target three-dimensional model.

In this embodiment, the rendering representation corresponding to the target three-dimensional model is as described above, and the rendering representation is used for model expression, such as a Neus volume rendering representation. Similarly, rendering table corresponding to target three-dimensional modelThe representation is a representation that depends on the micro-renderable parameters. For example, the Neus volume rendering representation of the target three-dimensional model is denoted as g (ψ). After the micro-renderable parameter ψ1 corresponding to the target three-dimensional model is acquired, the target three-dimensional model can be rendered and generated according to the rendering representation mode g (ψ) of the model. It is also understood that the first text-to-graph model-stabledifusion text-to-graph model Z _π And a second text-based graphic model-original open source stablistifuge text-based graphic model Z _θ After the training of the L steps is completed, the Neus volume rendering representation mode comprises a three-dimensional model representation corresponding to the text description y.

Corresponding to the above embodiment, a second embodiment of the present application provides a training method for a text-generated graph model. Fig. 8 is a flowchart of a training method of a text graph model provided in an embodiment of the present application, and the method provided in the embodiment is described below with reference to fig. 8, and for the same description of the second embodiment and the first embodiment of the present application, please refer to embodiment one, and the description of this embodiment is omitted.

The embodiments referred to in the following description are intended to illustrate the method principles and not to limit the practical use.

As shown in fig. 8, the training method of the text-to-graph model provided in this embodiment includes the following steps S801 to S803:

s801, acquiring a sample view angle interval diagram and sample description information for model training; the sample view angle interval diagram shows view angle diagrams corresponding to different surrounding view angles of the three-dimensional model under the same pitch angle; the sample description information comprises a model for describing appearance morphology of the three-dimensional model under different visual angles;

s802, obtaining parameter gradient information of a meridional chart model according to the sample visual angle interval chart and the sample description information based on a second meridional chart model; the meridional graph model parameter gradient information is used for representing the updating direction of the second meridional graph model parameter;

s803, according to the parameter gradient information of the meridional graph model, a first meridional graph model corresponding to the second meridional graph model is obtained.

Optionally, the obtaining the first text-to-image model according to the text-to-image model parameter gradient information includes: acquiring initial model parameters corresponding to the first text-generated graph model; and updating the initial model parameters corresponding to the first text-to-graph model according to the text-to-graph model parameter gradient information to obtain the first text-to-graph model.

For details of this embodiment, reference may be made to the description of the first embodiment, and details thereof are not repeated here.

According to the training method for the text-to-image model, compared with the second text-to-image model, the first text-to-image model after training is provided with the capability of generating the view angle interval output image of the three-dimensional model according to the sample description information of the three-dimensional model, so that the sensitivity of the first text-to-image model to the view angle information is higher.

The first embodiment provides a method for generating a three-dimensional model, and correspondingly, an embodiment of the present application further provides a device for generating a three-dimensional model, and since the device embodiment is basically similar to the method embodiment, the description is relatively simple, and details of relevant technical features should be referred to the corresponding description of the provided method embodiment, and the following description of the device embodiment is merely illustrative. As shown in fig. 9, a block diagram of a unit of the three-dimensional model generating apparatus provided in this embodiment includes:

an obtaining unit 901, configured to obtain a multi-view image group corresponding to an initial three-dimensional model and text description information corresponding to a target three-dimensional model to be generated, where the multi-view image group includes view images corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle;

A processing unit 902 configured to determine a micro-renderable parameter corresponding to the target three-dimensional model according to the text description information and the multi-view group;

the generating unit 903 is configured to generate the target three-dimensional model according to the micro-renderable parameters corresponding to the target three-dimensional model.

The second embodiment provides a training method for a text-to-image model, and correspondingly, an embodiment of the present application further provides a training device for a text-to-image model, and since the device embodiment is basically similar to the method embodiment, the description is relatively simple, and details of relevant technical features only need to be referred to the corresponding description of the method embodiment provided above, and the following description of the device embodiment is merely illustrative. As shown in fig. 10, a block diagram of a training device 1000 for a text-to-graph model according to the present embodiment includes:

an acquisition unit 1001 configured to acquire a sample view angle interval map and sample description information for model training; the sample view angle interval diagram shows view angle diagrams corresponding to different surrounding view angles of the three-dimensional model under the same pitch angle; the sample description information comprises a model for describing appearance morphology of the three-dimensional model under different visual angles;

The processing unit 1002 is configured to obtain, based on a second venturi graph model, venturi graph model parameter gradient information according to the sample view angle interval graph and the sample description information; the meridional graph model parameter gradient information is used for representing the updating direction of the second meridional graph model parameter;

and the generating unit 1003 is configured to obtain a first text-generated graph model corresponding to the second text-generated graph model according to the text-generated graph model parameter gradient information.

The foregoing embodiments provide apparatus embodiments corresponding to the respective method embodiments, and in addition, the embodiments of the present application further provide an electronic device, and since the electronic device embodiments are substantially similar to the method embodiments, the description is relatively simple, and details of relevant technical features may be found in the corresponding descriptions of the foregoing method embodiments, where the following descriptions of the electronic device embodiments are merely illustrative. The electronic device embodiment is as follows: fig. 11 is a schematic diagram of an electronic device according to the present embodiment.

Fig. 11 is a schematic diagram of an electronic device according to an embodiment of the present application, as shown in fig. 11.

In this embodiment, an optional hardware structure of the electronic device 1100 may be as shown in fig. 11, including: at least one processor 1101, at least one memory 1102, and at least one communication bus 1105; the memory 1102 includes a program 1103 and data 1104.

Bus 1105 may be a communication device that transfers data between components within electronic device 1100, such as an internal bus (e.g., a CPU-memory bus, processor central processing unit, simply CPU), an external bus (e.g., a universal serial bus port, a peripheral component interconnect express port), etc.

In addition, the electronic device further includes: at least one network interface 1106, at least one peripheral interface 1107. Network interface 1106 to provide wired or wireless communication with an external network 1108 (e.g., the internet, an intranet, a local area network, a mobile communication network, etc.); in some embodiments, network interface 1006 may include any number of network interface controllers (English: network interface controller, NIC for short), radio Frequency (RF) modules, transponders, transceivers, modems, routers, gateways, any combination of wired network adapters, wireless network adapters, bluetooth adapters, infrared adapters, near field communication (English: near Field Communication, NFC) adapters, cellular network chips, and the like.

The peripheral interface 1107 is used to connect with a peripheral, which may be peripheral 1 (1109 in fig. 11), peripheral 2 (1110 in fig. 11), and peripheral 3 (1111 in fig. 11) in the figure. Peripherals, i.e., peripheral devices, which may include, but are not limited to, cursor control devices (e.g., mice, touchpads, or touchscreens), keyboards, displays (e.g., cathode ray tube displays, liquid crystal displays). A display or light emitting diode display, a video input device (e.g., a video camera or an input interface communicatively coupled to a video archive), etc.

The processor 1101 may be a CPU or a specific integrated circuit ASIC (Application Specific Integrated Circuit) or one or more integrated circuits configured to implement embodiments of the present application.

The memory 1102 may include a high-speed RAM (collectively, random Access Memory, or random access memory) memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 1101 calls programs and data stored in the memory 1102 to execute the method of the first embodiment or the second embodiment of the present application.

Corresponding to the methods of the first and second embodiments of the present application, the embodiments of the present application also provide a computer storage medium storing a computer program to be executed by a processor to perform the methods of the first or second embodiments of the present application.

While the preferred embodiment has been described, it is not intended to limit the invention thereto, and any person skilled in the art may make variations and modifications without departing from the spirit and scope of the invention, so that the scope of the invention shall be defined by the claims.

Embodiments of the present application may relate to the use of user data, and in practical applications, user-specific personal data may be used in the schemes described herein within the scope allowed by applicable laws and regulations under conditions that meet applicable legal and regulatory requirements of the country where the application is located (e.g., the user explicitly agrees, practical notification to the user, etc.). In the above embodiments, a method for generating a three-dimensional model, an apparatus and an electronic device corresponding to the method are provided, and in addition, a computer readable storage medium for implementing the method for generating a three-dimensional model is provided in the embodiments of the present application. The embodiments of the computer readable storage medium provided in the present application are described more simply, and reference should be made to the corresponding descriptions of the above-described method embodiments, the embodiments described below being merely illustrative.

The computer readable storage medium provided in this embodiment stores computer instructions that, when executed by a processor, implement the steps of the method embodiments described above.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

1. Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.

2. It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

While the invention has been described in terms of preferred embodiments, it is not intended to be limiting, but rather, it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

Claims

1. A method for generating a three-dimensional model, comprising:

acquiring a multi-view image group corresponding to an initial three-dimensional model and text description information corresponding to a target three-dimensional model to be generated, wherein the multi-view image group comprises view images corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle;

Determining micro-rendering parameters corresponding to the target three-dimensional model according to the text description information and the multi-view diagram set;

and generating the target three-dimensional model according to the micro-renderable parameters corresponding to the target three-dimensional model.

2. The method of generating a three-dimensional model according to claim 1, further comprising: and obtaining a view angle interval diagram corresponding to the initial three-dimensional model, wherein the view angle interval diagram shows view angle diagrams corresponding to different surrounding view angles of the initial three-dimensional model under the same pitch angle.

3. The method for generating a three-dimensional model according to claim 2, wherein the view angle interval map corresponding to the initial three-dimensional model comprises: a noise diagram obtained by performing first noise adding processing according to the view diagram; the first noise adding process is to sequentially arrange and combine the view angles to form a view angle joint graph, and noise adding process is carried out according to sampling noise values;

the multi-view image group corresponding to the initial three-dimensional model comprises: a noise image group obtained by performing second noise adding processing according to the view angle image; the second noise adding process is to perform noise adding process on the view angle graph according to the sampling noise value.

4. The method of generating a three-dimensional model according to claim 1, wherein the text description information includes view description information; the visual angle description information is used for describing appearance morphology of the target three-dimensional model under different visual angles.

5. The method for generating a three-dimensional model according to claim 2, wherein the determining the micro-renderable parameters corresponding to the target three-dimensional model according to the text description information and the multi-view image group includes:

calculating a first consistency result of the view angle interval graph and the text description information based on a first text-generated graph model; the first text-generated graph model is used for generating a visual angle interval output graph of the three-dimensional model according to text description information of the three-dimensional model;

calculating a second consistency result of the multi-view graph group and the text description information based on a second text-generated graph model; the second text-generated graph model is used for generating output graphs of each view angle of the three-dimensional model according to text description information of the three-dimensional model;

and determining micro-renderable parameters corresponding to the target three-dimensional model according to the first consistency result and the second consistency result.

6. The method of generating a three-dimensional model according to claim 5, wherein the first text-to-image model is trained by:

7. The method for generating a three-dimensional model according to claim 6, wherein the obtaining the first venturi-created graphic model according to the venturi-created graphic model parameter gradient information comprises:

acquiring initial model parameters corresponding to the first text-generated graph model;

and updating the initial model parameters corresponding to the first text-to-graph model according to the text-to-graph model parameter gradient information to obtain the first text-to-graph model.

8. The method of generating a three-dimensional model according to claim 5, wherein said calculating a first consistency result of the view angle interval map and the text description information comprises:

Carrying out noise prediction on the view angle interval diagram according to the text description information to obtain a first noise prediction result, and determining a first consistency result of the view angle interval diagram and the text description information according to the first noise prediction result;

the calculating a second consistency result of the multi-view diagram set and the text description information comprises:

and carrying out noise prediction on the multi-view image group according to the text description information to obtain a second noise prediction result, and determining a second consistency result of the multi-view image group and the text description information according to the second noise prediction result.

9. The method of generating a three-dimensional model according to claim 5, wherein determining the micro-renderable parameters corresponding to the target three-dimensional model according to the first consistent result and the second consistent result comprises:

determining micro-renderable parameters corresponding to the target three-dimensional model according to the sampling noise value, the first noise prediction result, the second noise prediction result and the micro-renderable parameters corresponding to the initial three-dimensional model; the first noise prediction result is used for determining the first consistency result, and the second noise prediction result is used for determining the second consistency result.

10. The method of generating a three-dimensional model according to claim 1, wherein the generating the target three-dimensional model from the micro-renderable parameters includes:

acquiring a rendering representation mode corresponding to the target three-dimensional model;

and generating the target three-dimensional model according to the micro-renderable parameters corresponding to the target three-dimensional model and the rendering representation mode corresponding to the target three-dimensional model.

11. The method according to claim 9, wherein the rendering representation corresponding to the target three-dimensional model is the same as the rendering representation corresponding to the initial three-dimensional model.

12. A three-dimensional model generation device, comprising:

And the generating unit is configured to generate the target three-dimensional model according to the micro-renderable parameters corresponding to the target three-dimensional model.

13. An electronic device comprising a processor and a memory; wherein,

the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of any of claims 1-11.

14. A computer readable storage medium having stored thereon one or more computer instructions executable by a processor to implement the method of any of claims 1-11.