CN116228959A

CN116228959A - Object generation method, device and system

Info

Publication number: CN116228959A
Application number: CN202211515098.5A
Authority: CN
Inventors: 卢冠松; 徐航; 韩建华; 张维
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-06-06

Abstract

The application provides a method for generating an object, which comprises the following steps: inputting the text into a two-dimensional picture generation model, and outputting two-dimensional pictures of multiple visual angles of an object; text is used to describe the characteristics of an object, including object class, color, and shape; the two-dimensional picture generation model is used for generating two-dimensional pictures of a plurality of view angles according to the text; the view angle is the spatial angle presented by the object; calculating the similarity values of the two-dimensional pictures and the texts of the multiple view angles, and enhancing the two-dimensional pictures of the multiple view angles according to the similarity values; inputting the two-dimensional pictures with the enhanced multiple visual angles into a three-dimensional object generation model, rendering the two-dimensional pictures with other angles based on the two-dimensional pictures with the enhanced multiple visual angles by the three-dimensional object generation model, and outputting the three-dimensional object conforming to the text description. According to the method and the device, the two-dimensional picture generating model is used for generating the two-dimensional pictures of the corresponding object according to the text, and then the three-dimensional object generating model is used for generating the corresponding 3D object according to the two-dimensional pictures of the corresponding object, so that the quality of the generated 3D object can be improved, and the generation speed of the 3D object can be increased.

Description

Object generation method, device and system

Technical Field

The present application relates to the field of artificial intelligence (artificial intelligence), and in particular, to a method, apparatus, and system for object generation.

Background

The goal of the text-oriented three-dimensional object generation (text-guided 3D object generation) task is to: given a piece of text in which features of an object, such as the class, color, shape, etc. of the object are described, a 3D image of the object conforming to the sentence description is generated. As shown in the example of fig. 1, the text "charcoal leather double couches (charcoal leather loveseat)" is on the left, and the generated 3D object is on the right, and the 2D picture is rendered.

Text-oriented three-dimensional object generation techniques may be used for (1) data enhancement: generating training data for a scene such as autopilot; (2) entertainment and content creation: and the user generates the corresponding 3D object through the text, so that the user experience is improved, and the creation efficiency is improved.

The 3D object generated by the prior art is not real enough and has no real texture; each text needs to be optimized, resources needed by the optimization are large, time is long, and 3D object generation speed is very slow.

How to improve the quality of the generated 3D object and speed up the generation of the 3D object is a problem to be solved.

Disclosure of Invention

The embodiment of the application provides a method, a device, a system and a computing device for generating an object, which can improve the quality of the generated 3D object and accelerate the generation speed of the 3D object.

In a first aspect, an embodiment of the present application provides a method for generating an object, including: inputting the text into a two-dimensional picture generation model, and outputting two-dimensional pictures of multiple visual angles of an object; text is used to describe the characteristics of an object, including object class, color, and shape; the two-dimensional picture generation model is used for generating two-dimensional pictures of a plurality of view angles according to the text; the view angle is the spatial angle presented by the object; calculating the similarity values of the two-dimensional pictures and the texts of the multiple view angles, and enhancing the two-dimensional pictures of the multiple view angles according to the similarity values; inputting the two-dimensional pictures with the enhanced multiple visual angles into a three-dimensional object generation model, and outputting the three-dimensional object conforming to the text description by the three-dimensional object generation model based on the two-dimensional pictures with other angles at the rendering position of the two-dimensional pictures with the enhanced multiple visual angles. Therefore, the two-dimensional picture generating model can be used for generating two-dimensional pictures of a plurality of view angles of the corresponding object according to the text, and the three-dimensional object generating model is used for generating the corresponding 3D object according to the two-dimensional pictures of the plurality of view angles, so that the quality of the generated 3D object can be improved, and the generation speed of the 3D object can be increased.

In some implementations, the two-dimensional picture generation model includes a text marker, a Transformer structure, and a first image marker, the text input text marker being cut to output a first set of markers; a first set of indicia for indicating a class, color and shape characteristic of the object; the number of marks of the first group of marks is a plurality; inputting the first group into a transducer structure, and decoding to obtain a second group of marks; the number of marks of the second group of marks is n; the second set of markers is input to the first image marker, outputting two-dimensional pictures of the object at multiple perspectives. According to the method, two-dimensional pictures of multiple view angles of the corresponding object can be generated according to the text through the text marker, the transducer structure and the first image marker, so that the quality of the generated 3D object is improved, and the generation speed of the 3D object is increased.

In some implementations, the transducer structure uses an autoregressive generation method to generate an nth tag of the second set of tags with the first n-1 tags of the second set of tags as inputs. Thus, the quality of the two-dimensional pictures with multiple visual angles can be improved.

In some implementations, the input of the two-dimensional picture generation model further includes camera parameters for indicating a viewing angle at which the two-dimensional picture is generated; mapping camera parameters to a corresponding third set of markers; the number of marks of the third group of marks is k; and sequentially inputting the third group of marks into a transducer structure for decoding to obtain a second group of marks. The quality of the generated 3D object can be improved and the generation speed of the 3D object can be increased by generating two-dimensional pictures of a plurality of view angles of the corresponding object according to camera parameters and text control. .

In some implementations, the camera parameter indicates that the view angle for generating the two-dimensional picture further includes a preamble view angle, the two-dimensional picture generation model further includes a second image marker, the two-dimensional picture of the preamble view angle is obtained, and the two-dimensional picture of the preamble view angle is converted into a fourth set of marks through the second image marker; the number of the fourth group of marks is a plurality; the fourth set of tokens is input into a transducer structure for decoding to obtain a second set of tokens. Thereby, consistency between different perspectives generated by the same text can be improved through the preamble perspective guidance.

In some implementations, the two-dimensional pictures of each view in the two-dimensional pictures of the plurality of views are m; calculating semantic similarity between m two-dimensional pictures and texts of each view angle in the two-dimensional pictures of the plurality of view angles to obtain m similarity values; and sequencing the values of the m similarities, and determining s two-dimensional pictures with the similarity meeting the threshold requirement as enhanced two-dimensional pictures of each view angle, wherein m > s. Therefore, semantic consistency between the generated visual angle and the text can be enhanced, the generation quality of the 3D object is improved, and m and s are natural numbers.

In some implementations, text is input to the text encoder to output a first feature vector; inputting m two-dimensional pictures of each view angle into an image encoder to output m second feature vectors; and calculating the inner product between the first feature vector and the m second feature vectors to obtain the similarity value between the two-dimensional picture and the text of each view angle. Whereby a value of the similarity between the two-dimensional picture and the text for each view angle can be obtained.

In some realizable embodiments, the view angle outside the plurality of view angles is a second view angle, the three-dimensional object generation model is a pixelNeRF network, the two-dimensional images with enhanced view angles are input into the pixelNeRF network, corresponding image features are extracted from the two-dimensional images with enhanced view angles through projection and interpolation along the query point x of the target ray d of each view angle, each image feature and space coordinates are input into the NeRF network, and the output RGB and density values are subjected to volume rendering to obtain a NERF model of the object; the NERF model of the object is a hidden model; rendering images of other view angles except for the plurality of view angles based on the NERF model of the object to obtain a 3D object model conforming to the text description; the 3D object model is a mesh model of the object. Therefore, NERF can be used as a 3D object representation mode, the generation quality of the 3D object is improved, time-consuming optimization operation is not needed in the reasoning stage, and therefore the generation speed of the 3D object is improved.

In some implementations, the training step of the two-dimensional picture generation model includes: inputting the text and the training set GT picture corresponding to the text into a two-dimensional picture to generate a model; the pictures of the training set can be acquired through equipment or rendered through a 3D model; optimizing the loss function, and converging the similarity between the generated two-dimensional pictures of the multiple view angles and the training set GT picture to obtain a trained two-dimensional picture generation model. Therefore, the quality of 2D pictures generated by the two-dimensional picture generation model can be improved through training and optimizing the loss function.

In some implementations, one view of the camera parameters and a leading view picture of the view are input into a two-dimensional picture generation model; optimizing a loss function, and converging the similarity between the generated two-dimensional pictures of the multiple view angles and the multiple view angle pictures of the training set corresponding to the text to obtain a trained two-dimensional picture generation model. Therefore, the consistency among different view angles generated by the same text can be improved by training and optimizing the loss function according to the preamble view angle direction, the two-dimensional picture generation model is optimized, and the quality of the generated 2D picture is improved.

In some implementations, semantic similarity between two-dimensional pictures and text of multiple views is calculated, and under the condition that the semantic similarity converges, an optimized semantic loss function L is obtained:

L＝-I·T

wherein I is the feature vector of the two-dimensional picture, and T is the feature vector of the text. Therefore, semantic consistency between the generated visual angle and the text can be enhanced through training and optimizing a semantic loss function, an optimized two-dimensional picture generation model is generated, and quality of generated 2D pictures is improved.

In some implementations, the input to the two-dimensional picture generation model further includes a current view image and a preface view image thereof, the two-dimensional picture generation model further includes an image analyzer that calculates a cross entropy loss function for a plurality of tokens output by decoding the transducer structure, and the transducer model is trained at a token-level. Therefore, the two-dimensional picture generation model can be optimized, and the quality of 2D pictures is improved.

In some implementations, an L1 loss function is calculated between the second 2D picture of the output pair and the training set picture GT at the pixel level calculation based on the cross entropy loss function, resulting in an optimized detail loss function. The two-dimensional picture generation model can be optimized, the details of the 2D pictures of the multiple view angles are enhanced, and the 3D object generation quality is better.

In some implementations, the view contrast loss function is optimized, the distance between 2D pictures of different views generated by the same text is zoomed in, and the distance between 2D pictures of different views generated by different text is zoomed out. Therefore, consistency among different visual angles generated by the same text can be improved, and the 3D object generation quality is better.

In some implementations, the viewing angle contrast loss function is L _contrastive ：

In the method, in the process of the invention,

and->

2D pictures of different perspectives generated by the same text,>

is in combination with->

2D pictures of different perspectives generated by different texts; />

sim () is a similarity function for calculating +.>

And->

Obtaining similarity by the inner product of the feature vectors of (2); f (f) _enc () Is a feature extraction function for extracting +.>

And->

Is a feature vector of (1); τ is the temperature coefficient, the smaller the τ value, +. >

And->

The greater the distance of (2); the greater the τ value, the +.>

And->

The smaller the distance of (2); exp () is used to make one value approach 1 and the other approach 0 when τ is the smallest. Therefore, consistency among different visual angles generated by the same text can be improved, and the 3D object generation quality is better.

In a second aspect, an embodiment of the present application provides an apparatus for generating an object, configured to perform a method according to any one of the first aspect, at least including: the two-dimensional picture generation model is used for taking texts as input and outputting two-dimensional pictures of multiple visual angles of an object; text is used to describe the characteristics of an object, including object class, color, and shape; the two-dimensional picture generation model is used for generating two-dimensional pictures of a plurality of view angles according to the text; the view angle is the spatial angle presented by the object; the enhancement module is used for improving the similarity between the two-dimensional pictures of the multiple visual angles and the text to obtain two-dimensional pictures enhanced by the multiple visual angles; the three-dimensional object generation model is used for taking the two-dimensional pictures with enhanced multiple visual angles as input, rendering the two-dimensional pictures with other angles based on the two-dimensional pictures with enhanced multiple visual angles, and outputting the three-dimensional object conforming to the text description.

In a third aspect, embodiments of the present application provide a system for object generation, the system comprising the apparatus for object generation of the second aspect, wherein the apparatus is configured to perform the method according to any one of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer device, comprising: at least one memory for storing a program; at least one processor for executing a memory-stored program, the processor being adapted to perform the method of any one of the first aspects when the memory-stored program is executed.

In a fifth aspect, embodiments of the present application provide a computer storage medium having instructions stored therein, which when run on a computer, cause the computer to perform a method according to any one of the first aspects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only examples of the embodiments disclosed in the present specification, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

The drawings that accompany the detailed description can be briefly described as follows.

FIG. 1 is a flow chart of a three-dimensional object generation method provided in the background art;

FIG. 2a is a schematic view of a CLIP-force solution provided in the first aspect;

FIG. 2b is a schematic diagram of the texture of the first embodiment;

FIG. 3 is a schematic diagram of a second embodiment of the DreamFields;

FIG. 4 is a flow chart of a method for object generation according to an embodiment of the present application;

fig. 5 is a schematic diagram of a two-dimensional image generation model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of semantic enhancement in a method for object generation according to an embodiment of the present application;

FIG. 7 is a flowchart of training a two-dimensional image generation model in the method for generating an object according to the embodiment of the present application;

FIG. 8 is a schematic view of contrast of view angles in a method of object generation according to an embodiment of the present application;

FIG. 9 is a schematic diagram comparing the performance of the method for generating an object according to the embodiment of the present application with that of the conventional algorithm;

fig. 10 is a schematic view of a visual effect of a method for generating an object to control and generate a category of a 3D object according to an embodiment of the present application;

FIG. 11 is a schematic view of a visual effect of controlling the color of a 3D object generated by the object generating method according to the embodiment of the present application;

FIG. 12 is a schematic diagram of a method for generating objects to control the shape of a generated 3D object according to an embodiment of the present application;

FIG. 13 is a schematic diagram of an apparatus for object generation provided in an embodiment of the present application;

FIG. 14 is a schematic diagram of an object generation system according to an embodiment of the present application;

FIG. 15 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 16 is a schematic diagram of a computing device cluster according to an embodiment of the present disclosure;

fig. 17 is a schematic diagram of one possible implementation of a computing device cluster provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be described below with reference to the accompanying drawings.

In the description of embodiments of the present application, words such as "exemplary," "such as" or "for example," are used to indicate by way of example, illustration, or description. Any embodiment or design described herein as "exemplary," "such as" or "for example" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary," "such as" or "for example," etc., is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, the term "and/or" is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a alone, B alone, and both A and B. In addition, unless otherwise indicated, the term "plurality" means two or more. For example, a plurality of systems means two or more systems, and a plurality of terminals means two or more terminals.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating an indicated technical feature. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

In the description of the embodiments of the present application, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" may be the same subset or a different subset of all possible embodiments and may be combined with each other without conflict.

In the description of the embodiments of the present application, the terms "first\second\third, etc." or module a, module B, module C, etc. are used merely to distinguish similar objects and do not represent a particular ordering for the objects, it being understood that particular orders or precedence may be interchanged as allowed so that the embodiments of the present application described herein can be implemented in an order other than that illustrated or described herein.

In the description of the embodiment of the present application, reference numerals indicating steps, such as S110, S120, … …, etc., do not necessarily indicate that the steps are performed in this order, and the order of the steps may be interchanged or performed simultaneously as allowed.

Token is a meaning of a Token in lexical analysis. The analyzer token is a process in computer science that converts a sequence of characters into a sequence of labels (token). The process of generating tokens from text is called tokenization, in which the token also classifies the token.

pixelNeRF is a learning framework that predicts representations of neural luminance fields (NeRF) from single or multiple images. PixelNeRF can be trained on a set of multiple view 2D pictures, allowing it to generate a trusted new view synthesis from few input images without test time optimization.

pixelNeRF uses as input 2D pictures of known multiple perspectives for neural rendering. Specifically, for a query point x of a target camera ray along the observation direction d, a corresponding image feature is extracted from a feature body W (object diagram) by projection and interpolation, and then the feature is input into a NeRF network together with spatial coordinates, volume-rendered for the output RGB and density values, and compared with target pixel values. The coordinates x and d are in the camera coordinate system of the view.

As used herein, 2D pictures are short for two-dimensional pictures, and 3D objects are short for three-dimensional objects.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

Fig. 2a is a schematic diagram of a CLIP-force technical solution provided in the first aspect. As shown in fig. 2, the training phase mainly includes the following steps:

s11, training a self-encoder (auto-encoder) 11 to encode 3D objects (3D Shapes) 12 into 3D inserts (3D embedding) 13, the self-encoder 11 can be used to decode 3D inserts into 3D objects at the same time;

S12, training a standard flow model (normalization flow) 14 to generate 3D embedding based on a using picture embedding (CLIP-image embedding) 16 of 2D pictures (images) 15 rendered by the 3D object 12, wherein the pictures are pictures extracted using a CLIP model.

The reasoning stage mainly comprises the following steps:

s13, given a Text, extracting Text Embedding (CLIP-Text Embedding) by using a CLIP model.

S14, generating 3D embedding based on the text embedding in S13 by using a standard stream model. Since the CLIP model is trained to align picture with text embedding, text embedding is used in the reasoning phase instead of picture embedding used in the training phase.

S15, the 3D object is generated by decoding the 3D embedding using the self-encoder 11.

As shown in fig. 2b, the first scheme, in order to be able to encode and decode the 3D object 12 using the self-encoder 11, employs a displayed 3D representation to represent the 3D object, shown as voxels (voxels), resulting in an insufficiently realistic 3D object without realistic textures.

Fig. 3 is a schematic diagram of a second embodiment of the field of the stream. As shown in fig. 3, a neural radiation field (Neural Radiance Fields, NERF) model is employed as a 3D object representation. For each text, streamfields are de-optimized to get a NERF model. And calculating semantic similarity between the 2D pictures and texts of different visual angles rendered by the NERF model by using the CLIP model as a loss function, and simultaneously restraining sparsity of the NERF model. The experimental results are shown in the right-hand example of fig. 3.

The second solution has the disadvantage that it requires optimization for each text to get the final 3D object. Using the disclosed code experiments, in the case of optimization with 8 TPU blocks, the time per optimization was over 70 minutes, which resulted in very slow production rates.

Fig. 4 is a flowchart of a method for generating an object according to an embodiment of the present application. As shown in fig. 4, the following steps S21 to S24 are included.

S21, determining a text. The text describes characteristics of the object, including the class, color, and shape of the object. The text may be Chinese or English.

Illustratively, the text is determined to be chinese "charcoal colored leather twin sofas".

Illustratively, text is determined to be English "charcoal leather loveseat"

S22, inputting a text into a two-dimensional picture generation model, and outputting two-dimensional (2D) pictures of multiple view angles of an object; text is used to describe the characteristics of an object, including the class, color, and shape of the object. The viewing angle is the viewing angle of the object. A two-dimensional (2D) picture of multiple views may be noted as a first 2D picture.

Illustratively, a "charcoal leather double-person sofa (charcoal leather loveseat)" is input into an image generating model, and the output object is a sofa picture at a plurality of viewing angles, including front, back, side, top, bottom, 10 degrees left turn, 20 degrees right turn, etc.

In some embodiments, the 2D picture of each view in the 2D pictures of the plurality of views includes a plurality of 2D pictures of the view, the plurality of 2D pictures having randomness.

Illustratively, 20 2D pictures of each view of the sofa are output by the two-dimensional picture generation model.

In some embodiments, the input to the two-dimensional picture generation model further includes camera parameters, a current view, and a preamble view, wherein the current view is used only to train the two-dimensional picture generation model. The camera parameters are used to control multiple perspectives that generate a two-dimensional picture.

In some embodiments, camera parameters may be set to control the view angle of the generated two-dimensional picture, and the output order of the 2D pictures of the multiple view angles is determined according to the ordering of the multiple camera parameters.

For example, the set 9 views for generating the two-dimensional picture are 9, and the set 9 views are determined according to the 9 camera parameter sequences. The two-dimensional picture generation model may sequentially generate 2D pictures of 9 different perspectives for each of the text-described objects. Each view has a leading view, i.e., a view preceding the current view, in order of 9 camera parameters. When the model generates a 2D picture of a certain view angle, the 2D picture of the previous view angle is taken as input, and the function of the 2D picture of the previous view angle is to improve the consistency of the 2D pictures of different view angles.

Fig. 5 is a schematic diagram of a two-dimensional image generation model according to an embodiment of the present application. As shown in FIG. 5, the two-dimensional picture generation model includes a transducer structure 51, a text Tokenizer52, an image Tokenizer53, and an image Tokenizer54. The image Tokenizer54 may be referred to as a first image marker, the image Tokenizer53 may be referred to as a second image marker, and the text Tokenizer52 may be referred to as a text marker.

The text token 52 converts the entered text to output a text label token1, token1 indicating the category, color and shape characteristics of the object. Token1 may be marked as a first set of marks. The number of marks in the first set of marks is a plurality.

The transform structure 51 is used to sequentially decode and generate token4 based on the input token1, and the token4 may be marked as a second set of tokens. The number of marks in the second set of marks is a plurality. The marks in the second set of marks are different from the marks in the first set of marks.

The transform structure 51 uses an autoregressive generation method, and takes the first n-1 token of the tokens 4 as input, and generates the nth token of the tokens 4, where n is a natural number greater than 0.

The image token 54 takes token4 as an input, and outputs a two-dimensional picture at a certain viewing angle.

The previous view to a certain view is referred to as a preamble view.

The image token 53 takes a two-dimensional picture of the preamble view as input, and outputs token3. Token3 may be noted as a fourth set of labels. The number of marks of the fourth set of marks is plural.

In some embodiments, step S22 includes determining text in steps S221-S225 below.

S221, obtaining a text, and converting the text into a text mark token1 through a text token 52. token1 is used to indicate the class, color and shape characteristics of an object; the number of tokens of token1 is plural.

S222, obtaining camera parameters, a plurality of camera parameters can be set, and each camera parameter is mapped to a corresponding parameter label token2 according to a specified sequence. Token2 is noted as a third set of labels. The number of camera parameters is k and the number of marks of the third set of marks is k.

For example, the number of camera parameters may be 9, 9 camera parameters being set to control 9 perspectives of the generated two-dimensional picture. Token2 may be noted as a second tag.

S223, obtaining a preamble view picture, and converting the preamble view picture into a token3 through an image token 53. Token3 may be noted as a fourth tag. The number of marks of the fourth set of marks is plural.

The 2D picture of the i-1 th view which is generated can be recorded as a preamble view picture, the 2D picture of the i-1 th view which is generated is recorded as a current view picture, 1<i is less than or equal to k, and i is a natural number.

In some embodiments, the preamble view picture may be obtained with i being a natural number greater than 1, and converted to token3 for the preamble view through image token 53.

S224, inputting the token1, the token2 and the token3 into the transducer structure 51, and decoding to obtain a token4; the number of marks of token4 is n; n is a natural number greater than 0.

In some embodiments, the transducer structure 51 uses an autoregressive generation method to generate the nth tag in token4 with the first n-1 tags in token4 as input.

In some embodiments, in the case of i=1, the input of token1, token2 into the transducer structure 51 is subjected to autoregressive decoding, the first n-1 marks in token4 are added as input during decoding, the nth mark in token4 is generated, and token4 is output.

In some embodiments, in the case of i >1, the token1, token2, and token3 are input into the transducer structure 51 for autoregressive decoding, the first n-1 tags in token4 are added as input during decoding, the nth tag in token4 is generated, and token4 is output.

S225, inputting the token4 into the image token 54, and converting to obtain a 2D picture of the ith view angle

S226, repeatedly executing the steps S221-S225 under the condition that i is less than or equal to k, and sequentially obtaining the 2D pictures of k views according to the sequence of the views.

Illustratively, when i=1, there is no preamble view picture, the 2D picture generation model generates a 2D picture of the 1 st view from the 1 st camera parameter and the text.

Illustratively, when i=2, the leading view picture is a 2D picture of view 1, and the 2D picture generation model generates a 2D picture of view 2 from the 2 nd camera parameters, the leading view picture, and the text.

Illustratively, when i=3, the leading view picture is a 2D picture of the 2 nd view, and the 2D picture generation model generates a 2D picture of the 3 rd view according to the 3 rd camera parameters, the leading view picture, and the text. And so on until i=k.

In some embodiments, the generation of 2D pictures of text to multiple perspectives may employ an image generation model based on a convolutional neural network structure.

In addition to the image generation model based on the transducer structure and the image generation model based on the convolutional neural network structure used in the present embodiment, other image generation modules that are easily conceivable by those skilled in the art are within the scope of the present application.

For the same group of input, as randomness exists in the 2D picture generation model, the 2D picture of each view angle in the 2D pictures of a plurality of view angles comprises a plurality of random 2D pictures of the view angle, and therefore, a plurality of groups of different 2D pictures of a plurality of view angles can be obtained by inputting the same camera parameters and text for a plurality of times.

S23, determining semantic similarity of the 2D pictures of the multiple views and the text, and obtaining the 2D pictures with enhanced multiple views according to the value of the semantic similarity. The enhanced 2D picture is a 2D picture with the semantic similarity with the text reaching the threshold requirement.

Fig. 6 is a schematic diagram of semantic enhancement in a method for generating an object according to an embodiment of the present application. As shown in fig. 6, semantic similarity between a plurality of random 2D pictures and text of each view among 2D pictures of a plurality of views may be calculated to obtain a plurality of similarity values of each view, the plurality of similarity values of each view are ordered, and a 2D picture with the highest value among the plurality of similarity values is determined as an enhanced 2D picture of each view, so that semantic consistency between the 2D pictures of the plurality of views and the text is enhanced.

In some embodiments, the 2D picture for each view in the 2D pictures for multiple views is m; the semantic similarity of each 2D picture and the text is calculated, including the following steps S231-S232.

S231, calculating semantic similarity between m 2D pictures and texts of each view angle, and obtaining m similarity values.

In some embodiments, step S231 includes the following steps S2311-S2313.

S2311, the text is input to the text encoder 61 to output the first feature vector.

S2312, the m 2D pictures of each view are input to the image encoder 62 to output m second feature vectors.

Illustratively, text is passed through a text encoder 61 to obtain a first characteristic

The 2D picture gets the second feature +_ via the image encoder 62>

S2313, calculating an inner product between the first feature vector and each second feature vector to obtain m semantic similarity values of each view angle.

Illustratively, a first feature F may be calculated _t And second feature F _i Inner product between s=f _t ·F _i And obtaining the value of the semantic similarity of the text and the 2D picture.

In some embodiments, a first feature F may be calculated _t And second feature F _i And (5) cosine similarity (cosine similarity) between the two images to obtain a semantic similarity value of the text and the 2D image.

S232, sorting the values of m semantic similarities, and determining 2D pictures with the similarity values meeting the threshold requirement n as enhanced 2D pictures of each view angle; wherein m > n.

In some embodiments, n 2D pictures with highest semantic similarity values may be selected as enhanced 2D pictures according to the semantic similarity value ranking.

S24, inputting the 2D pictures with the enhanced multiple visual angles into a three-dimensional (3D) object generation model, rendering the 3D object generation model based on the 2D pictures with the enhanced multiple visual angles, and outputting the 3D object conforming to the text description.

In some embodiments, the three-dimensional object generation model is a pixelNeRF network, and step S24 includes the steps of:

s241, taking enhanced 2D pictures of multiple visual angles of the object as a pixelNeRF network input, and learning and outputting a NERF model of the object through internal parameters.

The pixelNeRF network is used to render a nerve from a 2D picture of known multiple perspectives as input. The principle is that for a query point x of a target camera ray along the observation direction d, corresponding image features are extracted from a feature body W (object diagram) through projection and interpolation, then the features are input into a neural radiation field (neural radiance fields, NERF) network together with space coordinates, output RGB and density values are subjected to volume rendering, and compared with target pixel values. The coordinates x and d are in the camera coordinate system of the view. The rendering is a process of integrating color integrals on the pictures to form a complete picture visualization.

In some embodiments, the multiple view enhanced 2D pictures may be input into a pixelNeRF network, the corresponding image features are extracted from each view enhanced 2D picture by projection and interpolation along the query point x of the target ray D of each view, and then each image feature is input into the NeRF network together with the spatial coordinates, and the RGB and density values of the output image are volume rendered to obtain the NeRF model of the object.

The NERF model of the object is a parameterized object model obtained through parameter learning optimization in the NERF network and is an implicit model.

And S242, rendering images of other visual angles except for a plurality of visual angles based on the NERF model of the object to obtain the 3D object conforming to the text description.

In some embodiments, the 3D object is a mesh (mesh) model. Mesh (mesh) is a collection of points, normal vectors, faces, which defines the 3D shape of an object.

Fig. 7 is a flowchart of training a 2D image generation model in the method for generating an object according to the embodiment of the present application. As shown in fig. 7, in the training phase, the 2D picture generation model is inputted with the input text t, the camera parameter P, the two-dimensional picture of the current view angle, and the two-dimensional picture of the previous view angle.

The input text t is the text of the training set, the two-dimensional picture of the current visual angle is the current visual angle picture of the corresponding training set, and the two-dimensional picture dimension of the preamble visual angle corresponds to the preamble visual angle picture of the current visual angle picture of the training set. The current view picture of the training set can be acquired through camera equipment or rendered through a 3D model. The camera parameters P are set camera parameters, and the camera parameters input in the reasoning stage are the same group of parameters.

The training phase comprises the following steps:

s31, inputting the camera parameters, the text of the training set and the corresponding current view angle picture of the training set into a 2D picture generation model.

In some embodiments, S31 includes the following steps S311-S313.

S311, mapping the set camera parameters to corresponding marks token2; the token2 is sequentially input into the transducer structure 51 to decode the output parameter label token5.

S312, converting the training set text by the text token 52 to obtain token1, inputting the token1 into the transducer structure 51, and decoding and outputting token4.

S313, converting the current view picture of the training set through the image token 53 to obtain token7, and inputting the token7 into the converter structure 51 to decode and output the image token8.

In some embodiments, to enhance continuity between 2D pictures of different perspectives generated by the same training text, 2D pictures of two consecutive perspectives in the training set may be taken together as input to a 2D picture generation model. The one of the 2D pictures of two consecutive views, the one of which is ordered in the front, is denoted as the leading view picture, and the one of which is ordered in the rear is denoted as the current view picture. Including the following step S314.

S314, the preamble view picture of the training set passes through the image token 53 to obtain token3, and the token3 is input into the transducer structure 51 to decode and output token6.

S32, synthesizing the token4, the token5, the token6 and the token8 through the image token 54 to obtain a 2D picture of the training visual angle. A two-dimensional (2D) picture of the trained view may be recorded as a second 2D picture.

S33, optimizing a semantic loss function. Semantic consistency between 2D pictures and text of a trained view angle can be enhanced by an optimized semantic loss function.

In some embodiments, semantic similarity between the plurality of second 2D pictures and the training set text is calculated, and the semantic loss function L is optimized such that the value of the semantic similarity converges. Wherein, the semantic loss function L is:

L＝-I·T

wherein I is the feature vector of the second 2D picture, and T is the feature vector of the training set text. .

S34, optimizing a detail loss function.

The details of the generated result can be enhanced by the optimized detail loss function. The method comprises the following steps:

s341, calculating cross entropy loss functions for labels token4, token5, token6 and token8 output by the transducer structure, thereby training the transducer model at the label level (token-level).

Experiments have shown that this token-level loss function is insufficient to guide the network model to generate enough detail.

S342, a detail loss function between the plurality of second 2D pictures and the training set picture GT is calculated at a pixel level (pixel-level).

In some embodiments, a detail loss function is calculated between the second 2D picture facing the output at the pixel level and the training set picture GT based on the token-level cross entropy loss function, thereby providing a more fine-grained training signal for the transducer model.

Those skilled in the art will readily recognize that other image generation modules may be used instead, such as superparamellos, as a viable solution.

S35, optimizing a viewing angle contrast loss function.

In some embodiments, the distance between 2D pictures of different perspectives generated by different texts may be zoomed in by optimizing the perspective contrast loss function.

Fig. 8 is a schematic diagram of viewing angle contrast in a method for generating an object according to an embodiment of the present application. As shown in figure 8 of the drawings,

and->

2D pictures of different perspectives generated by the same text,>

is in combination with->

Different text generates 2D pictures of different perspectives. Optimizing viewing angle contrast loss functionThe method comprises the following steps:

s351, extracting with feature extraction network fenc

Feature vector and->

Is a feature vector of (1); will->

Is denoted as a third eigenvector, ">

Is denoted as fourth feature vector. / >

S352, calculating the inner products of the third and fourth feature vectors to obtain a similarity value sim (), and calculating the view contrast loss function L according to the similarity value _contrastive ：

In the method, in the process of the invention,

sim (,) is a similarity function for calculating +.>

And->

And->

Is a feature vector of (1); />

And->

2D pictures of different perspectives generated by the same text,>

is in combination with->

2D pictures of different perspectives generated by different texts; τ is the temperature coefficient, the smaller the τ value, +.>

And->

The greater the distance of (2); the greater the τ value, the +.>

And->

The smaller the distance of (2); exp () is used to make one value approach 1 and the other approach 0 when τ is the smallest.

By optimizing the view contrast loss function, the similarity value of the 2D pictures of different views generated by the same text is increased, and the zooming is realized

And->

Distance of (2) while pulling far +.>

And->

Distance between them.

The PixelNeRF network can be trained on a set of multiple view 2D pictures, allowing it to generate a trusted new view synthesis from few input images without test time optimization.

For comparison with the performance of existing algorithms, the present embodiments are compared to a baseline model Text-NeRF, dreamField on an open dataset (ABO).

Fig. 9 is a schematic diagram comparing the performance of the conventional algorithm with the object generation method according to the embodiment of the present application. As shown in fig. 9, embodiments of the present application (Ours in fig. 9) compare two aspects of quality generated from a 3D object, semantic consistency between the 3D object and text.

A total of 6 numerical indicators are used in the examples of this application. Wherein PSNR, SSIM, LPIPS is an automation index for measuring 3D object generation quality, CLIP-Score is an automation index for measuring semantic consistency between texts.

In addition, the embodiment of the application also adopts an artificial evaluation method to compare the advantages and disadvantages of different method results, wherein Object Fidelity is an artificial evaluation index for measuring the generation quality of the 3D Object, and Caption Similarity is an artificial evaluation index for measuring the semantic consistency between texts.

As can be seen from fig. 9, the embodiment of the present application is superior to the existing text-NeRF algorithm and streamfields in all metrics.

In addition, in terms of the generation speed, compared with the streamfields which can also generate better textures, the embodiment of the application can generate the 3D object by only reasoning in the reasoning stage after training the neural network model in the training stage, because each text does not need to be optimized. The embodiment of the application only needs to use one piece of V100, which takes only 6 minutes.

Fig. 10 is a schematic view of a visual effect of a method for generating an object, which is provided in an embodiment of the present application, for controlling and generating a category of a 3D object. As shown in fig. 10, the input text is english, and the embodiment of the present application can control and generate the category attribute of the 3D object through the input english text.

Fig. 11 is a schematic view of a visual effect of controlling a color of a generated 3D object according to a method for generating an object provided in an embodiment of the present application. As shown in fig. 11, the input text is english, and the embodiment of the present application can control the generation of the color attribute of the 3D object through the input english text.

Fig. 12 is a schematic diagram of a method for generating an object according to an embodiment of the present application to control the shape of a generated 3D object. As shown in fig. 12, the input text is english, and the embodiment of the present application can control the shape attribute of the 3D object to be generated through the input english text.

According to the object generation method, firstly, a text-to-2D view angle generation module is used for generating 2D pictures of a plurality of views of a corresponding object according to the text, and then, a view angle-to-3D object generation module is used for generating a new method of the corresponding 3D object according to the plurality of views.

The view angle to 3D object generation module provided by the embodiment of the application supports the NERF to be used as a representation mode of the 3D object, and can generate the 3D object with higher quality. Time-consuming optimization is not needed in the reasoning stage, so that a faster generation speed is realized.

The object generation method provided by the embodiment of the application can be applied to different scenes, and corresponding 3D objects are generated according to texts.

In some embodiments, a method for generating an object according to an embodiment of the present application may be applied to a data enhancement scene.

By way of example, a text meeting the task requirements can be designed, a corresponding 3D object is generated, and the 3D object is directly used or the generated 3D object is fused with other data to obtain new training data, so that the data acquisition cost is saved, the performance of a machine learning model is improved, and data enhancement is realized.

In machine learning, more training data will generally facilitate training of the machine learning model.

In some embodiments, a method for generating an object according to an embodiment of the present application may be applied to an entertainment scene.

In some embodiments, the object generation method provided by the embodiment of the application can be deployed on different intelligent devices to interact with users so as to achieve the effect of mass entertainment. The intelligent equipment comprises a mobile phone, a PAD, a PC and the like.

The text of the user is displayed on the mobile phone, the generated result is obtained after the mobile phone is processed, and the mobile phone is edited for the second time by displaying the generated result or the generated result, so that the effect of mass entertainment is achieved.

In some embodiments, a method for text-oriented 3D object generation as proposed by embodiments of the present application may be applied to content authoring.

In the 3D object design, a corresponding 3D object is generated based on the text, and then the secondary editing design is performed according to the actual requirement, so that the creation efficiency is improved, and the creation thought of the creator can be widened.

In the commodity design, a corresponding 3D object is generated based on the text, and then the secondary editing design is performed according to the actual requirement, so that the creation efficiency is improved, and the creation thought of an creator can be widened.

The object generation method provided by the embodiment of the application can be applied to cloud products, terminal equipment and the like.

The object generation method provided by the embodiment of the invention is deployed on the computing node of the related equipment, and the generation quality and the generation speed of the text-oriented 3D object generation can be improved through software transformation.

Fig. 13 is a schematic diagram of an apparatus for object generation according to an embodiment of the present application. The object generation device provided in the embodiment of the present application performs the method of any one of the above, and the device includes a 2D picture generation model 131, an enhancement module 132, and a three-dimensional object generation model 133. The 2D picture generation model 131 takes text as input, and outputs 2D pictures of multiple view angles of the object; text is used to describe the characteristics of an object, including object class, color, and shape; the 2D picture generation model is used for generating 2D pictures of a plurality of view angles according to the text; the view angle is the spatial angle presented by the object; the enhancement module 132 increases the similarity between the 2D pictures of the multiple views and the text, so as to obtain 2D pictures enhanced by the multiple views; the three-dimensional object generation model 133 takes as input a plurality of view-enhanced 2D pictures, renders 2D pictures of other angles based on the plurality of view-enhanced 2D pictures, and outputs a three-dimensional object conforming to the text description.

Fig. 14 an object generating system according to an embodiment of the present application, as shown in fig. 14, includes:

object generation means 141 for inputting text into the 2D picture generation model, outputting 2D pictures of multiple perspectives of the object; text is used to describe the characteristics of an object, including object class, color, and shape; the 2D picture generation model is used for generating 2D pictures of a plurality of view angles according to the text; the view angle is the spatial angle presented by the object; calculating the similarity values of the 2D pictures and the texts of the multiple views, and enhancing the 2D pictures of the multiple views according to the similarity values; inputting the 2D pictures with the enhanced multiple visual angles into a three-dimensional object generation model, and outputting the three-dimensional object conforming to the text description by the three-dimensional object generation model based on the 2D pictures with other angles at the rendering position of the 2D pictures with the enhanced multiple visual angles. Therefore, the 2D picture generating model can be used for generating 2D pictures of a plurality of visual angles of the corresponding object according to the text, and then the three-dimensional object generating model is used for generating the corresponding 3D object according to the 2D pictures of the plurality of visual angles, so that the quality of the generated 3D object can be improved, and the generation speed of the 3D object can be increased.

Training means 142 for inputting the text of the training set and the corresponding training set GT picture into the 2D picture generation model; the pictures of the training set can be acquired through equipment or rendered through a 3D model; and optimizing the loss function to enable the similarity between the generated 2D pictures of the multiple view angles and the training set GT picture to be converged, and obtaining a trained 2D picture generation model. Therefore, the quality of the 2D picture generated by the 2D picture generation model can be improved through training and optimizing the loss function.

The object generating means 141 and the training means 142 may each be implemented by software or may be implemented by hardware. Illustratively, an implementation of the object generating apparatus 141 is described next. Similarly, the implementation of training device 142 may refer to the implementation of object-generating device 141.

Module as an example of a software functional unit, the object-generating means 141 may comprise code running on a computing instance. Wherein the computing instance may be at least one of a physical host (computing device), a virtual machine, a container, etc. computing device. Further, the computing device may be one or more. For example, object-generated device 141 may include code running on multiple hosts/virtual machines/containers. It should be noted that, multiple hosts/virtual machines/containers for running the application may be distributed in the same region, or may be distributed in different regions. Multiple hosts/virtual machines/containers for running the code may be distributed among the same AZ or among different AZs, each AZ including one data center or multiple geographically close data centers. Wherein typically a region may comprise a plurality of AZs.

Also, multiple hosts/virtual machines/containers for running the code may be distributed in the same VPC, or may be distributed among multiple VPCs. Where typically one VPC is placed within one region. The inter-region communication between two VPCs in the same region and between VPCs in different regions needs to set a communication gateway in each VPC, and the interconnection between the VPCs is realized through the communication gateway.

Module as an example of a hardware functional unit, the object generating means 141 may comprise at least one computing device, such as a server or the like. Alternatively, the object generating means 141 may be a device implemented by ASIC or PLD. Wherein, the PLD can be CPLD, FPGA, GAL or any combination thereof.

The object generation apparatus 141 includes multiple computing devices that may be distributed in the same region or may be distributed in different regions. The object generation apparatus 141 may include multiple computing devices distributed among the same AZ or among different AZ. Likewise, the object generation apparatus 141 may include multiple computing devices distributed in the same VPC or may be distributed among multiple VPCs. Wherein the plurality of computing devices may be any combination of computing devices such as servers, ASIC, PLD, CPLD, FPGA, and GAL.

The present application also provides a computing device 100. As shown in fig. 15, the computing device 100 includes: bus 102, processor 104, memory 106, and communication interface 108. Communication between the processor 104, the memory 106, and the communication interface 108 is via the bus 102. Computing device 100 may be a server or a terminal device. It should be understood that the present application is not limited to the number of processors, memories in computing device 100.

Bus 102 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 15, but not only one bus or one type of bus. Bus 102 may include a path to transfer information between various components of computing device 100 (e.g., memory 106, processor 104, communication interface 108).

The processor 104 may include any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (digital signal processor, DSP).

The memory 106 may include volatile memory (RAM), such as random access memory (random access memory). The processor 104 may also include non-volatile memory (ROM), such as read-only memory (ROM), flash memory, a mechanical hard disk (HDD), or a solid state disk (solid state drive, SSD).

The memory 106 has stored therein executable program code that is executed by the processor 104 to implement the functions of the aforementioned 2D picture generation model, enhancement module, and three-dimensional object generation model, respectively, to thereby implement the object generation method. That is, the memory 106 has stored thereon instructions for performing the object generation method.

Alternatively, the memory 106 has stored therein executable code that is executed by the processor 104 to implement the functions of the aforementioned object generation means 141 and training means 142, respectively, to implement the object generation method. That is, the memory 106 has stored thereon instructions for performing the object generation method.

Communication interface 108 enables communication between computing device 100 and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, or the like.

The embodiment of the application also provides a computing device cluster. The cluster of computing devices includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop, notebook, or smart phone.

As shown in fig. 16, the cluster of computing devices includes at least one computing device 100. The same instructions for performing the object generation method may be stored in the memory 106 in one or more computing devices 100 in the cluster of computing devices.

In some possible implementations, the memory 106 of one or more computing devices 100 in the computing device cluster may also each have stored therein a portion of instructions for performing the object generation method. In other words, a combination of one or more computing devices 100 may collectively execute instructions for performing an object generation method.

It should be noted that the memories 106 in different computing devices 100 in the computing device cluster may store different instructions for performing part of the functions of the object generating apparatus, respectively. That is, the instructions stored by the memory 106 in the different computing devices 100 may implement the functionality of one or more of the 2D picture generation model, the enhancement module, and the three-dimensional object generation model.

In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein the network may be a wide area network or a local area network, etc.

Fig. 17 shows one possible implementation. As shown in fig. 17, two computing devices 100A and 100B are connected by a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, instructions to perform the functions of the 2D picture generation model, enhancement module, are stored in memory 106 in computing device 100A. Meanwhile, instructions for performing the functions of generating a model of the three-dimensional object are stored in the memory 106 in the computing device 100B.

The connection manner between the computing device clusters shown in fig. 17 may be in consideration of that the object generating method provided in the present application requires a large amount of data and computation, so that the functions implemented by the 2D picture generating model and the enhancement module are considered to be performed by the computing device 100B.

It should be appreciated that the functionality of computing device 100A shown in fig. 17 may also be performed by multiple computing devices 100. Likewise, the functionality of computing device 100B may also be performed by multiple computing devices 100.

The embodiment of the application also provides another computing device cluster. The connection between computing devices in the computing device cluster may be similar to the connection of the computing device cluster described with reference to fig. 16 and 17. In contrast, the same instructions for performing the object generation method may be stored in the memory 106 in one or more computing devices 100 in the cluster of computing devices.

It should be noted that the memory 106 in different computing devices 100 in a cluster of computing devices may store different instructions for performing part of the functions of the object generation system. That is, the instructions stored by the memory 106 in the different computing devices 100 may implement the functionality of one or more of the object-generating means 141, training means 142.

Embodiments of the present application also provide a computer program product comprising instructions. The computer program product may be software or a program product containing instructions capable of running on a computing device or stored in any useful medium. The computer program product, when run on at least one computing device, causes the at least one computing device to perform an object generation method, or a training method.

Embodiments of the present application also provide a computer-readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer-readable storage medium includes instructions that instruct a computing device to perform an object generation method or instruct a computing device to perform an object generation method.

It is to be appreciated that the processor in embodiments of the present application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.

The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by a processor executing software instructions. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable PROM (EPROM), electrically erasable programmable EPROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

Claims

1. A method of object generation, the method comprising:

inputting the text into a two-dimensional picture generation model, and outputting two-dimensional pictures of multiple visual angles of an object; the text is used for describing characteristics of the object, wherein the characteristics comprise the category, the color and the shape of the object; the two-dimensional picture generation model is used for generating two-dimensional pictures of a plurality of view angles according to the text; the visual angle is a spatial angle presented by the object;

calculating the similarity values of the two-dimensional pictures of the multiple view angles and the text, and enhancing the two-dimensional pictures of the multiple view angles according to the similarity values;

inputting the two-dimensional pictures with the enhanced multiple view angles into a three-dimensional object generation model, and outputting the three-dimensional object conforming to the text description by the three-dimensional object generation model based on the two-dimensional pictures with other angles at the rendering position of the two-dimensional pictures with the enhanced multiple view angles.

2. The method of claim 1, wherein the two-dimensional picture generation model comprises a text marker, a transducer structure, and a first image marker, wherein inputting text into the two-dimensional picture generation model outputs two-dimensional pictures of multiple perspectives of an object, comprising:

Converting the text input text marker to output a first set of markers; the first set of indicia is for indicating a category, color, and shape characteristic of the object; the number of marks of the first group of marks is a plurality;

inputting the first group of marks into the transducer structure, and decoding to obtain a second group of marks; the number of marks of the second group of marks is n; n is a natural number greater than 0;

the second set of markers is input to the first image marker, and two-dimensional pictures of the object at multiple viewing angles are output.

3. The method of claim 2, wherein the fransformer structure uses an autoregressive generation method to generate an nth tag in the second set of tags with the first n-1 tags in the second set of tags as inputs.

4. A method according to claim 2 or 3, wherein the input of the two-dimensional picture generation model further comprises camera parameters for indicating a viewing angle at which a two-dimensional picture is generated; inputting the text into the two-dimensional picture generation model, and outputting two-dimensional pictures of a plurality of view angles of an object, wherein the two-dimensional picture generation model comprises the following steps:

mapping the camera parameters to a corresponding third set of markers; the number of marks of the third group of marks is k;

And sequentially inputting the third group of marks into the transducer structure for decoding to obtain the second group of marks.

5. The method of any of claims 2-4, wherein the camera parameters indicate that the view from which the two-dimensional picture was generated further comprises a precursor view, the two-dimensional picture generation model further comprises a second image marker, the inputting text into the two-dimensional picture generation model outputs two-dimensional pictures of multiple views of the object, comprising:

obtaining a two-dimensional picture of the preamble viewing angle, and converting the two-dimensional picture of the preamble viewing angle into a fourth group of marks through a second image marker; the number of the marks of the fourth group of marks is a plurality of marks;

inputting the fourth set of marks into the transducer structure for decoding to obtain the second set of marks.

6. The method of any of claims 1-5, wherein the two-dimensional picture for each of the plurality of views is m; the step of improving the similarity between the two-dimensional pictures of the multiple view angles and the text to obtain the two-dimensional pictures of the multiple view angle enhancement comprises the following steps:

calculating semantic similarity between m two-dimensional pictures of each view angle and the text in the two-dimensional pictures of the multiple view angles to obtain m similarity values;

And sequencing the m similarity values, and determining s two-dimensional pictures with the similarity values meeting the threshold requirement as enhanced two-dimensional pictures of each view angle, wherein m > s.

7. The method according to claim 6, wherein calculating semantic similarity between m two-dimensional pictures and text for each of the two-dimensional pictures for a plurality of views, to obtain m similarity values, comprises:

outputting a first feature vector from the text input text encoder;

inputting the m two-dimensional pictures of each view angle into an image encoder to output m second feature vectors;

and calculating an inner product between the first eigenvector and the m second eigenvectors to obtain m similarity values of each view angle.

8. The method of any of claims 1-7, wherein the view angle outside the plurality of view angles is a second view angle, the three-dimensional object generation model is a pixelNeRF network, the three-dimensional object generation model is rendered based on the plurality of view angle enhanced two-dimensional pictures, outputting a three-dimensional image of the object conforming to the textual description, comprising:

inputting the two-dimensional pictures with enhanced multiple view angles into a pixelNeRF network, extracting corresponding image features from the two-dimensional pictures with enhanced multiple view angles through projection and interpolation along the query point x of the target ray d of each view angle, inputting each image feature and space coordinates into the NeRF network, and performing volume rendering on the output RGB and density values to obtain a NERF model of the object; the NERF model of the object is a recessive model;

Rendering images of other view angles except the view angles based on the NERF model of the object to obtain a three-dimensional object model conforming to the text description; the three-dimensional object model is a mesh model of an object.

9. The method according to any of claims 1-8, further comprising a training step of the two-dimensional picture generation model, comprising:

inputting the text and the training set GT picture corresponding to the text into a two-dimensional picture generation model; the pictures of the training set can be acquired through equipment or rendered through a three-dimensional model;

optimizing a loss function, and converging the similarity between the generated two-dimensional pictures of the multiple view angles and the training set GT picture to obtain a trained two-dimensional picture generation model.

10. The method of claim 9, wherein the training step of the two-dimensional picture generation model further comprises:

inputting one view angle of the camera parameters and a preface view angle picture of the view angle into the two-dimensional picture generation model;

and optimizing the loss function to enable the similarity between the generated two-dimensional pictures of the multiple view angles and the multiple view angle pictures of the training set corresponding to the text to be converged, and obtaining the trained two-dimensional picture generation model.

11. The method according to claim 9 or 10, characterized in that the optimizing the loss function comprises:

calculating semantic similarity between the two-dimensional pictures of the multiple view angles and the text, and obtaining an optimized semantic loss function L under the condition that the semantic similarity is converged:

L＝-I·T

wherein I is the feature vector of the two-dimensional picture, and T is the feature vector of the text.

12. The method according to any of claims 9-11, wherein the input of the two-dimensional picture generation model further comprises a current view image and a preceding view image thereof, the two-dimensional picture generation model further comprises an image analyzer, the optimizing a loss function comprising:

a cross entropy loss function is calculated for a plurality of tokens output by decoding the transducer structure, and the transducer model is trained at a token-level.

13. The method according to any of claims 9-12, wherein optimizing the loss function comprises:

and calculating an L1 loss function between the second 2D picture output by the pair and the training set picture GT at the pixel layer to obtain an optimized detail loss function.

14. The method according to any of claims 9-13, wherein optimizing the loss function comprises: the visual angle contrast loss function is optimized, the distance between 2D pictures of different visual angles generated by the same text is reduced, and the distance between 2D pictures of different visual angles generated by different texts is increased.

15. The method of claim 14, wherein the view contrast loss function is L _contrastive ：

In the method, in the process of the invention,

and->

2D pictures of different perspectives generated by the same text,>

is in combination with->

2D pictures of different perspectives generated by different texts; />

sim () is a similarity function for calculating +.>

And->

Obtaining similarity by the inner product of the feature vectors of (2); f (f) _enc () Is a feature extraction function for extracting/>

And->

Is a feature vector of (1); τ is the temperature coefficient, the smaller the τ value, +.>

And->

The greater the distance of (2); the greater the τ value, the +.>

And->

The smaller the distance of (2); exp () is used to make one value approach 1 and the other approach 0.

16. An object generating apparatus for performing the method according to any of claims 1-15, comprising at least:

the two-dimensional picture generation model is used for taking texts as input and outputting two-dimensional pictures of multiple visual angles of an object; the text is used for describing characteristics of the object, wherein the characteristics comprise the category, the color and the shape of the object; the two-dimensional picture generation model is used for generating two-dimensional pictures of a plurality of view angles according to the text; the visual angle is a spatial angle presented by the object;

the enhancement module is used for improving the similarity between the two-dimensional pictures of the multiple view angles and the text to obtain two-dimensional pictures enhanced by the multiple view angles;

And the three-dimensional object generation model is used for taking the two-dimensional pictures with the enhanced multiple visual angles as input, rendering the two-dimensional pictures with other angles based on the two-dimensional pictures with the enhanced multiple visual angles, and outputting the three-dimensional object conforming to the text description.

17. A system of object generation, characterized in that the system comprises means of the object generation, wherein the means are adapted to perform the method according to any of claims 1-15.

18. A computer device, comprising:

at least one memory for storing a program;

at least one processor for executing the memory-stored program, which processor is adapted to perform the method of any of claims 1-15 when the memory-stored program is executed.

19. A computer storage medium having instructions stored therein which, when executed on a computer, cause the computer to perform the method of any of claims 1-15.