CN116051729A

CN116051729A - Three-dimensional content generation method and device and electronic equipment

Info

Publication number: CN116051729A
Application number: CN202211612954.9A
Authority: CN
Inventors: 吴进波; 刘星
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-05-02
Anticipated expiration: 2042-12-15
Also published as: CN116051729B

Abstract

The disclosure provides a three-dimensional content generation method, a three-dimensional content generation device and electronic equipment, relates to the technical field of artificial intelligence, in particular to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as metauniverse, virtual digital people and the like. The implementation scheme is as follows: acquiring a description of three-dimensional content which is generated by a user and is intended to be generated, and generating an initial three-dimensional model of the three-dimensional content; extracting semantic information representing the three-dimensional content as target feature information based on the description of the three-dimensional content; rendering based on the initial three-dimensional model to obtain a plurality of rendered images; extracting image characteristic information of each of the plurality of rendering images respectively; respectively calculating differences between the image characteristic information and the target characteristic information of each of the plurality of rendering images; and adjusting the initial three-dimensional model to correspond to the three-dimensional content based on the differences.

Description

Three-dimensional content generation method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as metauniverse, virtual digital people and the like, in particular to a method, a device, electronic equipment, a computer readable storage medium and a computer program product for generating three-dimensional content.

Background

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

In recent years, with the gradual rise of technologies such as the metauniverse, users have also put higher demands on three-dimensional content generation (i.e., generation of a three-dimensional model of an object) in these scenes. Three-dimensional content generation is often achieved through three-dimensional reconstruction of objects. In some cases, the three-dimensional model of the object is generated at a slower rate, and the three-dimensional reconstruction is based on the three-dimensional content generation of the real-world object, that is, such three-dimensional reconstruction is limited to restoring the existing object.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides a method, apparatus, electronic device, computer-readable storage medium, and computer program product for three-dimensional content generation.

According to an aspect of the present disclosure, there is provided a three-dimensional content generation method including: acquiring a description of three-dimensional content which is generated by a user and is intended to be generated, and generating an initial three-dimensional model of the three-dimensional content; extracting semantic information representing the three-dimensional content as target feature information based on the description of the three-dimensional content; rendering based on the initial three-dimensional model to obtain a plurality of rendered images; extracting image characteristic information of each of the plurality of rendering images respectively; respectively calculating differences between the image characteristic information and the target characteristic information of each of the plurality of rendering images; and adjusting the initial three-dimensional model to correspond to the three-dimensional content based on the differences.

According to another aspect of the present disclosure, there is provided a three-dimensional content generating apparatus including: an acquisition module configured to acquire a description of the three-dimensional content intended to be generated by the user, and an initial three-dimensional model for generating the three-dimensional content; the semantic module is configured to extract semantic information representing the three-dimensional content as target feature information based on the description of the three-dimensional content; a rendering module configured to render based on the initial three-dimensional model to obtain a plurality of rendered images; a feature extraction module configured to extract image feature information of each of the plurality of rendered images, respectively; a difference calculation module configured to calculate differences between image feature information and target feature information of each of the plurality of rendered images, respectively; and a model adjustment module configured to adjust the initial three-dimensional model to correspond to the three-dimensional content based on the differences.

According to another aspect of the present disclosure, there is provided an electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the present disclosure as provided above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the present disclosure as provided above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the present disclosure as provided above.

According to one or more embodiments of the present disclosure, three-dimensional content conforming to a user description can be quickly and accurately generated.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a three-dimensional content generation method according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of an initial three-dimensional model according to an embodiment of the present disclosure;

FIG. 4 illustrates a flow chart of a three-dimensional content generation method according to another embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a process of rendering based on an initial three-dimensional model in accordance with an embodiment of the present disclosure;

FIG. 6 shows a block diagram of an apparatus for three-dimensional content generation according to an embodiment of the present disclosure;

FIG. 7 shows a block diagram of an apparatus for three-dimensional content generation according to another embodiment of the present disclosure;

fig. 8 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another element. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

In the related art, three-dimensional content generation refers to generating a three-dimensional model of an object, often involving three-dimensional reconstruction techniques, such as MVS (Multiple View Stereo, multi-view stereoscopic) techniques. Existing three-dimensional reconstructions are mostly based on real-world objects for three-dimensional content generation, that is to say, such three-dimensional reconstructions are limited to restoring already existing objects. However, in many emerging areas, such as AR gaming, social networking, content authoring, etc., three-dimensional reconstruction that merely restores real world objects may not have been able to meet user needs, and there is an increasing demand for interactive three-dimensional content generation.

In view of the above technical problems, according to one aspect of the present disclosure, a three-dimensional content generation method is provided.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable execution of the method of three-dimensional content generation.

In some embodiments, server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user

operating client devices

101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use

client devices

101, 102, 103, 104, 105, and/or 106 to generate three-dimensional content. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications, such as applications for services such as object detection and recognition, signal conversion, etc., based on data such as images, video, voice, text, digital signals, etc., to process task requests such as voice interactions, text classification, image recognition, or keypoint detection received from

client devices

101, 102, 103, 104, 105, and/or 106. The server can train the neural network model by using training samples according to specific deep learning tasks, test each sub-network in the super-network module of the neural network model, and determine the structure and parameters of the neural network model for executing the deep learning tasks according to the test results of each sub-network. Various data may be used as training sample data for a deep learning task, such as image data, audio data, video data, or text data. After training of the neural network model is completed, the server 120 may also automatically search out the optimal model structure through a model search technique to perform a corresponding task.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure. The three-dimensional model display method according to the embodiment of the present disclosure is described in detail below.

Fig. 2 shows a flow chart of a three-dimensional content generation method 200 according to an embodiment of the present disclosure. As shown in fig. 2, the method 200 includes steps S201, S202, S203, S204, S205, and S206.

In step S201, a description of the three-dimensional content intended to be generated by the user and an initial three-dimensional model for generating the three-dimensional content are acquired.

In an example, the user's description of the three-dimensional content intended to be generated may include content related to non-real world objects. In application scenarios such as AR (Augmented Reality ) games, online social, online content authoring, etc., a user may wish to obtain corresponding three-dimensional content through a description of the intended generated three-dimensional content. For example, in some AR games, a player may instruct through a language to generate a three-dimensional model of a prop.

In an example, the user's description of the intended three-dimensional content may be one or more complete sentences for depicting or expressing the three-dimensional content (e.g., may be "generating a orange-colored winged flame dragon"), may be a single phrase including adjectives, nouns, etc. (e.g., may be "orange," "winged," "flame dragon"), or may be just one word (e.g., may be "dragon"). It should be appreciated that the more detailed the user describes the three-dimensional content that is intended to be generated, the more consistent the generated three-dimensional content is with the user's expectations.

In an example, the user's description of the three-dimensional content intended to be generated may be a text description entered, for example, through a keyboard, or may be a speech description entered, for example, through a microphone.

In an example, the initial three-dimensional model used to generate the three-dimensional content may be a rotating body, which may be, for example, a sphere, cone, cylinder, or the like, or a polyhedron, which may be, for example, a cube, octahedron, or the like. The initial three-dimensional model may also be, for example, a polyhedron of approximately spherical shape whose surface is defined by a plurality of points.

In step S202, semantic information characterizing the three-dimensional content is extracted as target feature information based on the description of the three-dimensional content.

In an example, in a case where the user's description of the three-dimensional content intended to be generated is a text description input through a keyboard, for example, semantic information characterizing the three-dimensional content may be directly extracted from the text description as target feature information based on the text description of the three-dimensional content. In the case where the user's description of the three-dimensional content intended to be generated is, for example, a voice description input through a microphone, the voice description may be converted into text information based on the voice description of the three-dimensional content, and then semantic information characterizing the three-dimensional content may be extracted from the text information as target feature information.

In an example, the form of semantic information characterizing the three-dimensional content may include the form of feature vectors, and thus, the target feature information may also include the form of feature vectors.

In an example, semantic information may be extracted as target feature information based on a neural network for natural language processing or a cross-modal neural network such as CLIP (Contrastive Language-Image Pretraining).

In step S203, rendering is performed based on the initial three-dimensional model to acquire a plurality of rendered images.

In an example, rendering may be performed at a random or predetermined plurality of perspectives (e.g., 8 or 16 perspectives) based on an initial three-dimensional model to generate a plurality of rendered images corresponding to the plurality of perspectives, respectively.

In an example, the acquired plurality of rendered images may correspond to a front view, a rear view, a left view, a right view, a top view, and a bottom view, respectively, and may be named as rendered images, for example, one of the plurality of rendered images may be named as a "rendered front view".

In an example, multiple renderings may be performed based on the initial three-dimensional model. In the execution of the plurality of renderings, the rendering may be performed at a lower resolution first, and each subsequent rendering may be performed at a higher resolution than that used in the previous rendering. The predetermined number of renderings may be performed at a lower resolution, and then at a higher resolution than that used for the previous rendering.

In an example, rendering may be performed based on a neural network for micro-renderable.

In step S204, image feature information of each of the plurality of rendering images is extracted.

In an example, the form of image feature information of the rendered image may include a form of feature vectors. The form of the image feature information may be consistent with the form of semantic information characterizing the three-dimensional content, for example, may be both in the form of feature vectors. In this way, it can be facilitated to compare the image feature information with semantic information to calculate differences.

In an example, the image characteristic information of the rendered image may include an image resolution of the rendered image, a color of each pixel (e.g., may be represented in an RGB (Red, green, blue) color mode), or a shape of each pixel.

In an example, image feature information may be extracted based on a neural network used to extract image features or a cross-modal neural network such as CLIP.

In step S205, differences between the image feature information and the target feature information of each of the plurality of rendering images are calculated, respectively.

In an example, since the form of the image feature information of each of the plurality of rendered images may coincide with the form of the semantic information characterizing the three-dimensional content, for example, may be each in the form of a vector feature, the difference between the image feature information and the target feature information of each of the plurality of rendered images may be calculated by comparing the image feature information with the semantic information.

In step S206, the initial three-dimensional model is adjusted to correspond to the three-dimensional content based on the difference.

In an example, when rendering is performed based on a neural network, the differences may include a loss function for training the neural network, which may indicate an adjustment direction for adjusting the initial three-dimensional model to correspond to the three-dimensional content.

In an example, the individual image characteristic information of each of the rendered images calculated in step S205 may be summed or combined in a linear manner as a difference, and the adjustment of the three-dimensional model may be performed based on the difference.

In an example, where the initial three-dimensional model is a polyhedron of approximately spherical shape whose surface is determined by a plurality of points, the positions or offsets of the points on the initial three-dimensional model may be iteratively adjusted based on the loss function, thereby ultimately enabling the initial three-dimensional model to be adjusted to correspond to the three-dimensional content.

According to the three-dimensional content generation method of the embodiment of the disclosure, three-dimensional content corresponding to user intention can be interactively generated based on the user intention, wherein the initial three-dimensional model can be finally adjusted to the three-dimensional content conforming to the user intention by representing the user intention as target characteristic information and determining a gap between a rendering result and the target characteristic information. In addition, since the gap between the rendering result and the target feature information is determined by means of the image feature information of each of the plurality of rendering images of the rendering result in the rendering process, it is possible to ensure accurate generation of three-dimensional content corresponding to the user's intention.

Various aspects of the three-dimensional content generation method according to embodiments of the present disclosure are described further below.

According to some embodiments, the description of the three-dimensional content acquired in step S201 may include at least one of a text description or a voice description.

In an example, when the user's description of the three-dimensional content intended to be generated is a text description, the user may input the description of the three-dimensional content by way of a keyboard key or handwriting. When the user's description of the three-dimensional content intended to be generated is a voice description, the user may input the description of the three-dimensional content by way of microphone input.

In an example, the user's description of the three-dimensional content intended to be generated may be, for example, "a orange-colored winged flame dragon" or may be a "dragon". It should be appreciated that the more detailed the user describes the three-dimensional content that is intended to be generated, the more consistent the generated three-dimensional content is with the user's expectations.

In an example, the three-dimensional content that the user description is intended to generate may use any language that is capable of being extracted with semantic information.

According to the embodiment of the disclosure, the method for describing the three-dimensional content intended to be generated by a user is not limited, so that various application scenes such as AR games, social interaction, content creation and the like are optimally adapted, and the description of the intention can be realized through any mode of text input and voice input, so that convenience is brought to the interactive three-dimensional content generation.

According to some embodiments, the initial three-dimensional model acquired in step S201 may include a rotating body or a polyhedron.

In an example, the initial three-dimensional model may be a sphere formed by rotating a semicircle with its diameter as an axis, a cylinder formed by rotating a rectangle with its one side as an axis, or a regular tetrahedron or regular dodecahedron.

Fig. 3 shows a schematic diagram of an initial three-dimensional model 300 according to an embodiment of the present disclosure.

In an example, as shown in FIG. 3, the initial three-dimensional model 300 may be a nearly spherical polyhedron whose surface is defined by a plurality of points.

In an example, points on the surface of the initial three-dimensional model 300 are connected to form a plurality of triangles that do not overlap each other, and the set of triangles forms the surface of the polyhedron of the initial three-dimensional model 300.

In an example, the position of a point, or the direction and distance of the offset of the point, of the initial three-dimensional model 300 surface may be adjusted. By adjusting the positions of the points, or the direction and distance of the offset of the points, of the surface of the initial three-dimensional model 300, the triangle of the surface of the initial three-dimensional model 300 is also changed, thereby enabling the surface profile and shape of the initial three-dimensional model 300 to be changed.

For example, the point 301 of the surface of the initial three-dimensional model 300 may be adjusted to the position of the point 302, and then the

triangles

311, 312, 313, 314, 315, and 316 that are vertex to the point 301 may also change. Specifically,

triangles

311, 312, 313, 314, 315, and 316 may each be deformed into new triangles (as shown by the dashed lines in the figure) in which point 301 is replaced with point 302 as a new vertex. The initial three-dimensional model 300 is then no longer approximately spherical in shape, and in particular, the initial three-dimensional model 300 forms a depression around the point 302.

According to the embodiments of the present disclosure, it is possible to facilitate subsequent adjustment of a three-dimensional model by using a structurally simple rotator or polyhedron as an initial three-dimensional model for generating three-dimensional content.

According to some embodiments, rendering based on the initial three-dimensional model in step S203 to obtain a plurality of rendered images may include: rendering at random or a predetermined plurality of viewing angles based on the initial three-dimensional model; and generating a plurality of rendered images corresponding to the plurality of perspectives, respectively.

In an example, the plurality of perspectives for rendering may be different from each other, the same as each other, or some of the plurality of perspectives may be the same.

In an example, the number and angle of the viewing angles may be randomly determined and then rendered at the randomly determined number and angle to generate the corresponding rendered image. The predetermined number of angles determined at random may have the same angle, and thus the rendered image at the same viewing angle may also exist in the generated rendered image.

In an example, the number and angle of the viewing angles may be preset, and rendering may be performed at the preset number and angle to generate the corresponding rendered image. For example, six viewing angles including front, back, right, top, and bottom to be rendered may be preset, and then rendered at the six viewing angles, respectively, to generate corresponding six rendered images, i.e., a front view, a rear view, a left view, a right view, a top view, and a bottom view.

In an example, with continued reference to fig. 3, directions M1, M2, M3, and M4 may be set to predetermined four viewing angles. Rendering may then be performed at the respective perspectives of directions M1, M2, M3, and M4 to generate four rendered images corresponding to the respective perspectives of directions M1, M2, M3, and M4, respectively.

In an example, a rendered image generated by rendering at a perspective of direction M1 may be named "rendered top view"; the rendered image generated by rendering at the perspective of direction M2 may be named "rendered right view"; the rendered image generated by rendering at the perspective of direction M3 may be named "rendered bottom view"; the rendered image generated by rendering at the perspective of direction M4 may be named "rendered left view".

According to the process of acquiring a plurality of rendering images, according to the embodiment of the disclosure, by taking the rendering image carrying the visual angle information as the rendering result, the effect of current rendering can be accurately reflected and adjusted towards the user intention, so that the accurate generation of the three-dimensional content conforming to the user intention is ensured.

Fig. 4 shows a flow chart of a three-dimensional content generation method 400 according to another embodiment of the present disclosure.

As shown in fig. 4, the three-dimensional content generation method 400 may include steps S401 to S406. Steps S401 to S406 may correspond to steps S201 to S206 shown in fig. 2, and thus the details thereof will not be described here.

The three-dimensional content generation method 400 may further include steps S407, S408, and S409.

In step S407, the plurality of rendered images may be projectively transformed to determine an overlapping region between the plurality of rendered images.

In an example, the plurality of rendered images may be projectively transformed based on depth information of each of the plurality of rendered images and pose (i.e., position information) of the camera at the time the rendered image was acquired. Based on the depth information of each of the plurality of rendered images and the pose of the camera, a ratio of the plurality of rendered images to each transform may be calculated.

In an example, one of the plurality of rendered images may be preset as a reference, that is, the other of the plurality of rendered images is projected to a viewing angle at which the rendered image is located without performing a projective transformation on the rendered image, so as to determine overlapping areas between the other rendered images and the one of the rendered images as the reference, respectively.

In an example, the plurality of rendered images may be grouped in pairs, and overlapping areas between each two rendered images may be determined, respectively.

In step S408, it may be determined whether the plurality of rendered images coincide in the overlapping region.

In an example, it may be determined whether the plurality of rendered images are consistent in the overlapping region based on a comparison of characteristics such as color and/or shape of the plurality of rendered images in the overlapping region. For example, the parameter values corresponding to colors may be expressed in an RGB color pattern, and then compared to perform the comparison of colors of a plurality of rendered images in an overlapping region.

In step S409, information indicating inconsistency may be added to the difference in response to determining that the plurality of rendered images are inconsistent in the overlapping area.

In an example, the difference may include a loss function for training the neural network, which may indicate an adjustment direction for adjusting the initial three-dimensional model to correspond to the three-dimensional content.

In an example, if the plurality of rendered images are not identical in the overlapping area, it means that there is a difference between the plurality of rendered images of the same three-dimensional model rendered, that is, the effect of the current rendering is not satisfactory, and therefore, such inconsistency needs to be adjusted further toward the user's intention as a difference. If the plurality of rendering images are consistent in the overlapping area, the current rendering effect can be indicated to meet a certain requirement, and at the moment, the resolution of rendering can be improved in the next iteration of rendering so as to perform finer three-dimensional model adjustment.

According to the three-dimensional content generation method, whether the rendering result is different among the plurality of rendering images of the same three-dimensional model or not can be checked, and the current rendering effect can be indirectly verified, so that adjustment towards the intention of a user is ensured, and the three-dimensional content conforming to the intention of the user can be accurately generated.

According to some embodiments, the overlapping region may include at least one pixel, and the information indicating the inconsistency in step S409 may be associated with at least one of a color or a shape of the pixel.

In an example, parameter values corresponding to colors of pixels may be represented in an RGB color pattern and then compared to perform a comparison of colors of the plurality of rendered images in an overlapping region to determine whether the plurality of rendered images are consistent in the overlapping region. When the colors of the pixels represented in the RGB color mode are not uniform, it means that there is a difference between the rendered images of the same three-dimensional model, that is, the effect of the current rendering is not satisfactory, and thus it is necessary to adjust this color non-uniformity as a difference further toward the user's intention. When there is color inconsistency, it may also be indicative of the presence of shape inconsistency.

According to the embodiments of the present disclosure, by judging the consistency of pixels by means of the consistency of the colors or shapes of the pixels, the effect of the current rendering can be verified in a simple manner, thereby facilitating adjustment of rendering.

According to some embodiments, rendering based on the initial three-dimensional model in step S403 to obtain a plurality of rendered images may include: rendering is performed with at least two resolutions in increments such that a plurality of rendered images acquired based on any one of the at least two resolutions contain three-dimensional model details corresponding to that resolution.

In an example, the three-dimensional model details may relate to texture maps of the three-dimensional model.

In an example, multiple renderings may be performed based on the initial three-dimensional model. In the execution of the plurality of renderings, the rendering may be performed at a lower resolution first, and each subsequent rendering may be performed at a higher resolution than that used for the previous rendering. The predetermined number of renderings may be performed at a lower resolution, and then at a higher resolution than that used for the previous rendering. For example, upon verifying that there is no pixel inconsistency between the current plurality of rendered images, the resolution of the rendering may be increased in the next iteration of rendering for finer three-dimensional model adjustments.

Fig. 5 shows a schematic diagram of a process of rendering based on an initial three-dimensional model according to an embodiment of the present disclosure.

In an example, as shown in fig. 5, two renderings may be performed based on an initial three-dimensional model 500. The rendered image 510 may be first rendered at a lower first resolution (the rendered image 510 may be one of a plurality of rendered images corresponding to a plurality of viewing angles). Then in a subsequent second rendering, the initial three-dimensional model 500 is rendered at a second resolution higher than the first resolution, resulting in a rendered image 520 (the rendered image 520 may be one of a plurality of rendered images corresponding to a plurality of perspectives). Thus, rendering the obtained rendered image 520 at the higher second resolution may contain more three-dimensional model details than rendering the obtained rendered image 510 at the lower first resolution.

In an example, when rendering at a lower resolution, the neural network may converge quickly as it contains less three-dimensional model details, thereby enabling a coarse adjustment of the initial three-dimensional model. Further fine tuning is achieved as more detail is included when rendering at a higher resolution next.

According to the process for acquiring the plurality of rendering images, the neural network can be converged rapidly by adopting a mode of combining coarse adjustment and fine adjustment, so that the computational complexity is reduced, and the throughput of the neural network is improved.

According to some embodiments, the rendering may be performed based on a neural network for micro-renderable, and the differences may include a loss function for training the neural network.

In an example, multiple renderings may be performed based on a neural network for micro-renderable, i.e., multiple iterations for training the neural network. After the current rendering (i.e., the current iteration), the difference between the effect of the current rendering and the target feature information may be back-transferred to the currently rendered three-dimensional model in the form of a loss function. In this way, after multiple renderings (i.e., multiple iterations), the initial three-dimensional model may be adjusted to conform to the user's intent. In other words, training of the neural network is completed at this time, and thus three-dimensional content conforming to the user's intention can be rendered, which means that reasoning of the neural network is also achieved at the same time.

In an example, the loss function may indicate an adjustment direction for adjusting the initial three-dimensional model to the three-dimensional content intended by the user. The initial three-dimensional model may be a polyhedron whose surface is determined by a plurality of points as shown in fig. 3, and then the positions of the points on the initial three-dimensional model, or the directions and distances of the offsets, may be iteratively adjusted based on the loss function, thereby finally adjusting the initial three-dimensional model to be able to correspond to the three-dimensional content.

According to the embodiments of the present disclosure, by performing rendering based on a neural network for micro-renderable, after performing iteration using a difference obtained by each rendering as a loss function to complete training of the neural network, three-dimensional content conforming to a user's intention can be rendered, that is, inference of the neural network is simultaneously achieved.

According to some embodiments, adjusting the initial three-dimensional model to correspond to the three-dimensional content based on the differences in step S406 may include: based on the loss function, the offset of the points on the initial three-dimensional model is iteratively adjusted to deform the initial three-dimensional model to correspond to the three-dimensional content.

In an example, the process of adjusting the offset of points on the initial three-dimensional model may be seen in fig. 3, in the current iteration as shown in fig. 3, point 301 of the surface of initial three-dimensional model 300 may be adjusted to the position of point 302 based on a loss function, and

triangles

311, 312, 313, 314, 315, and 316 may each be deformed into new triangles as shown by the dashed lines in the figure.

According to the initial three-dimensional model adjustment process of the embodiment of the disclosure, the rendering effect can be controlled by means of the loss function of the neural network to be towards the user intention, so that the three-dimensional model corresponding to the three-dimensional content expected by the user can be obtained synchronously when the neural network completes training.

According to some embodiments, extracting semantic information as target feature information, and extracting image feature information may be performed based on cross-modal neural networks.

In an example, a modality may refer to a presentation form of data, such as a file format of text, audio, images, video, and so forth. The form of the data may be different, but may describe the same thing or event. The cross-modal neural network (such as CLIP) may match the feature information of the text and image, that is, the cross-modal neural network may convert the semantic information and the image feature information into the same form, for example, may be in the form of feature vectors, to facilitate comparison of the two to calculate the difference.

According to embodiments of the present disclosure, utilizing a single cross-modal neural network may facilitate interactive three-dimensional content generation such that multiple neural networks for natural language processing and for image feature extraction, respectively, are not required, thereby simplifying the operability of the overall method.

According to the three-dimensional content generation method, the three-dimensional content intended by the user can be generated rapidly and accurately in a time range of a minute level.

According to another aspect of the present disclosure, there is also provided a three-dimensional content generating apparatus.

Fig. 6 shows a block diagram of an apparatus 600 for three-dimensional content generation according to an embodiment of the present disclosure.

As shown in fig. 6, the three-dimensional content generating apparatus 600 includes: an acquisition module 610 configured to acquire a description of the three-dimensional content intended to be generated by the user, and an initial three-dimensional model for generating the three-dimensional content; a semantic module 620 configured to extract semantic information characterizing the three-dimensional content as target feature information based on the description of the three-dimensional content; a rendering module 630 configured to render based on the initial three-dimensional model to obtain a plurality of rendered images; a feature extraction module 640 configured to extract image feature information of each of the plurality of rendered images, respectively; a difference calculation module configured to calculate differences between image feature information and target feature information of each of the plurality of rendered images, respectively; and a model adjustment module 660 configured to adjust the initial three-dimensional model to correspond to the three-dimensional content based on the differences.

Since the acquisition module 610, the semantic module 620, the rendering module 630, the feature extraction module 640, the variance calculation module 650, and the model adjustment module 660 in the three-dimensional content generating apparatus 600 may correspond to steps S201 to S206 as described in fig. 2, respectively, details of various aspects thereof will not be described herein.

In addition, the three-dimensional content generating apparatus 600 and the modules included therein may also include further sub-modules, which will be described in detail below in connection with fig. 7.

According to the embodiments of the present disclosure, three-dimensional content corresponding to a user intention can be interactively generated based on the user intention, wherein an initial three-dimensional model can be finally adjusted to the three-dimensional content conforming to the user intention by representing the user intention as target feature information and determining a gap between a rendering result and the target feature information. In addition, since the gap between the rendering result and the target feature information is determined by means of the image feature information of each of the plurality of rendering images of the rendering result in the rendering process, it is possible to ensure accurate generation of three-dimensional content corresponding to the user's intention.

Fig. 7 shows a block diagram of an apparatus 700 for three-dimensional content generation according to another embodiment of the present disclosure.

As shown in fig. 7, the apparatus 700 for three-dimensional content generation may include an acquisition module 710, a semantic module 720, a rendering module 730, a feature extraction module 740, a variance calculation module 750, and a model adjustment module 760. The obtaining module 710, the semantic module 720, the rendering module 730, the feature extraction module 740, the difference calculation module 750, and the model adjustment module 760 may correspond to the obtaining module 610, the semantic module 620, the rendering module 630, the feature extraction module 640, the difference calculation module 650, and the model adjustment module 660 shown in fig. 6, and thus the details thereof will not be repeated herein.

In an example, the description of the three-dimensional content may include at least one of a textual description or a phonetic description.

Therefore, the method for describing the three-dimensional content which is intended to be generated by the user is not limited, so that various application scenes such as AR games, social contact, content creation and the like are optimally adapted, and the description of the intention can be realized by any mode of text input and voice input, so that convenience is brought to interactive three-dimensional content generation.

In an example, the initial three-dimensional model may include a rotating body or a polyhedron.

Thus, by using a structurally simple rotator or polyhedron as an initial three-dimensional model for generating three-dimensional contents, it is possible to facilitate subsequent adjustment of the three-dimensional model.

In an example, rendering module 730 may include: a perspective rendering module 731 configured to render at a random or predetermined plurality of perspectives based on the initial three-dimensional model; and a rendering generation module 732 configured to generate a plurality of rendered images corresponding to the plurality of perspectives, respectively.

By using the rendering image carrying the visual angle information as the rendering result, the current rendering effect can be accurately reflected and adjusted towards the user intention, so that the accurate generation of the three-dimensional content conforming to the user intention is ensured.

In an example, the apparatus 700 for three-dimensional content generation may further include: a projective transformation module 770 configured to projectively transform the plurality of rendered images to determine overlapping areas between the plurality of rendered images; a consistency determination module 780 configured to determine whether the plurality of rendered images are consistent in the overlapping region; and a difference adding module 790 configured to add information indicative of the inconsistency to the difference in response to determining that the plurality of rendered images are inconsistent in the overlapping area.

Therefore, whether the difference exists between the rendering results of the plurality of rendering images of the same three-dimensional model or not can be checked, the current rendering effect can be indirectly verified, and adjustment towards the intention of the user is ensured, so that the three-dimensional content conforming to the intention of the user can be accurately generated.

In an example, the overlapping region may include at least one pixel, and information indicating the inconsistency is associated with at least one of a color or a shape of the pixel.

Thus, by judging the consistency of the pixels by means of the consistency of the colors or shapes of the pixels, the effect of the current rendering can be verified in a simple manner, thereby providing convenience for adjusting the rendering.

In an example, rendering module 730 may include: the resolution incrementing module 733 is configured to perform rendering with at least two resolutions incremented such that a plurality of rendered images acquired based on any one of the at least two resolutions contain three-dimensional model details corresponding to the resolution.

Therefore, by adopting a mode of combining coarse adjustment and fine adjustment, the neural network can be converged rapidly, the computational complexity is reduced, and the throughput of the neural network is improved.

In an example, rendering may be performed based on a neural network for micro-renderable, and the differences may include a loss function for training the neural network.

Thus, by performing rendering based on the neural network for micro-renderable, after performing iteration using the difference obtained by each rendering as a loss function to complete training of the neural network, three-dimensional content conforming to the user's intention can be rendered, that is, inference of the neural network is simultaneously achieved.

In an example, the model adjustment module 760 may include: an offset adjustment module 761 configured to iteratively adjust an offset of points on the initial three-dimensional model based on the loss function to deform the initial three-dimensional model to correspond to the three-dimensional content.

Thus, the rendering effect can be controlled by means of the loss function of the neural network to be directed to the user's intention, so that the three-dimensional model corresponding to the three-dimensional content desired by the user can be obtained synchronously when the training of the neural network is completed.

In an example, extracting semantic information as target feature information, and extracting image feature information may be performed based on a cross-modal neural network.

Thus, utilizing a single cross-modal neural network may facilitate interactive three-dimensional content generation, such that multiple neural networks for natural language processing and for image feature extraction, respectively, are not required, thereby simplifying the operability of the overall method.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to another aspect of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the embodiments described above.

According to another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method in the above-described embodiments.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method in the above embodiments.

According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

Referring to fig. 8, a block diagram of an electronic device 800 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the device 800, the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 807 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. The storage unit 808 may include, but is not limited to, magnetic disks, optical disks. The communication unit 809 allows the device 800 to exchange information/data with other devices over computer networks, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a three-dimensional content generation method. For example, in some embodiments, the three-dimensional content generation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the three-dimensional content generation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the three-dimensional content generation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A three-dimensional content generation method, comprising:

acquiring a description of three-dimensional content which is generated by a user and is intended to be generated, and generating an initial three-dimensional model of the three-dimensional content;

Extracting semantic information representing the three-dimensional content as target feature information based on the description of the three-dimensional content;

rendering is carried out based on the initial three-dimensional model so as to obtain a plurality of rendering images;

extracting respective image characteristic information of the plurality of rendering images;

respectively calculating differences between the image characteristic information and the target characteristic information of each of the plurality of rendering images; and

based on the differences, the initial three-dimensional model is adjusted to correspond to the three-dimensional content.

2. The method of claim 1, wherein the description of the three-dimensional content comprises at least one of a textual description or a phonetic description.

3. The method of claim 1 or 2, wherein the initial three-dimensional model comprises a rotating body or a polyhedron.

4. A method according to any one of claims 1 to 3, wherein the rendering based on the initial three-dimensional model to obtain a plurality of rendered images comprises:

rendering at a random or predetermined plurality of perspectives based on the initial three-dimensional model; and

the plurality of rendered images respectively corresponding to the plurality of perspectives are generated.

5. The method of any one of claims 1 to 4, further comprising:

Performing projective transformation on the plurality of rendered images to determine overlapping areas between the plurality of rendered images;

determining whether the plurality of rendered images are consistent in the overlapping region; and

in response to determining that the plurality of rendered images are inconsistent in the overlapping region, information indicative of the inconsistency is added to the difference.

6. The method of claim 5, wherein the overlap region comprises at least one pixel, and the information indicative of the inconsistency is associated with at least one of a color or a shape of the pixel.

7. The method of any of claims 1-6, wherein the rendering based on the initial three-dimensional model to obtain a plurality of rendered images comprises:

the rendering is performed with at least two resolutions in increments such that the plurality of rendered images acquired based on either of the at least two resolutions contain three-dimensional model details corresponding to that resolution.

8. The method of any of claims 1-7, wherein the rendering is performed based on a neural network for micro-renderable, and the difference comprises a loss function for training the neural network.

9. The method of claim 8, wherein the adjusting the initial three-dimensional model to correspond to the three-dimensional content based on the differences comprises:

iteratively adjusting an offset of points on the initial three-dimensional model based on the loss function to deform the initial three-dimensional model to correspond to the three-dimensional content.

10. The method of any of claims 1 to 9, wherein the extracting semantic information as target feature information and the extracting image feature information are performed based on a cross-modal neural network.

11. A three-dimensional content generating apparatus comprising:

an acquisition module configured to acquire a description of a three-dimensional content intended to be generated by a user, and an initial three-dimensional model for generating the three-dimensional content;

a semantic module configured to extract semantic information characterizing the three-dimensional content as target feature information based on the description of the three-dimensional content;

a rendering module configured to render based on the initial three-dimensional model to obtain a plurality of rendered images;

a feature extraction module configured to extract image feature information of each of the plurality of rendered images, respectively;

A difference calculation module configured to calculate differences between the image feature information and the target feature information of each of the plurality of rendered images, respectively; and

a model adjustment module configured to adjust the initial three-dimensional model to correspond to the three-dimensional content based on the differences.

12. The apparatus of claim 11, wherein the description of the three-dimensional content comprises at least one of a textual description or a phonetic description.

13. The apparatus of claim 11 or 12, wherein the initial three-dimensional model comprises a rotating body or a polyhedron.

14. The apparatus of any of claims 11 to 13, wherein the rendering module comprises:

a perspective rendering module configured to render at a random or predetermined plurality of perspectives based on the initial three-dimensional model; and

a rendering generation module configured to generate the plurality of rendered images respectively corresponding to the plurality of perspectives.

15. The apparatus of any of claims 11 to 14, further comprising:

a projective transformation module configured to projectively transform the plurality of rendered images to determine overlapping areas between the plurality of rendered images;

A consistency determination module configured to determine whether the plurality of rendered images are consistent in the overlapping region; and

a difference adding module configured to, in response to determining that the plurality of rendered images are inconsistent in the overlapping region, add information indicative of the inconsistency to the difference.

16. The device of claim 15, wherein the overlap region comprises at least one pixel, and the information indicative of the inconsistency is associated with at least one of a color or a shape of the pixel.

17. The apparatus of any of claims 11 to 16, wherein the rendering module comprises:

a resolution incrementing module configured to perform the rendering with at least two resolutions that are incremented such that the plurality of rendered images acquired based on either of the at least two resolutions contain three-dimensional model details corresponding to the resolution.

18. The apparatus of any of claims 11-17, wherein the rendering is performed based on a neural network for micro-renderable, and the difference comprises a loss function for training the neural network.

19. The apparatus of claim 18, wherein the model adjustment module comprises:

An offset adjustment module configured to iteratively adjust an offset of points on the initial three-dimensional model based on the loss function to deform the initial three-dimensional model to correspond to the three-dimensional content.

20. The apparatus of any of claims 11 to 19, wherein the extracting semantic information as target feature information and the extracting image feature information are performed based on a cross-modal neural network.

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method according to any one of claims 1-10.

22. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method according to any of claims 1-10.