CN116245998B

CN116245998B - Rendering map generation method and device, and model training method and device

Info

Publication number: CN116245998B
Application number: CN202310519942.XA
Authority: CN
Inventors: 李�杰; 陈睿智; 张岩; 赵晨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-05-09
Filing date: 2023-05-09
Publication date: 2023-08-29
Anticipated expiration: 2043-05-09
Also published as: CN116245998A

Abstract

The disclosure provides a rendering map generation method and device, and a model training method and device, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and can be applied to scenes such as metauniverse, digital people and the like. The implementation scheme is as follows: extracting first characteristics of the identification information of the target object to obtain a first characteristic diagram, wherein the identification information contains original texture characteristic information of the target object; performing second feature extraction on surface normal information of the target object and scene information of the virtual scene to obtain a second feature map, wherein the scene information comprises illumination information; and generating a rendering map of the target object in the virtual scene based on the first feature map and the second feature map. Therefore, the calculated amount can be reduced and the calculation efficiency can be improved while the super-realistic rendering effect of the target object is ensured.

Description

Rendering map generation method and device, and model training method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and can be applied to scenes such as metauniverse, digital people and the like, in particular to a rendering map generation method and device, a model training method and device, electronic equipment and a computer readable storage medium.

Background

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

Virtual digital people are one of the key elements that create a metauniverse virtual world. According to different business requirements of digital persons, the digital persons can be divided into two-dimensional, three-dimensional, cartoon, writing, super writing and the like. At present, the design of common high-quality virtual digital people (avatars) requires professional animators to perform geometric modeling, texture mapping, illumination mapping and the like on the avatars to perform professional optimization design so as to achieve basic avatar construction adapting to business requirements.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides a rendering map generation method and apparatus, a model training method and apparatus, an electronic device, and a computer-readable storage medium.

According to an aspect of the present disclosure, there is provided a rendering map generation method, including: extracting first characteristics of the identification information of the target object to obtain a first characteristic diagram, wherein the identification information contains original texture characteristic information of the target object; performing second feature extraction on surface normal information of the target object and scene information of the virtual scene to obtain a second feature map, wherein the scene information comprises illumination information; and generating a rendering map of the target object in the virtual scene based on the first feature map and the second feature map.

According to another aspect of the present disclosure, there is provided a model training method, the model including a first feature extraction network, a second feature extraction network, and a generation network, the model training method including: acquiring sample data, wherein the sample data comprises identification information of a sample object, surface normal information, scene information of a sample scene and sample rendering mapping of the sample object in the sample scene, the identification information comprises original texture characteristic information of the sample object, and the scene information comprises illumination information; inputting the identity identification information into a first feature extraction network to obtain a first feature map output by the first feature extraction network; inputting the surface normal information and the scene information into a second feature extraction network to obtain a second feature map output by the second feature extraction network; inputting at least the first feature map and the second feature map into a generating network to obtain a rendering map prediction result of sample objects in a sample scene output by the generating network; and adjusting parameters of the model based on the rendering map prediction result and the sample rendering map.

According to another aspect of the present disclosure, there is provided a rendering map generating apparatus including: the first acquisition unit is configured to perform first feature extraction on the identification information of the target object to obtain a first feature map, wherein the identification information comprises original texture feature information of the target object; a second acquisition unit configured to perform second feature extraction on surface normal information of the target object and scene information of the virtual scene to obtain a second feature map, the scene information including illumination information; and a generation unit configured to generate a rendering map of the target object in the virtual scene based on the first feature map and the second feature map.

According to another aspect of the present disclosure, there is provided a model training apparatus, the model including a first feature extraction network, a second feature extraction network, and a generation network, the model training apparatus including: a third acquisition unit configured to acquire sample data including identification information of a sample object, surface normal information, scene information of the sample scene, and sample rendering map of the sample object in the sample scene, the identification information including original texture feature information of the sample object, the scene information including illumination information; a fourth acquisition unit configured to input the identification information into the first feature extraction network to obtain a first feature map output by the first feature extraction network; a fifth acquisition unit configured to input surface normal information and scene information into the second feature extraction network to obtain a second feature map output by the second feature extraction network; a sixth acquisition unit configured to input at least the first feature map and the second feature map into the generation network to obtain a rendering map prediction result of the sample object in the sample scene output by the generation network; and an adjustment unit configured to adjust parameters of the model based on the rendering map prediction result and the sample rendering map.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the rendering map generation method or the model training method.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the above-described rendering map generation method or the above-described model training method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described rendering map generation method or the above-described model training method.

According to one or more embodiments of the present disclosure, the amount of computation can be reduced and the computation efficiency can be improved while ensuring the super-realistic rendering effect of the target object.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a rendering map generation method according to an embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of a method of obtaining a second feature map in accordance with an embodiment of the present disclosure;

FIG. 4 illustrates a block diagram of a first feature extraction network, a third feature extraction network, and a generation network in a rendering map generation model according to an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a block diagram of a second feature extraction network in a rendering map generation model in accordance with an exemplary embodiment of the present disclosure;

FIG. 6 illustrates a flow chart of a model training method according to an embodiment of the present disclosure;

FIG. 7 shows a block diagram of a rendering map generating apparatus according to an embodiment of the present disclosure;

FIG. 8 shows a block diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 9 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another element. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, server 120 may run one or more services or software applications that enable execution of the rendering map generation method described above or the model training method described above.

In some embodiments, server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use the client devices 101, 102, 103, 104, 105, and/or 106 to apply the above-described rendering map generation method for real-time rendering of avatars in virtual scenes. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and/or 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 2, there is provided a rendering map generating method, including:

step S201, extracting first characteristics of identity identification information of a target object to obtain a first characteristic diagram, wherein the identity identification information comprises original texture characteristic information of the target object;

step S202, carrying out second feature extraction on surface normal information of a target object and scene information of a virtual scene to obtain a second feature map, wherein the scene information comprises illumination information; and

step S203, generating a rendering map of the target object in the virtual scene based on the first feature map and the second feature map.

Firstly, extracting first characteristics through the pre-stored identification information of the target object to obtain a first characteristic diagram; extracting second characteristics of the surface normal information and the scene information to obtain a second characteristic diagram containing the surface normal information and the scene information of the state information of the current target object; and then, generating a rendering map based on the first feature map and the second feature map, so that the calculated amount can be reduced and the calculation efficiency can be improved while ensuring the super-realistic rendering effect of the target object.

In some embodiments, the target object may be an avatar controlled by a user in the virtual scene through a virtual reality device.

In some embodiments, the identification information may be used to represent original features of the three-dimensional model of the target object, such as a five-element shape, a limb shape, texture information before the scene information is not introduced, and the like of the three-dimensional model.

In some embodiments, the identification information may be stored in the form of a parameter matrix.

In some embodiments, each row or each column in the parameter matrix of the identification information may correspond to corresponding semantic information (e.g., face shape, facial feature shape, face texture, limb texture, etc.).

In some embodiments, the identification information is obtained by feature encoding an original texture map of the target object.

Therefore, the original texture map of the target object can be pre-stored in a parameterized form through feature coding, so that the calculation cost is further saved, and the calculation efficiency is improved.

In some embodiments, the feature encoding matrix output by the feature encoding model may be obtained by inputting the original texture map of the target object into the trained feature encoding model.

In some embodiments, the dimension of the feature encoding matrix may be 512×512, for example.

In some embodiments, the feature encoding model described above may apply a Pixel2Style2Pixel (pSp) model or a 3DMM parameterized model.

In some embodiments, the identification information may be first subjected to feature extraction, thereby obtaining a first feature map.

In some embodiments, performing the first feature extraction on the identification information of the target object to obtain the first feature map may include: extracting a fourth sub-feature of the identity identification information to obtain a plurality of fourth sub-feature images, wherein the fourth sub-feature images correspond to a plurality of different resolutions respectively; and performing feature fusion based on the plurality of fourth sub-feature graphs to obtain the first feature graph.

Therefore, the fourth sub-feature extraction is carried out on the identity information, the fourth sub-feature diagrams with different resolutions are further obtained, the feature diagrams with different sizes are obtained, the detail information and the global information in the identity information are considered, the information is integrated into the first feature diagram, and therefore the information of the original shape, the original texture and the like of the richer target object can be extracted, and the generation effect of the rendering map is further improved.

In some embodiments, the fourth sub-feature extraction may be performing a multi-level feature extraction and pooling operation on a feature encoding matrix corresponding to the identification information, where after each feature extraction or multiple feature extraction, the feature map is downsampled by the pooling operation, so as to continue the feature extraction on the downsampled feature map, so as to obtain multiple fourth sub-feature maps with different resolutions.

Subsequently, feature fusion may be performed on the plurality of fourth sub-feature maps, thereby obtaining the first feature map.

In some embodiments, the resolution of the fourth sub-feature map with the smallest resolution may be restored to the resolution corresponding to the previous stage by upsampling, so as to obtain a first intermediate feature map; and then, superposing the fourth sub-feature map with the corresponding resolution with the first intermediate feature map, extracting the superposed features once or more times again, and upsampling the obtained fusion features again until the fusion feature corresponding to the maximum resolution in the resolutions, namely the first feature map, is obtained.

In some embodiments, each feature extraction or feature fusion of the one or more feature extraction or one or more feature fusion may be performed by performing a convolution with a 3×3 convolution check feature map and then performing feature output through a ReLU activation function.

In some embodiments, the state (e.g., motion, expression, etc.) of the target object is changed in real-time based on user real-time control.

In some embodiments, the motion driving coefficient and the expression driving coefficient of the user can be obtained through the virtual reality device, and then the three-dimensional model of the target object in the current state is obtained based on the motion driving coefficient and the expression driving coefficient, and then the normal information of the surface of the three-dimensional model can be obtained through the three-dimensional model. In some embodiments, a normal map of the three-dimensional model of the target object may be obtained further based on the above-described normal information.

In some embodiments, the normal map may be, for example, a normal map in a world coordinate system. It will be appreciated that the surface normal information described above may also be stored in other forms (e.g., in single or multiple channels), without limitation.

In some embodiments, the virtual scene includes a virtual camera for observing the target object, and the scene information further includes at least one of: perspective information of the virtual camera relative to the target object, position information and pose information of the target object relative to the virtual camera.

In some embodiments, the perspective information of the virtual camera with respect to the target object, the position information and the attitude information of the target object with respect to the virtual camera may be represented by position coordinates or rotation coordinates.

In some embodiments, the position coordinates or rotation coordinates may be stored in three channels, respectively.

Therefore, visual angle information of the virtual camera and pose information of the target object can be further introduced, and the generation effect of the rendering map is improved through richer scene information (for example, the influence of light rays under different visual angles on texture colors of the target object can be further reflected).

In some embodiments, the illumination information may include at least one of illumination intensity information, light source color information, and light source position information.

Wherein the illumination intensity information may include an illumination intensity of the light source in the virtual scene for each point on the surface of the three-dimensional model of the current state of the target object. The illumination intensity may be determined based on the illuminance of the light source itself and the distance of the light source from each point on the surface of the three-dimensional model. The light source color information may be represented by pixel values (e.g., RGB values), and the light source position information may be represented by coordinates of the light source in a world coordinate system of the virtual scene.

In some embodiments, the illumination intensity may be saved as single channel information, and the light source color information and the light source position information may be saved as three channel information, respectively. It will be appreciated that the above information may also be stored in other forms (e.g., in single or multiple channels), without limitation.

Therefore, the generation effect of the rendering map can be further improved by introducing illumination information such as illumination intensity information, light source color information, light source position information and the like.

In some embodiments, the illumination information may only include illumination intensity information, so that by selecting the single channel information to perform feature extraction and rendering map generation, the calculation amount can be further reduced and the calculation efficiency can be improved while the rendering effect is ensured.

In some embodiments, the virtual scene may include a plurality of light sources, each light source of the plurality of light sources includes corresponding first illumination information, and the rendering map generating method may further include: clustering the plurality of light sources based on the first position information of each of the plurality of light sources to obtain at least one central light source; and determining at least one second illumination information corresponding to the at least one central light source based on the first illumination information corresponding to each of the plurality of light sources as illumination information.

Therefore, when a plurality of light sources are arranged in the virtual scene, the light sources can be clustered first, and feature extraction is performed based on illumination information of the central light source, so that the computing resource is further saved while the rendering quality is ensured, and the computing efficiency is improved.

In some embodiments, when a virtual scene contains a plurality of light sources within a certain range around a target object, the plurality of light sources may be first clustered, thereby obtaining several cluster centers. Respective averages of the first illumination information (e.g., one or more of illumination intensity information, light source color information, and light source position information) of the light sources included in each class may then be obtained to determine it as second illumination information of the center light source of the class.

In some embodiments, the second feature extraction may be performed based on the surface normal information of the target object and scene information of the virtual scene, thereby obtaining a second feature map.

In some embodiments, as shown in fig. 3, performing the second feature extraction on the surface normal information of the target object and the scene information of the virtual scene to obtain the second feature map may include:

step S301, extracting first sub-features of surface normal information to obtain a first sub-feature map;

step S302, extracting second sub-features of the scene information to obtain a second sub-feature map; and

step S303, based on the first sub-feature map and the second sub-feature map, the second feature map is obtained through fusion.

Therefore, by respectively extracting the characteristics of the surface normal information and the scene information and then carrying out characteristic fusion, the state information and the scene information of the richer target object can be extracted, and the generation effect of the rendering map is further improved.

In some embodiments, feature extraction may be performed first on the surface normal information and the scene information separately, and then feature fusion may be performed to obtain the second feature map.

In some embodiments, a first sub-feature extraction may be performed on surface normal information (e.g., a normal map) to obtain a first sub-feature map.

In some embodiments, the rendering map generating method may further include: extracting third sub-feature of the surface normal information to obtain at least one third sub-feature map, wherein each third sub-feature map in the first sub-feature map and the at least one third sub-feature map corresponds to different resolutions; and based on the first sub-feature map and the second sub-feature map, the fusing to obtain the second feature map includes: acquiring a first fusion feature based on the first sub-feature map and the second sub-feature map; and performing feature fusion on the first fusion feature and at least one third sub-feature map to obtain a second feature map.

Therefore, the third sub-feature image with different resolutions and different from the second sub-feature image is further obtained through the third sub-feature extraction of the surface normal information, so that detail information and global information in the surface normal information are considered through feature images with different sizes, the information is integrated into the second feature image, more abundant target object state information can be extracted, and the generation effect of the rendering map is further improved.

In some embodiments, a third sub-feature map may be obtained by performing one or more feature extractions on the surface normal information (normal map).

In some embodiments, the third sub-feature extraction may be a multi-level feature extraction and pooling operation on the surface normal information (normal map), where after each feature extraction or multiple feature extraction, the feature map is downsampled by the pooling operation, so as to continue the feature extraction on the downsampled feature map, so as to obtain multiple third sub-feature maps with different resolutions.

In some embodiments, the first sub-feature map may be obtained by performing one or more feature extractions after further downsampling the third sub-feature map with the smallest resolution obtained by the above method.

In some embodiments, the scene information of the virtual scene may be superimposed, and the integrated information may be input into a feature extraction network together, so as to perform second sub-feature extraction, thereby obtaining a second sub-feature map.

In some embodiments, the second sub-feature map and the first sub-feature map may have the same resolution by setting up the feature extraction network.

In some embodiments, obtaining the second feature map based on the first sub-feature map and the second sub-feature map may be a weighted summation of the first sub-feature map and the second self-feature map to obtain the second feature map.

In some embodiments, in response to acquiring the at least one third sub-feature map, a first fused feature may be obtained by first weighted summing the first sub-feature map and the second self-feature map; then up-sampling the first fusion feature to restore the resolution of the first fusion feature to the resolution corresponding to the upper level so as to obtain a second intermediate feature map; and then, superposing the third sub-feature map with the corresponding resolution with the second intermediate feature map, carrying out feature fusion once or more times on the superposed features again, and carrying out up-sampling on the obtained fusion features again until obtaining the fusion feature corresponding to the maximum resolution in the resolutions, namely the second feature map.

In some embodiments, a trained generation network may be input together based on the first feature map and the second feature map to obtain a rendering map generated by the generation network. Then, the three-dimensional model may be texture-mapped based on the rendering map to obtain a rendering result of the target object (avatar).

In some embodiments, the rendering map generating method may further include: extracting third features of the identity identification information to obtain a third feature map, wherein the resolution of the third feature map is larger than that of the first feature map; and generating a rendering map of the target object in the virtual scene based on the first feature map and the second feature map further comprises: a rendering map is generated based on the first feature map, the second feature map, and the third feature map.

Therefore, the third feature map with the resolution ratio larger than that of the first feature map is introduced into the generation process of the rendering map, so that the loss of the original shape, texture and other data of the target object in the downsampling and feature extraction is further avoided, and the generation effect of the rendering map is further improved.

In some embodiments, a third feature extraction may also be performed on the identification information to obtain a third feature map having a resolution greater than the first feature map. Then, after the first feature map and the second feature map are obtained, the two feature maps can be up-sampled to the same resolution as the third feature map respectively first to obtain two intermediate features, then the two intermediate features and the third feature map are overlapped, at least one feature fusion is further performed, and the result is output through an output layer to obtain the rendering map.

In some embodiments, each feature fusion in the at least one feature fusion may be performed by convolving the feature map with a 3×3 convolution kernel, and then performing feature output through a ReLU activation function.

In some embodiments, the identification information, the surface normal information, the scene information, and the like may be stored in the UV texture space, and the feature extraction and the feature fusion performed by the method are performed in the UV texture space, and the generated rendering map is also a texture map of the UV texture space. Therefore, by executing the method in the UV texture space, the calculation amount can be further reduced and the calculation efficiency can be improved while the rendering effect is not affected.

Fig. 4 shows a block diagram of the first, third, and generation networks in a rendering map generation model according to an exemplary embodiment of the present disclosure.

Fig. 5 shows a block diagram of a second feature extraction network in a rendering map generation model according to an exemplary embodiment of the present disclosure.

In some exemplary embodiments, as shown in fig. 4 and 5, the rendering map generation model may include a first feature extraction network 410, a second feature extraction network 420, a third feature extraction network 430, and a generation network 440, wherein the first feature extraction network 410 may include a fourth sub-feature extraction network 411 and a second feature fusion network 412, the second feature extraction network 420 may include a first sub-feature extraction network 421, a second sub-feature extraction network 422, a third sub-feature extraction network 423, and a first feature fusion network 424, and the first feature fusion network 424 includes a first fusion sub-network 424-1 and a second fusion sub-network 424-2.

In some example embodiments, the rendering map generation method of the present disclosure may include: firstly, 256×256 normal map 401 is input into a first sub-feature extraction network 421, three rounds of feature extraction and global pooling are performed on the normal map 401 through the first sub-feature extraction network 421, and then the obtained first intermediate feature is subjected to feature extraction once again, so as to obtain a first sub-feature map with a resolution of 32×32.

Each feature extraction in the three rounds of feature extraction and global pooling and the feature extraction for the first intermediate feature may be implemented by performing 2 times of intermediate feature extraction on the input feature map, where the intermediate feature extraction may be performed by convolving the image with a convolution check of 3×3, and then outputting a feature channel through a ReLU activation function. And carrying out global pooling on the feature map obtained by each intermediate feature extraction, so as to obtain a feature map with lower resolution, and carrying out the next feature extraction.

In some exemplary embodiments, the third sub-feature extraction network 423 may multiplex the networks for the first three feature extraction in the first sub-feature extraction network 421 to obtain three third sub-feature maps with progressively lower resolutions (e.g., 256×256, 128×128, 64×64, respectively).

In some exemplary embodiments, the scene information may be input into the second sub-feature extraction network 422 to perform hidden-layer feature extraction of the scene information through the second sub-feature extraction network 422, thereby obtaining a second sub-feature map output by the second sub-feature extraction network 422. Wherein the resolution of the second sub-feature map may be the same as the first sub-feature map.

In some embodiments, the second sub-feature extraction network 422 may be constructed based on a Multi-Layer Perceptron (MLP).

In some exemplary embodiments, the first sub-feature map and the second sub-feature map may be input together into a first converged sub-network 424-1 in the first feature converged network 424 to obtain a first converged feature output by the first converged sub-network 424-1.

In some embodiments, the first fusion sub-network 424-1 may be configured to perform weighted summation on the first sub-feature map and the second sub-feature map, so as to obtain the first fusion feature, thereby saving the calculation amount and improving the calculation efficiency while completing feature fusion. It will be appreciated that the weighting coefficients of the first sub-feature map and the second sub-feature map in the weighted summation may be determined according to actual needs, and are not limited herein.

In some exemplary embodiments, the first fused feature and the three third sub-feature maps described above may be input into the second fused sub-network 424-2 to obtain a fused second feature map. Wherein the first fused feature may be first up-sampled to restore its resolution to coincide with the third sub-feature map of minimum resolution (e.g., 64 x 64) to obtain a second intermediate feature; then, overlapping the third sub-feature images with the same resolution ratio with the second intermediate features, and carrying out 2 times of intermediate feature fusion on the overlapped third intermediate features to obtain fourth intermediate features; the process of upsampling, overlapping third sub-feature maps of the same resolution, and intermediate feature fusion may then be repeated further based on the fourth intermediate features until a feature fusion result of 256 x 256 resolution is obtained as the second feature map 402 output by the second fusion sub-network 424-2.

The above intermediate feature fusion may be to output the feature channel by a ReLU activation function after convolving the image with a convolution check of 3×3.

In some exemplary embodiments, identification information 403 (e.g., a 512 x 512 feature encoding matrix) may be input into the first feature extraction network 410 to obtain a first feature map output by the first feature extraction network 410.

In this case, the identification information 403 may be first subjected to five rounds of feature extraction by the fourth sub-feature extraction network 411, so as to obtain 5 fourth sub-feature graphs with progressively decreasing resolutions (for example, 512×512, 256×256, 128×128, 64×64, and 32×32, respectively). Each of the five feature extraction rounds may be implemented by performing 2 intermediate feature extraction rounds on the input feature map, where the intermediate feature extraction rounds may be performed by convolving the feature channel with a 3×3 convolution check image and then activating the feature channel with a ReLU function. In the middle of two rounds of feature extraction, the obtained feature images can be subjected to global pooling, so that feature images with lower resolution are obtained, and the next round of feature extraction is performed.

The 5 fourth sub-feature maps may then be input into the second feature fusion network 412 to be fused to obtain the first feature map. Wherein, the fourth sub-feature map with the lowest resolution can be up-sampled first, and the resolution is restored to be consistent with the fourth sub-feature map at the previous stage (for example, 64×64), so as to obtain a fifth intermediate feature; then, overlapping the fourth sub-feature map and the fifth intermediate feature with the same resolution, and carrying out 2 times of intermediate feature fusion on the sixth intermediate feature obtained by overlapping to obtain a seventh intermediate feature; the above process may then be repeated further based on the seventh intermediate feature until a feature fusion result with a resolution of 256×256 is obtained as the first feature map output by the second feature fusion network 412.

In some exemplary embodiments, the third feature extraction network 430 may multiplex the network for the first feature extraction in the fourth sub-feature extraction network 411 to obtain a third feature map with a resolution of 512×512.

In some exemplary embodiments, the first and second feature maps 402 and the third feature map obtained by the above method may be input into the generation network 440 together to obtain a rendering map output by the generation network 440. The generating network 440 may first superimpose the first feature map and the second feature map 402, then up-sample the first feature map and the second feature map again, further superimpose the feature map obtained by up-sampling and the third feature map, then fuse the intermediate features for 2 times, and input the fused feature map into the output layer, thereby obtaining the rendering map output by the output layer.

In some embodiments, the rendering map generation model of the present disclosure can be built based on a U-Net network.

In some embodiments, as shown in fig. 6, there is provided a model training method, the model including a first feature extraction network, a second feature extraction network, and a generation network, the model training method including:

step S601, acquiring sample data, wherein the sample data comprises identification information of a sample object, surface normal information, scene information of the sample scene and sample rendering mapping of the sample object in the sample scene, the identification information comprises original texture characteristic information of the sample object, and the scene information comprises illumination information;

step S602, inputting the identity identification information into a first feature extraction network to obtain a first feature map output by the first feature extraction network;

step S603, inputting the surface normal information and the scene information into a second feature extraction network to obtain a second feature map output by the second feature extraction network;

step S604, at least inputting the first feature map and the second feature map into a generating network to obtain a rendering map prediction result of a sample object in a sample scene output by the generating network; and

step S605, parameters of the model are adjusted based on the rendering map prediction result and the sample rendering map.

Thus, the model obtained by training by the method can be subjected to first feature extraction through the pre-stored identification information (for example, in the form of a parameter matrix) of the target object, so as to obtain a first feature map; extracting second characteristics of the surface normal information and the scene information to obtain a second characteristic diagram containing state information (such as actions, expressions and the like, introduced through the surface normal information) of the current target object and the scene information; and then, generating a rendering map based on the first feature map and the second feature map, so that the calculated amount can be reduced and the calculation efficiency can be improved while ensuring the super-realistic rendering effect of the target object.

In some embodiments, the second feature extraction network may include a first sub-feature extraction network, a second sub-feature extraction network, and a first feature fusion network, and inputting the surface normal information and the scene information into the second feature extraction network to obtain a second feature map output by the second feature extraction network may include: inputting the surface normal information into a first sub-feature extraction network to obtain a first sub-feature image output by the first sub-feature extraction network; inputting the scene information into a second sub-feature extraction network to obtain a second sub-feature image output by the second sub-feature extraction network; and inputting at least the first sub-feature map and the second sub-feature map into the first feature fusion network to obtain a second feature map output by the first feature fusion network.

In some embodiments, the second feature extraction network may further include a third sub-feature extraction network, the first feature fusion network may include a first fusion sub-network and a second fusion sub-network, and the model training method may further include: inputting the surface normal information into a third sub-feature extraction network to obtain at least one third sub-feature image output by the third sub-feature extraction network, wherein each third sub-feature image in the first sub-feature image and the at least one third sub-feature image corresponds to different resolutions; and inputting at least the first sub-feature map and the second sub-feature map into the first feature fusion network to obtain a second feature map output by the first feature fusion network may include: inputting the first sub-feature diagram and the second sub-feature diagram into a first fusion sub-network to obtain a first fusion feature output by the first fusion sub-network; and inputting the first fusion characteristic and at least one third sub-characteristic diagram into a second fusion sub-network to obtain a second characteristic diagram output by the second fusion sub-network.

In some embodiments, a virtual camera for observing a sample object may be included in the sample scene, and the scene information may further include at least one of the following information: perspective information of the virtual camera relative to the sample object, position information and pose information of the sample object relative to the virtual camera.

In some embodiments, the sample scene may include a plurality of light sources, each light source of the plurality of light sources including respective first illumination information, and the model training method further includes: clustering the plurality of light sources based on the first position information of each of the plurality of light sources to obtain at least one central light source; and determining at least one second illumination information corresponding to the at least one central light source based on the first illumination information corresponding to each of the plurality of light sources as illumination information.

In some embodiments, the first feature extraction network may include a fourth sub-feature extraction network and a second feature fusion network, and inputting the identification information into the first feature extraction network to obtain the first feature map output by the first feature extraction network may include: inputting the identity identification information into a fourth sub-feature extraction network to obtain a plurality of fourth sub-feature graphs output by the fourth sub-feature extraction network, wherein the fourth sub-feature graphs respectively correspond to a plurality of different resolutions; and inputting the plurality of fourth sub-feature images into the second feature fusion network to obtain a first feature image output by the second feature fusion network.

In some embodiments, the model may further include a third feature extraction network, and the model training method may further include: inputting the identification information into a third feature extraction network to obtain a third feature image output by the third feature extraction network, wherein the resolution of the third feature image is greater than that of the first feature image; and inputting at least the first feature map and the second feature map into a generating network to obtain a rendering map prediction result of sample objects in a sample scene output by the generating network, comprising: and inputting the first feature map, the second feature map and the third feature map into a generating network to obtain a rendering map prediction result output by the generating network.

In some embodiments, the identification information is obtained by feature encoding an original texture map of the sample object.

In some example embodiments, the above model may generate a model for a rendering map as described above.

In some embodiments, as shown in fig. 7, there is provided a rendering map generating apparatus 700, including:

a first obtaining unit 710 configured to perform a first feature extraction on the identification information of the target object to obtain a first feature map, where the identification information includes original texture feature information of the target object;

A second obtaining unit 720 configured to perform second feature extraction on surface normal information of the target object and scene information of the virtual scene to obtain a second feature map, the scene information including illumination information; and

and a generating unit 730 configured to generate a rendering map of the target object in the virtual scene based on the first feature map and the second feature map.

The operations performed by the units 710 to 730 in the apparatus 700 are similar to the operations performed by the steps S201 to S203 in the graphics processing method, and are not described herein.

In some embodiments, the second acquisition unit may include: a first acquisition subunit configured to perform first sub-feature extraction on the surface normal information to obtain a first sub-feature map; a second obtaining subunit configured to perform second sub-feature extraction on the scene information to obtain a second sub-feature map; and a first fusion subunit configured to fuse the second feature map based on the first sub-feature map and the second sub-feature map.

In some embodiments, the rendering map generating apparatus may further include: a third obtaining unit configured to perform third sub-feature extraction on the surface normal information to obtain at least one third sub-feature map, each of the first sub-feature map and the at least one third sub-feature map corresponding to a different resolution, respectively; and, the first fusion subunit may be further configured to: acquiring a first fusion feature based on the first sub-feature map and the second sub-feature map; and performing feature fusion on the first fusion feature and at least one third sub-feature map to obtain a second feature map.

In some embodiments, the virtual scene includes a virtual camera for observing the target object, and the scene information may further include at least one of the following information: perspective information of the virtual camera relative to the target object, position information and pose information of the target object relative to the virtual camera.

In some embodiments, the illumination information includes at least one of illumination intensity information, light source color information, and light source position information.

In some embodiments, the virtual scene includes a plurality of light sources, each light source of the plurality of light sources includes corresponding first illumination information, and the rendering map generating apparatus may further include: a first clustering unit configured to cluster the plurality of light sources based on first position information of each of the plurality of light sources to obtain at least one central light source; and a first determination unit configured to determine, as illumination information, at least one second illumination information corresponding to the at least one center light source based on the first illumination information corresponding to each of the plurality of light sources.

In some embodiments, the first acquisition unit may include: the third acquisition subunit is configured to perform fourth sub-feature extraction on the identity identification information to obtain a plurality of fourth sub-feature graphs, wherein the fourth sub-feature graphs respectively correspond to a plurality of different resolutions; and a second fusion subunit configured to perform feature fusion based on the plurality of fourth sub-feature maps to obtain the first feature map.

In some embodiments, the rendering map generating apparatus may further include: a fourth obtaining unit configured to perform third feature extraction on the identification information to obtain a third feature map, where a resolution of the third feature map is greater than a resolution of the first feature map; and, the generating unit may be further configured to: a rendering map is generated based on the first feature map, the second feature map, and the third feature map.

In some embodiments, as shown in fig. 8, there is provided a model training apparatus 800, the model including a first feature extraction network, a second feature extraction network, and a generation network, the model training apparatus 800 including:

a fifth obtaining unit 810 configured to obtain sample data, the sample data including identification information of a sample object, surface normal information, scene information of the sample scene, and a sample rendering map of the sample object in the sample scene, the identification information including original texture feature information of the sample object, the scene information including illumination information;

a sixth acquisition unit 820 configured to input the identification information into the first feature extraction network to obtain a first feature map output by the first feature extraction network;

A seventh acquisition unit 830 configured to input the surface normal information and the scene information into the second feature extraction network to obtain a second feature map output by the second feature extraction network;

an eighth obtaining unit 840 configured to input at least the first feature map and the second feature map into the generating network, to obtain a rendering map prediction result of the sample objects in the sample scene output by the generating network; and

an adjustment unit 850 configured to adjust parameters of the model based on the rendering map prediction result and the sample rendering map.

The operations performed by the units 810-850 in the apparatus 800 are similar to the operations of the steps S601-S605 in the graphics processing method described above, and are not described herein.

In some embodiments, the second feature extraction network may include a first sub-feature extraction network, a second sub-feature extraction network, and a first feature fusion network, and the seventh obtaining unit may include: a fourth acquisition subunit configured to input surface normal information into the first sub-feature extraction network to obtain a first sub-feature map output by the first sub-feature extraction network; a fifth acquisition subunit configured to input the scene information into the second sub-feature extraction network to obtain a second sub-feature map output by the second sub-feature extraction network; and a third merging subunit configured to input at least the first sub-feature map and the second sub-feature map into the first feature fusion network, so as to obtain a second feature map output by the first feature fusion network.

In some embodiments, the second feature extraction network may further include a third sub-feature extraction network, the first feature fusion network may include a first fusion sub-network and a second fusion sub-network, and the model training apparatus may further include: a ninth acquisition unit configured to input surface normal information into the third sub-feature extraction network to obtain at least one third sub-feature map output by the third sub-feature extraction network, each of the first sub-feature map and the at least one third sub-feature map corresponding to a different resolution, respectively; and, the third fusion subunit may be further configured to: inputting the first sub-feature diagram and the second sub-feature diagram into a first fusion sub-network to obtain a first fusion feature output by the first fusion sub-network; and inputting the first fusion characteristic and at least one third sub-characteristic diagram into a second fusion sub-network to obtain a second characteristic diagram output by the second fusion sub-network.

In some embodiments, the sample scene includes a virtual camera for observing the sample object, and the scene information may further include at least one of the following information: perspective information of the virtual camera relative to the sample object, position information and pose information of the sample object relative to the virtual camera.

In some embodiments, the sample scene may include a plurality of light sources, each light source of the plurality of light sources including corresponding first illumination information, and the model training apparatus may further include: a second clustering unit configured to cluster the plurality of light sources based on the first position information of each of the plurality of light sources to obtain at least one central light source; and a second determination unit configured to determine, as illumination information, at least one second illumination information corresponding to the at least one center light source based on the first illumination information corresponding to each of the plurality of light sources.

In some embodiments, the first feature extraction network may include a fourth sub-feature extraction network and a second feature fusion network, and the sixth acquisition unit may include: a sixth obtaining subunit configured to input the identification information into the fourth sub-feature extraction network, so as to obtain a plurality of fourth sub-feature graphs output by the fourth sub-feature extraction network, where the plurality of fourth sub-feature graphs respectively correspond to a plurality of different resolutions; and a fourth merging subunit configured to input the plurality of fourth sub-feature maps into the second feature fusion network to obtain the first feature map output by the second feature fusion network.

In some embodiments, the model may further include a third feature extraction network, and the model training apparatus may further include: a tenth acquisition unit configured to input the identification information into the third feature extraction network to obtain a third feature map output by the third feature extraction network, wherein the resolution of the third feature map is greater than the resolution of the first feature map; and, the eighth acquisition unit is further configured to: and inputting the first feature map, the second feature map and the third feature map into a generating network to obtain a rendering map prediction result output by the generating network.

In some embodiments, the identification information may be obtained by feature encoding an original texture map of the sample object.

According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

Referring to fig. 9, a block diagram of an electronic device 900 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 908, and a communication unit 909. The input unit 906 may be any type of device capable of inputting information to the electronic device 900, the input unit 906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 907 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 908 may include, but is not limited to, magnetic disks, optical disks. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 901 performs the respective methods and processes described above, such as the above-described rendering map generation method or the above-described model training method. For example, in some embodiments, the above-described rendering map generation method or the above-described model training method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described rendering map generation method or the above-described model training method may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the above-described rendering map generation method or the above-described model training method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A rendering map generation method implemented based on a rendering map generation model, wherein the rendering map generation model includes a first feature extraction network, a second feature extraction network, and a generation network, the method comprising:

Extracting first characteristics of identification information of a target object by using the first characteristic extraction network to obtain a first characteristic diagram, wherein the identification information comprises original texture characteristic information of the target object;

performing second feature extraction on surface normal information of the target object and scene information of a virtual scene by using the second feature extraction network to obtain a second feature map, wherein the scene information comprises illumination information of a light source in the virtual scene, the illumination information comprises at least one of illumination intensity information, light source color information and light source position information, and the second feature extraction comprises performing feature extraction on the surface normal information and the scene information respectively and performing feature fusion on the obtained features; and

and generating a rendering map of the target object in the virtual scene based on the first feature map and the second feature map by using the generation network.

2. The method of claim 1, wherein the performing a second feature extraction on the surface normal information of the target object and the scene information of the virtual scene to obtain a second feature map comprises:

Extracting first sub-features from the surface normal information to obtain a first sub-feature map;

extracting second sub-features of the scene information to obtain a second sub-feature map; and

and based on the first sub-feature map and the second sub-feature map, fusing to obtain the second feature map.

3. The method of claim 2, further comprising:

extracting third sub-feature from the surface normal information to obtain at least one third sub-feature map, wherein each third sub-feature map in the first sub-feature map and the at least one third sub-feature map corresponds to different resolutions; and, in addition, the processing unit,

the fusing to obtain the second feature map based on the first sub-feature map and the second sub-feature map includes:

acquiring a first fusion feature based on the first sub-feature map and the second sub-feature map; and

and carrying out feature fusion on the first fusion feature and the at least one third sub-feature map to obtain the second feature map.

4. A method according to any one of claims 1 to 3, wherein a virtual camera for observing the target object is included in the virtual scene, the scene information further including at least one of: perspective information of the virtual camera relative to the target object, position information and attitude information of the target object relative to the virtual camera.

5. A method according to any of claims 1 to 3, wherein a plurality of light sources are included in the virtual scene, each light source of the plurality of light sources including respective first illumination information, the method further comprising:

clustering the plurality of light sources based on the first location information of each of the plurality of light sources to obtain at least one central light source; and

and determining at least one piece of second illumination information corresponding to the at least one central light source based on the first illumination information corresponding to each light source in the plurality of light sources, and taking the second illumination information as the illumination information.

6. A method according to any one of claims 1 to 3, wherein the first feature extraction of the identification information of the target object to obtain a first feature map comprises:

extracting fourth sub-features of the identity identification information to obtain a plurality of fourth sub-feature images, wherein the fourth sub-feature images respectively correspond to a plurality of different resolutions; and

and carrying out feature fusion based on the plurality of fourth sub-feature graphs to obtain the first feature graph.

7. A method according to any one of claims 1 to 3, further comprising:

extracting third characteristics from the identification information to obtain a third characteristic diagram, wherein the resolution of the third characteristic diagram is larger than that of the first characteristic diagram; and, in addition, the processing unit,

The generating a rendering map of the target object in the virtual scene based on the first feature map and the second feature map includes:

the rendering map is generated based on the first feature map, the second feature map, and the third feature map.

8. A method according to any one of claims 1 to 3, wherein the identification information is obtained by feature encoding an original texture map of the target object.

9. A model training method, wherein the model comprises a first feature extraction network, a second feature extraction network, and a generation network, the method comprising:

obtaining sample data, wherein the sample data comprises identification information of a sample object, surface normal information, scene information of a sample scene and sample rendering mapping of the sample object in the sample scene, the identification information comprises original texture characteristic information of the sample object, the scene information comprises illumination information of a light source in the sample scene, and the illumination information comprises at least one of illumination intensity information, light source color information and light source position information;

inputting the identity identification information into the first feature extraction network to obtain a first feature map output by the first feature extraction network;

Inputting the surface normal information and the scene information into the second feature extraction network to obtain a second feature map output by the second feature extraction network, wherein the second feature extraction network performs the operations of respectively extracting features of the surface normal information and the scene information and performing feature fusion on the obtained features;

inputting at least the first feature map and the second feature map into the generation network to obtain a rendering map prediction result of the sample object in the sample scene output by the generation network; and

and adjusting parameters of the model based on the rendering map prediction result and the sample rendering map.

10. The method of claim 9, wherein the second feature extraction network comprises a first sub-feature extraction network, a second sub-feature extraction network, and a first feature fusion network, the inputting the surface normal information and the scene information into the second feature extraction network to obtain a second feature map output by the second feature extraction network comprising:

inputting the surface normal information into the first sub-feature extraction network to obtain a first sub-feature map output by the first sub-feature extraction network;

Inputting the scene information into the second sub-feature extraction network to obtain a second sub-feature image output by the second sub-feature extraction network; and

inputting at least the first sub-feature map and the second sub-feature map into the first feature fusion network to obtain the second feature map output by the first feature fusion network.

11. The method of claim 10, the second feature extraction network further comprising a third sub-feature extraction network, the first feature fusion network comprising a first fusion sub-network and a second fusion sub-network, the method further comprising:

inputting the surface normal information into the third sub-feature extraction network to obtain at least one third sub-feature image output by the third sub-feature extraction network, wherein each third sub-feature image in the first sub-feature image and the at least one third sub-feature image corresponds to different resolutions; and, in addition, the processing unit,

the inputting at least the first sub-feature map and the second sub-feature map into the first feature fusion network to obtain the second feature map output by the first feature fusion network includes:

inputting the first sub-feature diagram and the second sub-feature diagram into the first fusion sub-network to obtain a first fusion feature output by the first fusion sub-network; and

And inputting the first fusion characteristic and the at least one third sub-characteristic diagram into the second fusion sub-network to obtain the second characteristic diagram output by the second fusion sub-network.

12. The method of any of claims 9 to 11, wherein a virtual camera for observing the sample object is included in the sample scene, the scene information further including at least one of: perspective information of the virtual camera relative to the sample object, position information and pose information of the sample object relative to the virtual camera.

13. The method of any of claims 9 to 11, wherein a plurality of light sources are included in the sample scene, each light source of the plurality of light sources including respective first illumination information, the method further comprising:

14. The method of any of claims 9 to 11, wherein the first feature extraction network comprises a fourth sub-feature extraction network and a second feature fusion network, the inputting the identification information into the first feature extraction network to obtain a first feature map output by the first feature extraction network comprising:

inputting the identification information into the fourth sub-feature extraction network to obtain a plurality of fourth sub-feature graphs output by the fourth sub-feature extraction network, wherein the fourth sub-feature graphs respectively correspond to a plurality of different resolutions; and

and inputting the plurality of fourth sub-feature graphs into the second feature fusion network to obtain the first feature graph output by the second feature fusion network.

15. The method of any of claims 9 to 11, the model further comprising a third feature extraction network, the method further comprising:

inputting the identification information into the third feature extraction network to obtain a third feature map output by the third feature extraction network, wherein the resolution of the third feature map is greater than that of the first feature map; and, in addition, the processing unit,

The inputting at least the first feature map and the second feature map into the generating network to obtain a rendering map prediction result of the sample object in the sample scene output by the generating network includes:

and inputting the first feature map, the second feature map and the third feature map into the generation network to obtain the rendering map prediction result output by the generation network.

16. The method according to any of claims 9 to 11, wherein the identification information is obtained by feature encoding an original texture map of the sample object.

17. A rendering map generation apparatus implemented based on a rendering map generation model, the rendering map generation model comprising a first feature extraction network, a second feature extraction network, and a generation network, the apparatus comprising:

a first obtaining unit configured to perform first feature extraction on identification information of a target object by using the first feature extraction network to obtain a first feature map, wherein the identification information includes original texture feature information of the target object;

a second obtaining unit configured to perform second feature extraction on surface normal information of the target object and scene information of a virtual scene using the second feature extraction network to obtain a second feature map, the scene information including illumination information of a light source in the virtual scene, the illumination information including at least one of illumination intensity information, light source color information, and light source position information, the second feature extraction including feature extraction of the surface normal information and the scene information, respectively, and feature fusion of the obtained features; and

A generation unit configured to generate a rendering map of the target object in the virtual scene based on the first feature map and the second feature map using the generation network.

18. The apparatus of claim 17, wherein the second acquisition unit comprises:

a first acquisition subunit configured to perform a first sub-feature extraction on the surface normal information to obtain a first sub-feature map;

a second obtaining subunit configured to perform second sub-feature extraction on the scene information to obtain a second sub-feature map; and

and the first fusion subunit is configured to fuse and obtain the second characteristic diagram based on the first sub-characteristic diagram and the second sub-characteristic diagram.

19. The apparatus of claim 18, further comprising:

a third obtaining unit configured to perform third sub-feature extraction on the surface normal information to obtain at least one third sub-feature map, where the first sub-feature map and each of the at least one third sub-feature map correspond to different resolutions, respectively; and, in addition, the processing unit,

the first fusion subunit is further configured to:

20. The apparatus of any of claims 17-19, wherein a virtual camera for observing the target object is included in the virtual scene, the scene information further including at least one of: perspective information of the virtual camera relative to the target object, position information and attitude information of the target object relative to the virtual camera.

21. The apparatus of any of claims 17 to 19, wherein a plurality of light sources are included in the virtual scene, each light source of the plurality of light sources including respective first illumination information, the apparatus further comprising:

a first clustering unit configured to cluster the plurality of light sources based on first position information of each of the plurality of light sources to obtain at least one central light source; and

and a first determining unit configured to determine, as the illumination information, at least one second illumination information corresponding to the at least one central light source based on the first illumination information corresponding to each of the plurality of light sources.

22. The apparatus of any of claims 17 to 19, wherein the first acquisition unit comprises:

the third acquisition subunit is configured to perform fourth sub-feature extraction on the identification information to obtain a plurality of fourth sub-feature graphs, wherein the fourth sub-feature graphs respectively correspond to a plurality of different resolutions; and

and the second fusion subunit is configured to perform feature fusion based on the plurality of fourth sub-feature maps so as to obtain the first feature map.

23. The apparatus of any of claims 17 to 19, further comprising:

a fourth obtaining unit configured to perform third feature extraction on the identification information to obtain a third feature map, where a resolution of the third feature map is greater than a resolution of the first feature map; and, in addition, the processing unit,

the generating unit is further configured to: the rendering map is generated based on the first feature map, the second feature map, and the third feature map.

24. The apparatus of any of claims 17 to 19, wherein the identification information is obtained by feature encoding an original texture map of the target object.

25. A model training apparatus, wherein the model comprises a first feature extraction network, a second feature extraction network, and a generation network, the apparatus comprising:

A fifth acquisition unit configured to acquire sample data including identification information of a sample object, surface normal information, scene information of a sample scene, and a sample rendering map of the sample object in the sample scene, the identification information including original texture feature information of the sample object, the scene information including illumination information of a light source in the sample scene, the illumination information including at least one of illumination intensity information, light source color information, and light source position information;

a sixth acquisition unit configured to input the identification information into the first feature extraction network to obtain a first feature map output by the first feature extraction network;

a seventh acquisition unit configured to input the surface normal information and the scene information into the second feature extraction network to obtain a second feature map output by the second feature extraction network, the second feature extraction network performing operations of feature extraction of the surface normal information and the scene information, respectively, and feature fusion of the obtained features;

an eighth obtaining unit configured to input at least the first feature map and the second feature map into the generating network to obtain a rendering map prediction result of the sample object in the sample scene output by the generating network; and

And an adjustment unit configured to adjust parameters of the model based on the rendering map prediction result and the sample rendering map.

26. The apparatus of claim 25, wherein the second feature extraction network comprises a first sub-feature extraction network, a second sub-feature extraction network, and a first feature fusion network, the seventh acquisition unit comprising:

a fourth acquisition subunit configured to input the surface normal information into the first sub-feature extraction network to obtain a first sub-feature map output by the first sub-feature extraction network;

a fifth acquisition subunit configured to input the scene information into the second sub-feature extraction network to obtain a second sub-feature map output by the second sub-feature extraction network; and

and a third merging subunit configured to input at least the first sub-feature map and the second sub-feature map into the first feature fusion network, so as to obtain the second feature map output by the first feature fusion network.

27. The apparatus of claim 26, the second feature extraction network further comprising a third sub-feature extraction network, the first feature fusion network comprising a first fusion sub-network and a second fusion sub-network, the apparatus further comprising:

A ninth obtaining unit configured to input the surface normal information into the third sub-feature extraction network to obtain at least one third sub-feature map output by the third sub-feature extraction network, each of the first sub-feature map and the at least one third sub-feature map corresponding to a different resolution, respectively; and, in addition, the processing unit,

the third fusion subunit is further configured to:

28. The apparatus of any of claims 25 to 27, wherein a virtual camera for observing the sample object is included in the sample scene, the scene information further including at least one of: perspective information of the virtual camera relative to the sample object, position information and pose information of the sample object relative to the virtual camera.

29. The apparatus of any of claims 25 to 27, wherein a plurality of light sources are included in the sample scene, each light source of the plurality of light sources including respective first illumination information, the apparatus further comprising:

a second clustering unit configured to cluster the plurality of light sources based on the first position information of each of the plurality of light sources to obtain at least one central light source; and

and a second determining unit configured to determine, as the illumination information, at least one second illumination information corresponding to the at least one center light source based on the first illumination information corresponding to each of the plurality of light sources.

30. The apparatus of any of claims 25 to 27, wherein the first feature extraction network comprises a fourth sub-feature extraction network and a second feature fusion network, the sixth acquisition unit comprising:

a sixth obtaining subunit, configured to input the identification information into the fourth sub-feature extraction network, so as to obtain a plurality of fourth sub-feature graphs output by the fourth sub-feature extraction network, where the plurality of fourth sub-feature graphs respectively correspond to a plurality of different resolutions; and

And the fourth merging subunit is configured to input the plurality of fourth sub-feature graphs into the second feature fusion network so as to obtain the first feature graph output by the second feature fusion network.

31. The apparatus of any of claims 25 to 27, the model further comprising a third feature extraction network, the apparatus further comprising:

a tenth acquisition unit configured to input the identification information into the third feature extraction network to obtain a third feature map output by the third feature extraction network, wherein a resolution of the third feature map is greater than a resolution of the first feature map; and, in addition, the processing unit,

the eighth acquisition unit is further configured to: and inputting the first feature map, the second feature map and the third feature map into the generation network to obtain the rendering map prediction result output by the generation network.

32. The apparatus of any of claims 25 to 27, wherein the identification information is obtained by feature encoding an original texture map of the sample object.

33. An electronic device, the electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-16.

34. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-16.