WO2021151380A1

WO2021151380A1 - Method for rendering virtual object based on illumination estimation, method for training neural network, and related products

Info

Publication number: WO2021151380A1
Application number: PCT/CN2021/073937
Authority: WO
Inventors: Celong LIU; Yi Xu; Zhong Li; Shuxue Quan
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2020-01-30
Filing date: 2021-01-27
Publication date: 2021-08-05
Also published as: CN115039137A

Abstract

A method for rendering a virtual object based on illumination estimation and related products are provided. The method for rendering a virtual object based on illumination estimation includes the following. An image in which at least one object is located on at least one plane is captured. A foreground of the image is extracted. Illumination corresponding to the extracted foreground image is estimated. A virtual object is rendered with the estimated illumination. By means of implementations of the present disclosure, a real-world lighting condition can be estimated in real-time and a virtual object can be rendered with estimated illumination.

Description

METHOD FOR RENDERING VIRTUAL OBJECT BASED ON ILLUMINATION ESTIMATION, METHODFOR TRAINING NEURAL NETWORK, AND RELATED PRODUCTS

CROSS-REFERENCE TO RELATED APPLICATION (S)

This application claims priority to and the benefit of U.S. Provisional Application Patent Serial No. 62/967,739, filed January 30, 2020, the entire disclosure of which is hereby incorporated by reference.

TECHNICAL FIELD

This disclosure relates tothe field of augmented reality (AR) technology, and more particularly to a method for rendering a virtual object based on illumination estimation, a method for training a neural network, and related products.

BACKGROUND

AR applications aim to provide realistic blending between a realworld and virtual objects. An augmented-reality device may be configured to display augmented-reality images to provide an illusion that virtual objects are present in a real-world physical space. One of important factors for realistic AR is correct illumination estimation of the real world.

Illuminationestimation of the real worldis a challenging problem. Currently, typical solutions to this problem rely on inserting an object (e.g., a light probe) with known geometry and/or reflectance properties in the real world. Unfortunately, having to insert a known object in the real world is limiting and thus not easily amenable to practical applications.

SUMMARY

Implementations provide a method for rendering a virtual object based on illumination estimation, a method for training a neural network, and related products, which can estimate a real-world lighting condition in real-time and render a virtual object with estimated illumination.

In a first aspect, a method for rendering a virtual object based on illumination estimation is provided. The method for rendering a virtual object based on illumination estimation includes the following. Animage in which at least one object is locatedonat least one plane is captured. A foregroundof the image is extracted. Illuminationcorresponding to the extracted foregroundimage is estimated. Avirtual object is renderedwith the estimated illumination.

In a second aspect, a methodfor training a neural networkis provided. The method for training a neural network includes the following. Atrainingimage is obtained from a first dataset, where the first dataset includes images and ground truthparameters corresponding to each of the images, and the trainingimage is any of the images in the first dataset. A foreground of the trainingimage is extracted. The extracted foreground of the trainingimage is inputto a neural network to obtain predicted parameters. Animage is rendered with the predicted parameters. A loss of the neural network is calculated based on the predicted parameters, the rendered image, the trainingimage, and ground truthparameters corresponding to the trainingimage. Theneural network is trained based on the loss.

In a third aspect, an apparatus for rendering a virtual object based on illumination estimation is provided. The apparatusfor rendering a virtual object based on illumination estimation includes a capturing unit, an extracting unit, an estimating unit, and a rendering unit. The capturing unit is configured to capture an image in which at least one object is locatedonat least one plane. The extracting unit is configured to extract a foreground of the image. The estimating unit is configured to estimate illuminationcorresponding to the extracted foregroundimage. The rendering unit is configured to render a virtual object with the estimated illumination.

In a fourth aspect, anapparatusfor training a neural network is provided. The apparatusfor training a neural network includes an obtaining unit, an extracting unit, an inputting unit, a rendering unit, a calculating unit, and a training unit. The obtaining unit is configured to obtain a training image from a first dataset, where the first dataset includes images and ground truthparameters corresponding to each of the images, and the training image is any of the images in the first dataset. The extracting unit is configured to extract a foreground of the training image. The inputting unit is configured to inputthe extracted foreground of the training image to a neural network to obtain predicted parameters. The calculating unit is configured to calculate a loss of the neural network based on the predicted parameters, the rendered image, the training image, and ground truthparameters corresponding to the training image. The training unit is configured to train the neural network based on the loss.

In a fifth aspect, a terminaldevice is provided. The terminaldevice includes a processor, a memory configured to store one or more programs. The one or more programs are configured to be executed by the processor and the one or more programs include instructions for performing some or all operations of the method described in the first or second aspect.

In a sixth aspect, a non-transitory computer-readable storage medium is provided. The non-transitory computer-readable storage medium is configured to store computer programs for electronic data interchange (EDI) . The computer programs include instructions for performing some or all operations of the method described in the first or secondaspect.

In a seventh aspect, a computer program product is provided. The computer program product includes a non-transitory computer-readable storage medium that stores computer programs. The computer programs are operable with a computer to execute some or all operations of the method described in the first or secondaspect.

In implementations of the present disclosure, an image in which at least one object is locatedonat least one plane in a real scene is captured in real time, then a foreground of the image is extracted, thereafter illuminationcorresponding to the extracted foregroundimage is estimated, and finally a virtual object is rendered in the real scene with the estimated illumination. As such, with a trainedneural network and an input of an image that contains one or more objects placed on a planar region captured in the real scene, areal-world lighting condition can be estimated in real-time and the virtual object can be rendered in the real scene with estimated illumination, thereby improving rendering quality and sense of reality of rendering the virtual object in the real scene.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to describe technical solutions of implementations more clearly, the following will give a brief description of accompanying drawings used for describing the implementations. Apparently, accompanying drawings described below are merely some implementations. Those of ordinary skill in the art can also obtain other accompanying drawings based on the accompanying drawings described below without creative efforts.

FIG. 1 is a diagram of an example operating environment according to implementations.

FIG. 2 is a diagram of an exemplarystructure of frameworkof a neural network according to implementations.

FIG. 3 is a diagram of rendering virtual objects in a real scene using estimated lighting.

FIG. 4is a schematic flowchart of a method forrendering a virtual object based on illumination estimation according to implementations.

FIGS. 5 (a) -FIG. 5 (d) are diagrams of an exemplary foreground exaction process according to implementations.

FIG. 6 (a) and FIG. 6 (b) are diagrams of an exemplary design of a neural networkaccording to implementations.

FIG. 7is a diagram of anexemplary generation process of ground truth data.

FIG. 8 is a diagram illustrating comparison of illumination and normal map predictions of a network trained with a normal map of a plane and a network trained without a normal map of a plane according to implementations.

FIG. 9 is a diagram illustrating comparison of illumination and normal map predictions of a network (middle) trained with a normal map of a plane and a network (top) trained without a normal map of a plane according to other implementations.

FIG. 10is a diagram illustrating comparison between illumination estimation in the related at and in the present disclosure.

FIG. 11 is a diagram illustrating an example of rendering a virtual object in a real scene.

FIG. 12is a schematic flowchart of a method fortraining a neural network according to implementations.

FIG. 13is a schematic structural diagram of an apparatus forrendering a virtual object based on illumination estimation according to implementations.

FIG. 14 is a schematic structural diagram of an apparatusfortraining a neural network according to implementations.

FIG. 15is a schematic structural diagram of a terminal device according to implementations.

FIG. 16is a schematic structural diagram of a terminaldevice according to other implementations.

DETAILED DESCRIPTION

In order for those skilled in the art to better understand technical solutions of implementations, technical solutions of implementations will be described clearly and completely with reference to accompanying drawings in the implementations. Apparently, implementations hereinafter described are merely some implementations, rather than all implementations, of the disclosure. All other implementations obtained by those of ordinary skill in the art based on the implementations herein without creative efforts shall fall within the protection scope of the disclosure.

The terms “first” , “second” , “third” , and the like used in the specification, the claims, and the accompany drawings of the disclosure are used to distinguish different objects rather than describe a particular order. The terms “include” , “include” , and “have” as well as variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus including a series of steps or units is not limited to the listed steps or units. Instead, it can optionally include other steps or units that are not listed; alternatively, other steps or units inherent to the process, method, product, or apparatus can also be included.

The term “implementation” referred to herein means that a particular feature, structure, or character described in conjunction with the implementation may be contained in at least one implementation of the disclosure. The phrase appearing in various places in the specification does not necessarily refer to the same implementation, nor does it referto an independent or alternative implementation that is mutually exclusive with other implementations. It is explicitly and implicitly understood by those skilled in the art that animplementationdescribed herein may be combined with other implementations.

Aterminal devicereferred to herein may include various handheld devices, in-vehicle devices, wearable devices, computing devices that have wireless communication functions or other processing devices connected to a wireless modem, as well as various forms of user equipment (UE) , mobile stations (MS) , mobile terminals, and the like. For ease of description, the above-mentioned devices are collectively referred to as a terminal device.

For ease of better understanding of implementations of the present disclosure, related technology involved in the present disclosure will be briefly introduced below.

Illumination estimation has been a long-standing problem in both computer vision and graphics. A direct way of estimate illuminationof an environment is to capture light radiance at a target location using a physical light probe. Photographsof a mirrored sphere with different exposures can be used to compute illumination at the sphere’s location. Beyondmirrored spheres, it is also possible to estimate illumination using hybrid spheres, known 3D objects, objects with known surface material, or even human faces as proxies for light probes. However, the process of physically capturing high quality illumination maps can be expensive and difficult to scale, especially when the goal is to obtain training data for a dense set of visible locations in a large variety of environments.

Another approach to estimating illumination is to jointly optimize geometry, reflectance properties, and lighting models of ascene in order to find set of values that can best explain input image. However, directly optimizing all scene parameters is often a highly under-constrained problem, that is, an error in one parameter estimation can easily propagate into another. Therefore, to ease anoptimization process, many prior methods either assume additional user-provided ground truth information as input or make strong assumptions aboutthe lighting models.

Deep learning has recently shown promising results on a number of computer vision tasks, including depth estimation and intrinsic image decomposition. Recently LeGendre et al. propose to formulate anillumination estimation function as an end-to-end neural network, they use Google Pixel phone to capture pictures with balls of different bidirectional reflectance distribution functions (BRDFs) as alight prob. In order to achieve real-time performance, they compress the environment to a tiny size (32x32) . This makes apredicted map not stable and over sensitive to lighting intensity, which willbreak temporal consistency. On high accuracy end, Shuran et al. propose an end-to-end network that directly maps a LDR image to a HDR environment map using geometric wrapping and adversarial network. However, obtaining the HDR environment map is computationally heavy and is hard to perform in real-time. Moreover, using HDR map as the lighting will also add to the cost of rendering. For AR applications, real-time or near real-time is required. In implementations of the present disclosure, 5th-order spherical harmonic lighting is used to approximate anenvironment map. Without losing too much accuracy, a few parameters can be used instead of the HDR map, which will lower prediction and rendering cost.

As the deep learning community grows, many tasks are bridging computer vision (2D) and computer graphics (3D) , lighting estimation belongs to it as well. Differentiable renderer has beenwildly used in tasks such as jointly shape and BRDF reconstruction, jointly illumination and BRDF estimation. To relate changes in acapturedimage with that in a3D shape manipulation, a number of existing techniques have utilized derivatives of rendering. However, they are designed based on modern complex rendering system (rasterization or raytracing) and are too heavy in a deep learning system. Another problem is that they require all 3D geometric property (coordinates, normal, BRDFs) to perform accurate rendering. In such systems, the reconstruction tasks are always intertwined, and to reconstruct one property, all other properties are also needed to be reconstructed as the “side effect” . However, simultaneously reconstructing all properties is not a good idea. In the view of supervised learning, 3D labeling is too expensive and often not accurate. And in the view of unsupervised learning, the amount of current data is not enough to learn an intrinsic manifold between properties. Hence, in implementations of the present disclosure, a light weight differentiable screen-space renderer is developed. Although it still needs several properties (lighting (e.g., SH coefficients) , normal, albedo and roughness) to do the rendering, it is far less. All these properties are defined per-pixel and the cost of rendering will be lower, therefore, a more computationally efficient solution is provided.

Hereinafter, detailed description of implementations of the present disclosure will be given below.

FIG. 1 is a diagram of an example operating environment according to implementations. As illustrated in FIG. 1, the example operating environment 100 includes at least two computing devices interconnected through one or more networks 11. The one or more networks 11 allow a computing device to connect to and communicate with another computing device. In some implements, the at least two computing devices include a terminal device12 and a server 13. The at least two computing devices may include other computing devices not shown, which is not limited herein. The one or more networks 11 may include a secure network such as an enterprise private network, an unsecure network such as a wireless open network, a local area network (LAN) , a wide area network (WAN) , and the Internet. In the following, for ease of explanation, one network 11 is used.

In some implements, the terminal device12 includes a network interface 121, a processor 122, a memory 123, a camera 124, sensors 125, and a display 126 which are in communication with each other. The network interface 121 allows the terminal device12 to connect to the network 11. The network interface 121 may include at least one of: a wireless network interface, a modem, or a wired network interface. The processor 122 allows the terminal device12 to execute computer readable instructions stored in memory 123 to perform processes discussed herein. The camera 124 may capture color images and/or depth images of an environment. The terminal device12 may include outward facing cameras that capture images of the environment and inward facing cameras that capture images of the end user of the terminal device. The sensors 125 may generate motion and/or orientation information associated with terminal device12. In some cases, the sensors 125 may include an inertial measurement unit (IMU) . The display 126 may display digital images and/or videos. The display 126 may include a see-through display. Display 126 may include a light emitting diode (LED) or organic LED (OLED) display.

In some implements, various components of the terminal device12, such as the network interface 121, the processor 122, the memory 123, the camera 124, and the sensors 125, may be integrated on a single chip substrate. In one example, the network interface 121, the processor 122, the memory 123, the camera 124, and the sensors 125 may be integrated as a system on a chip (SOC) . In other implements, the network interface 121, the processor 122, the memory 123, the camera 124, and the sensors 125 may be integrated within a single package.

In some implements, the terminal device12 may provide a natural user interface (NUI) by employing camera 124, sensors 125, and gesture recognition software running on the processor 122. With the natural user interface, a person's body parts and movements may be detected, interpreted, and used to control various aspects of computing applications in the terminal. In one example, a computing device utilizing a natural user interface may infer intent of a person interacting with the computing device (e.g., that the end user has performed a particular gesture in order to control the computing device) .

In one example, the terminal device12includes a helmet-mounted display (HMD) that provides an augmented, mixed or virtual reality environment or a mixed reality environment to an end user of the HMD. In the context of an augmented or mixed reality environment, the HMD may include a video see-through and/or an optical see-through system. An optical see-through HMD worn by an end user may allow actual direct viewing of a real-world environment (e.g., via transparent lenses) and may, at the same time, project images of a virtual object into the visual field of the end user thereby augmenting the real-world environment perceived by the end user with the virtual object.

Utilizing an HMD, an end user may move around a real-world environment (e.g., a living room) wearing the HMD and perceive views of the real-world overlaid with images of virtual objects. The virtual objects may appear to maintain coherent spatial relationship with the real-world environment (i.e., as the end user turns their head or moves within the real-world environment, the images displayed to the end user will change such that the virtual objects appear to exist within the real-world environment as perceived by the end user) . The virtual objects may also appear fixed with respect to the end user's point of view (e.g., a virtual menu that always appears in the top right corner of the end user's point of view regardless of how the end user turns their head or moves within the real-world environment) . In one implement, environmental mapping of the real-world environment may be performed by the server 13 (i.e., on the server side) while camera localization may be performed on the terminal device12 (i.e., on the client side) . The virtual objects may include a text description associated with a real-world object.

In some implements, a terminal device, such as the terminal device12, may be in communication with a server in the cloud, such as server 13, and may provide to the server location information (e.g., the location of the terminal device via GPS coordinates) and/or image information (e.g., information regarding objects detected within a field of view of the terminal device) associated with the terminal device. In response, the server may transmit to the terminal device one or more virtual objects based upon the location information and/or image information provided to the server. In one implement, the terminal device12 may specify a particular file format for receiving the one or more virtual objects and the server 13 may transmit to the terminal device12 the one or more virtual objects embodied within a file of the particular file format.

FIG. 2 is diagram of an example structure of framework of a neural networkaccording to implementations. As illustrated in FIG. 2, for a single input image, aninitial foreground and an augmented foreground are extracted. Theaugmented foreground is fed to an initial encoder, then is decoded to normal, albedo, roughness respectively, and thereafter a 5th-orderspherical harmonic light is regressed. Then a mask of a target object is multiplied to predicted normal, albedo, and roughtness to remove a planar region. AScreen-Space Renderer will take the normal, albedo, roughness, and SH lighting to generate a re-rendered scene image.

FIG. 3 is diagram of rendering virtual objects into a real scene using estimated lighting. Left is an indoor image taken by a mobile device, such as phone, then lighting condition is estimatedfrom this capturedimage and virtual objects are rendered into the real scene. Some detailed rendering effects such assoft shadowing and glossy surface are shown in the zoom-in figures.

FIG. 4 is a schematic flowchart of a method forrendering a virtual object based on illumination estimation according to implementations. The method forrendering a virtual object based on illumination estimation can be applicable to the terminal device 120 illustrated in FIG. 1. In this implementation, an image in which at least one object is locatedonat least one plane in a real scene is captured in real time, then a foreground of the image is extracted, thereafter illuminationcorresponding to the extracted foregroundimage is estimated, and finally a virtual object is rendered in the real scene with the estimated illumination. As such, with a trainedneural network and an input of an image that contains one or more objects placed on a planar region captured in the real scene, areal-world lighting condition can be estimated in real-time and the virtual object can be rendered in the real scene with estimated illumination, thereby improving rendering quality and sense of reality of rendering the virtual object in the real scene.

As illustrated in FIG. 4, the method for rendering a virtual object based on illumination estimation includes the following.

At block 402, an image in which at least one object is locatedonat least one plane is captured.

Specifically, the illumination estimation is based on a visualappearance of the at least one object in areal scene and the image can be captured by a monocularRGB camera of the terminal device.

At block 404, a foreground of the image is extracted.

Background contents of an image are usually too small and regions near a boundary of the image may be even distorted by camera projection. Hence, the foreground of the image is extracted to reduce computation complexity.

As an implementation, the foreground of the image can be extracted as follows. Existence of at least one preset object is detected in the image. An object which is in the center of the image is selected from the at least one preset object as a target object. The foreground of the imageis determined and extracted based on the target object and a plane on which the target object is located. As illustrated in FIGS. 5 (a) -FIG. 5 (d) , FIGS. 5 (a) -FIG. 5 (d) illustrate an example foreground exaction process according to implementations, where FIG. 5 (a) is a diagram of an example captured image, FIG. 5 (b) is a diagram of a target object in the image, FIG. 5 (c) is a diagram of an initial extraction (e.g., segmentation) , and FIG. 5 (d) is a diagram of a final extraction.

Specifically, preset objects (e.g., banana, vase, etc. ) which are most likely to be found on an indoor table or ground are collected from common objects in context (COCO) dataset. FIG. 5 (a) gives such an example. In order to do foreground extraction, Detectron 2 is used to detect the at least one preset object and the target object. As illustrated in FIG. 5 (a) , the least one preset objectdetected in the image may include a sofa, a table, and a flowerpot. After the at least one preset object is detected, only an object which is in the center of the image is selected from the at least one preset object as the target object.

As an implementation, the object which is in the center of the image as the target objectis selected from the at least one preset object as follows. An objectwhich is at least partially locatedapproximately in a center of the imageis selected from the at least one preset object as the target object. In other words, the object of which a first percentage of pixels are located in a second percentage of center of the imageis selected from the at least one preset object as the target object. For example, an object that 95%of the segmented pixels located in the center 70%of the image is selected as the target object. FIG. 5 (b) gives such an example. As illustrated in FIG. 5 (b) , only the sofa is selected as the target object.

As an implementation, the foreground of the image is determined based on the target object and the plane on which the target object is located as follows.

A bonding box is determined according to the target object. The bonding box is divided into an upper bonding box and a lower bonding box. The lower bonding box is expanded with a magnification to include part of the plane on which the target object is located. A part of the image framed by the upper bonding box and the expanded lower bonding boxis determined as the foreground of the image. In other words, the part of the planewhich is in the expanded second bonding box and the target object is determined as the foreground of the image.

Specifically, a bonding box is determined according to a mask of the target object. As illustrated in FIG. 5 (c) , a lower half the bounding box is localized and the lower half the bounding box is extended by a preset magnification (e.g., 1.3) in x and y direction. As such, part of the plane on which the target object is locatedis mostly included in this extended region. Afinal augmented segmentation, that is the foreground of the image, can be shown in FIG. 5 (d) .

At block 406, illuminationcorresponding to the extracted foregroundimageis estimated.

Specifically, an input of neural network is the extracted foreground I _A. As illustrated in FIG. 3, a mask of the target object (FIG. 5 (c) ) is denoted as M and an augmented mask (FIG. 5 (d) ) is denoted as M _A. Then a mask of the plane can be denoted as M _P and M _P=M _A-M. SceneParser (·) representsthe neural network which consists of encoder-decoder blocks, then predicted normal map

albedo

roughness

and spherical harmonics (SH) coefficients

(differentiated from ground true parameters by ^～) are given by:

N _P is provided from an output of off-the shelf AR framework (e.g., ARCore) on the input image. All NP discussed here are in screen-space coordinate.

Since a directly output from decoders of normal, albedo, and roughness have a plane attached, the mask of the target object M is applied to get the predicted normal map

the predicted albedo

and the predicted roughness

InFIG. 2, they are (e) , (g) and (i) , respectively.

FIG. 6 (a) and FIG. 6 (b) are diagrams of an example design of a neural networkaccording to implementations. As an example, the neural network may include a lighting prediction module named SHE _st (·) and an initial encoder named InitEncoder (·) . SHE _st (·) is connected after InitEncoder (·) . SHE _st (·) contains two fully connected layers to regress 36 sphericalharmonics coefficients for each color channel. The predicted spherical harmonics (SH) coefficients

can beobtained via:

In addition, the neural network may further includethree sub-autoencoders for normal, albedo, androughness, and the three sub-autoencodersshare the same encoder InitEncoder (·) , and have theirown decoders named NormalDecoder (·) , AlbedoDecoder (·) , andRoughDecoder (·) , respetively. Adetailed architecture design can be found in FIG. 6 (a) . The predicted normal map

albedo

roughness

can be obtained via:

represents an inner product of two images. By an inner product of Mand an output of a decoder, a plane can be removed.

In order to predict illumination in the real scene, itis necessary to have a large receptive field. Thus, for example, InitEncoder (·) may have 6 convolutional layers with stride 2, so that each pixel ofan output can be influenced by the whole image. ForNormalDecoder (·) , AlbedoDecoder (·) , andRoughDecoder (·) , exceptNormalDecoder (·) has an extra input NP, they use transposedconvolutions for decoding and add skip links to recover greater details.

A detailed structure for such encoder-decoder pairs are shownin FIG. 6 (a) and the detailed structure of SHE _st (·) are shown inFIG. 6 (b) .

At block 408, a virtual object is rendered with the estimated illumination.

As an implementation, the illumination corresponding to the extracted foregroundimage can be estimated as following. Theextracted foreground is input to the neural network. PredictedSH coefficients output by the neural network are obtained. The illumination corresponding to the image is determined based on the predicted SH coefficients.

Specifically, advanced AR frameworks such as ARCore or ARKitare usually used to provide robust and accurate plane detection. The virtual object can be rendered on a plane of the real scene with the estimated illumination.

As can be seen, in implementations of the present disclosure, an image in which at least one object is locatedonat least one plane in a real scene is captured in real time, then a foreground of the image is extracted, thereafter illuminationcorresponding to the extracted foregroundimage is estimated, and finally a virtual object is rendered in the real scene with the estimated illumination. As such, with a trainedneural network and an input of an image that contains one or more objects placed on a planar region captured in the real scene, areal-world lighting condition can be estimated in real-time and the virtual object can be rendered in the real scene with estimated illumination, thereby improving rendering quality and sense of reality of rendering the virtual object in the real scene.

As an implementation, the method further includes the following. The neural network is trained with a first dataset, where the first dataset includes images and ground truthparameters corresponding to each of the images, each of the images contains an object and a plane on which the object is located, the ground truth parameters corresponding to each of the images include a ground truth normal map of the object, a ground truth normal map of the plane, ground truth SH coefficients, a ground truth albedo, and aground truth roughness.

As an implementation, the method further includes constructing the first dataset. Specifically, images each containing an object and a plane on which the object is located are selected from a second dataset, where the second datasetincludes multiplepanoramic imagesand a ground truthHDR illumination map corresponding to each of the multiplepanoramic images. Foreach of the selected image: a ground truth normal map of the object and a ground truth normal map of the plane are calculated; ground truth SH coefficients corresponding to the image are extracted by applying a spherical convolution to the ground truthHDR illumination map; ground truth albedo and ground truth roughness corresponding to the image are calculated. Thefirst dataset is constructed based on the selected images and ground truth parameters corresponding to the selected images, wherethe ground truth parameters corresponding to each of the selected images include the ground truth normal map of the object, the ground truth normal map of the plane, the ground truth SH coefficients, the ground truth albedo, and theground truth roughness.

In order to support the training of the neural network, a large database of indoor images and their corresponding normal maps are needed. However, obtaining a large dataset of ground truth normal maps for training is challenging. On the one hand, using physical based rendering engine can directly synthesize rendered image and normal maps simultaneously, but neural networks trained with synthesized data usually have bad performance on real data. On the other hand, existing normal map datasets provide normal maps which are reconstructed under expensive photometric stereo settings, and therefore the number of data is too small for sufficient network training.

In this implementation, the first dataset can be constructed based on an off-the-shelf second data set, that is, Matterport3D dataset. Matterport3D contains 194, 400 registered HDR RGB-D images arranged in 10, 800 panoramas within 90 different building-scale indoor scenes. Specifically, multiple HDR RGB-D images from the Matterport3D dataset are leveraged to generate the first dataset. Thepanoramas can provide ground truth HDR illumination maps, and a spherical convolution can be applied to extract 5th-order SH coefficients of theses maps as the ground truth SH coefficients L ^gt.

First, images that contain an object and a plane on which the object is located are selected from the Matterport3D dataset. The plane is defined by: (a) a horizontal surface (n _z> (cosπ/8) ) ; (b) semantic label such as floor, furniture, (c) there are one or more objects above on it. Then for each image I, the object is transferred to screen space (e.g., a camera coordinate) and the normal maps for the object and the plane are calculated. Additionally, a spherical convolution is applied to extract 5th-order SH coefficients of the ground truth HDR illumination maps as ground truth SH coefficients L ^gt. In this way, SH light is rotated to screen-space and the SH coefficients are transformed accordingly. To this end, necessary labels are constructed for an image I: ground truth SH coefficients L ^gt, ground truth normal map of the object N ^gt, and ground truth normal mapof the planeN _p. Overall, about 109042 samples of {I, L ^gt, N ^gt, N _p} . FIG. 7 shows an exemplarygeneration process of ground truth data. As illustrated in FIG. 7, groundtruth SH coefficientsand normal maps for images each of whichcontains a support plane in Matterport3D dataset are generated. InMatterport3D dataset, the groundtruth lighting is a panorama environment map, a groundtruth SH lighting is generated based on 5th-order SH coefficients.

Ina training stage, the normal mapof the plane N _p is extracted from ground truth data in the Matterport3D dataset directly. While in practical application, a normal mapof a plane N _p can be provided from an outputof AR framework (e.g., ARCore) on an input image. All N _p discussed here can be in screen-space coordinate.

As an implementation, the neural network is trained with the first dataset as following.

Atraining image is obtained from the first dataset, where the training image is any of the images in the first dataset. Aforeground of the training image is extracted. Theextracted foreground of the training image is inputto the neural network to obtain predicted parameters, where the predicted parametersinclude a predicted normal map of an object in the training image, a predicted normal map of a plane in the training image, predicted SH coefficients, predicted albedo, andpredicted roughness. Animage is rendered with the predicted parameters. A loss of the neural network is calculated based on the predicted parameters, the rendered image, the training image, and ground truthparameters corresponding to the training image. Theneural network is trained based on the loss.

In this implementation, the normal map of the plane is added to illumination estimation and neural network training, which is helpful to improving accuracy of the illumination estimation. The neural network has been with and without the normal map of the plane, as illustrated in table 1, column S-P (trained without N _P) 1 reports recovering errors, which are clearly larger than those in column S (trained with N _P) for the same test image. Thus, the normal map of the plane has a significant impact on lighting. FIG. 8 is diagram illustrating comparison of illumination and normal map predictions of a network (middle) trained with a normal map of a plane and a network (top) trained without a normal map of a plane according to implementations. The qualitative comparison in FIG. 8shows that, without the guidance of NP, texture on the objects may mislead the normal map estimation and further produce incorrect illuminationestimation.

Table 1: Quantitative comparison on image

	S	S-P
5th-order SH coef (10 ^-2)	9.833	3.746
Normal map (10 ^-2)	6.591	4.184

On multiple images captured by a mobile phone, how NP affects the lighting or illumination estimation has been demonstrated. FIG. 9 is diagram illustrating comparison of illumination and normal map predictions of a network (middle) trained with a normal map of a plane and a network (top) trained without a normal map of a plane according to other implementations. As illustrated in FIG. 9, in atop example, normal map output from S-P is not correct, a part in a blockmeans the normal map on that region is very different, this is not correct. In the bottom example, normal map outputfrom S-P is too smooth. In both examples, lighting is too dark for S-P results.

As an example, the image can be rendered with the predicted parameters as following.

Since transmission effectsand self-luminous are not includedhere, a rendering equation can be simplified toa reflection equation to render with the output of the neural network:

L ₀ represents a total spectral radiance directed outward along eye’s direction

from a particular position

on the object, L _i represents an incidence radiance at

from direction

f _r represents abidirectional reflectance distribution function (BRDF) at

This integral is over a normal-oriented hemisphere Ωtowards thenormal map

at

Lighting.

The lighting is parameterized in a 5th-order spherical harmonic lighting. Since it is a globally lighting, the radiance is only dependent on the direction. More specifically, we can represent L _i with:

(θ _i, φ _i) representthe altitude and azimuth respectively in cameracoordinates,

represents the predicted SH coefficient in

and Y _lm represents sphericalharmonic basis.

BRDF.

On the basis of a microfacet BRDF model,

the BRDF model is defined as:

D, F and G represent the normal distribution, Fresnel, and geometricterm respectively. (θ _i, φ _i) and (θ ₀, φ ₀) represent the altitude and azimuthof incident light and eye direction in local coordinate, respectively. As can be seen from the above equation of the BRDF model, BRDF is radially symmetric or it issolely dependent on θ′ _i when θ′ ₀ is fixed. This property can simplifythe integral in eqn. (4) .

Integral.

In order to do the integral in

local coordinate, L _i (θ _i, φ _i) is converted to L _i (θ′ _i, φ′ _i) by:

The

represents the (2l+1) -dimensional representation of the rotationgroup SO (3) . Plug eqn. (6) and (7) in (4) , the following equation can be obtained:

According to the radially symmetric property and the fact:

where N _lm′ is a normalizationfactor and P _lm′ is the associated Legendre function,

where δ _0m′=1, if m′=0, otherwise δ _0m′=0. Then eqn. (8) can besimplified as

∧ _l is a constant number which can be analytically calculated. Θ _l s more complicated integral and seems not possibleto solve it in a closed form. Hence,

is expandedin a TaylorSeries in terms of

cosθ′ ₀. Since

and cosθ′ ₀< 1, thehigh order terms can be neglected. We find the polynomial with thedegree of 5 is sufficient to our problem. Hence Θ _l can be approximatedvia a 5-degree polynomial

in terms of

and cosθ′ ₀. Sincethe camera’s field of view (fov) is available,

and cosθ′ ₀ can bedetermined by the pixel’s position in the image, hence

can be written as

where

is the pixel’s position. Therendered image is denoted as

To this end, the rendering process uses only the Screen-spaceattributes

and

and can be formulated as a low cost linearcombination of polynomials of these attributes and it is differentiable:

In this implementation, without losing too much accuracy, a few parameters can be used for rendering and predicting, which will lower prediction and rendering cost.

As an implementation, the loss of the neural network is calculated based on the predicted parameters, the rendered image, the trainingimage, and the ground truthparameters corresponding to the trainingimage as following.

Arendering loss is calculated based on the rendered image and the test image. An SH lossis calculated based on the predicted SH coefficients and ground truth SH coefficients.

Anobject-normal-map loss is calculated based on the predicted normal map of the object and a ground truth normal map of the object. Aplanar-normal-map loss is calculated based on the predicted normal map of the plane and a ground truth normal map of the plane. Analbedo loss is calculated based on the predicted albedo and a ground truth albedo. Aweighted sum of the rendering loss, the SH loss, the object-normal-map loss, the planar-normal-map loss, and the albedo loss is calculated as the loss.

In this implementation, withouta huge amountof data for modeling an intrinsic mapping from image to illumination, labeled training data has been used to supervise intermediatecomponents to make an estimated illumination more determinable.

Theground truth SH coefficientsand the ground truth normal map of the object can be extracted from the first dataset, as suchthe predicted SH coefficients and the predicted normal mapof the object can be supervised. In addition, the ground truth normal map of the plane can also be provided by ARCore, thereforethe predicted normal map of the plane can be supervised.

Rendering Loss L _r.

L _r represents a pixel-wise l ₁ difference between an input image I and a rendered image

of the foreground.

Using l ₁ norm as a loss function helps with robustnessto outliers, such as self-shadowing or extremely mirror reflection inI that are ignored in

SH Loss L _S.

L _S represents a mean square error (MSE) loss on SH coefficients, it is definedas:

represnts the ground truth SH coefficientsfor ac-th color channel (in RGB) .

represnts the predicted SH coefficientsfor ac-th color channel (in RGB) .

Object-normal-map loss L _N.

L _N measures a pixel-wise l ₂ difference between the groundtruth normal map of the object N ^gt and the predicted normal map of the object

on the extractedforeground.

Planar-normal-map loss

It is not feasible for

to enforce allpixels in the extended region (M _A-M) have the same normal map as N _P. Because besides the target object and the plane on which the target object is located, the extracted foreground may still contain part othernon-plane objects. One observation is that most of the pixels in (M _A -M) will be located on a plane. Thus, it is reasonably torequire most but not all of the pixels in (M _A-M) close to N _P. Thedefinition of

is:

denotes a subset of {·} , which is the top η%smallestelements in {·} . In this implementation, the top η%pixels that is closestto N _Pare selectedand l ₂distance is minimized to N _P. For example, η can be set to 80 empirically.

Albedo loss L _a.

L _a is based on a similarity of chromatically andintensity between pixels. This term is inspired by the multi-scaleshading smoothness property. It is defined as aweighted l ₂ term over neighboring pixels, the weights are negativegradient magnitudes:

nb (i) denotes 8-connected neighborhood around pixel I and

the gradient of image I.

To this end, by combining the above five term, a final loss function can be defined as:

To reduce an over-fitting effect during training, in the above finalloss function, an extra regularizer term can be added to further constrain the optimization usingstatistical regularization on the estimated SH coefficients. For an example, the valueof weighting coefficients in the above finalloss functionmay be thatλ _r= 1.92, λ _a= 0.074, λ _s=2.14, λ _n=1.01,

and λ _reg=2.9×10 ^-5. With theselosses during training, accurateSH illumination, normal maps, and albedo can be produced with the aid of the natural network.

FIG. 10 compares the normal map estimationusing two existing estimating method and the method in the present disclosure. As illustrated in FIG. 9, in both cases, the method in the present disclosureis significantly better. The two existing estimating methodtend to over-smooth the normal map due to handcrafted regularization. In contrast, the method in the present disclosure recovers high-frequency detailsfor surface normal maps, even in specular and shadowed regions.

FIG. 11 is a diagram illustrating an example of rendering a virtual object in a real scene. As illustrated in FIG. 11, the left is device-taken pictures and the right are composed picture by rendering virtualobjects with estimated illumination. The top row is a lighter scene while thebottom row is a darker one. By means of implementations of the present disclosure, the lightingcondition can be estimatedaccurately.

FIG. 12 is a schematic flowchart of a method for training a neural network according to implementations. As illustrated in FIG. 12, the method for training a neural network includes the following.

At block 1202, a trainingimage is obtained from a first dataset, where the first dataset includes images and ground truthparameters corresponding to each of the images, and the training image is any of the images in the first dataset.

Each of the images contains an object and a plane on which the object is located, the ground truth parameters corresponding to each of the images include a ground truth normal map of the object, a ground truth normal map of the plane, ground truth SH coefficients, a ground truth albedo, and a ground truth roughness.

At block1204, a foreground of the trainingimage is extracted.

At block1206, the extracted foreground of the trainingimage is input to a neural network to obtain predicted parameters.

The predicted parametersinclude a predicted normal map of an object in the trainingimage, a predicted normal map of a plane in the trainingimage, predicted SH coefficients, a predicted albedo, anda predicted roughness.

At block1208, an image is rendered with the predicted parameters.

At block1210, a loss of the neural network is calculated based on the predicted parameters, the rendered image, the trainingimage, and ground truthparameters corresponding to the test image.

At block1212, the neural network is trained based on the loss.

As an implementation, the loss of the neural network is calculated based on the predicted parameters, the rendered image, the trainingimage, and the ground truthparameters corresponding to the trainingimage as following. A rendering loss is calculated based on the rendered image and the trainingimage. An SH loss is calculated based on the predicted SH coefficients and ground truth SH coefficients. An object-normal-map loss is calculated based on the predicted normal map of the object and a ground truth normal map of the object. Aplanar-normal-map loss is calculated based on the predicted normal map of the plane and a ground truth normal map of the plane. Analbedo loss is calculated based on the predicted albedo and a ground truth albedo. Aweighted sum of the rendering loss, the SH loss, the object-normal-map loss, the planar-normal-map loss, and the albedo loss is calculated as the loss.

As an implementation, the method further includes the following.

Imageseach containing an object and a plane on which the object is located are selected from a second dataset, where the second dataset includes multiple panoramic imagesand a ground truthHDR illumination map corresponding to each of the multiplepanoramic images. Foreach of the selected image: a ground truth normal map of the object and a ground truth normal map of the plane are calculated; ground truth SH coefficients corresponding to the image are extracted by applying a spherical convolution to the ground truthHDR illuminationmap; ground truth albedo and ground truth roughness corresponding to the image are calculated. Thefirst dataset is constructed based on the selected images and ground truth parameters corresponding to the selected images, where the ground truth parameters corresponding to each of the selected images include the ground truth normal map of the object, the ground truth normal map of the plane, the ground truth SH coefficients, the ground truth albedo, and the ground truth roughness.

Reference of above operations can be made to detailed descriptions of network training operations in the method for rendering a virtual object based on illumination estimation, which will not be described herein.

The foregoing solution of the implementations of the present disclosure is mainly described from the viewpoint of execution process of the method. It can be understood that, in order to implement the above functions, the electronic device includes hardware structures and/or software modules corresponding to the respective functions. Those skilled in the art should readily recognize that, in combination with the example units and scheme steps described in the implementations disclosed herein, the present disclosure can be implemented in hardware or a combination of the hardware and computer software. Whether a function is implemented by way of the hardware or hardware driven by the computer software depends on the particular application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each particular application, but such implementations should not be considered as beyond the scope of the present disclosure.

According to the implementations of the present disclosure, functional units may be divided for the first wireless earphone in accordance with the foregoing method examples. For example, each functional unit may be divided according to each function, and two or more functions may be integrated in one processing unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional units. It should be noted that the division of units in the implementations is schematic, and is merely a logical function division, and there may be other division manners in actual implementation.

FIG. 13 is a schematic structural diagram of an apparatusfor rendering a virtual object based on illumination estimation according to implementations. As illustrated in FIG. 3, the apparatus for rendering a virtual object based on illumination estimation includes a capturing unit131, an extracting unit132, an estimating unit133, and a rendering unit134.

The capturing unit 131 is configured to capture an image in which at least one object is locatedonat least one plane.

The extracting unit 132 is configured to extract a foreground of the image.

The estimating unit 133is configured to estimate illumination corresponding to the extracted foregroundimage.

The rendering unit 134is configured to render a virtual object with the estimated illumination.

As an implementation, the estimating unit 133 is configured to: input the extracted foreground toaneural network; obtain predicted spherical harmonics (SH) coefficients output by the neural network; and determine the illumination corresponding to the image based on the predicted SH coefficients.

As an implementation, theextracting unit 132 is configured to: detect existence of at least one preset object in the image; select, from the at least one preset object, an object which is in the center of the image as a target object; determineand extracting the foreground of the image based on the target object and a plane on which the target object is located.

As an implementation, in terms of selecting, from the at least one preset object, the object which is in the center of the image as the target object, theextracting unit132 is configured to: select, from the at least one preset object, an objectwhich is at least partially locatedapproximately in a center of the image, as the target object.

As an implementation, in terms of determining the foreground of the image based on the target object and the plane on which the target object is located, the estimating unit133 is configured to: determine a bonding box according to the target object; divide the bonding box into an upper bonding box and a lower bonding box; expand the lower bonding box with a magnification to include part of the plane on which the target object is located; determine a part of the image framed by the upper bonding box and the expanded lower bonding boxas the foreground of the image.

As an implementation, the apparatus further includes a training unit135.

The training unit135 is configured to: train the neural network with a first dataset, where the first dataset includes images and ground truth parameters corresponding to each of the images, each of the images contains an object and a plane on which the object is located, the ground truth parameters corresponding to each of the images include a ground truth normal map of the object, a ground truth normal map of the plane, ground truth SH coefficients, a ground truth albedo, and a ground truth roughness.

As an implementation, the apparatus further includes a constructing unit136.

The constructing unit 136 is configured to: select, from a second dataset, images each containing an object and a plane on which the object is located, where the second dataset includes multiple panoramic imagesand a ground truthHDR illumination map corresponding to each of the multiplepanoramic images; for each of the selected image: calculate a ground truth normal map of the object and a ground truth normal map of the plane, extract ground truth SH coefficients corresponding to the image by applying a spherical convolution to the ground truthHDR illumination , calculateground truth albedo and ground truth roughness corresponding to the image; construct the first dataset based on the selected images and ground truth parameters corresponding to the selected images, where the ground truth parameters corresponding to each of the selected images include the ground truth normal map of the object, the ground truth normal map of the plane, the ground truth SH coefficients, the ground truth albedo, and the ground truth roughness.

As an implementation, the training unit 136 is configured to: obtain a training image from the first dataset, where the training image is any of the images in the first dataset; extract a foreground of the training image; input the extracted foreground of the training image to the neural network to obtain predicted parameters, where the predicted parameters include a predicted normal map of an object in the training image, a predicted normal map of a plane in the training image, predicted SH coefficients, predicted albedo, and predicted roughness; render an image with the predicted parameters; calculate a loss of the neural network based on the predicted parameters, the rendered image, the training image, and ground truth parameters corresponding to the training image; train the neural network based on the loss.

As an implementation, in terms of calculating the loss of the neural network based on the predicted parameters, the rendered image, the training image, and the ground truth parameters corresponding to the training image, the training unit 136 is configured to: calculate a rendering loss based on the rendered image and the training image; calculatean SH loss based on the predicted SH coefficients and ground truth SH coefficients; calculate an object-normal-map loss based on the predicted normal map of the object and a ground truth normal map of the object; calculate a planar-normal-map loss based on the predicted normal map of the plane and a ground truth normal map of the plane; calculate an albedo loss based on the predicted albedo and a ground truth albedo; calculate a weighted sum of the rendering loss, the SH loss, the object-normal-map loss, the planar-normal-map loss, and the albedo loss as the loss.

FIG. 14 is a schematic structural diagramof anapparatusfor training a neural network according to implementations. As illustrated in FIG. 14, the apparatus for training a neural network includes an obtaining unit141, an extracting unit142, an inputting unit143, a rendering unit144, a calculating unit145, and a trainingunit146.

The obtaining unit 141 is configured to obtain a training image from a first dataset, where the first dataset includes images and ground truth parameters corresponding to each of the images, and the training image is any of the images in the first dataset.

The extracting unit 142 is configured to extract a foreground of the training image.

The inputting unit 143 is configured to input the extracted foreground of the training image to a neural network to obtain predicted parameters.

The predicted parameters include a predicted normal map of an object in the training image, a predicted normal map of a plane in the training image, predicted SH coefficients, a predicted albedo, and a predicted roughness.

The rendering unit 144 is configured to render an image with the predicted parameters;

The calculating unit 145 is configured to calculate a loss of the neural network based on the predicted parameters, the rendered image, the training image, and ground truth parameters corresponding to the training image.

The training unit 146 is configured to train the neural network based on the loss.

As an implementation, the calculating unit 145 is configured to calculate a rendering loss based on the rendered image and the training image; calculate an SH loss based on the predicted SH coefficients and ground truth SH coefficients; calculate an object-normal-map loss based on the predicted normal map of the object and a ground truth normal map of the object; calculate a planar-normal-map loss based on the predicted normal map of the plane and a ground truth normal map of the plane; calculating an albedo loss based on the predicted albedo and a ground truth albedo; and calculate a weighted sum of the rendering loss, the SH loss, the object-normal-map loss, the planar-normal-map loss, and the albedo loss as the loss.

As an implementation, the apparatus further includes a constructing unit147. The constructing unit is configured to: select, from a second dataset, images each containing an object and a plane on which the object is located, where the second dataset includes mutiple panoramic imagesand a ground truthHDR illumination map corresponding to each of the multiplepanoramic images; for each of the selected image: calculate a ground truth normal map of the object and a ground truth normal map of the plane, extract ground truth SH coefficients corresponding to the image by applying a spherical convolution to the ground truthHDR illuminationmap, calculateground truth albedo and ground truth roughness corresponding to the image; construct the first dataset based on the selected images and ground truth parameters corresponding to the selected images, where the ground truth parameters corresponding to each of the selected images include the ground truth normal map of the object, the ground truth normal map of the plane, the ground truth SH coefficients, the ground truth albedo, and the ground truth roughness.

FIG. 15 is a schematic structural diagram of a terminal device according to implementations. As illustrated in FIG. 15, the terminal device150 includes a processor 151, a memory 152, a communication interface 153, and one or more programs 154 stored in the memory 152 and executed by the processor 151. The one or more programs 154 include instructions for performing the following operations.

Animage in which at least one object is located on at least one plane is captured. A foreground of the imageis extracted. Illuminationcorresponding to the extracted foregroundimage is estimated. Avirtual object is rendered with the estimated illumination.

As an implementation, in terms of estimating the illumination corresponding to the extracted foregroundimage, the one or more programs 154 include instructions for performing the following operations. Theextracted foreground is inputto aneural network. Predictedspherical harmonics (SH) coefficients output by the neural network are obtained. Theillumination corresponding to the image is determined based on the predicted SH coefficients.

As an implementation, in terms of extracting the foreground of the image, the one or more programs 154 include instructions for performing the following operations. Existenceof at least one preset object in the image is detected. Anobject which is in the center of the image is selected from the at least one preset object as a target object. Theforeground of the image is determined and extracted based on the target object and a plane on which the target object is located.

As an implementation, in terms of selecting, from the at least one preset object, the object which is in the center of the image as the target object, the one or more programs 154 include instructions for performing the following operations. Theobject which is at least partially locatedapproximately in a center of the imageis selected from the at least one preset objectas the target object.

As an implementation, in terms of determining the foreground of the image based on the target object and the plane on which the target object is located, the one or more programs

154 include instructions for performing the following operations. Abonding box is determined according to the target object. Thebonding box is divided into an upper bonding box and a lower bonding box. Thelower bonding boxis expanded with a magnification to include part of the plane on which the target object is located. A part of the image framed by the upper bonding box and the expanded lower bonding boxis determined as the foreground of the image.

As an implementation, the one or more programs 154 further include instructions for performing the following operations. Theneural network is trained with a first dataset, where the first dataset includes images, and ground truth parameters corresponding to each of the images, each of the images contains an object and a plane on which the object is located, the ground truth parameters corresponding to each of the images include a ground truth normal map of the object, a ground truth normal map of the plane, ground truth SH coefficients, a ground truth albedo, and a ground truth roughness.

As an implementation, the one or more programs 154 further include instructions for performing the following operations. Imageseach containing an object and a plane on which the object is located are selected from a second dataset, where the second dataset includes multiple panoramic imagesand a ground truthHDR illumination map corresponding to each of the multiplepanoramic images. Foreach of the selected image: a ground truth normal map of the object and a ground truth normal map of the plane are calculated, ground truthSH coefficients corresponding to the image are extracted by applying a spherical convolution to the ground truthHDR illuminationmap, ground truth albedo and ground truth roughness corresponding to the image are calculated. Thefirst dataset is constructed based on the selected images and ground truth parameters corresponding to the selected images, where the ground truth parameters corresponding to each of the selected images include the ground truth normal map of the object, the ground truth normal map of the plane, the ground truth SH coefficients, the ground truth albedo, and the ground truth roughness.

As an implementation, in terms of training the neural network with the first dataset, the one or more programs 154 include instructions for performing the following operations. Atraining image is obtained from the first dataset, where the training image is any of the images in the first dataset. Aforeground of the training image is extracted. The extracted foreground of the training image is inputto the neural network to obtain predicted parameters, where the predicted parameters include a predicted normal map of an object in the training image, a predicted normal map of a plane in the training image, predicted SH coefficients, predicted albedo, and predicted roughness. Animage is rendered with the predicted parameters. A loss of the neural network is calculated based on the predicted parameters, the rendered image, the training image, and ground truth parameters corresponding to the training image. The neural network is trained based on the loss.

As an implementation, in terms of calculating the loss of the neural network based on the predicted parameters, the rendered image, the training image, and the ground truthparameters corresponding to the training image, the one or more programs 154 include instructions for performing the following operations. Arendering loss is calculated based on the rendered image and the training image. An SH lossis calculated based on the predicted SH coefficients and ground truth SH coefficients. Anobject-normal-map loss is calculatedbased on the predicted normal map of the object and a ground truth normal map of the object. Aplanar-normal-map loss is calculated based on the predicted normal map of the plane and a ground truth normal map of the plane. Analbedo loss is calculated based on the predicted albedo and a ground truth albedo. Aweighted sum of the rendering loss, the SH loss, the object-normal-map loss, the planar-normal-map loss, and the albedo loss is calculated as the loss.

FIG. 16 is a schematic structural diagram of a terminal device according to implementations. As illustrated in FIG. 16, the terminal device160 includes a processor 161, a memory 162, a communication interface 163, and one or more programs 164 stored in the memory 162 and executed by the processor 161. The one or more programs 164 include instructions for performing the following operations.

Atraining image is obtained from a first dataset, where the first dataset includes images and ground truth parameters corresponding to each of the images, and the training image is any of the images in the first dataset. Aforeground of the training image is extracted. Theextracted foreground of the training image is input to a neural network to obtain predicted parameters. A loss of the neural network is calculated based on the predicted parameters, the rendered image, the training image, and ground truth parameters corresponding to the training image. Theneural network is trained based on the loss.

As an implementation, each of the images contains an object and a plane on which the object is located, the ground truth parameters corresponding to each of the images include a ground truth normal map of the object, a ground truth normal map of the plane, ground truth SH coefficients, a ground truth albedo, and a ground truth roughness.

As an implementation, the predicted parameters include a predicted normal map of an object in the training image, a predicted normal map of a plane in the training image, predicted SH coefficients, a predicted albedo, and a predicted roughness. Animage is rendered with the predicted parameters.

As an implementation, in terms of calculating the loss of the neural network based on the predicted parameters, the rendered image, the training image, and the ground truth parameters corresponding to the training image, the one or more programs 164 include instructions for performing the following operations. Arendering loss is calculated based on the rendered image and the training image. An SH lossis calculatedbased on the predicted SH coefficients and ground truth SH coefficients. An object-normal-map loss is calculatedbased on the predicted normal map of the object and a ground truth normal map of the object. A planar-normal-map loss is calculatedbased on the predicted normal map of the plane and a ground truth normal map of the plane. An albedo loss is calculatedbased on the predicted albedo and a ground truth albedo. A weighted sum of the rendering loss, the SH loss, the object-normal-map loss, the planar-normal-map loss, and the albedo loss is calculatedas the loss.

As an implementation, the one or more programs 164 further include instructions for performing the following operations. Imageseach containing an object and a plane on which the object is located are selected from a second dataset, where the second dataset includes multiple panoramic images and a ground truthHDR illumination map corresponding to each of the multiplepanoramic images. Foreach of the selected image: a ground truth normal map of the object and a ground truth normal map of the plane are calculated, ground truth SH coefficients corresponding to the image are extracted by applying a spherical convolution to the ground truthHDR illuminationmap, ground truth albedo and ground truth roughness corresponding to the image are calculated. Thefirst dataset is constructed based on the selected images and ground truth parameters corresponding to the selected images, where the ground truth parameters corresponding to each of the selected images include the ground truth normal map of the object, the ground truth normal map of the plane, the ground truth SH coefficients, the ground truth albedo, and the ground truth roughness.

A non-transitory computer storage medium is also provided. The non-transitory computer storage medium is configured to store programs which, when executed, are operable to execute some or all operations of the method for rendering a virtual object based on illumination estimation or some or all operations of the methodfor training a neural network as described in the above-described method implementations.

Acomputer program product is also provided. The computer program product includes a non-transitory computer-readable storage medium that stores computer programs. The computer programs are operable with a computer to execute some or all operations of the method for rendering a virtual object based on illumination estimation or some or all operations of the methodfor training a neural network as described in the above-described method implementations.

It is to be noted that, for the sake of simplicity, the foregoing method implementations are described as a series of action combinations. However, it will be appreciated by those skilled in the art that the present disclosure is not limited by the sequence of actions described. According to the present disclosure, certain steps or operations may be performed in other order or simultaneously. Besides, it will be appreciated by those skilled in the art that the implementations described in the specification are exemplary implementations and the actions and modules involved are not necessarily essential to the present disclosure.

In the foregoing implementations, the description of each implementation has its own emphasis. For the parts not described in detail in an implementation, reference may be made to related descriptions in other implementations.

In the implementations of the present disclosure, it is to be noted that, the apparatus disclosed in implementations provided herein may be implemented in other manners. For example, the device/apparatus implementations described above are merely illustrative; for instance, the division of the unit is only a logical function division and there can be other manners of division during actual implementations; for example, multiple units or components may be combined or may be integrated into another system, or some features may be ignored, omitted, or not performed. In addition, coupling or communication connection between each illustrated or discussed component may be direct coupling or communication connection, may be indirect coupling or communication among devices or units via some interfaces, and may be electrical connection, mechanical connection, or other forms of connection.

The units described as separate components may or may not be physically separated, and the components illustrated as units may or may not be physical units, that is, they may be in the same place or may be distributed to multiple network elements. All or part of the units may be selected according to actual needs to achieve the purpose of the technical solutions of the implementations.

In addition, the functional units in various implementations of the present disclosure may be integrated into one processing unit, or each unit may be physically present, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or a software function unit.

The integrated unit may be stored in a computer-readable memory when it is implemented in the form of a software functional unit and is sold or used as a separate product. Based on such understanding, the technical solutions of the present disclosure essentially, or the part of the technical solutions that contributes to the related art, or all or part of the technical solutions, may be embodied in the form of a software product which is stored in a memory and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, and so on) to perform all or part of the steps described in the various implementations of the present disclosure. The memory includes various medium capable of storing program codes, such as a USB (universal serial bus) flash disk, a read-only memory (ROM) , a random access memory (RAM) , a removable hard disk, Disk, compact disc (CD) , or the like.

It will be noted by those of ordinary skill in the art that all or a part of the various methods of the implementations described above may be accomplished by means of a program to instruct associated hardware, where the program may be stored in a computer-readable memory, which may include a flash memory, a read-only memory (ROM) , a random-access memory (RAM) , a disk or a compact disc (CD) , and so on.

The implementations of the present disclosure are described in detail above, specific examples are used herein to describe the principle and implementation manners of the present disclosure. The description of the above implementations is merely used to help understand the method and the core idea of the present disclosure. Meanwhile, those skilled in the art may make modifications to the specific implementation manners and the application scope according to the idea of the present disclosure. In summary, the contents of the specification should not be construed as limiting the present disclosure.

Claims

A method for rendering a virtual object based on illumination estimation, comprising:

capturing an image in which at least one objectislocatedonat least oneplane;

extracting a foregroundof the image;

estimatingilluminationcorresponding to the extracted foreground image; and

rendering a virtual object with theestimated illumination.
The method of claim 1, whereinestimating the illumination corresponding to theextracted foreground image, comprises:

inputting the extracted foreground toaneural network;

obtaining predicted spherical harmonics (SH) coefficients output by the neural network; and

determining the illumination corresponding to the image based on the predicted SH coefficients.
The method of claim 1 or 2, wherein extracting the foreground of the image comprises:

detecting existence of at least one preset object in the image;

selecting, from the at least one preset object, an object which is in the center of the image as a target object; and

determiningand extracting the foreground of the image based on the target object and a planeonwhich the target object is located.
The method of claim 3, wherein selecting, from the at least one preset object, the object which is in the center of the image as the target object comprises:

selecting, from the at least one preset object, an objectwhich is at least partially locatedapproximatelyin thecenter of the image, as the target object.
The method of claim 3 or 4, wherein determiningthe foreground of the image based on the target object and theplaneonwhich the target object is locatedcomprises:

determining a bonding box according to the target object;

dividing the bonding box into an upper bonding box and a lowerbonding box;

expanding the lower bonding boxto include part of theplane on which the target object is located; and

determininga part of the image framed by the upper bonding box and the expanded lower bonding box as the foreground of the image.
The method of any of claims 1 to 5, further comprising:

trainingthe neural networkwith a first dataset, wherein the first dataset comprises images, and ground truthparameters corresponding to each of the images, each of the images contains an object and a plane on which the object is located, the ground truth parameters corresponding to each of the images comprise a ground truth normal map of the object, a ground truth normal map of the plane, ground truth SH coefficients, a ground truth albedo, and aground truth roughness.
The method of claim 6, further comprising:

selecting, from a second dataset, images each containinganobjectand a plane on which the object is located, wherein the second dataset comprisesa plurality of panoramicimagesand a ground truthHDR illumination map corresponding to each of the plurality of panoramic images;

for each of the selected image:

calculating aground truthnormal map of the object and a ground truthnormal map of the plane;

extracting ground truthSH coefficientscorresponding to the image by applying a spherical convolution to the ground truthHDR illumination map;

calculating ground truthalbedo and ground truthroughness corresponding to the image; and

constructing the first dataset based on the selected images and ground truthparameters corresponding to the selected images, wherein the ground truthparameters corresponding to each of the selected images comprise the ground truthnormal map of the object, theground truthnormal map of the plane, the ground truthSH coefficients, the ground truthalbedo, and theground truth roughness.
The method of claim 7, wherein training the neural network with the first datasetcomprises:

obtaining atraining image from the first dataset, wherein the training image is any of the images in the first dataset;

extracting a foreground of the training image;

inputtingthe extracted foreground of the training image to theneural network to obtain predictedparameters, wherein the predictedparameters comprise a predicted normal map of an object in the training image, apredicted normal map of a plane in the training image, predicted SH coefficients, predicted albedo, andpredicted roughness;

renderingan image with the predicted parameters;

calculating a loss of the neural network based on the predicted

parameters, therenderedimage, the training image, and ground truthparameterscorresponding to the training image; and

training theneural network based on the loss.
The method of claim 8, wherein calculating the loss of the neural network based on the predicted parameters, the rendered image, the training image, and the ground truthparameters corresponding to the training image comprises:

calculating a rendering loss based on the rendered image and the training image;

calculating an SH loss based on the predicted SH coefficients and ground truth SH coefficients;

calculating anobject-normal-map loss based on the predicted normal mapof the object and a ground truthnormal map of the object;

calculating a planar-normal-map loss based on the predicted normal mapof the plane and a ground truthnormal map of the plane;

calculating analbedo loss based on the predicted albedo and aground truthalbedo; and

calculating a weighted sum of the rendering loss, the SH loss, the object-normal-map loss, the planar-normal-map loss, and the albedo loss as the loss.
Amethodfor training a neural network, comprising:

obtaining a training image from a first dataset, wherein the first dataset comprises images and ground truthparameters corresponding to each of the images, and the training image is any of the images comprised in the first dataset;

extracting a foreground of the training image;

inputtingthe extracted foreground of the training image to a neural network to obtain predicted parameters;

rendering an image with the predicted parameters;

calculating a loss of the neural network based on the predicted parameters, the rendered image, the training image, and ground truthparameters corresponding to the training image; and

training the neural network based on the loss.
The method of claim 10, wherein each of the images contains an object and a plane on which the object is located, the ground truth parameters corresponding to each of the images comprise a ground truth normal map of the object, a ground truth normal map of the plane, ground truth SH coefficients, a ground truth albedo, and a ground truth roughness; and the predicted parameters comprise a predicted normal map of an object in the training image, a predicted normal map of a plane in the training image, predicted SH coefficients, a predicted albedo, and a predicted roughness.
The method of claim 11, wherein calculating the loss of the neural network based on the predicted parameters, the rendered image, the training image, and the ground truthparameters corresponding to the training image comprises:

calculating a rendering loss based on the rendered image and the training image;

calculating an SH loss based on the predicted SH coefficients and ground truth SH coefficients;

calculating an object-normal-map loss based on the predicted normal map of the object and a ground truth normal map of the object;

calculating a planar-normal-map loss based on the predicted normal map of the plane and a ground truth normal map of the plane;

calculating an albedo loss based on the predicted albedo and a ground truth albedo; and

calculating a weighted sum of the rendering loss, the SH loss, the object-normal-map loss, the planar-normal-map loss, and the albedo loss as the loss.
The method of any of claim11 or 12, further comprising:

selecting, from a second dataset, images each containing an object and a plane on which the object is located, wherein the second dataset comprises a plurality of panoramic images and a ground truthHDR illumination map corresponding to each of the plurality of panoramic images;

for each of the selected image:

calculating a ground truth normal map of the object and a ground truth normal map of the plane;

extracting ground truth SH coefficients corresponding to the image by applying a spherical convolution to the ground truthHDR illuminationmap;

calculating ground truth albedo and ground truth roughness corresponding to the image; and

constructing the first dataset based on the selected images and ground truth parameters corresponding to the selected images, wherein the ground truth parameters corresponding to each of the selected images comprise the ground truth normal map of the object, the ground truth normal map of the plane, the ground truth SH coefficients, the ground truth albedo, and theground truth roughness.
An apparatus for rendering a virtual object based on illumination estimation, comprising:

a capturing unit configured to capture an image in which at least one object is locatedonat least one plane;

an extracting unit configured to extract a foreground of the image;

an estimating unit configuredto estimateilluminationcorresponding to the extracted foregroundimage; and

a rendering unit configured to rendera virtual object with the estimated illumination.
The apparatus of claim 14, wherein the estimating unit is configured to: input the extracted foreground toaneural network; obtain predicted spherical harmonics (SH) coefficients output by the neural network; anddetermine the illumination corresponding to the image based on the predicted SH coefficients.
The apparatus of claim 14 or 15, wherein the extracting unit is configured to: detect existence of at least one preset object in the image; select, from the at least one preset object, an object which is in the center of the image as a target object; anddetermine and extract the foreground of the image based on the target object and a plane on which the target object is located.
The apparatus of claim 16, wherein in terms of determiningthe foreground of the image based on the target object and the plane on which the target object is located, the estimating unitis configured to: determine a bonding box according to the target object; divide the bonding box into an upper bonding box and a lower bonding box; expand the lower bonding box to include part of the plane on which the target object is located; anddeterminea part of the image framed by the upper bonding box and the expanded lower bonding box and the target objectas the foreground of the image.
Anapparatusfor training a neural network, comprising:

an obtaining unit configured to obtain a training image from a first dataset, wherein the first dataset comprises images and ground truthparameters corresponding to each of the images, and the training image is any of the images comprised in the first dataset;

an extracting unit configured to extract a foreground of the training image;

an inputting unit configured to inputthe extracted foreground of the training image to a neural network to obtain predicted parameters;

a rendering unit configured torender an image with the predicted parameters;

a calculating unit configured tocalculate a loss of the neural network based on the predicted parameters, the rendered image, the training image, and ground truthparameters corresponding to the training image; and

a training unit configured to train the neural network based on the loss.
A terminaldevice comprising a processor, a memory configured to store one or more programs, wherein the one or more programs are configured to be executed by the processor, and comprise instructions for performing the method of any of claims 1 to 9 and the method of any of claims 10 to 13.
A non-transitory computer-readable storage medium configured to store computer programs for electronic data interchange (EDI) which, when executed, are operable with a computer to perform the method of any of claims 1 to 9 and the method of any of claims 10 to 13.