WO2024119997A1

WO2024119997A1 - Illumination estimation method and apparatus

Info

Publication number: WO2024119997A1
Application number: PCT/CN2023/123617
Authority: WO
Inventors: 张欢; 尼斯纳马蒂亚斯; 彼得柯西斯
Original assignee: 华为技术有限公司
Priority date: 2022-12-09
Filing date: 2023-10-09
Publication date: 2024-06-13
Also published as: CN118172476A

Abstract

The embodiments of the present application relate to the technical field of media, and can improve the accuracy of illumination estimation. Disclosed are an illumination estimation method and apparatus. The method comprises: acquiring a three-dimensional model of a scene; determining target information of the scene according to the three-dimensional model; and determining illumination intensity in the scene according to the target information, wherein the target information comprises light source position information, which is used for indicating the position of a light source in the scene. By means of introducing light source position information during illumination estimation, more constraints are provided to a solution space, the accuracy and robustness of illumination estimation are improved, and the domain difference between scene CG data obtained by means of rendering and actual scene data is reduced, thereby improving the accuracy of illumination estimation.

Description

Lighting estimation method and device

This application claims the priority of the Chinese patent application filed with the China Patent Office on December 9, 2022, with application number 202211581949.6 and application name “A method for illumination estimation”, and the priority of the Chinese patent application filed with the China Intellectual Property Office on June 26, 2023, with application number 202310767724.8 and application name “Illumination estimation method and device”, the entire contents of which are incorporated by reference in this application.

Technical Field

The embodiments of the present application relate to the field of media technology, and in particular to a method and device for illumination estimation.

Background technique

With the advancement of science and technology, augmented reality (AR) and other image technologies have gradually entered people's daily lives. The popularity of AR and other image technologies has greatly improved people's efficiency and experience in obtaining information. AR and other image technologies are widely used in education and training, military, medical, entertainment, and manufacturing industries. Highly realistic virtual-real fusion is the core requirement of AR and other image technologies. By keeping the light source and lighting model of virtual objects consistent with the real scene, a realistic rendering effect can be obtained.

However, the light source of the actual scene is often unknown, and it is necessary to estimate the light source in the scene based on information such as scene pictures to obtain a more realistic AR experience. Accurate lighting estimation can improve the similarity between the lighting model and the real scene.

Therefore, how to improve the accuracy of illumination estimation is one of the problems that technical personnel in this field need to solve urgently.

Summary of the invention

The embodiments of the present application provide a method and device for illumination estimation, which can improve the accuracy of illumination estimation. To achieve the above-mentioned purpose, the embodiments of the present application adopt the following technical solutions:

In a first aspect, an embodiment of the present application provides a method for estimating illumination, the method comprising: obtaining a three-dimensional model of a scene. Determining target information of the scene based on the three-dimensional model. Determining illumination intensity of the scene based on the target information. The target information includes light source position information, and the light source position information is used to indicate the position of the light source in the scene.

Compared with lighting estimation only through the image of the scene, the lighting estimation method provided in the embodiment of the present application additionally introduces light source position information describing the position of the light source in the scene when performing lighting estimation, because lights often have similar geometric features in different scenes. By introducing light source position information during lighting estimation, more constraints are imposed on the solution space, the accuracy and robustness of lighting estimation are improved, and the domain gap between the rendered scene computer graphics (CG) data and the actual scene data is reduced, thereby improving the accuracy of lighting estimation.

In a possible implementation, the three-dimensional model may be input into a first network model to obtain the light source position information, wherein the first network model is used to input the corresponding light source position information according to the input three-dimensional model.

It can be seen that the method provided in the embodiment of the present application can obtain light source position information describing the position of the light source in the scene by inputting the three-dimensional model of the scene into the first network model. Since lights often have similar geometric features in different scenes. By introducing light source position information during illumination estimation, more constraints are imposed on the solution space, improving the accuracy and robustness of illumination estimation, and narrowing the domain gap between the rendered scene CG data and the actual scene data, thereby improving the accuracy of illumination estimation.

In a possible implementation, the method may further include: inputting the three-dimensional model into a first coding network to obtain a first coding block, wherein the first coding block includes coding point features. Inputting the three-dimensional model into a second coding network to obtain a second coding block, wherein the second coding block includes target text features. Training a first network model based on the coding point features and the target text features, wherein the first network model is used to input corresponding light source position information based on the input three-dimensional model.

Exemplarily, the three-dimensional model can be input into a universal network (Unet) for encoding to obtain a first encoding block including encoded point features. The three-dimensional model is input into a target text encoder (such as a contrastive language-image pre-training (CLIP) encoder) for encoding to obtain a second encoding block including target text features (class features). The encoded point features and the target text features are then placed in a joint space, and the Euclidean distance (L2 distance) between the two features is optimized by gradient descent to train the first network model. The target text can include light sources and non-light sources.

It can be seen that the method provided in the embodiment of the present application can be trained by the coding point features of the scene 3D model and the target text features. to a first network model for inputting corresponding light source position information according to the input three-dimensional model. The three-dimensional model of the scene is input into the first network model to obtain light source position information describing the position of the light source in the scene. Since lights often have similar geometric features in different scenes. By introducing light source position information during illumination estimation, more constraints are imposed on the solution space, improving the accuracy and robustness of illumination estimation, and narrowing the domain gap between the rendered scene CG data and the actual scene data, thereby improving the accuracy of illumination estimation.

In a possible implementation, the light source position information, the position code and the illumination feature vector may be input into a second network model to obtain the illumination intensity of the scene; and the illumination expression may be determined according to the illumination intensity and the illumination color. The illumination feature vector of the scene describes the distribution of illumination in the scene.

It can be understood that the method provided in the embodiment of the present application introduces position coding during illumination estimation, which supplements more high-frequency components during illumination estimation, so that illumination estimation can depict illumination conditions with high-frequency changes. On the other hand, the introduction of light source position information during illumination estimation imposes more constraints on the solution space, improves the accuracy and robustness of illumination estimation, and reduces the domain gap between the rendered scene CG data and the actual scene data, thereby improving the accuracy of illumination estimation.

In a possible implementation, the target information further includes position coding information, and the position coding information is used to indicate a representation of the scene in a high-dimensional space.

It is understandable that since the points in the scene only need three components x, y, and z to represent in three-dimensional space, it is unable to express high-frequency components. The position coding information indicates the representation of the scene in high-dimensional space and has the ability to express high-frequency components. Therefore, compared to lighting estimation only through the three-dimensional position of the scene, the lighting estimation method provided in the embodiment of the present application introduces position coding information describing the representation of the scene in high-dimensional space when performing lighting estimation, supplements more high-frequency components, so that the lighting estimation process will not be too smooth, and it is easier to characterize the lighting conditions of the scene under high-frequency changes, thereby improving the accuracy of lighting estimation.

In a possible implementation manner, the target information further includes a lighting feature vector, and the lighting feature vector is used to indicate the distribution of lighting in the scene.

It can be understood that the illumination estimation method provided in the embodiment of the present application introduces an illumination feature vector that describes the distribution of illumination in the above-mentioned scene when performing illumination estimation, which gives more constraints to the solution space, thereby improving the accuracy of illumination estimation.

In a possible implementation, the target information further includes the lighting color of the scene. The lighting expression of the scene can be determined according to the lighting intensity and the lighting color, and the lighting expression is used to indicate the lighting color and lighting intensity of the scene.

It is understandable that, compared to performing illumination estimation only through the image of the scene, the illumination estimation method provided in the embodiment of the present application additionally introduces light source position information describing the position of the light source in the scene when performing illumination estimation, because lights often have similar geometric features in different scenes. By introducing light source position information during illumination estimation, more constraints are imposed on the solution space, the accuracy and robustness of illumination estimation are improved, and the domain gap between the rendered scene CG data and the actual scene data is reduced, thereby improving the accuracy of illumination estimation.

In a possible implementation, the target information may be input into a second network model to obtain the light intensity, and the second network model is used to output the corresponding light intensity according to the input target information.

It can be understood that, compared with the illumination estimation only through the image of the scene, the illumination estimation method provided in the embodiment of the present application additionally introduces the light source position information describing the position of the light source in the scene when performing illumination estimation, and then inputs the light source position information into the second network model to obtain the illumination intensity of the scene. Since lights often have similar geometric features in different scenes, by introducing the light source position information during illumination estimation, more constraints are imposed on the solution space, the accuracy and robustness of illumination estimation are improved, and the domain gap between the rendered scene CG data and the actual scene data is reduced, thereby improving the accuracy of illumination estimation.

In a possible implementation, the method may further include: determining a rendered image of the scene according to a lighting expression of the scene, the lighting expression being used to indicate the lighting color and lighting intensity of the scene. Training a second network model according to the rendered image of the scene and a reference image of the scene, the second network model being used to output a corresponding lighting intensity according to the input target information.

Specifically, the second network model can be trained by gradient descent based on the difference between the rendered image of the scene and the reference image of the scene as a loss function.

It can be understood that by training the second network model based on the rendered image of the above scene and the reference image of the above scene, the light intensity output by the second network model can be closer to the light intensity of the real scene, thereby further improving the accuracy of light estimation.

In a possible implementation, the method may further include: dividing the three-dimensional model into a plurality of voxels, and determining illumination feature vectors of the plurality of voxels according to an illumination feature vector of the scene, wherein the illumination feature vector of the scene is used to indicate the distribution of illumination in the scene.

It is understandable that 3D models cannot well express complex, spatially varying lighting models. However, by voxelizing the 3D model, It is better to express complex and spatially varying lighting models. By voxelizing the three-dimensional model, the lighting feature vector of each voxel of the three-dimensional model is obtained, so that the obtained voxelized lighting model can support lighting editing and relighting.

In a possible implementation manner, the method may further include: segmenting the three-dimensional model into a plurality of voxels; and determining the illumination colors of the plurality of voxels according to the illumination color of the scene.

It is understandable that the three-dimensional model cannot express the complex, spatially varying lighting model well. After the three-dimensional model is voxelized, it can better express the complex, spatially varying lighting model. By voxelizing the three-dimensional model, the lighting color of each voxel of the three-dimensional model can be obtained, so that the obtained voxelized lighting model can support lighting editing and relighting.

In a second aspect, an embodiment of the present application provides a lighting estimation device, the device comprising: a transceiver unit and a processing unit. The transceiver unit is used to obtain a three-dimensional model of a scene. The processing unit is used to determine target information of the scene based on the three-dimensional model, the target information includes light source position information, and the light source position information is used to indicate the position of the light source in the scene. The processing unit is also used to determine the lighting intensity of the scene based on the target information.

In a possible implementation, the processing unit is specifically used to: input the three-dimensional model into a first network model to obtain the light source position information, and the first network model is used to input the corresponding light source position information according to the input three-dimensional model.

In a possible implementation, the processing unit is also used to: input the three-dimensional model into a first coding network to obtain a first coding block, wherein the first coding block includes coding point features; input the three-dimensional model into a second coding network to obtain a second coding block, wherein the second coding block includes target text features; train a first network model based on the coding point features and the target text features, wherein the first network model is used to input corresponding light source position information based on the input three-dimensional model.

In a possible implementation, the target information further includes a lighting feature vector, and the lighting feature vector is used to indicate the distribution of lighting in the scene.

In a possible implementation, the target information also includes the lighting color of the scene, and the processing unit is further used to determine the lighting expression of the scene according to the lighting intensity and the lighting color, wherein the lighting expression is used to indicate the lighting color and lighting intensity of the scene.

In a possible implementation, the processing unit is specifically used to: input the target information into a second network model to obtain the light intensity, and the second network model is used to output the corresponding light intensity according to the input target information.

In a possible implementation, the processing unit is also used to: determine a rendered image of the scene based on a lighting expression of the scene, wherein the lighting expression is used to indicate the lighting color and lighting intensity of the scene; train a second network model based on the rendered image of the scene and a reference image of the scene, wherein the second network model is used to output corresponding lighting intensity based on input target information.

In a possible implementation, the processing unit is further used to: segment the three-dimensional model into multiple voxels; determine the illumination feature vectors of the multiple voxels based on the illumination feature vector of the scene, wherein the illumination feature vector of the scene is used to indicate the distribution of illumination in the scene.

In a possible implementation, the processing unit is further configured to: segment the three-dimensional model into a plurality of voxels; and determine the illumination colors of the plurality of voxels according to the illumination color of the scene.

In a third aspect, an embodiment of the present application further provides a lighting estimation device, which includes: at least one processor, when the at least one processor executes program code or instructions, implements the method described in the above first aspect or any possible implementation method thereof.

Optionally, the illumination estimation device may further include at least one memory, and the at least one memory is used to store the program code or instruction.

In a fourth aspect, an embodiment of the present application further provides a chip, comprising: an input interface, an output interface, and at least one processor. Optionally, the chip further comprises a memory. The at least one processor is used to execute the code in the memory, and when the at least one processor executes the code, the chip implements the method described in the first aspect or any possible implementation thereof.

Optionally, the above chip may also be an integrated circuit.

In a fifth aspect, an embodiment of the present application further provides a computer-readable storage medium for storing a computer program, wherein the computer program includes methods for implementing the method described in the above-mentioned first aspect or any possible implementation thereof.

In a sixth aspect, an embodiment of the present application further provides a computer program product comprising instructions, which, when executed on a computer, enables the computer to implement the method described in the first aspect or any possible implementation thereof.

The illumination estimation device, computer storage medium, computer program product and chip provided in this embodiment are used to execute the above-mentioned Therefore, the beneficial effects that can be achieved can refer to the beneficial effects of the method provided above, which will not be repeated here.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following briefly introduces the drawings required for use in the description of the embodiments. Obviously, the drawings described below are only some embodiments of the embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.

FIG1 is a schematic diagram of the structure of an image processing system provided in an embodiment of the present application;

FIG2 is a schematic diagram of a flow chart of a method for estimating illumination provided in an embodiment of the present application;

FIG3 is a schematic diagram of a process for determining a lighting expression of a scene provided in an embodiment of the present application;

FIG4 is a schematic diagram of a training process of a light source position information extraction network provided in an embodiment of the present application;

FIG5 is a schematic diagram of the structure of a lighting estimation device provided in an embodiment of the present application;

FIG6 is a schematic diagram of the structure of a chip provided in an embodiment of the present application;

FIG7 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of the structure of another illumination estimation device provided in an embodiment of the present application.

Detailed ways

The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the embodiments of the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the embodiments of the present application.

The term "and/or" in this article is merely a description of the association relationship of associated objects, indicating that three relationships may exist. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone.

The terms "first" and "second" and the like in the description and drawings of the embodiments of the present application are used to distinguish different objects, or to distinguish different processing of the same object, rather than to describe a specific order of objects.

In addition, the terms "including" and "having" and any variations thereof mentioned in the description of the embodiments of the present application are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device including a series of steps or units is not limited to the listed steps or units, but may optionally include other steps or units that are not listed, or may optionally include other steps or units that are inherent to these processes, methods, products or devices.

It should be noted that, in the description of the embodiments of the present application, words such as "exemplarily" or "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described as "exemplarily" or "for example" in the embodiments of the present application should not be interpreted as having priority or advantage over other embodiments or designs. Specifically, the use of words such as "exemplarily" or "for example" is intended to present related concepts in a specific way.

First, some terms involved in the embodiments of the present application are explained.

Voxel is the abbreviation of volume pixel. A volume containing voxels can be displayed by stereo rendering or extracting polygonal isosurfaces with a given threshold contour. As the name suggests, it is the smallest unit of digital data in three-dimensional space segmentation. Voxel is used for three-dimensional imaging.

Three-dimensional reconstruction refers to the process of using computer and/or mathematical methods to establish a three-dimensional mathematical model suitable for computer representation and processing based on the data of three-dimensional objects in the geographical world collected.

3D model: A 3D model is a polygonal representation of an object, usually displayed by a computer or other video device. The displayed object can be a real-world entity or a fictional object. Anything that exists in the physical world can be represented by a 3D model. There are many forms of data storage for 3D models, such as 3D point clouds, meshes, or voxels, which are not limited here.

Camera pose: The pose is the position of the camera in space and the attitude of the camera, which can be regarded as the translation and rotation transformation of the camera from the original reference position to the current position. Similarly, the pose of an object in this application is the position of the object in space and the attitude of the object.

Position encoding (PE): refers to mapping the original low-dimensional position coordinates through high-frequency functions to obtain high-dimensional position coordinates, so that the position coordinates carry more high-frequency information.

Camera extrinsic parameters: the external parameters of the camera, the conversion relationship between the world coordinate system and the camera coordinate system, including displacement parameters and rotation parameters. The camera pose can be determined based on the camera extrinsic parameters.

Trilinear interpolation is a method of linear interpolation on a tensor product grid of three-dimensional discrete sampled data. This tensor product grid may be There are arbitrary non-overlapping grid points in each dimension, but it is not a triangulated finite element analysis grid. This method approximates the value of a point (x, y, z) linearly on a local rectangular prism by using the data points on the grid. Trilinear interpolation is often used in numerical analysis, data analysis, and computer graphics.

Semantic segmentation is a basic research direction in the field of computer vision (CV). It can give a specific category to each pixel in an image. For example, it can analyze objects in a picture or a video stream and mark their categories pixel by pixel. Semantic segmentation is widely used in many fields such as autonomous driving, smart cities and medical image processing.

Category: Deep learning can be used to perform image recognition and identify the categories of objects in the image, that is, object classification. Object categories can be, for example: table, chair, cat, dog, car, etc.

Neural network, a neural network can be composed of neural units, and a neural unit can refer to an operation unit with xs (i.e., input data) and intercept 1 as input, and the output of the operation unit can be:

Where s=1, 2, ...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal. The output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple single neural units mentioned above, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the characteristics of the local receptive field. The local receptive field can be an area composed of several neural units.

Convolutional neural network (CNN) is a deep neural network with a convolutional structure. Convolutional neural network contains a feature extractor consisting of a convolution layer and a subsampling layer, which can be regarded as a filter. Convolutional layer refers to the neuron layer in the convolutional neural network that performs convolution processing on the input signal. In the convolutional layer of the convolutional neural network, a neuron can only be connected to some neurons in the adjacent layers. A convolutional layer usually contains several feature planes, each of which can be composed of some rectangularly arranged neural units. The neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract features is independent of position. Convolution kernels can be formalized as matrices of random sizes, and convolution kernels can obtain reasonable weights through learning during the training process of convolutional neural networks. In addition, the direct benefit of shared weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.

Loss function: In the process of training a deep neural network, because we hope that the output of the deep neural network is as close as possible to the value we really want to predict, we can compare the predicted value of the current network with the target value we really want, and then update the weight vector of each layer of the neural network according to the difference between the two (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it predict a lower value, and keep adjusting until the deep neural network can predict the target value we really want or a value very close to the target value we really want. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function or objective function, which are important equations used to measure the difference between the predicted value and the target value. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, so the training of the deep neural network becomes a process of minimizing this loss as much as possible.

A multi-layer perceptron (MLP) is a feed-forward artificial neural network model that maps multiple input data sets to a single output data set. Its main feature is that it has multiple layers of neurons, so it is also called deep neural networks (DNN). Perceptrons are single neuron models and are the predecessors of larger neural networks. The power of neural networks lies in their ability to learn representations in the training data and how to relate them to the output variables that you want to predict. Mathematically, they are able to learn any mapping function and have been proven to be a universal approximation algorithm. The predictive power of neural networks comes from the hierarchical or multi-layered structure of the network. A multi-layer perceptron is a neural network with at least three layers of nodes, an input layer, some intermediate layers, and an output layer. Each node in a given layer is connected to every node in the adjacent layer. The input layer receives data, the intermediate layers calculate the data, and the output layer outputs the results.

The red, green, blue (RGB) color model is a color standard in the industry. A variety of colors are obtained by changing the three color channels of red (R), green (G), and blue (B) and superimposing them on each other. RGB represents the colors of the three channels: red, green, and blue.

With the advancement of science and technology, augmented reality (AR) and other image technologies have gradually entered people's daily lives. The popularity of AR and other image technologies has greatly improved people's efficiency and experience in obtaining information. AR and other image technologies are widely used in education and training, military, medical, entertainment, and manufacturing industries. Highly realistic virtual-real fusion is the core requirement of AR and other image technologies. Keeping the light source and lighting model of virtual objects consistent with the real scene can obtain a realistic rendering effect.

To this end, an embodiment of the present application provides a method for illumination estimation, which can improve the accuracy of illumination estimation. The method is applicable to an image processing system, and FIG1 shows a possible existence form of the image processing system.

As shown in FIG. 1 , the image processing system includes a terminal 10 and a server 20 .

The terminal 100 in the embodiment of the present application can be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), etc., and the embodiment of the present application does not impose any limitation on this.

In a possible implementation, the terminal 10 may include a sensor unit 11 , a computing unit 12 , a storage unit 13 , a network transmission unit 14 , and an interaction unit 15 .

The sensor unit 11 usually includes a visual sensor (such as a camera, etc.) for acquiring two-dimensional (2D) image information of the scene; an inertial navigation module (such as an inertial measurement unit (IMU)), etc.) for acquiring the relative posture relationship of the mobile device at different times, which is used for subsequently acquiring an initial geometric model.

The computing unit 12 may include a central processing unit (CPU), an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a video codec, a digital signal processor (DSP), a baseband processor, and/or a neural-network processing unit (NPU) and other processors as well as buffers and registers, and is mainly used to run the mobile terminal system and process the various algorithm modules designed in the embodiments of the present application, such as the illumination estimation method.

The storage unit 13 mainly includes a memory and an external storage module (such as a hard disk, etc.), and is mainly used for storing and reading and writing algorithm data, user local and temporary data, etc.

The network transmission unit 14 mainly includes upload and download modules, encoding and decoding modules for images, videos, three-dimensional models and lighting information, etc.

The interaction unit 15 mainly includes a display, a touch panel, a vibrator, an audio device (such as a speaker, a microphone and other audio devices), etc., and is mainly used to interact with the user, obtain user input, and present the algorithm effect to the user.

In a possible implementation, the server 20 may include a computing unit 21 , a storage unit 22 , and a network transmission unit 23 .

The computing unit 21 may include various processors such as a central processing unit, an application processor, a modem processor, a graphics processor, an image signal processor, a video codec, a digital signal processor, a baseband processor, and/or a neural network processor, as well as buffers and registers, and is mainly used to run a server operating system and process various algorithm modules designed in the embodiments of the present application, such as an illumination estimation method.

The storage unit 22 mainly includes a memory and an external storage module (such as a hard disk, etc.), and is mainly used to store network models and parameters.

The network transmission unit 23 mainly includes upload and download modules, encoding and decoding of images, videos, three-dimensional models and lighting information, etc.

FIG2 shows a method for estimating illumination provided by an embodiment of the present application, the method comprising:

S201: Obtain a three-dimensional model of a scene.

The scene may include one or more objects, such as people, plants, animals, buildings, etc.

In a possible implementation, a three-dimensional model of a scene can be obtained based on the image of the scene. For example, a three-dimensional reconstruction application can be installed on the terminal 10, or a webpage related to three-dimensional reconstruction or downstream tasks based on three-dimensional reconstruction results can be opened. The above application and webpage can provide an interface, and the terminal 10 can receive the relevant parameters entered by the user on the three-dimensional reconstruction or downstream task interface based on the three-dimensional reconstruction results, and send the above parameters to the server 20. The server 20 can obtain the processing results based on the received parameters and return the processing results to the terminal 10. It should be understood that in some optional implementations, the terminal 10 can also complete the data processing results based on the received parameters by itself without the need for the server to cooperate in the implementation, and the embodiments of the present application are not limited.

For example, the data can be collected by a collection device (terminal, SLR camera, surveillance camera, stereo vision camera or other collection device). The scene is photographed to obtain multi-view images of the scene (but not limited to RGB images of various formats). Specifically, the scene can be photographed around to obtain multi-view images. To ensure the quality of subsequent 3D reconstruction, the captured images need to have a certain common viewing area between each perspective, and there should be no blind spots as much as possible. The camera should be kept stable during shooting to avoid image blur. The multi-view images of the scene are used as input, and a 3D model of the scene is reconstructed through a 3D reconstruction algorithm (such as the structure from motion (SFM) algorithm).

For example, a multi-view image of a room may be collected by a collection device, and then a three-dimensional reconstruction algorithm is used to perform three-dimensional reconstruction using the collected multi-view images of the room as reference images to obtain a three-dimensional model of the room.

S202: Determine target information of the scene according to the three-dimensional model.

The target information includes light source position information, and the light source position information is used to indicate the position of the light source in the scene.

It should be noted that the light sources indicated by the light source position information include activated light sources and inactivated light sources. Among them, activated light sources include but are not limited to light sources that are emitting light (such as turned-on lights, the sun, and turned-on displays), and inactivated light sources include but are not limited to light sources that are not emitting light (such as turned-off lights and turned-off displays) and objects with the geometric shape of light sources (such as lamp-shaped ornaments, lamp-shaped sculptures, etc.).

In a possible implementation, the three-dimensional model may be input into the first network model to obtain the light source position information.

The first network model is used to input the corresponding light source position information according to the input three-dimensional model. The first network model can be trained by three-dimensional models of multiple scenes and coding features corresponding to each three-dimensional model. The coding features include coding point features and target text features.

In a possible implementation, the target information may further include position coding information, where the position coding information is used to indicate a representation of the scene in a high-dimensional space.

In a possible implementation manner, the position coding of the scene may be determined according to the position of the scene and a position coding coefficient.

Exemplarily, the position coding of the scene may be determined according to the three-dimensional representation of the points in the scene and the position coding coefficients according to the position coding formula.

For example, the position encoding formula can satisfy:
pe(x,y,z)＝[sin(w ₀ x),cos(w ₀ x),…,sin(w _n x),cos(w _n x),
sin(w ₀ y),cos(w ₀ y),…,sin(w _n y),cos(w _n y),
sin(w ₀ z),cos(w ₀ z),…,sin(w _n z),cos(w _n z)]

The pe(x, y, z) component represents the position of the point used to represent the scene in the high-dimensional space, and the x, y, z components represent the position of the point used to represent the scene in the three-dimensional space. n is a positive number, and the value of n can be related to the resolution of the voxel grid that divides the scene (that is, it can be related to the number of voxels in the scene). In the embodiment of the present application, n can be 9.

w _n is a position coding coefficient. In the embodiment of the present application, w _n can be 2 to the power of n. Correspondingly, w ₀ can be 2 to the power of 0, and w ₅ can be 2 to the power of 5.

In a possible implementation, the three-dimensional model may be input into a third network model to obtain the illumination feature vector.

The third network model can be obtained by training three-dimensional models of multiple scenes and the illumination feature vector corresponding to each three-dimensional model.

In a possible implementation, the target information further includes the lighting color of the scene. The lighting color is used to indicate the color of the light source in the scene, that is, the material information of the scene.

In a possible implementation, the three-dimensional model may be input into a third network model to obtain the lighting color.

Among them, the third network model can be obtained by training three-dimensional models of multiple scenes and the lighting color corresponding to each three-dimensional model.

In a possible implementation, the three-dimensional model may be input into a third network model to obtain the illumination feature vector and illumination color.

Among them, the third network model can be obtained by training three-dimensional models of multiple scenes and the illumination feature vector and illumination color corresponding to each three-dimensional model.

In a possible implementation manner, the three-dimensional model may be segmented into a plurality of voxels; and the illumination feature vectors of the plurality of voxels may be determined according to the illumination feature vectors.

Exemplarily, the three-dimensional model of the scene can be input into the third network model to obtain the illumination feature vector of the three-dimensional model. The three-dimensional model of the scene is divided into a plurality of voxels (for example, the three-dimensional model of the scene can be divided into a plurality of voxels at a resolution of 10 cm). The illumination feature vector stored in each voxel is determined according to the illumination feature vector of the three-dimensional model. The illumination feature vector of each voxel is determined by trilinear interpolation. For each illumination feature vector of a voxel, it can be obtained by obtaining the illumination feature vectors of the eight voxels around it and performing trilinear interpolation.

In a possible implementation manner, the three-dimensional model may be segmented into a plurality of voxels; and the illumination colors of the plurality of voxels may be determined according to the illumination colors.

Exemplarily, the three-dimensional model of the scene can be input into the third network model to obtain the illumination color of the three-dimensional model. The three-dimensional model of the scene is divided into a plurality of voxels (e.g., the three-dimensional model of the scene can be divided into a plurality of voxels at a resolution of 10 cm). The illumination color stored in each voxel is determined according to the illumination color (diffuse) of the three-dimensional model. The illumination color of each voxel is determined by trilinear interpolation. For the illumination color of each voxel, the illumination color can be obtained by obtaining the illumination colors of the eight voxels around it and performing trilinear interpolation.

S203: Determine the illumination intensity of the scene according to the target information of the scene.

The light intensity is used to indicate the intensity of the light in the scene, and the range of the light intensity may be between 0 and 1. The higher the light intensity of the scene, the stronger the light in the scene.

In a possible implementation, the target information may be input into a second network model to obtain the light intensity, and the second network model is used to output the corresponding light intensity according to the input target information. The second network model may be trained by a rendered image of a scene and a reference image of the corresponding scene.

In one possible implementation, the illumination intensity of the scene can satisfy:
I _e = Θ _l ( w ₀ , z _s , z _l , z _x )

Among them, _Ie is the illumination intensity of the scene or voxel, _Θl is the second network model (also called the neural field or illumination neural field), _w0 is the illumination emission direction, _w0 can be determined by the rendering angle, _zs is the scene light source position information of the scene, _zl is the illumination feature vector of the scene or scene voxel, and _zx is the position code of the scene or scene voxel.

In a possible implementation manner, the lighting expression of the scene may also be determined according to the lighting intensity and the lighting color, where the lighting expression is used to indicate the lighting color and lighting intensity of the scene.

In one possible implementation, the lighting expression can satisfy:
_Le = _Ie * _ce

Among them, _Le is the lighting expression of the scene or scene voxel, _Ie is the lighting intensity of the scene or scene voxel, and _ce is the lighting color of the scene or scene voxel.

Exemplarily, as shown in FIG3 , the three-dimensional model of the scene is input into the third network model to obtain the illumination information _fl of the scene, and the illumination information of the scene includes the illumination feature vector _zl and illumination color c _e of the scene. The three-dimensional model of the scene is input into the first network model to obtain the light source position information _zs of the scene. The three-dimensional model of the scene can obtain the position code _zx of the scene through position coding. By inputting the obtained illumination feature vector _zl of the scene, the light source position information _zs and the position code _zx , and the illumination emission direction _w0 determined by the rendering angle into the second network model (illumination neural field), the illumination intensity _Ie of the scene (not shown in the figure) can be obtained, and the obtained illumination intensity _Ie of the scene and the illumination color c _e of the scene can determine the illumination expression L _e of the scene.

Compared with performing illumination estimation only through an image of a scene, the illumination estimation method provided in the embodiment of the present application performs illumination estimation. The light source position information describing the position of the light source in the scene is additionally introduced, because lights often have similar geometric features in different scenes. By introducing the light source position information during illumination estimation, more constraints are imposed on the solution space, improving the accuracy and robustness of illumination estimation, and narrowing the domain gap between the rendered scene CG data and the actual scene data, thereby improving the accuracy of illumination estimation.

Optionally, the method provided in the embodiment of the present application may further include:

S204: Input the three-dimensional model into a first coding network to obtain a first coding block.

Wherein, the above-mentioned first coding block includes coding point features.

Exemplarily, the above three-dimensional model can be input into the Unet network for encoding to obtain a first encoding block containing encoded point features.

S205: Input the above three-dimensional model into a second coding network to obtain a second coding block.

Wherein, the second encoding block includes target text features.

Exemplarily, the three-dimensional model may be input into a target text encoder (such as a CLIP encoder) for encoding to obtain a second encoding block including target text features (class features). The target text may include a light source and a non-light source.

S205: training a first network model according to the coding point features and the target text features.

The first network model is used to input the corresponding light source position information according to the input three-dimensional model. The first network model can also be called a light source position information extraction network.

Exemplarily, as shown in FIG4 , the three-dimensional model can be input into Unet for encoding to obtain a first encoding block including encoded point features. The three-dimensional model is input into the target text encoder for encoding to obtain a second encoding block including target text features. The encoded point features and the target text features are then placed in a joint space, and the Euclidean distance (L2 distance) between the two features is optimized by gradient descent to train the first network model.

Optionally, the first network model mentioned above can be a first network model of a residual network (deep residual network, ResNet) or UNet (such as Res16UNet34D) network structure.

Optionally, the pre-training data set of the first network model may be a ScanNet or S3DIS data set.

In a possible implementation, the pre-training data set of the first network model may be fine-tuned to reduce the classification categories in the pre-training data set to two categories: light source and non-light source.

It can be seen that the method provided in the embodiment of the present application can obtain a first network model for inputting corresponding light source position information according to the input three-dimensional model through the coding point features and target text features of the scene three-dimensional model. The three-dimensional model of the scene is input into the first network model to obtain light source position information describing the position of the light source in the scene. Since lights often have similar geometric features in different scenes. By introducing light source position information during illumination estimation, this gives more constraints to the solution space, improves the accuracy and robustness of illumination estimation, and narrows the domain gap between the rendered scene CG data and the actual scene data, thereby improving the accuracy of illumination estimation.

S206: Determine a rendered image of the scene according to the lighting expression.

Exemplarily, a rendered image of a scene may be obtained according to a lighting expression through a rendering equation.

For example, taking any pixel in a rendered image as an example, its color value is determined by the light incident on the pixel; and through ray tracing, we can find the light source corresponding to the incident light (which may have been reflected multiple times). Then, the color and brightness of the light source can be obtained from the model through the above lighting expression; combined with the material information, the color value of this ray after being projected onto the pixel can be calculated through the rendering equation. In fact, there are countless rays incident on a certain point, so we use sampling, that is, sampling multiple rays, repeating the above tracing process for each ray, and superimposing the obtained color values to obtain the final color value of the pixel.

S207: Train a second network model according to the rendered image of the scene and the reference image of the scene.

Among them, the above-mentioned second network model is used to output the corresponding light intensity according to the light source position information, position coding and lighting feature vector of the scene.

Exemplarily, the second network model can be trained by gradient descent based on the difference between the rendered image of the scene and the reference image of the scene as a loss function.

The following will introduce an illumination estimation device for executing the above illumination estimation method in conjunction with FIG. 5 .

It is understandable that, in order to realize the above functions, the illumination estimation device includes hardware and/or software modules that perform the corresponding functions. Whether a function is executed in hardware or in a computer software-driven hardware manner depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application in combination with the embodiments, but such implementation should not be considered to be beyond the scope of the embodiments of this application.

The embodiment of the present application can divide the functional modules of the illumination estimation device according to the above method example. For example, each functional module can be divided according to each function, or two or more functions can be integrated into one processing module. The above integrated module can be implemented in the form of hardware. It should be noted that the division of modules in this embodiment is schematic and is only a logical function division. There may be other division methods in actual implementation.

In the case of dividing each functional module according to each function, FIG5 shows a possible composition diagram of the illumination estimation device involved in the above embodiment. As shown in FIG5 , the illumination estimation device 500 may include: a transceiver unit 501 and a processing unit 502 .

The above-mentioned transceiver unit 501 is used to obtain a three-dimensional model of the scene.

The processing unit 502 is used to determine target information of the scene according to the three-dimensional model. The target information includes light source position information. The light source position information is used to indicate the position of the light source in the scene.

The processing unit 502 is further configured to determine the illumination intensity of the scene according to the target information.

In a possible implementation, the processing unit 502 is specifically used to: input the three-dimensional model into a first network model to obtain the light source position information, and the first network model is used to input the corresponding light source position information according to the input three-dimensional model.

In a possible implementation, the processing unit 502 is also used to: input the three-dimensional model into a first coding network to obtain a first coding block, wherein the first coding block includes coding point features; input the three-dimensional model into a second coding network to obtain a second coding block, wherein the second coding block includes target text features; train a first network model based on the coding point features and the target text features, wherein the first network model is used to input corresponding light source position information based on the input three-dimensional model.

In a possible implementation, the processing unit 502 is specifically used to: input the target information into a second network model to obtain the light intensity, and the second network model is used to output the corresponding light intensity according to the input target information.

In a possible implementation, the processing unit 502 is further used to: determine a rendered image of the scene based on a lighting expression of the scene, wherein the lighting expression is used to indicate the lighting color and lighting intensity of the scene; train a second network model based on the rendered image of the scene and a reference image of the scene, wherein the second network model is used to output corresponding lighting intensity based on input target information.

In a possible implementation, the processing unit 502 is further used to: divide the three-dimensional model into multiple voxels; determine the illumination feature vectors of the multiple voxels based on the illumination feature vector of the scene, wherein the illumination feature vector of the scene is used to indicate the distribution of illumination in the scene.

In a possible implementation, the processing unit 502 is further configured to: segment the three-dimensional model into a plurality of voxels; and determine the illumination colors of the plurality of voxels according to the illumination color of the scene.

The embodiment of the present application further provides a chip. FIG6 shows a schematic diagram of the structure of a chip 600. The chip 600 includes one or more processors 601 and an interface circuit 602. Optionally, the chip 600 may also include a bus 603.

The processor 601 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned illumination estimation method may be completed by an integrated logic circuit of hardware in the processor 601 or by instructions in the form of software.

Optionally, the processor 601 may be a general purpose processor, a digital signal processing (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods and steps disclosed in the embodiments of the present application may be implemented or executed. The general purpose processor may be a microprocessor or the processor may be any conventional processor, etc.

The interface circuit 602 can be used to send or receive data, instructions or information. The processor 601 can use the data, instructions or other information received by the interface circuit 602 to process, and can send the processing completion information through the interface circuit 602.

Optionally, the chip also includes a memory, which may include a read-only memory and a random access memory, and provides operation instructions and data to the processor. A portion of the memory may also include a non-volatile random access memory (NVRAM).

Optionally, the memory stores executable software modules or data structures, and the processor can perform corresponding operations by calling operation instructions stored in the memory (the operation instructions can be stored in the operating system).

Optionally, the chip can be used in the illumination estimation device involved in the embodiment of the present application. Optionally, the interface circuit 602 can be used to output the execution result of the processor 601. The illumination estimation method provided by one or more embodiments of the embodiment of the present application can refer to the aforementioned embodiments, which will not be repeated here.

It should be noted that the functions corresponding to the processor 601 and the interface circuit 602 can be implemented through hardware design, software design, or a combination of hardware and software, and there is no limitation here.

7 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application. The electronic device 100 may be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), an illumination estimation device, or a chip or functional module in an illumination estimation device.

Exemplarily, FIG7 is a schematic diagram of the structure of an electronic device 100 provided in an embodiment of the present application. The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, and a subscriber identification module (SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, etc.

It is to be understood that the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or fewer components than shown in the figure, or combine some components, or split some components, or arrange the components differently. The components shown in the figure may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (AP), a modem processor, a graphics processor (GPU), an image signal processor (ISP), a controller, a memory, a video codec, a digital signal processor (DSP), a baseband processor, and/or a neural-network processing unit (NPU), etc. Different processing units may be independent devices or integrated in one or more processors.

The controller may be the nerve center and command center of the electronic device 100. The controller may generate an operation control signal according to the instruction operation code and the timing signal to complete the control of fetching and executing instructions.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the processor 110 may include one or more interfaces. The interface may include an inter-integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a pulse code modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a mobile industry processor interface (MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (SIM) interface, and/or a universal serial bus (USB) interface, etc.

Among them, the I2C interface is a bidirectional synchronous serial bus. The processor 110 can couple the touch sensor 180K through the I2C interface, so that the processor 110 and the touch sensor 180K communicate through the I2C bus interface to realize the touch function of the electronic device 100. The MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193. The MIPI interface includes a camera serial interface (CSI), a display serial interface (DSI), etc. In some embodiments, the processor 110 and the camera 193 communicate through the CSI interface to realize the shooting function of the electronic device 100. The processor 110 and the display screen 194 communicate through the DSI interface to realize the display function of the electronic device 100.

It is understandable that the interface connection relationship between the modules illustrated in the embodiment of the present application is only a schematic illustration and does not constitute a structural limitation on the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connections in the above embodiments. port connection mode, or a combination of multiple port connection modes.

The charging management module 140 is used to receive charging input from a charger. The charger can be a wireless charger or a wired charger. The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charging management module 140, and provides power to the processor 110, the internal memory 121, the external memory, the display screen 194, the camera 193, and the wireless communication module 160.

The electronic device 100 implements the display function through a GPU, a display screen 194, and an application processor. The GPU is a microprocessor for image processing, which connects the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, etc. The display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diodes (QLED), etc. In some embodiments, the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.

The electronic device 100 can realize the shooting function through ISP, camera 193, touch sensor, video codec, GPU, display screen 194 and application processor.

Among them, ISP is used to process the data fed back by camera 193. For example, when taking a photo, the shutter is opened, and the light is transmitted to the camera photosensitive element through the lens. The light signal is converted into an electrical signal, and the camera photosensitive element transmits the above electrical signal to ISP for processing and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on the noise, brightness, and skin color of the image. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, ISP can be set in camera 193.

The camera 193 is used to capture still images or videos. The object generates an optical image through the lens and projects it onto the photosensitive element. The photosensitive element can be a charge coupled device (CCD) or a complementary metal oxide semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV or other format. It should be understood that in the description of the embodiments of the present application, an image in RGB format is used as an example for introduction, and the embodiments of the present application do not limit the image format. In some embodiments, the electronic device 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.

The digital signal processor is used to process digital signals, and can process not only digital image signals but also other digital signals. For example, when the electronic device 100 is selecting a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.

Video codecs are used to compress or decompress digital videos. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record videos in a variety of coding formats, such as Moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100. The internal memory 121 can be used to store computer executable program codes, which include instructions. The processor 110 executes various functional applications and data processing of the electronic device 100 by running the instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area.

The electronic device 100 can implement audio functions such as music playing and recording through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone jack 170D, and the application processor.

The button 190 includes a power button, a volume button, etc. The button 190 can be a mechanical button. It can also be a touch button. The electronic device 100 can receive button input and generate key signal input related to the user settings and function control of the electronic device 100. The motor 191 can generate a vibration prompt. The motor 191 can be used for incoming call vibration prompts, and can also be used for touch vibration feedback. For example, touch operations acting on different applications (such as taking pictures, audio playback, etc.) can correspond to different vibration feedback effects. For touch operations acting on different areas of the display screen 194, the motor 191 can also correspond to different vibration feedback effects. The indicator 192 can be an indicator light, which can be used to indicate the charging status, power changes, and can also be used to indicate messages, missed calls, notifications, etc. The SIM card interface 195 is used to connect a SIM card.

It should be noted that the electronic device 100 may be a chip system or a device with a similar structure as shown in FIG. 7. The chip system may be composed of chips, or may include chips and other discrete devices. The actions, terms, etc. involved in the various embodiments of the present application may refer to each other without limitation. The message name or parameter name in the message exchanged between the various devices in the embodiments of the present application is only an example, and other names may be used in the specific implementation without limitation. In addition, the composition structure shown in FIG. 7 does not constitute a Definition of the sub-device 100, in addition to the components shown in FIG. 7, the electronic device 100 may include more or less components than those shown in FIG. 7, or combine certain components, or arrange the components differently.

The processor and transceiver described in the present application can be implemented in an integrated circuit (IC), an analog IC, a radio frequency integrated circuit, a mixed signal IC, an application specific integrated circuit (ASIC), a printed circuit board (PCB), an electronic device, etc. The processor and transceiver can also be manufactured using various IC process technologies, such as complementary metal oxide semiconductor (CMOS), N-type metal oxide semiconductor (NMOS), P-type metal oxide semiconductor (positive channel metal oxide semiconductor, PMOS), bipolar junction transistor (BJT), bipolar CMOS (BiCMOS), silicon germanium (SiGe), gallium arsenide (GaAs), etc.

FIG8 is a schematic diagram of the structure of a lighting estimation device provided in an embodiment of the present application. The lighting estimation device can be applied to the scenario shown in the above method embodiment. For ease of explanation, FIG8 only shows the main components of the lighting estimation device, including a processor 801, a memory 802, a control circuit 803, and an input-output device 804. The processor 801 is mainly used to process communication protocols and communication data, execute software programs, and process data of software programs. The memory 802 is mainly used to store software programs and data. The control circuit 803 is mainly used for power supply and transmission of various electrical signals. The input-output device 804 is mainly used to receive data input by a user and output data to the user.

When the illumination estimation device is a processor 801, the control circuit 803 can be a mainboard, the memory 802 includes a hard disk, RAM, ROM and other media with storage functions, the processor 801 can include a baseband processor 801 and a central processing unit, the baseband processor is mainly used to process the communication protocol and communication data, the central processing unit is mainly used to control the entire illumination estimation device, execute software programs, and process the data of the software programs, and the input and output device 804 includes a display screen, a keyboard and a mouse, etc.; the control circuit 803 can further include or be connected to a transceiver circuit or a transceiver, such as: a network cable interface, etc., for sending or receiving data or signals, such as data transmission and communication with other devices. Further, it can also include an antenna for sending and receiving wireless signals, and for data/signal transmission with other devices.

An embodiment of the present application also provides a lighting estimation device, which includes: at least one processor, when the at least one processor executes program code or instructions, the above-mentioned related method steps are implemented to implement the lighting estimation method in the above-mentioned embodiment.

Optionally, the device may further include at least one memory, and the at least one memory is used to store the program code or instruction.

An embodiment of the present application further provides a computer storage medium, in which computer instructions are stored. When the computer instructions are executed on a lighting estimation device, the lighting estimation device executes the above-mentioned related method steps to implement the lighting estimation method in the above-mentioned embodiment.

The embodiment of the present application also provides a computer program product. When the computer program product is run on a computer, the computer is enabled to execute the above-mentioned related steps to implement the illumination estimation method in the above-mentioned embodiment.

The embodiment of the present application also provides a lighting estimation device, which can be a chip, an integrated circuit, a component or a module. Specifically, the device may include a connected processor and a memory for storing instructions, or the device includes at least one processor for obtaining instructions from an external memory. When the device is running, the processor can execute instructions so that the chip executes the lighting estimation method in the above-mentioned method embodiments.

It should be understood that in the various embodiments of the present application, the size of the serial numbers of the above-mentioned processes does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices and units described above can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the above units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

The units described above as separate components may or may not be physically separate, and the components shown as units may be Or it may not be a physical unit, and may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

If the above functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several instructions to enable a computer device (which can be a personal computer, server, or network device, etc.) to execute all or part of the steps of the above methods in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store program codes.

The above is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

A method for illumination estimation, characterized in that the method comprises:

Get a 3D model of the scene;

Determine target information of the scene according to the three-dimensional model, the target information includes light source position information, and the light source position information is used to indicate the position of the light source in the scene;

The illumination intensity of the scene is determined according to the target information.
The method according to claim 1, characterized in that determining the target information of the scene according to the three-dimensional model comprises:

The three-dimensional model is input into a first network model to obtain the light source position information, and the first network model is used to input the corresponding light source position information according to the input three-dimensional model.
The method according to claim 1 or 2, characterized in that the method further comprises:

Inputting the three-dimensional model into a first coding network to obtain a first coding block, wherein the first coding block includes coding point features;

Inputting the three-dimensional model into a second encoding network to obtain a second encoding block, wherein the second encoding block includes target text features;

A first network model is trained according to the coding point features and the target text features, wherein the first network model is used for inputting corresponding light source position information according to the input three-dimensional model.
The method according to any one of claims 1 to 3 is characterized in that the target information also includes position coding information, and the position coding information is used to indicate the representation of the scene in a high-dimensional space.
The method according to any one of claims 1 to 4 is characterized in that the target information also includes an illumination feature vector, and the illumination feature vector is used to indicate the distribution of illumination in the scene.
The method according to any one of claims 1 to 5, characterized in that the target information also includes the lighting color of the scene, and the method further includes:

A lighting expression for the scene is determined according to the lighting intensity and the lighting color, where the lighting expression is used to indicate the lighting color and the lighting intensity of the scene.
The method according to any one of claims 1 to 6, characterized in that determining the illumination intensity of the scene according to the target information comprises:

The target information is input into a second network model to obtain the light intensity, and the second network model is used to output the corresponding light intensity according to the input target information.
The method according to any one of claims 1 to 7, characterized in that the method further comprises:

Determining a rendered image of the scene according to a lighting expression of the scene, wherein the lighting expression is used to indicate lighting color and lighting intensity of the scene;

A second network model is trained according to the rendered image of the scene and the reference image of the scene, wherein the second network model is used to output corresponding light intensity according to input target information.
The method according to any one of claims 1 to 8, characterized in that the method further comprises:

Segmenting the three-dimensional model into a plurality of voxels;

The illumination feature vectors of the plurality of voxels are determined according to the illumination feature vector of the scene, where the illumination feature vector of the scene is used to indicate the distribution of illumination in the scene.
The method according to any one of claims 1 to 9, characterized in that the method further comprises:

Segmenting the three-dimensional model into a plurality of voxels;

The illumination colors of the plurality of voxels are determined according to the illumination color of the scene.
An illumination estimation device, characterized in that it comprises: a transceiver unit and a processing unit;

The transceiver unit is used to obtain a three-dimensional model of the scene;

The processing unit is used to determine target information of the scene according to the three-dimensional model, wherein the target information includes light source position information, and the light source position information is used to indicate the position of the light source in the scene;

The processing unit is further used to determine the illumination intensity of the scene according to the target information.
The device according to claim 11, characterized in that the processing unit is specifically used for:

The three-dimensional model is input into a first network model to obtain the light source position information, and the first network model is used to input the corresponding light source position information according to the input three-dimensional model.
The device according to claim 11 or 12, characterized in that the processing unit is further used for:

Inputting the three-dimensional model into a first coding network to obtain a first coding block, wherein the first coding block includes coding point features;

Inputting the three-dimensional model into a second encoding network to obtain a second encoding block, wherein the second encoding block includes target text features;

A first network model is trained according to the coding point features and the target text features, wherein the first network model is used for inputting corresponding light source position information according to the input three-dimensional model.
The device according to any one of claims 11 to 13 is characterized in that the target information also includes position coding information, and the position coding information is used to indicate the representation of the scene in a high-dimensional space.
The device according to any one of claims 11 to 14 is characterized in that the target information also includes an illumination feature vector, and the illumination feature vector is used to indicate the distribution of illumination in the scene.
The device according to any one of claims 11 to 15, characterized in that the target information also includes the lighting color of the scene, and the processing unit is further used to:

A lighting expression for the scene is determined according to the lighting intensity and the lighting color, where the lighting expression is used to indicate the lighting color and the lighting intensity of the scene.
The device according to any one of claims 11 to 16, characterized in that the processing unit is specifically used for:

The target information is input into a second network model to obtain the light intensity, and the second network model is used to output the corresponding light intensity according to the input target information.
The device according to any one of claims 11 to 17, characterized in that the processing unit is further used for:

Determining a rendered image of the scene according to a lighting expression of the scene, wherein the lighting expression is used to indicate lighting color and lighting intensity of the scene;

A second network model is trained according to the rendered image of the scene and the reference image of the scene, wherein the second network model is used to output corresponding light intensity according to input target information.
The device according to any one of claims 11 to 18, characterized in that the processing unit is further used for:

Segmenting the three-dimensional model into a plurality of voxels;

The illumination feature vectors of the plurality of voxels are determined according to the illumination feature vector of the scene, where the illumination feature vector of the scene is used to indicate the distribution of illumination in the scene.
The device according to any one of claims 11 to 19, characterized in that the processing unit is further used for:

Segmenting the three-dimensional model into a plurality of voxels;

The illumination colors of the plurality of voxels are determined according to the illumination color of the scene.
A lighting estimation device comprises at least one processor and a memory, wherein the at least one processor executes a program or instruction stored in the memory so that the lighting estimation device implements the method described in any one of claims 1 to 10.
A computer-readable storage medium for storing a computer program, characterized in that when the computer program is executed on a computer or a processor, the computer or the processor implements the method described in any one of claims 1 to 10.
A computer program product, comprising instructions, wherein when the instructions are executed on a computer or a processor, the computer or the processor implements the method according to any one of claims 1 to 10.