CN116843551A

CN116843551A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN116843551A
Application number: CN202310827441.8A
Authority: CN
Inventors: 庄放望
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-10-03

Abstract

The embodiment of the application provides an image processing method, an image processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: for each of M target poses of the camera, performing spatial point sampling on N rays emitted by the camera under the target poses to obtain N spatial sampling point set coordinates corresponding to the target poses; acquiring first pixel density and first feature vectors of N space sampling point sets corresponding to the target pose based on the N space sampling point set coordinates corresponding to the target pose; determining a first color, a second voxel density and a second color of N space sampling point sets corresponding to the target pose based on a first feature vector corresponding to the target pose and the view angle direction of the N space sampling point sets corresponding to the target pose; rendering is carried out based on the first voxel density, the second voxel density, the first color and the second color, target rendering images corresponding to M target poses are generated, and the target rendering images are spliced to obtain spliced images, so that the image splicing effect is improved.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to an image processing method, an image processing device, an electronic device, and a computer readable storage medium.

Background

Currently, in the process of manufacturing high dynamic range images (High Dynamic Range Image, HDRI), a professional 360-degree panoramic camera is generally adopted, a plurality of wide-angle cameras or a plurality of 180-degree fisheye lenses are utilized to shoot a plurality of low dynamic range images (Low Dynamic Range Images, LDRIs) with different exposure time, the plurality of images generally require that a common splicing area can be found, and then the images are spliced and fused into a high dynamic range panoramic picture (HDRI) through an algorithm; or a group of low dynamic range photos (LDRIs) with different exposure time are respectively shot by adopting a camera from different poses, and a common splicing area is needed to exist between the different poses, and then the high dynamic range panoramic photos (HDRI) are spliced and fused by an algorithm. For the stitching algorithm, the principle is that corner point information (characteristic points) of a joint area among pictures is detected, a homography matrix is calculated by utilizing the characteristic point information, and finally, a plurality of images are warped and stitched, and then, exposure is combined to form a whole panoramic image.

However, during the shooting process of the camera, there may be movement of the camera and/or movement of the object, which may affect the computation of feature points between the shot images, and may easily result in a larger homography matrix error of the computation, thereby resulting in poor image stitching effect.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, electronic equipment and a computer readable storage medium, which are used for solving the problem of poor image stitching effect.

In order to solve the technical problems, the application is realized as follows:

in a first aspect, an embodiment of the present application provides an image processing method, including:

for each of M target poses of a camera, performing spatial point sampling on N rays emitted by the camera under the target poses to obtain N spatial sampling point set coordinates corresponding to the target poses, wherein M, N is an integer greater than 1;

based on the coordinates of N spatial sampling point sets corresponding to the target pose, obtaining first voxel density and first feature vectors of the N spatial sampling point sets corresponding to the target pose;

based on a first feature vector corresponding to the target pose and the view angles of N space sampling point sets corresponding to the target pose, determining a first color of the N space sampling point sets corresponding to the target pose, and determining a second voxel density and a second color of the N space sampling point sets corresponding to the target pose;

rendering based on the first voxel density, the second voxel density, the first color and the second color, and generating target rendering images corresponding to the M target poses;

And splicing the target rendering images corresponding to the M target poses to obtain a spliced image.

In a second aspect, an embodiment of the present application provides an image processing apparatus including:

the first sampling module is used for carrying out space point sampling on N rays sent out by the camera in each of M target poses to obtain N space sampling point set coordinates corresponding to the target poses, wherein M, N is an integer larger than 1;

the first acquisition module is used for acquiring first voxel densities and first feature vectors of N space sampling point sets corresponding to the target pose based on the N space sampling point set coordinates corresponding to the target pose;

the second acquisition module is used for determining a first color of the N spatial sampling point sets corresponding to the target pose based on the first feature vector corresponding to the target pose and the view angle direction of the N spatial sampling point sets corresponding to the target pose, and determining a second voxel density and a second color of the N spatial sampling point sets corresponding to the target pose;

the rendering module is used for rendering based on the first voxel density, the second voxel density, the first color and the second color, and generating target rendering images corresponding to the M target poses;

And the splicing module is used for splicing the target rendering images corresponding to the M target poses to obtain spliced images.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, which processor implements the steps in the image processing method as described above when executing the computer program.

In a fourth aspect, embodiments of the present application also provide a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in an image processing method as described above.

In the image processing method of the embodiment of the application, the images shot by the camera under M target poses are not needed to be spliced, the space point sampling is carried out on rays emitted by the camera, the first pixel density and the first characteristic vector of the space sampling point set are determined by utilizing the space sampling point set coordinates, the first color, the second pixel density and the second color of the space sampling point set are rendered by utilizing the first characteristic vector and the view angle direction of the space sampling point set, so that the first pixel density, the second pixel density, the first color and the second color of N space sampling point sets corresponding to the M target poses respectively can be utilized to render, the target rendering images corresponding to the M target poses respectively are generated, then the object rendering images corresponding to the M target poses respectively are spliced to obtain spliced images, the situation that the camera moves and/or the object in a shooting scene moves, and the image splicing effect is poor due to the fact that the characteristic point calculation error among the shot images is large can be avoided, and the target rendering effect is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is one of the flowcharts of an image processing method provided in an embodiment of the present application;

FIG. 2 is a second flowchart of an image processing method according to an embodiment of the present application;

FIG. 3 is an application scene diagram of an image processing method according to an embodiment of the present application;

FIG. 4 is a block diagram of a raw neural radiation field model;

FIG. 5 is a diagram of a neural radiation field model provided by an embodiment of the present application;

FIG. 6 is one of the rendered image schematics provided by an embodiment of the present application;

FIG. 7 is a second schematic diagram of a rendered image according to an embodiment of the present application;

FIG. 8 is a schematic view of a panoramic image provided by an embodiment of the present application;

fig. 9 is a schematic block diagram of an image processing apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 is a flowchart of an image processing method provided by an embodiment of the present application, which is applicable to an electronic device, where the electronic device may be a terminal device or a server device, and the terminal device may be a mobile device or a non-mobile device. As shown in fig. 1, the image processing method provided by the embodiment of the application includes the following steps:

step 101: and for each of M target poses of the camera, performing spatial point sampling on N rays emitted under the target poses to obtain N spatial sampling point set coordinates corresponding to the target poses.

The N space sampling point set coordinates corresponding to the target pose comprise space sampling point set coordinates corresponding to each ray in the N rays under the target pose, and M, N is an integer greater than 1.

It should be noted that, the camera may emit a ray from the optical center, and a certain ray passes through a certain pixel, where the ray corresponds to the pixel, and it may be understood that each ray has a corresponding pixel, and the ray passing through a different pixel by the optical center may have a different viewing angle direction (also referred to as an observation direction or an observation direction), and the viewing angle direction of any spatial point on a ray is that of the ray. For a camera, the camera parameters may include at least image parameters (including, for example, an image size, that is, an image length (positive integer) ×an image width (positive integer)), a focal length, a pose, and the like, where the camera may emit N rays, and N may be the image length×the image width. In the image rendering process, the rendering color of a certain ray can be understood as the color of the pixel point corresponding to the certain ray, and the image rendering process can also be understood as the process of endowing the pixel point with the corresponding color.

For example, in the practical application process, a 1920×1080 size of a rendered image needs to be generated, each pixel has a ray in a world coordinate system, the color of each pixel (i.e. the rendering color of the corresponding ray) is rendered, and the corresponding rendered image can be obtained, where the color of the pixel can be obtained by rendering the first pixel density, the second pixel density, the first color and the second color of the spatial sampling point set on the corresponding ray. Therefore, the space point sampling can be performed on N rays emitted by the camera under the target pose, and the coordinates of N space sampling point sets corresponding to the target pose are obtained, namely the coordinates of the space sampling point sets corresponding to the N rays under the target pose are obtained.

The camera can emit rays in different directions under different poses. In this embodiment, M target poses may be understood as M predetermined/preset poses of a camera, in the process of generating a target rendering image corresponding to a certain target pose, spatial point sampling may be performed on each of N rays sent by the camera under the target pose, that is, spatial point sampling may be performed on each ray, for each ray, multiple spatial points may be sampled, N spatial sampling point set coordinates corresponding to the target pose may be obtained, N spatial sampling point sets corresponding to the target pose may be in one-to-one correspondence with N rays of the target pose, each spatial sampling point set may include multiple spatial sampling points, and the spatial sampling point set coordinates may be coordinates of the spatial sampling points in the spatial sampling point set. In one example, the coordinates are three-dimensional space coordinates, which are space coordinates in the world coordinate system. In one example, the coordinates may be further coordinates after encoding (e.g., high frequency encoding, such as encoding with a sine and cosine periodic function).

Step 102: and obtaining the first pixel density and the first feature vector of the N spatial sampling point sets corresponding to the target pose based on the N spatial sampling point set coordinates corresponding to the target pose.

For each target pose of the N target poses, the coordinates of the N spatial sampling point sets of the target poses can be utilized to determine the first voxel density and the first feature vector of the N spatial sampling point sets of the target poses, so that the first voxel density and the first feature vector of the N spatial sampling point sets corresponding to each target pose of the N target poses can be obtained. It should be noted that the first feature vector may also be referred to as a first spatial feature vector, and the first feature vector of a certain spatial sampling point may be used to represent spatial feature information of the spatial sampling point. In addition, the voxel density of a certain spatial sampling point may represent the probability that the ray at which the spatial sampling point is located is terminated with an infinitely small particle at the spatial sampling point, i.e. the density of the particle at the spatial sampling point of the ray is differentiable, which is also understood to be the opacity at the spatial sampling point.

Step 103: based on the first feature vector corresponding to the target pose and the view angles of the N spatial sampling point sets corresponding to the target pose, determining a first color of the N spatial sampling point sets corresponding to the target pose, and determining a second voxel density and a second color of the N spatial sampling point sets corresponding to the target pose.

In this embodiment, in addition to acquiring the coordinates of N spatial sampling point sets corresponding to the target pose, the viewing angle directions of the N spatial sampling point sets corresponding to the target pose (i.e., the viewing angle directions of the spatial sampling points in the spatial sampling point sets) may also be acquired. It should be noted that, the view angle direction of the spatial sampling point is the view angle direction of the ray where the spatial sampling point is located, the ray emitted from the optical center by the camera under the target pose is determined, that is, the view angle direction of the ray is determined, so that the view angle directions of all the spatial sampling points on the ray can be determined. The first feature vector corresponding to the target pose can be understood as a first feature vector of the N spatial sampling point sets corresponding to the target pose.

In this embodiment, the first color, the second voxel density, and the second color of the N spatial sampling point sets corresponding to the target pose may be determined by using the first feature vector corresponding to the target pose and the viewing angle directions of the N spatial sampling point sets corresponding to the target pose. The color of a spatial sampling point may represent the color reflected by the particles at that spatial sampling point, as seen in the direction of the ray in which it is located. In one example, the first color may be understood as static color information, the second color may be understood as dynamic color information, and the second voxel density may be understood as dynamic density information.

Step 104: rendering is performed based on the first voxel density, the second voxel density, the first color and the second color, and target rendering images corresponding to M target poses are generated.

For each of the N target poses, rendering can be performed by using the first voxel density, the first color, the second voxel density and the second color of the N spatial sampling point sets corresponding to the target poses, so as to generate a target rendering image corresponding to the target pose, and thus, a target rendering image corresponding to each of the N target poses can be generated, and thus, target rendering images (N target rendering images) corresponding to the M target poses can be generated.

In one example, the color of each pixel point in the target rendering image under the target pose is a rendering color corresponding to one ray of the N rays under the target pose, wherein the rendering color of any ray is obtained by rendering based on the first voxel density, the second voxel density, the first color and the second color of the spatial sampling point set corresponding to any ray.

The rendering color of one ray is obtained by rendering the first pixel density, the second pixel density, the first color and the second color of the spatial sampling point set corresponding to the one ray, namely the color of one pixel corresponding to the one camera is obtained, and for each ray under the target pose, the corresponding rendering color is determined in a similar manner, so that the color of the pixel corresponding to each ray under the target pose respectively is obtained, namely the target rendering image corresponding to the target pose is obtained, and the rendering color of N rays under the target pose corresponds to the color of the pixel of the target rendering image one by one.

Step 105: and splicing the target rendering images corresponding to the M target poses to obtain a spliced image.

After the target rendering images under M different target poses are obtained, the M target rendering images can be spliced to obtain a spliced image.

In the image processing method of the embodiment, the images shot by the camera under the M target poses are not needed to be spliced, but the space point sampling is performed on the rays emitted by the camera, the first pixel density and the first feature vector of the space sampling point set are determined by using the space sampling point set coordinates, and the first color, the second pixel density and the second color of the space sampling point set are rendered by using the first feature vector and the view angle direction of the space sampling point set, so that the first pixel density, the second pixel density and the first color of the N space sampling point sets corresponding to the M target poses respectively can be utilized to render, the target rendering images corresponding to the M target poses respectively are generated, then the object rendering images corresponding to the M target poses obtained by rendering are spliced to obtain the spliced image, the situation that the camera moves and/or the object in the shooting scene moves, and the calculation error of the feature points among the shot images is larger, so that the image splicing effect is poor can be avoided, and the target image splicing effect is improved.

In one example, the rendered color of the ray may be rendered by:

wherein r represents any ray emitted under the ith target pose in M target poses, C _i 'r' represents the rendering color of ray r emitted under the ith target pose, t _k Representing the kth spatial sampling point, t, on ray r _k′ Represents the kth' spatial sampling point on ray r, K represents the total number of spatial sampling points on ray r,can be expressed as t _k Cumulative transmittance at σ _i (t _k ) Representing a spatial sampling point t _k Is the first voxel density, delta _ik ＝t _k+1 -t _k ，t _k+1 Represents the k+1st spatial sampling point, delta, on ray r _ik Representing a spatial sampling point t _k+1 And the spatial sampling point t _k Distance between sigma _i (t _k′ ) Showing the spatial sampling point t _k′ Is the first voxel density, delta _ik′ ＝t _k′+1 -t _k′ ，t _k′+1 Represents the k' +1th spatial sampling point, delta, on ray r _ik′ Representing a spatial sampling point t _k′+1 And the spatial sampling point t _k′ Distance between c _i (t _k ) Representing a spatial sampling point t _k First color, & gt>Representing a spatial sampling point t _k Second voxel density,/, of->Spatial sampling point t _k′ Second voxel density,/, of->Representing a spatial sampling point t _k Alpha (y) =1-exp (-y), i.e. +.> In addition, the smaller the value of k, the spatial sampling point t _k The closer to the optical center of the camera, the larger the value of k', the spatial sampling point t _k′ The farther from the optical center of the camera.

It can be appreciated that in this embodiment, in determining the rendering color of the ray, not only the first voxel density and the first color of all the spatial sampling points on the ray are considered, but also the second voxel density and the second color of all the spatial sampling points on the ray are considered, so as to improve the accuracy of the rendering color of the obtained ray.

In one embodiment, obtaining a first voxel density and a first feature vector of N sets of spatial sampling points corresponding to a target pose includes: acquiring first voxel densities and first feature vectors of N space sampling point sets corresponding to target pose by adopting a first target perception network;

determining a first color of N spatial sampling point sets corresponding to target pose, including: determining a first color of N space sampling point sets corresponding to the target pose by adopting a target static sensing network;

determining a second voxel density and a second color of the N sets of spatial sampling points corresponding to the target pose, including: and determining a second voxel density and a second color of the N space sampling point sets corresponding to the target pose by adopting a target dynamic perception network.

It should be noted that the sensing network according to the embodiment of the present application may be a multi-layer sensor network. The method comprises the steps of inputting space sampling point coordinates into a first target sensing network, outputting first voxel density and first eigenvector of the space sampling point through the first target sensing network, inputting the first eigenvector of the space sampling point and the visual angle direction of the space sampling point into a target static sensing network, outputting first color of the space sampling point through the first target sensing network, inputting the first eigenvector of the space sampling point and the visual angle direction of the space sampling point into a target dynamic sensing network, and outputting second color and second voxel density of the space sampling point through the target dynamic sensing network.

That is, in this embodiment, a plurality of sensing networks may be constructed, each of the plurality of sensing networks may output corresponding information for generating a target rendering image, that is, based on the coordinates of the spatial sampling point, a first pixel density and a first feature vector of the spatial sampling point are obtained by using a first target sensing network, a first color is obtained by using a target static sensing network based on the first feature vector and the perspective girl of the spatial sampling point, a second color and a second pixel density are obtained by using a target dynamic sensing network, and the target rendering image is obtained by performing image rendering by using the first pixel density obtained by the first target sensing network, the first color obtained by the target static sensing network, and the second color and the second pixel density obtained by using the target dynamic sensing network.

In one embodiment, the target static-aware network and the target dynamic-aware network are trained by:

obtaining K first images, wherein the K first images are images shot by a camera for a first scene according to K camera poses, and K is an integer larger than 1;

Determining K camera poses based on the K first images, wherein the K camera poses comprise camera poses corresponding to the K first images respectively;

for each of the K camera poses, performing spatial point sampling on L first rays emitted by the camera under the camera pose to obtain L spatial sampling point set coordinates of the camera pose, wherein L is an integer greater than 1;

for each camera pose in the K camera poses, acquiring first predicted voxel densities and second feature vectors of L spatial sampling point sets corresponding to the camera poses, wherein the first predicted voxel densities and the second feature vectors of the L spatial sampling point sets corresponding to the camera poses are as follows: based on the L space sampling point set coordinates corresponding to the camera pose, determining by adopting a first initial perception network;

based on the K first images, the first predicted voxel density, the second feature vector and the view angle direction of the L spatial sampling point sets corresponding to the K camera poses respectively, training the initial static sensing network to obtain a target static sensing network, and training the initial dynamic sensing network to obtain the target dynamic sensing network.

It should be noted that, the L spatial sampling point set coordinates of the camera pose may include the spatial sampling point set coordinates corresponding to each of the L first rays under the camera pose, where L may be the same as N. The view angle direction of any spatial sampling point in the L spatial sampling points is the direction of the ray where the spatial sampling point is located. It will be appreciated that in this embodiment, the initial static sensing network and the initial dynamic sensing network may be trained together, and the K first images may be understood as training sample training images. In this embodiment, the camera pose of the camera for capturing N first images may be determined according to the K first images, and the manner of determining the camera pose according to the images is various. After the camera pose is determined, rays of the camera under the camera pose can be determined, and the rays emitted by each camera pose can be subjected to spatial point sampling to obtain the spatial sampling point set coordinates corresponding to each first ray in the L first rays corresponding to each camera pose.

It should be noted that the second feature vector may also be referred to as a second spatial feature vector, and the second feature vector of a certain spatial sampling point may be used to represent spatial feature information of the spatial sampling point.

The first predicted voxel density and the second feature vector of the L spatial sampling point sets corresponding to the camera pose may be obtained by the first initial perceptual network based on the L spatial sampling point set coordinates corresponding to the camera pose. The initial static sensing network can be trained to obtain a target static sensing network through the K first images, the first predicted voxel density, the second feature vector and the view angle direction of the L spatial sampling point sets corresponding to the K camera poses respectively, and the initial dynamic sensing network can be trained to obtain the target dynamic sensing network.

In this embodiment, by sampling spatial points of rays emitted under the camera poses of the K first images, the initial static sensing network and the initial dynamic sensing network are trained by using the K first images, the first predicted voxel densities, the second feature vectors and the view angles of the L spatial sampling point sets corresponding to the camera poses of the K first images, respectively, so as to improve the performance of the sensing network obtained by training.

In one example, the first initial sensing network may obtain, in advance, a first predicted voxel density and a second feature vector of an L spatial sampling point set corresponding to the camera pose based on coordinates of the L spatial sampling point sets under the camera pose, and in the initial static sensing network and the third sensing network training process, call the first predicted voxel density and the second feature vector of the L spatial sampling point sets corresponding to the camera pose. In another example, under the condition that the L spatial sampling point set coordinates under the camera pose are obtained by sampling, the L spatial sampling point set coordinates under the camera pose can be input into the first initial perception network, so as to obtain the first prediction voxel density and the second feature vector of the L spatial sampling point sets corresponding to the camera pose output by the first initial perception network.

In one embodiment, training the initial static sensing network to obtain a target static sensing network and training the initial dynamic sensing network to obtain a target dynamic sensing network based on the K first images, the first predicted voxel density, the second feature vector and the view angle direction of the L spatial sampling point sets corresponding to the K camera poses respectively, includes:

Inputting a second feature vector corresponding to each camera pose of the K camera poses and the view angle direction of the L spatial sampling point sets of the camera poses into an initial static sensing network, and performing color prediction through the initial static sensing network to obtain first predicted colors of the L spatial sampling point sets of the camera poses;

inputting a second feature vector corresponding to each camera pose in the K camera poses and the view angle direction of the L spatial sampling point sets of the camera poses into an initial dynamic sensing network, and predicting voxel density and color through the initial dynamic sensing network to obtain second predicted voxel density and second predicted color of the L spatial sampling point sets of the camera poses;

rendering based on the first prediction voxel density, the second prediction voxel density, the first prediction color and the second prediction color of the L spatial sampling point sets corresponding to the K camera poses respectively, and generating K prediction rendering images;

training the initial static sensing network and the initial dynamic sensing network according to the K first images and the K predicted rendering images to obtain a target static sensing network and a target dynamic sensing network.

The second feature vector corresponding to the camera pose is the second feature vector of the L spatial sampling point sets corresponding to the target pose, in the training process of the initial static sensing network, the second feature vector corresponding to the camera pose and the visual angle direction of the L spatial sampling point sets of the camera pose can be input into the initial static sensing network, and the initial static sensing network performs color prediction based on the second feature vector corresponding to the camera pose and the visual angle direction of the L spatial sampling point sets of the camera pose, so that the first predicted color of the L spatial sampling point sets of the camera pose can be obtained. In the training process of the initial dynamic sensing network, a second feature vector corresponding to the camera pose and the visual angle directions of the L spatial sampling point sets of the camera pose can be input into the initial dynamic sensing network, and the initial dynamic sensing network predicts the voxel density and predicts the color based on the second feature vector corresponding to the camera pose and the visual angle directions of the L spatial sampling point sets of the camera pose, so that the second predicted voxel density and the second predicted color of the L spatial sampling point sets of the camera pose can be obtained. For each of the K camera poses, the first predicted voxel density, the second predicted voxel density, the first predicted color and the second predicted color of the L spatial sampling point sets corresponding to each of the K camera poses may be obtained through the above-described process of determining the first predicted voxel density, the second predicted voxel density, the first predicted color and the second predicted color.

In this embodiment, the first prediction voxel density, the second prediction voxel density, the first prediction color and the second prediction color of the L spatial sampling point sets corresponding to the K camera poses respectively may be rendered, so as to generate prediction rendering images corresponding to the K camera poses (i.e., generate K prediction rendering images), where it may be understood that the colors of the pixels in the prediction rendering images are predicted colors, and the colors of the pixels in the first image may be understood as real colors, so that the initial static sensing network and the initial dynamic sensing network may be trained based on the K first images and the K prediction rendering images, to obtain the target static sensing network and the target dynamic sensing network.

The color of each pixel point in the predicted rendering image under the camera pose is the predicted rendering color corresponding to one of the L first rays under the camera pose. In addition, it should be noted that, the predicted rendering color of each of the L first rays in the camera pose may be obtained by rendering based on the first pre-stored voxel density, the second predicted voxel density, the first predicted color and the second predicted color of the spatial sampling point set corresponding to the first ray, where the rendering mode of obtaining the predicted rendering color of the first ray is similar to the rendering mode of obtaining the rendering color of the ray, and the difference is that the former is obtained by rendering the first pre-stored voxel density, the second predicted voxel density, the first predicted color and the second predicted color of the spatial sampling point set corresponding to the first ray in the camera pose, and the latter is obtained by rendering the first voxel density, the second voxel density, the first color and the second color of the spatial sampling point set corresponding to the ray in the target pose.

In this embodiment, in the network training process, the initial static sensing network may perform color prediction to obtain a first predicted color, and the third initial sensing network may perform voxel density and color prediction to obtain a second predicted voxel density and a second predicted color, and perform rendering using the first predicted voxel density, the second predicted voxel density, the first predicted color and the second predicted color to generate K predicted rendered images, so that the network training may be performed using the predicted rendered image in which the color of the pixel point is the predicted color and the first image in which the color of the pixel point is the true color, thereby improving the performance of the trained sensing network.

In one example, the rendered color of the first ray may be rendered by:

wherein r' represents any one of the first rays emitted from the j-th camera pose among the K camera poses,representing a predicted rendering color, t ', of a first ray r ' emitted in a j-th camera pose ' _k Representing the kth spatial sampling point, t ', on a first ray r' _k′ Represents the kth 'spatial sampling point on ray r', K 'represents the total number of spatial sampling points on ray r',can represent t' _k Cumulative transmittance at σ _j (t′ _k ) Representing the spatial sampling point t' _k First predicted voxel density of +.>t′ _k+1 Represents the k+1th spatial sampling point on the first ray r' -, a +.>Representing the spatial sampling point t' _k+1 And the spatial sampling point t' _k Distance between sigma _j (t′ _k′ ) Showing the spatial sampling point t' _k′ First predicted voxel density of +.>t′ _k′+1 Representing the k' 1 st spatial sampling point on ray r +>Representing the spatial sampling point t' _k′+1 And the spatial sampling point t' _k′ Distance between c _j (t′ _k ) Representing the spatial sampling point t' _k Is>Representing the spatial sampling point t' _k Second predicted voxel density of>Spatial sampling point t' _k′ Second predicted voxel density of>Representing the spatial sampling point t' _k Alpha (y) =1-exp (-y), i.e. +.> j may also characterize the number of images in the K first images,representing the set of spatially sampled points on the ray r' from the near plane to the far plane of the camera.

In one embodiment, in the training process of the initial static sensing network and the initial dynamic sensing network, the initial dynamic sensing network further outputs uncertainty coefficients of L spatial sampling point sets of the camera pose, the training loss value adopted is a loss value determined at least through K predicted rendering images, K first images and the uncertainty coefficients, and the uncertainty coefficient of any spatial sampling point is used for representing the probability that any spatial sampling point is a dynamic scene.

It should be noted that, the uncertainty coefficient of a spatial sampling point may represent the probability that the spatial sampling point is a dynamic scene (it may also be understood that the greater the uncertainty coefficient is, the greater the probability that the spatial sampling point is a dynamic object). The network training process can be understood as a process of continuously adjusting parameters of the network to be trained through the training loss value, and under the condition that training is completed, parameters of the network are determined, so that the trained network is obtained. In this embodiment, the training loss value used may be determined at least by the K predicted rendered images, the K first images, and the uncertainty coefficient, that is, the training loss value is not only related to the K predicted rendered images and the K first images, but also related to the uncertainty coefficient of the probability of being characterized as a dynamic scene, which may improve the rendering capability of the network to the dynamic scene, and improve the performance of the network obtained by training.

In one example, the training loss value is the sum of K sub-loss values, the j-th sub-loss value of the K sub-loss values being determined by:

wherein L is _j For the jth sub-loss value, a loss value between the predicted rendered image corresponding to the jth camera pose and the first image corresponding to the jth camera pose (i.e., the jth first image of the K first images) may be represented, R 'represents a set of first rays issued in the jth camera pose, R' represents one first ray issued in the jth camera pose, Representing a predicted rendering color of a first ray r' emitted in a j-th camera pose, C _j (r ') represents the true color of the first ray r ' emitted in the jth camera pose (i.e., the color of the pixel point corresponding to the first ray r ' in the first image corresponding to the jth camera pose), beta _j (r ') represents the sum of uncertainty coefficients of spatial sampling points in the set of spatial sampling points corresponding to the first ray r', K _r′ Represents the total number of spatial sampling points corresponding to the first ray r', lambda _u Representing the balance coefficient, which is a preset constant factor, preset according to actual scene or history experience,/->Representing the first ray r 'at the spatial sampling point t' _k Is included in the second predicted voxel density of the image.

In one embodiment, the first target awareness network is trained by:

acquiring L space sampling point set coordinates of a camera in each of the K camera poses, wherein L, K is an integer greater than 1;

and inputting the L space sampling point set coordinates corresponding to the K camera poses respectively into a first initial perception network for training to obtain a first target perception network.

The coordinates of the L spatial sampling point sets under the pose of the camera are as follows: and carrying out space point sampling on L first rays sent by the camera under the camera pose, wherein K camera poses are camera poses corresponding to K first images shot by the camera.

In the training process of the first initial perception network, L space sampling point set coordinates corresponding to the K different camera poses respectively can be input into the first initial perception network to train the first initial perception network, so that the network training effect is improved. In the first target sensing network training process, the determining process of the obtained L space sampling point set coordinates under the camera pose may be the same as the determining process of the L space sampling point set coordinates under the camera pose in the initial static sensing network and the initial dynamic sensing network training process.

In one example, in a first initial perceptual network training process, a first initial perceptual network is configured to obtain a first predicted voxel density and a second feature vector of L spatial sampling point sets of a camera pose based on coordinates of the L spatial sampling point sets of the camera pose. For each of the K camera poses, a first predicted voxel density and a second feature vector of the corresponding L spatial sampling point sets can be obtained through a first initial perception network, namely the first predicted voxel density and the second feature vector of the L spatial sampling point sets corresponding to the K camera poses respectively can be obtained. It should be noted that, the first predicted voxel density and the second feature vector of the L spatial sampling point sets corresponding to the K camera poses obtained by the first initial sensing network may be used for training the target static sensing network and the target dynamic sensing network.

Additionally, in one example, the penalty values employed in the first initial perceived network training process may be training penalty values employed in the initial static perceived network and initial dynamic perceived network training processes described above.

The procedure of the above-described image processing method will be specifically described in the following with specific examples.

As shown in fig. 2, the overall flow of the scheme of this embodiment is: firstly, acquiring an image set shot by a camera in a first scene in K camera pose, wherein the image set comprises K first images; then training a neural radiation field model (a perception network, which may be a multi-layer perceptron network) based on the K first images and the K camera poses of the camera; generating target rendering images under specified M different poses by using the trained neural radiation field model; and combining the obtained M target rendering images into a spliced image to obtain a panoramic image. The specific process is as follows:

firstly, acquiring an image set shot by a camera in K camera pose under a first scene:

as shown in fig. 3, in the image capturing stage, a first scene is captured by a camera under predetermined K camera poses, so as to obtain a set of sRGB images, where K first images may be included and correspond to the K camera poses one by one. For example, the camera shoots around a shooting scene, and the camera shooting points can be arranged according to actual requirements.

Since the human eye is able to discern a small relative difference between dark and bright areas in the image. While sRGB color space gamma compression can more closely fit the human eye resolved features, it optimizes the final image encoding by clipping values outside of 0,1 and applying a non-linear curve to the signal, at the cost of compressing bright spots, using more bits for dark areas. In addition to gamma compression, tone mapping algorithms can be used to better preserve the contrast of high dynamic range scenes when sRGB color space images are quantized to 8 bits; thus, in this embodiment, an sRGB image is employed, and in the case where the captured original image is not an sRGB image, the original image can be converted into an sRGB image;

for example, the method of converting the camera from the original RGB space to the sRGB space can be obtained by converting the following formula:

sRGB＝M _Z *RGB _raw

wherein RGB in the formula _raw For the original RGB image acquired by the camera, sRGB is RGB _raw An image converted to an sRGB color space;

wherein the camera corresponds to M under the D65 light source _Z The matrix is as follows:

/>

the camera corresponds to M under the D50 light source _Z The matrix is as follows:

if the original image captured by the camera is RGB _raw Space, then the above formula can be directly used to convert directly to sRGB space. If the camera acquires an image in the Adobe DNG format, an exiftool tool can be used for inquiring EXIF metadata data information to acquire RGB _raw space-to-Adobe DNG color space conversion matrix C _ccm Then to convert to sRGB space, it is converted by the following formula:

sRGB＝M _Z *C _ccm ^-1 RGB _DNG 。

secondly, training a nerve radiation field model based on K first images and K target poses of the camera:

1) Information preprocessing: firstly, obtaining K camera poses of a camera, namely obtaining the camera pose corresponding to a first image by adopting a related algorithm, for example, structure-from-Motion (sfm), reconstructing a 3D geometric shape by using pictures shot at different angles, and optimizing the camera pose by using bundle adjustment to calculate the camera pose (a parameter of the camera), and obtaining other parameters of the camera (including but not limited to image size, focal length and the like);

2) Training neural radiation field NeRF:

it should be noted that the original NeRF constructs a multi-layer perceptron (MLP) network, and the model expression is as follows

FΘ:(x,d)→(c,σ)；

Wherein, (x, D) is characterized by coordinates (3D coordinate positions) and viewing angle directions of the spatial up-sampling points, and (c, sigma) is characterized by colors and voxel densities corresponding to the spatial sampling points;

to calculate the color of each pixel, neRF approximates the volume rendering integral using numerical integration, and emits a ray from the camera view along each pixel, where the ray may be represented by r (t) =o+td, where o represents the optical center of the camera (also understood as the origin of the camera/camera position), d represents the view direction, t is a real number greater than zero, and may represent a spatial point on the ray, and then the pixel color of a ray r' may be approximated as:

The network structure optimization is performed on the basis of the original NeRF theory according to the embodiment of the application, as shown in fig. 5, the specific network training flow is as follows:

s-2-1: the compression of the image LDR values taken by the original NeRF to the [0,1] range results in loss of bright detail and also in deviation of the noise distribution of the pixels. Therefore, the embodiment of the application adopts an image of the sRGB color space;

s-2-2: the first initial MLP network (i.e., the first initial perceived network) is constructed, which is also referred to as a rebuilt network, for example, the structure of an exemplary first initial MLP network is shown in fig. 4, but it should be noted that the network structure shown in fig. 4 is merely for illustration, and in practice, the number of layers of the network, etc. are not necessarily designed according to the figure. The coordinates (three-dimensional (3D) coordinates) of the spatial sampling point t are input into a first initial MLP network after being subjected to high-frequency coding (encoding), and a first prediction voxel density sigma (t) corresponding to the spatial sampling point t and a second eigenvector z (t) are output. For example, as shown in fig. 4, the first initial MLP network may include 10 fully connected layers connected in sequence, the coordinates encoded by the spatial sampling points t are input into the first initial MLP network, and after being processed by the first 8 fully connected layers, the first predicted voxel density σ (t) corresponding to the spatial sampling points t is output (which may be implemented by the fully connected layers of (256, 1)), and a second eigenvector z (t) (which may be, for example, an eigenvector of 256 dimensions). There are various methods for high frequency coding (encoding), such as, but not limited to, encoding using a sine and cosine periodic function, as follows:

γ(p)＝(sin(2 ⁰ πp),cos(2 ⁰ πp),...,sin(2 ^L-1 πp),cos(2 ^L-1 πp)；

Wherein P can represent the coordinate value of each dimension in the space sampling point coordinate, and P can also represent the value of each dimension in the view angle direction, and the sampling point coordinate and the view angle direction can be encoded through the above formula. For example, the above manner may be used to encode each three dimension in the spatial sampling point coordinates, where L is a settable super parameter, that is, a preset parameter, where the value of the parameter may be set according to the actual situation, and L is a positive integer, for example, when encoding the spatial sampling point, L may be 10, when encoding the view angle direction, L may be 4, etc. In fig. 4, γ (x) may represent the coordinate obtained by encoding the coordinate x of the spatial upsampling point, enc (x) may be referred to as a three-dimensional coordinate, each dimensional coordinate is encoded, and the dimension obtained by encoding each dimensional coordinate is 2×l, so that the encoded coordinate in 2×l×3 dimensions may be obtained after encoding the three-dimensional coordinate, for example, when L takes 10, 60 may represent the dimension of the encoded coordinate γ (x) in fig. 4, γ (d) may represent the result obtained by encoding the viewing angle direction d corresponding to the coordinate x of the spatial upsampling point, and 24 in fig. 4 may represent the dimension of the encoded viewing angle direction γ (d).

In addition, in order to improve the network effect, the ReLU activation function adopted in the original NeRF is replaced, and the embodiment of the application can adopt softplus as the activation function, so that a smoother optimization effect can be obtained;

s-2-3: construction of a second initial MLP network (MLP) _static I.e. the initial static perceptual network), which may also be referred to as a static network, is used to identify a static scene object, and encodes the view direction corresponding to a spatial sampling point (for example, after encoding by using the sine and cosine periodic function to obtain the encoded view direction enc (d), a second feature vector z (t) output in the middle of the reconstruction network is combined as an input of the static network, and a predicted color (may be an RGB color) of the spatial sampling point in the scene is obtained by using a sigmoid activation function according to the following formula, which may be used to represent the color of the static object:

c _j (t)＝MLP _static (z(t),enc(d))；

wherein j represents the number of the first image in the image set during training, c _j (t) represents a first predicted color corresponding to the spatial sampling point t for the j-th first image; determining a first predicted color of the ray by using the first color of each spatial sampling point on the ray to obtain a first predicted color of a corresponding pixel point, wherein the first predicted color of each pixel point is obtained through the similar process;

S-2-4: construction of a third initial MLP network (MLP) _dynamic I.e. initial dynamic perception network), which is called a dynamic network, and is used for identifying a dynamic moving object in the shooting process, the second predicted color of the spatial sampling point t for the jth image can be obtained according to the following formula by combining the second feature vector z (t) output by the reconstruction network as the input of the dynamic network after encoding the viewing angle direction corresponding to the spatial sampling pointAnd a second predicted voxel density->And also outputs a corresponding uncertainty coefficient beta _j (t) weights to characterize dynamically moving objects:

s-2-5: redefining a volume rendering equation, defining a loss function, combining the MLP networks, and training a nerve radiation field model of the whole process:

for example, the volume rendering equation is defined as follows:

for the jth camera pose, rendering each first ray emitted by the jth camera pose through the rendering equation to obtain a predicted rendering color of each first ray of the jth camera pose, namely obtaining a predicted rendering image corresponding to the jth camera pose;

in the training process, the sub-loss value corresponding to the jth predictive rendering image is defined as follows:

it should be noted that, in the network reasoning process, the uncertainty coefficient β is used as a basis _j (t) removing the object color value with higher uncertainty factor, which can be used to solve the problems of ghost, shielding and the like;

through the training in the mode, a first target perception network, a target static perception network and a target dynamic perception network which correspond to each other can be obtained;

3) Generating new view angle images (i.e., target rendered images) specifying 6 orientations from the trained neuro-radiation field model:

6 target poses of a camera are selected, images under 6 poses in total are respectively rendered through three perception networks trained in Front, back, left (Left), right (Right), top and Down (Down), and 6 target rendering images are obtained as shown in fig. 6-7, so that a group of cube maps (namely cube map maps) are formed;

4) Combining the obtained 6 target rendering images into a complete spherical panorama (i.e. a spliced image);

as shown in FIG. 8, in the space formed by the cube map, the spherical polar coordinate representation can be utilized to re-project to obtain the equivalent spherical panorama as shown in FIG. 8.

The embodiment of the application provides a network structure method for training nerve radiation and rendering HDRI by adopting an image in an sRGB space, wherein three MLP networks are simultaneously connected with a nerve radiation field, an HDR image is deduced from a static network, the situation of traditional HDRI image synthesis failure can be avoided, meanwhile, the dynamic network is combined, the problems of dynamic ghost phenomenon caused by movement in the camera shooting process and the color blocked by correction are solved, the panoramic image is generated in an optimized mode, and the image processing method provided by the embodiment of the application can be used for solving the problems of HDRI image and the like required in AR/VR generation.

The embodiment of the application utilizes the characteristics of the nerve radiation field, and generates a rendering image under any pose by post-processing reasoning, so that the occurrence of the failure of synthesizing the HDRI picture in the related technology can be reduced, and the problems of ghosting, shielding and the like in the shooting and moving of the HDRI in the related technology can be optimized; meanwhile, due to the characteristic that the nerve radiation field can generate images under any pose, the embodiment of the application can generate the HDRI image under any specific pose so as to meet special requirements and the like.

Referring to fig. 9, fig. 9 is a schematic block diagram of an image processing apparatus 900 according to an embodiment of the present application, which is applicable to an electronic device, as shown in fig. 9, and includes:

the first sampling module 901 is configured to sample, for each target pose of M target poses of the camera, spatial points of N rays emitted by the target pose, to obtain N spatial sampling point set coordinates corresponding to the target pose, where M, N is an integer greater than 1;

the first obtaining module 902 is configured to obtain a first voxel density and a first feature vector of N spatial sampling point sets corresponding to the target pose based on N spatial sampling point set coordinates corresponding to the target pose;

the second obtaining module 903 is configured to determine a first color of the N spatial sampling point sets corresponding to the target pose based on the first feature vector corresponding to the target pose and a view direction of the N spatial sampling point sets corresponding to the target pose, and determine a second voxel density and a second color of the N spatial sampling point sets corresponding to the target pose;

The rendering module 904 is configured to perform rendering based on the first voxel density, the second voxel density, the first color, and the second color, and generate target rendered images corresponding to M target poses;

and the stitching module 905 is configured to stitch the target rendering images corresponding to the M target poses to obtain a stitched image.

In one embodiment, the color of each pixel point in the target rendering image under the target pose is a rendering color corresponding to one ray of the N rays under the target pose, wherein the rendering color of any ray is obtained by rendering based on the first voxel density, the second voxel density, the first color and the second color of the spatial sampling point set corresponding to any ray.

In one example, the L spatial sampling point set coordinates of the camera pose include spatial sampling point set coordinates corresponding to each of the L first rays under the camera pose.

In one embodiment, the first target awareness network is trained by:

In one example, the L sets of spatial sampling point coordinates in the camera pose are: and carrying out space point sampling on L first rays sent by the camera under the camera pose, wherein K camera poses are camera poses corresponding to K first images shot by the camera.

In one embodiment, in the training process of the first initial perception network, the first initial perception network is used for obtaining a first prediction voxel density and a second feature vector of the L spatial sampling point sets of the camera pose based on the L spatial sampling point set coordinates of the camera pose.

The technical features of the image processing apparatus 900 correspond to those of the image processing method, and the image processing apparatus 900 implements each process of the image processing method, and the same effects can be obtained, so that repetition is avoided and no further description is given here.

The embodiment of the application also provides an electronic device, which comprises a processor and a memory, wherein the memory stores a computer program capable of running on the processor, and the computer program realizes each process in the embodiment of the image processing method when being executed by the processor, and can achieve the same technical effect, so that repetition is avoided and redundant description is omitted.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the processes of the above-mentioned image processing method embodiment, and can achieve the same technical effects, so that repetition is avoided, and no further description is given here.

Among them, a computer readable storage medium such as Read-Only Memory (ROM), random access Memory (Random Access Memory RAM), magnetic disk or optical disk, and the like.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing an electronic device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method of the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. An image processing method, comprising:

rendering based on the first voxel density, the second voxel density, the first color and the second color to generate target rendering images corresponding to the M target poses;

2. The method of claim 1, wherein the color of each pixel point in the target rendered image in the target pose is a rendering color corresponding to one of N rays in the target pose, and wherein the rendering color of any ray is rendered based on a first voxel density, a second voxel density, a first color, and a second color of the spatial sampling point set corresponding to the any ray.

3. The method according to claim 1, wherein obtaining the first voxel density and the first eigenvector of the N sets of spatial sampling points corresponding to the target pose comprises: acquiring first voxel densities and first feature vectors of N space sampling point sets corresponding to the target pose by adopting a first target perception network;

the determining the first color of the N spatial sampling point sets corresponding to the target pose includes: determining a first color of N space sampling point sets corresponding to the target pose by adopting a target static sensing network;

the determining the second voxel density and the second color of the N spatial sampling point sets corresponding to the target pose includes: and determining a second voxel density and a second color of the N space sampling point sets corresponding to the target pose by adopting a target dynamic perception network.

4. A method according to claim 3, wherein the target static-aware network and the target dynamic-aware network are trained by:

obtaining K first images, wherein the K first images are images shot by the camera for a first scene in the mode of K camera pose, and K is an integer larger than 1;

Determining the K camera poses based on the K first images, wherein the K camera poses comprise camera poses corresponding to the K first images respectively;

for each camera pose of the K camera poses, performing spatial point sampling on L first rays sent by the camera under the camera pose to obtain L spatial sampling point set coordinates of the camera pose, wherein L is an integer greater than 1;

for each camera pose of the K camera poses, acquiring first predicted voxel densities and second feature vectors of L spatial sampling point sets corresponding to the camera poses, wherein the first predicted voxel densities and the second feature vectors of the L spatial sampling point sets corresponding to the camera poses are as follows: based on the L space sampling point set coordinates corresponding to the camera pose, determining by adopting a first initial perception network;

based on the K first images, the first predicted voxel density, the second feature vector and the view angle direction of the L spatial sampling point sets corresponding to the K camera poses respectively, training an initial static sensing network to obtain the target static sensing network, and training an initial dynamic sensing network to obtain the target dynamic sensing network.

5. The method of claim 4, wherein training the initial static perceptual network to obtain the target static perceptual network and training the initial dynamic perceptual network to obtain the target dynamic perceptual network based on a first predicted voxel density, a second feature vector, and a view direction of L sets of spatial sampling points corresponding to the K first images and the K camera poses, respectively, comprises:

inputting a second feature vector corresponding to each camera pose and the view angle direction of the L spatial sampling point sets of the camera poses into the initial static sensing network aiming at each camera pose in the K camera poses, and carrying out color prediction through the initial static sensing network to obtain first predicted colors of the L spatial sampling point sets of the camera poses;

inputting a second feature vector corresponding to each camera pose and the view angle direction of the L spatial sampling point sets of the camera poses into the initial dynamic sensing network aiming at each camera pose in the K camera poses, and carrying out voxel density and color prediction through the initial dynamic sensing network to obtain second prediction voxel density and second prediction color of the L spatial sampling point sets of the camera poses;

Rendering based on a first prediction voxel density, a second prediction voxel density, the first prediction color and the second prediction color of L spatial sampling point sets respectively corresponding to the K camera poses, and generating K prediction rendering images;

training the initial static sensing network and the initial dynamic sensing network according to the K first images and the K predicted rendering images to obtain the target static sensing network and the target dynamic sensing network.

6. The method of claim 5, wherein during the initial static sensing network and the initial dynamic sensing network training process, the initial dynamic sensing network further outputs uncertainty coefficients of L sets of spatial sampling points of the camera pose, the training loss value used is a loss value determined at least by the K predicted rendered images, the K first images, and the uncertainty coefficients, and the uncertainty coefficient of any spatial sampling point is used to represent a probability that the any spatial sampling point is a dynamic scene.

7. A method according to claim 3, wherein the first target-aware network is trained by:

Acquiring L space sampling point set coordinates of the camera in each of the K camera poses, wherein L, K is an integer greater than 1;

and inputting the L space sampling point set coordinates corresponding to the K camera poses respectively into a first initial perception network for training to obtain the first target perception network.

8. The method of claim 7, wherein during the first initial perceptual network training process, the first initial perceptual network is configured to obtain a first predicted voxel density and a second feature vector for the L sets of spatial sampling points of the camera pose based on the L sets of spatial sampling points coordinates of the camera pose.

9. An image processing apparatus, comprising:

10. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which processor, when executing the computer program, implements the steps of the image processing method according to any of the preceding claims 1-8.

11. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the image processing method of any of claims 1-8.