CN116012515A

CN116012515A - Neural radiation field network training method and related equipment

Info

Publication number: CN116012515A
Application number: CN202211716270.3A
Authority: CN
Inventors: 白东峰; 王环宇; 刘冰冰
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-25

Abstract

In the method, in the ith iteration process of training the nerve radiation field network, volume density and color information corresponding to each sample point in the ith iteration process can be obtained through the nerve radiation field network in the ith iteration process according to context information corresponding to each sample point in a plurality of sample points corresponding to a first visual angle; then, according to the volume density and the color information corresponding to each sample point in the ith iteration process, obtaining a first output image corresponding to a first visual angle in the ith iteration process through volume rendering; based on the first output image and the first image, it is determined whether training of the neural radiation field network is complete. In this way, the trained neural radiation field network can fuse the context information among the sample points, learn the more accurate volume density, color information and other characteristic information of the sample points, and therefore obtain high-quality generated images.

Description

Neural radiation field network training method and related equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a neural radiation field network training method and related equipment.

Background

In a variety of application scenarios, such as autopilot, gaming, virtual reality, and augmented reality, it is often desirable to render images of new perspectives in a particular scene.

Currently, one conventional way to obtain a new view image is to model a three-dimensional scene based on computer graphics (computer graphics), and render the three-dimensional scene model by a rendering engine to obtain image data at a specific view angle. In the scheme, the image quality of the new view angle depends on the precision of the three-dimensional model and the capacity of the related rendering engine, and if a large number of images under the new view angle are required to be generated, a large amount of resource cost is required to be spent, so that the conventional image generation process of the new view angle is complex and has low efficiency.

Disclosure of Invention

The application provides a neural radiation field network training method, and the trained neural radiation field network can accurately and efficiently obtain images under a new view angle in a designated scene. The application also provides corresponding apparatus, devices, computer readable storage media, computer program products, etc.

The first aspect of the application provides a neural radiation field network training method, which comprises the following steps: training the neural radiation field network according to the plurality of images and the view angle direction corresponding to each image to obtain a trained neural radiation field network; wherein, in the ith iteration process of training the neural radiation field network: obtaining volume density and color information corresponding to each sample point in the ith iteration process according to context information corresponding to each sample point in a plurality of sample points corresponding to a first view angle through a neural radiation field network in the ith iteration process, wherein the first view angle is determined based on the view angle direction of a first image, the plurality of sample points are obtained by sampling at least one ray corresponding to the first view angle, the first image is contained in a plurality of images, and i is a positive integer; according to the volume density and the color information corresponding to each sample point in the ith iteration process, obtaining a first output image corresponding to a first visual angle in the ith iteration process through volume rendering; based on the first output image and the first image, it is determined whether training of the neural radiation field network is complete.

In the first aspect, through the relevant training process, the trained neural radiation field network can fuse the context information among the sample points, learn the characteristic information such as the more accurate volume density and color information of the sample points, obtain accurate output results, further improve the quality of the finally generated image, and reduce the influence caused by noise.

In a possible implementation manner of the first aspect, the context information corresponding to each sample point is obtained based on a neighboring sample point of the plurality of sample points corresponding to the first view angle.

In a possible implementation manner of the first aspect, the first image includes a plurality of image blocks, any sample point corresponds to a plurality of ray lines, and the plurality of ray lines are obtained by sending rays to a plurality of pixel points in the image block corresponding to the corresponding sample point by a camera corresponding to the first image; the context information corresponding to the corresponding sample point is obtained based on a neighborhood sample point of the corresponding sample point on a neighborhood ray, the neighborhood ray is contained in a plurality of rays, and the neighborhood ray is different from the ray in which the corresponding sample point is located.

In this possible implementation manner, considering that in the practical application process, for a scene, the pixel color, depth and semantics of an image rendered at any angle have certain continuity, therefore, an image for training can be divided into a plurality of image blocks, each image block can include a plurality of pixels, and in the training and reasoning process, a plurality of ray rays can be obtained based on the image blocks in the image corresponding to the iterative process, and the ray rays are obtained by sending rays to a plurality of pixel points in any image block by a camera corresponding to the first image. In this way, the receptive field of the ray corresponding to a certain sample point can be expanded to the range of the corresponding image block through convolution operation and other processing in the training and reasoning process, so that the relevant context information on the neighborhood ray of the ray corresponding to the sample point is related to the sample point in the depth dimension, and the information of the sample point and the relevant context information on the neighborhood ray can be fused by the neural radiation field network to obtain the color information and the volume density of the sample point.

It can be seen that, by this possible implementation, the receptive field of each ray may be increased, so that the neural radiation field network may be associated with information of a neighboring sample point on the neighboring ray corresponding to a certain sample point.

In a possible implementation manner of the first aspect, the context information corresponding to each sample point is obtained based on a corresponding neighboring sample point on the ray where the corresponding sample point is located.

In this possible implementation manner, considering that in the practical application process, for a viewing angle direction, color and volume density of a series of sample points sampled from the three-dimensional space have a certain continuity, context information of one or more neighboring sample points corresponding to the same ray is associated by a convolution operation, so as to obtain a smoother estimated value corresponding to the corresponding sample point.

In a possible implementation manner of the first aspect, through a neural radiation field network in an ith iteration process, according to context information corresponding to each sample point in a plurality of sample points corresponding to the first view angle, obtaining volume density and color information corresponding to each sample point in the ith iteration process, where the obtaining includes: and fusing information of the corresponding sample point and information of the corresponding neighborhood sample point on the ray of the corresponding sample point according to the weight of the corresponding neighborhood sample point on the ray of any sample point through a nerve radiation field network in the ith iteration process to obtain volume density and color information corresponding to the corresponding sample point in the ith iteration process, wherein the weight of each corresponding neighborhood sample point on the ray of the corresponding sample point is determined based on the distance between the corresponding neighborhood sample point and the corresponding sample point.

In this possible implementation manner, for a certain sample point, the distances between the sample point and the corresponding sample points in different neighborhoods on the ray of light where the sample point is located may be different, and it is visible that the influence degree between the sample points in different neighborhoods is different. Therefore, the distance between the sample points can be used as a weight of the relevant neighborhood sample points in the information fusion, and then the relevant fusion operation such as convolution operation can be performed.

In a possible implementation manner of the first aspect, the neural radiation field network includes at least one convolution layer, where the at least one convolution layer is configured to fuse, through a convolution operation, information of each sample point with context information corresponding to the corresponding sample point.

In one possible implementation manner of the first aspect, the at least one convolution layer includes a plurality of convolution layers, the plurality of convolution layers are in a serial structure, and a size of an output of any one of the plurality of convolution layers in at least one dimension is not greater than a size of an output of a corresponding next convolution layer in at least one dimension.

In the possible implementation manner, the coverage range of a plurality of ray rays corresponding to the first visual angle can be enlarged, so that the information extraction range of the context information is enlarged, the neural radiation field network can learn scene information from a large-size image conveniently, and the performance of the neural radiation field network in a large scene is improved.

In a possible implementation manner of the first aspect, determining whether training of the neural radiation field network is completed according to the first output image and the first image includes: obtaining a weight corresponding to each pixel point in a first output image through the output of a kth layer in a neural radiation field network in an ith iteration process, wherein k is a positive integer smaller than a preset threshold value; obtaining a target pixel value of at least one target pixel point in the first output image according to the first output image and the weight corresponding to each pixel point in the first output image; and determining whether training of the nerve radiation field network is finished or not according to the target pixel value of at least one target pixel point in the first output image and the first image.

In a possible implementation manner of the first aspect, determining whether training of the neural radiation field network is completed according to the first output image and the first image includes: acquiring first feature information of each feature mapping network about a first output image in a plurality of feature mapping networks corresponding to an ith iteration process, wherein any feature mapping network is used for mapping input to a corresponding feature space, and the feature spaces corresponding to different feature mapping networks are different; acquiring second characteristic information of each characteristic mapping network about the first image in a plurality of characteristic mapping networks corresponding to the ith iteration process; and determining whether training of the nerve radiation field network is finished according to the first characteristic information and the second characteristic information.

In the training process of the conventional neural radiation field network, a peak signal-to-noise ratio (PSNR) is generally used alone as a loss function of the neural radiation field network, and this ignores the similarity between images in local and global directions, which may make the evaluation result of the loss function more monolithic and less accurate. However, if the loss function of the neural radiation field is to be constructed according to a plurality of performance indexes such as peak signal-to-noise ratio, structural similarity (structural similarity, SSIM), perceptual similarity (perceptual similarity, LPIPS), etc., the plurality of performance indexes (peak signal-to-noise ratio, structural similarity, and perceptual similarity) are assumed to be spatially heterogeneous and not converged in the same direction during network optimization, so that the effect of the loss function of the neural radiation field constructed according to the plurality of performance indexes such as peak signal-to-noise ratio, structural similarity, perceptual similarity is not ideal at present.

In the possible implementation manner, in the training process, the predicted value and the true value of the image are mapped to different feature spaces through a plurality of feature mapping networks, so that the accuracy of the generated image in the plurality of feature spaces such as the high-dimensional feature space can be evaluated, and the performance of the trained neural radiation field network is better.

In a possible implementation manner of the first aspect, training the neural radiation field network according to the plurality of images and the view direction corresponding to each image to obtain a trained neural radiation field network includes: according to each image and the corresponding view angle direction of each image, alternately training the neural radiation field network and the plurality of initial feature mapping networks to obtain a plurality of trained initial feature mapping networks and trained neural radiation field networks, and taking the plurality of trained initial feature mapping networks as a plurality of feature mapping networks.

In this possible implementation manner, the alternately training of the neural radiation field network and the plurality of initial feature mapping networks may be to fix the plurality of initial feature mapping networks first, and perform one or more iterative training on the neural radiation field network; then, fixing the latest updated neural radiation field network, and performing one or more iterative training on a plurality of initial feature mapping networks; then, the latest updated multiple initial feature mapping networks are fixed again, one or more iterative training is carried out on the latest updated neural radiation field network, and the like until the preset iterative times are reached, or until the iterative neural radiation field network and the initial feature mapping network respectively converge to respective expected states.

In a possible implementation manner of the first aspect, according to each image and a view direction corresponding to each image, performing alternating training on the neural radiation field network and the plurality of initial feature mapping networks to obtain a plurality of trained initial feature mapping networks and a trained neural radiation field network, and taking the plurality of trained initial feature mapping networks as a plurality of feature mapping networks, including: during the jth iteration of training the plurality of initial feature mapping networks: acquiring a second output image in the jth iteration process through a neural radiation field network in the jth iteration process, wherein j is a positive integer; acquiring third characteristic information of each initial characteristic mapping network about a second output image in a plurality of initial characteristic mapping networks corresponding to a jth iteration process; based on the differences between each of the third feature information, it is determined whether training of the plurality of initial feature mapping networks is completed.

In this possible implementation, it may be determined whether training of the plurality of initial feature mapping networks is completed based on the difference between each third feature information and the second loss function. Wherein the second loss function is used to keep the respective third characteristic information as diverse as possible, that is, to make the difference between the respective third characteristic information tend to increase. In this way, feature spaces corresponding to the plurality of feature mapping networks obtained through training of the plurality of initial feature mapping networks are significantly different, so that diversification of the feature spaces corresponding to the plurality of feature mapping networks is maintained.

Therefore, in the possible implementation manner, the initial feature mapping networks after training can accurately evaluate the image similarity between the output image and the corresponding truth image of the neural radiation field network on the aspect of the feature space with obvious differentiation, and the finally obtained trained neural radiation field network has better performance and better generated image quality.

A second aspect of the present application provides an image generating method, the method comprising: obtaining volume density and color information corresponding to each sample point in a plurality of sample points corresponding to a target visual angle according to the context information corresponding to each sample point through a nerve radiation field network, wherein the plurality of sample points are obtained by sampling at least one ray corresponding to the target visual angle; and according to the volume density and the color information corresponding to each sample point, obtaining an output image corresponding to the target visual angle through volume rendering.

In a possible implementation manner of the second aspect, the context information corresponding to each sample point is obtained based on a neighboring sample point of the plurality of sample points corresponding to the target view angle.

In a possible implementation manner of the second aspect, any sample point corresponds to a plurality of ray lines, the context information corresponding to the corresponding sample point is obtained based on a neighboring sample point of the corresponding sample point on a neighboring ray line, the neighboring ray line is included in the plurality of ray lines, and the neighboring ray line is different from the ray line in which the corresponding sample point is located.

In a possible implementation manner of the second aspect, the context information corresponding to each sample point is obtained based on a corresponding neighboring sample point on the ray where the corresponding sample point is located.

In a possible implementation manner of the second aspect, obtaining, through a neural radiation field network, volume density and color information corresponding to each sample point according to context information corresponding to each sample point in a plurality of sample points corresponding to a target view angle, includes: and fusing information of the corresponding sample point and information of the corresponding neighborhood sample point on the ray of the corresponding sample point according to the weight of the corresponding neighborhood sample point on the ray of any sample point through a nerve radiation field network to obtain volume density and color information corresponding to the corresponding sample point, wherein the weight of each neighborhood sample point corresponding to the ray of the corresponding sample point is determined based on the distance between the corresponding neighborhood sample point and the corresponding sample point.

In a possible implementation manner of the second aspect, the neural radiation field network includes at least one convolution layer, where the at least one convolution layer is configured to fuse, through a convolution operation, information of each sample point with context information corresponding to the corresponding sample point.

In one possible implementation manner of the second aspect, the at least one convolution layer includes a plurality of convolution layers, the plurality of convolution layers are in a serial structure, and a size of an output of any one of the plurality of convolution layers in at least one dimension is not greater than a size of an output of a corresponding next convolution layer in at least one dimension.

A third aspect of the present application provides a neural radiation field network training device having the functionality to implement the method of the first aspect or any one of the possible implementations of the first aspect. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above, such as training modules, etc.

A fourth aspect of the present application provides an image generating apparatus having a function to implement the method of the second aspect or any one of the possible implementations of the second aspect. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above, such as a neural radiation field network module, a volume rendering module, and the like.

A fifth aspect of the present application provides an electronic device comprising at least one processor, a memory and computer-executable instructions stored in the memory and executable on the processor, which when executed by the processor performs a method as described above for any one of the possible implementations of the first aspect or the first aspect and/or performs a method as described above for any one of the possible implementations of the second aspect or the second aspect.

A sixth aspect of the present application provides a computer readable storage medium storing one or more computer executable instructions which, when executed by a processor, perform a method as any one of the possible implementations of the first aspect or the first aspect and/or perform a method as any one of the possible implementations of the second aspect or the second aspect.

A seventh aspect of the present application provides a computer program product storing one or more computer-executable instructions, the computer program product comprising computer-executable instructions which, when executed by a processor, perform a method as described above for any one of the first aspect or the first possible implementation manner and/or perform a method as described above for any one of the second aspect or the second possible implementation manner.

An eighth aspect of the present application provides a chip system comprising a processor for supporting an electronic device to implement the functions as referred to in the first aspect or any one of the possible implementations of the first aspect and/or to implement the functions as referred to in the second aspect or any one of the possible implementations of the second aspect. In one possible design, the chip system may further include a memory to hold the necessary program instructions and data for the electronic device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

The technical effects of the second aspect to the eighth aspect or any one of the possible implementation manners of the second aspect may be referred to technical effects of the first aspect or the relevant possible implementation manners of the first aspect, which are not described herein.

Drawings

FIG. 1 is an exemplary schematic diagram of a neural radiation field network provided in an embodiment of the present application;

FIG. 2 is an exemplary schematic diagram of ray lines in volume rendering provided by embodiments of the present application;

FIG. 3 is a schematic diagram of an embodiment of a neural radiation field network training method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an embodiment of a neural radiation field network training method according to an embodiment of the present application;

FIG. 5a is an exemplary schematic diagram of a receptive field corresponding to a certain ray of light in a conventional neural radiation field network according to embodiments of the present application;

FIG. 5b is an exemplary diagram of contextual information between neighboring ray lines provided by embodiments of the present application;

FIG. 6 is an exemplary diagram of fusing contextual information between neighboring sample points on the same ray as provided by embodiments of the present application;

FIG. 7 is an exemplary schematic diagram of a neural radiation field network provided in an embodiment of the present application;

FIG. 8 is an exemplary schematic diagram of a plurality of feature mapping networks provided by embodiments of the present application;

FIG. 9a is an exemplary training schematic of a first stage provided by an embodiment of the present application;

FIG. 9b is an exemplary training schematic of a second stage provided by embodiments of the present application;

FIG. 10 is a schematic diagram of an exemplary iterative process provided by embodiments of the present application;

FIG. 11 is a schematic diagram of an embodiment of an image generating method according to an embodiment of the present application;

FIG. 12 is a schematic diagram of an embodiment of a neural radiation field network training device provided in an embodiment of the present application;

FIG. 13 is a schematic view of an embodiment of an image generating apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings. As a person of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical solutions provided in the embodiments of the present application are applicable to similar technical problems.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" or similar expressions thereof, means any combination of these items, including any combination of single or plural items. The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which the embodiments of the application described herein have been described for objects of the same nature. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

In many application scenarios such as autopilot, games, virtual reality, and augmented reality, it is often necessary to render images of a new perspective in a particular scene.

At present, one conventional way to obtain a new view image is to perform three-dimensional scene modeling based on computer graphics, and render the three-dimensional scene model through a rendering engine to obtain image data under a specific view. In the scheme, the image quality of the new view angle depends on the precision of the three-dimensional model and the capacity of the related rendering engine, and if a large number of images under the new view angle are required to be generated, a large amount of resource cost is required to be spent, so that the conventional image generation process of the new view angle is complex and has low efficiency.

With the rapid rise of the neural radiation field (neural radiance fields, neRF) in the calculated visual directions over the last two years, new visual angle image generation methods based on the neural radiation field have emerged.

The nerve radiation field is an emerging scene expression and image rendering method, the nerve radiation field records the scene expression in a depth neural network in an implicit expression mode, the depth neural network is used for implicitly learning the information of the three-dimensional scene, and the tasks of three-dimensional reconstruction of the three-dimensional scene, generation of a new visual angle image and the like are completed.

Referring to the schematic diagram shown in FIG. 1, a neural radiation field network F _Ω The input of (2) may include the spatial position (x, y, z) of a spatial point and the viewing angle direction

Wherein θ may be the camera pose corresponding to the viewing angle direction, < >>

May be a camera intrinsic to the viewing angle direction. Through a neural radiation field network F _Ω Corresponding output results (r, g, b, σ) can be obtained. In the output result, (r, g, b) represents color information corresponding to the corresponding spatial point, and σ represents the volume density of the corresponding spatial point. Where the bulk density σ (x) is understood to be the probability that a ray r is terminated when passing through an infinitesimal particle at x, which probability is microscopic. The volume density of a certain spatial point is similar to the opacity of the corresponding spatial pointDegree.

After the color information and the volume density of the plurality of spatial points are obtained, corresponding image data and corresponding spatial depth information can be deduced and rendered from any new viewpoint through a volume rendering technology according to the color information and the volume density of the plurality of spatial points.

The volume rendering technology refers to that for any light ray emitted from a camera view angle in a scene, all points on the light ray can accumulate volume density and color information in an integral mode, and finally pixel color and depth information corresponding to the light ray are obtained. As shown in fig. 2, the implemented process may be implemented by sampling a limited number of sample points on a ray and implementing discrete sample space modeling by way of accumulation to obtain corresponding image data and depth information of the image data.

At present, the direct input of the common neural radiation field network is the information of sample points on the ray, and the training and estimation are carried out on each sample point through the neural radiation field network, so that the scene reconstructed through the neural radiation field network is easy to generate noise in depth, the noise of the finally synthesized image is larger, and the image quality is affected.

Based on the above, the embodiment of the application provides a neural radiation field network, which can obtain accurate color information and volume density of corresponding sample points according to the characteristic information and corresponding context information of any sample point, so as to obtain a high-quality generated image.

The specific structure of the neural radiation field network in the embodiments of the present application is not particularly limited herein. Illustratively, the neural radiation field network may include a multi-layer perceptron (multilayer perceptron, MLP) to enable invisible expression of a specified three-dimensional scene by learning a series of images of the specified three-dimensional scene such that a corresponding image of a new view angle with respect to the specified scene may be generated.

The neural radiation field network of the embodiment of the application can be applied to various application fields needing image generation, for example, the application fields such as automatic driving field, virtual reality field, augmented reality field, high-precision three-dimensional map mapping field and other fields needing three-dimensional scene construction.

Embodiments of the present application relate to training and reasoning phases of a neural radiation field network. The training phase and the reasoning phase are described below, respectively.

First, the operation of the training phase of the neural radiation field network will be described.

In the embodiment of the application, the related operation of the training phase of the neural radiation field network can be performed through electronic equipment. The electronic device performing the relevant operations of the training phase may be the same as or different from the electronic device performing the relevant operations of the reasoning phase.

The particular type of electronic device performing the relevant operations of the training phase is not limited herein. The electronic device may be a single server, a server cluster, a terminal device, or the like, or may be a Virtual Machine (VM) or a container, for example.

For example, if the electronic device is a terminal device, the type of the terminal device may be a mobile phone (mobile phone), a tablet computer (pad), a computer with a wireless transceiving function, a Virtual Reality (VR) terminal, an augmented reality (augmented reality, AR) terminal, a terminal in industrial control (industrial control), a terminal in self driving (self driving), a terminal in remote medical (remote medical), a terminal in smart grid (smart grid), a terminal in transportation security (transportation safety), a terminal in smart city (smart city), a terminal in smart home (smart home), a terminal in internet of things (internet of things, ioT), and the like.

As shown in fig. 3, in an embodiment of the present application, a neural radiation field network training method performed by the electronic device may include step 301.

Step 301, training the neural radiation field network according to the multiple images and the view angle direction corresponding to each image, and obtaining the trained neural radiation field network.

In the embodiment of the present application, the specific types of the plurality of images may have multiple situations, and the specific acquisition modes of the plurality of images may also have multiple situations, which are not limited herein.

Illustratively, either image may be a Red Green Blue (RGB) image or a YUV image. Where "Y" in the YUV image represents brightness (luminance or luma), that is, gray scale values, "U" and "V" represent chromaticity (chroma). Alternatively, any image may be a depth image acquired by a depth camera, an image acquired by an event camera, or the like. In general, the specific type of each image in the plurality of images may be the same.

The plurality of images may be obtained by image acquisition of the same scene.

The viewing angle direction of any image may include the camera pose corresponding to the image, and parameter information such as the focal length and other parameters of the camera.

At present, in the conventional image generation method based on the neural radiation field network, direct input of the neural radiation field network is discrete samples on light rays emitted by image pixels which are independent of each other, and context information between the samples is ignored by the neural radiation field network, so that a finally rendered image is easily affected by random noise, and the quality of the finally generated image is affected.

In the embodiment of the application, through the related training process, the trained neural radiation field network can fuse the context information among the sample points, learn the characteristic information such as more accurate volume density and color information of the sample points, obtain accurate output results, further improve the quality of the finally generated image and reduce the influence caused by noise.

The training process is described in detail below.

Specifically, as shown in fig. 4, the process of training the neural radiation field network may include a plurality of iterative processes, and steps 3011-3013 may be included in the ith iterative process of training the neural radiation field network.

Step 3011, obtaining the volume density and the color information corresponding to each sample point in the ith iteration process according to the context information corresponding to each sample point in the plurality of sample points corresponding to the first view angle through the neural radiation field network in the ith iteration process.

The first view angle is determined based on a view angle direction of a first image, a plurality of sample points are obtained by sampling on at least one ray corresponding to the first view angle, the first image is contained in the plurality of images, and i is a positive integer.

In this embodiment of the present application, the ith iteration process may be one of a plurality of iteration processes for training the neural radiation field network, and operations of other iteration processes in the plurality of iteration processes may be the same as or similar to the ith iteration process, or may have a difference.

The first viewing angle may be considered as an observation angle corresponding to the first image in the world coordinate system, and specifically may be determined according to internal parameters such as a pose of a camera capturing the first image and a focal length of the camera.

In the ith iteration, the ray corresponding to the first view angle is determined based on the first image corresponding to the first view angle. In addition, in the ith iteration process, the number and form of ray rays corresponding to the first view angle are not limited herein.

In one example, to facilitate acquisition of contextual information between sample points by a neural radiation field network, a first image corresponding to a first view angle may be segmented into a plurality of image blocks, each image block including a plurality of pixel points therein.

In the ith iteration process, an image block can be determined from the plurality of image blocks, and a plurality of ray rays corresponding to the ith iteration process are obtained according to a plurality of pixel points in the image block. Any ray may be a ray emitted from the optical center of the camera corresponding to the first image to a pixel point in the corresponding image block. In this way, the plurality of ray lines corresponding to the image block may be obtained as the plurality of ray lines at the first view angle corresponding to the ith iteration process.

In this embodiment, the plurality of sample points corresponding to the first viewing angle are obtained by sampling on at least one ray corresponding to the first viewing angle. The sampling method is not limited herein, and for example, uniform sampling or non-uniform sampling may be performed.

In one example, a network structure of "coarsetofine" may be included in the neural radiation field network, including coarse and fine networks. The coarse network and the fine network may have the same structure, and may each include a multi-layer perceptron, for example. The input to the coarse network may be information such as spatial position and view direction of uniformly sampled sample points. And according to the output result (such as the volume density and/or the color information of the corresponding sample point) of the sample point which is uniformly sampled by the coarse network, the weight corresponding to the corresponding sample point is obtained, so that the input of the fine network is obtained by non-uniformly sampling according to the weight corresponding to the corresponding sample point.

In addition, in some examples, considering that the result obtained by taking the spatial position and the view angle direction corresponding to each sample point as the input of the neural radiation field network is relatively fuzzy, the spatial position and the view angle direction of the sample point can be mapped to a high-dimensional space by using a position coding (positional encoding) mode and then input into the neural radiation field network, so that the neural radiation field network can better fit data containing high-frequency changes, thereby effectively improving the performance of the neural radiation field network and improving the definition of a synthesized image.

In this embodiment of the present application, for each sample point in a plurality of sample points, through a neural radiation field network in an ith iteration process, context information corresponding to the sample point may be fused, so as to obtain volume density and color information corresponding to the sample point in the ith iteration process.

In some embodiments, the context information corresponding to each sample point is derived based on a neighborhood sample point of the respective sample point in the plurality of sample points corresponding to the first perspective.

In this way, the neural radiation field network can fuse the characteristic information of a certain sample point and the characteristic information of a corresponding neighborhood sample point, so as to obtain the corresponding volume density and color information of the sample point according to the context information of the sample point.

The neighbor sample points of the plurality of sample points corresponding to the first view angle of a certain sample point may be included on a ray line where the sample point is located, and/or one or more neighbor sample points of the sample point, and/or sample points located at the same depth or at similar depths on one or more neighbor ray lines of the ray line where the sample point is located. The sample point is located at the same depth or at a similar depth on one or more neighboring ray lines of the ray line where the sample point is located, which is also understood as a neighboring sample point of the sample point in the image plane dimension of the first image corresponding to the sample point.

In the embodiment of the application, the specific method for fusing the context information of the sample points by the neural radiation field network can be in various manners. Illustratively, a convolution layer may be included in the neural radiation field network, such that the context information of the sample points may be fused by a convolution operation on the context information of the sample points. Alternatively, the context information of the sample points may also be fused by a context information extraction structure similar to that in the transformer network.

In one embodiment, the neural radiation field network includes at least one convolution layer, and the at least one convolution layer is configured to fuse information of each sample point with context information corresponding to the corresponding sample point through a convolution operation.

The input of the at least one convolution layer may be derived based on the information of the sample points and the context information corresponding to the respective sample points. Illustratively, the at least one convolution layer is a multi-layered serial convolution layer, and the input of a first one of the multi-layered serial convolution layers may be a tensor obtained based on the spatial position and the viewing angle direction of the sample point and the spatial position and the viewing angle direction of a neighboring sample point corresponding to the respective sample point, so that the context information of the respective sample point may be fused by the multi-layered serial convolution layer.

The number of convolution layers and parameters in the neural radiation field network are not limited herein. For example, the size of the convolution kernel in the at least one convolution layer may be set based on actual scene and experience, and if multiple convolution layers exist, the sizes of the convolution kernels of the multiple convolution layers may be the same, for example, all the sizes of 5*5; alternatively, the convolution kernels of the multiple convolution layers may also be different in size.

In some embodiments, the at least one convolution layer comprises a plurality of convolution layers, the plurality of convolution layers being in a serial configuration, and a magnitude of an output of any one of the plurality of convolution layers in at least one dimension is not greater than a magnitude of an output of a corresponding next convolution layer in at least one dimension.

The output of any one convolution layer is typically two or more dimensions, for example, the output of any one convolution layer may be a three-dimensional matrix comprising three dimensions of length, width and depth. In an embodiment of the present application, a size of an output of any one of the multiple convolution layers in at least one dimension is not smaller than a size of an output of a corresponding next convolution layer in at least one dimension, where the at least one dimension may include a length and/or a width. That is, in the embodiment of the present application, the output dimension of the multi-layer convolution layer may gradually increase as the depth of the network increases. Therefore, the upper limit of the coverage range of a plurality of ray rays corresponding to the first visual angle can be improved, the information extraction range of the context information is enlarged, the neural radiation field network can learn scene information from a large-size image conveniently, and the performance of the neural radiation field network in a large scene is improved.

In addition, in some examples, other layers may be included in the serial structure formed by the multiple convolution layers, e.g., in one example, an activation function layer may be included between any two adjacent convolution layers.

In the embodiment of the application, the specific content of the context information of the sample points is fused through the nerve radiation field can be various. The following are each described by way of example.

1. Contextual information between neighboring ray lines is fused.

In some embodiments, the first image includes a plurality of image blocks, any sample point corresponds to a plurality of ray lines, and the plurality of ray lines are obtained by sending rays to a plurality of pixel points in the image block corresponding to the corresponding sample point by a camera corresponding to the first image.

The context information corresponding to the corresponding sample point is obtained based on a neighborhood sample point of the corresponding sample point on a neighborhood ray, the neighborhood ray is contained in a plurality of rays, and the neighborhood ray is different from the ray in which the corresponding sample point is located.

As shown in fig. 5a, in the conventional neural radiation field network, in each training and reasoning process, the receptive field corresponding to a certain ray of light is the range of a corresponding single pixel point.

In the embodiment of the application, considering that in the practical application process, for a scene, the pixel color, depth and semantics of an image rendered at any angle have certain continuity, therefore, an image for training can be divided into a plurality of image blocks, each image block can comprise a plurality of pixels, in the training and reasoning process, a plurality of ray rays can be obtained based on the image blocks in the image corresponding to the iterative process, and the ray rays are obtained by sending the ray rays to a plurality of pixel points in any image block by a camera corresponding to the first image. In this way, the receptive field of the ray corresponding to a certain sample point can be expanded to the range of the corresponding image block through convolution operation and other processing in the training and reasoning process, so that the relevant context information on the neighborhood ray of the ray corresponding to the sample point is related to the sample point in the depth dimension, and the information of the sample point and the relevant context information on the neighborhood ray can be fused by the neural radiation field network to obtain the color information and the volume density of the sample point.

The i-th iteration process is taken as an example for introduction.

In the ith iteration process, a neighbor sample point of a sample point on a neighbor ray may be a sample point located at the same depth or at a similar depth on one or more neighbor ray rays of the ray where the sample point is located, or may be understood as a neighbor sample point of the sample point on an image plane dimension of a first image corresponding to the sample point. In one example, the number of sample points on the plurality of ray lines corresponding to the first view angle is the same, and at this time, a neighbor sample point of a sample point on a neighbor ray line may also be determined based on the ordering of the sample points on the plurality of ray lines. For example, a sample point is the 10 th sample point on the corresponding neighboring ray, and the 10 th sample point on the corresponding neighboring ray is the neighboring sample point of the sample point.

For example, in one example as shown in fig. 5B, plane a and plane B may be considered to be planes that overlap or are parallel to the image plane of the first image, and may also be considered to be substantially coincident with each point on plane a relative to the depth of the camera. The sample point a corresponds to 5 ray lines under the first view angle, wherein the ray lines comprise the ray line where the sample point a is located and 4 neighborhood ray lines. At this time, the sample point a corresponds to 4 neighboring sample points on the neighboring ray, and the 4 neighboring sample points and the sample point a may be located on the plane a. Thus, the neural radiation field network can fuse the information of the 4 neighborhood sample points and the sample point a to obtain the volume density and color information corresponding to the sample point a.

Therefore, according to the embodiment of the application, the receptive field of each ray can be increased, so that the neural radiation field network can be related to the information of the neighborhood sample point on the neighborhood ray corresponding to a certain sample point.

At this time, even if there is a disturbance in the direction of the ray due to noise or the like, as long as the intersection point of the disturbed ray and the pixel point on the corresponding image is located inside the same pixel point, the receptive field of the ray is the same, and the obtained color information and volume density of the corresponding sample point are stable, so that the rendering result of the final image should be the same. Therefore, in the training process of the embodiment of the application, the neural radiation field network can effectively aggregate the context information among the neighborhood light rays, so that the training of the consistency and continuity of the whole scene is facilitated.

In addition, the receptive field of the light rays corresponding to the traditional neural radiation field network is a single ray, so that one-time rendering is needed through volume rendering according to color information and volume density of a plurality of sample points on the single ray, and information of one pixel point in a corresponding first output image is obtained. Therefore, if an image of h×w is rendered, h×w rendering operations are required in the conventional image generating method based on the neural radiation field network.

In the embodiment of the present application, the receptive field of a single ray is extended to the range of the corresponding image block by processing such as convolution operation in the training and reasoning process, so that after color information and volume density of sample points on the multiple ray are obtained through the neural radiation field network, the first output image block corresponding to the multiple ray is obtained through volume rendering. If the size of the image block is h×w, rendering an image of h×w requires (h×w)/(h×w) rendering operations.

2. And fusing the context information between the neighborhood sample points on the same ray.

In some embodiments, the context information corresponding to each sample point is obtained based on a corresponding neighbor sample point on the ray of light where the respective sample point is located.

In the conventional nerve radiation field network at present, in each training and reasoning process, each sample point on each ray is respectively input into the nerve radiation field network to estimate the volume density and color information.

In this embodiment of the present application, considering that in the practical application process, for a viewing angle direction, color and volume density of a series of sample points sampled from a three-dimensional space have a certain continuity, context information of one or more neighboring sample points corresponding to the same ray is associated by a manner such as convolution operation, so as to obtain a smoother estimated value corresponding to the corresponding sample point.

For example, as in the example shown in fig. 6, the information of the sample point B and the information of the neighborhood sample point a and the neighborhood sample point C of the sample point B on the light ray may be fused by a convolution operation to obtain the volume density and color information of the sample point B.

In addition, as shown in the schematic diagram in fig. 6, in the training process, the neighborhood sample points a and C of the sample point B on the light ray where the neural radiation field network is located are associated, so that a smoother volume density corresponding to the sample point B can be obtained.

In some embodiments, step 3011 above includes:

and fusing information of the corresponding sample point and information of the corresponding neighborhood sample point on the ray of the corresponding sample point according to the weight of the corresponding neighborhood sample point on the ray of any sample point through a nerve radiation field network in the ith iteration process to obtain volume density and color information corresponding to the corresponding sample point in the ith iteration process, wherein the weight of each corresponding neighborhood sample point on the ray of the corresponding sample point is determined based on the distance between the corresponding neighborhood sample point and the corresponding sample point.

In this embodiment of the present application, for a certain sample point, the distances between the sample point and the corresponding sample points in different neighborhoods on the ray where the sample point is located may be different, and it can be seen that the degree of influence between the sample points in different neighborhoods is different. Therefore, in the embodiment of the application, the distance between the sample points can be used as a weight of the relevant neighborhood sample points in the information fusion, and then the relevant fusion operation is performed.

For instance, in one example, in the scenario shown in fig. 6, where the neural radiation field network correlates the context information of sample point B on the ray by convolution operations, the convolution operations may be performed based on the following convolution operator f (B):

wherein, in the convolution operator f (B) of the sample point B, W (.B) is the weight in the corresponding convolution kernel, the multiplication of the corresponding element in the convolution operation is represented by the element in the context feature tensor corresponding to the sample point B, the context feature tensor comprises the feature information of the neighbor sample point A and/or the feature information of the neighbor sample point C, and d (.B) represents the distance between the neighbor sample point corresponding to the corresponding element and the sample point B, wherein, in the example shown in FIG. 6, the distance between the neighbor sample point A and the sample point B is d1, and the distance between the neighbor sample point C and the sample point B is d2.norm (-) indicates normalization and cos (-) identifies cosine functions, where the processing results can be made smoother and between 0,1 by normalization and cosine function processing.

Therefore, in the embodiment of the application, in the training process and even in the reasoning process, the neural radiation field network can fuse the context information between the neighborhood light rays so as to increase the receptive field of the light rays, and can also fuse the context information between the neighborhood sample points on the same light ray so as to improve the continuity of the color information and the estimated value of the volume density of the sample points, so that noise interference is reduced in scene information learned by the trained neural radiation field network, and finally generated image quality is improved.

It should be noted that, in the embodiment of the present application, the neural radiation field network may fuse one or more of the context information between the neighboring ray and the context information between the neighboring sample points on the same ray, which is not limited in the embodiment of the present application.

Step 3012, according to the volume density and color information corresponding to each sample point in the ith iteration process, obtaining a first output image corresponding to the first view angle in the ith iteration process through volume rendering.

In this embodiment of the present application, after the volume density and the color information corresponding to each sample point in the ith iteration process are obtained, pixel information corresponding to each light ray under the first view angle may be obtained through volume rendering according to the volume density and the color information corresponding to each sample point in the ith iteration process, so that a first output image corresponding to the first view angle in the ith iteration process is obtained according to each pixel information.

The first output image may be an image corresponding to a first view angle estimated by a neural radiation field network and a volume rendering technique in an ith iteration.

Step 3013, determining whether training of the neural radiation field network is completed according to the first output image and the first image.

In this embodiment of the present application, the difference between the first output image and the first image may be evaluated through a preset loss function, so as to determine whether training of the neural radiation field network is completed.

Specifically, if it is determined that the loss value of the ith iteration process converges to the desired state based on the loss function, it may be determined that training of the neural radiation field network is completed; or if the training times reach the preset times, the training of the nerve radiation field network can be determined to be completed.

If the training is determined to be not completed, the neural radiation field network can be updated in a back propagation mode and the like based on the corresponding loss value, so that the updated neural radiation field network is used as the neural radiation field network in the (i+1) th iteration process, and the training is continued by referring to the (i) th iteration process.

In this embodiment of the present application, there may be various ways to evaluate the difference between the first output image and the first image.

For example, in one example, differences between each pixel in the first output image and each pixel in the first image may be compared.

In yet another example, the weights corresponding to the pixels in the first output image may be determined based on intermediate layer information of the neural radiation field network, thereby determining target pixel values for a portion of the pixels in the first output image. Then, an image similarity between the first output image and the first image may be determined based on the target pixel value of the portion of pixels and the pixel value of the corresponding portion of pixels in the first image.

Specifically, in one embodiment, determining whether training of the neural radiation field network is completed based on the first output image and the first image includes:

obtaining a weight corresponding to each pixel point in a first output image through the output of a kth layer in a neural radiation field network in an ith iteration process, wherein k is a positive integer smaller than a preset threshold value;

obtaining a target pixel value of at least one target pixel point in the first output image according to the first output image and the weight corresponding to each pixel point in the first output image;

and determining whether training of the nerve radiation field network is finished or not according to the target pixel value of at least one target pixel point in the first output image and the first image.

In the embodiments of the present application, considering that the dimension of the weights is typically low, the kth layer is typically located in a shallow layer in the neural radiation field network based on experience. For example, the neural radiation field network includes a plurality of convolution layers, and the kth layer may be an output of the plurality of convolution layers.

The embodiments of the present application are illustrated by way of one example below.

As shown in fig. 7, the neural radiation field network includes multiple convolution layers, and a weight corresponding to each pixel point in the first output image can be obtained according to the output of the first convolution layer of the multiple convolution layers.

Then, for a certain sub-region (for example, an image region with a length and width dimension of 3*3 shown in fig. 7) in the first output image, the pixel value and the weight of each pixel point in the sub-region may be weighted and summed to be used as a target pixel value of a central pixel point of the sub-region, so that the features of the sub-region are aggregated on the corresponding central pixel point, and the central pixel point is used as a target pixel point. The target pixel value of the target pixel point is used to calculate the loss value without calculating the loss value from the color information of the entire sub-region.

At this time, the image similarity between the first output image and the first image may be determined based on the target pixel values of the respective target pixels in the first output image and the pixel values of the pixels corresponding to the respective target pixels in the first image.

In the embodiment of the present application, there may be various setting manners of the loss function for evaluating the difference between the first output image and the first image.

In one example, an error between pixel information in the first output image and pixel information of a corresponding first image may be estimated based on a peak signal-to-noise ratio (PSNR).

In another example, considering that PSNR alone is used as a loss function of the neural radiation field network ignores the similarity between images in local and global directions, the evaluation result of the loss function may be more unilateral and less accurate.

Based on this, it is contemplated that the loss function of the neural radiation field may be constructed from a plurality of performance metrics.

For example, in addition to PSNR, the difference between the first output image and the first image may be evaluated by one or more of performance indicators such as structural similarity (structural similarity, SSIM), perceptual similarity (perceptual similarity, LPIPS), and the like.

Wherein, the structural similarity measures the similarity of the local areas of the image. In a general implementation process, the whole image is divided into local small blocks by using a sliding window, the small blocks of the image are used as units to compare the similarity between a true value (for example, a pixel value in a first image) and a predicted value (for example, a pixel value in a first output image), and finally, an average value of the similarity of all the small blocks is used as the structural similarity of the whole image. Structural similarity can measure similarity between images in terms of brightness, contrast, and structure. The specific calculation formula of the structural similarity may refer to the related art, and will not be described herein.

While perceptual similarity is used to measure image similarity by depth features. For example, given a predicted value (e.g., a first output image in this example) and a true value (e.g., a first image in this example), a feature extraction network (e.g., alexNet, vgg16 network, etc.) is input, similarities of the predicted value and the true value are compared across different feature layers of the feature extraction network, then averaged across feature space, summed across channels, resulting in a perceived similarity of the predicted value and the true value.

However, the inventor of the present application finds that the above performance indexes (peak signal-to-noise ratio, structural similarity, and perceived similarity) and the like are heterogeneous in the assumed space, and are not converged in the same direction in the network optimization process, so in the embodiment of the present application, a method for calculating the similarity between the true value and the predicted value of the image is provided, so that the corresponding loss value can be calculated by effectively combining the performance indexes in the training process of the neural radiation field network, so as to accurately evaluate the similarity between the true value and the predicted value of the image in the training process, and thus accurately guide the training of the corresponding neural radiation field network.

In some embodiments, step 3013 above includes:

Acquiring first feature information of each feature mapping network about a first output image in a plurality of feature mapping networks corresponding to an ith iteration process, wherein any feature mapping network is used for mapping input to a corresponding feature space, and the feature spaces corresponding to different feature mapping networks are different;

acquiring second characteristic information of each characteristic mapping network about the first image in a plurality of characteristic mapping networks corresponding to the ith iteration process;

and determining whether training of the nerve radiation field network is finished according to the first characteristic information and the second characteristic information.

The specific structure of any feature mapping network in the embodiments of the present application is not limited herein. Illustratively, any of the feature mapping networks may include a multi-layer perceptron.

The specific function of each feature mapping network is to map the respective input to the corresponding feature space. The specific situation of the feature space corresponding to each feature mapping network is not limited herein. In one example, each feature mapping network is a machine learning network, and then the feature space corresponding to each feature mapping network is determined based on machine learning. In another example, the feature space corresponding to any feature mapping network may refer to the feature space corresponding to the performance indexes such as peak signal-to-noise ratio, structural similarity, or perceived similarity when the image similarity is evaluated.

In this way, by the plurality of feature map networks, a plurality of first feature information about the first output image and a plurality of second feature information about the first image can be obtained, so that it is possible to evaluate a difference between the first output image and the first image from the plurality of first feature information and the plurality of second feature information and determine whether training of the neural radiation field network is completed or not from the difference.

Illustratively, as shown in fig. 8, the plurality of feature mapping networks may include feature mapping network a, feature mapping network B, and feature mapping network C.

At this time, the first feature information a1 about the first output image and the second feature information a2 of the first image may be obtained through the feature mapping network a, and the first feature information B1 about the first output image and the second feature information B2 of the first image may be obtained through the feature mapping network B, and the first feature information C1 about the first output image and the second feature information C2 of the first image may be obtained through the feature mapping network C.

Then, a loss value of the i-th iterative process may be obtained from the difference between the first feature information a1 and the second feature information a2, the difference between the first feature information b1 and the second feature information b2, and the difference between the first feature information c1 and the second feature information c2, thereby evaluating whether the training is completed.

In this example, the loss value may reflect the similarity between the first output image and the first image in the multiple feature spaces, and if the similarity between the first output image and the first image in the multiple feature spaces is higher, it may generally be stated that the similarity between the first output image and the first image in both local and global aspects is higher, so that the training progress may be evaluated more accurately to determine whether the corresponding network converges.

In addition, in the embodiment of the present application, the complete first output image and the complete first image may be input into the feature mapping network respectively, so as to obtain first feature information and second feature information; alternatively, a part of the first output image and a corresponding part of the first image may be input into the feature mapping network, respectively, to obtain the first feature information and the second feature information. For example, an image formed by each target pixel point in the first output image is input into a feature mapping network, and an image formed by the pixel points corresponding to each target pixel point in the first image is input into a feature mapping network.

Therefore, in the embodiment of the application, in the training process, the predicted value and the true value of the image are mapped to different feature spaces through a plurality of feature mapping networks, so that the accuracy of the generated image in the plurality of feature spaces such as the high-dimensional feature space can be evaluated, and the performance of the trained neural radiation field network is better.

The specific manner of acquisition of each feature map network is not limited herein.

In one example, each feature mapping network may be deployed into an electronic device after pre-training is completed for training of the neural radiation field network.

In yet another example, the plurality of feature mapping networks may be trained alternately with the neural radiation field network and after the end of the alternate training, a trained neural radiation field network and a trained plurality of feature mapping networks are obtained.

In some embodiments, the steps above: training the neural radiation field network according to the multiple images and the view angle directions corresponding to the images to obtain a trained neural radiation field network, wherein the training method comprises the following steps of:

according to each image and the corresponding view angle direction of each image, alternately training the neural radiation field network and the plurality of initial feature mapping networks to obtain a plurality of trained initial feature mapping networks and trained neural radiation field networks, and taking the plurality of trained initial feature mapping networks as a plurality of feature mapping networks.

In the embodiment of the application, the alternately training of the neural radiation field network and the plurality of initial feature mapping networks may be to fix the plurality of initial feature mapping networks first and perform one or more iterative training on the neural radiation field network; then, fixing the latest updated neural radiation field network, and performing one or more iterative training on a plurality of initial feature mapping networks; then, the latest updated multiple initial feature mapping networks are fixed again, one or more iterative training is carried out on the latest updated neural radiation field network, and the like until the preset iterative times are reached, or until the iterative neural radiation field network and the initial feature mapping network respectively converge to respective expected states.

In some embodiments, according to each image and the view angle direction corresponding to each image, alternately training the neural radiation field network and the plurality of initial feature mapping networks to obtain a plurality of trained initial feature mapping networks and trained neural radiation field networks, and taking the plurality of trained initial feature mapping networks as the plurality of feature mapping networks, including:

during the jth iteration of training the plurality of initial feature mapping networks:

acquiring a second output image in the jth iteration process through a neural radiation field network in the jth iteration process, wherein j is a positive integer;

acquiring third characteristic information of each initial characteristic mapping network about a second output image in a plurality of initial characteristic mapping networks corresponding to a jth iteration process;

based on the differences between each of the third feature information, it is determined whether training of the plurality of initial feature mapping networks is completed.

The iterative process of training the plurality of initial feature mapping networks and the iterative process of training the neural radiation field network are alternately performed, so that the neural radiation field network and the plurality of initial feature mapping networks are alternately trained.

The specific period of alternating training of the neural radiation field network and the plurality of initial feature mapping networks is not limited herein.

For example, the plurality of initial feature mapping networks and the neural radiation field network may be alternately trained periodically with a fixed period, that is, the number of iterations of the plurality of initial feature mapping networks and the neural radiation field network may be the same or different in each of the alternate periods. Alternatively, in the process of alternately training the multiple initial feature mapping networks and the neural radiation field network, if a certain network converges in advance, the iteration times of the network can be reduced or stopped.

Of course, the period of the alternate training may have other situations, and is not limited herein.

The training process of the plurality of initial feature mapping networks and the neural radiation field network is illustrated below in an alternating cycle.

In one alternating cycle, the training of the plurality of initial feature mapping networks and the neural radiation field network can be divided into two phases.

1. The first stage: the neural radiation field network is trained while the plurality of initial feature mapping networks are fixed.

As shown in fig. 9a, in a first stage, a plurality of initial feature mapping networks (e.g., initial feature mapping network a, initial feature mapping network B, and initial feature mapping network C in the figure) may be fixed, and the neural radiation field network may be trained iteratively one or more times. An iterative process for training the neural radiation field network may be referred to in any of the above-described related embodiments. For example, as in the example shown in fig. 9a, in one iteration of training the neural radiation field network, the feature mapping network a obtains first feature information a1 about the first output image and second feature information a2 of the first image, and may obtain first feature information B1 about the first output image and second feature information B2 of the first image through the feature mapping network B, and may obtain first feature information C1 about the first output image and second feature information C2 of the first image through the feature mapping network C. By setting the first loss function, the output image and the corresponding truth image are made as close as possible to the corresponding feature space, for example, the first feature information a1 and the second feature information a2 are made as close as possible, the first feature information b1 and the second feature information b2 are made as close as possible, and the first feature information c1 and the second feature information c2 are made as close as possible. Because of the plurality of initial feature mapping networks, the output image and the corresponding truth image can be made as close as possible in the plurality of feature spaces by the first loss function.

2. And a second stage: a neural radiation field network is fixed while a plurality of initial feature mapping networks are trained.

In the second stage, a second output image in the jth iteration is acquired through the neural radiation field network in the jth iteration, as shown in fig. 9 b; then, third feature information (for example, third feature information a3, third feature information b3, and third feature information c3 in fig. 9 b) of each of the plurality of initial feature mapping networks corresponding to the jth iterative process is acquired, so that the plurality of initial feature mapping networks are updated based on the difference between each of the third feature information and the second loss function.

Wherein the second loss function is used to keep the respective third characteristic information as diverse as possible, that is, to make the differences among the third characteristic information a3, the third characteristic information b3, and the third characteristic information c3 tend to increase.

In this way, feature spaces corresponding to the plurality of feature mapping networks obtained through training of the plurality of initial feature mapping networks are significantly different, so that diversification of the feature spaces corresponding to the plurality of feature mapping networks is maintained.

Therefore, the image similarity between the output image of the neural radiation field network and the corresponding truth image can be accurately estimated on the aspect of a plurality of characteristic spaces with obvious differentiation by the plurality of trained initial characteristic mapping networks, the performance of the finally obtained trained neural radiation field network is better, and the quality of the generated image is better.

An exemplary iterative process diagram according to an embodiment of the present application is shown in fig. 10.

In the example shown in fig. 10, the neural radiation field network may fuse context information between a plurality of light rays corresponding to an image block in an image and fuse context information between neighboring sample points on the same light ray, and may evaluate differences between an output image obtained based on the neural radiation field network and a corresponding truth image through a plurality of feature mapping networks, so as to guide a training process of the neural radiation field network according to the differences, thereby improving performance of the neural radiation field after training.

After training of the neural radiation field network is completed, a corresponding inference task can be performed based on the trained neural radiation field network.

The following describes the relevant operation of the inference phase of the neural radiation field network.

In the embodiment of the application, the related operation of the reasoning stage of the neural radiation field network can be performed through electronic equipment. The electronic device performing the operations related to the training phase of the neural radiation field network in any of the above embodiments may be the same as or different from the electronic device performing the operations related to the reasoning phase.

The specific type of electronic device performing the relevant operations of the inference phase is not limited herein. The electronic device may be a single server, a server cluster, a terminal device, or the like, or may be a Virtual Machine (VM) or a container, for example.

The neural radiation field network deployed by the electronic device in the reasoning stage may be the neural radiation field network obtained after training in the training stage. For ease of description, the trained neural radiation field networks deployed in the inference phase are all referred to as neural radiation field networks.

As shown in fig. 11, in an embodiment of the present application, a method for generating images related to a neural radiation field network performed by the electronic device may include steps 1101-1102.

Step 1101, obtaining volume density and color information corresponding to each sample point according to context information corresponding to each sample point in a plurality of sample points corresponding to a target view angle through a neural radiation field network.

The plurality of sample points are obtained by sampling at least one ray corresponding to the target visual angle.

In step 1102, according to the volume density and color information corresponding to each sample point, an output image corresponding to the target viewing angle is obtained through volume rendering.

In the embodiment of the application, the target view angle may be a view angle required to generate an output image in the reasoning process. At this time, the spatial position (x, y, z) and the viewing angle direction corresponding to the target viewing angle can be input

And then determining a plurality of light rays corresponding to the target visual angle through the nerve radiation field network, and sampling on the plurality of light rays corresponding to the target visual angle to obtain a plurality of sample points corresponding to the target visual angle. In the reasoning stage, the sampling manner of the sample points in the neural radiation field network in the training stage may be referred to as a specific manner of sampling on the plurality of light rays corresponding to the target view angle to obtain the plurality of sample points corresponding to the target view angle. For example, the weights of uniformly sampled sample points may be obtained by a coarse network in the neural radiation field network, and non-uniformly sampled based on the weights to obtain a plurality of sample points to obtain the input of a fine network in the neural radiation field network.

According to the embodiment of the application, through the neural radiation field network, the volume density and the color information corresponding to each sample point are obtained according to the context information corresponding to each sample point in a plurality of sample points corresponding to a target view angle, and then through a volume rendering technology, the output image rendered by the scene under the target view angle and the depth information of the output image are deduced, so that the output image is displayed based on the depth information of the output image, the purpose of generating an image in a new view angle in the scene is achieved, and the three-dimensional space of the whole scene can be explicitly restored through the output image.

In some embodiments, the context information corresponding to each sample point is derived based on a neighborhood sample point of the respective sample point in the plurality of sample points corresponding to the target perspective.

In some embodiments, any sample point corresponds to a plurality of ray lines, the context information corresponding to the respective sample point is obtained based on a neighboring sample point of the respective sample point on a neighboring ray line, the neighboring ray line is included in the plurality of ray lines, and the neighboring ray line is different from the ray line in which the respective sample point is located.

In some embodiments, step 1101 comprises: and fusing information of the corresponding sample point and information of the corresponding neighborhood sample point on the ray of the corresponding sample point according to the weight of the corresponding neighborhood sample point on the ray of any sample point through a nerve radiation field network to obtain volume density and color information corresponding to the corresponding sample point, wherein the weight of each neighborhood sample point corresponding to the ray of the corresponding sample point is determined based on the distance between the corresponding neighborhood sample point and the corresponding sample point.

In some embodiments, the neural radiation field network includes at least one convolution layer for fusing information of each sample point with context information corresponding to the respective sample point through a convolution operation.

Features and steps in each embodiment of the inference phase may refer to relevant content in any embodiment of the training phase, for example, specific content and fusion manner of context information corresponding to a sample point in the inference phase, and content such as a structure of a neural radiation field network may be similar to relevant schemes in any embodiment of the training phase, which are not described herein in detail.

Having described various method embodiments from various aspects, reference is made to the accompanying drawings, which illustrate a neural radiation field network training device and an image generating device in embodiments of the present application.

As shown in fig. 12, an embodiment of the present application provides a neural radiation field network training device 120.

The apparatus 120 includes:

the training module 1201 is configured to train the neural radiation field network according to the plurality of images and the view angle direction corresponding to each image, and obtain a trained neural radiation field network;

wherein, in the ith iteration process of training the neural radiation field network, the training module is used for:

obtaining volume density and color information corresponding to each sample point in the ith iteration process according to context information corresponding to each sample point in a plurality of sample points corresponding to a first view angle through a neural radiation field network in the ith iteration process, wherein the first view angle is determined based on the view angle direction of a first image, the plurality of sample points are obtained by sampling at least one ray corresponding to the first view angle, the first image is contained in a plurality of images, and i is a positive integer;

according to the volume density and the color information corresponding to each sample point in the ith iteration process, obtaining a first output image corresponding to a first visual angle in the ith iteration process through volume rendering;

Based on the first output image and the first image, it is determined whether training of the neural radiation field network is complete.

Optionally, the context information corresponding to each sample point is obtained based on a neighboring sample point of the plurality of sample points corresponding to the first viewing angle.

Optionally, the first image includes a plurality of image blocks, any sample point corresponds to a plurality of light rays, and the plurality of light rays are obtained by sending light rays to a plurality of pixel points in the image block corresponding to the corresponding sample point by a camera corresponding to the first image;

Optionally, the context information corresponding to each sample point is obtained based on a corresponding neighbor sample point on the ray of light where the corresponding sample point is located.

Optionally, the training module 1201 is configured to:

Optionally, the neural radiation field network includes at least one convolution layer, and the at least one convolution layer is used for fusing the information of each sample point with the context information corresponding to the corresponding sample point through convolution operation.

Optionally, the at least one convolution layer comprises a plurality of convolution layers, the plurality of convolution layers being in a serial structure, and a magnitude of an output of any one of the plurality of convolution layers in at least one dimension is not greater than a magnitude of an output of a corresponding next convolution layer in at least one dimension.

Optionally, the training module 1201 is configured to:

As shown in fig. 13, an embodiment of the present application provides an image generating apparatus 130.

The apparatus 130 includes:

the neural radiation field network module 1301 is configured to obtain, through a neural radiation field network, bulk density and color information corresponding to each sample point according to context information corresponding to each sample point in a plurality of sample points corresponding to a target viewing angle, where the plurality of sample points are obtained by sampling on at least one ray corresponding to the target viewing angle;

the volume rendering module 1302 is configured to obtain an output image corresponding to the target viewing angle through volume rendering according to the volume density and the color information corresponding to each sample point.

Optionally, the context information corresponding to each sample point is obtained based on a neighborhood sample point of the corresponding sample point among the plurality of sample points corresponding to the target viewing angle.

Optionally, any sample point corresponds to a plurality of ray lines, the context information corresponding to the corresponding sample point is obtained based on a neighborhood sample point of the corresponding sample point on a neighborhood ray line, the neighborhood ray line is included in the plurality of ray lines, and the neighborhood ray line is different from the ray line where the corresponding sample point is located.

Optionally, the neural radiation field network module 1301 is configured to:

and fusing information of the corresponding sample point and information of the corresponding neighborhood sample point on the ray of the corresponding sample point according to the weight of the corresponding neighborhood sample point on the ray of any sample point through a nerve radiation field network to obtain volume density and color information corresponding to the corresponding sample point, wherein the weight of each neighborhood sample point corresponding to the ray of the corresponding sample point is determined based on the distance between the corresponding neighborhood sample point and the corresponding sample point.

Fig. 14 is a schematic diagram of a possible logic structure of the electronic device 140 according to the embodiment of the present application. The electronic device 140 is configured to implement the functions of the electronic device of the neural radiation field network training method embodiment and/or the image generation method embodiment related to any of the foregoing embodiments. The electronic device 140 includes: memory 1401, processor 1402, communication interface 1403, and bus 1404. Wherein the memory 1401, the processor 1402, and the communication interface 1403 are communicatively coupled to each other via a bus 1404.

The memory 1401 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 1401 may store a program which, when the program stored in the memory 1401 is executed by the processor 1402, the processor 1402 and the communication interface 1403 are adapted to carry out one or more steps of the above-described neural radiation field network training method embodiment and/or image generating method embodiment.

The processor 1402 may employ a central processing unit (central processing unit, CPU), microprocessor, application specific integrated circuit (application specific integrated circuit, ASIC), graphics processor (graphics processing unit, GPU), digital signal processor (digital signal processing, DSP), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or any combination thereof, for executing associated programs to perform functions required for a training module, etc. in a neural radiation field network training device in the above-described embodiments, or for performing functions required for a neural radiation field network module, a volume rendering module, etc. in an image generation device in the above-described embodiments, or for performing one or more steps of various method embodiments of the present application. The steps of a method disclosed in connection with the embodiments of the present application may be performed by a hardware decoding processor or by a combination of hardware and software modules in the decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 1401 and a processor 1402 reads the information in the memory 1401 and, in combination with its hardware, performs one or more of the steps of the neural radiation field network training method embodiments and/or the image generation method embodiments described above.

Communication interface 1403 enables communication between electronic device 140 and other devices or communication networks using transceiving means such as, but not limited to, a transceiver.

The bus 1404 may enable a path to transfer information between various components of the electronic device 140 (e.g., the memory 1401, the processor 1402, and the communication interface 1403). The bus 1404 may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 14, but not only one bus or one type of bus.

In another embodiment of the present application, there is also provided a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor of a device, perform the steps performed by the processor of fig. 14 described above.

In another embodiment of the present application, there is also provided a computer program product comprising computer-executable instructions stored in a computer-readable storage medium; the steps performed by the processor in fig. 14 described above are performed by the device when the computer-executable instructions are executed by the device's processor.

In another embodiment of the present application, there is also provided a chip system including a processor for implementing the steps performed by the processor of fig. 14 described above. In one possible design, the chip system may further include a memory to hold program instructions and data necessary for the electronic device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or, what contributes to the prior art, or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above is merely a specific implementation of the embodiments of the present application, but the protection scope of the embodiments of the present application is not limited thereto.

Claims

1. A neural radiation field network training method, comprising:

training the nerve radiation field network according to a plurality of images and the view angle direction corresponding to each image to obtain a trained nerve radiation field network;

Wherein, in the ith iteration process of training the neural radiation field network:

obtaining volume density and color information corresponding to each sample point in an ith iteration process according to context information corresponding to each sample point in a plurality of sample points corresponding to a first view angle through a neural radiation field network in the ith iteration process, wherein the first view angle is determined based on a view angle direction of a first image, the plurality of sample points are obtained by sampling at least one ray corresponding to the first view angle, the first image is contained in the plurality of images, and i is a positive integer;

according to the volume density and the color information corresponding to each sample point in the ith iteration process, obtaining a first output image corresponding to the first visual angle in the ith iteration process through volume rendering;

and determining whether training of the nerve radiation field network is finished according to the first output image and the first image.

2. The method of claim 1, wherein the context information for each sample point is derived based on a neighborhood sample point of the respective sample point among the plurality of sample points corresponding to the first view angle.

3. The method according to claim 2, wherein the first image includes a plurality of image blocks, any sample point corresponds to a plurality of ray lines, and the plurality of ray lines are obtained by sending rays to a plurality of pixel points in the image block corresponding to the corresponding sample point by the camera corresponding to the first image;

the context information corresponding to the corresponding sample point is obtained based on a neighborhood sample point of the corresponding sample point on a neighborhood ray, the neighborhood ray is included in the plurality of ray lines, and the neighborhood ray line is different from the ray line in which the corresponding sample point is located.

4. A method according to claim 2 or claim 3, wherein the context information for each sample point is derived based on a corresponding neighborhood sample point on the ray where the respective sample point is located.

5. The method according to claim 4, wherein the obtaining, by the neural radiation field network in the ith iteration, the volume density and color information corresponding to each sample point in the ith iteration according to the context information corresponding to each sample point in the plurality of sample points corresponding to the first view angle, includes:

And fusing information of the corresponding sample point with information of a corresponding neighborhood sample point on a ray line of the corresponding sample point according to the weight of the corresponding neighborhood sample point on the ray line of any sample point through a nerve radiation field network in the ith iteration process to obtain volume density and color information corresponding to the corresponding sample point in the ith iteration process, wherein the weight of each corresponding neighborhood sample point on the ray line of the corresponding sample point is determined based on the distance between the corresponding neighborhood sample point and the corresponding sample point.

6. The method of any one of claims 1-5, wherein at least one convolution layer is included in the neural radiation field network, the at least one convolution layer being configured to fuse information of each of the sample points with context information corresponding to the respective sample point by a convolution operation.

7. The method of claim 6, wherein the at least one convolution layer comprises a plurality of convolution layers, the plurality of convolution layers being in a serial configuration, and wherein a magnitude of an output of any one of the plurality of convolution layers in at least one dimension is not greater than a magnitude of an output of a corresponding next convolution layer in the at least one dimension.

8. The method of any of claims 1-7, wherein determining whether training of the neural radiation field network is complete based on the first output image and the first image comprises:

obtaining a weight corresponding to each pixel point in the first output image through the output of a kth layer in a neural radiation field network in the ith iteration process, wherein k is a positive integer smaller than a preset threshold value;

9. The method of any of claims 1-8, wherein determining whether training of the neural radiation field network is complete based on the first output image and the first image comprises:

acquiring first feature information of each feature mapping network about the first output image in a plurality of feature mapping networks corresponding to the ith iteration process, wherein any one of the feature mapping networks is used for mapping input to a corresponding feature space, and the feature spaces corresponding to different feature mapping networks are different;

and determining whether training of the nerve radiation field network is finished or not according to the first characteristic information and the second characteristic information.

10. The method of claim 9, wherein training the neural radiation field network according to the plurality of images and the corresponding viewing directions of each of the images to obtain a trained neural radiation field network comprises:

and alternately training the neural radiation field network and a plurality of initial feature mapping networks according to each image and the view angle direction corresponding to each image to obtain a plurality of trained initial feature mapping networks and trained neural radiation field networks, and taking the plurality of trained initial feature mapping networks as the plurality of feature mapping networks.

11. The method according to claim 10, wherein alternately training the neural radiation field network and the plurality of initial feature map networks according to each image and the view direction corresponding to each image to obtain a plurality of trained initial feature map networks and trained neural radiation field networks, and using the plurality of trained initial feature map networks as the plurality of feature map networks comprises:

acquiring third feature information of each initial feature mapping network about the second output image in a plurality of initial feature mapping networks corresponding to the jth iteration process;

12. An image generation method, comprising:

obtaining volume density and color information corresponding to each sample point in a plurality of sample points corresponding to a target visual angle through a nerve radiation field network according to the context information corresponding to each sample point, wherein the plurality of sample points are obtained by sampling at least one ray corresponding to the target visual angle;

and obtaining an output image corresponding to the target visual angle through volume rendering according to the volume density and the color information corresponding to each sample point.

13. The method of claim 12, wherein the context information for each sample point is derived based on a neighborhood sample point of the respective sample point among the plurality of sample points corresponding to the target perspective.

14. The method of claim 13, wherein any sample point corresponds to a plurality of ray lines, the context information corresponding to the respective sample point is obtained based on a neighborhood sample point of the respective sample point on a neighborhood ray line, the neighborhood ray line is included in the plurality of ray lines, and the neighborhood ray line is different from the ray line in which the respective sample point is located.

15. The method of claim 13 or 14, wherein the context information for each sample point is derived based on a corresponding neighbor sample point on the ray where the respective sample point is located.

16. The method according to claim 15, wherein the obtaining, by the neural radiation field network, the volume density and the color information corresponding to each sample point according to the context information corresponding to each sample point in the plurality of sample points corresponding to the target view angle, includes:

and fusing information of the corresponding sample point and information of the corresponding neighborhood sample point on the ray of the corresponding sample point according to the weight of the corresponding neighborhood sample point on the ray of any sample point through the nerve radiation field network to obtain the corresponding volume density and color information of the corresponding sample point, wherein the weight of each corresponding neighborhood sample point on the ray of the corresponding sample point is determined based on the distance between the corresponding neighborhood sample point and the corresponding sample point.

17. The method of any one of claims 12-16, wherein at least one convolution layer is included in the neural radiation field network, the at least one convolution layer being configured to fuse information of each of the sample points with context information corresponding to the respective sample point by a convolution operation.

18. The method of claim 17, wherein the at least one convolution layer comprises a plurality of convolution layers, the plurality of convolution layers being in a serial configuration, and wherein a magnitude of an output of any one of the plurality of convolution layers in at least one dimension is not greater than a magnitude of an output of a corresponding next convolution layer in the at least one dimension.

19. A neural radiation field network training device, comprising:

the training module is used for training the nerve radiation field network according to the plurality of images and the view angle direction corresponding to each image to obtain a trained nerve radiation field network;

wherein, in the ith iteration process of training the neural radiation field network, the training module is configured to:

20. The apparatus of claim 19, wherein the context information for each sample point is derived based on a neighborhood sample point of the respective sample point among the plurality of sample points corresponding to the first view angle.

21. The apparatus of claim 20, wherein the first image includes a plurality of image blocks, any sample point corresponds to a plurality of ray lines, and the plurality of ray lines are obtained by sending rays to a plurality of pixels in the image block corresponding to the corresponding sample point by a camera corresponding to the first image;

22. The apparatus of claim 20 or 21, wherein the context information for each sample point is derived based on a corresponding neighbor sample point on the ray where the respective sample point is located.

23. The apparatus of claim 22, wherein the device comprises a plurality of sensors,

the training module is used for:

24. The apparatus of any one of claims 19-23, wherein at least one convolution layer is included in the neural radiation field network, the at least one convolution layer configured to fuse information of each of the sample points with context information corresponding to the respective sample point by a convolution operation.

25. The apparatus of claim 24, wherein the at least one convolution layer comprises a plurality of convolution layers, the plurality of convolution layers being in a serial configuration, and wherein a magnitude of an output of any one of the plurality of convolution layers in at least one dimension is not greater than a magnitude of an output of a corresponding next convolution layer in the at least one dimension.

26. The device according to any one of claims 19-25, wherein,

the training module is used for:

27. The device according to any one of claims 19-26, wherein,

the training module is used for:

28. The apparatus of claim 27, wherein the device comprises a plurality of sensors,

the training module is used for:

29. The apparatus of claim 28, wherein the device comprises a plurality of sensors,

The training module is used for:

30. An image generating apparatus, comprising:

the system comprises a neural radiation field network module, a target visual angle acquisition module and a neural radiation field network module, wherein the neural radiation field network module is used for acquiring volume density and color information corresponding to each sample point according to context information corresponding to each sample point in a plurality of sample points corresponding to the target visual angle, and the plurality of sample points are obtained by sampling at least one ray corresponding to the target visual angle;

and the volume rendering module is used for obtaining an output image corresponding to the target visual angle through volume rendering according to the volume density and the color information corresponding to each sample point.

31. The apparatus of claim 30, wherein the context information for each sample point is derived based on a neighborhood sample point of the respective sample point among the plurality of sample points corresponding to the target perspective.

32. The apparatus of claim 31, wherein any sample point corresponds to a plurality of ray lines, the context information corresponding to the respective sample point is obtained based on a neighborhood sample point of the respective sample point on a neighborhood ray line, the neighborhood ray line is included in the plurality of ray lines, and the neighborhood ray line is different from a ray line in which the respective sample point is located.

33. The apparatus of claim 31 or 32, wherein the context information for each sample point is derived based on a corresponding neighbor sample point on the ray where the respective sample point is located.

34. The apparatus of claim 33, wherein the neural radiation field network module is configured to:

35. The apparatus of any one of claims 30-34, wherein the neural radiation field network includes at least one convolution layer configured to fuse information of each of the sample points with context information corresponding to the respective sample point by a convolution operation.

36. The apparatus of claim 35, wherein the at least one convolution layer comprises a plurality of convolution layers, the plurality of convolution layers being in a serial configuration, and wherein a magnitude of an output of any one of the plurality of convolution layers in at least one dimension is not greater than a magnitude of an output of a corresponding next convolution layer in the at least one dimension.

37. An electronic device comprising at least one processor, a memory, and instructions stored on the memory and executable by the at least one processor, the at least one processor executing the instructions to perform the steps of the method of any one of claims 1-18.

38. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1-18.