WO2024099564A1

WO2024099564A1 - Learnable point spread functions for image rendering

Info

Publication number: WO2024099564A1
Application number: PCT/EP2022/081411
Authority: WO
Inventors: Richard Shaw; Eduardo PEREZ PELLITERO; Sibi CATLEY-CHANDAR; Ales LEONARDIS
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2024-05-16

Abstract

An image processing apparatus (1400) for forming an enhanced image (717, 818), the apparatus (1400) being configured to: receive (1301) data for an input image, the data comprising respective colour and depth values for each pixel of multiple pixels; receive (1302) a trained point spread function model (711, 812) comprising a respective learned point spread function (712, 813) for each pixel of the multiple pixels of the input image; modify (1303) each learned point spread function (712, 813) in dependence on the respective depth value of the respective pixel to form a modified point spread function (715, 816) for each pixel of the multiple pixels of the input image; and apply (1304) the respective modified point spread function (715, 816) to the respective colour value for one or more pixels comprising the respective pixel to form the enhanced image (717, 818). By incorporating a point spread function model, the behaviour of real lenses can be better modelled and images can be reconstructed with realistic depth-of-field blur.

Description

LEARNABLE POINT SPREAD FUNCTIONS FOR IMAGE RENDERING

FIELD OF THE INVENTION

This disclosure relates to image processing, in particular to rendering images.

BACKGROUND

Neural Radiance Fields (NeRF), as described in B. Mildenhall et al., “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, The 2020 European Conference on Computer Vision (ECCV 2020), can enable the rendering of high-quality images and depth maps of novel views of a three-dimensional (3D) scene. Given a set of two-dimensional (2D) images of a 3D scene with known camera poses (for example, focal length, rotation and translation), NeRF can learn an implicit mapping from spatial coordinates (x, y, z) to volume density and view-dependent colour (RGB). A process known as volume rendering accumulates samples of this scene representation along camera rays to render the scene from any viewpoint.

Figure 1 schematically illustrates a visualization of a typical NeRF framework. NeRF enables the rendering of high-quality images and depth maps of novel views of a scene. Multi-view input images of a 3D scene, along with their corresponding camera poses, are processed by the NeRF optimization framework, learning a scene representation enabling novel views (RGB and depth) of the scene to be rendered.

NeRF can obtain impressive results for the task of novel view synthesis, but relies on a simplified pinhole camera model, which is a theoretical lens-less model where all rays pass through a single point known as the camera’s optical centre, as shown in Figure 2a. Under such lens-less model assumptions, all rays pass through a single point, which results in images that are entirely in focus (i.e. all parts of the images are equally sharp). However, this is unrealistic, since real cameras typically use complex multi-lens systems, with one or more moving groups designed for, for example, different focal lengths, as shown schematically in Figure 2b. This can result in a complex optical response governed by physical properties such as aperture, focus distance, focal length and depth. Furthermore, all real lenses will exhibit some amount of depth-of-field blur (DoF) and will never be completely in focus.

In non-pinhole camera models, the spread of rays onto the image sensor creates image blur in out-of-focus regions. This shallow depth-of-field (DoF) blur is often exploited for aesthetic photographic or video purposes, such as to focus the viewer's attention on a particular subject or isolate the subject from the background, to create a more "cinematic" large sensor or aperture look, or to create aesthetically pleasing "bokeh" effects, as shown in Figures 3a and 3b. As illustrated in Figure 3a, lenses operating with large apertures and high focal lengths inherently have a shallow depth-of-field resulting in photos with progressively-blurry backgrounds. The wider the aperture (the smaller the f-stop) the greater the out-of-focus blur. As illustrated in Figure 3b, portrait mode on smartphones can simulate background blur synthetically, but such modes are often error-prone and fail to consider spatially varying behaviour.

This depth-of-field blur can be modelled by a point spread function (PSF), which describes the response of a focused optical imaging system to a point source or point object. A more general term for the PSF is the system's impulse response; the PSF is the impulse response or impulse response function (IRF) of a focused optical imaging system. PSFs are camera and lensspecific and vary spatially across the image sensor. The PSF of a lens system depends on a number of factors, such as lens aperture, focal length, focus distance, sensor position and scene depth.

Figure 4a illustrates an example of real lens PSFs as measured scientifically in the lab using specialist equipment. The PSFs are shown for a number of locations across the image sensor. Figure 4b shows a close up visualization of the PSFs at the corner of the sensor and in the centre for different aperture values: f2.8, f4 and f5.6, as described in M. Bauer et al., “Automatic Estimation of Modular Transfer Functions”, 2018 IEEE International Conference on Computational Photography (ICCP), 1-12 [Bauer et al. 2018], As the size of the aperture is increased, and towards the corner of the image sensor, the size of the PSF blur increases.

PSFs typically have a sharp edge, as the lens aperture occludes rays, as schematically illustrated in Figure 5a. Figure 5b illustrates other variations in lens blur, for example caused by vignetting of the lens barrel and imperfections of the lens elements.

Due to the simplified pinhole camera model used by most neural rendering frameworks, such frameworks cannot handle input images containing shallow DoF blur, since the model inherently assumes all-in-focus input images. To be able to reconstruct images containing DoF blur, when trained with shallow DoF input images, neural rendering approaches distort the learnt geometry (depth maps) such that rays rendered from the scene representation generate blur matching the training images. The typically resulting poor quality geometry reconstruction degrades the ability to perform novel view synthesis. Figures 6a to 6c show examples of results from a NeRF trained on shallow DoF images. Figures 6a and 6b show a rendered RGB image and its corresponding depth map respectively. An example of an inaccurate depth produced by NeRF when trained with shallow DoF images is shown in Figure 6c. If NeRF is trained with shallow DoF images, then this DoF blur is “baked” into the NeRF, i.e. the amount of DoF blur or the focus point cannot be changed after optimizing the NeRF. Therefore, although the rendered image is quite accurate, the NeRF fails to accurately reconstruct the depth map of the blurry background. This is due to the NeRF’s pinhole camera model, which assumes the images to be all-in-focus. To compensate, the NeRF distorts the background depth such that the resulting rays render the RGB image to match the training images.

Several methods have been developed to try to overcome the above issues. A first category of approaches is synthetic blur and discretized depth plane (i.e. post-processing methods to add blur). These methods typically apply depth-varying blur as a post-process to rendered images and depth maps. Given an RGB image and its corresponding depth map, the RGB image is split into segments based on a number of discretized depth planes, and each plane is convolved with a 2D blur kernel. The blur kernel is usually chosen to be a simple circular or Gaussian blur kernel, where the radius or standard deviation of the blur kernel increases with distance from the chosen focus plane, i.e. the further from the focus point, the more blurry the image and thus the larger the radius of the blur kernel. Since these methods are synthetic, they are not able to reconstruct realistic blur from real lenses. Furthermore, if they are to be incorporated into an end-to-end NeRF framework, the discretization of the depth map into planes can cause training instability.

These approaches generally have low computational complexity and blur is easily controllable a priori, using a simple algorithm. However, as mentioned above, the blur is generally not realistic (not learned from data and is synthetically generated a priori). These methods assume all-in-focus input images, which may not be available. Blur modelling is a post-process (does not affect NeRF reconstruction). Discretized depth planes can also cause issues, such as haloing around objects, discontinuities can affect training stability, and deciding how many planes to use. Examples of these methods are described in B. Mildenhall et al., “NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images”, Computer Vision and Pattern Recognition Conference (CVPR) 2022 and Busam et al., “Sterefo: Efficient Image Refocusing with Stereo Vision, International Conference on Computer Vision Workshops (ICCVW) 2019. Another category of prior approaches involve using blur models that fit to the scene. This family of algorithms provide a mechanism to blur the observations. However, this blur is overfitted to the scene and is thus non-generalizable. The models are learnt from real data and are usually more expressive, allowing them to better reconstruct real blur. However, they are generally not easily controllable. The blur model is fitted to each scene separately and is not generalizable (i.e. it is a model of the scene, not camera plus lens). Also, most methods cannot handle large blur sizes. These models are usually learnt in a self-supervised way, therefore fail with consistent blur (when all training images are blurred in the same way). They also cannot disentangle focus distance and aperture. Some examples of these methods are described in Deblur-NeRF, as described in Li et al., “Deblur-NeRF: Neural Radiance Fields from Blurry Images", Computer Vision and Pattern Recognition Conference (CVPR) 2022, and DoF-NeRF, as described in Wu et al., “DoF-NeRF: Depth-of-Field Meets Neural Radiance Fields”, ACM International Conference on Multimedia (ACMMM) 2022.

A final category is thin-lens modelling. Such methods model out-of-focus blur by explicitly computing the geometry of rays passing through a thin-lens to model. These methods are more complex and computationally expensive to train, since many rays need to be rendered to generate out-of-focus blur. They are also trained using synthetically generated blur (for example, Gaussian) due to lack of real paired sharp-blurry data with camera lens parameter labels. These methods are controllable and based on understood lens optics. However, they are still an approximation to real lenses. They are also computationally complex, as many rays need to be rendered. An example of this method is described in Wang et al., "NeRFocus: Neural Radiance Field for 3D Synthetic Defocus", arXiv:2203.05189, 2022.

It is desirable to develop a method that can overcome at least some of the above issues.

SUMMARY OF THE INVENTION

According to one aspect, there is provided an image processing apparatus for forming an enhanced image, the image processing apparatus being configured to: receive data for an input image, the data comprising respective colour and depth values for each pixel of multiple pixels of the input image; receive a trained point spread function model comprising a respective learned point spread function for each pixel of the multiple pixels of the input image; modify each learned point spread function in dependence on the respective depth value of the respective pixel to form a respective modified point spread function for each pixel of the multiple pixels of the input image; and apply the respective modified point spread function to the respective colour value for one or more pixels comprising the respective pixel to form the enhanced image. By incorporating a point spread function model for a camera lens into a rendering framework, the behaviour of real lenses can be better modelled, and images can be reconstructed with realistic depth-of-field blur. This can allow for reconstruction and synthesisation of images via neural rendering with shallow depth-of-field input images, which can particularly address the difficulties in reconstructing accurate 3D geometry.

The apparatus may be configured to form the enhanced image so as to render depth-of-field blur in the enhanced image. This may allow images with accurate 3D geometry to be rendered.

The apparatus may be configured to modify each learned point spread function in dependence on a respective depth mask weight for the respective pixel, each depth mask weight being determined based on a pixel-wise depth difference from a central pixel in the multiple pixels of the input image. The point spread function model described herein is therefore fully spatial- and depth-varying, resulting in a more expressive and realistic model compared to the pinhole camera model used in standard rendering frameworks.

Each learned point spread function may comprise a matrix of weights corresponding to the respective pixel and multiple neighboring pixels. Each learned point spread function may be applied to a patch of pixels comprising a central pixel and one or more neighboring pixels. This may allow the rendered blur to be more realistic.

Each learned point spread function may be further modifiable in dependence on one or more parameters of an image sensor and camera lens that captured the input image. This can enable the model to be generalizable across scenes, as it is specific to the lens and is sceneagnostic.

The one or more parameters may comprise one or more of focal length, focus distance and aperture value. This can allow the point spread functions to be conditioned on such camera parameters.

The learned point spread function model may be specific to a particular lens of an image sensor used to capture the input image. The learned point spread functions may vary spatially across the image sensor. This can enable the model to be generalizable across scenes, as it is specific to a lens. The apparatus may be further configured to convert spatial locations in the input image to another coordinate system and apply one or more known properties of an image sensor used to capture the input image to the converted spatial locations. Once the spatial locations have been converted, the apparatus may be configured to apply prior knowledge of an image sensor of the camera to aid with training. For example, the coordinate transform module may apply prior knowledge that the sensor has symmetric properties, thus reducing the learnable space. This may improve the efficiency of training.

The trained point spread function model may be scene-agnostic. This may allow the approach to be generalizable across different scenes.

The received data may be an output of an image rendering model. This may allow the learned point spread function model to be used in a rendering framework, such as a neural rendering pipeline.

The input image may correspond to a novel view output from the image rendering model. The image rendering model may be, for example a Neural Radiance Fields (NeRF) model.

The trained point spread function model may be a multi-layer perceptron neural network. This may allow the model to represent a neural field.

The trained spread point function model may be trained end-to-end with the image rendering model. When trained end-to-end within a NeRF framework, a sharp all-in-focus internal representation of the 3D scene is learnt, which when rendered using standard volume rendering techniques can enable controllable depth-of-field blur given novel camera parameters.

The trained point spread model may be trained using paired sharp-blurry images with labelled lens parameters. This may allow the model to learn taking into account the lens parameters. The ambiguity between aperture and focus distances can be mitigated when the point spread function model is pre-trained with labelled paired data.

According to a second aspect there is provided a method for forming an enhanced image, the method comprising: receiving data for an input image, the data comprising respective colour and depth values for each pixel of multiple pixels of the input image; receiving a trained point spread function model comprising a respective learned point spread function for each pixel of the multiple pixels of the input image; modifying each learned point spread function in dependence on the respective depth value of the respective pixel to form a modified point spread function for each pixel of the multiple pixels of the input image; and applying the respective modified point spread function to the respective colour value for one or more pixels comprising the respective pixel to form the enhanced image.

By incorporating the use of a point spread function model for a camera lens into a rendering method, the behaviour of real lenses can be better modelled, and images can be reconstructed with realistic depth-of-field blur. This can allow for reconstruction and synthesisation of images via neural rendering with shallow depth-of-field input images, which can particularly address the difficulties in reconstructing accurate 3D geometry.

According to a further aspect, there is provided a computer-readable storage medium having stored thereon computer-readable instructions that, when executed at a computer system, cause the computer system to perform the steps set out above. The computer system may comprise the one or more processors. The computer-readable storage medium may be a non- transitory computer-readable storage medium.

BRIEF DESCRIPTION OF THE FIGURES

The present disclosure will now be described by way of example with reference to the accompanying drawings. In the drawings:

Figure 1 schematically illustrates Neural Radiance Fields (NeRF), which can enable the rendering of high-quality images and depth maps of novel views of a scene.

Figure 2a schematically illustrates a pinhole camera model.

Figure 2b schematically illustrates a complex multi-lens design.

Figure 3a illustrates images captured with lenses operating with large apertures and high focal lengths, which inherently have a shallow depth-of-field resulting in photos with progressively- blurry backgrounds. The wider the aperture (the smaller the f-stop) the greater the out-of-focus blur.

Figure 3b shows an image captured using a portrait mode on a smartphone, which can simulate background blur synthetically.

Figure 4a shows examples of real lens point spread functions, with spatial variation across an image sensor [from Bauer et al. 2018],

Figure 4b shows examples of real lens point spread functions (PSF) for different apertures (f values) [from Bauer et al. 2018],

Figure 5a schematically illustrates how PSFs typically have a hard edge due to rays being occluded by the physical lens aperture.

Figure 5b illustrates examples of other variations in lens blur. Figures 6a to 6c show results from a NeRF trained on shallow depth-of-field images: Figure 6a is a rendered RGB image, Figure 6b is a foreground depth map, and Figure 6c is a background depth map.

Figure 7a schematically illustrates an artificial intelligence-based lens modelling neural rendering framework.

Figure 7b schematically illustrates one implementation of an end-to-end rendering pipeline shown in greater detail.

Figure 8 schematically illustrates a network design for a point spread function neural field model.

Figure 9 shows an example of a transformation of sensor coordinates from Cartesian to Polar coordinates.

Figure 10 schematically illustrates an example of a point spread function modulation function. Figure 11 schematically illustrates the application of a soft weight mask which uses a continuous weighting function.

Figure 12 schematically illustrates an example of a point spread function application module.

Figure 13 shows the steps of a method of forming an enhanced image in accordance with embodiments of the present invention.

Figure 14 schematically illustrates an image processing apparatus in accordance with embodiments of the present invention and some of its associated components.

Figures 15a-15c show examples of results on synthetic data. Figure 15a shows an all-in-focus input image, Figure 15b shows a ground truth blurry images and Figure 15c shows a predicted blurry image with learnt point spread function model.

Figures 16a-16c show examples of results obtained by incorporating a learned lens point spread function model into the NeRF framework and training end-to-end. Figure 16a shows a ground truth sharp image, Figure 16b shows a blurry input image and Figure 16c shows a recovered sharp image.

Figures 17a-17e shows some further qualitative results on real world data. Figure 17a shows the all-in-focus input, Figure 17b shows the ground truth blur, and Figure 17c shows the result of fitting the PSF blur model to a real 3D scene captured with the all-in-focus and blurry image pair of Figures 17a and 17b. Figures 17d and 17e show novel views which can be rendered with the learned PSF blur, and this blur can be modified by controlling the blur kernels.

DETAILED DESCRIPTION

Embodiments of the present invention introduce an artificial intelligence (Al) system implementing a learnable lens model that can learn point spread functions (also referred to as kernels). The PSFs can be learned, for example, using neural rendering from multi-view images and depth maps in an image rendering framework. Camera lens-specific PSFs can be learned, which can allow images to be rendered with realistic out-of-focus blur. This can allow for reconstruction and synthetisation of images via neural rendering with shallow depth-of-field input images, which can particularly address the difficulties in reconstructing accurate 3D geometry.

Figure 7a shows an example of an image rendering pipeline 700. The pipeline comprises a number of separate modules, including a differential rendering model 705, a volume rendering module 707 and a PSF model 711. The PSF model 711 can be inserted into the rendering pipeline 700, as shown schematically in Figure 7a, and can enable the pipeline to both reconstruct and render image blur. The differential rendering model 705 can be trained separately to the PSF model 711 using any suitable known method, or the PSF model 711 may be trained end-to-end with the differential rendering model 705, as will be described later.

In this example, the pipeline 700 processes multi-view input images 701 of a scene, which may be sharp and/or have a shallow depth-of-field, with known camera parameters (such as aperture, focus and distance). The pipeline 700 reconstructs these images using neural rendering to output rendered blur images 717. The rendered blur images 717 may each have a novel view compared to the input images 701. The pipeline 700 can perform one or more operations on the inputs 701 including ray casting 702, point sampling 703 and frequency encoding 704 before inputting data derived from the input images 701 to the differential rendering model 705.

As shown at 706, the output of the differential rendering model 705 is (c_fc, cr_fe) , where c is colour and is a density for a 3D space coordinate k, which is input to the volume rendering module 707 to output colour, C(x,y), and depth, Depth(x,y), values for each pixel of the input image at a pixel location (x, y) of the image sensor, as shown at 708. A patch sampler 709 can sample patches of pixels TV' (x,y)). Each patch may comprise a central pixel and multiple neighbouring pixels.

For training the PSF model 711 for a particular lens, multi-view image training data can be used. The data to train the PSF lens model 711 preferably comprises paired multi-view image data of a number of 3D scenes, comprising sharp all-in-focus images and corresponding images with depth-of-field blur. The training data may advantageously span a number of different lens apertures (for example, f1.8 to f16) and focus distances (with camera parameter labels, for example extracted from the camera's Exchangeable Image File Format (EXIF)), as well as encompassing a range of different scene depths. The captured paired image data with camera parameter labels and depth maps can therefore be used to train the lens-specific PSF model 711.

Using the sharp all-in-focus images, a neural rendering pipeline can be trained to render sharp ground truth depth maps by comparing the output with sharp ground truth depth maps. As all modules are differentiable, this enables end-to-end training of the complete system. In a preferred implementation, a NeRF pipeline is trained end-to-end with the PSF lens model.

The learnable PSF lens model 711 is trained to learn PSF kernel weights Kt] 712 for each location on an image sensor of the camera (i.e. for each pixel, and is optionally also applied across one or more neighbouring pixels in a patch JV(x,y)), given depth at each location, camera lens focal length, focus distance and lens aperture for the lens, as shown at 718.

After pre-training with a per-lens paired dataset, as described above, the PSF model 711 for a lens can be fixed and used together with any neural rendering method for end-to-end reconstruction and novel-view rendering of new scenes.

The coordinate transform module shown at 710 converts image sensor pixel locations (x,y) to another coordinate system (e.g. polar coordinates) and can apply sensor priors to aid with training of the Al system. For example, the coordinate transform module may apply prior knowledge that the sensor has symmetric properties, thus reducing the learnable space to one half or one quarter of the complete space accordingly.

The pipeline also comprises a PSF depth modulation mechanism, which determines a function 713 that modulates PSF kernels Kt] 712 with continuous depth mask weights

714 based on pixel-wise depth differences from a central pixel in a patch JV(x,y). Therefore, the continuous depth mask weights are a function that modulates PSF kernels based on pixelwise relative depth differences. This is used to prevent blurring across parts of the image at different depths.

By applying the depth mask weights

to the kernel weights Kt], this modifies each learned point spread function in dependence on the respective depth value of the respective pixel to form a modified point spread function for each pixel. This results in a spatial and depth varying kernel at 715. The PSF application module shown at 716 is a mechanism of applying the transformations of spatial and depth-varying point spread functions to rendered images. This module applies the learnt PSF kernel to the RGB colour values C(x,y) of a pixel or a patch of pixels centred on a particular sensor location (x,y), modulated by the continuous depth mask, to generate images 717 rendered with depth-of-field blur.

A more detailed embodiment of the pipeline is schematically illustrated in Figure 7b.

The pipeline 800 can perform one or more operations on each of the input images 801 including ray casting 802, point sampling 803 and frequency encoding 804 before inputting data derived from the input images to the differential rendering model 805, which in this example is a NeRF model.

Given a set of 2D images of a 3D scene with known camera poses (for example, specifying focal length, rotation and translation), the NeRF model 805 can learn an implicit mapping from spatial coordinates (x, y, z) to volume density and view-dependent colour (RGB). The output of the NeRF model 805 is (Cj, o-j), shown at 806. The volume rendering module 807 accumulates samples of each scene representation along camera rays to render the scene from any viewpoint. The output of the volume rendering module 807 is colour, C(x,y), and depth, Depth(x,y), values for each pixel of the input image at a pixel location (x,y) of the image sensor, as shown at 808. The patch sampler 809 operates as for the patch sampler 709 of Figure 7a.

The coordinate transform module 810 converts image sensor pixel locations (x, y) from Cartesian coordinates to Polar coordinates and can apply sensor priors, as described above. In this example, the output of the coordinate transform module is frequency encoded at 811 .

The PSF model 812 in this example is a multi-layer perceptron (MLP) model. The learnable PSF lens model 812 is trained to learn PSF kernel weights Kt] 813 for each location on an image sensor of the camera (i.e. for each pixel and optionally one or more neighbouring pixels in a patch J\T(x,y)), given depth at each location, camera lens focal length, focus distance and lens aperture for the lens, as shown at 819. In this example, these parameters are frequency modulated at 820 before being input to the model 812.

The PSF kernels Kt] 813 output from the PSF model 812 are modulated with continuous depth mask weights 815 based on pixel-wise depth differences from a central pixel in a patch N(x,y) from a continuous depth mask, shown at 814, which in this example uses a Gaussian function.

By applying the depth mask weights

to the kernel weights Kt], this modifies each learned point spread function in dependence on the respective depth value of the respective pixel to form a modified point spread function for each pixel. This results in a spatial and depth varying kernel at 816.

The transformations of spatial and depth-varying point spread functions are then applied to rendered images. In this example, this is performed using a convolution (dot product), as will be described in more detail with reference to Figure 12. The module 817 applies the learnt PSF kernel to the RGB colour values C(x,y) of a pixel or a patch of pixels centred on a particular sensor location (x,y), modulated by the continuous depth mask, to generate images 818 rendered with depth-of-field blur.

As described above, the learnable PSF model for a particular camera lens learns to generate PSFs for each pixel. The PSFs may also be termed blur kernels. The PSF for a respective pixel may be applied to the respective pixel and one or more neighbouring pixels of a patch comprising the respective pixel. For example, the PSF for a respective pixel may be applied to a patch of 9 pixels, 16 pixels, 25 pixels, or so on (with the respective pixel at the center of the patch). This may be performed for all pixels of the image.

One implementation of the network design for a PSF neural field model is shown in Figure 8. In this example, the PSF model is a neural field represented by a 3-layer fully connected multilayer perceptron neural network (MLP) 807 to represent the neural field. From an RGB image 801 and a corresponding depth map 802, comprising RGB colour and depth data for multiple pixels respectively, the model takes as input a sensor location (x,y) and scene depth d(x,y) at that location and outputs the corresponding blur kernel weights k_tj. Parameters 806 of a camera 805 (such as focal length, focus distance and aperture) can also be fed into the PSF model 807. In this implementation, all inputs to the MLP 807 are encoded to a higher dimensionality using a coordinate transform 803 (for example, from Cartesian to Polar coordinates) and frequency encoding 804, following the standard neural rendering procedure. This frequency encoding can increase the input dimensionality. Alternative neural network architectures may be used. The output of the MLP model 807 is the PSF blur kernel weights. In this example, the output is a kernel 808 in vector form, which is reshaped into an s x s matrix 809 containing s² elements. 810 is an example of how such a point spread function may look for real data. The model 807 is fully differentiable so can be trained either separately (pretrained), or end-to-end within a neural rendering framework.

The transformation of the coordinates (x,y) of the image sensor 900 from Cartesian coordinates to Polar coordinates (r, 0) before being input to the MLP is schematically illustrated in Figure 9. This transformation constrains the learned PSF kernel weights to a single quadrant 901 of the image sensor 900, exploiting the symmetric properties of the image sensor. Therefore, symmetry of the learned PSFs is enforced by restricting inputs to a single quadrant 901 of the sensor 900. This can reduce the learnable space to one quarter of the total sensor space.

As described above, the PSF modulation mechanism implements a function that generates a continuous weight mask based on the respective depths of each pixel of a patch of pixels to modulate the learned PSF kernel. This mechanism can therefore output a soft weight mask for a patch of pixels to prevent blurring of the PSF kernels across parts of the scene at different depths, instead of splitting the depth into discontinuous planes.

One example of such a function is shown in Figure 10. In this implementation, a Gaussian function is used to produce weights Wy for a patch of pixels 1000 based on the depth difference of a respective pixel 1001 of the patch 1000 from the centre pixel 1002 of the patch of pixels (i.e. di'j - dtj, where d is depth). The weight mask is used to prevent blurring across parts of scene at different depths by applying the weights to the kernel values of a patch of pixels. Using a simple Gaussian function, the larger the depth discontinuity from the center pixel of a given patch, the less weight is applied. The hyperparameter of the Gaussian’s standard deviation a can be pre-defined and fixed, or could be a learnt parameter. Alternatively, the function could be represented by a neural network.

As illustrated in Figure 11 , this mechanism is equivalent to splitting the scene into depth planes, but instead of discretized planes, the soft weight depth mask transforms this into a continuous weighting function for each patch 1-4 shown in the figure, which is better for optimization purposes and model expressivity.

As schematically illustrated in Figure 12, the PSF application module applies the corresponding PSF at each sensor location (i.e. for each respective pixel and one or more neighbouring pixels in a patch comprising a respective pixel at the center of the patch) to produce the final RGB output image C with rendered depth-of-field blur. This module generates the final image pixel values with rendered depth-of-field blur. It fuses a patch of rendered colour pixels, with each pixel having a respective colour value c^, the learned PSF kernel

at that sensor location, and the modulating weight mask produced by the PSF modulation mechanism. In this exemplary implementation, a simple dot product of the colour values with kernel weights and PSF modulation weights, summed over a patch, is performed. Other implementations may alternatively be used, for example using a dictionary of learned kernels and applying them in the fast Fourier Transform (FFT) domain.

To train the pixel-wise reconstruction, the loss (L1 loss) between the ground truth image I and the predicted image / can be minimized according to the following loss function:

1 = II' -'ll.

Figure 13 shows the steps of an exemplary method for forming an enhanced image. At step 1301 , the method comprises receiving data for an input image, the data comprising respective colour and depth values for each pixel of multiple pixels of the input image. In some implementations, the data for the input image may comprise respective colour and depth values for every pixel of the input image. At step 1302, the method comprises receiving a trained point spread function model comprising a respective learned point spread function for each pixel of the multiple pixels of the input image. At step 1303, the method comprises modifying each learned point spread function in dependence on the respective depth value of the respective pixel to form a respective modified point spread function for each pixel of the multiple pixels of the input image. At step 1304, the method comprises applying the respective modified point spread function to the respective colour value for one or more pixels comprising the respective pixel to form the enhanced image. The respective modified point spread function may be applied to a patch of pixels centred on the respective pixel. This may be performed for each pixel of the multiple pixels of the input image. This may be performed for every pixel of the input image to form the enhanced image.

This method may be implemented on-device (for example, on a smartphone) or externally, such as on cloud-based services.

Figure 14 shows an example of an image processing apparatus for performing the methods described herein. The apparatus 1400 may comprise at least one processor, such as processors 1401 , and at least one memory, such as memory 1402. The memory stores in a non-transient way code that is executable by the processor(s) to implement the apparatus in the manner described herein. The apparatus may also comprise one of more image sensors 1403 configured to capture an image which can then be input to a rendering pipeline implemented by the processor 1401 or used to train the rendering pipeline in the manner described herein.

This Al system can enable depth-of-field blur to be rendered in images rendered by a novel view synthesis method such as NeRF. The PSF lens blur model is represented by a neural field, which can be conditioned on sensor location, depth and captured camera parameters (such as focal length, focus distance and aperture). The depth weighting mechanism modulates the learned blur kernels to prevent blurring across parts of the scene at significantly different depths. The learnt lens blur kernels and depth weighting are then applied to image patches rendered by a neural rendering framework to generate images with depth-of-field blur.

The implementation of the learned PSF model in the rendering pipeline can lead to a more expressive camera model, capable of generating images with realistic depth-of-field blur, rather than the sharp all-in-focus images normally generated by standard NeRF models. The PSF model can also lead to better geometry (depth) reconstruction when the learnt lensspecific PSF model is incorporated into a NeRF framework and trained end-to-end. This is because the NeRF has to update the depth map to produce sharp colours before the blur model is applied. Furthermore, by incorporating a learnable lens model into a neural rendering framework, lens-specific (scene-agnostic) blur can be learned from real images. Thus, the learned PSFs are generalizable across scenes, and novel views can be rendered and the blur controlled a priori.

This approach therefore provides for a controllable and learnable system that generates DoF blur by taking colour and depth values from volumetric rendering and applies point spread functions. The neural field can learn spatially-varying kernel weights Kt] based on real camera parameters (focal length, focus distance, aperture value), transformed sensor coordinates, and depth values.

The present approach can enable learning of spatial- and depth-varying PSF kernels for each location on the sensor, leading to the realistic reconstruction of images containing DoF blur. Transforming sensor coordinates (e.g. polar coordinates in a quotient space) enables priors such as PSF symmetry. Conditioning the neural field on camera parameters enables view synthesis with novel camera parameters.

The continuous depth mask provides for a smooth function modulates the PSF kernels based on pixel-wise depth differences, preventing blurring across large depth discontinuities and avoiding problems associated with discretized depth planes. This can enable the application of learnt PSF kernels to colours and depth provided by neural rendering, resulting in renders with realistic DoF blur.

Figures 15a-15c show qualitative results of fitting a lens PSF model to a synthetic 3D scene. The present model is able to render the out-of-focus blur accurately, with a PSNR of ~50dB when compared to the ground truth blurry image. Figure 15a shows an all-in-focus input image, Figure 15b shows a ground truth blurry images and Figure 15c shows a predicted blurry image obtained using the present approach, with an example of a PSF blur kernel learned through optimization of the model.

Incorporating the learned PSF lens model (pre-trained on a dataset sharp-blurry image pairs) into an end-to-end NeRF framework improves the depth geometry reconstruction and enables sharp images to be recovered from blurry input images. This is because the PSF model enables the NeRF to learn an internal scene representation that is sharp, before volume rendered pixels are convolved with their corresponding blur kernels. This result is shown in Figures 16a-16c. Figure 16a shows a ground truth sharp image, Figure 16b shows a blurry input image and Figure 16c shows a recovered sharp image. Incorporating the learned lens PSF model into the NeRF framework and training end-to-end results in sharper depth reconstruction and enables the sharp image to be recovered from blurry inputs.

Some further advantages of this solution are that the lens model is conditioned on real camera parameters and thus enables rendering of images with novel camera parameters. The ambiguity between aperture and focus distance is mitigated since the lens PSF model is pretrained with labelled paired data, i.e. the lens model is conditioned on camera parameters. As discussed above, the PSF lens model can be pre-trained, for example in the lab, using paired sharp and blurry images with known camera parameters as input. This enables the lens model to be generalizable across scenes, as it is specific to the lens and is scene-agnostic. The trained point spread function model can advantageously be learned based on camera parameters of an image sensor. The model is therefore also controllable and conditioned on lens parameters, enabling novel views to be rendered with novel camera parameters, facilitating the depth-of-field to be changed after capture and NeRF reconstruction (whereas in most prior NeRF methods, the depth-of-field is baked into the NeRF model and cannot be changed). As the PSF model can advantageously be conditioned on known camera settings (such as focus distance and aperture from camera’s EXIF data) which are fed into the model, this enables easy control of the blur by changing in the input camera parameters. Therefore, the method can learn to disambiguate between focus distance and aperture value and enables novel view synthesis with new camera parameters (such as aperture and focus distance).

In contrast to previous methods, dense PSF kernels are learned, where the PSF function is applied to a central pixel and multiple neighbouring pixels (i.e. a patch of pixels). The blur rendering is lens-specific (i.e. scene-agnostic blur, where the blur rending does not depend on the scene being captured). Embodiments of the present invention can therefore provide a general lens model that is scene-agnostic (and camera and lens dependent instead) and thus generalizable across different scenes.

The PSF model described herein can therefore learn arbitrary blur PSF kernels from real data and is fully spatial- and depth-varying, resulting in a more expressive and realistic model compared to the pinhole camera model used in standard NeRF frameworks.

Thus, by incorporating a PSF model for a camera lens into a neural rendering framework, the behaviour of real lenses can be better modelled, and images can be reconstructed with realistic depth-of-field blur. When trained end-to-end within a NeRF framework, a sharp all-in-focus internal representation of the 3D scene is learnt, which when rendered using standard volume rendering techniques, enables controllable depth-of-field blur given novel camera parameters.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. An image processing apparatus (1400) for forming an enhanced image (717, 818), the image processing apparatus (1400) being configured to: receive (1301) data for an input image, the data comprising respective colour and depth values for each pixel of multiple pixels of the input image; receive (1302) a trained point spread function model (711 , 812) comprising a respective learned point spread function (712, 813) for each pixel of the multiple pixels of the input image; modify (1303) each learned point spread function (712, 813) in dependence on the respective depth value of the respective pixel to form a respective modified point spread function (715, 816) for each pixel of the multiple pixels of the input image; and apply (1304) the respective modified point spread function (715, 816) to the respective colour value for one or more pixels comprising the respective pixel to form the enhanced image (717, 818).

2. The image processing apparatus (1400) as claimed in claim 1 , wherein the apparatus is configured to form the enhanced image so as to render depth-of-field blur in the enhanced image (717, 818).

3. The image processing apparatus (1400) as claimed in any preceding claim, wherein the apparatus is configured to modify each learned point spread function (712, 813) in dependence on a respective depth mask weight (714, 815) for the respective pixel, each depth mask weight (714, 815) being determined based on a pixel-wise depth difference from a central pixel in the multiple pixels of the input image.

4. The image processing apparatus (1400) as claimed in any preceding claim, wherein each learned point spread function (712, 813) comprises a matrix of weights (813) corresponding to the respective pixel and multiple neighboring pixels.

5. The image processing apparatus (1400) as claimed in any preceding claim, wherein each learned point spread function (712, 813) is further modifiable in dependence on one or more parameters (718, 819) of an image sensor and camera lens that captured the input image.

6. The image processing apparatus (1400) as claimed in claim 5, wherein the one or more parameters (718, 819) comprise one or more of focal length, focus distance and aperture value.

7. The image processing apparatus (1400) as claimed in any preceding claim, wherein the learned point spread function model (711 , 812) is specific to a particular lens of an image sensor used to capture the input image and wherein the learned point spread functions (712, 813) vary spatially across the image sensor.

8. The image processing apparatus (1400) as claimed in any preceding claim, wherein the apparatus is further configured to convert spatial locations in the input image to another coordinate system and apply one or more known properties of an image sensor used to capture the input image to the converted spatial locations.

9. The image processing apparatus (1400) as claimed in any preceding claim, wherein the trained point spread function model (711 , 812) is scene-agnostic.

10. The image processing apparatus (1400) as claimed in any preceding claim, wherein the received data is an output of an image rendering model (705, 805).

11. The image processing apparatus (1400) as claimed in claim 10, wherein the input image corresponds to a novel view output from the image rendering model (805).

12. The image processing apparatus (1400) as claimed in any preceding claim, wherein the trained point spread function model (812) is a multi-layer perceptron neural network.

13. The image processing apparatus (1400) as claimed in any preceding claim, wherein the trained spread point function model (711 , 812) is trained end-to-end with the image rendering model (705, 805).

14. The image processing apparatus (1400) as claimed in any preceding claim, wherein the trained point spread model (711 , 812) is trained using paired sharp-blurry images with labelled lens parameters.

15. A method (1300) for forming an enhanced image (717, 818), the method comprising: receiving (1301) data for an input image, the data comprising respective colour and depth values for each pixel of multiple pixels of the input image; receiving (1302) a trained point spread function model (711 , 812) comprising a respective learned point spread function (712, 813) for each pixel of the multiple pixels of the input image; modifying (1303) each learned point spread function (712, 813) in dependence on the respective depth value of the respective pixel to form a respective modified point spread function (715, 816) for each pixel of the multiple pixels of the input image; and applying (1304) the respective modified point spread function (715, 816) to the respective colour value for one or more pixels comprising the respective pixel to form the enhanced image (717, 818).