WO2024138236A2 - Improving sharpness of image of scene - Google Patents

Improving sharpness of image of scene Download PDF

Info

Publication number
WO2024138236A2
WO2024138236A2 PCT/US2024/026657 US2024026657W WO2024138236A2 WO 2024138236 A2 WO2024138236 A2 WO 2024138236A2 US 2024026657 W US2024026657 W US 2024026657W WO 2024138236 A2 WO2024138236 A2 WO 2024138236A2
Authority
WO
WIPO (PCT)
Prior art keywords
pixel
image
rays
blurry image
rendered
Prior art date
Application number
PCT/US2024/026657
Other languages
French (fr)
Inventor
Achleshwar LUTHRA
Xiyun Song
Shiva Souhith GANTHA
Zongfang LIN
Liang Peng
Hong Heather Yu
Original Assignee
Futurewei Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Futurewei Technologies, Inc. filed Critical Futurewei Technologies, Inc.
Publication of WO2024138236A2 publication Critical patent/WO2024138236A2/en

Links

Definitions

  • Embodiments described herein generally relate to improving sharpness of images of a scene captured using a camera, but are not limited thereto.
  • Smartphones have become the dominant type of camera for capturing images and video clips worldwide. In fact, it was reported that in 2022 most people (i.e., more than 90% of people) capture images and video clips using smartphones rather than traditional digital cameras, and that percentage is expected to continue to grow in the coming years.
  • One aspect of the present disclosure includes a computer implemented method for improving sharpness of a ground truth blurry image of a scene.
  • the method includes using a first neural network and a second neural network to respectively produce a first output and a second output, rendering a rendered blurry image based on the first and the second outputs, and comparing the rendered blurry image to the ground truth blurry image to thereby determine a difference between the rendered blurry image and the ground truth blurry image.
  • the method further includes training the first and the second neural networks by updating respective trainable parameters of each of the first and the second neural networks based on the difference between the rendered blurry image and the ground truth blurry image.
  • the method also includes iteratively repeating the using, the rendering, the comparing, and the training until a specified criterion is satisfied. Additionally, the method includes rendering, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, an image of the scene that has improved sharpness compared to the ground truth blurry image.
  • the rendered blurry image which is rendered based on the first and the second outputs of the first and the second neural networks, includes a plurality of pixels. Each of the plurality of pixels has corresponding color data and corresponds to a position in image space.
  • the rendering of the rendered blurry image based on the first and the second outputs of the first and the second neural networks, during the training of the first and the second neural networks, includes performing multi-ray-per-pixel tracing and weighed averaging for each of at least some of a plurality of pixels in the rendered blurry image.
  • the first neural network comprises a neural radiance field network and the first output comprises trainable parameters of radiance fields of the neural radiance field network.
  • the second neural network comprises a kernel estimation network and the second output comprises estimated blur kernels and associated weights for each of a plurality of kernel positions.
  • the first neural network is configured to iteratively learn to approach a ground truth sharp image, which corresponds the ground truth blurry image with a blurring process removed.
  • the second neural network is configured to iteratively generate a plurality of trained blur kernels that simulate the blurring process.
  • the multi-ray-per-pixel tracing for a pixel, of the plurality of pixels for which the multi-ray-per-pixel tracing is performed results in a plurality N of rays being traced for the pixel.
  • each of the plurality of trained blur kernels is generated for a respective one of the plurality of pixels for which the multi-ray-per-pixel tracing is performed.
  • each of the plurality of trained blur kernels corresponds to a K x K kernel window, where K is an odd integer that is at least 3, and N ⁇ K A 2.
  • N ⁇ K A 2 is an odd integer that is at least 3.
  • the performing multi-ray-per-pixel tracing and weighed averaging for a pixel of the plurality of pixels comprises: tracing a respective ray for the pixel and for each of a plurality of neighboring pixels to thereby trace a plurality N of rays for the pixel, wherein each ray that is traced, of the plurality N of rays for the pixel, originates from a same camera position and passes through one of the plurality of different respective positions in image space and has respective color data; determining a weight for each of the plurality N of rays for the pixel to thereby determine a respective plurality N of weights for the plurality N of rays for the pixel; and using the plurality N of weights for the plurality N of rays for the pixel to combine the color data for the plurality N of rays for the pixel to thereby produce combined color data for the pixel.
  • the color data for each said pixel comprises a respective plurality of color values (e.g., comprises respective red, green, and blue (RGB) color values), and each of the plurality N of rays traced for a said pixel has a respective plurality of color values (e.g., respective RGB color values); and the plurality N of weights determined for the plurality N of rays, which are traced for the pixel, are used to combine the plurality of color values (e.g., the RGB color values) corresponding to the plurality N of rays traced for the pixel.
  • RGB red, green, and blue
  • the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel are respective centers of the pixel and the neighboring pixels.
  • at least some of the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel are offset from respective centers of the pixel and the neighboring pixels.
  • the rendering of the image of the scene that has improved sharpness compared to the ground truth blurry image, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied is performed using single-ray-per-pixel tracing for each of a plurality of pixels in the image of the scene.
  • Another aspect of the present disclosure includes a system for improving sharpness of a ground truth blurry image of a scene.
  • the system includes a first neural network and a second neural network configured to respectively produce a first output and a second output, during each of a plurality of iterations of the first and the second neural networks being trained until a specified criterion is satisfied.
  • the system also includes a rendering engine configured to receive the first and second outputs and to render a rendered blurry image based on the first and the second outputs, during each of the plurality of iterations until the specified criterion is satisfied.
  • the first and second neural networks are configured to be trained based on a difference between the rendered blurry image and the ground truth blurry image during each of the plurality of iterations until the specified criterion is satisfied, by updating respective trainable parameters of each of the first and the second neural networks based on the difference between the rendered blurry image and the ground truth blurry image during each of the plurality of iterations until the specified criterion is satisfied.
  • the rendering engine is further configured to render, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, an image of the scene that has improved sharpness compared to the ground truth blurry image.
  • the rendered blurry image which is rendered based on the first and the second outputs of the first and the second neural networks during each of the plurality of iterations until the specified criterion is satisfied, includes a plurality of pixels. Each of the plurality of pixels has corresponding color data and corresponds to a position in image space.
  • the rendering engine is configured to render the rendered blurry image based on the first and the second outputs of the first and the second neural networks, during each of the plurality of iterations until the specified criterion is satisfied, by performing multi- ray-per-pixel tracing and weighed averaging for each of at least some of a plurality of pixels in the rendered blurry image.
  • the first neural network comprises a neural radiance field network and the first output comprises trainable parameters of radiance fields of the neural radiance field network.
  • the second neural network comprises a kernel estimation network and the second output comprises estimated blur kernels and associated weights for each of a plurality of kernel positions.
  • the first neural network is configured to iteratively learn to approach a ground truth sharp image, which corresponds the ground truth blurry image with a blurring process removed.
  • the second neural network is configured to generate a plurality of trained blur kernels that simulate the blurring process.
  • the multi-ray-per-pixel tracing for a pixel, of the plurality of pixels for which the multi-ray-per-pixel tracing is performed results in a plurality N of rays being traced for the pixel; each of the plurality of trained blur kernels is generated for a respective one of the plurality of pixels for which the multi-ray-per-pixel tracing is performed; and each of the plurality of trained blur kernels corresponds to a K x K kernel window, where K is an odd integer that is at least 3, and N ⁇ K A 2.
  • N ⁇ K A 2 and thus, each of the plurality of trained blur kernels comprises a sparse blur kernel.
  • the multi-ray-per-pixel tracing and weighed averaging for a pixel of the plurality of pixels includes: tracing a respective ray for the pixel and for each of a plurality of neighboring pixels to thereby trace a plurality N of rays for the pixel, wherein each ray that is traced, of the plurality N of rays for the pixel, originates from a same camera position and passes through one of a plurality of different respective positions in image space and has respective color data; determining a weight for each of the plurality N of rays for the pixel to thereby determine a respective plurality N of weights for the plurality N of rays for the pixel; and using the plurality N of weights for the plurality N of rays for the pixel to combine the color data for the plurality N of rays for the pixel to thereby produce combined color data for the pixel.
  • the color data for each said pixel comprises a respective plurality of color values (e.g., comprises respective red, green, and blue (RGB) color values), and each of the plurality N of rays traced for a said pixel has a respective plurality of color values (e.g., respective RGB color values); and the plurality N of weights determined for the plurality N of rays, which are traced for the pixel, are used to combine the plurality of color values (e.g., the RGB color values) corresponding to the plurality N of rays traced for the pixel.
  • RGB red, green, and blue
  • the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel are respective centers of the pixel and the neighboring pixels.
  • at least some of the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel are offset from respective centers of the pixel and the neighboring pixels.
  • the rendering engine is configured to render the image of the scene that has improved sharpness compared to the ground truth blurry image, which said image is rendered based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, using single-ray-per-pixel tracing for each of a plurality of pixels in the image of the scene.
  • a further aspect of the present disclosure includes one or more non- transitory computer-readable media storing computer instructions for improving sharpness of a ground truth blurry image of a scene, that when executed by one or more processors, cause the one or more processors to perform operations comprising: using a first neural network and a second neural network to respectively produce a first output and a second output; rendering a rendered blurry image based on the first and the second outputs; comparing the rendered blurry image to the ground truth blurry image to thereby determine a difference between the rendered blurry image and the ground truth blurry image; training the first and the second neural networks by updating respective trainable parameters of each of the first and the second neural networks based on the difference between the rendered blurry image and the ground truth blurry image; iteratively repeating the using, the rendering, the comparing, and the training until a specified criterion is satisfied; and rendering, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, an image of the scene that has improved sharpness
  • FIG. 1 is a high level block diagram representing a system or pipeline according to an embodiment of the present technology.
  • FIG. 2 illustrates an example of a ground truth blurry image and how rays can be rendered and optimized for each pixel of the image and how those optimized rays can be determined according to a predicted blur kernel determined for each pixel, in accordance with an embodiment of the present technology.
  • FIG. 3 illustrates how a kernel estimation network can be trained to learn generalizable spatially varying and/or temporal varying blur kernels, in accordance with an embodiment of the present technology.
  • FIG. 4A illustrates a sparse 3 x 3 blur kernel for a pixel for which rays are traced from a same camera position through a center of the pixel and through centers of some of its neighboring pixels.
  • FIG. 4B illustrates a dense 3 x 3 blur kernel for a pixel for which rays are traced from a same camera position through a center of the pixel and through centers of all of its neighboring pixels.
  • FIG. 4C illustrates a sparse 3 x 3 blur kernel for a pixel for which rays are traced from a same camera position through a position offset from the center of the pixel and through positions offset from centers of some of its neighboring pixels.
  • FIG. 4D illustrates a sparse 5 x 5 blur kernel for a pixel for which rays are traced from a same camera position through a center of the pixel and through centers of some of its neighboring pixels
  • FIGS. 5A and 5B illustrate how embodiments of the present technology can be used to improve the sharpness of images included in volumetric video delivery at a remote side where original images are captured or alternatively at an edge side where the images are rendered for viewing.
  • FIG. 6 illustrates an example of a computing system with which embodiments disclosed herein may be implemented.
  • FIG. 7 illustrates a schematic diagram of a general-purpose network component or computer system with which embodiments of the present technology may be implemented.
  • Certain embodiments of the present technology can be used to improve sharpness of a ground truth blurry image of a scene.
  • a ground truth blurry image of a scene can be, for example, a frame of a video clip captured using a camera of a smartphone or other type of video recorder, where the blurriness in the ground truth blurry image of the scene is due to camera motion, due to object motion, and/or due to images of the video clip being out of the focal plane of the camera.
  • FIG. 1 is a high level block diagram representing a system or pipeline according to an embodiment of the present technology.
  • the system or pipeline shown therein includes a first neural network 110, a second neural 120, and a rendering engine 130.
  • the first neural network 110 which can be neural radiance field network, but is not limited thereto, is responsible for generating a sharp representation of an original blurry image 150.
  • the original blurry image 150 which can also be referred to herein as the ground truth blurry image 150, may have been captured using a camera of a mobile phone or some other type of camera. It is also possible that a ground truth blurry image is generated based on low quality meshes for a three-dimensional (3D) scene, as can be appreciated from FIG.
  • the blurring in the ground truth blurry image may have been caused, for example, by one or more moving objects in a dynamic scene, camera shake (aka camera movement), and/or one or more objects in the scene being out of a focal plane of the camera.
  • the second neural network 120 which can be a multilayer perceptron (MLP), but is not limited thereto, is responsible for generating (e.g., estimating) blur kernels that simulate a blurring process.
  • MLP multilayer perceptron
  • the rendered blurry image 140 is compared to the ground truth blurry image 150 at a comparison stage 160, to thereby produce a difference 170 between the two images, which difference 170 can also be referred to as an error.
  • This difference 170 is fed back to (i.e., provided as feedback to) the first and the second neural networks 110, 120.
  • each of the first and the second neural networks 110, 120 update their respective trainable parameters in a manner that causes the rendered blurry image 140 to become more similar to the ground truth blurry image 150 during the next iteration of the above described process, which has the effect of reducing the difference 170 (aka error) between the two images during that next iteration.
  • the above described iterative process is repeated for a predetermined number of iterations (e.g., 10,000 iteration), until the difference 170 is below a predetermined threshold difference, until a change in the difference 170 from one iteration to the next iteration is below a predetermined threshold change, until any one of the above criteria is met, until at least two of the above criteria are met, or until some other specified criteria is met. More generally, the above described iterative process is repeated until a specified criteria is met. When the above described iterative process stops being repeated because the specified criteria (examples of which were just described above) has been met, it can be said that the training (and associated learning) is completed (aka done).
  • a predetermined number of iterations e.g. 10,000 iteration
  • the first and the second neural networks 110, 120 are jointly trained.
  • the first neural network 110 e.g., a neural radiance field network
  • the second neural network 120 gradually learns to approach the blurring process.
  • the NeRF network gradually learns a sharp time-conditioned NeRF representation
  • the second neural network gradually learns spatially-varying and/or temporally-varying blur kernels to simulate the blurring process.
  • the trainable parameters in the first neural network provide a representation of the ground truth sharp image (which is the ground truth blurry image with the blurring process removed), which can be a three-dimensional (3D) scene, but is not limited thereto. Additionally, when the training (and associated learning) is done, the trainable parameters in the first neural network can be used to render an image of the scene included in the ground truth blurry image that has improved sharpness (compared to the ground truth blurry image), which rendered image can be (but need not be) a novel view of the scene.
  • the ground truth sharp image which is the ground truth blurry image with the blurring process removed
  • 3D three-dimensional
  • the rendering of the image of the scene that has improved sharpness (compared to the ground truth blurry image), based on the respective trainable parameters of the first neural network after the specified criterion is satisfied is performed using single- ray-per-pixel tracing (as opposed to multi-ray-per-pixel tracing and weighted averaging) for each of a plurality of pixels in the image of the scene.
  • NeRF enables rendering of photorealistic views from novel viewpoints.
  • NeRF has since gained significant attention for its potential applications in computer graphics and content creation.
  • the NeRF algorithm represents a scene as a radiance field parametrized by a deep neural network (DNN).
  • DNN deep neural network
  • the network predicts a volume density and view-dependent emitted radiance given the spatial location (x, y, z) and viewing direction in Euler angles (9, ⁇ t>) of the camera.
  • a NeRF model is retrained for each unique scene.
  • a first step is to collect images of the scene from different angles and their respective camera pose, wherein such images are standard 2D images and do not require a specialized camera or software. Accordingly, any camera that is able to generate datasets, provided the settings and capture method meet the requirements for Structure from Motion (SfM).
  • SfM Structure from Motion
  • SLAM Simultaneous Localization and Mapping
  • GPS Global Positioning System
  • inertial estimation For each sparse viewpoint (image and camera pose) provided, camera rays are marched through the scene, generating a set of 3D points with a given radiance direction (into the camera). For these points, volume density and emitted radiance are predicted using an MLP. An image is then generated through classical volume rendering. Because this process is fully differentiable, the error between the predicted image and the original image can be minimized with gradient descent over multiple viewpoints, encouraging the MLP to develop a coherent model of the scene.
  • NeRF variants have achieved remarkable success in space-time view synthesis, these methods require carefully captured videos or images with well- calibrated camera parameters hence limiting them to controlled environments. In the past, only a few methods have sufficiently addressed the problem of novel view synthesis from images subject to motion or out-of-focus blur. To the best of the inventors’ knowledge, the paper by Ma et al.
  • embodiments of the present technology described herein can construct a sharp NeRF while also handling dynamic scenes.
  • Certain embodiments of the present technology take the best of both the worlds, by providing a method that learns a continuous volumetric function mapping a 3D location, direction and time to reflectance, density and 3D scene motion.
  • a sharp Neural Scene Flow Fields (NSFF) model is trained by maintaining temporal consistency using scene flow fields warping loss and by simulating the blurring process using a Kernel Estimation Network.
  • the algorithm is able to generate sharp novel space-time renderings from blurry input frames.
  • Such embodiments tackle a novel problem that has not yet been covered in previous research.
  • NeRFs can disentangle deformation from geometry and appearance by modelling a separate canonical model to map from deformed space to canonical space but cannot capture larger range of motion.
  • Certain embodiments of the present technology generate novel view and time synthesis of dynamic scenes with recovered sharpness.
  • Certain embodiments of the present technology utilize a special deep learning method, wherein the method learns a continuous volumetric function mapping a three-dimensional (3D) location, direction and time to reflectance, density, and 3D scene motion. More specifically, this deep learning method trains a sharp model by maintaining temporal consistency using scene flow fields warping loss and by learning spatially varying blur kernels to simulate the blurring process using a Kernel Estimation Network. Then the model optimized from the training is a representation of time-conditioned neural radiance field and can be used to generate sharp novel space-time renderings from blurry input frames.
  • trainable sparse spatially-varying blur kernels are used to render sharp novel space-time view synthesis of blurred images (e.g., of a video) capturing dynamic scenes.
  • This learned blur kernel can be a general K x K sized kernel, where K is an integer that is greater than or equal to 3, and the K x K sized kernel can be used to learn a sharp representation of a time conditioned neural radiance field.
  • the images might be subject to blur due to camera motion or object motion or out-of-focus blur, as was noted above.
  • a dense blur kernel can be used to get improvement in results and produce sharper novel renderings.
  • K 3 + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K + K
  • Certain embodiments of the present technology facilitate separation of dynamic and static objects and help to sharpen those regions in the scene. If the dynamic region is subject to motion blur or if the static background region is subject to out-of-focus blur, embodiments of the present technology can be used to sharpen both the regions simultaneously without specifically modelling different types of blurs.
  • Kernel Estimation Network that is conditioned on learnable (aka trainable) time embedding along with a view embedding.
  • the Kernel Estimation Network is implemented as a fully connected Multi-Layer Perceptron (MLP) neural network, which takes time as an input, but instead of directly passing in time, a trainable embedded vector corresponding to time is passed to the MLP neural network.
  • MLP Multi-Layer Perceptron
  • the above mentioned K x K sized kernel can also be referred to herein as a two-dimensional (2D) trainable kernel.
  • the 2D trainable kernel is replaced with a 3D trainable kernel, where the associated weights are 3-dimensional corresponding to RGB channels.
  • the 2D trainable kernel is replaced with a 3D trainable kernel, where 2 dimensions correspond to position, and 1 dimension corresponds to time.
  • Other variations are also possible and within the scope of the embodiments described herein.
  • Neural Scene Flow Fields extend the idea of Neural Radiance Fields (NeRFs) by modelling 3D motion as dense scene flow fields, and more specifically, by learning a combination of static and dynamic NeRFs.
  • a static NeRF is a time-independent Multi-Layer Perceptron (MLP), denoted by F ⁇ , that takes as input a position (x) and a viewing direction (d), and outputs RGB color (c), volumetric density (a), and unsupervised 3D blending weights (v) that determines how to blend RGBa (i.e., RGB volumetric density) from static and dynamic representation, in accordance with the following equation:
  • F ⁇ is a static NeRF
  • x is position
  • d is viewing direction
  • c is RGB color
  • a is volumetric density
  • v is unsupervised 3D blending weight.
  • a dynamic NeRF denoted by Fg y
  • Fg y explicitly models a view-dependent as well as time-dependent MLP that takes an additional input, i.e. , time (t) along with position (x) and viewing direction (d).
  • time (t) along with position (x) and viewing direction (d).
  • Fg y is a dynamic NeRF
  • x is position
  • d viewing direction
  • t time
  • c t is RGB color at the time t
  • o t volumetric density at the time t
  • disocclusion weights disocclusion weights.
  • a final color value for a pixel is estimated using the blending weights as per the following rendering equations: )dz, (EQ3), where r t is a ray at the time t,
  • Ct b (r t ') a final color value for a pixel at the time t
  • T t denotes transmittance at the time t
  • z n and z f denote near depth and far depth along the ray r t .
  • NSFF is used to render a single-ray-per-pixel in an image frame that is being deblurred.
  • a plurality (N) of rays is rendered per pixel of each image frame during training, where N is, for example, 5, but is not limited thereto, as will be described in more detail below.
  • b(x, y) c(x,y) ® h (EQ6), where, h is a blur kernel, c(x,y) is a sharp image that the neural network (e.g., MLP) is to learn, b(x, y) is the blurred image, and
  • ® denotes convolution operator
  • the blur kernel h is a K x K window (which can also be referred to as a K x K matrix).
  • the number of rays N rendered per pixel would equal K A 2.
  • K 3
  • N K 2
  • the kernel is a sparse kernel.
  • a Kemal Estimation Network or more generally the second neural network 120 introduced above with reference to FIG. 1 , is used to optimize N rays for each of the pixels (or for each of at least some of the pixels) in an image frame.
  • the final color value for a pixel i.e. , is estimated for all elements from 0 to N, i.e., V e, [0, N), using equation EQ3 introduced above.
  • B t (r t ') is the rendered blurry image, and is the ground truth blurry image.
  • the dynamic NeRF F g y
  • this is used to determine forward and backward flows for each sampled point on all the optimized rays.
  • Using each point in 3D space can be offset to a neighboring frame t' (where t' e ⁇ t-1, t+1 ⁇ ) and volume render with associated color (c t ') and volumetric density (a t ' ). This provides a rendered image frame of time t' warped to time t, denoted as:
  • N rays are rendered and optimized for each pixel of an image and those optimized rays are determined according to the predicted blur kernel h determined for each pixel.
  • an example ground truth blurry image 210 which is an example of the ground truth blurry image 150 introduced in FIG. 1
  • B gt Shown just to the right of the example ground truth blurry image 210, is a zoomed-in image patch 220 (of the example ground truth blurry image 210), as well as a representation of an optimized blur kernel 240 for a pixel 230 of the image patch 220.
  • a weight Wi is determined for each of the plurality N of rays 242 for the pixel 230 to thereby determine a respective plurality N of weights for the plurality N of rays 242 for the pixel 230.
  • the plurality N of weights that are determined for the plurality N of rays for the pixel are used to combine the color data for the plurality N of rays for the pixel (e.g., 230) to thereby produce combined color data for the pixel 230.
  • the color data for a pixel (e.g., 230) comprises respective red, green, and blue (RGB) color values, and each of the plurality N of rays traced for the pixel has respective RGB color values.
  • the plurality N of weights determined for the plurality N of rays, which are traced for the pixel are used to combine the RGB color values corresponding to the plurality N of rays traced for the pixel. Referring to the right side of FIG.
  • c 0 represents the color data for the pixel 230
  • c 2 , c 3 , and c 4 represent the color data for the neighboring pixels 232
  • w 0 , w lt w 2 , w 3 , and w 4 represent the weights that are used to combine the RGB color values corresponding to the plurality N of rays traced for the pixel. Convolution of the color data and the weights, for each of the pixels, results in a rendered blurry pixel of the rendered blurry image.
  • the color data for a pixel (e.g., 230) can alternatively comprise other types of color values, besides RGB color values, if another color space besides RGB color space is used.
  • the YUV color space may alternatively be used, in which case the color data values for a pixel can comprise a respective Y' component (luma) value, a respective chroma component U value, and a respective chroma component V value.
  • the HSV or HSL color space also known as the HSB color space
  • the color data values for a pixel can comprise a hue data value, a saturation data value, and a brightness data value.
  • the CMYK color space may be used, in which case the color data values for a pixel can include respective cyan (C), magenta (M), yellow (Y), and black (K for “key”) color data values.
  • Other color spaces that can be used include, but are not limited to, sRGB and CIE L*A*B* color spaces.
  • an MLP 260 is shown as being used to determine the aforementioned color data and the weights.
  • the MLP 260 in FIG. 2 is an example of the second neural network 120 introduced above with reference to FIG. 1.
  • other types of neural networks can be used instead of an MLP to implement the second neural network 120, while still being within the scope of the embodiments described herein.
  • the second neural network 120 is implemented as an MLP as a design choice.
  • the second neural network can also be referred to herein as the Kernel Estimation Network.
  • Such a blur kernel is obtained using the Kernel Estimation Network, or more generally the second neural network 120, as visually illustrated in FIG. 3.
  • ground truth blurry image 310 which is an example of the ground truth blurry image 150 introduced in FIG. 1 , and as the same as the ground truth blurry image 210 discussed above with reference to FIG. 2.
  • a respective blur kernel is determined using a Kernel Estimation Network 320, which is the neural network 120 introduced above in the discussion of FIG. 1 .
  • the output 330 of the Kernel Estimation Network 320 in FIG. 3 is shows as including positions p t and associated weights w t for each of 5 rays traced (and more generally, N rays traced) for a pixel 312 corresponding to a position in image space.
  • the Kernel Estimation Network 320 is denoted as Gc
  • takes a query pixel p, a canonical kernel h , and a view embedding I (labeled, 312, 314, and 316 respectively in FIG. 3) as inputs, and outputs
  • a p . is the offset for each p t which are pre-defined positions on the canonical kernel h
  • w t is their associated weight.
  • the network 320 learns an optimized blur kernel for each pixel as blurring is a spatially-varying process, as well as it usually varies with the viewing direction, which justifies the use of a viewembedding.
  • K 3.
  • the value of K can be increased to deal with different types of blurring effects.
  • N may, for example, equal 10, but is not limited thereto.
  • the positions through which the N rays are traced are instead trainable locations that are offset from the centers of the square regions of a sparse blur kernel, or a dense blur kernel, depending upon the specific implementation. Examples of these various types of blur kernels are illustrated in FIGS. 4A through 4D, discussed below.
  • FIG. 4A illustrates a sparse 3 x 3 blur kernel for a pixel (shown in the center) for which rays are traced from a same camera position through a center of the pixel and through centers of some of its neighboring pixels.
  • the positions (aka locations) that are offset from the centers of the various pixels can be determined during training of the neural networks 110 and 120, and thus, the positions can be referred to a trainable positions.
  • rays are traced from a same camera position through a center of the pixel and through centers of some of its neighboring pixels.
  • the rays can alternatively be traced through a position offset from the center of the pixel and through positions offset from centers of neighboring pixels.
  • the first neural network 110 e.g., a neural radiance field network
  • the ground truth sharp image 150 which is the ground truth blurry image with the blurring process removed
  • the second neural network 120 which can be referred to as a Kemal Estimation Network
  • the NeRF network gradually learns a sharp time-conditioned NeRF representation
  • the second neural network gradually learns spatially-varying blur kernels to simulate the blurring process.
  • the first and second neural networks 110 and 120 are done being trained, during inference, sharp novel views of the image that was represented in the original blurry image 150 (which as noted above, can also be referred to herein as the ground truth blurry image 150) can be generated.
  • the second neural network 120 aka the Kemal Estimation Network
  • the trained sharp NeRFs of the first neural network 110 can be used to generate novel space-time views, without any use of the second neural network 120.
  • a splatting-based plane-sweep volume rendering approach is used to render novel views at fixed times, novel times at fixed views, and space-time interpolation.
  • consistency of a scene at a time t with the adjacent times t' is ensured after accounting for motion due to 3D scene flow.
  • the scene is warped from time t' to t using the 3D scene flow estimation output from the Dynamic NeRF to ensure that any motion that occurred in that time period is undone.
  • the warped point locations x t > on the ray r t > are used to query the associated color (c t >) and density at the time t'.
  • the image with is rendered in accordance with equation EQ8.
  • the blur kernel for the the ray r t corresponding to time t is then applied to get the blurry frame which is also referred to herein as the rendered blurry image.
  • the Dynamic NeRF (F e dy ) also outputs disocclusion weights TI e [0,1], The weights decide the contribution of the temporal photometric consistency loss at each location in the scene to the total loss.
  • volume rendering of the weights along a ray r t is first performed using the density values from time t'. This accumulated weight is then used for each 2D pixel as the weightage of the temporal photometric consistency loss as shown in the following equations:
  • a model e.g., a NeRF model
  • the model e.g., NeRF model
  • the optimized kernels might learn to map well to the blurry ground truth image (e.g., 150, 210) but during inference, there may be some unexpected distortions because renderings are produced using the model (e.g., NeRF model) generated by first neural network 110 (e.g., a neural radiance field network) without the second neural network 120 (aka, the Kernal Estimation Network).
  • the sparse kernel can be initialized such that the optimized rays are close to the input ray and the kernel weights corresponding to the optimized rays are similar to gaussian kernel representation. This will ensure that when the training is started, all the kernel points are near the pixel centers. Further, one of the optimized rays is forced to be close to the input ray as shown in the equation below: where q 0 corresponds to a fixed index in the output from the Kernel Estimation Network.
  • the overall loss function can be represented using the following equation: where aUgn can be kept at 0.1 , but is not limited thereto.
  • FIGS. 5A and 5B illustrate how embodiments of the present technology can be used to improve the sharpness of images included in volumetric video delivery at a remote side where original images are captured or at an edge side where 3D views are rendered for viewing.
  • the block labeled 512 which performs image reconstruction with increased sharpness, is used to implement an embodiment of the present technology described above with reference to FIGS. 1-3.
  • low quality blurry images 502 are captured using a camera.
  • the sharpness of the images is increased at block 512 using an embodiment of the present technology.
  • the images with increased sharpness are used to produce high quality meshes for a 3D scene 522.
  • the high quality meshes 522 for the 3D scene 522 are transferred (e.g., transmitted) from the remote side to an edge side over one or more networks, which can include wired and/or wireless networks, where high quality rendered sharp views 542 are rendered based on the high quality meshes for the 3D scene 522.
  • the rendering at the edge side can performed using an augmented reality (AR) display, a virtual reality (VR) display, or a 3D display, but is not limited thereto.
  • FIG. 5B instead shows image reconstruction being performed at block 510 without increased sharpness, based upon which low quality meshes for a 3D scene 520 are produced.
  • the low quality meshes 522 for the 3D scene 520 are transferred (e.g., transmitted) from the remote side to the edge side over one or more networks, which can include wired and/or wireless networks, as is part of the original low quality blurry images 502’. Based thereon, on the edge side the sharpness of the images is increased at block 512 using an embodiment of the present technology, and used to produce high quality rendered sharp views 542.
  • FIGS. 6 and 7 illustrate example hardware devices suitable for implementing of such embodiments.
  • FIG. 6 shows an example of a computing system 600 with which embodiments disclosed herein may be implemented.
  • the computing system 600 can be used to implement the neural networks 110 and 120, and the rendering engine 130, but not limited thereto.
  • the computing system 600 includes at least one processor 602, which could be a single central processing unit (CPU) or an arrangement of multiple processor cores of a multi-core architecture.
  • the processor 602 includes a pipeline 604, an instruction cache 606, and a data cache 608 (and other circuitry, not shown).
  • the processor 602 is connected to a processor bus 610, which enables communication with an external memory system 612 and an input/output (I/O) bridge 614.
  • I/O input/output
  • the I/O bridge 614 enables communication over an I/O bus 616, with various different I/O devices 618A-618D (e.g., disk controller, network interface, display adapter, and/or user input devices such as a keyboard or mouse).
  • I/O devices 618A-618D e.g., disk controller, network interface, display adapter, and/or user input devices such as a keyboard or mouse.
  • device 600 may include both a processor 600 and a dedicated graphics processing unit (GPU, not shown.)
  • GPU graphics processing unit
  • a GPU is a type of processing unit that enables very efficient parallel processing of data.
  • GPUs may be used in a video card or the like for computer graphics, GPUs have found much broader applications.
  • the external memory system 612 is part of a hierarchical memory system that includes multi-level caches, including the first level (L1 ) instruction cache 606 and data cache 608, and any number of higher level (L2, L3, . . . ) caches within the external memory system 612.
  • Other circuitry (not shown) in the processor 602 supporting the caches 606 and 608 includes a translation lookaside buffer (TLB), various other circuitry for handling a miss in the TLB or the caches 606 and 608.
  • TLB translation lookaside buffer
  • the TLB is used to translate an address of an instruction being fetched or data being referenced from a virtual address to a physical address, and to determine whether a copy of that address is in the instruction cache 606 or data cache 608, respectively.
  • the external memory system 612 also includes a main memory interface 620, which is connected to any number of memory modules (not shown) serving as main memory (e.g., Dynamic Random Access Memory modules).
  • FIG. 7 illustrates a schematic diagram of a general-purpose network component or computer system 700 with which embodiments of the present technology may be implemented.
  • general-purpose network component or computer system 700 can be used to implement the neural networks 110 and 120, and the rendering engine 130, but not limited thereto.
  • the general-purpose network component or computer system 700 includes a processor 702 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 704, and memory, such as ROM 706 and RAM 708, input/output (I/O) devices 710, and a network 712, such as the Internet or any other well-known type of network, that may include network connectively devices, such as a network interface.
  • processor 702 which may be referred to as a central processor unit or CPU
  • memory devices such as ROM 706 and RAM 708, input/output (I/O) devices 710
  • a network 712 such as the Internet or any other well-known type of network, that may include network connectively devices, such as a
  • the processor 702 is not so limited and may comprise multiple processors.
  • the processor 702 may be implemented as one or more CPU chips, cores (e.g., a multi-core processor), FPGAs, ASICs, and/or DSPs, and/or may be part of one or more ASICs.
  • the processor 702 may be configured to implement any of the schemes described herein.
  • the processor 702 may be implemented using hardware, software, or both.
  • the secondary storage 704 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if the RAM 708 is not large enough to hold all working data.
  • the secondary storage 704 may be used to store programs that are loaded into the RAM 708 when such programs are selected for execution.
  • the ROM 706 is used to store instructions and perhaps data that are read during program execution.
  • the ROM 706 is a non-volatile memory device that typically has a small memory capacity relative to the larger memory capacity of the secondary storage 704.
  • the RAM 708 is used to store volatile data and perhaps to store instructions. Access to both the ROM 706 and the RAM 708 is typically faster than to the secondary storage 1404.
  • At least one of the secondary storage 704 or RAM 708 may be configured to store routing tables, forwarding tables, or other tables or information disclosed herein.
  • device 700 may include both a processor 702 and a dedicated GPU, not shown.
  • the technology described herein can be implemented using hardware, firmware, software, or a combination of these.
  • the software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein.
  • the processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media.
  • computer readable media may comprise computer readable storage media and communication media.
  • Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
  • a computer readable medium or media does (do) not include propagated, modulated or transitory signals.
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
  • some or all of the software can be replaced by dedicated hardware logic components.
  • illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Applicationspecific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc.
  • FPGAs Field-programmable Gate Arrays
  • ASICs Application-specific Integrated Circuits
  • ASSPs Applicationspecific Standard Products
  • SOCs System-on-a-chip systems
  • CPLDs Complex Programmable Logic Devices
  • special purpose computers etc.
  • software stored on a storage device
  • the one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces.
  • each process associated with the disclosed technology may be performed continuously and by one or more computing devices.
  • Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.

Landscapes

  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

Computer implemented methods and systems for improving sharpness of a ground truth blurry image of a scene are disclosed. First and second neural networks respectively produce first and second outputs based upon which a blurry image is rendered. A difference between the rendered blurry image and the ground truth blurry image is used to train the first and the second neural networks. This process is iteratively repeated until a specified criterion is satisfied. An image of the scene that has improved sharpness is then rendered based on respective trainable parameters of the first neural network after the specified criterion is satisfied. During the training, the rendering of the rendered blurry image based on the first and the second outputs of the first and the second neural networks, includes performing multi-ray-per-pixel tracing and weighed averaging for each of at least some of the plurality of pixels in the rendered blurry image.

Description

IMPROVING SHARPNESS OF IMAGE OF SCENE
Inventors:
Achleshwar Luthra Xiyun Song Shiva Souhith Gantha Zongfang Lin Liang Peng Hong Heather Yu
CLAIM OF PRIORITY
[0001] This application claims priority to U.S. Provisional Patent Application No. 63/578,727, entitled “Resolution Recovery of Blurry Videos from Mobile Device Cameras”, filed August 25, 2023, which is incorporated by reference herein in its entirety.
FIELD
[0002] Embodiments described herein generally relate to improving sharpness of images of a scene captured using a camera, but are not limited thereto.
BACKGROUND
[0003] Smartphones have become the dominant type of camera for capturing images and video clips worldwide. In fact, it was reported that in 2022 most people (i.e., more than 90% of people) capture images and video clips using smartphones rather than traditional digital cameras, and that percentage is expected to continue to grow in the coming years.
[0004] While capturing images and video clips with smartphones bring significant convenience (such as fast, easy, inexpensive) in our daily life to capture special moments, events, birthdays and more, it can be quite challenging to take high quality images and video clips in some scenarios. For example, moving objects in dynamic scenes such as running kids can be out of the focal plane of smartphone cameras, which results in blurring of these objects in some or all video frames. In another example, the quality of video clips could also suffer from blurring artifacts due to camera shake, especially when the person who holds the camera is also in motion to follow the one or more objects of interest. In many of scenarios, there is no chance to re-take videos of those special moments despite the degraded image quality due to blurring. Therefore, it would be desirable to have the capability of recovering sharpness from blurry video clips, and more generally, from blurry images.
SUMMARY
[0005] One aspect of the present disclosure includes a computer implemented method for improving sharpness of a ground truth blurry image of a scene. The method includes using a first neural network and a second neural network to respectively produce a first output and a second output, rendering a rendered blurry image based on the first and the second outputs, and comparing the rendered blurry image to the ground truth blurry image to thereby determine a difference between the rendered blurry image and the ground truth blurry image. The method further includes training the first and the second neural networks by updating respective trainable parameters of each of the first and the second neural networks based on the difference between the rendered blurry image and the ground truth blurry image. The method also includes iteratively repeating the using, the rendering, the comparing, and the training until a specified criterion is satisfied. Additionally, the method includes rendering, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, an image of the scene that has improved sharpness compared to the ground truth blurry image. The rendered blurry image, which is rendered based on the first and the second outputs of the first and the second neural networks, includes a plurality of pixels. Each of the plurality of pixels has corresponding color data and corresponds to a position in image space. The rendering of the rendered blurry image based on the first and the second outputs of the first and the second neural networks, during the training of the first and the second neural networks, includes performing multi-ray-per-pixel tracing and weighed averaging for each of at least some of a plurality of pixels in the rendered blurry image.
[0006] Optionally, in any preceding aspect, the first neural network comprises a neural radiance field network and the first output comprises trainable parameters of radiance fields of the neural radiance field network. Optionally, in any preceding aspect, the second neural network comprises a kernel estimation network and the second output comprises estimated blur kernels and associated weights for each of a plurality of kernel positions.
[0007] Optionally, in any preceding aspect, the first neural network is configured to iteratively learn to approach a ground truth sharp image, which corresponds the ground truth blurry image with a blurring process removed. Optionally, in any preceding aspect, the second neural network is configured to iteratively generate a plurality of trained blur kernels that simulate the blurring process.
[0008] Optionally, in any preceding aspect, the multi-ray-per-pixel tracing for a pixel, of the plurality of pixels for which the multi-ray-per-pixel tracing is performed, results in a plurality N of rays being traced for the pixel. Optionally, in any preceding aspect, each of the plurality of trained blur kernels is generated for a respective one of the plurality of pixels for which the multi-ray-per-pixel tracing is performed. Optionally, in any preceding aspect, each of the plurality of trained blur kernels corresponds to a K x K kernel window, where K is an odd integer that is at least 3, and N < KA2. Optionally, in any preceding aspect, N < KA2, and thus, each of the plurality of trained blur kernels comprises a sparse blur kernel.
[0009] Optionally, in any preceding aspect, the performing multi-ray-per-pixel tracing and weighed averaging for a pixel of the plurality of pixels comprises: tracing a respective ray for the pixel and for each of a plurality of neighboring pixels to thereby trace a plurality N of rays for the pixel, wherein each ray that is traced, of the plurality N of rays for the pixel, originates from a same camera position and passes through one of the plurality of different respective positions in image space and has respective color data; determining a weight for each of the plurality N of rays for the pixel to thereby determine a respective plurality N of weights for the plurality N of rays for the pixel; and using the plurality N of weights for the plurality N of rays for the pixel to combine the color data for the plurality N of rays for the pixel to thereby produce combined color data for the pixel. [0010] Optionally, in any preceding aspect, the color data for each said pixel comprises a respective plurality of color values (e.g., comprises respective red, green, and blue (RGB) color values), and each of the plurality N of rays traced for a said pixel has a respective plurality of color values (e.g., respective RGB color values); and the plurality N of weights determined for the plurality N of rays, which are traced for the pixel, are used to combine the plurality of color values (e.g., the RGB color values) corresponding to the plurality N of rays traced for the pixel.
[0011] Optionally, in any preceding aspect, the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel, are respective centers of the pixel and the neighboring pixels. Alternatively, at least some of the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel, are offset from respective centers of the pixel and the neighboring pixels.
[0012] Optionally, in any preceding aspect, the rendering of the image of the scene that has improved sharpness compared to the ground truth blurry image, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, is performed using single-ray-per-pixel tracing for each of a plurality of pixels in the image of the scene.
[0013] Another aspect of the present disclosure includes a system for improving sharpness of a ground truth blurry image of a scene. The system includes a first neural network and a second neural network configured to respectively produce a first output and a second output, during each of a plurality of iterations of the first and the second neural networks being trained until a specified criterion is satisfied. The system also includes a rendering engine configured to receive the first and second outputs and to render a rendered blurry image based on the first and the second outputs, during each of the plurality of iterations until the specified criterion is satisfied. The first and second neural networks are configured to be trained based on a difference between the rendered blurry image and the ground truth blurry image during each of the plurality of iterations until the specified criterion is satisfied, by updating respective trainable parameters of each of the first and the second neural networks based on the difference between the rendered blurry image and the ground truth blurry image during each of the plurality of iterations until the specified criterion is satisfied. The rendering engine is further configured to render, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, an image of the scene that has improved sharpness compared to the ground truth blurry image. The rendered blurry image, which is rendered based on the first and the second outputs of the first and the second neural networks during each of the plurality of iterations until the specified criterion is satisfied, includes a plurality of pixels. Each of the plurality of pixels has corresponding color data and corresponds to a position in image space. The rendering engine is configured to render the rendered blurry image based on the first and the second outputs of the first and the second neural networks, during each of the plurality of iterations until the specified criterion is satisfied, by performing multi- ray-per-pixel tracing and weighed averaging for each of at least some of a plurality of pixels in the rendered blurry image.
[0014] Optionally, in any preceding aspect, the first neural network comprises a neural radiance field network and the first output comprises trainable parameters of radiance fields of the neural radiance field network. Optionally, in any preceding aspect, the second neural network comprises a kernel estimation network and the second output comprises estimated blur kernels and associated weights for each of a plurality of kernel positions.
[0015] Optionally, in any preceding aspect, the first neural network is configured to iteratively learn to approach a ground truth sharp image, which corresponds the ground truth blurry image with a blurring process removed. Optionally, in any preceding aspect, the second neural network is configured to generate a plurality of trained blur kernels that simulate the blurring process.
[0016] Optionally, in any preceding aspect, the multi-ray-per-pixel tracing for a pixel, of the plurality of pixels for which the multi-ray-per-pixel tracing is performed, results in a plurality N of rays being traced for the pixel; each of the plurality of trained blur kernels is generated for a respective one of the plurality of pixels for which the multi-ray-per-pixel tracing is performed; and each of the plurality of trained blur kernels corresponds to a K x K kernel window, where K is an odd integer that is at least 3, and N < KA2. Optionally, in any preceding aspect, N < KA2, and thus, each of the plurality of trained blur kernels comprises a sparse blur kernel.
[0017] Optionally, in any preceding aspect, the multi-ray-per-pixel tracing and weighed averaging for a pixel of the plurality of pixels, that the rendering engine is configured to perform, includes: tracing a respective ray for the pixel and for each of a plurality of neighboring pixels to thereby trace a plurality N of rays for the pixel, wherein each ray that is traced, of the plurality N of rays for the pixel, originates from a same camera position and passes through one of a plurality of different respective positions in image space and has respective color data; determining a weight for each of the plurality N of rays for the pixel to thereby determine a respective plurality N of weights for the plurality N of rays for the pixel; and using the plurality N of weights for the plurality N of rays for the pixel to combine the color data for the plurality N of rays for the pixel to thereby produce combined color data for the pixel.
[0018] Optionally, in any preceding aspect, the color data for each said pixel comprises a respective plurality of color values (e.g., comprises respective red, green, and blue (RGB) color values), and each of the plurality N of rays traced for a said pixel has a respective plurality of color values (e.g., respective RGB color values); and the plurality N of weights determined for the plurality N of rays, which are traced for the pixel, are used to combine the plurality of color values (e.g., the RGB color values) corresponding to the plurality N of rays traced for the pixel.
[0019] Optionally, in any preceding aspect, the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel, are respective centers of the pixel and the neighboring pixels. Alternatively, at least some of the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel, are offset from respective centers of the pixel and the neighboring pixels.
[0020] Optionally, in any preceding aspect, the rendering engine is configured to render the image of the scene that has improved sharpness compared to the ground truth blurry image, which said image is rendered based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, using single-ray-per-pixel tracing for each of a plurality of pixels in the image of the scene.
[0021] A further aspect of the present disclosure includes one or more non- transitory computer-readable media storing computer instructions for improving sharpness of a ground truth blurry image of a scene, that when executed by one or more processors, cause the one or more processors to perform operations comprising: using a first neural network and a second neural network to respectively produce a first output and a second output; rendering a rendered blurry image based on the first and the second outputs; comparing the rendered blurry image to the ground truth blurry image to thereby determine a difference between the rendered blurry image and the ground truth blurry image; training the first and the second neural networks by updating respective trainable parameters of each of the first and the second neural networks based on the difference between the rendered blurry image and the ground truth blurry image; iteratively repeating the using, the rendering, the comparing, and the training until a specified criterion is satisfied; and rendering, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, an image of the scene that has improved sharpness compared to the ground truth blurry image; wherein the rendered blurry image, which is rendered based on the first and the second outputs of the first and the second neural networks, includes a plurality of pixels; wherein each of the plurality of pixels has corresponding color data and corresponds to a position in image space; and wherein the rendering of the rendered blurry image based on the first and the second outputs of the first and the second neural networks, , during the training of the first and the second neural networks, includes performing multi-ray-per-pixel tracing and weighed averaging for each of at least some of the plurality of pixels in the rendered blurry image.
[0022] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background.
BRIEF DESCRIPTION OF THE DRAWINGS
[0023] Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying Figures (FIGS.) for which like references indicate elements. [0024] FIG. 1 is a high level block diagram representing a system or pipeline according to an embodiment of the present technology.
[0025] FIG. 2 illustrates an example of a ground truth blurry image and how rays can be rendered and optimized for each pixel of the image and how those optimized rays can be determined according to a predicted blur kernel determined for each pixel, in accordance with an embodiment of the present technology.
[0026] FIG. 3 illustrates how a kernel estimation network can be trained to learn generalizable spatially varying and/or temporal varying blur kernels, in accordance with an embodiment of the present technology.
[0027] FIG. 4A illustrates a sparse 3 x 3 blur kernel for a pixel for which rays are traced from a same camera position through a center of the pixel and through centers of some of its neighboring pixels.
[0028] FIG. 4B illustrates a dense 3 x 3 blur kernel for a pixel for which rays are traced from a same camera position through a center of the pixel and through centers of all of its neighboring pixels.
[0029] FIG. 4C illustrates a sparse 3 x 3 blur kernel for a pixel for which rays are traced from a same camera position through a position offset from the center of the pixel and through positions offset from centers of some of its neighboring pixels.
[0030] FIG. 4D illustrates a sparse 5 x 5 blur kernel for a pixel for which rays are traced from a same camera position through a center of the pixel and through centers of some of its neighboring pixels
[0031] FIGS. 5A and 5B illustrate how embodiments of the present technology can be used to improve the sharpness of images included in volumetric video delivery at a remote side where original images are captured or alternatively at an edge side where the images are rendered for viewing.
[0032] FIG. 6 illustrates an example of a computing system with which embodiments disclosed herein may be implemented. [0033] FIG. 7 illustrates a schematic diagram of a general-purpose network component or computer system with which embodiments of the present technology may be implemented.
DETAILED DESCRIPTION
[0034] Certain embodiments of the present technology can be used to improve sharpness of a ground truth blurry image of a scene. Such a ground truth blurry image of a scene can be, for example, a frame of a video clip captured using a camera of a smartphone or other type of video recorder, where the blurriness in the ground truth blurry image of the scene is due to camera motion, due to object motion, and/or due to images of the video clip being out of the focal plane of the camera.
[0035] FIG. 1 is a high level block diagram representing a system or pipeline according to an embodiment of the present technology. Referring to FIG. 1 , the system or pipeline shown therein includes a first neural network 110, a second neural 120, and a rendering engine 130. The first neural network 110, which can be neural radiance field network, but is not limited thereto, is responsible for generating a sharp representation of an original blurry image 150. The original blurry image 150, which can also be referred to herein as the ground truth blurry image 150, may have been captured using a camera of a mobile phone or some other type of camera. It is also possible that a ground truth blurry image is generated based on low quality meshes for a three-dimensional (3D) scene, as can be appreciated from FIG. 5B, discussed below. The blurring in the ground truth blurry image may have been caused, for example, by one or more moving objects in a dynamic scene, camera shake (aka camera movement), and/or one or more objects in the scene being out of a focal plane of the camera. The second neural network 120, which can be a multilayer perceptron (MLP), but is not limited thereto, is responsible for generating (e.g., estimating) blur kernels that simulate a blurring process.
[0036] Still referring to FIG. 1 , an output of the first neural network 110 and an output of the second neural network 120 are both provided to a rendering engine 130. During an iterative training process, the rendering engine 130, according to an embodiment of the present technology, uses multi-ray-per-pixel tracing and weighed averaging for each of a plurality of pixels of an image frame to render a blurry image 140. Example details of such multi-ray-per-pixel tracing and weighted averaging, according to certain embodiments of the present technology, are described further below.
[0037] As shown in FIG. 1 , the rendered blurry image 140 is compared to the ground truth blurry image 150 at a comparison stage 160, to thereby produce a difference 170 between the two images, which difference 170 can also be referred to as an error. This difference 170 is fed back to (i.e., provided as feedback to) the first and the second neural networks 110, 120. Based on the feedback, each of the first and the second neural networks 110, 120 update their respective trainable parameters in a manner that causes the rendered blurry image 140 to become more similar to the ground truth blurry image 150 during the next iteration of the above described process, which has the effect of reducing the difference 170 (aka error) between the two images during that next iteration. The above described iterative process is repeated for a predetermined number of iterations (e.g., 10,000 iteration), until the difference 170 is below a predetermined threshold difference, until a change in the difference 170 from one iteration to the next iteration is below a predetermined threshold change, until any one of the above criteria is met, until at least two of the above criteria are met, or until some other specified criteria is met. More generally, the above described iterative process is repeated until a specified criteria is met. When the above described iterative process stops being repeated because the specified criteria (examples of which were just described above) has been met, it can be said that the training (and associated learning) is completed (aka done).
[0038] In the system or pipeline described with reference to FIG. 1 , the first and the second neural networks 110, 120 are jointly trained. During the training, the first neural network 110 (e.g., a neural radiance field network) gradually learns to approach a ground truth sharp image (which is the ground truth blurry image with the blurring process removed), and the second neural network 120 (which can be referred to as a Kemal Estimation Network) gradually learns to approach the blurring process. In specific embodiments, where the first neural network 110 is implemented using a neural radiance field (NeRF) network, the NeRF network gradually learns a sharp time- conditioned NeRF representation, and the second neural network gradually learns spatially-varying and/or temporally-varying blur kernels to simulate the blurring process.
[0039] When the training (and associated learning) is done, the trainable parameters in the first neural network provide a representation of the ground truth sharp image (which is the ground truth blurry image with the blurring process removed), which can be a three-dimensional (3D) scene, but is not limited thereto. Additionally, when the training (and associated learning) is done, the trainable parameters in the first neural network can be used to render an image of the scene included in the ground truth blurry image that has improved sharpness (compared to the ground truth blurry image), which rendered image can be (but need not be) a novel view of the scene. More specifically, when the training (and associated learning) is done, the rendering of the image of the scene that has improved sharpness (compared to the ground truth blurry image), based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, is performed using single- ray-per-pixel tracing (as opposed to multi-ray-per-pixel tracing and weighted averaging) for each of a plurality of pixels in the image of the scene.
[0040] Before providing additional details of the embodiments of the present technology introduced above, it is useful to describe some details of technology upon which the embodiments of the present technology build and improve upon. There has been a recent wave of NeRF-based methods for free-viewpoint rendering of static 3D scenes with impressive results, wherein such NeRF methods utilize deep learning for reconstructing a 3D representation of a scene from sparse two-dimensional (2D) images. More specifically, a NeRF model enables learning of novel view synthesis, scene geometry, and the reflectance properties of the scene. Additional scene properties such as camera poses may also be jointly learned. NeRF enables rendering of photorealistic views from novel viewpoints. First introduced in 2020, NeRF has since gained significant attention for its potential applications in computer graphics and content creation. The NeRF algorithm represents a scene as a radiance field parametrized by a deep neural network (DNN). The network predicts a volume density and view-dependent emitted radiance given the spatial location (x, y, z) and viewing direction in Euler angles (9, <t>) of the camera. By sampling many points along camera rays, traditional volume rendering techniques can produce an image. A NeRF model is retrained for each unique scene. A first step is to collect images of the scene from different angles and their respective camera pose, wherein such images are standard 2D images and do not require a specialized camera or software. Accordingly, any camera that is able to generate datasets, provided the settings and capture method meet the requirements for Structure from Motion (SfM). This can utilize the tracking of the camera position and orientation, often through some combination of Simultaneous Localization and Mapping (SLAM), Global Positioning System (GPS), or inertial estimation. For each sparse viewpoint (image and camera pose) provided, camera rays are marched through the scene, generating a set of 3D points with a given radiance direction (into the camera). For these points, volume density and emitted radiance are predicted using an MLP. An image is then generated through classical volume rendering. Because this process is fully differentiable, the error between the predicted image and the original image can be minimized with gradient descent over multiple viewpoints, encouraging the MLP to develop a coherent model of the scene.
[0041] Even more recently, the research community has shifted its focus towards novel space-time synthesis of dynamic scenes that include moving objects such as people or pets. This is sometimes also referred to as free-viewpoint video and this enables plenty of applications such as cinematic effects from monocular videos, free- viewpoint selfies, background-foreground separation (where foreground refers to objects in motion), virtual 3D teleportation, and so on. Novel view synthesis for a dynamic scene is a challenging task. This requires expensive and arduous setups of multiple-camera capturing rigs which are impractical to scale. There can be ambiguous solutions as multiple scene settings can lead to the same observed image sequences and additionally, moving objects also add to the difficulty of this problem statement. Recent methods have proposed unique solutions to deal with the task of dynamic novel view-time synthesis but they still have many limitations. Methods like Nerfies and HyperNeRF learn a static canonical model and handle deformations via a separate ray bending network whereas methods like NSFF and DVS learn correspondences over time (or temporal consistency) via 3D scene flow and display the ability to handle larger motion (compared to ray bending network based methods) in the scene. However, these methods are not designed to handle unstable camera motion hence preventing their deployment in real world scenarios where blur is very common.
[0042] Although NeRF variants have achieved remarkable success in space-time view synthesis, these methods require carefully captured videos or images with well- calibrated camera parameters hence limiting them to controlled environments. In the past, only a few methods have sufficiently addressed the problem of novel view synthesis from images subject to motion or out-of-focus blur. To the best of the inventors’ knowledge, the paper by Ma et al. titled “Deblur-nerf: Neural radiance fields from blurry images,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), which is incorporated herein by reference, is the first paper that described a method that can reconstruct sharp NeRF from blurry inputs, and more specifically, proposed a deformable sparse kernel estimation module to simulate the blurring process while training a NeRF model. More recently, others introduced a rigid blurring kernel that utilizes physical scene priors, and an adaptive weight refinement scheme that considers relationship between depth and blur to render realistic results. While these methods may achieve impressive results, they are limited to static scenes and cannot handle moving objects in the scene.
[0043] In contrast, embodiments of the present technology described herein can construct a sharp NeRF while also handling dynamic scenes. Certain embodiments of the present technology take the best of both the worlds, by providing a method that learns a continuous volumetric function mapping a 3D location, direction and time to reflectance, density and 3D scene motion. In certain such embodiments, a sharp Neural Scene Flow Fields (NSFF) model is trained by maintaining temporal consistency using scene flow fields warping loss and by simulating the blurring process using a Kernel Estimation Network. During inference time, the algorithm is able to generate sharp novel space-time renderings from blurry input frames. Such embodiments tackle a novel problem that has not yet been covered in previous research.
[0044] There have been numerous methods that build the 3D scene geometry using point clouds, parametric surfaces and meshes and use it to render novel views. Recently, a neural network was introduced that takes a 5D input to model the continuous 3D scene by using differentiable volumetric rendering techniques. Success of NeRF inspired many subsequent works on improving upon different aspects and limitations of NeRF including the photometric and geometric quality, training and rendering efficiency and reducing the number of input views used. Most of these works, however, assume ideal input views that are perfectly captured. There are few works that have explored non-ideal inputs, such as the case where the camera poses are not available. Other works disentangle the lighting with the geometry which helps when the input views have some inconsistencies, and help in removing aliasing artifacts as the input images are scaled.
[0045] Compared to static scenes, this is an even more challenging problem because along with capturing view-dependent ambiguities in geometry and appearances, a network trained for dynamic scene also needs to model deformations in geometry with varying time. There have been some recent NeRF-based successful attempts for the task of 4D reconstruction in space and time, given real monocular RGB videos or synthetic videos. These approaches can be categorized into: (a) timevarying NeRFs; and (b) controllable NeRFs that use ray-bending technique to handle deformations. Time varying NeRFs are conditioned on encoded time-input and assume no prior knowledge of the 3D scene. These methods rely on learning from different data modalities such as depth, optical flow, segmentation masks, and camera poses. In addition to data-based learning, these methods also take advantage of geometric regularization losses to maintain consistency across time. Although these methods have shown impressive results and can empirically handle larger motion, they entangle geometry, appearance and deformation whereas controllable NeRFs can disentangle deformation from geometry and appearance by modelling a separate canonical model to map from deformed space to canonical space but cannot capture larger range of motion.
[0046] Certain embodiments of the present technology generate novel view and time synthesis of dynamic scenes with recovered sharpness. Certain embodiments of the present technology utilize a special deep learning method, wherein the method learns a continuous volumetric function mapping a three-dimensional (3D) location, direction and time to reflectance, density, and 3D scene motion. More specifically, this deep learning method trains a sharp model by maintaining temporal consistency using scene flow fields warping loss and by learning spatially varying blur kernels to simulate the blurring process using a Kernel Estimation Network. Then the model optimized from the training is a representation of time-conditioned neural radiance field and can be used to generate sharp novel space-time renderings from blurry input frames.
[0047] In accordance with certain embodiments, trainable sparse spatially-varying blur kernels are used to render sharp novel space-time view synthesis of blurred images (e.g., of a video) capturing dynamic scenes. This learned blur kernel can be a general K x K sized kernel, where K is an integer that is greater than or equal to 3, and the K x K sized kernel can be used to learn a sharp representation of a time conditioned neural radiance field. The images might be subject to blur due to camera motion or object motion or out-of-focus blur, as was noted above. Instead of using a spars blur kernel, a dense blur kernel can be used to get improvement in results and produce sharper novel renderings. However, there is a trade-off between speed and quality, and using a dense blur kernel may be subject to technological advancements in hardware in the future. In certain embodiments, K=3. However, with enhanced hardware equipment, the value of K can be increased to deal with different types of blurring effects.
[0048] Certain embodiments of the present technology facilitate separation of dynamic and static objects and help to sharpen those regions in the scene. If the dynamic region is subject to motion blur or if the static background region is subject to out-of-focus blur, embodiments of the present technology can be used to sharpen both the regions simultaneously without specifically modelling different types of blurs.
[0049] Certain embodiments of the present technology train a Kernel Estimation Network that is conditioned on learnable (aka trainable) time embedding along with a view embedding. In accordance with certain embodiments, the Kernel Estimation Network is implemented as a fully connected Multi-Layer Perceptron (MLP) neural network, which takes time as an input, but instead of directly passing in time, a trainable embedded vector corresponding to time is passed to the MLP neural network.
[0050] The above mentioned K x K sized kernel can also be referred to herein as a two-dimensional (2D) trainable kernel. In accordance with certain embodiments, the 2D trainable kernel is replaced with a 3D trainable kernel, where the associated weights are 3-dimensional corresponding to RGB channels. In accordance with certain embodiments, the 2D trainable kernel is replaced with a 3D trainable kernel, where 2 dimensions correspond to position, and 1 dimension corresponds to time. Other variations are also possible and within the scope of the embodiments described herein.
[0051] To represent a dynamic scene, Neural Scene Flow Fields (NSFF) extend the idea of Neural Radiance Fields (NeRFs) by modelling 3D motion as dense scene flow fields, and more specifically, by learning a combination of static and dynamic NeRFs. In accordance with certain embodiments, a static NeRF is a time-independent Multi-Layer Perceptron (MLP), denoted by F^, that takes as input a position (x) and a viewing direction (d), and outputs RGB color (c), volumetric density (a), and unsupervised 3D blending weights (v) that determines how to blend RGBa (i.e., RGB volumetric density) from static and dynamic representation, in accordance with the following equation:
Figure imgf000018_0001
where
F^ is a static NeRF, x is position, d is viewing direction, c is RGB color a is volumetric density, and v is unsupervised 3D blending weight.
[0052] In the below equation EQ2, a dynamic NeRF, denoted by Fgy, explicitly models a view-dependent as well as time-dependent MLP that takes an additional input, i.e. , time (t) along with position (x) and viewing direction (d). On top of color and density, the model also predicts forward and backward 3D scene flow
Figure imgf000018_0002
=
Figure imgf000018_0004
and disocclusion weights
Figure imgf000018_0003
to tackle motion disocclusions in 3D space, in accordance with the following equation:
Figure imgf000019_0001
where
Fgy is a dynamic NeRF, x is position, d is viewing direction, t is time, ct is RGB color at the time t, ot is volumetric density at the time t, represents 3D scene flow, and represents disocclusion weights.
[0053] In accordance with certain embodiments, a final color value for a pixel is estimated using the blending weights as per the following rendering equations: )dz, (EQ3),
Figure imgf000019_0002
where rt is a ray at the time t,
Ctb(rt') a final color value for a pixel at the time t,
Figure imgf000019_0003
Tt denotes transmittance at the time t, and zn and zf denote near depth and far depth along the ray rt.
[0054] The final color output
Figure imgf000019_0004
is trained against Ct(rt), which is the ground truth RGB color values at the pixel corresponding to the ray rt, which is represented by the following equation:
Figure imgf000019_0005
[0055] In the above described equations, during training NSFF is used to render a single-ray-per-pixel in an image frame that is being deblurred. In accordance with certain embodiments of the present technology described below, rather than rendering a single-ray-per-pixel during training, a plurality (N) of rays is rendered per pixel of each image frame during training, where N is, for example, 5, but is not limited thereto, as will be described in more detail below. This is done in order to simulate the blurring process represented in the following equation: b(x, y) = c(x,y) ® h (EQ6), where, h is a blur kernel, c(x,y) is a sharp image that the neural network (e.g., MLP) is to learn, b(x, y) is the blurred image, and
® denotes convolution operator.
[0056] In accordance with certain embodiments, the blur kernel h is a K x K window (which can also be referred to as a K x K matrix). Ideally, the number of rays N rendered per pixel would equal KA2. However, where a NeRF model is being trained to obtain the sharp image c(x,y), it may be impractical to render K2 rays per pixel, because of the memory requirements and processing time that would be needed. For example, if K = 3, and N = K2, then N would equal 9. For the following discussion, it is assumed that K = 3 (meaning the kernel window, aka kernel matrix, is 3 x 3), N = 5 (meaning 5 rays are rendered per pixel in a frame), in which case the kernel is a sparse kernel.
[0057] In accordance with certain embodiments, a Kemal Estimation Network, or more generally the second neural network 120 introduced above with reference to FIG. 1 , is used to optimize N rays for each of the pixels (or for each of at least some of the pixels) in an image frame. Once the N rays for each of the pixels in a frame are optimized, the final color value for a pixel, i.e. ,
Figure imgf000020_0001
is estimated for all elements from 0 to N, i.e.,
Figure imgf000020_0002
V e, [0, N), using equation EQ3 introduced above. This results in each pixel having N associated color values and a sparse blur kernel h with weights W/ V ei [0,N) predicted using a Kemal Estimation Network, or more generally the second neural network 120 introduced above with reference to FIG. 1. Using these values, the blurring process is simulated using equation EQ6, introduced above, to obtain B^(rt), where rt corresponds to a pixel at time t (and more specifically, corresponds to a ray traced through a pixel at the time t) and B~t denotes the output blurry image at time t. This is further supervised using the ground truth blurry image Bgt, which process can be expressed as the following equation:
Figure imgf000021_0001
where
■^mse-biur 's a mean-squared error loss function,
Bt (rt') is the rendered blurry image, and
Figure imgf000021_0002
is the ground truth blurry image.
[0058] As per equation EQ2, the dynamic NeRF, Fg y, can be used to output
Figure imgf000021_0003
for all the points in 3D space. In accordance with certain such embodiments, this is used to determine forward and backward flows for each sampled point on all the optimized rays. Using each point in 3D space can be offset to a neighboring frame t' (where t' e {t-1, t+1 }) and volume render with associated color (ct') and volumetric density (at' ). This provides a rendered image frame of time t' warped to time t, denoted as:
Figure imgf000021_0004
[0059] In accordance with certain embodiments of the present technology, instead of obtaining
Figure imgf000021_0005
e [0,N) is obtained and the blurring process is repeated using the same blur kernel h for the pixel corresponding to the ray rt. This gives a warped rendered blurry view
Figure imgf000021_0006
and then the goal is to minimize the following temporal photometric consistency loss equation:
Figure imgf000021_0007
where
£ph0 a temporal photometric consistency loss,
Figure imgf000022_0001
is the warped rendered blurry image, and is the ground truth blurry image.
[0060] More details on the temporal photometric consistency loss equation EQ10 are discussed below.
[0061] As can be appreciated from FIG. 2, in accordance with certain embodiments of the present technology, N rays are rendered and optimized for each pixel of an image and those optimized rays are determined according to the predicted blur kernel h determined for each pixel. Referring to FIG. 2, shown at the left of an example ground truth blurry image 210 (which is an example of the ground truth blurry image 150 introduced in FIG. 1 ), which can be expressed as Bgt. Shown just to the right of the example ground truth blurry image 210, is a zoomed-in image patch 220 (of the example ground truth blurry image 210), as well as a representation of an optimized blur kernel 240 for a pixel 230 of the image patch 220. The optimized blur kernel 240 is represented as a 3 x 3 sparse matrix through which N rays 242, which all originate at a same camera position 250, are traced, where N=5 in this example. More specifically, in FIG. 2, a respective ray 242 is traced for the pixel 230 and for each of a plurality of neighboring pixels 232 to thereby trace a plurality N of rays for the pixel. In other words, for each pixel, instead of tracing a single ray, the pixel’s neighboring rays are also considered based on optimized location information in the kernel estimated for the pixel, thus resulting in a total of N rays traced. Each ray 242 that is traced, of the plurality N of rays 242 for the pixel 230, originates from the same camera position 250 and passes through one of a plurality of different respective positions in image space and has respective color data.
[0062] In accordance with certain embodiments of the present technology, a weight Wi is determined for each of the plurality N of rays 242 for the pixel 230 to thereby determine a respective plurality N of weights for the plurality N of rays 242 for the pixel 230. After being determined, the plurality N of weights that are determined for the plurality N of rays for the pixel are used to combine the color data for the plurality N of rays for the pixel (e.g., 230) to thereby produce combined color data for the pixel 230. In specific embodiments, the color data for a pixel (e.g., 230) comprises respective red, green, and blue (RGB) color values, and each of the plurality N of rays traced for the pixel has respective RGB color values. In such embodiments, the plurality N of weights determined for the plurality N of rays, which are traced for the pixel, are used to combine the RGB color values corresponding to the plurality N of rays traced for the pixel. Referring to the right side of FIG. 2, c0 represents the color data for the pixel 230, and , c2, c3, and c4 represent the color data for the neighboring pixels 232, and w0, wlt w2, w3, and w4 represent the weights that are used to combine the RGB color values corresponding to the plurality N of rays traced for the pixel. Convolution of the color data and the weights, for each of the pixels, results in a rendered blurry pixel of the rendered blurry image. The color data for a pixel (e.g., 230) can alternatively comprise other types of color values, besides RGB color values, if another color space besides RGB color space is used. For example, the YUV color space may alternatively be used, in which case the color data values for a pixel can comprise a respective Y' component (luma) value, a respective chroma component U value, and a respective chroma component V value. For another example, the HSV or HSL color space (also known as the HSB color space) may be used, in which case the color data values for a pixel can comprise a hue data value, a saturation data value, and a brightness data value. For still another example, the CMYK color space may be used, in which case the color data values for a pixel can include respective cyan (C), magenta (M), yellow (Y), and black (K for “key”) color data values. Other color spaces that can be used include, but are not limited to, sRGB and CIE L*A*B* color spaces.
[0063] In FIG. 2, an MLP 260 is shown as being used to determine the aforementioned color data and the weights. The MLP 260 in FIG. 2 is an example of the second neural network 120 introduced above with reference to FIG. 1. As was noted in the above discussion of FIG. 1 , other types of neural networks can be used instead of an MLP to implement the second neural network 120, while still being within the scope of the embodiments described herein. However, for much of the remaining discussion, it is assumed that the second neural network 120 is implemented as an MLP as a design choice. Further, as was noted above, because the second neural network 120 is used to estimate blur kernels, the second neural network can also be referred to herein as the Kernel Estimation Network. Such a blur kernel is obtained using the Kernel Estimation Network, or more generally the second neural network 120, as visually illustrated in FIG. 3.
[0064] Referring to FIG. 3, shown therein at the left is an example ground truth blurry image 310, which is an example of the ground truth blurry image 150 introduced in FIG. 1 , and as the same as the ground truth blurry image 210 discussed above with reference to FIG. 2. For each pixel in the ground truth blurry image 310, a respective blur kernel is determined using a Kernel Estimation Network 320, which is the neural network 120 introduced above in the discussion of FIG. 1 . The output 330 of the Kernel Estimation Network 320 in FIG. 3 is shows as including positions pt and associated weights wt for each of 5 rays traced (and more generally, N rays traced) for a pixel 312 corresponding to a position in image space.
[0065] The Kernel Estimation Network 320 is denoted as Gc|> in the below discussion and the below described equations.
[0066] In accordance with certain embodiments, G(|, takes a query pixel p, a canonical kernel h , and a view embedding I (labeled, 312, 314, and 316 respectively in FIG. 3) as inputs, and outputs
Figure imgf000024_0001
Figure imgf000024_0002
where Ap. is the offset for each pt which are pre-defined positions on the canonical kernel h , and wt is their associated weight. In specific embodiments, the network 320 learns an optimized blur kernel for each pixel as blurring is a spatially-varying process, as well as it usually varies with the viewing direction, which justifies the use of a viewembedding. In specific embodiments, to confine the solution, a boundary constant is defined (e.g., boundary constant e = 0.1 can be used) and multiplied with the outputs of the Kernel Estimation Network 320. This helps keep the neighboring points, that will be considered in producing the final blur color for each pixel, closer to the pixel in consideration. Still referring to FIG. 3, in accordance with certain embodiments, in additional to (or instead of) having view embedding as one of the inputs to the Kernel Estimation Network 320, time embedding 318 can be one of the inputs to the Kernel Estimation Network 320. [0067] In FIG. 3, the blur kernel 316 is shown as being a sparse 3 x 3 blur kernel, where the kernel width K = 3, and N < KA2, because N = 5, which is less than 3A2 = 9. Instead of the blur kernel being a sparse kernel, the blur kernel can alternatively be a dense kernel where N = KA2, wherein in this example would mean that N = 9. However, as noted above, there is a trade-off between speed and quality, and using a dense blur kernel may be subject to technological advancements in hardware in the future. In certain embodiments, K = 3. However, with enhanced hardware equipment, the value of K can be increased to deal with different types of blurring effects. In other words, the width of the blur kernel can be increased, e.g., by having K = 5. In an embodiment where K = 5, and the blur kernel is a sparse blur kernel, then N may, for example, equal 10, but is not limited thereto. In an embodiment where K = 5, and the blur kernel is a dense blur kernel, then N = 25, since 5A2 = 25. In FIG. 3, the positions (aka locations) of the pixels in the sparse blur kernel through which the N rays are traced (where N = 5 in this example), which positions are labeled po, pi , p2, ps, and p4, are each within the centers of each of the square regions of the sparse blur kernel. In alternative embodiments, the positions through which the N rays are traced are instead trainable locations that are offset from the centers of the square regions of a sparse blur kernel, or a dense blur kernel, depending upon the specific implementation. Examples of these various types of blur kernels are illustrated in FIGS. 4A through 4D, discussed below.
[0068] FIG. 4A illustrates a sparse 3 x 3 blur kernel for a pixel (shown in the center) for which rays are traced from a same camera position through a center of the pixel and through centers of some of its neighboring pixels. The sparse blur kernel illustrated in FIG. 4A is the same as the sparse blur kernel 316 illustrated in FIG. 3 discussed above, and thus, K = 3 and N = 5. FIG. 4B illustrates a dense 3 x 3 blur kernel for a pixel for which rays are traced from a same camera position through a center of the pixel and through centers of all of its neighboring pixels, where K = 3 and N = 9. FIG. 4C illustrates a sparse 3 x 3 blur kernel for a pixel for which rays are traced from a same camera position through a position offset from the center of the pixel and through positions offset from centers of some of its neighboring pixels, where K = 3 and N = 5. The positions (aka locations) that are offset from the centers of the various pixels can be determined during training of the neural networks 110 and 120, and thus, the positions can be referred to a trainable positions. FIG. 4D illustrates a sparse 5 x 5 blur kernel for a pixel for which rays are traced from a same camera position through a center of the pixel and through centers of some of its neighboring pixels, where K = 5 and N = 9. The density of the 5 x 5 blur kernel can be increased, e.g., so N = 13, or N = 25, but not limited thereto. In FIG. 4D, rays are traced from a same camera position through a center of the pixel and through centers of some of its neighboring pixels. However, as can be appreciated from the above discussion of FIG. 4C, the rays can alternatively be traced through a position offset from the center of the pixel and through positions offset from centers of neighboring pixels.
[0069] As explained above in the discussion of FIG. 1 , during the training process, the first neural network 110 (e.g., a neural radiance field network) gradually learns to approach the ground truth sharp image 150 (which is the ground truth blurry image with the blurring process removed), and the second neural network 120 (which can be referred to as a Kemal Estimation Network) gradually learns to approach the blurring process. In specific embodiments, where the first neural network 110 is implemented using a neural radiance field (NeRF) network, the NeRF network gradually learns a sharp time-conditioned NeRF representation, and the second neural network gradually learns spatially-varying blur kernels to simulate the blurring process. In such embodiments, once the first and second neural networks 110 and 120 are done being trained, during inference, sharp novel views of the image that was represented in the original blurry image 150 (which as noted above, can also be referred to herein as the ground truth blurry image 150) can be generated. Explained another way, in accordance with certain embodiments, while the second neural network 120 (aka the Kemal Estimation Network) is responsible for simulating the blurring process, static and dynamic NeRFs of the first neural network 110 learn a sharp scene. This implies that following training, i.e. , once the training of the first and second neural networks 110, 120 is done, the trained sharp NeRFs of the first neural network 110 can be used to generate novel space-time views, without any use of the second neural network 120. In certain embodiments, a splatting-based plane-sweep volume rendering approach is used to render novel views at fixed times, novel times at fixed views, and space-time interpolation.
[0070] In accordance with certain embodiments, consistency of a scene at a time t with the adjacent times t' is ensured after accounting for motion due to 3D scene flow. The scene is warped from time t' to t using the 3D scene flow estimation output from the Dynamic NeRF
Figure imgf000027_0001
to ensure that any motion that occurred in that time period is undone. The warped point locations xt> on the ray rt> are used to query the associated color (ct>) and density
Figure imgf000027_0002
at the time t'. Using the color and density information, the image with
Figure imgf000027_0003
is rendered in accordance with equation EQ8. The blur kernel for the the ray rt corresponding to time t is then applied to get the blurry frame
Figure imgf000027_0004
which is also referred to herein as the rendered blurry image.
The calculation of loss over regions which get disoccluded due to motion in the scene is ambiguous. To help with this, the Dynamic NeRF (Fe dy) also outputs disocclusion weights TI
Figure imgf000027_0005
e [0,1], The weights
Figure imgf000027_0006
decide the contribution of the temporal photometric consistency loss at each location in the scene to the total loss. To calculate I , volume rendering of the weights along a ray rt is first performed using the density values from time t'. This accumulated weight
Figure imgf000027_0007
is then used for each 2D pixel as the weightage of the temporal photometric consistency loss as shown in the following equations:
Figure imgf000027_0008
[0071] To avoid the trivial solution, an l regularization is added to force the dissocclusion weights to be near 1 , resulting in the following equation for the loss:
Figure imgf000027_0009
where A is a regularization parameter which is kept as 0.1.
[0072] To account for potential alignment loss, further regularization can be added to train the neural networks 110 and 120 in order to gear a model (e.g., a NeRF model) generated by the neural network 110 towards learning a sharp representation of the scene. Without any constraints, the model (e.g., NeRF model) together with the optimized kernels might learn to map well to the blurry ground truth image (e.g., 150, 210) but during inference, there may be some unexpected distortions because renderings are produced using the model (e.g., NeRF model) generated by first neural network 110 (e.g., a neural radiance field network) without the second neural network 120 (aka, the Kernal Estimation Network). To avoid this, the sparse kernel can be initialized such that the optimized rays are close to the input ray and the kernel weights corresponding to the optimized rays are similar to gaussian kernel representation. This will ensure that when the training is started, all the kernel points are near the pixel centers. Further, one of the optimized rays is forced to be close to the input ray as shown in the equation below:
Figure imgf000028_0001
where q0 corresponds to a fixed index in the output from the Kernel Estimation Network.
[0073] In summary, the overall loss function can be represented using the following equation:
Figure imgf000028_0002
where aUgn can be kept at 0.1 , but is not limited thereto.
[0074] FIGS. 5A and 5B illustrate how embodiments of the present technology can be used to improve the sharpness of images included in volumetric video delivery at a remote side where original images are captured or at an edge side where 3D views are rendered for viewing. In FIGS. 5A and 5B, the block labeled 512, which performs image reconstruction with increased sharpness, is used to implement an embodiment of the present technology described above with reference to FIGS. 1-3. Initially referring to FIG. 5A, low quality blurry images 502 are captured using a camera. The sharpness of the images is increased at block 512 using an embodiment of the present technology. The images with increased sharpness are used to produce high quality meshes for a 3D scene 522. The high quality meshes 522 for the 3D scene 522 are transferred (e.g., transmitted) from the remote side to an edge side over one or more networks, which can include wired and/or wireless networks, where high quality rendered sharp views 542 are rendered based on the high quality meshes for the 3D scene 522. The rendering at the edge side can performed using an augmented reality (AR) display, a virtual reality (VR) display, or a 3D display, but is not limited thereto. FIG. 5B instead shows image reconstruction being performed at block 510 without increased sharpness, based upon which low quality meshes for a 3D scene 520 are produced. The low quality meshes 522 for the 3D scene 520 are transferred (e.g., transmitted) from the remote side to the edge side over one or more networks, which can include wired and/or wireless networks, as is part of the original low quality blurry images 502’. Based thereon, on the edge side the sharpness of the images is increased at block 512 using an embodiment of the present technology, and used to produce high quality rendered sharp views 542.
[0075] The various embodiments of the present technology described above can be implemented using one or more processing device. FIGS. 6 and 7 illustrate example hardware devices suitable for implementing of such embodiments.
[0076] FIG. 6 shows an example of a computing system 600 with which embodiments disclosed herein may be implemented. For example, the computing system 600 can be used to implement the neural networks 110 and 120, and the rendering engine 130, but not limited thereto. The computing system 600 includes at least one processor 602, which could be a single central processing unit (CPU) or an arrangement of multiple processor cores of a multi-core architecture. In the depicted example, the processor 602 includes a pipeline 604, an instruction cache 606, and a data cache 608 (and other circuitry, not shown). The processor 602 is connected to a processor bus 610, which enables communication with an external memory system 612 and an input/output (I/O) bridge 614. The I/O bridge 614 enables communication over an I/O bus 616, with various different I/O devices 618A-618D (e.g., disk controller, network interface, display adapter, and/or user input devices such as a keyboard or mouse). Optionally, in embodiments, device 600 may include both a processor 600 and a dedicated graphics processing unit (GPU, not shown.) As understood, a GPU is a type of processing unit that enables very efficient parallel processing of data. Although GPUs may be used in a video card or the like for computer graphics, GPUs have found much broader applications.
[0077] The external memory system 612 is part of a hierarchical memory system that includes multi-level caches, including the first level (L1 ) instruction cache 606 and data cache 608, and any number of higher level (L2, L3, . . . ) caches within the external memory system 612. Other circuitry (not shown) in the processor 602 supporting the caches 606 and 608 includes a translation lookaside buffer (TLB), various other circuitry for handling a miss in the TLB or the caches 606 and 608. For example, the TLB is used to translate an address of an instruction being fetched or data being referenced from a virtual address to a physical address, and to determine whether a copy of that address is in the instruction cache 606 or data cache 608, respectively. If so, that instruction or data can be obtained from the L1 cache. If not, that miss is handled by miss circuitry so that it may be executed from the external memory system 612. It is appreciated that the division between which level caches are within the processor 102 and which are in the external memory system 612 can differ in various examples. For example, an L1 cache and an L2 cache may both be internal and an L3 (and higher) cache could be external. The external memory system 612 also includes a main memory interface 620, which is connected to any number of memory modules (not shown) serving as main memory (e.g., Dynamic Random Access Memory modules).
[0078] FIG. 7 illustrates a schematic diagram of a general-purpose network component or computer system 700 with which embodiments of the present technology may be implemented. For example, general-purpose network component or computer system 700 can be used to implement the neural networks 110 and 120, and the rendering engine 130, but not limited thereto. The general-purpose network component or computer system 700 includes a processor 702 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 704, and memory, such as ROM 706 and RAM 708, input/output (I/O) devices 710, and a network 712, such as the Internet or any other well-known type of network, that may include network connectively devices, such as a network interface. Although illustrated as a single processor, the processor 702 is not so limited and may comprise multiple processors. The processor 702 may be implemented as one or more CPU chips, cores (e.g., a multi-core processor), FPGAs, ASICs, and/or DSPs, and/or may be part of one or more ASICs. The processor 702 may be configured to implement any of the schemes described herein. The processor 702 may be implemented using hardware, software, or both.
[0079] The secondary storage 704 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if the RAM 708 is not large enough to hold all working data. The secondary storage 704 may be used to store programs that are loaded into the RAM 708 when such programs are selected for execution. The ROM 706 is used to store instructions and perhaps data that are read during program execution. The ROM 706 is a non-volatile memory device that typically has a small memory capacity relative to the larger memory capacity of the secondary storage 704. The RAM 708 is used to store volatile data and perhaps to store instructions. Access to both the ROM 706 and the RAM 708 is typically faster than to the secondary storage 1404. At least one of the secondary storage 704 or RAM 708 may be configured to store routing tables, forwarding tables, or other tables or information disclosed herein.
[0080] It is understood that by programming and/or loading executable instructions onto the node 700, at least one of the processor 720 or the memory 722 are changed, transforming the node 700 in part into a particular machine or apparatus, e.g., a router, having the novel functionality taught by the present disclosure. Similarly, it is understood that by programming and/or loading executable instructions onto the node device 700, at least one of the processor 702, the ROM 1406, and the RAM 708 are changed, transforming the node 700 in part into a particular machine or apparatus, e.g., a router, having the novel functionality taught by the present disclosure. Optionally, in embodiments, device 700 may include both a processor 702 and a dedicated GPU, not shown.
[0081] It is fundamental to the electrical engineering and software engineering arts that functionality that can be implemented by loading executable software into a computer can be converted to a hardware implementation by well-known design rules. Decisions between implementing a concept in software versus hardware typically hinge on considerations of stability of the design and numbers of units to be produced rather than any issues involved in translating from the software domain to the hardware domain. Generally, a design that is still subject to frequent change may be preferred to be implemented in software, because re-spinning a hardware implementation is more expensive than re-spinning a software design. Generally, a design that is stable that will be produced in large volume may be preferred to be implemented in hardware, for example in an ASIC, because for large production runs the hardware implementation may be less expensive than the software implementation. Often a design may be developed and tested in a software form and later transformed, by well-known design rules, to an equivalent hardware implementation in an application specific integrated circuit that hardwires the instructions of the software. In the same manner as a machine controlled by a new ASIC is a particular machine or apparatus, likewise a computer that has been programmed and/or loaded with executable instructions may be viewed as a particular machine or apparatus.
[0082] The technology described herein can be implemented using hardware, firmware, software, or a combination of these. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated or transitory signals.
[0083] Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
[0084] In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Applicationspecific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces.
[0085] It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details.
[0086] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
[0087] The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.
[0088] For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.
[0089] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

CLAIMS What is claimed is:
1. A computer implemented method for improving sharpness of a ground truth blurry image of a scene, the method comprising: using a first neural network and a second neural network to respectively produce a first output and a second output; rendering a rendered blurry image based on the first and the second outputs; comparing the rendered blurry image to the ground truth blurry image to thereby determine a difference between the rendered blurry image and the ground truth blurry image; training the first and the second neural networks by updating respective trainable parameters of each of the first and the second neural networks based on the difference between the rendered blurry image and the ground truth blurry image; iteratively repeating the using, the rendering, the comparing, and the training until a specified criterion is satisfied; and rendering, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, an image of the scene that has improved sharpness compared to the ground truth blurry image; wherein the rendered blurry image, which is rendered based on the first and the second outputs of the first and the second neural networks, includes a plurality of pixels; wherein each of the plurality of pixels has corresponding color data and corresponds to a position in image space; and wherein the rendering of the rendered blurry image based on the first and the second outputs of the first and the second neural networks, during the training of the first and the second neural networks, includes performing multi-ray-per-pixel tracing and weighed averaging for each of at least some of the plurality of pixels in the rendered blurry image.
2. The method of claim 1 , wherein: the first neural network comprises a neural radiance field network and the first output comprises trainable parameters of radiance fields of the neural radiance field network; and the second neural network comprises a kernel estimation network and the second output comprises estimated blur kernels and associated weights for each of a plurality of kernel positions.
3. The method of any one of claims 1 or 2, wherein: the first neural network is configured to iteratively learn to approach a ground truth sharp image, which corresponds the ground truth blurry image with a blurring process removed; and the second neural network is configured to iteratively generate a plurality of trained blur kernels that simulate the blurring process.
4. The method of claim 3, wherein: the multi-ray-per-pixel tracing performed for a pixel, of the plurality of pixels for which the multi-ray-per-pixel tracing is performed, results in a plurality N of rays being traced for the pixel; each of the plurality of trained blur kernels is generated for a respective one of the plurality of pixels for which the multi-ray-per-pixel tracing is performed; and each of the plurality of trained blur kernels corresponds to a K x K kernel window, where K is an odd integer that is at least 3, and N < KA2.
5. The method of claim 4, wherein:
N < KA2, and thus, each of the plurality of trained blur kernels comprises a sparse blur kernel.
6. The method of any one of claims 1 through 5, wherein the performing multi-ray- per-pixel tracing and weighed averaging for a pixel of the plurality of pixels comprises: tracing a respective ray for the pixel and for each of a plurality of neighboring pixels to thereby trace a plurality N of rays for the pixel, wherein each ray that is traced, of the plurality N of rays for the pixel, originates from a same camera position and passes through one of a plurality of different respective positions in image space and has respective color data; determining a weight for each of the plurality N of rays for the pixel to thereby determine a respective plurality N of weights for the plurality N of rays for the pixel; and using the plurality N of weights for the plurality N of rays for the pixel to combine the color data for the plurality N of rays for the pixel to thereby produce combined color data for the pixel.
7. The method of claim 6, wherein: the color data for each said pixel comprises respective red, green, and blue (RGB) color values; each of the plurality N of rays traced for a said pixel has respective RGB color values; and the plurality N of weights determined for the plurality N of rays, which are traced for the pixel, are used to combine the RGB color values corresponding to the plurality N of rays traced for the pixel.
8. The method of any one of claims 6 or 7, wherein: the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel, are respective centers of the pixel and the neighboring pixels.
9. The method of any one of claims 6 or 7, wherein: at least some of the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel, are offset from respective centers of the pixel and the neighboring pixels.
10. The method of any one of claims 1 through 9, wherein: the rendering the image of the scene that has improved sharpness compared to the ground truth blurry image, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, is performed using single-ray-per-pixel tracing for each of a plurality of pixels in the image of the scene.
11. A system for improving sharpness of a ground truth blurry image of a scene, the system comprising: a first neural network and a second neural network configured to respectively produce a first output and a second output, during each of a plurality of iterations of the first and the second neural networks being trained until a specified criterion is satisfied; and a rendering engine configured to receive the first and second outputs and to render a rendered blurry image based on the first and the second outputs, during each of the plurality of iterations until the specified criterion is satisfied; the first and second neural networks configured to be trained based on a difference between the rendered blurry image and the ground truth blurry image during each of the plurality of iterations until the specified criterion is satisfied, by updating respective trainable parameters of each of the first and the second neural networks based on the difference between the rendered blurry image and the ground truth blurry image during each of the plurality of iterations until the specified criterion is satisfied; and the rendering engine further configured to render, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, an image of the scene that has improved sharpness compared to the ground truth blurry image; wherein the rendered blurry image, which is rendered based on the first and the second outputs of the first and the second neural networks during each of the plurality of iterations until the specified criterion is satisfied, includes a plurality of pixels; wherein each of the plurality of pixels has corresponding color data and corresponds to a position in image space; and wherein the rendering engine is configured to render the rendered blurry image based on the first and the second outputs of the first and the second neural networks, during each of the plurality of iterations until the specified criterion is satisfied, by performing multi-ray-per-pixel tracing and weighed averaging for each of at least some of the plurality of pixels in the rendered blurry image.
12. The system of claim 11 , wherein: the first neural network comprises a neural radiance field network and the first output comprises trainable parameters of radiance fields of the neural radiance field network; and the second neural network comprises a kernel estimation network and the second output comprises estimated blur kernels and associated weights for each of a plurality of kernel positions.
13. The system of any one of claims 11 or 12, wherein: the first neural network is configured to iteratively learn to approach a ground truth sharp image, which corresponds the ground truth blurry image with a blurring process removed; and the second neural network is configured to generate a plurality of trained blur kernels that simulate the blurring process.
14. The system of claim 13, wherein: the multi-ray-per-pixel tracing performed for a pixel, of the plurality of pixels for which the multi-ray-per-pixel tracing is performed, results in a plurality N of rays being traced for the pixel; each of the plurality of trained blur kernels is generated for a respective one of the plurality of pixels for which the multi-ray-per-pixel tracing is performed; and each of the plurality of trained blur kernels corresponds to a K x K kernel window, where K is an odd integer that is at least 3, and N < KA2.
15. The system of claim 14, wherein:
N < KA2, and thus, each of the plurality of trained blur kernels comprises a sparse blur kernel.
16. The system of any one of claims 11 through 15, wherein the multi-ray-per-pixel tracing and weighed averaging for a pixel of the plurality of pixels, that the rendering engine is configured to perform, comprises: tracing a respective ray for the pixel and for each of a plurality of neighboring pixels to thereby trace a plurality N of rays for the pixel, wherein each ray that is traced, of the plurality N of rays for the pixel, originates from a same camera position and passes through one of a plurality of different respective positions in image space and has respective color data; determining a weight for each of the plurality N of rays for the pixel to thereby determine a respective plurality N of weights for the plurality N of rays for the pixel; and using the plurality N of weights for the plurality N of rays for the pixel to combine the color data for the plurality N of rays for the pixel to thereby produce combined color data for the pixel.
17. The system of claim 16, wherein: the color data for each said pixel comprises respective red, green, and blue (RGB) color values; and each of the plurality N of rays traced for a said pixel has respective RGB color values; and the plurality N of weights determined for the plurality N of rays, which are traced for the pixel, are used to combine the RGB color values corresponding to the plurality N of rays traced for the pixel.
18. The system of any one of claims 16 or 17, wherein: the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel, are respective centers of the pixel and the neighboring pixels.
19. The system of any one of claims 16 or 17, wherein: at least some of the different respective positions in image space, through which the plurality N of rays originating from the same camera position are traced for a said pixel, are offset from respective centers of the pixel and the neighboring pixels.
20. The system of any one of claims 11 through 19, wherein: the rendering engine is configured to render the image of the scene that has improved sharpness compared to the ground truth blurry image, which said image is rendered based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, using single-ray-per-pixel tracing for each of a plurality of pixels in the image of the scene.
21. One or more non-transitory computer-readable media storing computer instructions for improving sharpness of a ground truth blurry image of a scene, that when executed by one or more processors, cause the one or more processors to perform operations comprising: using a first neural network and a second neural network to respectively produce a first output and a second output; rendering a rendered blurry image based on the first and the second outputs; comparing the rendered blurry image to the ground truth blurry image to thereby determine a difference between the rendered blurry image and the ground truth blurry image; training the first and the second neural networks by updating respective trainable parameters of each of the first and the second neural networks based on the difference between the rendered blurry image and the ground truth blurry image; iteratively repeating the using, the rendering, the comparing, and the training until a specified criterion is satisfied; and rendering, based on the respective trainable parameters of the first neural network after the specified criterion is satisfied, an image of the scene that has improved sharpness compared to the ground truth blurry image; wherein the rendered blurry image, which is rendered based on the first and the second outputs of the first and the second neural networks, includes a plurality of pixels; wherein each of the plurality of pixels has corresponding color data and corresponds to a position in image space; and wherein the rendering of the rendered blurry image based on the first and the second outputs of the first and the second neural networks, during the training of the first and the second neural networks, includes performing multi-ray-per-pixel tracing and weighed averaging for each of at least some of the plurality of pixels in the rendered blurry image.
PCT/US2024/026657 2023-08-25 2024-04-26 Improving sharpness of image of scene WO2024138236A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363578727P 2023-08-25 2023-08-25
US63/578,727 2023-08-25

Publications (1)

Publication Number Publication Date
WO2024138236A2 true WO2024138236A2 (en) 2024-06-27

Family

ID=91186642

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/026657 WO2024138236A2 (en) 2023-08-25 2024-04-26 Improving sharpness of image of scene

Country Status (1)

Country Link
WO (1) WO2024138236A2 (en)

Similar Documents

Publication Publication Date Title
Yue et al. Supervised raw video denoising with a benchmark dataset on dynamic scenes
Jiang et al. Learning to see moving objects in the dark
Wang et al. Joint filtering of intensity images and neuromorphic events for high-resolution noise-robust imaging
Niklaus et al. Video frame interpolation via adaptive convolution
Pan et al. Physics-based generative adversarial models for image restoration and beyond
Shivakumar et al. Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion
WO2021208122A1 (en) Blind video denoising method and device based on deep learning
Li et al. A two-streamed network for estimating fine-scaled depth maps from single rgb images
Bhat et al. Using photographs to enhance videos of a static scene
Chang et al. Vornet: Spatio-temporally consistent video inpainting for object removal
Wang et al. Joint iterative color correction and dehazing for underwater image enhancement
Kim et al. Deep blind video decaptioning by temporal aggregation and recurrence
Niu et al. Blind motion deblurring super-resolution: When dynamic spatio-temporal learning meets static image understanding
Qiu et al. World from blur
Yu et al. Split-attention multiframe alignment network for image restoration
CN113724155A (en) Self-boosting learning method, device and equipment for self-supervision monocular depth estimation
Lv et al. Low-light image enhancement via deep Retinex decomposition and bilateral learning
Zhang et al. Hierarchical attention aggregation with multi-resolution feature learning for GAN-based underwater image enhancement
Zhao et al. Deep pyramid generative adversarial network with local and nonlocal similarity features for natural motion image deblurring
Peng et al. PDRF: progressively deblurring radiance field for fast scene reconstruction from blurry images
Liu et al. Unsupervised optical flow estimation for differently exposed images in ldr domain
Liang et al. Coherent event guided low-light video enhancement
Li et al. Fastllve: Real-time low-light video enhancement with intensity-aware look-up table
Zhu et al. Learning spatio-temporal sharpness map for video deblurring
Karaoglu et al. Dynamon: Motion-aware fast and robust camera localization for dynamic nerf