WO2022216309A1 - Novel view synthesis with neural light field - Google Patents

Novel view synthesis with neural light field Download PDF

Info

Publication number
WO2022216309A1
WO2022216309A1 PCT/US2021/046695 US2021046695W WO2022216309A1 WO 2022216309 A1 WO2022216309 A1 WO 2022216309A1 US 2021046695 W US2021046695 W US 2021046695W WO 2022216309 A1 WO2022216309 A1 WO 2022216309A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
ray
mapping function
training images
color value
Prior art date
Application number
PCT/US2021/046695
Other languages
French (fr)
Inventor
Liu CELONG
Xu Yi
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Publication of WO2022216309A1 publication Critical patent/WO2022216309A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure relates, in general, to methods, systems, and apparatuses for machine learning solutions for novel view synthesis.
  • AR virtual reality
  • VR virtual reality
  • Traditional computer vision approaches to novel view synthesis have relied on explicitly building a geometric model of a scene or object.
  • image-based rendering has been employed where no geometric modeling is needed.
  • traditional light field rendering relies on a dense sampling of views (e.g., images) of a scene or object.
  • volume-based implicit representation techniques have been developed, in which a neural network is used to render scene and object representations.
  • volume-based implicit representations are computationally intensive, typically require ground-truth depth for training, and have large data storage requirements. This has thus far limited the use of these techniques for real-time applications.
  • a method may include obtaining a set of training images, the set of training images comprising two or more training images, generating, via a neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learning one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value.
  • the method may further include rendering a novel view, wherein the novel view is a synthesized image with N number of pixels.
  • Rendering the novel view may further include querying the mapping function for each ray R 1 through R N, projecting from a center of projection to every pixel of the synthesized image, and obtaining, via the mapping function, a respective color value (R, G, B) for each ray R 1 through R N .
  • An apparatus may include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions.
  • the set of instructions may be executed by the processor to perform various functions.
  • the instructions may be executed by the processor to receive a set of training images, the set of training images comprising two or more training images, generate, via a neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learn one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value.
  • the instructions may further be executed by the processor to render a novel view, wherein the novel view is a synthesized image with N number of pixels.
  • Rendering the novel view may further include evaluating the mapping function for each ray R 1 through R N, projecting from a center of projection to every pixel of the synthesized image; and generating for each ray R 1 through R N a respective color value (R, G, B) based on the mapping function.
  • a system may include a neural network configured to relate a ray of an image to a respective color value, and a novel view synthesizer subsystem.
  • the novel view synthesizer subsystem may further include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions.
  • the set of instructions may be executed by the processor to receive a set of training images, the set of training images comprising two or more training images; generate, via the neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learn one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value.
  • the instructions may further be executed by the processor to render a novel view, wherein the novel view is a synthesized image with N number of pixels.
  • Rendering the novel view may further include evaluating the mapping function for each ray R 1 through RN , projecting from a center of projection to every pixel of the synthesized image, and generating for each ray R 1 through RN a respective color value (R, G, B) based on the mapping function.
  • FIG. 1 is a schematic diagram of a system for novel view synthesis using neural light field, in accordance with various embodiments
  • FIG. 2 is a functional block diagram of a system for training a mapping function using neural light field, in accordance with various embodiments
  • FIG. 3 is a schematic diagram of 4D light field parameterization of rays, in accordance with various embodiments;
  • Fig. 4 is a schematic diagram of a camera array for training a mapping function, in accordance with various embodiments;
  • FIG. 5 is a flow diagram of a method for novel view synthesis using neural light field, in accordance with various embodiments
  • FIG. 6 is a schematic block diagram of a computer system for novel view synthesis using neural light field, in accordance with various embodiments.
  • a method for novel view synthesis using neural light field may include obtaining a set of training images, the set of training images comprising two or more training images, generating, via a neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learning one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value.
  • the method may further include rendering a novel view, wherein the novel view is a synthesized image with N number of pixels.
  • Rendering the novel view may further include querying the mapping function for each ray R 1 through R N, projecting from a center of projection to every pixel of the synthesized image, and obtaining, via the mapping function, a respective color value (R, G, B) for each ray R 1 through R N .
  • each ray R 1 through R N may be parameterized as a
  • the 4D coordinate ⁇ u, v, s, t) may be of a two-plane model, wherein the first parameterization surface is a first plane, wherein (u i , v i ) represents two-dimensional coordinates of the first plane through which R i intersects the first plane, and wherein the second parameterization surface is a second plane, wherein (s t , t i ) represents two-dimensional coordinates of the second plane through which R i intersects the second plane.
  • the first plane, the second plane, and a camera plane may be parallel to each other, wherein a camera is positioned on the camera plane.
  • the mapping function may be a multilayer perceptron.
  • Training the one or more training parameters may include minimizing for photometric loss ( L p ), wherein minimizing £ p includes determining a Euclidean distance between a predicted color value for a first known ray of a training image of the set of training images and a ground truth color value of the first known ray.
  • Minimizing for £ p may further include selecting a random set of rays of the set of training images, passing the random set of rays through the mapping function to produce a set of respective predicted values, and determining the £ p for the set of respective predicted values as compared against the set of training images.
  • the method may further include back- propagating £ p to update the one or more training parameters of the mapping function.
  • training the one or more training parameters may include minimizing Fourier sparsity loss (L s ) of the mapping function, wherein minimizing £ s further includes determining a difference of a frequency distribution of a predicted image of a random viewpoint and a frequency distribution of a one or more training images of the set of training images, wherein the difference includes a Euclidean distance.
  • L s Fourier sparsity loss
  • training the one or more training parameters may include minimizing ray bundle loss (L r ) of the mapping function, wherein minimizing L r further includes determining a difference between the color values of a bundle of rays originating from a common origin, wherein the difference includes a Euclidean distance between respective color values of each ray of the bundle of rays.
  • training the one or more training parameters comprises minimizing a loss function (L), the loss function comprising a weighted sum of photometric loss, Fourier sparsity loss, and ray bundle loss.
  • an apparatus for novel view synthesis with neural light field may include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions.
  • the set of instructions may be executed by the processor to perform various functions.
  • the instructions may be executed by the processor to receive a set of training images, the set of training images comprising two or more training images, generate, via a neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learn one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value.
  • the instructions may further be executed by the processor to render a novel view, wherein the novel view is a synthesized image with N number of pixels.
  • Rendering the novel view may further include evaluating the mapping function for each ray R 1 through R N, projecting from a center of projection to every pixel of the synthesized image; and generating for each ray R 1 through R N a respective color value (R, G, B) based on the mapping function.
  • each ray R 1 through R N may be parameterized as a
  • the mapping function may be a multilayer perceptron.
  • Training the one or more training parameters may include minimizing for photometric loss ( L p ), wherein minimizing L p includes determining a Euclidean distance between a predicted color value for a first known ray of a training image of the set of training images and a ground truth color value of the first known ray.
  • training the one or more training parameters may include minimizing Fourier sparsity loss (L s ) of the mapping function, wherein minimizing L s further includes determining a difference of a frequency distribution of a predicted image of a random viewpoint and a frequency distribution of a one or more training images of the set of training images, wherein the difference includes a Euclidean distance.
  • training the one or more training parameters may include minimizing ray bundle loss (L r ) of the mapping function, wherein minimizing L r further includes determining a difference between the color values of a bundle of rays originating from a common origin, wherein the difference includes a Euclidean distance between respective color values of each ray of the bundle of rays.
  • a system for novel view synthesis with neural light field may include a neural network configured to relate a ray of an image to a respective color value, and a novel view synthesizer subsystem.
  • the novel view synthesizer subsystem may further include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions.
  • the set of instructions may be executed by the processor to receive a set of training images, the set of training images comprising two or more training images; generate, via the neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learn one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value.
  • the instructions may further be executed by the processor to render a novel view, wherein the novel view is a synthesized image with N number of pixels.
  • Rendering the novel view may further include evaluating the mapping function for each ray R 1 through R N, projecting from a center of projection to every pixel of the synthesized image, and generating for each ray R 1 through R N a respective color value (R, G, B) based on the mapping function.
  • learning the one or more training parameters may include minimizing for photometric loss ( L p ).
  • the set of instructions is further executable by the processor to select a random set of rays of the set of training images, pass the random set of rays through the mapping function to produce a set of respective predicted values; determine the £ p for the set of respective predicted values as compared against the set of training images; and back-propagate £ p to update the one or more training parameters of the mapping function.
  • learning the one or more training parameters may further include minimizing Fourier sparsity loss (L s ) of the mapping function, wherein minimizing £ s further includes determining a difference of a frequency distribution of a predicted image of a random viewpoint and a frequency distribution of a one or more training images of the set of training images, wherein the difference includes a Euclidean distance.
  • L s Fourier sparsity loss
  • learning the one or more training parameters may include minimizing ray bundle loss (L r ) of the mapping function, wherein minimizing L r further includes determining a difference between the color values of a bundle of rays originating from a common origin, wherein the difference includes a Euclidean distance between respective color values of each ray of the bundle of rays.
  • L r ray bundle loss
  • the various embodiments include, without limitation, methods, systems, apparatuses, and/or software products.
  • a method might comprise one or more procedures, any or all of which may be executed by a computer system.
  • an embodiment might provide a computer system configured with instructions to perform one or more procedures in accordance with methods provided by various other embodiments.
  • a computer program might comprise a set of instructions that are executable by a computer system (and/or a processor therein) to perform such operations.
  • such software programs are encoded on physical, tangible, and/or non-transitory computer readable media (such as, to name but a few examples, optical media, magnetic media, and/or the like).
  • Various embodiments described herein, embodying software products and computer-performed methods, represent tangible, concrete improvements to existing technological areas, including, without limitation, multimedia coding and signal processing. It is to be understood that the term "coding” may be used to refer to encoding, decoding, or both encoding and decoding.
  • implementations of various embodiments improve the functioning of multimedia coding systems / subsystems by enabling more efficient dependent quantization and residual coding techniques.
  • view synthesis using 4D neural light fields improves upon traditional light field view synthesis, which relies on large amounts of input data to improve image fidelity, which has greater storage requirements and is less memory-efficient. Further, view synthesis using 4D neural light fields provides an efficient way to synthesize novel views with high fidelity and computational efficiency. 4D neural light field view synthesis further improves computational performance and efficiency, and eliminates redundancies present in view synthesis approaches relying on 5D radiance fields.
  • Fig. 1 is a schematic diagram of a system 100 for novel view synthesis using neural light field.
  • the system 100 includes training logic 105, mapping function 110, and view synthesizer logic 115. It should be noted that the various components of the system 100 are schematically illustrated in Fig. 1, and that modifications to the arrangement of the system 100 may be possible in accordance with the various embodiments.
  • the training logic 105 may be configured to obtain training data from a set of training images. As will be described in greater detail below, the training logic 105 may further be configured to train a mapping function to determine one or more trainable parameters of the mapping function.
  • the mapping function logic 110 may thus be coupled to the training logic 105 and configured to determine a color value for a given input ray.
  • View synthesizer logic 115 may be configured to generate a synthesized view based on the color values determined by the mapping function logic 110.
  • mapping function logic 110 may be incorporated as part of a novel view synthesizer subsystem, which may, in some examples, be separate from a neural network and/or mapping function itself, or alternatively comprise all or part of the neural network.
  • training logic 105 may be implemented as hardware, software, or both hardware and software.
  • the training logic 105 may include one or more of a dedicated integrated circuit (IC), such as an application specific IC (ASIC), field programmable gate array (FPGA), single board computer, programmable logic controller (PLC), or custom hardware.
  • IC dedicated integrated circuit
  • ASIC application specific IC
  • FPGA field programmable gate array
  • PLC programmable logic controller
  • training logic 105 may further be implemented on a general computer or server, and may further include a physical machine, one or more virtual machine instances, or both physical and virtual machine implementations.
  • Features and logic implemented in software may be stored in corresponding non-transitory computer-readable media and/or storage devices. This is discussed in greater detail below with respect to Fig. 6.
  • the training logic 105 may be configured to obtain a set of training images.
  • the set of training images may, in turn, include one or more training images.
  • training logic 105 may be configured to obtain a set of training images from a user. The user may thus, in some instances, provide a selected set of training images on which to generate a mapping function.
  • the training logic 105 may be configured to obtain a non-specific set of images.
  • the training logic 105 may obtain a plurality of sets of images. The images within a respective set of images may be related to each other, while the sets of images are unrelated.
  • images within a respective set may depict a common scene or object from different camera positions.
  • the camera positions may be arranged in a known arrangement. In other examples, the camera positions may be arranged randomly or otherwise without a positioning strategy.
  • the set of training images may be represented as ⁇ / 1 , I 2 ,
  • the training logic 105 may be configured to obtain training data from the set of training images.
  • training data may be obtained from the set of training images by determining a respective color value for each ray of each image of the set of training images.
  • each pixel of an image may be represented as a ray.
  • the ray in one example, may be parameterized as a 4-dimensional (4D) set of coordinates, as will be described in greater detail below with respect to Fig. 3. For each ray, a corresponding color value of the pixel may be learned.
  • the color value in one example, may be represented as a red, green, blue (RGB) value given by a 3 -dimensional set of values (R, G, B), as known to those skilled in the art.
  • each ray may be represented as a 4D coordinate ( «, v, s, t).
  • all 4D coordinates for all rays for all images in the set of images may be determined by the training logic 105.
  • the set of all rays of the set of training images may be parameterized in 4D light field representation as
  • the color value for each of the rays is then known and obtained from the respective training images. From the training data, a collection of sample mappings may be obtained , where is the color value of the i-th pixel of the k-th image. This training data may be passed to the mapping function logic 110 for further processing.
  • the set of training images may be captured by one or more cameras of a camera array.
  • the term "camera” may refer to both physical cameras, used to capture images of a scene and/or object in the real world, and virtual cameras representing a point in space (e.g., virtual 3D space) from which a computer-generated image (e.g., a synthetic image) is captured and/or viewed.
  • 4D parameterization of a ray may include parameterization via one or more parameterization surfaces, as will be described below with respect to Figs. 2 & 3.
  • a two-plane model may be utilized, in which two planes (e.g., uv plane and st plane) are parallel to each other.
  • the rays of a training image may be sampled, or "re-sampled" depending on the notation adopted, to determine the set 4D coordinate representation.
  • the set of training images may form an image plane, which may be parallel to the parameterization planes, which may further be parallel to the plane of the camera array (e.g., a viewing plane).
  • the camera acts as a center of projection from which all rays are projected through a given training image (e.g., image plane), and on through each of the two parameterization planes (e.g., a first plane or uv plane, and second plane or st plane).
  • the camera may further be positioned such that a ray passing through the center pixel of an image from the camera may be orthogonal to the image plane.
  • This arrangement simplifies 4D parameterization of the rays.
  • the camera plane e.g., plane of the camera array
  • image plane, and/or parameterization plane(s) may be unstructured, in which any one or more of the above planes may be parallel and/or may not be parallel to each other.
  • the cameras viewing and/or capturing the training images may be placed positioned in an unstructured pattern relative to one another, such that any four or more of the cameras may not lie on the same plane.
  • the mapping function logic 110 may be implemented as hardware, software, or both hardware and software, and similarly include both physical and virtual machine instances.
  • the mapping function logic 105 may include a neural network comprising one or more neural nodes.
  • the neural network may comprise a multilayer perceptron (MLP).
  • MLP multilayer perceptron
  • the mapping function logic 110 may be configured to generate a mapping function, f ⁇ , where ⁇ represents one or more trainable parameters.
  • the mapping function logic 100 may further provide for evaluation and/or evaluate the mapping function for a given input.
  • the mapping function f ⁇ may be configured to generate a color value for a given input ray.
  • f ⁇ may be configured to determine a color value for a given set of 4D coordinates in a 4D light field representation of a ray.
  • the mapping function logic 110 may be configured to generate the mapping function f ⁇ based on the rays of the set of training images in relation to their respective color values.
  • f ⁇ may be generated based on the collection of sample mappings generated by the training logic, as described above.
  • a typical MLP also referred to as a "vanilla" MLP, may not be able to encode high frequency information of a scene.
  • the mapping function logic 110 may be configured to utilize high dimension embedding to recover high frequency information.
  • the mapping function logic 110 may implement a high dimension embedder.
  • the mapping function logic 110 may further pass a set of 4D coordinates to an embedder, and then feed the embedded vector into the network (e.g., neural network, such as the MLP).
  • the network e.g., neural network, such as the MLP.
  • periodic activation functions may be utilized to recover the high frequency information.
  • the mapping function logic 110 may be configured to implement a periodic activation function, such as sinusoidal representation networks (SIREN).
  • the mapping functionf ⁇ may comprise one or more trainable parameters.
  • Training logic 105 may be configured to train the one or more trainable parameters ⁇ of the mapping function.
  • the training logic 105 may be configured to select a set of training rays from the set of training images.
  • the training logic 105 may be configured to select one or more ray at random from the set of training images.
  • one or more rays of the set of training images may be selected according to a known arrangement.
  • some or all of the one or more rays may be selected from the same image of the set of training images.
  • some or all of the rays may be selected from different images of the set of training images.
  • the training logic 105 may further provide each training ray of the set of training rays to the mapping function logic 110.
  • the mapping function logic 110 may evaluate the mapping function for each respective training ray to determine a respective color value. This may be referred to as the predicted color value of the training ray.
  • the training logic 105 may further determine a respective actual color value of each training ray from the training image itself. This may be referred to as the ground-truth color value. For each iteration of training, the training logic 105 may be configured to compare each predicted color value to its respective ground-truth color value.
  • the training logic 105 may be configured to determine one or more of a photometric loss (£ p ), ray bundle loss (L,), Fourier sparsity loss (L s ) of the predicted color value.
  • a loss function (L) may be evaluated for the predicted color value.
  • the loss function in some examples, may be a weighted sum of one or more of £ p , £ r , and £ s .
  • the one or more of the L p , L r , and £ s may be an error for the predicted color value.
  • a respective prediction error for each of the one or more training rays may be provided as feedback to be minimized, reduced, or otherwise optimized by the neural network (e.g., MLP).
  • the prediction error may be optimized by adjustment of the one or more trainable parameters of the mapping function based on the prediction error.
  • the one or more trainable parameters, ⁇ may be trained by minimizing for photometric loss, £ p .
  • photometric loss may be given by the following formula: where M is the total number of images, and N k is the total number of pixels in the k-th image.
  • the training logic 105 may be configured to utilize Fourier sparsity loss £ s (e.g., sparsity loss in the Fourier domain) of light fields to regularize learning. Specifically, an assumption is made that for a given scene, nearby images and/or views will have a similar frequency distribution in the frequency domain. In other words, when the network synthesizes a novel view, an assumption is made that the rendered image will have a similar frequency distribution as the input data (e.g., one or more training images).
  • Fourier sparsity loss £ s e.g., sparsity loss in the Fourier domain
  • training the mapping function may further include minimizing for Fourier sparsity loss.
  • the frequency distribution may be obtained by applying a Fourier transform to the image.
  • the frequency distributions of the set of random novel viewpoints may be compared to the frequency distribution of the set of training images.
  • the frequency distributions of the random set of novel viewpoints may be compared against the frequency distributions of all training images.
  • the frequency distributions of the random set of novel viewpoints may be compared against a subset of the set of training images.
  • comparison to frequency distributions may be restricted, for example, based on proximity (e.g., restricting comparison to neighboring views / training images).
  • Fourier sparsity loss £ s may be formulated as follows:
  • J is the total number of sampled novel viewpoints
  • M is the total number of images in the set of training images (or alternatively subset of training images), and is the rendered image of the i-th novel viewpoint.
  • the training logic 105 may further be configured to train the mapping function (e.g., MLP) to minimize, reduce, or otherwise optimize L s .
  • the mapping function e.g., MLP
  • the MLP will further learn the information of color changes, i.e., the first-order gradient of colors, in 4D space based on the spatial frequency components of the training images.
  • the training logic 105 may be configured to address aliasing.
  • aliasing In light field rendering, aliasing a common issue.
  • the network e.g., MLP
  • the mapping function e.g., MLP
  • L r Ray Bundle Loss
  • ⁇ R, Ri> denotes the angle between R and R i : and ⁇ is the weighting parameter.
  • Experimental data has shown that a ⁇ of 1.5° was effective for most scenes tested. This value is correlated to the size of a pixel in the imaging setting, and thus, in some examples, may be determined based on the size of the pixel in a particular image. For example, when ⁇ is selected for a specific image, a ⁇ that is too small will produce aliasing artifacts, whereas a Q that is too large will produce images that are over- smoothed.
  • grid search tuning, manual tuning, or a combination of grid search tuning and manual tuning may be used to automatically find the optimal weighting parameters for a specific scene.
  • a set of random rays from the training images may be selected and passed through the MLP during training.
  • the mapping function may thus be evaluated for each respective ray of the set of random rays to determine a color value.
  • the color value may be used by the training logic 105 and/or mapping function logic 110 to calculate £ p and back-propagate the error.
  • the training logic 105 and/or mapping function logic 110 may be configured to generate a camera with a random position within the convex hull of all cameras of the training images.
  • the viewing direction of the randomly generated camera may have a random offset from the z-axis, which may not be perpendicular to the two light slabs, and where the z-axis is perpendicular to the two light slabs.
  • the range of this offset is estimated by arctan(d/h), where d is the smallest distance from the selected camera to the convex hull of the training cameras, and h is the distance from the selected camera to the ,st-plane. This ensures the selected camera will be pointed at the scene of interest.
  • the training logic 105 and/or mapping function logic 110 may sample a first set (or batch) of camera rays from the training set randomly and generate a second set (or batch) of rays using the same rules as the camera generation for £ s .
  • a T number of rays may be cast from the origin of R at various angles, where the angle between R and each cast ray is sampled from (0, ⁇ 2 ), where the negative angles are skipped.
  • the angles between R and the cast rays may be distributed evenly, while in other embodiments, the angles may be randomly determined.
  • the embedded input vector (v) may be passed through 6 fully-connected rectified linear unit (ReLU) layers, each of which has 256 channels, and the final layer including a sigmoid activation to output the predicted color.
  • the ray batch size in each iteration may be set to 32768 for both £ p and L r .
  • the MLP may be trained for 200,000 iterations using an Adam optimizer., the initial learning rate is 1x10 "3 and is reduced by half every 20,000 iterations.
  • Experimental data has shown that training the neural light field (NeuLF) on a scene with 25 input images with a resolution of 1536 x 1280 took approximately 8 hours using four commercially available Nvidia ® V100 cards.
  • a novel view may be rendered by the view synthesizer logic 115.
  • the view synthesizer logic 115 may include, without limitation, hardware, software, or both hardware and software, and similarly include both physical and virtual machine instances.
  • rendering the novel view may include evaluating the learned mapping function ⁇ for each ray of the viewpoint V. For example, in some embodiments, rendering the viewpoint V may include determining a desired camera pose (e.g., position and/or orientation of a camera for an image), and a desired rendering resolution ⁇ W v , H v ⁇ .
  • each color value of a given ray may thus map each color value of a given ray back to a pixel position of the rendered image of the novel view.
  • Fig. 2 is a functional block diagram of a system 200 for training a mapping function using neural light field, in accordance with various embodiments.
  • the system 200 may include a set of training images 205, a first camera 210, a second camera 215, a bundle of rays 220, a scene (or object) 225, a first plane 230, a second plane 235, a multilayer perceptron (MLP) 240, a rendered image 245, a Fourier image 250, ground truth Fourier images 255, and predicted color values 260.
  • MLP multilayer perceptron
  • Fig. 2 provides an overview of the NeuLF process for novel view synthesis.
  • training data from a set of training images 205 of a scene 225 may be provided to the MLP 240.
  • the MLP 240 may then be configured such each ray of the set of training images is related to a color value based on the ground-truth color values (e.g., the training images themselves).
  • the MLP 240 may thus perform a mapping function configured to map a ray to its respective color value based on the set of training images 205.
  • a ray may be parameterized into a 4D set of coordinates, based on points at which the ray intersects one or more parameterization surfaces.
  • a two-plane model may be utilized.
  • a first plane 230 (uv plane), and a second plane 235 ( st plane) may be used to characterize each ray based on a point at which a given ray intersects the uv plane and st plane.
  • a color value may be represented as an RGB value.
  • Each ray may be parameterized as a 4D coordinate (u, v, s, t ), where the coordinates (u, v) indicate where the ray intersects the first plane 230, and ( s , t ) indicates where the ray intersects the second plane 235.
  • one or more trainable parameters of the mapping function may be trained.
  • the one or more trainable parameters may include, for example, one or more parameters of the "hidden" layers of the MLP 240.
  • a ray from the set of training images may be passed to the MLP 240, and the mapping function evaluated (e.g., passed through the MLP 240) to determine a predicted color value for the ray.
  • the 4D coordinates of a known ray (u, v, s, t ) of an image of the set of training images 205 may be passed to the MLP 240, which may then evaluate the mapping function (e.g., pass the 4D coordinate through the neural network), to determine a predicted color value corresponding to the input ray.
  • the mapping function e.g., pass the 4D coordinate through the neural network
  • a difference between the predicted color value and ground truth color value may be determined.
  • the difference may be a squared Euclidean norm (squared h norm) of the difference between the predicted RGB value and ground truth RGB value for the input ray.
  • one or more trainable parameters of the mapping function e.g., layers of the MLP 240
  • the l 2 norm may be expressed as set forth above with respect to Eq. 4.
  • other forms of Euclidean distance may be utilized to characterize the difference between predicted color value and ground-truth color values.
  • the MLP 240 may be trained to minimize for Fourier sparsity loss, L s .
  • the one or more trainable parameters of mapping function may be trained by minimizing for L s .
  • the frequency distribution of a rendered view may be compared to a frequency distribution of a neighboring view (e.g., a training image of the set of training images) to enforce a similarity to the neighboring view.
  • a random viewpoint may be rendered.
  • rendering an image (e.g., a predicted image) of a random viewpoint may include determining a set of rays corresponding to the random viewpoint.
  • Each of the rays of the random viewpoint may be evaluated by the MLP 240 to generate a rendered image 245 of the random viewpoint.
  • a Fourier transform F may be performed on the rendered image to produce a frequency distribution 250 of the rendered image 245.
  • the frequency distribution 250 of the rendered image 245 may be compared against one or more ground truth Fourier images 255, which provide frequency distributions of one or more images of the set of training images 205.
  • the one or more ground truth Fourier images 255 may include a Fourier transformed image from the set of training images 205.
  • the images used for ground truth Fourier images 255 may include one or more images of the set of training images 205 with camera positions nearest to the camera of the random viewpoint.
  • Fourier sparsity loss may, thus, in some embodiments, be a mean square error (MSE) between the frequency distribution 250 of the rendered image 245 and the one or more ground truth Fourier images 255.
  • MSE mean square error
  • the MSE may be formulated as set forth above with respect to Eq. 5.
  • the MSE may further include a root mean square error (RMSE), or other Euclidean distance, as appropriate.
  • RMSE root mean square error
  • the MLP 240 may be trained to optimize for ray bundle loss, £ r , and thereby training the one or more trainable parameters of the mapping function (e.g., one or more layers of the MLP 240). Specifically, the one or more trainable parameters of mapping function may be trained by minimizing for L r . The color value of rays in-between neighboring rays may not transition smoothly. This can lead to aliasing in an image.
  • £ r the predicted color values of a bundle of rays projected from a first ray may be compared for smoothness. It is assumed that the closer in angle a projected ray is to the original ray, the closer in predicted color values will be.
  • £ r may be formulated as set forth above with respect to Eq. 6. Thus, by minimizing £ r , smoothness may be ensured between neighboring rays, and aliasing effects reduced.
  • the MLP 240 may be trained on one or more of the above photometric loss, Fourier sparsity loss, and ray bundle loss, or any combination thereof.
  • the MLP 240 may be trained on an overall loss function, as set forth above with respect to Eq. 7.
  • each of the photometric loss, Fourier sparsity loss, and ray bundle loss may be weighted differently in the overall loss function, as appropriate.
  • the weights may be manually and/or automatically tuned based on experimental results.
  • Fig. 3 is a schematic diagram of a two-plane model 300 for 4D light field parameterization of rays, in accordance with various embodiments.
  • the 4D light field parameterization two-plane model 300 includes camera 305, image plane 310, first plane 315, and second plane 320. It should be noted that while the various parts of the two- plane model 300 for 4D light field parameterization are schematically illustrated in Fig.
  • the first plane 315 and second plane 320 may also be referred to as the uv plane and st plane, respectively, parameterization planes, and/or parameterization light slabs.
  • rays parameterized using the two-plane model may be represented as a 4D set of coordinates.
  • a first ray Ri and second ray R 2 may originate from the camera 305, which may be a center of projection for all rays passing through an image plane 310.
  • a given ray may be represented by
  • the 4D set of coordinates may comprise two sets of coordinates, a first set of coordinates (u, v ) indicating a position on the first plane 315 which the ray intersects, and a second set of coordinates ( s , t ) indicating a position of the second plane 320 through which the ray passes (intersects).
  • a first set of coordinates u, v
  • a second set of coordinates s , t
  • all rays from an object to one side of the two planes 315, 320 may be uniquely determined by a 4D coordinate.
  • the first ray R 1 may thus be represented by 4D coordinate (u 1 , v 1 , s 1 , t 1 ), where ray Ri passes through the first plane 315 at coordinate (u 1 , v 1 ), and the second plane at coordinate ( s 1 , t 1 ).
  • the second ray R 2 may thus be represented by 4D coordinate ( U 2 , V 2 , S 2 , t 2 ) ⁇
  • one or more parameterization surfaces may be utilized.
  • a first parameterization surface and a second parameterization surface may be used to parametrize the ray.
  • the first set of coordinates (u, v ) may indicate a point on the first parameterization surface through which the ray intersects the first parameterization surface
  • the second set of coordinates (s, t ) may indicate a point on the second parameterization surface through which the ray intersects the second parameterization surface.
  • parameterization surfaces may include planar and non-planar surfaces.
  • the parameterization surfaces may be non-planar surfaces, which may include, without limitation, spherical surfaces, cylindrical surfaces, polygonal surfaces, or irregularly shaped surfaces.
  • the parameterization surfaces may or may not intersect.
  • rendering a novel view may include querying all rays originating from the camera 305 to every pixel on the camera's 305 image plane 310.
  • 4D light field representation significantly improves computational performance and eliminates redundancies found in 5D radiance field representation.
  • novel viewpoints may lie outside of the convex hull of a scene and/or object.
  • 5D radiance field representation utilizes complex ray marching to leam the color of a ray at all points in free space of a scene, including points inside the convex hull of a scene / object.
  • Fig. 4 is a schematic diagram of a camera array 400 for generating a training image set for training of a mapping function, in accordance with various embodiments.
  • the camera array 400 may include a set of training images 405 and scene 410, and first training image 415a, second training image 415b, and third training image 415c.
  • the camera array 400 may include cameras positioned on a camera plane parallel to the parameterization light slabs (e.g., first and second plane of the two-plane model) as described above with respect to Fig. 3.
  • an image plane 405 may be formed, that is also parallel to the parameterization planes (e.g., all planes are orthogonal to a z-axis).
  • the set of training images 405 depict viewpoints along the image plane, and are schematically illustrated positioned in space on the image plane.
  • Fig. 5 is a flow diagram of a method 500 for novel view synthesis using neural light field, in accordance with various embodiments.
  • the method 500 begins, at block 505, by obtaining a set of training images.
  • a set of training images may be captured, generated, and/or received by a system for novel view synthesis using neural light fields.
  • the set of training images may be captured by a camera array, and include images of a scene or object.
  • the camera array may include a plurality of cameras. In some examples, the plurality of cameras may be arranged on a camera plane.
  • the system, or subcomponent of the system may be configured to extract training data from the set of training images. Extracting training data may include, for example, determining all rays of each of the training images of the set of training images, and the corresponding color values for each of the respective rays.
  • the method 500 may continue, at block 510, by generating a mapping function.
  • the training data obtained from the set of training images may be provided to a neural network, such as an MLP, to generate a mapping function.
  • the MLP itself may act as the mapping function, in which one or more layers of the MLP are trainable parameters of the mapping function.
  • generating the mapping function may further include learning of one or more trainable parameters.
  • the method 500 further includes, at block 515, learning one or more parameters (e.g., trainable parameters) of the mapping function.
  • the MLP may learn the one or more parameters of the mapping function based on the training data.
  • learning the one or more parameters may further include minimizing, reducing, or otherwise optimizing error in the mapping function. Optimization may, for example, include bringing error (including photometric loss, Fourier sparsity loss, and ray bundle loss) within an acceptable range or threshold.
  • minimizing error may include, for example, at block 520, minimizing for photometric loss.
  • minimizing photometric loss may include first determining a color value for one or more rays of the set of training images, and predicting a color value for the one or more rays via a neural network, such as the MLP. To determine photometric loss ( L p ), a difference between the predicted color value and ground truth color value may then be determined.
  • the MLP may predict a color value for a ray from the set of training images, and a difference determined between the predicted color value and the actual color value of the corresponding pixel in the training image of the set of training images.
  • taking a difference between the two values may include a squared Euclidean norm (squared h norm) of the difference between the predicted RGB value and ground truth RGB value for the input ray.
  • the h norm may be expressed as set forth above with respect to Eq. 4.
  • the photometric loss may be fed back to the mapping function and one or more parameters of the mapping function (e.g., layers of the MLP) may be adjusted to minimize photometric loss, and thus trained / learned.
  • minimizing error may further include, at block 525, minimizing Fourier sparsity loss.
  • the neural network such as an MLP, may be trained to minimize for Fourier sparsity loss, L s .
  • the frequency distribution of a rendered view may be compared to a frequency distribution of a neighboring view (e.g., a training image of the set of training images) to enforce a similarity to the neighboring.
  • a random viewpoint may be rendered. Rendering an image (e.g., a predicted image) of a random viewpoint may include determining a set of rays corresponding to the random viewpoint.
  • Each of the rays of the random viewpoint may be evaluated by a neural network, such as an MLP, to generate a rendered image of the random viewpoint.
  • a Fourier transform T may be performed on the rendered image to produce a frequency distribution of the rendered image.
  • the frequency distribution of the rendered image may be compared against the frequency distribution of one or more ground truth images (e.g., one or more images of the set of training images).
  • the frequency distribution of one or more ground truth images may include a Fourier transformed images from the set of training images.
  • Fourier sparsity loss may, thus, in some embodiments, be a mean square error (MSE) between the frequency distribution of the rendered image and the one or more ground truth images.
  • MSE mean square error
  • the MSE may be formulated as set forth above with respect to Eq. 5.
  • the MSE may further include a root mean square error (RMSE), or other Euclidean distance, as appropriate.
  • RMSE root mean square error
  • minimizing error may further include, at block 530, minimizing ray bundle loss.
  • the color value of rays in-between neighboring rays may not transition smoothly. This can lead to aliasing in an image.
  • the mapping function may be trained to optimize for ray bundle loss, £ r .
  • the one or more trainable parameters of the mapping function may be trained by minimizing for £ r , and determining the proper parameters for calculating ray bundle loss.
  • £ r may be formulated as set forth in Eq. 6.
  • Q may be selected for a specific image, where a Q that is too small will produce aliasing artifacts, whereas a Q that is too large will produce images that are over-smoothed.
  • £ r the predicted color values of a bundle of rays projected from a first ray may be compared for smoothness. It is assumed that the closer in angle a projected ray is to the original ray, the closer in predicted color values will be.
  • £ r may be formulated as set forth above with respect to Eq. 6. Thus, by minimizing £ r , smoothness may be ensured between neighboring rays, and aliasing effects reduced.
  • the method 500 further includes, at block 535, determining a set of rays of a novel view.
  • a novel view may be rendered by first determining all rays of the novel view.
  • rendering the viewpoint may include determining a desired camera pose (e.g., position and/or orientation of a camera for an image), and a desired rendering resolution ⁇ W v , H v ⁇ .
  • each ray may comprise W x H number of rays, where W x H is the desired resolution of the rendered image of the novel view.
  • the view synthesis logic 115 may further be configured to determine the 4D coordinates for each ray
  • the method 500 continues, at block 540, by evaluating mapping function for each ray of the novel view. Evaluating the learned mapping function ⁇ for each ray of the viewpoint , the novel view may be rendered.
  • the rendering process 32 may thus be formulated as N v evaluations of the mapping function ⁇ , as set forth above with respect to Eq. 8.
  • the method 500 comprises rendering the novel view.
  • the color values of the rays may be mapped back to pixels / respective pixel positions of the rendered image.
  • the novel view rendered directly based on the color values of each ray, with each ray corresponding to a respective pixel of the rendered novel view.
  • Fig. 6 is a schematic block diagram of a computer system 600 for novel view synthesis using neural light field, in accordance with various embodiments.
  • Fig. 6 provides a schematic illustration of one embodiment of a computer system 600, such as the system for novel view synthesis 100, or subsystems thereof, such as the training logic, mapping function logic, MLP, neural network, view synthesis logic, or combinations thereof, which may perform the methods provided by various other embodiments, as described herein.
  • Fig. 6 only provides a generalized illustration of various components, of which one or more of each may be utilized as appropriate.
  • Fig. 6, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner.
  • the computer system 600 includes multiple hardware elements that may be electrically coupled via a bus 605 (or may otherwise be in communication, as appropriate).
  • the hardware elements may include one or more processors 610, including, without limitation, one or more general-purpose processors and/or one or more special- purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and microcontrollers); one or more input devices 615, which include, without limitation, a mouse, a keyboard, one or more sensors, and/or the like; and one or more output devices 620, which can include, without limitation, a display device, and/or the like.
  • processors 610 including, without limitation, one or more general-purpose processors and/or one or more special- purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and microcontrollers); one or more input devices 615, which include, without limitation, a mouse, a keyboard, one or more sensors, and/or the like; and one or more output devices 620, which can include,
  • the computer system 600 may further include (and/or be in communication with) one or more storage devices 625, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random-access memory (“RAM”) and/or a read-only memory (“ROM”), which can be programmable, flash-updateable, and/or the like.
  • RAM random-access memory
  • ROM read-only memory
  • Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.
  • the computer system 600 might also include a communications subsystem
  • the communications subsystem 630 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, between data centers or different cloud platforms, and/or with any other devices described herein.
  • the computer system 600 further comprises a working memory 635, which can include a RAM or ROM device, as described above.
  • the computer system 600 also may comprise software elements, shown as being currently located within the working memory 635, including an operating system 640, device drivers, executable libraries, and/or other code, such as one or more application programs 645, which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein.
  • an operating system 640 operating system 640
  • device drivers executable libraries
  • application programs 645 which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein.
  • code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.
  • a set of these instructions and/or code might be encoded and/or stored on a non-transitory computer readable storage medium, such as the storage device(s) 625 described above.
  • the storage medium might be incorporated within a computer system, such as the system 600.
  • the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon.
  • These instructions might take the form of executable code, which is executable by the computer system 600 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer system 600 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code.
  • some embodiments may employ a computer or hardware system (such as the computer system 600) to perform methods in accordance with various embodiments of the invention. According to a set of embodiments, some or all of the procedures of such methods are performed by the computer system 600 in response to processor 610 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 640 and/or other code, such as an application program 645) contained in the working memory 635. Such instructions may be read into the working memory 635 from another computer readable medium, such as one or more of the storage device(s) 625. Merely by way of example, execution of the sequences of instructions contained in the working memory 635 might cause the processor(s) 610 to perform one or more procedures of the methods described herein.
  • a computer or hardware system such as the computer system 600
  • machine readable medium and “computer readable medium,” as used herein, refer to any medium that participates in providing data that causes a machine to operate in a specific fashion.
  • various computer readable media might be involved in providing instructions/code to processor(s) 610 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals).
  • a computer readable medium is a non-transitory, physical, and/or tangible storage medium.
  • a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like.
  • Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 625.
  • Volatile media includes, without limitation, dynamic memory, such as the working memory 635.
  • a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire and fiber optics, including the wires that comprise the bus 605, as well as the various components of the communication subsystem 630 (and/or the media by which the communications subsystem 630 provides communication with other devices).
  • transmission media can also take the form of waves (including, without limitation, radio, acoustic, and/or light waves, such as those generated during radio-wave and infra-red data communications).
  • Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.
  • Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 610 for execution.
  • the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer.
  • a remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer system 600.
  • These signals which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention.
  • the communications subsystem 630 (and/or components thereof) generally receives the signals, and the bus 605 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 635, from which the processor(s) 610 retrieves and executes the instructions.
  • the instructions received by the working memory 635 may optionally be stored on a storage device 625 either before or after execution by the processor(s) 610.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Image Processing (AREA)

Abstract

Novel tools and techniques are provided for novel view synthesis using neural light field. A system includes a neural network configured to relate a ray of an image to a respective color value, and a novel view synthesizer subsystem configured to receive a set of training images, generate a mapping function, and learn one or more parameters of the mapping function based on training data captured from the set of training images. The system further renders a novel view by evaluating the mapping function for each ray R1 through RN and generating a respective color value based on the mapping function.

Description

NOVEL VIEW SYNTHESIS WITH NEURAL LIGHT FIELD
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Patent Application Ser.
No. 63/170,506, filed April 4, 2021 by Celong Liu et al., entitled “PRACTICAL NOVEL VIEW SYNTHESIS WITH NEURAL LIGHT FIELD,” and U.S. Provisional Patent Application Ser. No. 63/188,832, filed May 14, 2021 by Celong Liu et al., entitled “PRACTICAL NOVEL VIEW SYNTHESIS WITH NEURAL LIGHT FIELD,” the disclosures of which are incorporated herein by reference in their entirety for all purposes.
COPYRIGHT STATEMENT
[0002] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
FIELD
[0003] The present disclosure relates, in general, to methods, systems, and apparatuses for machine learning solutions for novel view synthesis. BACKGROUND
[0004] Novel view synthesis is increasingly being adopted for augmented reality
(AR) and virtual reality (VR) applications. Traditional computer vision approaches to novel view synthesis have relied on explicitly building a geometric model of a scene or object. Alternatively, image-based rendering has been employed where no geometric modeling is needed. However, traditional light field rendering relies on a dense sampling of views (e.g., images) of a scene or object.
[0005] More recently, volume-based implicit representation techniques have been developed, in which a neural network is used to render scene and object representations. However, volume-based implicit representations are computationally intensive, typically require ground-truth depth for training, and have large data storage requirements. This has thus far limited the use of these techniques for real-time applications.
[0006] Therefore, methods, systems, and apparatuses for improving the performance of novel view synthesis are provided.
SUMMARY
[0007] Novel tools and techniques for novel view synthesis with neural light field are provided.
[0008] A method may include obtaining a set of training images, the set of training images comprising two or more training images, generating, via a neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learning one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value. The method may further include rendering a novel view, wherein the novel view is a synthesized image with N number of pixels. Rendering the novel view may further include querying the mapping function for each ray R1 through RN, projecting from a center of projection to every pixel of the synthesized image, and obtaining, via the mapping function, a respective color value (R, G, B) for each ray R1 through RN.
[0009] An apparatus may include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. For example, the set of instructions may be executed by the processor to perform various functions. Thus, the instructions may be executed by the processor to receive a set of training images, the set of training images comprising two or more training images, generate, via a neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learn one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value. The instructions may further be executed by the processor to render a novel view, wherein the novel view is a synthesized image with N number of pixels. Rendering the novel view may further include evaluating the mapping function for each ray R1 through RN, projecting from a center of projection to every pixel of the synthesized image; and generating for each ray R1 through RN a respective color value (R, G, B) based on the mapping function.
[0010] A system may include a neural network configured to relate a ray of an image to a respective color value, and a novel view synthesizer subsystem. The novel view synthesizer subsystem may further include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. For example, the set of instructions may be executed by the processor to receive a set of training images, the set of training images comprising two or more training images; generate, via the neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learn one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value. The instructions may further be executed by the processor to render a novel view, wherein the novel view is a synthesized image with N number of pixels. Rendering the novel view may further include evaluating the mapping function for each ray R1 through RN, projecting from a center of projection to every pixel of the synthesized image, and generating for each ray R1 through RN a respective color value (R, G, B) based on the mapping function.
[0011] These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided therein.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, in which like reference numerals are used to refer to similar components. In some instances, a sub-label is associated with a reference numeral to denote one of multiple similar components. When reference is made to a reference numeral without specification to an existing sub-label, it is intended to refer to all such multiple similar components.
[0013] Fig. 1 is a schematic diagram of a system for novel view synthesis using neural light field, in accordance with various embodiments;
[0014] Fig. 2 is a functional block diagram of a system for training a mapping function using neural light field, in accordance with various embodiments;
[0015] Fig. 3 is a schematic diagram of 4D light field parameterization of rays, in accordance with various embodiments; [0016] Fig. 4 is a schematic diagram of a camera array for training a mapping function, in accordance with various embodiments;
[0017] Fig. 5 is a flow diagram of a method for novel view synthesis using neural light field, in accordance with various embodiments;
[0018] Fig. 6 is a schematic block diagram of a computer system for novel view synthesis using neural light field, in accordance with various embodiments.
DETAILED DESCRIPTION OF EMBODIMENTS
[0019] Various embodiments provide tools and techniques for novel view synthesis using neural light field. In some embodiments, a method for novel view synthesis using neural light field is provided. The method may include obtaining a set of training images, the set of training images comprising two or more training images, generating, via a neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learning one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value. The method may further include rendering a novel view, wherein the novel view is a synthesized image with N number of pixels. Rendering the novel view may further include querying the mapping function for each ray R1 through RN, projecting from a center of projection to every pixel of the synthesized image, and obtaining, via the mapping function, a respective color value (R, G, B) for each ray R1 through RN.
[0020] In some embodiments, each ray R1 through RN may be parameterized as a
4-dimensional (4D) coordinate (n, v, s, t ), wherein the 4D coordinate representation for a given ray, Ri, is (ui vi si ti), where i is an integer 1-N, wherein (ui, vi) represents two- dimensional coordinates of a first parameterization surface through which Ri intersects the first parameterization surface, and wherein (si, ti) represents two-dimensional coordinates of a second parameterization surface through which Ri intersects the second parameterization surface. In some examples, the 4D coordinate {u, v, s, t) may be of a two-plane model, wherein the first parameterization surface is a first plane, wherein (ui, vi) represents two-dimensional coordinates of the first plane through which Ri intersects the first plane, and wherein the second parameterization surface is a second plane, wherein (st, ti ) represents two-dimensional coordinates of the second plane through which Ri intersects the second plane. In some further examples, the first plane, the second plane, and a camera plane may be parallel to each other, wherein a camera is positioned on the camera plane.
[0021] In further embodiments, the mapping function may be a multilayer perceptron. Training the one or more training parameters may include minimizing for photometric loss ( Lp ), wherein minimizing £p includes determining a Euclidean distance between a predicted color value for a first known ray of a training image of the set of training images and a ground truth color value of the first known ray. Minimizing for £p may further include selecting a random set of rays of the set of training images, passing the random set of rays through the mapping function to produce a set of respective predicted values, and determining the £p for the set of respective predicted values as compared against the set of training images. The method may further include back- propagating £p to update the one or more training parameters of the mapping function.
[0022] In some embodiments, training the one or more training parameters may include minimizing Fourier sparsity loss (Ls) of the mapping function, wherein minimizing £s further includes determining a difference of a frequency distribution of a predicted image of a random viewpoint and a frequency distribution of a one or more training images of the set of training images, wherein the difference includes a Euclidean distance. In some further embodiments, training the one or more training parameters may include minimizing ray bundle loss (Lr) of the mapping function, wherein minimizing Lr further includes determining a difference between the color values of a bundle of rays originating from a common origin, wherein the difference includes a Euclidean distance between respective color values of each ray of the bundle of rays. In yet further examples, training the one or more training parameters comprises minimizing a loss function (L), the loss function comprising a weighted sum of photometric loss, Fourier sparsity loss, and ray bundle loss.
[0023] In some embodiments, an apparatus for novel view synthesis with neural light field is provided. The apparatus may include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. For example, the set of instructions may be executed by the processor to perform various functions. Thus, the instructions may be executed by the processor to receive a set of training images, the set of training images comprising two or more training images, generate, via a neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learn one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value. The instructions may further be executed by the processor to render a novel view, wherein the novel view is a synthesized image with N number of pixels. Rendering the novel view may further include evaluating the mapping function for each ray R1 through RN, projecting from a center of projection to every pixel of the synthesized image; and generating for each ray R1 through RN a respective color value (R, G, B) based on the mapping function.
[0024] In some embodiments, each ray R1 through RN may be parameterized as a
4-dimensional (4D) coordinate (u, v, s, t ), wherein the 4D coordinate representation for a given ray, Ri, is (ui, vi si ti), where i is an integer 1-N, wherein (ui, vi) represents two- dimensional coordinates of a first parameterization surface through which Ri intersects the first parameterization surface, and wherein ( Si , ti) represents two-dimensional coordinates of a second parameterization surface through which Ri intersects the second parameterization surface.
[0025] In some examples, the mapping function may be a multilayer perceptron.
Training the one or more training parameters may include minimizing for photometric loss ( Lp ), wherein minimizing Lp includes determining a Euclidean distance between a predicted color value for a first known ray of a training image of the set of training images and a ground truth color value of the first known ray.
[0026] In further embodiments, training the one or more training parameters may include minimizing Fourier sparsity loss (Ls) of the mapping function, wherein minimizing Ls further includes determining a difference of a frequency distribution of a predicted image of a random viewpoint and a frequency distribution of a one or more training images of the set of training images, wherein the difference includes a Euclidean distance. In another embodiments, training the one or more training parameters may include minimizing ray bundle loss (Lr) of the mapping function, wherein minimizing Lr further includes determining a difference between the color values of a bundle of rays originating from a common origin, wherein the difference includes a Euclidean distance between respective color values of each ray of the bundle of rays.
[0027] In yet further embodiments, a system for novel view synthesis with neural light field is provided. The system may include a neural network configured to relate a ray of an image to a respective color value, and a novel view synthesizer subsystem. The novel view synthesizer subsystem may further include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. For example, the set of instructions may be executed by the processor to receive a set of training images, the set of training images comprising two or more training images; generate, via the neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images, and learn one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value. The instructions may further be executed by the processor to render a novel view, wherein the novel view is a synthesized image with N number of pixels. Rendering the novel view may further include evaluating the mapping function for each ray R1 through RN, projecting from a center of projection to every pixel of the synthesized image, and generating for each ray R1 through RN a respective color value (R, G, B) based on the mapping function.
[0028] In some embodiments, learning the one or more training parameters may include minimizing for photometric loss ( Lp ). In some examples, the set of instructions is further executable by the processor to select a random set of rays of the set of training images, pass the random set of rays through the mapping function to produce a set of respective predicted values; determine the £p for the set of respective predicted values as compared against the set of training images; and back-propagate £p to update the one or more training parameters of the mapping function.
[0029] In some embodiments, learning the one or more training parameters may further include minimizing Fourier sparsity loss (Ls) of the mapping function, wherein minimizing £s further includes determining a difference of a frequency distribution of a predicted image of a random viewpoint and a frequency distribution of a one or more training images of the set of training images, wherein the difference includes a Euclidean distance. In further embodiments, learning the one or more training parameters may include minimizing ray bundle loss (Lr) of the mapping function, wherein minimizing Lr further includes determining a difference between the color values of a bundle of rays originating from a common origin, wherein the difference includes a Euclidean distance between respective color values of each ray of the bundle of rays.
[0030] In the following description, for the purposes of explanation, numerous details are set forth to provide a thorough understanding of the described embodiments. It will be apparent to one skilled in the art, however, that other embodiments may be practiced without some of these details. In other instances, structures and devices are shown in block diagram form. Several embodiments are described herein, and while various features are ascribed to different embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to every embodiment of the invention, as other embodiments of the invention may omit such features. [0031] Unless otherwise indicated, all numbers used herein to express quantities, dimensions, and so forth used should be understood as being modified in all instances by the term "about." In this application, the use of the singular includes the plural unless specifically stated otherwise, and use of the terms "and" and "or" means "and/or" unless otherwise indicated. Moreover, the use of the term "including," as well as other forms, such as "includes" and "included," should be considered non-exclusive. Also, terms such as "element" or "component" encompass both elements and components comprising one unit and elements and components that comprise more than one unit, unless specifically stated otherwise.
[0032] The various embodiments include, without limitation, methods, systems, apparatuses, and/or software products. Merely by way of example, a method might comprise one or more procedures, any or all of which may be executed by a computer system. Correspondingly, an embodiment might provide a computer system configured with instructions to perform one or more procedures in accordance with methods provided by various other embodiments. Similarly, a computer program might comprise a set of instructions that are executable by a computer system (and/or a processor therein) to perform such operations. In many cases, such software programs are encoded on physical, tangible, and/or non-transitory computer readable media (such as, to name but a few examples, optical media, magnetic media, and/or the like).
[0033] Various embodiments described herein, embodying software products and computer-performed methods, represent tangible, concrete improvements to existing technological areas, including, without limitation, multimedia coding and signal processing. It is to be understood that the term "coding" may be used to refer to encoding, decoding, or both encoding and decoding.
[0034] In some aspects, implementations of various embodiments improve the functioning of multimedia coding systems / subsystems by enabling more efficient dependent quantization and residual coding techniques. In some examples, view synthesis using 4D neural light fields improves upon traditional light field view synthesis, which relies on large amounts of input data to improve image fidelity, which has greater storage requirements and is less memory-efficient. Further, view synthesis using 4D neural light fields provides an efficient way to synthesize novel views with high fidelity and computational efficiency. 4D neural light field view synthesis further improves computational performance and efficiency, and eliminates redundancies present in view synthesis approaches relying on 5D radiance fields.
[0035] To the extent any abstract concepts are present in the various embodiments, those concepts can be implemented as described herein by devices, software, systems, and methods that involve novel functionality (e.g., steps or operations), such as using an algorithm to quantization modes in a memory device and/or memory subsystem.
[0036] Fig. 1 is a schematic diagram of a system 100 for novel view synthesis using neural light field. The system 100 includes training logic 105, mapping function 110, and view synthesizer logic 115. It should be noted that the various components of the system 100 are schematically illustrated in Fig. 1, and that modifications to the arrangement of the system 100 may be possible in accordance with the various embodiments.
[0037] In various embodiments, the training logic 105 may be configured to obtain training data from a set of training images. As will be described in greater detail below, the training logic 105 may further be configured to train a mapping function to determine one or more trainable parameters of the mapping function. The mapping function logic 110 may thus be coupled to the training logic 105 and configured to determine a color value for a given input ray. View synthesizer logic 115 may be configured to generate a synthesized view based on the color values determined by the mapping function logic 110. In some embodiments, the functions of one or more of the training logic 105, mapping function logic 110, and view synthesizer logic 115 may be incorporated as part of a novel view synthesizer subsystem, which may, in some examples, be separate from a neural network and/or mapping function itself, or alternatively comprise all or part of the neural network.
[0038] In various embodiments, training logic 105 may be implemented as hardware, software, or both hardware and software. For example, the training logic 105 may include one or more of a dedicated integrated circuit (IC), such as an application specific IC (ASIC), field programmable gate array (FPGA), single board computer, programmable logic controller (PLC), or custom hardware. In further embodiments, training logic 105 may further be implemented on a general computer or server, and may further include a physical machine, one or more virtual machine instances, or both physical and virtual machine implementations. Features and logic implemented in software may be stored in corresponding non-transitory computer-readable media and/or storage devices. This is discussed in greater detail below with respect to Fig. 6.
[0039] In some embodiments, the training logic 105 may be configured to obtain a set of training images. The set of training images may, in turn, include one or more training images. For example, in some instances, training logic 105 may be configured to obtain a set of training images from a user. The user may thus, in some instances, provide a selected set of training images on which to generate a mapping function. In further embodiments, the training logic 105 may be configured to obtain a non-specific set of images. For example, the training logic 105 may obtain a plurality of sets of images. The images within a respective set of images may be related to each other, while the sets of images are unrelated. For example, images within a respective set may depict a common scene or object from different camera positions. In some examples the camera positions may be arranged in a known arrangement. In other examples, the camera positions may be arranged randomly or otherwise without a positioning strategy.
[0040] In some examples, the set of training images may be represented as {/1, I2 ,
. . . , /M}, where M is the total number of images. In some embodiments, the training logic 105 may be configured to obtain training data from the set of training images. In some examples, training data may be obtained from the set of training images by determining a respective color value for each ray of each image of the set of training images. For example, each pixel of an image may be represented as a ray. The ray, in one example, may be parameterized as a 4-dimensional (4D) set of coordinates, as will be described in greater detail below with respect to Fig. 3. For each ray, a corresponding color value of the pixel may be learned. The color value, in one example, may be represented as a red, green, blue (RGB) value given by a 3 -dimensional set of values (R, G, B), as known to those skilled in the art. Thus, in some embodiments, training data may comprise each ray and corresponding color value for each training image in the set of training images. For example, for each image Ik kwhere k= 1, 2, , M), based on a known camera position, each pixel may be traversed to determine a set of respective rays
Figure imgf000014_0001
, where Nk is the total number of pixels in the k-th image. In one example, based on a 4D light field representation, each ray may be represented as a 4D coordinate («, v, s, t). Thus, all 4D coordinates for all rays for all images in the set of images may be determined by the training logic 105. In one example, the set of all rays of the set of training images may be parameterized in 4D light field representation as
Figure imgf000014_0002
Figure imgf000014_0003
[0041] The color value for each of the rays is then known and obtained from the respective training images. From the training data, a collection of sample mappings may be obtained
Figure imgf000014_0004
, where
Figure imgf000014_0005
is the color value of the i-th pixel of the k-th image. This training data may be passed to the mapping function logic 110 for further processing.
[0042] As will be described in greater detail below with respect to Fig. 5, the set of training images may be captured by one or more cameras of a camera array. As used herein, the term "camera" may refer to both physical cameras, used to capture images of a scene and/or object in the real world, and virtual cameras representing a point in space (e.g., virtual 3D space) from which a computer-generated image (e.g., a synthetic image) is captured and/or viewed.
[0043] Accordingly, in some examples, 4D parameterization of a ray may include parameterization via one or more parameterization surfaces, as will be described below with respect to Figs. 2 & 3. In some examples, a two-plane model may be utilized, in which two planes (e.g., uv plane and st plane) are parallel to each other. Thus, when an image is viewed, the rays of a training image may be sampled, or "re-sampled" depending on the notation adopted, to determine the set 4D coordinate representation. In one arrangement, the set of training images may form an image plane, which may be parallel to the parameterization planes, which may further be parallel to the plane of the camera array (e.g., a viewing plane). Thus, in this configuration, the camera acts as a center of projection from which all rays are projected through a given training image (e.g., image plane), and on through each of the two parameterization planes (e.g., a first plane or uv plane, and second plane or st plane). The camera may further be positioned such that a ray passing through the center pixel of an image from the camera may be orthogonal to the image plane. This arrangement simplifies 4D parameterization of the rays. It is to be understood that in other embodiments, the camera plane (e.g., plane of the camera array), image plane, and/or parameterization plane(s) may be unstructured, in which any one or more of the above planes may be parallel and/or may not be parallel to each other. In yet further embodiments, the cameras viewing and/or capturing the training images may be placed positioned in an unstructured pattern relative to one another, such that any four or more of the cameras may not lie on the same plane.
[0044] Like the training logic 105, in various embodiments, the mapping function logic 110 may be implemented as hardware, software, or both hardware and software, and similarly include both physical and virtual machine instances. In various embodiments the mapping function logic 105 may include a neural network comprising one or more neural nodes. In one example, the neural network may comprise a multilayer perceptron (MLP).
[0045] In some embodiments, the mapping function logic 110 may be configured to generate a mapping function, fΘ, where Θ represents one or more trainable parameters. The mapping function logic 100 may further provide for evaluation and/or evaluate the mapping function for a given input. Accordingly, in various embodiments, the mapping function fΘ may be configured to generate a color value for a given input ray. For example, fΘ may be configured to determine a color value for a given set of 4D coordinates in a 4D light field representation of a ray. Thus, the mapping function logic 110 may be configured to generate the mapping function fΘ based on the rays of the set of training images in relation to their respective color values. In some embodiments, fΘ may be generated based on the collection of sample mappings generated by the training logic, as described above.
[0046] A typical MLP, also referred to as a "vanilla" MLP, may not be able to encode high frequency information of a scene. Thus, in some embodiments, the mapping function logic 110 may be configured to utilize high dimension embedding to recover high frequency information.
[0047] In one example, the mapping function logic 110 may implement a high dimension embedder. For example, the mapping function logic 110 may further pass a set of 4D coordinates to an embedder, and then feed the embedded vector into the network (e.g., neural network, such as the MLP). Specifically, for a 4D coordinate:
Figure imgf000016_0001
the embedder may be configured to map the 4D coordinate (v) to:
Figure imgf000016_0002
where each element , where σ and
Figure imgf000016_0003
L are chosen for each scene based on hyper-parameter grid searching. Experimental results have found that σ = 16 and L = 256 works best for most scenes tested.
[0048] Alternatively, periodic activation functions may be utilized to recover the high frequency information. In one example, the mapping function logic 110 may be configured to implement a periodic activation function, such as sinusoidal representation networks (SIREN).
[0049] As described above, the mapping functionfΘ may comprise one or more trainable parameters. Training logic 105 may be configured to train the one or more trainable parameters Θ of the mapping function. In one example, the training logic 105 may be configured to select a set of training rays from the set of training images. For example the training logic 105 may be configured to select one or more ray at random from the set of training images. Alternatively, one or more rays of the set of training images may be selected according to a known arrangement. In some embodiments, some or all of the one or more rays may be selected from the same image of the set of training images. In other embodiments, some or all of the rays may be selected from different images of the set of training images. The training logic 105 may further provide each training ray of the set of training rays to the mapping function logic 110. The mapping function logic 110 may evaluate the mapping function for each respective training ray to determine a respective color value. This may be referred to as the predicted color value of the training ray. The training logic 105 may further determine a respective actual color value of each training ray from the training image itself. This may be referred to as the ground-truth color value. For each iteration of training, the training logic 105 may be configured to compare each predicted color value to its respective ground-truth color value. In some embodiments, the training logic 105 may be configured to determine one or more of a photometric loss (£p), ray bundle loss (L,), Fourier sparsity loss (Ls) of the predicted color value. In some embodiments, a loss function (L) may be evaluated for the predicted color value. The loss function, in some examples, may be a weighted sum of one or more of £p, £r, and £s. In some embodiments, the one or more of the Lp, Lr, and £s may be an error for the predicted color value. Thus, a respective prediction error for each of the one or more training rays may be provided as feedback to be minimized, reduced, or otherwise optimized by the neural network (e.g., MLP). In some examples, the prediction error may be optimized by adjustment of the one or more trainable parameters of the mapping function based on the prediction error.
[0050] Thus, in some embodiments, the one or more trainable parameters, Θ, may be trained by minimizing for photometric loss, £p. In some examples, photometric loss may be given by the following formula:
Figure imgf000017_0001
where M is the total number of images, and Nk is the total number of pixels in the k-th image.
[0051] In learning the mapping function (e.g., MLP), solely optimizing for photometric loss £p may cause the neural network to overfit the data and relationships between 4D coordinates and respective color values in the training data. Accordingly, in various embodiments, the training logic 105 may be configured to utilize Fourier sparsity loss £s (e.g., sparsity loss in the Fourier domain) of light fields to regularize learning. Specifically, an assumption is made that for a given scene, nearby images and/or views will have a similar frequency distribution in the frequency domain. In other words, when the network synthesizes a novel view, an assumption is made that the rendered image will have a similar frequency distribution as the input data (e.g., one or more training images).
[0052] Thus, in some embodiments, training the mapping function may further include minimizing for Fourier sparsity loss. In some examples, the mapping function logic 110 may further be configured to obtain a frequency distribution
Figure imgf000018_0001
for each training image Ik (where k = 1, 2, . . . , M). For example, the frequency distribution
Figure imgf000018_0002
may be obtained by applying a Fourier transform
Figure imgf000018_0008
to the image.
[0053] A set of random novel viewpoints {V1, V2, . . . , Vj}, where J is the number of novel viewpoints, may be sampled and frequency distributions
Figure imgf000018_0003
obtained from the rendered image of the novel viewpoint of the set of random novel viewpoints (where i = 1, . . . , J). The frequency distributions
Figure imgf000018_0004
of the set of random novel viewpoints may be compared to the frequency distribution of the set of training images
Figure imgf000018_0005
In some examples, the frequency distributions of the random set of novel viewpoints may be compared against the frequency distributions of all training images. Alternatively, in some examples, the frequency distributions of the random set of novel viewpoints
Figure imgf000018_0006
may be compared against a subset of the set of training images.
In some further embodiments, comparison to frequency distributions may be restricted, for example, based on proximity (e.g., restricting comparison to neighboring views / training images).
[0054] In various embodiments, Fourier sparsity loss £s may be formulated as follows:
Figure imgf000018_0007
Where J is the total number of sampled novel viewpoints, M is the total number of images in the set of training images (or alternatively subset of training images), and
Figure imgf000018_0009
is the rendered image of the i-th novel viewpoint.
[0055] Thus, the training logic 105 may further be configured to train the mapping function (e.g., MLP) to minimize, reduce, or otherwise optimize Ls.
Specifically, in addition to learning the respective color values of a set of rays, the MLP will further learn the information of color changes, i.e., the first-order gradient of colors, in 4D space based on the spatial frequency components of the training images.
[0056] In a further embodiment, the training logic 105 may be configured to address aliasing. In light field rendering, aliasing a common issue. With photometric loss £p and Fourier sparsity loss £s, the network (e.g., MLP) may learn color and frequency distribution of images. However, the color value of rays in-between neighboring rays may not transition smoothly. This can lead to aliasing (e.g., fuzziness artifacts) in an image. Thus, in some embodiments, the mapping function (e.g., MLP) may be trained to minimize, reduce, or otherwise optimize Ray Bundle Loss ( Lr ). Optimization of £r may, for example, include bringing £r within an acceptable range or threshold.
[0057] Ray Bundle Loss £r may be based on the following: for a ray, R, cast from a viewpoint, T number of rays {R1, R2, . . . , RT} may be cast from the same origin but with slightly different directions. The rays with smaller angles to R are more likely to have the same color as R. Therefore, £r may be defined as the weighted sum of l2 terms over surrounding rays as follows:
Figure imgf000019_0001
where {u,v,s, t} and {ui, vi, Si, ti } are the 4D coordinates for rays R and Ri (i = 1, . . . ,
T) respectively; and <R, Ri> denotes the angle between R and Ri: and θ is the weighting parameter. Experimental data has shown that a θ of 1.5° was effective for most scenes tested. This value is correlated to the size of a pixel in the imaging setting, and thus, in some examples, may be determined based on the size of the pixel in a particular image. For example, when θ is selected for a specific image, a θ that is too small will produce aliasing artifacts, whereas a Q that is too large will produce images that are over- smoothed.
[0058] In further embodiments, as previously described, the mapping function /
MLP may be trained on an overall loss function (L), which is given as follows:
Figure imgf000019_0002
where λs and λr are weighting coefficients. In experimental results, λs = 1.92 and λr = 0.074 have been found to be effective for most scenes tested. In some embodiments, grid search tuning, manual tuning, or a combination of grid search tuning and manual tuning may be used to automatically find the optimal weighting parameters for a specific scene.
[0059] As previously described, according to some embodiments, a set of random rays from the training images may be selected and passed through the MLP during training. The mapping function may thus be evaluated for each respective ray of the set of random rays to determine a color value. The color value may be used by the training logic 105 and/or mapping function logic 110 to calculate £p and back-propagate the error.
[0060] To determine £s, in each iteration, the training logic 105 and/or mapping function logic 110 may be configured to generate a camera with a random position within the convex hull of all cameras of the training images. In some examples, the viewing direction of the randomly generated camera may have a random offset from the z-axis, which may not be perpendicular to the two light slabs, and where the z-axis is perpendicular to the two light slabs. The range of this offset is estimated by arctan(d/h), where d is the smallest distance from the selected camera to the convex hull of the training cameras, and h is the distance from the selected camera to the ,st-plane. This ensures the selected camera will be pointed at the scene of interest.
[0061] To determine £r, in each iteration, the training logic 105 and/or mapping function logic 110 may sample a first set (or batch) of camera rays from the training set randomly and generate a second set (or batch) of rays using the same rules as the camera generation for £s. For each ray R in the two sets (or batches) of rays, a T number of rays may be cast from the origin of R at various angles, where the angle between R and each cast ray is sampled from (0, θ2), where the negative angles are skipped. In some examples, a T = 16 may be utilized, which has been shown to yield good experimental results. In some embodiments, the angles between R and the cast rays may be distributed evenly, while in other embodiments, the angles may be randomly determined. [0062] In some examples, during training, the embedded input vector (v) may be passed through 6 fully-connected rectified linear unit (ReLU) layers, each of which has 256 channels, and the final layer including a sigmoid activation to output the predicted color. In some examples, for model training, the ray batch size in each iteration may be set to 32768 for both £p and Lr. The MLP may be trained for 200,000 iterations using an Adam optimizer., the initial learning rate is 1x10"3 and is reduced by half every 20,000 iterations. Experimental data has shown that training the neural light field (NeuLF) on a scene with 25 input images with a resolution of 1536 x 1280 took approximately 8 hours using four commercially available Nvidia® V100 cards.
[0063] In various embodiments, given a viewpoint
Figure imgf000021_0006
, a novel view
Figure imgf000021_0005
may be rendered by the view synthesizer logic 115. Like the training logic 105 and mapping function logic 110, the view synthesizer logic 115 may include, without limitation, hardware, software, or both hardware and software, and similarly include both physical and virtual machine instances. In various embodiments rendering the novel view may include evaluating the learned mapping functionƒfor each ray of the viewpoint V. For example, in some embodiments, rendering the viewpoint V may include determining a desired camera pose (e.g., position and/or orientation of a camera for an image), and a desired rendering resolution { Wv, H v } .
[0064] Based on the camera pose, the view synthesis logic 115 may sample all rays , where Nv = Wv X Hv (e.g., the total number of pixels to be
Figure imgf000021_0001
rendered). Thus, the view synthesis logic 115 may further be configured to determine the 4D coordinates for each ray The rendering
Figure imgf000021_0002
Figure imgf000021_0003
process may thus be formulated as Nv evaluations of the mapping function f:
Figure imgf000021_0004
[0065] From each evaluation of the mapping function, the view synthesis logic
115 may thus map each color value of a given ray back to a pixel position of the rendered image of the novel view.
[0066] Fig. 2 is a functional block diagram of a system 200 for training a mapping function using neural light field, in accordance with various embodiments. For purposes of the functional block diagram, the system 200 may include a set of training images 205, a first camera 210, a second camera 215, a bundle of rays 220, a scene (or object) 225, a first plane 230, a second plane 235, a multilayer perceptron (MLP) 240, a rendered image 245, a Fourier image 250, ground truth Fourier images 255, and predicted color values 260. It should be noted that the various components of the system 200 are schematically illustrated in Fig. 2, and that modifications to the arrangement of the system 200 may be possible in accordance with the various embodiments.
[0067] In various embodiments, Fig. 2 provides an overview of the NeuLF process for novel view synthesis. For example, to train the MLP 240, training data from a set of training images 205 of a scene 225 may be provided to the MLP 240. The MLP 240 may then be configured such each ray of the set of training images is related to a color value based on the ground-truth color values (e.g., the training images themselves). The MLP 240 may thus perform a mapping function configured to map a ray to its respective color value based on the set of training images 205.
[0068] In some embodiments, as described above, a ray may be parameterized into a 4D set of coordinates, based on points at which the ray intersects one or more parameterization surfaces. In one example, a two-plane model may be utilized. In the two-plane model, a first plane 230 (uv plane), and a second plane 235 ( st plane) may be used to characterize each ray based on a point at which a given ray intersects the uv plane and st plane. Thus, for each image of the set of training images, each pixel may correspond to a unique ray, and each ray may have a respective color value. In various embodiments, a color value may be represented as an RGB value. Each ray may be parameterized as a 4D coordinate (u, v, s, t ), where the coordinates (u, v) indicate where the ray intersects the first plane 230, and ( s , t ) indicates where the ray intersects the second plane 235.
[0069] Next, to further train the mapping function (e.g., MLP 240), one or more trainable parameters of the mapping function may be trained. For example, the one or more trainable parameters may include, for example, one or more parameters of the "hidden" layers of the MLP 240. A ray from the set of training images may be passed to the MLP 240, and the mapping function evaluated (e.g., passed through the MLP 240) to determine a predicted color value for the ray. Thus, in some examples, the 4D coordinates of a known ray (u, v, s, t ) of an image of the set of training images 205 may be passed to the MLP 240, which may then evaluate the mapping function (e.g., pass the 4D coordinate through the neural network), to determine a predicted color value corresponding to the input ray.
[0070] As previously described, to determine photometric loss ( Lp ), a difference between the predicted color value and ground truth color value may be determined. In one example, the difference may be a squared Euclidean norm (squared h norm) of the difference between the predicted RGB value and ground truth RGB value for the input ray. Based on the above, one or more trainable parameters of the mapping function (e.g., layers of the MLP 240) may be adjusted to minimize photometric loss. The l2 norm may be expressed as set forth above with respect to Eq. 4. In other embodiments, other forms of Euclidean distance may be utilized to characterize the difference between predicted color value and ground-truth color values.
[0071] In some examples, the MLP 240 may be trained to minimize for Fourier sparsity loss, Ls. Here, the one or more trainable parameters of mapping function may be trained by minimizing for Ls. To determine £s, the frequency distribution of a rendered view may be compared to a frequency distribution of a neighboring view (e.g., a training image of the set of training images) to enforce a similarity to the neighboring view.
[0072] In some examples, to train the MLP 240 on Fourier sparsity loss, a random viewpoint may be rendered. In some embodiments, as previously described, rendering an image (e.g., a predicted image) of a random viewpoint may include determining a set of rays corresponding to the random viewpoint. Each of the rays of the random viewpoint may be evaluated by the MLP 240 to generate a rendered image 245 of the random viewpoint. A Fourier transform F may be performed on the rendered image to produce a frequency distribution 250 of the rendered image 245. The frequency distribution 250 of the rendered image 245 may be compared against one or more ground truth Fourier images 255, which provide frequency distributions of one or more images of the set of training images 205. Thus, the one or more ground truth Fourier images 255 may include a Fourier transformed image from the set of training images 205. In some examples, the images used for ground truth Fourier images 255 may include one or more images of the set of training images 205 with camera positions nearest to the camera of the random viewpoint.
[0073] Fourier sparsity loss may, thus, in some embodiments, be a mean square error (MSE) between the frequency distribution 250 of the rendered image 245 and the one or more ground truth Fourier images 255. In one example, the MSE may be formulated as set forth above with respect to Eq. 5. In further embodiments, the MSE may further include a root mean square error (RMSE), or other Euclidean distance, as appropriate. By minimizing MSE between the frequency distribution 250 of the rendered image 245 and the one or more ground truth Fourier images 255, a smooth frequency transition may be enforced between neighboring views, thus avoiding overfitting of the rendered image to data in the set of ground truth images.
[0074] In further embodiments, the MLP 240 may be trained to optimize for ray bundle loss, £r, and thereby training the one or more trainable parameters of the mapping function (e.g., one or more layers of the MLP 240). Specifically, the one or more trainable parameters of mapping function may be trained by minimizing for Lr. The color value of rays in-between neighboring rays may not transition smoothly. This can lead to aliasing in an image.
[0075] To determine £r, the predicted color values of a bundle of rays projected from a first ray may be compared for smoothness. It is assumed that the closer in angle a projected ray is to the original ray, the closer in predicted color values will be. In one example, £r may be formulated as set forth above with respect to Eq. 6. Thus, by minimizing £r, smoothness may be ensured between neighboring rays, and aliasing effects reduced.
[0076] In yet further embodiments, the MLP 240 may be trained on one or more of the above photometric loss, Fourier sparsity loss, and ray bundle loss, or any combination thereof. In some examples, the MLP 240 may be trained on an overall loss function, as set forth above with respect to Eq. 7. Thus, in some examples, each of the photometric loss, Fourier sparsity loss, and ray bundle loss may be weighted differently in the overall loss function, as appropriate. For example, in some examples, the weights may be manually and/or automatically tuned based on experimental results.
[0077] Fig. 3 is a schematic diagram of a two-plane model 300 for 4D light field parameterization of rays, in accordance with various embodiments. The 4D light field parameterization two-plane model 300 includes camera 305, image plane 310, first plane 315, and second plane 320. It should be noted that while the various parts of the two- plane model 300 for 4D light field parameterization are schematically illustrated in Fig.
3, and that modifications to the arrangement of the two plane model 300 may be possible accordance with the various embodiments. Moreover, in other embodiments, it is to be noted that other types of parameterization may be utilized in light field parameterization, including, without limitation, points on a curved surface or plane, or points on the surface of a sphere.
[0078] In various embodiments, the first plane 315 and second plane 320 may also be referred to as the uv plane and st plane, respectively, parameterization planes, and/or parameterization light slabs. As illustrated, rays parameterized using the two-plane model may be represented as a 4D set of coordinates. For example, a first ray Ri and second ray R2 may originate from the camera 305, which may be a center of projection for all rays passing through an image plane 310.
[0079] According to the two-plane model 300, a given ray may be represented by
4D set of coordinates (u, v, s, t). The 4D set of coordinates may comprise two sets of coordinates, a first set of coordinates (u, v ) indicating a position on the first plane 315 which the ray intersects, and a second set of coordinates ( s , t ) indicating a position of the second plane 320 through which the ray passes (intersects). Using the two-plane model to represent rays, all rays from an object to one side of the two planes 315, 320 may be uniquely determined by a 4D coordinate. Accordingly, the first ray R1 may thus be represented by 4D coordinate (u1, v1, s1, t1 ), where ray Ri passes through the first plane 315 at coordinate (u1, v1 ), and the second plane at coordinate ( s1 , t1). Similarly, the second ray R2 may thus be represented by 4D coordinate ( U2 , V2, S2, t2
[0080] It is to be understood that in other embodiments, one or more parameterization surfaces may be utilized. For example, a first parameterization surface and a second parameterization surface may be used to parametrize the ray. Specifically the first set of coordinates (u, v ) may indicate a point on the first parameterization surface through which the ray intersects the first parameterization surface, and the second set of coordinates (s, t ) may indicate a point on the second parameterization surface through which the ray intersects the second parameterization surface. Thus, parameterization surfaces may include planar and non-planar surfaces. In some examples, the parameterization surfaces may be non-planar surfaces, which may include, without limitation, spherical surfaces, cylindrical surfaces, polygonal surfaces, or irregularly shaped surfaces. Furthermore, in some embodiments, the parameterization surfaces may or may not intersect. Using 4D light field to represent rays, rendering a novel view may include querying all rays originating from the camera 305 to every pixel on the camera's 305 image plane 310.
[0081] In contrast with known approaches utilizing radiance fields, which rely on a 5D radiance field representation, 4D light field representation significantly improves computational performance and eliminates redundancies found in 5D radiance field representation. Using 4D light field representation, novel viewpoints may lie outside of the convex hull of a scene and/or object. Thus, because rays are projected from the convex hull of a scene / object, a color value of a ray may be learned directly, whereas 5D radiance field representation utilizes complex ray marching to leam the color of a ray at all points in free space of a scene, including points inside the convex hull of a scene / object.
[0082] Fig. 4 is a schematic diagram of a camera array 400 for generating a training image set for training of a mapping function, in accordance with various embodiments. The camera array 400 may include a set of training images 405 and scene 410, and first training image 415a, second training image 415b, and third training image 415c. In various embodiments, the camera array 400 may include cameras positioned on a camera plane parallel to the parameterization light slabs (e.g., first and second plane of the two-plane model) as described above with respect to Fig. 3. Thus, an image plane 405 may be formed, that is also parallel to the parameterization planes (e.g., all planes are orthogonal to a z-axis). Thus, the set of training images 405 depict viewpoints along the image plane, and are schematically illustrated positioned in space on the image plane. Using a set of training images 405 parallel with the parameterization planes simplifies ray parameterization, and has been used for purposes of explanation. It is to be understood that in other examples, camera positioning need not follow this strategy and other strategies, such as non-planar camera positioning, may be utilized.
[0083] Traditional light field representation requires a large number of views, with many cameras. Often, these approaches rely on larger amounts of input data to improve the accuracy and fidelity of synthesized views. Thus, as set forth in the examples above, the various embodiments leverage machine learning, for example, in the form of a neural network like an MLP as previously described, to learn an implicit function (e.g., the mapping function) to predict and reconstruct light fields (also referred to as "implicit representations") from sparse inputs like the set of training images 405. This approach reduces storage requirements for view synthesis. Thus, an implicit-function-based approach is proposed, which provides memory-efficient alternatives to conventional approaches.
[0084] Fig. 5 is a flow diagram of a method 500 for novel view synthesis using neural light field, in accordance with various embodiments. The method 500 begins, at block 505, by obtaining a set of training images. As previously described, in some embodiments, a set of training images may be captured, generated, and/or received by a system for novel view synthesis using neural light fields. The set of training images may be captured by a camera array, and include images of a scene or object. The camera array may include a plurality of cameras. In some examples, the plurality of cameras may be arranged on a camera plane.
[0085] In some embodiments, the system, or subcomponent of the system, such as the training logic as previously described, may be configured to extract training data from the set of training images. Extracting training data may include, for example, determining all rays of each of the training images of the set of training images, and the corresponding color values for each of the respective rays.
[0086] The method 500 may continue, at block 510, by generating a mapping function. In some embodiments, the training data obtained from the set of training images may be provided to a neural network, such as an MLP, to generate a mapping function. In some examples, the MLP itself may act as the mapping function, in which one or more layers of the MLP are trainable parameters of the mapping function. Thus, generating the mapping function may further include learning of one or more trainable parameters.
Thus, the method 500 further includes, at block 515, learning one or more parameters (e.g., trainable parameters) of the mapping function. In some embodiments, the MLP may learn the one or more parameters of the mapping function based on the training data. In some further embodiments, learning the one or more parameters may further include minimizing, reducing, or otherwise optimizing error in the mapping function. Optimization may, for example, include bringing error (including photometric loss, Fourier sparsity loss, and ray bundle loss) within an acceptable range or threshold.
[0087] Thus, in some embodiments, minimizing error may include, for example, at block 520, minimizing for photometric loss. As previously described, minimizing photometric loss may include first determining a color value for one or more rays of the set of training images, and predicting a color value for the one or more rays via a neural network, such as the MLP. To determine photometric loss ( Lp ), a difference between the predicted color value and ground truth color value may then be determined. Thus, the MLP may predict a color value for a ray from the set of training images, and a difference determined between the predicted color value and the actual color value of the corresponding pixel in the training image of the set of training images. In one example, taking a difference between the two values may include a squared Euclidean norm (squared h norm) of the difference between the predicted RGB value and ground truth RGB value for the input ray. The h norm may be expressed as set forth above with respect to Eq. 4. In some examples, the photometric loss may be fed back to the mapping function and one or more parameters of the mapping function (e.g., layers of the MLP) may be adjusted to minimize photometric loss, and thus trained / learned.
[0088] In some embodiments, minimizing error may further include, at block 525, minimizing Fourier sparsity loss. In some examples, the neural network, such as an MLP, may be trained to minimize for Fourier sparsity loss, Ls. In various embodiments, to determine £s, the frequency distribution of a rendered view may be compared to a frequency distribution of a neighboring view (e.g., a training image of the set of training images) to enforce a similarity to the neighboring. As previously described, in some examples, a random viewpoint may be rendered. Rendering an image (e.g., a predicted image) of a random viewpoint may include determining a set of rays corresponding to the random viewpoint. Each of the rays of the random viewpoint may be evaluated by a neural network, such as an MLP, to generate a rendered image of the random viewpoint. A Fourier transform T may be performed on the rendered image to produce a frequency distribution of the rendered image. The frequency distribution of the rendered image may be compared against the frequency distribution of one or more ground truth images (e.g., one or more images of the set of training images). Thus, the frequency distribution of one or more ground truth images may include a Fourier transformed images from the set of training images.
[0089] Fourier sparsity loss may, thus, in some embodiments, be a mean square error (MSE) between the frequency distribution of the rendered image and the one or more ground truth images. In one example, the MSE may be formulated as set forth above with respect to Eq. 5. In further embodiments, the MSE may further include a root mean square error (RMSE), or other Euclidean distance, as appropriate. By minimizing MSE between the frequency distribution of the rendered image and the one or more ground truth images, a smooth frequency transition may be enforced between neighboring views, thus avoiding overfitting of the rendered image to data in the set of ground truth images.
[0090] In some embodiments, minimizing error may further include, at block 530, minimizing ray bundle loss. The color value of rays in-between neighboring rays may not transition smoothly. This can lead to aliasing in an image. As previously described, the mapping function may be trained to optimize for ray bundle loss, £r. Specifically, the one or more trainable parameters of the mapping function may be trained by minimizing for £r, and determining the proper parameters for calculating ray bundle loss. For example, £r may be formulated as set forth in Eq. 6. Here, Q may be selected for a specific image, where a Q that is too small will produce aliasing artifacts, whereas a Q that is too large will produce images that are over-smoothed. [0091] To determine £r, the predicted color values of a bundle of rays projected from a first ray may be compared for smoothness. It is assumed that the closer in angle a projected ray is to the original ray, the closer in predicted color values will be. In one example, £r may be formulated as set forth above with respect to Eq. 6. Thus, by minimizing £r, smoothness may be ensured between neighboring rays, and aliasing effects reduced.
[0092] The method 500 further includes, at block 535, determining a set of rays of a novel view. In various embodiments, given a viewpoint
Figure imgf000030_0001
, a novel view may
Figure imgf000030_0002
be rendered by first determining all rays of the novel view. For example, in some embodiments, rendering the viewpoint
Figure imgf000030_0003
may include determining a desired camera pose (e.g., position and/or orientation of a camera for an image), and a desired rendering resolution { Wv, Hv }. Thus, each ray may comprise W x H number of rays, where W x H is the desired resolution of the rendered image of the novel view. Thus, all rays
Figure imgf000030_0004
Figure imgf000030_0005
where Nv = Wv x H v (e.g., the total number of pixels to be rendered) may be determined. Based on the camera pose, the view synthesis logic may sample all rays , where Nv = Wv x Hv (e.g., the total number of pixels to be
Figure imgf000030_0006
rendered). Thus, the view synthesis logic 115 may further be configured to determine the 4D coordinates for each ray
Figure imgf000030_0007
Figure imgf000030_0008
[0093] The method 500 continues, at block 540, by evaluating mapping function for each ray of the novel view. Evaluating the learned mapping function ƒ for each ray of the viewpoint
Figure imgf000030_0010
, the novel view
Figure imgf000030_0009
may be rendered. The rendering process 32 may thus be formulated as Nv evaluations of the mapping function ƒ, as set forth above with respect to Eq. 8. Once all color values for each ray have been determined, at block 545, the method 500 comprises rendering the novel view. Thus, in some embodiments, the color values of the rays may be mapped back to pixels / respective pixel positions of the rendered image. Thus, the novel view rendered directly based on the color values of each ray, with each ray corresponding to a respective pixel of the rendered novel view.
[0094] The process of novel view synthesis using neural light fields may be performed by computer system. Fig. 6 is a schematic block diagram of a computer system 600 for novel view synthesis using neural light field, in accordance with various embodiments. Fig. 6 provides a schematic illustration of one embodiment of a computer system 600, such as the system for novel view synthesis 100, or subsystems thereof, such as the training logic, mapping function logic, MLP, neural network, view synthesis logic, or combinations thereof, which may perform the methods provided by various other embodiments, as described herein. It should be noted that Fig. 6 only provides a generalized illustration of various components, of which one or more of each may be utilized as appropriate. Fig. 6, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner.
[0095] The computer system 600 includes multiple hardware elements that may be electrically coupled via a bus 605 (or may otherwise be in communication, as appropriate). The hardware elements may include one or more processors 610, including, without limitation, one or more general-purpose processors and/or one or more special- purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and microcontrollers); one or more input devices 615, which include, without limitation, a mouse, a keyboard, one or more sensors, and/or the like; and one or more output devices 620, which can include, without limitation, a display device, and/or the like.
[0096] The computer system 600 may further include (and/or be in communication with) one or more storage devices 625, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random-access memory ("RAM") and/or a read-only memory ("ROM"), which can be programmable, flash-updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.
[0097] The computer system 600 might also include a communications subsystem
630, which may include, without limitation, a modem, a network card (wireless or wired), an IR communication device, a wireless communication device and/or chipset (such as a Bluetooth™ device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, a Z-Wave device, a ZigBee device, cellular communication facilities, etc.), and/or a LP wireless device. The communications subsystem 630 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, between data centers or different cloud platforms, and/or with any other devices described herein. In many embodiments, the computer system 600 further comprises a working memory 635, which can include a RAM or ROM device, as described above.
[0098] The computer system 600 also may comprise software elements, shown as being currently located within the working memory 635, including an operating system 640, device drivers, executable libraries, and/or other code, such as one or more application programs 645, which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.
[0099] A set of these instructions and/or code might be encoded and/or stored on a non-transitory computer readable storage medium, such as the storage device(s) 625 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 600. In other embodiments, the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer system 600 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer system 600 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code.
[0100] It will be apparent to those skilled in the art that substantial variations may be made in accordance with specific requirements. For example, customized hardware (such as programmable logic controllers, single board computers, FPGAs, ASICs, and SoCs) might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed.
[0101] As mentioned above, in one aspect, some embodiments may employ a computer or hardware system (such as the computer system 600) to perform methods in accordance with various embodiments of the invention. According to a set of embodiments, some or all of the procedures of such methods are performed by the computer system 600 in response to processor 610 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 640 and/or other code, such as an application program 645) contained in the working memory 635. Such instructions may be read into the working memory 635 from another computer readable medium, such as one or more of the storage device(s) 625. Merely by way of example, execution of the sequences of instructions contained in the working memory 635 might cause the processor(s) 610 to perform one or more procedures of the methods described herein.
[0102] The terms "machine readable medium" and "computer readable medium," as used herein, refer to any medium that participates in providing data that causes a machine to operate in a specific fashion. In an embodiment implemented using the computer system 600, various computer readable media might be involved in providing instructions/code to processor(s) 610 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals). In many implementations, a computer readable medium is a non-transitory, physical, and/or tangible storage medium. In some embodiments, a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like. Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 625. Volatile media includes, without limitation, dynamic memory, such as the working memory 635. In some alternative embodiments, a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire and fiber optics, including the wires that comprise the bus 605, as well as the various components of the communication subsystem 630 (and/or the media by which the communications subsystem 630 provides communication with other devices). In an alternative set of embodiments, transmission media can also take the form of waves (including, without limitation, radio, acoustic, and/or light waves, such as those generated during radio-wave and infra-red data communications).
[0103] Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.
[0104] Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 610 for execution. Merely by way of example, the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer. A remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer system 600. These signals, which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention.
[0105] The communications subsystem 630 (and/or components thereof) generally receives the signals, and the bus 605 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 635, from which the processor(s) 610 retrieves and executes the instructions. The instructions received by the working memory 635 may optionally be stored on a storage device 625 either before or after execution by the processor(s) 610.
[0106] While some features and aspects have been described with respect to the embodiments, one skilled in the art will recognize that numerous modifications are possible. For example, the methods and processes described herein may be implemented using hardware components, software components, and/or any combination thereof. Further, while various methods and processes described herein may be described with respect to particular structural and/or functional components for ease of description, methods provided by various embodiments are not limited to any particular structural and/or functional architecture but instead can be implemented on any suitable hardware, firmware and/or software configuration. Similarly, while some functionality is ascribed to one or more system components, unless the context dictates otherwise, this functionality can be distributed among various other system components in accordance with the several embodiments.
[0107] Moreover, while the procedures of the methods and processes described herein are described in a particular order for ease of description, unless the context dictates otherwise, various procedures may be reordered, added, and/or omitted in accordance with various embodiments. Moreover, the procedures described with respect to one method or process may be incorporated within other described methods or processes; likewise, system components described according to a particular structural architecture and/or with respect to one system may be organized in alternative structural architectures and/or incorporated within other described systems. Hence, while various embodiments are described with or without some features for ease of description and to illustrate aspects of those embodiments, the various components and/or features described herein with respect to a particular embodiment can be substituted, added and/or subtracted from among other described embodiments, unless the context dictates otherwise. Consequently, although several embodiments are described above, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:
1. A method comprising: obtaining a set of training images, the set of training images comprising two or more training images; generating, via a neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images; learning one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value; rendering a novel view, wherein the novel view is a synthesized image with N number of pixels, wherein N is an integer, and wherein rendering the novel view comprises: querying the mapping function for each ray R1 through RN, projecting from a center of projection to every pixel of the synthesized image; and obtaining, via the mapping function, a respective color value (R, G, B) for each ray R1 through RN.
2. The method of claim 1, each ray R1 through RN is parameterized as a 4- dimensional (4D) coordinate (u, v, s, t ), wherein the 4D coordinate representation for a given ray, Ri, is (ui, vi si ti), where i is an integer 1-N, wherein (ui, vi) represents two- dimensional coordinates of a first parameterization surface through which Ri intersects the first parameterization surface, and wherein (si, ti) represents two-dimensional coordinates of a second parameterization surface through which Ri intersects the second parameterization surface.
3. The method of claim 1, wherein the 4D coordinate (u, v, s, t ) is of a two- plane model, wherein the first parameterization surface is a first plane, wherein (ui, vi) represents two-dimensional coordinates of the first plane through which Ri intersects the first plane, and wherein the second parameterization surface is a second plane, wherein ( Si , ti) represents two-dimensional coordinates of the second plane through which Ri intersects the second plane.
4. The method of claim 3, wherein the first plane, the second plane, and a camera plane are parallel to each other, wherein a camera is positioned on the camera plane.
5. The method of claim 1, wherein the mapping function is a multilayer perceptron.
6. The method of claim 1, wherein training the one or more training parameters comprises minimizing for photometric loss ( Lp ), wherein minimizing £p includes determining a Euclidean distance between a predicted color value for a first known ray of a training image of the set of training images and a ground truth color value of the first known ray.
7. The method of claim 6, wherein minimizing for £p further includes: selecting a random set of rays of the set of training images; passing the random set of rays through the mapping function to produce a set of respective predicted values; determining the £p for the set of respective predicted values as compared against the set of training images; and back-propagating £p to update the one or more training parameters of the mapping function.
8. The method of claim 1, wherein training the one or more training parameters comprises minimizing Fourier sparsity loss (Ls) of the mapping function, wherein minimizing £s further includes determining a difference of a frequency distribution of a predicted image of a random viewpoint and a frequency distribution of a one or more training images of the set of training images, wherein the difference includes a Euclidean distance.
9. The method of claim 1, wherein training the one or more training parameters comprises minimizing ray bundle loss (Lr) of the mapping function, wherein minimizing Lr further includes determining a difference between the color values of a bundle of rays originating from a common origin, wherein the difference includes a Euclidean distance between respective color values of each ray of the bundle of rays.
10. The method of claim 1, wherein training the one or more training parameters comprises minimizing a loss function (L), the loss function comprising a weighted sum of photometric loss, Fourier sparsity loss, and ray bundle loss.
11. An apparatus, comprising: a processor; and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: receive a set of training images, the set of training images comprising two or more training images; generate, via a neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images; learn one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value; render a novel view, wherein the novel view is a synthesized image with N number of pixels, wherein N is an integer, and wherein rendering the novel view further comprises: evaluating the mapping function for each ray R1 through RN, projecting from a center of projection to every pixel of the synthesized image; and generating for each ray R1 through RN a respective color value (R, G, B) based on the mapping function.
12. The apparatus of claim 11, wherein each ray R1 through RN is parameterized as a 4-dimensional (4D) coordinate (u, v, s, t ), wherein the 4D coordinate representation for a given ray, Ri, is (ui, vi si ti), where i is an integer 1-N, wherein (ui, vi) represents two-dimensional coordinates of a first parameterization surface through which Ri intersects the first parameterization surface, and wherein ( si ti) represents two- dimensional coordinates of a second parameterization surface through which Ri intersects the second parameterization surface.
13. The apparatus of claim 11, wherein the mapping function is a multilayer perceptron.
14. The apparatus of claim 11, wherein training the one or more training parameters comprises minimizing for photometric loss ( Lp ), wherein minimizing Lp includes determining a Euclidean distance between a predicted color value for a first known ray of a training image of the set of training images and a ground truth color value of the first known ray.
15. The apparatus of claim 14, wherein training the one or more training parameters comprises minimizing Fourier sparsity loss (Ls) of the mapping function, wherein minimizing Ls further includes determining a difference of a frequency distribution of a predicted image of a random viewpoint and a frequency distribution of a one or more training images of the set of training images, wherein the difference includes a Euclidean distance.
16. The apparatus of claim 11, wherein training the one or more training parameters comprises minimizing ray bundle loss (Lr) of the mapping function, wherein minimizing Lr further includes determining a difference between the color values of a bundle of rays originating from a common origin, wherein the difference includes a Euclidean distance between respective color values of each ray of the bundle of rays.
17. A system for novel view synthesis utilizing neural light field comprising: a neural network configured to relate a ray of an image to a respective color value; a novel view synthesizer subsystem further comprising: a processor; and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: receive a set of training images, the set of training images comprising two or more training images; generate, via the neural network, a mapping function, wherein the mapping function is configured to map a ray, corresponding to a respective pixel of an image, to a color value based on the set of training images; learn one or more parameters of the mapping function based on training data captured from the set of training images, wherein the one or more parameters of the mapping function determines a relationship between a respective ray and its respective color value; render a novel view, wherein the novel view is a synthesized image with N number of pixels, wherein N is an integer, and wherein rendering the novel view further comprises: evaluating the mapping function for each ray R1 through RN, projecting from a center of projection to every pixel of the synthesized image; and generating for each ray R1 through RN a respective color value (R, G, B) based on the mapping function.
18. The system of claim 17, wherein learning the one or more training parameters comprises minimizing for photometric loss ( Lp ), wherein the set of instructions is further executable by the processor to: select a random set of rays of the set of training images; pass the random set of rays through the mapping function to produce a set of respective predicted values; determine the £p for the set of respective predicted values as compared against the set of training images; and back-propagate £p to update the one or more training parameters of the mapping function.
19. The system of claim 17, wherein learning the one or more training parameters comprises minimizing Fourier sparsity loss (Ls) of the mapping function, wherein minimizing £s further includes determining a difference of a frequency distribution of a predicted image of a random viewpoint and a frequency distribution of a one or more training images of the set of training images, wherein the difference includes a Euclidean distance.
20. The system of claim 17, wherein learning the one or more training parameters comprises minimizing ray bundle loss (Lr) of the mapping function, wherein minimizing Lr further includes determining a difference between the color values of a bundle of rays originating from a common origin, wherein the difference includes a Euclidean distance between respective color values of each ray of the bundle of rays.
PCT/US2021/046695 2021-04-04 2021-08-19 Novel view synthesis with neural light field WO2022216309A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202163170506P 2021-04-04 2021-04-04
US63/170,506 2021-04-04
US202163188832P 2021-05-14 2021-05-14
US63/188,832 2021-05-14

Publications (1)

Publication Number Publication Date
WO2022216309A1 true WO2022216309A1 (en) 2022-10-13

Family

ID=83545913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/046695 WO2022216309A1 (en) 2021-04-04 2021-08-19 Novel view synthesis with neural light field

Country Status (1)

Country Link
WO (1) WO2022216309A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200014904A1 (en) * 2018-07-03 2020-01-09 Raxium, Inc. Display processing circuitry
US20200279387A1 (en) * 2017-11-20 2020-09-03 Shanghaitech University Light field image rendering method and system for creating see-through effects

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200279387A1 (en) * 2017-11-20 2020-09-03 Shanghaitech University Light field image rendering method and system for creating see-through effects
US20200014904A1 (en) * 2018-07-03 2020-01-09 Raxium, Inc. Display processing circuitry

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BENJAMIN ATTAL; SELENA LING; AARON GOKASLAN; CHRISTIAN RICHARDT; JAMES TOMPKIN: "MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 14 August 2020 (2020-08-14), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081742047 *
CHEN BIN BINORCHEN@GMAIL.COM; RUAN LINGYAN LYRUANRUAN@GMAIL.COM; LAM MIU-LING MIU.LAM@CITYU.EDU.HK: "LFGAN", ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS ANDAPPLICATIONS, ASSOCIATION FOR COMPUTING MACHINERY,, US, vol. 16, no. 1, 17 February 2020 (2020-02-17), US , pages 1 - 20, XP058687506, ISSN: 1551-6857, DOI: 10.1145/3366371 *
PRATUL P. SRINIVASAN; TONGZHOU WANG; ASHWIN SREELAL; RAVI RAMAMOORTHI; REN NG: "Learning to Synthesize a 4D RGBD Light Field from a Single Image", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 10 August 2017 (2017-08-10), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080952365 *

Similar Documents

Publication Publication Date Title
Li et al. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation
CN110378838B (en) Variable-view-angle image generation method and device, storage medium and electronic equipment
CN115690324A (en) Neural radiation field reconstruction optimization method and device based on point cloud
CN114666564B (en) Method for synthesizing virtual viewpoint image based on implicit neural scene representation
CN110910437B (en) Depth prediction method for complex indoor scene
CN111861880A (en) Image super-fusion method based on regional information enhancement and block self-attention
CN116205962A (en) Monocular depth estimation method and system based on complete context information
CN116778063A (en) Rapid virtual viewpoint synthesis method and device based on characteristic texture grid and hash coding
CN113723294B (en) Data processing method and device and object recognition method and device
JP2024510230A (en) Multi-view neural human prediction using implicitly differentiable renderer for facial expression, body pose shape and clothing performance capture
CN106845555A (en) Image matching method and image matching apparatus based on Bayer format
CN116912675A (en) Underwater target detection method and system based on feature migration
WO2022216309A1 (en) Novel view synthesis with neural light field
CN117011137A (en) Image stitching method, device and equipment based on RGB similarity feature matching
CN116486038A (en) Three-dimensional construction network training method, three-dimensional model generation method and device
CN114742950B (en) Ship shape 3D digital reconstruction method and device, storage medium and electronic equipment
Li et al. 2.5 D-VoteNet: Depth Map based 3D Object Detection for Real-Time Applications.
CN115147577A (en) VR scene generation method, device, equipment and storage medium
Jee et al. Hologram super-resolution using dual-generator gan
Semeraro et al. NeRF applied to satellite imagery for surface reconstruction
Zhou et al. Improved YOLOv7 models based on modulated deformable convolution and swin transformer for object detection in fisheye images
Nair et al. NAS-NeRF: Generative neural architecture search for neural radiance fields
CN115953544B (en) Three-dimensional reconstruction method, three-dimensional reconstruction device, electronic equipment and readable storage medium
JP7254849B2 (en) Rotational Equivariant Orientation Estimation for Omnidirectional Localization
Yu et al. ImmersiveNeRF: Hybrid Radiance Fields for Unbounded Immersive Light Field Reconstruction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21936225

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21936225

Country of ref document: EP

Kind code of ref document: A1