US12494013B2 - Autodecoding latent 3D diffusion models - Google Patents
Autodecoding latent 3D diffusion modelsInfo
- Publication number
- US12494013B2 US12494013B2 US18/211,149 US202318211149A US12494013B2 US 12494013 B2 US12494013 B2 US 12494013B2 US 202318211149 A US202318211149 A US 202318211149A US 12494013 B2 US12494013 B2 US 12494013B2
- Authority
- US
- United States
- Prior art keywords
- training
- autodecoder
- latent
- volumetric
- dataset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional [3D] objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—Three-dimensional [3D] animation
- G06T13/40—Three-dimensional [3D] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—Three-dimensional [3D] image rendering
- G06T15/08—Volume rendering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—Three-dimensional [3D] image rendering
- G06T15/10—Geometric effects
- G06T15/20—Perspective computation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three-dimensional [3D] modelling for computer graphics
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating three-dimensional [3D] models or images for computer graphics
- G06T19/20—Editing of three-dimensional [3D] images, e.g. changing shapes or colours, aligning objects or positioning parts
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/001—Model-based coding, e.g. wire frame
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2210/00—Indexing scheme for image generation or computer graphics
- G06T2210/21—Collision detection, intersection
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2219/00—Indexing scheme for manipulating 3D models or images for computer graphics
- G06T2219/20—Indexing scheme for editing of 3D models
- G06T2219/2016—Rotation, translation, scaling
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2219/00—Indexing scheme for manipulating 3D models or images for computer graphics
- G06T2219/20—Indexing scheme for editing of 3D models
- G06T2219/2021—Shape modification
Definitions
- Examples set forth herein generally relate to the generation of static and articulated three-dimensional (3D) images and, in particular, to methods and systems for autodecoding latent 3D diffusion models to embed properties learned from a target dataset in the latent space into a volumetric representation for rendering.
- Photorealistic generation is undergoing rapid improvement. Recent improvements in quality, composition, stylization, resolution, scale, and manipulation capabilities of images were unimaginable just over a year ago. The abundance of online images, often enriched with text, labels, tags, and sometimes per-pixel segmentation, has significantly accelerated such progress. The emergence and development of denoising diffusion probabilistic models (DDPMs) have propelled these advances in image synthesis and other domains such as audio and video.
- DDPMs denoising diffusion probabilistic models
- FIG. 1 A is a diagram illustrating the full pipeline of a two-stage approach to autodecoding latent 3D diffusion models in a sample configuration.
- FIG. 1 B is a flow chart illustrating the method of autodecoding latent three-dimensional (3D) diffusion models to embed properties learned from a target dataset in a latent space into a volumetric representation of an object for rendering in a sample configuration.
- FIG. 2 is a chart illustrating the impact of diffusion resolution and the number of sampling steps on sample quality and inference time for sample methods.
- FIG. 3 is an illustration depicting generated samples from a model trained using monocular videos from MVImgNet.
- FIG. 4 is an illustration depicting generated samples from a model trained using rendered images from Objaverse.
- FIG. 5 is a graph comparing three diffusion models that were trained for the same time, resources, and number of parameters, for diffusion at 3 resolutions in the autodecoder: 4 3 , 8 3 , and 16 3 , with the 8 3 model showing the best trade off quality and training speed, rendering it the best option for training on large-scale 3D datasets.
- FIG. 6 is a block diagram of a machine within which instructions (e.g., software, a program, an application, an applet, an app, or other executable code) for causing the machine to perform one or more of the methodologies discussed herein may be executed.
- instructions e.g., software, a program, an application, an applet, an app, or other executable code
- FIG. 7 is a block diagram showing a software architecture within which examples described herein may be implemented.
- a 3D autodecoder framework embeds properties learned from a target dataset in latent space for use in generating static and articulated 3D assets.
- the 3D autodecoder framework can be decoded into a volumetric representation for rendering view-consistent appearance and geometry.
- the appropriate intermediate volumetric latent space is then identified and robust normalization and de-normalization operations are implemented to learn a 3D diffusion from 2D images or monocular videos of rigid or articulated objects.
- the results are shown to outperform state-of-the-art alternatives on various benchmark datasets and metrics, including multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects.
- the following description is directed to designing and training denoising diffusion probabilistic models (DDPMs) for 3D-aware content suitable for efficient usage with datasets of various scales.
- DDPMs diffusion probabilistic models
- the described systems and methods are generic enough to handle both rigid and articulated objects and is versatile enough to learn diverse 3D geometry and appearance from multi-view images and monocular videos of both static and dynamic objects. Recognizing the poses of objects in such data is shown to be important to learning useful 3D representations.
- the systems and methods described herein are thus designed to be robust to the use of ground-truth poses, poses estimated using structure-from-motion, or using no input pose information at all, but rather learning the poses effectively during training.
- the described systems and methods are scalable enough to train on single- or multi-category datasets of large numbers of diverse objects suitable for synthesizing a wide range of realistic content.
- the systems and methods described herein use a volumetric autodecoder to learn the latent space for diffusion sampling.
- an autodecoder maps a one-dimensional (1D) vector to each object in the training set, and thus does not require 3D supervision.
- the autodecoder learns 3D representations from two-dimensional (2D) observations, using rendering consistency as supervision.
- UVA unsupervised volumetric animation
- the resulting 3D representation supports the articulated parts necessary to model non-rigid objects.
- the autodecoders do not have a clear “bottleneck.” Starting with a 1D embedding, the autodecoders upsample the latent features at many resolutions, until finally reaching the output radiance and density volumes. Here, each intermediate volumetric representation could potentially serve as a “bottleneck.”
- autoencoder-based methods typically regularize the bottleneck by imposing a KL-Divergence constraint, meaning diffusion is performed in a normalized space.
- the systems and methods described herein use robust normalization and denormalization operations that can be applied to any layers of a pre-trained and fixed autodecoder. These operations compute robust statistics to perform layer normalization and thus allow the diffusion process to be trained at any intermediate resolution of the autodecoder.
- the inventors have discovered that at fairly low resolutions, the space is compact and provides the necessary regularization for geometry, allowing the training data to contain only sparse observations of each object. The deeper layers, on the other hand, operate more as upsamplers.
- the versatility and scalability of the systems and methods described herein are demonstrated on various tasks involving rigid and articulated 3D object synthesis.
- the model is initially trained using multi-view images and cameras.
- the model is then scaled to hundreds of thousands of diverse objects trained using the real-world MVImgNet dataset, which is beyond the capacity of prior 3D diffusion methods.
- the model is trained on a subset of CelebV-Text, consisting of approximately 44,000 sequences of high-quality videos of human motion.
- Neural radiance fields enable high-quality novel view synthesis (NVS) of rigid scenes learned from 2D images.
- NVS novel view synthesis
- the NeRFss approach to volumetric neural rendering has been successfully applied to various tasks, including generating objects suitable for 3D-aware NVS.
- GANs generative adversarial models
- Such works have shown promising results for this task, yet they suffer from limited multi-view consistency from arbitrary viewpoints and experience difficulty in generalizing to multi-category image datasets.
- Pi-GAN Existing techniques for estimating gradients for waveform generation, known as Pi-GAN, employ neural rendering with periodic activation functions for generation with view-consistent rendering.
- EG3D and EpiGRAF use tri-plane representations of 3D scenes created by a generator-discriminator framework based on StyleGAN2.
- these techniques require pose estimation from keypoints (e.g., facial features) for training, again limiting the viewpoint range.
- Denoising diffusion probabilistic models represent the generation process as the learned denoising of data progressively corrupted by a sequence of diffusion steps.
- Conventional techniques improving the training objectives, architecture, and sampling process have demonstrated rapid advances in high-quality data generation on various data domains.
- such techniques have primarily shown results for tasks in which samples from the target domain are fully observable, rather than operating in those with only partial observations of the dataset content.
- DiffRF proposes reconstructing per-object NeRF volumes for synthetic datasets, then applying diffusion training on them within a U-Net framework.
- DiffRF requires the reconstruction of many object volumes and is limited to low-resolution volumes due to the diffusion training's high computational cost.
- the framework described below instead operates in the latent space of the autodecoder and effectively shares the learned knowledge from all training data, thus enabling low-resolution, latent 3D diffusion.
- a 3D autoencoder has been previously used for generating 3D shapes, but such methods require ground-truth 3D supervision and only focus on shape generation, with textures added using an off-the-shelf method.
- the framework described below learns to generate the surface appearance and corresponding geometry without such ground-truth 3D supervision.
- the present disclosure addresses the above and other limitations in the prior art by providing systems, methods, and instructions on computer readable media to implement methods of training a three-dimensional (3D) diffusion model to embed properties from two-dimensional (2D) images learned from a target dataset in a latent space using an autodecoder.
- the system includes a volumetric autodecoder (G) that learns embedding vectors of a library of embedding vectors corresponding to objects in a training dataset to generate a latent 3D feature volume and decodes the latent 3D feature volume into a 3D voxel grid for density and radiance representative of an object's shape and appearance.
- the autodecoder includes a first part G 1 and a second part G 2 .
- the system also trains the 3D diffusion model using the volumetric autodecoder.
- the 3D diffusion model operates in a 3D latent space obtained from the first part G 1 using volumetric rendering of the voxel grid with two-dimensional (2D) reconstruction supervision from training images in a training dataset to extract structure and appearance properties from the training dataset.
- the second part G 2 of the volumetric autodecoder generates a 3D representation of the object from the structure and appearance properties extracted from the training dataset.
- the volumetric autodecoder may progressively upsample the latent 3D feature volume to a desired resolution before decoding the upsampled latent 3D feature volume into the 3D voxel grid.
- the volumetric autodecoder also may perform robust normalization on the latent 3D feature volume before training the 3D diffusion model.
- the robust normalization may include taking a median m as a center of distribution of the latent 3D feature volume and approximating a scale of the latent 3D feature volume using a Normalized InterQuartile Range (IQR).
- IQR Normalized InterQuartile Range
- the features F may be denormalized from the structure and appearance properties extracted from the training dataset by the second part G 2 as ⁇ circumflex over (F) ⁇ IQR+m prior to generating the 3D representation of the object.
- the embedding vectors are learned by the volumetric autodecoder and the volumetric autodecoder may provide at least four residual blocks at each resolution in the autodecoder and use self-attention layers in a second level of resolution 8 3 and in a third level of resolution 16 3 of the autodecoder.
- the object may be in a canonical pose and the voxel grid may be trained using ground truth poses, poses estimated using structure from motion, or poses learned from the training dataset during training.
- the canonical pose may include a canonical voxel representation of a density grid that is a discrete representation of a density field and a canonical representation that represents a red, green, blue (RGB) radiance field.
- the density values and RGB values may be tri-linearly interpolated from the 3D voxel grid after decoding.
- a background of the training images in the training dataset prior may be removed prior to training the 3D diffusion model to improve the training efficiency.
- a shape of the object and local motion may be modeled from dynamic poses as well as a corresponding non-rigid deformation of a local region.
- a differentiable Perspective-n-Point algorithm may estimate camera poses for each component of the non-rigid object and progressively refine estimated camera poses during training using a combination of learned 3D keypoints for each component of the non-rigid object and corresponding 2D projections predicted in each image.
- the components may be combined with plausible deformations using a learned volumetric linear blend skinning (LBS) algorithm having skinning weights for each component of the non-rigid object that are estimated during training of the 3D diffusion model.
- LBS volumetric linear blend skinning
- each object in the training set may be represented by an embedding vector including a concatenation of smaller embedding vectors.
- a deterministic mapping from each training object index to its corresponding concatenated embedding vector may be implemented using a hashing function where for object index k, the corresponding embedding index is:
- a target non-rigid object also may be decomposed into N p regions, each containing N k K p 3D points and corresponding K p 2D projections per image.
- the N k K p 3D points may be shared across all non-rigid objects and the non-rigid objects may be aligned in a learned canonical space to allow for motion transfer between the non-rigid objects.
- the training may include extracting a text description of an object in the training dataset by providing a hint and a first view of the object along with a question requesting a description of a shape and color of the object for use in an inference stage to identify the object.
- FIGS. 1 - 7 A detailed description of the methodology for autodecoding latent 3D diffusion models will now be described with reference to FIGS. 1 - 7 .
- this description provides a detailed description of possible implementations, it should be noted that these details are intended to be exemplary and in no way delimit the scope of the inventive subject matter.
- the methods and systems for autodecoding latent 3D diffusion models in exemplary configurations implement a two-stage approach.
- an autodecoder G containing a library of embedding vectors corresponding to the objects in the training dataset is learned. These vectors are first processed to create a low-resolution, latent 3D feature volume, which is then progressively upsampled and finally decoded into a voxelized representation of the generated object's shape and appearance.
- the resulting voxel grid is trained using volumetric rendering techniques on this volume, with 2D reconstruction supervision from the training images.
- the autodecoder is then employed to train a 3D diffusion model operating in the compact, 3D latent space obtained from G 1 .
- this 3D diffusion process allows the voxel grid to be used to efficiently generate diverse and realistic 3D content.
- FIG. 1 A illustrates the full pipeline 100 of the two-stage approach in a sample configuration.
- Stage 1 volumemetric autodecoding
- the autodecoder 110 learns to assign each training set object a 1D embedding 120 that is processed by G 1 into a latent volumetric space 130 .
- Features of the latent volumes generated by G 1 112 may be normalized into the same range by robust normalization 140 to allow for a uniform set of diffusion hyper-parameters in the sampled volume for all datasets and all trained autodecoders 110 .
- G 2 114 decodes these volumes in latent volumetric space 130 into larger radiance volumes 150 suitable for rendering.
- Stage 2 the autodecoder 110 parameters are frozen.
- the uniform set of diffusion hyper-parameters in the sampled volume 160 are then used to train the 3D denoising diffusion process (e.g., 3D U-Net) 170 to generate a 3D noise volume 180 from which the object is rendered by G 2 at inference time.
- 3D denoising diffusion process e.g., 3D U-Net
- G 1 112 is not used, as the generated 3D noise volume 180 generated during stage 2 is randomly sampled, iteratively denoised using 3D U-Net 170 at 185 , robust denormalization is performed at 190 , and decoding is performed by G 2 114 for rendering to produce the generated radiance volume 195 .
- volumetric autodecoding architecture training procedure and reconstruction losses for the autodecoder 110 , and training and sampling strategies for 3D diffusion in the latent space 130 of the autodecoder 110 are described below.
- a 3D voxel grid is used to represent the 3D structure and appearance of an object. It is assumed that the objects are in their canonical pose, such that the 3D representation is decoupled from the camera poses. This decoupling is desirable for learning compact representations of objects, and also serves as a constraint to learn meaningful 3D structure from 2D images without direct 3D supervision.
- the canonical voxel representation may consist of a density grid V Density ⁇ that is a discrete representation of the density field with resolution S 3 , and V RGB ⁇ that represents the RGB radiance field. Volumetric rendering may be employed to integrate the radiance and opacity values along each view ray similar to NeRFs. In contrast to the original NeRF, however, rather than computing these local values using a multilayer perceptron (MLP), the density and RGB values may bee tri-linearly interpolated from the decoded voxel grids.
- MLP multilayer perceptron
- the 3D voxel grids for density and radiance, V Density and V RGB may be generated by a volumetric autodecoder G 110 that is trained using rendering supervision from 2D images.
- the method directly generates V Density and V RGB , rather than intermediate representations such as feature volumes or tri-planes, as it is more efficient to render and ensures consistency across multiple views. It is noted that feature volumes and tri-planes include running an MLP pass for each sampled point, which may necessitate significant computational cost and memory during training and inference.
- the autodecoder 110 is trained in the manner of large scale generative networks across various object categories from large-scale multi-view or monocular video datasets.
- the architecture of the autodecoder 110 is adapted such that the framework supports large scale datasets that pose a challenge in designing the architecture of the autodecoder 110 with the capability to generate high-quality 3D content across various categories.
- a very high-capacity autodecoder 110 is desired.
- a relatively basic decoder is extended to support the diverse shapes and appearances in the target datasets by increasing the length of the embedding vectors learned by the decoder from 64 to 1024 for the autodecoder 110 .
- the number of residual blocks at each resolution in the autodecoder 110 also may be increased from 1 to 4.
- self-attention layers may be introduced in the second and third levels (resolutions 8 3 and 16 3 ).
- the autodecoder 110 is trained from image data through analysis-by-synthesis, with the primary objective of minimizing the difference between the rendered images of the autodecoder 110 and the training images.
- RGB color image C is rendered using volumetric rendering. Additionally, in order to supervise silhouettes of the objects, a 2D occupancy mask O is also rendered.
- a pyramidal perceptual loss is employed on the rendered images as the primary reconstruction loss:
- the background is removed.
- the color of the object is black (which corresponds to the absence of density)
- the network can make the object semi-transparent.
- a foreground supervision loss may be used.
- binary foreground masks estimated by an off-the-shelf matting method such as Segment Anything or synthetic ground-truth masks, depending on the dataset
- an L1 loss is applied on the rendered occupancy map to match that of the mask corresponding to the image as follows:
- a subject's shape and local motion may be modeled from dynamic poses, as well as the corresponding non-rigid deformation of local regions. It is assumed that these sequences can be decomposed into a set of N p smaller, rigid components (e.g., 10) whose poses can be estimated for consistent alignment in the canonical 3D space.
- the camera poses for each component are estimated and progressively refined during training, using a combination of learned 3D keypoints for each component of the depicted subject and the corresponding 2D projections predicted in each image. This estimation may be performed via a differentiable Perspective-n-Point (PnP) algorithm.
- PnP Perspective-n-Point
- a learned volumetric linear blend skinning (LBS) algorithm may be employed.
- a voxel grid V LBS ⁇ R may be introduced to represent the skinning weights for each deformation component.
- the skinning weights for each component are also estimated during training.
- the diffusion model architecture used by the systems and methods described herein extends prior art diffusion techniques in a 2D space to the latent 3D space. Its 2D operations, including convolutions and self-attention layers, are implemented in the 3D decoder space. In text-conditioning experiments, after the self-attention layer, a cross-attention layer that is similar is used.
- the features F in the latent space of the 3D autodecoder 110 have a bell-shaped distribution, which eliminates the need to enforce any form of prior on it. Operating in the latent space without a prior enables training a single autodecoder 110 for each of the possible latent diffusion resolutions.
- the feature distribution F has very long tails. This is hypothesised because the final density values inferred by the network do not have any natural bounds and thus can fall within any range. In fact, the network is encouraged to make such predictions, as they have the sharpest boundaries between the surface and empty regions.
- the methods may rely upon the sampling method from Expressing Diffusion Models (EDM) in a common framework as described by Karras, et al. in “Elucidating the Design Space of Diffusion-Based Generative Models,” (Arxiv.org/abs/2206.00364) with several slight modifications.
- EDM Expressing Diffusion Models
- the EDM's hyper-parameter matching the dataset's distribution is fixed to 0.5 regardless of the experiment, and the feature statistics are modified in the feature processing step.
- Classifier free guidance is also introduced for text-conditioning experiments. It was found that setting the weight equal to 3 yields good results across all datasets.
- FIG. 1 B is a flow chart illustrating the method 200 of autodecoding latent three-dimensional (3D) diffusion models to embed properties learned from a target dataset in a latent space into a volumetric representation of an object for rendering in a sample configuration.
- the method includes processing at 210 one dimensional (1D) embedding vectors 120 of an autodecoder (G) 110 comprising a library of embedding vectors corresponding to objects in a training dataset to generate a latent 3D feature volume 130 .
- the latent 3D feature volume 130 is progressively upsampled to a desired resolution.
- robust normalization 140 is performed on the latent 3D feature volume 130 to normalize the features from the latent 3D feature volume 130 before decoding, at 240 , the latent 3D feature volume 130 into a 3D voxel grid 150 for density and radiance representative of an object's shape and appearance.
- 2D two-dimensional
- the second part G 2 of the autodecoder 110 uses the structure and appearance properties extracted from the training dataset to generate a 3D representation of the object.
- the features F of the object are denormalized using robust denormalization at 280 to complete the generation of the 3D representation of the object.
- the datasets used for evaluations of the methods are listed below.
- the methods were mostly evaluated on datasets of synthetic renderings of 3D objects.
- results are also provided for a challenging video dataset of dynamic human subjects and a dataset of static object videos.
- the datasets include:
- ABO Tables The methods were evaluated on renderings of objects from the Tables subset of the Amazon Berkeley Objects (ABO) dataset, consisting of 1,676 training sequences with 91 renderings per sequence, for a total of 152,516 renderings.
- ABO Amazon Berkeley Objects
- PhotoShape Chairs Images from the Chairs subset of the PhotoShape dataset were used, totaling 3,115,200 frames, with 200 renderings for each of 15,576 chair models.
- This dataset contains approximately 800,000 publicly available 3D models. As the object geometry and appearance varies, a manually-filtered subset of approximately 300,00 unique objects were used. Six images were rendered per training object, for a total of approximately 1.8 million frames.
- MVImgNet MVImgNet. For this dataset, approximately 6.5 million frames were used from 219,188 videos of real-world objects from 239 categories, with an average of 30 frames each. Grounded Segment Anything was used for background removal, then filtering was applied to remove objects with failed segmentation. This process resulted in 206,990 usable objects.
- CelebV-Text The CelebV-Text dataset consists of approximately 70,000 sequences of high-quality videos of celebrities captured in in-the-wild environments, lighting, motion, and poses. They generally depict the head, neck, and upper-torso region, but contain more challenging pose and motion variation than prior datasets, e.g., VoxCeleb.
- a robust video matting framework was used to obtain the masks for foreground supervision. Some sample filtering was used for sufficient video quality and continuity for training. This produced approximately 44,400 unique videos, with an average of approximately 373 frames each, totaling approximately 16.6 million frames.
- Table 3 below provides the peak signal to noise ratio (PSNR) and the learned perceptual image patch similarity (LPIPS) reconstruction metrics.
- PSNR peak signal to noise ratio
- LPIPS learned perceptual image patch similarity
- FIG. 2 illustrates the impact of diffusion resolution and the number of sampling steps on sample quality and inference time for the methods of sample configurations.
- FIG. 2 A clear distinction can be seen in FIG. 2 between the results obtained from diffusion at the earlier or later autodecoder stages, and those from the results of the disclosed methods with resolution 8 3 .
- the quality degrades significantly.
- Training at a higher resolution requires substantial resources, limiting the convergence seen in a reasonable amount of time.
- the number of sampling steps has a smaller, more variable impact. Going from 16 to 32 steps improves the results with a reasonable increase in inference time, but at 64 steps, the largest improvement is at the 16 3 resolution, which requires more than 30 seconds per sample.
- the chosen diffusion resolution of 8 3 achieves the best results, allowing for high sample quality at 64 steps with only approximately 8 seconds of computation, but provides reasonable results with 32 steps in approximately 4 seconds.
- FIGS. 3 and 4 show generated samples from a model trained using monocular videos from MVImgNet
- FIG. 4 shows generated samples of the model trained using rendered images from Objaverse.
- FIGS. 3 and 4 three views are shown for each object, along with the normals for each view. Depth is also shown for the right-most view as well as text-conditioned results.
- Ground-truth captions are generated by MiniGPT-4. It can be seen that in all cases, the disclosed methods generate objects with reasonable geometry that generally follow the prompt. However, some details can be missing. The disclosed model appears to learn to ignore certain details from text prompts, as MiniGPT-4 often hallucinates details inconsistent with the object's appearance. Better captioning systems should help alleviate this issue in the future.
- the methods described herein demonstrate this is possible with the right approach.
- the disclosed methods learn representations of the structure and appearance of diverse and complex content suitable for generating high-fidelity 3D objects using only 2D supervision.
- the latent volumetric representation is conducive to 3D diffusion modeling for both conditional and unconditional generation, while enabling view-consistent rendering of the synthesized objects.
- this generalizes well to various types of domains and datasets, from relatively small, single-category, synthetic renderings to large-scale, multi-category real-world datasets. It also supports the challenging task of generating articulated moving objects from videos.
- the disclosed methods focus on images and videos with foregrounds depicting one key person or object.
- the generation or composition of more complex, multi-object scenes is a challenging task and an interesting direction for future developments.
- the methods use multi-view or video sequences of each object in the dataset for training, single-image datasets are not supported. Learning the appearance and geometry of diverse content for controllable 3D generation and animation from such limited data is quite challenging, especially for articulated objects.
- using general knowledge about shape, motion, and appearance extracted from datasets like those used herein to reduce or remove the multi-image requirement when learning to generate additional object categories may be feasible for such purposes. This would allow the generation of content learned from image datasets of potentially unbounded size and diversity.
- the 2D occupancy map O is also rendered using the volumetric equation:
- poses may be estimated during training, using a set of learnable 3D keypoints K 3D and their predicted 2D projections K 2D in each image in an extended version of the Perspective-n-Point (PnP) algorithm.
- PnP Perspective-n-Point
- the target subjects can be decomposed into N p regions, each containing N k K p 3D points and their corresponding K p 2D projections per image. These points are shared across all subjects, and are aligned in the learned canonical space, allowing for realistic generation and motion transfer between these subjects. This allows for learning N p poses per-frame defining the pose of each region p relative to its pose in the learned canonical pose.
- Successful reconstruction of the training images for each subject includes learning the appropriate canonical locations for each region's 3D keypoints, to predict the 2D projections of these keypoints in each frame, and the pose best matching the 3D points and 2D projections for these regions.
- This information is then used in the volumetric rendering framework to sample appropriately from the canonical space such that the subject's appearance and pose are consistent and appropriate throughout their video sequence.
- this information can be learned along with the autodecoder parameters for articulated objects using the reconstruction and foreground supervision losses used for the rigid object datasets.
- volumetric linear blend skinning may be employed. This allows the weight each component p in the canonical space contributes to a sampled point in the deformed space based on the spatial correspondence between these two spaces to be learned:
- x d is the 3D point deformed to correspond to the current pose
- x c is its corresponding point when aligned in the canonical volume
- w p c (x c ) is the learned LBS weight for component p, sampled at position x c in the volume, used to define this correspondence. It is noted that an approximate solution may be computed using the inverse LBS weights following Human NeRF to avoid the excessive computation used by the direct solution.
- the autodecoder 110 learns to produce a volume V LBS ⁇ containing the LBS weights for each of the N p locally rigid regions constituting the subject.
- the deformation can be completely described by establishing correspondences between each point x d in the deformed space and points x c in the canonical space. Such correspondence is established through Linear Blend Skinning (LBS) as follows:
- w p c (x) is a weight assigned to each part p.
- LBS weights segment the object into different parts. As an example, a point with LBS weight equal to 1.0 for the left hand will always move according to the transformation for the left hand.
- Eq. (6) for x c . This procedure is prohibitively expensive, so the approximate solution may be used that defines inverse LBS weights w p d such that:
- PnP Perspective-n-Point
- volumetric autodecoder 110 For the volumetric autodecoder 110 architecture described above, given an embedding vector e of size 1024, a fully-connected layer may be used followed by a reshape operation to transform it into a 4 3 volume with 512 features per cell. This is followed by a series of four 3D residual blocks, each of which upsamples the volume resolution in each dimension and halves the features per cell, to a final resolution of 643 and 32 features. These blocks consist of two 3 ⁇ 3 ⁇ 3 convolution blocks each followed by batch normalization in the main path, while the residual path consists of four 1 ⁇ 1 ⁇ 1 convolutions, with a rectified linear activation unit (ReLU) function applied after these operations.
- ReLU rectified linear activation unit
- the 8 3 volume may be obtained with 256 features per cell used for training the diffusion network.
- self-attention layers are applied.
- a final batch normalization is applied followed by a 1 ⁇ 1 ⁇ 1 convolution to produce the final 1+3 density V Density and RGB color features V RGB used in the volumetric renderer.
- the unsupervised 2D keypoint predictor uses the U-Net architecture, which operates on a downsampled 64 ⁇ 64 input image to predict the locations of the keypoints corresponding to each of the 3D keypoints used to determine the pose of the camera relative to each region of the subject when it is aligned in the canonical volumetric space.
- the Ablated Diffusion Model (ADM) may be used, which is a U-Net architecture originally designed for 2D image synthesis. Preconditioning enhancements to this model may be incorporated. As this architecture was originally designed for 2D, all convolutions and normalizations operations, as well as the attention mechanisms, are adapted to 3D. For the cross-attention mechanism used for the conditioning experiments, the latent-space cross-attention mechanism is likewise extended to the 3D latent space.
- the model was trained for text-conditioned image generation on three datasets: CelebV-Text, MVImgNet, and Objaverse.
- the two latter datasets provide the object category of each sample, but they do not provide text descriptions.
- MiniGPT4 a description was extracted by providing a hint and the first view of each object along with the question: “ ⁇ Img> ⁇ ImageHere> ⁇ /Img> Describe this ⁇ hint> in one sentence. Describe its shape and color. Be concise, use only a single sentence.”
- this hint is the “class name”, while it is the “asset name” for Objaverse.
- the assigned description may then be used in the inference stage to identify the object to be rendered.
- the 11-billion parameter T5 model was used to extract a sequence of text-embedding vectors.
- the dimensionality of these vectors was 1024.
- the length of the embedding sequence was fixed to 32 elements. Longer sentences were trimmed and smaller sentences were padded with zeroes.
- Each object in the training set was encoded by an embedding vector.
- an embedding vector As multi-view datasets of various scales were employed, up to approximately 300,000 unique targets from multiple categories, storing a separate embedding vector for each object depicted in the training images is burdensome. As such, a technique may be used enabling the effective use of a significantly reduced number of embeddings (no more than approximately 32,000 needed for any of the evaluations), while allowing effective content generation from large-scale datasets. Concatenations of smaller embedding vectors were employed to create more combinations of unique embedding vectors used during training.
- the input embedding vector H k ⁇ used for an object to be decoded is a concatenation of smaller embedding vectors h i j , where each vector is selected from an ordered codebook with n c entries, with each entry containing collection of n h embedding vectors of length l v /n c :
- H k [ h 1 k 1 , h 2 k 2 , ... , h n c k n c ] , ( 9 ) where k i ⁇ 1, 2, . . . , n h ⁇ is the set of indices used to select from the n h possible codebook entries for position i in the final vector.
- This method allows for exponentially more combinations of embedding vectors to be provided during training than must be stored in learned embedding vector library.
- index j for the vector h i j at position i may be randomly selected for each position to access its corresponding codebook entry
- the inventors instead used a deterministic mapping from each training object index to its corresponding concatenated embedding vector.
- This function is implemented using a hashing function employing a multiplication method for fast indexing using efficient bitwise operations.
- object index k the corresponding embedding index is:
- m ⁇ ( k ) [ ( a ⁇ k ) ⁇ mod ⁇ 2 w ] ⁇ ( w - r ) , ( 10 )
- w and a are heuristic hashing parameters used to reduce the number of collisions while maintaining an appropriate table size. 32 was used for w, while a must be an odd integer between 2 w-1 and 2 w .
- Each smaller codebook was given its own a value:
- FIG. 6 is a diagrammatic representation of the machine 600 within which instructions 610 (e.g., software, a program, an application, an applet, an app, or other executable code) for causing the machine 600 to perform one or more of the methodologies discussed herein may be executed.
- the instructions 610 may cause the machine 600 to execute one or more of the methods described herein.
- the instructions 610 transform the general, non-programmed machine 600 into a particular machine 600 programmed to carry out the described and illustrated functions in the manner described.
- the machine 600 may operate as a standalone device or may be coupled (e.g., networked) to other machines.
- the machine 600 may operate in the capacity of a server machine or a client machine in a server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment.
- the machine 600 may include, but not be limited to, a server computer, a client computer, a personal computer (PC), a tablet computer, a laptop computer, a netbook, a set-top box (STB), a personal digital assistant (PDA), an entertainment media system, a cellular telephone, a smartphone, a mobile device, a wearable device (e.g., a smartwatch), a smart home device (e.g., a smart appliance), other smart devices, a web appliance, a network router, a network switch, a network bridge, or any machine capable of executing the instructions 610 , sequentially or otherwise, that specify actions to be taken by the machine 600 .
- PC personal computer
- PDA personal digital assistant
- machine 600 may implement the two-stage pipeline 100 of FIG. 1 A .
- the machine 600 may also include both client and server systems, with certain operations of a particular method or algorithm being performed on the server-side and with certain operations of the particular method or algorithm being performed on the client-side.
- the machine 600 may include processors 604 , memory 606 , and input/output I/O components 602 , which may be configured to communicate with each other via a bus 640 .
- the processors 604 e.g., a Central Processing Unit (CPU), a Reduced Instruction Set Computing (RISC) Processor, a Complex Instruction Set Computing (CISC) Processor, a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Radio-Frequency Integrated Circuit (RFIC), another processor, or any suitable combination thereof
- the processors 604 may include, for example, a processor 608 and a processor 612 that execute the instructions 610 .
- processor is intended to include multi-core processors that may include two or more independent processors (sometimes referred to as “cores”) that may execute instructions contemporaneously.
- FIG. 6 shows multiple processors 604
- the machine 600 may include a single processor with a single core, a single processor with multiple cores (e.g., a multi-core processor), multiple processors with a single core, multiple processors with multiples cores, or any combination thereof.
- the memory 606 includes a main memory 614 , a static memory 616 , and a storage unit 618 , both accessible to the processors 604 via the bus 640 .
- the main memory 606 , the static memory 616 , and storage unit 618 store the instructions 610 for one or more of the methodologies or functions described herein.
- the instructions 610 may also reside, completely or partially, within the main memory 614 , within the static memory 616 , within machine-readable medium 620 within the storage unit 618 , within at least one of the processors 604 (e.g., within the Processor's cache memory), or any suitable combination thereof, during execution thereof by the machine 600 .
- the I/O components 602 may include a wide variety of components to receive input, provide output, produce output, transmit information, exchange information, capture measurements, and so on.
- the specific I/O components 602 that are included in a particular machine will depend on the type of machine. For example, portable machines such as mobile phones may include a touch input device or other such input mechanisms, while a headless server machine will likely not include such a touch input device. It will be appreciated that the I/O components 602 may include many other components that are not shown in FIG. 6 .
- the I/O components 602 may include user output components 626 and user input components 628 .
- the user output components 626 may include visual components (e.g., a display such as a plasma display panel (PDP), a light-emitting diode (LED) display, a liquid crystal display (LCD), a projector, or a cathode ray tube (CRT)), acoustic components (e.g., speakers), haptic components (e.g., a vibratory motor, resistance mechanisms), other signal generators, and so forth.
- visual components e.g., a display such as a plasma display panel (PDP), a light-emitting diode (LED) display, a liquid crystal display (LCD), a projector, or a cathode ray tube (CRT)
- acoustic components e.g., speakers
- haptic components e.g., a vibratory motor, resistance mechanisms
- the user input components 628 may include alphanumeric input components (e.g., a keyboard, a touch screen configured to receive alphanumeric input, a photo-optical keyboard, or other alphanumeric input components), point-based input components (e.g., a mouse, a touchpad, a trackball, a joystick, a motion sensor, or another pointing instrument), tactile input components (e.g., a physical button, a touch screen that provides location and force of touches or touch gestures, or other tactile input components), audio input components (e.g., a microphone), and the like.
- alphanumeric input components e.g., a keyboard, a touch screen configured to receive alphanumeric input, a photo-optical keyboard, or other alphanumeric input components
- point-based input components e.g., a mouse, a touchpad, a trackball, a joystick, a motion sensor, or another pointing instrument
- tactile input components e.g., a physical button,
- the I/O components 602 may include biometric components 630 , motion components 632 , environmental components 634 , or position components 636 , among a wide array of other components.
- the biometric components 630 include components to detect expressions (e.g., hand expressions, facial expressions, vocal expressions, body gestures, or eye-tracking), measure biosignals (e.g., blood pressure, heart rate, body temperature, perspiration, or brain waves), identify a person (e.g., voice identification, retinal identification, facial identification, fingerprint identification, or electroencephalogram-based identification), and the like.
- the motion components 632 include acceleration sensor components (e.g., accelerometer), gravitation sensor components, rotation sensor components (e.g., gyroscope).
- biometric data collected by the biometric components 630 is captured and stored with only user approval and deleted on user request. Further, such biometric data may be used for very limited purposes, such as identification verification. To ensure limited and authorized use of biometric information and other personally identifiable information (PII), access to this data is restricted to authorized personnel only, if at all. Any use of biometric data may strictly be limited to identification verification purposes, and the biometric data is not shared or sold to any third party without the explicit consent of the user. In addition, appropriate technical and organizational measures are implemented to ensure the security and confidentiality of this sensitive information.
- PII personally identifiable information
- the environmental components 634 include, for example, one or cameras (with still image/photograph and video capabilities), illumination sensor components (e.g., photometer), temperature sensor components (e.g., one or more thermometers that detect ambient temperature), humidity sensor components, pressure sensor components (e.g., barometer), acoustic sensor components (e.g., one or more microphones that detect background noise), proximity sensor components (e.g., infrared sensors that detect nearby objects), gas sensors (e.g., gas detection sensors to detection concentrations of hazardous gases for safety or to measure pollutants in the atmosphere), or other components that may provide indications, measurements, or signals corresponding to a surrounding physical environment.
- illumination sensor components e.g., photometer
- temperature sensor components e.g., one or more thermometers that detect ambient temperature
- humidity sensor components e.g., pressure sensor components (e.g., barometer)
- acoustic sensor components e.g., one or more microphones that detect background noise
- proximity sensor components e.
- the position components 636 include location sensor components (e.g., a GPS receiver component), altitude sensor components (e.g., altimeters or barometers that detect air pressure from which altitude may be derived), orientation sensor components (e.g., magnetometers), and the like.
- location sensor components e.g., a GPS receiver component
- altitude sensor components e.g., altimeters or barometers that detect air pressure from which altitude may be derived
- orientation sensor components e.g., magnetometers
- the I/O components 602 further include communication components 638 operable to couple the machine 600 to a network 622 or devices 624 via respective coupling or connections.
- the communication components 638 may include a network interface Component or another suitable device to interface with the network 622 .
- the communication components 638 may include wired communication components, wireless communication components, cellular communication components, Near Field Communication (NFC) components, Bluetooth® components (e.g., Bluetooth® Low Energy), Wi-Fi ⁇ components, and other communication components to provide communication via other modalities.
- the devices 624 may be another machine or any of a wide variety of peripheral devices (e.g., a peripheral device coupled via a USB).
- the communication components 638 may detect identifiers or include components operable to detect identifiers.
- the communication components 638 may include Radio Frequency Identification (RFID) tag reader components, NFC smart tag detection components, optical reader components (e.g., an optical sensor to detect one-dimensional bar codes such as Universal Product Code (UPC) bar code, multi-dimensional bar codes such as Quick Response (QR) code, Aztec code, Data Matrix, Dataglyph, MaxiCode, PDF417, Ultra Code, UCC RSS-2D bar code, and other optical codes), or acoustic detection components (e.g., microphones to identify tagged audio signals).
- RFID Radio Frequency Identification
- NFC smart tag detection components e.g., an optical sensor to detect one-dimensional bar codes such as Universal Product Code (UPC) bar code, multi-dimensional bar codes such as Quick Response (QR) code, Aztec code, Data Matrix, Dataglyph, MaxiCode, PDF417, Ultra Code, UCC RSS-2D bar code, and other optical codes
- RFID Radio Fre
- IP Internet Protocol
- Wi-Fi® Wireless Fidelity
- NFC beacon a variety of information may be derived via the communication components 638 , such as location via Internet Protocol (IP) geolocation, location via Wi-Fi® signal triangulation, location via detecting an NFC beacon signal that may indicate a particular location, and so forth.
- IP Internet Protocol
- the various memories may store one or more sets of instructions and data structures (e.g., software) embodying or used by one or more of the methodologies or functions described herein. These instructions (e.g., the instructions 610 ), when executed by processors 604 , cause various operations to implement the disclosed examples.
- the instructions 610 may be transmitted or received over the network 622 , using a transmission medium, via a network interface device (e.g., a network interface component included in the communication components 638 ) and using any one of several well-known transfer protocols (e.g., hypertext transfer protocol (HTTP)). Similarly, the instructions 610 may be transmitted or received using a transmission medium via a coupling (e.g., a peer-to-peer coupling) to the devices 624 .
- a network interface device e.g., a network interface component included in the communication components 638
- HTTP hypertext transfer protocol
- the instructions 610 may be transmitted or received using a transmission medium via a coupling (e.g., a peer-to-peer coupling) to the devices 624 .
- a coupling e.g., a peer-to-peer coupling
- FIG. 7 is a block diagram 700 illustrating a software architecture 704 , which can be installed on one or more of the devices described herein.
- the software architecture 704 is supported by hardware such as a machine 702 (see FIG. 6 ) that includes processors 720 , memory 726 , and I/O components 738 .
- the software architecture 704 can be conceptualized as a stack of layers, where each layer provides a particular functionality.
- the software architecture 704 includes layers such as an operating system 712 , libraries 710 , frameworks 708 , and applications 706 .
- the applications 706 invoke API calls 750 through the software stack and receive messages 752 in response to the API calls 750 .
- the operating system 712 manages hardware resources and provides common services.
- the operating system 712 includes, for example, a kernel 714 , services 716 , and drivers 722 .
- the kernel 714 acts as an abstraction layer between the hardware and the other software layers.
- the kernel 714 provides memory management, processor management (e.g., scheduling), component management, networking, and security settings, among other functionality.
- the services 716 can provide other common services for the other software layers.
- the drivers 722 are responsible for controlling or interfacing with the underlying hardware.
- the drivers 722 can include display drivers, camera drivers, BLUETOOTH® or BLUETOOTH® Low Energy drivers, flash memory drivers, serial communication drivers (e.g., USB drivers), WI-FI® drivers, audio drivers, power management drivers, and so forth.
- the libraries 710 provide a common low-level infrastructure used by the applications 706 .
- the libraries 710 can include system libraries 718 (e.g., C standard library) that provide functions such as memory allocation functions, string manipulation functions, mathematic functions, and the like.
- the libraries 710 can include API libraries 724 such as media libraries (e.g., libraries to support presentation and manipulation of various media formats such as Moving Picture Experts Group-4 (MPEG4), Advanced Video Coding (H.264 or AVC), Moving Picture Experts Group Layer-3 (MP3), Advanced Audio Coding (AAC), Adaptive Multi-Rate (AMR) audio codec, Joint Photographic Experts Group (JPEG or JPG), or Portable Network Graphics (PNG)), graphics libraries (e.g., an OpenGL framework used to render in two dimensions (2D) and three dimensions (3D) in a graphic content on a display), database libraries (e.g., SQLite to provide various relational database functions), web libraries (e.g., WebKit to provide web browsing functionality), and the
- the frameworks 708 provide a common high-level infrastructure that is used by the applications 706 .
- the frameworks 708 provide various graphical user interface (GUI) functions, high-level resource management, and high-level location services.
- GUI graphical user interface
- the frameworks 708 can provide a broad spectrum of other APIs that can be used by the applications 706 , some of which may be specific to a particular operating system or platform.
- the applications 706 may include a home application 736 , a contacts application 730 , a browser application 732 , a book reader application 734 , a location application 742 , a media application 744 , a messaging application 746 , a game application 748 , and a broad assortment of other applications such as a third-party application 740 .
- the applications 706 are programs that execute functions defined in the programs.
- Various programming languages can be employed to generate one or more of the applications 706 , structured in a variety of manners, such as object-oriented programming languages (e.g., Objective-C, Java, or C++) or procedural programming languages (e.g., C or assembly language).
- the third-party application 740 may be mobile software running on a mobile operating system such as IOSTM, ANDROIDTM WINDOWS® Phone, or another mobile operating system.
- the third-party application 740 can invoke the API calls 750 provided by the operating system 712 to facilitate functionality described herein.
- Carrier signal refers to any intangible medium that is capable of storing, encoding, or carrying instructions for execution by the machine, and includes digital or analog communications signals or other intangible media to facilitate communication of such instructions. Instructions may be transmitted or received over a network using a transmission medium via a network interface device.
- Client device refers to any machine that interfaces to a communications network to obtain resources from one or more server systems or other client devices.
- a client device may be, but is not limited to, a mobile phone, desktop computer, laptop, portable digital assistants (PDAs), smartphones, tablets, ultrabooks, netbooks, laptops, multi-processor systems, microprocessor-based or programmable consumer electronics, game consoles, set-top boxes, or any other communication device that a user may use to access a network.
- PDAs portable digital assistants
- smartphones tablets, ultrabooks, netbooks, laptops, multi-processor systems, microprocessor-based or programmable consumer electronics, game consoles, set-top boxes, or any other communication device that a user may use to access a network.
- Communication network refers to one or more portions of a network that may be an ad hoc network, an intranet, an extranet, a virtual private network (VPN), a local area network (LAN), a wireless LAN (WLAN), a wide area network (WAN), a wireless WAN (WWAN), a metropolitan area network (MAN), the Internet, a portion of the Internet, a portion of the Public Switched Telephone Network (PSTN), a plain old telephone service (POTS) network, a cellular telephone network, a wireless network, a Wi-Fi® network, another type of network, or a combination of two or more such networks.
- VPN virtual private network
- LAN local area network
- WLAN wireless LAN
- WAN wide area network
- WWAN wireless WAN
- MAN metropolitan area network
- PSTN Public Switched Telephone Network
- POTS plain old telephone service
- a network or a portion of a network may include a wireless or cellular network and the coupling may be a Code Division Multiple Access (CDMA) connection, a Global System for Mobile communications (GSM) connection, or other types of cellular or wireless coupling.
- CDMA Code Division Multiple Access
- GSM Global System for Mobile communications
- the coupling may implement any of a variety of types of data transfer technology, such as Single Carrier Radio Transmission Technology (1 ⁇ RTT), Evolution-Data Optimized (EVDO) technology, General Packet Radio Service (GPRS) technology, Enhanced Data rates for GSM Evolution (EDGE) technology, third Generation Partnership Project (3GPP) including 3G, fourth generation wireless (4G) networks, Universal Mobile Telecommunications System (UMTS), High Speed Packet Access (HSPA), Worldwide Interoperability for Microwave Access (WiMAX), Long Term Evolution (LTE) standard, others defined by various standard-setting organizations, other long-range protocols, or other data transfer technology.
- RTT Single Carrier Radio Transmission Technology
- GPRS General Packet Radio Service
- EDGE Enhanced Data rates for GSM Evolution
- 3GPP Third Generation Partnership Project
- 4G fourth generation wireless (4G) networks
- Universal Mobile Telecommunications System (UMTS) Universal Mobile Telecommunications System
- HSPA High Speed Packet Access
- WiMAX Worldwide Interoperability for Microwave Access
- LTE
- Component refers to a device, physical entity, or logic having boundaries defined by function or subroutine calls, branch points, APIs, or other technologies that provide for the partitioning or modularization of particular processing or control functions. Components may be combined via their interfaces with other components to carry out a machine process.
- a component may be a packaged functional hardware unit designed for use with other components and a part of a program that usually performs a particular function of related functions. Components may constitute either software components (e.g., code embodied on a machine-readable medium) or hardware components.
- a “hardware component” is a tangible unit capable of performing operations and may be configured or arranged in a certain physical manner.
- one or more computer systems may be configured by software (e.g., an application or application portion) as a hardware component that operates to perform certain operations as described herein.
- a hardware component may also be implemented mechanically, electronically, or any suitable combination thereof.
- a hardware component may include dedicated circuitry or logic that is permanently configured to perform certain operations.
- a hardware component may be a special-purpose processor, such as a field-programmable gate array (FPGA) or an application specific integrated circuit (ASIC).
- FPGA field-programmable gate array
- ASIC application specific integrated circuit
- a hardware component may also include programmable logic or circuitry that is temporarily configured by software to perform certain operations.
- a hardware component may include software executed by a general-purpose processor or other programmable processor. Once configured by such software, hardware components become specific machines (or specific components of a machine) uniquely tailored to perform the configured functions and are no longer general-purpose processors. It will be appreciated that the decision to implement a hardware component mechanically, in dedicated and permanently configured circuitry, or in temporarily configured circuitry (e.g., configured by software), may be driven by cost and time considerations.
- the phrase “hardware component” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a certain manner or to perform certain operations described herein.
- hardware components are temporarily configured (e.g., programmed)
- each of the hardware components need not be configured or instantiated at any one instance in time.
- the general-purpose processor may be configured as respectively different special-purpose processors (e.g., including different hardware components) at different times.
- Hardware components can provide information to, and receive information from, other hardware components. Accordingly, the described hardware components may be regarded as being communicatively coupled. Where multiple hardware components exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) between or among two or more of the hardware components. In examples in which multiple hardware components are configured or instantiated at different times, communications between such hardware components may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple hardware components have access.
- one hardware component may perform an operation and store the output of that operation in a memory device to which it is communicatively coupled. A further hardware component may then, at a later time, access the memory device to retrieve and process the stored output. Hardware components may also initiate communications with input or output devices, and can operate on a resource (e.g., a collection of information).
- the various operations of example methods described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented components that operate to perform one or more operations or functions described herein.
- processor-implemented component refers to a hardware component implemented using one or more processors.
- the methods described herein may be at least partially processor-implemented, with a particular processor or processors being an example of hardware.
- a particular processor or processors being an example of hardware.
- the operations of a method may be performed by one or more processors or processor-implemented components.
- the one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS).
- SaaS software as a service
- at least some of the operations may be performed by a group of computers (as examples of machines including processors), with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., an API).
- processors may be distributed among the processors, not only residing within a single machine, but deployed across a number of machines.
- the processors or processor-implemented components may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm). In other examples, the processors or processor-implemented components may be distributed across a number of geographic locations.
- Computer-readable storage medium refers to both machine-storage media and transmission media. Thus, the terms include both storage devices/media and carrier waves/modulated data signals.
- machine-readable medium “computer-readable medium” and “device-readable medium” mean the same thing and may be used interchangeably in this disclosure.
- Machine storage medium refers to a single or multiple storage devices and media (e.g., a centralized or distributed database, and associated caches and servers) that store executable instructions, routines and data.
- the term shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media, including memory internal or external to processors.
- machine-storage media examples include non-volatile memory, including by way of example semiconductor memory devices, e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), FPGA, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
- semiconductor memory devices e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), FPGA, and flash memory devices
- magnetic disks such as internal hard disks and removable disks
- magneto-optical disks magneto-optical disks
- CD-ROM and DVD-ROM disks examples include CD-ROM and DVD-ROM disks.
- Non-transitory computer-readable storage medium refers to a tangible medium that is capable of storing, encoding, or carrying the instructions for execution by a machine.
- Signal medium refers to any intangible medium that is capable of storing, encoding, or carrying the instructions for execution by a machine and includes digital or analog communications signals or other intangible media to facilitate communication of software or data.
- signal medium shall be taken to include any form of a modulated data signal, carrier wave, and so forth.
- modulated data signal means a signal that has one or more of its characteristics set or changed in such a matter as to encode information in the signal.
- transmission medium and “signal medium” mean the same thing and may be used interchangeably in this disclosure.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Computer Graphics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Multimedia (AREA)
- Geometry (AREA)
- Architecture (AREA)
- Computer Hardware Design (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Generation (AREA)
- Image Analysis (AREA)
Abstract
Description
Then, during inference, the features F may be denormalized from the structure and appearance properties extracted from the training dataset by the second part G2 as {circumflex over (F)}×IQR+m prior to generating the 3D representation of the object.
for a table having 2r entries where w and a are heuristic hashing parameters used to reduce a number of collisions while maintaining an appropriate table size. A target non-rigid object also may be decomposed into Np regions, each containing Nk Kp 3D points and corresponding Kp 2D projections per image. The Nk Kp 3D points may be shared across all non-rigid objects and the non-rigid objects may be aligned in a learned canonical space to allow for motion transfer between the non-rigid objects.
where Ĉ, C∈[0,1]H×W×3 are the RGB rendered and training images of resolution H×W, respectively; VGGi is the ith-layer of a pre-trained VGG-19 network; and operator Dl downsamples images to the resolution for pyramid level l.
where Ô, O∈[0,1]H×W are the inferred and ground-truth occupancy masks, respectively.
During inference, when producing the final volumes, the features are de-normalized as {circumflex over (F)}×IQR+m. This method is referred to herein as robust normalization.
| TABLE 1 | ||
| PhotoShape Chairs | ABO Tables | |
| Method | FID ↓ | KID ↓ | FID ↓ | KID ↓ |
| π-GAN [7] | 52.71 | 13.64 | 41.67 | 13.81 |
| EG3D [6] | 16.54 | 8.412 | 31.18 | 11.67 |
| DiffRF [44] | 15.95 | 7.935 | 27.06 | 10.03 |
| Disclosed | 11.28 | 4.714 | 18.44 | 6.854 |
| Method | ||||
| TABLE 2 | ||||
| CelebV-Text | MVImgNet | Objaverse | ||
| Method | FID ↓ | KID ↓ | FID ↓ | KID ↓ | FID ↓ | KID ↓ |
| Direct Latent | 69.21 | 73.74 | 97.51 | 69.22 | 72.76 | 53.68 |
| Sampling | ||||||
| Disclosed | 48.01 | 49.49 | 62.21 | 39.94 | 47.49 | 32.44 |
| method-16 | ||||||
| Steps | ||||||
| Disclosed | 49.74 | 46.2 | 51.26 | 28.45 | 43.68 | 31.7 |
| method-32 | ||||||
| Steps | ||||||
| Disclosed | 50.27 | 47.72 | 43.85 | 23.91 | 40.49 | 29.37 |
| Method-64 | ||||||
| Steps | ||||||
| TABLE 3 | ||
| Model | ||
| Variant | PSNR↑ | LPIPS↓ |
| Disclosed | 27.719 | 6.255 |
| Method | ||
| Multi-Frame | 27.176 | 6.855 |
| Training | ||
| Self-Attention | 27.335 | 6.738 |
| Increased | 27.24 | 6.924 |
| Depth | ||
| Embedding | 25.985 | 8.332 |
| Length | ||
| (1024→64) | ||
where δ, c are the density and RGB values from the radiance field volumes sampled along these rays, and T(t)=exp−∫t
128 points are sampled across these rays for radiance field rendering during training and inference.
Articulated Animation
where Tp=[Rp, tp]=[R−1, −R−1] is the estimated pose of part p relative to the camera (where T=[R,] ∈ is the estimated camera pose with respect to the canonical volume); xd is the 3D point deformed to correspond to the current pose; xc is its corresponding point when aligned in the canonical volume; and wp c(xc) is the learned LBS weight for component p, sampled at position xc in the volume, used to define this correspondence. It is noted that an approximate solution may be computed using the inverse LBS weights following Human NeRF to avoid the excessive computation used by the direct solution.
where wp c(x) is a weight assigned to each part p. Intuitively, LBS weights segment the object into different parts. As an example, a point with LBS weight equal to 1.0 for the left hand will always move according to the transformation for the left hand. Unfortunately, during volumetric rendering canonical points are typically queried using points in the deformed space, requiring solving Eq. (6) for xc. This procedure is prohibitively expensive, so the approximate solution may be used that defines inverse LBS weights wp d such that:
where weights wp d are defined as follows:
This approximation has an intuitive explanation, i.e. given the deformed point, it is projected using the inverse Tp to the canonical pose and checked if it corresponds to the part p in canonical pose.
where ki∈{1, 2, . . . , nh} is the set of indices used to select from the nh possible codebook entries for position i in the final vector. This method allows for exponentially more combinations of embedding vectors to be provided during training than must be stored in learned embedding vector library.
where the table has 2r entries. w and a are heuristic hashing parameters used to reduce the number of collisions while maintaining an appropriate table size. 32 was used for w, while a must be an odd integer between 2w-1 and 2w. Each smaller codebook was given its own a value:
where i is the index of the codebook. It was found that employing this approach had negligible impact on the overall speed and quality of the training and synthesis process. Using this hash embedding approach was found to reduce the model storage needs by approximately 75% for this dataset.
Processing Platform
Claims (19)
Priority Applications (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/211,149 US12494013B2 (en) | 2023-06-16 | 2023-06-16 | Autodecoding latent 3D diffusion models |
| EP24736239.5A EP4728492A1 (en) | 2023-06-16 | 2024-06-03 | Autodecoding latent 3d diffusion models |
| KR1020267001009A KR20260023031A (en) | 2023-06-16 | 2024-06-03 | Automatically decoded latent 3D diffusion models |
| CN202480040291.9A CN121399676A (en) | 2023-06-16 | 2024-06-03 | Automatic decoding of potential 3D diffusion models |
| PCT/US2024/032242 WO2024258661A1 (en) | 2023-06-16 | 2024-06-03 | Autodecoding latent 3d diffusion models |
| US19/374,069 US20260057606A1 (en) | 2023-06-16 | 2025-10-30 | Autodecoding latent 3d diffusion models |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/211,149 US12494013B2 (en) | 2023-06-16 | 2023-06-16 | Autodecoding latent 3D diffusion models |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/374,069 Continuation US20260057606A1 (en) | 2023-06-16 | 2025-10-30 | Autodecoding latent 3d diffusion models |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20240420407A1 US20240420407A1 (en) | 2024-12-19 |
| US12494013B2 true US12494013B2 (en) | 2025-12-09 |
Family
ID=91664753
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/211,149 Active 2044-02-22 US12494013B2 (en) | 2023-06-16 | 2023-06-16 | Autodecoding latent 3D diffusion models |
| US19/374,069 Pending US20260057606A1 (en) | 2023-06-16 | 2025-10-30 | Autodecoding latent 3d diffusion models |
Family Applications After (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/374,069 Pending US20260057606A1 (en) | 2023-06-16 | 2025-10-30 | Autodecoding latent 3d diffusion models |
Country Status (5)
| Country | Link |
|---|---|
| US (2) | US12494013B2 (en) |
| EP (1) | EP4728492A1 (en) |
| KR (1) | KR20260023031A (en) |
| CN (1) | CN121399676A (en) |
| WO (1) | WO2024258661A1 (en) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11995854B2 (en) * | 2018-12-19 | 2024-05-28 | Nvidia Corporation | Mesh reconstruction using data-driven priors |
| US20250078392A1 (en) * | 2023-08-28 | 2025-03-06 | Lemon Inc. | Multi-view 3d diffusion |
| US20250117970A1 (en) * | 2023-10-06 | 2025-04-10 | Adobe Inc. | Encoding image values through attribute conditioning |
| US20250232505A1 (en) * | 2024-01-16 | 2025-07-17 | Nvidia Corporation | Machine learning models for generative human motion simulation |
| CN121305004B (en) * | 2025-12-12 | 2026-02-27 | 立安智通(北京)科技有限公司 | Single-image 3D portrait generation method and system based on mixed prior and noise resampling |
Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190370965A1 (en) * | 2017-02-22 | 2019-12-05 | The United States Of America, As Represented By The Secretary, Department Of Health And Human Servic | Detection of prostate cancer in multi-parametric mri using random forest with instance weighting & mr prostate segmentation by deep learning with holistically-nested networks |
| US20200058137A1 (en) * | 2015-06-24 | 2020-02-20 | Sergi PUJADES | Skinned Multi-Person Linear Model |
| US20230130281A1 (en) * | 2021-10-21 | 2023-04-27 | Google Llc | Figure-Ground Neural Radiance Fields For Three-Dimensional Object Category Modelling |
| US20240005590A1 (en) * | 2020-11-16 | 2024-01-04 | Google Llc | Deformable neural radiance fields |
| US20240013497A1 (en) * | 2020-12-21 | 2024-01-11 | Google Llc | Learning Articulated Shape Reconstruction from Imagery |
| US20240096001A1 (en) * | 2021-11-16 | 2024-03-21 | Google Llc | Geometry-Free Neural Scene Representations Through Novel-View Synthesis |
| US20240221258A1 (en) * | 2022-12-28 | 2024-07-04 | Menglei Chai | Unsupervised volumetric animation |
| US20240371081A1 (en) * | 2021-11-03 | 2024-11-07 | Google Llc | Neural Radiance Field Generative Modeling of Object Classes from Single Two-Dimensional Views |
| US20240371096A1 (en) * | 2023-05-04 | 2024-11-07 | Nvidia Corporation | Synthetic data generation using morphable models with identity and expression embeddings |
-
2023
- 2023-06-16 US US18/211,149 patent/US12494013B2/en active Active
-
2024
- 2024-06-03 WO PCT/US2024/032242 patent/WO2024258661A1/en not_active Ceased
- 2024-06-03 EP EP24736239.5A patent/EP4728492A1/en active Pending
- 2024-06-03 KR KR1020267001009A patent/KR20260023031A/en active Pending
- 2024-06-03 CN CN202480040291.9A patent/CN121399676A/en active Pending
-
2025
- 2025-10-30 US US19/374,069 patent/US20260057606A1/en active Pending
Patent Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200058137A1 (en) * | 2015-06-24 | 2020-02-20 | Sergi PUJADES | Skinned Multi-Person Linear Model |
| US20190370965A1 (en) * | 2017-02-22 | 2019-12-05 | The United States Of America, As Represented By The Secretary, Department Of Health And Human Servic | Detection of prostate cancer in multi-parametric mri using random forest with instance weighting & mr prostate segmentation by deep learning with holistically-nested networks |
| US20240005590A1 (en) * | 2020-11-16 | 2024-01-04 | Google Llc | Deformable neural radiance fields |
| US20240013497A1 (en) * | 2020-12-21 | 2024-01-11 | Google Llc | Learning Articulated Shape Reconstruction from Imagery |
| US20230130281A1 (en) * | 2021-10-21 | 2023-04-27 | Google Llc | Figure-Ground Neural Radiance Fields For Three-Dimensional Object Category Modelling |
| US20240371081A1 (en) * | 2021-11-03 | 2024-11-07 | Google Llc | Neural Radiance Field Generative Modeling of Object Classes from Single Two-Dimensional Views |
| US20240096001A1 (en) * | 2021-11-16 | 2024-03-21 | Google Llc | Geometry-Free Neural Scene Representations Through Novel-View Synthesis |
| US20240221258A1 (en) * | 2022-12-28 | 2024-07-04 | Menglei Chai | Unsupervised volumetric animation |
| US20240371096A1 (en) * | 2023-05-04 | 2024-11-07 | Nvidia Corporation | Synthetic data generation using morphable models with identity and expression embeddings |
Non-Patent Citations (198)
| Title |
|---|
| Achlioptas et al.: Learning Representations and Generative Models for 3D Point Clouds. In Proceedings of the International Conference on Machine Learning, 2018. |
| Binkowski et al.: Demystifying MMD GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. |
| Bojanowski et al.: Optimizing the latent space of generative networks. In arXiv, 2017. |
| Brock et al.: Large scale gan training for high fidelity natural image synthesis. In arXiv, 2018. |
| Chan et al.: Efficient Geometry-aware 3D Generative Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Chan et al: pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Chang et al.: ShapeNet: an Information-Rich 3D Model Repository. In arXiv, 2015. |
| Chen et al.: Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation. In arXiv, 2023. |
| Chen et al.: WaveGrad: Estimating Gradients for Waveform Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Cheng et al.: SDFusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Collins et al.: ABO: Dataset and Benchmarks for Real-World 3D Object Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Deitke et al.: Objaverse: a Universe of Annotated 3D Objects. In arXiv, 2022. |
| Deng et al.: Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009. |
| Devadas et al. MIT 6.006, Lecture 5: Hashing I: Chaining, Hash Functions, 2009. |
| Devadas, Daskalakis. âLecture 5: Hashing I: Chaining, Hash Functions. â MIT, 2009, courses.csail.mit.edu/6.006/fall09/lecture_notes/lecture05.pdf. (Year: 2009). * |
| Dhariwal et al.: Diffusion Models Beat Gans on Image Synthesis. In Proceedings of the Neural Information Processing Systems Conference, 2021. |
| Dockhorn et al.: Score-based generative modeling with critically-damped langevin diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Falcon et al. PyTorch Lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning, 3, 2019. |
| Forsgren et al.: Riffusion—Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about. |
| Goodfellow et al.: Generative adversarial nets. In Proceedings of the Neural Information Processing Systems Conference, 2014. |
| Gupta, Anchit, et al. "3dgen: Triplane latent diffusion for textured mesh generation." arXiv preprint arXiv:2303.05371 (2023). (Year: 2023). * |
| Harvey et al.: Flexible Diffusion Modeling of Long Videos. In Proceedings of the Neural Information Processing Systems Conference, 2022. |
| He et al.: Latent video diffusion models for high-fidelity long video generation. In arXiv, 2023. |
| Heusel et al.: GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Neural Information Processing Systems Conference, 2017. |
| Ho et al.: Classifier-free diffusion guidance. In arXiv, 2022. |
| Ho et al.: Denoising diffusion probabilistic models. In Proceedings of the Neural Information Processing Systems Conference, 2020. |
| Ho et al.: Video Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Hore et al.: Image quality metrics: Psnr vs. ssim. In Proceedings of the International Conference on Pattern Recognition, 2010. |
| International Search Report and Written Opinion for PCT/U2024/032242 dated Sep. 19, 2024 (Sep. 19, 2024), 9 pages. |
| Johnson et al.: Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision, 2016. |
| Karras et al.: A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. |
| Karras et al.: Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. |
| Karras et al.: Elucidating the Design Space of Diffusion-Based Generative Models. In Proceedings of the Neural Information Processing Systems Conference, 2022. |
| Karras et al.: Progressive growing of GANs for improved quality, stability, and variation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. |
| Karras et al.: Training generative adversarial networks with limited data. In arXiv, 2020. |
| Kingma et al.: Adam: a Method for Stochastic Optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. |
| Kingma et al.: Auto-encoding variational bayes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014. |
| Kirillov et al.: Segment Anything. In arXiv, 2023. |
| Lepetit et al.: EPnP: an Accurate O(n) Solution to the PnP Problem. In International Journal of Computer Vision, 2009. |
| Lewis et al.: Pose Space Deformation: a Unified Approach to Shape Interpolation and Skeleton-Driven Deformation. In ACM Transactions on Graphics, 2000. |
| Lin et al.: BARF: Bundle-Adjusting Neural Radiance Fields. In Proceedings of the IEEE International Conference on Computer Vision, 2021. |
| Lin et al.: Magic3D: High-Resolution Text-to-3D Content Creation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Lin et al.: Robust High-Resolution Video Matting with Temporal Guidance. In Proceedings of the Winter Conference on Applications of Computer Vision, 2022. |
| Liu et al.: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In arXiv, 2023. |
| Lorensen et al.: Marching Cubes: a High Resolution 3D Surface Construction Algorithm. In ACM Transactions on Graphics, 1987. |
| Mei et al.: VIDM: Video Implicit Diffusion Models. In Association for the Advancement of Artificial Intelligence Conference, 2023. |
| Mildenhall et al.: NeRF: Representing scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the European Conference on Computer Vision, 2020. |
| Müller et al.: DiffRF: Rendering-Guided 3D Radiance Field Diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Nagrani et al.: VoxCeleb: Large-scale speaker verification in the wild. Computer Science and Language, 2019. |
| Nam et al: "3D-LDM: Neural Implicit 3D Shape Generation with Latent Diffusion Models", Arxiv.org, Cornell University Library, 201 Olin Library Cornell University Ithaca, NY 14853, Dec. 15, 2022 (Dec. 15, 2022), XP091384293. |
| Nam, Gimin, et al. "3d-Idm: Neural implicit 3d shape generation with latent diffusion models." arXiv preprint arXiv:2212.00842 (2022). (Year: 2022). * |
| Nguyen-Phuoc et al.: HoloGAN: Unsupervised Learning of 3D Representations From Natural Images. In Proceedings of the IEEE International Conference on Computer Vision, 2019. |
| Nguyen-Phuoc et al.; Blockgan: Learning 3d object-aware scene representations from unlabelled images. In arXiv, 2020. |
| Nichol et al.: Improved denoising diffusion probabilistic models. In ICML, 2021. |
| Niemeyer et al.: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Ntavelis et al.: StyleGenes: Discrete and Efficient Latent Distributions for GANs. In arXiv, 2023. |
| Ntavelis, Evangelos et al. "Autodecoding Latent 3D Diffusion Models," arXiv:2307.05445v1 [cs.CV] Jul. 7, 2023z. |
| Ntavelis, Evangelos, et al. "Autodecoding latent 3d diffusion models." Advances in Neural Information Processing Systems 36 (2023): 67021-67047. (Year: 2023). * |
| Obukhov et al.: High-fidelity performance metrics for generative models in PyTorch, 2020. URL https://github.com/toshas/torch-fidelity. Version: 0.3.0, DOI: 10.5281/zenodo.4957738. |
| Park et al.: PhotoShape: Photorealistic Materials for Large-Scale Shape Collections. In ACM Transactions on Graphics, 2018. |
| Park et al: "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation", 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Jun. 15, 2019 (Jun. 15, 2019), pp. 165-174, XP033687152. |
| Paszke et al.: Automatic Differentiation in PyTorch, 2017. |
| Paszke et al.: PyTorch: an Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Neural Information Processing Systems Conference, 2019. |
| Poole et al.: Dreamfusion: Text-to-3d using 2d diffusion. In arXiv, 2022. |
| Raffel et al.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. In The Journal of Machine Learning Research, 2020. |
| Ravi et al.: Accelerating 3D Deep Learning with PyTorch3D. In arXiv, 2020. |
| Rombach et al.: High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Schönberger et al.: Structure-from-Motion Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. |
| Schwarz et al.: GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. In Proceedings of the Neural Information Processing Systems Conference, 2020. |
| Siarohin et al.: First Order Motion Model for Image Animation. In Proceedings of the Neural Information Processing Systems Conference, 2019. |
| Siarohin et al.: Unsupervised Volumetric Animation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Siarohin et al: "Unsupervised Volumetric Animation", Arxiv.org, Cornell University Library, 201 Olin Library Cornell University Ithaca, NY 14853, Jan. 26, 2023 (Jan. 26, 2023), XP091421870. |
| Simonyan et al.: Very deep convolutional networks for large-scale image recognition. In arXiv, 2014. |
| Skorokhodov et al.: 3D Generation on ImageNet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Skorokhodov et al.: EpiGRAF: Rethinking Training of 3D GANs. In Proceedings of the Neural Information Processing Systems Conference, 2022. |
| Skorokhodov et al.: Stylegan-v: a continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Sohl-Dickstein et al.: Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015. |
| Song et al.: Denoising Diffusion Implicit Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Song et al.: Generative modeling by estimating gradients of the data distribution. In Proceedings of the Neural Information Processing Systems Conference, 2019. |
| Song et al.: Score-based generative modeling through stochastic differential equations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Tian et al.: A good image generator is what you need for high-resolution video synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Vahdat et al.: Score-based generative modeling in latent space. In Proceedings of the Neural Information Processing Systems Conference, 2021. |
| Vaswani et al.: Attention is all you need. In Proceedings of the Neural Information Processing Systems Conference, 2017. |
| Voleti et al.: MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In Proceedings of the Neural Information Processing Systems Conference, 2022. |
| Wang et al.: NeRF—: Neural Radiance Fields Without Known Camera Parameters. In arXiv, 2021. |
| Wang, Tengfei, et al. "Rodin: a Generative Model for Sculpting 3D Digital Avatars Using Diffusion." arXiv preprint arXiv:2212.06135 (2022). (Year: 2022). * |
| Weng et al.: HumanNeRF: Free-Viewpoint Rendering of Moving People from Monocular Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Whaley: The Interquartile Range: Theory and Estimation. PhD thesis, East Tennessee State University, 2005. |
| Xiao et al.: Tackling the generative learning trilemma with denoising diffusion GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Xu et al.: Pose for Everything: Towards Category-Agnostic Pose Estimation. In Proceedings of the European Conference on Computer Vision, 2022. |
| Xue et al.: GIRAFFE HD: a High-Resolution 3D-aware Generative Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Yin et al.: NUWA-XL: Diffusion over Diffusion for extremely Long Video Generation. In arXiv, 2023. |
| Yu et al.: CelebV-Text: a Large-Scale Facial Text-Video Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Yu et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Yu et al.: MVImgNet: a Large-scale Dataset of Multi-view Images. In arXiv, 2023. |
| Zhang et al.: The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. |
| Zhu et al.: Discrete contrastive diffusion for cross-modal music and image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Zhu et al.: MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In arXiv, 2023. |
| Zhu, K. P., et al. "3D CAD model search: a regularized manifold learning approach." 2009 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2009. (Year: 2009). * |
| Achlioptas et al.: Learning Representations and Generative Models for 3D Point Clouds. In Proceedings of the International Conference on Machine Learning, 2018. |
| Binkowski et al.: Demystifying MMD GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. |
| Bojanowski et al.: Optimizing the latent space of generative networks. In arXiv, 2017. |
| Brock et al.: Large scale gan training for high fidelity natural image synthesis. In arXiv, 2018. |
| Chan et al.: Efficient Geometry-aware 3D Generative Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Chan et al: pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Chang et al.: ShapeNet: an Information-Rich 3D Model Repository. In arXiv, 2015. |
| Chen et al.: Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation. In arXiv, 2023. |
| Chen et al.: WaveGrad: Estimating Gradients for Waveform Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Cheng et al.: SDFusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Collins et al.: ABO: Dataset and Benchmarks for Real-World 3D Object Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Deitke et al.: Objaverse: a Universe of Annotated 3D Objects. In arXiv, 2022. |
| Deng et al.: Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009. |
| Devadas et al. MIT 6.006, Lecture 5: Hashing I: Chaining, Hash Functions, 2009. |
| Devadas, Daskalakis. âLecture 5: Hashing I: Chaining, Hash Functions. â MIT, 2009, courses.csail.mit.edu/6.006/fall09/lecture_notes/lecture05.pdf. (Year: 2009). * |
| Dhariwal et al.: Diffusion Models Beat Gans on Image Synthesis. In Proceedings of the Neural Information Processing Systems Conference, 2021. |
| Dockhorn et al.: Score-based generative modeling with critically-damped langevin diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Falcon et al. PyTorch Lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning, 3, 2019. |
| Forsgren et al.: Riffusion—Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about. |
| Goodfellow et al.: Generative adversarial nets. In Proceedings of the Neural Information Processing Systems Conference, 2014. |
| Gupta, Anchit, et al. "3dgen: Triplane latent diffusion for textured mesh generation." arXiv preprint arXiv:2303.05371 (2023). (Year: 2023). * |
| Harvey et al.: Flexible Diffusion Modeling of Long Videos. In Proceedings of the Neural Information Processing Systems Conference, 2022. |
| He et al.: Latent video diffusion models for high-fidelity long video generation. In arXiv, 2023. |
| Heusel et al.: GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Neural Information Processing Systems Conference, 2017. |
| Ho et al.: Classifier-free diffusion guidance. In arXiv, 2022. |
| Ho et al.: Denoising diffusion probabilistic models. In Proceedings of the Neural Information Processing Systems Conference, 2020. |
| Ho et al.: Video Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Hore et al.: Image quality metrics: Psnr vs. ssim. In Proceedings of the International Conference on Pattern Recognition, 2010. |
| International Search Report and Written Opinion for PCT/U2024/032242 dated Sep. 19, 2024 (Sep. 19, 2024), 9 pages. |
| Johnson et al.: Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision, 2016. |
| Karras et al.: A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. |
| Karras et al.: Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. |
| Karras et al.: Elucidating the Design Space of Diffusion-Based Generative Models. In Proceedings of the Neural Information Processing Systems Conference, 2022. |
| Karras et al.: Progressive growing of GANs for improved quality, stability, and variation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. |
| Karras et al.: Training generative adversarial networks with limited data. In arXiv, 2020. |
| Kingma et al.: Adam: a Method for Stochastic Optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. |
| Kingma et al.: Auto-encoding variational bayes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014. |
| Kirillov et al.: Segment Anything. In arXiv, 2023. |
| Lepetit et al.: EPnP: an Accurate O(n) Solution to the PnP Problem. In International Journal of Computer Vision, 2009. |
| Lewis et al.: Pose Space Deformation: a Unified Approach to Shape Interpolation and Skeleton-Driven Deformation. In ACM Transactions on Graphics, 2000. |
| Lin et al.: BARF: Bundle-Adjusting Neural Radiance Fields. In Proceedings of the IEEE International Conference on Computer Vision, 2021. |
| Lin et al.: Magic3D: High-Resolution Text-to-3D Content Creation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Lin et al.: Robust High-Resolution Video Matting with Temporal Guidance. In Proceedings of the Winter Conference on Applications of Computer Vision, 2022. |
| Liu et al.: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In arXiv, 2023. |
| Lorensen et al.: Marching Cubes: a High Resolution 3D Surface Construction Algorithm. In ACM Transactions on Graphics, 1987. |
| Mei et al.: VIDM: Video Implicit Diffusion Models. In Association for the Advancement of Artificial Intelligence Conference, 2023. |
| Mildenhall et al.: NeRF: Representing scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the European Conference on Computer Vision, 2020. |
| Müller et al.: DiffRF: Rendering-Guided 3D Radiance Field Diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Nagrani et al.: VoxCeleb: Large-scale speaker verification in the wild. Computer Science and Language, 2019. |
| Nam et al: "3D-LDM: Neural Implicit 3D Shape Generation with Latent Diffusion Models", Arxiv.org, Cornell University Library, 201 Olin Library Cornell University Ithaca, NY 14853, Dec. 15, 2022 (Dec. 15, 2022), XP091384293. |
| Nam, Gimin, et al. "3d-Idm: Neural implicit 3d shape generation with latent diffusion models." arXiv preprint arXiv:2212.00842 (2022). (Year: 2022). * |
| Nguyen-Phuoc et al.: HoloGAN: Unsupervised Learning of 3D Representations From Natural Images. In Proceedings of the IEEE International Conference on Computer Vision, 2019. |
| Nguyen-Phuoc et al.; Blockgan: Learning 3d object-aware scene representations from unlabelled images. In arXiv, 2020. |
| Nichol et al.: Improved denoising diffusion probabilistic models. In ICML, 2021. |
| Niemeyer et al.: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Ntavelis et al.: StyleGenes: Discrete and Efficient Latent Distributions for GANs. In arXiv, 2023. |
| Ntavelis, Evangelos et al. "Autodecoding Latent 3D Diffusion Models," arXiv:2307.05445v1 [cs.CV] Jul. 7, 2023z. |
| Ntavelis, Evangelos, et al. "Autodecoding latent 3d diffusion models." Advances in Neural Information Processing Systems 36 (2023): 67021-67047. (Year: 2023). * |
| Obukhov et al.: High-fidelity performance metrics for generative models in PyTorch, 2020. URL https://github.com/toshas/torch-fidelity. Version: 0.3.0, DOI: 10.5281/zenodo.4957738. |
| Park et al.: PhotoShape: Photorealistic Materials for Large-Scale Shape Collections. In ACM Transactions on Graphics, 2018. |
| Park et al: "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation", 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Jun. 15, 2019 (Jun. 15, 2019), pp. 165-174, XP033687152. |
| Paszke et al.: Automatic Differentiation in PyTorch, 2017. |
| Paszke et al.: PyTorch: an Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Neural Information Processing Systems Conference, 2019. |
| Poole et al.: Dreamfusion: Text-to-3d using 2d diffusion. In arXiv, 2022. |
| Raffel et al.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. In The Journal of Machine Learning Research, 2020. |
| Ravi et al.: Accelerating 3D Deep Learning with PyTorch3D. In arXiv, 2020. |
| Rombach et al.: High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Schönberger et al.: Structure-from-Motion Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. |
| Schwarz et al.: GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. In Proceedings of the Neural Information Processing Systems Conference, 2020. |
| Siarohin et al.: First Order Motion Model for Image Animation. In Proceedings of the Neural Information Processing Systems Conference, 2019. |
| Siarohin et al.: Unsupervised Volumetric Animation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Siarohin et al: "Unsupervised Volumetric Animation", Arxiv.org, Cornell University Library, 201 Olin Library Cornell University Ithaca, NY 14853, Jan. 26, 2023 (Jan. 26, 2023), XP091421870. |
| Simonyan et al.: Very deep convolutional networks for large-scale image recognition. In arXiv, 2014. |
| Skorokhodov et al.: 3D Generation on ImageNet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Skorokhodov et al.: EpiGRAF: Rethinking Training of 3D GANs. In Proceedings of the Neural Information Processing Systems Conference, 2022. |
| Skorokhodov et al.: Stylegan-v: a continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Sohl-Dickstein et al.: Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015. |
| Song et al.: Denoising Diffusion Implicit Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Song et al.: Generative modeling by estimating gradients of the data distribution. In Proceedings of the Neural Information Processing Systems Conference, 2019. |
| Song et al.: Score-based generative modeling through stochastic differential equations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Tian et al.: A good image generator is what you need for high-resolution video synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. |
| Vahdat et al.: Score-based generative modeling in latent space. In Proceedings of the Neural Information Processing Systems Conference, 2021. |
| Vaswani et al.: Attention is all you need. In Proceedings of the Neural Information Processing Systems Conference, 2017. |
| Voleti et al.: MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In Proceedings of the Neural Information Processing Systems Conference, 2022. |
| Wang et al.: NeRF—: Neural Radiance Fields Without Known Camera Parameters. In arXiv, 2021. |
| Wang, Tengfei, et al. "Rodin: a Generative Model for Sculpting 3D Digital Avatars Using Diffusion." arXiv preprint arXiv:2212.06135 (2022). (Year: 2022). * |
| Weng et al.: HumanNeRF: Free-Viewpoint Rendering of Moving People from Monocular Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Whaley: The Interquartile Range: Theory and Estimation. PhD thesis, East Tennessee State University, 2005. |
| Xiao et al.: Tackling the generative learning trilemma with denoising diffusion GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Xu et al.: Pose for Everything: Towards Category-Agnostic Pose Estimation. In Proceedings of the European Conference on Computer Vision, 2022. |
| Xue et al.: GIRAFFE HD: a High-Resolution 3D-aware Generative Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Yin et al.: NUWA-XL: Diffusion over Diffusion for extremely Long Video Generation. In arXiv, 2023. |
| Yu et al.: CelebV-Text: a Large-Scale Facial Text-Video Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Yu et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. |
| Yu et al.: MVImgNet: a Large-scale Dataset of Multi-view Images. In arXiv, 2023. |
| Zhang et al.: The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. |
| Zhu et al.: Discrete contrastive diffusion for cross-modal music and image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. |
| Zhu et al.: MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In arXiv, 2023. |
| Zhu, K. P., et al. "3D CAD model search: a regularized manifold learning approach." 2009 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2009. (Year: 2009). * |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4728492A1 (en) | 2026-04-22 |
| US20240420407A1 (en) | 2024-12-19 |
| KR20260023031A (en) | 2026-02-20 |
| CN121399676A (en) | 2026-01-23 |
| US20260057606A1 (en) | 2026-02-26 |
| WO2024258661A1 (en) | 2024-12-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12494013B2 (en) | Autodecoding latent 3D diffusion models | |
| US12361512B2 (en) | Generating digital images utilizing high-resolution sparse attention and semantic layout manipulation neural networks | |
| US11710248B2 (en) | Photometric-based 3D object modeling | |
| US11769227B2 (en) | Generating synthesized digital images utilizing a multi-resolution generator neural network | |
| US12380640B2 (en) | 3D generation of diverse categories and scenes | |
| US11604963B2 (en) | Feedback adversarial learning | |
| US12400388B2 (en) | Unsupervised volumetric animation | |
| US12475639B2 (en) | Spatially disentangled generative radiance fields for controllable 3D-aware scene synthesis | |
| US20230316454A1 (en) | Vector-quantized transformable bottleneck networks | |
| CN119600206B (en) | A multi-view 3D reconstruction method and system based on GRU and 3DCNN | |
| CN117974992A (en) | Matting processing method, device, computer equipment and storage medium | |
| CN119211552A (en) | Video encoding method, computer device, storage medium and computer program product | |
| Hassani-Vasmejani et al. | Transforming Face Sketches into Realistic Images Using Hierarchical Attention in Swin Transformers |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| AS | Assignment |
Owner name: SNAP INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NTAVELIS, EVANGELOS;OLSZEWSKI, KYLE;SIAROHIN, ALIAKSANDR;AND OTHERS;REEL/FRAME:067586/0469 Effective date: 20230619 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: ALLOWED -- NOTICE OF ALLOWANCE NOT YET MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |