WO2023086398A1 - Réseaux de rendu 3d basés sur des champs de radiance neurale de réfraction - Google Patents

Réseaux de rendu 3d basés sur des champs de radiance neurale de réfraction Download PDF

Info

Publication number
WO2023086398A1
WO2023086398A1 PCT/US2022/049424 US2022049424W WO2023086398A1 WO 2023086398 A1 WO2023086398 A1 WO 2023086398A1 US 2022049424 W US2022049424 W US 2022049424W WO 2023086398 A1 WO2023086398 A1 WO 2023086398A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
network
medium
refractive
image
Prior art date
Application number
PCT/US2022/049424
Other languages
English (en)
Inventor
Yuqi DING
Zhong Li
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Publication of WO2023086398A1 publication Critical patent/WO2023086398A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/50Lighting effects
    • G06T15/506Illumination models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/56Particle system, point based geometry or rendering

Definitions

  • This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for rendering images of an existing underwater scene.
  • Underwater imaging is a useful and important method to exploit an unknown underwater environment which contains rich natural resources and massive marine organisms.
  • Traditional computer vision methods track structures from a motion and can be used to recover three-dimensional (3D) shape of the underwater objects.
  • 3D three-dimensional
  • refraction is one of the biggest problems in underwater imaging since light rays become non-linear in an air-water interface.
  • the traditional computer vision methods usually focus on the linear light transmission. Direct application of the traditional methods cannot address the issue of nonlinear light transmission in such a physical environment. It would be beneficial to have an efficient and accurate image rendering method to render undistorted images for underwater scenes.
  • Various embodiments of this application are directed to generating underwater images.
  • This application implements a neural radiance field (NeRF) based neural network.
  • the methods and systems disclosed herein take a set of images of an underwater object as input, adjust an refractive effect of the light rays between the surface of the air and water, and recover a point cloud and/or synthesize new views of the underwater object. Compared with some traditional computer vision methods which could only obtain a sparse point cloud result, the methods and systems disclosed herein may generate new views of the underwater object and a dense depth result.
  • the NeRF based neural network uses a deep learning network, which is trained to implement neural volumetric rendering utilizing a set of input images and rendering output images.
  • a method is implemented at an electronic device for reconstructing views of a three-dimensional (3D) scene.
  • the method includes identifying input pixel information of a plurality of pixels corresponding to a particular view of the 3D scene from a set of views of the 3D scene in a first medium, wherein each of the set of views is formed from a perspective of a camera located in the first medium and includes objects existing in a second medium distinct from the first medium.
  • the method further includes adjusting the input pixel information of the particular view of the 3D scene by a refractive network to generate adjusted pixel information based on at least a refractive index of the first medium.
  • the method further includes processing the adjusted pixel information by a neural radiance field (NeRF) network to generate an image of the particular view of the 3D scene based on the adjusted pixel information, the image at least partially compensating image distortion caused by refraction.
  • NeRF neural radiance field
  • the refractive network and NeRF network are collectively called a refractive NeRF network.
  • the first medium is air
  • the second medium is water.
  • some implementations include a computer system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer- readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.
  • Figure 3 is an example data processing environment for training and applying a neural network based (NN-based) data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • NN-based neural network based
  • Figure 4A is an example neural network (NN) applied to process content data in an NN-based data processing model, in accordance with some embodiments.
  • NN neural network
  • Figure 4B is an example node in the neural network (NN), in accordance with some embodiments.
  • Figure 5 is a block diagram of a refractive neural radiance field (NeRF) network that corrects image data captured in a first medium concerning one or more objects in a second medium, in accordance with some embodiments.
  • NeRF refractive neural radiance field
  • Figure 6 is is a diagram illustrating a refraction process for an underwater environment, in accordance with some embodiments.
  • Figure 7 is a flowchart of an example method for reconstructing views of a three-dimensional (3D) scene, in accordance with some embodiments.
  • Systems and methods for reconstructing 3D underwater photos or videos are disclosed herein. Refraction from one medium to another changes a direction of an incident light ray, thereby distorting the 3D underwater photos captured above an interface of two media.
  • the systems and methods disclosed herein solve this distortion issue by integrating a refractive network to a NeRF network to provide a refractive NeRF network.
  • the refractive incident ray is modeled to enforce a NeRF to find pixels correcting the distorted underwater photos.
  • some sparse images are captured as input to reduce the complexity of underwater imaging.
  • the systems and methods disclosed herein generate new views and estimate a reasonable refractive index. Additionally, in some embodiments, the systems and methods disclosed herein are used in many marine research and imaging applications.
  • FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a camera).
  • HMD head-mounted display
  • AR augmented reality
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs are processed locally (e.g., for training and/or for prediction) at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provide system data (e.g., boot fdes, operating system images, and user applications) to the client devices 104, and in some embodiments, process the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • storage 106 may store image/video content for training a machine learning model (e.g., deep learning network) and/or image/video content obtained by a user to which a trained machine learning model is applied to determine one or more actions associated with the image/video content.
  • a machine learning model e.g., deep learning network
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 (e.g., the HMD 104D) with user data.
  • the game server 102 generates a stream of video/image data based on the user instruction and user data and providing the stream of video/image data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera and a mobile phone 104C.
  • the networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data/image to identify image, motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communication links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational, and other computer systems that route data and messages.
  • Deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • content data e.g., video data, visual data, audio data
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C).
  • the client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models.
  • the client device 104C or HMD 104D obtains the content data (e.g., captures image or video data via an internal camera) and processes the content data using the training data processing models locally.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D).
  • the client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models.
  • both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the mobile phone 104C and HMD 104D).
  • the server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models.
  • the client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results.
  • the client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102A.
  • data processing is implemented locally at a client device 104 (e.g., the mobile phone 104C and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
  • deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by a client device 104.
  • 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model.
  • Visual content is optionally generated using a second data processing model.
  • training of the first or second data processing models is optionally implemented by the server 102, while inference of the device poses and visual content is implemented by the client device 104.
  • the second data processing model includes an image processing model for image or video content rendering, and is implemented in a user application (e.g., a social networking application, a social media application, a short video application, and a media play application).
  • a user application e.g., a social networking application, a social media application, a short video application, and a media play application.
  • the visual content reflects an underwater scene captured by a camera staying out of the underwater scene
  • the second data processing model correct distortion of the underwater scene in the visual content and optionally adds an virtual object into the underwater scene.
  • FIG. 2 is a block diagram illustrating an electronic system 200, in accordance with some embodiments.
  • the electronic system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof.
  • the electronic system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voicecommand input unit or microphone, a touch screen display, a touch-sensitive input pad, a camera 260, or other input buttons or controls.
  • the camera 260 is located in a first medium (e.g., air) and configured to capture one or more images of a field of view, e.g., including an object located in a second medium (e.g., water).
  • the client device 104 of the electronic system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account; •
  • One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);
  • Model training module 226 for receiving training data (.g., training data 238) and establishing a data processing model (e.g., data processing module 228) for processing content data (e.g., video data, visual data, audio data) to be collected or obtained by a client device 104;
  • a data processing model e.g., data processing module 2248 for processing content data (e.g., video data, visual data, audio data) to be collected or obtained by a client device 104;
  • Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;
  • One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video data, visual data, audio data) using deep learning techniques; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the electronic system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server
  • the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200.
  • the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video data, visual data, audio data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 238 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 238 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 240 is trained according to a type of the content data to be processed.
  • the training data 238 is consistent with the type of the content data, and a data pre-processing module 308 applied to process the training data 238 consistent with the type of the content data.
  • a video pre-processing module 308 is configured to process video training data 238 to a predefined image format, e.g., group frames (e.g., video frames, visual frames) of the video content into video segments.
  • the data pre-processing module 308 may also extract a region of interest (ROI) in each frame or separate a frame into foreground and background components, and crop each frame to a predefined image size.
  • ROI region of interest
  • the model training engine 310 receives pre-processed training data provided by the data preprocessing module(s) 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318.
  • the data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316.
  • Examples of the content data include one or more of: video data, visual data (e.g., image data), audio data, textual data, and other types of data. For example, each video is pre-processed to group frames in the video into video segments.
  • the model-based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre- processed content data.
  • the model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
  • the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that is derived from the processed content data.
  • Figure 4 A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network 400, in accordance with some embodiments.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function.
  • the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input layer 402 and the output layer 406.
  • a deep neural network has more than one hidden layer 404 between the input layer 402 and the output layer 406.
  • each layer may be only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected neural network layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, visual data and audio data).
  • Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real-valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory
  • Elman Elman
  • Jordan Jordan network
  • Hopfield network a bidirectional associative memory
  • BAM network bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN),
  • a generative neural network is applied in the data processing model 240 to process content data (particularly, visual data and audio data).
  • a generative neural is trained by providing it with a large amount of data (e.g., millions of images, sentences, or sounds, etc.) and then the neural network is trained to generate data like the input data.
  • the generative neural networks have significantly smaller amount of parameters than the amount of input training data, so the generative neural networks are forced to find and efficiently internalize the essence of the data for generating data.
  • two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.
  • the training process is a process for calibrating all of the weights Wi for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps neural network 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • FIG 5 is a block diagram of a refractive neural radiance field (NeRF) network 500 that corrects image data captured in a first medium concerning one or more objects 506 in a second medium, in accordance with some embodiments.
  • the image data 508 includes one or more input images 508 (e.g., a first input image 508A and a second input image 508B).
  • the one or more input images 508 are captured by a camera 260 located in the first medium.
  • Each input image 508 records respective objects 506 existing in the second medium.
  • the first medium is air
  • the second medium is water (e.g., lake water, sea water, swimming pool water).
  • the refractive NeRF network 500 is used to process the one or more input images 508, and has two components, a refractive network 502 and an NeRF 504.
  • the refractive NeRF network 500 is applied to generate new views of the same object 506 or scene from the one or more input images captured by the camera 260 located in the first medium.
  • the new views include one or more output images 510 (e.g., a first output image 510A and a second output image 510B).
  • the object 506 or scene is distorted in the one or more input images due to light refraction, and recovered in the one or more input images 508 with no or little distortion.
  • the refractive NeRF network 500 is applied to generate a dense depth map of the object 516 or scene recorded in the one or more input images 508.
  • the dense depth map includes a plurality of pixels, and each pixel corresponds to a respective object in a field of view of the camera 260. Each pixel has a depth value corresponding to the respective object in a field of view of the camera 260.
  • the refractive NeRF network 500 uses a set of underwater images of an underwater object or scene to train the refractive network 502 and the NeRF network 504. Further, in some situations, the NeRF network 504 is trained using the underwater images. Alternatively, in some situations, the refractive network 502 and the NeRF network 504 are trained using the set of underwater images in an end-to-end manner. Training and application of the refractive NeRF network 500 only needs a limited number of underwater images (e.g., ⁇ 5 underwater images captured by an underwater camera), which controls a level of underwater operation involved in this image processing process.
  • the refractive index of the second medium e.g., water
  • the refractive NeRF network 500 is applied to learn the unknown refractive index of the second medium.
  • an underwater camera 260 captures one or more ground truth images in the second medium.
  • the one or more input images 508 are captured by the camera 260 located in the first medium, and processed by the refractive NeRF network 500 including the refractive network 502 and NeRF 504 to generate the one or more output images 510.
  • the one or more output images 510 are compared with the one or more ground truth images to determine the refractive index of the second medium.
  • the refractive NeRF network 500 is applied to generate a dense point cloud of the one or more objects 506 from the one or more input images 508 for accurate measurement of the one or more objects 506. In some embodiments, the refractive NeRF network 500 is applied to estimate physical parameters of the second medium, e.g., for marine research.
  • the one or more input images 508 are captured from a single camera 260 located in the first medium, e.g., above the water.
  • the camera 260 is located within the first medium (e.g., air) with a first refractive index, and the objects or scene is located in a second medium (e.g., water) with a second refractive index.
  • more than one camera 260 is used to capture the one or more input images 508 .
  • the one or more input images 508 are taken for the same objects 506 or scene from different angles corresponding to different camera locations and/or orientations. For each input image 508 , intrinsic and extrinsic parameters (rotation and translation) of a respective camera 260 are known.
  • image pixels of the one or more input images 508 captured by the camera 260 are transformed to light rays according to a projection matrix, computed based on intrinsic parameters of the camera 260.
  • Each image pixel has an image location on a respective image 508 and a spatial location (x, y, z) in the field view of the camera 260, and corresponds to a viewing angle and/or direction.
  • each of the rays corresponds to a respective image pixel of an input image 508 and is defined by a combination of the respective image pixel’s spatial position (x, y, z) and associated camera viewing direction (0, ⁇ I>).
  • the refractive NeRF network 500 includes an integrated neural network that combines the refractive network 502 and the NeRF network 504.
  • An input of the integrated refractive NeRF network 500 is represented by a five dimensional (5D) coordinate of each image pixel, and the 5D coordinate is represented by a spatial position (x, y, z) and associated camera viewing direction (0, ⁇ I>).
  • (0, ⁇ I>) is an angular representation in a spherical coordinate system for a three-dimensional space.
  • the corresponding spherical coordinate system has an origin set at the camera 260.
  • an output of the integrated refractive NeRF network 500 includes a volume density o and view-dependent RGB color values of the one or more output images 510.
  • each input image 508 corresponds to a view that is synthesized using the 5D coordinates along a plurality of distinct camera rays.
  • volume rendering techniques are applied to project the volume density o and view-dependent RGB color values of the one or more output images 510.
  • a static scene is represented in a view as continuous 5D coordinates having radiance emitted in each direction (0, ⁇ I>) at each point (x, y, z) in space.
  • a volume density o at each point controls how much radiance is accumulated by a ray passing through each point (x, y, z) in space.
  • the refractive NeRF system 500 optimizes a deep neural network to determine the volume density o and view-dependent RGB color values of the one or more output images 510 from the 5D coordinate values (x, y, z, 0, ⁇ I>) of image pixels of the one or more input images 508.
  • the refractive NeRF network 500 is optionally a fully connected network.
  • the refractive NeRF network 500 is optimized by regressing from the 5D coordinates corresponding to the one or more input images 508 to the volume density o and view-dependent RGB color values of the one or more output images 510.
  • the refractive NeRF network 500 includes the refractive network 502 and NeRF network 504 that is distinct from (i.e., not integrated with) the refractive network 502.
  • the NeRF network 504 uses refractive light rays.
  • the refractive network 502 simulates refraction at an interface between the first and second media, and generates a refractive light ray from an original light ray.
  • the NeRF network 504 represents a radiance field of underwater points.
  • each of the refractive network 502 and the NeRF network 504 has a respective neutral network.
  • the refractive network 502 and the NeRF network 504 are trained separately.
  • the refractive network 502 and the NeRF network 504 are trained sequentially. These two networks 502 and 504 have different scalars, and different learning rates are set to train the refractive network 502 and the NeRF network 504. As such, in some embodiments, the refractive NeRF network 500 is applied to estimate the refractive index of the water and generate an output image 510 including a new view of the underwater object 506.
  • a set of input scene views are captured surrounding the scene in the first medium.
  • the set of input scene views includes more than 100 input images 508 captured by one or more cameras 260.
  • a particular viewpoint of the camera 260 corresponds to a camera viewing direction (0, ⁇ I>), and the light rays are directed through the scene to generate one or more input images 508 on the camera 260 having the particular viewpoint.
  • Each input image 508 includes a plurality of image pixels.
  • Each image pixel has a 5D coordinate represented by a spatial position (x, y, z) and associated camera viewing direction (0, ⁇ I>).
  • the plurality of image pixels with the 5D coordinate are used as an input to the refractive network 502.
  • the refractive network 502 produces a sample set of points each having a refractive 5D coordinate after the light rays are directed from a first medium, such as air, though the second medium having a refractive index n w .
  • the refractive index n w is a known value.
  • the refractive index n w is an unknown value which may be estimated from the refractive network 502 and/or the NeRF network 504 through training.
  • the sample set of points with refractive 5D coordinates is then used as an input to the NeRF 504 to produce, at the output, the volume density o and viewdependent RGB color values .
  • the volume density o and view- dependent RGB color values are composed into a 2D image (e.g., an output image 510).
  • volume rendering is implemented to compose or accumulate the volume density o and view-dependent RGB color values of the one or more output images 510.
  • the refractive network 502 and NeRF network 504 are differentiable. Differences are minimized between images captured by one or more cameras (e.g., underwater cameras) in the second medium and output images 510 rendered based on the refractive NeRF network 500.
  • one or more cameras e.g., underwater cameras
  • FIG. 6 is a diagram illustrating a refraction process 600 for an underwater environment 606, in accordance with some embodiments.
  • the underwater environment 606 corresponds to an interface 602 between a first medium 604 and a second medium 606.
  • the interface 602 is modeled by a flat plane that is disposed in front of a camera 260 located in the first medium 604,.
  • the first medium 604 is air and the second medium 606 is water (e.g., in a river, in sea, in a swimming pool).
  • Spatial points (x, y, z) of the flat plane in a 3D space are represented as:
  • angles of the incident ray 608 and refractive ray 610 are expressed as follows: where 0i is an incident angle of the incident ray 608 measured with respect to a vertical axis 612 perpendicular to the interface 602, and 02 is a refractive angle of the refractive ray 610 measured with respect to the vertical axis 612.
  • the refractive ray 610 follows a first direction d w
  • the original ray 608 follows a second direction d a
  • the refractive ray 610 intersects the plane at an intersection point 614.
  • a refractive index n w which can be the refractive index of the water in some embodiments, is added to the refractive network 502 as a known or unknown parameter.
  • Figure 7 is a flowchart of an example method 700 for reconstructing views of a three-dimensional (3D) scene, in accordance with some embodiments.
  • the method 700 is implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof).
  • An example of the client device 104 is a mobile phone 104C or an HMD 104D.
  • Method 700 is, in some embodiments, governed by instructions that are stored in a non- transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figure 7 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 700 may be combined and/or the order of some operations may be changed.
  • the computer system identifies (710) input pixel information of a plurality of pixels corresponding a particular view of the 3D scene from a set of views of the 3D scene in a first medium.
  • the plurality of pixels belongs to one or more input images 508 ( Figure 5).
  • each of the set of views is formed from a perspective of a camera 260 located in the first medium and includes one or more objects 506 existing in a second medium distinct from the first medium.
  • the camera 260 includes a camera of a mobile phone 104C.
  • the camera 260 includes a camera mounted on the HMD 104D.
  • the computer system adjusts (720) the input pixel information of the particular view of the 3D scene by a refractive network 502 to generate adjusted pixel information based on at least a refractive index of the second medium.
  • the computer system processes (730) the adjusted pixel information by a neural radiance field (NeRF) network 504 to generate an output image 510 ( Figure 5) of the particular view of the 3D scene based on the adjusted pixel information, and the output image 510 at least partially compensates image distortion caused by refraction.
  • NeRF neural radiance field
  • the computer system automatically corrects image distortion caused by refraction between two distinct media materials and provides high quality images 510 of underwater objects.
  • the particular view of the 3D scene is different from each remaining view of the set of views of the 3D scene.
  • the NeRF network 504 is configured to provide an occupancy field from the adjusted pixel information,
  • the occupancy field is configured to generate a scene geometry of the 3D scene.
  • the 3D scene is associated with the second medium (e.g., water).
  • the refractive network 502 reduces or removes an impact of refraction and converts the one or more input images 508 captured from a perspective of the camera 260 located in the first medium to adjusted input images 508’ in the absence of the impact of refraction.
  • the NeRF network 504 is a fully-connected neural network that can generate novel views of the 3D scene based on a partial set of 2D images (e.g., the adjusted input images 508’ in Figure 5).
  • such a NeRF network 504 is trained to use a rendering loss to reproduce input views of the 3D scene.
  • the refractive network 502 is configured to partially compensate the image distortion caused by refraction
  • the NeRF network 504 is configured to provide a dense point cloud of the 3D scene.
  • the image of the 3D scene is generated based on the dense point cloud.
  • a point cloud is a set of data points in space.
  • the points may represent a 3D shape or object 506.
  • Each point position has its set of Cartesian coordinates (X, Y, Z).
  • Point clouds are generally produced by 3D scanners or by photogrammetry software, which measure many points on the external surfaces of objects around them.
  • the one or input images 508 include the shape or object 506, and each point of the shape or object 506 is represented with 5D coordinate values (x, y, z, 0, ⁇ I>) from the perspective of the camera 260.
  • the Cartesian coordinates (X, Y, Z) are determined by the refraction network 502 and NeRF network 504, and have compensated an impact of refraction compared with the 5D coordinate values (x, y, z, 0, ⁇ I>) indicated by the one or more input images 508.
  • the first medium is air and the second medium is water.
  • identifying (710) the input pixel information of the plurality of pixels further includes: identifying a plurality of pixel values of the plurality of pixels corresponding to the particular view of the 3D scene; and identifying a respective pixel location (x, y, z) associated with each of the plurality of pixel values.
  • the perspective of the camera 260 corresponds to a camera pose in the 3D scene, and each of the plurality of pixels is associated with a respective ray direction, and identifying (710) the input pixel information of the plurality of pixels further comprises: based on the camera pose, determining the respective ray direction (for, example, 0 and ⁇ I> of each of the plurality of pixels).
  • the refraction network 502 and NeRF network 504 form a refractive NeRF network 500, which is configured to use a set of underwater images as input.
  • the intrinsic and extrinsic parameters (rotation and translation) of each camera 260 are also known.
  • the output images 510 of the refraction NeRF network 500 are compared with the set of underwater images to determine the refractive index of the second medium or train the refraction NeRF network 500.
  • the set of underwater images are captured by the camera 260 located in the first medium and outside of a glass box containing the second medium. A thickness of the glass box is controlled to be small (e.g., negligible without impacting the set of underwater images).
  • adjusting (720) the input pixel information of the particular view of the 3D scene by the refractive network 502 to generate the adjusted pixel information based on at least the refractive index of the second medium includes: identifying a plurality of directional rays for the plurality of pixels corresponding to the particular view of the 3D scene, and projecting the plurality of directional rays through an interface 602 in front of the camera 260.
  • the interface 602 is a flat plane between the first medium and the second medium.
  • each image pixel of the one or more input images 508 is transformed to a respective ray according to a projection matrix which is computed by the intrinsic parameters.
  • the underwater environment is modeled by placing the flat plane in the front of the camera 260.
  • this flat plane is an interface 602 between air and the water.
  • each projected directional ray changes a respective direction after passing through the interface 602. For example, when a ray which comes from the camera 260 passes the flat plane, a direction of the ray is changed by refraction.
  • the method 700 for reconstructing views of a 3D scene further includes training the refractive network 502 by a set of sparse training images by, for each sparse training image captured at the 3D scene: adjusting input pixel information of each sparse training image by the refractive network 502, measuring a change of a respective direction of a projected directional ray corresponding to each of a subset of pixels of a respective sparse image, estimating a ratio of the refractive indexes of the first medium and the second medium based on the change of the respective direction, and training the refractive network based on the change of the respective direction of the projected directional ray corresponding to each of at least the subset of pixels .
  • the refractive index of the first medium is known (e.g., 1 for air).
  • the refractive index of the second medium e.g., water
  • a difference between the estimated refractive index and a ground truth refractive index of the second medium is monitored to train the refractive network.
  • training is implemented in a self-supervised manner without knowing the ground truth refractive index of the second medium.
  • the input pixel information includes a 3D location and 2D camera viewing direction for each pixel.
  • the output image 510 is synthesized from 3D color values and a volume density for each image pixel.
  • the method 700 for reconstructing views of a 3D scene further includes: training the refractive network 502 and NeRF network 504 end-to-end by reducing a difference between the generated image 510 of the 3D scene and a corresponding ground truth image of the 3D scene.
  • the method 700 for reconstructing views of a 3D scene further includes: training the NeRF network 504 separately based on a plurality of views of the 3D scene by the camera 260, including generating new views of the 3D scene from the NeRF network 504 and reducing a difference between the new views and corresponding ground truth views of the 3D scene.
  • the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

Un procédé est mis en œuvre au niveau d'un dispositif électronique pour reconstruire ou générer une vue d'une scène tridimensionnelle (3D). Le dispositif électronique identifie les informations de pixel d'entrée d'une pluralité de pixels correspondant à une vue particulière de la scène 3D à partir d'un ensemble de vues de la scène 3D dans un premier milieu formé à partir de la perspective d'une ou de plusieurs caméras situées dans le premier milieu et comprend un ou plusieurs objets existant dans un second milieu distinct du premier milieu, règle les informations de pixel d'entrée de la vue particulière de la scène 3D par un réseau de réfraction pour générer des informations de pixel réglées sur la base d'au moins un indice de réfraction du second milieu, et traite les informations de pixel réglées par réseau de rendu de champ de radiance neurale (NeRF) pour générer une image de la vue particulière de la scène 3D sur la base des informations de pixel réglées. L'image compense au moins partiellement la distorsion d'image provoquée par la réfraction.
PCT/US2022/049424 2021-11-09 2022-11-09 Réseaux de rendu 3d basés sur des champs de radiance neurale de réfraction WO2023086398A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163277503P 2021-11-09 2021-11-09
US63/277,503 2021-11-09

Publications (1)

Publication Number Publication Date
WO2023086398A1 true WO2023086398A1 (fr) 2023-05-19

Family

ID=86336687

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/049424 WO2023086398A1 (fr) 2021-11-09 2022-11-09 Réseaux de rendu 3d basés sur des champs de radiance neurale de réfraction

Country Status (1)

Country Link
WO (1) WO2023086398A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116958453A (zh) * 2023-09-20 2023-10-27 成都索贝数码科技股份有限公司 基于神经辐射场的三维模型重建方法、设备和介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100002222A1 (en) * 1999-09-03 2010-01-07 Arete Associates Lidar with streak-tube imaging, including hazard detection in marine applications; related optics
US20140285655A1 (en) * 2013-03-20 2014-09-25 Electronics And Telecommunications Research Institute Apparatus and method for measuring shape of underwater object
US20160253795A1 (en) * 2015-02-24 2016-09-01 NextVR, Inc. Calibration for immersive content systems
US20200264416A1 (en) * 2019-02-15 2020-08-20 S Chris Hinnah Methods and systems for automated imaging of three-dimensional objects
US20200404243A1 (en) * 2019-06-24 2020-12-24 Align Technology, Inc. Intraoral 3d scanner employing multiple miniature cameras and multiple miniature pattern projectors
US20210044795A1 (en) * 2019-08-09 2021-02-11 Light Field Lab, Inc. Light Field Display System Based Digital Signage System

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100002222A1 (en) * 1999-09-03 2010-01-07 Arete Associates Lidar with streak-tube imaging, including hazard detection in marine applications; related optics
US20140285655A1 (en) * 2013-03-20 2014-09-25 Electronics And Telecommunications Research Institute Apparatus and method for measuring shape of underwater object
US20160253795A1 (en) * 2015-02-24 2016-09-01 NextVR, Inc. Calibration for immersive content systems
US20200264416A1 (en) * 2019-02-15 2020-08-20 S Chris Hinnah Methods and systems for automated imaging of three-dimensional objects
US20200404243A1 (en) * 2019-06-24 2020-12-24 Align Technology, Inc. Intraoral 3d scanner employing multiple miniature cameras and multiple miniature pattern projectors
US20210044795A1 (en) * 2019-08-09 2021-02-11 Light Field Lab, Inc. Light Field Display System Based Digital Signage System

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116958453A (zh) * 2023-09-20 2023-10-27 成都索贝数码科技股份有限公司 基于神经辐射场的三维模型重建方法、设备和介质
CN116958453B (zh) * 2023-09-20 2023-12-08 成都索贝数码科技股份有限公司 基于神经辐射场的三维模型重建方法、设备和介质

Similar Documents

Publication Publication Date Title
KR102319177B1 (ko) 이미지 내의 객체 자세를 결정하는 방법 및 장치, 장비, 및 저장 매체
US11288857B2 (en) Neural rerendering from 3D models
WO2021249401A1 (fr) Procédé et appareil de génération de modèle, procédé et appareil de détermination de perspective d'image, dispositif et support
WO2023102223A1 (fr) Apprentissage multitâche en couplage croisé pour cartographie de profondeur et segmentation sémantique
US20240203152A1 (en) Method for identifying human poses in an image, computer system, and non-transitory computer-readable medium
CN116391209A (zh) 现实的音频驱动的3d化身生成
WO2022052782A1 (fr) Procédé de traitement d'image et dispositif associé
WO2023101679A1 (fr) Récupération inter-modale d'image de texte sur la base d'une expansion de mots virtuels
CN111368733B (zh) 一种基于标签分布学习的三维手部姿态估计方法、存储介质及终端
WO2022148248A1 (fr) Procédé d'entraînement de modèle de traitement d'image, procédé et appareil de traitement d'image, dispositif électronique et produit programme informatique
WO2023086398A1 (fr) Réseaux de rendu 3d basés sur des champs de radiance neurale de réfraction
CN112149528A (zh) 一种全景图目标检测方法、系统、介质及设备
WO2023133285A1 (fr) Anticrénelage de bordures d'objet comportant un mélange alpha de multiples surfaces 3d segmentées
WO2023027712A1 (fr) Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
WO2023091131A1 (fr) Procédés et systèmes pour récupérer des images sur la base de caractéristiques de plan sémantique
CN115496859A (zh) 基于散乱点云交叉注意学习的三维场景运动趋势估计方法
WO2023069085A1 (fr) Systèmes et procédés de synthèse d'images de main
US20240153184A1 (en) Real-time hand-held markerless human motion recording and avatar rendering in a mobile platform
CN116310408B (zh) 一种建立事件相机与帧相机数据关联的方法及装置
US20230274403A1 (en) Depth-based see-through prevention in image fusion
WO2023023162A1 (fr) Détection et reconstruction de plan sémantique 3d à partir d'images stéréo multi-vues (mvs)
WO2024123343A1 (fr) Mise en correspondance stéréo pour une estimation de profondeur à l'aide de paires d'images avec des configurations de pose relative arbitraires
CN117813626A (zh) 从多视图立体(mvs)图像重建深度信息
WO2024123372A1 (fr) Sérialisation et désérialisation d'images de profondeur en couches pour rendu 3d

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22893565

Country of ref document: EP

Kind code of ref document: A1