WO2023277877A1 - Détection et reconstruction de plan sémantique 3d - Google Patents

Détection et reconstruction de plan sémantique 3d Download PDF

Info

Publication number
WO2023277877A1
WO2023277877A1 PCT/US2021/039535 US2021039535W WO2023277877A1 WO 2023277877 A1 WO2023277877 A1 WO 2023277877A1 US 2021039535 W US2021039535 W US 2021039535W WO 2023277877 A1 WO2023277877 A1 WO 2023277877A1
Authority
WO
WIPO (PCT)
Prior art keywords
plane
neural network
image frames
data
network model
Prior art date
Application number
PCT/US2021/039535
Other languages
English (en)
Inventor
Pan JI
Yuliang GUO
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2021/039535 priority Critical patent/WO2023277877A1/fr
Publication of WO2023277877A1 publication Critical patent/WO2023277877A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/579Depth or shape recovery from multiple images from motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30232Surveillance

Definitions

  • the present application generally relates to artificial intelligence, and more specifically to methods and systems for using deep learning techniques in augmented reality (AR) devices and applications.
  • AR augmented reality
  • Augmented reality enhances images as viewed on a screen or other display by overlaying computer-generated images, sounds, or other data on a real-world environment.
  • AR applications are now commonly found on smart mobile devices such as cellphones and tablets.
  • Simultaneous Localization and Mapping is a computational technique used in the AR applications to construct or update a map of an environment while simultaneously keeping track of a device’s pose.
  • the AR applications have been able to rely on the SLAM technique and advanced sensors (e.g., LiDAR sensors, time-of- flight (TOF) sensors etc.) of mobile devices to detect and reconstruct three-dimensional (3D) planes, e.g., by estimating a depth or distance of each pixel associated with a 3D plane.
  • 3D three-dimensional
  • Current practices apply deep neural networks (DNN) to either regress the depth of planar regions or estimate the 3D plane parameters directly, and however, require training of the DNN in a fully supervised manner, i.e., require annotating the 3D planes with a full set of ground truth plane parameters that are often hard to obtain. It would be beneficial to develop systems and methods for detecting and reconstructing 3D planes using deep learning techniques while demanding less or no annotations of the ground truth plane parameters.
  • DNN deep neural networks
  • 3D plane semantic plane reconstruction and detection Various embodiments disclosed herein describe systems, devices, and methods that detect planar regions in 3D space and reconstruct 3D plane parameters in a semi -supervised (or self-supervised or unsupervised) manner, without using 3D ground truth annotations. Self-supervised losses are leveraged to supervise 3D plane reconstruction and help bypass a need of 3D annotations.
  • a result neural network model is scalable and applicable for augmented or mixed reality applications executed in different types of electronic devices.
  • a 3D plane reconstruction method is implemented. The method includes obtaining a neural network model.
  • the neural network model includes a plurality of layers. Each layer has a respective number of filters.
  • the method further includes obtaining a sequence of consecutive image frames and processing the sequence of consecutive image frames using the neural network model to obtain one or more 3D plane parameters and one or more plane depth maps without 3D ground truth labels.
  • the method further includes determining a loss function combining the one or more 3D plane parameters and the one or more plane depth maps of the sequence of consecutive image frames.
  • the method further includes based on the loss function, training the neural network model in an unsupervised manner for the one or more 3D plane parameters and the one or more plane depth maps.
  • the method further includes providing the trained neural network model to a client device to process input images.
  • the method further includes obtaining two- dimensional (2D) ground truth labels of a set of 2D outputs with the sequence of consecutive image frames.
  • the set of 2D outputs are distinct from the one or more 3D plane parameters and the one or more plane depth maps.
  • the sequence of consecutive image frames are processed using the neural network model to obtain the set of 2D outputs, and the loss function combines the 3D plane parameters and plane depth maps with the set of 2D outputs and the 2D ground truth labels.
  • a computer system includes one or more processors and memory storing instructions, which when executed by the one or more processors cause the processors to perform any of the methods disclosed herein.
  • a non-transitory computer readable storage medium stores instruction, which when executed by the one or more processing processors cause the processors to perform any of the methods disclosed herein.
  • Figure 1 A is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments
  • Figure IB is a pair of AR glasses that can be communicatively coupled in a data processing environment, in accordance with some embodiments.
  • Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.
  • Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • Figure 4A is an example neural network applied to process content data in an
  • Figure 4B is an example node in the neural network, in accordance with some embodiments.
  • Figure 5 is a flow diagram of a semi-supervised method of detecting and reconstructing a 3D semantic plane, in accordance with some embodiments.
  • Figure 6 is a flow diagram of a process of rendering virtual objects in an AR application, in accordance with some embodiments.
  • Figure 7 is a flow diagram of an example 3D plane reconstruction method, in accordance with some embodiments.
  • Figure 1A is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104 A, tablet computers 104B, mobile phones 104C, augmented reality (AR) glasses 150, or intelligent, multi-sensing, network-connected home devices (e.g., a camera).
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., the head-mounted display 150) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera and a mobile phone 104C.
  • the networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Intemet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
  • deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • the content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104.
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C and head-mounted display 150).
  • the client device 104C or head-mounted display 150 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models.
  • the client device 104C or head-mounted display 150 obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the training data processing models locally.
  • both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and head-mounted display 150).
  • the server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the client device 104A or head-mounted display 150 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized or predicted device poses) from the server 102A, presents the results on a user interface (e.g., associated with the application), rending virtual objects in a field of view based on the poses, or implements some other functions based on the results.
  • the client device 104A or head-mounted display 150 itself implements no or little data processing on the content data prior to sending them to the server 102A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B and head-mounted display 150), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B or head-mounted display 150.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104B or head-mounted display 150 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
  • FIG. IB illustrates a pair of augmented reality (AR) glasses 150 (also called a head-mounted display) that can be communicatively coupled to a data processing environment 100, in accordance with some embodiments.
  • the AR glasses 150 can be includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display.
  • the camera and microphone are configured to capture video and audio data from a scene of the AR glasses 150, while the one or more inertial sensors are configured to capture inertial sensor data.
  • the camera captures hand gestures of a user wearing the AR glasses 150.
  • the microphone records ambient sound, including user’s voice commands.
  • both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses.
  • the video, static image, audio, or inertial sensor data captured by the AR glasses 150 is processed by the AR glasses 150, server(s) 102, or both to recognize the device poses.
  • deep learning techniques are applied by the server(s) 102 and AR glasses 150 jointly to recognize and predict the device poses.
  • the device poses are used to control the AR glasses 150 itself or interact with an application (e.g., a gaming application) executed by the AR glasses 150.
  • the display of the AR glasses 150 displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items on the user interface.
  • deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 150.
  • Device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a data processing model. Training of the data processing model is optionally implemented by the server 102 or AR glasses 150. Inference of the device poses is implemented by each of the server 102 and AR glasses 150 independently or by both of the server 102 and AR glasses 150 jointly.
  • FIG 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments.
  • the data processing system 200 includes a server 102, a client device 104 (e.g., AR glasses 150 in Figure IB), a storage 106, or a combination thereof.
  • the data processing system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • Memory 206 includes high-speed random access memory, such as DRAM,
  • SRAM, DDR RAM, or other random access solid state memory devices and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices.
  • Memory 206 optionally, includes one or more storage devices remotely located from one or more processing units 202.
  • Memory 206, or alternatively the non-volatile memory within memory 206 includes a non-transitory computer readable storage medium.
  • memory 206, or the non- transitory computer readable storage medium of memory 206 stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the data processing system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;
  • content data e.g., video, image, audio, or textual data
  • Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;
  • One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at
  • the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 240 is trained according to a type of the content data to be processed.
  • the training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size.
  • ROI region of interest
  • an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform.
  • the model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318.
  • the data pre processing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data.
  • each image is pre-processed to extract an ROI or cropped to a predefined image size
  • an audio clip is pre-processed to convert to a frequency domain using a Fourier transform.
  • the content data includes two or more types, e.g., video data and textual data.
  • the model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data.
  • the model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
  • the processed content data is further processed by the data post processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
  • Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model -based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs.
  • a weight w associated with each link 412 is applied to the node output.
  • the one or more node inputs are combined based on corresponding weights w , W2, W3, and W4 according to the propagation function.
  • the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406.
  • a deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real-valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory
  • BAM bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • the RNN can be used for hand
  • the training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • FIG. 5 is a flow diagram of a semi-supervised method 500 of detecting and reconstructing a 3D semantic plane, in accordance with some embodiments.
  • the method 500 includes a deep neural network (DNN) model 510 configured to extract information of the 3D semantic plane from a sequence of image frames 502.
  • the DNN model 510 is trained using a semi-supervised training pipeline that does not use 3D ground truth annotations (e.g., labels).
  • the method 500 is implemented by a computer system (e.g., a server 102, a client device 104).
  • the training pipeline is implemented at the server 102 to train the DNN model 510, and the trained DNN model 510 is provided to a client device 104, allowing the client device to use the DNN model 510 to process input images.
  • the training pipeline is implemented at the server 102 to train the DNN model 510, and the trained DNN model 510 is also applied at the server 102 to process input images that are optionally received from a client device 104.
  • the training pipeline is implemented at a client device 104 to train the DNN model 510, and the trained DNN model 510 is applied by the same client device 104 or provided to a distinct client device 104 to process input images.
  • the computer system configured to train the DNN model 510 obtains a sequence of consecutive image frames 502 (e.g., two or more successive image frames).
  • the image frames 502 are from a video clip.
  • the image frames 502 are provided to the DNN model 510.
  • the DNN model 510 includes a plurality of layers each of which includes a respective number of filters.
  • the DNN model 510 outputs a plurality of plane parameters including, but not limited to, detection bounding boxes 512, plane masks 514, semantic classes 516, 3D plane parameters 518, and depth maps 520.
  • Each detection bounding box 512 is a rectangle that fits tightly to a respective object detected in the image frames 502.
  • Each plane mask 514 corresponds to a non-rectangular shape identifying an area of a respective type of surface of plane detected in the image frames 502 with a pixel-level accuracy.
  • Each semantic class 516 identifies a type of an object or surface detected in the image frames 502. Examples of the semantic classes 516 include a door, a window, a person, a table surface, a wall surface, and the like.
  • the 3D plane parameters 518 describe each surface or plane identified in the image frames 502 in terms of a 3D surface normal and an offset.
  • Each depth map 520 has a resolution equal to or less than a resolution of the image frames 502.
  • Each pixel of a depth map 520 represents a depth value of a pixel or a set of pixels in a corresponding image frame 502.
  • the pixel or set of pixels represents a region in a field of view of a camera, and the respective depth value indicates a distance from the region represented by the pixel or set of pixels to a center of the camera.
  • the DNN model 510 has a structure consistent with a mask regional convolutional neural network (Mask R-CNN).
  • the DNN model has a greater number of branches than the Mask R-CNN, and for example, is a combination of the Mask R- CNN and additional branches.
  • the additional branches of the DNN model 510 are configured to predict the 3D plane parameters 518 and the plane depth maps 520.
  • the DNN model 510 includes a first neural network and a second neural network.
  • the first neural network is configured to detect the bounding boxes 512, plane masks 514, and semantic classes 516, which are 2D plane parameters.
  • Training data include both the image frames 502 and a first set of 2D ground truth labels associated with the bounding boxes 512, plane masks 514 and semantic classes 516.
  • the detected bounding boxes 512, plane masks 514 and semantic classes 516 are compared with the first set of 2D ground truth labels. That said, the first neural network is trained in a supervised manner to predict the 2D plane parameters including the bounding boxes 512, plane masks 514, and semantic classes 516.
  • the second neural network is configured to detect the 3D plane parameters 518 and plane depth maps 520 without supervision, i.e., in an unsupervised or self-supervised manner.
  • the 3D plane parameters 518 and plane depth maps 520 are 3D plane parameters. No counterpart 3D ground truth labels are provided with the training data to train the second neural network. By these means, the image frames 502 of the training data do not need to be annotated with the 3D ground truth labels, thereby making the training pipeline more efficient and easier to implement.
  • the DNN model 510 relies on a
  • SLAM Simultaneous Localization and Mapping
  • VIO Visual Inertial Odometry
  • SLAM is a computational technique of constructing or updating a map of an environment while simultaneously keeping track of a camera’s location within the environment.
  • the VIO system is implemented via a process of determining the position and orientation of the camera, in which associated image and inertial measurement unit (IMU) data are analyzed.
  • IMU image and inertial measurement unit
  • SLAM algorithms are based on a VIO system involving both the images and IMU data, and generates two camera poses T t and T t> associated with the two successive image frames I t and I t> .
  • a relative camera pose change is denoted as T t t> for the two successive image frames I t and I t> .
  • training of the DNN model 510 is implemented by using a photometric loss L p , a depth smoothness loss L s , and/or a plane consistency loss L c.
  • the DNN model 510 predicts a depth map (e.g., performs a depth estimation) D t.
  • the depth map includes a depth value of each pixel in the current image frame I t or a distance of an object that is recorded at each pixel, where the distance is measured from a location of a camera capturing the current image frame I t.
  • the photometric loss between the current image frame I t and the subsequent image frame I t> can be computed as follows: where I t ' t is a synthesized image warped from the current image frame I t using the depth map D t and the relative camera pose change T t t> , and p( ⁇ ) is a loss function combining an L-l norm of the image frames I t and I t> and a structured similarity of the image frames I t and l t ' .
  • a plane-depth consistency loss L c is used to supervise plane parameter regression, and can be computed as follows: where n is a predicted vector containing a norm and an offset of a plane, x is one of 2D coordinates of an image pixel, and K is a camera intrinsic matrix.
  • a comprehensive unsupervised loss LUNS is a weighted combination of the photometric loss L r, depth smoothness loss L s , and/or plane consistency loss L c , and no annotations of the 3D ground truth labels are not needed for 3D plane parameters 518 and plane depth maps 520.
  • the training pipeline also includes computing respective supervised losses for the bounding boxes 512, plane masks 514, and/or semantic classes 516 using a supervised loss LSP.
  • a comprehensive semi-supervised loss LSMS is a weighted combination of the photometric loss L p , depth smoothness loss L s , plane consistency loss L c , and supervised loss LSP. Annotations of the 2D ground truth labels are needed to determine the supervised loss LSP for a subset or all of the bounding boxes 512, plane masks 514, and semantic classes 516.
  • the sequence of consecutive image frames 502 includes a first image frame, a second image frame, and a third image frame.
  • a first pixel difference map is determined between the first and second image frames
  • a second pixel difference map is determined between the second and third image frames.
  • the first and the second pixel difference maps are combined on a pixel level to a comprehensive difference map.
  • the comprehensive difference map is processed using the DNN model 510 to obtain the one or more 3D plane parameters 518 and one or more plane depth maps 520 without the 3D ground truth labels.
  • the sequence of consecutive image frames 502 includes a first image frame, a second image frame, and a third image frame.
  • a first pixel difference map is determined between the first and second image frames
  • a second pixel difference map is determined between the second and third image frames.
  • a smaller value of corresponding pixel values of the first and the second pixel difference maps is selected to form a comprehensive difference map.
  • the comprehensive difference map is processed using the DNN model 510 to obtain the one or more 3D plane parameters 518 and one or more plane depth maps 520 without the 3D ground truth labels.
  • Figure 6 is a flow diagram of a process 600 of rendering virtual objects in an
  • the AR application 602 is optionally an application (e.g., user application 224) executed by the data processing system 200 of Figure 2.
  • the data processing system 200 is a client device 104, a server 102, or a combination thereof.
  • the hand-held or wearable device identifies (606) a set of 3D planes in a field of view of a camera, determines (606) corresponding semantic labels in the images, and applies SLAM techniques to determine (608) a pose of the hand-held or wearable device.
  • Virtual objects are seamlessly placed (610) on the top of one of the set of 3D planes.
  • the virtual object moves around or is fixed at a particular location in accordance with different AR applications.
  • the user application 224 includes a mixed reality (MR) application.
  • MR mixed reality
  • a virtual object is not just overlaid but is fixed in the real world, and the user is enabled to interact with combined virtual/real objects. Since a 3D structure is a reconstruction of 3D planes, occlusion and dis- occlusion relationships are enabled between real and virtual objects, as the pose of the hand held or wearable device varies.
  • a virtual object is placed on a 3D plane based on a semantic class of the 3D plane.
  • a semantic class of the 3D plane For example, in an AR shopping application, floor lamps can be automatically placed on planes identified as “floors” and not on planes identified as “table tops.”
  • semantic classes associated with different 3D planes provide better semantic understanding of a user’s environment, allowing user experience with extended reality to be greatly improved.
  • FIG. 7 is a flow diagram of an example three-dimensional (3D) plane reconstruction method 700, in accordance with some embodiments.
  • the method 700 is implemented by a computer system including an electronic device 104 (e.g., an HMD 150, a mobile phone 104C), a server 102, or a combination thereof.
  • a DNN model 510 is applied in the method 700.
  • the DNN model 510 is trained locally at the electronic device using training data or downloaded from a remote server system 102.
  • the trained DNN model 510 is used by the electronic device 104 to determine 3D plane parameters 518 and depth maps 520A during inference.
  • the DNN model 510 is trained remotely in the remote server system 102.
  • the electronic device 104 submits image frames to the server system 102, and the server system 102 processes the image frames using the DNN model 510 and returns the result plane parameters 518 and depth maps 520A to the electronic device 104.
  • the computer system obtains (710) a neural network model (e.g., DNN model
  • the computer system also obtains (720) a sequence of consecutive image frames (e.g., frames 502 in Figure 5) and processes (730) the sequence of consecutive image frames using the neural network model to obtain one or more 3D plane parameters 518 and one or more plane depth maps 520 without 3D ground truth labels.
  • the computer system determines (740) a loss function combining the one or more 3D plane parameters 518 and the one or more plane depth maps 520 of the sequence of consecutive image frames. Based on the loss function, the computer system trains (750) the neural network model in an unsupervised manner for the one or more 3D plane parameters 518 and the one or more plane depth maps 520.
  • the trained neural network model is provided (760) to a client device to process input images.
  • the client device is optionally the computer system or a distinct device.
  • the computer system obtains two-dimensional (2D) ground truth labels of a set of 2D outputs with the sequence of consecutive image frames.
  • the set of 2D outputs is distinct from the one or more 3D plane parameters and the one or more plane depth maps.
  • the sequence of consecutive image frames are processed using the neural network model to obtain the set of 2D outputs.
  • the loss function combines the 3D plane parameters and plane depth maps with the set of 2D outputs and the 2D ground truth labels.
  • the neural network model is trained using the loss function and the 2D ground truth labels in a supervised manner for the set of 2D outputs.
  • the set of 2D outputs includes at least one of: one or more bounding boxes 512, one or more semantic classes 516, and one or more plane masks 514 of the sequence of consecutive image frames.
  • the 2D ground truth labels include one or more detection box locations and one or more semantic labels.
  • the loss function includes a photometric loss L p , a depth smoothness loss Ls, and a depth consistency loss Lc. During training, the loss function is minimized to determine the 3D plane parameters and plane depth maps of the sequence of consecutive image frames. Further, in some embodiments, the photometric loss L p , depth smoothness loss Ls, and depth consistency loss Lc are represented as:
  • the sequence of consecutive image frames includes a first image frame, a second image frame, and a third image frame.
  • the computer system determines a first pixel difference map between the first and second image frames, and a second pixel difference map between the second and third image frames.
  • the first and the second pixel difference maps are combined on a pixel level to a comprehensive difference map.
  • the comprehensive difference map is processed using the neural network model to obtain the one or more 3D plane parameters 518 and one or more plane depth maps 520 without the 3D ground truth labels.
  • the sequence of consecutive image frames includes a first image frame, a second image frame, and a third image frame.
  • the computer system determines a first pixel difference map between the first and second image frames, and a second pixel difference map between the second and third image frames. For each pixel, a smaller value of corresponding pixel values of the first and the second pixel difference maps is selected to form a comprehensive difference map.
  • the computer system processes the comprehensive difference map using the neural network model to obtain the one or more 3D plane parameters 518 and one or more plane depth maps 520 without the 3D ground truth labels.
  • the client device obtains the neural network model and executes an extended reality application (e.g., an AR application).
  • the extended reality application obtains the input images, identifies the one or more 3D plane parameters and the one or more plane depth maps using the trained neural network model;, and reconstructs one or more 3D semantic plane with one or more semantic labels from the one or more 3D plane parameters and the one or more plane depth maps using the trained neural network model.
  • the extended reality application of the client device identifies a device pose of the client device, and renders a virtual object on a first 3D semantic plane based on a respective semantic label of the first 3D semantic plane and the device pose of the client device.
  • the neural network model includes a first neural network configured to provide a set of 2D outputs.
  • the neural network also includes a second neural network configured to provide the one or more 3D plane parameters and the one or more plane depth maps.
  • the set of 2D outputs includes one or more of: one or more detection boxes, one or more semantic classes, and one or more plane masks of the sequence of consecutive image frames.
  • the loss function includes a first loss function associated with the set of 2D outputs and a second loss function associated with the one or more 3D plane parameters and the one or more plane depth maps.
  • training the neural network model includes training the first and second neural networks separately.
  • the first neural network is trained for the set of 2D outputs in a supervised manner
  • the second neural network is trained for the one or more 3D plane parameters and the one or more plane depth maps in an unsupervised (e.g., self-supervised) manner.
  • the neural network model is trained in a semi-supervised manner.
  • the first neural network is trained for the set of 2D outputs in an unsupervised manner. So both the first and second neural networks are trained f in an unsupervised manner.
  • Figure 7 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed.
  • One of ordinary skill in the art would recognize various ways described above with reference to Figures 1-6 to the 3D plane construction method 700 described in Figure 7. For brevity, these details are not repeated here.
  • Computer- readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol.
  • computer-readable media generally may correspond to (1) tangible computer-readable storage media which is non-transitory or (2) a communication medium such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the embodiments described in the present application.
  • a computer program product may include a computer- readable medium.
  • first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
  • a first network could be termed a second network, and, similarly, a second network could be termed a first network, without departing from the scope of the embodiments.
  • the first network and the second network are both network, but they are not the same network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne la reconstruction de plans tridimensionnels (3D) à partir de trames d'image sans supervision. Une séquence de trames d'image est traitée à l'aide d'un modèle de réseau neuronal pour fournir un ou plusieurs paramètres de plan 3D et une ou plusieurs cartes de profondeur de plan sans marqueurs de vérité terrain 3D. Une fonction de perte combine le ou les paramètres de plan 3D et la ou les cartes de profondeur de plan de la séquence de trames d'image. Sur la base de la fonction de perte, le modèle de réseau neuronal est entraîné d'une manière non supervisée pour le ou les paramètres de plan 3D et la ou les cartes de profondeur de plan. Le modèle de réseau neuronal entraîné est fourni à un dispositif client pour traiter des images d'entrée. Facultativement, la fonction de perte comprend une perte photométrique, une perte de lissé de profondeur et une perte de cohérence de profondeur et est minimisée pendant l'entraînement pour déterminer les paramètres de plan 3D et les cartes de profondeur de plan.
PCT/US2021/039535 2021-06-29 2021-06-29 Détection et reconstruction de plan sémantique 3d WO2023277877A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2021/039535 WO2023277877A1 (fr) 2021-06-29 2021-06-29 Détection et reconstruction de plan sémantique 3d

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2021/039535 WO2023277877A1 (fr) 2021-06-29 2021-06-29 Détection et reconstruction de plan sémantique 3d

Publications (1)

Publication Number Publication Date
WO2023277877A1 true WO2023277877A1 (fr) 2023-01-05

Family

ID=84692024

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/039535 WO2023277877A1 (fr) 2021-06-29 2021-06-29 Détection et reconstruction de plan sémantique 3d

Country Status (1)

Country Link
WO (1) WO2023277877A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965736A (zh) * 2023-03-16 2023-04-14 腾讯科技(深圳)有限公司 图像处理方法、装置、设备及存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200084427A1 (en) * 2018-09-12 2020-03-12 Nvidia Corporation Scene flow estimation using shared features
US20200137380A1 (en) * 2018-10-31 2020-04-30 Intel Corporation Multi-plane display image synthesis mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200084427A1 (en) * 2018-09-12 2020-03-12 Nvidia Corporation Scene flow estimation using shared features
US20200137380A1 (en) * 2018-10-31 2020-04-30 Intel Corporation Multi-plane display image synthesis mechanism

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965736A (zh) * 2023-03-16 2023-04-14 腾讯科技(深圳)有限公司 图像处理方法、装置、设备及存储介质

Similar Documents

Publication Publication Date Title
WO2021077140A2 (fr) Systèmes et procédés de transfert de connaissance préalable pour la retouche d'image
WO2021184026A1 (fr) Fusion audiovisuelle avec attention intermodale pour la reconnaissance d'actions vidéo
WO2021092600A2 (fr) Réseau pose-over-parts pour estimation de pose multi-personnes
WO2023101679A1 (fr) Récupération inter-modale d'image de texte sur la base d'une expansion de mots virtuels
WO2023102223A1 (fr) Apprentissage multitâche en couplage croisé pour cartographie de profondeur et segmentation sémantique
CN116391209A (zh) 现实的音频驱动的3d化身生成
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
WO2023086398A1 (fr) Réseaux de rendu 3d basés sur des champs de radiance neurale de réfraction
WO2023133285A1 (fr) Anticrénelage de bordures d'objet comportant un mélange alpha de multiples surfaces 3d segmentées
WO2023027712A1 (fr) Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles
WO2023091131A1 (fr) Procédés et systèmes pour récupérer des images sur la base de caractéristiques de plan sémantique
WO2023069085A1 (fr) Systèmes et procédés de synthèse d'images de main
WO2023277888A1 (fr) Suivi de la main selon multiples perspectives
WO2023069086A1 (fr) Système et procédé de ré-éclairage de portrait dynamique
US20240153184A1 (en) Real-time hand-held markerless human motion recording and avatar rendering in a mobile platform
WO2023023162A1 (fr) Détection et reconstruction de plan sémantique 3d à partir d'images stéréo multi-vues (mvs)
WO2023172257A1 (fr) Stéréo photométrique pour surface dynamique avec champ de mouvement
WO2023091129A1 (fr) Localisation de caméra sur la base d'un plan
WO2023063944A1 (fr) Reconnaissance de gestes de la main en deux étapes
WO2023063937A1 (fr) Procédés et systèmes de détection de régions planes à l'aide d'une profondeur prédite
US20230274403A1 (en) Depth-based see-through prevention in image fusion
US20240087344A1 (en) Real-time scene text area detection
CN117813626A (zh) 从多视图立体(mvs)图像重建深度信息
WO2024076343A1 (fr) Sélection de zone de délimitation masquée pour une prédiction de rotation de texte
WO2023229600A1 (fr) Détection et mappage de pose en boucle fermée dans un mappage slam

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21948630

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE