WO2023009113A1 - Guidage interactif pour cartographie et relocalisation - Google Patents

Guidage interactif pour cartographie et relocalisation Download PDF

Info

Publication number
WO2023009113A1
WO2023009113A1 PCT/US2021/043465 US2021043465W WO2023009113A1 WO 2023009113 A1 WO2023009113 A1 WO 2023009113A1 US 2021043465 W US2021043465 W US 2021043465W WO 2023009113 A1 WO2023009113 A1 WO 2023009113A1
Authority
WO
WIPO (PCT)
Prior art keywords
pose
view
score
sample
client device
Prior art date
Application number
PCT/US2021/043465
Other languages
English (en)
Inventor
Yuan Tian
Xiang Li
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2021/043465 priority Critical patent/WO2023009113A1/fr
Publication of WO2023009113A1 publication Critical patent/WO2023009113A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/012Head tracking input arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/207Image signal generators using stereoscopic image cameras using a single 2D image sensor
    • H04N13/221Image signal generators using stereoscopic image cameras using a single 2D image sensor using the relative movement between cameras and objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/271Image signal generators wherein the generated image signals comprise depth maps or disparity maps
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/296Synchronisation thereof; Control thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/004Annotating, labelling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/111Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation
    • H04N13/117Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation the virtual viewpoint locations being selected by the viewers or determined by viewer tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/261Image signal generators with monoscopic-to-stereoscopic image conversion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/275Image signal generators from 3D object models, e.g. computer-generated stereoscopic image signals
    • H04N13/279Image signal generators from 3D object models, e.g. computer-generated stereoscopic image signals the virtual viewpoint locations being selected by the viewers or determined by tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/332Displays for viewing with the aid of special glasses or head-mounted displays [HMD]
    • H04N13/344Displays for viewing with the aid of special glasses or head-mounted displays [HMD] with head-mounted left-right displays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/366Image reproducers using viewer tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0081Depth or disparity estimation from stereoscopic image signals

Definitions

  • This application relates generally to display technology including, but not limited to, methods, systems, and non-transitory computer-readable media for generating dense maps and performing relocalization to facilitate display of virtual content on an electronic system.
  • Augmented reality (AR) applications overlay virtual content on top of a view of real world objects or surfaces to provide enhanced perceptual information to a user.
  • Position, orientation, and scale of virtual objects are adjusted with respect to the real world objects or surfaces in the image, such that the virtual content can be seamlessly merged with the image.
  • Such adjustment is particularly critical to head mounted displays in which a stereoscopic vision enables depth perception and causes virtual-real alignment errors.
  • Most head mounted displays rely on accurate mappings of an environment to perform relocalization of the displays. Inaccuracies in mapping and/or relocalization require users to provide additional inputs.
  • existing systems do no provide users with detailed instructions or receive user inputs needed to enhance accuracies in mappings and relocalization. As such, there is a need of an interactive process that implements mappings and/or relocalization in an efficient and cost effective manner.
  • Various embodiments of this application are directed to the method and systems of interactive guidance for generating dense maps and relocalization to facilitate display of virtual content on artificial reality systems.
  • interactive guidance is implemented in an optical see-through head-mounted display (OST-HMD), and provides instructions to users based on additional inputs (e.g., captured images).
  • the instructions help generate dense maps of an environment where the OST-HMD is located and/or perform relocalization of the OST-HMD in the environment.
  • an artificial reality system examples include, but are not limited to, non-immersive, semi-immersive, and fully-immersive virtual-reality (VR) systems; marker-based, markerless, location-based, and projection-based AR systems; hybrid reality systems; and other types of mixed reality systems.
  • VR virtual-reality
  • AR marker-based, markerless, location-based, and projection-based AR systems
  • hybrid reality systems and other types of mixed reality systems.
  • the disclosed methods and systems herein provide user guidance for both environment mapping and device localization based on a current device pose (e.g., position and orientation).
  • the user guidance includes an instruction describing in which direction the user should move or orient a client device.
  • the user guidance is computed to improve reconstruction quality, explore more unknown spaces of the scene, and device relocalization accuracy in an environment where the client device is located. This allows the user to generate accurate dense maps and determine device poses of artificial reality systems with no or little guesswork.
  • a method for interactively rendering AR content is provided.
  • the method includes obtaining a dense map of a physical space.
  • the dense map is generated based on a plurality of images captured by an imaging device at a plurality of device poses.
  • the method includes selecting a sample device pose of the imaging device and synthesizing a sample device view corresponding to the sample device pose based on the dense map.
  • the method further includes determining a view score of the sample device view, and in accordance with a determination that the view score satisfies an instruction criterion, generating a user instruction to adjust the imaging device to the sample device pose.
  • the method further includes capturing the plurality of images by the imaging device.
  • the method includes detecting the plurality of device poses corresponding to the plurality of images, and reconstructing the physical space to the dense map based on the images and device poses.
  • the method also includes identifying one or more regions of interest (ROIs) in the physical space based on the dense map.
  • the dense map includes a plurality of surface elements (surfels) and a plurality of volume elements (voxels).
  • Each ROI includes a respective subset of surfels and a respective subset of voxels.
  • identifying each of the ROIs further includes merging the respective subset of surfels and the respective subset of voxels to form the respective ROI.
  • the method includes upon detecting that the imaging device is adjusted in response to the user instruction, capturing a next image by the imaging device and determining a next device pose of the imaging device. The method also includes updating the dense map based on the next image and next device pose. [0008] This method is optionally applied for environment mapping or image rendering. In some embodiments, the method also includes capturing, by the imaging device, a current image corresponding to a current device pose in the physical space, estimating the current device pose of the imaging device, and rendering virtual content in the physical space and on top of the current image based on the current device pose and the dense map. Further, in some embodiments, the method further includes determining a pose estimation quality for the current device pose, and comparing the pose estimation quality with a quality threshold.
  • the user instruction is generated to adjust the imaging device to the sample device pose.
  • the current device pose is estimated periodically. For example, the pose can be estimated once every 5 seconds.
  • the sample device pose is within a predefined range of a current device pose of the imaging device, and the method further includes adding a variance of displacement or rotation to the current device pose of the imaging device to obtain the first sample device pose.
  • the sample device pose includes a first sample device pose
  • the sample device view includes a first sample device view.
  • the view score includes a first view score.
  • the method further includes selecting one or more second sample device poses of the imaging device and synthesizing one or more second sample device views based on the dense map. Each second sample device view corresponds to a respective second sample device pose.
  • the method also includes determining one or more second view scores of the one or more second sample device view, and in accordance with a determination that the first view score is greater than the one or more second view scores, selecting the first sample device pose to generate the user instruction to adjust the imaging device.
  • the first view score satisfies the instruction criterion when the first view score is greater than the one or more second view scores.
  • each of the first and second sample device poses is within a predefined range of a current device pose of the imaging device.
  • the method further includes for each of the first and second sample device poses, adding a respective variance of displacement or rotation to the current device pose of the imaging device.
  • determining the view score of the sample device view further includes one or more of determining a first score term from an array of score elements, each score element representing an uncertainty value of a respective pixel of the sample device view corresponding to the sample device pose; determining a second score term from an area percentage of one or more visible frontier regions in the sample device view, each visible frontier region located between an unknown region and a scanned region of the dense map; and determining a third score term from a percentage of active bins in a spherical coordinate having an origin in a corresponding predefined ROI.
  • the view score of the sample device view is a weighted sum of the first, second, and third score terms.
  • the dense map includes a plurality of surface elements (surfels) and a plurality of volume elements (voxels), and each visible frontier region is located between an unknown surfel or voxel and a scanned surfel or voxel.
  • surfels surface elements
  • voxels volume elements
  • some implementations include an electronic system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 A is an example guidance system having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • Figure IB illustrates a pair of AR glasses that are communicatively coupled in a guidance system, in accordance with some embodiments.
  • Figure 2 illustrates a dense mapping guidance process, in accordance with some embodiments.
  • Figure 3 illustrates a relocalization process implemented by a client device, in accordance with some embodiments.
  • Figure 4 is a diagram illustrating pose sampling and view synthesis, in accordance with some embodiments.
  • Figure 5 is a diagram illustrating a score determination process, in accordance with some embodiments.
  • Figure 6 is a diagram illustrating a method of determining a score term related to relocalization quality, in accordance with some embodiments.
  • Figures 7A-7D are flowcharts of a method for interactively rendering AR content, in accordance with some embodiments.
  • Figure 8A is a block diagram illustrating a server system, in accordance with some embodiments.
  • Figure 8B is a block diagram illustrating a client device, in accordance with some embodiments.
  • FIG. 1A is an example guidance system 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104 A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network- connected home devices (e.g., a camera).
  • the one or more client devices 104 include a head-mounted display (HMD) 150.
  • the HMD 150 is an optical see-through Head-Mounted Display (OST-HMD), in which virtual objects are displayed on top of the real world.
  • OST-HMD Head-Mounted Display
  • Each client device 104 is configured to collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs is optionally processed locally (e.g., for training and/or for prediction) at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the guidance system 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • storage 106 may store video content (including visual and audio content), static visual content, and/or inertial sensor data for training a machine learning model (e.g., deep learning network).
  • storage 106 may also store video content, static visual content, and/or inertial sensor data obtained by a client device 104 to which a trained machine learning model can be applied to determine one or more poses associated with the video content, static visual content, and/or inertial sensor data.
  • the one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., the HMD 150) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera and a mobile phone 104C.
  • the networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the guidance system 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G/5G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other electronic systems that route data and messages.
  • deep learning techniques are applied in the guidance system 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • the content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104.
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C and HMD 150).
  • the client device 104C or HMD 150 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Subsequently to model training, the client device 104C or HMD 150 obtains the content data (e.g., captures video data via an internal and/or external image sensor, such as a camera) and processes the content data using the training data processing models locally. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 150).
  • a server 102 e.g., the server 102A
  • client device 104 e.g. the client device 104A and HMD 150.
  • the server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the client device 104A or HMD 150 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized or predicted device poses) from the server 102A, presents the results on a user interface (e.g., associated with the application), rending virtual objects in a field of view based on the poses, or implements some other functions based on the results.
  • the client device 104A or HMD 150 itself implements no or little data processing on the content data prior to sending them to the server 102 A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 150), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B or HMD 150.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104B or HMD 150 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
  • FIG. IB illustrates a pair of augmented reality (AR) glasses 150 (also called a head-mounted display (HMD)) that are communicatively coupled in a guidance system 100, in accordance with some embodiments.
  • the AR glasses 150 includes one or more of an image sensor, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display.
  • the image sensor and microphone are configured to capture video and audio data from a scene of the AR glasses 150, while the one or more inertial sensors are configured to capture inertial sensor data.
  • the image sensor captures hand gestures of a user wearing the AR glasses 150.
  • the microphone records ambient sound, including user’s voice commands.
  • both video or static visual data captured by the image sensor and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses.
  • the video, static image, audio, or inertial sensor data captured by the AR glasses 150 is processed by the AR glasses 150, server(s) 102, or both to recognize the device poses.
  • deep learning techniques are applied by the server(s) 102 and AR glasses 150 jointly to recognize and predict the device poses.
  • the device poses are used to control the AR glasses 150 itself or interact with an application (e.g., a gaming application) executed by the AR glasses 150.
  • the display of the AR glasses 150 displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items on the user interface.
  • deep learning techniques are applied in the guidance system to process video data, static image data, or inertial sensor data captured by the AR glasses 150.
  • Device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a data processing model. Training of the data processing model is optionally implemented by the server 102 or AR glasses 150. Inference of the device poses is implemented by each of the server 102 and AR glasses 150 independently or by both of the server 102 and AR glasses 150 jointly.
  • Figures 2-7D are described as being implemented by an electronic system (e.g., a client device 104, AR glasses 150, a server 102, or a combination thereof; Figure 1A-1B).
  • the processes and operations described below in reference to Figures 2-7D are, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the electronic system.
  • Each of the operations shown in the processes and operations described below in reference to Figures 2-7D correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 806 of the server 102 in Figure 8A or memory 866 of the client device 104 in Figure 8B).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors.
  • FIG 2 illustrates a dense mapping guidance process 200, in accordance with some embodiments.
  • the dense mapping guidance process 200 is controlled by a user 230 using a client device 104, such as a mobile phone 104C, AR glasses 150, and/or any other client device described above in reference to Figures 1 A and IB.
  • the dense mapping guidance process 200 scans an environment where the client device 104 is located via a camera of the client device 104, and generates a dense map of the environment for use in AR applications (e.g., games, tele-presence, visualization, and/or other applications described above in reference to Figures 1 A and IB).
  • AR applications e.g., games, tele-presence, visualization, and/or other applications described above in reference to Figures 1 A and IB.
  • the dense mapping guidance process 200 incudes one or more of a preprocess operation 202, a pose estimation operation 204, a reconstruction or update reconstruction operation 206, a surface prediction operation 208, and a dense mapping guidance operation 220. Each operation of the dense mapping guidance process 200 is discussed below in turn.
  • the dense mapping guidance process 200 includes receiving an input 210 by a client device 104, such as captured and/or obtained image data.
  • the client device 104 obtains a plurality of images captured by an imaging device (e.g., a camera) of the client device 104 and/or by an imaging device communicatively coupled to the client device 104.
  • the plurality of images are obtained via a network 108 or local network 110 ( Figure 1A).
  • the client device 104 obtains the plurality of images from a storage 106 ( Figure 1A) communicatively coupled to the client device 104 or is obtained from a webpage visited by the client device 104.
  • plurality of images include video, still images, streamed content, etc. Examples of the different types of imaging devices are described below in reference to Figures 8A and 8B.
  • the client device 104 after receiving an input 210, the client device 104 preprocesses the input 210 in the preprocess operation 202. In some embodiments, the client device 104 extracts sparse features from the input 210, and each sparse feature includes a vector or array that contains mostly zeros values. Alternatively, the client device extracts dense features from the input 210, and each dense feature includes a vector or array that includes mostly non-zero values.
  • the dense mapping guidance process 200 includes performing the pose estimation operation 204.
  • the client device 104 determines a device pose (e.g., a position and/or orientation) of the client device 104 and/or identifies one or more objects in the environment.
  • the client device 104 determines or detects a 6 degrees-of-freedom (DoF) device pose based on the input 210.
  • the 6 DoF include rotation (e.g., roll, pitch, and yaw) and three-dimensional (3D) translation of the client device 104 with respect to the world and/or the plurality of images (i.e. input 210).
  • the pose estimation operation 204 detects a plurality of device poses (for the client device 104) corresponding to the plurality of images. Further, in some embodiments, the one or more objects identified by the client device in the environment correspond to a plurality of feature points (e.g., a table comer, a window comer) having a plurality of feature positions in the environment. That said, the feature positions of the feature points and the device pose of the client device 104 are determined based on the sparse or dense features extracted from the input 210 (i.e., images).
  • a plurality of feature points e.g., a table comer, a window comer
  • the dense mapping guidance process 200 includes performing the reconstruction or update reconstruction operation 206.
  • the client device 104 reconstructs and updates the environment where the client device 104 is located based on the device pose and the one or more objects in the environment.
  • the client device 104 creates a virtual space corresponding to the environment based on the input 210.
  • the virtual space includes a plurality of feature points positioned in the virtual space based on the device poses of the client device 104 tracked by the client device 104.
  • Each image of the input 210 corresponds a device pose of the client device 104 in the environment, and is used to determine a subset of the plurality of feature points in the virtual space.
  • the subset of the plurality of feature points may have been already identified in the virtual space, and virtual points of the subset of feature points are optionally updated according to this respective image or not updated at all.
  • one or more feature points have not been previously identified in the virtual space, and the one or more virtual points are updated according to this respective image.
  • the device poses generated from operation 204 and/or the extracted sparse and/or dense features generated from operation 202 are combined to reconstruct or update the virtual space of the environment where the client device 104 is located.
  • the reconstructed virtual space provides a 3D dense map of the scene and/or objects associated with the received input 210, and enables device localization, object visibility, and physical simulation in the AR applications executed on the client device 104.
  • the surface prediction operation 208 uses the reconstructed physical space and/or the input 210 to determine a plurality of surface elements (surfels) and/or a plurality of volume elements (voxels) for the dense map. For example, one or more of the plurality of images are scanned to determine the plurality of surfels and/or the plurality of voxels. Surfels are used to render complex geometric objects at interactive frame rates.
  • a surfel includes information corresponding to a surface element of a scene or object in the received input 210, such as depth, texture color, normal, etc.
  • a voxel is a value on a regular grid in three-dimensional space that represents at least a portion of a scene or object in the received input 210.
  • a voxel does not include position (i.e. coordinates) information. Instead, a position of a voxel can be determined based on its position relative to other voxels that also represent other portions of a scene or object in the received input 210.
  • the plurality of surfels and the plurality of voxels are used to generate at least one mesh for the dense map generated by the dense mapping guidance process 200.
  • the generated meshes provide at least a continuous surface and/or volume representation of a scene and/or object in the received input 210 for the dense map generated by the dense mapping guidance process 200.
  • the client device 104 In the dense mapping guidance operation 220, the client device 104 generates a user instruction to guide the user to adjust the client device 104 through the dense mapping guidance process 200 and improve the dense map generated thereby.
  • the dense mapping guidance operation 220 is performed after an update to a 3D reconstruction.
  • the dense mapping guidance operation 220 based on the current pose of a client device 104, gives users guidance on adjustments to a client device 104 in capturing image data that is provided as input 210.
  • the dense mapping guidance operation 220 provides the users with instructions indicating in which direction the client device 104 should be moved (e.g., closer to a scene, further from the scene, to the left of a scene, to the right of a scene, adjustments to an angle, etc.).
  • the guidance is computed to improve reconstruction quality, explore more unknown spaces of the scene, and future relocalization accuracy. For example, an initial input 210 provided by the user 230, via the client device 104, that includes a limited representation of objects and/or a scene will result in an inaccurate or incomplete dense map; the user 230, via the dense mapping guidance operation 220, is provided with guidance on improving the generated dense map by providing additional inputs 210 (as discussed below).
  • the dense mapping guidance operation 220 helps make the mapping process easier, efficient, and intuitive for users to perform resulting in accurate and detailed 3D reconstructions, or dense maps. As such, the dense mapping guidance operation 220 resolves difficulties that users experience in the 3D reconstruction of a scene and/or objects, and enables fast and reliable relocalization in which a precise device position and orientation of the client device 104 with respect to a known scene is estimated based on each given image obtained by the client device 104 in a prompt and reliable manner.
  • ROI generation process 222 the client device 104 identifies one or more ROIs in the physical space based on the dense map. ROIs are samples within a data set identified for a particular purpose.
  • an ROI includes one or more objects, points of interest, contours, surfaces, or other aspects of a scene and/or object.
  • each ROI includes a respective subset of surfels and a respective subset of voxels.
  • one or more ROIs are identified by merging the respective subset of surfels and the respective subset of voxels to form the respective ROI.
  • the ROI generation process 222 is occasionally performed to save computational resources. For example, ROIs are generated only once for the dense mapping guidance process 200. In another example, ROIs are generated at every other or third iteration of the dense mapping guidance process 200.
  • incremental ROIs are generated based on the plurality of surfels and the plurality of voxels in the ROI generation process 222.
  • a surfel representation is generated based on the plurality of surfels. Specifically, a connection graph is built for the surfel representation, and edges are built between one surfel and its neighboring surfels. Each voxel or surfel is treated as one cluster, and voxels and surfels are optionally merged by pairs to new clusters. Every two clusters are optionally merged by pairs as well. Energy for each clustering procedure is defined as:
  • E v Ei + E C
  • E L is a squared positional difference between centers of two clusters to be merged (e.g., two voxels)
  • E c is a color difference (e.g., in a CIELAB color space) of the two clusters to be merged
  • a regularized weight l is initialized with an empirical value.
  • l is set to 1 to make the positional difference Ei and color difference E c equally important.
  • the clustering procedure is similar to K-means in supervoxel algorithms. After merging two clusters, a new cluster with an averaged center and an averaged color replaces the merged two clusters, and has an energy level represented by E v.
  • the merging procedure on two clusters is performed only if the energy of the new cluster is less than the total energy of the two clusters that are merged.
  • Large clusters are generated as a result of merging. In some embodiments, when such a large cluster has energy above a predetermined cluster energy threshold, it is not merged with any other cluster (e.g., a voxel, surfel, or cluster merged from voxels or surfels).
  • Each large cluster includes an ROI cluster treated as a single unit, is also called an incremental ROI.
  • energy E v of a cluster is defined based on geometric features, semantic features, if enough computational resources are available.
  • ROIs are initially generated from one or more images
  • newly added surfels and/or voxels are generated from additional inputs 210 provided by the user and treated as new clusters.
  • the cluster merging procedure is performed to merge new clusters with each other or with existing or previously generated clusters.
  • the ROIs are generated at a relatively fast rate than current practice, e.g., using depth image frames, because the number of surfels and/or voxels used in ROI generation is much smaller than the number of 3D points as used in the current practice.
  • the client device 104 determines one or more potential device poses (e.g., which are sampled device poses) based on variances (e.g., displacements, rotations, and other spatial adjustments). The variances are added to the device poses of the client device 104 determined in the pose estimation operation 204. For each potential device pose, a simulated view for the client device 104 is synthesized. In some embodiments, ray casting is used to render a synthesized or simulated view for a truncated signed distance function (TSDF) representation. In some embodiments, surfels are rendered as points for a new synthesized or simulated view for a surfel representation.
  • TSDF truncated signed distance function
  • synthesized view For purposes of this disclosure, the terms synthesized view, sampled view, simulated view, synthesized device view, sample device view, and simulated device view are used interchangeably.
  • the sampled device poses for the client device 104 and the simulated views for the client device 104 cover scanned areas (i.e., scenes and/or objects captured in input 210), unknown areas (i.e., scenes and/or objects not captured in input 210), and/or both. More details on the sample poses and simulated views are provided below with reference to Figure 4.
  • the client device 104 scores the sample device poses for the client device 104 and the simulated views for the client device 104, and identifies the highest scoring sample device pose and simulated view on which a user instruction to adjust the client device is 104 based.
  • the highest scoring sample device pose and the simulated view for the client device 104 are presented with the user instruction to the user 230.
  • the user 230 enables a subsequent input 210 that can explore as much of the unknown area as possible, eliminate reconstruction uncertainty, increase the number of view angles, and improve distances and positions represented in the reconstruction for dense mapping.
  • 3D reconstruction of the environment where the client device 104 is located takes into account occlusion, collision, and physics simulation, and a high-quality 3D dense map is created and updated for subsequent AR interactions.
  • the client device 104 scores the sample device pose and simulated view using at least three different scoring criteria or terms.
  • the score terms are defined for each pixel (e.g., a position in a two-dimensional (2D) image) on a synthesized view determined in the pose sampling and view synthesis operation 224.
  • a first score term is related to the uncertainty of reconstruction
  • the second score term is related to a frontier of a simulated view
  • the third score term is related to relocalization.
  • the client device 104 determines, based on a Possion-based reconstruction quality measure, the uncertainty of reconstruction for each pixel of a synthesized view determined in the pose sampling and view synthesis operation 224.
  • the first score term is determined from an array of score elements, each score element representing an uncertainty value of a respective pixel of the synthesized view corresponding to the sample device pose (e.g., a sum of all uncertainty values of the first score elements).
  • the Possion-based reconstruction quality measure results in a confidence map defined on an estimated surface.
  • the first score term includes an uncertainty of semantic segmentation if the dense map and/or the synthesized view includes semantic segmentation.
  • the uncertainty of semantic segmentation is also determined using the Possion-based reconstruction quality measure.
  • semantic segmentation is the identification and/or classification of one or more objections or regions in an image.
  • the uncertainty value for each pixel of the synthesized view is saved in each voxel and/or surfel.
  • the first score term is determined, after ray casting projection or surfel -based rendering is performed.
  • the first score term is the sum of uncertainty values of all pixels on the synthesized view.
  • the Possion-based reconstruction quality measure is an example of a method determining the uncertainty of reconstruction.
  • the second score term is based on an area percentage of visible frontier region on a simulated view.
  • a “frontier” is defined as a region between unknown and seen surfaces (e.g., scanned and unscanned surfaces as shown in Figure 4).
  • each visible frontier region is located between an unknown surfel or voxel and a scanned surfel or voxel.
  • the second score term is determined from the area percentage of one or more visible frontier regions in the simulated view, each visible frontier region being located between an unknown region and a scanned region of the dense map.
  • each voxel and/or surfel belongs to one of the three categories; unknown, seen, and frontier.
  • each pixel of a simulated view is assigned a category value, and the area of the frontier is determined by counting the frontier pixels.
  • the second score term is determined as the area percentage of visible frontier region on the simulated view.
  • the third score term represents a quality of relocalization based on a spherical coordinates sampling.
  • an ROI cluster is taken as one unit and sampled in a spherical coordinates system, and the ROI cluster sampled in the spherical coordinates system is scored using the third score term.
  • predefined bins are used to discretize the space in the spherical coordinates system. Each bin is assigned as an occupied or active bin based on spherical coordinates determined for a vector connecting locations of client device 104 and ROI cluster. A percentage of activated bins is determined and assigned to each voxel and/or surfel in the corresponding ROI cluster.
  • the number of active bins are assigned to pixels of a simulated view. After the simulated view is generated, a sum of activated bins of all pixels is treated as a relocalization score (i.e., the third score term).
  • the third score term is determined from a percentage of active bins in the spherical coordinate system having an origin fixed in a corresponding predefined ROI. The higher the third score term, the greater possibility that the client device 104 observed the scene from more angles and distances. More details on the third score term and spherical coordinates are explained below with reference to Figure 6.
  • the client device 104 determines a final score, e.g., by determining a weighted average from at least the first score term, the second score term, and the third score term for a simulated view of a sample pose.
  • the sample pose (of all of the sampled poses determined by the pose sampling and view synthesis operation 224) having the highest final score is selected, and guidance (e.g., a user instruction) is generated and presented to the user 230 based on the sample pose having the highest final score.
  • the guidance in the dense mapping guidance process 200 includes a vector from the current client device 104 position to the selected sample pose for the client device 104.
  • the guidance includes a 3D model of a client device 104, such that the user 230 knows a target location and orientation of the client device 104 leading to a subsequent input 210.
  • the client device 104 after the client device 104 adjusts the device pose in response to the user instruction, the client device 104 captures a next image in a subsequent input 210.
  • the subsequent input 210 is used by the dense mapping guidance process 200 to determine a next device pose of the client device 104.
  • the dense map is updated based on the subsequent input 210 and next device pose.
  • each iteration of the dense mapping guidance process 200 updates the 3D reconstruction of the environment using the dense map.
  • the dense map is stored in a map database (e.g., map database 330; Figure 3) to be continuously updated in subsequent iterations of the dense mapping guidance process 200 or for future use in a relocalization process (e.g., 300 in Figure 3).
  • the dense mapping guidance process 200 reconstructs a consistent 3D virtual representation of a scene and/or objects, which enables more accurate localization, visibility, and physical simulation in associated AR applications.
  • the dense mapping guidance process 200 provides users with an easy and approachable method of controlling the client device 104 to provide input needed to generate accurate dense maps.
  • FIG 3 illustrates a relocalization process 300 implemented by a client device 104 (e.g., AR glasses 150, mobile device 104C), in accordance with some embodiments.
  • the relocalization process 300 uses a dense map generated by the dense mapping guidance process 200 in AR applications (e.g., games, telepresence, visualization, and/or other applications described above in reference to Figures 1 A and IB).
  • the client device 104 obtains a dense map of a physical space (e.g., an environment or a scene where the client device 104 is located) from a map database 330.
  • the dense map is generated based on a plurality of images captured by an imaging device of a client device 104 at a plurality of device poses.
  • the user 230 obtains a dense map from a local map database 330 stored in memory of the client device 104 or from a remote map database 330 communicatively coupled to the client device 104. That said, the map database 330 is optionally located locally in or remotely from the client device 104.
  • the relocalization process 300 includes using the obtained dense map to perform the pose estimation operation 204.
  • the current device pose of the client device 104 is estimated periodically. For example, in some embodiments, the current device pose of the client device 104 is estimated at predetermined periods (e.g., every 5 seconds, 10 seconds, 30 seconds, etc.).
  • the pose estimation operation 204 determines a position and/or orientation of the client device 104 in the scene of the client device represented by the dense map extracted from the map database 330.
  • the client device 104 captures and provides an input 210 (e.g., current image data corresponding to a current device pose in the physical space), which is used to estimate the current device pose of the client device 104.
  • the input 210 is preprocessed via the preprocess operation 202 as described above in reference to Figure 2.
  • the client device 104 identifies one or more feature points from an image captured in association with the current device pose in the preprocess operation 202.
  • the one or more feature points are compared with a plurality of feature points of the dense map of the scene in the pose estimation operation 204 to determine where the client device 104 is in the scene, i.e., the current device pose including a current device position and a current device orientation.
  • a pose estimation quality is determined for the current device pose of the client device 104.
  • the client device 104 further determines whether the pose estimation quality satisfies an instruction criterion, e.g., which requires that the pose estimation quality to be below a quality threshold to issue a user instruction. For example, the pose estimation quality is compared with the quality threshold to determine whether the pose estimation quality is less than the quality threshold. If the pose estimation quality generated for the current device pose is equal to or greater than the quality threshold required in the instruction criterion, the current device pose has a high fidelity or is suitable for use in the associated AR applications), and relocalization is optionally enhanced by a refinement operation 306.
  • the pose estimation quality generated for the current device pose is below the quality threshold required in the instruction criterion, the current device pose has a limited fidelity and is not suitable for use in the associated AR applications.
  • a user instruction is generated for adjusting the device pose of the client device 104 to a sample device pose, e.g., via the relocalization guidance operation 320.
  • the pose estimation quality of the current device pose is represented by a view score indicating how the image associated with the current device pose is matched to the dense map.
  • the refinement operation 306 is based on one of a perspective-n-point (PNP) method and an iterative closest point (ICP) method.
  • PNP perspective-n-point
  • ICP iterative closest point
  • the PNP method estimates the pose of a calibrated client device 104 given a set of 3D points in the scene of the client device 104 and their corresponding 2D projections in the image or dense map.
  • the ICP method reconstructs 2D or 3D surfaces from different scans (e.g., image captures), to localize a device pose and/or achieve optimal path planning, and/or make other pose estimation.
  • Both the PNP method and the ICP method are used to refine the dense map and/or client device pose for relocalization.
  • the determined refinements are applied to the obtained dense map and/or current client device pose for accurate relocalization.
  • a dense map is refined and updated in the map database 330.
  • the current device pose of the client device 104 is within a predefined range of a refined device pose (or optimal device pose), and a variance of displacement or rotation is added to the current device pose of the client device 104 to obtain an updated device pose that is accurate for relocalization.
  • the refined dense map and/or current device pose is used to apply one or more AR effects (e.g., changes to the scene, one or more objects, device pose and/or view change).
  • the relocalization process 300 includes rendering virtual content in the physical space and on top of the input 210 (e.g., a current image) based on the current device pose and the dense map.
  • the relocalization guidance operation 320 resolves difficulties in the use of 3D reconstructed scenes and/or objects, allowing the current device pose of the client device 104 to be identified in the reconstructed dense map of the scene in a prompt and reliable manner.
  • the relocalization guidance operation 320 includes a pose sampling and view synthesis operation 224 and/or a scoring operation 226.
  • the pose sampling and view synthesis operation 224 and/or the scoring operation 226 determine a sample pose and/or simulated view having the highest score, and generate guidance (e.g., a user instruction) to guide the client device 104 to a subsequent pose and/or view.
  • the sample device pose of the client device 104 is selected and the simulated view corresponding to the sample device pose is synthesized based on the dense map.
  • a score of the simulated view is determined, and in accordance with a determination that the score satisfies an instruction criterion, the relocalization guidance operation 320 generates user instructions to adjust the client device 104 to the sample device pose. More details on the pose sampling and view synthesis operation 224 and the scoring operation 226 are explained above with reference to Figure 2.
  • FIG. 4 is a diagram 400 illustrating pose sampling and view synthesis, in accordance with some embodiments.
  • a client device 104 has a current device pose 415 and a current view 417.
  • Examples of the client device 104 include a mobile phone 104C, AR glasses 150, and/or any other client device described above in reference to Figures 1A and IB.
  • the current device pose 415 of the client device 104 has a device position and a device orientation.
  • the current device pose 415 is determined from a current image having one or more feature points, and the one or more features points of the current image are compared with a plurality of feature points mapped for a scene to determine the current device pose 415 within the scene of the client device 104.
  • the client device 104 has sample device poses 425 and simulated views 427 derived based on the current device pose 415 of the client device 104, e.g., by the pose sampling and view synthesis operation 224 ( Figures 2-3).
  • the current view 417 and the simulated views 427 of the client device 104 cover a portion of an unknown space 410.
  • the unknown space 410 is part of the scene where the client device is located, and includes one or more objects in a plurality of images captured by the client device 104.
  • one or more ROIs 413 have been identified within the current view 417 of the client device 104, e.g., by ROI generation process 222 in Figure 2.
  • each of the sample device poses 425 is within a predefined range of the current device pose 415 of the client device 104, and includes a respective variance of displacement or rotation added to the current device pose 415 of the client device 104.
  • the predefined range of the current device pose 415 is optionally bounded by a variance limit of displacement (e.g., ⁇ 10 cm) and a variance limit of rotation ( ⁇ 5°), and the respective sample device pose 425 varies from the current device pose 415 within the variance limit of displacement and the variance limit of rotation.
  • the respective variances of displacements or orientations are distinct for different sample device poses 425.
  • a first sample device pose 425-a includes a first variance of displacement and a first variation of rotation with respect to the current device pose 415 of the client device 104
  • the second sample device pose 425-b includes a second variance of displacement and a second variation of rotation with respect to the current device pose 415 of the client device 104, such that the first sample device pose 425-a and the second sample device pose 425-b have distinct poses.
  • At least one of the first variances of displacement and rotation is distinct from a respective one of the second variances of displacement and rotation.
  • each simulated view 427 covers a respective portion of the unknown space 410, a known space (e.g., identified ROIs 413), or both.
  • each simulated view 427 covers a distinct portion of the unknown space 410.
  • each simulated view 427 covers a different portion of the unknown space 410.
  • each simulated view 427 is scored. Different scores are determined for the plurality of simulated views 427, and one of the plurality of simulated view 427 is determined to have the highest score, e.g., by the scoring operation 226 in Figures 2 and 3.
  • the simulated view 427 having the highest score is selected to generate a user instruction for adjusting the current device pose 415 of the client device 104. The user instruction, if followed by the user, improves accuracy levels for scene mapping and device relocalization.
  • FIG. 5 is a diagram illustrating a score determination process 500, in accordance with some embodiments.
  • the client device 104 corresponds to a sample device poses 425 and a simulated view 427 of an environment where the client device 104 is disposed.
  • the simulated view 427 is represented as a simulated view image 510.
  • the simulated view image 510 includes at least one of a TSDF representation and a surf el representation of the simulated view 427.
  • each simulated view image 510 is scored using a plurality of different score terms. Examples of the different score terms are described above in reference to Figures 2 and 3.
  • a final score 515 is determined for the simulated view image 510 by a weighted average from at least three score terms including a first score term related to uncertainty of reconstruction, a second score term related to a frontier of the simulated view 427, and a third score term related to relocalization quality.
  • a plurality of simulated views 427 are derived for a plurality of sample device poses 425 of the client device 104 using the score determination process 500, and one of the plurality of simulated views 427 has the highest final score 515. After the one of the plurality of simulated views 427 is identified, a corresponding device pose 425 is identified and used to generate a user instruction.
  • the user instruction includes information concerning the corresponding sample device pose 425 having the highest final score 515, e.g., a movement direction leading to the sample device pose 425. More details on the determination of each sample device pose 425 and simulated view 427 are discussed above with reference to Figures 2 and 3.
  • FIG. 6 is a diagram illustrating a method 600 of determining a score term related to relocalization quality, in accordance with some embodiments.
  • a spherical coordinate 650 is used in conjunction with scene element 610 to determine the score term (specifically, an above-described third score term related to relocalization quality).
  • the scene element 610 can be one of a voxel in TSDF representation and a surf el in surf el representation.
  • the ROIs 413 include an scene element 610.
  • the scene element 610 is associated with a spherical coordinate 650 with the scene element 610 at the origin.
  • the spherical coordinate 650 includes 3 indices r, q, f , where r is a distance between a new client device 104 viewpoint (e.g., a sample pose of the client device 104) and the scene element 610, and Q and f are two angles that represent an orientation of client device 104 viewpoint with respect to the origin.
  • a new client device 104 viewpoint e.g., a sample pose of the client device 104
  • Q and f are two angles that represent an orientation of client device 104 viewpoint with respect to the origin.
  • Predefined bins are used to discretize a space.
  • a bin is defined by the vector index ( i, qi, pi), a first unit size of a bin along an r axis can be set as the same size of a TSDF unit or average distance between two surfels, and two additional unit sizes of a bin along Q and f axes are set to a predefined angle (e.g., 30 degrees).
  • each AR application has a respective working range.
  • the working range includes minimum and maximum distance thresholds (r mm and r max ). In some embodiments, any physical location that is outside the working range of the AR application is ignored.
  • An example AR application applies to an indoor environment, and has a working range of 0.4-6 m corresponding to a minimum distance threshold of 0.4 m and a maximum distance threshold of 6 m. Locations in the transformed ROIs 610 having the distance r outside the working range of 0.4-6 m are ignored.
  • a vector connects a device location associated with a sample device pose (e.g., 425-a) of the client device 104 to a scene element 610.
  • the vector is determined in the spherical coordinate 650, and represented by three variables r, Q , and f. Based on values of the variables representing the vector, a corresponding bin is identified as occupied or active. After all the previous viewpoints that have observed scene element 610 are projected to the bins based on their locations in the spherical coordinate 650, a percentage of activated bin is determined associated with scene element 610. This process is done on all scene elements within the simulated views 427. In some embodiments, the number of active bins are assigned to pixels during view synthesis.
  • FIGS 7A-7D are flowcharts of a method 700 for interactively rendering AR content as described above in Figure 1-6, in accordance with some embodiments.
  • the method 700 is described as being implemented by an electronic system (e.g., a client device 104 including AR glasses 150, a server 102, or a combination thereof).
  • Method 700 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the electronic system.
  • Each of the operations shown in Figures 7A-7D correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 806 of the server 102 in Figure 8 A or memory 866 of the client device 104 in Figure 8B).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors.
  • the electronic system captures (702-a) a plurality of images by an imaging device.
  • the electronic system includes and/or is commutatively coupled (e.g., via a network 108) to an imaging device, such as a camera, webcam, or other imaging device described herein.
  • the electronic system detects (702-b) a plurality of device poses corresponding to the plurality of images, and reconstructs (702-c) a physical space to a dense map based on the images and device poses (e.g., electronic system pose).
  • the electronic system obtains (704) a dense map of a physical space.
  • the dense map is generated based on a plurality of images captured by the imaging device at a plurality of device poses.
  • the electronic system identifies (706) one or more regions of interest (ROIs) in the physical space based on the dense map.
  • the dense map includes (708-a) a plurality of surface elements (surfels) and a plurality of volume elements (voxels), and each ROI includes a respective subset of surfels and a respective subset of voxels.
  • identifying each of the ROIs further includes merging (708-b) the respective subset of surfels and the respective subset of voxels to form the respective ROI.
  • the electronic system selects (710) a sample device pose of the imaging device and synthesizes (712) a sample device view corresponding to the sample device pose based on the dense map.
  • the electronic system determines (714) a view score of the sample device view.
  • determining the view score of the sampled device view further includes (716-a) one or more of determining (716-b) a first score term from an array of score elements, each score element representing an uncertainty value of a respective pixel of the sample device view corresponding to the sample device pose; determining (716-c) a second score term from an area percentage of one or more visible frontier regions in the sample device view, each visible frontier region being located between an unknown region and a scanned region of the dense map; and determining (716-d) a third score term from a percentage of active bins in a spherical coordinate having an origin in a corresponding predefined ROI.
  • the view score of the sample device view is (718) a weighted sum of the first, second, and third score terms.
  • the dense map includes (720) a plurality of surface elements (surfels) and a plurality of volume elements (voxels), and each visible frontier region is located between an unknown surfel or voxel and a scanned surfel or voxel. Additional information on the selection of the sample device view and the view score is provided above in Figures 4 and 5.
  • the electronic system in accordance with a determination that the view score satisfies an instruction criteria, generates (722) a user instruction to adjust the imaging device to the sample device pose.
  • the sample device pose includes (724-a) a first sample device pose
  • the sample device view includes a first sample device view, the view score including a first view score
  • the electronic system is configured to select (724-b) one or more second sample device poses of the imaging device and synthesize (724-c) one or more second sample device views based on the dense map, each second sample device view corresponding to a respective second sample device pose.
  • the electronic system is further configured to determine (724-d) one or more second view scores of the one or more second sample device view.
  • the electronic system selects (724-e) the first sample device pose to generate the user instruction to adjust the imaging device, the first view score satisfying the instruction criteria when the first view score is greater than the one or more second view scores.
  • each of the first and second sample device poses is (726-a) within a predefined range of a current device pose of the imaging device, and the electronic system is further configured to, for each of the first and second sample device poses, add (726-b) a respective variance of displacement or rotation to the current device pose of the imaging device.
  • the electronic system captures (728-a), by the imaging device, a current image corresponding to a current device pose in the physical space.
  • the electronic system estimates (728-b) the current device pose of the imaging device, and renders (728-c) virtual content in the physical space and on top of the current image based on the current device pose and the dense map.
  • the electronic system determines (730-a) a pose estimation quality for the current device pose, and compares (730- b) the pose estimation quality with a quality threshold. In accordance with a determination that the pose estimation quality is less than the quality threshold and that the view score satisfies the instruction criteria, the user instruction is generated (730-c) to adjust the imaging device to the sample device pose.
  • the current device pose is (732) estimated periodically.
  • the electronic system upon detecting that the imaging device is adjusted in response to the user instruction, captures (734-a) a next image by the imaging device and determining a next device pose of the imaging device, and updates (734-b) the dense map based on the next image and next device pose.
  • Figures 7A-7D have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed.
  • One of ordinary skill in the art would recognize various ways to identify device poses as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 1-6 are also applicable in an analogous manner to method 700 described above with respect to Figures 7A-7D. For brevity, these details are not repeated here.
  • FIG. 8 A is a block diagram illustrating a server system 102, in accordance with some embodiments.
  • the server system 102 typically, includes one or more processing units (CPUs) 802, one or more network interfaces 804, memory 806, and one or more communication buses 808 for interconnecting these components (sometimes called a chipset).
  • the server system 102 includes one or more input devices 810 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the server system 102 also includes one or more output devices 812 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the server system 102 is communicatively coupled, via the one or more network interfaces 804 or communication buses 808, to one or more of a client device 104 (including HMDs 150), a storage 106, or a combination thereof.
  • Memory 806 includes high-speed random access memory, such as DRAM,
  • Memory 806, optionally, includes one or more storage devices remotely located from one or more processing units 802.
  • Memory 806, or alternatively the non-volatile memory within memory 806, includes a non-transitory computer readable storage medium.
  • memory 806, or the non- transitory computer readable storage medium of memory 806, stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 814 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 816 for connecting the server 102 and other devices (e.g., client devices 104, HMDs 150, and/or storage 106) via one or more network interfaces 804 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 818 for enabling presentation of information (e.g., a graphical user interface for application(s) 824, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 and/or HMDs 150 via their respective output devices (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 824, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices e.g., displays, speakers, etc.
  • Input processing module 820 for detecting one or more user inputs or interactions from one of the one or more input devices 810 and interpreting the detected input or interaction;
  • Web browser module 822 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • Model training module 826 for receiving training data (.g., training data 842) and establishing a data processing model (e.g., data processing module 828) for processing content data (e.g., video data, visual data, audio data, sensor data) collected or obtained by a client device 104 and/or HMD 150;
  • content data e.g., video data, visual data, audio data, sensor data
  • Data processing module 828 for processing content data using data processing models 844, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 828 is associated with one of the user applications 824 to process the content data in response to a user instruction received from the user application 824;
  • mapping and localization module 830 for mapping a scene where an imaging device is located and localizing the imaging device within the scene, where the mapping and localization module 830 is optionally executed jointly with the data processing module 828, and includes one or more of: o Dense mapping module 831 for creating a 2D or 3D dense map of the scene, the dense map optionally including a plurality of feature points that are associated with objects in the scene and determined and updated based on the content data (e.g., images) obtained by one or more of the client devices 104 and/or HMDs 150, as well as performing one or more operations described above in reference to Figure 2; o Relocalization module 832 for determining and predicting a 6 degree of freedom device pose (position and orientation) of the client device 104 and/or HMD 150 with respect to the dense map, and/or refining an estimated pose of a client device 104 and/or HMD 150 based on content data obtained by one or more of the client devices 104 and/or HMDs 150, as well as performing one or
  • Reconstruction module 834 for generating virtual content based on content data (e.g., video data, visual data, audio data, sensor data) collected or obtained by one or more of a client device 104 and/or HMD 150 as well as the dense maps generated by the dense mapping module 829 and the relocalization data generated by the relocalization module 832, and rendering the virtual content on a display or on top of a view of the real word; and
  • content data e.g., video data, visual data, audio data, sensor data
  • the one or more databases 835 are stored in one of the server 102, client device 104, and storage 106 of the guidance system 100.
  • the one or more databases 835 are distributed in more than one of the server 102, client device 104, and storage 106 of the guidance system.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of display settings are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 806, optionally, stores a subset of the modules and data structures identified above.
  • memory 806, optionally, stores additional modules and data structures not described above.
  • FIG. 8B is a block diagram illustrating a client device 104 (such as desktop computers 104A, tablet computers 104B, mobile phones 104C, HMDs 150, and/or other computing devices), in accordance with some embodiments.
  • the client device 104 typically, includes one or more processing units (CPUs) 852, one or more network interfaces 854, memory 866, and one or more communication buses 864 for interconnecting these components (sometimes called a chipset).
  • the client device 104 includes one or more input devices 856 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing image sensor, or other input buttons or controls.
  • the client device 104 uses a microphone and voice recognition or one or more image sensor 858 and gesture recognition to supplement or replace the keyboard.
  • the one or more image sensor 858 e.g., tracking cameras, infrared sensors, CMOS sensors, LiDAR sensors, monocular monochrome cameras, RGB cameras, RGBD cameras, etc.
  • scanners or photo sensor units for capturing images or video , detecting users or interesting objects, and/or environmental conditions (e.g., background scenery or objects).
  • the client device 104 also includes one or more output devices 862 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • the client device 104 includes an inertial measurement unit (IMU) integrating multi-axes inertial sensors to provide estimation of a location and an orientation of the client device 104 in space.
  • IMU inertial measurement unit
  • the one or more inertial sensors include, but are not limited to, a gyroscope, an accelerometer, a magnetometer, and an inclinometer.
  • the client device 104 is communicatively coupled, via the one or more network interfaces 854 or communication buses 864, to one or more of a server system 102, other client devices 104 (including HMDs 150), a storage 106, or a combination thereof.
  • Memory 866 includes high-speed random access memory, such as DRAM,
  • Memory 866 optionally, includes one or more storage devices remotely located from one or more processing units 852.
  • Memory 866, or alternatively the non-volatile memory within memory 866 includes a non-transitory computer readable storage medium.
  • memory 866, or the non- transitory computer readable storage medium of memory 866 stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 874 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 876 for connecting the client device 104 and other devices (e.g., other client devices 104, the server 102, and/or storage 106) via one or more network interfaces 854 (wired or wireless) and one or more communication networks , such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 878 for enabling presentation of information (e.g., a graphical user interface for application(s) 884, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 and/or HMD 150 via their respective output devices (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 884, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • Input processing module 880 for detecting one or more user inputs or interactions from one of the one or more input devices 856 and interpreting the detected input or interaction;
  • Web browser module 882 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with the client device 104 or another electronic device, controlling the client device 104 or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account; •
  • One or more user applications 884 for execution by the client device 104 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);
  • Data processing module 828 for processing content data using data processing models 844, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 828 is associated with one of the user applications 884 to process the content data in response to a user instruction received from the user application 884;
  • mapping and localization module 830 for mapping a scene where an imaging device is located and localizing the imaging device within the scene, where the mapping and localization module 830 is optionally executed jointly with the data processing module 828, and includes one or more of: o Dense mapping module 831 for creating a 2D or 3D dense map of the scene, the dense map optionally including a plurality of feature points that are associated with objects in the scene and determined and updated based on the content data (e.g., images) obtained by one or more of the client devices 104 and/or HMDs 150, as well as performing one or more operations described above in reference to Figure 2; o Relocalization module 832 for determining and predicting a 6 degree of freedom device pose (position and orientation) of the client device 104 and/or HMD 150 with respect to the dense map, and/or refining an estimated pose of a client device 104 and/or HMD 150 based on content data obtained by one or more of the client devices 104 and/or HMDs 150, as well as performing one or
  • Reconstruction module 834 for generating virtual content based on content data (e.g., video data, visual data, audio data, sensor data) collected or obtained by one or more of a client device 104 and/or HMD 150 as well as the dense maps generated by the dense mapping module 829 and the relocalization data generated by the relocalization module 832, and rendering the virtual content on a display or on top of a view of the real word; and
  • content data e.g., video data, visual data, audio data, sensor data
  • the one or more databases 835 are stored in one of the server 102, client device 104, and storage 106 of the guidance system 100.
  • the one or more databases 835 are distributed in more than one of the server 102, client device 104, and storage 106 of the guidance system.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of display settings are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 866 optionally, stores a subset of the modules and data structures identified above.
  • memory 866 optionally, stores additional modules and data structures not described above.
  • the application utilizes captured images by a client device (or other computing device) to generate a dense map of a scene and/or one or more objects, perform relocalization based on the generated dense map and/or captured images, and provide the user with guidance during the generation of the dense map and the relocalization to assist the user in the generation of the dense map and the relocalization in such a way that enables the user to generate dense map and the perform relocalization with improved accuracy and finer detail.
  • the user guidance system provides the user with easily understandable and actionable instructions that enable the user to provide additional images in an efficient manner to improve the generation of the dense map and the relocalization; thus, improving the users augmented-reality experience.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Processing Or Creating Images (AREA)

Abstract

Cette demande concerne des dispositifs électroniques, des procédés et des supports lisibles par ordinateur non transitoires permettant de cartographier de manière interactive un environnement, de localiser un dispositif et de restituer un contenu de réalité augmentée (AR). Un système électronique obtient une carte dense d'un espace physique. La carte dense est générée sur la base d'une pluralité d'images capturées par un dispositif d'imagerie au niveau d'une pluralité de poses de dispositif. Le système électronique sélectionne une pose de dispositif d'échantillon du dispositif d'imagerie, et synthétise une vue de dispositif d'échantillon correspondant à la pose de dispositif d'échantillon sur la base de la carte dense. Un score de vue est déterminé pour la vue de dispositif d'échantillon. Selon une détermination selon laquelle le score de vue satisfait un critère d'instruction, le système électronique génère une instruction d'utilisateur pour ajuster le dispositif d'imagerie à la pose du dispositif d'échantillon. Dans certains modes de réalisation, la carte dense comprend une pluralité d'éléments de surface (surfels) et une pluralité d'éléments de volume (voxels).
PCT/US2021/043465 2021-07-28 2021-07-28 Guidage interactif pour cartographie et relocalisation WO2023009113A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2021/043465 WO2023009113A1 (fr) 2021-07-28 2021-07-28 Guidage interactif pour cartographie et relocalisation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2021/043465 WO2023009113A1 (fr) 2021-07-28 2021-07-28 Guidage interactif pour cartographie et relocalisation

Publications (1)

Publication Number Publication Date
WO2023009113A1 true WO2023009113A1 (fr) 2023-02-02

Family

ID=85088062

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/043465 WO2023009113A1 (fr) 2021-07-28 2021-07-28 Guidage interactif pour cartographie et relocalisation

Country Status (1)

Country Link
WO (1) WO2023009113A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015192117A1 (fr) * 2014-06-14 2015-12-17 Magic Leap, Inc. Procédés et systèmes de création d'une réalité virtuelle et d'une réalité augmentée
US20200273190A1 (en) * 2018-03-14 2020-08-27 Dalian University Of Technology Method for 3d scene dense reconstruction based on monocular visual slam
US20210004979A1 (en) * 2018-10-04 2021-01-07 Google Llc Depth from motion for augmented reality for handheld user devices
WO2021092600A2 (fr) * 2020-12-14 2021-05-14 Innopeak Technology, Inc. Réseau pose-over-parts pour estimation de pose multi-personnes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015192117A1 (fr) * 2014-06-14 2015-12-17 Magic Leap, Inc. Procédés et systèmes de création d'une réalité virtuelle et d'une réalité augmentée
US20200273190A1 (en) * 2018-03-14 2020-08-27 Dalian University Of Technology Method for 3d scene dense reconstruction based on monocular visual slam
US20210004979A1 (en) * 2018-10-04 2021-01-07 Google Llc Depth from motion for augmented reality for handheld user devices
WO2021092600A2 (fr) * 2020-12-14 2021-05-14 Innopeak Technology, Inc. Réseau pose-over-parts pour estimation de pose multi-personnes

Similar Documents

Publication Publication Date Title
JP6768156B2 (ja) 仮想的に拡張された視覚的同時位置特定及びマッピングのシステム及び方法
CN112771539B (zh) 采用使用神经网络从二维图像预测的三维数据以用于3d建模应用
US11776142B2 (en) Structuring visual data
CN106896925A (zh) 一种虚拟现实与真实场景融合的装置
WO2023056544A1 (fr) Système de localisation et procédé de localisation d'objet et de caméra pour la cartographie du monde réel
US11748998B1 (en) Three-dimensional object estimation using two-dimensional annotations
US20230245373A1 (en) System and method for generating a three-dimensional photographic image
WO2018080849A1 (fr) Simulation de profondeur de champ
WO2021146449A1 (fr) Historique d'objet visuel
US20240203152A1 (en) Method for identifying human poses in an image, computer system, and non-transitory computer-readable medium
JP2023532285A (ja) アモーダル中心予測のためのオブジェクト認識ニューラルネットワーク
US11741671B2 (en) Three-dimensional scene recreation using depth fusion
WO2023086398A1 (fr) Réseaux de rendu 3d basés sur des champs de radiance neurale de réfraction
US20220254008A1 (en) Multi-view interactive digital media representation capture
WO2023009113A1 (fr) Guidage interactif pour cartographie et relocalisation
US11417063B2 (en) Determining a three-dimensional representation of a scene
CN115309113A (zh) 一种零件装配的引导方法及相关设备
KR102299902B1 (ko) 증강현실을 제공하기 위한 장치 및 이를 위한 방법
WO2023027712A1 (fr) Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
CN115729250A (zh) 一种无人机的飞行控制方法、装置、设备及存储介质
WO2023003558A1 (fr) Étalonnage d'affichage stéréoscopique interactif
US20240185511A1 (en) Information processing apparatus and information processing method
WO2023063937A1 (fr) Procédés et systèmes de détection de régions planes à l'aide d'une profondeur prédite
WO2024123343A1 (fr) Mise en correspondance stéréo pour une estimation de profondeur à l'aide de paires d'images avec des configurations de pose relative arbitraires

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21952068

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE