EP3861533A1 - A cross reality system - Google Patents
A cross reality systemInfo
- Publication number
- EP3861533A1 EP3861533A1 EP19868457.3A EP19868457A EP3861533A1 EP 3861533 A1 EP3861533 A1 EP 3861533A1 EP 19868457 A EP19868457 A EP 19868457A EP 3861533 A1 EP3861533 A1 EP 3861533A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- map
- persistent
- maps
- frame
- coordinate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
- G06T19/006—Mixed reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0481—Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
- G06F3/04815—Interaction with a metaphor-based environment or interaction object displayed as three-dimensional, e.g. changing the user viewpoint with respect to the environment or object
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/20—Scenes; Scene-specific elements in augmented reality scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
Definitions
- This patent application also claims priority to and the benefit of U.S. Provisional Patent Application No. 62/868,786, filed on June 28, 2019 and entitled“RANKING AND MERGING A PLURALITY OF
- This application relates generally to a cross reality system.
- Computers may control human user interfaces to create an X Reality (XR or cross reality) environment in which some or all of the XR environment, as perceived by the user, is generated by the computer.
- XR environments may be virtual reality (VR), augmented reality (AR), and mixed reality (MR) environments, in which some or all of an XR
- environment may be generated by computers using, in part, data that describes the
- This data may describe, for example, virtual objects that may be rendered in a way that users’ sense or perceive as a part of a physical world and can interact with the virtual objects.
- the user may experience these virtual objects as a result of the data being rendered and presented through a user interface device, such as, for example, a head-mounted display device.
- the data may be displayed to the user to see, or may control audio that is played for the user to hear, or may control a tactile (or haptic) interface, enabling the user to experience touch sensations that the user senses or perceives as feeling the virtual object.
- XR systems may be useful for many applications, spanning the fields of scientific visualization, medical training, engineering design and prototyping, tele-manipulation and tele-presence, and personal entertainment.
- AR and MR in contrast to VR, include one or more virtual objects in relation to real objects of the physical world.
- the experience of virtual objects interacting with real objects greatly enhances the user’s enjoyment in using the XR system, and also opens the door for a variety of applications that present realistic and readily understandable information about how the physical world might be altered.
- an XR system may build a representation of the physical world around a user of the system.
- This representation may be constructed by processing images acquired with sensors on a wearable device that forms a part of the XR system.
- a user might perform an initialization routine by looking around a room or other physical environment in which the user intends to use the XR system until the system acquires sufficient information to construct a representation of that environment.
- the sensors on the wearable devices might acquire additional information to expand or update the representation of the physical world.
- aspects of the present application relate to methods and apparatus for providing X reality (cross reality or XR) scenes. Techniques as described herein may be used together, separately, or in any suitable combination.
- Some embodiments relate to an electronic system including one or more sensors configured to capture information about a three-dimensional (3D) environment.
- the captured information includes a plurality of images.
- the electronic system includes at least one processor configured to execute computer executable instructions to generate a map of at least a portion of the 3D environment based on the plurality of images.
- the computer executable instructions further include instructions for: identifying a plurality of features in the plurality of images; selecting a plurality of key frames from among the plurality of images based, at least in part, on the plurality of features of the selected key frames; generating one or more coordinate frames based, at least in part, on the identified features of the selected key frames, and storing, in association with the map of the 3D environment, the one or more coordinate frames as one or more persistent coordinate frames.
- the one or more sensors comprises a plurality of pixel circuits arranged in a two-dimensional array such that each image of the plurality of images comprises a plurality of pixels. Each feature corresponds to a plurality of pixels.
- identifying a plurality of features in the plurality of images comprises selecting as the identified features a number, less than a predetermined maximum, of groups of the pixels based on a measure of similarity to groups of pixels depicting portions of persistent objects.
- storing the one or more coordinate frames comprises storing for each of the one or more coordinate frames: descriptors representative of at least a subset of the features in a selected key frame from which the coordinate frame was generated.
- storing the one or more coordinate frames comprises storing, for each of the one or more coordinate frames, at least a subset of the features in a selected key frame from which the coordinate frame was generated.
- storing the one or more coordinate frames comprises storing, for each of the one or more coordinate frames, a transformation between a coordinate frame of the map of the 3D environment and the persistent coordinate frame; and geographic information indicating a location within the 3D environment of a selected key frame from which the coordinate frame was generated.
- the geographic information comprises a WiFi fingerprint of the location.
- the computer executable instruction comprise instructions for computing feature descriptors for individual features with an artificial neural network.
- the first artificial neural network is a first artificial neural network.
- the computer executable instruction comprise instructions for implementing a second artificial neural network configured to compute a frame descriptor to represent a key frame based, at least in part, on the computed feature descriptors for the identified features in the key frame.
- the computer executable instructions further comprise an application programming interface configured to provide to an application, executing on the portable electronic system, information characterizing a persistent coordinate frame of the one or more persistent coordinate frames; instructions for refining the map of the 3D environment based on a second plurality of images; adjusting one or more of the persistent coordinate frames based, at least in part, on the second plurality of images; instructions for providing through the application programming interface, notification of the adjusted persistent coordinate frames.
- an application programming interface configured to provide to an application, executing on the portable electronic system, information characterizing a persistent coordinate frame of the one or more persistent coordinate frames; instructions for refining the map of the 3D environment based on a second plurality of images; adjusting one or more of the persistent coordinate frames based, at least in part, on the second plurality of images; instructions for providing through the application programming interface, notification of the adjusted persistent coordinate frames.
- adjusting the one or more persistent coordinate frames comprises adjusting a translation and rotation of the one or more persistent coordinate frames relative to an origin of the map of the 3D environment.
- the electronic system comprises a wearable device and the one or more sensors are mounted on the wearable device.
- the map is a tracking map computed on the wearable device. The origin of the map is determined based on a location where the device is powered on.
- the electronic system comprises a wearable device and the one or more sensors are mounted on the wearable device.
- the computer executable instruction further comprise instructions, for tracking motion of the portable device; and controlling the timing of execution of the instructions for generating one or more coordinate frames and/or the instructions for storing one or more persistent coordinate frames based on the tracked motion indicating motion of the wearable device exceeding a threshold distance, wherein the threshold distance is between two to twenty meters.
- Some embodiments relate to a method of operating an electronic system to render virtual content in a 3D environment comprising a portable device.
- the method include, with one or more processors: maintaining on the portable device a coordinate frame local to the portable device based on output of one or more sensors on the portable device; obtaining a stored coordinate frame from stored spatial information about the 3D environment; computing a transformation between the coordinate frame local to the portable device and the obtained stored coordinate frame; receiving a specification of a virtual object having a coordinate frame local to the virtual object and a location of the virtual object with respect to the selected stored coordinate frame; and rendering the virtual object on a display of the portable device at a location determined, at least in part, based on the computed transformation and the received location of the virtual object.
- obtaining the stored coordinate frame comprises obtaining the coordinate frame through an application programming interface (API).
- API application programming interface
- the portable device comprises a first portable device comprising a first processor of the one or more processors.
- the system further comprises a second portable device comprising a second processor of the one or more processors.
- the processor on each of the first and second devices obtains a same, stored coordinate frame; computes a transformation between a coordinate frame local to a respective device and the obtained same stored coordinate frame; receives the specification of the virtual object; and renders the virtual object on a respective display.
- each of the first and second devices comprises: a camera configured to output a plurality of camera images; a key frame generator configured to transform a plurality camera images to a plurality of key frames; a persistent pose calculator configured to generate a persistent pose by averaging the plurality of key frames; a tracking map and persistent pose transformer configured to transform a tracking map to the persistent pose to determine the persistent pose relative to an origin of the tracking map; a persistent pose and persistent coordinate frame (PCF) transformer configured to transform the persistent pose to a PCF; and a map publisher, configured to transmit spatial information, including the PCF, to a server.
- a camera configured to output a plurality of camera images
- a key frame generator configured to transform a plurality camera images to a plurality of key frames
- a persistent pose calculator configured to generate a persistent pose by averaging the plurality of key frames
- a tracking map and persistent pose transformer configured to transform a tracking map to the persistent pose to determine the persistent pose relative to an origin of the tracking map
- the method further comprises executing an application to generate the specification of the virtual object and the location of the virtual object with respect to the selected stored coordinate frame.
- maintaining on the portable device a coordinate frame local to the portable device comprises, for each of the first and second portable devices: capturing a plurality of images about the 3D environment from the one or more sensors of the portable device, computing one or more persistent poses based, at least in part, on the plurality of images, and generating spatial information about the 3D environment based, at least in part, on the computed one or more persistent poses.
- the method further comprises, for each of the first and second portable devices transmitting to a remote server the generated spatial information; and obtaining the stored coordinate frame comprises receiving the stored coordinate frame from the remote server.
- computing the one or more persistent poses based, at least in part, on the plurality of images comprises: extracting one or more features from each of the plurality of images; generating a descriptor for each of the one or more features; generating a key frame for each of the plurality of images based, at least in part, on the descriptors; and generating the one or more persistent poses based, at least in part, on the one or more key frames.
- generating the one or more persistent poses comprises selectively generating a persistent pose based on the portable device traveling a pre determined distance from a location of other persistent poses.
- each of the first and second devices comprises a download system configured to download the stored coordinate frame from a server.
- Some embodiments relate to an electronic system for maintaining persistent spatial information about a 3D environment for rendering virtual content on each of a plurality of portable devices.
- the electronic system include a networked computing device.
- the networked computing device includes at least one processor; at least one storage device connected to the processor; a map storing routine, executable with the at least one processor, to receive from portable devices of the plurality of portable devices, a plurality of maps and store map information on the at least one storage device, wherein each of the plurality of received maps comprises at least one coordinate frame; and a map transmitter, executable with the at least one processor, to: receive location information from a portable device of the plurality of portable devices; select one or more maps from among the stored maps; and transmit to the portable device of the plurality of portable devices information from the selected one or more maps, wherein the transmitted information comprises a coordinate frame of a map of the selected one or more maps.
- the coordinate frame comprises a computer data structure.
- the computer data structure comprises a coordinate frame comprising information characterizing a plurality of features of objects in the 3D environment.
- the information characterizing the plurality of features comprises descriptors characterizing regions of the 3D environment.
- each coordinate frame of the at least one coordinate frame comprises persistent points characterized by features detected in sensor data representing the 3D environment.
- each coordinate frame of the at least one coordinate frame comprises a persistent pose.
- each coordinate frame of the at least one coordinate frame comprises a persistent coordinate frame.
- Figure 1 is a sketch illustrating an example of a simplified augmented reality (AR) scene, according to some embodiments
- Figure 2 is a sketch of an exemplary simplified AR scene, showing exemplary use cases of an XR system, according to some embodiments;
- Figure 3 is a schematic diagram illustrating data flow for a single user in an AR system configured to provide an experience to the user of AR content interacting with a physical world, according to some embodiments;
- Figure 4 is a schematic diagram illustrating an exemplary AR display system, displaying virtual content for a single user, according to some embodiments
- Figure 5A is a schematic diagram illustrating a user wearing an AR display system rendering AR content as the user moves through a physical world environment, according to some embodiments;
- FIG. 5B is a schematic diagram illustrating a viewing optics assembly and attendant components, according to some embodiments.
- Figure 6A is a schematic diagram illustrating an AR system using a world reconstruction system, according to some embodiments.
- Figure 6B is a schematic diagram illustrating components of an AR system that maintain a model of a passable world, according to some embodiments.
- Figure 7 is a schematic illustration of a tracking map formed by a device traversing a path through a physical world.
- Figure 8 is a schematic diagram illustrating a user of a cross reality (XR) system, perceiving virtual content, according to some embodiments;
- XR cross reality
- Figure 9 is a block diagram of components of a first XR device of the XR system of Figure 8 that transform between coordinate systems, according to some embodiments;
- Figure 10 is a schematic diagram illustrating an exemplary transformation of origin coordinate frames into destination coordinate frames in order to correctly render local XR content, according to some embodiments
- Figure 11 is a top plan view illustrating pupil-based coordinate frames, according to some embodiments.
- Figure 12 is a top plan view illustrating a camera coordinate frame that includes all pupil positions, according to some embodiments.
- Figure 13 is a schematic diagram of the display system of Figure 9, according to some embodiments.
- Figure 14 is a block diagram illustrating the creation of a persistent coordinate frame (PCF) and the attachment of XR content to the PCF, according to some embodiments;
- PCF persistent coordinate frame
- Figure 15 is a flow chart illustrating a method of establishing and using a PCF, according to some embodiments.
- Figure 16 is a block diagram of the XR system of Figure 8, including a second XR device, according to some embodiments;
- Figure 17 is a schematic diagram illustrating a room and key frames that are established for various areas in the room, according to some embodiments;
- Figure 18 is a schematic diagram illustrating the establishment of persistent poses based on the key frames, according to some embodiments;
- Figure 19 is a schematic diagram illustrating the establishment of a persistent coordinate frame (PCF) based on the persistent poses, according to some embodiments;
- PCF persistent coordinate frame
- Figures 20A to 20C are schematic diagrams illustrating an example of creating PCFs, according to some embodiments.
- Figures 21 is a block diagram illustrating a system for generating global descriptors for individual images and/or maps, according to some embodiments
- Figure 22 is a flow chart illustrating a method of computing an image descriptor, according to some embodiments.
- Figure 23 is a flow chart illustrating a method of localization using image descriptors, according to some embodiments.
- Figure 24 is a flow chart illustrating a method of training a neural network, according to some embodiments.
- Figure 25 is a block diagram illustrating a method of training a neural network, according to some embodiments.
- Figure 26 is a schematic diagram illustrating an AR system configured to rank and merge a plurality of environment maps, according to some embodiments
- Figure 27 is a simplified block diagram illustrating a plurality of canonical maps stored on a remote storage medium, according to some embodiments.
- Figure 28 is a schematic diagram illustrating a method of selecting canonical maps to, for example, localize a new tracking map in one or more canonical maps and/or obtain PCF’s from the canonical maps, according to some embodiments;
- Figure 29 is flow chart illustrating a method of selecting a plurality of ranked environment maps, according to some embodiments;
- Figure 30 is a schematic diagram illustrating an exemplary map rank portion of the AR system of Figure 26, according to some embodiments.
- Figure 31A is a schematic diagram illustrating an example of area attributes of a tracking map (TM) and environment maps in a database, according to some embodiments;
- Figure 31B is a schematic diagram illustrating an example of determining a geographic location of a tracking map (TM) for geolocation filtering of Figure 29, according to some embodiments;
- Figure 32 is a schematic diagram illustrating an example of geolocation filtering of Figure 29, according to some embodiments.
- Figure 33 is a schematic diagram illustrating an example of Wi-Fi BSSID filtering of Figure 29, according to some embodiments.
- Figure 34 is a schematic diagram illustrating an example of localization of Figure 29, according to some embodiments.
- Figure 35 and 36 are block diagrams of an XR system configured to rank and merge a plurality of environment maps, according to some embodiments.
- Figure 37 is a block diagram illustrating a method of creating environment maps of a physical world, in a canonical form, according to some embodiments.
- Figures 38A and 38B are schematic diagrams illustrating an environment map created in a canonical form by updating the tracking map of FIG. 7 with a new tracking map, according to some embodiments.
- Figures 39A to 39F are schematic diagrams illustrating an example of merging maps, according to some embodiments.
- Figure 40 is a two-dimensional representation of a three-dimensional first local tracking map (Map 1), which may be generated by the first XR device of Figure 9, according to some embodiments;
- Figure 41 is a block diagram illustrating uploading Map 1 from the first XR device to the server of Figure 9, according to some embodiments;
- Figure 42 is a schematic diagram illustrating the XR system of Figure 16, showing the second user has initiated a second session using a second XR device of the XR system after the first user has terminated a first session, according to some embodiments;
- Figure 43A is a block diagram illustrating a new session for the second XR device of Figure 42, according to some embodiments.
- Figure 43B is a block diagram illustrating the creation of a tracking map for the second XR device of Figure 42, according to some embodiments.
- Figure 43C is a block diagram illustrating downloading a canonical map from the server to the second XR device of Figure 42, according to some embodiments;
- Figure 44 is a schematic diagram illustrating a localization attempt to localize to a canonical map a second tracking map (Map 2), which may be generated by the second XR device of Figure 42, according to some embodiments;
- Figure 45 is a schematic diagram illustrating a localization attempt to localize to a canonical map the second tracking map (Map 2) of Figure 44, which may be further developed and with XR content associated with PCFs of Map 2, according to some embodiments;
- Figures 46A-46B are a schematic diagram illustrating a successful localization of Map 2 of Figure 45 to the canonical map, according to some embodiments.
- Figure 47 is a schematic diagram illustrating a canonical map generated by including one or more PCFs from the canonical map of Figure 46A into Map 2 of Figure 45, according to some embodiments;
- Figure 48 is a schematic diagram illustrating the canonical map of Figure 47 with further expansion of Map 2 on the second XR device, according to some embodiments;
- Figure 49 is a block diagram illustrating uploading Map 2 from the second XR device to the server, according to some embodiments.
- Figure 50 is a block diagram illustrating merging Map 2 with the canonical map, according to some embodiments.
- Figure 51 is a block diagram illustrating transmission of a new canonical map from the server to the first and second XR devices, according to some embodiments;
- Figure 52 is block diagram illustrating a two-dimensional representation of Map 2 and a head coordinate frame of the second XR device that is referenced to Map 2, according to some embodiments;
- Figure 53 is a block diagram illustrating, in two-dimensions, adjustment of the head coordinate frame which can occur in six degrees of freedom, according to some embodiments;
- Figure 54 a block diagram illustrating a canonical map on the second XR device wherein sound is localized relative to PCFs of Map 2, according to some embodiments;
- Figures 55 and 56 are a perspective view and a block diagram illustrating use of the XR system when the first user has terminated a first session and the first user has initiated a second session using the XR system, according to some embodiments;
- Figures 57 and 58 are a perspective view and a block diagram illustrating use of the XR system when three users are simultaneously using the XR system in the same session, according to some embodiments;
- Figure 59 is a flow chart illustrating a method of recovering and resetting a head pose, according to some embodiments.
- Figure 60 is a block diagram of a machine in the form of a computer that can find application in the present invention system, according to some embodiments.
- Described herein are methods and apparatus for providing X reality (XR or cross reality) scenes.
- XR X reality
- an XR system must know the users’ physical surroundings in order to correctly correlate locations of virtual objects in relation to real objects.
- An XR system may build an environment map of a scene, which may be created from image and/or depth information collected with sensors that are part of XR devices worn by users of the XR system.
- each XR device develops a local map of its physical environment by integrating information from one or more images collected during a scan at a point in time.
- the coordinate system of that map is tied to the orientation of the device when the scan was initiated. That orientation may change from instant to instant as a user interacts with the XR system, whether different instances in time are associated with different users, each with their own wearable device with sensors that scan the environment, or the same user who uses the same device at different times.
- the inventors have realized and appreciated techniques for operating XR systems based on persistent spatial information that overcome limitations of an XR system in which each user device relies only on spatial information that it collects relative to an orientation that is different for different user instances (e.g., snapshot in time) or sessions (e.g., the time between being turned on and off) of the system.
- the techniques may provide XR scenes for a more user instances (e.g., snapshot in time) or sessions (e.g., the time between being turned on and off) of the system.
- the techniques may provide XR scenes for a more
- the persistent spatial information may be represented by a persistent map, which may enable one or more functions that enhance an XR experience.
- the persistent map may be stored in a remote storage medium (e.g., a cloud).
- the wearable device worn by a user after being turned on, may retrieve from persistent storage, such as from cloud storage, an appropriate stored map that was previously created and stored. That previously stored map may have been based on data about the environment collected with sensors on the user’s wearable device during prior sessions. Retrieving a stored map may enable use of the wearable device without a scan of the physical world with the sensors on the wearable device.
- the system/device upon entering a new region of the physical world may similarly retrieve an appropriate stored map.
- the stored map may be represented in a canonical form that each XR device may relate to its local frame of reference.
- the stored map accessed by one device may have been created and stored by another device and/or may have been constructed by aggregating data about the physical world collected by sensors on multiple wearable devices that were previously present in at least a portion of the physical world represented by the stored map.
- sharing data about the physical world among multiple devices may enable shared user experiences of virtual content. Two XR devices that have access to the same stored map, for example, may both localize with respect to the stored map.
- a user device may render virtual content that has a location specified by reference to the stored map by translating that location to a frame or reference maintained by the user device.
- the user device may use this local frame of reference to control the display of the user device to render the virtual content in the specified location.
- the XR system may include components that, based on data about the physical world collected with sensors on user devices, develop, maintain, and use persistent spatial information, including one or more stored maps. These components may be distributed across the XR system, with some operating, for example, on a head mounted portion of a user device. Other components may operate on a computer, associated with the user coupled to the head mounted portion over a local or personal area network. Yet others may operate at a remote location, such as at one or more servers accessible over a wide area network.
- These components may include components that can identify from information about the physical world collected by one or more user devices information that is of sufficient quality to be stored as or in a persistent map.
- An example of such a component, described in greater detail below is a map merge component.
- Such a component may receive inputs from a user device and determine the suitability of parts of the inputs to be used to update a persistent map.
- a map merge component may split a local map created by a user device into parts, determine mergibility of one or more of the parts to a persistent map, and merge the parts that meet qualified mergibility criteria to the persistent map.
- a map merge component may also promote a part that is not merged with a persistent map to be a separate persistent map.
- these components may include components that may aid in determining an appropriate persistent map that may be retrieved and used by a user device.
- a map rank component may receive inputs from a user device and identify one or more persistent maps that are likely to represent the region of the physical world in which that device is operating.
- a map rank component may aid in selecting a persistent map to be used by that local device as it renders virtual content, gathers data about the environment, or performs other actions.
- a map rank component alternatively or additionally, may aid in identifying persistent maps to be updated as additional information about the physical world is collected by one or more user devices.
- Yet other components may determine transformations that transform information captured or described in relation to one reference frame into another reference frame.
- sensors may be attached to a head mounted display such that the data read from that sensor indicates locations of objects in the physical world with respect to the head pose of the wearer.
- One or more transformations may be applied to relate that location information to the coordinate frame associated with a persistent environment map.
- data indicating where a virtual object is to be rendered when expressed in a coordinate frame of a persistent environment map may be put through one or more transformations to be in a frame of reference of the display on the user’s head.
- there may be multiple such transformations.
- These transformations may be partitioned across the components of an XR system such that they may be efficiently updated and or applied in a distributed system.
- the persistent maps may be constructed from information collected by multiple user devices.
- the XR devices may capture local spatial information and construct separate tracking maps with information collected by sensors of each of the XR devices at various locations and times.
- Each tracking map may include points, each of which may be associated with a feature of a real object that may include multiple features.
- the tracking maps may be used to track users’ motions in a scene, enabling an XR system to estimate respective users’ head poses based on a tracking map.
- the inventors have realized and appreciated techniques for operating XR systems to provide XR scenes for a more immersive user experience such as estimating head pose at a frequency of 1 kHz, with low usage of computational resources in connection with an XR device, that may be configured with, for example, four video graphic array (VGA) cameras operating at 30 Hz, one inertial measurement unit (IMU) operating at 1 kHz, compute power of a single advanced RISC machine (ARM) core, memory less than 1 GB, and network bandwidth less than 100 Mbp.
- VGA video graphic array
- IMU inertial measurement unit
- ARM advanced RISC machine
- These techniques may include hybrid tracking such that an XR system can leverage both (1) patch-based tracking of distinguishable points between successive images (e.g., frame-to -frame tracking) of the environment, and (2) matching of points of interest of a current image with a descriptor-based map of known real-world locations of corresponding points of interest (e.g., map-to-frame tracking).
- the XR system may track particular points of interest (e.g., salient points), such as corners, between captured images of the real-world environment.
- the display system may identify locations of visual points of interest in a current image, which were included in (e.g., located in) a previous image.
- This identification may be accomplished using, e.g., photometric error minimization processes.
- the XR system may access map information indicating real-world locations of points of interest, and match points of interest included in a current image to the points of interest indicated in the map information.
- Information regarding the points of interest may be stored as descriptors in the map database.
- the XR system may calculate its pose based on the matched visual features.
- U.S. Patent Application No. 16/221,065 describes hybrid tracking and is hereby incorporated herein by reference in its entirety.
- These techniques may include reducing the amount of data that is processed when constructing maps, such as by constructing sparse maps with a collection of mapped points and keyframes and/or dividing the maps into blocks to enable updates by blocks.
- a mapped point may be associated with a point of interest in the environment.
- a keyframe may include selected information from camera-captured data.
- persistent spatial information may be represented in a way that may be readily shared among users and among the distributed components, including applications.
- Information about the physical world may be represented as persistent coordinate frames (PCFs).
- PCFs may be defined based on one or more points thatrepresent features recognized in the physical world. The features may be selected such that they are likely to be the same from user session to user session of the XR system.
- PCFs may exist sparsely, providing less than all of the available information about the physical world, such that they may be efficiently processed and transferred.
- Techniques for processing persistent spatial information may include creating dynamic maps based on one or more coordinate systems in real space across one or more sessions, and generating persistent coordinate frames (PCF) over the sparse maps, which may be exposed to XR applications via, for example, an application programming interface (API). These capabilities may be supported by techniques for ranking and merging multiple maps created by one or more XR devices. Persistent spatial information may also enable quickly recovering and resetting head poses on each of one or more XR devices in a computationally efficient way.
- PCF persistent coordinate frames
- an image frame may be represented by a numeric descriptor. That descriptor may be computed via a transformation that maps a set of features identified in the image to the descriptor. That transformation may be performed in a trained neural network.
- the set of features that is supplied as an input to the neural network may be a filtered set of features, extracted from the image using techniques, for example, that preferentially select features that are likely to be persistent.
- the representation of image frames as a descriptor enables, for example, efficient matching of new image information to stored image information.
- An XR system may store in conjunction with persistent maps descriptors of one or more frames underlying the persistent map.
- a local image frame acquired by a user device may similarly be converted to such a descriptor.
- the descriptor may be computed for key frames in the local map and the persistent map, further reducing processing when comparing maps.
- Such an efficient comparison may be used, for example, to simplify finding a persistent map to load in a local device or to find a persistent map to update based on image information acquired with a local device.
- Techniques as described herein may be used together or separately with many types of devices and for many types of scenes, including wearable or portable devices with limited computational resources that provide an augmented or mixed reality scene.
- the techniques may be implemented by one or more services that form a portion of an XR system.
- Figures 1 and 2 illustrate scenes with virtual content displayed in conjunction with a portion of the physical world.
- an AR system is used as an example of an XR system.
- Figures 3-6B illustrate an exemplary AR system, including one or more processors, memory, sensors and user interfaces that may operate according to the techniques described herein.
- an outdoor AR scene 354 is depicted in which a user of an AR technology sees a physical world park-like setting 356, featuring people, trees, buildings in the background, and a concrete platform 358.
- the user of the AR technology also perceives that they "see” a robot statue 357 standing upon the physical world concrete platform 358, and a cartoon-like avatar character 352 flying by which seems to be a personification of a bumble bee, even though these elements (e.g., the avatar character 352, and the robot statue 357) do not exist in the physical world.
- Due to the extreme complexity of the human visual perception and nervous system it is challenging to produce an AR technology that facilitates a comfortable, natural-feeling, rich presentation of virtual image elements amongst other virtual or physical world imagery elements.
- Such an AR scene may be achieved with a system that builds maps of the physical world based on tracking information, enable users to place AR content in the physical world, determine locations in the maps of the physical world where AR content are placed, preserve the AR scenes such that the placed AR content can be reloaded to display in the physical world during, for example, a different AR experience session, and enable multiple users to share an AR experience.
- the system may build and update a digital representation of the physical world surfaces around the user. This representation may be used to render virtual content so as to appear fully or partially occluded by physical objects between the user and the rendered location of the virtual content, to place virtual objects, in physics based interactions, and for virtual character path planning and navigation, or for other operations in which information about the physical world is used.
- FIG. 2 depicts another example of an indoor AR scene 400, showing exemplary use cases of an XR system, according to some embodiments.
- the exemplary scene 400 is a living room having walls, a bookshelf on one side of a wall, a floor lamp at a comer of the room, a floor, a sofa, and coffee table on the floor.
- the user of the AR technology also perceives virtual objects such as images on the wall behind the sofa, birds flying through the door, a deer peeking out from the book shelf, and a decoration in the form of a windmill placed on the coffee table.
- the AR technology requires information about not only surfaces of the wall but also objects and surfaces in the room such as lamp shape, which are occluding the images to render the virtual objects correctly.
- the AR technology requires information about all the objects and surfaces around the room for rendering the birds with realistic physics to avoid the objects and surfaces or bounce off them if the birds collide.
- the AR technology requires information about the surfaces such as the floor or coffee table to compute where to place the deer.
- the system may identify that is an object separate from the table and may determine that it is movable, whereas corners of shelves or comers of the wall may be determined to be stationary. Such a distinction may be used in determinations as to which portions of the scene are used or updated in each of various operations.
- the virtual objects may be placed in a previous AR experience session.
- the AR technology requires the virtual objects being accurately displayed at the locations previously placed and realistically visible from different viewpoints.
- the windmill should be displayed as standing on the books rather than drifting above the table at a different location without the books. Such drifting may happen if the locations of the users of the new AR experience sessions are not accurately localized in the living room.
- the AR technology requires corresponding sides of the windmill being displayed.
- a scene may be presented to the user via a system that includes multiple components, including a user interface that can stimulate one or more user senses, such as sight, sound, and/or touch.
- the system may include one or more sensors that may measure parameters of the physical portions of the scene, including position and/or motion of the user within the physical portions of the scene.
- the system may include one or more computing devices, with associated computer hardware, such as memory. These components may be integrated into a single device or may be distributed across multiple interconnected devices. In some embodiments, some or all of these components may be integrated into a wearable device.
- FIG. 3 depicts an AR system 502 configured to provide an experience of AR contents interacting with a physical world 506, according to some embodiments.
- the AR system 502 may include a display 508.
- the display 508 may be worn by the user as part of a headset such that a user may wear the display over their eyes like a pair of goggles or glasses. At least a portion of the display may be transparent such that a user may observe a see-through reality 510.
- the see-through reality 510 may correspond to portions of the physical world 506 that are within a present viewpoint of the AR system 502, which may correspond to the viewpoint of the user in the case that the user is wearing a headset incorporating both the display and sensors of the AR system to acquire information about the physical world.
- AR contents may also be presented on the display 508, overlaid on the see-through reality 510.
- the AR system 502 may include sensors 522 configured to capture information about the physical world 506.
- the sensors 522 may include one or more depth sensors that output depth maps 512.
- Each depth map 512 may have multiple pixels, each of which may represent a distance to a surface in the physical world 506 in a particular direction relative to the depth sensor.
- Raw depth data may come from a depth sensor to create a depth map.
- Such depth maps may be updated as fast as the depth sensor can form a new image, which may be hundreds or thousands of times per second.
- that data may be noisy and incomplete, and have holes shown as black pixels on the illustrated depth map.
- the system may include other sensors, such as image sensors.
- the image sensors may acquire monocular or stereoscopic information that may be processed to represent the physical world in other ways.
- the images may be processed in world reconstruction component 516 to create a mesh, representing connected portions of objects in the physical world. Metadata about such objects, including for example, color and surface texture, may similarly be acquired with the sensors and stored as part of the world
- the system may also acquire information about the headpose (or“pose”) of the user with respect to the physical world.
- a head pose tracking component of the system may be used to compute headposes in real time.
- the head pose tracking component may represent a headpose of a user in a coordinate frame with six degrees of freedom including, for example, translation in three perpendicular axes (e.g., forward/backward, up/down, left/right) and rotation about the three perpendicular axes (e.g., pitch, yaw, and roll).
- sensors 522 may include inertial measurement units that may be used to compute and/or determine a headpose 514.
- a headpose 514 for a depth map may indicate a present viewpoint of a sensor capturing the depth map with six degrees of freedom, for example, but the headpose 514 may be used for other purposes, such as to relate image information to a particular portion of the physical world or to relate the position of the display worn on the user’s head to the physical world.
- the headpose information may be derived in other ways than from an IMU, such as from analyzing objects in an image.
- the head pose tracking component may compute relative position and orientation of an AR device to physical objects based on visual information captured by cameras and inertial information captured by IMUs. The head pose tracking component may then compute a headpose of the AR device by, for example, comparing the computed relative position and orientation of the AR device to the physical objects with features of the physical objects.
- that comparison may be made by identifying features in images captured with one or more of the sensors 522 that are stable over time such that changes of the position of these features in images captured over time can be associated with a change in headpose of the user.
- the AR device may construct a map from the feature points recognized in successive images in a series of image frames captured as a user moves throughout the physical world with the AR device. Though each image frame may be taken from a different pose as the user moves, the system may adjust the orientation of the features of each successive image frame to match the orientation of the initial image frame by matching features of the successive image frames to previously captured image frames. Translations of the successive image frames so that points representing the same features will match corresponding feature points from previously collected image frames, can be used to align each successive image frame to match the orientation of previously processed image frames.
- the frames in the resulting map may have a common orientation established when the first image frame was added to the map.
- This map with sets of feature points in a common frame of reference, may be used to determine the user’s pose within the physical world by matching features from current image frames to the map. In some embodiments, this map may be called a tracking map.
- this map may enable other components of the system, such as world reconstruction component 516, to determine the location of physical objects with respect to the user.
- the world reconstruction component 516 may receive the depth maps 512 and headposes 514, and any other data from the sensors, and integrate that data into a reconstruction 518.
- the reconstruction 518 may be more complete and less noisy than the sensor data.
- the world reconstruction component 516 may update the reconstruction 518 using spatial and temporal averaging of the sensor data from multiple viewpoints over time.
- the reconstruction 518 may include representations of the physical world in one or more data formats including, for example, voxels, meshes, planes, etc.
- the different formats may represent alternative representations of the same portions of the physical world or may represent different portions of the physical world.
- portions of the physical world are presented as a global surface; on the right side of the reconstruction 518, portions of the physical world are presented as meshes.
- the map maintained by headpose component 514 may be sparse relative to other maps that might be maintained of the physical world. Rather than providing information about locations, and possibly other characteristics, of surfaces, the sparse map may indicate locations of interest points and/or structures, such as comers or edges.
- the map may include image frames as captured by the sensors 522. These frames may be reduced to features, which may represent the interest points and/or structures. In conjunction with each frame, information about a pose of a user from which the frame was acquired may also be stored as part of the map. In some embodiments, every image acquired by the sensor may or may not be stored.
- the system may process images as they are collected by sensors and select subsets of the image frames for further computation.
- the selection may be based on one or more criteria that limits the addition of information yet ensures that the map contains useful information.
- the system may add a new image frame to the map, for example, based on overlap with a prior image frame already added to the map or based on the image frame containing a sufficient number of features determined as likely to represent stationary objects.
- the selected image frames, or groups of features from selected image frames may serve as key frames for the map, which are used to provide spatial information.
- the AR system 502 may integrate sensor data over time from multiple viewpoints of a physical world.
- the poses of the sensors e.g., position and orientation
- the sensor’s frame pose is known and how it relates to the other poses, each of these multiple viewpoints of the physical world may be fused together into a single, combined reconstruction of the physical world, which may serve as an abstract layer for the map and provide spatial information.
- the reconstruction may be more complete and less noisy than the original sensor data by using spatial and temporal averaging (i.e. averaging data from multiple viewpoints over time), or any other suitable method.
- a map represents the portion of the physical world in which a user of a single, wearable device is present.
- headpose associated with frames in the map may be represented as a local headpose, indicating orientation relative to an initial orientation for a single device at the start of a session.
- the headpose may be tracked relative to an initial headpose when the device was turned on or otherwise operated to scan an environment to build a representation of that environment.
- the map may include metadata.
- the metadata may indicate time of capture of the sensor information used to form the map.
- Metadata alternatively or additionally may indicate location of the sensors at the time of capture of information used to form the map.
- Location may be expressed directly, such as with information from a GPS chip, or indirectly, such as with a Wi-Fi signature indicating strength of signals received from one or more wireless access points while the sensor data was being collected and/or with the BSSID’s of wireless access points to which the user device connected while the sensor data was collected.
- the reconstruction 518 may be used for AR functions, such as producing a surface representation of the physical world for occlusion processing or physics-based processing. This surface representation may change as the user moves or objects in the physical world change. Aspects of the reconstruction 518 may be used, for example, by a component 520 that produces a changing global surface representation in world coordinates, which may be used by other components.
- the AR content may be generated based on this information, such as by AR applications 504.
- An AR application 504 may be a game program, for example, that performs one or more functions based on information about the physical world, such as visual occlusion, physics-based interactions, and environment reasoning. It may perform these functions by querying data in different formats from the reconstruction 518 produced by the world reconstruction component 516.
- component 520 may be configured to output updates when a representation in a region of interest of the physical world changes. That region of interest, for example, may be set to approximate a portion of the physical world in the vicinity of the user of the system, such as the portion within the view field of the user, or is projected (predicted/determined) to come within the view field of the user.
- the AR applications 504 may use this information to generate and update the AR contents.
- the virtual portion of the AR contents may be presented on the display 508 in combination with the see-through reality 510, creating a realistic user experience.
- an AR experience may be provided to a user through an XR device, which may be a wearable display device, which may be part of a system that may include remote processing and or remote data storage and/or, in some embodiments, other wearable display devices worn by other users.
- Figure 4 illustrates an example of system 580 (hereinafter referred to as "system 580") including a single wearable device for simplicity of illustration.
- the system 580 includes a head mounted display device 562 (hereinafter referred to as "display device 562”), and various mechanical and electronic modules and systems to support the functioning of the display device 562.
- the display device 562 may be coupled to a frame 564, which is wearable by a display system user or viewer 560 (hereinafter referred to as "user 560") and configured to position the display device 562 in front of the eyes of the user 560.
- the display device 562 may be a sequential display.
- the display device 562 may be monocular or binocular.
- the display device 562 may be an example of the display 508 in Figure 3.
- a speaker 566 is coupled to the frame 564 and positioned proximate an ear canal of the user 560.
- another speaker not shown, is positioned adjacent another ear canal of the user 560 to provide for stereo/shapeable sound control.
- the display device 562 is operatively coupled, such as by a wired lead or wireless connectivity 568, to a local data processing module 570 which may be mounted in a variety of configurations, such as fixedly attached to the frame 564, fixedly attached to a helmet or hat worn by the user 560, embedded in headphones, or otherwise removably attached to the user 560 (e.g., in a backpack- style configuration, in a belt-coupling style configuration).
- the local data processing module 570 may include a processor, as well as digital memory, such as non-volatile memory (e.g., flash memory), both of which may be utilized to assist in the processing, caching, and storage of data.
- the data include data a) captured from sensors (which may be, e.g., operatively coupled to the frame 564) or otherwise attached to the user 560, such as image capture devices (such as cameras), microphones, inertial measurement units, accelerometers, compasses, GPS units, radio devices, and/or gyros; and/or b) acquired and/or processed using remote processing module 572 and/or remote data repository 574, possibly for passage to the display device 562 after such processing or retrieval.
- the wearable deice may communicate with remote components.
- the local data processing module 570 may be operatively coupled by communication links 576, 578, such as via a wired or wireless communication links, to the remote processing module 572 and remote data repository 574, respectively, such that these remote modules 572, 574 are operatively coupled to each other and available as resources to the local data processing module 570.
- the head pose tracking component described above may be at least partially implemented in the local data processing module 570.
- the world reconstruction component 516 in Figure 3 may be at least partially implemented in the local data processing module 570.
- the local data processing module 570 may be configured to execute computer executable instructions to generate the map and/or the physical world representations based at least in part on at least a portion of the data.
- processing may be distributed across local and remote processors.
- local processing may be used to construct a map on a user device (e.g. tracking map) based on sensor data collected with sensors on that user’s device.
- a map may be used by applications on that user’s device.
- previously created maps e.g., canonical maps
- a tracking map may be localized to the stored map, such that a correspondence is established between a tracking map, which might be oriented relative to a position of the wearable device at the time a user turned the system on, and the canonical map, which may be oriented relative to one or more persistent features.
- the persistent map might be loaded on the user device to allow the user device to render virtual content without a delay associated with scanning a location to build a tracking map of the user’s full environment from sensor data acquired during the scan.
- the user device may access a remote persistent map (e.g., stored on a cloud) without the need to download the persistent map on the user device.
- the tracking map may be merged with previously stored maps to extend or improve the quality of those maps.
- the processing to determine whether a suitable previously created environment map is available and/or to merge a tracking map with one or more stored environment maps may be done in local data processing module 570 or remote processing module 572.
- the local data processing module 570 may include one or more processors (e.g., a graphics processing unit (GPU)) configured to analyze and process data and/or image information.
- the local data processing module 570 may include a single processor (e.g., a single-core or multi-core ARM processor), which would limit the local data processing module 570’ s compute budget but enable a more miniature device.
- the world reconstruction component 516 may use a compute budget less than a single Advanced RISC Machine (ARM) core to generate physical world representations in real-time on a non-predefined space such that the remaining compute budget of the single ARM core can be accessed for other uses such as, for example, extracting meshes.
- ARM Advanced RISC Machine
- the remote data repository 574 may include a digital data storage facility, which may be available through the Internet or other networking
- all data is stored and all computations are performed in the local data processing module 570, allowing fully autonomous use from a remote module.
- all data is stored and all or most computations are performed in the remote data repository 574, allowing for a smaller device.
- a world reconstruction for example, may be stored in whole or in part in this repository 574.
- data may be shared by multiple users of an augmented reality system.
- user devices may upload their tracking maps to augment a database of environment maps.
- the tracking map upload occurs at the end of a user session with a wearable device.
- the tracking map uploads may occur continuously, semi- continuously, intermittently, at a pre-defined time, after a pre-defined period from the previous upload, or when triggered by an event.
- a tracking map uploaded by any user device may be used to expand or improve a previously stored map, whether based on data from that user device or any other user device.
- a persistent map downloaded to a user device may be based on data from that user device or any other user device. In this way, high quality environment maps may be readily available to users to improve their experiences with the AR system.
- the local data processing module 570 is operatively coupled to a battery 582.
- the battery 582 is a removable power source, such as over the counter batteries.
- the battery 582 is a lithium-ion battery.
- the battery 582 includes both an internal lithium-ion battery chargeable by the user 560 during non-operation times of the system 580 and removable batteries such that the user 560 may operate the system 580 for longer periods of time without having to be tethered to a power source to charge the lithium-ion battery or having to shut the system 580 off to replace batteries.
- Figure 5A illustrates a user 530 wearing an AR display system rendering AR content as the user 530 moves through a physical world environment 532 (hereinafter referred to as "environment 532").
- the information captured by the AR system along the movement path of the user may be processed into one or more tracking maps.
- the user 530 positions the AR display system at positions 534, and the AR display system records ambient information of a passable world (e.g., a digital representation of the real objects in the physical world that can be stored and updated with changes to the real objects in the physical world) relative to the positions 534. That information may be stored as poses in combination with images, features, directional audio inputs, or other desired data.
- a passable world e.g., a digital representation of the real objects in the physical world that can be stored and updated with changes to the real objects in the physical world
- the positions 534 are aggregated to data inputs 536, for example, as part of a tracking map, and processed at least by a passable world module 538, which may be implemented, for example, by processing on a remote processing module 572 of Figure 4.
- the passable world module 538 may include the head pose component 514 and the world reconstruction component 516, such that the processed information may indicate the location of objects in the physical world in combination with other information about physical objects used in rendering virtual content.
- the passable world module 538 determines, at least in part, where and how AR content 540 can be placed in the physical world as determined from the data inputs 536.
- the AR content is“placed” in the physical world by presenting via the user interface both a representation of the physical world and the AR content, with the AR content rendered as if it were interacting with objects in the physical world and the objects in the physical world presented as if the AR content were, when appropriate, obscuring the user’s view of those objects.
- the AR content may be placed by appropriately selecting portions of a fixed element 542 (e.g., a table) from a reconstruction (e.g., the reconstruction 518) to determine the shape and position of the AR content 540.
- the fixed element may be a table and the virtual content may be positioned such that it appears to be on that table.
- the AR content may be placed within structures in a field of view 544, which may be a present field of view or an estimated future field of view.
- the AR content may be persisted relative to a model 546 of the physical world (e.g. a mesh).
- the fixed element 542 serves as a proxy (e.g. digital copy) for any fixed element within the physical world which may be stored in the passable world module 538 so that the user 530 can perceive content on the fixed element 542 without the system having to map to the fixed element 542 each time the user 530 sees it.
- the fixed element 542 may, therefore, be a mesh model from a previous modeling session or determined from a separate user but nonetheless stored by the passable world module 538 for future reference by a plurality of users. Therefore, the passable world module 538 may recognize the
- environment 532 from a previously mapped environment and display AR content without a device of the user 530 mapping all or part of the environment 532 first, saving computation process and cycles and avoiding latency of any rendered AR content.
- the mesh model 546 of the physical world may be created by the AR display system and appropriate surfaces and metrics for interacting and displaying the AR content 540 can be stored by the passable world module 538 for future retrieval by the user 530 or other users without the need to completely or partially recreate the model.
- the data inputs 536 are inputs such as geolocation, user identification, and current activity to indicate to the passable world module 538 which fixed element 542 of one or more fixed elements are available, which AR content 540 has last been placed on the fixed element 542, and whether to display that same content (such AR content being "persistent" content regardless of user viewing a particular passable world model).
- the passable world module 538 may update those objects in a model of the physical world from time to time to account for the possibility of changes in the physical world.
- the model of fixed objects may be updated with a very low frequency.
- Other objects in the physical world may be moving or otherwise not regarded as fixed (e.g. kitchen chairs).
- the AR system may update the position of these non- fixed objects with a much higher frequency than is used to update fixed objects.
- an AR system may draw information from multiple sensors, including one or more image sensors.
- Figure 5B is a schematic illustration of a viewing optics assembly 548 and attendant components.
- two eye tracking cameras 550 directed toward user eyes 549, detect metrics of the user eyes 549, such as eye shape, eyelid occlusion, pupil direction and glint on the user eyes 549.
- one of the sensors may be a depth sensor 551, such as a time of flight sensor, emitting signals to the world and detecting reflections of those signals from nearby objects to determine distance to given objects.
- a depth sensor may quickly determine whether objects have entered the field of view of the user, either as a result of motion of those objects or a change of pose of the user.
- information about the position of objects in the field of view of the user may alternatively or additionally be collected with other sensors.
- Depth information for example, may be obtained from stereoscopic visual image sensors or plenoptic sensors.
- world cameras 552 record a greater-than-peripheral view to map and/or otherwise create a model of the environment 532 and detect inputs that may affect AR content.
- the world camera 552 and/or camera 553 may be grayscale and/or color image sensors, which may output grayscale and/or color image frames at fixed time intervals. Camera 553 may further capture physical world images within a field of view of the user at a specific time. Pixels of a frame-based image sensor may be sampled repetitively even if their values are unchanged.
- Each of the world cameras 552, the camera 553 and the depth sensor 551 have respective fields of view of 554, 555, and 556 to collect data from and record a physical world scene, such as the physical world environment 532 depicted in Figure 34 A.
- Inertial measurement units 557 may determine movement and orientation of the viewing optics assembly 548.
- each component is operatively coupled to at least one other component.
- the depth sensor 551 is operatively coupled to the eye tracking cameras 550 as a confirmation of measured accommodation against actual distance the user eyes 549 are looking at.
- a viewing optics assembly 548 may include some of the components illustrated in Figure 34B and may include components instead of or in addition to the components illustrated.
- a viewing optics assembly 548 may include two world camera 552 instead of four. Alternatively or additionally, cameras 552 and 553 need not capture a visible light image of their full field of view.
- a viewing optics assembly 548 may include other types of components.
- a viewing optics assembly 548 may include one or more dynamic vision sensor (DVS), whose pixels may respond asynchronously to relative changes in light intensity exceeding a threshold.
- DVDS dynamic vision sensor
- a viewing optics assembly 548 may not include the depth sensor 551 based on time of flight information.
- a viewing optics assembly 548 may include one or more plenoptic cameras, whose pixels may capture light intensity and an angle of the incoming light, from which depth information can be determined.
- a plenoptic camera may include an image sensor overlaid with a transmissive diffraction mask (TDM).
- TDM transmissive diffraction mask
- a plenoptic camera may include an image sensor containing angle-sensitive pixels and/or phase-detection auto-focus pixels (PDAF) and/or micro-lens array (MLA). Such a sensor may serve as a source of depth information instead of or in addition to depth sensor 551.
- PDAF phase-detection auto-focus pixels
- MSA micro-lens array
- a viewing optics assembly 548 may include components with any suitable configuration, which may be set to provide the user with the largest field of view practical for a particular set of components. For example, if a viewing optics assembly 548 has one world camera 552, the world camera may be placed in a center region of the viewing optics assembly instead of at a side.
- Information from the sensors in viewing optics assembly 548 may be coupled to one or more of processors in the system.
- the processors may generate data that may be rendered so as to cause the user to perceive virtual content interacting with objects in the physical world. That rendering may be implemented in any suitable way, including generating image data that depicts both physical and virtual objects.
- physical and virtual content may be depicted in one scene by modulating the opacity of a display device that a user looks through at the physical world. The opacity may be controlled so as to create the appearance of the virtual object and also to block the user from seeing objects in the physical world that are occluded by the virtual objects.
- the image data may only include virtual content that may be modified such that the virtual content is perceived by a user as realistically interacting with the physical world (e.g. clip content to account for occlusions), when viewed through the user interface.
- the location on the viewing optics assembly 548 at which content is displayed to create the impression of an object at a particular location may depend on the physics of the viewing optics assembly. Additionally, the pose of the user’s head with respect to the physical world and the direction in which the user’s eyes are looking may impact where in the physical world content displayed at a particular location on the viewing optics assembly content will appear. Sensors as described above may collect this information, and or supply information from which this information may be calculated, such that a processor receiving sensor inputs may compute where objects should be rendered on the viewing optics assembly 548 to create a desired appearance for the user.
- a model of the physical world may be used so that characteristics of the virtual objects, which can be impacted by physical objects, including the shape, position, motion, and visibility of the virtual object, can be correctly computed.
- the model may include the reconstruction of a physical world, for example, the reconstruction 518.
- That model may be created from data collected from sensors on a wearable device of the user. Though, in some embodiments, the model may be created from data collected by multiple users, which may be aggregated in a computing device remote from all of the users (and which may be“in the cloud”).
- the model may be created, at least in part, by a world reconstruction system such as, for example, the world reconstruction component 516 of Figure 3 depicted in more detail in Figure 6A.
- the world reconstruction component 516 may include a perception module 660 that may generate, update, and store representations for a portion of the physical world.
- the perception module 660 may represent the portion of the physical world within a reconstruction range of the sensors as multiple voxels.
- Each voxel may correspond to a 3D cube of a predetermined volume in the physical world, and include surface information, indicating whether there is a surface in the volume represented by the voxel.
- Voxels may be assigned values indicating whether their corresponding volumes have been determined to include surfaces of physical objects, determined to be empty or have not yet been measured with a sensor and so their value is unknown.
- values indicating that voxels that are determined to be empty or unknown need not be explicitly stored, as the values of voxels may be stored in computer memory in any suitable way, including storing no information for voxels that are determined to be empty or unknown.
- the perception module 660 may identify and output indications of changes in a region around a user of an AR system. Indications of such changes may trigger updates to volumetric data stored as part of the persisted world, or trigger other functions, such as triggering components 604 that generate AR content to update the AR content.
- the perception module 660 may identify changes based on a signed distance function (SDF) model.
- the perception module 660 may be configured to receive sensor data such as, for example, depth maps 660a and headposes 660b, and then fuse the sensor data into a SDF model 660c.
- Depth maps 660a may provide SDF information directly, and images may be processed to arrive at SDF information.
- the SDF information represents distance from the sensors used to capture that information. As those sensors may be part of a wearable unit, the SDF information may represent the physical world from the perspective of the wearable unit and therefore the perspective of the user.
- the headposes 660b may enable the SDF information to be related to a voxel in the physical world.
- the perception module 660 may generate, update, and store representations for the portion of the physical world that is within a perception range.
- the perception range may be determined based, at least in part, on a sensor’s reconstruction range, which may be determined based, at least in part, on the limits of a sensor’s observation range.
- an active depth sensor that operates using active IR pulses may operate reliably over a range of distances, creating the observation range of the sensor, which may be from a few centimeters or tens of centimeters to a few meters.
- the world reconstruction component 516 may include additional modules that may interact with the perception module 660.
- a persisted world module 662 may receive representations for the physical world based on data acquired by the perception module 660.
- the persisted world module 662 also may include various formats of representations of the physical world. For example, volumetric metadata 662b such as voxels may be stored as well as meshes 662c and planes 662d. In some embodiments, other information, such as depth maps could be saved.
- representations of the physical world may provide relatively dense information about the physical world in comparison to sparse maps, such as a tracking map based on feature points as described above.
- the perception module 660 may include modules that generate representations for the physical world in various formats including, for example, meshes 660d, planes and semantics 660e.
- the representations for the physical world may be stored across local and remote storage mediums.
- the representations for the physical world may be described in different coordinate frames depending on, for example, the location of the storage medium.
- a representation for the physical world stored in the device may be described in a coordinate frame local to the device.
- the representation for the physical world may have a counterpart stored in a cloud.
- the counterpart in the cloud may be described in a coordinate frame shared by all devices in an XR system.
- these modules may generate representations based on data within the perception range of one or more sensors at the time the representation is generated as well as data captured at prior times and information in the persisted world module 662.
- these components may operate on depth information captured with a depth sensor.
- the AR system may include vision sensors and may generate such representations by analyzing monocular or binocular vision information.
- these modules may operate on regions of the physical world. Those modules may be triggered to update a subregion of the physical world, when the perception module 660 detects a change in the physical world in that subregion. Such a change, for example, may be detected by detecting a new surface in the SDF model 660c or other criteria, such as changing the value of a sufficient number of voxels representing the subregion.
- the world reconstruction component 516 may include components 664 that may receive representations of the physical world from the perception module 660. Information about the physical world may be pulled by these components according to, for example, a use request from an application.
- information may be pushed to the use components, such as via an indication of a change in a pre-identified region or a change of the physical world representation within the perception range.
- the components 664 may include, for example, game programs and other components that perform processing for visual occlusion, physics-based interactions, and environment reasoning.
- the perception module 660 may send representations for the physical world in one or more formats. For example, when the component 664 indicates that the use is for visual occlusion or physics-based interactions, the perception module 660 may send a representation of surfaces. When the component 664 indicates that the use is for environmental reasoning, the perception module 660 may send meshes, planes and semantics of the physical world.
- the perception module 660 may include components that format information to provide the component 664.
- An example of such a component may be raycasting component 660f.
- a use component e.g., component 664
- Raycasting component 660f may select from one or more representations of the physical world data within a field of view from that point of view.
- the perception module 660 may process data to create 3D representations of portions of the physical world.
- Data to be processed may be reduced by culling parts of a 3D reconstruction volume based at last in part on a camera frustum and/or depth image, extracting and persisting plane data, capturing, persisting, and updating 3D reconstruction data in blocks that allow local update while maintaining neighbor consistency, providing occlusion data to applications generating such scenes, where the occlusion data is derived from a combination of one or more depth data sources, and/or performing a multi-stage mesh simplification.
- the reconstruction may contain data of different levels of sophistication including, for example, raw data such as live depth data, fused volumetric data such as voxels, and computed data such as meshes.
- components of a passable world model may be distributed, with some portions executing locally on an XR device and some portions executing remotely, such as on a network connected server, or otherwise in the cloud.
- the allocation of the processing and storage of information between the local XR device and the cloud may impact functionality and user experience of an XR system. For example, reducing processing on a local device by allocating processing to the cloud may enable longer battery life and reduce heat generated on the local device. But, allocating too much processing to the cloud may create undesirable latency that causes an unacceptable user experience.
- FIG. 6B depicts a distributed component architecture 600 configured for spatial computing, according to some embodiments.
- the distributed component architecture 600 may include a passable world component 602 (e.g., PW 538 in FIG. 5A), a Lumin OS 604, API’s 606, SDK 608, and Application 610.
- the Lumin OS 604 may include a Linux-based kernel with custom drivers compatible with an XR device.
- the API’s 606 may include application programming interfaces that grant XR applications (e.g., Applications 610) access to the spatial computing features of an XR device.
- the SDK 608 may include a software development kit that allows the creation of XR applications.
- One or more components in the architecture 600 may create and maintain a model of a passable world.
- sensor data is collected on a local device. Processing of that sensor data may be performed in part locally on the XR device and partially in the cloud.
- PW 538 may include environment maps created based, at least in part, on data captured by AR devices worn by multiple users. During sessions of an AR experience, individual AR devices (such as wearable devices described above in connection with Figure 4 may create tracking maps, which is one type of map.
- the device may include components that construct both sparse maps and dense maps.
- a tracking map may serve as a sparse map and may include headposes of the AR device scanning an environment as well as information about objects detected within that environment at each headpose. Those headposes may be maintained locally for each device. For example, the headpose on each device may be relative to an initial headpose when the device was turned on for its session. As a result, each tracking map may be local to the device creating it.
- the dense map may include surface information, which may be represented by a mesh or depth information. Alternatively or additionally, a dense map may include higher level information derived from surface or depth information, such as the location and/or characteristics of planes and/or other objects.
- Creation of the dense maps may be independent of the creation of sparse maps, in some embodiments.
- the creation of dense maps and sparse maps may be performed in separate processing pipelines within an AR system. Separating processing, for example, may enable generation or processing of different types of maps to be performed at different rates. Sparse maps, for example, may be refreshed at a faster rate than dense maps. In some embodiments, however, the processing of dense and sparse maps may be related, even if performed in different pipelines. Changes in the physical world revealed in a sparse map, for example, may trigger updates of a dense map, or vice versa. Further, even if independently created, the maps might be used together. For example, a coordinate system derived from a sparse map may be used to define position and/or orientation of objects in a dense map.
- the sparse map and/or dense map may be persisted for re-use by the same device and/or sharing with other devices. Such persistence may be achieved by storing information in the cloud.
- the AR device may send the tracking map to a cloud to, for example, merge with environment maps selected from persisted maps previously stored in the cloud.
- the selected persisted maps may be sent from the cloud to the AR device for merging.
- the persisted maps may be oriented with respect to one or more persistent coordinate frames.
- Such maps may serve as canonical maps, as they can be used by any of multiple devices.
- a model of a passable world may comprise or be created from one or more canonical maps. Devices, even though they perform some operations based on a coordinate frame local to the device, may nonetheless use the canonical map by determining a transformation between their coordinate frame local to the device and the canonical map.
- a canonical map may originate as a tracking map (TM) (e.g., TM 1102 in Figure 31 A), which may be promoted to a canonical map.
- the canonical map may be persisted such that devices that access the canonical map may, once determining a transformation between their local coordinate system and a coordinate system of the canonical map, use the information in the canonical map to determine locations of objects represented in the canonical map in the physical world around the device.
- a TM may be a headpose sparse map created by an XR device.
- the canonical map may be created when an XR device sends one or more TMs to a cloud server for merging with additional TMs captured by the XR device at a different time or by other XR devices.
- the canonical maps, or other maps may provide information about the portions of the physical world represented by the data processed to create respective maps.
- Figure 7 depicts an exemplary tracking map 700, according to some embodiments.
- the tracking map 700 may provide a floor plan 706 of physical objects in a corresponding physical world, represented by points 702.
- a map point 702 may represent a feature of a physical object that may include multiple features. For example, each comer of a table may be a feature that is represented by a point on a map.
- the features may be derived from processing images, such as may be acquired with the sensors of a wearable device in an augmented reality system.
- the features may be derived by processing an image frame output by a sensor to identify features based on large gradients in the image or other suitable criteria. Further processing may limit the number of features in each frame. For example, processing may select features that likely represent persistent objects. One or more heuristics may be applied for this selection.
- the tracking map 700 may include data on points 702 collected by a device.
- a pose may be stored.
- the pose may represent the orientation from which the image frame was captured, such that the feature points within each image frame may be spatially correlated.
- the pose may be determined by positioning information, such as may be derived from the sensors, such as an IMU sensor, on the wearable device.
- the pose may be determined from matching image frames to other image frames that depict overlapping portions of the physical world. By finding such positional correlation, which may be accomplished by matching subsets of features points in two frames, the relative pose between the two frames may be computed.
- a relative pose may be adequate for a tracking map, as the map may be relative to a coordinate system local to a device established based on the initial pose of the device when construction of the tracking map was initiated.
- Not all of the feature points and image frames collected by a device may be retained as part of the tracking map, as much of the information collected with the sensors is likely to be redundant. Rather, only certain frames may be added to the map. Those frames may be selected based on one or more criteria, such as degree of overlap with image frames already in the map, the number of new features they contain, or a quality metric for the features in the frame. Image frames not added to the tracking map may be discarded or may be used to revise the location of features. As a further alternative, all or most of the image frames, represented as a set of features may be retained, but a subset of those frames may be designated as key frames, which are used for further processing.
- the key frames may be processed to produce keyrigs 704.
- the key frames may be processed to produce three dimensional sets of feature points and saved as keyrigs 704. Such processing may entail, for example, comparing image frames derived simultaneously from two cameras to stereoscopically determine the 3D position of feature points. Metadata may be associated with these keyframes and/or keyrigs, such as poses.
- the environment maps may have any of multiple formats depending on, for example, the storage locations of an environment map including, for example, local storage of AR devices and remote storage.
- a map in remote storage may have higher resolution than a map in local storage on a wearable device where memory is limited.
- the map may be down sampled or otherwise converted to an appropriate format, such as by reducing the number of poses per area of the physical world stored in the map and/or the number of feature points stored for each pose.
- a slice or portion of a high resolution map from remote storage may be sent to local storage, where the slice or portion is not down sampled.
- a database of environment maps may be updated as new tracking maps are created.
- updating may include efficiently selecting one or more environment maps stored in the database relevant to the new tracking map. The selected one or more
- environment maps may be ranked by relevance and one or more of the highest ranking maps may be selected for processing to merge higher ranked selected environment maps with the new tracking map to create one or more updated environment maps.
- a new tracking map represents a portion of the physical world for which there is no preexisting environment map to update, that tracking map may be stored in the database as a new environment map.
- Described herein are methods and apparatus for providing virtual contents using an XR system, independent of locations of eyes viewing the virtual content.
- a virtual content is re-rendered upon any motion of the displaying system. For example, if a user wearing a display system views a virtual representation of a three-dimensional (3D) object on the display and walks around the area where the 3D object appears, the 3D object should be re-rendered for each viewpoint such that the user has the perception that he or she is walking around an object that occupies real space.
- the re -rendering consumes significant computational resources of a system and causes artifacts due to latency.
- head pose e.g., the location and orientation of a user wearing an XR system
- dynamic maps of a scene may be generated based on multiple coordinate frames in real space across one or more sessions such that virtual contents interacting with the dynamic maps may be rendered robustly, independent of eye rotations within the head of the user and/or independent of sensor deformations caused by, for example, heat generated during high-speed, computation intensive operation.
- the configuration of multiple coordinate frames may enable a first XR device worn by a first user and a second XR device worn by a second user to recognize a common location in a scene.
- the configuration of multiple coordinate frames may enable users wearing XR devices to view a virtual content in a same location of a scene.
- a tracking map may be built in a world coordinate frame, which may have a world origin.
- the world origin may be the first pose of an XR device when the XR device is powered on.
- the world origin may be aligned to gravity such that a developer of an XR application can get gravity alignment without extra work.
- Different tracking maps may be built in different world coordinate frames because the tracking maps may be captured by a same XR device at different sessions and/or different XR devices worn by different users.
- a session of an XR device may span from powering on to powering off the device.
- an XR device may have a head coordinate frame, which may have a head origin.
- the head origin may be the current pose of an XR device when an image is taken. The difference between head pose of a world coordinate frame and of a head coordinate frame may be used to estimate a tracking route.
- an XR device may have a camera coordinate frame, which may have a camera origin.
- the camera origin may be the current pose of one or more sensors of an XR device.
- the inventors have recognized and appreciated that the configuration of a camera coordinate frame enables robust displaying virtual contents independent of eye rotation within a head of a user. This configuration also enables robust displaying of virtual contents independent of sensor deformation due to, for example, heat generated during operation.
- an XR device may have a head unit with a head-mountable frame that a user can secure to their head and may include two waveguides, one in front of each eye of the user.
- the waveguides may be transparent so that ambient light from real- world objects can transmit through the waveguides and the user can see the real-world objects.
- Each waveguide may transmit projected light from a projector to a respective eye of the user.
- the projected light may form an image on the retina of the eye.
- the retina of the eye thus receives the ambient light and the projected light.
- the user may simultaneously see real-world objects and one or more virtual objects that are created by the projected light.
- XR devices may have sensors that detect real-world objects around a user. These sensors may, for example, be cameras that capture images that may be processed to identify the locations of real-world objects.
- an XR system may assign a coordinate frame to a virtual content, as opposed to attaching the virtual content in a world coordinate frame.
- a virtual content may be described without regard to where it is rendered for a user, but it may be attached to a more persistent frame position such as a persistent coordinate frame (PCF) described in relation to, for example, Figures 14-20C, to be rendered in a specified location.
- PCF persistent coordinate frame
- the XR device may detect the changes in the environment map and determine movement of the head unit worn by the user relative to real-world objects.
- Figure 8 illustrates a user experiencing virtual content, as rendered by an XR system 10, in a physical environment, according to some embodiments.
- the XR system may include a first XR device 12.1 that is worn by a first user 14.1, a network 18 and a server 20.
- the user 14.1 is in a physical environment with a real object in the form of a table 16.
- the first XR device 12.1 includes a head unit 22, a belt pack 24 and a cable connection 26.
- the first user 14.1 secures the head unit 22 to their head and the belt pack 24 remotely from the head unit 22 on their waist.
- the cable connection 26 connects the head unit 22 to the belt pack 24.
- the head unit 22 includes technologies that are used to display a virtual object or objects to the first user 14.1 while the first user 14.1 is permitted to see real objects such as the table 16.
- the belt pack 24 includes primarily processing and communications capabilities of the first XR device 12.1. In some
- the processing and communication capabilities may reside entirely or partially in the head unit 22 such that the belt pack 24 may be removed or may be located in another device such as a backpack.
- the belt pack 24 is connected via a wireless connection to the network 18.
- the server 20 is connected to the network 18 and holds data representative of local content.
- the belt pack 24 downloads the data representing the local content from the server 20 via the network 18.
- the belt pack 24 provides the data via the cable connection 26 to the head unit 22.
- the head unit 22 may include a display that has a light source, for example, a laser light source or a light emitting diode (LED), and a waveguide that guides the light.
- a light source for example, a laser light source or a light emitting diode (LED)
- the first user 14.1 may mount the head unit 22 to their head and the belt pack 24 to their waist.
- the belt pack 24 may download image data representing virtual content over the network 18 from the server 20.
- the first user 14.1 may see the table 16 through a display of the head unit 22.
- a projector forming part of the head unit 22 may receive the image data from the belt pack 24 and generate light based on the image data.
- the light may travel through one or more of the waveguides forming part of the display of the head unit 22.
- the light may then leave the waveguide and propagates onto a retina of an eye of the first user 14.1.
- the projector may generate the light in a pattern that is replicated on a retina of the eye of the first user 14.1.
- the light that falls on the retina of the eye of the first user 14.1 may have a selected field of depth so that the first user 14.1 perceives an image at a preselected depth behind the waveguide.
- both eyes of the first user 14.1 may receive slightly different images so that a brain of the first user 14.1 perceives a three- dimensional image or images at selected distances from the head unit 22.
- the first user 14.1 perceives a virtual content 28 above the table 16.
- the proportions of the virtual content 28 and its location and distance from the first user 14.1 are determined by the data representing the virtual content 28 and various coordinate frames that are used to display the virtual content 28 to the first user 14.1.
- the virtual content 28 is not visible from the perspective of the drawing and is visible to the first user 14.1 through using the first XR device 12.1.
- the virtual content 28 may initially reside as data structures within vision data and algorithms in the belt pack 24. The data structures may then manifest themselves as light when the projectors of the head unit 22 generate light based on the data structures. It should be appreciated that although the virtual content 28 has no existence in three-dimensional space in front of the first user 14.1, the virtual content 28 is still represented in Figure 1 in three- dimensional space for illustration of what a wearer of head unit 22 perceives.
- the virtual content 28 has no existence in three-dimensional space in front of the first user 14.1, the virtual content 28 is still represented in Figure 1 in three- dimensional space for illustration of what a wearer of head unit 22 perceives.
- visualization of computer data in three-dimensional space may be used in this description to illustrate how the data structures that facilitate the renderings are perceived by one or more users relate to one another within the data structures in the belt pack 24.
- Figure 9 illustrates components of the first XR device 12.1, according to some embodiments.
- the first XR device 12.1 may include the head unit 22, and various components forming part of the vision data and algorithms including, for example, a rendering engine 30, various coordinate systems 32, various origin and destination coordinate frames 34, and various origin to destination coordinate frame transformers 36.
- the various coordinate systems may be based on intrinsics of to the XR device or may be determined by reference to other information, such as a persistent pose or a persistent coordinate system, as described herein.
- the head unit 22 may include a head-mountable frame 40, a display system 42, a real object detection camera 44, a movement tracking camera 46, and an inertial measurement unit 48.
- the head-mountable frame 40 may have a shape that is securable to the head of the first user 14.1 in Figure 8.
- the display system 42, real object detection camera 44, movement tracking camera 46, and inertial measurement unit 48 may be mounted to the head-mountable frame 40 and therefore move together with the head-mountable frame 40.
- the coordinate systems 32 may include a local data system 52, a world frame system 54, a head frame system 56, and a camera frame system 58.
- the local data system 52 may include a data channel 62, a local frame determining routine 64 and a local frame storing instruction 66.
- the data channel 62 may be an internal software routine, a hardware component such as an external cable or a radio frequency receiver, or a hybrid component such as a port that is opened up.
- the data channel 62 may be configured to receive image data 68 representing a virtual content.
- the local frame determining routine 64 may be connected to the data channel 62.
- the local frame determining routine 64 may be configured to determine a local coordinate frame 70.
- the local frame determining routine may determine the local coordinate frame based on real world objects or real world locations.
- the local coordinate frame may be based on a top edge relative to a bottom edge of a browser window, head or feet of a character, a node on an outer surface of a prism or bounding box that encloses the virtual content, or any other suitable location to place a coordinate frame that defines a facing direction of a virtual content and a location (e.g. a node, such as a placement node or PCF node) with which to place the virtual content, etc.
- a location e.g. a node, such as a placement node or PCF node
- the local frame storing instruction 66 may be connected to the local frame determining routine 64.
- the local frame storing instruction 66 may store the local coordinate frame 70 as a local coordinate frame 72 within the origin and destination coordinate frames 34.
- the origin and destination coordinate frames 34 may be one or more coordinate frames that may be manipulated or transformed in order for a virtual content to persist between sessions.
- a session may be the period of time between a boot-up and shut-down of an XR device. Two sessions may be two start-up and shut-down periods for a single XR device, or may be a start-up and shut-down for two different XR devices.
- the origin and destination coordinate frames 34 may be the coordinate frames involved in one or more transformations required in order for a first user’s XR device and a second user’s XR device to recognize a common location.
- the destination coordinate frame may be the output of a series of computations and transformations applied to the target coordinate frame in order for a first and second user to view a virtual content in the same location.
- the rendering engine 30 may be connected to the data channel 62.
- the rendering engine 30 may receive the image data 68 from the data channel 62 such that the rendering engine 30 may render virtual content based, at least in part, on the image data 68.
- the display system 42 may be connected to the rendering engine 30.
- the display system 42 may include components that transform the image data 68 into visible light.
- the visible light may form two patterns, one for each eye.
- the visible light may enter eyes of the first user 14.1 in Figure 8 and may be detected on retinas of the eyes of the first user 14.1.
- the real object detection camera 44 may include one or more cameras that may capture images from different sides of the head-mountable frame 40.
- the movement tracking camera 46 may include one or more cameras that capture images on sides of the head- mountable frame 40.
- One set of one or more cameras may be used instead of the two sets of one or more cameras representing the real object detection camera(s) 44 and the movement tracking camera(s) 46.
- the cameras 44, 46 may capture images. As described above these cameras may collect data that is used to construct a tacking map.
- the inertial measurement unit 48 may include a number of devices that are used to detect movement of the head unit 22.
- the inertial measurement unit 48 may include a gravitation sensor, one or more accelerometers and one or more gyroscopes.
- the sensors of the inertial measurement unit 48 in combination, track movement of the head unit 22 in at least three orthogonal directions and about at least three orthogonal axes.
- the world frame system 54 includes a world surface determining routine 78, a world frame determining routine 80, and a world frame storing instruction 82.
- the world surface determining routine 78 is connected to the real object detection camera 44.
- the world surface determining routine 78 receives images and/or key frames based on the images that are captured by the real object detection camera 44 and processes the images to identify surfaces in the images.
- a depth sensor (not shown) may determine distances to the surfaces.
- the surfaces are thus represented by data in three dimensions including their sizes, shapes, and distances from the real object detection camera.
- a world coordinate frame 84 may be based on the origin at the initialization of the head pose session.
- the world coordinate frame may be located where the device was booted up, or could be somewhere new if head pose was lost during the boot session.
- the world coordinate frame may be the origin at the start of a head pose session.
- the world frame determining routine 80 is connected to the world surface determining routine 78 and determines a world coordinate frame 84 based on the locations of the surfaces as determined by the world surface determining routine 78.
- the world frame storing instruction 82 is connected to the world frame determining routine 80 to receive the world coordinate frame 84 from the world frame determining routine 80.
- the world frame storing instruction 82 stores the world coordinate frame 84 as a world coordinate frame 86 within the origin and destination coordinate frames 34.
- the head frame system 56 may include a head frame determining routine 90 and a head frame storing instruction 92.
- the head frame determining routine 90 may be connected to the movement tracking camera 46 and the inertial measurement unit 48.
- the head frame determining routine 90 may use data from the movement tracking camera 46 and the inertial measurement unit 48 to calculate a head coordinate frame 94.
- the inertial measurement unit 48 may have a gravitation sensor that determines the direction of gravitational force relative to the head unit 22.
- the movement tracking camera 46 may continually capture images that are used by the head frame determining routine 90 to refine the head coordinate frame 94.
- the head unit 22 moves when the first user 14.1 in Figure 8 moves their head.
- the movement tracking camera 46 and the inertial measurement unit 48 may continuously provide data to the head frame determining routine 90 so that the head frame determining routine 90 can update the head coordinate frame 94.
- the head frame storing instruction 92 may be connected to the head frame determining routine 90 to receive the head coordinate frame 94 from the head frame determining routine 90.
- the head frame storing instruction 92 may store the head coordinate frame 94 as a head coordinate frame 96 among the origin and destination coordinate frames 34.
- the head frame storing instruction 92 may repeatedly store the updated head coordinate frame 94 as the head coordinate frame 96 when the head frame determining routine 90 recalculates the head coordinate frame 94.
- the head coordinate frame may be the location of the wearable XR device 12.1 relative to the local coordinate frame 72.
- the camera frame system 58 may include camera intrinsics 98.
- the camera intrinsics 98 may include dimensions of the head unit 22 that are features of its design and manufacture.
- the camera intrinsics 98 may be used to calculate a camera coordinate frame 100 that is stored within the origin and destination coordinate frames 34.
- the camera coordinate frame 100 may include all pupil positions of a left eye of the first user 14.1 in Figure 8. When the left eye moves from left to right or up and down, the pupil positions of the left eye are located within the camera coordinate frame 100. In addition, the pupil positions of a right eye are located within a camera coordinate frame 100 for the right eye. In some embodiments, the camera coordinate frame 100 may include the location of the camera relative to the local coordinate frame when an image is taken.
- the origin to destination coordinate frame transformers 36 may include a local-to- world coordinate transformer 104, a world-to-head coordinate transformer 106, and a head-to- camera coordinate transformer 108.
- the local-to- world coordinate transformer 104 may receive the local coordinate frame 72 and transform the local coordinate frame 72 to the world coordinate frame 86.
- the transformation of the local coordinate frame 72 to the world coordinate frame 86 may be represented as a local coordinate frame transformed to world coordinate frame 110 within the world coordinate frame 86.
- the world-to-head coordinate transformer 106 may transform from the world coordinate frame 86 to the head coordinate frame 96.
- the world-to-head coordinate transformer 106 may transform the local coordinate frame transformed to world coordinate frame 110 to the head coordinate frame 96.
- the transformation may be represented as a local coordinate frame transformed to head coordinate frame 112 within the head coordinate frame 96.
- the head-to-camera coordinate transformer 108 may transform from the head coordinate frame 96 to the camera coordinate frame 100.
- the head-to-camera coordinate transformer 108 may transform the local coordinate frame transformed to head coordinate frame 112 to a local coordinate frame transformed to camera coordinate frame 114 within the camera coordinate frame 100.
- the local coordinate frame transformed to camera coordinate frame 114 may be entered into the rendering engine 30.
- the rendering engine 30 may render the image data 68 representing the local content 28 based on the local coordinate frame transformed to camera coordinate frame 114.
- Figure 10 is a spatial representation of the various origin and destination coordinate frames 34.
- the local coordinate frame 72, world coordinate frame 86, head coordinate frame 96, and camera coordinate frame 100 are represented in the figure.
- the local coordinate frame associated with the XR content 28 may have a position and rotation (e.g. may provide a node and facing direction) relative to a local and/or world coordinate frame and/or PCF when the virtual content is placed in the real world so the virtual content may be viewed by the user.
- Each camera may have its own camera coordinate frame 100 encompassing all pupil positions of one eye.
- Reference numerals 104A and 106A represent the transformations that are made by the local-to-world coordinate transformer 104, world-to-head coordinate transformer 106, and head-to-camera coordinate transformer 108 in Figure 9, respectively.
- Figure 11 depicts a camera render protocol for transforming from a head coordinate frame to a camera coordinate frame, according to some embodiments.
- a pupil for a single eye moves from position A to B.
- a virtual object that is meant to appear stationary will project onto a depth plane at one of the two positions A or B depending on the position of the pupil (assuming that the camera is configured to use a pupil- based coordinate frame).
- using a pupil coordinate frame transformed to a head coordinate frame will cause jitter in a stationary virtual object as the eye moves from position A to position B. This situation is referred to as view dependent display or projection.
- a camera coordinate frame e.g., CR
- the head coordinate frame transforms to the CR frame, which is referred to as view independent display or projection.
- An image reprojection may be applied to the virtual content to account for a change in eye position, however, as the rendering is still in the same position, jitter is minimized.
- FIG. 13 illustrates the display system 42 in more detail.
- the display system 42 includes a stereoscopic analyzer 144 that is connected to the rendering engine 30 and forms part of the vision data and algorithms.
- the display system 42 further includes left and right projectors 166A and 166B and left and right waveguides 170A and 170B.
- the left and right projectors 166A and 166B are connected to power supplies.
- Each projector 166A and 166B has a respective input for image data to be provided to the respective projector 166 A or 166B.
- the respective projector 166 A or 166B when powered, generates light in two-dimensional patterns and emanates the light therefrom.
- the left and right waveguides 170A and 170B are positioned to receive light from the left and right projectors 166A and 166B, respectively.
- the left and right waveguides 170A and 170B are transparent waveguides.
- a user mounts the head mountable frame 40 to their head.
- Components of the head mountable frame 40 may, for example, include a strap (not shown) that wraps around the back of the head of the user.
- the left and right waveguides 170A and 170B are then located in front of left and right eyes 220A and 220B of the user.
- the rendering engine 30 enters the image data that it receives into the stereoscopic analyzer 144.
- the image data is three-dimensional image data of the local content 28 in Figure 8.
- the image data is projected onto a plurality of virtual planes.
- the stereoscopic analyzer 144 analyzes the image data to determine left and right image data sets based on the image data for projection onto each depth plane.
- the left and right image data sets are data sets that represent two-dimensional images that are projected in three-dimensions to give the user a perception of a depth.
- the stereoscopic analyzer 144 enters the left and right image data sets into the left and right projectors 166 A and 166B.
- the left and right projectors 166A and 166B then create left and right light patterns.
- the components of the display system 42 are shown in plan view, although it should be understood that the left and right patterns are two-dimensional patterns when shown in front elevation view.
- Each light pattern includes a plurality of pixels. For purposes of illustration, light rays 224A and 226A from two of the pixels are shown leaving the left projector 166A and entering the left waveguide 170A. The light rays 224A and 226A reflect from sides of the left waveguide 170A.
- the light rays 224A and 226A propagate through internal reflection from left to right within the left waveguide 170A, although it should be understood that the light rays 224A and 226A also propagate in a direction into the paper using refractory and reflective systems.
- the light rays 224A and 226A exit the left light waveguide 170A through a pupil 228A and then enter a left eye 220A through a pupil 230A of the left eye 220A.
- the light rays 224A and 226A then fall on a retina 232A of the left eye 220A.
- the left light pattern falls on the retina 232A of the left eye 220A.
- the user is given the perception that the pixels that are formed on the retina 232A are pixels 234A and 236A that the user perceives to be at some distance on a side of the left waveguide 170A opposing the left eye 220A. Depth perception is created by manipulating the focal length of the light.
- the stereoscopic analyzer 144 enters the right image data set into the right projector 166B.
- the right projector 166B transmits the right light pattern, which is represented by pixels in the form of light rays 224B and 226B.
- the light rays 224B and 226B reflect within the right waveguide 170B and exit through a pupil 228B.
- the light rays 224B and 226B then enter through a pupil 230B of the right eye 220B and fall on a retina 232B of a right eye 220B.
- the pixels of the light rays 224B and 226B are perceived as pixels 134B and 236B behind the right waveguide 170B.
- the patterns that are created on the retinas 232A and 232B are individually perceived as left and right images.
- the left and right images differ slightly from one another due to the functioning of the stereoscopic analyzer 144.
- the left and right images are perceived in a mind of the user as a three-dimensional rendering.
- the left and right waveguides 170A and 170B are transparent. Light from a real-life object such as the table 16 on a side of the left and right waveguides 170A and 170B opposing the eyes 220A and 220B can project through the left and right waveguides 170A and 170B and fall on the retinas 232A and 232B.
- Described herein are methods and apparatus for providing spatial persistence across user instances within a shared space. Without spatial persistence, virtual content placed in the physical world by a user in a session may not exist or may be misplaced in the user’s view in a different session. Without spatial persistence, virtual content placed in the physical world by one user may not exist or may be out of place in a second user’s view, even if the second user is intended to be sharing an experience of the same physical space with the first user.
- PCFs persistent coordinate frames
- a PCF may be defined based on one or more points, representing features recognized in the physical world (e.g., comers, edges). The features may be selected such that they are likely to be the same from a user instance to another user instance of an XR system.
- drift during tracking which causes the computed tracking path (e.g., camera trajectory) to deviate from the actual tracking path, can cause the location of virtual content, when rendered with respect to a local map that is based solely on a tracking map to appear out of place.
- a tracking map for the space may be refined to correct the drifts as an XR device collects more information of the scene overtime.
- the virtual content may appear displaced, as if the real object has been moved during the map refinement.
- PCFs may be updated according to map refinement because the PCFs are defined based on the features and are updated as the features move during map refinements.
- a PCF may comprise six degrees of freedom with translations and rotations relative to a map coordinate system.
- a PCF may be stored in a local and/or remote storage medium.
- the translations and rotations of a PCF may be computed relative to a map coordinate system depending on, for example, the storage location.
- a PCF used locally by a device may have translations and rotations relative to a world coordinate frame of the device.
- a PCF in the cloud may have translations and rotations relative to a canonical coordinate frame of a canonical map.
- PCFs may provide a sparse representation of the physical world, providing less than all of the available information about the physical world, such that they may be efficiently processed and transferred.
- Techniques for processing persistent spatial information may include creating dynamic maps based on one or more coordinate systems in real space across one or more sessions, generating persistent coordinate frames (PCF) over the sparse maps, which may be exposed to XR applications via, for example, an application
- API programming interface
- FIG 14 is a block diagram illustrating the creation of a persistent coordinate frame (PCF) and the attachment of XR content to the PCF, according to some embodiments.
- Each block may represent digital information stored in a computer memory.
- the data may represent computer-executable instructions.
- the digital information may define a virtual object, as specified by the application 1180, for example.
- the digital information may characterize some aspect of the physical world.
- one or more PCFs are created from images captured with sensors on a wearable device.
- the sensors are visual image cameras. These cameras may be the same cameras used for forming a tracking map. Accordingly, some of the processing suggested by FIG. 14 may be performed as part of updating a tracking map. However, FIG. 14 illustrates that information that provides persistence is generated in addition to the tracking map.
- FIG. 14 illustrates an Image 1 and an Image 2, each derived from one of the cameras. A single image from each camera is illustrated for simplicity. However, each camera may output a stream of image frames and the processing illustrated in FIG. 14 may be performed for multiple image frames in the stream.
- Image 1 and Image 2 may each be one frame in a sequence of image frames. Processing as depicted in FIG. 14 may be repeated on successive image frames in the sequence until image frames containing feature points providing a suitable image from which to form persistent spatial information is processed. Alternatively or additionally, the processing of FIG. 14 might be repeated as a user moves such that the user is no longer close enough to a previously identified PCF to reliably use that PCF for determining positions with respect to the physical world. For example, an XR system may maintain a current PCF for a user. When that distance exceeds a threshold, the system may switch to a new current PCF, closer to the user, which may be generated according to the process of FIG. 14, using image frames acquired in the user’s current location.
- a stream of image frames may be processed to identify image frames depicting content in the physical world that is likely stable and can be readily identified by a device in the vicinity of the region of the physical world depicted in the image frame.
- this processing begins with the identification of features 1120 in the image.
- Features may be identified, for example, by finding locations of gradients in the image above a threshold or other characteristics, which may correspond to a comer of an object, for example.
- the features are points, but other recognizable features, such as edges, may alternatively or additionally be used.
- a fixed number, N, of features 1120 are selected for further processing.
- Those feature points may be selected based on one or more criteria, such as magnitude of the gradient, or proximity to other feature points.
- the feature points may be selected heuristically, such as based on characteristics that suggest the feature points are persistent.
- heuristics may be defined based on the characteristics of feature points that likely correspond to a corner of a window or a door or a large piece of furniture. Such heuristics may take into account the feature point itself and what surrounds it.
- the number of feature points per image may be between 100 and 500 or between 150 and 250, such as 200.
- descriptors 1130 may be computed for the feature points.
- a descriptor is computed for each selected feature point, but a descriptor may be computed for groups of feature points or for a subset of the feature points or for all features within an image.
- the descriptor characterizes a feature point such that feature points representing the same object in the physical world are assigned similar descriptors.
- the descriptors may facilitate alignment of two frames, such as may occur when one map is localized with respect to another. Rather than searching for a relative orientation of the frames that minimizes the distance between feature points of the two images, an initial alignment of the two frames may be made by identifying feature points with similar descriptors. Alignment of the image frames may be based on aligning points with similar descriptors, which may entail less processing than computing an alignment of all the feature points in the images.
- the descriptors may be computed as a mapping of the feature points or, in some embodiments a mapping of a patch of an image around a feature point, to a descriptor.
- the descriptor may be a numeric quantity.
- a descriptor 1130 is computed for each feature point in each image frame. Based on the descriptors and/or the feature points and/or the image itself, the image frame may be identified as a key frame 1140.
- a key frame is an image frame meeting certain criteria that is then selected for further processing.
- image frames that add meaningful information to the map may be selected as key frames that are integrated into the map.
- image frames that substantially overlap a region for which an image frame has already been integrated into the map may be discarded such that they do not become key frames.
- key frames may be selected based on the number and/or type of feature points in the image frame.
- key frames 1150 selected for inclusion in a tracking map may also be treated as key frames for determining a PCF, but different or additional criteria for selecting key frames for generation of a PCF may be used.
- FIG. 14 shows that a key frame is used for further processing
- information acquired from an image may be processed in other forms.
- the feature points such as in a key rig
- a key frame is described as being derived from a single image frame, it is not necessary that there be a one to one relationship between a key frame and an acquired image frame.
- a key frame may be acquired from multiple image frames, such as by stitching together or aggregating the image frames such that only features appearing in multiple images are retained in the key frame.
- a key frame may include image information and/or metadata associated with the image information.
- images captured by the cameras 44, 46 may be computed into one or more key frames (e.g., key frames 1, 2).
- a key frame may include a camera pose.
- a key frame may include one or more camera images captured at the camera pose.
- an XR system may determine a portion of the camera images captured at the camera pose as not useful and thus not include the portion in a key frame. Therefore, using key frames to align new images with earlier knowledge of a scene reduces the use of computational resource of the XR system.
- a key frame may include an image, and/or image data, at a location with a direction / angle. In some embodiments, a key frame may include a location and a direction from which one or more map points may be observed. In some embodiments, a key frame may include a coordinate frame with an ID.
- Some or all of the key frames 1140 may be selected for further processing, such as the generation of a persistent pose 1150 for the key frame.
- the selection may be based on the characteristics of all, or a subset of, the feature points in the image frame.
- characteristics may be determined from processing the descriptors, features, and/or image frame, itself.
- the selection may be based on a cluster of feature points identified as likely to relate to a persistent object.
- Each key frame is associated with a pose of the camera at which that key frame was acquired.
- that pose information may be saved along with other metadata about the key frame, such as a WiFi fingerprint and/or GPS coordinates at the time of acquisition and/or at the location of acquisition.
- the persistent poses are a source of information that a device may use to orient itself relative to previously acquired information about the physical world. For example, if the key frame from which a persistent pose was created is incorporated into a map of the physical world, a device may orient itself relative to that persistent pose using a sufficient number of feature points in the key frame that are associated with the persistent pose. The device may align a current image that it takes of its surroundings to the persistent pose. This alignment may be based on matching the current image to the image 1110, the features 1120, and/or the descriptors 1130 that gave rise to the persistent pose, or any subset of that image or those features or descriptors. In some embodiments, the current image frame that is matched to the persistent pose may be another key frame that has been incorporated into the device’s tracking map.
- Information about a persistent pose may be stored in a format that facilitates sharing among multiple applications, which may be executing on the same or different devices.
- some or all of the persistent poses may be reflected as a persistent coordinate frames (PCF) 1160.
- PCF persistent coordinate frames
- a PCF may be associated with a map and may comprise a set of features, or other information, that a device can use to determine its orientation with respect to that PCF.
- the PCF may include a transformation that defines its transformation with respect to the origin of its map, such that, by correlating its position to a PCF, the device can determine its position with respect to any objects in the physical world reflected in the map.
- an application such as applications 1180, may define positions of virtual objects with respect to one or more PCFs, which serve as anchors for the virtual content 1170.
- FIG. 14 illustrates, for example, that App 1 has associated its virtual content 2 with PCF 1,2.
- App 2 has associated its virtual content 3 with PCF 1,2.
- App 1 is also shown associating its virtual content 1 to PCF 4,5
- App 2 is shown associating its virtual content 4 with PCF 3.
- PCF 3 may be based on Image 3 (not shown)
- PCF 4,5 may be based on Image 4 and Image 5 (not shown), analogously to how PCF 1,2 is based on Image 1 and Image 2.
- a device may apply one or more transformations to compute information, such as the location of the virtual content with respect to the display of the device and/or the location of physical objects with respect to the desired location of the virtual content.
- compute information such as the location of the virtual content with respect to the display of the device and/or the location of physical objects with respect to the desired location of the virtual content.
- PCF PCF
- a persistent pose may be a coordinate location and/or direction that has one or more associated key frames. In some embodiments, a persistent pose may be automatically created after the user has traveled a certain distance, e.g., three meters. In some embodiments, the persistent poses may act as reference points during localization. In some embodiments, the persistent poses may be stored in a passable world (e.g., the passable world module 538).
- a new PCF may be determined based on a pre-defined distance allowed between adjacent PCFs.
- one or more persistent poses may be computed into a PCF when a user travels a pre-determined distance, e.g. five meters.
- PCFs may be associated with one or more world coordinate frames and/or canonical coordinate frames, e.g., in the passable world.
- PCFs may be stored in a local and/or remote database depending on, for example, security settings.
- Figure 15 illustrates a method 4700 of establishing and using a persistence coordinate frame, according to some embodiments.
- the method 4700 may start from capturing (Act 4702) images (e.g., Image 1 and Image 2 in FIG.14) about a scene using one or more sensors of an XR device. Multiple cameras may be used and one camera may generate multiple images, for example, in a stream.
- images e.g., Image 1 and Image 2 in FIG.14
- Multiple cameras may be used and one camera may generate multiple images, for example, in a stream.
- the method 4700 may include extracting (4704) interest points (e.g., map points 702 in FIG. 7, features 1120 in FIG. 14) from the captured images, generating (Act 4706) descriptors (e.g., descriptors 1130 in FIG. 14) for the extracted interest points, and generating (Act 4708) key frames (e.g., key frames 1140) based on the descriptors.
- interest points e.g., map points 702 in FIG. 7, features 1120 in FIG. 14
- descriptors e.g., descriptors 1130 in FIG. 14
- key frames e.g., key frames 1140
- the method may compare interest points in the key frames, and form pairs of key frames that share a predetermined amount of interest points.
- the method may reconstruct parts of the physical world using individual pairs of key frames. Mapped parts of the physical world may be saved as 3D features (e.g., keyrig 704 in FIG. 7).
- 3D features e.g., keyrig 704 in FIG. 7
- a selected portion of the pairs of key frames may be used to build 3D features.
- results of the mapping may be selectively saved.
- Key frames not used for building 3D features may be associated with the 3D features through poses, for example, representing distances between key frames with a covariance matrix between poses of keyframes.
- pairs of key frames may be selected to build the 3D features such that distances between each of the build 3D features are within a predetermined distance, which may be determined to balance the amount of computation needed and the level of accuracy of a resulting model.
- a covariance matrix of two images may include covariances between poses of the two images (e.g., six degree of freedom).
- the method 4700 may include generating (Act 4710) persistent poses based on the key frames.
- the method may include generating the persistent poses based on the 3D features reconstructed from pairs of key frames.
- a persistent pose may be attached to a 3D feature.
- the persistent pose may include a pose of a key frame used to construct the 3D feature.
- the persistent pose may include an average pose of key frames used to construct the 3D feature.
- persistent poses may be generated such that distances between neighboring persistent poses are within a predetermined value, for example, in the range of one meter to five meters, any value in between, or any other suitable value.
- the distances between neighboring persistent poses may be represented by a covariance matrix of the neighboring persistent poses.
- the method 4700 may include generating (Act 4712) PCFs based on the persistent poses.
- a PCF may be attached to a 3D feature.
- a PCF may be associated with one or more persistent poses.
- a PCF may include a pose of one of the associated persistent poses.
- a PCF may include an average pose of the poses of the associated persistent poses.
- PCFs may be generated such that distances between neighboring PCFs are within a predetermined value, for example, in the range of three meters to ten meters, any value in between, or any other suitable value.
- the distances between neighboring PCFs may be represented by a covariance matrix of the neighboring PCFs.
- PCFs may be exposed to XR applications via, for example, an application programming interface (API) such that the XR applications can access a model of the physical world through the PCFs without accessing the model itself.
- API application programming interface
- the method 4700 may include associating (Act 4714) image data of a virtual object to be displayed by the XR device to at least one of the PCFs.
- the method may include computing translations and orientations of the virtual object with respect to the associated PCF. It should be appreciated that it is not necessary to associate a virtual object to a PCF generated by the device placing the virtual object. For example, a device may retrieve saved PCFs in a canonical map in a cloud and associate a virtual object to a retrieved PCF. It should be appreciated that the virtual object may move with the associated PCF as the PCF is adjusted overtime.
- Figure 16 illustrates the first XR device 12.1 and vision data and algorithms of a second XR device 12.2 and the server 20, according to some embodiments.
- the components illustrated in FIG. 16 may operate to perform some or all of the operations associated with generating, updating, and/or using spatial information, such as persistent poses, persistent coordinate frames, tracking maps, or canonical maps, as described herein.
- the first XR device 12.1 may be configured the same as the second XR device 12.2.
- the server 20 may have a map storing routine 118, a canonical map 120, a map transmitter 122, and a map merge algorithm 124.
- the second XR device 12.2 which may be in the same scene as the first XR device 12.1, may include a persistent coordinate frame (PCF) integration unit 1300, an application 1302 that generates the image data 68 that may be used to render a virtual object, and a frame embedding generator 308 ( See FIG. 21).
- a map download system 126, PCF identification system 128, Map 2, localization module 130, canonical map incorporator 132, canonical map 133, and map publisher 136 may be grouped into a passable world unit 1304.
- the PCF integration unit 1300 may be connected to the passable world unit 1304 and other components of the second XR device 12.2 to allow for the retrieval, generation, use, upload, and download of PCFs.
- a map comprising PCFs, may enable more persistence in a changing world.
- localizing a tracking map including, for example, matching features for images may include selecting features that represent persistent content from the map constituted by PCFs, which enables fast matching and/or localizing. For example, a world where people move into and out of the scene and objects such as doors move relative to the scene, requires less storage space and transmission rates, and enables the use of individual PCFs and their relationships relative to one another (e.g., integrated constellation of PCFs) to map a scene.
- the PCF integration unit 1300 may include PCFs 1306 that were previously stored in a data store on a storage unit of the second XR device 12.2, a PCF tracker 1308, a persistent pose acquirer 1310, a PCF checker 1312, a PCF generation system 1314, a coordinate frame calculator 1316, a persistent pose calculator 1318, and three transformers, including a tracking map and persistent pose transformer 1320, a persistent pose and PCF transformer 1322, and a PCF and image data transformer 1324.
- the PCF tracker 1308 may have an on-prompt and an off- prompt that are selectable by the application 1302.
- the application 1302 may be executable by a processor of the second XR device 12.2 to, for example, display a virtual content.
- the application 1302 may have a call that switches the PCF tracker 1308 on via the on-prompt.
- the PCF tracker 1308 may generate PCFs when the PCF tracker 1308 is switched on.
- the application 1302 may have a subsequent call that can switch the PCF tracker 1308 off via the off-prompt.
- the PCF tracker 1308 terminates PCF generation when the PCF tracker 1308 is switched off.
- the server 20 may include a plurality of persistent poses 1332 and a plurality of PCFs 1330 that have previously been saved in association with a canonical map 120.
- the map transmitter 122 may transmit the canonical map 120 together with the persistent poses 1332 and/or the PCFs 1330 to the second XR device 12.2.
- the persistent poses 1332 and PCFs 1330 may be stored in association with the canonical map 133 on the second XR device 12.2.
- Map 2 localizes to the canonical map 133
- the persistent poses 1332 and the PCFs 1330 may be stored in association with Map 2.
- the persistent pose acquirer 1310 may acquire the persistent poses for Map 2.
- the PCF checker 1312 may be connected to the persistent pose acquirer 1310.
- the PCF checker 1312 may retrieve PCFs from the PCFs 1306 based on the persistent poses retrieved by the persistent pose acquirer 1310.
- the PCFs retrieved by the PCF checker 1312 may form an initial group of PCFs that are used for image display based on PCFs.
- the application 1302 may require additional PCFs to be generated. For example, if a user moves to an area that has not previously been mapped, the application 1302 may switch the PCF tracker 1308 on.
- the PCF generation system 1314 may be connected to the PCF tracker 1308 and begin to generate PCFs based on Map 2 as Map 2 begins to expand.
- the PCFs generated by the PCF generation system 1314 may form a second group of PCFs that may be used for PCF-based image display.
- the coordinate frame calculator 1316 may be connected to the PCF checker 1312. After the PCF checker 1312 retrieved PCFs, the coordinate frame calculator 1316 may invoke the head coordinate frame 96 to determine a head pose of the second XR device 12.2. The coordinate frame calculator 1316 may also invoke the persistent pose calculator 1318.
- the persistent pose calculator 1318 may be directly or indirectly connected to the frame embedding generator 308. In some embodiments, an image/frame may be designated a key frame after a threshold distance from the previous key frame, e.g. 3 meters, is traveled.
- the persistent pose calculator 1318 may generate a persistent pose based on a plurality, for example three, key frames. In some embodiments, the persistent pose may be essentially an average of the coordinate frames of the plurality of key frames.
- the tracking map and persistent pose transformer 1320 may be connected to Map 2 and the persistent pose calculator 1318.
- the tracking map and persistent pose transformer 1320 may transform Map 2 to the persistent pose to determine the persistent pose at an origin relative to Map 2.
- the persistent pose and PCF transformer 1322 may be connected to the tracking map and persistent pose transformer 1320 and further to the PCF checker 1312 and the PCF generation system 1314.
- the persistent pose and PCF transformer 1322 may transform the persistent pose (to which the tracking map has been transformed) to the PCFs from the PCF checker 1312 and the PCF generation system 1314 to determine the PCF’s relative to the persistent pose.
- the PCF and image data transformer 1324 may be connected to the persistent pose and PCF transformer 1322 and to the data channel 62.
- the PCF and image data transformer 1324 transforms the PCF’s to the image data 68.
- the rendering engine 30 may be connected to the PCF and image data transformer 1324 to display the image data 68 to the user relative to the PCFs.
- the PCF integration unit 1300 may store the additional PCFs that are generated with the PCF generation system 1314 within the PCFs 1306.
- the PCFs 1306 may be stored relative to persistent poses.
- the map publisher 136 may retrieve the PCFs 1306 and the persistent poses associated with the PCFs 1306 when the map publisher 136 transmits Map 2 to the server 20, the map publisher 136 also transmits the PCF’s and persistent poses associated with Map 2 to the server 20.
- the map storing routine 118 of the server 20 stores Map 2
- the map storing routine 118 may also store the persistent poses and PCFs generated by the second viewing device 12.2.
- the map merge algorithm 124 may create the canonical map 120 with the persistent poses and PCFs of Map 2 associated with the canonical map 120 and stored within the persistent poses 1332 and PCFs 1330, respectively.
- the first XR device 12.1 may include a PCF integration unit similar to the PCF integration unit 1300 of the second XR device 12.2.
- the map transmitter 122 may transmit the persistent poses 1332 and PCF’s 1330 associated with the canonical map 120 and originating from the second XR device 12.2.
- the first XR device 12.1 may store the PCFs and the persistent poses within a data store on a storage device of the first XR device 12.1.
- the first XR device 12.1 may then make use of the persistent poses and the PCFs originating from the second XR device 12.2 for image display relative to the PCFs. Additionally or alternatively, the first XR device 12.1 may retrieve, generate, make use, upload, and download PCFs and persistent poses in a manner similar to the second XR device 12.2 as described above.
- the first XR device 12.1 generates a local tracking map (referred to hereinafter as“Map 1”) and the map storing routine 118 receives Map 1 from the first XR device 12.1.
- Map 1 a local tracking map
- the map storing routine 118 then stores Map 1 on a storage device of the server 20 as the canonical map 120.
- the second XR device 12.2 includes a map download system 126, an anchor identification system 128, a localization module 130, a canonical map incorporator 132, a local content position system 134, and a map publisher 136.
- the map transmitter 122 sends the canonical map 120 to the second XR device 12.2 and the map download system 126 downloads and stores the canonical map 120 as a canonical map 133 from the server 20.
- the anchor identification system 128 is connected to the world surface determining routine 78.
- the anchor identification system 128 identifies anchors based on objects detected by the world surface determining routine 78.
- the anchor identification system 128 generates a second map (Map 2) using the anchors.
- Map 2 maps the anchors.
- the anchor identification system 128 continues to identify anchors and continues to update Map 2.
- the locations of the anchors are recorded as three-dimensional data based on data provided by the world surface determining routing 78.
- the world surface determining routine 78 receives images from the real object detection camera 44 and depth data from depth sensors 135 to determine the locations of surfaces and their relative distance from the depth sensors 135
- the localization module 130 is connected to the canonical map 133 and Map 2.
- the localization module 130 repeatedly attempts to localize Map 2 to the canonical map 133.
- the canonical map incorporator 132 is connected to the canonical map 133 and Map 2.
- Map 2 is then updated with missing data that is included in the canonical map.
- the local content position system 134 is connected to Map 2.
- the local content position system 134 may, for example, be a system wherein a user can locate local content in a particular location within a world coordinate frame. The local content then attaches itself to one anchor of Map 2.
- the local-to-world coordinate transformer 104 transforms the local coordinate frame to the world coordinate frame based on the settings of the local content position system 134.
- the functioning of the rendering engine 30, display system 42, and data channel 62 have been described with reference to Figure 2.
- the map publisher 136 uploads Map 2 to the server 20.
- the map storing routine 118 of the server 20 then stores Map 2 within a storage medium of the server 20.
- the map merge algorithm 124 merges Map 2 with the canonical map 120.
- the map merge algorithm 124 merges all the maps into the canonical map 120 to render a new canonical map 120.
- the map transmitter 122 then transmits the new canonical map 120 to any and all devices 12.1 and 12.2 that are in an area represented by the new canonical map 120.
- the devices 12.1 and 12.2 localize their respective maps to the canonical map 120, the canonical map 120 becomes the promoted map.
- Figure 17 illustrates an example of generating key frames for a map of a scene, according to some embodiments.
- a first key frame, KF1 is generated for a door on a left wall of the room.
- a second key frame, KF2 is generated for an area in a corner where a floor, the left wall, and a right wall of the room meet.
- a third key frame, KF3, is generated for an area of a window on the right wall of the room.
- a fourth key frame, KF4 is generated for an area at a far end of a rug on a floor of the wall.
- a fifth key frame, KF5, is generated for an area of the rug closest to the user.
- Figure 18 illustrates an example of generating persistent poses for the map of Figure 17, according to some embodiments.
- a new persistent pose is created when the device measures a threshold distance traveled, and/or when an application requests a new persistent pose (PP).
- the threshold distance may be 3 meters, 5 meters, 20 meters, or any other suitable distance. Selecting a smaller threshold distance (e.g., 1 m) may result in an increase in compute load since a larger number of PPs may be created and managed compared to larger threshold distances. Selecting a larger threshold distance (e.g.
- 40 m may result in increased virtual content placement error since a smaller number of PPs would be created, which would result in fewer PCFs being created, which means the virtual content attached to the PCF could be a relatively large distance (e.g. 30m) away from the PCF, and error increases with increasing distance from a PCF to the virtual content.
- a PP may be created at the start of a new session. This initial PP may be thought of as zero, and can be visualized as the center of a circle that has a radius equal to the threshold distance. When the device reaches the perimeter of the circle, and, in some embodiments, an application requests a new PP, a new PP may be placed at the current location of the device (at the threshold distance). In some embodiments, a new PP will not be created at the threshold distance if the device is able to find an existing PP within the threshold distance from the device’s new position. In some embodiments, when a new PP (e.g., PP1150 in FIG.
- the device attaches one or more of the closest key frames to the PP.
- the location of the PP relative to the key frames may be based on the location of the device at the time a PP is created.
- a PP will not be created when the device travels a threshold distance unless an application requests a PP.
- an application may request a PCF from the device when the application has virtual content to display to the user.
- the PCF request from the application may trigger a PP request, and a new PP would be created after the device travels the threshold distance.
- Figure 18 illustrates a first persistent pose PP1 which may have the closest key frames, (e.g. KF1, KF2, and KF3) attached by, for example, computing relative poses between the key frames to the persistent pose.
- Figure 18 also illustrates a second persistent pose PP2 which may have the closest key frames (e.g. KF4 and KF5) attached.
- FIG 19 illustrates an example of generating a PCF for the map of Figure 17, according to some embodiments.
- PCF 1 may include PP1 and PP2.
- the PCF may be used for displaying image data relative to the PCF.
- each PCF may have coordinates in another coordinate frame (e.g., a world coordinate frame) and a PCF descriptor, for example, uniquely identifying the PCF.
- the PCF descriptor may be computed based on feature descriptors of features in frames associated with the PCF.
- various constellations of PCFs may be combined to represent the real world in a persistent manner that requires less data and less transmission of data.
- Figures 20A to 20C are schematic diagrams illustrating an example of establishing and using a persistent coordinate frame.
- Figure 20A shows two users 4802A, 4802B with respective local tracking maps 4804A, 4804B that have not localized to a canonical map.
- the origins 4806A, 4806B for individual users are depicted by the coordinate system (e.g., a world coordinate system) in their respective areas. These origins of each tracking map may be local to each user, as the origins are dependent on the orientation of their respective devices when tracking was initiated.
- the device may capture images that, as described above in connection with FIG. 14, may contain features representing persistent objects such that those images may be classified as key frames, from which a persistent pose may be created.
- the tracking map 4802A includes a persistent pose (PP) 4808 A; the tracking 4802B includes a PP 4808B.
- some of the PP’s may be classified as PCF’s which are used to determine the orientation of virtual content for rendering it to the user.
- Figure 20B shows that XR devices worn by respective users 4802A, 4802B may create local PCFs 4810A, 4810B based on the PP 4808A, 4808B.
- Figure 20C shows that persistent content 4812A, 4812B (e.g., a virtual content) may be attached to the PCFs 4810A, 4810B by respective XR devices.
- virtual content may have a virtual content coordinate frame, that may be used by an application generating virtual content, regardless of how the virtual content should be displayed.
- the virtual content for example, may be specified as surfaces, such as triangles of a mesh, at particular locations and angles with respect to the virtual content coordinate frame. To render that virtual content to a user, the locations of those surfaces may be determined with respect to the user that is to perceive the virtual content.
- Attaching virtual content to the PCFs may simplify the computation involved in determining locations of the virtual content with respect to the user.
- the location of the virtual content with respect to a user may be determined by applying a series of
- transformations Some of those transformations may change, and may be updated frequently. Others of those transformations may be stable and may be updated in frequently or not at all. Regardless, the transformations may be applied with relatively low computational burden such that the location of the virtual content can be updated with respect to the user frequently, providing a realistic appearance to the rendered virtual content.
- user l’s device has a coordinate system that can be related to the coordinate system that defines the origin of the map by the transformation rigl_T_wl.
- User 2’s device has a similar transformation rig2_T_w2.
- These transformations may be expressed as 6 degree of transformation, specifying translation and rotation to align the device coordinate systems with the map coordinate systems.
- the transformation may be expressed as two separate transformations, one specifying translation and the other specifying rotation. Accordingly, it should be appreciated that the
- transformations may be expressed in a form that simplifies computation or otherwise provides an advantage.
- Transformations between the origins of the tracking maps and the PCF’s identified by the respective user devices are expressed as pcfl_T_wl and pcf2_T_w2.
- the PCF and the PP are identical, such that the same transformation also characterizes the PP’s.
- the virtual content is locate with respect to the PCF’s, with a transformation of objl_T_pcfl.
- This transformation may be set by an application generating the virtual content that may receive information from a world reconstruction system describing physical objects with respect to the PCF.
- transformation may then be related to the user’s device through further transformation rigl_T_wl.
- the location of the virtual content may change, based on output from an application generating the virtual content.
- the end-to-end transformation from a source coordinate system to a destination coordinate system, may be recomputed.
- the location and/or head pose of the user may change as the user moves.
- the transformation rigl_T_wl may change, as would any end-to-end transformation that depends on the location or head pose of the user .
- the transformation rigl_T_wl may be updated with motion of the user based on tracking the position of the user with respect to stationary objects in the physical world. Such tracking may be performed by a headphone tacking component processing a sequence of images, as described above, or other component of the system. Such updates may be made by determining pose of the user with respect to a stationary frame of reference, such as a PP.
- the location and orientation of a user device may be determined relative to the nearest persistent pose, or, in this example, a PCF, as the PP is used as a PCF.
- a system may determine and apply transformations in an order that is
- measurement yielding rigl_T_pcfl might be avoided by tracking both user pose and defining the location of virtual content relative to the PP or a PCF built on a persistent pose. In this way the transformation from a source coordinate system of the virtual content to the destination coordinate system of the user’s device may be based on the measured
- the end-to-end transformation may relate the virtual object coordinate system to the PCF coordinate system based on a further transformation between the map coordinates and the PCF coordinates.
- a transformation between the two may be applied. Such a transformation may be fixed and may be determined, for example, from a map in which both appear.
- a transform-based approach may be implemented, for example, in a device with components that process sensor data to build a tracking map. As part of that process, those components may identify feature points that may be used as persistent poses, which in turn may be turned into PCF’s. Those components may limit the number of persistent poses generated for the map, to provide a suitable spacing between persistent poses, while allowing the user, regardless of location in the physical environment, to be close enough to a persistent pose location to accurately compute the user’s pose, as described above in connection with FIGs. 17-19.
- any of the transformations that are used to compute the location of virtual content relative to the user that depend on the location of the PP (or PCF if being used) may be updated and stored for use, at least until the user moves away from that persistent pose. Nonetheless, by computing and storing transformations, the computational burden each time the location of virtual content is updated may be relatively low, such that it may be performed with relatively low latency.
- FIGs. 20A-20C illustrate positioning with respect to a tracking map, and each device had its own tracking map.
- transformations may be generated with respect to any map coordinate system.
- Persistence of content across user sessions of an XR system may be achieved by using a persistent map.
- Shared experiences of users may also be facilitated by using a map to which multiple user devices may be oriented.
- the location of virtual content may be specified in relation to coordinates in a canonical map, formatted such that any of multiple devices may use the map.
- Each device might maintain a tracking map and may determine the change of pose of the user with respect to the tracking map.
- a transformation between the tracking map and the canonical map may be determined through a process of“localization” - which may be performed by matching structures in the tracking map (such as one or more persistent poses) to one or more structures of the canonical map (such as one or more PCFs).
- a new image may be captured with sensors worn by the user and an XR system may search, in a set of images that were used to create the tracking map, images that share at least a predetermined amount of interest points with the new image.
- a tracking map might be localized to a canonical map by first finding image frames associated with a persistent pose in the tracking map that is similar to an image frame associated with a PCF in the canonical map.
- a transformation between two canonical maps may be computed by first finding similar image frames in the two maps.
- Deep key frames provide a way to reduce the amount of processing required to identify similar image frames.
- the comparison may be between image features in a new 2D image (e.g.,“2D features”) and 3D features in the map.
- Such a comparison may be made in any suitable way, such as by projecting the 3D images into a 2D plane.
- a conventional method such as Bag of Words (BoW) searches the 2D features of a new image in a database including all 2D features in a map, which may require significant computing resources especially when a map represents a large area.
- the conventional method locates the images that share at least one of the 2D features with the new image, which may include images that are not useful for locating meaningful 3D features in the map.
- the conventional method locates 3D features that are not meaningful with respect to the 2D features in the new image.
- the inventors have recognized and appreciated techniques to retrieve images in the map using less memory resource (e.g., a quarter of the memory resource used by BoW), higher efficiency (e.g., 2.5 ms processing time for each key frame, 100 ps for comparing against 500 key frames), and higher accuracy (e.g., 20% better retrieval recall than BoW for 1024 dimensional model, 5% better retrieval recall than BoW for 256 dimensional model).
- less memory resource e.g., a quarter of the memory resource used by BoW
- higher efficiency e.g., 2.5 ms processing time for each key frame, 100 ps for comparing against 500 key frames
- higher accuracy e.g., 20% better retrieval recall than BoW for 1024 dimensional model, 5% better retrieval recall than BoW for 256 dimensional model.
- a descriptor may be computed for an image frame that may be used to compare an image frame to other image frames.
- the descriptors may be stored instead of or in addition to the image frames and feature points.
- the descriptor of the image frame or frames from which each persistent pose or PCF was generated may be stored as part of the persistent pose and/or PCF.
- the descriptor may be computed as a function of feature points in the image frame.
- a neural network is configured to compute a unique frame descriptor to represent an image.
- the image may have a resolution higher than 1 Megabyte such that enough details of a 3D environment within a field-of-view of a device worn by a user is captured in the image.
- the frame descriptor may be much shorter, such as a string of numbers, for example, in the range of 128 Bytes to 512 Bytes or any number in between.
- the neural network is trained such that the computed frame descriptors indicate similarity between images.
- Images in a map may be located by identifying, in a database comprising images used to generate the map, the nearest images that may have frame descriptors within a predetermined distance to a frame descriptor for a new image.
- the distances between images may be represented by a difference between the frame descriptors of the two images.
- Figure 21 is a block diagram illustrating a system for generating a descriptor for an individual image, according to some embodiments.
- a frame embedding generator 308 is shown.
- embodiments may be used within the server 20, but may alternatively or additionally execute in whole or in part within one of the XR devices 12.1 and 12.2, or any other device processing images for comparison to other images.
- the frame embedding generator may be configured to generate a reduced data representation of an image from an initial size (e.g., 76,800 bytes) to a final size (e.g., 256 bytes) that is nonetheless indicative of the content in the image despite a reduced size.
- the frame embedding generator may be used to generate a data representation for an image which may be a key frame or a frame used in other ways.
- the frame embedding generator 308 may be configured to convert an image at a particular location and orientation into a unique string of numbers (e.g., 256 bytes).
- an image 320 taken by an XR device may be processed by feature extractor 324 to detect interest points 322 in the image 320.
- Interest points may be or may not be derived from feature points identified as described above for features 1120 (FIG. 14) or as otherwise described herein.
- interest points may be represented by descriptors as described above for descriptors 1130 (FIG. 14), which may be generated using a deep sparse feature method.
- each interest point 322 may be represented by a string of numbers (e.g., 32 bytes). There may, for example, be n features (e.g., 100) and each feature is represented by a string of 32 bytes.
- the frame embedding generator 308 may include a neural network 326.
- the neural network 326 may include a multi-layer perceptron unit 312 and a maximum (max) pool unit 314.
- the multi-layer perceptron (MLP) unit 312 may comprise a multi-layer perceptron, which may be trained.
- the interest points 322 e.g., descriptors for the interest points
- the multi-layer perceptron 312 may output as weighted combinations 310 of the descriptors.
- the MLP may reduce n features to m feature that is less than n features.
- the MLP unit 312 may be configured to perform matrix multiplication.
- the multi-layer perceptron unit 312 receives the plurality of interest points 322 of an image 320 and converts each interest point to a respective string of numbers (e.g., 256).
- a respective string of numbers e.g., 256
- a matrix in this example, may be created having 100 horizontal rows and 256 vertical columns. Each row may have a series of 256 numbers that vary in magnitude with some being smaller and others being larger.
- the output of the MLP may be an n x 256 matrix, where n represents the number of interest points extracted from the image.
- the output of the MLP may be an m x 256 matrix, where m is the number of interest points reduced from n.
- the MLP 312 may have a training phase, during which model parameters for the MLP are determined, and a use phase.
- the MLP may be trained as illustrated in Figure 25.
- the input training data may comprise data in sets of three, the set of three comprising 1) a query image, 2) a positive sample, and 3) a negative sample.
- the query image may be considered the reference image.
- the positive sample may comprise an image that is similar to the query image.
- similar may be having the same object in both the query and positive sample image but viewed from a different angle.
- similar may be having the same object in both the query and positive sample images but having the object shifted (e.g. left, right, up, down) relative to the other image.
- the negative sample may comprise an image that is dissimilar to the query image.
- a dissimilar image may not contain any objects that are prominent in the query image or may contain only a small portion of a prominent object in the query image (e.g. ⁇ 10%, 1%).
- a similar image in contrast, may have most of an object (e.g. >50%, or >75%) in the query image, for example.
- interest points may be extracted from the images in the input training data and may be converted to feature descriptors. These descriptors may be computed both for the training images as shown in FIG. 25 and for extracted features in operation of frame embedding generator 308 of FIG. 21.
- a deep sparse feature (DSF) process may be used to generate the descriptors (e.g., DSF descriptors) as described in US Patent Application 16/190,948.
- DSF descriptors are n x 32 dimension.
- the descriptors may then be passed through the model / MLP to create a 256 byte output.
- the model / MLP may have the same structure as MLP 312 such that once the model parameters are set through training, the resulting trained MLP may be used as MLP 312.
- the feature descriptors may then be sent to a triplet margin loss module (which may only be used during the training phase, not during use phase of the MLP neural network).
- the triplet margin loss module may be configured to select parameters for the model so as to reduce the difference between the 256 byte output from the query image and the 256 byte output from the positive sample, and to increase the difference between the 256 byte output from the query image and the 256 byte output from the negative sample.
- the training phase may comprise feeding a plurality of triplet input images into the learning process to determine model parameters. This training process may continue, for example, until the differences for positive images is minimized and the difference for negative images is maximized or until other suitable exit criteria are reached.
- the frame embedding generator 308 may include a pooling layer, here illustrated as maximum (max) pool unit 314.
- the max pool unit 314 may analyze each column to determine a maximum number in the respective column.
- the max pool unit 314 may combine the maximum value of each column of numbers of the output matrix of the MLP 312 into a global feature string 316 of, for example, 256 numbers.
- the global feature string 316 is a relatively small number that takes up relatively little memory and is easily searchable compared to an image (e.g., with a resolution higher than 1 Megabyte). It is thus possible to search for images without analyzing each original frame from the camera and it is also cheaper to store 256 bytes instead of complete frames.
- Figure 22 is a flow chart illustrating a method 2200 of computing an image descriptor, according to some embodiments.
- the method 2200 may start from receiving (Act 2202) a plurality of images captured by an XR device worn by a user.
- the method 2200 may include determining (Act 2204) one or more key frames from the plurality of images.
- Act 2204 may be skipped and/or may occur after step 2210 instead.
- the method 2200 may include identifying (Act 2206) one or more interest points in the plurality of images with an artificial neural network, and computing (Act 2208) feature descriptors for individual interest points with the artificial neural network.
- the method may include computing (Act 2210), for each image, a frame descriptor to represent the image based, at least in part, on the computed feature descriptors for the identified interest points in the image with the artificial neural network.
- Figure 23 is a flow chart illustrating a method 2300 of localization using image descriptors, according to some embodiments.
- a new image frame, depicting the current location of the XR device may be compared to image frames stored in connection with points in a map (such as a persistent pose or a PCF as described above).
- the method 2300 may start from receiving (Act 2302) a new image captured by an XR device worn by a user.
- the method 2300 may include identifying (Act 2304) one or more nearest key frames in a database comprising key frames used to generate one or more maps.
- a nearest key frame may be identified based on coarse spatial information and/or previously determined spatial information. For example, coarse spatial information may indicate that the XR device is in a geographic region represented by a 50mx50m area of a map. Image matching may be performed only for points within that area.
- the XR system may know that an XR device was previously proximate a first persistent pose in the map and was moving in a direction of a second persistent pose in the map. That second persistent pose may be considered the nearest persistent pose and the key frame stored with it may be regarded as the nearest key frame.
- other metadata such as GPS data or WiFi fingerprints, may be used to select a nearest key frame or set of nearest key frames.
- frame descriptors may be used to determine whether the new image matches any of the frames selected as being associated with a nearby persistent pose. The determination may be made by comparing a frame descriptor of the new image with frame descriptors of the closest key frames, or a subset of key frames in the database selected in any other suitable way, and selecting key frames with frame descriptors that are within a predetermined distance of the frame descriptor of the new image. In some embodiments, a distance between two frame descriptors may be computed by obtaining the difference between two strings of numbers that may represent the two frame descriptors. In embodiments in which the strings are processed as strings of multiple quantities, the difference may be computed as a vector difference.
- the method 2300 may include performing (Act 2306) feature matching against 3D features in the maps that correspond to the identified nearest key frames, and computing (Act 2308) pose of the device worn by the user based on the feature matching results. In this way, the computationally intensive matching of features points in two images may be performed for as few as one image that has already been determined to be a likely match for the new image.
- Figure 24 is a flow chart illustrating a method 2400 of training a neural network, according to some embodiments.
- the method 2400 may start from generating (Act 2402) a dataset comprising a plurality of image sets.
- Each of the plurality of image sets may include a query image, a positive sample image, and a negative sample image.
- the plurality of image sets may include synthetic recording pairs configured to, for example, teach the neural network basic information such as shapes.
- the plurality of image sets may include real recording pairs, which may be recorded from a physical world.
- inliers may be computed by fitting a fundamental matrix between two images.
- sparse overlap may be computed as the intersection over union (IoU) of interest points seen in both images.
- a positive sample may include at least twenty interest points, serving as inliers, that are the same as in the query image.
- a negative sample may include less than ten inlier points.
- a negative sample may have less than half of the sparse points overlapping with the parse points of the query image.
- the method 2400 may include computing (Act 2404), for each image set, a loss by comparing the query image with the positive sample image and the negative sample image.
- the method 2400 may include modifying (Act 2406) the artificial neural network based on the computed loss such that a distance between a frame descriptor generated by the artificial neural network for the query image and a frame descriptor for the positive sample image is less than a distance between the frame descriptor for the query image and a frame descriptor for the negative sample image.
- a map may include a plurality of key frames, each of which may have a frame descriptor as described above.
- a max pool unit may analyze the frame descriptors of the map’s key frames and combines the frame descriptors into a unique map descriptor for the map.
- Map merging may enable maps representing overlapping portions of the physical world to be combined to represent a larger area.
- Ranking maps may enable efficiently performing techniques as described herein, including map merging, that involve selecting a map from a set of maps based on similarity.
- a set of canonical maps formatted in a way that they may be accessed by any of a number of XR devices may be maintained by the system.
- These canonical maps may be formed by merging selected tracking maps from those devices with other tracking maps or previously stored canonical maps.
- the canonical maps may be ranked, for example, for use in selecting one or more canonical maps to merge with a new tracking map and/or to select one or more canonical maps from the set to use within a device.
- the XR system To provide realistic XR experiences to users, the XR system must know the user’s physical surroundings in order to correctly correlate locations of virtual objects in relation to real objects. Information about a user’s physical surroundings may be obtained from an environment map for the user’s location.
- an XR system could provide an enhanced XR experience to multiple users sharing a same world, comprising real and/or virtual content, by enabling efficient sharing of environment maps of the real / physical world collected by multiple users, whether those users are present in the world at the same or different times.
- Such a system may store multiple maps generated by multiple users and/or the system may store multiple maps generated at different times. For operations that might be performed with a previously generated map, such as localization, for example as described above, substantial processing may be required to identify a relevant environment map of a same world (e.g. same real world location) from all the environment maps collected in an XR system.
- the inventors have recognized and appreciated techniques to quickly and accurately rank the relevance of environment maps out of all possible environment maps, such as the universe of all canonical maps 120 in Figure 28, for example.
- a high ranking map may then be selected for further processing, such as to render virtual objects on a user display realistically interacting with the physical world around the user or merging map data collected by that user with stored maps to create larger or more accurate maps.
- a stored map that is relevant to a task for a user at a location in the physical world may be identified by filtering stored maps based on multiple criteria. Those criteria may indicate comparisons of a tracking map, generated by the wearable device of the user in the location, to candidate environment maps stored in a database. The comparisons may be performed based on metadata associated with the maps, such as a Wi-Fi fingerprint detected by the device generating the map and/or set of BSSID’s to which the device was connected while forming the map. The comparisons may also be performed based on compressed or uncompressed content of the map. Comparisons based on a compressed representation may be performed, for example, by comparison of vectors computed from map content.
- Comparisons based on un-compressed maps may be performed, for example, by localizing the tracking map within the stored map, or vice versa. Multiple comparisons may be performed in an order based on computation time needed to reduce the number of candidate maps for consideration, with comparisons involving less computation being performed earlier in the order than other comparisons requiring more computation.
- FIG. 26 depicts an AR system 800 configured to rank and merge one or more environment maps, according to some embodiments.
- the AR system may include a passable world model 802 of an AR device.
- Information to populate the passable world model 802 may come from sensors on the AR device, which may include computer executable instructions stored in a processor 804 (e.g., a local data processing module 570 in FIG. 4), which may perform some or all of the processing to convert sensor data into a map.
- a map may be a tracking map, as it can be built as sensor data is collected as the AR device operates in a region.
- area attributes may be supplied so as to indicate the area that the tracking map represents.
- area attributes may be a geographic location identifier, such as coordinates presented as latitude and longitude or an ID used by the AR system to represent a location. Alternatively or additionally, the area attributes may be measured characteristics that have a high likelihood of being unique for that area.
- the area attributes may be derived from parameters of wireless networks detected in the area.
- the area attribute may be associated with a unique address of an access-point the AR system is nearby and/or connected to. For example, the area attribute may be associated with a MAC address or basic service set identifiers (BSSIDs) of a 5G base station / router, a Wi-Fi router, and the like.
- BSSIDs basic service set identifiers
- the tracking maps may be merged with other maps of the environment.
- a map rank portion 806 receives tracking maps from the device PW 802 and communicates with a map database 808 to select and rank environment maps from the map database 808. Higher ranked, selected maps are sent to a map merge portion 810.
- the map merge portion 810 may perform merge processing on the maps sent from the map rank portion 806. Merge processing may entail merging the tracking map with some or all of the ranked maps and transmitting the new, merged maps to a passable world model 812.
- the map merge portion may merge maps by identifying maps that depict overlapping portions of the physical world. Those overlapping portions may be aligned such that information in both maps may be aggregated into a final map. Canonical maps may be merged with other canonical maps and/or tracking maps.
- the aggregation may entail extending one map with information from another map.
- aggregation may entail adjusting the representation of the physical world in one map, based on information in another map.
- a later map for example, may reveal that objects giving rise to feature points have moved, such that the map may be updated based on later information.
- two maps may characterize the same region with different feature points and aggregating may entail selecting a set of feature points from the two maps to better represent that region.
- PCF’s from all maps that are merged may be retained, such that applications positioning content with respect to them may continue to do so.
- merging of maps may result in redundant persistent poses, and some of the persistent poses may be deleted.
- merging maps may entail modifying the PCF to be associated with a persistent pose remaining in the map after merging.
- maps may be refined.
- refinement may entail computation to reduce internal inconsistency between feature points that likely represent the same object in the physical world.
- Inconsistency may result from inaccuracies in the poses associated with key frames supplying feature points that represent the same objects in the physical world.
- Such inconsistency may result, for example, from an XR device computing poses relative to a tracking map, which in turn is built based on estimating poses, such that errors in pose estimation accumulate, creating a“drift” in pose accuracy over time.
- the map may be refined.
- the location of a persistent point relative to the origin of a map may change. Accordingly, the transformation associated with that persistent point, such as a persistent pose or a PCF, may change.
- the XR system in connection with map refinement (whether as part of a merge operation or performed for other reasons) may re-compute transformations associated with any persistent points that have changed. These transformations might be pushed from a component computing the transformations to a component using the transformation such that any uses of the transformations may be based on the updated location of the persistent points.
- Passable world model 812 may be a cloud model, which may be shared by multiple AR devices. Passable world model 812 may store or otherwise have access to the
- the prior version of that map may be deleted so as to remove out of date maps from the database.
- the prior version of that map may be archived enabling retrieving/viewing prior versions of an environment.
- permissions may be set such that only AR systems having certain read/write access may trigger prior versions of maps being deleted/archived.
- map rank portion 806 also may be used in supplying environment maps to an AR device.
- the AR device may send a message requesting an environment map for its current location, and map rank portion 806 may be used to select and rank environment maps relevant to the requesting device.
- the AR system 800 may include a downsample portion 814 configured to receive the merged maps from the cloud PW 812.
- the received merged maps from the cloud PW 812 may be in a storage format for the cloud, which may include high resolution information, such as a large number of PCFs per square meter or multiple image frames or a large set of feature points associated with a PCF.
- the downsample portion 814 may be configured to downsample the cloud format maps to a format suitable for storage on AR devices.
- the device format maps may have less data, such as fewer PCF’s or less data stored for each PCF to accommodate the limited local computing power and storage space of AR devices.
- FIG. 27 is a simplified block diagram illustrating a plurality of canonical maps 120 that may be stored in a remote storage medium, for example, a cloud.
- Each canonical map 120 may include a plurality of canonical map identifiers indicating the canonical map’s location within a physical space, such as somewhere on the planet earth.
- These canonical map identifiers may include one or more of the following identifiers: area identifiers represented by a range of longitudes and latitudes, frame descriptors (e.g., global feature string 316 in FIG. 21), Wi-Fi fingerprints, feature descriptors (e.g., feature descriptors 310 in FIG. 21), and device identities indicating one or more devices that contributed to the map.
- the canonical maps 120 are disposed geographically in a two-dimensional pattern as they may exist on a surface of the earth.
- the canonical maps 120 may be uniquely identifiable by corresponding longitudes and latitudes because any canonical maps that have overlapping longitudes and latitudes may be merged into a new canonical map.
- Figure 28 is a schematic diagram illustrating a method of selecting canonical maps, which may be used for localizing a new tracking map to one or more canonical maps, according to some embodiment.
- the method may start from accessing (Act 120) a universe of canonical maps 120, which may be stored, as an example, in a database in a passable world (e.g., the passable world module 538).
- the universe of canonical maps may include canonical maps from all previously visited locations.
- An XR system may filter the universe of all canonical maps to a small subset or just a single map. It should be appreciated that, in some embodiments, it may not be possible to send all the canonical maps to a viewing device due to bandwidth restrictions. Selecting a subset selected as being likely candidates for matching the tracking map to send to the device may reduce bandwidth and latency associated with accessing a remote database of maps.
- the method may include filtering (Act 300) the universe of canonical maps based on areas with predetermined size and shapes.
- each square may represent an area.
- Each square may cover 50 m x 50 m.
- Each square may have six neighboring areas.
- Act 300 may select at least one matching canonical map 120 covering longitude and latitude that include that longitude and latitude of the position identifier received from an XR device, as long as at least one map exists at that longitude and latitude.
- the Act 300 may select at least one neighboring canonical map covering longitude and latitude that are adjacent the matching canonical map.
- the Act 300 may select a plurality of matching canonical maps and a plurality of neighboring canonical maps.
- the Act 300 may, for example, reduce the number of canonical maps approximately ten times, for example, from thousands to hundreds to form a first filtered selection.
- criteria other than latitude and longitude may be used to identify neighboring maps.
- An XR device for example, may have previously localized with a canonical map in the set as part of the same session.
- a cloud service may retain information about the XR device, including maps previously localized to.
- the maps selected at Act 300 may include those that cover an area adjacent to the map to which the XR device localized to.
- the method may include filtering (Act 302) the first filtered selection of canonical maps based on Wi-Fi fingerprints.
- the Act 302 may determine a latitude and longitude based on a Wi-Fi fingerprint received as part of the position identifier from an XR device.
- the Act 302 may compare the latitude and longitude from the Wi-Fi fingerprint with latitude and longitude of the canonical maps 120 to determine one or more canonical maps that form a second filtered selection.
- the Act 302 may reduce the number of canonical maps
- a first filtered selection may include 130 canonical maps and the second filtered selection may include 50 of the 130 canonical maps and may not include the other 80 of the 130 canonical maps.
- the method may include filtering (Act 304) the second filtered selection of canonical maps based on key frames.
- the Act 304 may compare data representing an image captured by an XR device with data representing the canonical maps 120.
- the data representing the image and/or maps may include feature descriptors (e.g., DSF descriptors in FIG. 25) and/or global feature strings (e.g., 316 in FIG. 21).
- the Act 304 may provide a third filtered selection of canonical maps.
- the output of Act 304 may only be five of the 50 canonical maps identified following the second filtered selection, for example.
- the map transmitter 122 then transmits the one or more canonical maps based on the third filtered selection to the viewing device.
- the Act 304 may reduce the number of canonical maps for approximately ten times, for example, from tens to single digits of canonical maps (e.g., 5) that form a third selection.
- an XR device may receive canonical maps in the third filtered selection, and attempt to localize into the received canonical maps.
- the Act 304 may filter the canonical maps 120 based on the global feature strings 316 of the canonical maps 120 and the global feature string 316 that is based on an image that is captured by the viewing device (e.g. an image that may be part of the local tracking map for a user).
- Each one of the canonical maps 120 in Figure 27 thus has one or more global feature strings 316 associated therewith.
- the global feature strings 316 may be acquired when an XR device submits images or feature details to the cloud and the cloud processes the image or feature details to generate global feature strings 316 for the canonical maps 120.
- the cloud may receive feature details of a live/new/current image captured by a viewing device, and the cloud may generate a global feature string 316 for the live image. The cloud may then filter the canonical maps 120 based on the live global feature string 316.
- the global feature string may be generated on the local viewing device.
- the global feature string may be generated remotely, for example on the cloud.
- a cloud may transmit the filtered canonical maps to an XR device together with the global feature strings 316 associated with the filtered canonical maps.
- the viewing device localizes its tracking map to the canonical map, it may do so by matching the global feature strings 316 of the local tracking map with the global feature strings of the canonical map.
- an operation of an XR device may not perform all of the Acts (300, 302, 304). For example, if a universe of canonical maps are relatively small (e.g., 500 maps), an XR device attempting to localize may filter the universe of canonical maps based on Wi-Fi fingerprints (e.g., Act 302) and Key Frame (e.g., Act 304), but omit filtering based on areas (e.g., Act 300). Moreover, it is not necessary that maps in their entireties be compared. In some embodiments, for example, a comparison of two maps may result in identifying common persistent points, such as persistent poses or PCFs that appear in both the new map the selected map from the universe of maps. In that case, descriptors may be associated with persistent points, and those descriptors may be compared.
- a universe of canonical maps are relatively small (e.g., 500 maps)
- an XR device attempting to localize may filter the universe of canonical maps based on Wi
- FIG. 29 is flow chart illustrating a method 900 of selecting one or more ranked environment maps, according to some embodiments.
- the ranking is performed for a user’s AR device that is creating a tracking map.
- the tracking map is available for use in ranking environment maps.
- some or all of portions of the selection and ranking of environment maps that do not expressly rely on the tracking map may be used.
- the method 900 may start at Act 902, where a set of maps from a database of environment maps (which may be formatted as canonical maps) that are in the neighborhood of the location where the tracking map was formed may be accessed and then filtered for ranking. Additionally, at Act 902, at least one area attribute for the area in which the user’s AR device is operating is determined. In scenarios in which the user’s AR device is constructing a tracking map, the area attributes may correspond to the area over which the tracking map was created. As a specific example, the area attributes may be computed based on received signals from access points to computer networks while the AR device was computing the tracking map.
- Figure 30 depicts an exemplary map rank portion 806 of the AR system 800, according to some embodiments.
- the map rank portion 806 may be executing in a cloud computing environment, as it may include portions executing on AR devices and portions executing on a remote computing system such as a cloud.
- the map rank portion 806 may be configured to perform at least a portion of the method 900.
- FIG. 31A depicts an example of area attributes AA1-AA8 of a tracking map (TM) 1102 and environment maps CM1-CM4 in a database, according to some embodiments.
- an environment map may be associated to multiple area attributes.
- the area attributes AA1-AA8 may include parameters of wireless networks detected by the AR device computing the tracking map 1102, for example, basic service set identifiers (BSSIDs) of networks to which the AR device are connected and/or the strength of the received signals of the access points to the wireless networks through, for example, a network tower 1104.
- the parameters of the wireless networks may comply with protocols including Wi-Fi and 5G NR.
- the area attributes are a fingerprint of the area in which the user AR device collected sensor data to form the tracking map.
- Figure 31B depicts an example of the determined geographic location 1106 of the tracking map 1102, according to some embodiments.
- the determined geographic location 1106 includes a centroid point 1110 and an area 1108 circling around the centroid point. It should be appreciated that the determination of a geographic location of the present application is not limited to the illustrated format.
- a determined geographic location may have any suitable formats including, for example, different area shapes.
- the geographic location is determined from area attributes using a database relating area attributes to geographic locations. Databases are commercially available, for example, databases that relate Wi-Fi fingerprints to locations expressed as latitude and longitude and may be used for this operation.
- a map database containing environment maps may also include location data for those maps, including latitude and longitude covered by the maps.
- Processing at Act 902 may entail selecting from that database a set of environment maps that covers the same latitude and longitude determined for the area attributes of the tracking map.
- Act 904 is a first filtering of the set of environment maps accessed in Act 902.
- environment maps are retained in the set based on proximity to the geolocation of the tracking map. This filtering step may be performed by comparing the latitude and longitude associated with the tracking map and the environment maps in the set.
- Figure 32 depicts an example of Act 904, according to some embodiments.
- Each area attribute may have a corresponding geographic location 1202.
- the set of environment maps may include the environment maps with at least one area attribute that has a geographic location overlapping with the determined geographic location of the tracking map.
- the set of identified environment maps includes environment maps CM1, CM2, and CM4, each of which has at least one area attribute that has a geographic location overlapping with the determined geographic location of the tracking map 1102.
- the environment map CM3 associated with the area attribute AA6 is not included in the set because it is outside the determined geographic location of the tracking map.
- the method 900 may include filtering (Act 906) the set of environment maps based on similarity of one or more identifiers of network access points associated with the tracking map and the
- a device collecting sensor data to generate the map may be connected to a network through a network access point, such as through Wi-Fi or similar wireless communication protocol.
- the access points may be identified by BSSID.
- the user device may connect to multiple different access points as it moves through an area collecting data to form a map.
- the devices may have connected through different access points, so there may be multiple access points used in forming the map for that reason too. Accordingly, there may be multiple access points associated with a map, and the set of access points may be an indication of location of the map.
- Strength of signals from an access point which may be reflected as an RSSI value, may provide further geographic information.
- a list of BSSID and RSSI values may form the area attribute for a map.
- filtering the set of environment maps based on similarity of the one or more identifiers of the network access points may include retaining in the set of environment maps environment maps with the highest Jaccard similarity to the at least one area attribute of the tracking map based on the one or more identifiers of network access points.
- Figure 33 depicts an example of Act 906, according to some embodiments.
- a network identifier associated with the area attribute AA7 may be determined as the identifier for the tracking map 1102.
- the set of environment maps after Act 906 includes environment map CM2, which may have area attributes within higher Jaccard similarity to AA7, and environment map CM4, which also include the area attributes AA7.
- the environment map CM1 is not included in the set because it has the lowest Jaccard similarity to AA7.
- Processing at Acts 902-906 may be performed based on metadata associated with maps and without actually accessing the content of the maps stored in a map database. Other processing may involve accessing the content of the maps.
- Act 908 indicates accessing the environment maps remaining in the subset after filtering based on metadata. It should be appreciated that this act may be performed either earlier or later in the process, if subsequent operations can be performed with accessed content.
- the method 900 may include filtering (Act 910) the set of environment maps based on similarity of metrics representing content of the tracking map and the environment maps of the set of environment maps.
- the metrics representing content of the tracking map and the environment maps may include vectors of values computed from the contents of the maps.
- the Deep Key Frame descriptor, as described above, computed for one or more key frames used in forming a map may provide a metric for comparison of maps, or portions of maps.
- the metrics may be computed from the maps retrieved at Act 908 or may be pre computed and stored as metadata associated with those maps.
- filtering the set of environment maps based on similarity of metrics representing content of the tracking map and the environment maps of the set of environment maps may include retaining in the set of environment maps environment maps with the smallest vector distance between a vector of characteristics of the tracking map and vectors representing environment maps in the set of environment maps.
- the method 900 may include further filtering (Act 912) the set of environment maps based on degree of match between a portion of the tracking map and portions of the environment maps of the set of environment maps.
- the degree of match may be determined as a part of a localization process.
- localization may be performed by identifying critical points in the tracking map and the environment map that are
- the critical points may be features, feature descriptors, key frames, key rigs, persistent poses, and/or PCFs.
- the set of critical points in the tracking map might then be aligned to produce a best fit with the set of critical points in the environment map.
- a mean square distance between the corresponding critical points might be computed and, if below a threshold for a particular region of the tracking map, used as an indication that the tracking map and the environment map represent the same region of the physical world.
- filtering the set of environment maps based on degree of match between a portion of the tracking map and portions of the environment maps of the set of environment maps may include computing a volume of a physical world represented by the tracking map that is also represented in an environment map of the set of environment maps, and retaining in the set of environment maps environment maps with a larger computed volume than environment maps filtered out of the set.
- Figure 34 depicts an example of Act 912, according to some embodiments.
- the set of environment maps after Act 912 includes environment map CM4, which has an area 1402 matched with an area of the tracking map 1102.
- the environment map CM1 is not included in the set because it has no area matched with an area of the tracking map 1102.
- the set of environment maps may be filtered in the order of Act 906, Act 910, and Act 912. In some embodiments, the set of environment maps may be filtered based on Act 906, Act 910, and Act 912, which may be performed in an order based on processing required to perform the filtering, from lowest to highest.
- the method 900 may include loading (Act 914) the set of environment maps and data.
- a user database stores area identities indicating areas that AR devices were used in.
- the area identities may be area attributes, which may include parameters of wireless networks detected by the AR devices when in use.
- a map database may store multiple environment maps constructed from data supplied by the AR devices and associated metadata.
- the associated metadata may include area identities derived from the area identities of the AR devices that supplied data from which the environment maps were constructed.
- An AR device may send a message to a PW module indicating a new tracking map is created or being created.
- the PW module may compute area identifiers for the AR device and updates the user database based on the received parameters and/or the computed area identifiers.
- the PW module may also determine area identifiers associated with the AR device requesting the environment maps, identify sets of environment maps from the map database based on the area identifiers, filter the sets of environment maps, and transmit the filtered sets of environment maps to the AR devices.
- the PW module may filter the sets of environment maps based on one or more criteria including, for example, a geographic location of the tracking map, similarity of one or more identifiers of network access points associated with the tracking map and the environment maps of the set of environment maps, similarity of metrics representing contents of the tracking map and the environment maps of the set of environment maps, and degree of match between a portion of the tracking map and portions of the environment maps of the set of environment maps.
- inventions are described in connection with devices, such as wearable devices. It should be appreciated that some or all of the techniques described herein may be implemented via networks (such as cloud), discrete applications, and/or any suitable combinations of devices, networks, and discrete applications.
- Figure 29 provides examples of criteria that may be used to filter candidate maps to yield a set of high ranking maps. Other criteria may be used instead of or in addition to the described criteria. For example, if multiple candidate maps have similar values of a metric used for filtering out less desirable maps, characteristics of the candidate maps may be used to determine which maps are retained as candidate maps or filtered out. For example, larger or more dense candidate maps may be prioritized over smaller candidate maps.
- Figures 27-28 may describe all or part of the systems and methods described in Figures 29-34.
- Figures 35 and 36 are schematic diagrams illustrating an XR system configured to rank and merge a plurality of environment maps, according to some embodiments.
- a passable world may determine when to trigger ranking and/or merging the maps.
- determining a map to be used may be based at least partly on deep key frames described above in relation to FIGs. 21-25, according to some
- FIG. 37 is a block diagram illustrating a method 3700 of creating environment maps of a physical world, according to some embodiments.
- the method 3700 may start from localizing (Act 3702) a tracking map captured by an XR device worn by a user to a group of canonical maps (e.g., canonical maps selected by the method of FIG. 28 and/or the method 900 of FIG. 29).
- the Act 3702 may include localizing keyrigs of the tracking map into the group of canonical maps.
- the localization result of each keyrig may include the keyrig’s localized pose and a set of 2D-to-3D feature correspondences.
- the method 3700 may include splitting (Act 3704) a tracking map into connected components, which may enable merging maps robustly by merging connected pieces.
- Each connected component may include keyrigs that are within a predetermined distance.
- the method 3700 may include merging (Act 3706) the connected components that are larger than a predetermined threshold into one or more canonical maps, and removing the merged connected components from the tracking map.
- the method 3700 may include merging (Act 3708) canonical maps of the group that are merged with the same connected components of the tracking map. In some embodiments, the method 3700 may include promoting (Act 3710) the remaining connected components of the tracking map that has not been merged with any canonical maps to be a canonical map. In some embodiments, the method 3700 may include merging (Act 3712) persistent poses and/or PCFs of the tracking maps and the canonical maps that are merged with at least one connected component of the tracking map. In some embodiments, the method 3700 may include finalizing (Act 3714) the canonical maps by, for example, fusing map points and pruning redundant keyrigs.
- Figures 38A and 38B illustrate an environment map 3800 created by updating a canonical map 700, which may be promoted from the tracking map 700 (FIG. 7) with a new tracking map, according to some embodiments.
- the canonical map 700 may provide a floor plan 706 of reconstructed physical objects in a corresponding physical world, represented by points 702.
- a map point 702 may represent a feature of a physical object that may include multiple features.
- a new tracking map may be captured about the physical world and uploaded to a cloud to merge with the map 700.
- the new tracking map may include map points 3802, and keyrigs 3804, 3806.
- keyrigs 3804 represent keyrigs that are successfully localized to the canonical map by, for example, establishing a correspondence with a keyrig 704 of the map 700 (as illustrated in FIG. 38B).
- keyrigs 3806 represent keyrigs that have not been localized to the map 700. Keyrigs 3806 may be promoted to a separate canonical map in some embodiments.
- Figures 39A to 39F are schematic diagrams illustrating an example of a cloud- based persistent coordinate system providing a shared experience for users in the same physical space.
- Figure 39A shows that a canonical map 4814, for example, from a cloud, is received by the XR devices worn by the users 4802A and 4802B of FIGs. 20A-20C.
- the canonical map 4814 may have a canonical coordinate frame 4806C.
- the canonical map 4814 may have a PCF 4810C with a plurality of associated PPs (e.g., 4818A, 4818B in FIG. 39C).
- Figure 39B shows that the XR devices established relationships between their respective world coordinate system 4806A, 4806B with the canonical coordinate frame 4806C. This may be done, for example, by localizing to the canonical map 4814 on the respective devices. Localizing the tracking map to the canonical map may result, for each device, a transformation between its local world coordinate system and the coordinate system of the canonical map.
- FIG 39C shows that, as a result of localization, a transformation can be computed (e.g., transformation 4816A, 4816B) between a local PCF (e.g., PCFs 4810A, 4810B) on the respective device to a respective persistent pose (e.g., PPs 4818A, 4818B) on the canonical map.
- a transformation can be computed (e.g., transformation 4816A, 4816B) between a local PCF (e.g., PCFs 4810A, 4810B) on the respective device to a respective persistent pose (e.g., PPs 4818A, 4818B) on the canonical map.
- each device may use its local PCF’s, which can be detected locally on the device by processing images detected with sensors on the device, to determine where with respect to the local device to display virtual content attached to the PPs 4818A, 4818B or other persistent points of the canonical map.
- Such an approach may accurately position virtual content with respect
- Figure 39D shows a persistent pose snapshot from the canonical map to the local tracking maps.
- the local tracking maps are connected to one another via the persistent poses.
- Figure 39E shows that the PCF 4810A on the device worn by the user 4802A is accessible in the device worn by the user 4802B through PPs 4818A.
- Figure 39F shows that the tracking maps 4804A, 4804B and the canonical 4814 may merge. In some embodiments, some PCFs may be removed as a result of merging.
- the merged map includes the PCF 4810C of the canonical map 4814 but not the PCFs 4810A, 4810B of the tracking maps 4804A, 4804B.
- the PPs previously associated with the PCFs 4810A, 4810B may be associated with the PCF 4810C after the maps merge.
- Figures 40 and 41 illustrate an example of generating a tracking map by the first XR device 12.1 of Figure 9.
- Figure 40 is a two-dimensional representation of a three- dimensional first local tracking map (Map 1), which may be generated by the first XR device of Figure 9, according to some embodiments.
- Figure 41 is a block diagram illustrating uploading Map 1 from the first XR device to the server of Figure 9, according to some embodiments.
- Figure 40 illustrates Map 1 and virtual content (Contentl23 and Content456) on the first XR device 12.1.
- Map 1 has an origin (Origin 1).
- Map 1 includes a number of PCFs (PCF a to PCF d).
- PCF a is located at the origin of Map 1 and has X, Y, and Z coordinates of (0,0,0) and PCF b has X, Y, and Z coordinates (-1,0,0).
- Contentl23 is associated with PCF a.
- Content 123 has an X, Y, and Z relationship relative to PCF a of (1,0,0).
- Content456 has a relationship relative to PCF b.
- Content456 has an X, Y, and Z relationship of (1,0,0) relative to PCF b.
- the first XR device 12.1 uploads Map 1 to the server 20.
- the server 20 now has a canonical map based on Map 1.
- the first XR device 12.1 has a canonical map that is empty at this stage.
- the first XR device 12.1 also transmits its Wi-Fi signature data to the server 20.
- the server 20 may use the Wi-Fi signature data to determine a rough location of the first XR device 12.1 based on intelligence gathered from other devices that have, in the past, connected to the server 20 or other servers together with the GPS locations of such other devices that have been recorded.
- the first XR device 12.1 may now end the first session ( See Figure 8) and may disconnect from the server 20.
- Figure 42 is a schematic diagram illustrating the XR system of Figure 16, showing the second user 14.2 has initiated a second session using a second XR device of the XR system after the first user 14.1 has terminated a first session, according to some embodiments.
- Figure 43A is a block diagram showing the initiation of a second session by a second user 14.2.
- the first user 14.1 is shown in phantom lines because the first session by the first user 14.1 has ended.
- the second XR device 12.2 begins to record objects.
- Various systems with varying degrees of granulation may be used by the server 20 to determine that the second session by the second XR device 12.2 is in the same vicinity of the first session by the first XR device 12.1.
- Wi-Fi signature data For example, Wi-Fi signature data, global positioning system (GPS) positioning data, GPS data based on Wi-Fi signature data, or any other data that indicates a location may be included in the first and second XR devices 12.1 and 12.2 to record their locations.
- GPS global positioning system
- the PCFs that are identified by the second XR device 12.2 may show a similarity to the PCFs of Map 1.
- the second XR device boots up and begins to collect data, such as images 1110 from one or more cameras 44, 46.
- an XR device e.g. the second XR device 12.2 may collect one or more images 1110 and perform image processing to extract one or more features / interest points 1120.
- Each feature may be converted to a descriptor 1130.
- the descriptors 1130 may be used to describe a key frame 1140, which may have the position and direction of the associated image attached.
- One or more key frames 1140 may correspond to a single persistent pose 1150, which may be automatically generated after a threshold distance from the previous persistent pose 1150, e.g., 3 meters.
- One or more persistent poses 1150 may correspond to a single PCF 1160, which may be automatically generated after a pre determined distance, e.g. every 5 meters.
- additional PCFs e.g., PCF 3 and PCF 4, 5 may be created.
- An application or two 1180 may run on the XR device and provide virtual content 1170 to the XR device for presentation to the user.
- the virtual content may have an associated content coordinate frame which may be placed relative to one or more PCFs.
- the second XR device 12.2 creates three PCFs.
- the second XR device 12.2 may try to localize into one or more canonical maps stored on the server 20.
- the second XR device 12.2 may download the canonical map 120 from the server 20.
- Map 1 on the second XR device 12.2 includes PCFs a to d and Origin 1.
- the server 20 may have multiple canonical maps for various locations and may determine that the second XR device 12.2 is in the same vicinity as the vicinity of the first XR device 12.1 during the first session and sends the second XR device 12.2 the canonical map for that vicinity.
- Figure 44 shows the second XR device 12.2 beginning to identify PCFs for purposes of generating Map 2.
- the second XR device 12.2 has only identified a single PCF, namely PCF 1,2.
- the X, Y, and Z coordinates of PCF 1,2 for the second XR device 12.2 may be (1,1,1).
- Map 2 has its own origin (Origin 2), which may be based on the head pose of device 2 at device start-up for the current head pose session.
- the second XR device 12.2 may immediately attempt to localize Map 2 to the canonical map.
- Map 2 may not be able to localize into Canonical Map (Map 1) (i.e.
- the system may localize based on PCF comparison between the local and canonical maps. In some embodiments, the system may localize based on persistent pose comparison between the local and canonical maps. In some embodiments, the system may localize based on key frame comparison between the local and canonical maps.
- Figure 45 shows Map 2 after the second XR device 12.2 has identified further PCFs (PCF 1,2, PCF 3, PCF 4,5) of Map 2.
- the second XR device 12.2 again attempts to localize Map 2 to the canonical map. Because Map 2 has expanded to overlap with at least a portion of the Canonical Map, the localization attempt will succeed.
- the overlap between the local tracking map, Map 2, and the Canonical Map may be represented by PCFs, persistent poses, key frames, or any other suitable intermediate or derivative construct.
- the second XR device 12.2 has associated Contentl23 and
- Content456 to PCFs 1,2 and PCF 3 of Map 2 has X, Y, and Z coordinates relative to PCF 1,2 of (1,0,0). Similarly, the X, Y, and Z coordinates of Content456 relative to PCF 3 in Map 2 are (1,0,0).
- Figures 46A and 46B illustrate a successful localization of Map 2 to the canonical map.
- the overlapping area / volume / section of the maps 1410 represent the common parts to Map 1 and the canonical map. Since Map 2 created PCFs 3 and 4,5 before localizing, and the Canonical map created PCFs a and c before Map 2 was created, different PCFs were created to represent the same volume in real space (e.g., in different maps).
- the second XR device 12.2 expands Map 2 to include PCFs a-d from the Canonical Map.
- the inclusion of PCFs a-d represents the localization of Map 2 to the Canonical Map.
- the XR system may perform an optimization step to remove duplicate PCFs from overlapping areas, such as the PCFs in 1410, PCF 3 and PCF 4,5.
- Map 2 localizes, the placement of virtual content, such as Content456 and Contentl23 will be relative to the closest updated PCFs in the updated Map 2.
- the virtual content appears in the same real-world location relative to the user, despite the changed PCF attachment for the content, and despite the updated PCFs for Map 2.
- the second XR device 12.2 continues to expand Map 2 as further PCFs (e.g., PCFs e, f, g, and h) are identified by the second XR device 12.2, for example as the user walks around the real world. It can also be noted that Map 1 has not expanded in Figures 47 and 48.
- PCFs e, f, g, and h PCFs
- the second XR device 12.2 uploads Map 2 to the server 20.
- the server 20 stores Map 2 together with the canonical map.
- Map 2 may upload to the server 20 when the session ends for the second XR device 12.2.
- the canonical map within the server 20 now includes PCF i which is not included in Map 1 on the first XR device 12.1.
- the canonical map on the server 20 may have expanded to include PCF i when a third XR device (not shown) uploaded a map to the server 20 and such a map included PCF i.
- the server 20 merges Map 2 with the canonical map to form a new canonical map.
- the server 20 determines that PCFs a to d are common to the canonical map and Map 2.
- the server expands the canonical map to include PCFs e to h and PCF 1,2 from Map 2 to form a new canonical map.
- the canonical maps on the first and second XR devices 12.1 and 12.2 are based on Map 1 and are outdated.
- the server 20 transmits the new canonical map to the first and second XR devices 12.1 and 12.2. In some embodiments, this may occur when the first XR device 12.1 and second device 12.2 try to localize during a different or new or subsequent session.
- the first and second XR devices 12.1 and 12.2 proceed as described above to localize their respective local maps (Mapl and Map 2 respectively) to the new canonical map.
- the head coordinate frame 96 or“head pose” is related to the PCFs in Map 2.
- the origin of the map, Origin 2 is based off of the head pose of second XR device 12.2 at the start of the session.
- the PCFs are placed relative to the world coordinate frame, Origin 2.
- the PCFs of Map 2 serve as a persistent coordinate frames relative to a canonical coordinate frame, where the world coordinate frame may be a previous session’s world coordinate frame (e.g. Map l’s Origin 1 in Figure 40).
- the transformation from the world coordinate frame to the head coordinate frame 96 has been previously discussed with reference to Figure 9.
- the head coordinate frame 96 shown in Figure 52 only has two orthogonal axes that are in a particular coordinate position relative to the PCFs of Map 2, and at particular angles relative to Map 2.
- the head coordinate frame 96 is in a three-dimensional location relative to the PCFs of Map 2 and has three orthogonal axes within three-dimensional space.
- the head coordinate frame 96 has moved relative to the PCFs of Map 2.
- the head coordinate frame 96 has moved because the second user 14.2 has moved their head.
- the user can move their head in six degrees of freedom (6dof).
- the head coordinate frame 96 can thus move in 6dof, namely in three-dimensions from its previous location in Figure 52 and about three orthogonal axes relative to the PCFs of Map 2.
- the head coordinate frame 96 is adjusted when the real object detection camera 44 and inertial measurement unit 48 in Figure 9 respectively detect real objects and motion of the head unit 22. More information regarding head pose tracking is disclosed in U.S. Patent Application No. 16/221,065 entitled“Enhanced Pose Determination for Display Device” and is hereby incorporated by reference in its entirety.
- Figure 54 shows that sound may be associated with one or more PCFs.
- a user may, for example, wear headphones or earphones with stereoscopic sound.
- the location of sound through headphones can be simulated using conventional techniques.
- the location of sound may be located in a stationary position so that, when the user rotates their head to the left, the location of sound rotates to the right so that the user perceives the sound coming from the same location in the real world.
- location of sound is represented by Soundl23 and Sound456.
- Figure 54 is similar to Figure 48 in its analysis. When the first and second users 14.1 and 14.2 are located in the same room at the same or different times, they perceive Soundl23 and Sound456 coming from the same locations within the real world.
- Figures 55 and 56 illustrate a further implementation of the technology described above.
- the first user 14.1 has initiated a first session as described with reference to Figure 8.
- the first user 14.1 has terminated the first session as indicated by the phantom lines.
- the first XR device 12.1 uploaded Map 1 to the server 20.
- the first user 14.1 has now initiated a second session at a later time than the first session.
- the first XR device 12.1 does not download Map 1 from the server 20 because Map 1 is already stored on the first XR device 12.1. If Map 1 is lost, then the first XR device 12.1 downloads Map 1 from the server 20.
- the first XR device 12.1 then proceeds to build PCFs for Map 2, localizes to Map 1, and further develops a canonical map as described above.
- Map 2 of the first XR device 12.1 is then used for relating local content, a head coordinate frame, local sound, etc. as described above.
- FIG. 57 and 58 it may also be possible that more than one user interacts with the server in the same session.
- the first user 14.1 and the second user 14.2 are joined by a third user 14.3 with a third XR device 12.3.
- Each XR device 12.1, 12.2, and 12.3 begins to generate its own map, namely Map 1, Map 2, and Map 3, respectively.
- Map 1, Map 2, and Map 3 maps are incrementally uploaded to the server 20.
- the server 20 merges Maps 1, 2, and 3 to form a canonical map.
- the canonical map is then transmitted from the server 20 to each one of the XR devices 12.1, 12.2 and 12.3.
- Figure 59 illustrates aspects of a viewing method to recover and/or reset head pose, according to some embodiments.
- the viewing device is powered on.
- a new session is initiated.
- a new session may include establishing head pose.
- One or more capture devices on a head-mounted frame secured to a head of a user capture surfaces of an environment by first capturing images of the environment and then determining the surfaces from the images.
- surface data may be combined with a data from a gravitational sensor to establish head pose. Other suitable methods of establishing headpose may be used.
- a processor of the viewing device enters a routine for tracking of head pose.
- the capture devices continue to capture surfaces of the environment as the user moves their head to determine an orientation of the head-mounted frame relative to the surfaces.
- the processor determines whether head pose has been lost.
- Head pose may become lost due to“edge” cases, such as too many reflective surfaces, low light, blank walls, being outdoor, etc. that may result in low feature acquisition, or because of dynamic cases such as a crowd that moves and forms part of the map.
- the routine at 1430 allows for a certain amount of time, for example 10 seconds, to pass to allow enough time to determine whether head pose has been lost. If head pose has not been lost, then the processor returns to 1420 and again enters tracking of head pose.
- the processor enters a routine at 1440 to recover head pose. If head pose is lost due to low light, then a message such as the following message is displayed to the user through a display of the viewing device:
- the system will continue to monitor whether there is sufficient light available and whether head pose can be recovered.
- the system may alternatively determine that low texture of surfaces is causing head pose to be lost, in which case the user is given the following prompt in the display as a suggestion to improve capturing of surfaces:
- the processor enters a routine to determine whether head pose recovery has failed. If head pose recovery has not failed (i.e. head pose recovery has succeeded), then the processor returns to Act 1420 by again entering tracking of head pose. If head pose recovery has failed, the processor returns to Act 1410 to establish a new session.
- Figure 60 shows a diagrammatic representation of a machine in the exemplary form of a computer system 1900 within which a set of instructions, for causing the machine to perform any one or more of the methodologies discussed herein, may be executed, according to some embodiments.
- the machine operates as a standalone device or may be connected (e.g., networked) to other machines.
- the term“machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
- the exemplary computer system 1900 includes a processor 1902 (e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both), a main memory 1904 (e.g., read only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.), and a static memory 1906 (e.g., flash memory, static random access memory (SRAM), etc.), which communicate with each other via a bus 1908.
- a processor 1902 e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both
- main memory 1904 e.g., read only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.
- static memory 1906 e.g., flash memory, static random access memory (SRAM), etc.
- the computer system 1900 may further include a disk drive unit 1916, and a network interface device 1920.
- the disk drive unit 1916 includes a machine-readable medium 1922 on which is stored one or more sets of instructions 1924 (e.g., software) embodying any one or more of the methodologies or functions described herein.
- the software may also reside, completely or at least partially, within the main memory 1904 and/or within the processor 1902 during execution thereof by the computer system 1900, the main memory 1904 and the processor 1902 also constituting machine-readable media.
- the software may further be transmitted or received over a network 18 via the network interface device 1920.
- the computer system 1900 includes a driver chip 1950 that is used to drive projectors to generate light.
- the driver chip 1950 includes its own data store 1960 and its own processor 1962.
- machine-readable medium 1922 is shown in an exemplary embodiment to be a single medium, the term “machine -readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions.
- the term “machine- readable medium” shall also be taken to include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention.
- the term “machine-readable medium” shall accordingly be taken to include, but not be limited to, solid- state memories, optical and magnetic media, and carrier wave signals.
- embodiments are described in connection with an augmented (AR) environment. It should be appreciated that some or all of the techniques described herein may be applied in an MR environment or more generally in other XR environments, and in VR environments.
- embodiments are described in connection with devices, such as wearable devices. It should be appreciated that some or all of the techniques described herein may be implemented via networks (such as cloud), discrete applications, and/or any suitable combinations of devices, networks, and discrete applications.
- FIG. 29 provides examples of criteria that may be used to filter candidate maps to yield a set of high ranking maps. Other criteria may be used instead of or in addition to the described criteria. For example, if multiple candidate maps have similar values of a metric used for filtering out less desirable maps, characteristics of the candidate maps may be used to determine which maps are retained as candidate maps or filtered out. For example, larger or more dense candidate maps may be prioritized over smaller candidate maps.
- Some embodiments relate to a portable electronic system including a sensor configured to capture information about a three-dimensional (3D) environment and output images, wherein each image comprises a plurality of pixels; at least one processor configured to execute computer executable instructions to process the images output by the sensor.
- a sensor configured to capture information about a three-dimensional (3D) environment and output images, wherein each image comprises a plurality of pixels; at least one processor configured to execute computer executable instructions to process the images output by the sensor.
- the computer executable instructions comprise instructions for: receiving a plurality of images captured by the sensor; for at least a subset of the plurality of images: identifying one or more features in the plurality of pixels for each image of the subset of images, wherein each feature corresponds to one or more pixels; computing feature descriptors for each feature of the one or more features; and for each of the images of the subset, computing a frame descriptor to represent the image based, at least in part, on the computed feature descriptors in the image.
- the senor comprises at least one million pixel circuits.
- the frame descriptor for each of the plurality of images comprises 512 or fewer numbers.
- the computer executable instructions comprise further instructions for: constructing a map of at least a portion of the 3D environment; and associating the feature descriptors for respective frames with portions of the map generated, at least in part, from the respective frames.
- the computer executable instructions comprise instructions for selecting as the subset of the plurality of images one or more key frames from the plurality of images based, at least in part, on location of the image with respect to the 3D environment and the plurality of pixels of the plurality of images.
- the computer executable instructions comprise instructions for identifying, for a key frame of the one or more key frames, one or more frames associated with a map of the 3D environment, the one or more frames having frame descriptors less than a threshold distance from the frame descriptor for the key frame.
- the computer executable instructions for computing the frame descriptor comprise an artificial neural network.
- the artificial neural network comprises a multi-layer perceptron unit trained based on similar and dissimilar images and configured to receive as inputs a plurality of values representative of features in images and to provide as outputs weighted combinations of the plurality of values representative of features, and a max pooling unit configured to select a subset of the outputs of the multi-layer perceptron unit as the frame descriptor.
- Some embodiments relate to a method of operating a computing system to generate a map of at least a portion of a three-dimensional (3D) environment based on sensor data collected by a device worn by a user.
- the method includes receiving a plurality of images captured by the device worn by the user; determining one or more key frames from the plurality of images; identifying one or more interest points in the one or more key frames with a first artificial neural network; computing feature descriptors for individual interest points with the first artificial neural network; and for each of the one or more key frames, computing a frame descriptor to represent the key frame based, at least in part, on the computed feature descriptors for the identified interest points in the key frame with a second artificial neural network.
- the first and second artificial neural networks are sub networks of an artificial neural network.
- the frame descriptors are unique for individual key frames.
- each of the one or more key frame has a resolution higher than 1 Megabyte.
- the frame descriptor for each of the one or more key frame is a string that is less than 512 numbers.
- each feature descriptor is a string of 32 bytes.
- the frame descriptor is generated by max pooling the feature descriptors.
- the method includes receiving a new image captured by the device worn by the user, and identifying one or more nearest key frames in a database comprising key frames used to generate the map, the one or more nearest key frames having frame descriptors within a predetermined distance of the frame descriptor for the new image.
- the method includes performing feature matching against 3D map points of the map that correspond to the identified one or more nearest key frames; and computing pose of the device worn by the user based on feature matching results.
- determining the one or more key frames from the plurality of images comprises comparing pixels of a first image with pixels of a second image that is taken immediately after the first image, and identifying the second image as a key frame when the difference between the pixels of the first image and the pixels of the second image is above or below a threshold value.
- the method includes training the second artificial neural network by: generating a dataset comprising a plurality of image sets, wherein each of the plurality of image set includes a query image, a positive sample image, and a negative sample image; for each image set of the plurality of image sets in the dataset, computing a loss by comparing the query image with the positive sample image and the negative sample image; and modifying the second artificial neural network based on the computed loss such that a distance between a frame descriptor generated by the second artificial neural network for the query image and a frame descriptor for the positive sample image is greater than a distance between the frame descriptor for the query image and a frame descriptor for the negative sample image.
- the computing environment includes a database storing a plurality of maps. Each map comprises information representing regions of a 3D environment. The information representing each region comprises a frame descriptor representing an image of the region; and non-transitory computer storage media storing computer-executable instructions that, when executed by at least one processor in the computing environment: processes an image captured by a portable device by identifying a plurality of features in the image; computing a feature descriptor for each of the plurality of features; computing a frame descriptor to represent the image based, at least in part, on the computed feature descriptors for the one or more identified interest points in the image; and selecting a map in the database based on a comparison between the computed frame descriptor and frame descriptors stored in the database of maps.
- the frame descriptors are unique for the frames stored in the database.
- the image has a resolution higher than 1 Megabyte.
- the frame descriptor computed to represent the image is a string that is less than 512 numbers.
- the computer executable instructions comprise an artificial neural network trained by: processing a dataset comprising a plurality of image sets, wherein each of the plurality of image sets includes a query image, a positive sample image, and a negative sample image by, computing a loss for an image set of the plurality of image sets in the dataset by comparing the query image with the positive sample image and the negative sample image; and modifying the artificial neural network based on the computed loss such that a distance between a frame descriptor generated by the artificial neural network for the query image and a frame descriptor for the positive sample image is less than a distance between the frame descriptor for the query image and a frame descriptor for the negative sample image.
- modifying the artificial neural network comprises modifying copies of the artificial neural network on portable devices in the computing environment.
- the computing environment comprises a cloud platform and a plurality of portable devices in communication with the cloud platform.
- the cloud platform comprises the database and the computer executable instructions for selecting the map.
- the computer executable instructions for processing an image captured by a portable device are stored on the portable device.
- Some embodiments relate to an XR system including a first XR device that includes a first processor, a first computer-readable medium connected to the first processor, a first origin coordinate frame stored on the first computer-readable medium, a first destination coordinate frame stored on the computer-readable medium, a first data channel to receive data representing local content, a first coordinate frame transformer executable by the first processor to transform a positioning of the local content from the first origin coordinate frame to the first destination coordinate frame, and a first display system adapted to display the local content to a first user after transforming the positioning of the local content from the first origin coordinate frame to the first destination coordinate frame.
- Some embodiments relate to a viewing method including storing a first origin coordinate frame, storing a first destination coordinate frame, receiving data representing local content, transforming a positioning of local content from the first origin coordinate frame to the first destination coordinate frame, and displaying the local content to a first user after transforming the positioning of the local content from the first origin coordinate frame to the first destination coordinate frame.
- Some embodiments relate to an XR system including a map storing routine to store a first map, being a canonical map, having a plurality of persistent coordinate frames (PCFs), each PCF of the first map having a set of coordinates, a real object detection device positioned to detect locations of real objects, an PCF identification system connected to the real object detection device to detect, based on the locations of the real objects, PCFs of a second map, each PCF of the second map having a set of coordinates and a localization module connected to the canonical map and the second map and executable to localize the second map to the canonical map by matching a first PCF of the second map to a first PCF of the canonical map and matching a second PCF of the second map to a second PCF of the canonical map.
- PCFs persistent coordinate frames
- the real object detection device is a real object detection camera.
- the XR system further comprises a canonical map incorporator connected to the canonical map and the second map and executable to
- the XR system further comprises an XR device that includes: a head unit comprising: a head-mountable frame, wherein the real object detection device is mounted to the head-mountable frame; a data channel to receive image data of local content; a local content position system connected to the data channel and executable to relate the local content to one PCF of the canonical map; and a display system connected to the local content position system to display the local content.
- a head unit comprising: a head-mountable frame, wherein the real object detection device is mounted to the head-mountable frame; a data channel to receive image data of local content; a local content position system connected to the data channel and executable to relate the local content to one PCF of the canonical map; and a display system connected to the local content position system to display the local content.
- the XR system further comprises a local-to- world coordinate transformer that transforms a local coordinate frame of the local content to a world coordinate frame of the second map.
- the XR system further comprises a first world frame determining routine to calculate a first world coordinate frame based on the PCFs of the second map; a first world frame storing instruction to store the world coordinate frame; a head frame determining routine to calculate a head coordinate frame that changes upon movement of the head-mountable frame; a head frame storing instruction to store the first head coordinate frame; and a world-to-head coordinate transformer that transforms the world coordinate frame to the head coordinate frame.
- the head coordinate frame changes relative to the world coordinate frame when the head-mountable frame moves.
- the XR system further comprises at least one sound element that is related to at least one PCF of the second map.
- the first and second maps are created by the XR device.
- the XR system further comprises first and second XR devices.
- Each XR device includes: a head unit comprising: a head-mountable frame, wherein the real object detection device is mounted to the head-mountable frame; a data channel to receive image data of local content; a local content position system connected to the data channel and executable to relate the local content to one PCF of the canonical map; and a display system connected to the local content position system to display the local content.
- the first XR device creates PCFs for the first map and the second XR device creates PCFs for the second map and the localization module forms part of the second XR device.
- the first and second maps are created in first and second sessions respectively.
- the XR system further comprises a server; and a map download system, forming part of the XR device, that downloads the first map over a network from a server.
- the localization module repeatedly attempts to localize the second map to the canonical map.
- the XR system further comprises a map publisher that uploads the second map over the network to the server.
- Some embodiments relate to a viewing method including storing a first map, being a canonical map, having a plurality of PCFs, each PCF of the canonical map having a set of coordinates, detecting locations of real objects, detecting, based on the locations of the real objects, PCFs of a second map, each PCF of the second map having a set of coordinates and localizing the second map to the canonical map by matching a first PCF of the second map to a first PCF of the first map and matching a second PCF of the second map to a second PCF of the canonical map.
- an XR system including a server that may have a processor, a computer-readable medium connected to the processor, a plurality of canonical maps on the computer-readable medium, a respective canonical map identifier on the computer-readable medium associated with each respective canonical map, the canonical map identifiers differing from one another to uniquely identify the canonical maps, a position detector on the computer-readable medium and executable by the processor to receive and store a position identifier from an XR device, a first filter on the computer-readable medium and executable by the processor to compare the position identifier with the canonical map identifiers to determine one or more canonical maps that form a first filtered selection, and a map transmitter on the computer-readable medium and executable by the processor to transmit one or more of the canonical maps to the XR device based on the first filtered selection.
- the canonical map identifiers each include longitude and latitude and the position identifier includes longitude and latitude.
- the first filter is a neighboring areas filter that selects at least one matching canonical map covering longitude and latitude that include the longitude and latitude of the position identifier and at least one neighboring map covering longitude and latitude that are adjacent the first matching canonical map.
- the position identifier includes a WiFi fingerprint.
- the XR system further comprises a second filter, being a WiFi fingerprint filter, on the computer- readable medium and executable by the processor to: determine latitude and longitude based on the WiFi fingerprint; compare latitude and longitude from the WiFi fingerprint filter with latitude and longitude of the canonical maps to determine one or more canonical maps that form a second filtered selection within the first filtered selection, the map transmitter transmitting one or more canonical maps based on the second selection and not canonical maps based on the first selection outside of the second selection.
- the first filter is a WiFi fingerprint filter, on the computer- readable medium and executable by the processor to: determine latitude and longitude based on the WiFi fingerprint; compare latitude and longitude from the WiFi fingerprint filter with latitude and longitude of the canonical maps to determine one or more canonical maps that form the first filtered selection.
- the XR system further comprises a multilayer perception unit on the computer-readable medium and executable by the processor, that receives a plurality of features of an image and converts each feature to a respective string of numbers; a max pool unit on the computer-readable medium and executable by the processor, that combines a maximum value of each string of numbers into a global feature string representing the image, wherein each canonical map has at least one of said global features string and the position identifier received from the XR device includes features of an image captured by the XR device that are progressed by the multilayer perception unit and the max pool unit to determine a global feature string of the image; and a key frame filter that compares the global feature string of the image to the global feature strings of the canonical maps to determine one or more canonical maps that form a third filtered selection within the second filtered selection, the map transmitter transmitting one or more canonical maps based on the third selection and not canonical maps based on the second selection outside of the third selection
- the XR system comprises a multilayer perception unit on the computer-readable medium and executable by the processor, that receives a plurality of features of an image and converts each feature to a respective string of numbers; a max pool unit on the computer-readable medium and executable by the processor, that combines a maximum value of each string of numbers into a global feature string representing the image, wherein each canonical map has at least one of said global features string and the position identifier received from the XR device includes features of an image captured by the XR device that are progressed by the multilayer perception unit and the max pool unit to determine a global feature string of the image; and wherein the first filter is a key frame filter that compares the global feature string of the image to the global feature strings of the canonical maps to determine one or more canonical maps.
- the XR system comprises an XR device that includes: a head unit comprising: a head-mountable frame, wherein the real object detection device is mounted to the head-mountable frame; a data channel to receive image data of local content; a local content position system connected to the data channel and executable to relate the local content to one PCF of the canonical map; and a display system connected to the local content position system to display the local content.
- a head unit comprising: a head-mountable frame, wherein the real object detection device is mounted to the head-mountable frame; a data channel to receive image data of local content; a local content position system connected to the data channel and executable to relate the local content to one PCF of the canonical map; and a display system connected to the local content position system to display the local content.
- the XR device includes: a map storing routine to store a first map, being a canonical map, having a plurality of PCFs, each PCF of the first map having a set of coordinates; a real object detection device positioned to detect locations of real objects; an PCF identification system connected to the real object detection device to detect, based on the locations of the real objects, PCFs of a second map, each PCF of the second map having a set of coordinates; and a localization module connected to the canonical and the second map and executable to localize the second map to the canonical map by matching a first PCF of the second map to a first PCF of the canonical map and matching a second PCF of the second map to a second PCF of the canonical map.
- real object detection device is a real object detection camera.
- the XR system comprises a canonical map incorporator connected to the canonical map and the second map and executable to incorporate a third PCF of the canonical map into the second map.
- Some embodiments relate to a viewing method including storing a plurality of canonical maps on a computer-readable medium, each canonical map having a respective canonical map identifier associated with the respective canonical map, the canonical map identifiers differing from one another to uniquely identify the canonical maps, receiving and storing, with a processor connected to the computer-readable medium, a position identifier from an XR device, comparing, with the processor, the position identifier with the canonical map identifiers to determine one or more canonical maps that form a first filtered selection, and transmitting, with the processor, a plurality of the canonical maps to the XR device based on the first filtered selection.
- Some embodiments relate to an XR system including a processor, a computer readable medium connected to the processor, a multilayer perception unit, on the computer readable medium and, executable by the processor, that receives a plurality of features of an image and converts each feature to a respective string of numbers, and a max pool unit, on the computer-readable medium and executable by the processor, that combines a maximum value of each string of numbers into a global feature string representing the image.
- the XR system comprises a plurality of canonical maps on the computer-readable medium, each canonical map having at least one of said global feature strings associated therewith; a position detector on the computer-readable medium and executable by the processor, to receive features of an image captured by an XR device from the XR device that are processed by the multilayer perception unit and the max pool unit to determine a global feature string of the image; a key frame filter that compares the global feature string of the image to the global feature string of the canonical maps to determine one or more canonical maps that form part of a filtered selection; and a map transmitter on the computer-readable medium and executable by the processor to transmit one or more of the canonical maps to the XR device based on the filtered selection.
- the XR system comprises an XR device that includes: a head unit comprising: a head-mountable frame, wherein the real object detection device is mounted to the head-mountable frame; a data channel to receive image data of local content; a local content position system connected to the data channel and executable to relate the local content to one PCF of the canonical map; and a display system connected to the local content position system to display the local content.
- the XR system comprises an XR device that includes: a head unit comprising: a head-mountable frame, wherein the real object detection device is mounted to the head-mountable frame; a data channel to receive image data of local content; a local content position system connected to the data channel and executable to relate the local content to one PCF of the canonical map; and a display system connected to the local content position system to display the local content, wherein the matching is executed by matching said global feature strings of the second map to the said global feature strings of the canonical map.
- Some embodiments relate to a viewing method, including receiving, with a processor, a plurality of features of an image, converting, with the processor, each feature to a respective string of numbers, and combining, with the processor, a maximum value of each string of numbers into a global feature string representing the image.
- Some embodiments relate to a method of operating a computing system to identify one or more environment maps stored in a database to merge with a tracking map computed based on sensor data collected by a device worn by a user, wherein the device received signals of access points to computer networks while computing the tracking map, the method including determining at least one area attribute of the tracking map based on characteristics of communications with the access points, determining a geographic location of the tracking map based on the at least one area attribute, identifying a set of environment maps stored in the database corresponding to the determined geographic location, filtering the set of environment maps based on similarity of one or more identifiers of network access points associated with the tracking map and the environment maps of the set of environment maps, filtering the set of environment maps based on similarity of metrics representing contents of the tracking map and the environment maps of the set of environment maps, and filtering the set of environment maps based on degree of match between a portion of the tracking map and portions of the environment maps of the set of environment maps.
- filtering the set of environment maps based on similarity of the one or more identifiers of the network access points comprises retaining in the set of environment maps environment maps with the highest Jaccard similarity to the at least one area attribute of the tracking map based on the one or more identifiers of network access points.
- filtering the set of environment maps based on similarity of metrics representing content of the tracking map and the environment maps of the set of environment maps comprises retaining in the set of environment maps environment maps with the smallest vector distance between a vector of characteristics of the tracking map and vectors representing environment maps in the set of environment maps.
- the metrics representing contents of the tracking map and the environment maps comprise vectors of values computed from the contents of the maps.
- filtering the set of environment maps based on degree of match between the portion of the tracking map and portions of the environment maps of the set of environment maps comprises computing a volume of a physical world represented by the tracking map that is also represented in an environment map of the set of environment maps; and retaining in the set of environment maps environment maps with larger computed volume than environment maps filtered out of the set.
- the set of environment maps is filtered: first based on the similarity of the one or more identifiers; subsequently based on the similarity of the metrics representing content; and subsequently based on the degree of match between the portion of the tracking map and portions of the environment maps.
- filtering of the set of environment maps based on the similarity of the one or more identifiers comprises filtering the set of environment maps based on the similarity of the metrics representing content; and the degree of match between the portion of the tracking map and portions of the environment maps, is performed in an order based on processing required to perform the filtering.
- an environment map is selected based on the filtering of the set of environment maps based on: the similarity of the one or more identifiers; the similarity of the metrics representing content; the degree of match between the portion of the tracking map and portions of the environment maps, and information is loaded on the user device from the selected environment map.
- an environment map is selected based on the filtering of the set of environment maps based on: the similarity of the one or more identifiers; the similarity of the metrics representing content; and the degree of match between the portion of the tracking map and portions of the environment maps, and the tracking map is merged with the selected environment map.
- Some embodiments relate to a cloud computing environment for an augmented reality system configured for communication with a plurality of user devices comprising sensors, including a user database storing area identities indicating areas that the plurality of user devices were used in, the area identities comprising parameters of wireless networks detected by the user devices when in use, a map database storing a plurality of environment maps constructed from data supplied by the plurality of user devices and associated metadata, the associated metadata comprising area identities derived from area identities of the plurality of user devices that supplied data from which the maps were constructed, the area identities comprising parameters of wireless networks detected by the user devices that supplied data from which the maps were constructed, non-transitory computer storage media storing computer-executable instructions that, when executed by at least one processor in the cloud computing environment, receives messages from the plurality of user devices comprising parameters of wireless networks detected by the user devices, computes area identifiers for the user devices and updates the user database based on the received parameters and/or the computed area identifiers, and receives requests for environment maps from the pluralit
- the computer-executable instructions are further configured to, when executed by at least one processor in the cloud computing environment, receive a tracking map from a user device requesting environment maps; and filtering a set of environment maps is further based on similarity of metrics representing contents of the tracking map and the environment maps of the set of environment maps.
- the computer-executable instructions are further configured to, when executed by at least one processor in the cloud computing environment, receive a tracking map from a user device requesting environment maps; and filtering a set of environment maps is further based on degree of match between a portion of the tracking map and portions of the environment maps of the set of environment maps.
- the parameters of the wireless networks comprise basic service set identifiers (BSSIDs) of networks to which the user devices are connected.
- BSSIDs basic service set identifiers
- filtering the set of environment maps based on similarity of parameters of wireless networks comprises computing a similarity of a plurality of BSSIDs stored in the user database associated with the user device requesting the environment maps to BSSIDs stored in the map database associated with environment maps of the set of environment maps.
- the area identifiers indicate geographic locations by longitude and latitude.
- determining area identifiers comprises accessing the area identifiers from the user database.
- determining area identifiers comprises receiving the area identifiers in the received messages from the plurality of user devices.
- the parameters of the wireless networks comply with protocols comprising Wi-Fi and 5G NR.
- the computer-executable instructions are further configured to, when executed by at least one processor in the cloud computing environment, receive a tracking map from a user device; and filtering the set of environment maps is further based on degree of match between a portion of the tracking map and portions of the environment maps of the set of environment maps.
- the computer-executable instructions are further configured to, when executed by at least one processor in the cloud computing environment: receive a tracking map from a user device and determine area identifiers associated with the tracking map based on the user device supplying the tracking map; select a second set of environment maps from the map database based, at least in part, on the area identifiers associated with the tracking map; and updating the map database based on the received tracking map, wherein the updating comprises merging the received tracking map with one or more environment maps in the second set of environment maps.
- the computer-executable instructions are further configured to, when executed by at least one processor in the cloud computing environment, filter the second set of environment maps based on degree of match between a portion of the received tracking map and portions of the environment maps of the second set of environment maps; and merging the tracking map with one or more environment maps in the second set of environment maps comprises merging the tracking map with one or more environment maps in the filtered second set of environment maps.
- Some embodiments relate to an XR system including a real object detection device to detect a plurality of surfaces of real-world objects, a PCF identification system connected to the real object detection device to generate a map based on the real-world objects, a persistent coordinate frame (PCF) generation system to generate a first PCF based on the map and associate the first PCF with the map, first and second storage mediums on first and second XR devices, respectively, and at least first and second processors of the first and second XR devices, to store the first PCF in first and second storage mediums of the first and second XR devices respectively.
- a real object detection device to detect a plurality of surfaces of real-world objects
- a PCF identification system connected to the real object detection device to generate a map based on the real-world objects
- PCF persistent coordinate frame
- the XR system comprises a key frame generator, executable by the at least one processor, to transform a plurality camera images to a plurality of respective key frames; a persistent pose calculator, executable by the at least one processor, to generate a persistent pose by averaging the plurality of key frames; a tracking map and persistent pose transformer, executable by the at least one processor, to transform a tracking map to the persistent pose to determine the persistent pose at an origin relative to the tracking map; a persistent pose and PCF transformer, executable by the at least one processor, to transform the persistent pose to the first PCF to determine the first PCF relative to the persistent pose; a PCF and image data transformer, executable by the at least one processor, to transform the first PCF to image data; and a display device to display the image data to the user relative to the first PCF.
- a key frame generator executable by the at least one processor, to transform a plurality camera images to a plurality of respective key frames
- a persistent pose calculator executable by the at least one processor, to generate a
- the detection device is a detection device of the first XR device connected to the first XR device processor.
- the map is a first map on the first XR device and the processor generating the first map is the first XR device processor of the first XR device.
- the processor generating the first PCF is the first XR device processor of the first XR device.
- the processor associating the first PCF with the first map is the first XR device processor of the first XR device.
- the XR system comprises an application, executable by the first XR device processor; a first PCF tracker, executable by the first XR device processor, and including on-prompt to switch the first PCF tracker on from the application wherein the first PCF tracker generates the first PCF only if the first PCF tracker is switched on.
- the first PCF tracker has an off-prompt to switch the first PCF tracker off from the application wherein the first PCF tracker terminates first PCF generation when the first PCF tracker is switched off.
- the XR system comprises a map publisher, executable by the first XR device processor, to transmit the first PCF to a server; a map storing routine, executable by a server processor of the server, to store the first PCF on a storage device of the server; and transmitting, with the server processor of the server, the first PCF to the second XR device; and a map download system, executable by a second XR device processor of the second XR device, to download the first PCF from the server.
- the XR system comprises an application, executable by the second XR device processor; and a second PCF tracker, executable by the second XR device processor, and including on-prompt to switch the second PCF tracker on from the application wherein the second PCF tracker generates a second PCF only if the second PCF tracker is switched on.
- the second PCF tracker has an off-prompt to switch the second PCF tracker off from the application wherein the second PCF tracker terminates second PCF generation when the second PCF tracker is switched off.
- the XR system comprises a map publisher, executable by the second XR device processor, to transmit the second PCF to the server.
- the XR system comprises a persistent pose acquirer, executable by the first XR device processor, to download persistent poses from the server; a PCF checker, executable by the first XR device processor, to retrieve PCF's from a first storage device of the first XR device based on the persistent poses; and a coordinate frame calculator, executable by the first XR device processor, to calculate a coordinate frame based on the PCF's retrieved from the first storage device.
- Some embodiments relate to a viewing method including detecting, with at least one detection device a plurality of surfaces of real-world objects, generating, with at least one processor, a map based on the real-world objects, generating, with at least one processor, a first PCF based on the map, associating, with the at least one processor, the first PCF with the map, and storing, with at least first and second processors of first and second XR devices, the first PCF in first and second storage mediums of the first and second XR devices respectively.
- the viewing method comprises transforming, with the at least one processor, a plurality of camera images to a plurality of respective key frames;
- the detection device is a detection device of the first XR device connected to the first XR device processor.
- the map is a first map on the first XR device and the processor generating the first map is the first XR device processor of the first XR device.
- the processor generating the first PCF is the first XR device processor of the first XR device.
- the processor associating the first PCF with the first map is the first XR device processor of the first XR device.
- the viewing method comprises executing, with the first XR device processor, an application; and switching, with the first XR device processor, a first PCF tracker on with an on-prompt from the application wherein the first PCF tracker generates the first PCF only if the first PCF tracker is switched on.
- the viewing method comprises switching, with the first XR device processor, the first PCF tracker off with an off-prompt from the application wherein the first PCF tracker terminates first PCF generation when the first PCF tracker is switched off.
- the viewing method comprises transmitting, with the first XR device processor, the first PCF to a server; storing, with a server processor of the server, the first PCF on a storage device of the server; and transmitting, with the server processor of the server, the first PCF to the second XR device; and receiving, with a second XR device processor of the second XR device, the first PCF from the server.
- the viewing method comprises executing, with the second XR device processor, an application; and switching, with the second XR device processor, a second PCF tracker on with an on-prompt from the application wherein the second PCF tracker generates a second PCF only if the second PCF tracker is switched on.
- the viewing method comprises switching, with the first XR device processor, the second PCF tracker off with an off-prompt from the application wherein the second PCF tracker terminates second PCF generation when the second PCF tracker is switched off.
- the viewing method comprises uploading, with the second XR device processor, the second PCF to the server.
- the viewing method comprises determining, with the first XR device processor, persistent poses from the server; retrieving, with the first XR device processor, PCF's from a first storage device of the first XR device based on the persistent poses; and calculating, with the first XR device processor, a coordinate frame based on the PCF's retrieved from the first storage device.
- Some embodiments relate to an XR system including a first XR device that may include a first XR device processor, a first XR device storage device connected to the first XR device processor, a set of instructions on the first XR device processor, including a download system, executable by the first XR device processor, to download persistent poses from a server, a PCF retriever, executable by the first XR device processor, to retrieve PCF's from the first storage device of the first XR device based on the persistent poses, and a coordinate frame calculator, executable by the first XR device processor, to calculate a coordinate frame based on the PCF's retrieved from the first storage device.
- Some embodiments relate to a viewing method including downloading, with a first XR device processor of a first XR device, persistent poses from a server, retrieving, with the first XR device processor, PCF's from the first storage device of the first XR device based on the persistent poses, and calculating, with the first XR device processor, a coordinate frame based on the PCF's retrieved from the first storage device.
- an XR device including a server that may include a server processor, a server storage device connected to the server processor, a map storing routine storing, executable with a server processor of the server, the first PCF in association with a map on the server storage device of the server, and a map transmitter, with the server processor, executable with a server processor, to transmit the map and the first PCF to a first XR device.
- a server may include a server processor, a server storage device connected to the server processor, a map storing routine storing, executable with a server processor of the server, the first PCF in association with a map on the server storage device of the server, and a map transmitter, with the server processor, executable with a server processor, to transmit the map and the first PCF to a first XR device.
- Some embodiments relate to a viewing method including storing, with a server processor of the server, a first PCF in association with a map on a server storage device of the server, and transmitting, with the server processor of the server, the map and the first PCF to a first XR device.
- Some embodiments relate to a viewing method including entering, by a processer of an XR device, tracking of head pose by capturing an environment with a capture device on a head-mounted frame secured to a head of a user and determining an orientation of the head- mounted frame, determining, by the processor, whether head pose is lost due to an inability to determine the orientation of the head-mounted frame, and if head pose is lost, then, by the processor, entering pose recovery mode to establish the head pose by determining an orientation of the head-mounted frame.
- pose recovery includes: displaying, by the processor, a message to the user with a suggestion to improve capturing of the environment.
- the suggestion is at least one of increasing light and refining texture.
- the viewing method comprises determining, by the processor, whether recovery has failed; and if recover has failed, starting, by the processor, a new session including establishing head pose.
- the viewing method comprises displaying, by a processor, a message to the user that a new session will be started.
- the viewing method comprises if head pose is not lost, by the processor, entering tracking of head pose.
- Some embodiments relate to a method of operating a computing system to render a virtual object in a scene comprising one or more physical objects.
- the method includes capturing a plurality of images about the scene from one or more sensors of a first device worn by a user, computing one or more persistent poses based, at least in part, on the plurality of images, and generating a persistent coordinate frame based, at least in part, on the computed one or more persistent poses such that information of the plurality of images can be accessed at a different time by one or more applications running on the first device and/or a second device via the persistent coordinate frame.
- computing the one or more persistent poses based, at least in part, on the plurality of images comprises extracting one or more features from each of the plurality of images, generating a descriptor for each of the one or more features, generating a key frame for each of the plurality of images based, at least in part, on the descriptors, and generating the one or more persistence poses based, at least in part, on the one or more key frames.
- generating the persistent coordinate frame based, at least in part, on the computed one or more persistent poses comprises: generating the persistent coordinate frame when the first device travels a pre-determined distance from a location of a previous persistent coordinate frame.
- the pre-determined distance is between two to twenty meters and is based on both the consumption of computational resources of the device and the placement error of the virtual object.
- the method comprises generating an initial persistent pose when the first device is powered on, and when the first device reaches a perimeter of a circle with the initial persistent pose as a center of the circle and a radius equal to a threshold distance, generating a first persistent pose at a current location of the first device.
- the circle is a first circle.
- the method further comprises, when the device reaches a perimeter of a second circle with the first persistent pose as a center of the circle and a radius equal to the threshold distance, generating a second persistent pose at a current location of the first device.
- the first persistent pose is not generated when the first device finds an existing persistent pose within the threshold distance from the initial persistent pose.
- the first device attaches to the first persistent pose one or more of the plurality of key frames that are within a predetermined distance to the first persistent pose.
- the first persistent pose is not generated unless an application running on the first device requests a persistent pose.
- Some embodiments relate to an electronic system portable by a user.
- the electronic system includes one or more sensors configured to capture images about one or more physical objects in a scene; an application configured to execute computer executable instructions to render a virtual content in the scene; and at least one processor configured to execute computer executable instructions to provide image data about the virtual content to the application, wherein the computer executable instructions comprise instructions for: generating a persistence coordinate frame based, at least in part, on the captured images.
- the above-described embodiments of the present disclosure can be implemented in any of numerous ways.
- the embodiments may be implemented using hardware, software or a combination thereof.
- the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.
- processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component, including commercially available integrated circuit components known in the art by names such as CPU chips, GPU chips, microprocessor, microcontroller, or co-processor.
- a processor may be implemented in custom circuitry, such as an ASIC, or semicustom circuitry resulting from configuring a programmable logic device.
- a processor may be a portion of a larger circuit or semiconductor device, whether commercially available, semi-custom or custom.
- some commercially available microprocessors have multiple cores such that one or a subset of those cores may constitute a processor.
- a processor may be implemented using circuitry in any suitable format.
- a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer. Additionally, a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smart phone or any other suitable portable or fixed electronic device.
- PDA Personal Digital Assistant
- a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible format. In the embodiment illustrated, the input/output devices are illustrated as physically separate from the computing device. In some embodiments, however, the input and/or output devices may be physically integrated into the same unit as the processor or other elements of the computing device. For example, a keyboard might be implemented as a soft keyboard on a touch screen. In some embodiments, the input/output devices may be entirely disconnected from the computing device, and functionally integrated through a wireless connection.
- Such computers may be interconnected by one or more networks in any suitable form, including as a local area network or a wide area network, such as an enterprise network or the Internet.
- networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
- the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
- the disclosure may be embodied as a computer readable storage medium (or multiple computer readable media) (e.g., a computer memory, one or more floppy discs, compact discs (CD), optical discs, digital video disks (DVD), magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement the various embodiments of the disclosure discussed above.
- a computer readable storage medium may retain information for a sufficient time to provide computer-executable instructions in a non-transitory form.
- Such a computer readable storage medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the present disclosure as discussed above.
- the term "computer-readable storage medium” encompasses only a computer- readable medium that can be considered to be a manufacture (i.e., article of manufacture) or a machine.
- the disclosure may be embodied as a computer readable medium other than a computer-readable storage medium, such as a propagating signal.
- program or“software” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects of the present disclosure as discussed above. Additionally, it should be appreciated that according to one aspect of this embodiment, one or more computer programs that when executed perform methods of the present disclosure need not reside on a single computer or processor, but may be distributed in a modular fashion amongst a number of different computers or processors to implement various aspects of the present disclosure.
- Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices.
- program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- functionality of the program modules may be combined or distributed as desired in various embodiments.
- data structures may be stored in computer-readable media in any suitable form.
- data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a computer-readable medium that conveys relationship between the fields.
- any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationship between data elements.
- the disclosure may be embodied as a method, of which an example has been provided.
- the acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Computer Graphics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Human Computer Interaction (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Processing Or Creating Images (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
Claims
Applications Claiming Priority (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862742237P | 2018-10-05 | 2018-10-05 | |
US201962812935P | 2019-03-01 | 2019-03-01 | |
US201962815955P | 2019-03-08 | 2019-03-08 | |
US201962868786P | 2019-06-28 | 2019-06-28 | |
US201962870954P | 2019-07-05 | 2019-07-05 | |
US201962884109P | 2019-08-07 | 2019-08-07 | |
PCT/US2019/054819 WO2020072972A1 (en) | 2018-10-05 | 2019-10-04 | A cross reality system |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3861533A1 true EP3861533A1 (en) | 2021-08-11 |
EP3861533A4 EP3861533A4 (en) | 2022-12-21 |
Family
ID=70055505
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19868457.3A Pending EP3861533A4 (en) | 2018-10-05 | 2019-10-04 | A cross reality system |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP3861533A4 (en) |
JP (2) | JP7526169B2 (en) |
CN (1) | CN113544748A (en) |
WO (1) | WO2020072972A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021238145A1 (en) * | 2020-05-26 | 2021-12-02 | 北京市商汤科技开发有限公司 | Generation method and apparatus for ar scene content, display method and apparatus therefor, and storage medium |
US11556961B2 (en) * | 2020-07-23 | 2023-01-17 | At&T Intellectual Property I, L.P. | Techniques for real-time object creation in extended reality environments |
CN112465890A (en) * | 2020-11-24 | 2021-03-09 | 深圳市商汤科技有限公司 | Depth detection method and device, electronic equipment and computer readable storage medium |
US11200754B1 (en) * | 2020-12-22 | 2021-12-14 | Accenture Global Solutions Limited | Extended reality environment generation |
CN113313809A (en) * | 2021-06-03 | 2021-08-27 | 中国建设银行股份有限公司 | Rendering method and device |
WO2023043607A1 (en) * | 2021-09-16 | 2023-03-23 | Chinook Labs Llc | Aligning scanned environments for multi-user communication sessions |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7542034B2 (en) * | 2004-09-23 | 2009-06-02 | Conversion Works, Inc. | System and method for processing video images |
US20080090659A1 (en) * | 2006-10-12 | 2008-04-17 | Maximino Aguilar | Virtual world event notification from a persistent world game server in a logically partitioned game console |
JP4292426B2 (en) * | 2007-05-15 | 2009-07-08 | ソニー株式会社 | Imaging apparatus and imaging data correction method |
GB2506338A (en) * | 2012-07-30 | 2014-04-02 | Sony Comp Entertainment Europe | A method of localisation and mapping |
KR20230173231A (en) | 2013-03-11 | 2023-12-26 | 매직 립, 인코포레이티드 | System and method for augmented and virtual reality |
US10025486B2 (en) * | 2013-03-15 | 2018-07-17 | Elwha Llc | Cross-reality select, drag, and drop for augmented reality systems |
US10262462B2 (en) * | 2014-04-18 | 2019-04-16 | Magic Leap, Inc. | Systems and methods for augmented and virtual reality |
US9196022B2 (en) * | 2014-03-10 | 2015-11-24 | Omnivision Technologies, Inc. | Image transformation and multi-view output systems and methods |
EP3699736B1 (en) | 2014-06-14 | 2023-03-29 | Magic Leap, Inc. | Methods and systems for creating virtual and augmented reality |
US10185775B2 (en) | 2014-12-19 | 2019-01-22 | Qualcomm Technologies, Inc. | Scalable 3D mapping system |
US10217231B2 (en) | 2016-05-31 | 2019-02-26 | Microsoft Technology Licensing, Llc | Systems and methods for utilizing anchor graphs in mixed reality environments |
US10007868B2 (en) * | 2016-09-19 | 2018-06-26 | Adobe Systems Incorporated | Font replacement based on visual similarity |
-
2019
- 2019-10-04 CN CN201980080054.4A patent/CN113544748A/en active Pending
- 2019-10-04 EP EP19868457.3A patent/EP3861533A4/en active Pending
- 2019-10-04 JP JP2021518528A patent/JP7526169B2/en active Active
- 2019-10-04 WO PCT/US2019/054819 patent/WO2020072972A1/en active Application Filing
-
2024
- 2024-05-29 JP JP2024087120A patent/JP2024103610A/en active Pending
Also Published As
Publication number | Publication date |
---|---|
EP3861533A4 (en) | 2022-12-21 |
WO2020072972A8 (en) | 2021-09-23 |
JP2024103610A (en) | 2024-08-01 |
JP2022509731A (en) | 2022-01-24 |
CN113544748A (en) | 2021-10-22 |
WO2020072972A1 (en) | 2020-04-09 |
JP7526169B2 (en) | 2024-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11978159B2 (en) | Cross reality system | |
US11995782B2 (en) | Cross reality system with localization service | |
US11257294B2 (en) | Cross reality system supporting multiple device types | |
US11410395B2 (en) | Cross reality system with accurate shared maps | |
US11551430B2 (en) | Cross reality system with fast localization | |
US20210256766A1 (en) | Cross reality system for large scale environments | |
US20210112427A1 (en) | Cross reality system with wireless fingerprints | |
EP3837674A1 (en) | A cross reality system | |
EP4104001A1 (en) | Cross reality system with map processing using multi-resolution frame descriptors | |
US11694394B2 (en) | Cross reality system for large scale environment reconstruction | |
JP7526169B2 (en) | Cross Reality System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20210504 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06V 20/40 20220101ALI20220817BHEP Ipc: G06V 20/20 20220101ALI20220817BHEP Ipc: G06V 10/46 20220101ALI20220817BHEP Ipc: G06V 10/82 20220101ALI20220817BHEP Ipc: G06K 9/62 20060101ALI20220817BHEP Ipc: G06F 3/04815 20220101ALI20220817BHEP Ipc: G06F 3/0486 20130101ALI20220817BHEP Ipc: G06F 3/0482 20130101ALI20220817BHEP Ipc: G06T 19/00 20110101AFI20220817BHEP |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20221123 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06V 20/40 20220101ALI20221117BHEP Ipc: G06V 20/20 20220101ALI20221117BHEP Ipc: G06V 10/46 20220101ALI20221117BHEP Ipc: G06V 10/82 20220101ALI20221117BHEP Ipc: G06K 9/62 20060101ALI20221117BHEP Ipc: G06F 3/04815 20220101ALI20221117BHEP Ipc: G06F 3/0486 20130101ALI20221117BHEP Ipc: G06F 3/0482 20130101ALI20221117BHEP Ipc: G06T 19/00 20110101AFI20221117BHEP |