WO2021178366A1 - Efficient localization based on multiple feature types - Google Patents

Efficient localization based on multiple feature types Download PDF

Info

Publication number
WO2021178366A1
WO2021178366A1 PCT/US2021/020403 US2021020403W WO2021178366A1 WO 2021178366 A1 WO2021178366 A1 WO 2021178366A1 US 2021020403 W US2021020403 W US 2021020403W WO 2021178366 A1 WO2021178366 A1 WO 2021178366A1
Authority
WO
WIPO (PCT)
Prior art keywords
map
pose
correspondences
images
camera
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2021/020403
Other languages
English (en)
French (fr)
Inventor
Lipu ZHOU
Ashwin Swaminathan
Frank Thomas STEINBRUECKER
Daniel Esteban KOPPEL
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Magic Leap Inc
Original Assignee
Magic Leap Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Magic Leap Inc filed Critical Magic Leap Inc
Priority to EP21765469.8A priority Critical patent/EP4115329A4/en
Priority to CN202180018922.3A priority patent/CN115349140A/zh
Priority to JP2022552439A priority patent/JP7701932B2/ja
Publication of WO2021178366A1 publication Critical patent/WO2021178366A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • This application relates generally to a machine vision system, such as a cross reality system.
  • Localization is performed in some machine vision systems to relate the location of a device, equipped with a camera to capture images of a 3D environment, to locations in a map of the 3D environment.
  • a new image captured by the device may be matched to a portion of the map.
  • a spatial transformation between the new image of the matching portion of the map may indicate the “pose” of the device with respect to the map.
  • a form of localization may be performed while creating the map.
  • the location of new images with respect to existing portions of the map may enable those new images to be integrated into the map.
  • New images may be used to extend the map to represent portions of the 3D environment not previously mapped or to update the representation of portions of the 3D environment that were previously mapped.
  • results of localization may be used in various ways in various machine vision systems.
  • locations of goals or obstacles may be specified with respect to the coordinates of the map.
  • a robotic device Once a robotic device is localized with respect to the map, it may be guided towards the goals along a route that avoids the obstacles.
  • localization may be used in an XR system.
  • computers may control human user interfaces to create a cross reality environment in which some or all of the XR environment, as perceived by the user, is generated by a computer.
  • These XR environments may be virtual reality (VR), augmented reality (AR), and/or mixed reality (MR) environments, in which some or all of an XR environment may be generated by computers.
  • Data generated by a computer may describe, for example, virtual objects that may be rendered in a way that users perceive as part of a physical world such that users can interact with the virtual objects.
  • the user may experience these virtual objects as a result of the data being rendered through a user interface device, such as a head-mounted display device that enables the user to simultaneously see both the virtual content and objects in the physical world.
  • an XR system may build a representation of the physical world around a user of the system.
  • This representation may be constructed by processing images acquired with sensors on wearable devices that form a part of the XR system.
  • the locations of both physical and virtual objects may be expressed with respect to a map to which a user device in the XR system may localize. Localization enables the user devices to render virtual objects so as to take into account the locations of physical objects. It also enables multiple user devices to render virtual content so that their respective users share the same experience of that virtual content in the 3D environment.
  • a conventional approach to localization is to store, in conjunction with a map, collections of feature points derived from images of the 3D environment.
  • Feature points may be selected for inclusion in the map based on how readily identifiable they are and the likelihood that they represent persistent objects, such as comers of rooms or large furniture. Localization entails selecting feature points from new images and identifying matching feature points in the map. The identification is based on finding a transformation that aligns a collection of feature points from a new image with matching feature points in the map.
  • Finding a suitable transformation is computationally intensive and is often performed by selecting a group of feature points in the new image and attempting to compute a transformation that aligns that group of feature points against each of multiple groups of feature points from the map.
  • Attempts to compute a transformation may use a non-linear least squared approach, which may entail computing a Jacobean matrix which is used to iteratively arrive at a transformation. This computation may be repeated for multiple groups of feature points in the map and possibly multiple groups of feature points in one or new images to arrive at a transformation accepted as providing a suitable match.
  • RANSAC is a process in which the matching process is performed in two stages. In a first stage, a coarse transformation between a new image and a map might be identified based on processing of multiple groups, each with a small number of feature points. The coarse alignment is used as a starting point for computing a more refined transformation that achieves suitable alignment between larger groups of feature points.
  • Some aspects relate to a method of determining a pose of a camera with respect to a map based on one or more images captured with the camera, wherein the pose is represented as a rotation matrix and a translation matrix.
  • the method may comprise developing correspondences between a combination of points and/or lines in the one or more images and the map, transforming the correspondences into a set of three second-order polynomial equations, solving the set of equations for the rotation matrix, and computing the translation matrix based on the rotation matrix.
  • the combination of points and/or lines may be determined dynamically based on characteristics of the one or more images.
  • the method may further comprise refining the pose by minimizing a cost function.
  • the method may further comprise refining the pose by using a damped Newton step.
  • transforming the correspondences into a set of three second- order polynomial equations comprises deriving a set of constraints from the correspondences, forming a close-form expression of the translation matrix, and using a 3D vector to form a parametrization of the rotation matrix.
  • transforming the correspondences into a set of three second- order polynomial equations further comprises denoising by rank approximation.
  • solving the set of equations for the rotation matrix comprises using a hidden variable method.
  • using a 3D vector to form the parametrization of the rotation matrix comprises using Cayley-Gibbs-Rodriguez (CGR) parametrization.
  • CGR Cayley-Gibbs-Rodriguez
  • forming a close-form expression of the translation matrix comprises forming a linear equation system using the set of constraints.
  • Some aspects relate to a method of determining the pose of a camera with respect to a map based on one or more images captured with the camera, wherein the pose is represented as a rotation matrix and a translation matrix.
  • the method may comprise developing a plurality of correspondences between a combination of points and/or lines in the one or more images and the map, expressing the correspondences as an over-determined set of equations in a plurality of variables, formatting the over-determined set of equations as a minimal set of equations of meta-variables, in which each of the meta-variables represents a group of the plurality of variables, computing values of the meta-variables based on the minimal set of equations, and computing the pose from the meta- variables.
  • the combination of points and/or lines may be determined dynamically based on characteristics of the one or more images.
  • computing the pose from the meta-variables comprises computing the rotation matrix, and computing the translation matrix based on the rotation matrix.
  • computing the translation matrix based on the rotation matrix comprises computing the translation matrix from an equation that expresses the plurality of correspondences based on the rotation matrix and is linear with respect to the translation matrix.
  • computing the translation matrix comprises deriving a set of constraints from the correspondences, forming a close-form expression of the translation matrix, and forming a linear equation system using the set of constraints.
  • Some aspects relate to a non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a method.
  • the method may comprise developing correspondences between a combination of points and/or lines in one or more images and a map, transforming the correspondences into a set of three second-order polynomial equations, solving the set of equations for the rotation matrix, and computing the translation matrix based on the rotation matrix.
  • the points and/or lines in the one or more images may be two- dimensional features and corresponding features in the map may be three-dimensional features.
  • Some aspects relate to a non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a method.
  • the method may comprise developing a plurality of correspondences between a combination of points and/or lines in the one or more images and the map, expressing the correspondences as an over-determined set of equations in a plurality of variables, formatting the over-determined set of equations as a minimal set of equations of meta-variables, in which each of the meta-variables represents a group of the plurality of variables, computing values of the meta-variables based on the minimal set of equations, and computing the pose from the meta- variables.
  • a portable electronic device comprising: a camera configured to capture one or more images of a 3D environment and at least one processor configured to execute computer executable instructions.
  • the computer executable instructions may comprise instructions for determining a pose of the camera with respect to a map based on the one or more images, comprising: determining information about a combination of points and/or lines in the one or more images of the 3D environment; sending, to a localization service, the information about the combination of points and/or lines in the one or more images to determine a pose of the camera with respect to the map; and receiving, from the localization service, the pose of the camera with respect to the map represented as a rotation matrix and a translation matrix.
  • the localization service is implemented on the portable electronic device.
  • the localization service is implemented on a server remote from the portable electronic device, wherein the information about the combination of points and/or lines in the one or more images is sent to the localization service over a network.
  • determining the pose of the camera with respect to the map comprises: developing correspondences between the combination of points and/or lines in the one or more images and the map; transforming the correspondences into a set of three second-order polynomial equations; solving the set of equations for the rotation matrix; and computing the translation matrix based on the rotation matrix.
  • the combination of points and/or lines is determined dynamically based on characteristics of the one or more images.
  • determining the pose of the camera with respect to the map comprises further comprises refining the pose by minimizing a cost function.
  • determining the pose of the camera with respect to the map comprises further comprises refining the pose by using a damped Newton step.
  • transforming the correspondences into a set of three second- order polynomial equations comprises: deriving a set of constraints from the correspondences; forming a close-form expression of the translation matrix; and using a 3D vector to form a parametrization of the rotation matrix.
  • transforming the correspondences into a set of three second- order polynomial equations further comprises denoising by rank approximation.
  • solving the set of equations for the rotation matrix comprises using a hidden variable method.
  • using a 3D vector to form the parametrization of the rotation matrix comprises using Cayley-Gibbs-Rodriguez (CGR) parametrization.
  • CGR Cayley-Gibbs-Rodriguez
  • forming a close-form expression of the translation matrix comprises forming a linear equation system using the set of constraints.
  • determining the pose of the camera with respect to the map comprises: developing correspondences between the combination of points and/or lines in the one or more images and the map; expressing the correspondences as an over-determined set of equations in a plurality of variables; formatting the over-determined set of equations as a minimal set of equations of meta-variables, in which each of the meta-variables represents a group of the plurality of variables; computing values of the meta-variables based on the minimal set of equations; and computing the pose from the meta- variables.
  • the combination of points and/or lines is determined dynamically based on characteristics of the one or more images.
  • computing the pose from the meta-variables comprises: computing the rotation matrix; and computing the translation matrix based on the rotation matrix.
  • computing the translation matrix based on the rotation matrix comprises computing the translation matrix from an equation that expresses the plurality of correspondences based on the rotation matrix and is linear with respect to the translation matrix.
  • computing the translation matrix comprises: deriving a set of constraints from the correspondences; forming a close-form expression of the translation matrix; and forming a linear equation system using the set of constraints.
  • the points and lines in the one or more images are two- dimensional features; and corresponding features in the map are three-dimensional features.
  • Some aspects relate to a method for determining a pose of a camera with respect to a map based on one or more images of a 3D environment captured by the camera, comprising: determining information about a combination of points and/or lines in the one or more images of the 3D environment; sending, to a localization service, the information about the combination of points and/or lines in the one or more images to determine a pose of the camera with respect to the map; and receiving, from the localization service, the pose of the camera with respect to the map represented as a rotation matrix and a translation matrix.
  • Some aspects relate to a non-transitory computer readable medium comprising computer executable instructions for execution by at least one processor, wherein the computer executable instructions comprise instructions for determining a pose of a camera with respect to a map based on one or more images of a 3D environment captured by the camera, comprising: determining information about a combination of points and/or lines in the one or more images of the 3D environment; sending, to a localization service, the information about the combination of points and/or lines in the one or more images to determine a pose of the camera with respect to the map; and receiving, from the localization service, the pose of the camera with respect to the map represented as a rotation matrix and a translation matrix.
  • FIG. 1 is a sketch illustrating an example of a simplified augmented reality (AR) scene, according to some embodiments;
  • Figure 2 is a sketch of an exemplary simplified AR scene, showing exemplary use cases of an XR system, according to some embodiments;
  • Figure 3 is a schematic diagram illustrating data flow for a single user in an AR system configured to provide an experience to the user of AR content interacting with a physical world, according to some embodiments;
  • Figure 4 is a schematic diagram illustrating an exemplary AR display system, displaying virtual content for a single user, according to some embodiments
  • Figure 5A is a schematic diagram illustrating a user wearing an AR display system rendering AR content as the user moves through a physical world environment, according to some embodiments;
  • Figure 5B is a schematic diagram illustrating a viewing optics assembly and attendant components, according to some embodiments.
  • Figure 6A is a schematic diagram illustrating an AR system using a world reconstruction system, according to some embodiments.
  • Figure 6B is a schematic diagram illustrating components of an AR system that maintain a model of a passable world, according to some embodiments.
  • Figure 7 is a schematic illustration of a tracking map formed by a device traversing a path through a physical world, according to some embodiments.
  • Figure 8 is a schematic diagram of an example XR system in which any of multiple devices may access a localization service, according to some embodiments;
  • Figure 9 is an example process flow for operation of a portable device as part of an XR system that provides cloud-based localization, according to some embodiments;
  • Figure 10 is a flowchart of an exemplary process for localization in a system configured to compute a pose using features with a mix of feature types, according to some embodiments;
  • Figure 11 is a sketch of an exemplary environment for which point-based localization is likely to fail, according to some embodiments.
  • Figure 12 is an exemplary schematic of 2D-3D point correspondence and 2D-3D line correspondence, according to some embodiments.
  • Figure 13 is a flow chart illustrating a method of efficient localization, according to some embodiments.
  • Figure 14A shows median rotation errors of different PnPL algorithms, according to some embodiments.
  • Figure 14B shows median translation errors of different PnPL algorithms, according to some embodiments.
  • Figure 14C shows mean rotation errors of different PnPL algorithms, according to some embodiments.
  • Figure 14D shows mean translation errors of different PnPL algorithms, according to some embodiments.
  • Figure 15A is a diagram of computational time of different PnPL algorithms, according to some embodiments.
  • Figure 15B is a diagram of computational time of different PnPL algorithms, according to some embodiments.
  • Figure 16A shows the number of instances of errors of a certain range versus the log error of a PnPL solution, according to some embodiments described herein, for a PnP problem compared to a P3P and UPnP solution;
  • Figure 16B shows a box plot of a PnPL solution, according to some embodiments described herein, for a PnP problem compared to a P3P and UPnP solution
  • Figure 16C shows the mean rotational error in radians of a PnPL solution, according to some embodiments described herein, for a PnP problem compared to a P3P and UPnP solution
  • Figure 16D shows the mean positional error in meters of a PnPL solution, according to some embodiments described herein, for a PnP problem compared to a P3P and UPnP solution;
  • Figure 17A shows median rotation errors of different PnL algorithms, according to some embodiments.
  • Figure 17B shows median translation errors of different PnL algorithms, according to some embodiments.
  • Figure 17C shows mean rotation errors of different PnL algorithms, according to some embodiments.
  • Figure 17D shows mean translation errors of different PnL algorithms, according to some embodiments.
  • Figure 18 is a flowchart of an alternative embodiment of an exemplary process for localization in a system configured to compute a pose using features with a mix of feature types;
  • Figure 19 is a schematic of constraints from li ⁇ Li, according to some embodiments.
  • Figure 20A is a boxplot figure showing rotation error of hidden variable (HV) polynomial solver compared to other solvers, according to some embodiments;
  • HV hidden variable
  • Figure 20B is a boxplot figure showing translation error of hidden variable (HV) polynomial solver compared to other solvers, according to some embodiments;
  • HV hidden variable
  • Figure 21 A is a figure showing rotation error compared to other solvers, according to some embodiments.
  • Figure 2 IB is a figure showing translation error compared to other solvers, according to some embodiments;
  • Figure 22A is a plot of rotation error of an embodiment of an algorithm described herein and previous algorithms AlgP3L, RP3L and SRP3L, according to some embodiments;
  • Figure 22B is a box plot of translation error of an embodiment of an algorithm described herein and previous algorithms AlgP3L, RP3L and SRP3L, according to some embodiments;
  • Figure 23A shows a comparison of mean rotational error in degrees between different P3L algorithms, according to some embodiments.
  • Figure 23B shows a comparison of mean translational error in degrees between different P3L algorithms, according to some embodiments.
  • Figure 24A is a plot showing mean rotation errors of different PnL algorithms, according to some embodiments.
  • Figure 24B is a plot showing mean translation errors of different PnL algorithms, according to some embodiments.
  • Figure 24C is a plot showing median rotation errors of different PnL algorithms, according to some embodiments.
  • Figure 24D is a plot showing median translation errors of different PnL algorithms, according to some embodiments.
  • Figure 25A is a plot showing mean rotation errors of different PnL algorithms, according to some embodiments.
  • Figure 25B is a plot showing mean translation errors of different PnL algorithms, according to some embodiments.
  • Figure 25C is a plot showing median rotation errors of different PnL algorithms, according to some embodiments.
  • Figure 25D is a plot showing median translation errors of different PnL algorithms, according to some embodiments.
  • Figure 26 A is a plot showing mean rotation errors of different PnL algorithms, according to some embodiments;
  • Figure 26B is a plot showing mean translation errors of different PnL algorithms, according to some embodiments.
  • Figure 26C is a plot showing median rotation errors of different PnL algorithms, according to some embodiments.
  • Figure 26D is a plot showing median translation errors of different PnL algorithms, according to some embodiments.
  • Figure 27 A is a plot showing mean rotation errors of different PnL algorithms, according to some embodiments.
  • Figure 27B is a plot showing mean translation errors of different PnL algorithms, according to some embodiments.
  • Figure 27C is a plot showing median rotation errors of different PnL algorithms, according to some embodiments.
  • Figure 27D is a plot showing median translation errors of different PnL algorithms, according to some embodiments.
  • Figure 28 is an exemplary diagram of experimental results of real data, according to some embodiments.
  • Figure 29 A is a diagram of computational time of many algorithms, according to some embodiments.
  • Figure 29B is a diagram of computational time of an embodiment of an algorithm described herein as compared to computational times of algorithms involving polynomial system;
  • Figure 29C is a diagram of computational time of an embodiment of an algorithm described herein as compared to computational times of algorithms based on linear transformation;
  • Figure 30 is a flow chart illustrating a method 3000 of efficient localization, according to some embodiments;
  • Figure 31 is a pseudo code implementation of an exemplary algorithm for solving the PnL problem, according to some embodiments.
  • Figure 32 is a block diagram of a machine in the form of a computer that can find application in the present invention system, according to some embodiments.
  • the other image information may act as a map, such that determining pose localizes the device with respect to the map.
  • the map for example may represent a 3D environment.
  • the device containing a camera may be, for example, an XR system, an autonomous vehicle, or a smart phone. Localizing these devices relative to a map enables the devices to perform location-based functions, such as rendering virtual content registered with the physical world, navigation, or rendering content based on location.
  • Pose may be computed by finding correspondences between at least one set of features extracted from an image acquired with the camera and features stored in the map. Correspondences may be based, for example, on a determination that the corresponding features likely represent the same structure in the physical world. Once corresponding features in the image and the map are identified, an attempt is made to determine a transformation that aligns the corresponding features with little or no error is computed.
  • Such a transformation indicates the pose between the image and a frame of reference of the features supplied by the map.
  • the computed pose also indicates the pose of the camera, and by extension the device containing the camera, relative to the frame of reference of the map.
  • Computation of pose conventionally requires large amounts of computational resources, such as processing power or, for a portable device, battery power. Every two corresponding features may provide a constraint on the computed pose. But, taking into consideration noise or other errors, it is conventional for sets of features to contain enough features that there are more constraints than there are degrees of freedom in a transformation to be computed. Finding a solution in this case may involve computing the solution of an over-determined system of equations. Conventional techniques for solving an over determined system may employ a least squares approach, which is a known iterative approach to finding a solution that provides as a solution a transformation that has a low overall squared error in satisfying all the constraints.
  • the computational burden is compounded by the fact that finding a pose may require attempts to compute a transformation between multiple corresponding sets of features. For example, two structures in the physical world might give rise to two similar sets of features, which may seemingly correspond. However, a computed transformation may have a relatively high error such that those seemingly corresponding features are ignored for computing pose. The computation might be repeated for other sets of seemingly corresponding features until a transformation is computed with relatively low error.
  • a computed transformation may not be accepted as a solution unless there is sufficient similarity of the transformations computed for multiple sets of features, which may be taken from different portions of an image or from different images.
  • the computational burden may be reduced by reformatting the over-determined set of equations into a minimal set of equations, which may be solved with a lower computational burden than solving a least squared problem.
  • the minimal set of equations may be expressed in terms of meta-variables that each represent a group of variables in the over-determined set of equations.
  • the elements of the transformation between features sets may be computed from the meta-variables.
  • the elements of the transformation maybe, for example, a rotation matrix and translation vector,
  • Use of meta-variables may enable the problem to be solved to be expressed as a set with a small number of low order polynomials, which can be solved more efficiently than a full least squared problem. Some or all the polynomials may have an order as low as two. In some embodiments, there may be as few as three such polynomials, enabling a solution to be arrived at with relatively low computation.
  • Image features used for computing pose are frequently image points, representing a small area of an image.
  • a feature point for example, may be represented as a rectangular region with sides that extend three or four pixels of the image.
  • using points as the features may lead to an adequate solution in many scenarios.
  • using lines as features may be more likely to lead to an adequate solution, which, in comparison to using points as features, may require fewer attempts to compute a suitable transformation.
  • the overall computational burden may be less by using lines as features.
  • a technique as described herein may be used to efficiently compute a pose when lines are used as features.
  • an efficient solution may be more likely to result from using features that are a combination of features and lines.
  • the number or proportion of each type of feature that leads to an efficient solution may vary based on scenario.
  • a system configured to compute a pose based on corresponding sets of features, with an arbitrary mix of feature types, may enable the mix of feature types to be selected so as to increase the likelihood of finding a solution with reduced computational burden from multiple attempts to find a solution.
  • a technique as described herein may be used to efficiently compute a pose when an arbitrary mix of points and lines are used as features.
  • the localization techniques described herein may be used for providing XR scenes.
  • An XR system therefore provides a useful example of how computationally efficient pose computation techniques may be applied in practice.
  • To provide realistic XR experiences to multiple users, an XR system must know the users’ location within the physical world in order to correctly correlate locations of virtual objects to real objects.
  • the inventors have recognized and appreciated methods and apparatus that are computationally efficient and quick in localizing XR devices, even in large and very large scale environments (e.g., a neighborhood, a city, a country, the globe).
  • An XR system may build a map of an environment in which user devices may operate.
  • the environment map may be created from image information collected with sensors that are part of XR devices worn by users of the XR system.
  • Each XR device may develop a local map of its physical environment by integrating information from one or more images collected as the device operates.
  • the coordinate system of the local map is tied to the position and/or orientation of the device when the device first initiates scanning the physical world (e.g. starts a new session).
  • That position and/or orientation of the device may change from session to session as a user interacts with the XR system, whether different sessions are associated with different users, each with their own wearable device with sensors that scan the environment, or the same user who uses the same device at different times.
  • the XR system may implement one or more techniques to enable persistent operation across sessions based on persistent spatial information.
  • the techniques may provide XR scenes for a more computationally efficient and immersive experiences for a single or multiple users by enabling persistent spatial information to be created, stored, and retrieved by any of multiple users of an XR system.
  • persistent spatial information provides a more immersive experience as it enables multiple users to experience virtual content in the same location with respect to the physical world. Even when used by a single user, persistent spatial information may enable quickly recovering and resetting headposes on an XR device in a computationally efficient way.
  • the persistent spatial information may be represented by a persistent map.
  • the persistent map may be stored in a remote storage medium (e.g., a cloud).
  • a wearable device worn by a user after being turned on, may retrieve from persistent storage an appropriate map that was previously created and stored. That previously stored map may have been based on data about the environment collected with sensors on the user’s wearable device during prior sessions. Retrieving a stored map may enable use of the wearable device without completing a scan of the physical world with the sensors on the wearable device.
  • the device upon entering a new region of the physical world, may similarly retrieve an appropriate stored map.
  • the stored map may be represented in a canonical form to which a local frame of reference on each XR device may be related.
  • the stored map accessed by one device may have been created and stored by another device and/or may have been constructed by aggregating data about the physical world collected by sensors on multiple wearable devices that were previously present in at least a portion of the physical world represented by the stored map.
  • persistent spatial information may be represented in a way that may be readily shared among users and among the distributed components, including applications.
  • Canonical maps may provide information about the physical world, which may be formatted, for example, as persistent coordinate frames (PCFs).
  • PCF may be defined based on a set of features recognized in the physical world. The features may be selected such that they are likely to be the same from user session to user session of the XR system.
  • PCFs may be sparse, providing less than all of the available information about the physical world, such that they may be efficiently processed and transferred.
  • Techniques for processing persistent spatial information also may include creating dynamic maps based on the local coordinate systems of one or more devices. These maps may be sparse maps, representing the physical world with features, such as points or edges or other structures that appear as lines, detected in images used in forming the maps. Canonical maps may be formed by merging multiple such maps created by one or more XR devices.
  • the relationship between a canonical map and a local map for each device may be determined through a localization process. That localization process may be performed on each XR device based on a set of canonical maps selected and sent to the device. Alternatively or additionally, a localization service may be provided on remote processors, such as might be implemented in the cloud.
  • Two XR devices that have access to the same stored map may both localize with respect to the stored map.
  • a user device may render virtual content that has a location specified by reference to the stored map by translating that location to a frame of reference maintained by the user device.
  • the user device may use this local frame of reference to control the display of the user device to render the virtual content in the specified location.
  • the XR system may be configured to create, share, and use persistent spatial information with low usage of computational resources and/or low latency to provide a more immersive user experience.
  • the system may use techniques for efficient comparison of spatial information. Such comparisons may arise, for example, as part of localization in which a collection of features from a local device is matched to a collection of features in a canonical map. Similarly, in map merge, attempts may be made to match one or more collections of features in a tracking map from a device to corresponding features in a canonical map.
  • Techniques as described herein may be used together or separately with many types of devices and for many types of scenes, including wearable or portable devices with limited computational resources that provide an augmented or mixed reality scene.
  • the techniques may be implemented by one or more services that form a portion of an XR system.
  • Figures 1 and 2 illustrate scenes with virtual content displayed in conjunction with a portion of the physical world.
  • an AR system is used as an example of an XR system.
  • Figures 3-6B illustrate an exemplary AR system, including one or more processors, memory, sensors and user interfaces that may operate according to the techniques described herein.
  • an outdoor AR scene 354 is depicted in which a user of an AR technology sees a physical world park-like setting 356, featuring people, trees, buildings in the background, and a concrete platform 358.
  • the user of the AR technology also perceives that they "see” a robot statue 357 standing upon the physical world concrete platform 358, and a cartoon-like avatar character 352 flying by which seems to be a personification of a bumble bee, even though these elements (e.g., the avatar character 352, and the robot statue 357) do not exist in the physical world.
  • Due to the extreme complexity of the human visual perception and nervous system it is challenging to produce an AR technology that facilitates a comfortable, natural-feeling, rich presentation of virtual image elements amongst other virtual or physical world imagery elements.
  • Such an AR scene may be achieved with a system that builds maps of the physical world based on tracking information, enables users to place AR content in the physical world, determines locations in the maps of the physical world where AR content is placed, preserves the AR scenes such that the placed AR content can be reloaded to display in the physical world during, for example, a different AR experience session, and enables multiple users to share an AR experience.
  • the system may build and update a digital representation of the physical world surfaces around the user. This representation may be used to render virtual content so as to appear fully or partially occluded by physical objects between the user and the rendered location of the virtual content, to place virtual objects, in physics based interactions, and for virtual character path planning and navigation, or for other operations in which information about the physical world is used.
  • FIG. 2 depicts another example of an indoor AR scene 400, showing exemplary use cases of an XR system, according to some embodiments.
  • the exemplary scene 400 is a living room having walls, a bookshelf on one side of a wall, a floor lamp at a comer of the room, a floor, a sofa, and coffee table on the floor.
  • the user of the AR technology also perceives virtual objects such as images on the wall behind the sofa (i.e. as in 402), birds flying through the door (i.e. as in 404), a deer peeking out from the book shelf, and a decoration in the form of a windmill placed on the coffee table (i.e. as in 406).
  • the AR technology requires information about not only surfaces of the wall but also objects and surfaces in the room such as lamp shape, which are occluding the images to render the virtual objects correctly.
  • the AR technology requires information about all the objects and surfaces around the room for rendering the birds with realistic physics to avoid the objects and surfaces or bounce off them if the birds collide.
  • the AR technology requires information about the surfaces such as the floor or coffee table to compute where to place the deer.
  • the system may identify that is an object separate from the table and may determine that it is movable, whereas corners of shelves or comers of the wall may be determined to be stationary. Such a distinction may be used in determinations as to which portions of the scene are used or updated in each of various operations.
  • the virtual objects may be placed in a previous AR experience session.
  • the AR technology requires the virtual objects being accurately displayed at the locations previously placed and realistically visible from different viewpoints.
  • the windmill should be displayed as standing on the books rather than drifting above the table at a different location without the books. Such drifting may happen if the locations of the users of the new AR experience sessions are not accurately localized in the living room.
  • the AR technology requires corresponding sides of the windmill being displayed.
  • a scene may be presented to the user via a system that includes multiple components, including a user interface that can stimulate one or more user senses, such as sight, sound, and/or touch.
  • the system may include one or more sensors that may measure parameters of the physical portions of the scene, including position and/or motion of the user within the physical portions of the scene.
  • the system may include one or more computing devices, with associated computer hardware, such as memory. These components may be integrated into a single device or may be distributed across multiple interconnected devices. In some embodiments, some or all of these components may be integrated into a wearable device.
  • FIG 3 is a schematic diagram 300 that depicts an AR system 502 configured to provide an experience of AR contents interacting with a physical world 506, according to some embodiments.
  • the AR system 502 may include a display 508.
  • the display 508 may be worn by the user as part of a headset such that a user may wear the display over their eyes like a pair of goggles or glasses. At least a portion of the display may be transparent such that a user may observe a see-through reality 510.
  • the see-through reality 510 may correspond to portions of the physical world 506 that are within a present viewpoint of the AR system 502, which may correspond to the viewpoint of the user in the case that the user is wearing a headset incorporating both the display and sensors of the AR system to acquire information about the physical world.
  • AR contents may also be presented on the display 508, overlaid on the see-through reality 510.
  • the AR system 502 may include sensors 522 configured to capture information about the physical world 506.
  • the sensors 522 may include one or more depth sensors that output depth maps 512.
  • Each depth map 512 may have multiple pixels, each of which may represent a distance to a surface in the physical world 506 in a particular direction relative to the depth sensor.
  • Raw depth data may come from a depth sensor to create a depth map.
  • Such depth maps may be updated as fast as the depth sensor can form a new image, which may be hundreds or thousands of times per second. However, that data may be noisy and incomplete, and have holes shown as black pixels on the illustrated depth map.
  • the system may include other sensors, such as image sensors.
  • the image sensors may acquire monocular or stereoscopic information that may be processed to represent the physical world in other ways.
  • the images may be processed in world reconstruction component 516 to create a mesh, representing connected portions of objects in the physical world. Metadata about such objects, including for example, color and surface texture, may similarly be acquired with the sensors and stored as part of the world reconstruction.
  • the system may also acquire information about the headpose of the user with respect to the physical world.
  • a headpose tracking component of the system may be used to compute headposes in real time.
  • the headpose tracking component may represent a headpose of a user in a coordinate frame with six degrees of freedom including, for example, translation in three perpendicular axes (e.g., forward/backward, up/down, left/right) and rotation about the three perpendicular axes (e.g., pitch, yaw, and roll).
  • sensors 522 may include inertial measurement units that may be used to compute and/or determine a headpose 514.
  • a headpose 514 for a depth map may indicate a present viewpoint of a sensor capturing the depth map with six degrees of freedom, for example, but the headpose 514 may be used for other purposes, such as to relate image information to a particular portion of the physical world or to relate the position of the display worn on the user’s head to the physical world.
  • the headpose information may be derived in other ways than from an IMU, such as from analyzing objects in an image captured with a camera worn on the user’s head.
  • the headpose tracking component may compute relative position and orientation of an AR device to physical objects based on visual information captured by cameras and inertial information captured by IMUs.
  • the headpose tracking component may then compute a pose of the AR device by, for example, comparing the computed relative position and orientation of the AR device to the physical objects with features of the physical objects.
  • that comparison may be made by identifying features in images captured with one or more of the sensors 522 that are stable over time such that changes of the position of these features in images captured over time can be associated with a change in headpose of the user.
  • the inventors have realized and appreciated techniques for operating XR systems to provide XR scenes for a more immersive user experience such as estimating headpose at a frequency of 1 kHz, with low usage of computational resources in connection with an XR device, that may be configured with, for example, four video graphic array (VGA) cameras operating at 30 Hz, one inertial measurement unit (IMU) operating at 1 kHz, compute power of a single advanced RISC machine (ARM) core, memory less than 1 GB, and network bandwidth less than 100 Mbp.
  • VGA video graphic array
  • IMU inertial measurement unit
  • ARM advanced RISC machine
  • the XR system may calculate its pose based on the matched visual features.
  • the AR device may construct a map from the features, such as points and/or lines recognized in successive images in a series of image frames captured as a user moves throughout the physical world with the AR device. Though each image frame may be taken from a different pose as the user moves, the system may adjust the orientation of the features of each successive image frame to match the orientation of the initial image frame by matching features of the successive image frames to previously captured image frames. Translations of the successive image frames so that points and lines representing the same features will match corresponding feature points and feature lines from previously collected image frames, can be used to align each successive image frame to match the orientation of previously processed image frames.
  • the frames in the resulting map may have a common orientation established when the first image frame was added to the map.
  • This map with sets of feature points and lines in a common frame of reference, may be used to determine the user’s pose within the physical world by matching features from current image frames to the map. In some embodiments, this map may be called a tracking map.
  • this map may enable other components of the system, such as world reconstruction component 516, to determine the location of physical objects with respect to the user.
  • the world reconstruction component 516 may receive the depth maps 512 and headposes 514, and any other data from the sensors, and integrate that data into a reconstruction 518.
  • the reconstruction 518 may be more complete and less noisy than the sensor data.
  • the world reconstruction component 516 may update the reconstruction 518 using spatial and temporal averaging of the sensor data from multiple viewpoints over time.
  • the reconstruction 518 may include representations of the physical world in one or more data formats including, for example, voxels, meshes, planes, etc.
  • the different formats may represent alternative representations of the same portions of the physical world or may represent different portions of the physical world.
  • portions of the physical world are presented as a global surface; on the right side of the reconstruction 518, portions of the physical world are presented as meshes.
  • the map maintained by headpose component 514 may be sparse relative to other maps that might be maintained of the physical world. Rather than providing information about locations, and possibly other characteristics, of surfaces, the sparse map may indicate locations of interest, which may be reflected as points and/or lines in the images, that arise from visually distinctive structures, such as corners or edges.
  • the map may include image frames as captured by the sensors 522. These frames may be reduced to features, which may represent the locations of interest. In conjunction with each frame, information about a pose of a user from which the frame was acquired may also be stored as part of the map. In some embodiments, every image acquired by the sensor may or may not be stored.
  • the system may process images as they are collected by sensors and select subsets of the image frames for further computation.
  • the selection may be based on one or more criteria that limits the addition of information yet ensures that the map contains useful information.
  • the system may add a new image frame to the map, for example, based on overlap with a prior image frame already added to the map or based on the image frame containing a sufficient number of features determined as likely to represent stationary objects.
  • the selected image frames, or groups of features from selected image frames may serve as key frames for the map, which are used to provide spatial information.
  • the amount of data that is processed when constructing maps may be reduced, such as by constructing sparse maps with a collection of mapped points and keyframes and/or dividing the maps into blocks to enable updates by blocks.
  • a mapped point and/or line may be associated with a point and/or line of interest in the environment.
  • a keyframe may include selected information from camera-captured data.
  • U.S. Patent Application No. 16/520,582 (published as application 2020/0034624) describes determining and/or evaluating localization maps and is hereby incorporated herein by reference in its entirety.
  • the AR system 502 may integrate sensor data over time from multiple viewpoints of a physical world.
  • the poses of the sensors may be tracked as a device including the sensors is moved.
  • each of these multiple viewpoints of the physical world may be fused together into a single, combined reconstruction of the physical world, which may serve as an abstract layer for the map and provide spatial information.
  • the reconstruction may be more complete and less noisy than the original sensor data by using spatial and temporal averaging (i.e. averaging data from multiple viewpoints over time), or any other suitable method.
  • a map represents the portion of the physical world in which a user of a single, wearable device is present.
  • headpose associated with frames in the map may be represented as a local headpose, indicating orientation relative to an initial orientation for a single device at the start of a session.
  • the headpose may be tracked relative to an initial headpose when the device was turned on or otherwise operated to scan an environment to build a representation of that environment.
  • the map may include metadata.
  • the metadata may indicate time of capture of the sensor information used to form the map.
  • Metadata alternatively or additionally may indicate location of the sensors at the time of capture of information used to form the map.
  • Location may be expressed directly, such as with information from a GPS chip, or indirectly, such as with a wireless (e.g. Wi-Fi) signature indicating strength of signals received from one or more wireless access points while the sensor data was being collected and/or with identifiers, such as BSSID’s, of wireless access points to which the user device connected while the sensor data was collected.
  • a wireless e.g. Wi-Fi
  • the reconstruction 518 may be used for AR functions, such as producing a surface representation of the physical world for occlusion processing or physics-based processing. This surface representation may change as the user moves or objects in the physical world change. Aspects of the reconstruction 518 may be used, for example, by a component 520 that produces a changing global surface representation in world coordinates, which may be used by other components.
  • the AR content may be generated based on this information, such as by AR applications 504.
  • An AR application 504 may be a game program, for example, that performs one or more functions based on information about the physical world, such as visual occlusion, physics-based interactions, and environment reasoning.
  • component 520 may be configured to output updates when a representation in a region of interest of the physical world changes. That region of interest, for example, may be set to approximate a portion of the physical world in the vicinity of the user of the system, such as the portion within the view field of the user, or is projected (predicted/determined) to come within the view field of the user.
  • the AR applications 504 may use this information to generate and update the AR contents.
  • the virtual portion of the AR contents may be presented on the display 508 in combination with the see-through reality 510, creating a realistic user experience.
  • an AR experience may be provided to a user through an XR device, which may be a wearable display device, which may be part of a system that may include remote processing and or remote data storage and/or, in some embodiments, other wearable display devices worn by other users.
  • Figure 4 illustrates an example of system 580 (hereinafter referred to as "system 580") including a single wearable device for simplicity of illustration.
  • the system 580 includes a head mounted display device 562 (hereinafter referred to as "display device 562”), and various mechanical and electronic modules and systems to support the functioning of the display device 562.
  • the display device 562 may be coupled to a frame 564, which is wearable by a display system user or viewer 560 (hereinafter referred to as "user 560") and configured to position the display device 562 in front of the eyes of the user 560.
  • the display device 562 may be a sequential display.
  • the display device 562 may be monocular or binocular.
  • the display device 562 may be an example of the display 508 in Figure 3.
  • a speaker 566 is coupled to the frame 564 and positioned proximate an ear canal of the user 560.
  • another speaker not shown, is positioned adjacent another ear canal of the user 560 to provide for stereo/shapeable sound control.
  • the display device 562 is operatively coupled, such as by a wired lead or wireless connectivity 568, to a local data processing module 570 which may be mounted in a variety of configurations, such as fixedly attached to the frame 564, fixedly attached to a helmet or hat worn by the user 560, embedded in headphones, or otherwise removably attached to the user 560 (e.g., in a backpack- style configuration, in a belt- coupling style configuration).
  • the local data processing module 570 may include a processor, as well as digital memory, such as non-volatile memory (e.g., flash memory), both of which may be utilized to assist in the processing, caching, and storage of data.
  • the data include data a) captured from sensors (which may be, e.g., operatively coupled to the frame 564) or otherwise attached to the user 560, such as image capture devices (such as cameras), microphones, inertial measurement units, accelerometers, compasses, GPS units, radio devices, and/or gyros; and/or b) acquired and/or processed using remote processing module 572 and/or remote data repository 574, possibly for passage to the display device 562 after such processing or retrieval.
  • the wearable device may communicate with remote components.
  • the local data processing module 570 may be operatively coupled by communication links 576, 578, such as via a wired or wireless communication links, to the remote processing module 572 and remote data repository 574, respectively, such that these remote modules 572, 574 are operatively coupled to each other and available as resources to the local data processing module 570.
  • the wearable device in addition or as alternative to remote data repository 574, can access cloud based remote data repositories, and/or services.
  • the headpose tracking component described above may be at least partially implemented in the local data processing module 570.
  • the world reconstruction component 516 in Figure 3 may be at least partially implemented in the local data processing module 570.
  • the local data processing module 570 may be configured to execute computer executable instructions to generate the map and/or the physical world representations based at least in part on at least a portion of the data.
  • processing may be distributed across local and remote processors.
  • local processing may be used to construct a map on a user device (e.g. tracking map) based on sensor data collected with sensors on that user’s device.
  • a map may be used by applications on that user’s device.
  • previously created maps e.g., canonical maps
  • a tracking map may be localized to the stored map, such that a correspondence is established between a tracking map, which might be oriented relative to a position of the wearable device at the time a user turned the system on, and the canonical map, which may be oriented relative to one or more persistent features.
  • the persistent map might be loaded on the user device to allow the user device to render virtual content without a delay associated with scanning a location to build a tracking map of the user’s full environment from sensor data acquired during the scan.
  • the user device may access a remote persistent map (e.g., stored on a cloud) without the need to download the persistent map on the user device.
  • spatial information may be communicated from the wearable device to remote services, such as a cloud service that is configured to localize a device to stored maps maintained on the cloud service.
  • the localization processing can take place in the cloud matching the device location to existing maps, such as canonical maps, and return transforms that link virtual content to the wearable device location.
  • the system can avoid communicating maps from remote resources to the wearable device.
  • Other embodiments can be configured for both device- based and cloud-based localization, for example, to enable functionality where network connectivity is not available or a user opts not to enable could-based localization.
  • the tracking map may be merged with previously stored maps to extend or improve the quality of those maps.
  • the processing to determine whether a suitable previously created environment map is available and/or to merge a tracking map with one or more stored environment maps may be done in local data processing module 570 or remote processing module 572.
  • the local data processing module 570 may include one or more processors (e.g., a graphics processing unit (GPU)) configured to analyze and process data and/or image information.
  • the local data processing module 570 may include a single processor (e.g., a single-core or multi-core ARM processor), which would limit the local data processing module 570’ s compute budget but enable a more miniature device.
  • the world reconstruction component 516 may use a compute budget less than a single Advanced RISC Machine (ARM) core to generate physical world representations in real-time on a non-predefined space such that the remaining compute budget of the single ARM core can be accessed for other uses such as, for example, extracting meshes.
  • ARM Advanced RISC Machine
  • the remote data repository 574 may include a digital data storage facility, which may be available through the Internet or other networking configuration in a "cloud" resource configuration.
  • all data is stored and all computations are performed in the local data processing module 570, allowing fully autonomous use from a remote module.
  • all data is stored and all or most computations are performed in the remote data repository 574, allowing for a smaller device.
  • a world reconstruction for example, may be stored in whole or in part in this repository 574.
  • data may be shared by multiple users of an augmented reality system.
  • user devices may upload their tracking maps to augment a database of environment maps.
  • the tracking map upload occurs at the end of a user session with a wearable device.
  • the tracking map uploads may occur continuously, semi-continuously, intermittently, at a pre-defined time, after a pre-defined period from the previous upload, or when triggered by an event.
  • a tracking map uploaded by any user device may be used to expand or improve a previously stored map, whether based on data from that user device or any other user device.
  • a persistent map downloaded to a user device may be based on data from that user device or any other user device. In this way, high quality environment maps may be readily available to users to improve their experiences with the AR system.
  • persistent map downloads can be limited and/or avoided based on localization executed on remote resources (e.g., in the cloud).
  • a wearable device or other XR device communicates to the cloud service feature information coupled with pose information (e.g., positioning information for the device at the time the features represented in the feature information were sensed).
  • pose information e.g., positioning information for the device at the time the features represented in the feature information were sensed.
  • One or more components of the cloud service may match the feature information to respective stored maps (e.g., canonical maps) and generates transforms between a tracking map maintained by the XR device and the coordinate system of the canonical map.
  • Each XR device that has its tracking map localized with respect to the canonical map may accurately render virtual content in locations specified with respect to the canonical map based on its own tracking.
  • the local data processing module 570 is operatively coupled to a battery 582.
  • the battery 582 is a removable power source, such as over the counter batteries.
  • the battery 582 is a lithium-ion battery.
  • the battery 582 includes both an internal lithium-ion battery chargeable by the user 560 during non-operation times of the system 580 and removable batteries such that the user 560 may operate the system 580 for longer periods of time without having to be tethered to a power source to charge the lithium-ion battery or having to shut the system 580 off to replace batteries.
  • Figure 5A illustrates a user 530 wearing an AR display system rendering AR content as the user 530 moves through a physical world environment 532 (hereinafter referred to as "environment 532").
  • the information captured by the AR system along the movement path of the user may be processed into one or more tracking maps.
  • the user 530 positions the AR display system at positions 534, and the AR display system records ambient information of a passable world (e.g., a digital representation of the real objects in the physical world that can be stored and updated with changes to the real objects in the physical world) relative to the positions 534. That information may be stored as poses in combination with images, features, directional audio inputs, or other desired data.
  • a passable world e.g., a digital representation of the real objects in the physical world that can be stored and updated with changes to the real objects in the physical world
  • the positions 534 are aggregated to data inputs 536, for example, as part of a tracking map, and processed at least by a passable world module 538, which may be implemented, for example, by processing on a remote processing module 572 of Figure 4.
  • the passable world module 538 may include the headpose component 514 and the world reconstruction component 516, such that the processed information may indicate the location of objects in the physical world in combination with other information about physical objects used in rendering virtual content.
  • the passable world module 538 determines, at least in part, where and how AR content 540 can be placed in the physical world as determined from the data inputs 536.
  • the AR content is “placed” in the physical world by presenting via the user interface both a representation of the physical world and the AR content, with the AR content rendered as if it were interacting with objects in the physical world and the objects in the physical world presented as if the AR content were, when appropriate, obscuring the user’s view of those objects.
  • the AR content may be placed by appropriately selecting portions of a fixed element 542 (e.g., a table) from a reconstruction (e.g., the reconstruction 518) to determine the shape and position of the AR content 540.
  • the fixed element may be a table and the virtual content may be positioned such that it appears to be on that table.
  • the AR content may be placed within structures in a field of view 544, which may be a present field of view or an estimated future field of view.
  • the AR content may be persisted relative to a model 546 of the physical world (e.g. a mesh).
  • the fixed element 542 serves as a proxy (e.g. digital copy) for any fixed element within the physical world which may be stored in the passable world module 538 so that the user 530 can perceive content on the fixed element 542 without the system having to map to the fixed element 542 each time the user 530 sees it.
  • the fixed element 542 may, therefore, be a mesh model from a previous modeling session or determined from a separate user but nonetheless stored by the passable world module 538 for future reference by a plurality of users. Therefore, the passable world module 538 may recognize the environment 532 from a previously mapped environment and display AR content without a device of the user 530 mapping all or part of the environment 532 first, saving computation process and cycles and avoiding latency of any rendered AR content.
  • the mesh model 546 of the physical world may be created by the AR display system and appropriate surfaces and metrics for interacting and displaying the AR content 540 can be stored by the passable world module 538 for future retrieval by the user 530 or other users without the need to completely or partially recreate the model.
  • the data inputs 536 are inputs such as geolocation, user identification, and current activity to indicate to the passable world module 538 which fixed element 542 of one or more fixed elements are available, which AR content 540 has last been placed on the fixed element 542, and whether to display that same content (such AR content being "persistent" content regardless of user viewing a particular passable world model).
  • the passable world module 538 may update those objects in a model of the physical world from time to time to account for the possibility of changes in the physical world.
  • the model of fixed objects may be updated with a very low frequency.
  • Other objects in the physical world may be moving or otherwise not regarded as fixed (e.g. kitchen chairs).
  • the AR system may update the position of these non- fixed objects with a much higher frequency than is used to update fixed objects.
  • an AR system may draw information from multiple sensors, including one or more image sensors.
  • Figure 5B is a schematic illustration of a viewing optics assembly 548 and attendant components.
  • two eye tracking cameras 550 directed toward user eyes 549, detect metrics of the user eyes 549, such as eye shape, eyelid occlusion, pupil direction and glint on the user eyes 549.
  • one of the sensors may be a depth sensor 551, such as a time of flight sensor, emitting signals to the world and detecting reflections of those signals from nearby objects to determine distance to given objects.
  • a depth sensor may quickly determine whether objects have entered the field of view of the user, either as a result of motion of those objects or a change of pose of the user.
  • information about the position of objects in the field of view of the user may alternatively or additionally be collected with other sensors.
  • Depth information for example, may be obtained from stereoscopic visual image sensors or plenoptic sensors.
  • world cameras 552 record a greater-than-peripheral view to map and/or otherwise create a model of the environment 532 and detect inputs that may affect AR content.
  • the world camera 552 and/or camera 553 may be grayscale and/or color image sensors, which may output grayscale and/or color image frames at fixed time intervals. Camera 553 may further capture physical world images within a field of view of the user at a specific time. Pixels of a frame-based image sensor may be sampled repetitively even if their values are unchanged.
  • Each of the world cameras 552, the camera 553 and the depth sensor 551 have respective fields of view of 554, 555, and 556 to collect data from and record a physical world scene, such as the physical world environment 532 depicted in Figure 34 A.
  • Inertial measurement units 557 may determine movement and orientation of the viewing optics assembly 548. In some embodiments, inertial measurement units 557 may provide an output indicating a direction of gravity. In some embodiments, each component is operatively coupled to at least one other component. For example, the depth sensor 551 is operatively coupled to the eye tracking cameras 550 as a confirmation of measured accommodation against actual distance the user eyes 549 are looking at.
  • a viewing optics assembly 548 may include some of the components illustrated in Figure 34B and may include components instead of or in addition to the components illustrated.
  • a viewing optics assembly 548 may include two world camera 552 instead of four. Alternatively or additionally, cameras 552 and 553 need not capture a visible light image of their full field of view.
  • a viewing optics assembly 548 may include other types of components.
  • a viewing optics assembly 548 may include one or more dynamic vision sensor (DVS), whose pixels may respond asynchronously to relative changes in light intensity exceeding a threshold.
  • DVDS dynamic vision sensor
  • a viewing optics assembly 548 may not include the depth sensor 551 based on time of flight information.
  • a viewing optics assembly 548 may include one or more plenoptic cameras, whose pixels may capture light intensity and an angle of the incoming light, from which depth information can be determined.
  • a plenoptic camera may include an image sensor overlaid with a transmissive diffraction mask (TDM).
  • a plenoptic camera may include an image sensor containing angle- sensitive pixels and/or phase-detection auto-focus pixels (PDAF) and/or micro-lens array (MLA). Such a sensor may serve as a source of depth information instead of or in addition to depth sensor 551.
  • PDAF phase-detection auto-focus pixels
  • MVA micro-lens array
  • a viewing optics assembly 548 may include components with any suitable configuration, which may be set to provide the user with the largest field of view practical for a particular set of components. For example, if a viewing optics assembly 548 has one world camera 552, the world camera may be placed in a center region of the viewing optics assembly instead of at a side.
  • Information from the sensors in viewing optics assembly 548 may be coupled to one or more of processors in the system.
  • the processors may generate data that may be rendered so as to cause the user to perceive virtual content interacting with objects in the physical world. That rendering may be implemented in any suitable way, including generating image data that depicts both physical and virtual objects.
  • physical and virtual content may be depicted in one scene by modulating the opacity of a display device that a user looks through at the physical world.
  • the opacity may be controlled so as to create the appearance of the virtual object and also to block the user from seeing objects in the physical world that are occluded by the virtual objects.
  • the image data may only include virtual content that may be modified such that the virtual content is perceived by a user as realistically interacting with the physical world (e.g. clip content to account for occlusions), when viewed through the user interface.
  • the location on the viewing optics assembly 548 at which content is displayed to create the impression of an object at a particular location may depend on the physics of the viewing optics assembly. Additionally, the pose of the user’s head with respect to the physical world and the direction in which the user’s eyes are looking may impact where in the physical world content displayed at a particular location on the viewing optics assembly content will appear. Sensors as described above may collect this information, and or supply information from which this information may be calculated, such that a processor receiving sensor inputs may compute where objects should be rendered on the viewing optics assembly 548 to create a desired appearance for the user.
  • a model of the physical world may be used so that characteristics of the virtual objects, which can be impacted by physical objects, including the shape, position, motion, and visibility of the virtual object, can be correctly computed.
  • the model may include the reconstruction of a physical world, for example, the reconstruction 518.
  • That model may be created from data collected from sensors on a wearable device of the user. Though, in some embodiments, the model may be created from data collected by multiple users, which may be aggregated in a computing device remote from all of the users (and which may be “in the cloud”). [00191] The model may be created, at least in part, by a world reconstruction system such as, for example, the world reconstruction component 516 of Figure 3 depicted in more detail in Figure 6A.
  • the world reconstruction component 516 may include a perception module 660 that may generate, update, and store representations for a portion of the physical world. In some embodiments, the perception module 660 may represent the portion of the physical world within a reconstruction range of the sensors as multiple voxels.
  • Each voxel may correspond to a 3D cube of a predetermined volume in the physical world, and include surface information, indicating whether there is a surface in the volume represented by the voxel.
  • Voxels may be assigned values indicating whether their corresponding volumes have been determined to include surfaces of physical objects, determined to be empty or have not yet been measured with a sensor and so their value is unknown. It should be appreciated that values indicating that voxels that are determined to be empty or unknown need not be explicitly stored, as the values of voxels may be stored in computer memory in any suitable way, including storing no information for voxels that are determined to be empty or unknown.
  • the perception module 660 may identify and output indications of changes in a region around a user of an AR system. Indications of such changes may trigger updates to volumetric data stored as part of the persisted world, or trigger other functions, such as triggering components 604 that generate AR content to update the AR content.
  • the perception module 660 may identify changes based on a signed distance function (SDF) model.
  • the perception module 660 may be configured to receive sensor data such as, for example, depth maps 660a and headposes 660b, and then fuse the sensor data into a SDF model 660c.
  • Depth maps 660a may provide SDF information directly, and images may be processed to arrive at SDF information.
  • the SDF information represents distance from the sensors used to capture that information. As those sensors may be part of a wearable unit, the SDF information may represent the physical world from the perspective of the wearable unit and therefore the perspective of the user.
  • the headposes 660b may enable the SDF information to be related to a voxel in the physical world.
  • the perception module 660 may generate, update, and store representations for the portion of the physical world that is within a perception range.
  • the perception range may be determined based, at least in part, on a sensor’s reconstruction range, which may be determined based, at least in part, on the limits of a sensor’s observation range.
  • an active depth sensor that operates using active IR pulses may operate reliably over a range of distances, creating the observation range of the sensor, which may be from a few centimeters or tens of centimeters to a few meters.
  • the world reconstruction component 516 may include additional modules that may interact with the perception module 660.
  • a persisted world module 662 may receive representations for the physical world based on data acquired by the perception module 660.
  • the persisted world module 662 also may include various formats of representations of the physical world.
  • the module may include volumetric information 662a.
  • volumetric metadata 662b such as voxels may be stored as well as meshes 662c and planes 662d.
  • other information such as depth maps could be saved.
  • representations of the physical world may provide relatively dense information about the physical world in comparison to sparse maps, such as a tracking map based on feature points and/or lines as described above.
  • the perception module 660 may include modules that generate representations for the physical world in various formats including, for example, meshes 660d, planes and semantics 660e.
  • the representations for the physical world may be stored across local and remote storage mediums.
  • the representations for the physical world may be described in different coordinate frames depending on, for example, the location of the storage medium.
  • a representation for the physical world stored in the device may be described in a coordinate frame local to the device.
  • the representation for the physical world may have a counterpart stored in a cloud.
  • the counterpart in the cloud may be described in a coordinate frame shared by all devices in an XR system.
  • these modules may generate representations based on data within the perception range of one or more sensors at the time the representation is generated as well as data captured at prior times and information in the persisted world module 662.
  • these components may operate on depth information captured with a depth sensor.
  • the AR system may include vision sensors and may generate such representations by analyzing monocular or binocular vision information.
  • these modules may operate on regions of the physical world. Those modules may be triggered to update a subregion of the physical world, when the perception module 660 detects a change in the physical world in that subregion. Such a change, for example, may be detected by detecting a new surface in the SDF model 660c or other criteria, such as changing the value of a sufficient number of voxels representing the subregion.
  • the world reconstruction component 516 may include components 664 that may receive representations of the physical world from the perception module 660.
  • Components 664 may include visual occlusion 664a, physics-based interactions 664b, and/or environment reasoning 664c.
  • Information about the physical world may be pulled by these components according to, for example, a use request from an application.
  • information may be pushed to the use components, such as via an indication of a change in a pre-identified region or a change of the physical world representation within the perception range.
  • the components 664 may include, for example, game programs and other components that perform processing for visual occlusion, physics-based interactions, and environment reasoning.
  • the perception module 660 may send representations for the physical world in one or more formats. For example, when the component 664 indicates that the use is for visual occlusion or physics-based interactions, the perception module 660 may send a representation of surfaces. When the component 664 indicates that the use is for environmental reasoning, the perception module 660 may send meshes, planes and semantics of the physical world.
  • the perception module 660 may include components that format information to provide the component 664.
  • An example of such a component may be raycasting component 660f.
  • a use component e.g., component 664
  • Raycasting component 660f may select from one or more representations of the physical world data within a field of view from that point of view.
  • components of a passable world model may be distributed, with some portions executing locally on an XR device and some portions executing remotely, such as on a network connected server, or otherwise in the cloud.
  • the allocation of the processing and storage of information between the local XR device and the cloud may impact functionality and user experience of an XR system. For example, reducing processing on a local device by allocating processing to the cloud may enable longer battery life and reduce heat generated on the local device. But, allocating too much processing to the cloud may create undesirable latency that causes an unacceptable user experience.
  • FIG. 6B depicts a distributed component architecture 600 configured for spatial computing, according to some embodiments.
  • the distributed component architecture 600 may include a passable world component 602 (e.g., PW 538 in Figure 5A), a Lumin OS 604, API’s 606, SDK 608, and Application 610.
  • the Lumin OS 604 may include a Linux-based kernel with custom drivers compatible with an XR device.
  • the API’s 606 may include application programming interfaces that grant XR applications (e.g., Applications 610) access to the spatial computing features of an XR device.
  • the SDK 608 may include a software development kit that allows the creation of XR applications.
  • One or more components in the architecture 600 may create and maintain a model of a passable world.
  • sensor data is collected on a local device. Processing of that sensor data may be performed in part locally on the XR device and partially in the cloud.
  • PW 538 may include environment maps created based, at least in part, on data captured by AR devices worn by multiple users. During sessions of an AR experience, individual AR devices (such as wearable devices described above in connection with Figure 4 may create tracking maps, which is one type of map.
  • the device may include components that construct both sparse maps and dense maps.
  • a tracking map may serve as a sparse map.
  • the dense map may include surface information, which may be represented by a mesh or depth information. Alternatively or additionally, a dense map may include higher level information derived from surface or depth information, such as the location and/or characteristics of planes and/or other objects.
  • the sparse map and/or dense map may be persisted for re-use by the same device and/or sharing with other devices. Such persistence may be achieved by storing information in the cloud.
  • the AR device may send the tracking map to a cloud to, for example, merge with environment maps selected from persisted maps previously stored in the cloud.
  • the selected persisted maps may be sent from the cloud to the AR device for merging.
  • the persisted maps may be oriented with respect to one or more persistent coordinate frames.
  • Such maps may serve as canonical maps, as they can be used by any of multiple devices.
  • a model of a passable world may comprise or be created from one or more canonical maps. Devices, even though they perform some operations based on a coordinate frame local to the device, may nonetheless use the canonical map by determining a transformation between their coordinate frame local to the device and the canonical map.
  • a canonical map may originate as a tracking map (TM).
  • the tracking map for example, may be persisted such that the frame of reference of the tracking map becomes a persisted coordinate frame. Thereafter, devices that access the canonical map may, once determining a transformation between their local coordinate system and a coordinate system of the canonical map, use the information in the canonical map to determine locations of objects represented in the canonical map in the physical world around the device.
  • FIG. 7 depicts an exemplary tracking map 700, according to some embodiments.
  • the tracking map represents features of interest as points.
  • lines may be used instead of or in addition to points.
  • the tracking map 700 may provide a floor plan 706 of physical objects in a corresponding physical world, represented by points 702.
  • a map point 702 may represent a feature of a physical object that may include multiple features. For example, each corner of a table may be a feature that is represented by a point on a map.
  • the features may be derived from processing images, such as may be acquired with the sensors of a wearable device in an augmented reality system.
  • the features may be derived by processing an image frame output by a sensor to identify features based on large gradients in the image or other suitable criteria. Further processing may limit the number of features in each frame. For example, processing may select features that likely represent persistent objects. One or more heuristics may be applied for this selection.
  • the tracking map 700 may include data on points 702 collected by a device.
  • a pose may be stored.
  • the pose may represent the orientation from which the image frame was captured, such that the feature points within each image frame may be spatially correlated to the tracking map.
  • the pose may be determined by positioning information, such as may be derived from the sensors, such as an IMU sensor, on the wearable device.
  • the pose may be determined by matching a subset of features in the image frame to features already in the tracking map. A transformation between matching subsets of features may be computed, which indicates the relative pose between the image frame and the tacking map.
  • Not all of the feature points and image frames collected by a device may be retained as part of the tracking map, as much of the information collected with the sensors is likely to be redundant.
  • a relatively small subset of features from an image frame may be processed. Those features may be distinctive, such as may result from a sharp comer or edge.
  • features from only certain frames may be added to the map. Those frames may be selected based on one or more criteria, such as degree of overlap with image frames already in the map, the number of new features they contain or a quality metric for the features in the frame.
  • Image frames not added to the tracking map may be discarded or may be used to revise the location of features.
  • data from multiple image frames, represented as a set of features may be retained, but features from only a subset of those frames may be designated as key frames, which are used for further processing.
  • the key frames may be processed to produce keyrigs 704.
  • the key frames may be processed to produce three dimensional sets of feature points and saved as keyrigs 704. Such processing may entail, for example, comparing image frames derived simultaneously from two cameras to stereoscopically determine the 3D position of feature points. Metadata may be associated with these keyframes and/or keyrigs, such as poses. Keyrigs may subsequently be used when localizing a device to the map based on a newly acquired image from the device.
  • Environment maps may have any of multiple formats depending on, for example, the storage locations of an environment map including, for example, local storage of AR devices and remote storage.
  • a map in remote storage may have higher resolution than a map in local storage on a wearable device where memory is limited.
  • the map may be down sampled or otherwise converted to an appropriate format, such as by reducing the number of poses per area of the physical world stored in the map and/or the number of feature points stored for each pose.
  • a slice or portion of a high resolution map from remote storage may be sent to local storage, where the slice or portion is not down sampled.
  • a database of environment maps may be updated as new tracking maps are created.
  • updating may include efficiently selecting one or more environment maps stored in the database relevant to the new tracking map.
  • the selected one or more environment maps may be ranked by relevance and one or more of the highest ranking maps may be selected for processing to merge higher ranked selected environment maps with the new tracking map to create one or more updated environment maps.
  • a new tracking map represents a portion of the physical world for which there is no preexisting environment map to update, that tracking map may be stored in the database as a new environment map.
  • Various embodiments may utilize remote resources to facilitate persistent and consistent cross reality experiences between individual and/or groups of users.
  • Benefits of operation of an XR device with canonical maps as described herein can be achieved without downloading a set of canonical maps.
  • the benefit for example, may be achieved by sending feature and pose information to a remote service that maintains a set of canonical maps.
  • a device seeking to use a canonical map to position virtual content in locations specified relative to the canonical map may receive from the remote service one or more transformations between the features and the canonical maps.
  • spatial information is captured by an XR device and communicated to a remote service, such as a cloud based service, which uses the spatial information to localize the XR device to a canonical map used by applications or other components of an XR system to specify the location of virtual content with respect to the physical world. Once localized, transforms that link a tracking map maintained by the device to the canonical map can be communicated to the device.
  • a camera and/or a portable electronic device comprising a camera may be configured to capture and/or determine information about features (e.g. a combination of points and/or lines) and send the information to a remote service, such as a cloud based device.
  • the remote service may use the information to determine a pose of the camera.
  • the pose of the camera may be determined, for example, using the methods and techniques described herein.
  • the pose may include a rotation matrix and/or a translation matrix.
  • the pose of the camera may be represented with respect to any of the maps described herein.
  • the transforms may be used, in conjunction with the tracking map, to determine a position in which to render virtual content specified with respect to the canonical map, or otherwise identify locations in the physical world that are specified with respect to the canonical map.
  • the results returned to the device from the localization service may be one or more transformations that relate the uploaded features to portions of a matching canonical map. Those transformations may be used within the XR device, in conjunction with its tracking map, for identifying locations of virtual content or otherwise identifying locations in the physical world.
  • the localization service may download to the device transformations between the features and one or more PCFs after a successful localization.
  • the localization service may further return to the device a pose of the camera.
  • the result returned to the device from the localization service may relate the pose of the camera in relation to a canonical map.
  • network bandwidth consumed by communications between an XR device and a remote service for performing localization may be low.
  • the system may therefore support frequent localization, enabling each device interacting with the system to quickly obtain information for positioning virtual content or performing other location-based functions.
  • a device may repeat requests for updated localization information.
  • a device may frequently obtain updates to the localization information, such as when the canonical maps change, such as through merging of additional tracking maps to expand the map or increase their accuracy.
  • FIG 8 is a schematic diagram of an XR system 6100.
  • the user devices that display cross reality content during user sessions can come in a variety of forms.
  • a user device can be a wearable XR device (e.g., 6102) or a handheld mobile device (e.g., 6104).
  • these devices can be configured with software, such as applications or other components, and/or hardwired to generate local position information (e.g., a tracking map) that can be used to render virtual content on their respective displays.
  • software such as applications or other components
  • hardwired to generate local position information (e.g., a tracking map) that can be used to render virtual content on their respective displays.
  • Virtual content positioning information may be specified with respect to global location information, which may be formatted as a canonical map containing one or more persistent coordinate frames (PCFs), for example.
  • a PCF may be a collection of features in a map that may be used when localizing with respect to that map.
  • a PCF may be selected, for example, based on processing that identifies that set of features as readily recognizable and likely to be persistent across user sessions.
  • the system 6100 is configured with cloud-based services that support the functioning and display of the virtual content on the user device for which a location is specified relative to a PCF in a canonical map.
  • localization functions are provided as a cloud-based service 6106.
  • Cloud-based service 6106 may be implemented on any of multiple computing devices, from which computing resources may be allocated to one or more services executing in the cloud. Those computing devices may be interconnected with each other and accessibly to devices, such as a wearable XR device 6102 and hand held device 6104. Such connections may be provided over one or more networks.
  • the cloud-based service 6106 is configured to accept descriptor information from respective user devices and “localize” the device to a matching canonical map or maps. For example, the cloud-based localization service matches descriptor information received to descriptor information for respective canonical map(s).
  • the canonical maps may be created using techniques as described above that create canonical maps by merging maps provided by one or more devices that have image sensors or other sensors that acquire information about a physical world.
  • maps may be created by the devices that access them, as such maps may be created by a map developer, for example, who may publish the maps by making them available to localization service 6106.
  • Figure 9 is an example process flow that can be executed by a device to use a cloud- based service to localize the device’s position with canonical map(s) and receive transform information specifying one or more transformations between the device local coordinate system and the coordinate system of a canonical map.
  • process 6200 can begin at 6202 with a new session.
  • Starting a new session on the device may initiate capture of image information to build a tracking map for the device. Additionally, the device may send a message, registering with a server of a localization service, prompting the server to create a session for that device.
  • process 6200 may continue at 6204 with capture of new frames of the device’s environment. Each frame can be processed to select features from the captured frame at 6206.
  • Features may be of one or more types, such as feature points and/or feature lines.
  • Feature extraction at 6206 may include appending pose information to the extracted features at 6206.
  • the pose information may be a pose in the device’s local coordinate system. In some embodiments, the pose may be relative to a reference point in the tracking map, which may be the origin of a tracking map of the device. Regardless of the format, the pose information may be appended to each feature or each set of features, such that the localization service may use the pose information for computing a transformation that can be returned to the device upon matching the features to features in a stored map.
  • the process 6200 may continue to decision block 6207 where a decision is made whether to request localization.
  • localization accuracy is enhanced by performing localization for each of multiple image frames. A localization is considered successful only when there is a sufficient correspondence between the results computed for a sufficient number of the multiple image frames. Accordingly, a localization request may be sent only when sufficient data has been captured to achieve a successful localization.
  • One or more criteria may be applied to determine whether to request localization.
  • the criteria may include passage of time, such that a device may request localization after some threshold amount of time. For example, if localization has not been attempted within a threshold amount of time, the process may continue from decision block 6207 to act 6208 where localization is requested from the cloud. That threshold amount of time may be between ten and thirty seconds, such as twenty-five seconds, for example. Alternatively or additionally, localization may be triggered by motion of a device. A device executing the process 6200 may track its motion using an IMU and/or its tracking map, and initiate localization upon detecting motion exceeding a threshold distance from the location where the device last requested localization. The threshold distance may be between one and ten meters, such as between three and five meters, for example.
  • process 6200 may proceed to act 6208 where the device sends a request to the localization service, including data used by the localization service to perform localization.
  • data from multiple image frames may be provided for a localization attempt.
  • the localization service may not deem localization successful unless features in multiple image frames yield consistent localization results.
  • process 6200 may include saving sets of feature and appended pose information into a buffer.
  • the buffer may, for example, be a circular buffer, storing sets of features extracted from the most recently captured frames. Accordingly, the localization request may be sent with a number of sets of features accumulated in the buffer.
  • the device may transfer the contents of the buffer to the localization service as part of a localization request.
  • Other information may be transmitted in conjunction with the feature points and appended pose information.
  • geographic information may be transmitted, which may aid in selecting a map against which to attempt localization.
  • the geographic information may include, for example, GPS coordinates or a wireless signature associated with the devices tracking map or current persistent pose.
  • a cloud localization service may process the sets of features to localize the device into a canonical map or other persistent map maintained by the service. For example, t the cloud-based localization service may generate a transform based on the pose of feature sets sent from the device relative to matching features of the canonical maps. The localization service may return the transform to the device as the localization result. This result may be received at block 6210.
  • the device may use these transforms to compute the location at which to render virtual content for which a location has been specified by an application or other component of the XR system relative to any of the PCFs. This information may alternatively or additionally be used on the device to perform any location based operation in which a location is specified based on the PCFs.
  • the localization service may be unable to match features sent from a device to any stored canonical map or may not be able to match a sufficient number of the sets of features communicated with the request for the localization service to deem a successful localization occurred.
  • the localization service may indicate to the device that localization failed.
  • the process 6200 may branch at decision block 6209 to act 6230, where the device may take one or more actions for failure processing. These actions may include increasing the size of the buffer holding feature sets sent for localization.
  • the buffer size may be increased from five to six, increasing the chances that three of the transmitted sets of features can be matched to a canonical map maintained by the localization service.
  • canonical maps maintained by the localization service may contain PCFs that have been previously identified and stored. Each PCF may be represented by multiple features, which, as for each image frame processed at 6206, may include a mix of feature points and feature lines. Accordingly, the localization service may identify a canonical map with a sets of features that match sets of features sent with the localization request and may compute a transformation between the coordinate frame represented by the poses sent with the request for localization and the one or more PCFs.
  • a localization result may be expressed as a transformation that aligns the coordinate frame of extracted sets of features with respect to the selected map.
  • This transformation may be returned to user device where it may be applied, as either a forward or inverse transformation, to relate locations specified with respect to the shared map to the coordinate frame used by the user device, or vice versa.
  • the transformation may allow the device to render virtual content for its user in a location with respect to the physical world that is specified in a coordinate frame of the map to which the device localized.
  • a pose of a set of features relative to other image information may be computed in many scenarios, including in an XR system to localize a device with respect to a map.
  • Figure 10 illustrates a method 1000 that may be implemented to compute such a pose.
  • method 1000 computes a pose for any mix of feature types.
  • the features for example, may be all feature points or all feature lines or a combination of feature points and feature lines.
  • Method 1000 for example, may be performed as part of the processing illustrated in Figure 9 in which the computed pose is used to localize a device with respect to a map.
  • Processing for method 1000 may begin once an image frame is captured for processing.
  • a mix of feature types may be determined.
  • the features extracted may be points and/or lines.
  • the device may be configured to select a certain mix of feature types.
  • the device for example, may be programmed to select a set percentage of the features as points and the remaining features as lines.
  • pre-configuration may be based on ensuring at least a certain number of points and a certain number of lines in the set of features from the image.
  • Such a selection may be guided by one or more metrics, indicating, for example, the likelihood that a feature would be recognized in a subsequent image of the same scene.
  • a metric may be based, for example, on the characteristics of the physical structure giving rise to the feature and/or the location of such a structure within the physical environment.
  • a comer of a window or a picture frame mounted on a wall, for example, may yield feature points with high scores.
  • a corner of a room or an edge of a step may yield feature lines with high scores.
  • Such metrics may be used to select the best features in an image or may be used to select images for which further processing is performed, with further processing being performed only for images with a number exceeding a threshold of features with a high score for example.
  • selection of features may be done in such a way that the same number or same mix of points and lines is selected for all images. Image frames that do not supply the specified mix of features might be discarded, for example. In other scenarios, the selection may be dynamic based on the visual characteristics of the physical environment.
  • the selection may be guided, for example, based on the magnitude of metrics assigned to detected features. For example, in a small room with monochrome walls and few furnishings, there may be few physical structures that give rise to feature points with large metrics.
  • Figure 11 illustrates an environment in which a localization attempt based on feature points is likely to fail. A similar result may occur in an environment with structures that give rise to numerous similar feature points. In those environments, the mix of selected features may include more lines than points. Conversely, in a large or outdoor space, there may be many structures that give rise to feature points, with few straight edges, such that the mix of features will be biased towards points.
  • features of the determined mix may be extracted from an image frame to be processed. It should be appreciated that blocks 1010 and 1020 need not be performed in the order illustrated, as the processing may be dynamic such that processing to select features and determine a mix may occur concurrently. Techniques that process an image to identify points and/or lines may be applied in block 1020 to extract features. Moreover, one or more criteria may be applied to limit the number of features extracted. Criteria may include a total number of features or a quality metric for the features included in the set of extracted features.
  • Processing may then proceed to block 1030 at which correspondences between the extracted features from an image and other image information, such as a previously stored map are determined. Correspondences may be determined, for example, based on visual similarity and/or descriptor information associated with the features. These correspondences may be used to generate a set of constraints on a transformation that defines the pose of the extracted features with respect to the features from the other image information. In the localization example, these correspondences are between the selected set of features in an image taken with a camera on a device and a stored map.
  • the image used as the input for pose estimate is a two- dimensional image.
  • the image features are 2D.
  • the other image information may represent features in three-dimensions.
  • a keyrig as described above may have three dimensional features built up from multiple two dimensional images. Even though of different dimensions, correspondences may nonetheless be determined.
  • processing proceeds to block 1040, where a pose is computed.
  • This pose may serve as the result of localization attempt in an XR system, as described above.
  • any or all of the steps of the method 1000 may be performed on devices described herein, and/or on remote services such as those described herein.
  • the processing at block 1040 may be selected based on the mix of feature types extracted from the image frame.
  • the processing may be universal, such that the same software may be executed, for example, for an arbitrary mix of points and lines.
  • PnPL problem Estimating the pose of a camera using 2D/3D point or line correspondences, called the PnPL problem, is a fundamental problem in computer vision with many applications, such as Simultaneous Localization and Mapping (SLAM), Structure from Motion (SfM) and Augmented Reality.
  • SLAM Simultaneous Localization and Mapping
  • SfM Structure from Motion
  • a PnPL algorithm as described herein may be complete, robust and efficient.
  • a “complete” algorithm can mean that the algorithm can handle all potential inputs and may be applied in any scenario regardless of mix of feature types, such that the same processing may be applied in any scenario.
  • universal processing may be achieved by programming a system to compute a pose from a set of correspondences by converting a least- squares problem into a minimal problem.
  • a method of localization may include using a complete, accurate and efficient solution for the PnPL problem.
  • the method can also be able to solve the PnP and the PnL problems as specific cases of the PnPL problem.
  • the method may be able to solve a plurality of multiple types of problems including minimal problems (e.g. P3L, P3P, and/or PnL) and/or least-squares problems (e.g. PnL, PnP, PnPL).
  • the method may be capable of solving any of the P3L, P3P, PnL, PnP and PnPL problems.
  • Figure 13 is an example of processing that may be universal and may result in conversion of a problem, conventionally solved as a least- squares problem, into a minimal problem.
  • Figure 13 is a flow chart illustrating a method 1300 of efficient pose estimation, according to some embodiments.
  • the method 1300 may be performed, for example, on the correspondences determined in block 1030 in Figure 10, for example.
  • the method may start with, given a number n of 2D/3D point correspondences and m 2D/3D line correspondences, obtaining 2 x (m + n) constraints (Act 1310).
  • the method 1300 may include reconfiguring (Act 1320) the set of constraints and using partial linearization method to obtain an equation system.
  • the method further includes solving the equation system to obtain the rotation matrix (Act 1330) and obtaining t, a translation vector, using the rotation matrix and the closed form of t (Act 1340).
  • the rotation matrix and translation vector may together define the pose.
  • any or all of the steps of the method 1300 may be performed on devices described herein, and/or on remote services such as those described herein.
  • solving the PnPL problem can mean estimating the camera pose (i.e. R and t) using N 2D/3D point correspondences (i.e. and M
  • 2D/3D line correspondences (i. may represent a 3D point and R [ U i ,v i] may represent the corresponding 2D pixel in the image.
  • may represent a 3D line and may represent the corresponding 2D line.
  • 2 3D points (such as and ⁇ ) can be used to represent 3D line and 2 pixels (such as ⁇ and ⁇ ) can be used to represent corresponding 2D line .
  • the normalized pixel coordinate may be used.
  • the PnPL problem may include estimation of the camera pose (i.e. R and t) using N
  • 2D/3D point correspondences M 2D/3D line correspondences - j . may represent a 3D point and R L u i V i J may represent the corresponding 2D pixel in the image. Similarly, can represent a 3D line and can represent the corresponding 2D line. 23D points O v1 and O 2 may be used to represent L , and 2 pixels ⁇ 1 and ⁇ 2 may be used to represent 1 ⁇ . To simplify the notation, we use the normalized pixel coordinate.
  • obtaining 2 x (m + n) constraints in Act 1310 of method 1300 further includes multiplying the denominators in (1) to both sides of the equations, to yield the following:
  • reconfiguring the set of constraints in Act 1320 of method 1300 may include generating a quadratic system using the constraints, a representation of R using Cayley- Gibbs -Rodriguez parametrization, and the close-form of t. obtained given n 2D/3D point correspondences and m 2D/3D lines correspondences.
  • i th constraint the following may be defined:
  • may be a scalar.
  • equation (8) may be solved by adopting QR, SVD, or Cholesky.
  • the linear system of equation (8) may be solved using the normal equation.
  • the representation of R using Cayley- Gibbs-Rodriguez parametrization may be calculated by back-substituting t into (7), to get the following
  • a solution for R may then be determined.
  • Cayley-Gibbs-Rodriguez (CGR) parametrization, a 3-dimensional vector s , may be used to represent R as the following EQ.(IO) .
  • Equation (13) is a homogeneous linear system r with 9 elements is a non-trivial solution of (13). Thus H should be singular, otherwise this homogeneous system only has zero (or trivial) solution. This contradicts to the fact that r is the solution of (13). Theorem 1 The rank of A in (11) is smaller than 9 for data without noise.
  • rank approximation may be used to denoise.
  • using partial linearization method to obtain an equation system in Act 1320 of method 1300 may include using partial linearization method to convert the PnPL problem into an Essential Minimal Formulation (EMF) and generating an equation system.
  • EMF Essential Minimal Formulation
  • the partial linearization method may include splitting ⁇ into two parts, where a first part may include 3 monomials, and a remaining
  • Partial linearization may also include, according to some embodiments, dividing matrix A in (11) into based on the division of ⁇ , accordingly and rewriting (11) as
  • Equation (17) may be rewritten as
  • solving the equation system to obtain the rotation matrix may include obtaining the rotation matrix by solving the equation system where equations are of form (19).
  • obtaining t using the rotation matrix and the closed form of t may include obtaining t from (8) after solving for s.
  • Figures 14-17 are diagrams of experimental results of embodiments of the method of efficient localization compared to other known PnPL solvers.
  • Figures 14A-14D show mean and median rotation and translation errors of different PnPL solvers, including OPnPL and evxpnpl, described in “Accurate and linear time pose estimation from points and lines: European Conference on Computer Vision”, Alexander Vakhitov, Jan Funke, and Francesc Moreno Noguer, Springer, 2016 and “CvxPnPL: A unified convex solution to the absolute pose estimation problem from point and line correspondences” by Agostinho, Sergio, Joao Gomes, and Alessio Del Bue, 2019 respectively, and are both hereby incorporated by reference herein in its entirety.
  • Figure 14 A shows median rotation errors of different PnPL algorithms in degrees.
  • Figure 14 B shows median translation errors of different PnPL algorithms in percentages.
  • Figure 14 C shows mean rotation errors of different PnPL algorithms in degrees.
  • Figure 14 D shows mean translation errors of different PnPL algorithms in percentages.
  • the pnpl curves 40100A-D show the error in rotation and translation using the method described herein, according to some embodiments.
  • the OPnPL curves 40200A-D and the evxpnpl curves 40300A-D show error in percentage and degrees that is consistently higher than those of pnpl curve 40100.
  • Figure 15A is a diagram of computational time of different PnPL algorithms.
  • Figure 15 B is a diagram of computational time of different PnPL algorithms.
  • the computational time of solving a PnPL problem using a method described herein in represented by 50100A-B and the OPnPL curves 50200A-B and the cvxpnpl curves 50300A-B show consistently higher computational times than a method including embodiments of an algorithm described herein.
  • Figure 16A shows the number of instances of errors of a certain range versus the log error of a PnPL solution, according to some embodiments described herein, for a PnP problem compared to a P3P and UPnP solution, according to some embodiments.
  • Figure 16B shows a box plot of a PnPL solution, according to some embodiments described herein, for a PnP problem compared to a P3P and UPnP solution.
  • Figure 16C shows the mean rotational error in radians of a PnPL solution, according to some embodiments described herein, for a PnP problem compared to a P3P and UPnP solution.
  • the PnPL solution, according to some embodiment described herein, for a PnP problem has error 60100C, which can be seen to be less than the error for the UPnP solution 60200C.
  • Figure 16D shows the mean positional error in meters of a PnPL solution, according to some embodiments described herein, for a PnP problem compared to a P3P and UPnP solution.
  • the PnPL solution, according to some embodiment described herein, for a PnP problem has error 60100D, which can be seen to be less than the error for the UPnP solution 60200D.
  • Figures 17A-D shows mean and median rotation and translation errors of different PnL algorithms including OAPnL, DLT, LPnL, Ansar, Mirzaei, OPnPL, and ASPnL.
  • OAPnL is described in "A Robust and Efficient Algorithm for the PnL problem Using Algebraic Distance to Approximate the Reprojection Distance," by Zhou, Lipu, et al., 2019, and is hereby incorporated by reference herein in its entirety.
  • DLT is described in “Absolute pose estimation from line correspondences using direct linear transformation.
  • LPnL is described in “Pose estimation from line correspondences: A complete analysis and a series of solutions” by Xu, C., Zhang, L., Cheng, L., and Koch, R., 2017, and is hereby incorporated by reference herein in its entirety.
  • Ansar is described in “Linear pose estimation from points or lines” by Ansar, A., and Daniilidis, K., 2003 and is hereby incorporated by reference herein in its entirety.
  • Mirzaei is described in “Globally optimal pose estimation from line correspondences” by Mirzaei, F.
  • OPnPL is addressed in “Accurate and linear time pose estimation from points and lines: European Conference on Computer Vision”. As described herein, aspects of ASPnL are described in “Pose estimation from line correspondences: A complete analysis and a series of solutions”.
  • Figure 17 A shows median rotation errors of the different PnL algorithms in degrees.
  • Figure 17 B shows median translation errors of the different PnF algorithms in percentages.
  • Figure 17 C shows mean rotation errors of the different PnF algorithms in degrees.
  • Figure 17 D shows mean translation errors of the different PnF algorithms in percentages.
  • Curve 70100A-D shows the median and mean rotation and translation error of a PnPF solution using the method described herein.
  • Figure 18 illustrates a method 1800 that may be an alternative to method 1000 in Figure 10. As in method 1000, method 1800 may begin with determining a feature mix and extracting features with that mix at blocks 1810 and 1820. In processing at block 1810, the feature mix may include only lines. For example, only lines may be selected in an environment as illustrated in Figure 11.
  • correspondences may be determined as described above. From these correspondences, a pose may be computed at subprocess 1835. In this example, processing may branch dependent on whether the features include at least one point. If so, pose may be estimated with a technique that may solve for pose based on a set of features including at least one point.
  • the universal algorithm as described above, for example, may be applied, at box 1830.
  • processing may be performed by an algorithm that delivers accurate and efficient results in that case.
  • processing branches to block 3000.
  • Block 3000 may solve the Perspective-n-Line (PnL) problem, as described below.
  • PnL Perspective-n-Line
  • lines are often present, and may serve as readily recognizable features, in environments in which pose estimation may be desired, providing a solution specifically for a feature set using only lines may provide an efficiency or accuracy advantage for devices operating in such environments.
  • any or all of the steps of the method 1800 may be performed on devices described herein, and/or on remote services such as those described herein.
  • PnPL Perspective-n- Line
  • the PnL problem can be described as the line counterpart of the PnP problem such as is described in “A direct least-squares (dls) method for pnp” by Hesch, J.A., Roumeliotis, S.I., International Conference on Computer Vision, “Upnp: An optimal o (n) solution to the absolute pose problem with universal applicability.
  • the PnL problem is a fundamental problem in computer vision and robotics with many applications, including Simultaneous Localization and Mapping (SLAM), Structure from Motion (SfM) and Augmented Reality (AR).
  • SLAM Simultaneous Localization and Mapping
  • SfM Structure from Motion
  • AR Augmented Reality
  • the camera pose can be determined from a number of N 2D-3D line correspondences, where N 3 3.
  • the problem may be called the minimal problem, also known as the P3L problem.
  • the problem may be known as a least-squares problem.
  • the minimum (P3L) problem generally requires solving an eighth-order univariate equation and thus has at most 8 solutions, except for in the case of some specific geometric configurations (e.g. as described in “Pose estimation from line correspondences: A complete analysis and a series of solutions. IEEE transactions on pattern analysis and machine intelligence” by Xu, C., Zhang, L., Cheng, L., Koch, R.).
  • One widely adopted strategy for the minimum (P3L) problem is to simplify the problem by some geometrical transformations (e.g. such as described in “Determination of the attitude of 3d objects from a single perspective view.
  • CVGIP Image understanding 60(3), 313 ⁇ 342 (1994)' by Kumar, R., Hanson, A.R., hereby incorporated by reference in its entirety) proposed to jointly optimize rotation and translation in the iterative method. They presented a sampling-based method to get an initial estimation. Latter works (e.g. as described in 'Pose estimation using point and line correspondences. Real-Time Imaging 5(3), 215 ⁇ 230 (1999)' by Dornaika, F., Garcia, C. and Iterative pose computation from line correspondences (1999), which are both hereby incorporate by reference in their entirety) proposed to start the iteration from a pose estimated by a weak perspective or paraperspective camera model.
  • the accuracy of the iterative algorithm depends on the quality of the initial solution and the parameters of the iterative algorithm. There is no guarantee that the iterative method will converge.
  • linear formulation plays an important role (e.g. as described in 'Multiple view geometry in computer vision. Cambridge university press (2003)' by Hartley, R., Zisserman, A., which is hereby incorporated by reference in entirety).
  • Direct Linear Transformation (DLT) provides a straightforward way to compute the pose (e.g. as described in 'Multiple view geometry in computer vision. Cambridge university press (2003)' by Hartley, R., Zisserman, A.). This method requires at least 6 line correspondences.
  • Pribyl et al. e.g.
  • the EPnP algorithm is extended to solve the PnL problem (e.g. as described in 'Accurate and linear time pose estimation from points and lines. In: European Conference on Computer Vision pp. 583 ⁇ 599. Springer (2016)' and, “Pose estimation from line correspondences: A complete analysis and a series of solutions. IEEE transactions on pattern analysis and machine intelligence” by Xu, C., Zhang, L., Cheng, L., Koch, R. ).
  • Hichem e.g. as described in ⁇ direct least-squares solution to multi-view absolute and relative pose from 2d-3d perspective line pairs.
  • Hichem e.g. as described in ⁇ direct least-squares solution to multi-view absolute and relative pose from 2d-3d perspective line pairs.
  • a direct Least-Squares solution for the PnL problem of a multi-camera system.
  • the vertical direction is known from a certain sensor (eg. IMU).
  • a desirable PnL solution is that it is accurate and efficient for any possible inputs.
  • algorithms based on linear formulation are generally unstable or infeasible for a small N and need specific treatment or even do not work for the planar case.
  • algorithms based on polynomial formulation could achieve better accuracy and are applicable to broader PnL inputs but are more computationally demanding.
  • a method of localization may include a complete, accurate and efficient solution for the Perspective-n-Line (PnL) problem.
  • a least-squares problem may be transformed into a General Minimal Formulation (GMF), which can have the same form as the minimal problem, by a novel hidden variable method.
  • GMF General Minimal Formulation
  • the Gram-Schmidt process may be used to avoid the singular case in the transformation.
  • Figure 30 is a flow chart illustrating a method 3000 of efficient localization, according to some embodiments.
  • the method may start with determining a set of correspondences of extracted features (Act 3010), given a number n of 2D/3D point correspondences and m 2D/3D line correspondences, obtaining 2N constraints (Act 3020).
  • the method 3000 may include reconfiguring (Act 3030) the set of constraints and using partial linearization method to obtain an equation system.
  • the method further includes solving an equation system to obtain the rotation matrix (Act 3040) and obtaining t using the rotation matrix and the closed form of t (Act 3050).
  • any or all of the steps of the method 3000 may be performed on devices described herein, and/or on remote services such as those described herein.
  • this is described further in conjunction with Figure 19.
  • Figure 19 is an exemplary schematic of constraints from li ⁇ Li, according to some embodiments.
  • the PnL problem may include estimating the camera pose including rotation R
  • K T l L may be computed first. The notation may be simplified by using to represent K r ⁇ ,.
  • the PnL problem may include estimating the camera pose including rotation R and translation t.
  • the rotation R and translation t may have a total of 6 degrees of freedom.
  • each line correspondence l t ⁇ ® L may yield 2 constraints which may be written as 1,2. EQ. (l’)
  • Equation (2’) is the General Minimal Formulation (GMF) for the P3L problem.
  • GMF General Minimal Formulation
  • reconfiguring the set of constraints in Act 3020 of method 3000 may include generating a quadratic system using the constraints, a representation of R using Cayley- Gibbs -Rodriguez (CGR) parametrization, and the closed form of t.
  • the CGR may be used to represent R, for example as discussed in “A robust and efficient algorithm for the pnl problem using algebraic distance to approximate the reprojection distance”.
  • the representation of R using CGR parametrization may be in the form described by the following equations (3’).
  • / 3 may be the 3 X 3 identity matrix and is the skew matrix of three-dimensional vector s.
  • each element of R is a quadratic in three-dimensional vector s. EQ. (3’)
  • the closed-form of t may be derived by first substituting (3’) in (G), multiplying a term (1 + S T S) to both sides to yield
  • Equations (5’) may be simplified by defining
  • AT + BT 0 2N+ 1 EQ. (8’) can treat (8’) as a linear equation system in t to get a closed-form solution for
  • a quadratic system of Act 3020 may be a quadratic system in s l s 2 , and s 3 and may be in the following form:
  • using partial linearization method to obtain an equation system in Act 3020 of method 3000 may include using partial linearization method to convert the PnL problem into a General Minimal Formulation (GMF) and generating an equation system.
  • GMF General Minimal Formulation
  • r 3 may be treated as individual unknowns.
  • the method may require that the matrix K 3 for r 3 be full rank.
  • a close-form solution for r 3 with respect to r 7 may be written as the following:
  • —(K 3 K 3 ' ) ⁇ 1 K K 7 of equation (13’) may represent a 3 x 7 matrix.
  • r 3 when (K of (10’)) is of full rank, r 3 may be chosen arbitrarily.
  • the matrix K q (i.e. K of (10’)) may be rank deficient for arbitrary numbers of 2D-3D line correspondences for data without noise.
  • K q (i.e. K of (10’)) when K q (i.e. K of (10’)) is rank deficient, a certain input may make C for a fixed choice of r 3 be or approximate rank deficient.
  • [00313] may be determined by the Gram-Schmidt process with column pivoting to select 3 independent columns from K to generate K 3 .
  • i argmax J ⁇ k n n kj-k
  • the equations (16’) may be used, wherein the ith, jth, and kth column of K is selected is K 3 , and the corresponding monomials may form r 3 . The remaining columns may be selected to form K 7 and the corresponding monomials may form r 7 . According to some embodiments, the equations (16’) may be solved using other polynomial solvers.
  • the above equation system includes 3 second-order equations in S j , s 2 , and s 3 .
  • Each of the 3 second-order equations may have the following form:
  • solving the equation system to obtain the rotation matrix may include obtaining the rotation matrix by solving the equation system where equations are of form (15’).
  • the equation system may be solved using the Grobner basis approach.
  • the equation system may be solved using methods and approaches described in Kukelova et al.
  • a hidden variable method may be used to solve the equation system (14’).
  • a customized hidden variable method may be used to solve the equation system.
  • customized hidden variable methods are described in “Using algebraic geometry, vol. 185. Springer Science & Business Media (2006)”.
  • the customized hidden variable method may be implemented by treating one known in (15’) as a constant.
  • s 3 may be treated as a constant while S ; and s 2 are treated as unknowns such that equation system (15’) may be written in the following manner:
  • auxiliary variable s 0 may be used to make (15’) a homogeneous quadratic equation such that all monomials in (15’) have degree 2. This generates the following system:
  • J can be a third-order homogeneous equation in s 0 , s- . and s 2 whose coefficients are polynomials in s 3 .
  • s 3 after getting s 3 , s 3 can be back substituted into (21’) to derive a linear homogeneous equation system with respect to u.
  • obtaining the rotation matrix (Act 3030) in method 3000 may comprise computing R with (3’) once s x , s 2 , and s 3 are obtained.
  • t may be calculated by (6’).
  • obtaining (Act 3030) t may include obtaining t using equation (9’).
  • an iterative method may be used to refine the solution, for example as described in “A robust and efficient algorithm for the pnl problem using algebraic distance to approximate the reprojection distance”, “Pose estimation from line correspondences: A complete analysis and a series of solutions. IEEE transactions on pattern analysis and machine intelligence”, and “Camera pose estimation from lines: a fast, robust and general method. Machine Vision and Applications 30”).
  • the solution may be refined by minimizing the cost function (e.g., as described in “A robust and efficient algorithm for the pnl problem using algebraic distance to approximate the reprojection distance”), which is a sixth-ordered polynomial in s and t.
  • the damped Newton step may be used to refine the solution (e.g. as described in “Revisiting the pnp problem: A fast, general and optimal solution. In: Proceedings of the IEEE International Conference on Computer Vision” by Zheng, Y., Kuang, Y., Sugimoto, S., Astrom, K., Okutomi, M. which is hereby incorporated by reference in its entirety, and “A robust and efficient algorithm for the pnl problem using algebraic distance to approximate the reprojection distance”).
  • the PnL solution described herein is applicable to N > 3 2D/3D line correspondences.
  • the method of solving a PnL problem may include 4 steps.
  • the first step may include compressing the 2N constraints (4’) into 3 equations (15’).
  • the equation system 3 equations (15’) may be solved by the hidden variable method and recover rotation R and translation t.
  • the PnL solution may further be refined by the damped Newton step.
  • Figure 31 shows an exemplary algorithm 3100 for solving the PnL problem, according to some embodiments.
  • step 2 (Act 3120) and step 3 (Act 3130) of algorithm 3100 is 0(1) as it is independent of the number of correspondences.
  • the main computational cost of step 1 is to solve the linear least-squares problem (9’) and (13’).
  • the main computational cost of step 4 is to calculate the summation of squared distance functions. The computational complexity of these steps increases linearly with respect to N.
  • MinPnL a component of the algorithm of the solution of the PnL problem described herein.
  • Figures 24-27 show comparisons of the MinPnL algorithm, according to some embodiments, and previous P3L and least-squares PnL algorithms.
  • the compared algorithms for solving the P3L and least-squares PnL algorithms include, for the P3L problem, three recent works AlgP3L (e.g. as described in “A stable algebraic camera pose estimation for minimal configurations of 2d/3d point and line correspondences. In: Asian Conference on Computer Vision”), RP3L (e.g. as described in “Pose estimation from line correspondences: A complete analysis and a series of solutions.
  • LPnL Bar LS e.g. as described in “Pose estimation from line correspondences: A complete analysis and a series of solutions. IEEE transactions on pattern analysis and machine intelligence”
  • LPnL Bar ENull e.g. as described in “Pose estimation from line correspondences: A complete analysis and a series of solutions. IEEE transactions on pattern analysis and machine intelligence”
  • cvxPnPL e.g. as described in “'Cvxpnpl: A unified convex solution to the absolute pose estimation problem from point and line correspondences”
  • OPnPL and EPnPL Planar e.g. as described in “Accurate and linear time pose estimation from points and lines. In: European Conference on Computer Vision.”.
  • Pattern recognition “Pose estimation from line correspondences: A complete analysis and a series of solutions. IEEE transactions on pattern analysis and machine intelligence”, and “Camera pose estimation from lines: a fast, robust and general method. Machine Vision and Applications 30,” which are hereby incorporated by reference herein.
  • the camera resolution may be set to 640 x 480 pixels and the focal length to 800.
  • Euler angles a, B, l may be used to generate the rotation matrix. For each trial, the camera is randomly placed within a [-10m; 10m] 3 cube and the Euler angles are uniformly sampled from a, l e[0° , 360°] and b e [0°, 180°]. Then N 2D/3D line correspondences are randomly generated.
  • the endpoints of the 2D lines are first randomly generated, then the 3D endpoints are generated by projecting the 2D endpoints into 3D space.
  • the depths of the 3D endpoints are within [4m; 10m]. Then these 3D endpoints are transformed to the world frame.
  • Histograms and boxplots may be used to compare the estimation errors.
  • the histogram is used to present the major distribution of the errors while the boxplot may be used to better show the large errors.
  • the central mark of each box indicates the median, and the bottom and top edges indicate the 25th and 75th percentiles, respectively.
  • the whiskers extend to +/- 2.7 standard deviation, and errors out of this range are plotted individually using the “+” symbol.
  • the numeric stability of the hidden variable (HV) polynomial solver is compared with the Grobner, E3Q3, and RE3Q3 algorithms (e.g. as described in “A robust and efficient algorithm for the pnl problem using algebraic distance to approximate the reprojection distance”) using 10,000 trials.
  • FIGS 20 A-B show the results. It is clear that the hidden variable solver is more stable than other algorithms.
  • G3 -K7G7
  • Figures 23 A-B demonstrate the results.
  • Figure 23 A shows a comparison of mean rotational error in degrees between different P3F algorithms.
  • Figure 23 B shows a boxplot of rotational error between different P3F algorithms.
  • the fixed choose of G3 may encounter numerical problems when K3 approximates a singular matrix.
  • the Gram-Schmidt process used in some embodiments of the solution to the algorithm described herein can solve this problem, thus generating more stable results.
  • MinP3F a solution to the P3F problem as described herein, may be compared with previous P3F algorithms including AlgP3F (e.g. as described in “A stable algebraic camera pose estimation for minimal configurations of 2d/3d point and line correspondences. In:
  • FIG. 22 A-B shows the results.
  • Figure 22 A shows a box plot of rotation error of an embodiment of an algorithm described herein and algorithms AlgP3F, RP3F and SRP3F.
  • Figure 22 B shows a box plot of translation error of an embodiment of an algorithm described herein and previous algorithms AlgP3L, RP3L and SRP3L.
  • the rotation and translation errors of MinP3L which is implemented using methods and techniques described herein, are smaller than 10 5 .
  • Other algorithms all yield large errors as shown by the longer tail in the boxplot figures of Figure 22. Then the behavior of the P3L algorithms is considered under varying noise level.
  • Figures 23 A-B show the results.
  • Figure 23 A shows mean rotation errors of an embodiment of an algorithm described herein and previous algorithms AlgP3L, RP3L and SRP3L.
  • Figure 23 B shows mean translation errors of an embodiment of an algorithm described herein and previous algorithms AlgP3L, RP3L and SRP3L.
  • MinP3L algorithm implemented using techniques described herein, shows stability. Similar to the noise-free case, the compared algorithms (e.g. as described in “A stable algebraic camera pose estimation for minimal configurations of 2d/3d point and line correspondences. In: Asian Conference on Computer Vision”, “Pose estimation from line correspondences: A complete analysis and a series of solutions. IEEE transactions on pattern analysis and machine intelligence”), each have longer tails than algorithms developed using the techniques described herein. This may be caused by the numerically unstable operations in these algorithms.
  • Figures 24 A-D and 25 A-D shows the mean and median errors.
  • Figure 24 A shows mean rotation errors of different PnL algorithms.
  • Figure 24 B shows mean translation errors of different PnL algorithms.
  • Figure 24 C shows median rotation errors of different PnL algorithms.
  • Figure 24 D shows median translation errors of different PnL algorithms.
  • Figure 25 A shows mean rotation errors of different PnL algorithms.
  • Figure 25 B shows mean translation errors of different PnL algorithms.
  • Figure 25 C shows median rotation errors of different PnL algorithms.
  • Figure 25 D shows median translation errors of different PnL algorithms.
  • Figure 26 D shows median translation errors of different PnL algorithms.
  • Figure 27 A shows mean rotation errors of different PnL algorithms.
  • Figure 27 B shows mean translation errors of different PnL algorithms.
  • Figure 27 C shows median rotation errors of different PnL algorithms.
  • Figure 27 D shows median translation errors of different PnL algorithms.
  • MinPnL implemented using techniques and methods described herein, achieves the best results.
  • cvxPnPL and ASPnL e.g. as described in “Pose estimation from line correspondences: A complete analysis and a series of solutions. IEEE transactions on pattern analysis and machine intelligence”), generate large errors which are out of the scope.
  • Some methods and techniques described herein for finding the pose of a camera using features may work even when the feature points and feature lines exist on the same plane.
  • MinPnL achieves the best result among the compared algorithms, except for the MC2 dataset, which is slightly worse than the result of OAPnL. But the MinPnL algorithm is much faster as shown in the next section.
  • Figure 29 A is a diagram of computational time of many algorithms.
  • Figure 29 B is a diagram of computational time of an embodiment of an algorithm described herein as compared to computational times of algorithms involving polynomial system.
  • Figure 29 C is a diagram of computational time of an embodiment of an algorithm described herein as compared to computational times of algorithms based on linear transformation.
  • Figure 32 shows a diagrammatic representation of a machine in the exemplary form of a computer system 1900 within which a set of instructions, for causing the machine to perform any one or more of the methodologies discussed herein, may be executed, according to some embodiments.
  • the machine operates as a standalone device or may be connected (e.g., networked) to other machines.
  • the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
  • the exemplary computer system 1900 includes a processor 1902 (e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both), a main memory 1904 (e.g., read only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.), and a static memory 1906 (e.g., flash memory, static random access memory (SRAM), etc.), which communicate with each other via a bus 1908.
  • a processor 1902 e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both
  • main memory 1904 e.g., read only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.
  • static memory 1906 e.g., flash memory, static random access memory (SRAM), etc.
  • the computer system 1900 may further include a disk drive unit 1916, and a network interface device 1920.
  • the disk drive unit 1916 includes a machine-readable medium 1922 on which is stored one or more sets of instructions 1924 (e.g., software) embodying any one or more of the methodologies or functions described herein.
  • the software may also reside, completely or at least partially, within the main memory 1904 and/or within the processor 1902 during execution thereof by the computer system 1900, the main memory 1904 and the processor 1902 also constituting machine-readable media.
  • the software may further be transmitted or received over a network 18 via the network interface device 1920.
  • the computer system 1900 includes a driver chip 1950 that is used to drive projectors to generate light.
  • the driver chip 1950 includes its own data store 1960 and its own processor 1962.
  • machine-readable medium 1922 is shown in an exemplary embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions.
  • the term “machine- readable medium” shall also be taken to include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention.
  • the term “machine-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media, and carrier wave signals.
  • communication network 1928 may be a local area network (LAN), a cell phone network, a Bluetooth network, the internet, or any other such network.
  • LAN local area network
  • cell phone network a cell phone network
  • Bluetooth network a Bluetooth network
  • the above-described embodiments of the present disclosure can be implemented in any of numerous ways.
  • the embodiments may be implemented using hardware, software or a combination thereof.
  • the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.
  • processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component, including commercially available integrated circuit components known in the art by names such as CPU chips, GPU chips, microprocessor, microcontroller, or co-processor.
  • a processor may be implemented in custom circuitry, such as an ASIC, or semicustom circuitry resulting from configuring a programmable logic device.
  • a processor may be a portion of a larger circuit or semiconductor device, whether commercially available, semi-custom or custom.
  • some commercially available microprocessors have multiple cores such that one or a subset of those cores may constitute a processor.
  • a processor may be implemented using circuitry in any suitable format.
  • a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer. Additionally, a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smart phone or any other suitable portable or fixed electronic device.
  • PDA Personal Digital Assistant
  • a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible format. In the embodiment illustrated, the input/output devices are illustrated as physically separate from the computing device. In some embodiments, however, the input and/or output devices may be physically integrated into the same unit as the processor or other elements of the computing device. For example, a keyboard might be implemented as a soft keyboard on a touch screen. In some embodiments, the input/output devices may be entirely disconnected from the computing device, and functionally integrated through a wireless connection.
  • Such computers may be interconnected by one or more networks in any suitable form, including as a local area network or a wide area network, such as an enterprise network or the Internet.
  • networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
  • the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
  • the disclosure may be embodied as a computer readable storage medium (or multiple computer readable media) (e.g., a computer memory, one or more floppy discs, compact discs (CD), optical discs, digital video disks (DVD), magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement the various embodiments of the disclosure discussed above.
  • a computer readable storage medium may retain information for a sufficient time to provide computer-executable instructions in a non- transitory form.
  • Such a computer readable storage medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the present disclosure as discussed above.
  • the term "computer-readable storage medium” encompasses only a computer-readable medium that can be considered to be a manufacture (i.e., article of manufacture) or a machine.
  • the disclosure may be embodied as a computer readable medium other than a computer-readable storage medium, such as a propagating signal.
  • program or “software” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects of the present disclosure as discussed above.
  • one or more computer programs that when executed perform methods of the present disclosure need not reside on a single computer or processor, but may be distributed in a modular fashion amongst a number of different computers or processors to implement various aspects of the present disclosure.
  • Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • functionality of the program modules may be combined or distributed as desired in various embodiments.
  • data structures may be stored in computer-readable media in any suitable form.
  • data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a computer-readable medium that conveys relationship between the fields.
  • any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationship between data elements.
  • the disclosure may be embodied as a method, of which an example has been provided.
  • the acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
PCT/US2021/020403 2020-03-03 2021-03-02 Efficient localization based on multiple feature types Ceased WO2021178366A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP21765469.8A EP4115329A4 (en) 2020-03-03 2021-03-02 EFFICIENT LOCALIZATION BASED ON MULTIPLE FEATURE TYPES
CN202180018922.3A CN115349140A (zh) 2020-03-03 2021-03-02 基于多种特征类型的有效定位
JP2022552439A JP7701932B2 (ja) 2020-03-03 2021-03-02 複数の特徴タイプに基づく効率的位置特定

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202062984688P 2020-03-03 2020-03-03
US62/984,688 2020-03-03
US202063085994P 2020-09-30 2020-09-30
US63/085,994 2020-09-30

Publications (1)

Publication Number Publication Date
WO2021178366A1 true WO2021178366A1 (en) 2021-09-10

Family

ID=77554890

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/020403 Ceased WO2021178366A1 (en) 2020-03-03 2021-03-02 Efficient localization based on multiple feature types

Country Status (5)

Country Link
US (2) US11748905B2 (enExample)
EP (1) EP4115329A4 (enExample)
JP (1) JP7701932B2 (enExample)
CN (1) CN115349140A (enExample)
WO (1) WO2021178366A1 (enExample)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11748905B2 (en) 2020-03-03 2023-09-05 Magic Leap, Inc. Efficient localization based on multiple feature types
IL300043A (en) * 2020-07-29 2023-03-01 Magic Leap Inc External calibration of a camera using ray cuts
US12008740B2 (en) * 2020-08-12 2024-06-11 Niantic, Inc. Feature matching using features extracted from perspective corrected image
WO2023052264A1 (en) * 2021-09-29 2023-04-06 Sony Group Corporation Light-field camera, vision system for a vehicle, and method for operating a vision system for a vehicle
US20230252109A1 (en) * 2022-01-17 2023-08-10 Vmware, Inc Methods and systems that continuously optimize sampling rates for metric data in distributed computer systems by preserving metric-data-sequence patterns and characteristics
US12217450B2 (en) 2022-02-08 2025-02-04 Ford Global Technologies, Llc Vehicle localization
CN120912797B (zh) * 2024-11-12 2026-02-27 京海盛大(上海)科技股份有限公司 一种快速半直接slam地图构建算法

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130120600A1 (en) * 2010-09-14 2013-05-16 Hailin Jin Methods and Apparatus for Subspace Video Stabilization
US20140316698A1 (en) * 2013-02-21 2014-10-23 Regents Of The University Of Minnesota Observability-constrained vision-aided inertial navigation

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6435750B2 (ja) 2014-09-26 2018-12-12 富士通株式会社 3次元座標算出装置、3次元座標算出方法および3次元座標算出プログラム
US9930315B2 (en) 2015-04-29 2018-03-27 Lucid VR, Inc. Stereoscopic 3D camera for virtual reality experience
WO2017022033A1 (ja) 2015-07-31 2017-02-09 富士通株式会社 画像処理装置、画像処理方法および画像処理プログラム
WO2018134686A2 (en) * 2017-01-19 2018-07-26 Mindmaze Holding Sa Systems, methods, device and apparatuses for performing simultaneous localization and mapping
CN111630435B (zh) 2017-12-15 2025-03-14 奇跃公司 用于显示装置的增强的姿势确定
CN108242079B (zh) * 2017-12-30 2021-06-25 北京工业大学 一种基于多特征视觉里程计和图优化模型的vslam方法
US10964053B2 (en) * 2018-07-02 2021-03-30 Microsoft Technology Licensing, Llc Device pose estimation using 3D line clouds
US10948297B2 (en) * 2018-07-09 2021-03-16 Samsung Electronics Co., Ltd. Simultaneous location and mapping (SLAM) using dual event cameras
US11182614B2 (en) 2018-07-24 2021-11-23 Magic Leap, Inc. Methods and apparatuses for determining and/or evaluating localizing maps of image display devices
US10839556B2 (en) * 2018-10-23 2020-11-17 Microsoft Technology Licensing, Llc Camera pose estimation using obfuscated features
US11417017B2 (en) * 2019-04-22 2022-08-16 Texas Instmments Incorporated Camera-only-localization in sparse 3D mapped environments
US11748905B2 (en) 2020-03-03 2023-09-05 Magic Leap, Inc. Efficient localization based on multiple feature types

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130120600A1 (en) * 2010-09-14 2013-05-16 Hailin Jin Methods and Apparatus for Subspace Video Stabilization
US20140316698A1 (en) * 2013-02-21 2014-10-23 Regents Of The University Of Minnesota Observability-constrained vision-aided inertial navigation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4115329A4 *

Also Published As

Publication number Publication date
EP4115329A1 (en) 2023-01-11
JP2023516656A (ja) 2023-04-20
US11748905B2 (en) 2023-09-05
US12106514B2 (en) 2024-10-01
JP7701932B2 (ja) 2025-07-02
US20240029301A1 (en) 2024-01-25
CN115349140A (zh) 2022-11-15
EP4115329A4 (en) 2024-04-24
US20210279909A1 (en) 2021-09-09

Similar Documents

Publication Publication Date Title
US12106514B2 (en) Efficient localization based on multiple feature types
US11869158B2 (en) Cross reality system with localization service and shared location-based content
US11900547B2 (en) Cross reality system for large scale environments
EP4111425B1 (en) Cross reality system with fast localization
US10366534B2 (en) Selective surface mesh regeneration for 3-dimensional renderings
US9779508B2 (en) Real-time three-dimensional reconstruction of a scene from a single camera
CN115427758A (zh) 具有精确共享地图的交叉现实系统
EP4046070A1 (en) Cross reality system supporting multiple device types
EP4224423B1 (en) Fusion of depth images into global volumes
US12444136B2 (en) Scene understanding using occupancy grids
CN114332448B (zh) 基于稀疏点云的平面拓展方法及其系统和电子设备
US12100181B2 (en) Computationally efficient method for computing a composite representation of a 3D environment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21765469

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022552439

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021765469

Country of ref document: EP

Effective date: 20221004