WO2023280274A1 - Geometric structure aided visual localization method and system - Google Patents

Geometric structure aided visual localization method and system Download PDF

Info

Publication number
WO2023280274A1
WO2023280274A1 PCT/CN2022/104407 CN2022104407W WO2023280274A1 WO 2023280274 A1 WO2023280274 A1 WO 2023280274A1 CN 2022104407 W CN2022104407 W CN 2022104407W WO 2023280274 A1 WO2023280274 A1 WO 2023280274A1
Authority
WO
WIPO (PCT)
Prior art keywords
real
landmarks
keyframe
geometric structure
map
Prior art date
Application number
PCT/CN2022/104407
Other languages
French (fr)
Inventor
Huaiyang HUANG
Haoyang YE
Jianhao JIAO
Ming Liu
Original Assignee
The Hong Kong University Of Science And Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Hong Kong University Of Science And Technology filed Critical The Hong Kong University Of Science And Technology
Priority to CN202280048273.6A priority Critical patent/CN117616459A/en
Publication of WO2023280274A1 publication Critical patent/WO2023280274A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • the present invention generally relates to visual localization, and more specifically, to a geometric structure aided visual localization method and system.
  • GNSS/INS global navigation satellite system and inertial navigation system
  • LiDAR three-dimensional light detection and ranging
  • a 3D LiDAR device scans the scene structure which can be matched against a point cloud map built in advance.
  • LiDAR scans it is not easy to recover the global location via recognize a place from existing database with LiDAR scans. This is because of the appearance information provided by a single scan is limited.
  • GNSS/INS or LiDAR-based systems that have the capability to recover centimeter-level sensor poses are typically too expensive to be widely applied for navigation or other applications
  • a geometric structure aided visual localization method for navigating a mobile system.
  • the method comprises: retrieving a 3D map about an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks and obtaining a sequence of real-time image frames by a camera of a mobile system; for each real-time image frame, extracting one or more local features from the real-time image frame; predicting a camera pose corresponding to the real-time image frame; creating a keyframe by tracking the predicted camera pose in the 3D map; identifying one or more temporal visible landmarks with respect to the created keyframe; obtaining one or more two dimensions to three dimensions (2D-3D) correspondences between the local features and the temporal visible landmarks; and locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.
  • a geometric structure aided visual localization system for navigating a mobile system.
  • the system comprises a camera configured to obtain a sequence of real-time image frames; a first processor including a mapping module configured to retrieve a 3D map based on an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks and a second processor including a localization module configured to process each real-time image frame by: extracting one or more local features from the real-time image frame; predicting a camera pose corresponding to the real-time image frame; creating a keyframe by tracking the predicted camera pose in the 3D map; identifying one or more temporal visible landmarks with respect to the created keyframe; obtaining one or more 2D-3D correspondences between the local features and the temporal visible landmarks; and locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D
  • the camera pose is predicted through a visual-inertial odometry process comprising: obtaining real-time inertial measurement data by an inertial measurement unit of the mobile system; pre-integrating the real-time inertial measurement data into a single relative motion constraint; and estimating the camera pose based on the single relative motion constraint.
  • the one or more temporal visible landmarks are identified by: dividing the 3D map into a set of voxels to form a voxelized map, wherein each mapped landmark being intersected with one or more voxels; casting a plurality of rays from an optical center of the created keyframe into the voxelized map; for each casted ray, locating one or more voxels along the casted ray; and identifying one or more mapped landmarks which are intersected with the located voxels as temporal visible landmarks with respect to the created keyframe.
  • the provided visual localization method and system can take input data from a low-cost sensor suite, such as a visual-inertial sensor, and leverage other modalities, such as high-definition map generated by expensive LiDAR device and integrated navigation system.
  • a low-cost sensor suite such as a visual-inertial sensor
  • other modalities such as high-definition map generated by expensive LiDAR device and integrated navigation system.
  • FIG. 1 shows a block diagram of a geometric structure aided visual localization system according to one embodiment of the present invention
  • FIG. 2 shows exemplary sensor setup of the geometric structure aided visual localization system of FIG. 1;
  • FIG. 3 shows a flowchart of a localization method according to one embodiment of the present invention
  • FIG. 4 shows an exemplary visual-inertial bundle adjustment pipeline according to one embodiment of the present invention.
  • FIG. 5 shows a factor graph representation for the visual-inertial bundle adjustment pipeline of FIG. 4.
  • FIG. 1 shows a block diagram of a geometric structure aided visual localization system according to one embodiment of the present invention.
  • FIG. 2 shows exemplary sensor setup for the visual localization system of FIG. 1.
  • the visual localization system may be fully decoupled into a mapping module 11 and a localization module 12.
  • the mapping module 11 may be operated offline in advance to generate an accurate localizable map, which is expected to be fully reusable for long-term localization.
  • the prior localizable map can be built offline with more sophisticated sensors.
  • the visual localization system may take input trajectories estimated from an integrated navigation system combining information from laser scanner, camera system and GNSS/INS (in functional block 102: integrated navigation) .
  • both geometric and visual features are extracted from laser and visual data, respectively (in functional block 104: geometrical/visual feature extraction) .
  • the geometric and visual features serve as map elements for constructing the localizable map.
  • the features under local frames are transformed with the estimated poses and aggregated to form a localizable map (in functional block 106: map aggregation) .
  • the localization module 12 takes input query image and/or inertial measurement data from a visual-inertial sensor suite.
  • the visual-inertial sensor suite may include a camera system and an inertial measurement unit (IMU) .
  • IMU inertial measurement unit
  • For each input query image local features are extracted for representing the current frame (in functional block 108: local feature extraction) .
  • the inertial measurement data from a last frame is pre-integrated for a relative transform and corresponding covariance (at the step 110: pre-integration) to estimate a prior pose.
  • the local features are firstly matched and the last frame is tracked with the prior pose as the initial guess (in functional block 112: inter-frame tracking) .
  • the local features are then associated with the map elements obtained from the localizable map to generate some landmarks (in functional block 114: local feature association) .
  • the optimizable parameters are jointly adjusted for refinement of poses and inertial parameters (in functional block 116: joint optimization) .
  • the mapping module 11 may include a high-standard sensor suite including, but not limited to, laser scanners 22, camera system 24 and GNSS/INSS 26 for obtaining high accuracy and density data to construct the prior localizable and reusable map.
  • the localization module 12 may include an inexpensive visual-inertial (VI) sensor 28 (e.g., VI sensors equipped on a smart phone) to obtain visual data and inertial measurement data such that the cost can be reduced to the maximum extent.
  • VI visual-inertial
  • the mapping module 11 and the localization module 12 may be implemented with one or more processors (not shown) which may be CPU, MCU, application specific integrated circuits (ASIC) , field programmable gate arrays (FPGA) or any suitable programmable logic devices configured or programmed to be a processor for performing the mapping and localization.
  • the localization system may further include a memory unit (not shown) which may include a volatile memory unit (such as RAM) , a non-volatile unit (such as ROM, EPROM, EEPROM and flash memory) or both, or any type of media or devices suitable for storing instructions, codes, and/or data.
  • FIG. 3 shows a flowchart of a visual localization method according to one embodiment of the present invention.
  • the visual localization method may comprise:
  • S301 retrieving a 3D map about an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks
  • S302 obtaining a sequence of real-time image frames by a camera of a mobile system
  • S308 locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.
  • the visual structure may be constructed by leveraging a simultaneous localization and mapping (SLAM) or structure from motion (SfM) method, where the landmarks are filtered by the number of frames that can observe them.
  • the geometric structure may be constructed using a point cloud representing shapes of landmarks in the 3D map.
  • the Gaussian mixture model is expressed and parameterized as:
  • x i 3D position of a i-th mapped landmark, is a likelihood of the landmark position x i when a j-th Gaussian distribution is given;
  • w j is the weight for the j-th Gaussian distribution for each landmark, ⁇ j and ⁇ j are mean and covariance of the j-th Gaussian distribution respectively.
  • the camera pose may be predicted through a visual-inertial odometry process including: obtaining real-time inertial measurement data by an inertial measurement unit of the mobile system; pre-integrating the real-time inertial measurement data into a single relative motion constraint; and estimating the camera pose based on the single relative motion constraint.
  • the camera pose may be predicted through a place recognition process including: retrieving a plurality of keyframe candidates with a global descriptor based on the initialization position data; grouping the keyframe candidates into clusters by their covisibility with respect to one or more landmarks; matching the local features against with each cluster to obtain a correspondence between the real-time frame image and the keyframe candidates; recovering the camera pose via a perspective-n-point (PnP) scheme with a random sample consensus (RANSAC) method; and refining the camera pose with inlier matches.
  • PnP perspective-n-point
  • RASAC random sample consensus
  • the temporal visible landmarks may be identified with respect to the created keyframe by: projecting the mapped landmarks on to the created keyframe; and filtering the projected landmarks based on one or more visibility criteria to identify one or more projected landmarks which pass the visibility criteria as the one or more temporal visible landmarks with respect to the created keyframe.
  • the temporal visible landmarks may be identified with respect to the created keyframe by: dividing the 3D map into a set of voxels to form a voxelized map, wherein each mapped landmark being intersected with one or more voxels; casting a plurality of rays from an optical center of the created keyframe into the voxelized map; locating voxels along each casted ray; and identifying mapped landmarks which are intersected with the located voxels as temporal visible landmarks with respect to the created keyframe.
  • Each casted ray may be expressed as [v, t wc ] , where t wc is the position of the optical center under the world coordinate and v is normalized direction vector of the casted ray.
  • each voxel in the retrieved map is examined for whether a geometric component (or a Gaussian distribution for a landmark) contained in the voxel can be associated with the local feature.
  • a geometric structure parametrized as 3D Gaussian distribution ( ⁇ , ⁇ ) , which can be decomposed using singular value decomposition (SVD) :
  • RSR T
  • S diag ( ⁇ 1 , ⁇ 2 , ⁇ 3 )
  • R [e 1 , e 2 , e 3 ]
  • the ray parameters [v, t wc ] can be whitened by:
  • the Mahalanobious distance between the ray parameters [v, t wc ] can be used for determine whether the geometric component is intersected with the ray.
  • the inverse depth can be obtained through:
  • the positions of identified temporal visible landmark may be further optimized through an interactive closest point (ICP) scheme including: selecting, for each temporal visible landmark, a set of nearest projected Gaussian distributions as association candidates; optimizing position of the temporal visible landmark by iterating over the association candidates; and selecting a final association candidate which leads minimum reprojection error.
  • ICP interactive closest point
  • Step S308 the state for the real-time image frame may be estimated by solving a least-squares optimization problem by minimizing a total energy function E total given by:
  • E visual , E strucure and E prior are visual term, structure term and prior term of the total energy function respectively.
  • E visual is given by:
  • ⁇ (. ) is a probability density function
  • e proj (x i , ⁇ k ) is the reprojection residual term in projecting a i-th mapped landmark on to a k-th image frame, which is dependent on the predicted camera pose ⁇ k for the k-th image frame and 3D position x i of the i-th mapped landmark
  • u ik is a 2-D location of visual feature corresponding to a i-th temporal visible landmark projected in the k-th image frame, and are rotation and translation matrices for projecting the landmark from the world coordinate of the 3D map to the k-th image frame respectively.
  • the reprojection residual term e proj may be expressed as:
  • E structure may be given by:
  • association set for the structure term is a j-th Gaussian distribution for a i-th mapped landmark
  • e str is a residual term
  • e str_deg is a residual term for a degenerated case, and is a function for indicating whether the i-th mapped landmark is degenerated or not and defined as
  • e str For a given likelihood of a i-th landmark, the structure residual term e str is given by:
  • ⁇ j and ⁇ j are mean and covariance of the j-th Gaussian distribution respectively.
  • x i 3D position of a i-th mapped landmark
  • ⁇ j is the mean of the j-th Gaussian distribution
  • e j1 is first eigen-base for the j-th Gaussian distribution
  • ⁇ str is the pre-defined variance of the geometric structure.
  • the prior term E prior may be defined as an initial residual term and given by:
  • the optimized landmark position may be formulated as:
  • e proj (x i , ⁇ k ) is the projection residual term
  • ⁇ k is the pose parameterization of the k-th image frame
  • ⁇ ik is the pre-defined variance for the feature corresponding to the i-th mapped landmark.
  • the least-squares optimization problem may be solved by the Levenberg-Marquardt method, which gives a system as follow:
  • J and r are the stacked Jacobian matrices and residuals, respectively, and W is the weight matrix from stacking the inverse of the covariance for the residual terms, ⁇ is the damping factor for the Levenberg-Marquardt method .
  • a batched yet efficient optimization problem may be formulated, which can be solved within 10ms in average.
  • the batched optimization problem may be treated as bundle adjustment (BA) which can be decoupled into visual-inertial (or motion-only or structureless) BA and structure-only BA.
  • FIG. 4 shows an exemplary visual-inertial BA pipeline 400 according to one embodiment of the present invention.
  • FIG. 5 shows a factor graph representation for the visual-inertial BA pipeline.
  • the bundle adjustment 400 may include a frame tracking module 401, marginalization module 402, joint optimization module 403 and parameter updating module 404.
  • the decoupled scheme accelerates the optimization, iteratively re-linearizing the visual factors can be less effective in computation, while the instant localization can already provide a relatively accurate estimation.
  • the visual factors obtained via instant localization may be marginalized as a pose prior instead of using the original visual factors.
  • visual factors may be “marginalized out” , which forms a prior term for the pose of the real-time image frame. This eliminates all the visual factors in the optimization, which significantly reduces the computational resources required for visual-inertial bundle adjustment.
  • the estimated value and corresponding covariance preserved in the instant localization serve as a prior term in the bundle adjustment.
  • the joint optimization adjusts the camera poses along with the inertial parameters, which includes the prior term in the energy function as follows:
  • e rot and e trans are rotational and translational terms for the visual-inertial bundle adjustment.
  • the camera pose may be optimized via minimizing an objective function:
  • F is the keyframes in the current window which modulus
  • the problem can then be solved by the Levenberg-Marquarelt method.
  • the advantage of this scheme is that there is no need to re-linearize the visual factors, which can be more efficient in computation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

A geometric structure aided visual localization method and a system for implementing the same is provided. The method comprises: retrieving a 3D map about an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks (S301); obtaining a sequence of real-time image frames by a camera of a mobile system (S302); for each real-time image frame, extracting local features from the real-time image frame (S303); predicting a camera pose corresponding to the real-time image frame (S304); creating a keyframe by tracking the predicted camera pose in the 3D map (S305); identifying temporal visible landmarks with respect to the created keyframe (S306); obtaining 2D-3D correspondences between the local features and the temporal visible landmarks (S307); and locating the mobile system by estimating a state for the real-time image frame based on the 2D-3D correspondences (S308).

Description

GEOMETRIC STRUCTURE AIDED VISUAL LOCALIZATION METHOD AND SYSTEM Field of the Invention:
The present invention generally relates to visual localization, and more specifically, to a geometric structure aided visual localization method and system.
Background of the Invention:
Localization system plays a significant role in autonomous navigation or virtual/augmented reality. Although there are several existing techniques based on different sensor configurations. Techniques based on global navigation satellite system and inertial navigation system (GNSS/INS) are widely applied for locating autonomous systems. However, they occasionally fail under urban canyons and are entirely disabled for indoor environment. An alternative solution is utilizing three-dimensional (3D) light detection and ranging (LiDAR) device. A 3D LiDAR device scans the scene structure which can be matched against a point cloud map built in advance. However, it is not easy to recover the global location via recognize a place from existing database with LiDAR scans. This is because of the appearance information provided by a single scan is limited. In addition, GNSS/INS or LiDAR-based systems that have the capability to recover centimeter-level sensor poses are typically too expensive to be widely applied for navigation or other applications
Traditional vision-based methods match the query image against the sparse 3D scene model enriched by visual features. Camera pose can then be estimated with three dimensions to two dimensions (3D-2D) correspondences. In contrast to GNSS/INS or LiDAR-based systems, the cost of visual localization systems can be much more inexpensive. However, this comes at a price that visual perception is sensitive to various conditions; and the metric information cannot be directly achieved with vision sensors. These two drawbacks lead common visual localization systems to be less robust and limited the application scenarios
To alleviate the issues in visual localization system, some methods introduce the information from a dense map generated beforehand, which helps constrain the visual structure and thus leads to better pose estimation result. Yet, compared to traditional vision-based pipelines, these systems generally require more computational resources, which can only be fulfilled with a high-standard desktop. The computational requirement limits the large-scale application of these methods, especially on some embedded devices. Therefore, it remains an issue of developing a low-cost localization system while ensures the robustness, accuracy and scalability. It is also desirable to have  a visual localization method and system that is computationally efficient so as to have the potential to be applied on edge computing devices.
Summary of the Invention:
According to one aspect of the present invention, a geometric structure aided visual localization method is provided for navigating a mobile system. The method comprises: retrieving a 3D map 
Figure PCTCN2022104407-appb-000001
about an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks 
Figure PCTCN2022104407-appb-000002
and obtaining a sequence of real-time image frames by a camera of a mobile system; for each real-time image frame, extracting one or more local features from the real-time image frame; predicting a camera pose corresponding to the real-time image frame; creating a keyframe by tracking the predicted camera pose in the 3D map; identifying one or more temporal visible landmarks with respect to the created keyframe; obtaining one or more two dimensions to three dimensions (2D-3D) correspondences between the local features and the temporal visible landmarks; and locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.
According to another aspect of the present invention, a geometric structure aided visual localization system is provided for navigating a mobile system. The system comprises a camera configured to obtain a sequence of real-time image frames; a first processor including a mapping module configured to retrieve a 3D map 
Figure PCTCN2022104407-appb-000003
based on an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks 
Figure PCTCN2022104407-appb-000004
and a second processor including a localization module configured to process each real-time image frame by: extracting one or more local features from the real-time image frame; predicting a camera pose corresponding to the real-time image frame; creating a keyframe by tracking the predicted camera pose in the 3D map; identifying one or more temporal visible landmarks with respect to the created keyframe; obtaining one or more 2D-3D correspondences between the local features and the temporal visible landmarks; and locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.
Preferably, the camera pose is predicted through a visual-inertial odometry process comprising: obtaining real-time inertial measurement data by an inertial measurement unit of the mobile system; pre-integrating the real-time inertial measurement data into a single relative motion constraint; and estimating the camera pose based on the single relative motion constraint.
Preferably, the one or more temporal visible landmarks are identified by: dividing the 3D map into a set of voxels to form a voxelized map, wherein each mapped landmark being intersected with one or more voxels; casting a plurality of rays from an optical center of the created keyframe into the voxelized map; for each casted ray, locating one or more voxels along the casted ray; and identifying one or more mapped landmarks which are intersected with the located voxels as temporal visible landmarks with respect to the created keyframe.
The provided visual localization method and system can take input data from a low-cost sensor suite, such as a visual-inertial sensor, and leverage other modalities, such as high-definition map generated by expensive LiDAR device and integrated navigation system. As such, advantages of both geometric structure and visual structure can be taken without an initial pose for bootstrapping the system. Localization accuracy and the computational efficiency can thus be well-balanced.
Brief Description of the Drawings:
Aspects of the present disclosure may be readily understood from the following detailed description with reference to the accompanying figures. The illustrations may not necessarily be drawn to scale. Common reference numerals may be used throughout the drawings and the detailed description to indicate the same or similar components.
FIG. 1 shows a block diagram of a geometric structure aided visual localization system according to one embodiment of the present invention;
FIG. 2 shows exemplary sensor setup of the geometric structure aided visual localization system of FIG. 1;
FIG. 3 shows a flowchart of a localization method according to one embodiment of the present invention;
FIG. 4 shows an exemplary visual-inertial bundle adjustment pipeline according to one embodiment of the present invention; and
FIG. 5 shows a factor graph representation for the visual-inertial bundle adjustment pipeline of FIG. 4.
Detailed Description:
In the following description, preferred examples of the present disclosure will be set forth as embodiments which are to be regarded as illustrative rather than restrictive. Specific details may  be omitted so as not to obscure the present disclosure; however, the disclosure is written to enable one skilled in the art to practice the teachings herein without undue experimentation.
FIG. 1 shows a block diagram of a geometric structure aided visual localization system according to one embodiment of the present invention. FIG. 2 shows exemplary sensor setup for the visual localization system of FIG. 1.
As shown, the visual localization system may be fully decoupled into a mapping module 11 and a localization module 12. The mapping module 11 may be operated offline in advance to generate an accurate localizable map, which is expected to be fully reusable for long-term localization. The prior localizable map can be built offline with more sophisticated sensors. To this end, the visual localization system may take input trajectories estimated from an integrated navigation system combining information from laser scanner, camera system and GNSS/INS (in functional block 102: integrated navigation) . In addition, both geometric and visual features are extracted from laser and visual data, respectively (in functional block 104: geometrical/visual feature extraction) . Finally, the geometric and visual features serve as map elements for constructing the localizable map. The features under local frames are transformed with the estimated poses and aggregated to form a localizable map (in functional block 106: map aggregation) .
With the prior localizable map, the localization module 12 can be operated separately. The localization module 12 takes input query image and/or inertial measurement data from a visual-inertial sensor suite. The visual-inertial sensor suite may include a camera system and an inertial measurement unit (IMU) . For each input query image, local features are extracted for representing the current frame (in functional block 108: local feature extraction) . The inertial measurement data from a last frame is pre-integrated for a relative transform and corresponding covariance (at the step 110: pre-integration) to estimate a prior pose. The local features are firstly matched and the last frame is tracked with the prior pose as the initial guess (in functional block 112: inter-frame tracking) . With a relatively accurate pose estimated, the local features are then associated with the map elements obtained from the localizable map to generate some landmarks (in functional block 114: local feature association) . Finally, the optimizable parameters are jointly adjusted for refinement of poses and inertial parameters (in functional block 116: joint optimization) .
As shown in FIG. 2, the mapping module 11 may include a high-standard sensor suite including, but not limited to, laser scanners 22, camera system 24 and GNSS/INSS 26 for obtaining high accuracy and density data to construct the prior localizable and reusable map. The localization module 12 may include an inexpensive visual-inertial (VI) sensor 28 (e.g., VI sensors equipped on a  smart phone) to obtain visual data and inertial measurement data such that the cost can be reduced to the maximum extent.
The mapping module 11 and the localization module 12 may be implemented with one or more processors (not shown) which may be CPU, MCU, application specific integrated circuits (ASIC) , field programmable gate arrays (FPGA) or any suitable programmable logic devices configured or programmed to be a processor for performing the mapping and localization. The localization system may further include a memory unit (not shown) which may include a volatile memory unit (such as RAM) , a non-volatile unit (such as ROM, EPROM, EEPROM and flash memory) or both, or any type of media or devices suitable for storing instructions, codes, and/or data.
FIG. 3 shows a flowchart of a visual localization method according to one embodiment of the present invention. As shown, the visual localization method may comprise:
S301: retrieving a 3D map 
Figure PCTCN2022104407-appb-000005
about an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks 
Figure PCTCN2022104407-appb-000006
S302: obtaining a sequence of real-time image frames by a camera of a mobile system;
S303: for each real-time image frame, extracting one or more local features from the real-time image frame;
S304: predicting a camera pose corresponding to the real-time image frame;
S305: creating a keyframe by tracking the predicted camera pose in the 3D map;
S306: identifying one or more temporal visible landmarks with respect to the created keyframe;
S307: obtaining one or more 2D-3D correspondences between the local features and the temporal visible landmarks; and
S308: locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.
In Step S301, the visual structure may be constructed by leveraging a simultaneous localization and mapping (SLAM) or structure from motion (SfM) method, where the landmarks are filtered by the number of frames that can observe them. The geometric structure may be constructed using a point cloud representing shapes of landmarks in the 3D map.
The Gaussian mixture model is expressed and parameterized as:
Figure PCTCN2022104407-appb-000007
where x i is 3D position of a i-th mapped landmark, 
Figure PCTCN2022104407-appb-000008
is a likelihood of the landmark position x i when a j-th Gaussian distribution
Figure PCTCN2022104407-appb-000009
is given; w j is the weight for the j-th Gaussian distribution for each landmark, μ j and ∑ j are mean and covariance of the j-th Gaussian distribution respectively.
Each Gaussian distribution of mapped landmark is parameterized as [μ, ∑] with ∑=RSR T, where S=diag (λ 1, λ 2, λ 3) , λ 1, λ 2, λ 3 are the singular values and have a relationship λ 12< λ 3, R= [e 1, e 2, e 3] and e 1, e 2, e 3 are the eigen-bases.
In Step 304, the camera pose may be predicted through a visual-inertial odometry process including: obtaining real-time inertial measurement data by an inertial measurement unit of the mobile system; pre-integrating the real-time inertial measurement data into a single relative motion constraint; and estimating the camera pose based on the single relative motion constraint.
Alternatively, the camera pose may be predicted through a place recognition process including: retrieving a plurality of keyframe candidates with a global descriptor based on the initialization position data; grouping the keyframe candidates into clusters by their covisibility with respect to one or more landmarks; matching the local features against with each cluster to obtain a correspondence between the real-time frame image and the keyframe candidates; recovering the camera pose via a perspective-n-point (PnP) scheme with a random sample consensus (RANSAC) method; and refining the camera pose with inlier matches.
In Step 306, the temporal visible landmarks may be identified with respect to the created keyframe by: projecting the mapped landmarks on to the created keyframe; and filtering the projected landmarks based on one or more visibility criteria to identify one or more projected landmarks which pass the visibility criteria as the one or more temporal visible landmarks with respect to the created keyframe.
Alternatively, the temporal visible landmarks may be identified with respect to the created keyframe by: dividing the 3D map into a set of voxels to form a voxelized map, wherein each mapped landmark being intersected with one or more voxels; casting a plurality of rays from an optical center of the created keyframe into the voxelized map; locating voxels along each casted ray; and identifying mapped landmarks which are intersected with the located voxels as temporal visible landmarks with respect to the created keyframe.
Each casted ray may be expressed as [v, t wc] , where t wc is the position of the optical center under the world coordinate and v is normalized direction vector of the casted ray. v may be computed by v=R wcπ -1 (u) , where u is the pixel location for each locale feature, π denotes the  camera projection function, and R wc is the rotation matrix from the world coordinate to the camera coordinate. The position of each temporal visible landmark is parameterized as p=t wc+1/ρ v, where p represents the positions of the located voxels and ρ is the inverse depth of the temporal visible landmark.
After casting the ray to the voxelized map, each voxel in the retrieved map is examined for whether a geometric component (or a Gaussian distribution for a landmark) contained in the voxel can be associated with the local feature. Assume a geometric structure parametrized as 3D Gaussian distribution (μ, Σ) , which can be decomposed using singular value decomposition (SVD) : ∑=RSR T, S=diag (λ 1, λ 2, λ 3) , R= [e 1, e 2, e 3] , the ray parameters [v, t wc] can be whitened by:
Figure PCTCN2022104407-appb-000010
Figure PCTCN2022104407-appb-000011
And the Mahalanobious distance between the ray parameters [v, t wc] can be used for determine whether the geometric component is intersected with the ray.
Applying the whitening transform, the ray parameters become:
Figure PCTCN2022104407-appb-000012
Figure PCTCN2022104407-appb-000013
With which the Mahalanobious distance is given by:
Figure PCTCN2022104407-appb-000014
If the Mahalanobious distance is within a threshold, then the inverse depth can be obtained through:
Figure PCTCN2022104407-appb-000015
Optionally, the positions of identified temporal visible landmark may be further optimized through an interactive closest point (ICP) scheme including: selecting, for each temporal visible landmark, a set of nearest projected Gaussian distributions as association candidates; optimizing position of the temporal visible landmark by iterating over the association candidates; and selecting a final association candidate which leads minimum reprojection error.
In Step S308, the state for the real-time image frame may be estimated by solving a least-squares optimization problem by minimizing a total energy function E total given by:
E total =E visual +E structure +E prior
where E visual, E strucure and E prior are visual term, structure term and prior term of the total energy function respectively.
The visual term E visual is given by:
Figure PCTCN2022104407-appb-000016
where ρ (. ) is a probability density function; e proj (x i, ξ k) is the reprojection residual term in projecting a i-th mapped landmark on to a k-th image frame, which is dependent on the predicted camera pose ξ k for the k-th image frame and 3D position x i of the i-th mapped landmark; and
Figure PCTCN2022104407-appb-000017
Figure PCTCN2022104407-appb-000018
is the covariance of the 2D Gaussian distribution for the i-th mapped landmark projected on the k-th image frame, 
Figure PCTCN2022104407-appb-000019
is the association set for the visual term.
In general, the reprojection residual term e proj is given by:
Figure PCTCN2022104407-appb-000020
where u ik is a 2-D location of visual feature corresponding to a i-th temporal visible landmark projected in the k-th image frame, 
Figure PCTCN2022104407-appb-000021
and
Figure PCTCN2022104407-appb-000022
are rotation and translation matrices for projecting the landmark from the world coordinate of the 3D map to the k-th image frame respectively.
When the camera pose is predicted through a visual-inertial odometry process, the reprojection residual term e proj may be expressed as:
Figure PCTCN2022104407-appb-000023
where
Figure PCTCN2022104407-appb-000024
and
Figure PCTCN2022104407-appb-000025
are rotation and translation matrices for transforming from the world coordinate of the 3D map to the body coordinate of the inertial measurement unit with respect to the k-th image frame; 
Figure PCTCN2022104407-appb-000026
and
Figure PCTCN2022104407-appb-000027
are rotation and translation matrices for transforming from the body coordinate of the inertial measurement unit to the k-th image frame.
The structure term E structure may be given by:
Figure PCTCN2022104407-appb-000028
where
Figure PCTCN2022104407-appb-000029
is the association set for the structure term, 
Figure PCTCN2022104407-appb-000030
is a j-th Gaussian distribution for a i-th mapped landmark, e str is a residual term and e str_deg is a residual term for a degenerated case, and
Figure PCTCN2022104407-appb-000031
is a function for indicating whether the i-th mapped landmark is degenerated or not and defined as
Figure PCTCN2022104407-appb-000032
For a given likelihood of a i-th landmark, the structure residual term e str is given by:
Figure PCTCN2022104407-appb-000033
where x i is 3D position of the i-th mapped landmark, μ j and ∑ j are mean and covariance of the j-th Gaussian distribution respectively.
The structure residual term for the degenerated case e str_deg is given by:
Figure PCTCN2022104407-appb-000034
where x i is 3D position of a i-th mapped landmark, μ j is the mean of the j-th Gaussian distribution; e j1 is first eigen-base for the j-th Gaussian distribution and
Figure PCTCN2022104407-appb-000035
where σ stris the pre-defined variance of the geometric structure.
The prior term E prior may be defined as an initial residual term and given by:
Figure PCTCN2022104407-appb-000036
where
Figure PCTCN2022104407-appb-000037
is a preset camera pose based on the initialization position data and
Figure PCTCN2022104407-appb-000038
is an actual pose.
Therefore, the optimized landmark position may be formulated as:
Figure PCTCN2022104407-appb-000039
where e proj (x i, ξ k) is the projection residual term, ξ k is the pose parameterization of the k-th image frame and ∑ ik is the pre-defined variance for the feature corresponding to the i-th mapped landmark.
The least-squares optimization problem may be solved by the Levenberg-Marquardt method, which gives a system as follow:
H=J TWJ+∈I,
b=-J TWr
where J and r are the stacked Jacobian matrices and residuals, respectively, and W is the weight matrix from stacking the inverse of the covariance for the residual terms, ∈ is the damping factor for the Levenberg-Marquardt method .
In the case that some of the states are not able to be properly estimated, especially when the inertial measurements are incorporated, a batched yet efficient optimization problem may be formulated, which can be solved within 10ms in average. Firstly, the batched optimization problem may be treated as bundle adjustment (BA) which can be decoupled into visual-inertial (or motion-only or structureless) BA and structure-only BA.
FIG. 4 shows an exemplary visual-inertial BA pipeline 400 according to one embodiment of the present invention. FIG. 5 shows a factor graph representation for the visual-inertial BA  pipeline. The bundle adjustment 400 may include a frame tracking module 401, marginalization module 402, joint optimization module 403 and parameter updating module 404. Although the decoupled scheme accelerates the optimization, iteratively re-linearizing the visual factors can be less effective in computation, while the instant localization can already provide a relatively accurate estimation. As shown in FIGS. 4 and 5, the visual factors obtained via instant localization may be marginalized as a pose prior instead of using the original visual factors. In other words, after optimizing the state for the real-time image frame, visual factors may be “marginalized out” , which forms a prior term for the pose of the real-time image frame. This eliminates all the visual factors in the optimization, which significantly reduces the computational resources required for visual-inertial bundle adjustment.
More specifically, the estimated value and corresponding covariance preserved in the instant localization serve as a prior term in the bundle adjustment. The joint optimization adjusts the camera poses along with the inertial parameters, which includes the prior term in the energy function as follows:
Figure PCTCN2022104407-appb-000040
Figure PCTCN2022104407-appb-000041
where e rot and e trans are rotational and translational terms for the visual-inertial bundle adjustment.
Accordingly, the camera pose may be optimized via minimizing an objective function:
Figure PCTCN2022104407-appb-000042
where F is the keyframes in the current window which modulus | F| may be set to 15.
Assuming the pose estimation reaches the local optimal, the update of state
Figure PCTCN2022104407-appb-000043
thus the prior error state can be formulated as:
Figure PCTCN2022104407-appb-000044
Then the total objective function may be changed to:
Figure PCTCN2022104407-appb-000045
The problem can then be solved by the Levenberg-Marquarelt method. The advantage of this scheme is that there is no need to re-linearize the visual factors, which can be more efficient in computation.
The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the  invention for various embodiments and with various modifications that are suited to the particular use contemplated. While the methods disclosed herein have been described with reference to particular operations performed in a particular order, it will be understood that these operations may be combined, sub-divided, or re-ordered to form an equivalent method without departing from the teachings of the present disclosure. Accordingly, unless specifically indicated herein, the order and grouping of the operations are not limitations. While the apparatuses disclosed herein have been described with reference to particular structures, shapes, materials, composition of matter and relationships…etc., these descriptions and illustrations are not limiting. Modifications may be made to adapt a particular situation to the objective, spirit and scope of the present disclosure. All such modifications are intended to be within the scope of the claims appended hereto.

Claims (20)

  1. A geometric structure aided visual localization method, comprising:
    retrieving a 3D map
    Figure PCTCN2022104407-appb-100001
    based on an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks
    Figure PCTCN2022104407-appb-100002
    obtaining a sequence of real-time image frames by a camera of a mobile system;
    for each real-time image frame:
    extracting one or more local features from the real-time image frame;
    predicting a camera pose corresponding to the real-time image frame;
    creating a keyframe by tracking the predicted camera pose in the 3D map;
    identifying one or more temporal visible landmarks with respect to the created keyframe;
    obtaining one or more 2D-3D correspondences between the local features and the temporal visible landmarks; and
    locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.
  2. The geometric structure aided visual localization method according to claim 1, wherein the camera pose is predicted through a place recognition process comprising:
    retrieving a plurality of keyframe candidates with a global descriptor based on the initialization position data;
    grouping the keyframe candidates into clusters by their covisibility;
    matching the local features against with each cluster to obtain a correspondence between the real-time frame image and the keyframe candidates;
    recovering the camera pose via a perspective-n-point (PnP) scheme with a random sample consensus (RANSAC) method; and
    refining the camera pose with inlier matches.
  3. The geometric structure aided visual localization method according to claim 1, the camera pose is predicted through a visual-inertial odometry process comprising:
    obtaining real-time inertial measurement data by an inertial measurement unit of the mobile system;
    pre-integrating the real-time inertial measurement data into a single relative motion constraint; and
    estimating the camera pose based on the single relative motion constraint.
  4. The geometric structure aided visual localization method according to claims 2 or 3, wherein the step of identifying the one or more temporal visible landmarks includes:
    projecting the mapped landmarks on to the created keyframe; and
    filtering the projected landmarks based on one or more visibility criteria to identify one or more projected landmarks which pass the visibility criteria as the one or more temporal visible landmarks with respect to the created keyframe.
  5. The geometric structure aided visual localization method according to claims 2 or 3, wherein the step of identifying the one or more temporal visible landmarks includes:
    dividing the 3D map into a set of voxels to form a voxelized map, wherein each mapped landmark being intersected with one or more voxels;
    casting a plurality of rays from an optical center of the created keyframe into the voxelized map;
    for each casted ray:
    locating one or more voxels along the casted ray; and
    identifying one or more mapped landmarks which are intersected with the located
    voxels as temporal visible landmarks with respect to the created keyframe.
  6. The geometric structure aided visual localization method according to claims 4 or 5, wherein the step of estimating the state for the real-time image frame includes solving a least-squares optimization problem and minimizing a total energy function E given by:
    E total = E visual + E structure + E prior,
    where E visual, E prior and E proj are visual term, prior term and reprojection term of the total energy function respectively.
  7. The geometric structure aided visual localization method according to claim 6, positions of the temporal visible landmarks are optimized through an interactive closest point (ICP) scheme.
  8. The geometric structure aided visual localization method according to claim 7, wherein the ICP scheme includes:
    selecting, for each temporal visible landmark, a set of nearest projected Gaussian distributions as association candidates; and
    optimizing position of the temporal visible landmark by iterating over the association candidates; and selecting a final association candidate which leads minimum reprojection error.
  9. The geometric structure aided visual localization method according to claim 1, further comprising constructing the 3D map with a point cloud which is generated by a LiDAR.
  10. The geometric structure aided visual localization method according to claim 1, wherein the initialization position data is obtained through a GNSS/INS unit.
  11. A geometric structure aided visual localization system, comprising:
    a camera configured to obtain a sequence of real-time image frames;
    a first processor including a mapping module configured to retrieve a 3D map
    Figure PCTCN2022104407-appb-100003
    based on an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks
    Figure PCTCN2022104407-appb-100004
    and
    a second processor including a localization module configured to process each real-time image frame by:
    extracting one or more local features from the real-time image frame;
    predicting a camera pose corresponding to the real-time image frame;
    creating a keyframe by tracking the predicted camera pose in the 3D map;
    identifying one or more temporal visible landmarks with respect to the created keyframe;
    obtaining one or more 2D-3D correspondences between the local features and the temporal visible landmarks; and
    locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.
  12. The geometric structure aided visual localization system according to claim 11, wherein the localization module is further configured to predict the camera pose corresponding to the real-time image frame by:
    retrieving a plurality of keyframe candidates with a global descriptor based on the initialization position data;
    grouping the keyframe candidates into clusters by their covisibility;
    matching the local features against with each cluster to obtain a correspondence between the real-time frame image and the keyframe candidates;
    recovering the camera pose via a perspective-n-point (PnP) scheme with a random sample consensus (RANSAC) method; and
    refining the camera pose with inlier matches.
  13. The geometric structure aided visual localization system according to claim 11, further comprising an inertial measurement unit configured to obtain real-time inertial measurement data; and
    wherein the localization module is further configured to predict the camera pose corresponding to the real-time image frame by:
    pre-integrating the real-time inertial measurement data into a single relative motion constraint; and
    estimating the camera pose based on the single relative motion constraint.
  14. The geometric structure aided visual localization system according to claims 12 or 13, wherein the localization module is further configured to identify the one or more temporal visible landmarks by:
    projecting the mapped landmarks on to the created keyframe; and
    filtering the projected landmarks based on one or more visibility criteria to identify one or more projected landmarks which pass the visibility criteria as the one or more temporal visible landmarks with respect to the created keyframe.
  15. The geometric structure aided visual localization system according to claims 12 or 13, wherein the localization module is further configured to identify the one or more temporal visible landmarks by:
    dividing the 3D map into a set of voxels to form a voxelized map, wherein each mapped landmark being intersected with one or more voxels;
    casting a plurality of rays from an optical center of the created keyframe into the voxelized map;
    for each casted ray:
    locating one or more voxels along the casted ray; and
    identifying one or more mapped landmarks which are intersected with the located voxels as temporal visible landmarks with respect to the created keyframe.
  16. The geometric structure aided visual localization system according to claims 14 or 15, wherein the localization module is further configured to estimate the state for the real-time image frame by solving a least-squares optimization problem and minimizing a total energy function E given by:
    E total = E visual + E structure + E prior,
    where E visual, E structure and E prior are visual term, structure term and prior term of the total energy function respectively.
  17. The geometric structure aided visual localization system according to claim 16, wherein the localization module is further configured to optimize positions of the temporal visible landmarks through an interactive closest point (ICP) scheme.
  18. The geometric structure aided visual localization system according to claim 17, wherein the localization module is further configured to performing the ICP scheme by:
    selecting, for each temporal visible landmark, a set of nearest projected Gaussian distributions as association candidates; and
    optimizing position of the temporal visible landmark by iterating over the association candidates; and selecting a final association candidate which leads minimum reprojection error.
  19. The geometric structure aided visual localization system according to claim 11, comprising a LiDAR configured to generate a point cloud for constructing the 3D map.
  20. The geometric structure aided visual localization system according to claim 11, comprising a GNSS/INS configured to obtain the initialization position data.
PCT/CN2022/104407 2021-07-07 2022-07-07 Geometric structure aided visual localization method and system WO2023280274A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202280048273.6A CN117616459A (en) 2021-07-07 2022-07-07 Geometry-aided visual localization method and system

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202163218925P 2021-07-07 2021-07-07
US63/218,925 2021-07-07
US202163227352P 2021-07-30 2021-07-30
US63/227,352 2021-07-30

Publications (1)

Publication Number Publication Date
WO2023280274A1 true WO2023280274A1 (en) 2023-01-12

Family

ID=84800336

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/104407 WO2023280274A1 (en) 2021-07-07 2022-07-07 Geometric structure aided visual localization method and system

Country Status (1)

Country Link
WO (1) WO2023280274A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117475092A (en) * 2023-12-27 2024-01-30 安徽蔚来智驾科技有限公司 Pose optimization method, pose optimization equipment, intelligent equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140333741A1 (en) * 2013-05-08 2014-11-13 Regents Of The University Of Minnesota Constrained key frame localization and mapping for vision-aided inertial navigation
CN108717710A (en) * 2018-05-18 2018-10-30 京东方科技集团股份有限公司 Localization method, apparatus and system under indoor environment
CN108986037A (en) * 2018-05-25 2018-12-11 重庆大学 Monocular vision odometer localization method and positioning system based on semi-direct method
CN110782494A (en) * 2019-10-16 2020-02-11 北京工业大学 Visual SLAM method based on point-line fusion
WO2021004416A1 (en) * 2019-07-05 2021-01-14 杭州海康机器人技术有限公司 Method and apparatus for establishing beacon map on basis of visual beacons
CN112596064A (en) * 2020-11-30 2021-04-02 中科院软件研究所南京软件技术研究院 Laser and vision integrated indoor robot global positioning method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140333741A1 (en) * 2013-05-08 2014-11-13 Regents Of The University Of Minnesota Constrained key frame localization and mapping for vision-aided inertial navigation
CN108717710A (en) * 2018-05-18 2018-10-30 京东方科技集团股份有限公司 Localization method, apparatus and system under indoor environment
CN108986037A (en) * 2018-05-25 2018-12-11 重庆大学 Monocular vision odometer localization method and positioning system based on semi-direct method
WO2021004416A1 (en) * 2019-07-05 2021-01-14 杭州海康机器人技术有限公司 Method and apparatus for establishing beacon map on basis of visual beacons
CN110782494A (en) * 2019-10-16 2020-02-11 北京工业大学 Visual SLAM method based on point-line fusion
CN112596064A (en) * 2020-11-30 2021-04-02 中科院软件研究所南京软件技术研究院 Laser and vision integrated indoor robot global positioning method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YE HAOYANG; HUANG HUAIYANG; HUTTER MARCO; SANDY TIMOTHY; LIU MING: "3D Surfel Map-Aided Visual Relocalization with Learned Descriptors", 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE, 30 May 2021 (2021-05-30), pages 5574 - 5581, XP033989857, DOI: 10.1109/ICRA48506.2021.9561005 *
YE HAOYANG; HUANG HUAIYANG; LIU MING: "Monocular Direct Sparse Localization in a Prior 3D Surfel Map", 2020 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE, 31 May 2020 (2020-05-31), pages 8892 - 8898, XP033826358, DOI: 10.1109/ICRA40945.2020.9197022 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117475092A (en) * 2023-12-27 2024-01-30 安徽蔚来智驾科技有限公司 Pose optimization method, pose optimization equipment, intelligent equipment and medium
CN117475092B (en) * 2023-12-27 2024-03-19 安徽蔚来智驾科技有限公司 Pose optimization method, pose optimization equipment, intelligent equipment and medium

Similar Documents

Publication Publication Date Title
US10824862B2 (en) Three-dimensional object detection for autonomous robotic systems using image proposals
CN112258618B (en) Semantic mapping and positioning method based on fusion of prior laser point cloud and depth map
Yang et al. Monocular object and plane slam in structured environments
Menze et al. Object scene flow
US11030525B2 (en) Systems and methods for deep localization and segmentation with a 3D semantic map
CN112602116A (en) Mapping object instances using video data
CN111060115B (en) Visual SLAM method and system based on image edge features
US9989969B2 (en) Visual localization within LIDAR maps
US9420265B2 (en) Tracking poses of 3D camera using points and planes
Wang et al. Accurate georegistration of point clouds using geographic data
Gao et al. Ground and aerial meta-data integration for localization and reconstruction: A review
US20200226392A1 (en) Computer vision-based thin object detection
WO2023280274A1 (en) Geometric structure aided visual localization method and system
CN113822996A (en) Pose estimation method and device for robot, electronic device and storage medium
GB2581957A (en) Image processing to determine object thickness
EP4001965A1 (en) Lidar localization using optical flow
Liu et al. Accurate real-time visual SLAM combining building models and GPS for mobile robot
Li et al. Indoor layout estimation by 2d lidar and camera fusion
Ulaş et al. A fast and robust scan matching algorithm based on ML-NDT and feature extraction
WO2023165817A1 (en) Semantic slam framework for improved object pose estimation
Lin et al. Dense 3D surface reconstruction of large-scale streetscape from vehicle-borne imagery and LiDAR
Velasco-Sánchez et al. LiLO: Lightweight and low-bias LiDAR Odometry method based on spherical range image filtering
Pöschmann et al. Synthesized semantic views for mobile robot localization
CN115880690A (en) Method for quickly marking object in point cloud under assistance of three-dimensional reconstruction
CN117616459A (en) Geometry-aided visual localization method and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22837019

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280048273.6

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE