WO2023280274A1

WO2023280274A1 - Geometric structure aided visual localization method and system

Info

Publication number: WO2023280274A1
Application number: PCT/CN2022/104407
Authority: WO
Inventors: Huaiyang HUANG; Haoyang YE; Jianhao JIAO; Ming Liu
Original assignee: The Hong Kong University Of Science And Technology
Priority date: 2021-07-07
Filing date: 2022-07-07
Publication date: 2023-01-12

Abstract

A geometric structure aided visual localization method and a system for implementing the same is provided. The method comprises: retrieving a 3D map about an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks (S301); obtaining a sequence of real-time image frames by a camera of a mobile system (S302); for each real-time image frame, extracting local features from the real-time image frame (S303); predicting a camera pose corresponding to the real-time image frame (S304); creating a keyframe by tracking the predicted camera pose in the 3D map (S305); identifying temporal visible landmarks with respect to the created keyframe (S306); obtaining 2D-3D correspondences between the local features and the temporal visible landmarks (S307); and locating the mobile system by estimating a state for the real-time image frame based on the 2D-3D correspondences (S308).

Description

GEOMETRIC STRUCTURE AIDED VISUAL LOCALIZATION METHOD AND SYSTEM

Field of the Invention:

The present invention generally relates to visual localization, and more specifically, to a geometric structure aided visual localization method and system.

Background of the Invention:

Localization system plays a significant role in autonomous navigation or virtual/augmented reality. Although there are several existing techniques based on different sensor configurations. Techniques based on global navigation satellite system and inertial navigation system (GNSS/INS) are widely applied for locating autonomous systems. However, they occasionally fail under urban canyons and are entirely disabled for indoor environment. An alternative solution is utilizing three-dimensional (3D) light detection and ranging (LiDAR) device. A 3D LiDAR device scans the scene structure which can be matched against a point cloud map built in advance. However, it is not easy to recover the global location via recognize a place from existing database with LiDAR scans. This is because of the appearance information provided by a single scan is limited. In addition, GNSS/INS or LiDAR-based systems that have the capability to recover centimeter-level sensor poses are typically too expensive to be widely applied for navigation or other applications

Traditional vision-based methods match the query image against the sparse 3D scene model enriched by visual features. Camera pose can then be estimated with three dimensions to two dimensions (3D-2D) correspondences. In contrast to GNSS/INS or LiDAR-based systems, the cost of visual localization systems can be much more inexpensive. However, this comes at a price that visual perception is sensitive to various conditions; and the metric information cannot be directly achieved with vision sensors. These two drawbacks lead common visual localization systems to be less robust and limited the application scenarios

To alleviate the issues in visual localization system, some methods introduce the information from a dense map generated beforehand, which helps constrain the visual structure and thus leads to better pose estimation result. Yet, compared to traditional vision-based pipelines, these systems generally require more computational resources, which can only be fulfilled with a high-standard desktop. The computational requirement limits the large-scale application of these methods, especially on some embedded devices. Therefore, it remains an issue of developing a low-cost localization system while ensures the robustness, accuracy and scalability. It is also desirable to have a visual localization method and system that is computationally efficient so as to have the potential to be applied on edge computing devices.

Summary of the Invention:

According to one aspect of the present invention, a geometric structure aided visual localization method is provided for navigating a mobile system. The method comprises: retrieving a 3D map

about an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks

and obtaining a sequence of real-time image frames by a camera of a mobile system; for each real-time image frame, extracting one or more local features from the real-time image frame; predicting a camera pose corresponding to the real-time image frame; creating a keyframe by tracking the predicted camera pose in the 3D map; identifying one or more temporal visible landmarks with respect to the created keyframe; obtaining one or more two dimensions to three dimensions (2D-3D) correspondences between the local features and the temporal visible landmarks; and locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.

According to another aspect of the present invention, a geometric structure aided visual localization system is provided for navigating a mobile system. The system comprises a camera configured to obtain a sequence of real-time image frames; a first processor including a mapping module configured to retrieve a 3D map

based on an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks

and a second processor including a localization module configured to process each real-time image frame by: extracting one or more local features from the real-time image frame; predicting a camera pose corresponding to the real-time image frame; creating a keyframe by tracking the predicted camera pose in the 3D map; identifying one or more temporal visible landmarks with respect to the created keyframe; obtaining one or more 2D-3D correspondences between the local features and the temporal visible landmarks; and locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.

Preferably, the camera pose is predicted through a visual-inertial odometry process comprising: obtaining real-time inertial measurement data by an inertial measurement unit of the mobile system; pre-integrating the real-time inertial measurement data into a single relative motion constraint; and estimating the camera pose based on the single relative motion constraint.

Preferably, the one or more temporal visible landmarks are identified by: dividing the 3D map into a set of voxels to form a voxelized map, wherein each mapped landmark being intersected with one or more voxels; casting a plurality of rays from an optical center of the created keyframe into the voxelized map; for each casted ray, locating one or more voxels along the casted ray; and identifying one or more mapped landmarks which are intersected with the located voxels as temporal visible landmarks with respect to the created keyframe.

The provided visual localization method and system can take input data from a low-cost sensor suite, such as a visual-inertial sensor, and leverage other modalities, such as high-definition map generated by expensive LiDAR device and integrated navigation system. As such, advantages of both geometric structure and visual structure can be taken without an initial pose for bootstrapping the system. Localization accuracy and the computational efficiency can thus be well-balanced.

Brief Description of the Drawings:

Aspects of the present disclosure may be readily understood from the following detailed description with reference to the accompanying figures. The illustrations may not necessarily be drawn to scale. Common reference numerals may be used throughout the drawings and the detailed description to indicate the same or similar components.

FIG. 1 shows a block diagram of a geometric structure aided visual localization system according to one embodiment of the present invention;

FIG. 2 shows exemplary sensor setup of the geometric structure aided visual localization system of FIG. 1;

FIG. 3 shows a flowchart of a localization method according to one embodiment of the present invention;

FIG. 4 shows an exemplary visual-inertial bundle adjustment pipeline according to one embodiment of the present invention; and

FIG. 5 shows a factor graph representation for the visual-inertial bundle adjustment pipeline of FIG. 4.

Detailed Description:

In the following description, preferred examples of the present disclosure will be set forth as embodiments which are to be regarded as illustrative rather than restrictive. Specific details may be omitted so as not to obscure the present disclosure; however, the disclosure is written to enable one skilled in the art to practice the teachings herein without undue experimentation.

FIG. 1 shows a block diagram of a geometric structure aided visual localization system according to one embodiment of the present invention. FIG. 2 shows exemplary sensor setup for the visual localization system of FIG. 1.

As shown, the visual localization system may be fully decoupled into a mapping module 11 and a localization module 12. The mapping module 11 may be operated offline in advance to generate an accurate localizable map, which is expected to be fully reusable for long-term localization. The prior localizable map can be built offline with more sophisticated sensors. To this end, the visual localization system may take input trajectories estimated from an integrated navigation system combining information from laser scanner, camera system and GNSS/INS (in functional block 102: integrated navigation) . In addition, both geometric and visual features are extracted from laser and visual data, respectively (in functional block 104: geometrical/visual feature extraction) . Finally, the geometric and visual features serve as map elements for constructing the localizable map. The features under local frames are transformed with the estimated poses and aggregated to form a localizable map (in functional block 106: map aggregation) .

With the prior localizable map, the localization module 12 can be operated separately. The localization module 12 takes input query image and/or inertial measurement data from a visual-inertial sensor suite. The visual-inertial sensor suite may include a camera system and an inertial measurement unit (IMU) . For each input query image, local features are extracted for representing the current frame (in functional block 108: local feature extraction) . The inertial measurement data from a last frame is pre-integrated for a relative transform and corresponding covariance (at the step 110: pre-integration) to estimate a prior pose. The local features are firstly matched and the last frame is tracked with the prior pose as the initial guess (in functional block 112: inter-frame tracking) . With a relatively accurate pose estimated, the local features are then associated with the map elements obtained from the localizable map to generate some landmarks (in functional block 114: local feature association) . Finally, the optimizable parameters are jointly adjusted for refinement of poses and inertial parameters (in functional block 116: joint optimization) .

As shown in FIG. 2, the mapping module 11 may include a high-standard sensor suite including, but not limited to, laser scanners 22, camera system 24 and GNSS/INSS 26 for obtaining high accuracy and density data to construct the prior localizable and reusable map. The localization module 12 may include an inexpensive visual-inertial (VI) sensor 28 (e.g., VI sensors equipped on a smart phone) to obtain visual data and inertial measurement data such that the cost can be reduced to the maximum extent.

The mapping module 11 and the localization module 12 may be implemented with one or more processors (not shown) which may be CPU, MCU, application specific integrated circuits (ASIC) , field programmable gate arrays (FPGA) or any suitable programmable logic devices configured or programmed to be a processor for performing the mapping and localization. The localization system may further include a memory unit (not shown) which may include a volatile memory unit (such as RAM) , a non-volatile unit (such as ROM, EPROM, EEPROM and flash memory) or both, or any type of media or devices suitable for storing instructions, codes, and/or data.

FIG. 3 shows a flowchart of a visual localization method according to one embodiment of the present invention. As shown, the visual localization method may comprise:

S301: retrieving a 3D map

S302: obtaining a sequence of real-time image frames by a camera of a mobile system;

S303: for each real-time image frame, extracting one or more local features from the real-time image frame;

S304: predicting a camera pose corresponding to the real-time image frame;

S305: creating a keyframe by tracking the predicted camera pose in the 3D map;

S306: identifying one or more temporal visible landmarks with respect to the created keyframe;

S307: obtaining one or more 2D-3D correspondences between the local features and the temporal visible landmarks; and

S308: locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.

In Step S301, the visual structure may be constructed by leveraging a simultaneous localization and mapping (SLAM) or structure from motion (SfM) method, where the landmarks are filtered by the number of frames that can observe them. The geometric structure may be constructed using a point cloud representing shapes of landmarks in the 3D map.

The Gaussian mixture model is expressed and parameterized as:

where x _i is 3D position of a i-th mapped landmark,

is a likelihood of the landmark position x _i when a j-th Gaussian distribution

is given; w _j is the weight for the j-th Gaussian distribution for each landmark, μ _j and ∑ _j are mean and covariance of the j-th Gaussian distribution respectively.

Each Gaussian distribution of mapped landmark is parameterized as [μ, ∑] with ∑=RSR ^T, where S=diag (λ ₁, λ ₂, λ ₃) , λ ₁, λ ₂, λ ₃ are the singular values and have a relationship λ ₁<λ ₂< λ ₃, R= [e ₁, e ₂, e ₃] and e ₁, e ₂, e ₃ are the eigen-bases.

In Step 304, the camera pose may be predicted through a visual-inertial odometry process including: obtaining real-time inertial measurement data by an inertial measurement unit of the mobile system; pre-integrating the real-time inertial measurement data into a single relative motion constraint; and estimating the camera pose based on the single relative motion constraint.

Alternatively, the camera pose may be predicted through a place recognition process including: retrieving a plurality of keyframe candidates with a global descriptor based on the initialization position data; grouping the keyframe candidates into clusters by their covisibility with respect to one or more landmarks; matching the local features against with each cluster to obtain a correspondence between the real-time frame image and the keyframe candidates; recovering the camera pose via a perspective-n-point (PnP) scheme with a random sample consensus (RANSAC) method; and refining the camera pose with inlier matches.

In Step 306, the temporal visible landmarks may be identified with respect to the created keyframe by: projecting the mapped landmarks on to the created keyframe; and filtering the projected landmarks based on one or more visibility criteria to identify one or more projected landmarks which pass the visibility criteria as the one or more temporal visible landmarks with respect to the created keyframe.

Alternatively, the temporal visible landmarks may be identified with respect to the created keyframe by: dividing the 3D map into a set of voxels to form a voxelized map, wherein each mapped landmark being intersected with one or more voxels; casting a plurality of rays from an optical center of the created keyframe into the voxelized map; locating voxels along each casted ray; and identifying mapped landmarks which are intersected with the located voxels as temporal visible landmarks with respect to the created keyframe.

Each casted ray may be expressed as [v, t _wc] , where t _wc is the position of the optical center under the world coordinate and v is normalized direction vector of the casted ray. v may be computed by v=R _wcπ ^-1 (u) , where u is the pixel location for each locale feature, π denotes the camera projection function, and R _wc is the rotation matrix from the world coordinate to the camera coordinate. The position of each temporal visible landmark is parameterized as p=t _wc+1/ρ v, where p represents the positions of the located voxels and ρ is the inverse depth of the temporal visible landmark.

After casting the ray to the voxelized map, each voxel in the retrieved map is examined for whether a geometric component (or a Gaussian distribution for a landmark) contained in the voxel can be associated with the local feature. Assume a geometric structure parametrized as 3D Gaussian distribution (μ, Σ) , which can be decomposed using singular value decomposition (SVD) : ∑=RSR ^T, S=diag (λ ₁, λ ₂, λ ₃) , R= [e ₁, e ₂, e ₃] , the ray parameters [v, t _wc] can be whitened by:

And the Mahalanobious distance between the ray parameters [v, t _wc] can be used for determine whether the geometric component is intersected with the ray.

Applying the whitening transform, the ray parameters become:

With which the Mahalanobious distance is given by:

If the Mahalanobious distance is within a threshold, then the inverse depth can be obtained through:

Optionally, the positions of identified temporal visible landmark may be further optimized through an interactive closest point (ICP) scheme including: selecting, for each temporal visible landmark, a set of nearest projected Gaussian distributions as association candidates; optimizing position of the temporal visible landmark by iterating over the association candidates; and selecting a final association candidate which leads minimum reprojection error.

In Step S308, the state for the real-time image frame may be estimated by solving a least-squares optimization problem by minimizing a total energy function E _total given by:

E _total =E _visual +E _structure +E _prior

where E _visual, E _strucure and E _prior are visual term, structure term and prior term of the total energy function respectively.

The visual term E _visual is given by:

where ρ (. ) is a probability density function; e _proj (x _i, ξ _k) is the reprojection residual term in projecting a i-th mapped landmark on to a k-th image frame, which is dependent on the predicted camera pose ξ _k for the k-th image frame and 3D position x _i of the i-th mapped landmark; and

is the covariance of the 2D Gaussian distribution for the i-th mapped landmark projected on the k-th image frame,

is the association set for the visual term.

In general, the reprojection residual term e _proj is given by:

where u _ik is a 2-D location of visual feature corresponding to a i-th temporal visible landmark projected in the k-th image frame,

and

are rotation and translation matrices for projecting the landmark from the world coordinate of the 3D map to the k-th image frame respectively.

When the camera pose is predicted through a visual-inertial odometry process, the reprojection residual term e _proj may be expressed as:

where

and

are rotation and translation matrices for transforming from the world coordinate of the 3D map to the body coordinate of the inertial measurement unit with respect to the k-th image frame;

and

are rotation and translation matrices for transforming from the body coordinate of the inertial measurement unit to the k-th image frame.

The structure term E _structure may be given by:

where

is the association set for the structure term,

is a j-th Gaussian distribution for a i-th mapped landmark, e _str is a residual term and e _{str_deg} is a residual term for a degenerated case, and

is a function for indicating whether the i-th mapped landmark is degenerated or not and defined as

For a given likelihood of a i-th landmark, the structure residual term e _str is given by:

where x _i is 3D position of the i-th mapped landmark, μ _j and ∑ _j are mean and covariance of the j-th Gaussian distribution respectively.

The structure residual term for the degenerated case e _{str_deg} is given by:

where x _i is 3D position of a i-th mapped landmark, μ _j is the mean of the j-th Gaussian distribution; e _j1 is first eigen-base for the j-th Gaussian distribution and

where σ _stris the pre-defined variance of the geometric structure.

The prior term E _prior may be defined as an initial residual term and given by:

where

is a preset camera pose based on the initialization position data and

is an actual pose.

Therefore, the optimized landmark position may be formulated as:

where e _proj (x _i, ξ _k) is the projection residual term, ξ _k is the pose parameterization of the k-th image frame and ∑ _ik is the pre-defined variance for the feature corresponding to the i-th mapped landmark.

The least-squares optimization problem may be solved by the Levenberg-Marquardt method, which gives a system as follow:

H=J ^TWJ+∈I,

b=-J ^TWr

where J and r are the stacked Jacobian matrices and residuals, respectively, and W is the weight matrix from stacking the inverse of the covariance for the residual terms, ∈ is the damping factor for the Levenberg-Marquardt method .

In the case that some of the states are not able to be properly estimated, especially when the inertial measurements are incorporated, a batched yet efficient optimization problem may be formulated, which can be solved within 10ms in average. Firstly, the batched optimization problem may be treated as bundle adjustment (BA) which can be decoupled into visual-inertial (or motion-only or structureless) BA and structure-only BA.

FIG. 4 shows an exemplary visual-inertial BA pipeline 400 according to one embodiment of the present invention. FIG. 5 shows a factor graph representation for the visual-inertial BA pipeline. The bundle adjustment 400 may include a frame tracking module 401, marginalization module 402, joint optimization module 403 and parameter updating module 404. Although the decoupled scheme accelerates the optimization, iteratively re-linearizing the visual factors can be less effective in computation, while the instant localization can already provide a relatively accurate estimation. As shown in FIGS. 4 and 5, the visual factors obtained via instant localization may be marginalized as a pose prior instead of using the original visual factors. In other words, after optimizing the state for the real-time image frame, visual factors may be “marginalized out” , which forms a prior term for the pose of the real-time image frame. This eliminates all the visual factors in the optimization, which significantly reduces the computational resources required for visual-inertial bundle adjustment.

More specifically, the estimated value and corresponding covariance preserved in the instant localization serve as a prior term in the bundle adjustment. The joint optimization adjusts the camera poses along with the inertial parameters, which includes the prior term in the energy function as follows:

where e _rot and e _trans are rotational and translational terms for the visual-inertial bundle adjustment.

Accordingly, the camera pose may be optimized via minimizing an objective function:

where F is the keyframes in the current window which modulus | F| may be set to 15.

Assuming the pose estimation reaches the local optimal, the update of state

thus the prior error state can be formulated as:

Then the total objective function may be changed to:

The problem can then be solved by the Levenberg-Marquarelt method. The advantage of this scheme is that there is no need to re-linearize the visual factors, which can be more efficient in computation.

The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated. While the methods disclosed herein have been described with reference to particular operations performed in a particular order, it will be understood that these operations may be combined, sub-divided, or re-ordered to form an equivalent method without departing from the teachings of the present disclosure. Accordingly, unless specifically indicated herein, the order and grouping of the operations are not limitations. While the apparatuses disclosed herein have been described with reference to particular structures, shapes, materials, composition of matter and relationships…etc., these descriptions and illustrations are not limiting. Modifications may be made to adapt a particular situation to the objective, spirit and scope of the present disclosure. All such modifications are intended to be within the scope of the claims appended hereto.

Claims

A geometric structure aided visual localization method, comprising:

retrieving a 3D map
based on an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks

obtaining a sequence of real-time image frames by a camera of a mobile system;

for each real-time image frame:

extracting one or more local features from the real-time image frame;

predicting a camera pose corresponding to the real-time image frame;

creating a keyframe by tracking the predicted camera pose in the 3D map;

identifying one or more temporal visible landmarks with respect to the created keyframe;

obtaining one or more 2D-3D correspondences between the local features and the temporal visible landmarks; and

locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.
The geometric structure aided visual localization method according to claim 1, wherein the camera pose is predicted through a place recognition process comprising:

retrieving a plurality of keyframe candidates with a global descriptor based on the initialization position data;

grouping the keyframe candidates into clusters by their covisibility;

matching the local features against with each cluster to obtain a correspondence between the real-time frame image and the keyframe candidates;

recovering the camera pose via a perspective-n-point (PnP) scheme with a random sample consensus (RANSAC) method; and

refining the camera pose with inlier matches.
The geometric structure aided visual localization method according to claim 1, the camera pose is predicted through a visual-inertial odometry process comprising:

obtaining real-time inertial measurement data by an inertial measurement unit of the mobile system;

pre-integrating the real-time inertial measurement data into a single relative motion constraint; and

estimating the camera pose based on the single relative motion constraint.
The geometric structure aided visual localization method according to claims 2 or 3, wherein the step of identifying the one or more temporal visible landmarks includes:

projecting the mapped landmarks on to the created keyframe; and

filtering the projected landmarks based on one or more visibility criteria to identify one or more projected landmarks which pass the visibility criteria as the one or more temporal visible landmarks with respect to the created keyframe.
The geometric structure aided visual localization method according to claims 2 or 3, wherein the step of identifying the one or more temporal visible landmarks includes:

dividing the 3D map into a set of voxels to form a voxelized map, wherein each mapped landmark being intersected with one or more voxels;

casting a plurality of rays from an optical center of the created keyframe into the voxelized map;

for each casted ray:

locating one or more voxels along the casted ray; and

identifying one or more mapped landmarks which are intersected with the located

voxels as temporal visible landmarks with respect to the created keyframe.
The geometric structure aided visual localization method according to claims 4 or 5, wherein the step of estimating the state for the real-time image frame includes solving a least-squares optimization problem and minimizing a total energy function E given by:

E _total = E _visual + E _structure + E _prior,

where E _visual, E _prior and E _proj are visual term, prior term and reprojection term of the total energy function respectively.
The geometric structure aided visual localization method according to claim 6, positions of the temporal visible landmarks are optimized through an interactive closest point (ICP) scheme.
The geometric structure aided visual localization method according to claim 7, wherein the ICP scheme includes:

selecting, for each temporal visible landmark, a set of nearest projected Gaussian distributions as association candidates; and

optimizing position of the temporal visible landmark by iterating over the association candidates; and selecting a final association candidate which leads minimum reprojection error.
The geometric structure aided visual localization method according to claim 1, further comprising constructing the 3D map with a point cloud which is generated by a LiDAR.
The geometric structure aided visual localization method according to claim 1, wherein the initialization position data is obtained through a GNSS/INS unit.
A geometric structure aided visual localization system, comprising:

a camera configured to obtain a sequence of real-time image frames;

a first processor including a mapping module configured to retrieve a 3D map
based on an initialization position data, the 3D map being constrained with visual structure and geometric structure and modelled by a Gaussian mixture model with a set of Gaussian distributions of mapped landmarks
and

a second processor including a localization module configured to process each real-time image frame by:

extracting one or more local features from the real-time image frame;

predicting a camera pose corresponding to the real-time image frame;

creating a keyframe by tracking the predicted camera pose in the 3D map;

identifying one or more temporal visible landmarks with respect to the created keyframe;

obtaining one or more 2D-3D correspondences between the local features and the temporal visible landmarks; and

locating the mobile system by estimating a state for the real-time image frame based on the one or more 2D-3D correspondences.
The geometric structure aided visual localization system according to claim 11, wherein the localization module is further configured to predict the camera pose corresponding to the real-time image frame by:

retrieving a plurality of keyframe candidates with a global descriptor based on the initialization position data;

grouping the keyframe candidates into clusters by their covisibility;

matching the local features against with each cluster to obtain a correspondence between the real-time frame image and the keyframe candidates;

recovering the camera pose via a perspective-n-point (PnP) scheme with a random sample consensus (RANSAC) method; and

refining the camera pose with inlier matches.
The geometric structure aided visual localization system according to claim 11, further comprising an inertial measurement unit configured to obtain real-time inertial measurement data; and

wherein the localization module is further configured to predict the camera pose corresponding to the real-time image frame by:

pre-integrating the real-time inertial measurement data into a single relative motion constraint; and

estimating the camera pose based on the single relative motion constraint.
The geometric structure aided visual localization system according to claims 12 or 13, wherein the localization module is further configured to identify the one or more temporal visible landmarks by:

projecting the mapped landmarks on to the created keyframe; and

filtering the projected landmarks based on one or more visibility criteria to identify one or more projected landmarks which pass the visibility criteria as the one or more temporal visible landmarks with respect to the created keyframe.
The geometric structure aided visual localization system according to claims 12 or 13, wherein the localization module is further configured to identify the one or more temporal visible landmarks by:

dividing the 3D map into a set of voxels to form a voxelized map, wherein each mapped landmark being intersected with one or more voxels;

casting a plurality of rays from an optical center of the created keyframe into the voxelized map;

for each casted ray:

locating one or more voxels along the casted ray; and

identifying one or more mapped landmarks which are intersected with the located voxels as temporal visible landmarks with respect to the created keyframe.
The geometric structure aided visual localization system according to claims 14 or 15, wherein the localization module is further configured to estimate the state for the real-time image frame by solving a least-squares optimization problem and minimizing a total energy function E given by:

E _total = E _visual + E _structure + E _prior,

where E _visual, E _structure and E _prior are visual term, structure term and prior term of the total energy function respectively.
The geometric structure aided visual localization system according to claim 16, wherein the localization module is further configured to optimize positions of the temporal visible landmarks through an interactive closest point (ICP) scheme.
The geometric structure aided visual localization system according to claim 17, wherein the localization module is further configured to performing the ICP scheme by:

selecting, for each temporal visible landmark, a set of nearest projected Gaussian distributions as association candidates; and

optimizing position of the temporal visible landmark by iterating over the association candidates; and selecting a final association candidate which leads minimum reprojection error.
The geometric structure aided visual localization system according to claim 11, comprising a LiDAR configured to generate a point cloud for constructing the 3D map.
The geometric structure aided visual localization system according to claim 11, comprising a GNSS/INS configured to obtain the initialization position data.