CN114972501B

CN114972501B - Visual positioning method based on prior semantic map structure information and semantic information

Info

Publication number: CN114972501B
Application number: CN202210423500.0A
Authority: CN
Inventors: 张云洲; 梁世文; 田瑞; 杨凌昊; 曹振中
Original assignee: 东北大学
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2024-07-02
Anticipated expiration: 2042-04-21
Also published as: CN114972501A

Abstract

The invention belongs to the field of visual SLAM, and provides a visual positioning method based on prior semantic map structural information and semantic information. The invention introduces the priori semantic map factors into visual positioning through a mixed constraint of fusing the semantic information and the structural information of the priori semantic map. And then, using a data association between the visual road sign and the priori semantic map, and simultaneously optimizing the data association and the camera pose by using a expectation maximization algorithm, thereby improving the accuracy and the robustness of visual positioning. The algorithm can effectively limit drift errors of the visual odometer and improve visual positioning accuracy so as to serve application scenes such as navigation. Under the condition of meeting the requirement of real-time performance, higher positioning precision is obtained.

Description

Visual positioning method based on prior semantic map structure information and semantic information

Technical Field

The invention relates to the field of visual SLAM (simultaneous localization AND MAPPING), in particular to a visual positioning algorithm based on prior semantic map structure information and semantic information.

Background

With the wider and wider application of mobile robots and unmanned robots, how to provide accurate and robust pose for related hardware devices such as autonomous vehicles is an urgent problem to be solved. The vision SLAM technology obtains images through a camera to estimate the pose of the vision SLAM technology and build a map. As one of the currently mainstream positioning technologies, the visual SLAM method attracts a great deal of research work due to the characteristics of low cost and light weight of cameras. However, compared with a laser radar-based SLAM algorithm, the traditional vision SLAM mainly relies on characteristics of points, lines, planes and the like in the environment to complete self positioning and mapping, so that the traditional vision SLAM is very easily influenced by factors such as ambient illumination, visual angles and the like. In the field of autopilot or mobile robotics, one possible approach is to use high-precision maps built off-line to provide constraints to limit the cumulative drift error of the visual odometer. The method can ensure the accuracy of visual positioning, and can improve the robustness of the visual positioning, so that the visual positioning can meet the requirements of practical application.

3D line features in prior semantic maps are quantified in IEEE/RSJ International Conference on Intelligent Robots AND SYSTEMS (IROS), 4588-4594,2020, and 2D line features extracted from images are matched with 3D line features. Based on the data association of the 2D-3D line characteristics, a priori semantic map constraint is introduced for the visual positioning system, and accumulated drift errors of the visual odometer are limited. The Journal of Field Robotics,1003-1026,2020 proposes a ProW-NDT point cloud matching algorithm to estimate the coordinate transformation of a local visual map and an a priori laser map. And optimizing the key frame pose in the sliding window by utilizing a local pose graph optimization algorithm in order to integrate the results of the point cloud registration and the visual odometer.

The visual localization algorithms described in IEEE/RSJ International Conference on Intelligent Robots AND SYSTEMS (IROS), 4588-4594,2020, and Journal of Field Robotics,1003-1026,2020 all use only the structural information of the prior semantic map, without considering the semantic information in the prior semantic map. When visual localization is performed in outdoor large scenes, merely utilizing structural information in a priori semantic map may result in reduced localization robustness or even failure. In order to ensure the accuracy and robustness of visual localization, common constraints of semantic information and structural information in a priori semantic map need to be introduced.

Disclosure of Invention

The invention aims to provide a visual positioning algorithm based on prior semantic map structural information and semantic information, and the accuracy and the robustness of visual positioning are improved.

The technical scheme of the invention is as follows: a visual positioning algorithm based on prior semantic map structure information and semantic information comprises the following specific steps:

(1) Acquiring structural information and semantic information by using a priori semantic map M, and taking a visual image feature U and a semantic segmentation image C as system observation O, O= { U, C }; setting the pose of a camera as T, setting the coordinate transformation relation between a visual image and a priori semantic map as S, and estimating the state X= { T, S } of the camera and the coordinate P of a semantic landmark by using the priori semantic map M and system observation O; obtaining posterior probability estimation of the camera state, the semantic signpost, the priori semantic map and the system observation, and performing maximization estimation to obtain a visual observation factor, a semantic tracking factor and a priori semantic map factor:

p(X,P|M,O)∝p(U|P,T)·p(C|Z)·p(P|M,S) (1)

wherein Z represents a semantic tag of a semantic landmark, P (U|P, T) is a visual observation factor, P (C|Z) is a semantic tracking factor, and P (P|M, S) is a priori semantic map factor;

(2) Continuous visual tracking is carried out on the semantic signpost based on the visual feature description sub-feature matching visual image;

Setting the camera pose of the kth frame of image as T _k, setting the coordinate of the ith semantic landmark as P _i, using the reprojection error in the characteristic point method as a residual term E _visual of a visual observation factor P (U|P, T), and estimating the initial pose T _k of each frame of visual image;

wherein pi (·) represents a reprojection function, u _i,k represents a corresponding image feature of the ith semantic landmark on the kth frame of image, Σ _i,k represents a visual data association corresponding covariance matrix, and Ω represents a visual data association set;

(3) The semantic segmentation image carries out continuous semantic tracking on the semantic landmark based on Dirichlet distribution, and semantic label Z of the semantic landmark is estimated by using semantic tracking factor p (C|Z) and semantic segmentation image C;

Defining g _i＝[p(z_i＝1),p(z_i＝2),...,p(z_i＝H)]^T as the discrete probability of the semantic label z _i corresponding to the ith semantic landmark; defining the observation of the ith semantic landmark on the kth frame semantic segmentation image as c _i,k, and updating the distribution of g _i through the observation;

Estimating the discrete probability distribution of a semantic label z _i corresponding to the semantic landmark by utilizing multi-frame semantic observation C _i＝{c_i,k}_k∈K, namely semantic state D _i, wherein K represents a semantic segmentation image set in which the ith semantic landmark is observed;

D_i＝p(g_i|C_i)∝p(C_i|g_i)·Dir(g_i|ⁱα)

(3)

Wherein Dir (·) represents the dirichlet distribution, ⁱα＝[α₁,α₂,...,α_H]^T is a parameter of the dirichlet distribution; p (C _i|g_i) obeys polynomial distribution, and the semantic state of the semantic roadmap is converted into update of the dirichlet distribution parameters, namely:

ⁱα_new＝ⁱα_old+c_i,k (4)

Wherein ⁱα_new is post-update ⁱα,ⁱα_old is pre-update ⁱ a;

(4) In the positioning thread, the prior semantic map M is further divided into a plurality of sub-maps M _l according to semantic tags of the prior semantic map M; establishing data association of semantic signpost p _i with each sub-map M _l Setting the coordinate transformation relation between the visual image obtained by the kth frame estimation and the priori semantic map as S _k, extracting the structural information in the priori semantic map M according to the obtained priori semantic map factor P (P|M, S) in the formula (1)And semantic information p (z _i|M,S_k,p_i) construct a hybrid constraint as a priori constraint for the algorithm:

(4.1) structural information in the formula (5) Its residual term is defined as:

Wherein, Representing a probability distribution of the local point cloud; the structure information in the prior semantic map is divided into a planar structure and a non-planar structure;

For planar structures, the residual term in equation (6) is modeled as the distance from point to straight line, i.e.:

Wherein the method comprises the steps of AndRespectively isSurface normal vector and three-dimensional point coordinates of (a); in a planar structure we willExpressed as a unitary matrix; ;

for non-planar structures, the residual form in the ICP algorithm is employed, namely:

(4.2) p (z _i|M,S_k,p_i) in equation (5) is the semantic weight w _i of the ith semantic landmark, define Obeying the Dirichlet distribution, namely: w _i～Dir(ⁱ beta), wherein ⁱβ＝[ⁱβ₁,ⁱβ₂,...,ⁱβ_H]^T is a parameter of the dirichlet allocation, and is used for optimizing by fusing semantic states of semantic signposts in a positioning thread;

Wherein, gamma is the balance coefficient;

(5) In a localization thread, based on the proposed hybrid constraints, a coordinate transformation between a priori semantic map and a world coordinate system is solved using a expectation maximization algorithm Coordinate transformation based on solutionEstimating 6-degree-of-freedom pose of camera in prior semantic map

Estimating coordinate transformation between an initial frame of a visual image and an a priori semantic map using a expectation maximization algorithmThe flow of (2) is as follows:

1) E, step E: for each semantic landmark, constructing a data association between the semantic landmark and the prior semantic map with a variable Z using nearest neighbor search, and estimating a weight w _i of each data association with equation (9);

2) M steps: based on the data association and the weight w _i estimated based on equation (9), solving by the following optimization model

Wherein S _k isThe coordinate transformation S ₀ between the prior semantic map and the visual image initial frame is an identity matrix.

Repeating the steps E and M until convergence or the set iteration times are reached; finally, pose passing of current camera in the prior check semantic mapAnd (5) performing calculation.

Coordinate transformation based on solutionFurther refining the 3D coordinates of the semantic signpost; setting semantic landmark point coordinates before optimization asThe coordinates of the waypoints after refinement are:

Wherein, Transforming between a world coordinate system before optimizing for a desired maximization algorithm and a priori semantic map coordinate system; the system further optimizes the camera pose T and the coordinates P of the semantic landmark by using a local BA optimization algorithm.

The plane structure is a road, a sidewalk or a wall.

The invention has the beneficial effects that: aiming at the defect that the prior semantic map-based algorithm only utilizes geometric structure information in the prior semantic map, the invention provides a visual positioning algorithm based on prior semantic map structure information and semantic information. The invention extracts the structure information and the semantic information of the prior semantic map as prior constraints. Unlike other approaches that utilize geometric features in prior semantic maps to constrain camera pose, the present invention introduces prior semantic map factors into visual positioning through a hybrid constraint that fuses prior semantic map semantic information and structural information. And then, using a data association between the visual road sign and the priori semantic map, and simultaneously optimizing the data association and the camera pose by using an Expectation Maximization (EM) algorithm, so that the accuracy and the robustness of visual positioning are improved. The algorithm can effectively limit drift errors of the visual odometer and improve visual positioning accuracy so as to serve application scenes such as navigation. Under the condition of meeting the requirement of real-time performance, higher positioning precision is obtained.

Drawings

FIG. 1 is a flow chart of a visual localization algorithm based on prior semantic map structural information and semantic information

Detailed Description

Fig. 1 is a main flow chart of the technical scheme of the present invention. The visual positioning algorithm based on the prior semantic map structural information and the semantic information provided by the invention takes the prior semantic map M as input to acquire the structural information and the semantic information, and takes the visual feature U and the semantic segmentation image C as system observation O= { U, C }, respectively. Let the camera pose be T and the coordinate transformation between the visual image and the prior semantic map be S, the system estimates the camera state x= { T, S } and the visual landmark P using the prior semantic map M and the system observations. The whole problem can be modeled as a maximum posterior probability estimate, which is used as a priori constraint of a visual positioning algorithm to limit drift errors of the visual odometer. The maximum posterior probability estimation is expressed as follows:

p(X,P|M,O)∝p(U|P,T)·p(C|Z)·p(P|M,S)，(1)

Where Z represents the semantic tag of the visual landmark. In the formula (1), P (u|p, T) is a visual observation factor, P (c|z) is a semantic tracking factor, and P (p|m, S) is an a priori semantic map factor.

As shown in fig. 1, the visual positioning algorithm based on prior semantic map structure information and semantic information provided by the invention comprises the following steps:

(1) In the tracking thread, the visual image and the semantic segmentation image are used as system observation, and the semantic signpost is continuously tracked based on the visual feature descriptors and the Dirichlet distribution.

In the front end, a data association between the visual landmark and ORB (Oriented FAST and Rotated BRIEF) features is established according to the feature matching, and the initial pose T _k of each frame of image is estimated by utilizing a PnP (PERSPECTIVE-n-Point) algorithm. As for the visual observation factor P (u|p, T) in the formula (1), the reprojection error in the characteristic point method is used as a residual term. The expression form is as follows:

Wherein pi (·) represents the re-projection function; u _i,k denotes the corresponding image features of the ith semantic landmark on the kth frame image, Σ _i,k denotes the visual data association corresponding covariance matrix, Ω denotes the visual data association set.

For the semantic tracking factor p (c|z) in equation (1), it estimates the semantic tag Z of the visual landmark using the semantic segmentation result as the semantic segmentation image C. The system does not directly quantify the distribution p (c|z), but rather estimates the probability distribution of the semantic attribute Z. We define g _i＝[p(z_i＝1),p(z_i＝2),...,p(z_i＝H)]^T as the discrete probability of z _i and update the distribution of g _i by the corresponding semantic observation c _i,k. Based on the above analysis, the system estimates the discrete probability distribution of the semantic label z _i of the corresponding landmark point, i.e., the semantic state D _i, using the semantic observations C _i of the multiframe. Thus, the semantic state D _i of each visual landmark is represented as:

D_i＝p(g_i|C_i)∝p(C_i|g_i)·Dir(g_i|ⁱα)， (3)

Where Dir (·) represents the dirichlet distribution and ⁱα＝[α₁,α₂,...,α_H]^T is a parameter of the dirichlet distribution. Assuming that p (C _i|g_i) is a polynomial distribution, the estimation of the semantic state of the semantic roadmap can be translated into an update of the Dirichlet distribution parameters, namely:

ⁱα_new＝ⁱα_old+c_i,k. (4)

Wherein ⁱα_new is post-update ⁱα,ⁱα_old is pre-update ⁱ a;

(2) In the positioning thread, extracting structural information and semantic information in the prior semantic map to construct a mixed constraint, and taking the mixed constraint as the prior constraint:

the prior semantic map factor P (P|M, S) in the formula (1) is utilized to further lead out the prior semantic map semantic information P (z _i|M,S,p_i) and the structural information which are fused Is a mixed constraint of (1), namely:

The prior semantic map M is further divided into a plurality of sub-maps M _l according to semantic tags. Subsequently, for each semantic landmark, a data association of the semantic landmark coordinates p _i with the respective sub-map M _l is established Based on the data association of the local semantic signpost and the prior semantic map, a mixing constraint E _hybrid is introduced.

On the other hand, since a large number of planar features exist in an actual outdoor scene, the structure information in the prior semantic map is further divided into a planar structure and a non-planar structure to be considered.

For planar structures such as roads and sidewalks of a priori semantic map, the structural information in the formula (5) is modeled as a point-to-straight line distance, namely:

Wherein the method comprises the steps of AndRespectively isSurface normal vector and three-dimensional point coordinates.

For the rest of the unstructured scene, the residual form in the ICP algorithm is adopted, namely:

(3) In the localization thread, the 6-degree-of-freedom pose of the camera in the prior semantic map is solved using an EM (estimation-Maximization) algorithm based on the proposed hybrid constraints.

Based on the mixed constraint in the formula (5), for the kth frame image, solving the coordinate transformation between the prior semantic map and the world coordinate system corresponding to the visual image by using an EM algorithmSubsequently, a coordinate transformation based on the solutionEstimating 6-degree-of-freedom pose of camera in prior semantic mapP (z _i|M,S_k,p_i) in equation (5) can be considered the semantic weight of each landmarkSince p (z _i|M,S_k,p_i) is a discrete probability distribution, the system further definesAnd assuming that it obeys the Dirichlet distribution, i.e.: w _i～Dir(ⁱ beta), wherein ⁱβ＝[ⁱβ₁,ⁱβ₂,...,ⁱβ_H]^T is a parameter of the dirichlet distribution. To optimize the semantic state of the front-end visual roadmap fused in the positioning thread, we define ⁱβ_l as:

Wherein, gamma is the balance coefficient.

Thus, the weight w _i can be further modeled as the expectation of the dirichlet distribution. Finally, for the kth frame image, estimating coordinate transformation of the prior semantic map and the world coordinate system corresponding to the visual image by using an EM algorithmThe flow of (2) is as follows:

1) E, step E: for each visual landmark, a data association between the visual landmark and the prior semantic map is constructed with a variable Z using nearest neighbor search and the weight w _i of each data association is estimated with equation (8).

2) M steps: based on a series of data correlations and the estimated weights, S can be solved by the following optimization model:

Wherein, The probability distribution, represented as a local point cloud in a non-planar structure, is represented as an identity matrix in a planar structure. And (3) repeating the steps E and M until convergence or the set iteration times are reached. Finally, the pose of the current camera in the prior-check semantic map can be determined byAnd (5) performing calculation.

In addition, in order to improve the accuracy of pose estimation in the tracking thread. The system further refines the 3D coordinates of the semantic signpost based on the coordinate transformation S obtained by solving. Setting the coordinates of the road mark points before optimization asThe coordinates of the waypoints after refinement are:

Wherein, Transforming between a world coordinate system before optimizing for a desired maximization algorithm and a priori semantic map coordinate system; the system further optimizes the camera pose T and coordinates P of the semantic landmark using a local BA (Bundle Adjustment) optimization algorithm.

The present invention tested both binocular and monocular systems on 9 sequences of outdoor KITTI datasets, respectively. On the outdoor KITTI dataset, the average positioning error of the binocular system was 0.5216m and the average positioning error of the monocular system was 2.1838m. In addition, experimental tests are carried out on the time consumption of system positioning, and the time consumption of single-frame positioning is about 78.77 ms. According to the experimental result, the system provided by the invention achieves higher positioning precision under the condition of meeting the real-time requirement.

Table 1 ATE error test results in meters for the system

Claims

1. A visual positioning method based on prior semantic map structure information and semantic information is characterized by comprising the following specific steps:

(1) Acquiring structural information and semantic information by using a priori semantic map M, and taking a visual image feature U and a semantic segmentation image C as system observation O, O= { U, C }; setting the pose of a camera as T, setting the coordinate transformation relation between a visual image and a priori semantic map as S, and estimating the state X= { T, S } of the camera and the coordinate P of a semantic landmark by using the priori semantic map M and system observation O; obtaining a camera state, a semantic landmark, a priori semantic map and posterior probability estimation of system observation, and performing maximization estimation to obtain a visual observation factor P (U|P, T), a semantic tracking factor P (C|Z) and a priori semantic map factor P (P|M, S);

(4) In the positioning thread, the prior semantic map M is further divided into a plurality of sub-maps M _l according to semantic tags of the prior semantic map M; establishing data association of coordinates p _i of semantic signposts with each sub-map M _l Setting the coordinate transformation relation between the visual image obtained by the kth frame estimation and the priori semantic map as S _k, extracting the structural information in the priori semantic map M according to the obtained priori semantic map factor P (P|M, S) in the formula (1)And semantic information p (z _i|M,S_k,p_i) to construct a hybrid constraint as a priori constraint of the method:

(5) In the positioning thread, based on the proposed mixed constraint, a desired maximization algorithm is used for solving coordinate transformation between a priori semantic map and a world coordinate system corresponding to a visual image Coordinate transformation based on solutionEstimating 6-degree-of-freedom pose of camera in prior semantic map

2. The visual positioning method based on prior semantic map structural information and semantic information according to claim 1, wherein in the step (1), the specific formulas of the posterior probability estimation of the camera state, the semantic landmark, the prior semantic map and the system observation are:

p(X,P|M,O)∝p(U|P,T)·p(C|Z)·p(P|M,S) (1)

Wherein Z represents the semantic label of the semantic landmark, P (U|P, T) is a visual observation factor, P (C|Z) is a semantic tracking factor, and P (P|M, S) is a priori semantic map factor.

3. The visual localization method based on prior semantic map structure information and semantic information according to claim 1, wherein in the step (2), the residual term E _visual of the visual observation factor P (u|p, T) is:

Wherein pi (·) represents a reprojection function, ui, k represents a corresponding image feature of the ith semantic landmark on the kth frame image, Σ _i,k represents a visual data association corresponding covariance matrix, and Ω represents a visual data association set.

4. The visual localization method based on prior semantic map structure information and semantic information according to claim 1, wherein the semantic state D _i in the step (3) is,

D_i＝p(g_i|C_i)∝p(C_i|g_i)·Dir(g_i|ⁱα) (3)

ⁱα_new＝ⁱα_old+c_i,k (4)

Wherein ⁱα_new is post-update ⁱα,ⁱα_old is pre-update ⁱ a.

5. The visual localization method based on a priori semantic map structural information and semantic information according to claim 1, wherein the mixing constraint in step (4) is that,

6. The visual positioning method based on prior semantic map structural information and semantic information according to claim 5, wherein in the step (4), for the structural information in formula (5)Its residual term is defined as:

For planar structures, including roads, sidewalks, walls, the residual term in equation (6) is modeled as a point-to-straight distance, namely:

Wherein the method comprises the steps of AndRespectively isSurface normal vector and three-dimensional point coordinates of (a); in a planar structure we willExpressed as a unitary matrix;

7. the visual localization method based on prior semantic map structure information and semantic information according to claim 5 or 6, wherein in the step (4), p (z _i|M,S_k,p_i) of formula (5) is the semantic weight w _i of the ith semantic landmark, defining Obeying the Dirichlet distribution, namely: w _i～Dir(ⁱ beta), wherein ⁱβ＝[ⁱβ₁,ⁱβ₂,...,ⁱβ_H]^T is a parameter of the dirichlet allocation, and is used for optimizing by fusing semantic states of semantic signposts in a positioning thread;

Wherein, gamma is the balance coefficient.

8. The visual positioning method based on prior semantic map structure information and semantic information according to claim 1, wherein the using a expectation maximization algorithm solves a coordinate transformation between a prior semantic map and a world coordinate system corresponding to a visual imageThe flow of (2) is as follows:

1) E, step E: for each semantic landmark, using nearest neighbor search to construct data association between the semantic landmark and the prior semantic map with semantic label Z of semantic coordinates, and estimating weight w _i of each data association with formula (9);

Wherein S _k is

And (3) repeating the steps E and M until convergence or the set iteration times are reached.

9. The visual positioning method based on prior semantic map structure information and semantic information according to claim 8, wherein in the coordinate transformation between the prior semantic map and the world coordinate system corresponding to the visual image, the coordinate transformation S ₀ corresponding to the visual image of the initial frame is an identity matrix.

10. The visual positioning method based on prior semantic map structure information and semantic information according to claim 8, wherein the coordinate transformation based on solvingFurther refining the 3D coordinates of the semantic signpost; setting semantic landmark point coordinates before optimization asThe coordinates of the waypoints after refinement are:

Wherein, Transforming between a world coordinate system before optimizing for a desired maximization algorithm and a priori semantic map coordinate system; the system further optimizes the pose T of the local camera and the coordinates P of the semantic landmark by using a local BA optimization algorithm.