CN114399547A

CN114399547A - Monocular SLAM robust initialization method based on multiple frames

Info

Publication number: CN114399547A
Application number: CN202111499604.1A
Authority: CN
Inventors: 胡德文; 葛杨冰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-04-26
Anticipated expiration: 2041-12-09
Also published as: CN114399547B

Abstract

The invention discloses a monocular SLAM robust initialization method based on multiple frames, which comprises the following steps: extracting characteristic points of image frames in the initial video stream for mutual matching, screening out matching points, and obtaining initial matching point pairs; screening out a three-view pair according to the initial matching point pair, further screening out matching points in the three-view pair based on a random sampling consistency algorithm of the trifocal tensor, and constructing a three-frame matching image; solving the relative rotation between each image frame according to a double-view geometric principle; solving global rotation according to the relative rotation between the image frames; solving global displacement based on global rotation; the global rotation and the global displacement are integrated to obtain the initial pose of each frame, and nonlinear optimization adjustment is carried out according to the initial pose; calculating the depth of field of the feature points and recovering three-dimensional coordinates of the feature points; the method can improve the convergence speed and reduce the appearance of scattered points, thereby improving the precision of the initial map.

Description

Monocular SLAM robust initialization method based on multiple frames

Technical Field

The invention belongs to the technical field of monocular SLAM initialization, and particularly relates to a monocular SLAM robust initialization method based on multiple frames.

Background

The goal of simultaneous localization and mapping (SLAM) is to reconstruct the unknown environment while estimating the motion trajectory of the camera. The technology is widely applied to the fields of augmented reality, automatic driving and the like at present, and can run in real time under the background independent of external infrastructure.

Initialization is the key of monocular SLAM technology, and by initialization, the initial posture of a camera can be obtained and an initial map can be generated, so that support is provided for the subsequent tracking stage.

At present, the academic community mainly uses a Structure From Motion (SFM) similar to incremental Motion to initialize a monocular SLAM system, and mainly uses epipolar geometric constraint or planar structural constraint between two frames to construct a basic matrix or a homography matrix, and obtains an initial posture of a camera by decomposing the basic matrix or the homography matrix, and obtains an initial map by using a triangulation technology. The technology has high requirements on the matching of the initial posture and the feature points of the camera, the initialization process depends on the initial movement of the camera, and the convergence cannot be realized quickly. In the later stage of initialization, Bundle Adjustment (BA) is performed to further optimize the initial pose and the initial map, but there still exist "scatter points" after optimization, and the distance between these "scatter points" and their nearest three-dimensional feature points is far beyond the average distance between common three-dimensional feature points, which will cause errors in the tracking process, especially in the case of poor image observation quality, will affect the subsequent pose tracking.

Disclosure of Invention

The invention aims to overcome the defects that the prior art cannot be quickly converged and scattered points exist, and provides an SLAM initialization method which can improve the convergence speed and reduce the scattered points, so that the accuracy of an initial map is improved, in particular to a monocular SLAM robust initialization method based on multiple frames.

The invention provides a monocular SLAM robust initialization method based on multiple frames, which comprises the following steps:

s1: extracting feature points of each image frame in the initial video stream, matching the image frames pairwise according to the feature points, screening out matching points, and obtaining initial matching point pairs, wherein the matching points are matched feature points;

s2: screening three-view pairs, namely three image frames with enough common-view characteristic points according to the initial matching point pairs, further screening matching points in the three-view pairs based on a random sampling consistency algorithm of a trifocal tensor, and constructing three-frame matching graphs, namely topological graphs describing the common-view relation among the image frames;

s3: solving the relative rotation between each image frame according to a double-view geometric principle;

s4: solving global rotation by adopting an iterative weighted least square method according to the relative rotation between the image frames;

s5: solving the global displacement based on the global rotation of each image frame and the linear constraint relation between the scene structure and the global displacement;

s6: the global rotation and the global displacement are integrated to obtain the initial pose of each frame, and on the basis, the pose of each frame is optimized by using a pose-only nonlinear optimization adjustment strategy;

s7: and calculating the depth of field of the feature points and recovering the three-dimensional coordinates of the feature points.

Preferably, in S1, feature points of each image frame in the initial video stream are extracted, the image frames are matched with each other pairwise according to the feature points, matching points are screened by a random sampling consensus algorithm, and an initial matching point pair is obtained, where the matching points are matched feature points.

Preferably, the specific steps of screening out the matching points in the three-view pair are as follows:

s2.1: setting a sample set P with the minimum number of samples n, extracting n samples from the sample set P to form a sample subset S, and calculating an initial trifocal tensor by using an intrinsic matrix between views to serve as an initialization model M;

s2.2: obtaining projection matrixes P1, P2 and P3 of the trifocal tensor according to the initial trifocal tensor, calculating coordinates of the characteristic points by using a least square method, and respectively obtaining three estimated values of the characteristic points through the projection matrixes P1, P2 and P3, wherein the three estimated values are as follows:

wherein the content of the first and second substances,

representing the estimated values of the feature points under the action of the P1 projection matrix,

representing the estimated values of the feature points under the action of the P2 projection matrix,

representing the estimated value of the characteristic point under the action of a P3 projection matrix, wherein X represents the three-dimensional coordinate of the characteristic point;

the reprojection error is calculated from the three-view pairs, i.e.:

where ω denotes the reprojection error, x₁Representing the measured value, x, of the characteristic point in view 1₂Represents the measured value, x, of the feature point in view 2₃Representing the measured value of the characteristic point in view 3, d²(-) represents the square of the euclidean distance between two elements;

taking the reprojection error omega as the error measurement of the initialized model M, and forming an inner point set S by using a sample set and a sample subset S, wherein the error between the sample set P and the initialized model M is smaller than a set threshold th;

s2.3: calculating a new model M by adopting a least square method according to the inner point set S;

s2.4: and repeating the steps of S2.1, S2.2 and S2.3 until the maximum consistent set is obtained, removing the outer points, and recording the inner points and the trifocal tensor of the current cycle, wherein the inner points are the matching points.

Preferably, in S5, the global displacement is solved based on the global rotation of each image frame and the linear constraint relationship between the scene structure and the global displacement; the linear relational expression is:

Bt_l+Ct_i+Dt_r＝0

wherein the content of the first and second substances,

B[X_i]×R_r，iX_r([R_r，lX_r]_xX_l)^T[X_l]_xR_l

C＝|[X_l]_×R_r，lX_r||²[X_i]_×R_i

D＝-(B+C)

R_r，i＝R_iR_r ^T

wherein, t_lRepresenting the global displacement, t, of view l_iRepresenting the global displacement, t, of view i_rRepresenting a global displacement, X, of view r_iNormalized image coordinates representing feature points in View i [ ·]_×Representing an inverse-symmetric matrix of vector correspondences, R_r，iRepresenting a relative rotation between views r and i, X_rRepresenting the normalized image coordinates, R, of the feature points in view R_lRepresenting a global rotation of view l, R_iRepresenting a global rotation, R, of view i_rRepresenting a global rotation, X, of the view r_lRepresenting the normalized image coordinates of the feature points in view l, T representing the transpose of the matrix, R_r，lRepresenting relative rotation between views r and l;

taking all feature points into account, all linear constraints are integrated to obtain the following formula:

F·t＝0

where F is a coefficient matrix formed by B, C, D, and t is (t)₁ ^T，t₂ ^T，...，t_n ^T)^TA global displacement representing all n views;

solving the linear homogeneous equation can obtain the optimal value of the global displacement

Preferably, in S6, the step of performing pose-only nonlinear optimization adjustment includes:

s6.1: based on each three-view pair, a reprojection vector is calculated, namely:

wherein the content of the first and second substances,

a re-projection vector is represented which is,

represents a vector (0,0,1),

is the depth of field, X, of the feature point calculated from the reference view_rRepresenting the normalized image coordinates, t, of the feature points in view r_r，iRepresenting the relative displacement between views r, i, as:

t_r，i＝R_j(t_r-t_i)

wherein R is_jRepresenting a global rotation, t, of view j_rRepresenting the global displacement, t, of the view r_iRepresents the global displacement of view i;

s6.2: calculating and adding reprojection errors for all the feature points to obtain an error term epsilon, wherein the expression of the error term epsilon is as follows:

wherein the content of the first and second substances,

representing the reprojection vector, x_iRepresenting the measured value of the characteristic point in the view i, and T represents the transposition of the matrix;

s6.3: optimizing through a general graph optimization library, taking the pose of each frame as a node of graph optimization, and taking the reprojection error of each feature point as an edge of the graph optimization.

Preferably, in S7, θ_(r，j)Recovery as a feature pointCriterion of complex mass according to theta_(r，j)Calculating a weighted depth Z of a view r_rWeighted depth Z_rThe expression of (a) is:

ω_(r，j)＝θ_(r，j)/∑_1≤j≤nθ_(r，j)

θ_(r，j)＝||[X_j]_×R_r，jX_r||

where j denotes the jth view,

representing the depth of field, ω_(r，j)Represents a weighted value, θ_(r，j)Indicating the quality of recovery of the characteristic points, R_r，jRepresenting a relative rotation, X, between views r, j_rNormalized image coordinates, X, representing feature points in view r_jRepresents the coordinates of the feature points under view j [ ·]_×An antisymmetric matrix representing the vector;

according to a weighted depth Z_rPerforming weighted reconstruction on the initial map, recovering the three-dimensional coordinates of the feature points, wherein the recovered three-dimensional coordinates of the feature points are expressed as:

X^W＝Z_rR_rX_r+t_r

wherein, X^WRepresenting the three-dimensional coordinates of the restored feature points, Z_rRepresenting the weighted depth, R, of view R_rRepresenting a global rotation, X, of the view r_rRepresenting the normalized image coordinates, t, of the feature points in view r_rRepresenting a global displacement of view r.

Preferably, in S6.3, the method further includes increasing robustness of the optimizer by setting a kernel function.

Has the advantages that:

1. compared with the traditional system which uses two frames of information for initialization, the method can utilize more video stream information in the process of solving the initial pose and obtain the high-precision initial pose by an averaging method by introducing the technology of a global motion recovery structure.

2. In the triangulation process of the initial map, a weighted reconstruction technology is utilized, a plurality of observation information can be comprehensively utilized aiming at one feature point, and the precision of the feature point is improved.

3. And a nonlinear optimization strategy taking the initial pose as an optimization variable is adopted in the optimization process, so that compared with the traditional nonlinear optimization strategy, the number of scattered points in the initial map is reduced, and the accuracy of the initial map is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a monocular SLAM robust initialization method based on multiple frames in the implementation of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the present embodiment provides a monocular SLAM robust initialization method based on multiple frames, which includes the steps of:

s1: extracting feature points of each image frame in the initial video stream, matching the image frames pairwise according to the feature points, screening matching points through a Random Sample Consensus (RANSAC) algorithm, and acquiring an initial matching point pair, wherein the matching points are matched feature points; judging whether enough characteristic points are possessed, if so, performing S2; otherwise, S1 is repeated.

S2: and screening out three-view pairs according to the initial matching point pairs, namely three image frames with enough common-view characteristic points, further screening out matching points in the three-view pairs based on a random sampling consistency algorithm of the trifocal tensor, and constructing a three-frame matching graph, namely a topological graph describing the common-view relation among the image frames.

The method comprises the following steps that a basic matrix E is used as an initialization model M between two frames, matching points cannot be completely eliminated by using a random sampling consistency algorithm, mismatching points can be further eliminated by using a random sampling consistency algorithm using a trifocal tensor as the initialization model M between three frames, and matching accuracy is improved; therefore, the specific steps of screening out the matching points in the three-view pairs are as follows:

s2.1: setting a sample set P with the minimum number of samples n, extracting n samples from the sample set P to form a sample subset S, and calculating an initial trifocal tensor by using an intrinsic matrix between every two views to serve as an initialization model M.

wherein the content of the first and second substances,

the reprojection error is calculated from the three-view pairs, i.e.:

and taking the reprojection error omega as an error measure of the initialized model M, and forming an inner point set S by using the sample set and the sample subset S, wherein the error between the sample set P and the initialized model M is smaller than a set threshold th.

S2.3: and if the inner point set accounts for more than 75% of the sample set, determining that correct model parameters are obtained, and calculating a new model M by adopting a least square method according to the inner point set S.

S3: according to the double-view geometrical principle, the relative rotation between the image frames is solved.

S4: the global rotation is calculated by an iterative weighted Least squares (IRLS) method based on the relative rotation between the image frames.

S5: under the accurate global rotation background, a linear relation exists between the scene structure and the global displacement, and a linear global translation constraint can be constructed, so that the global displacement can be directly solved based on the global rotation of each image frame and the linear constraint relation between the scene structure and the global displacement;

wherein, the linear relation expression is:

Bt_l+Ct_i+Dt_r＝0

wherein the content of the first and second substances,

B＝[X_i]_×R_r，iX_r([R_r，lX_r]_xX_l)^T[X_l]_xR_l

C＝||[X_l]_×R_r，lX_r||²[X_i]_×R_i

D＝-(B+C)

R_r，i＝R_iR_r ^T

considering all feature points, and combining all linear constraints, the following equation can be obtained:

F·t＝0

where F is a coefficient matrix formed by B, C, D, and t is (t)₁ ^T，t₂ ^T，...，t_n ^T)^TRepresenting the global displacement of all n views.

S6: the global rotation and the global displacement are integrated to obtain the initial pose of each frame, and on the basis, the pose of each frame is optimized by using a pose-only nonlinear optimization adjustment strategy, so that the reprojection error is minimum;

wherein, the step of carrying out the nonlinear optimization adjustment of only the pose is as follows:

wherein the content of the first and second substances,

a re-projection vector is represented which is,

represents a vector (0,0,1),

t_r，i＝R_j(t_r-t_i)

wherein the content of the first and second substances,

s6.3: optimizing through a general graph optimization library, taking the pose of each frame as a node of graph optimization, taking the reprojection error of each feature point as an edge of the graph optimization, and improving the robustness of the optimizer by setting a kernel function.

S7: calculating the depth of field of the feature points by utilizing a triangulation technology, and recovering three-dimensional coordinates of the feature points;

the method specifically comprises the following steps: will theta_(r，j)As a criterion for the quality of the restoration of the feature points, according to θ_(r，j)Calculating a weighted depth Z of a view r_rWeighted depth Z_rThe expression of (a) is:

ω_(r，j)＝θ_(r，j)/∑_1≤j≤nθ_(r，j)

θ_(r，j)＝||[X_j]_×R_r，jX_r||

where j denotes the jth view,

X^W＝Z_rR_rX_r+t_r

The monocular SLAM robust initialization method based on the multi-frame provided by the embodiment has the following beneficial effects:

1. compared with the traditional system which uses two frames of information for initialization, the embodiment can more utilize video stream information in the process of solving the initial pose by introducing the technology of the global motion recovery structure into the monocular initialization system, and obtains the high-precision initial pose by an averaging method.

2. In the triangulation process of the initial map, a weighted reconstruction technology is utilized, a plurality of observation information can be comprehensively utilized aiming at one feature point, and the precision of the initial map is improved.

The present invention is not limited to the above preferred embodiments, and any modification, equivalent replacement or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A monocular SLAM robust initialization method based on multiple frames is characterized by comprising the following steps:

2. The multi-frame-based monocular SLAM robust initialization method as recited in claim 1, wherein in S1, feature points of each image frame in the initial video stream are extracted, the image frames are matched with each other pairwise according to the feature points, matching points are screened through a random sampling consistency algorithm to obtain initial matching point pairs, and the matching points are matched feature points.

3. The multi-frame-based monocular SLAM robust initialization method according to claim 2, wherein the specific step of screening out the matching points in the three-view pair is:

wherein the content of the first and second substances,

the reprojection error is calculated from the three-view pairs, i.e.:

s2.4: and repeating the S2.1, the S2.2 and the S2.3 until a maximum consistent set is obtained, removing outer points, and recording inner points and the trifocal tensor of the current cycle, wherein the inner points are matching points.

4. The method of claim 3, wherein in step S5, the global displacement is solved based on the global rotation of each image frame and the linear constraint relationship between the scene structure and the global displacement; the linear relational expression is as follows:

Bt_l+Ct_i+Dt_r＝0

wherein the content of the first and second substances,

B＝[X_i]_×R_r，iX_r([R_r，lX_r]_xX_l)^T[X_l]_xR_l

C＝||[X_l]_×R_r，lX_r||²[X_i]_×R_i

D＝-(B+C)

R_r，i＝R_iR_r ^T

considering all the feature points, and integrating all the linear constraints, we can get the following formula:

F·t＝0

5. The multi-frame based monocular SLAM robust initialization method of claim 4, wherein in S6, the step of pose-only nonlinear optimization adjustment comprises:

wherein the content of the first and second substances,

a re-projection vector is represented which is,

represents a vector (0,0,1),

t_r，i＝R_j(t_r-t_i)

wherein the content of the first and second substances,

6. The method of claim 5, wherein in S7 θ is determined according to the received request signal_(r，j)As a criterion for the quality of the restoration of the feature points, according to θ_(r，j)Calculating a weighted depth Z of a view r_rWeighted depth Z_rThe expression of (a) is:

ω_(r，j)＝θ_(r，j)/∑_1≤j≤nθ_(r，j)

θ_(r，j)＝||[X_j]_×R_r，jX_r||

where j denotes the jth view,

X^W＝Z_rR_rX_r+t_r

7. The method of claim 5, wherein S6.3 further comprises enhancing the robustness of the optimizer by setting a kernel function.