CN114399547B

CN114399547B - Monocular SLAM robust initialization method based on multiframe

Info

Publication number: CN114399547B
Application number: CN202111499604.1A
Authority: CN
Inventors: 胡德文; 葛杨冰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2024-01-02
Anticipated expiration: 2041-12-09
Also published as: CN114399547A

Abstract

The invention discloses a monocular SLAM robust initialization method based on a plurality of frames, which comprises the following steps: extracting characteristic points of image frames in an initial video stream to match with each other, screening out matching points, and obtaining initial matching point pairs; screening out three-view pairs according to the initial matching point pairs, further screening out matching points in the three-view pairs based on a random sampling coincidence algorithm of a trifocal tensor, and constructing a three-frame matching diagram; according to the principle of double-view geometry, solving the relative rotation between each image frame; solving global rotation according to the relative rotation among the image frames; solving global displacement based on global rotation; the global rotation and the global displacement are synthesized to obtain the initial pose of each frame, and nonlinear optimization adjustment is carried out according to the initial pose; calculating the depth of field of the feature points, and recovering the three-dimensional coordinates of the feature points; the method can improve the convergence speed and reduce the occurrence of scattered points, thereby improving the accuracy of the initial map.

Description

Monocular SLAM robust initialization method based on multiframe

Technical Field

The invention belongs to the technical field of monocular SLAM initialization, and particularly relates to a monocular SLAM robust initialization method based on multiple frames.

Background

The goal of the synchronized localization and mapping technique (simultaneous localization and mapping, SLAM) is to reconstruct an unknown environment while estimating the motion trajectories of the cameras. The technology is widely applied to the fields of augmented reality, automatic driving and the like at present, and can run in real time under the background independent of external infrastructure.

Initialization is the key to monocular SLAM technology, by which an initial pose of the camera can be obtained and an initial map generated, providing support for the subsequent tracking phase.

The current academy mainly utilizes an incremental motion restoration structure technology (Structure From Motion, SFM) to initialize a monocular SLAM system, mainly utilizes polar geometric constraint or planar structure constraint between two frames to construct a basic matrix or homography matrix, obtains the initial pose of a camera by decomposing the basic matrix or homography matrix, and obtains an initial map by utilizing a triangulation technology. The technology has high matching requirements on the initial gesture and the characteristic point of the camera, and the initialization process depends on the initial movement of the camera and cannot converge faster. Later in the initialization, cluster adjustment (Bundle Adjustment, BA) will be performed to further optimize the initial pose and the initial map, but after optimization, there are still "scatter points" whose distances from its nearest neighbor three-dimensional feature point far exceed the average distance between common three-dimensional feature points, which will cause errors in the tracking process, especially in case of poor image observation quality, which will affect the subsequent pose tracking.

Disclosure of Invention

The invention aims to overcome the defects that the prior art cannot be converged rapidly and has scattered points, and provides an SLAM initialization method capable of improving convergence speed and reducing scattered points so as to improve the accuracy of an initial map, in particular to a monocular SLAM robust initialization method based on multiple frames.

The invention provides a monocular SLAM robust initialization method based on a plurality of frames, which comprises the following steps:

s1: extracting characteristic points of each image frame in the initial video stream, carrying out mutual matching on the image frames in pairs according to the characteristic points, screening out matching points, and acquiring initial matching point pairs, wherein the matching points are matched characteristic points;

s2: screening out three-view pairs according to the initial matching point pairs, namely three image frames with enough common-view characteristic points, further screening out matching points in the three-view pairs based on a random sampling coincidence algorithm of a trifocal tensor, and constructing three-frame matching diagrams, namely topological diagrams describing common-view relations among the image frames;

s3: according to the principle of double-view geometry, solving the relative rotation between each image frame;

s4: solving global rotation by adopting an iterative weighted least square method according to the relative rotation among the image frames;

s5: solving global displacement based on a linear constraint relation between the global rotation of each image frame and the global displacement of the scene structure;

s6: the global rotation and global displacement are synthesized to obtain the initial pose of each frame, and based on the initial pose, the pose of each frame is optimized by utilizing a pose-only nonlinear optimization adjustment strategy;

s7: and calculating the depth of field of the feature points, and recovering the three-dimensional coordinates of the feature points.

Preferably, in S1, feature points of each image frame in the initial video stream are extracted, and the image frames are matched in pairs according to the feature points, and matching points are screened by a random sampling coincidence algorithm, so as to obtain an initial matching point pair, where the matching points are matched feature points.

Preferably, the specific step of screening out the matching points in the three-view pair is as follows:

s2.1: setting a sample set P with the minimum sample number of n, extracting n samples from the sample set P to form a sample subset S, and calculating an initial trifocal tensor as an initialization model M by using an inter-view essential matrix;

s2.2: obtaining projection matrixes P1, P2 and P3 of the three views according to the initial trifocal tensor, calculating feature point coordinates by using a least square method, and obtaining three estimated values of the feature points through the projection matrixes P1, P2 and P3 respectively, namely:

wherein,estimated value representing characteristic point under the action of P1 projection matrix,/->Estimated value representing characteristic point under the action of P2 projection matrix,/>The estimated value of the characteristic point under the action of the P3 projection matrix is represented, and X represents the three-dimensional coordinate of the characteristic point;

the re-projection error is calculated from the three view pairs, namely:

wherein ω represents the reprojection error, x ₁ Representing the measured value of a feature point in view 1, x ₂ Representing the measured value of the feature point in view 2, x ₃ Representing the measured value of the feature point in view 3, d ² (·, ·) represents the square of the euclidean distance between two elements;

taking the re-projection error omega as error measurement of the initialization model M, and forming an inner point set S by a sample set and a sample subset S, wherein the error between the sample set P and the initialization model M is smaller than a set threshold value th;

s2.3: calculating a new model M by adopting a least square method according to the S of the inner point set;

s2.4: repeating S2.1, S2.2 and S2.3 until the maximum consistent set is obtained, removing the outer points, and recording the inner points and the triple-focus tensor of the current cycle, wherein the inner points are the matching points.

Preferably, in S5, the global displacement is solved based on the global rotation of each image frame and the linear constraint relationship between the scene structure and the global displacement; the linear relation expression is:

Bt _l +Ct _i +Dt _r ＝0

wherein,

B[X _i ]×R _r，i X _r ([R _r，l X _r ] _x X _l ) ^T [X _l ] _x R _l

C＝|[X _l ] _× R _r，l X _r || ² [X _i ] _× R _i

D＝-(B+C)

R _r，i ＝R _i R _r ^T

wherein t is _l Representing the global displacement of view l, t _i Representing the global displacement of view i, t _r Representing global displacement of view r, X _i Representing normalized image coordinates of feature points in view i, [ · ]] _× Representing the corresponding antisymmetric matrix of the vector R _r，i Representing the relative rotation between views r and i, X _r Representing normalized image coordinates of feature points in view R, R _l Representing the global rotation of view l, R _i Representing the global rotation of view i, R _r Representing a global rotation of view r, X _l Representing normalized image coordinates of feature points in view l, T representing transpose of matrix, R _r，l Representing the relative rotation between views r and l;

considering all the characteristic points, synthesizing all the linear constraints to obtain the following formula:

F·t＝0

where F is a coefficient matrix composed of B, C, D, t= (t) ₁ ^T ，t ₂ ^T ，...，t _n ^T ) ^T Representing global displacement for all n views;

solving the linear homogeneous equation to obtain the optimal value of global displacement

Preferably, in S6, the step of performing the nonlinear optimization adjustment of only the pose is:

s6.1: based on each three-view pair, a re-projection vector is calculated, namely:

wherein,representing the reprojection vector, +.>Representing vector (0, 1), -for example>Is the depth of field of the feature point obtained by calculation of the reference view, X _r Representing normalized image coordinates, t, of feature points in view r _r，i Representation ofThe relative displacement between views r, i is expressed as:

t _r，i ＝R _j (t _r -t _i )

wherein R is _j Representing the global rotation of view j, t _r Representing global displacement of view r, t _i Representing the global displacement of view i;

s6.2: and calculating and adding the reprojection errors of all the characteristic points to obtain an error term epsilon, wherein the expression of the error term epsilon is as follows:

wherein,representing the reprojection vector, x _i Representing the measured value of the feature point in view i, T representing the transpose of the matrix;

s6.3: and optimizing through a universal graph optimizing library, taking the pose of each frame as a node of graph optimization, and taking the reprojection error of each characteristic point as a graph optimizing edge.

Preferably, in S7, θ _(r，j) As a judgment standard of the quality of feature point recovery, according to θ _(r，j) Calculating the weighted depth Z of view r _r Weighted depth Z _r The expression of (2) is:

ω _(r，j) ＝θ _(r，j) /∑ _1≤j≤n θ _(r，j)

θ _(r，j) ＝||[X _j ] _× R _r，j X _r ||

wherein j represents a j-th view,indicating depth of field, ω _(r，j) Represents the weighted value, θ _(r，j) Representing the feature point recovery quality, R _r，j Representing the relative rotation between views r, j, X _r Representing normalized image coordinates, X, of feature points in view r _j Representing the coordinates of the feature point under view j, [ ·] _× An antisymmetric matrix representing the vector;

according to the weighted depth Z _r Carrying out weighted reconstruction on the initial map, recovering the three-dimensional coordinates of the feature points, and representing the three-dimensional coordinates of the feature points after recovery as follows:

X ^W ＝Z _r R _r X _r +t _r

wherein X is ^W Representing the three-dimensional coordinates of the restored characteristic points, Z _r Representing the weighted depth of view R, R _r Representing a global rotation of view r, X _r Representing normalized image coordinates, t, of feature points in view r _r Representing the global displacement of view r.

Preferably, in S6.3, the method further comprises improving the robustness of the optimizer by setting a kernel function.

The beneficial effects are that:

1. compared with the traditional system using two frames of information for initialization, the method can utilize video stream information more in the solving process of the initial pose by introducing the global motion recovery structure technology, and obtain the high-precision initial pose by an average method.

2. In the process of triangulation of the initial map, a weighted reconstruction technology is utilized, a plurality of pieces of observation information can be comprehensively utilized aiming at one feature point, and the precision of the feature point is improved.

3. In the optimization process, a nonlinear optimization strategy taking the initial pose as an optimization variable is adopted, so that the number of scattered points in the initial map is reduced, and the accuracy of the initial map is further improved compared with the traditional nonlinear optimization strategy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a monocular SLAM robust initialization method based on multiple frames in the implementation of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the present embodiment provides a monocular SLAM robust initialization method based on multiple frames, which includes the steps of:

s1: extracting characteristic points of each image frame in the initial video stream, carrying out mutual matching on the image frames in pairs according to the characteristic points, screening matching points through a random sampling consensus (Random Sample Consensus, RANSAC) algorithm, and obtaining initial matching point pairs, wherein the matching points are matched characteristic points; judging whether the feature points are enough or not, if so, performing S2; if not, repeating S1.

S2: and screening out three-view pairs according to the initial matching point pairs, namely three image frames with enough common-view characteristic points, further screening out matching points in the three-view pairs based on a random sampling coincidence algorithm of a trifocal tensor, and constructing a three-frame matching diagram, namely a topological diagram for describing the common-view relation among the image frames.

The base matrix E is used as an initialization model M between two frames, matching points cannot be completely removed by utilizing a random sampling consistency algorithm, and error matching points can be further removed by utilizing a random sampling consistency algorithm which uses a trifocal tensor as the initialization model M between three frames, so that matching precision is improved; the specific steps for screening the matching points in the three-view pair are as follows:

s2.1: setting a sample set P with the minimum sample number of n, extracting n samples from the sample set P to form a sample subset S, and calculating an initial trifocal tensor as an initialization model M by using an essential matrix between every two views.

the re-projection error is calculated from the three view pairs, namely:

taking the re-projection error omega as an error measure of the initialization model M, and forming an interior point set S by a sample set and a sample subset S, wherein the error between the sample set P and the initialization model M is smaller than a set threshold value th.

S2.3: if the inner point set accounts for more than 75% of the sample set, the correct model parameters are considered to be obtained, and a new model M is calculated by adopting a least square method according to the inner point set S.

S3: according to the principle of double vision geometry, the relative rotation between the image frames is solved.

S4: the global rotation is calculated by an iteratively weighted least squares method (Iteratively Reweighted Least Square, IRLS) based on the relative rotation between the image frames.

S5: under the accurate global rotation background, a linear relation exists between a scene structure and global displacement, and linear global translation constraint can be constructed, so that the global displacement can be directly solved based on the global rotation of each image frame and the linear constraint relation between the scene structure and the global displacement;

wherein, the linear relation expression is:

Bt _l +Ct _i +Dt _r ＝0

wherein,

B＝[X _i ] _× R _r，i X _r ([R _r，l X _r ] _x X _l ) ^T [X _l ] _x R _l

C＝||[X _l ] _× R _r，l X _r || ² [X _i ] _× R _i

D＝-(B+C)

R _r，i ＝R _i R _r ^T

wherein t is _l Representing the global displacement of view l, t _i Representing the global displacement of view i, t _r Representing global displacement of view r, X _i Representing normalized image coordinates of feature points in view i, [ · ]] _× Representing the corresponding antisymmetric matrix of the vector R _r，i Representing the relative rotation between views r and i, X _r Representing normalized image coordinates of the feature points in view r,R _l representing the global rotation of view l, R _i Representing the global rotation of view i, R _r Representing a global rotation of view r, X _l Representing normalized image coordinates of feature points in view l, T representing transpose of matrix, R _r，l Representing the relative rotation between views r and l;

considering all feature points, and integrating all linear constraints, the following formula can be obtained:

F·t＝0

where F is a coefficient matrix composed of B, C, D, t= (t) ₁ ^T ，t ₂ ^T ，...，t _n ^T ) ^T Representing the global displacement of all n views.

S6: the global rotation and global displacement are synthesized to obtain the initial pose of each frame, and based on the initial pose, the pose of each frame is optimized by utilizing a pose-only nonlinear optimization adjustment strategy, so that the reprojection error is minimum;

the method comprises the following steps of:

wherein,representing the reprojection vector, +.>Representing vector (0, 1), -for example>Is a feature point obtained by reference view calculationDepth of field, X _r Representing normalized image coordinates, t, of feature points in view r _r，i Representing the relative displacement between views r, i, expressed as:

t _r，i ＝R _j (t _r -t _i )

s6.3: optimizing through a universal graph optimizing library, taking the pose of each frame as a node of graph optimization, taking the reprojection error of each characteristic point as an edge of graph optimization, and improving the robustness of the optimizer through setting a kernel function.

S7: calculating the depth of field of the feature points by using a triangulation technique, and recovering the three-dimensional coordinates of the feature points;

the method comprises the following steps: will be theta _(r，j) As a judgment standard of the quality of feature point recovery, according to θ _(r，j) Calculating the weighted depth Z of view r _r Weighted depth Z _r The expression of (2) is:

ω _(r，j) ＝θ _(r，j) /∑ _1≤j≤n θ _(r，j)

θ _(r，j) ＝||[X _j ] _× R _r，j X _r ||

X ^W ＝Z _r R _r X _r +t _r

The monocular SLAM robust initialization method based on the multiframe provided by the embodiment has the following beneficial effects:

1. compared with the traditional system using two frames of information for initialization, the embodiment can utilize video stream information more in the solving process of the initial pose by introducing the global motion recovery structure technology in the monocular initialization system, and obtain the high-precision initial pose by an average method.

2. In the process of triangulation of the initial map, a weighted reconstruction technology is utilized, a plurality of pieces of observation information can be comprehensively utilized aiming at one characteristic point, and the accuracy of the initial map is improved.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, or alternatives falling within the spirit and principles of the invention.

Claims

1. A monocular SLAM robust initialization method based on multiple frames is characterized by comprising the following steps:

s1: extracting characteristic points of each image frame in an initial video stream, carrying out mutual matching on the image frames in pairs according to the characteristic points, screening out matching points, and obtaining initial matching point pairs, wherein the matching points are matched characteristic points;

extracting characteristic points of each image frame in an initial video stream, carrying out mutual matching on the image frames in pairs according to the characteristic points, screening matching points through a random sampling coincidence algorithm, and obtaining initial matching point pairs, wherein the matching points are matched characteristic points;

the specific steps for screening the matching points in the three-view pair are as follows:

the re-projection error is calculated from the three view pairs, namely:

s2.4: repeating S2.1, S2.2 and S2.3 until a maximum consistent set is obtained, removing outer points, and recording inner points and the triple-focus tensor of the current cycle, wherein the inner points are matching points;

2. The multi-frame-based monocular SLAM robust initialization method of claim 1, wherein in S5, the global displacement is solved based on the global rotation of each image frame and the linear constraint relationship between the scene structure and the global displacement; the linear relation expression is:

Bt _l +Ct _i +Dt _r ＝0

wherein,

B＝[X _i ] _× R _r，i X _r ([R _r，l X _r ] _x X _l ) ^T [X _l ] _x R _l

C＝||[X _l ] _× R _r，l X _r || ² [X _i ] _× R _i

D＝-(B+C)

R _r，i ＝R _i R _r ^T

F·t＝0

where F is a coefficient matrix composed of B, C, D, t= (t) ₁ ^T ，t ₂ ^T ，....，t _n ^T ) ^T Representing global displacement for all n views;

3. The monocular SLAM robust initialization method based on multiple frames according to claim 2, wherein in S6, the step of performing the pose-only nonlinear optimization adjustment is:

wherein,representing the reprojection vector, +.>Representing vector (0, 1), -for example>Is the depth of field of the feature point obtained by calculation of the reference view, X _r Representing normalized image coordinates, t, of feature points in view r _r，i Representing the relative displacement between views r, i, expressed as:

t _r，i ＝R _j (t _r -t _i )

4. The method for monocular SLAM robust initialization over multiple frames according to claim 3, wherein θ is set in S7 _(r，j ) As a judgment standard of the quality of feature point recovery, according to θ _(r，j) Calculating the weighted depth Z of view r _r Weighted depth Z _r The expression of (2) is:

ω _(r，j) ＝θ _(r，j) /∑ _1≤j≤n θ _(r，j)

θ _(r，j) ＝||[X _j ] _× R _r，j X _r ||

wherein j represents a j-th view,indicating depth of field, ω _(r，j) Represents the weighted value, θ _(r，j) Representing the feature point recovery quality, R _r，j Representing the relative rotation between views r, j, X _r Representing normalized image coordinates, X, of feature points in view r _j Representing the coordinates of the feature point under view j, [ ·] _× Representing vectorsIs an antisymmetric matrix of (a);

X ^W ＝Z _r R _r X _r +t _r

5. The method for monocular SLAM robust initialization over multiple frames of claim 3, further comprising increasing the robustness of the optimizer by setting a kernel function in S6.3.