CN112734839A

CN112734839A - Monocular vision SLAM initialization method for improving robustness

Info

Publication number: CN112734839A
Application number: CN202011633399.9A
Authority: CN
Inventors: 项志宇; 钱琦; 白庭铭; 闭云峰; 钟文祺
Original assignee: Huawei Technologies Co Ltd; Zhejiang University ZJU
Current assignee: Huawei Technologies Co Ltd; Zhejiang University ZJU
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-30
Anticipated expiration: 2040-12-31
Also published as: CN112734839B

Abstract

The invention discloses a monocular vision SLAM initialization method for improving robustness. Establishing a sliding window, and pre-adding continuous frame images as initial key frames, wherein the first frame is a reference frame; extracting point characteristics and line characteristics from each key frame, and performing characteristic matching between every two key frames to obtain a point characteristic matching pair set and a line characteristic matching pair set; for every two key frames, calculating the relative pose between the two key frames by using a feature matching pair set, and processing to obtain three-dimensional points and three-dimensional straight lines, wherein all the key frames, the three-dimensional points and the three-dimensional straight lines form an initial map; and obtaining more accurate results of the pose, the three-dimensional points and the three-dimensional coordinates of the three-dimensional straight lines of each key frame through global BA optimization, and establishing an accurate initial map. The invention can effectively improve the robustness and the precision of monocular vision SLAM system initialization, can quickly complete the initialization in challenging scenes and ensure the precision, and has strong real-time performance.

Description

Monocular vision SLAM initialization method for improving robustness

Technical Field

The invention relates to a navigation positioning technical method of an intelligent vehicle, in particular to a monocular vision SLAM initialization method for improving robustness.

Background

Visual SLAM (simultaneous localization and mapping) is an important direction in the field of computer vision, and has wide application in the fields of unmanned driving, robots and AR/VR, and is receiving more and more attention. Monocular SLAM is the most important category, not only because of the low cost of a single camera, but also because of the wider range of scenarios in which monocular SLAM is used. However, monocular SLAM also has some drawbacks, such as uncertainty in scale and accumulation of drift errors. In addition, compared with the monocular SLAM, the monocular SLAM has higher initialization difficulty, and the situations of initialization failure or low initialization precision are easier to occur.

Initialization is a very important module in monocular SLAM, which mainly solves two problems: firstly, setting an initial scale; the second is to create an initial map for subsequent tracking and optimization. The performance of the entire SLAM is therefore closely related to the accuracy and robustness of the initialization. At present, the SLAM initialization based on point characteristics is initialized by using information of two frames before a sequence, characteristics are extracted between the two frames and matched, some visual mileage calculation methods assume that a plane scene is observed during initialization, a pose is initialized from a homography matrix, and other calculation methods use an essential matrix for more common scenes. If two frames cannot be found to effectively construct the initial map and three-dimensional points, the SLAM system cannot effectively start running. This typically occurs in low disparity scenes, which can significantly increase the time to initialize the SLAM system, or even completely disable initialization.

In addition, the current feature-based monocular SLAM system only utilizes point features during initialization, and does not relate to line features, but in some weak texture structured scenes, the number of line features is more than that of point features, and if the calculation of the essential matrix and the homography matrix is carried out by only using the point features, the number of matched features is too small, and the initialization fails.

Disclosure of Invention

In order to solve the problems in the background art, the invention aims to provide a robust monocular vision SLAM initialization method which is suitable for all the fields of vision positioning, such as unmanned vehicles and robots.

On the basis of the traditional monocular vision SLAM, a sliding window is used for selecting a plurality of key frames for initialization, and linear feature matching is used for optimization when the initial pose between every two key frames is calculated. And then, BA optimization is carried out on the whole initial map, and the accurate pose of the key frame in the initial map, the accurate three-dimensional coordinates of the three-dimensional point and the three-dimensional straight line are calculated by simultaneously minimizing the reprojection error of the three-dimensional point and the three-dimensional straight line.

The technical scheme adopted by the invention comprises the following steps:

(1) establishing a sliding window, adding a fixed number of continuous frame images as initial key frames in advance, wherein the frames in the sliding window are all key frames, and the first key frame is a reference frame; a sliding window is maintained for all key frames, one image per frame.

(2) Extracting point characteristics and line characteristics from each key frame, and performing characteristic matching between every two key frames to obtain a point characteristic matching pair set and a line characteristic matching pair set;

(3) for every two key frames, calculating the relative pose between the two key frames by using a feature matching pair set obtained by matching, processing to obtain three-dimensional coordinates of three-dimensional points and three-dimensional straight lines corresponding to point features and line features in a space under a reference frame coordinate system, and forming an initial map by all the key frames, the three-dimensional points and the three-dimensional straight lines;

the reference frame coordinate system is a three-dimensional coordinate system established by taking a camera optical center under the first key frame in the sliding window as a coordinate system as an origin.

(4) For all key frames, three-dimensional points and three-dimensional straight lines contained in the initial map, obtaining more accurate results of the pose and the three-dimensional points of each key frame in a reference frame coordinate system and the three-dimensional coordinates of the three-dimensional straight lines in the reference frame coordinate system through global BA (beam adjustment) optimization, establishing an accurate initial map, and finishing the initialization process of monocular vision SLAM; the pose is represented by a rotation matrix and a translation matrix.

The method extracts line features between the images of every two key frames, and optimizes an initial rotation matrix and a translation matrix between the two frames by maximizing the overlapping length of a line feature matching pair when the line features can be matched; and increasing the initialized frame number by using a sliding window, and enabling the initial map and the estimated pose of the camera key frame to be more accurate by global BA optimization by utilizing the information of multiple frames and the mutual constraint.

In the step (1), when continuous frame images are added into a sliding window, counting the number of point features and line features which can be extracted from each frame image, and if the number reaches a preset threshold value, adding the frame image into the sliding window; otherwise, the frame image is discarded without adding the sliding window.

In a specific implementation, if all the acquired new continuous 6 frames are discarded, all existing key frames in the sliding window are deleted, and the sliding window is reinitialized.

In the step (2), the point features extracted from the image of the key frame are ORB features, the line features are extracted by using an LSD detector, both the point features and the line features are features, and are two different features, more than one feature, but a plurality of features, and both the point features and the line features are feature-matched in the following same manner:

every two key frames in the sliding window form a key frame pair, feature matching is carried out on the key frame pair, the distance of descriptors between features is calculated, and judgment is carried out on the key frame pair according to the distance:

if one feature in the same feature of the two key frames is the closest distance, matching the feature vectors of the same feature of the two key frames successfully to form a pair of feature matching pairs;

if one feature in the same feature of the two key frames is not the closest to each other, skipping and not processing;

the feature of the same type of feature of the two key frames is the closest feature among all features of the same type of feature of the other key frame.

Through the continuous traversal of the process, the feature matching pairs of all the point features are found to form a point feature matching pair set, and the feature matching pairs of all the line features form a line feature matching pair set.

When the characteristics of the point characteristics are matched, the point characteristics are matched pairs, and when the characteristics of the line characteristics are matched, the line characteristics are matched pairs.

In the step (3), the pose between the two key frames and the three-dimensional coordinates of the features are calculated by adopting the following processes:

(3.1) calculating an essential matrix or a homography matrix through a point feature matching pair set on matching, and then recovering a rotation matrix R and a translation matrix t between two key frames from the essential matrix or the homography matrix;

in the step (3.1), the intrinsic matrix E and the homography matrix are both solved through epipolar geometric constraint of the point features, the solving methods are the same, and the intrinsic matrix E is taken as an example for explanation, specifically, the following methods are used:

in a point feature matching pair between two keyframes, a pair of point feature matching pairs is divided into image points p respectively matched in the images of the two keyframes₁And an image point p₂Of an image point p₁And an image point p₂Representing the corresponding image point of a three-dimensional point in space in the images of two keyframes, image point p, according to the pinhole camera model₁And an image point p₂The epipolar constraint for the same three-dimensional point in space is established by the following formula:

s₁p₁＝KP

s₂p₂＝K(RP+t)

K^-1s₂p₂＝PK^-1s₁p₁+t

in the formula, p₁For the pixel coordinates, p, of an image point of a three-dimensional point in the image of the first of two keyframes₂Is the pixel coordinate of the image point of the three-dimensional point in the image of the second key frame of the two key frames, K is the internal reference matrix of the camera, P is the three-dimensional coordinate of the three-dimensional point, R is the rotation matrix between the two key frames, t is the translation matrix between the two key frames, s₁Depth, s, of a three-dimensional point in the first of two keyframes₂The depth of the three-dimensional point in the second key frame of the two key frames is shown, E is an essential matrix, and T represents the matrix transposition;

one pair of point feature matching pairs can obtain one equation for the essential matrix E, and the essential matrix E has 8 degree-of-freedom parameters, so that all the parameters of the essential matrix E can be found by combining eight equations for the eight point feature matching pairs for the essential matrix E. Because errors exist in the detection and matching process of the point features, the RANSAC (random consistent sampling) is used for selecting the optimal eight point feature matching pairs in the point feature matching pair set to calculate the essential matrix E.

In the step (3.2), the rotation matrix R and the translation matrix t are solved from the intrinsic matrix or the homography matrix through singular value decomposition.

And the rotation matrix R and the translation matrix t are pose parameters of the camera.

(3.2) optimizing the rotation matrix R and the translation matrix t obtained in the step (3.1) by using a line feature matching set on matching, wherein the following method is specifically adopted:

(3.2.1) in a line feature matching pair between two keyframes, two end points of the line feature detected by the image of the first keyframe are respectively converted into two epipolar lines in the image of the second keyframe through a rotation matrix R and a translation matrix t, and the direction vector of the epipolar line is calculated by adopting the following formula:

wherein, [ t ]]X represents the antisymmetric matrix of the translation matrix t, R is the rotation vector,

and

are homogeneous coordinates, l ', of the two end points of the line feature detected on the image of the first keyframe'_sAnd l'_eDirection vectors of two epipolar lines on the image for the second keyframe;

the two polar lines pass through opposite poles in the image of the second key frame, the positions of the two polar lines are determined in the image through the direction vectors and the opposite poles, and the opposite poles are the intersection points of the camera optical center connecting lines of the two key frames passing through the imaging plane where the image of the key frame is located;

two epipolar lines intersect with two straight lines where the line features in the image of the second key frame are located, and a line segment between the two intersections of the straight lines where the line features in the image of the second key frame are located is used as a projection line segment of the line features of the first key frame on the second key frame;

(3.2.2) if there is an overlap between the projected line segment and the line feature in the image of the second keyframe, then taking the length of the overlap between the projected line segment and the line feature in the image of the second keyframe as the overlap length of the matching pair of current line features on the image of the second keyframe

The following formula is used for calculation:

if there is no overlap between the projected line segment and the line feature in the image of the second keyframe, then the overlap length of the current line feature matching pair on the second keyframe

The following formula is used for calculation:

wherein s is₂Pixel coordinates of the starting end point of the line feature in the image of the second key frame, e₂Pixel coordinate, s ', of the ending endpoint of the line feature in the image of the second keyframe'₁Pixel coordinates of intersection point, e ', of epipolar line corresponding to starting endpoint of line feature in image of first key frame and straight line of line feature in image of second key frame'₁The pixel coordinates of the intersection point of the polar line corresponding to the termination end point of the line feature in the image of the first key frame and the straight line where the line feature of the second frame is located;

(3.2.3) repeating the above steps (3.2.1) - (3.2.2) to obtain the overlap length on the two keyframes of each pair of line feature matching pairs between the two keyframes

(3.2.4) according to the overlap length

The following cost function is established

Substituting a cost function into a line characteristic matching pair set consisting of n pairs of line characteristic matching pairs

Due to overlapping length and s'₁And e'₁Of each, and s'₁And e'₁The calculation of coordinates again depends on the rotationConverting the matrix R and the translation matrix t by minimizing the cost function

Obtaining a more accurate rotation matrix R and translation matrix t:

wherein the content of the first and second substances,

matching the overlap length of the pair on the first key frame for the ith pair of line features,

overlap length on the second key frame for the ith pair of line feature matching pairs, l_iLine feature length, l 'for the first keyframe in the ith pair of line feature matches'_iFor the line feature length of the second keyframe in the ith pair of line feature matching pairs, n represents the total number of line feature matching pairs between the two keyframes.

(3.3) reconstructing three-dimensional coordinates of the point characteristics and the line characteristics by using the solved pose between the two key frames;

the (3.3) adopts the following method:

(3.3.1) for a matching pair of point features, knowing the homogeneous coordinates of the pixels of the image points in the first keyframe

And homogeneous coordinates of pixels of image points in the second keyframe

And a rotation matrix R and a translation matrix t between the two key frames, and the depth s of the corresponding three-dimensional point in the space of the point feature matching pair under the first key frame coordinate system is obtained through triangulation₁The constraint of (2):

wherein the content of the first and second substances,

is composed of

An antisymmetric matrix of (a);

solving the overdetermined equation set to obtain a least square solution s₁Then, the three-dimensional coordinate P of the three-dimensional point under the first key frame coordinate system is obtained through the following formula₁：

Wherein K is an internal reference matrix of the camera;

converting the coordinates of the three-dimensional points into coordinates of a reference frame coordinate system by using a rotation matrix R and a translation matrix t between the first key frame and the reference frame;

(3.3.2) repeating the step (3.3.1) to obtain the coordinates of the corresponding three-dimensional points of all the point feature matching pairs in the space under a reference frame coordinate system;

(3.3.3) for a line feature matching pair, determining a plane through the camera origin of a first key frame and two end points of the line feature detected on the image of the first key frame, determining a second plane through the camera origin of a second key frame and two end points of the line feature detected on the image of the second key frame, solving a plane equation of the second plane under the coordinate system of the first key frame by using a rotation matrix R and a translation matrix t between the two key frames, and solving three-dimensional coordinates of a corresponding three-dimensional straight line of the line feature matching pair in space under the coordinate system of the first key frame by intersecting the two planes;

(3.3.4) repeating the step (3.3.3) to obtain the three-dimensional coordinates of the corresponding three-dimensional straight lines in the space of all the line feature matching pairs in the reference frame coordinate system.

And (3.4) all the key frames, the three-dimensional points and the three-dimensional straight lines form an initial map.

In the step (4), the step of (C),

(4.1) for the initial map, the process obtains the reprojection error of each three-dimensional point and each three-dimensional straight line in the initial map on all keyframes in which they were observed:

(4.1A) the reprojection error of the three-dimensional points is calculated using the following formula:

wherein u is_ijPixel coordinates representing the detected point features of the jth three-dimensional point on the image of the ith keyframe,

pixel coordinates representing a reprojected point of the jth three-dimensional point on the image of the ith key frame;

the reprojection point of the jth three-dimensional point on the image of the ith key frame refers to a pixel point formed by converting the three-dimensional coordinate of the jth three-dimensional point under a reference frame coordinate system into the ith key frame coordinate system through a rotation matrix R and a translation matrix between the reference frame and the ith key frame and projecting the pixel point onto the two-dimensional image of the ith key frame;

(4.1B) the reprojection error of the three-dimensional straight line is defined as the distance from two end points of the line feature detected in the image of the key frame to the reprojection line of the three-dimensional straight line corresponding to the line feature in the key frame, and is specifically calculated by the following formula:

wherein the content of the first and second substances,

and

is the pixel coordinate l 'of the line feature detected by the k-th three-dimensional straight line on the image of the i-th key frame'_ikIs the normal vector of the re-projection line of the kth three-dimensional straight line on the image of the ith keyframe, l₁And l₂Is normal vector l'_ikThe first and second parameters of (1);

the re-projection line of the kth three-dimensional straight line on the image of the ith key frame refers to a straight line formed by converting the three-dimensional coordinate of the kth three-dimensional straight line under a reference frame coordinate system into the ith key frame coordinate system through a rotation matrix R and a translation matrix t between the reference frame and the ith key frame and projecting the k three-dimensional straight line on the two-dimensional image of the ith key frame;

(4.2) establishing the following global BA-optimized global cost function C under the first frame coordinate system, wherein the global cost function C is expressed as the following formula:

wherein the content of the first and second substances,

for the reprojection error of the jth three-dimensional point on the image of the ith keyframe,

for the reprojection error of the kth three-dimensional straight line on the image of the ith key frame, omega_ijIs composed of

Information matrix of omega_ikIs composed of

P is a robust kernel function;

the value of the reprojection error is related to the coordinates of the reprojection points and the reprojection lines, the reprojection points and the reprojection lines are related to the three-dimensional points, the three-dimensional coordinates of the three-dimensional straight lines under the reference frame coordinate system and the poses of the key frames under the reference frame coordinate system, and the poses of each key frame under the reference frame coordinate system and the three-dimensional coordinates of each three-dimensional point and each three-dimensional straight line under the reference frame coordinate system are solved and obtained by minimizing a global cost function C.

The invention adopts a sliding window form, initializes the map by utilizing the information of multiple frames, and optimizes the initial pose between frames by using line characteristics.

The invention firstly adopts a multi-frame information and BA optimization mode to optimize the initial map and complete initialization. In some weakly textured scenes, to solve the problem of too few initialization time point features, we first use line features to optimize the initial pose between two frames, optimizing the initial rotation and translation between two frames by maximizing the overlap length of the line feature projections. After the number of key frames in the sliding window reaches a preset threshold value, the whole initial map is optimized by minimizing the reprojection error of the three-dimensional features, the robustness and the precision of the initialization of the monocular SLAM system are improved, and a foundation is laid for follow-up tracking and map building.

Compared with the background art, the invention has the beneficial effects that:

(1) the invention can improve the robustness of monocular SLAM system initialization, and can quickly complete initialization in a low-parallax scene;

(2) the method uses the line characteristic matching pair to optimize the initial pose between frames, and can complete the initialization process in a weak texture scene;

(3) the global BA optimization method for the initial map can improve the initialization precision of the monocular SLAM system;

in summary, compared with the traditional monocular SLAM initialization method, the method uses multi-frame information, and participates the line characteristics into the initial pose calculation, so that the robustness and the precision of the monocular SLAM system initialization are improved, especially the initialization can be rapidly completed in challenging scenes such as low parallax, weak texture and the like, the precision is ensured, and the instantaneity is not reduced.

Drawings

FIG. 1 is a schematic antipodal geometry for a point feature

FIG. 2 is an antipodal geometric schematic of a line feature;

FIG. 3 is a schematic diagram of overlapping portions of line feature matching pairs of projections;

FIG. 4 is a diagram of a co-view relationship between keyframes and three-dimensional features;

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The embodiment and the implementation process of the complete method according to the invention content are as follows:

(1) establishing a sliding window, adding a fixed number of continuous frame images as initial key frames in advance, wherein the frames in the sliding window are all key frames, and the first key frame is called a reference frame; a sliding window is maintained for all key frames.

(3) for every two key frames, calculating the relative pose between the two key frames by using a feature matching pair set obtained by matching, processing to obtain three-dimensional coordinates of three-dimensional points and three-dimensional straight lines corresponding to point features and line features in a space under a reference frame coordinate system, and forming an initial map by all the key frames, the three-dimensional points and the three-dimensional straight lines; the pose between two key frames is calculated in two steps:

(3.1) calculating an essential matrix or a homography matrix through the point feature matching pair set, and then recovering a rotation matrix and a translation matrix from the essential matrix or the homography matrix; the essential matrix and the homography matrix are solved through epipolar geometric constraint of point features as shown in FIG. 1; eight optimal point feature matching pairs in the point feature matching pair set are selected through random consistency sampling to calculate an essential matrix E, four optimal non-collinear point feature matching pairs are selected through the same method to calculate a homography matrix H, and a rotation matrix and a translation matrix can be calculated through singular value decomposition;

(3.2) optimizing the obtained R and t by using the line features on the matching, wherein the following method is specifically adopted:

two endpoints s of the line feature detected on the image of the first keyframe₁And e₁The two polar lines l 'in the image of the second key frame can be converted by the rotation matrix R and the translation matrix t between the two key frames respectively'_sAnd l'_eAs shown in FIG. 2, the intersection of two polar lines with the straight line on the image of the second keyframe results in s'₁And e'₁Taking a line segment between two intersections of a straight line where the line feature on the image of the second key frame is located as a projection line segment of the line feature on the image of the first key frame on the image of the second key frame; if the projected line segment and the line feature in the image of the second key frame have an overlapping part, taking the length of the overlapping part between the projected line segment and the line feature in the image of the second key frame as the overlapping length of the current line feature matching pair on the image of the second key frame

The following formula is used for calculation:

The following formula is used for calculation:

then, optimizing pose parameters R and t between two key frames by minimizing a cost function for n pairs of line feature matching pairs;

(4) and (3) optimizing all key frames, three-dimensional points and three-dimensional straight lines contained in the initial map by a global Beam Adjustment (BA) to obtain more accurate results of the poses of the key frames in a reference frame coordinate system, the three-dimensional points and the three-dimensional straight lines in the reference frame coordinate system, establishing an accurate initial map, and finishing the initialization process of the monocular vision SLAM.

The initialization method of the present invention is described below by a simple example, assuming that the frame number set by the sliding window is 3 frames; firstly, matching every two key frames by using ORB point characteristics, roughly calculating an essential matrix and a homography matrix between every two key frames by using an eight-point method, then extracting line characteristics by using LSD (least squares decomposition) and matching by using LBD (local binary decomposition) descriptors, optimizing an initial pose by using a line characteristic matching pair set, and then reconstructing three-dimensional coordinates of three-dimensional points and three-dimensional lines in a space corresponding to the point characteristics and the line characteristics. Taking the camera center of the first key frame as the origin, classifying into four classes according to the co-view relationship between the key frame and the three-dimensional feature, as shown in FIG. 4, F₀The representation is only observed between the first key frame and the second key frame, and the representation plays a role in optimizing the pose of the second key frame; f₁Observed by the first key frame and the third key frame, the pose of the third key frame is restricted; f₂The characteristics are observed by the three key frames, so that the poses among the three key frames can be restrained, and the most information is provided; f₃It appears only in the second and third key frames, and it has constraint on the relative poses of the second and third key frames. And then carrying out global BA optimization on all key frames and map points to obtain accurate poses and coordinates, and finishing the initialization process.

The effect of the method is evaluated by testing the TUM data set and the OpenLORIS data set, and the absolute translation errors of the conventional monocular SLAM system and the method are compared with the true values according to the true values provided by the two data sets by taking the monocular image sequence as input.

The test was first performed on the TUM data set and the positioning error is more detailed in Table 1 below. Wherein "initialization" indicates that initialization is not successful all the time.

TABLE 1TUM data set conventional initialization method vs. method positioning error (cm)

TUM RGB-D sequence	Conventional initialization method	Method for producing a composite material
			f1_xyz	1.38	1.00
f2_xyz	0.98	0.97
			f2_rpy	ambiguity	1.48
f2_nstr_tex_far	ambiguity	3.65
			f3_sitting	2.76	2.65
f3_walking	ambiguity	2.45

And then, testing is carried out on an OpenLORIS data set, under the condition that the true value is a nearly straight line path, the initialized map established by the traditional initialization method has a drifting phenomenon, the pose error of the initial key frame is large, and the method can establish an accurate initial map, thereby playing a very key role in subsequent positioning and tracking. A comparison of the positioning errors on the OpenLORIS dataset is detailed in table 2 below.

TABLE 2OpenLORIS dataset conventional initialization method vs. the method positioning error (cm)

The method can effectively improve the robustness and the precision of the monocular vision SLAM system initialization, enables the subsequent tracking and mapping process to be more precise, can quickly complete the initialization in challenging scenes, ensures the precision and has strong real-time performance.

Claims

1. A monocular vision SLAM initialization method for improving robustness is characterized in that: the method comprises the following steps:

(1) establishing a sliding window, adding a fixed number of continuous frame images in advance as initial key frames, wherein the first key frame is a reference frame;

(4) and obtaining more accurate results of the pose, the three-dimensional points and the three-dimensional coordinates of the three-dimensional straight lines of each key frame in the reference frame coordinate system through global BA optimization, establishing an accurate initial map, and finishing the initialization process of monocular vision SLAM.

2. The method of initializing monocular vision SLAM to increase robustness of claim 1, wherein: in the step (1), when continuous frame images are added into a sliding window, counting the number of point features and line features which can be extracted from each frame image, and if the number reaches a preset threshold value, adding the frame image into the sliding window; otherwise, the frame image is discarded without adding the sliding window.

3. The method of initializing monocular vision SLAM to increase robustness of claim 1, wherein: in the step (2), the point features extracted from the image of the key frame are ORB features, the line features are extracted by using an LSD detector, both the point features and the line features are features, and both the point features and the line features are feature-matched in the following same manner:

4. The method of initializing monocular vision SLAM to increase robustness of claim 1, wherein: in the step (3), the pose between the two key frames and the three-dimensional coordinates of the features are calculated by adopting the following processes:

(3.2) optimizing the rotation matrix R and the translation matrix t obtained in the step (3.1) by using a line feature matching set on matching;

5. The method of claim 4, wherein the method comprises: in the step (3.1), the intrinsic matrix E and the homography matrix are both solved through epipolar geometric constraint of the point features, the solving methods are the same, and the intrinsic matrix E is taken as an example for explanation, specifically:

in a point feature matching pair between two keyframes, a pair of point feature matching pairs is divided into image points p respectively matched in the images of the two keyframes₁And an image point p₂Of an image point p₁And an image point p₂The epipolar constraint for the same three-dimensional point in space is established by the following formula:

s₁p₁＝KP

s₂p₂＝K(RP+t)

K^-1s₂p₂＝RK^-1s₁p₁+t

in the formula, p₁For three-dimensional points at the first of two keyframesPixel coordinates of image points in the image of the key frame, p₂Is the pixel coordinate of the image point of the three-dimensional point in the image of the second key frame of the two key frames, K is the internal reference matrix of the camera, P is the three-dimensional coordinate of the three-dimensional point, R is the rotation matrix between the two key frames, t is the translation matrix between the two key frames, s₁Depth, s, of a three-dimensional point in the first of two keyframes₂The depth of the three-dimensional point in the second key frame of the two key frames is shown, E is an essential matrix, and T represents the matrix transposition;

one pair of point feature matching pairs can obtain one equation for the essential matrix E, and the essential matrix E has 8 degree-of-freedom parameters, so that all the parameters of the essential matrix E can be found by combining eight equations for the eight point feature matching pairs for the essential matrix E. Because errors exist in the detection and matching process of the point features, the RANSAC is used for selecting the optimal eight point feature matching pairs in the point feature matching pair set to calculate the essential matrix E.

6. The method of claim 4, wherein the method comprises: in the step (3.2), the rotation matrix R and the translation matrix t are solved from the intrinsic matrix or the homography matrix through singular value decomposition.

7. The method of claim 4, wherein the method comprises: the (3.2) is specifically as follows:

wherein, [ t ]]_×An antisymmetric matrix representing a translation matrix t, R being a rotation vector,

and

The following formula is used for calculation:

if there is no overlap between the projected line segment and the line feature in the image of the second keyframe, then the image of the second keyframe is processedOverlap length of front line feature matching pair on second key frame

The following formula is used for calculation:

(3.2.4) according to the overlap length

The following cost function is established

By minimizing a cost function

Obtaining a rotation matrix R and a translation matrix t:

wherein the content of the first and second substances,

8. The method of claim 4, wherein the method comprises: the (3.3) is specifically as follows:

And homogeneous coordinates of pixels of image points in the second keyframe

wherein the content of the first and second substances,

is composed of

An antisymmetric matrix of (a);

Wherein K is an internal reference matrix of the camera;

9. The robust monocular vision SLAM initialization method of claim 1, wherein: the step (4) is specifically as follows:

the reprojection point of the jth three-dimensional point on the image of the ith key frame refers to a pixel point formed by converting the three-dimensional coordinate of the jth three-dimensional point under a reference frame coordinate system into the ith key frame coordinate system through a rotation matrix R and a translation matrix t between the reference frame and the ith key frame and projecting the converted three-dimensional coordinate onto the two-dimensional image of the ith key frame;

(4.1B) the reprojection error of the three-dimensional straight line is specifically calculated by the following formula:

wherein the content of the first and second substances,

and

wherein the content of the first and second substances,

Information matrix of omega_ikIs composed of

P is a robust kernel function;

and solving and obtaining the pose of each key frame in the reference frame coordinate system and the three-dimensional coordinates of each three-dimensional point and each three-dimensional straight line in the reference frame coordinate system by minimizing the global cost function C.