WO2023184968A1

WO2023184968A1 - Structured scene visual slam method based on point line surface features

Info

Publication number: WO2023184968A1
Application number: PCT/CN2022/128826
Authority: WO
Inventors: 裴海龙; 翁卓荣
Original assignee: 华南理工大学
Priority date: 2022-04-02
Filing date: 2022-10-31
Publication date: 2023-10-05
Also published as: CN114862949A

Abstract

Disclosed is a structured scene visual SLAM method based on point line surface features, wherein the method comprises: first, inputting a color image and a depth image corresponding thereto, extracting image point line surface features, and performing feature matching; then, using a plane normal vector to detect a Manhattan world coordinate system, and if a Manhattan world coordinate system exists and appears in a Manhattan world map, then solving for the camera attitude and tracking the point line surface features to estimate displacement, and if not, then tracking the point line surface features to estimate the attitude; then performing key frame judgment on a current frame, and if the current frame is the key frame then inserting same into a local map; subsequently maintaining the map information, and performing joint optimization on the current key frame, adjacent key frames and three-dimensional features; and finally, performing loopback detection, and if a closed-loop frame is detected, then closing the loop and performing global optimization. The present invention comprises a visual SLAM method with high precision and strong robustness, and solves the problem of reduced precision and even system failure when visual SLAM is based only on point features in a low-texture structured scene.

Description

A structured scene visual SLAM method based on point, line and surface features

Technical field

The invention belongs to the technical field of robot simultaneous positioning and map construction, and specifically designs a structured scene visual SLAM method based on point, line and surface features.

Background technique

In recent years, intelligent vehicle positioning systems have become more and more widely used in urban transportation. Traditional positioning methods such as outdoor positioning systems based on GNSS technology are now very mature. However, high-performance real-time positioning is still a difficulty in indoor environments where GNSS signals are obscured. Although indoor positioning systems based on wireless signals have made great progress, such as those based on Bluetooth, WiFi, UWB, and RFID, they are difficult to effectively use indoors due to the high cost of equipment deployment and the fact that they are easily interfered by occlusion and multipath effects indoors. Positioning and mapping.

In response to the above problems, the Simultaneous Localization and Mapping Method (SLAM) was proposed. SLAM is a technology that locates itself in an unknown environment and builds a map at the same time. Mainly used in mobile robots, autonomous driving, virtual reality, augmented reality, etc. For indoor environments, mature laser-based SLAM systems such as cartographer, hector map, gmapping and other systems have been proposed. However, since lidar is still relatively expensive, people generally expect to use low-cost visual sensors, such as monocular cameras, binocular cameras, depth cameras, etc. The system takes images as input and outputs camera trajectories and reconstructed three-dimensional maps, such as DSO Utilize cameras, etc. to achieve high-precision indoor positioning. However, existing vision-based SLAM systems rely heavily on the extraction of low-level point features in images, such as the current advanced point feature-based ORB-SLAM2 (ORB-SLAM2: An Open Source SLAM System for Monocular Stereo and RGB -D Cameras), this SLAM method only achieves pose estimation based on point feature points. Although it can achieve high-precision positioning in texture-rich environments, point features are difficult to extract in artificial indoor environments where texture features are lacking and illumination changes are large. And it is unstable, and the pose estimation accuracy of this method will become worse or even tracking will fail. This type of artificial indoor environment is rich in regular geometric structures such as walls, floors, and ceilings. Structural features represented by lines and surfaces are easier to obtain than point features and are less susceptible to changes in lighting. Therefore, how to In indoor geometrically structured scenes that lack texture features, it is particularly important to use visual information to find structural features suitable for SLAM positioning. By utilizing indoor geometric structural elements, the present invention proposes a structured scene visual SLAM method based on point, line and surface features to solve the problem of low pose estimation accuracy and poor robustness in artificial structured indoor scenes with lack of texture and changing illumination. question.

Contents of the invention

In view of the above-mentioned problems existing in the existing technology, the purpose of the present invention is to provide a structured scene visual SLAM method based on point, line and surface features to solve the problem of current visual SLAM technology in artificial structured indoor scenes with lack of texture and changing illumination. The problem of low pose estimation accuracy and poor robustness.

The present invention is realized through at least one of the following technical solutions.

A structured scene visual SLAM method based on point, line and surface features, including the following steps:

S1. Input a color image, extract point features and line features based on the color image and perform feature matching;

S2. Input the depth image, convert it into a point cloud sequence structure, then extract the image plane, and then match the extracted image plane with the map plane;

S3. Detect whether the Manhattan world coordinate system exists in the image. If it exists and the coordinate system is observed in the historical key frames in the Manhattan world map, estimate the camera posture and track point features and line features based on the extracted Manhattan world coordinate system. and plane features to optimize camera displacement, otherwise track point features, line features and plane features to optimize camera pose;

S4. Determine whether the current frame is a key frame. If it is a key frame, add it to the local map, update the three-dimensional points, three-dimensional lines and the Manhattan world map in the map, and jointly optimize the current key frame and adjacent key frames. , optimize the camera's pose and 3D point and 3D line features, and eliminate some external points and redundant key frames;

S5. Perform loopback detection on key frames. If a loopback is detected, the loopback will be closed and globally optimized to reduce cumulative errors.

Further, the step S1 is specifically: input a color image, first use the ORB algorithm to extract point features, and perform point feature matching according to the descriptor, and then use the PROSAC method to eliminate mismatches between point features; then use the EDLine algorithm to For line feature extraction, split line segments are merged based on distance, angle information and descriptors as filtering criteria, and line features are matched based on LBD descriptors.

Further, the step S2 is specifically: first, find the corresponding three-dimensional points for the depth-effective pixel points to form an organized point cloud sequence structure, and then use a hierarchical clustering algorithm to merge the small plane blocks into a larger plane. , perform segmentation optimization on the fused plane, and finally calculate the three-dimensional center point of the segmented plane P _C = [X _C Y _C Z _C ] ^T and the unit normal vector n of the plane = [n _x n _y n _z _] ^T _, _where _{_} _{_} _{_} _{_}

The surface characteristics of this plane are expressed as π=[n ^T d] ^T ; then the plane in the image is projected onto the world coordinate system and the projected plane is projected based on the angle between the normal and the distance from the origin of the world coordinate system to the plane. Match the map surface.

Further, the step S3 is specifically:

S31. Traverse the normal vectors of the plane extracted in step S2 to find whether there are three mutually orthogonal normal vector combinations. If there are only two mutually orthogonal normal vector combinations, the other one-dimensional normal vector passes through two mutually orthogonal normal vector combinations. The cross product of mutually orthogonal normal vectors is obtained;

S32. If the normal vector combination sought in step S31 exists, query the Manhattan world map to see whether a mutually orthogonal plane that is consistent with this set of orthogonal normal vectors is observed in the historical key frames. If it exists, a Manhattan world coordinate is formed. system, take the Manhattan world coordinate system composed of the normal vector combination with the largest number of corresponding plane points, and find the rotation matrix from the Manhattan world coordinate system to the camera coordinate system of the frame:

where n ₁ , n ₂ , n ₃ represent the found mutually orthogonal normal vectors, c _i represents the id of the current frame, and m _k represents the id of the Manhattan world coordinate system; for

Perform SVD decomposition to obtain the rotation matrix from the Manhattan world coordinate system to the camera coordinate system after orthogonalization

Then the rotation matrix from the camera coordinate system to the world coordinate system is:

where c _j represents the id of the found historical key frame,

is the rotation matrix from the Manhattan world coordinate system m _k to the camera coordinate system at the c _jth frame,

is the rotation matrix from the world coordinate system to the camera coordinate system at the c _jth frame;

Then track point features, line features and plane features to optimize the camera displacement t. The error model e _t of its displacement is:

Among them, e _p , e _l , and e _π represent the reprojection errors of point features, line features, and plane features respectively. The specific forms are:

e _p =p-(KR _cw P _w +t _cw )

e _l = l ^T (KR _cw P _L +t _cw )

Among them, K is the internal parameter of the camera, R _cw is the rotation matrix from the world coordinate system to the camera coordinate system, t _cw is the displacement from the world coordinate system to the camera coordinate system, and T _cw is the pose transformation matrix from the world coordinate system to the camera coordinate system; p is the pixel coordinate of the point feature recognized in this frame, P _w is the three-dimensional point corresponding to the point feature; l is the line feature in the image recognized in the frame, L is the three-dimensional line corresponding to the line feature; P _L is the three-dimensional The three-dimensional endpoint of the line;

is the expression form of surface feature parameters, π = [n _x n _x n _x d] ^T , n _x , n _y , n _z are the values corresponding to the normal vector of plane π, d is the distance from the camera optical center to plane π; π _c is the plane detected in the frame, π _w is the map surface corresponding to the plane; where Λ _p , Λ _l , Λ _π represent the information matrix of the point feature, line feature and plane feature respectively, ρ _p , ρ _l , ρ _π are the Huber robust kernel functions of point features, line features and plane features respectively, Φ _p (p), Φ _l (l) and Φ _π (π) represent the confidence coefficients of point features, line features and plane features respectively. They are:

Among them, n _p , n _l , and n _π respectively represent the number of times the corresponding point feature, line feature, and plane are observed, t _p , t _l , and t _π are the weight coefficients of point features, line features, and plane features respectively, and level _i represents The number of layers of the ORB pyramid where the point feature is located; α is the weight coefficient, α∈[0.5,1]; θ _i represents the angle between the i-th frame camera’s line of sight and the map line,

S33. If the Manhattan world coordinate system required in step S31 does not exist or the corresponding normal vector combination is not observed in the historical key frames in the Manhattan world map, then track the point features, line features and plane features to optimize the camera pose R,t , the camera pose error model e _R,t is:

Further, the step S4 is specifically:

S41. Determine whether to set the frame as a key frame based on the matching pairs of point features and line features, whether a new plane is detected, feature tracking conditions, and key frames in the local map. If it is a key frame, add it to the local map. Otherwise, return to step S1;

S42. For the newly inserted keyframe, update the common view and spanning tree, and add new keyframe nodes; based on the observation consistency, eliminate point features, line features, and plane features that have not been continuously and reliably observed since their creation; The new keyframe has no matching point features and line features. New map points and map lines are generated based on the depth information back-projection and inserted into the map; the association between the vertical plane combination and the current keyframe is recorded according to the plane vertical relationship in the image, and updated Manhattan world map;

S43. After the complete map is updated, jointly optimize the current keyframe and the keyframes associated with it, and eliminate outliers in the optimization to optimize the camera pose to the greatest extent; the optimization object is the camera pose R,t of the relevant keyframes. As well as the three-dimensional feature parameters P, L, the reprojection error e used in the optimization process is:

Among them, e _p and e _l represent the reprojection errors of point features and line features respectively, Φ _p (p) and Φ _l (l) represent the confidence coefficients of point features and line features respectively, and Λ _p and Λ _l represent the point respectively. The information matrix of features and line features, ρ _p and ρ _l are the Huber robust kernel functions of point features and line features respectively;

S44. Eliminate key frames whose feature overlap between key frames is greater than 90%.

Furthermore, the conditions for judging key frames are specifically:

(1) After inserting the key frame, more than 10 frames of images have been processed, and the current local mapping thread is in an idle state, then it is determined to be a key frame;

(2) If more than 20 frames of images have been processed since the last key frame was inserted, it is determined to be a key frame;

(3) The sum of the matching point features, line features and plane features on the current image is not less than 20, otherwise it cannot be used as a key frame;

(4) If the features tracked on the current image are less than 90% compared to the features tracked by the recent key frame, it is determined to be a key frame;

(5) If a new plane is extracted from the image, it is judged to be a key frame.

Further, the step S5 is specifically: based on the point-line feature dictionary model, first find the word type in the dictionary K-ary tree based on the descriptors corresponding to all point-line features of the current key frame, and calculate the corresponding word for each word. Weight, get the word vector of the current key frame; calculate the preselected similarity between the word vector of the current key frame and the word vectors of other key frames, obtain the similarity between frames and obtain the loop frame of the current key frame based on the similarity. ; Construct the pose graph of all key frames in the loop, optimize the global pose graph, and reduce cumulative errors.

Further, the preselected similarity is:

where v _c is the word vector of the current key frame, v _o is the word vector of other key frames, and the specific form of the word vector v is:

v={(w ₁ , η ₁ ), (w ₂ , η ₂ ),…, (w _i , η _i )}

Where w _i is the i-th word in the visual dictionary, eta _i is the word weight of w _i , where the word weight eta _i is calculated as:

where n _i is the number of features belonging to word w _i in the image, n is the total number of point and line features in the image, N is the number of all features in the keyframe database, and N _i is the total number of point and line features belonging to word w _i in the database .

Further, step S1 is specifically: input a color image, first use the ORB algorithm to extract point features, and perform point feature matching according to the descriptor, and then use the RANSAC method to eliminate mismatches between point features; then use the EDLine algorithm For line feature extraction, split line segments are merged based on distance, angle information and descriptors as filtering criteria, and line features are matched based on LBD descriptors.

Further, step S1 is specifically: input a color image, first use the LSD algorithm to extract line features, and perform point feature matching according to the descriptor, and then use the RANSAC method to eliminate mismatches between point features; then use the EDLine algorithm to match the lines For feature extraction, split line segments are merged based on distance, angle information and descriptors as filtering criteria, and line features are matched based on LBD descriptors.

Compared with the prior art, the present invention has the following significant beneficial effects:

(1) The present invention improves the problems of low positioning accuracy and poor robustness of the visual SLAM system based only on the point feature method, and can achieve satisfactory positioning and tracking effects in indoor artificially structured scenes with low texture and large illumination changes;

(2) This invention uses the structural features in the indoor scene and based on the Manhattan world hypothesis, extracts the Manhattan world coordinate system to calculate the absolute rotation matrix from the Manhattan world coordinate system to the camera coordinate system, and then directly obtains the rotation matrix from the world coordinate system to the camera coordinate system. The rotation matrix of the coordinate system can greatly reduce the cumulative drift error between frames and improve the tracking effect and accuracy.

Description of drawings

Figure 1 is a flow chart of an embodiment of a structured scene visual SLAM method based on point, line and surface features;

Figure 2 is a trajectory comparison chart estimated by the present invention in the TUM fr3_str_notext_far data sequence;

Figure 3 is a comparison diagram of the xyz displacement estimated by the present invention in the TUM fr3_str_notext_far data sequence;

Figure 4 is a comparison diagram of the three-axis Euler angle attitude estimated by the present invention in the TUM fr3_str_notext_far data sequence;

Figure 5 is the sparse three-dimensional feature map reconstructed by the present invention in the TUM fr3_str_notext_far data sequence;

Figure 6 is a plane grid diagram reconstructed by the present invention in the TUM fr3_str_notext_far data sequence.

Detailed ways

The technical solution of the present invention will be further described in detail and completely below with reference to specific embodiments.

As shown in Figure 1, the present invention provides a structured scene visual SLAM method based on point, line and surface features, including the following steps:

Step S1: Input a color image, first use the ORB algorithm to extract point features and match them based on descriptors, and then use the PROSAC method to eliminate mismatches between point features. In this embodiment, the PROSAC method can be used to eliminate mismatches between point features. For matching, you can also use the RANSAC method to eliminate mismatches between point features; then use the EDLine algorithm to extract line features, merge the split line segments based on the endpoint distance, angle and descriptor information as filtering criteria, and match the line features based on the LBD descriptor ;

Step S2. Input the depth image. First, find the corresponding three-dimensional points for the depth-effective pixels to form an organized point cloud sequence structure. Then use the hierarchical clustering algorithm to quickly extract the plane according to the required point cloud sequence. Then, according to the plane The parameters match the plane in the image and the map surface; the specific steps are:

Step S21: Input the depth image, and calculate the corresponding three-dimensional points for the depth-effective pixels to form an organized point cloud sequence structure to facilitate subsequent algorithm processing;

Step S22: Initialize the graph model and divide the point cloud in the image evenly into 10×10 squares. Each block is equivalent to a node in the graph model. When the mean square error of the node is higher than the set threshold or the node contains missing depth When the data or node contains a point or node with discontinuous depth at the boundary of two planes, the node and the connected edges are deleted from the graph model;

Step S23: Use the hierarchical clustering algorithm to first establish a minimum heap data structure to more effectively find the node with the minimum mean square error for fusion; then calculate the mean square error of the plane fitting after fusion again to find the mean square The smallest error corresponds to the two nodes; if the mean square error exceeds a preset threshold, a plane segmentation node is found and extracted from the graph, otherwise the two nodes are fused and re-added to the constructed graph. The established minimum heap is updated; repeat the first two steps until all nodes in the graph are removed;

Step S24: Further segment and optimize the plane. Use the corrosion boundary area to optimize the jagged segmentation generated at the edge; classify unused data points into nearest planes around the point; over-segment the surface in a particularly small image. Hierarchical clustering is performed again;

Step S25: Find the center point of the divided plane P _C = [X _C Y _C Z _C ] ^T and its unit normal vector n = [n _x n _y n _z ] ^T , where X _C , Y _C , Z _C is the three-dimensional coordinate value corresponding to the center point P _C , n _x ,n _y n _z is the value corresponding to the normal vector n; then the distance from the camera optical center to the plane

Then the surface characteristics of the plane are expressed as π=[n ^T d] ^T ;

Step S26: Project the plane in the image onto the world coordinate system and match the projection plane with the map surface based on the angle between the normals and the difference between the origin of the world coordinate system and the plane;

Step S3: Detect whether the Manhattan world coordinate system exists in the image. If it exists and the historical key frames in the Manhattan world map are observed, the camera rotation matrix is calculated based on the extracted Manhattan world coordinate system and the point features, line features and planes are tracked. Features optimize camera displacement, otherwise tracking point features, line features and plane features optimize camera pose. The specific steps are:

Step S31: Traverse the corresponding normal vectors of the plane extracted in step S24 to find whether there are three mutually orthogonal normal vector combinations, or whether there are two mutually orthogonal normal vector combinations, and the other one-dimensional normal vector passes through The cross product of 2 mutually orthogonal normal vectors is obtained;

Step S32. If it exists, query the historical key frame in the Manhattan world map to see if there are three mutually orthogonal plane combinations that are consistent with this set of orthogonal normal vectors. If they exist, a Manhattan world coordinate system will be formed, and the number of corresponding plane points will be obtained. The Manhattan world coordinate system formed by the combination of normal vectors with the largest sum, and the rotation matrix from the Manhattan world coordinate system to the camera coordinate system of the frame is obtained:

where n ₁ , n ₂ , n ₃ represent the found mutually orthogonal normal vectors, c _i represents the id of the current frame, and m _k represents the id of the Manhattan world coordinate system; because the presence of sensor noise will cause n ₁ , n ₂ , n ₃ is not completely orthogonal or

is not an orthogonal matrix, you need to

Perform SVD decomposition to obtain the rotation matrix from the Manhattan world coordinate system to the camera coordinate system after orthogonalization:

where c _j represents the id of the found historical key frame,

Then the point features, line features and plane features are tracked to optimize the camera displacement t, and the error model e _t of its displacement is:

Among them, e _p , e _l and e _π respectively represent the reprojection errors of point features, line features and plane features. The specific forms are:

e _p =p-(KR _cw P _w +t _cw )

e _l = l ^T (KR _cw P _L +t _cw )

Among them, K is the internal parameter of the camera, R _cw is the rotation matrix from the world coordinate system to the camera coordinate system, t _cw is the displacement from the world coordinate system to the camera coordinate system, and T _cw is the pose transformation matrix from the world coordinate system to the camera coordinate system; p is the pixel coordinate of the point feature identified in the frame, P _w is the three-dimensional point corresponding to the point feature; l is the line feature identified in the frame, L is the three-dimensional line corresponding to the line feature; P _L is the three-dimensional line corresponding to the point feature 3D endpoint;

is the plane characteristic parameter expression form adopted by the present invention, π=[n _x n _x n _x d] ^T , n _x , _ny , n _z are the values corresponding to the normal vector of the plane π, and d is the distance from the camera optical center to the plane π distance; π _c is the plane detected in the frame, π _w is the map surface corresponding to the plane; where Λ _p , Λ _l , Λ _π represent the information matrices of point features, line features and plane features respectively, ρ _p , ρ _l and ρ _π are the Huber robust kernel functions of point features, line features and plane features respectively, Φ _p (p), Φ _l (l), Φ _π (π) represent the confidence of point features, line features and plane features respectively. degrees, respectively:

Among them n _p , n _l , n _π respectively represent the number of times the corresponding point feature, line feature and plane feature are observed, t _p , t _l , t _π are the weight coefficients of the point feature, line feature and plane feature, and level _i represents The number of layers of the ORB pyramid where the point feature is located; α is the weight coefficient, α∈[0.5,1]; θ _i represents the angle between the i-th frame camera’s line of sight and the map line,

In this embodiment, the plane features are expressed in the form of spherical coordinate parameters.

π=[n _x n _x n _x d] ^T , n _x , n _y , n _z are the values corresponding to the normal vector of the plane π, d is the distance from the camera optical center to the plane π; in addition, the plane features can also be unit four The elemental parameter representation or the closest point parameter representation represents the planar feature.

Step S33. If the Manhattan world coordinate system required in S31 does not exist or the corresponding normal vector combination is not observed in the historical key frames in the Manhattan world map, then track the point features, line features and plane features to optimize the camera pose R,t, The camera pose error model e _R,t is:

where e _p , e _l , and e _π represent the reprojection errors of point features, line features, and plane features respectively, and Φ _p (p), Φ _l (l), and Φ _π (π) represent point features, line features, and plane features respectively. The confidence coefficient of the feature, Λ _p , Λ _l , Λ _π respectively represent the information matrix of the point feature, line feature and plane feature, ρ _p , ρ _l , ρ _π are the Huber Lux of point feature, line feature and plane feature respectively. rod kernel function;

Step S4: Determine whether the current frame is a key frame. If it is a key frame, add it to the local map, update the three-dimensional points, three-dimensional lines and the Manhattan world map in the map, and combine the current key frame and adjacent key frames. Optimize, optimize the camera's pose and 3D point and 3D line features, and eliminate some external points and redundant key frames. The specific steps are:

Step S41: Determine whether the frame is a key frame. If it is a key frame, add it to the local map. Otherwise, return to step S1 and continue to process the next frame of image. The conditions for judgment are: (1) After inserting the key frame, it is If more than 10 frames of images are processed, and the current local mapping thread is idle, it is determined to be a key frame; (2) If more than 20 frames of images have been processed since the last key frame was inserted, it is determined to be a key frame; (3) The current image The sum of the matching point features, line features and plane features is no less than 20, otherwise it cannot be used as a key frame; (4) If the features tracked on the current image are less than 90% compared to the features tracked by the recent key frame, it is judged to be a key frame ; (5) If a new plane is extracted from the image, it is judged to be a key frame;

Step S42. For the newly inserted key frame, first update the common view and spanning tree, and add new key frame nodes; then remove the point features and line features whose three-dimensional features in the local map have been observed less than three times; for the new key frame, no The matched point and line features are back-projected to generate new map points and map lines based on the depth information and inserted into the map; finally, the association between the vertical plane combination and the current key frame is recorded according to the plane vertical relationship in the image, and the Manhattan world map is updated;

Step S43. After the complete map is updated, the current key frame and the key frames associated with it are jointly optimized. The main optimization objects are the camera pose and three-dimensional feature parameters of the relevant key frames. Here, the parameters of the plane are not used as optimization objects; in the iteration After 10 times of optimization, remove the outer points with excessive reprojection errors, and then continue the iterative optimization, repeating 4 times to remove as many outer points as possible and optimize the camera pose R,t and three-dimensional feature information P,L to the greatest extent. The parameters of the plane are not optimized here; the reprojection error e used in the optimization process is as follows:

Step S44: Eliminate key frames with a feature overlap greater than 90% between key frames to streamline common views;

Step S5: Perform loop closure detection and global optimization on key frames. The specific steps are:

Step S51. Based on the point-line feature dictionary model, first find the word type in the dictionary K-ary tree according to the descriptors corresponding to all point features and line features of the current key frame, calculate the weight corresponding to each word, and obtain the current key The word vector of the frame;

Step S52: Calculate the preselected similarity s between the word vector of the current key frame and the word vectors of other key frames, and select the key frame with the highest similarity and exceeding the set threshold as the loop frame of the current key frame, in which the word of the current key frame The preselected similarity between the vector and other keyframe word vectors is:

v={(w ₁ , η ₁ ), (w ₂ , η ₂ ),…, (w _k , η _k )}

Step S53: Construct a pose graph of all key frames in the loop. The vertices of the pose graph are the poses of each key frame, and the edges are common viewpoint features or common line features between key frames to achieve global pose graph optimization.

Step S6. Result output: pose saving, path drawing and map saving and synchronized real-time display. Figure 2 shows the trajectory comparison chart estimated by the present invention in the TUM fr3_str_notext_far data sequence. Figure 3 a, b and c show the present invention. Comparison diagram of the xyz-axis displacement estimated in the TUM fr3_str_notext_far data sequence. Figure 4 a, b, and c are comparison diagrams of the xyz-axis Euler angle attitude estimated by the present invention in the TUM fr3_str_notext_far data sequence; Figure 5 shows the comparison diagram of the present invention in TUM The sparse three-dimensional feature map reconstructed in the fr3_str_notext_far data sequence. Figure 6 is the plane grid map reconstructed in the TUM fr3_str_notext_far data sequence by the present invention.

Example 2

S1. Input a color image, extract point features and line features based on the color image and perform feature matching; in this embodiment, the EDLine algorithm is used to extract line features, the LSD algorithm can also be used to extract line features, and the split line segments extracted by the LSD algorithm are The merging method also uses endpoint distance, angle and descriptor information as filtering criteria;

S2. Input the depth image, convert it into a point cloud sequence structure, then extract the image plane, and then match the extracted image plane with the map plane; in this embodiment, in addition to the plane matching based on the normal angle and the origin of the world coordinate system The difference in distance to the plane is used as the criterion for judging plane matching, and the size of the normal angle and whether there is a collision area between the two planes can also be used as the criterion for judging plane matching;

Example 3

S1. Input a color image, extract point features and line features based on the color image and perform feature matching; in this embodiment, the LSD algorithm is used to extract line features, and the merging method of the split line segments extracted by the LSD algorithm is also based on the endpoint distance, Angle and descriptor information are used as filtering criteria;

The above embodiments are only used to illustrate the technical solutions of the present invention but not to limit them. Without departing from the spirit and essence of the present invention, those skilled in the art can make various corresponding changes and modifications according to the present invention. deformation, but these corresponding changes and deformations shall fall within the protection scope of the appended claims of the present invention.

Claims

A structured scene visual SLAM method based on point, line and surface features, which is characterized by including the following steps:

S1. Input a color image, extract point features and line features based on the color image and perform feature matching;

S2. Input the depth image, convert it into a point cloud sequence structure, then extract the image plane, and then match the extracted image plane with the map plane;

S3. Detect whether the Manhattan world coordinate system exists in the image. If it exists and the coordinate system is observed in the historical key frames in the Manhattan world map, estimate the camera posture and track point features and line features based on the extracted Manhattan world coordinate system. and plane features to optimize camera displacement, otherwise track point features, line features and plane features to optimize camera pose;

S4. Determine whether the current frame is a key frame. If it is a key frame, add it to the local map, update the three-dimensional points, three-dimensional lines and the Manhattan world map in the map, and jointly optimize the current key frame and adjacent key frames. , optimize the camera's pose and 3D point and 3D line features, and eliminate some external points and redundant key frames;

S5. Perform loopback detection on key frames. If a loopback is detected, the loopback will be closed and globally optimized to reduce cumulative errors.
A structured scene visual SLAM method based on point, line and surface features according to claim 1, characterized in that: the step S1 is specifically: input a color image, first use the ORB algorithm to extract point features, and extract the point features according to the description Point feature matching is performed, and then the PROSAC method is used to eliminate mismatches between point features; then the EDLine algorithm is used to extract line features, and the split line segments are merged based on distance, angle information and descriptors as screening criteria, and the line features are aligned based on the LBD descriptor Make a match.
A structured scene visual SLAM method based on point, line and surface features according to claim 1, characterized in that: the step S2 is specifically: first, find the corresponding three-dimensional points for the depth-effective pixel points, forming a The organized point cloud sequence structure is then used to fuse the small plane blocks into a larger plane using a hierarchical clustering algorithm, and the fused plane is segmented and optimized. Finally, the three-dimensional center point of the plane P C = [ X C Y C Z C ] T and the unit normal vector n of the plane = [n x n y n z ] T , where X C , Y C , and Z C are the three-dimensional coordinate values corresponding to the center point P C , n x , n y , n z are the values corresponding to the normal vector n; then the distance from the camera optical center to the plane
The surface characteristics of the plane are expressed as π=[n T d] T ; then the plane in the image is projected onto the world coordinate system and the projected plane is projected based on the angle between the normal and the distance from the origin of the world coordinate system to the plane. Match the map surface.
A structured scene visual SLAM method based on point, line and surface features according to claim 1, characterized in that: the step S3 is specifically:

S31. Traverse the normal vectors of the plane extracted in step S2 to find whether there are three mutually orthogonal normal vector combinations. If there are only two mutually orthogonal normal vector combinations, the other one-dimensional normal vector passes through two mutually orthogonal normal vector combinations. The cross product of mutually orthogonal normal vectors is obtained;

S32. If the normal vector combination sought in step S31 exists, query the Manhattan world map to see whether a mutually orthogonal plane that is consistent with this set of orthogonal normal vectors is observed in the historical key frames. If it exists, a Manhattan world coordinate is formed. system, take the Manhattan world coordinate system composed of the normal vector combination with the largest number of corresponding plane points, and find the rotation matrix from the Manhattan world coordinate system to the camera coordinate system of the frame:

where n 1 , n 2 , n 3 represent the found mutually orthogonal normal vectors, c i represents the id of the current frame, and m k represents the id of the Manhattan world coordinate system; for
Perform SVD decomposition to obtain the rotation matrix from the Manhattan world coordinate system to the camera coordinate system after orthogonalization
Then the rotation matrix from the camera coordinate system to the world coordinate system is:

where c j represents the id of the found historical key frame,
is the rotation matrix from the Manhattan world coordinate system m k to the camera coordinate system at the c jth frame,
is the rotation matrix from the world coordinate system to the camera coordinate system at the c jth frame;

Then track point features, line features and plane features to optimize the camera displacement t. The error model e t of its displacement is:

Among them, e p , e l , and e π represent the reprojection errors of point features, line features, and plane features respectively. The specific forms are:

e p =p-(KR cw P w +t cw )

e l = l T (KR cw P L +t cw )

Among them, K is the internal parameter of the camera, R cw is the rotation matrix from the world coordinate system to the camera coordinate system, t cw is the displacement from the world coordinate system to the camera coordinate system, and T cw is the pose transformation matrix from the world coordinate system to the camera coordinate system; p is the pixel coordinate of the point feature recognized in this frame, P w is the three-dimensional point corresponding to the point feature; l is the line feature in the image recognized in the frame, L is the three-dimensional line corresponding to the line feature; P L is the three-dimensional The three-dimensional endpoint of the line;
is the expression form of plane characteristic parameters, π=[n x n x n x d] T , n x , n y , n z are the values corresponding to the normal vector of plane π, d is the distance from the optical center of the camera to plane π; π c is the plane detected in the frame, π w is the map surface corresponding to the plane; where Λ p , Λ l , Λ π represent the information matrix of the point feature, line feature and plane feature respectively, ρ p , ρ l , ρ π are the Huber robust kernel functions of point features, line features and plane features respectively, Φ p (p), Φ l (l) and Φ π (π) represent the confidence coefficients of point features, line features and plane features respectively. They are:

Among them, n p , n l , and n π respectively represent the number of times the corresponding point feature, line feature, and plane are observed, t p , t l , and t π are the weight coefficients of point features, line features, and plane features respectively, and level i represents The number of layers of the ORB pyramid where the point feature is located; α is the weight coefficient, α∈[0.5,1]; θ i represents the angle between the i-th frame camera’s line of sight and the map line,

S33. If the Manhattan world coordinate system required in step S31 does not exist or the corresponding normal vector combination is not observed in the historical key frames in the Manhattan world map, then track the point features, line features and plane features to optimize the camera pose R,t , the camera pose error model e R,t is:
A structured scene visual SLAM method based on point, line and surface features according to claim 1, characterized in that: the step S4 is specifically:

S41. Determine whether to set the frame as a key frame based on the matching pairs of point features and line features, whether a new plane is detected, feature tracking conditions, and key frames in the local map. If it is a key frame, add it to the local map. Otherwise, return to step S1;

S42. For the newly inserted keyframe, update the common view and spanning tree, and add new keyframe nodes; based on the observation consistency, eliminate point features, line features, and plane features that have not been continuously and reliably observed since their creation; The new keyframe has no matching point features and line features. New map points and map lines are generated based on the depth information back-projection and inserted into the map; the association between the vertical plane combination and the current keyframe is recorded according to the plane vertical relationship in the image, and updated Manhattan world map;

S43. After the complete map is updated, jointly optimize the current keyframe and the keyframes associated with it, and eliminate outliers in the optimization to optimize the camera pose to the greatest extent; the optimization object is the camera pose R,t of the relevant keyframes. As well as the three-dimensional feature parameters P, L, the reprojection error e used in the optimization process is:

Among them, e p and e l represent the reprojection errors of point features and line features respectively, Φ p (p) and Φ l (l) represent the confidence coefficients of point features and line features respectively, and Λ p and Λ l represent the point respectively. The information matrix of features and line features, ρ p and ρ l are the Huber robust kernel functions of point features and line features respectively;

S44. Eliminate key frames whose feature overlap between key frames is greater than 90%.
A structured scene visual SLAM method based on point, line and surface features according to claim 5, characterized in that: the conditions for judging key frames are specifically:

(1) After inserting the key frame, more than 10 frames of images have been processed, and the current local mapping thread is in an idle state, then it is determined to be a key frame;

(2) If more than 20 frames of images have been processed since the last key frame was inserted, it is determined to be a key frame;

(3) The sum of the matching point features, line features and plane features on the current image is not less than 20, otherwise it cannot be used as a key frame;

(4) If the features tracked on the current image are less than 90% compared to the features tracked by the recent key frame, it is determined to be a key frame;

(5) If a new plane is extracted from the image, it is judged to be a key frame.
A structured scene visual SLAM method based on point, line and surface features according to claim 1, characterized in that: the step S5 is specifically: based on the point and line feature dictionary model, first according to the corresponding point and line features of the current key frame Descriptor, find the word type in the dictionary K-ary tree, calculate the weight corresponding to each word, and obtain the word vector of the current key frame; calculate the preselection between the word vector of the current key frame and the word vectors of other key frames Similarity, obtain the similarity between frames and obtain the loop frame of the current key frame based on the similarity; construct the pose graph of all key frames in the loop, optimize the global pose graph, and reduce the cumulative error.
A structured scene visual SLAM method based on point, line and surface features according to claim 1, characterized in that: the preselected similarity is:

where v c is the word vector of the current key frame, v o is the word vector of other key frames, and the specific form of the word vector v is:

v={(w 1 , η 1 ), (w 2 , η 2 ),…, (w i , η i )}

Where w i is the i-th word in the visual dictionary, eta i is the word weight of w i , where the word weight eta i is calculated as:

where n i is the number of features belonging to word w i in the image, n is the total number of point and line features in the image, N is the number of all features in the keyframe database, and N i is the total number of point and line features belonging to word w i in the database .
A structured scene visual SLAM method based on point, line and surface features according to claim 1, characterized in that: the step S1 is specifically: input a color image, first use the ORB algorithm to extract point features, and extract the point features according to the description Point feature matching is performed, and then the mismatching between point features is eliminated by using the RANSAC method; then the EDLine algorithm is used to extract line features, and the split line segments are merged based on distance, angle information and descriptors as filtering criteria, and the lines are aligned based on the LBD descriptor Features are matched.
A structured scene visual SLAM method based on point, line and surface features according to claim 1, characterized in that: the step S1 is specifically: input a color image, first use the LSD algorithm to extract line features, and perform processing according to the descriptor Point feature matching, and then use the RANSAC method to eliminate mismatches between point features; then use the EDLine algorithm to extract line features, merge the split line segments based on distance, angle information and descriptors as screening criteria, and perform line feature extraction based on the LBD descriptor match.