CN112258600A

CN112258600A - Simultaneous positioning and map construction method based on vision and laser radar

Info

Publication number: CN112258600A
Application number: CN202011117756.6A
Authority: CN
Inventors: 章国锋; 鲍虎军; 王宇伟
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2021-01-22

Abstract

The invention discloses a simultaneous positioning and map construction method based on vision and laser radar, and belongs to the field of SLAM. The laser odometer and the visual odometer are operated simultaneously, wherein the visual odometer can assist the laser point cloud to better remove motion distortion, and the point cloud with the motion distortion removed can be projected to an image to be used as depth information to carry out motion calculation of the next frame. After the laser odometer obtains a good initial value, the situation that the laser odometer falls into a degraded scene due to the defect of a single sensor is avoided, and the addition of visual information enables the odometer to achieve higher positioning accuracy. The loop detection and repositioning module is realized through the visual bag, a complete visual-laser SLAM system is formed, after loop is detected, the system carries out pose map optimization according to loop constraint, accumulated errors after long-time movement are eliminated to a certain extent, and high-precision positioning and point cloud map construction can be completed in a complex environment.

Description

Simultaneous positioning and map construction method based on vision and laser radar

Technical Field

The invention relates to the field of simultaneous localization and mapping (SLAM), in particular to a simultaneous localization and mapping method based on vision and laser radar.

Background

With the rise of artificial intelligence concepts and the maturation of computer vision related technologies, unmanned devices and robotics are receiving more and more attention. Meanwhile, Simultaneous Localization and Mapping (SLAM) is an indispensable module in the related art of robots. Depending on the sensors used, SLAM technology can be largely classified into camera-based visual SLAM, which can be subdivided into monocular SLAM, binocular SLAM, and RGBD-SLAM, and LiDAR-based laser SLAM, which can be further classified into 2D LiDAR-based laser SLAM and 3D LiDAR-based laser SLAM.

Visual SLAM is a low-cost solution that is currently widely used in the fields of augmented reality and robotics. The method comprises the steps of acquiring surrounding environment information by using cameras, extracting image features, completing data association between frames, solving the pose of each frame of camera by using a multi-view geometric principle, and finally constructing a three-dimensional map. Because the imaging result of the camera is greatly influenced by illumination, the camera is difficult to continuously operate in an environment with unstable illumination conditions. Furthermore, for some non-textured and weakly textured regions, a tracking loss situation is likely to occur since it is difficult to obtain a sufficient number of feature points by relying on visual information only.

The current mature Visual SLAM scheme generally consists of a front end and a rear end, wherein the front end is mainly responsible for completing a real-time positioning function and a Visual Odometer (VO) function. The back end is an optimization module, which typically performs further correlation on the frame data transmitted from the front end, such as extracting key frames, and then performs optimization on the camera poses and map points of the key frames, and furthermore, maintains a loop detection and relocation module in order to maintain global consistency of the map and ensure that the map can be relocated after a tracking loss.

Academic community research work on visual SLAM has continued for more than ten years, 2007, MonoSLAM systems were proposed, which are based on a filter framework and build a probabilistic model to predict posterior distribution of system states through kalman filtering. The system adopts a serial architecture, and alternately updates the camera pose and map point information in one thread. Later, the filter-based SLAM scheme was gradually replaced by an optimized-based visual SLAM scheme, the earliest proposed optimized-based visual SLAM was the PTAM proposed by Klein et al. The biggest difference between the system and the Mono-SLAM is that a concept of Bundle Adjustment (BA) is introduced, and the camera pose and the map point are optimized simultaneously by a nonlinear least square method, so that the aim of minimizing the reprojection error is fulfilled. In addition, PTAM also proposes the concept of key frames (keyframes), where higher quality images are added to global optimization as key frames. These innovative concepts are also heavily borrowed by the successor, making this work the prototype of many mature optimization-based visual SLAM schemes in the future. Among them, Bundle Adjustment becomes an independent research topic, and researchers have proposed many more efficient and robust BA schemes, such as Segment-Based BA, EIBA, and ICE-BA that is later applied in the inertial navigation SLAM system. With the great improvement of hardware performance, the optimization-based scheme surpasses the filter-based scheme in both precision and speed. ORB-SLAM, proposed by Mur-Artal et al in 2015, is also one of the most advanced SLAM systems at present. One of the biggest characteristics of the ORB-SLAM is that the uniform ORB feature is used globally, which greatly improves the robustness of the system compared with the prior work. In addition to the two threads for tracking the camera pose and the thread for updating the map point mentioned in the PTAM, the ORB-SLAM also separately opens one thread for processing loop detection and repositioning, which not only can eliminate the accumulated error caused by long-time operation, but also can return to the original track to continue positioning and mapping after the tracking is lost.

The laser SLAM utilizes a 2D or 3D laser radar to measure the distance of discrete points by emitting laser and receiving reflected light of a measured object to obtain structural information of the surrounding environment, data is given in a Point cloud collection mode, Point clouds are subjected to online registration through an ICP (inductively coupled plasma) and an improved ICP (Point-to-Line ICP) method, real-time pose information is obtained, and an environment map is constructed. At present, most sweeping robots in the market are basically carried by a 2D laser SLAM technology. In the positioning and mapping of the three-dimensional scene, the multi-line laser radar is high in price, and is characterized by wide viewing angle (the horizontal viewing angle of the common rotary laser radar reaches 360 degrees), long distance measurement and high precision (generally more than 100 meters, and the error is within 2 centimeters), so that the 3D laser SLAM is generally superior to the visual SLAM scheme in positioning precision under the condition of good data processing. This is also shown in the public data set KITTI, and the top 3 schemes in terms of positioning accuracy are all based on lidar.

The lidar SLAM generally performs frame-to-frame pose solution or frame-to-local map pose solution with ICP and some improved ICP algorithms as the main core. Because the point cloud number of a single frame of the multi-line laser radar is large, usually, tens of thousands of points exist, the calculation efficiency is low in direct ICP (inductively coupled plasma) solution, the pose calculation is also inaccurate due to the fact that the direct ICP solution is easily trapped into a local optimal solution, similar to a visual method, the point cloud data are preprocessed in a mode of extracting feature points with large information quantity from the point cloud, and only loss terms among the processed feature points are added during ICP (inductively coupled plasma) so as to improve the calculation efficiency. Typical examples of the method include a LOAM proposed by Zhang et al in 2014, which estimates the curvature of a current point according to the distance relationship between each point and a nearby point in the multiline lidar, and then extracts an edge point and a face point according to the curvature, wherein an error term between the features constitutes an object to be optimized. In addition, the point cloud is obtained continuously by the laser radar, the laser radar coordinate system changes constantly relative to the visual coordinate system in motion, and the laser radar cannot sense the movement in the process, so that the point cloud of the laser radar has a motion distortion phenomenon, and the distortion degree is increased along with the acceleration of the movement speed. In the LOAM, Zhang uses a uniform motion hypothesis to reduce the influence caused by motion distortion as much as possible, obtains a timestamp for sweeping each point by calculating the deflection angle of each point relative to a rotating central point, calculates the real pose of the laser at the moment according to the interpolation of a uniform motion model, and converts each point into the same laser coordinate system according to the relative relationship between the poses to fulfill the aim of removing the point cloud motion distortion.

Many of the later 3D laser SLAM approaches more or less borrow on the idea of LOAM. The LeGO-LOAM proposed in 2018 is one of the two, and is respectively composed of point cloud segmentation, feature extraction, a laser odometer and laser mapping, the algorithm idea is somewhat similar to that of an RGBD-SLAM, and a system receives input from a three-dimensional laser radar and outputs 6-Degree-of-Freedom (DoF) attitude estimation. The whole system is divided into five modules, the first is segmentation, a single scanned point cloud is projected to an image in a fixed range for segmentation, and then the second is sending the segmented point cloud to a feature extraction module. Thirdly, the laser radar odometer uses the features extracted from the last module to find the transformation of the pose of the robot in the continuous scanning process, and the information is finally used in the point cloud mapping for the laser radar. And the fifth module is used for fusing the attitude estimation results of the laser radar mileage measurement and the mapping and outputting the final attitude estimation. In the attitude estimation, in order to reduce the amount of calculation and accelerate the optimization efficiency, the optimization of 6 degrees of freedom is divided into 2 steps, firstly, the offset (z), the roll angle (roll) and the pitch angle (pitch) in the vertical direction are optimized by the constraint between the face points, and then the two offset angles (x and y) and the yaw angle (yaw) in the horizontal direction are optimized by the constraint between the side points. In addition, a strong priori is that the point cloud information necessarily contains a level of open road surface information, so that the point cloud information is more like a solution for a specific road scene and has no strong universality.

IMLS-SLAM is another pure laser visual positioning solution, an Implicit Moving Least Squares (IMLS) representation surface is used, registration of feature points and the surface is completed through feature screening, high positioning accuracy is achieved, the defect is obvious, due to the fact that the calculated amount is too large, real-time positioning cannot be achieved, and the IMLS-SLAM can be generally used for off-line mapping.

The conventional improved ICP method is prone to degradation in a repeated structural scene (e.g., moving in a narrow and lengthy corridor) or a scene lacking structural information, and loop detection of point cloud structural information still does not have a robust solution at present, so that the laser SLAM and the visual SLAM cannot be directly and simply combined.

Therefore, how to combine the visual and laser information to further improve the accuracy of positioning and mapping and improve the robustness of the algorithm is still a problem worthy of study.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a simultaneous positioning and map building method based on vision and laser radar. In the visual odometer, the calibrated external parameter is used for converting the structural information of the laser point cloud into the depth information of the image, so that the visual odometer is not lack of scale constraint any more, and accurate displacement information can be estimated; the method is characterized in that a visual odometer is added as a constraint while the laser odometer runs, so that motion distortion in laser point cloud is removed, a good motion initial value can be provided for the laser odometer, and the laser odometer is prevented from falling into a local optimal solution in certain specific environments; a loop detection and repositioning module is added at the rear end for constraint, accumulated errors generated by long-time movement are eliminated through pose graph optimization after loop detection, and high-precision positioning and point cloud map construction can be completed in a complex environment.

In order to achieve the purpose, the invention adopts the following technical scheme:

a vision and laser radar based simultaneous localization and map construction method comprises the following steps:

step 1: simultaneously operating a visual odometer and a laser odometer, wherein the visual odometer acquires image information of the surrounding environment in real time by using a camera to obtain image data; the laser odometer utilizes a laser radar to emit laser and receive reflected light of a peripheral measured object for ranging, and obtains structured information of a peripheral environment to obtain laser point cloud data;

step 2: projecting the laser point cloud to a pixel plane by using an external reference result calibrated by combining a laser radar and a camera, and converting the laser point cloud into depth information of an image; taking the depth information as the scale constraint of the visual odometer, matching visual feature points from a previous frame to a current frame in the image data, and calculating visual motion information;

and step 3: removing the motion distortion of the laser point cloud of the current frame by using the visual motion information from the previous frame to the current frame; and (3) taking the visual motion information obtained in the step (2) as an initial pose, registering the laser point cloud through the local point cloud map, constructing an environment map, and continuously performing loop detection and repositioning by using the key frame to optimize the pose of the key frame and map points.

Further, after the laser point cloud is projected to the pixel plane in the step 2, a sparse depth map is generated, and the sparse depth map is completed to obtain depth information of the image; the completion comprises the following steps:

step 2.1.1: inverting effective depth information in the sparse depth map:

in the formula (I), the compound is shown in the specification,

for the reversed visual feature point depth values, D_maxAt a predetermined maximum depth value, D_xAs depth values of visual feature points before inversion, D_xNot equal to 0 indicates that there is depth information, i.e. it belongs to valid depth information;

D

_x0 denotes no depth information;

step 2.1.2: expanding a depth information-free region in the sparse depth map;

step 2.1.3: and continuously filling the expanded region without the depth information, then performing median filtering and Gaussian filtering, and finally performing depth re-inversion to recover the depth information.

Further, in the visual feature point matching process described in step 2, two different types of matching point error term weighting are adopted as final reprojection errors; the method specifically comprises the following steps:

for feature points p with depth information₁Back-projecting it back into three-dimensional space, and re-projecting it onto another image plane to obtain points

And the position of the true matching point is p₁', then distance

The characteristic point p with depth information is formed₁The error term of (2);

for feature points p without depth information₂Back-projecting it back into three-dimensional space would appear at ray Op₂At any position, O is the origin of the three-dimensional space, and the ray Op₂Projecting to another image plane to obtain a straight line, and then matching point p₂'the distance from the position of' to the straight line constitutes a feature point p without depth information₂The error term of (2);

the final reprojection error is expressed as:

in the formula, T^*Expressing an error term optimization function, and expressing the relative pose between two frames by T; n denotes the number of feature points with depth information, m denotes the number of feature points without depth information, pi is a projection function of projecting a three-dimensional point to a pixel coordinate system, pi^-1Is to sum the pixel coordinatesRestoring the corresponding depth information into a back projection function of the three-dimensional coordinate point; u. of_(1,i)Representing the pixel coordinates of the ith point in the left view among the first type of points, i.e. points having depth information, d_iRepresenting depth information representing the ith point of the first class of points, v_(1,i)Representing the pixel coordinates representing the ith point in the right view in the first class of points, u_(2,i)Representing the pixel coordinates of the ith point in the left view, phi, of the second type of points, i.e. points without depth information^-1Is a back projection function for projecting the pixel points into a three-dimensional ray, | · luminance |²Calculating the modular length of the distance, wherein t is a threshold value; lambda is a weight coefficient for adjusting the confidence between the two types of points; dist (-) represents a point-to-ray distance function.

Further, the calculating of the visual motion information in step 2 specifically includes:

step 2.2.1: extracting and matching the characteristics of the previous frame image and the current frame image;

step 2.2.2: recovering the depth of the matching feature points of the previous frame by using the previous frame image and the laser point cloud of the previous frame after the motion distortion is removed;

step 2.2.3: counting the number of feature points with depth information in the previous frame, if the number of feature points is less than a threshold value, returning error information, and using a uniform motion hypothesis; otherwise, carrying out PnP calculation on the feature points with depth information in the previous frame and the feature points without depth information in the current frame to obtain visual motion information from the previous frame to the current frame;

step 2.2.4: and (5) optimizing the reprojection error to obtain the optimal pose estimation.

Further, the step 3 of removing the motion distortion of the laser point cloud of the current frame by using the visual motion information from the previous frame to the current frame specifically includes:

step 3.1: for each three-dimensional point in the current frame laser point cloud, determining a horizontal deflection angle theta of each three-dimensional point relative to the initial direction;

step 3.2: performing linear interpolation between a previous frame and a current frame according to the horizontal deflection angle theta, and projecting distorted point clouds to the positions where motion distortion is removed according to the result of the linear interpolation;

step 3.3: and (3) further optimizing the result of the step (3.2), extracting edge point features and surface point features for matching, and calculating an error term, wherein the error term of the edge point features is as follows:

in the formula (I), the compound is shown in the specification,

the three-dimensional coordinates of the points i, j and L in a laser coordinate system { L } are respectively; subscripts k and k-1 denote the current frame and the previous frame; t represents the relative pose between two frames;

the error term of the face point feature is:

in the formula (I), the compound is shown in the specification,

representing the three-dimensional coordinates of the point m under the laser coordinate system { L }; l, m and j are three face points nearest to the point i;

the optimization loss function is expressed as:

in the formula (I), the compound is shown in the specification,

represents a point set consisting of edge points in the k-th frame laser coordinate system L,

represents a point set formed by surface points under a k frame laser coordinate system { L }, wherein lambda represents a relative weight coefficient between two types of error terms, both of which contain a parameter T, and representsThe relative pose relationship between two frames.

Further, the method for extracting the edge point features and the face point features comprises the following steps:

step 3.3.1: dividing the three-dimensional laser radar point cloud into a plurality of two-dimensional laser radar point cloud sets with different pitch angles, and defining a smoothness c for all points in each beam of laser point cloud through the position relation between each point and adjacent points thereof:

in the formula (I), the compound is shown in the specification,

the three-dimensional coordinates of the ith point of the nth line beam representing the kth frame point cloud in the laser coordinate system { L }, and

then represents the set of all points within the polar angular neighborhood of the ith point;

step 3.3.2: ordering the points in a bundle of point clouds according to smoothness c, and setting two thresholds T_pAnd T_e：

When the smoothness of a point is less than T_pWhen the method is used, the method is close enough to an adjacent point and is taken as a face point;

when the smoothness of a point is greater than T_eWhen the distance between the adjacent point and the edge point is longer, the adjacent point is taken as the edge point;

when the number of the face points and the edge points exceeds a threshold value, extracting the top N_pFacial point feature and front N_eCharacteristic of individual edge points, N_pAnd N_eThe maximum number of face points and edge points.

Further, in the loop detection and repositioning by using the key frame strategy in step 3, after finding out similar key frames, the relative poses of the two frames also need to be solved through feature matching and PnP, the visual motion information in step 2 is optimized, and the optimization target is the sum of the reprojection errors between the two frames after the depth information of the two frames successfully matched by features is restored, which is expressed as:

wherein the content of the first and second substances,

expressing an error term optimization function, T expressing the relative pose between two frames, u_iIndicating the pixel coordinates of the ith matching point in the first image, v_iRepresenting the pixel coordinates of the ith matching point in the second image, d representing the depth information of the pixel points in the two images, pi being the projection function of projecting the three-dimensional point to the pixel coordinate system, pi^-1Restoring the pixel coordinates and the corresponding depth information into a back projection function of a three-dimensional coordinate point; d_1,iDepth information representing the ith matching point in the first image, d_2,iIndicating depth information of the ith matching point in the second image, T^-1Is the inverse of the relative pose;

the final optimization result is the pose constraint between two frames.

Further, the key frame in step 3 satisfies the following condition:

1) the number of the feature points matched with the previous frame is more than N_minEnsuring the tracking quality; n is a radical of_minIs the minimum match threshold;

2) the feature point matched with the last key frame is less than N_maxRedundant storage is avoided; n is a radical of_maxIs the maximum matching threshold;

3) the minimum interval is limited and the storage is reduced by M frames after the last key frame passes; m is the minimum separation threshold.

Compared with the prior art, the invention has the advantages that:

1) the invention builds a SLAM system with visual laser information fusion on the basis of a laser odometer system LOAM. The front end of the system mainly comprises a visual odometer and a laser odometer. In the visual odometer, sparse depth information is obtained by projection from laser point cloud to an image, then depth completion is carried out on the sparse depth image, and an accurate initial pose is estimated through feature matching between two frames and minimized reprojection errors.

2) The method is mainly characterized in that the experimental visual motion information replaces the previous uniform motion hypothesis to remove the point cloud motion distortion, and the precision of the algorithm is improved on the whole. And a good initial value not only avoids entering a degradation environment during point cloud registration, but also avoids falling into a local optimal solution in some specific environments, thereby improving the robustness of the algorithm, accelerating the iterative convergence speed during point cloud registration and realizing real-time positioning (less than 100 milliseconds).

3) According to the invention, a loop detection and relocation module is added at the rear end of the odometer, a visual bag-of-words model is adopted, and a characteristic dictionary formed by a multi-branch tree is maintained through hierarchical clustering, so that the retrieval robustness is ensured and the retrieval speed of the characteristics is accelerated. And when the loop is detected, the integral optimization of the inside of the whole track loop is completed through the pose graph optimization. The module not only eliminates the accumulated error caused by long-time movement and reduces the absolute error, but also ensures that the system has the function of repositioning after returning to the original track after being hijacked, thereby being beneficial to the positioning in long-time movement.

Drawings

FIG. 1 is a flow chart of the SLAM method of the present invention;

FIG. 2 is a schematic diagram of the present invention for calculating the reprojection error of two types of points;

FIG. 3 is a schematic diagram of pose graph optimization;

FIG. 4 is a trace of the present invention under KITTI data set, wherein FIGS. 4-1 to 4-3 show traces over KITTI-00 sequence, KITTI-05 sequence, and KITTI-09 sequence, respectively;

FIG. 5 is the mapping result of the present invention at KITTI-07.

Detailed Description

The invention is described in detail below with reference to the accompanying drawings. The technical features of the embodiments of the present invention can be combined correspondingly without mutual conflict.

The invention relates to a vision and laser radar based simultaneous positioning and map building method, which comprises the steps of simultaneously operating a vision odometer and a laser odometer, wherein the vision odometer acquires image information of the surrounding environment in real time by using a camera to obtain image data; the laser odometer utilizes a laser radar to emit laser and receive reflected light of a peripheral measured object for ranging, and obtains structured information of a peripheral environment to obtain laser point cloud data;

as shown in fig. 1, the method mainly comprises the following steps:

the method comprises the following steps: projecting the laser radar point cloud into an image by using the calibrated external parameters, densifying the sparse depth, and operating a visual odometer with depth information at the front end;

step two: and adding the visual motion information serving as an initial value into a laser radar odometer, removing the motion distortion of the laser radar, and solving the motion information of the laser radar through local point cloud map matching.

Step three: and (4) running a loop detection and repositioning module at the back end, optimizing a closed loop and smoothing a track by using a pose graph, and reducing accumulated errors.

In one embodiment of the present invention, the implementation of step one is described.

Obtaining a dense depth map from a sparse depth map is mainly divided into the following 7 steps:

1) depth Inversion (Depth Inversion): since the expansion operation is performed for the purpose of "bright area expansion", i.e. depth with larger values will cover depth with smaller values in the surroundings, the invention needs to be performed closer, i.e. for expanding depth information with smaller values. The inversion operation of the effective depth information is required.

Wherein the content of the first and second substances,

the depth value of the visual characteristic point after inversion is obtained; d_maxIs a preset maximum depth value. In KITTI data, the depth value is within 80 m, so the invention takes D_max＝100。D_xAs depth values of visual feature points before inversion, D_xNot equal to 0 indicates that there is depth information, i.e. it belongs to valid depth information; d _x0 means no depth information.

2) Dilation (Custom Kernel Dilation) with Custom convolution kernels: and designing a proper convolution kernel to expand the depth information-free area.

3) Close operation (Small Hole close) for Small holes: after the previous expansion operation, a part of the region still has the lack of depth information, and at this time, a closed operation (expansion first and then corrosion) can be adopted to fill the small cavity, and a full convolution kernel of 5 × 5 is adopted.

4) Fill the Hole (Small Hole Fill): some regions that still lack depth information are filled using a 7 x 7 dilation operation of the full convolution kernel.

5) Fill Large holes (Large Hole Fill): the large holes are filled by using a full convolution kernel of 31 × 31, except for areas where the laser beam cannot reach, such as sky, and the like, and areas where the depth information is not present at this time, which are filled by using a large convolution kernel.

6) Median/gaussian filtering (Median and gaussian Blur): in order to reduce the noise introduced by the dilation operation, the outliers of the local edges are removed first using a 5 × 5 median filter, and then the sharp edges are smoothed using a 5 × 5 gaussian filter.

7) Depth re-Inversion (Depth Inversion): and recovering the real depth information.

After the depth completion, the depth information of partial characteristic points is obtained, but because the visible area of the laser radar is limited, about one third of the area above the image has no depth information, and two different types of matching point error terms are designed to be weighted to obtain the final error term.

As shown in FIG. 2, for points with depth, their error term is the three-dimensional to two-dimensional reprojection error, p with depth information₁Back-projected into three-dimensional space, and re-projected onto another image plane to obtain points

And the position of the true matching point is p₁', then distance

A reprojection error term with depth points is constructed. For a point without depth, the invention needs to design its error term by another geometric model, and for a point without depth, p₂In other words, its three-dimensional position may appear at ray Op₂The invention can calculate the true matching point p of the ray at any position and the projection of the ray in the second camera is a straight line₂The distance of the' position from the projected line is used as a reprojection error term. Thus, the invention constructs two error terms for different types of matching points.

In the formula, T^*Expressing an error term optimization function, and expressing the relative pose between two frames by T; n denotes the number of feature points with depth information, m denotes the number of feature points without depth information, pi is a projection function of projecting a three-dimensional point to a pixel coordinate system, pi^-1Restoring the pixel coordinates and the corresponding depth information into a back projection function of a three-dimensional coordinate point; u. of_(1,i)Representing the pixel coordinates of the ith point in the left view among the first type of points, i.e. points having depth information, d_iRepresenting depth information representing an ith point of a first class of points，v_(1,i)Representing the pixel coordinates representing the ith point in the right view in the first class of points, u_(2,i)Representing the pixel coordinates of the ith point in the left view, phi, of the second type of points, i.e. points without depth information^-1Is a back projection function for projecting the pixel points into a three-dimensional ray, | · luminance |²Calculating the modular length of the distance, wherein t is a threshold value; lambda is a weight coefficient for adjusting the confidence between the two types of points; dist (-) represents a point-to-ray distance function.

The invention adopts a characteristic point method to complete the relative motion estimation between two frames, and the relative motion can be used for removing the motion distortion of the current frame point cloud and can also be used as an initial relative motion estimation to be transmitted to a laser odometer, thereby avoiding the laser falling into a degradation environment. In an embodiment of the present invention, a description is given of a process of calculating visual motion information, and this embodiment takes calculating the visual motion information of the k frame to the (k + 1) th frame as an example.

Assuming that the motion information for the k-1 th to k-th frames is already known, it is now necessary to estimate the motion of the k-1 th to k +1 th frames:

1) and completing feature extraction and matching of the kth frame image and the (k + 1) th frame image.

2) And removing the motion distortion of the point cloud of the k frame by using the motion information from the k-1 frame to the k frame.

3) And restoring the depth of the matching feature points of the kth frame by using the image of the kth frame and the point cloud information of the kth frame after the motion distortion is removed. The depth information of the (k + 1) th frame point cloud is not recovered here, because there is no motion information at this time, and motion distortion cannot be removed for the point cloud.

4) Counting the number of characteristic points with depth in the kth frame, if the number of the characteristic points is less than 20, judging that the characteristic points are fewer at the moment, the robust motion estimation cannot be obtained, returning error information, and continuously using a uniform motion hypothesis; otherwise, the next step is continued.

5) And carrying out PnP calculation on the feature points with depth in the kth frame and the feature points without depth in the (k + 1) th frame to obtain the motion information from the kth frame to the (k + 1) th frame.

6) And (5) optimizing the reprojection error to obtain the optimal pose estimation.

In one embodiment of the present invention, the implementation of step two is described.

The invention considers the three-dimensional laser radar point cloud as a set of two-dimensional laser radar point clouds with a plurality of different pitch angles. And for all points in each laser point cloud, returning is usually in a clockwise or counterclockwise direction (depending on the rotation direction of the laser radar), that is, if a two-dimensional polar coordinate system with the laser emission center as a pole and the laser initially facing as a polar axis is constructed, sorting by the size of the polar angle is already completed when the point cloud returns. According to the position relation of each point and adjacent points in the same line, the invention extracts edge points and face points from the points. Thus, adjacent points in the index are typically points that are close in polar angle.

The method for extracting the edge point features and the face point features comprises the following steps:

1) a smoothness (smoothness) c is first defined for each point by its positional relationship to its neighbors.

Wherein the content of the first and second substances,

then the set of all points within the polar angular neighborhood of the ith point is represented.

2) Sorting the points in a bundle of point clouds according to their smoothness c, and setting two thresholds T_pAnd T_e：

3) When the smoothness of a point is less than T_pWhen it is said to be close enough to the neighboring point, it is considered to be a planar point.

4) When the smoothness of a point is greater than T_eWhen it is far from the neighboring point, it is considered as an edge point (edge point).

5) Meanwhile, the invention also has limit to the number of the edge points and the face points, and when the number of the feature points exceeding the threshold value is excessive, the first N is taken_pFacial point feature and front N_eAnd (4) characteristics of each edge point.

There are two cases that need to be considered as well:

the first condition is as follows: when the angle of the smooth surface formed by the incident angle and the point cloud is small, the number of points on the plane which can be detected is small, and subsequent feature matching is influenced, so that the points are not set as surface points.

Case two: after enough edge points and face points are obtained, the invention needs to match the features between adjacent frames to design a reasonable error term to estimate the current motion state of the motion.

In one embodiment of the present invention, the present invention generally makes the assumption of uniform motion without other sensor information, and then further fine-registers this estimated motion state. Under the condition of uniform motion, the invention uses the next frame of laser coordinate system { L }_k+1Projecting the points below the point to the laser coordinate system (L) of the previous frame_kAnd (6) below.

The error term of the edge point is considered first. For the k +1 th frame coordinate system { L_k+1The edge points i in the lattice constitute a point set epsilon_k+1Projected to the k-th frame coordinate system { L }_kIn the preceding, form a point set

Then for each of the edge points i,

at epsilon_kWhere e is the nearest two edge points to point i_kA point set composed of k-th frame edge points is shown. Since a real edge in the three-dimensional space and any one of the laser beams have at most one intersection point, it is necessary to verify whether the two found edge points belong to the same beam. When two points j and l nearest to the point i belong to different wire harnesses, the invention can use the distance from the point to the straight line as an error term:

wherein the content of the first and second substances,

the three-dimensional coordinates of the points i, j, L in the laser coordinate system { L }, respectively. Subscripts k and k +1 denote two adjacent frames; t represents the relative pose between two frames;

the error term for the face point is then considered. For the k +1 th frame coordinate system { L_k+1Set of points formed by surface points i in

Project it to the k frame coordinate system { L }_kIn the preceding, form a point set

Then for each of the points i of the face,

in that

Finding three face points nearest to point i, wherein

A point set composed of k-th frame edge points is shown. In order to be able to construct a plane without degradation, the three points cannot be located on the same beam, while it is also verified whether the three points are on the same line. Only three surface points which simultaneously satisfy non-identical wire harnesses and non-identical straight lines can form an error term of one surface point:

wherein the content of the first and second substances,

projection error term d when determining edge points and surface points_εAnd

then, the invention accumulates the error terms and uses a nonlinear optimization method to solve an optimal pose.

The invention combines two types of error terms through a parameter lambda:

1) wherein λ represents the relative weight coefficient between the two types of error terms, and that coefficient in this loop can be adjusted according to the specific link. For some scenes with rich planar structure, the value of λ (λ >1) is increased so that the registration of the face feature points has higher weight. For some more complex regions, where the face feature points are less abundant, the relative weight of the edge points can be increased by turning λ down (λ < 1). In this experiment, the present invention provides λ ≡ 1.

2) The two error terms both contain a parameter T, which represents the relative pose relationship between two frames, and the projection position of the point i in the previous frame can be changed by adjusting the T, and the adjacent point is searched to obtain the error.

The final optimization objective is

The invention discovers that in the iterative optimization process, the projection position of the point i changes when the parameter T is updated each time, and at the moment, the adjacent characteristic points are recalculated to calculate the optimal error term. If the adjacent feature points are recalculated for each iteration, the time complexity of the algorithm is greatly increased. In the experimental process, the invention finds that if the initial value is better, and under the condition of less iteration times, the neighboring points of most of the reprojection points can not be changed. Therefore, in the practical realization process of the algorithm, the invention completes the mutual iteration process through two cycles:

the outer loop control seeks the neighboring points again at most for the number of times MaxSearchTimes.

Iterative optimization only under the condition that internal recirculation control adjacent points are not changed

The number of times { maxtiertimes }.

In addition, in order to improve the efficiency of searching adjacent points, the invention adopts a kd-tree to maintain three-dimensional point cloud, and reduces the complexity of linear search O (N) to O (Klog)₂And N), wherein K is 3, the coordinate dimension number of the point, and N is the number of the characteristic points.

The inter-frame motion estimation can obtain an overall motion track through the relative motion relationship between every two frames, but due to the existence of observation errors and matching errors, the overall track obtained through the relative motion relationship between the two frames often has larger accumulated errors after moving for a long time. In order to reduce accumulated errors, the method based on the sliding window is adopted, the feature points of each frame are matched with the feature point set of the nearest N frames, and all the feature points are represented by a world coordinate system { W }, so that the result obtained after final optimization is directly the pose of the world coordinate system { W }. In one embodiment of the present application, N ≡ 10 is taken.

In absolute motion estimation, the number of feature points is the superposition of the nearest N frames of point clouds and their motion states. The KD-tree searching for neighbors is enlarged by a factor of N, and over-relying on nearest neighbors is prone to overfitting due to observation errors. Therefore, the invention does not use the nearest neighbor point to construct an error term, but finds the main directions of the edges and the faces from several nearest points.

The invention also uses kd-tree to maintain the feature point set to be matched of the nearest N frames of point clouds. Similar to the same-phase relative motion estimation, the initial pose T is determined through a uniform motion model₀。

In one embodiment of the present invention, a process for removing motion distortion of a lidar using visual motion information is described.

The angular speed of the rotary laser radar is assumed to be uniform, the specific time corresponding to each point can be calculated quickly, and the pose of any time can be calculated, so that the side points and the surface points are projected to the position after distortion removal according to the relative poses.

In one embodiment of the present invention, the motion compensation for the point cloud between the time k and the time k +1 includes the following steps:

1) for each three-dimensional point, the horizontal deflection angle θ of the point relative to the initial direction is calculated from the x and y direction length values.

2) Linear interpolation of attitude in time k and k +1 according to theta

3) Projecting the distorted point cloud to the position with the motion distortion removed according to the linear interpolation result

4) The result of the linear interpolation is further optimized. In an original laser radar odometer, edge points and face points need to be extracted for matching, error items are calculated, the optimal laser radar pose is solved by optimizing the error items, and the error items of the edge points are as follows:

wherein

The three-dimensional coordinates of the points i, j, L in the { L } coordinate system, respectively.

The error term of the face point is

Assuming the position of each point in the undistorted point cloud is

It is no longer fixed in common ICPA value, but one that changes constantly as a function of updates to the relative pose T, then the edge point error term is rewritten as:

likewise, the face point error term is rewritten as:

the final loss function to be optimized is:

in the formula (I), the compound is shown in the specification,

and (3) representing a point set formed by surface points under a k-th frame laser coordinate system { L }, wherein lambda represents a relative weight coefficient between two types of error terms, the two error terms both contain a parameter T, and the relative pose relationship between the two frames is represented.

In one embodiment of the present invention, the process of step three is described.

According to the invention, a loop detection and repositioning module is added and is realized through a visual bag, so that a complete visual-laser SLAM system is formed, after loop is detected, the system carries out pose diagram optimization according to loop constraint, and accumulated errors after long-time movement are eliminated to a certain extent.

The pose graph optimization process specifically comprises the following steps:

after the loop constraint is obtained through the visual bag-of-words model, the relative position relationship between two loop frames can be calculated, so that new constraint relationships except for the adjacent frames before and after are obtained. If the current k frame and the previous i frame are determined to have a loop through loop detection, a large number of similar features must exist between the two frames, the similar features are matched, meanwhile, the depth information of feature points can be obtained through projection and depth completion of laser point cloud on an image, and the estimation of the relative pose between the two frames is completed through calculating PnP.

Suppose the pose estimated by the odometer is { T }_i|i∈[0,9]Detecting T by a loop detection module₀And T₇，T₂And T₉There is a loop back. By T₀And T₇For example, the relative pose between two frames is solved as T through the PnP between the two frames_0,7Then, then

T_0,7＝T_0-1T₇

Theoretically, if the calculation of the odometer and the pose calculation are accurate, the equation can be established; however, in the visual problem, since an uncertain factor such as an observation error inevitably exists, the equation is not strictly satisfied, and therefore, it is necessary to construct a least square error. A common method is by Pose Graph Optimization (PGO). As shown in FIG. 3, the nodes represent the pose of each frame, and the directed edges between the two nodes represent the relative constraints between the poses.

The loop detection needs to store historical information of image frames, and if the information of each previous frame is stored indiscriminately, when the track is gradually increased, the consumption of memory is increased, and the time loss of loop detection is also seriously affected, so a key frame strategy is generally adopted in the most advanced SLAM system.

In order to more efficiently complete loop detection and relocation, namely, to retrieve the picture most similar to the current frame from the historical pictures, the invention also uses the idea of the key frame for reference, namely only storing the key frame information, so that the memory loss in operation can be greatly reduced. Regarding the strategy of the key frame, the SLAM systems are slightly different in detail implementation, but are basically the same and different, and some frames with rich characteristics are selected as the key frame. The following is a key frame policy used in the embodiments of the present invention, and the following three conditions must be satisfied simultaneously:

1) the number of the feature points matched with the previous frame is more than N_minAnd the tracking quality is ensured.

2) The feature point matched with the last key frame is less than N_maxRedundant storage is avoided.

3) The minimum spacing is limited by the distance that the last key frame has passed M frames, further reducing storage.

After the similar key frames are found through the loop detection and repositioning module, the relative poses of the similar key frames also need to be solved through feature matching and PnP.

Different from the PnP between the previous frame and the previous frame, the motion distortion of the two frames forming the loop is removed through the motion information of the previous frame and the next frame, and the current depth information is credible, so that the pose solved by the PnP can be further optimized, and the optimization target is the sum of the reprojection errors of the two frames after the depth information is recovered:

wherein the content of the first and second substances,

the final optimization result is the pose constraint between two frames.

Examples

In order to further demonstrate the implementation effect of the present invention, in this embodiment, the original data of the KITTI in the raw data set, which is not partially undistorted, is adopted, and 10 of the first 11 sequences (00-10) in the KITTI are selected as experimental data (the sequence 03 does not find corresponding original point cloud data from the raw data).

Evaluation indexes are as follows:

according to different positioning requirements, the relative error and the absolute error are adopted to respectively evaluate the positioning accuracy.

Relative error evaluation takes 8 poses of 100 m, 200 m and 300 m … … 800 m as an initial frame, compares the poses with the poses of real values calculated by the same method, divides the error by the actual length of each segment, and calculates the average rotation error R (unit: degree/100 m) and translation error t (unit:%). The relative error evaluation avoids the accumulated error after long-time movement by setting the maximum length of the sequence, and is used for evaluating the relative positioning accuracy.

The absolute error evaluation compares all poses directly with the true pose, divides by the total trajectory length, and then calculates the average rotation error R (unit: degree/100 meter) and translation error t (unit:%). The absolute error evaluation maximally considers the existence of accumulated errors and is used for evaluating the overall positioning accuracy of the whole sequence.

It should be noted that the relative error and the absolute error are measured in different dimensions, there is no necessary positive correlation, and the small relative error does not mean the small absolute error. In addition, the following abbreviated VLO indicates a method for fusing laser and visual information proposed by the present invention, i.e., the loop back detection step is removed in the overall scheme of the present invention. VL-SLAM represents VLO after addition of loop back detection, the complete invention.

The difference between the odometer method provided by the invention and the pure laser odometer method is that the motion distortion of the motion information obtained by calculation is removed by assisting the laser point cloud through a visual method, and a better initial value is provided for the laser odometer, so that the laser odometer is prevented from entering a local optimal solution.

In the experiment, the VLO (Visual-LiDAR Odometry) proposed by the present invention was compared with the most advanced pure laser odometer LOAM in the current open source project. In addition, to better verify positioning accuracy, the VLO was also tested against the current state-of-the-art open source SLAM system ORB-SLAM 2.

Experiment 1: comparison of positioning accuracy of original point cloud data

TABLE 1 comparison of positioning accuracy of VLO (raw Point clouds) and LOAM (raw Point clouds) on KITTI

The experiment used raw point cloud data and the positioning accuracy comparison results are shown in table 1. It can be seen from the table that compared with a pure laser odometer LOAM, the positioning accuracy of the VLO fusing laser and visual information provided by the invention is greatly improved in both relative positioning accuracy and absolute positioning accuracy, on one hand, because reliable visual motion information is provided to replace the assumption of uniform motion to help remove the motion distortion of the laser, on the other hand, a good initial value is provided for ICP in the laser odometer, thereby not only accelerating the convergence speed, but also avoiding the falling into a degradation environment.

Experiment 2: verification of distortion removal effect

In order to verify the effect of removing the laser point cloud motion distortion by using the visual method, the distortion removal switch in the LOAM is turned off, the point cloud with the motion distortion removed is collected by using the odometry data set, then the positioning accuracy is compared with the VLO with the motion distortion removed by using the visual method, and the relative error and the absolute error of the point cloud with the motion distortion removed are calculated on each sequence respectively, wherein the VLO experiment is not performed on the sequence 03 due to the lack of the original point cloud information. The results of the experiment are shown in table 2.

TABLE 2 comparison of positioning accuracy of VLO (original point cloud) and LOAM (undistorted point cloud) on KITTI

As can be seen from table 2, the LOAM using the undistorted point cloud is equivalent in precision to the VLO using the original point cloud proposed by the present invention on most data sets, and the difference is within the error range, which indicates that the distortion removal effect in the VLO is already close to the true value.

Experiment 3: run time comparison

In addition to the consideration of the positioning accuracy, the time performance is also one of the considered targets of the present invention, and in the motion estimation, the matching relationship between the edge point and the face point is changed after each iteration is completed, and ICP iteration is required again after re-matching. In the LOAM, due to the lack of a good initial motion estimate, more re-matches and ICP times are often required, where the maximum number of re-matches is set to 5. Since there is already a good visual motion estimation in VLO, the maximum number of times of re-matching is set to 3, and it can be seen from table 2 that the number of times of re-matching is small, but the positioning accuracy of VLO is not affected.

The average running time of each frame is considered in the LOAM, and the average running time of the visual odometer and the average moving time of the laser odometer are considered separately in the VLO due to the need of the visual moving information as a priori, and the comparison results are shown in table 3.

TABLE 3 LOAM vs VLO runtime on KITTI

As can be seen from table 3, although the vision module is added to the VLO, which brings extra time consumption to the system, a good initial solution can reduce the burden of the laser odometer to a certain extent, and a good pose can be obtained by optimization through a small number of iterations, so that not much consumption is brought to the total time efficiency, and a real-time effect (less than 100 milliseconds) can be achieved on the KITTI.

Experiment 4: comparison with the accuracy of a pure vision odometer

This experiment uses the most advanced open source system ORB-SLAM2 (closed loop closure module) to perform a precision comparison experiment with VLO. Because the pure monocular odometer lacks scale information and cannot be compared fairly, the binocular system in the ORB-SLAM2 is selected for the experiment, the system uses stereo parallax to complete depth estimation through a calibrated binocular camera, but compared with laser information, the depth confidence obtained by binocular vision is not high, and continuous iterative optimization is needed.

TABLE 4 comparison of location accuracy of VLO (raw Point cloud) and ORB-SLAM2-stereo on KITTI

It can be seen from table 4 that the two have good and bad performance on different sequences because the two have different main data sources, the binocular vision adds more visual information of the right-side camera, and the lidar adds more point cloud structure information of the scene, so that the two have respective advantageous scenes (for example, the binocular odometer in ORB-SLAM2 in KITTI-04, KITTI-10 performs better, and the accuracy of VLO in KITTI-00, KITTI-07 is higher). Furthermore, the displacement error of the VLO location of the present invention is relatively smaller over most of the data set than ORB-SLAM 2-stereo.

Similarly, with the addition of the loop back constraint, the positioning accuracy of the VL-SLAM of the present invention is significantly improved, as shown in Table 5.

TABLE 5 location accuracy comparison of VL-SLAM on KITTI Loop data

Therefore, after loop detection is added, the positioning precision is obviously improved.

In addition, compared graphs of the VL-SLAM and the real track are shown in FIGS. 4-1 to 4-3, which respectively show track graphs on a KITTI-00 sequence, a KITTI-05 sequence and a KITTI-09 sequence, and the track is basically coincident with the real track, so that the description precision is extremely high. FIG. 5 shows the mapping result of the present invention at KITTI-07.

The invention builds a SLAM system with visual laser information fusion on the basis of a laser odometer system LOAM. The front end of the system mainly comprises a visual odometer and a laser odometer. In the visual odometer, sparse depth information is obtained through projection from laser point cloud to an image, then depth completion is carried out on the sparse depth image, a better initial pose is estimated through feature matching between two frames and minimized reprojection error, the initial pose is substituted into the laser odometer to complete further point cloud distortion removal and point cloud registration, and a good initial value not only prevents the point cloud from entering a degradation environment during point cloud registration, but also accelerates the iterative convergence speed during point cloud registration.

In addition, a loop detection and repositioning module is added at the rear end of the odometer, a visual bag-of-words model is adopted, and a characteristic dictionary formed by a multi-branch tree is maintained through hierarchical clustering, so that the retrieval robustness is ensured, and the retrieval speed of the characteristics is accelerated. And when the loop is detected, the integral optimization of the inside of the whole track loop is completed through the pose graph optimization. The module not only eliminates the accumulated error caused by long-time movement, but also ensures that the system has the function of repositioning after returning to the original track after being hijacked.

The following conclusions can be drawn from comparative experiments:

1) the VLO provided by the invention has greatly improved positioning accuracy compared with a pure laser odometer under the condition of using the original point cloud, is mainly reflected in that experimental visual motion information replaces the previous uniform motion hypothesis to remove point cloud motion distortion, and improves the accuracy of the algorithm on the whole.

2) The VLO can avoid the laser odometer from falling into a degraded environment (such as KITTI-02) by adding visual prior motion information in some special scenes, and the robustness of the algorithm is improved.

3) The VLO can perform real-time positioning on the PC (less than 100 ms), consuming 88 ms per frame at KITTI, slightly slower than 82 ms for pure laser odometer LOAM, which is essentially negligible.

4) The addition of the loop detection module can reduce absolute errors to a certain extent, and is favorable for positioning in long-time movement.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A simultaneous localization and map construction method based on vision and laser radar is characterized by comprising the following steps:

2. The vision and lidar based simultaneous localization and mapping method according to claim 1, wherein after the laser point cloud is projected to the pixel plane in step 2, a sparse depth map is generated, and the sparse depth map is complemented to obtain depth information of the image; the completion comprises the following steps:

step 2.1.1: inverting effective depth information in the sparse depth map:

in the formula (I), the compound is shown in the specification,

for the reversed visual feature point depth values, D_maxAt a predetermined maximum depth value, D_xAs depth values of visual feature points before inversion, D_xNot equal to 0 indicates that there is depth information, i.e. it belongs to valid depth information; d_x0 denotes no depth information;

step 2.1.2: expanding a depth information-free region in the sparse depth map;

3. The simultaneous localization and mapping method based on vision and lidar of claim 1, wherein two different types of matching point error term weighting are used as final reprojection errors in the visual feature point matching process of step 2; the method specifically comprises the following steps:

for feature points p with depth information₁Back-projecting it back into three-dimensional space and re-projecting it to anotherObtaining points in the image plane

And the position of the true matching point is p₁', then distance

the final reprojection error is expressed as:

in the formula, T^*Expressing an error term optimization function, and expressing the relative pose between two frames by T; n denotes the number of feature points with depth information, m denotes the number of feature points without depth information, pi is a projection function of projecting a three-dimensional point to a pixel coordinate system, pi^-1Restoring the pixel coordinates and the corresponding depth information into a back projection function of a three-dimensional coordinate point; u. of_(1,i)Representing the pixel coordinates of the ith point in the left view among the first type of points, i.e. points having depth information, d_iRepresenting depth information representing the ith point of the first class of points, v_(1,i)Representing the pixel coordinates representing the ith point in the right view in the first class of points, u_(2,i)Representing the pixel coordinates of the ith point in the left view, phi, of the second type of points, i.e. points without depth information^-1Is a back projection function for projecting the pixel points into a three-dimensional ray, | · luminance |²Calculating the modular length of the distance, wherein t is a threshold value; λ is a weight coefficient for adjusting two kinds of pointsInter-trust; dist (-) represents a point-to-ray distance function.

4. The vision and lidar based simultaneous localization and mapping method according to claim 3, wherein the calculating of the vision motion information in step 2 is specifically:

5. The vision and lidar based simultaneous localization and mapping method according to claim 1, wherein the removing of the motion distortion of the laser point cloud of the current frame by using the vision motion information from the previous frame to the current frame in step 3 specifically comprises:

in the formula (I), the compound is shown in the specification,

the error term of the face point feature is:

in the formula (I), the compound is shown in the specification,

the optimization loss function is expressed as:

in the formula (I), the compound is shown in the specification,

6. The vision and lidar based simultaneous localization and mapping method according to claim 5, wherein the method for extracting the edge point feature and the face point feature comprises:

in the formula (I), the compound is shown in the specification,

7. The vision and lidar based simultaneous localization and mapping method according to claim 1, wherein in the step 3, the keyframe strategy is used for loop detection and repositioning, after finding out similar keyframes, the relative poses of the two frames also need to be solved through feature matching and PnP, the vision motion information in the step 2 is optimized, and the sum of the reprojection errors between the two frames after recovering depth information is expressed as:

wherein the content of the first and second substances,

the final optimization result is the pose constraint between two frames.

8. The simultaneous vision and lidar-based localization and mapping method according to claim 1, wherein the keyframe of step 3 satisfies the following condition: