CN116429087A

CN116429087A - Visual SLAM method suitable for dynamic environment

Info

Publication number: CN116429087A
Application number: CN202310387172.8A
Authority: CN
Inventors: 黎萍
Original assignee: University of Electronic Science and Technology of China Zhongshan Institute
Current assignee: University of Electronic Science and Technology of China Zhongshan Institute
Priority date: 2023-04-12
Filing date: 2023-04-12
Publication date: 2023-07-14

Abstract

The invention discloses a visual SLAM method suitable for a dynamic environment, which has the technical scheme that the visual SLAM method comprises four major threads, namely a tracking thread, a local image building thread, a closed loop detection thread and an image building thread. In order to improve the adaptability of the visual SLAM system in a dynamic environment, ensure the extraction quantity and quality of feature points and improve the robustness of the system, a deep learning algorithm YOLOv4 and a dynamic feature point detection algorithm based on geometric constraint are added at the front end of the SLAM system, dynamic objects are firstly identified by the YOLOv4, and the dynamic points are further removed by adopting the geometric constraint, so that the influence of the dynamic objects on the feature extraction in the visual SLAM method is avoided, the map construction and positioning precision of the visual SLAM in the dynamic environment are improved, and the robustness of the visual SLAM system is effectively improved.

Description

Visual SLAM method suitable for dynamic environment

Technical Field

The invention belongs to the field of visual SLAM (synchronous positioning and map building), and particularly relates to a visual SLAM method suitable for a dynamic environment.

Background

The pose estimation and positioning precision of the current visual SLAM system in a static scene are excellent, but in a dynamic scene, a large number of moving objects influence the extraction of characteristic points of the system, at the moment, the system usually treats the characteristic points on the dynamic objects as normal static points, obviously, the system has larger errors due to the operation, and thus the pose estimation and positioning map construction becomes inaccurate. Therefore, how to introduce data correlation of appropriate dynamic environments in SLAM systems to achieve accurate localization and mapping of dense maps has become a hotspot of current research.

If there are fewer dynamic objects in the environment, a visual SLAM system using a filtering method such as RANSAC (Random Sample Consenus, random sample consensus) can better perform the tasks expected by the designer. The ORB-SLAM2 system adopts a RANSAC method, and because the motion rules of dynamic points and static points are greatly different, a filter can be used for identifying the dynamic points as outliers and removing the outliers. However, when there are many dynamic objects in the environment, the capability of detecting the outer points of the detection model will be greatly reduced, and at this time, the dynamic points and the static points cannot be accurately distinguished, so that serious errors will occur in the estimation of the pose of the subsequent system. One solution to this problem in the dynamic environment of visual SLAM is to completely discard these moving points when estimating the motion pose, so that an algorithm capable of accurately detecting a dynamic object is needed at this time, and at present, such an algorithm is mainly divided into a deep learning algorithm and a geometric algorithm.

In the aspect of deep learning, a class label of an object is obtained by using a deep learning network, and then whether the object belongs to a dynamic object is judged according to the class label.

Flow Fusion adds segmentation and discrimination to dynamic point clouds by improving Static Fusion, and uses PWC-Net network to identify and then reject dynamic objects. Riazuelo et al uses target detection to reject feature points on a dynamic object, thereby reducing the influence of pedestrians on pose estimation. It should be noted, however, that the algorithm only considers that the dynamic object is a person, and the dynamic object is various in real life, so the algorithm is not applicable any more in the face of the various situations of the dynamic object. Xu et al use Mask R-CNN algorithm capable of carrying out semantic segmentation, and can retain enough static characteristic points while filtering dynamic characteristic points, but the final real-time performance of the algorithm is very low due to the huge calculation amount of the algorithm model. Bescos et al propose DynaSLAM systems that consider three types of monocular, binocular, and RGB-D cameras, and if a binocular camera or other sensor is used, mask R-CNN is used to segment the dynamic object first, and then a static region is used to map. For RGB-D cameras, the system adds a link for determining whether it is a real dynamic object. The system has the greatest advantage of being capable of constructing a more complete map because the algorithm does not simply remove dynamic objects from the image, but rather complements the area according to different view angle information. Wang et al propose a method of segmenting an object on a depth map that slightly improves the efficiency of the system. However, the algorithms are based on semantic segmentation, so that the real-time performance is not high.

In the research of solving the dynamic object by using the geometric algorithm, tan and the like propose an RDSLAM algorithm based on RANSAC, the algorithm can eliminate dynamic points which are obviously different from static points, and in addition, a strategy for updating key frames on line is also provided, so that the image frames with larger motion occupation can be replaced. However, when the number of dynamic points is greater than that of static points, the positioning accuracy of the system will be greatly reduced. Dai et al construct a plurality of triangles for feature points in an image frame by using a triangulation method, judge whether the feature points are the same target by the distance change of the connected edges in the two images, and if the feature points are different, directly remove the corresponding edges to finally obtain an area formed by the remaining triangles, thereby rejecting the dynamic target. Sun et al propose a method for removing moving objects, which uses sparse optical flow to obtain the contour of the moving object, and then further segments the object, but the premise is to ensure that the camera is not moved all the time, and obviously the method needs to be improved to be applied to SLAM systems.

The above information disclosed in the background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

The invention aims to improve the robustness of an SLAM system in a dynamic environment, avoid the influence of dynamic objects on the construction and positioning, provide a visual SLAM method suitable for the dynamic environment, and improve the robustness of the SLAM system to the dynamic environment while maintaining the excellent construction capability and loop detection capability of the SLAM system under the conventional condition.

In order to achieve the above purpose, the present invention is realized by the following technical scheme:

a visual SLAM method adapted to dynamic environment, characterized by comprising four large threads: tracking threads, local mapping threads, closed loop detection threads and mapping threads, and specifically comprises the following steps:

A. tracking threads: the SLAM system receives images from cameras, firstly uses a deep learning YOLOv4 network to identify common dynamic objects such as people in the environment, eliminates dynamic feature points by utilizing epipolar geometric constraint after ORB features are extracted, outputs camera pose information corresponding to each frame of images for positioning, performs local map tracking, selects key frames and transmits the key frames to a local map building thread and a dense map building thread;

B. local mapping thread: receiving a key frame output by a tracking thread, completing the insertion of the key frame, and generating a new map point; then, adjusting by a local beam adjustment method, and finally screening the inserted key frames to remove redundant key frames;

C. loop detection thread: the method mainly comprises two processes, namely loop detection and loop correction, wherein the loop detection firstly utilizes word bags to detect loop key frames, then performs similarity transformation through a sim3 algorithm, and the loop correction is to perform loop fusion and optimize an intrinsic image;

D. dense mapping threads: constructing a dense map by using a PCL (Point Cloud Library) point cloud library, constructing a static dense map by using key frames from which dynamic points are removed, obtaining point cloud information which is always noisy and contains more redundant information, removing outliers by using a statistical filtering method in a PCL library, and removing the redundant point cloud information by using voxel filtering.

A visual SLAM method adapted to a dynamic environment as described above, characterized by: in order to improve the adaptability of the visual SLAM system in a dynamic environment, ensure the extraction quantity and quality of feature points and improve the robustness of the system, a deep learning algorithm YOLOv4 and a dynamic feature point detection algorithm based on geometric constraint are added at the front end of the SLAM system, a dynamic object is firstly identified by YOLOv4, and the dynamic points are further removed by adopting geometric constraint, and the method comprises the following steps:

screening matching points in an image by using a RANSAC algorithm, removing wrong matching points, calculating a basic matrix F by using the rest characteristic points, and solving R and t;

the data to be processed can be divided into inner points and outer points through a RANSAC algorithm, wherein the inner points are points at which a model is expected to be effective, and refer to effective characteristic points, and points on a dynamic object are excluded;

secondly, recovering specific coordinates X of the space point P in two camera coordinate systems according to the epipolar geometry relation between the pixel points of the two matching point pairs and the solved R, t ₁ ，X ₂ ；

Third, calculate the solved X ₁ ，X ₂ Whether or not formula X is satisfied ₂ ＝R·X ₁ If the space point P is satisfied, the space point P is a static point, if the space point P is not satisfied, the judgment is continued, and X is calculated ₂ Or X ₁ Projecting to the next frame image to obtain a pixel point p ₆ Then respectively at p ₅ And p ₆ For a central construction of a 3 x 3 pixel block, the corresponding small block is denoted as a and B, and the degree of correlation between a and B is expressed by a normalized cross-correlation factor as follows:

setting the threshold to 0.9, if S (A, B) _NCC If the difference is larger than 0.9, the two points are considered to be similar, namely the space point P is judged to be a static point, and otherwise, the two points are dissimilar, namely the space point P is judged to be a dynamic point.

The visual SLAM method adapted to dynamic environment as described above is characterized in that the specific steps of the RANSAC algorithm are as follows:

(1) Randomly selecting four sample data (ensuring that the four sample data are not collinear) from a feature point data set obtained by matching a visual SLAM system, calculating a transformation matrix H, and making a model be M;

(2) Traversing all the characteristic points by utilizing a transformation matrix H, calculating projection errors between each characteristic point and a model M, and adding the characteristic point P into an inner point set if the errors between the characteristic point P and the model are smaller than a set error threshold;

(3) Comparing the current obtained model M' with the previous model M, and selecting a model with more interior points in the interior point set;

(4) Repeating the above three steps until the iteration is finished (i.e. the number of iterations reaches a preset value).

Compared with the prior art, the invention has the following advantages:

the visual SLAM method suitable for the dynamic environment adds the YOLOv4 and the dynamic point detection algorithm based on the geometric features at the front end of the SLAM system, firstly uses the YOLOv4 to identify dynamic objects, and adopts geometric constraints to further reject dynamic points, thereby avoiding the influence of the dynamic objects on the feature extraction in the visual SLAM method, improving the map construction and positioning precision of the visual SLAM under the dynamic environment, and further effectively improving the robustness of the visual SLAM system.

Drawings

FIG. 1 is a block diagram of a visual SLAM method of the present invention adapted to a dynamic environment;

FIG. 2 is a graph of the geometry of a dynamic point under multiple camera coordinate systems;

FIG. 3 is a flow chart of a dynamic point detection algorithm based on geometric constraints of the present invention;

FIG. 4 is a comparative graph of feature point extraction experiment results;

FIG. 5 is a graph of ORB-SLAM2 and the absolute track error of the algorithm of the present invention at a fr 3_stationary sequence;

FIG. 6 is a graph of the absolute track error versus the fr3_walking_halfsphere sequence for ORB-SLAM2 and the algorithm of the present invention;

FIG. 7 is a graph of ORB-SLAM2 and the absolute track error of the algorithm of the present invention in a fr3_walk_static sequence;

FIG. 8 is a graph of ORB-SLAM2 and the absolute track error versus the fr3_walk_xyz sequence for the algorithm of the present invention;

FIG. 9 is a graph of ORB-SLAM2 and the algorithm of the present invention in comparison to dynamic sequence APE (Absolute Pose Error);

FIG. 10 is a second comparison of ORB-SLAM2 and the algorithm of the present invention in dynamic sequence APE (Absolute Pose Error);

FIG. 11 is a third comparison of ORB-SLAM2 and the algorithm of the present invention in dynamic sequence APE (Absolute Pose Error);

FIG. 12 is a fourth comparison of ORB-SLAM2 and the algorithm of the present invention in dynamic sequence APE (Absolute Pose Error);

FIG. 13 is an ORB-SLAM2 build dense point cloud;

fig. 14 is a dense point cloud created by the algorithm of the present invention.

Detailed Description

The technical features of the present invention are described in further detail below with reference to the accompanying drawings so that those skilled in the art can understand the features.

As shown in fig. 1, the visual SLAM method of the present invention adapted to a dynamic environment is divided into four major threads: tracking threads, local mapping threads, closed loop detection threads and dense mapping threads, which are specifically as follows:

1. tracking threads: the SLAM system receives images from cameras, firstly uses a deep learning YOLOv4 network to identify common dynamic objects such as people in the environment, eliminates dynamic feature points by utilizing epipolar geometric constraint after ORB features are extracted, outputs camera pose information corresponding to each frame of images for positioning, performs local map tracking, selects key frames and transmits the key frames to a local map building thread and a dense map building thread;

2. local mapping thread: and receiving the key frames output by the tracking thread, completing the insertion of the key frames, and generating new map points. And then, adjusting by a local beam adjustment method (BA), and finally, screening the inserted key frames to remove redundant key frames.

3. Closed loop detection thread: the method mainly comprises two processes, namely loop detection and loop correction. Loop detection firstly uses word bags to detect loop key frames, and then performs similarity transformation through a sim3 algorithm. The loop correction is to perform loop fusion and optimize the intrinsic image.

4. Dense mapping threads: and constructing a dense map by using the PCL (Point Cloud Library) point cloud library. And constructing a static dense map by using key frames from which dynamic points are removed, wherein the obtained point cloud information is always noisy and contains more redundant information, removing outliers by adopting a statistical filtering method in a PCL library, and removing the redundant point cloud information by adopting voxel filtering.

The geometrical relationship of the dynamic point under a plurality of camera coordinate systems is shown in fig. 2, and the dynamic point detection algorithm based on geometrical constraint is shown in fig. 3.

the data to be processed can be divided into inner points and outer points through the RANSAC algorithm, wherein the inner points are points at which the model is expected to be effective, refer to effective characteristic points, exclude points on a dynamic object, and the outer points are invalid data and refer to points on the dynamic object. In the visual SLAM system, the RANSAC algorithm specifically comprises the following steps:

The algorithm of the application has the following specific experiments and comparison demonstration:

to verify the effectiveness of the algorithm herein, test experiments were performed with five sets of dynamic sequences fre3_walking_xyz, fre3_walking_ halfsphere, fr3_walking_static, fre3_stationary, fre3_stationary_xyz in the fr3 dynamic dataset in the TUM dataset of the munich university of industry, each sequence being tested 5 times. These five sets of dynamic sequences can be divided into two categories again, where "walking" represents high dynamics, where the dynamics are people walking back and forth, the range of motion is large, and where the people can occupy 1/3 or even 1/2 of the picture, which is certainly a great challenge for the robustness of the visual SLAM system. While "sitting" indicates low dynamics in which the character sits mainly in a chair with little movement on the limbs. The suffix in the name represents the motion of the camera, e.g. "XYZ" means moving in three coordinate axes XYZ, "halfsphere" means moving along a hemisphere, and "static" means stationary. The experimental environment is a Dart Precision 7820Tower computer, the processor is Intel (R) Silver4210R, the model of the display card is NVIDIA Quadro P22004G, and the system version is Ubuntu16.04.

1. Characteristic point extraction experiment

The result of extracting the feature points after the dynamic object is identified by the oloov 4 is shown in the left diagram of fig. 4, and because the oloov 4 can be misjudged for the condition that the character in the image is incomplete or blurred, the feature points on the dynamic object are not removed and processed by the ORB-SLAM2 combined with the algorithm of the oloov 4, so that more feature points appear on the body of the pedestrian, the subsequent pose estimation error can be caused, and the established map accuracy is rapidly reduced. The improved visual front end is combined with YOLOv4 and geometric constraint to remove dynamic points, the processed image is shown in the right side diagram of FIG. 4, and therefore, even if the image is blurred due to pedestrian movement, the invention can well remove the characteristic points on the pedestrian body, and a subsequent system can accurately estimate the pose according to the effective static characteristic points.

2. Positioning accuracy test experiment

In contrast to the positioning accuracy of ORB-SLAM2 and the algorithm of the invention on four dynamic sequences (including low dynamic and high dynamic), the absolute track errors of the two methods on the four sequences are shown in fig. 5 to 8, the absolute error of the camera pose is shown on the left side of the graph, the error magnitude is described by the color bar on the right side, the value is gradually increased from bottom to top, and the color is also changed from dark blue to dark red. In the fr 3-sizing-static sequence, the algorithm is not much improved over the ORB-SLAM2 algorithm because the sequence is a low dynamic sequence under which the ORB-SLAM2 algorithm is robust. But the difference between the two algorithms is relatively large with the remaining three highly dynamic sequences. Observing fr3_walk_ halfsphere, fr3 _walk_static and fr3_walk_xyz can obtain that the algorithm has smaller error, the track is closer to the real track, the ORB-SLAM2 algorithm is opposite, the error between the obtained result and the real track is larger, and the main reason is that the ORB-SLAM2 algorithm does not have the capability of processing more scenes with dynamic feature points and can only keep certain robustness in a static environment. The dynamic sequence APE (Absolute Pose Error ) test results are shown in fig. 9-12, wherein the APE can be used for evaluating global consistency of SLAM tracks, and indexes such as absolute error, root mean square error and the like are included, and it can be seen intuitively that in a static sequence, error fluctuation of two algorithms is basically similar, the difference is not large, and in a dynamic sequence, error fluctuation generated by an ORB-SLAM2 is relatively large, and error fluctuation generated by an improved algorithm is obviously smaller and can continuously keep smaller fluctuation, so that the performance of the improved algorithm is superior to that of the ORB-SLAM 2.

The Absolute Track Error (ATE) and relative pose error (Relative Pose Error, RPE) of the two algorithms on four sequences are shown in tables 1, 2 and 3, wherein the smaller the value of rmse is, the smaller the representative error is, the error between the algorithm and the real track is smaller than the ORB-SLAM2 algorithm as seen from rmse in three tables, the difference between the algorithm and the real track is more obvious on the dynamic sequence, the improved algorithm can well treat the interference in the dynamic environment, and the RPE represents the accuracy of the real pose and the estimated pose error of two adjacent frames after a period of time is separated.

TABLE 1Absolute track error test (ATE) results (m) Table.1Absolute trajectory error test results (m)

TABLE 2 Relative Pose Error (RPE) translation test results (m) Table.2Translation test result ofrelative pose error (m)

TABLE 3Relative Pose Error (RPE) rotation angle test results (deg) Table.3relative pose error rotation angle testresults (deg)

In order to more intuitively embody the merits of the improved algorithm, the invention selects the rmse calculation in the Absolute Track Error (ATE) test result of the improved algorithm to be improved relative to the original ORB-SLAM2, and the calculation formula is as follows:

where α represents the rate of rise, m is the root mean square error derived from ORB-SLAM2, and n is the root mean square error derived from the improved algorithm herein. The calculation results are shown in table 4, and the results in the table show that the improved algorithm of the invention has obvious advantages in a high dynamic state, the improvement rate is over 90%, and the improved algorithm can better cope with dynamic environment.

Table 4rmse vs Table.4RMSE comparison ofabsolute trajectory errorbetween two algorithms of absolute track error for two algorithms

Sequence name	ORB-SLAM2(m)	The algorithm (m) herein	Lifting (%)
				sitting_static	0.0091	0.0077	15.38
sitting_xyz	0.0091	0.0078	14.29
				walking_halfsphere	0.7757	0.0507	93.46
walking_static	0.2813	0.0099	96.48
				walking_xyz	1.0007	0.0167	98.33

In terms of dynamic SLAM system research, there are many improved excellent algorithms such as DS-SLAM [ Yu C, liu Z X, liu X J, et al DS-SLAM: a semantic visual SLAM towards dynamic environments [ C ] 2018IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) & IEEE,2018:1168-1174 ], dynaSLAM [ Bescos B, F.c. J M, civera J, et al DynaSLAM: tracking, mapping, and inpainting in dynamic scenes [ J ]. IEEE Robotics and Automation Letters,2018,3 (4): 4076-4083.], etc., for verification of algorithm advancement herein and because each algorithm experimental configuration is different, the improvement rate of absolute track errors in rmse of the two algorithms is selected to be compared with that of the algorithm herein, and the data is taken from the disclosure values in the corresponding literature, as shown in Table 5.

TABLE 5absolute track error RMSE promotion Rate of each dynamic SLAM algorithm Table.5Absolute tetrajectoryerrrmSEINCreateareateof Technomium SLAMALGORITHM

Sequence name	DS-SLAM	DynaSLAM	Algorithm herein
				sitting_static	25.94％	—	15.38％
walking_halfsphere	93.76％	92.88％	93.46％
				walking_static	97.91％	93.33％	96.48％
walking_xyz	96.71％	96.73％	98.33％

As can be seen from Table 5, the algorithm of the present invention performs best on the high dynamic sequence walking_xyz and performs better overall than the DynaSLAM algorithm, but slightly less than the best overall DS-SLAM algorithm. By comparing with two excellent dynamic SLAM algorithms, the improved algorithm can be obtained, so that errors can be reduced better, and the accuracy of system positioning is improved.

3. Dense map building test

The experiment was performed with the high dynamic sequence walking halfsphere, walking xyz. Firstly, under the condition that dynamic objects are not removed from both groups (namely the original ORB-SLAM2 algorithm), a dense point cloud map is constructed, and the result is shown in fig. 13, and as the dynamic objects of people are not removed, the point cloud map is spliced in error, a large number of double images are generated, and the accuracy of the SLAM system in map establishment is affected.

The algorithm of the invention is used for eliminating dynamic objects in the environment, and a constructed dense point cloud map is shown in fig. 14. In contrast to the dense point cloud map 13 processed by the ORB-SLAM2 algorithm, the point cloud of the dynamic object in fig. 14 is filtered out, so that a lot of ghosts caused by pedestrians are no longer present, and the scene occluded by the pedestrians can be completely restored. From the comparison result of the experiment, the algorithm can well process dynamic objects in the environment, and the robustness of the SLAM system is improved.

The embodiments of the present invention are merely described in terms of preferred embodiments of the present invention, and are not intended to limit the scope and spirit of the present invention, but various modifications and improvements of the technical solutions of the present invention will be apparent to those skilled in the art without departing from the design concept of the present invention.

Claims

1. A visual SLAM method adapted to dynamic environment, characterized by comprising four large threads: tracking threads, local mapping threads, closed loop detection threads and mapping threads, and specifically comprises the following steps:

D. dense mapping threads: and constructing a dense map by using a PCL point cloud library, constructing a static dense map by using key frames from which dynamic points are removed, wherein the obtained point cloud information is always noisy and contains more redundant information, removing outliers by using a statistical filtering method in the PCL library, and removing the redundant point cloud information by using voxel filtering.

2. The visual SLAM method adapted to dynamic environment according to claim 1, wherein: adding a deep learning algorithm YOLOv4 and a dynamic feature point detection algorithm based on geometric constraint at the front end of the SLAM system, firstly identifying a dynamic object by using the YOLOv4, and further removing dynamic points by adopting the geometric constraint, wherein the method comprises the following steps:

the data to be processed can be divided into inner points and outer points through a RANSAC algorithm, wherein the inner points are points at which a model is expected to be effective, and refer to effective characteristic points, points on a dynamic object are excluded, and the outer points are invalid data, and refer to points on the dynamic object;

Third, calculate the solved X ₁ ，X ₂ Whether or not formula X is satisfied ₂ ＝R·X ₁ +t, if satisfied, this spatial point P isIf the static point is not satisfied, continuing to judge and X ₂ Or X ₁ Projecting to the next frame image to obtain a pixel point p ₆ Then respectively at p ₅ And p ₆ For a central construction of a 3 x 3 pixel block, the corresponding small block is denoted as a and B, and the degree of correlation between a and B is expressed by a normalized cross-correlation factor as follows:

3. Visual SLAM method adapted to dynamic environment according to claim 2, characterized in that the RANSAC algorithm comprises the following specific steps:

firstly, randomly selecting four sample data from a feature point data set obtained by matching a visual SLAM system, calculating a transformation matrix H, and enabling a model to be M;

step two, traversing all the characteristic points by using a transformation matrix H, calculating projection errors between each characteristic point and a model M, and adding the characteristic point P into an inner point set if the errors between the characteristic point P and the model are smaller than a set error threshold;

thirdly, comparing the current obtained model M' with the previous model M, and selecting a model with more interior points in the interior point set;

fourth, repeating the above three steps until the iteration is finished.