CN110378997B

CN110378997B - ORB-SLAM 2-based dynamic scene mapping and positioning method

Info

Publication number: CN110378997B
Application number: CN201910481714.1A
Authority: CN
Inventors: 龙土志; 蔡述庭; 李丹; 杨家兵; 董海涛
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-06-04
Filing date: 2019-06-04
Publication date: 2023-01-20
Anticipated expiration: 2039-06-04
Also published as: CN110378997A

Abstract

The invention discloses a dynamic scene mapping and positioning method based on ORB-SLAM2, which comprises a local map tracking process, a dynamic pixel eliminating process, a sparse mapping process, a closed loop detection process and an octree map building process; the method has the function of dynamic pixel elimination, and can be used for quickly detecting a moving object in the image information of a camera by combining a target detection method with a depth image of a new key frame and constructing a clean static background octree map in a complex dynamic environment.

Description

ORB-SLAM 2-based dynamic scene mapping and positioning method

Technical Field

The invention relates to the technical field of robot synchronous mapping and positioning slam, in particular to a dynamic scene mapping and positioning method based on orb-slam 2.

Background

SLAM (simultaneous localization and mapping) has been a popular topic in the fields of computer vision and robotics, and has attracted the attention of many high-tech companies. SLAM technology is to create a map in an unknown environment and to enable real-time localization in the map. The framework of modern visual SLAM systems is very mature, such as ORB-SLAM2, LSD-SLAM; most advanced visual synchronous positioning and mapping (V-SLAM) systems have high precision positioning functionality, but most of these systems assume the operating environment to be static, limiting their application.

For establishing a static map in a dynamic scene, the existing algorithms have disadvantages, for example, dynaSLAM cannot eliminate pixels of predefined objects in real time, dynaSLAM cannot eliminate pixels of undefined objects or only a part of predefined objects, a StaticFusion system cannot eliminate the undefined objects or only a part of the predefined objects, and the above algorithms cannot establish a clean static map in real time in a complex dynamic environment.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a dynamic scene mapping and positioning method based on ORB-SLAM2, which has the function of dynamic pixel elimination, rapidly detects a moving object in the image information of a camera by collecting the depth image of a new key frame through a target detection method, and constructs a clean static background octree map in a complex dynamic environment.

In order to realize the task, the invention adopts the following technical scheme:

a dynamic scene mapping and positioning method based on ORB-SLAM2 comprises the following steps:

step 1, local map tracking

Initializing the pose of a camera by using image information captured by the camera carried by the robot; taking a first frame image captured by a camera as a key frame during initialization; after the initial pose is obtained, tracking the local map, so as to optimize the pose of the camera and produce a new key frame;

step 2, dynamic pixel elimination

Detecting a predefined dynamic object in the color image of the new key frame using a target detection algorithm, and then identifying dynamic pixels in combination with the depth image of the new key frame; dynamic pixels detected by the two methods are removed;

step 3, sparse mapping

For the key frames with the dynamic pixels removed, the robot poses of the key frames are optimized, new map points are added, and the quality and scale of a key frame set are maintained;

step 4, closed loop detection

Performing closed loop detection on each new key frame, and once closed loops are detected, performing pose graph optimization;

step 5, constructing an octree map

Dividing map points into voxels by using an octree, and storing the voxels through an octree structure to construct an octree map; and (4) calculating the occupation probability of the voxel to judge whether the voxel is occupied, and if so, visualizing in an octree graph.

Further, tracking the local map to optimize the camera pose and generate new keyframes includes:

the local map refers to 3D points observed by a key frame with a distance and a visual angle close to the current frame; more matched 3D points are obtained through re-projection, so that the camera pose is optimized and new key frames are generated with the minimum error:

projecting the 3D points on the local map to the current frame to obtain 3D-2D feature matching;

limiting the area of searching 2D matching points in the current frame to reduce mismatching, then comparing the pixel in the current frame with the position obtained by projecting the 3D point according to the current estimated camera pose to obtain an error structure least square method problem, minimizing the error structure least square method problem, and then searching the best camera pose to position;

and judging whether a new key frame needs to be generated or not according to preset conditions.

Further, the step 2 of identifying dynamic pixels in combination with the depth image of the new key frame includes:

projecting the remaining pixels of the predefined object removed by the target detection algorithm to world coordinates to create a 3D point; dividing the 3D points into a plurality of clusters, and uniformly selecting M pixels from each cluster; for each pixel, the pixel is projected to the N frames of key frames nearest to the new key frame and compared to detect whether the pixel is a dynamic pixel:

backprojecting pixel u to 3D point p in world coordinates using depth in depth image of keyframe and robot pose of new keyframe ^w ；

Point p in 3D ^w Projecting the color image of the jth key frame near the new key frame;

if the pixel u ' of the jth key frame has a valid depth value z ' in the corresponding depth image, the pixel u ' is back-projected to the 3D point p in world coordinates ^w′ ；

By mixing p with ^w′ And p ^w The distance d between the two and a set threshold value d _mth Comparing to determine whether the pixel u is dynamic:

search for pixels in a square around u' such that d takes the minimum value d _min (ii) a If d is _min Greater than a threshold value d _mth Then make a preliminary judgmentAnd judging that the broken pixel u is static, otherwise, preliminarily judging that the broken pixel u is dynamic.

Further, assuming that the pixel u is in the preliminary judgment results of all the nearby key frames, the number of static results is N _S The number of dynamic results is N _d The number of invalid results is N _i The final properties of pixel u are as follows:

if (N) _S ≥N _d ,N _S > 0), then pixel u is a static pixel;

if (N) _d ≥N _s ,N _d > 0), then pixel u is a dynamic pixel;

if (N) _S ＝N _d = 0), pixel u is invalid.

Further, the dynamic pixels detected by the two methods are removed, wherein after the dynamic pixels are identified by combining the depth image of the new key frame, the removing method comprises the following steps:

among the uniformly selected M pixels in one cluster, the number of static pixels is assumed to be N _s ', the number of dynamic pixels is N _d ', the number of invalid pixels is N _i ', the final properties of the cluster are as follows:

if (N) _S '≥N _d '), the cluster is a static cluster, and the cluster is reserved;

if (N) _d '≥N _s ') the cluster is a dynamic cluster, and the cluster is removed.

Further, the optimizing the robot pose of the key frame, adding new map points, and maintaining the quality and scale of the key frame set includes:

calculating the Bow vector of the current key frame, and updating the map point of the current key frame;

optimizing the pose of the robot by using the local BA of the sliding window, wherein an optimized object is the pose in the current frame;

redundant key frames are detected and culled, and if 90% of the pixels on a key frame can be observed by more than three arbitrary key frames, the key frame is considered to be a redundant key frame and deleted.

Further, the determining whether a voxel is occupied by calculating an occupation probability of the voxel, and if so, visualizing in an octree map includes:

is provided with Z _t Representing the observation of a voxel n at time t, the log of the probability of the voxel n starting at time t is L (n | Z) _1:t ) Then, at time t +1, the probability logarithm of voxel n is given by the following equation:

L(n|Z _1:t+1 )＝L(n|Z _1:t-1 )+L(n|Z _1:t ) Formula 8

If voxel n is observed at time t, L (n | Z) _t ) Is tau, otherwise is 0; the increment τ is a predetermined value;

defining p ∈ [0,1] to represent the probability that a voxel is occupied, and l ∈ R to represent the logarithm of probability, l being computed by the logit transform:

the above equation is inversely transformed:

calculating an occupancy probability p of the voxel by substituting a probability logarithm value into equation 10; only if the probability of occupancy p is greater than a predetermined threshold, the voxel n is considered occupied and will be visualized in the octree graph.

The invention has the following technical characteristics:

1. rapidity of operation

The algorithm uses a CornerNet-Squeeze target detection algorithm to detect dynamic objects and uses a K-Means variant Mini Batch K-Means clustering algorithm to cluster the depth information of the images, which is faster than the existing algorithm. Because the CornerNet-Squeeze target detection algorithm only needs 34ms for processing a picture, the algorithm is faster than the algorithms such as YOLOv3, and the hardware testing environment is as follows: 1080ti GPU + Intel Core i7-7700k CPU. For a clustering method with big data larger than ten thousand, the K-Means variant Mini Batch K-Means is three times faster than the K-Means itself, and the performance is not very different.

In addition, the octree map is established, so that the updating time of the map is shortened.

2. Very clean static map can be established in complex environment

The algorithm combines a target detection method and a probability logarithm updating voxel mode of an octree map to detect and eliminate dynamic pixels.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of a partial BA optimization process;

FIG. 3 is an octree map model.

Detailed Description

ORB-SLAM2 is a complete set of SLAM solutions based on monocular, binocular and RGB-D cameras. It can realize the functions of map reuse, loop detection and relocation. It assumes that the operating environment is static, limiting its application.

The algorithm is provided on the basis of an ORB-SLAM2 algorithm, can quickly establish a clean static octree map in a dynamic environment in real time, and mainly comprises five steps: local map tracking, dynamic pixel culling, sparse mapping, closed-loop detection and octree map creation, the overall flow chart is shown in fig. 1. The specific content is as follows:

step 1, local map tracking

Initializing the pose of a camera by using image information captured by the camera carried by the robot; taking a first frame image captured by a camera as a key frame during initialization; after the initial pose is obtained, tracking the local map so as to optimize the pose of the camera and produce a new key frame; the method comprises the following specific steps:

step 1.1, extracting ORB characteristic points from original RGB image information (including color images and depth images) captured by Kinect2, matching, and tracking and initializing camera poses sequentially through three modes of a motion mode, a key mode and a repositioning mode, namely positioning; at initialization, the first frame is set as the key frame, this step 1.1 is the same as ORB-SLAM 2.

Step 1.2, after the initial pose is obtained, tracking a local map, wherein the local map refers to 3D points observed by a key frame (local key frame) of a current frame (a picture currently shot by a camera) at a distance and a visual angle (the distance is set to be 4m, and the angle is set to be 1 rad); more matched 3D points are obtained through re-projection, so that the camera pose is optimized with the minimum error:

(1) Defining:

transformation matrix from camera coordinate system c to robot coordinate system r:

transformation matrix from robot coordinate system r to world coordinate system w (i.e. pose of the robot):

from a 3D point P corresponding to a frame of picture C _c Projection relationship of the 2D point u projected onto this picture:

back-projecting from 2D point u on a frame picture C and its corresponding depth information z to the corresponding 3D point P of the picture _c The back projection relation is as follows:

(2) Carrying out reprojection to obtain feature matching:

let the robot pose (i.e. the current frame pose) be

Map 3D points on a local map

Projecting on the current frame, obtaining the feature matching of 3D-2D:

(3) Optimizing the pose of the camera:

in a dynamic scene, a large number of mismatching exists in feature matching, and in order to solve the problem, the invention reduces the mismatching by limiting an area (a circle with a radius set to be 3 pixels) for searching 2D matching points (namely pixels) in a current frame. Then the pixel u in the current frame is divided into two _i A position u obtained by projecting with the 3D point according to the currently estimated camera pose _i ' the error found by the comparison (equation 3) constructs a least squares problem to minimize it, and then looks for the best camera pose to locate:

(4) Finally, judging whether a new key frame needs to be generated or not according to preset conditions; this preset condition is the same as the orb-slam2 algorithm.

Step 2, dynamic pixel elimination

In constructing static maps in dynamic scenes, the identification and deletion of dynamic pixels is most critical. Since only the key frames are used to construct the octree map, only the newly selected key frames (new key frames) from the previous step are subjected to dynamic pixel culling.

The method comprises the steps of firstly, detecting a predefined dynamic object in a color image of a new key frame by using a target detection method, and then identifying a dynamic pixel by combining a depth image of the new key frame; the dynamic pixels detected by the two methods are removed. The method comprises the following steps:

step 2.1, first for predefined objects such as: people, tables, chairs and the like, the scheme can detect the predefined dynamic object in the color image of the new key frame through a CornerNet-Squeeze target detection algorithm in CornerNet-Lite, and if the dynamic object is detected, the pixels of the dynamic object are removed.

The CornerNet-Squeeze target detection algorithm only needs 34ms for processing a picture, which is faster than the algorithms such as YOLOv3, and the hardware testing environment is as follows: 1080ti GPU + Intel Core i7-7700k CPU.

Step 2.2, for some undefined dynamic objects such as books, boxes and the like or parts of predefined objects which can not be detected by the target detection algorithm, such as human hands, the scheme detects dynamic pixels on the color picture processed by the previous step by combining the depth image of the new key frame through the following method:

2.2.1 projecting the remaining pixels of the predefined object removed by the target detection algorithm to world coordinates to create a 3D point.

2.2.2 Using clustering Algorithm to divide 3D points into clusters

Dividing the 3D points into a plurality of clusters; the number of clusters k is based on the number of 3D points s _p Determining: k = s _p /n _pt ，n _pt Is the average number of points of the cluster to be adjusted, s _p Representing the size of the point cloud, M pixels are uniformly selected from each cluster.

Due to the fact that the number of pixels is extremely large and the clustering speed needs to be as fast as possible, a K-Means variant clustering method Mini Batch K-Means method is used in the scheme. Wherein n is reduced _pt Better approximation can be guaranteed, but the calculation burden is increased, and the scheme uses n _pt Set to 6000 to balance computational consumption and accuracy.

Because this scheme focuses on removing dynamic pixels and building a static map without tracking dynamic objects, it is assumed that the clusters are rigid bodies, which means that pixels in the same cluster have the same motion attributes; therefore, the scheme only needs to detect which clusters are dynamic clusters; to speed up the dynamic cluster detection process, the present scheme selects M =100 pixels uniformly for each cluster.

In the following steps, the selected dynamic and static attributes are judged; if the number of dynamic pixels is more than that of static pixels, the cluster is determined to be dynamic for removing, otherwise, the cluster is determined to be static and reserved.

2.2.4 determining whether a pixel is a dynamic pixel

The present scheme detects whether a pixel is a dynamic pixel by projecting M pixels selected in step 2.2.2 to the nearest N =6 frames (near the new key frame) of the key frame to the new key frame and comparing them. The method comprises the following specific steps:

(1) Robot pose using depth z in depth image of keyframe and new keyframe n

Backprojecting pixel u to 3D point p in world coordinates ^w ：

(2) Point 3D p ^w Projected onto the color image of the jth key frame near the new key frame:

wherein

Is the robot pose of the jth keyframe near the keyframe.

(3) If the pixel u ' of the jth key frame has a valid depth value z ' in the corresponding depth image, the pixel u ' is back-projected to the 3D point p in world coordinates ^w′ ：

(4) By mixing p with ^w′ And p ^w The distance d between the two and a set threshold value d _mth Comparing to determine whether the pixel u is dynamic:

since the depth image and pose of the key frame are both in error and u 'may not be the pixel corresponding to u, the scheme operates by searching for a square around u' (empirically setting the square side length S to be10 pixels) such that d takes a minimum value d _min (ii) a If d is _min Greater than a threshold value d _mth (threshold value d) _mth Set to increase linearly with the depth value z'), then the pixel u is preliminarily determined to be static, otherwise it is preliminarily determined to be dynamic; other cases are that no valid depth value is found in the square search area, or that u is out of range of the image, in which case the pixel u is judged to be invalid.

Considering that the result of one key frame is not reliable enough and may produce invalid result, the present solution applies the above-mentioned preliminary judgment processes (1) - (4) to all the neighboring key frames of the new key frame (the present embodiment selects 6 key frames), and finally, the final condition of the pixel u is determined by voting: assuming that the pixel u is in the preliminary judgment results of all the neighboring key frames, the number of static results is N _S The number of dynamic results is N _d The number of invalid results is N _i The final properties of pixel u are as follows:

if (N) _S ≥N _d ,N _S > 0), then pixel u is a static pixel;

if (N) _d ≥N _s ,N _d > 0), then pixel u is a dynamic pixel;

if (N) _S ＝N _d = 0), pixel u is invalid.

2.2.5 determining whether a cluster is dynamic

The step also adopts the voting method of the previous step to determine the cluster attributes; among the uniformly selected M pixels in one cluster, the number of static pixels is assumed to be N _s ', the number of dynamic pixels is N _d ', the number of invalid pixels is N _i ', the final properties of the cluster are as follows:

Step 3, sparse mapping

The main purpose of sparse mapping is to receive and process the key frame with the removed dynamic pixels, optimize the robot pose of the key frame, add new map points and maintain the quality and scale of the key frame set. The method comprises the following specific steps:

step 3.1, processing newly introduced key frame

step 3.2, local BA

The sliding window local BA is applied to optimize the pose of the robot, an optimization framework is shown as figure 2, an optimization object is the pose in the current frame, and the optimization is carried out by:

(1) All poses in the keyframes connected with the current keyframe in the sliding window; in order to balance time and precision, the scheme sets n in a sliding window to be 7;

(2) Two black map points created before the key frame in the sliding window;

(3) Two white map points created after the key frame in the sliding window, which are not as variables, are already fixed. After local BA optimization, the pose of a new keyframe is optimized, creating a new map point, which will be used to construct an octree map.

Step 3.3, local Key frame screening

In order to control the reconstruction density and the complexity of BA optimization, the step also comprises a process of detecting redundant key frames and removing the redundant key frames; as can be seen from the reprojection, if 90% of the pixels on a key frame can be observed by more than three arbitrary key frames, it is considered as a redundant key frame and deleted.

Step 4, closed loop detection

Each new key frame is closed-loop detected using the bag-of-words model Dbow 2. Once a closed loop is detected, optimizing a pose graph; the process is the same as the ORB-SLAM2 algorithm, and the specific pose graph optimization process belongs to the prior art and is not described again.

Step 5, constructing octree map

Dividing map points which are created and optimized by the octree into voxels (or small squares), and storing the voxels through an octree structure to construct an octree map; and (4) calculating the occupation probability of the voxel to judge whether the voxel is occupied, and if so, visualizing in an octree graph.

As shown in fig. 3, the octree map is constructed to continuously update whether the voxels are occupied; octree maps use the form of probability to indicate whether a voxel is occupied, unlike map points that are only 0 for blank and 1 for occupied; the following is described using a probability log-valued method. The method comprises the following specific steps:

step 5.1, setting Z _t Representing the observation (which can be obtained by reprojection) of a voxel n at time t, the log probability value of the voxel n starting at time t being L (n | Z) _1:t ) Then, at time t +1, the probability logarithm of voxel n is given by the following formula:

L(n|Z _1:t+1 )＝L(n|Z _1:t-1 )+L(n|Z _1:t ) Formula 8

If voxel n is observed at time t, L (n | Z) _t ) Is tau, otherwise is 0; the increment τ is a predetermined value. The formula indicates that the log of the probability of a voxel will increase when repeatedly observed to be occupied, and decrease otherwise.

Step 5.2, defining p ∈ [0,1] to represent the probability of voxel being occupied, and l ∈ R to represent the probability logarithm value, l can be calculated by the logit transformation:

the above equation is inversely transformed into:

calculating the occupancy probability p of the voxel by substituting the probability logarithm value obtained in the last step into the formula 10; only if the probability of occupancy p is greater than a predetermined threshold, the voxel n is considered occupied and will be visualized in the octree graph.

In other words, voxels that have been observed to occupy multiple times are considered stable occupied voxels, and in this way the present solution can deal well with the mapping problem in dynamic environments. Under complex conditions, the octree map is helpful to enhance elimination of dynamic pixels and minimize the influence of dynamic objects.

Claims

1. A dynamic scene mapping and positioning method based on ORB-SLAM2 is characterized by comprising the following steps:

step 1, local map tracking

step 2, dynamic pixel elimination

step 3, sparse mapping

step 4, closed loop detection

step 5, constructing an octree map

Dividing map points into voxels by using an octree, and storing the voxels through an octree structure to construct an octree map; and (4) judging whether the voxels are occupied or not by calculating the occupancy probability of the voxels, and if so, visualizing in an octree graph.

2. The ORB-SLAM2 based dynamic scene mapping and localization method of claim 1, wherein tracking the local map to optimize camera pose and generate new keyframes comprises:

the local map refers to a 3D point observed by a key frame which is 4m away from the current frame and has an angle of 1 rad; more matched 3D points are obtained through re-projection, so that the camera pose is optimized and new key frames are generated with the minimum error:

3. The ORB-SLAM2 based dynamic scene mapping and localization method of claim 1, wherein the step 2 of identifying dynamic pixels in combination with the depth image of the new key frame comprises:

By mixing p with ^w′ And p ^w The distance d between the two and a set threshold value d _mth Comparing to determine the pixelWhether u is dynamic:

search for pixels within a square around u' such that d takes the minimum value d _min (ii) a If d is _min Greater than a threshold value d _mth If not, the pixel u is judged to be dynamic.

4. The ORB-SLAM 2-based dynamic scene mapping and localization method of claim 3, wherein assuming that pixel u is among the preliminary decision results of all nearby keyframes, the number of static results is N _S The number of dynamic results is N _d The number of invalid results is N _i The final properties of pixel u are as follows:

if N is present _S ≥N _d ,N _S If the pixel u is more than 0, the pixel u is a static pixel;

if N is present _d ≥N _s ,N _d If the pixel u is more than 0, the pixel u is a dynamic pixel;

if N is present _S ＝N _d =0, pixel u is invalid.

5. The ORB-SLAM 2-based dynamic scene mapping and positioning method of claim 4, wherein the dynamic pixels detected by both methods are removed, wherein the removing method after identifying the dynamic pixels by combining the depth image of the new key frame comprises:

if (N) _d '≥N _s ') and the cluster is a dynamic cluster, and the cluster is removed.

6. The ORB-SLAM 2-based dynamic scene mapping and localization method of claim 1, wherein optimizing keyframe robot pose, adding new map points, maintaining keyframe set quality and scale, comprises:

calculating a bag-of-words model (Bow) vector of the current key frame, and updating the map point of the current key frame;

optimizing the pose of the robot by using a sliding window local beam adjustment optimization BA, wherein an optimized object is the pose in the current frame;

7. The ORB-SLAM 2-based dynamic scene mapping and localization method according to claim 1, wherein said determining whether voxels are occupied by calculating their occupancy probability, if occupied then visualizing in an octree map, comprises:

L(n|Z _1:t+1 )＝L(n|Z _1:t-1 )+L(n|Z _1:t ) Formula 8

defining p ∈ [0,1] to represent the probability of a voxel being occupied, l ∈ R to represent the logarithm of the probability, l is calculated by the logit transform:

the above equation is inversely transformed:

calculating an occupation probability p of the voxel by substituting a probability logarithm value into formula 10; only if the probability of occupancy p is greater than a predetermined threshold, the voxel n is considered occupied and will be visualized in the octree graph.