CN114972656A

CN114972656A - Dynamic scene vision SLAM optimization method based on semantic segmentation network

Info

Publication number: CN114972656A
Application number: CN202210715033.9A
Authority: CN
Inventors: 李丹; 赵凯; 管玲; 徐飞虎
Original assignee: Anhui University of Technology AHUT
Current assignee: Anhui University of Technology AHUT
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-08-30

Abstract

The invention discloses a dynamic scene vision SLAM optimization method based on a semantic segmentation network. The method comprises the following steps of 1: acquiring a color image and a depth image through a camera; 2: removing the influence of the prior dynamic object through a semantic segmentation network; 3: estimating the pose of the current frame through a lightweight tracking module; 4: and filtering out non-prior dynamic objects by using an improved multi-view geometric algorithm. 5: and mutually verifying the multi-view geometric detection result and the LR-ASPP detection result to obtain a complete dynamic region, filtering ORB characteristic points on the dynamic region, entering a tracking thread, and generating a key frame by the tracking thread. 6: and constructing an octree map under a dynamic scene according to the related data of the key frames. The invention makes full use of the semantic information basis output by the semantic segmentation network, reduces the use of the region growing algorithm, improves the running speed of the system, constructs the reusable octree map according to the key frame, reduces the memory occupation and improves the map searching efficiency.

Description

Dynamic scene vision SLAM optimization method based on semantic segmentation network

Technical Field

The invention relates to the technical field of visual SLAM positioning and mapping, in particular to a dynamic scene visual SLAM optimization method based on a semantic segmentation network.

Background

Slam (slam) technology is one of core technologies for realizing true autonomy of a mobile robot. Currently, conventional SLAM systems can be classified into laser SLAM and visual SLAM according to the sensor used. The visual SLAM mainly relies on camera sensor data, combines technologies such as computer vision and deep learning to better solve the problem, has the advantages of simplicity, portability, low hardware cost and high positioning accuracy, gradually becomes the mainstream trend of SLAM research, and is mature day by day. ORB-SLAM using feature point method tracking, LSD-SLAM using direct method tracking, and VINS-SLAM using visual + inertial sensor fusion, etc., all performed satisfactorily in the corresponding scenes.

However, the conventional SLAM framework has a common problem that a rigid assumption is adopted: a static environment. Therefore, in some challenging dynamic environments, the feature points on the dynamic object affect the feature matching result, and the system robustness and the positioning accuracy are obviously reduced. In the objective world, when synchronous positioning and mapping are carried out, the existence of dynamic environments of pedestrians, animals, vehicles and the like is inevitable, and therefore the application of the traditional algorithm is limited.

In recent years, the convolutional neural network is developed vigorously on the semantic segmentation task of images, and a new solution is provided for solving the problem of SLAM mapping in a dynamic environment. The convolutional neural network can provide semantic labels of objects, help the SLAM system to better understand the surrounding environment, complete some high-level tasks and improve the robustness of the system. However, a common problem of solutions proposed by many scholars to solve the dynamic environmental impact is that only the information of the a priori dynamic object can be obtained, and the robustness of the non a priori dynamic object is poor. In the prior art, an SLAM system using an RGB-D camera can well solve the problem of identifying a non-prior dynamic object, but due to the adoption of a point cloud segmentation and region growing algorithm, the calculated amount is large, and the system cannot run in real time.

Disclosure of Invention

1. Technical problem to be solved by the invention

In order to solve the problem that the real-time performance and the effectiveness of the map building in the dynamic environment of the SLAM system are poor, the invention provides a dynamic scene vision SLAM optimization method based on a semantic segmentation network; the invention provides a visual SLAM system based on ORB-SLAM3, which can be used in a dynamic environment: compared with other classical dynamic SLAMs, the Segment-SLAM has obviously improved positioning precision. Particularly, in a high dynamic scene, compared with an ORB-SLAM, the positioning precision is relatively improved by more than 80%, and the octree map creation of an actual scene can be realized.

2. Technical scheme

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

the invention relates to a dynamic scene vision SLAM optimization method based on a semantic segmentation network, which comprises the following steps:

step 1: acquiring image data including an RGB image and a depth image by a monocular RGB-D camera;

step 2: in order to ensure real-time performance and segmentation effect, the RGB image removes the influence of a priori dynamic object in the image through a semantic segmentation network;

and step 3: the pose of the current frame is estimated by the image obtained in the step 2 through a lightweight tracking module;

and 4, step 4: filtering out non-prior dynamic objects and prior dynamic objects in the image obtained in the step 3 by an improved multi-view geometric algorithm;

and 5: generating a key frame by utilizing a tracking thread;

and 6: and constructing an octree map under a dynamic scene according to the related data of the key frames.

Furthermore, in the step 2, the image is processed through an LR-ASPP semantic segmentation network, the LR-ASPP semantic segmentation network takes a lightweight convolution network MobileNet V3 as a main network, the RGB image is processed through the LR-ASPP semantic segmentation network to obtain pixel-level semantic information, and the semantic information is utilized to remove the feature points on the prior dynamic object in the image.

Furthermore, in step 3, after the feature points on the priori dynamic object in the image are removed, the pose of the current frame is obtained by the remaining feature points through the lightweight tracking module.

Furthermore, the lightweight tracking module only estimates the pose of the current frame and does not participate in the subsequent image construction.

Furthermore, the specific method for filtering out the non-prior dynamic object by using the improved multi-view geometry in the step 4 is as follows:

the multi-view geometry selects 5 key frames with the highest overlapping degree with the current frame from 20 key frames closest to the current frame for each input image frame;

detecting dynamic points in the key frame;

when the dynamic point is detected, acquiring a label of the dynamic point according to a semantic segmentation result, and dividing the dynamic point into two types, wherein one type is the dynamic point with semantic information, and the other type is the dynamic point without the semantic information;

semantic contour search is carried out on dynamic points with semantic information, and region growth is carried out on dynamic points without semantic information in the depth map, so that the semantic information can be fully utilized, and the number of region growth seed points is reduced;

and finally, combining the dynamic object mask without semantics and the dynamic object mask with semantic information to obtain a complete dynamic object mask.

Further, the process of detecting the key frame dynamic point in step 4 is as follows:

assuming that X is a key point on the selected key frame, X 'is a point where X is projected to the current frame coordinate system, and the three-dimensional point corresponding to X is X, calculating the angle alpha and the projection depth l between X, X', X _proj (ii) a When the angle alpha is larger than a certain threshold value, judging that the key point is possibly a dynamic point; then, the depth l 'of the key point x' in the current frame is obtained through the calculation of the depth map, and the depth l 'and the key point x' are compared _proj Comparing if the difference is equal to l _proj -l' exceeds a threshold τ _z I.e. it is considered as a dynamic point.

Furthermore, the angle threshold and the projection depth threshold are set to be alpha equal to 30 DEG and tau _z ＝0.2m。

Furthermore, in step 4, the multi-view geometric detection result and the LR-ASPP detection result are mutually verified to obtain a complete dynamic region, ORB feature points on the dynamic region are filtered, and feature points outside the mask are extracted through a feature point extraction module, so that the effect of removing dynamic objects is achieved.

Further, the specific process of step 6 is:

assuming y ∈ R as the log-of-probability value and x as the probability that the node is occupied, then the transformation between x and y is described by the logit transformation:

storing y to express whether the node is occupied, and increasing y by one value when the occupation is continuously observed, or decreasing y by one value otherwise;

let a certain node be n, observed data be z, and the log probability value of a certain node from the beginning to time t be L (n | z) _1:t-1 ) And the time t +1 is as follows:

L(n|z _1:t+1 )＝L(n|z _1:t-1 )+L(n|z _t )

when a certain node is repeatedly observed, the y value of the node is continuously increased, and when the set threshold value is exceeded, the node is occupied and can be visualized in the octree graph.

3. Advantageous effects

Compared with the prior art, the technical scheme provided by the invention has the following remarkable effects:

(1) the invention discloses a dynamic scene vision SLAM optimization method based on a semantic segmentation network, provides an SLAM system for a dynamic environment, overcomes the interference of dynamic objects to the traditional SLAM system, and constructs a reusable octree map.

(2) The dynamic scene vision SLAM optimization method based on the semantic segmentation network provided by the invention is characterized in that an LR-ASPP semantic segmentation network is used for eliminating a priori dynamic object, and rough pose estimation is obtained through a lightweight tracking mode. The improved multi-view geometry is used for processing a non-prior dynamic object, the semantic information of the convolutional neural network is fully utilized by using contour retrieval, more accurate pose estimation is obtained on the basis of a lightweight tracking mode, and the robustness of the system in a dynamic environment is improved. Experiments prove that the system has good real-time performance and more accurate pose precision. In particular, in a high dynamic environment, the higher SLAM positioning accuracy is still maintained.

(3) According to the dynamic scene vision SLAM optimization method based on the semantic segmentation network, the front end which uses the semantic segmentation network and an improved multi-view geometric method to remove feature points on a dynamic object is added on the basis of an ORB-SLAM3 framework, the use of an area growing algorithm is reduced on the basis of fully utilizing semantic information output by the semantic segmentation network, the running speed of a system is improved, a reusable octree map is constructed according to key frames, the memory occupation is reduced, and the map searching efficiency is improved.

Drawings

FIG. 1 is a system framework diagram of the present invention;

FIG. 2 is a diagram of the LR-ASPP network architecture of the present invention;

FIG. 3 is a flow chart of the multi-view geometry algorithm of the present invention;

FIG. 4 is a multi-view geometry inspection dynamic point diagram of the present invention;

FIG. 5 is a graph of Segment-SLAM versus ORB-SLAM3 trajectory error;

FIG. 6 is a graph of octree mapping test results.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Example 1

With reference to fig. 1, the system of the present embodiment employs four-thread parallelism: tracking, local map building, loop detection and octree map building, wherein the tracking thread comprises an LR-ASPP network, a lightweight tracking module, an improved multi-view geometric algorithm module, a feature point extraction module, an initialization and repositioning or map reconstruction module and a local tracking module.

The present embodiment first acquires color images and depth images through an RGB-D camera to provide image frames for LR-ASPP.

Fig. 2 shows a structure of the LR-ASPP network. In order to ensure the real-time performance and the segmentation effect, the lightweight convolutional network MobileNetV3-Large is adopted as the main network of the LR-ASPP in the embodiment. The color image is firstly processed by an LR-ASPP semantic segmentation network, so that pixel semantic information is obtained. And removing the characteristic points on the prior dynamic object in the image by utilizing the semantic information.

The LR-ASPP model is trained on a subdata set of COCO train2017, and 20 classes of objects can be distinguished. The mean IoU (interaction over Union) of the LR-ASPP model is 57.9, the global pixelwise acc is 91.2, and the segmentation effect is good under the condition of meeting the requirement of real-time property.

The MobileNet V3-Large network uses the NAS technology, provides an h-swish activation function and is added with an SE module, so that the calculation amount and the network precision are effectively reduced, and the network performance is improved.

The MobileNetV3 is used as the input of a semantic segmentation head to serve as feature maps with different resolutions extracted by a main network of the LR-ASPP, so that the LR-ASPP can well eliminate feature points on a priori dynamic object.

The image after the semantic segmentation processing only generates a pose and does not generate a key frame through a lightweight tracking module, namely a simplified version of local tracking module. After the pose of the current frame is obtained, the depth of the feature point is convenient to obtain. Non-a priori dynamic objects are further processed by the improved multi-view geometry. Fig. 3 is a flowchart of the multi-view geometry algorithm:

first, the multi-view geometry selects, for each input frame, the 5 key frames with the highest degree of overlap with the current frame from the 20 key frames closest to the current frame. Obtaining dynamic points through an improved multi-view geometric algorithm, for example, fig. 4 is a schematic diagram of multi-view geometric detection dynamic points:

let x be the keypoint on the keyframe being picked, and x' be the point where x projects into the current frame coordinate system. The three-dimensional point corresponding to the X is X, and the angle alpha and the projection depth l between the X, the X' and the X are calculated _proj . When the angle α is greater than a certain threshold, it is determined that the key point may be a dynamic point, and further determination of depth change is required. Under the condition of considering the reprojection error, the depth l 'of the x' key point in the current frame is obtained directly through the depth map calculation, and is compared with the depth l _proj And (6) comparing. If the difference Δ l is equal to l _proj -l' exceeds a threshold τ _z I.e. it is considered as a dynamic point. Through experimental tests on the TUM data set, take alpha-30 DEG and tau _z 0.2m, dynamic objects and static objects can be well distinguished.

When the dynamic point is detected, the label of the dynamic point is obtained according to the result of semantic segmentation, and the dynamic point is divided into two types, wherein one type is the dynamic point with semantic information, and the other type is the dynamic point without the semantic information.

And continuously performing semantic contour search on the feature points with the semantic information to obtain a dynamic object mask with the semantic information, and performing region growth on the dynamic points without the semantic information in the depth map to obtain a dynamic object mask without the semantic information. The semantic information can be fully utilized, the number of region growing seed points is reduced, and the operation efficiency is improved.

And finally, combining the dynamic object mask without the semantic meaning with the dynamic object mask with the semantic information to obtain the complete dynamic object mask.

Meanwhile, mutual verification is carried out through the detection result of the improved multi-view geometric algorithm and the detection result of the LR-ASPP, a complete dynamic area is obtained, and ORB characteristic points on the dynamic area are filtered. And then extracting the characteristic points outside the mask through a characteristic point extraction module, thereby achieving the effect of eliminating the dynamic objects. And then the data is sent to a partial tracking module after being initialized and relocated or a map rebuilding module.

At this point, a key frame is generated by passing through the local trace module in the entire trace thread. And then, octree mapping is carried out, wherein the octree mapping under the dynamic scene is constructed by using the key frame data.

Octrees are a flexible, compact map that can be updated in real time. A large cube is continually divided evenly into eight blocks until it becomes the smallest square. The entire large square can be considered the root node, while the smallest block can be considered the "leaf node". Thus, in an octree, the volume of the map can be expanded eight times as one layer goes up from the next layer of nodes. When all children nodes of a block are occupied or not, it is not necessary to expand the node.

The octree map is a map updated by probability. The node of the octree stores information whether it is occupied or not. A blank may be represented by 0 and an occupied by 1. For convenience of representation, a floating point number x ∈ [0, 1] is used to express whether a node is occupied. Typically x starts to take 0.5. This value is made to increase if it is observed to be occupied, whereas it is made to decrease if it is observed to be blank. This allows the obstacle information in the map to be dynamically modeled. x may be out of the range of [0.1] as it decreases or increases, and is therefore described by a log-of-probability value: given y ∈ R as a log-of-probability value, then the transformation between x and y is described by a logit transformation:

thus y expresses well whether a node is occupied, and when occupancy is constantly observed, y is incremented by one value, otherwise it is decremented by one value. In mathematical terms, let a certain node be n and the observed data be z. Then the log of the probability for a node from the beginning to time t is: l (n | z) _1:n-1 ) Then time t +1 is:

L(n|z _1:t+1 )＝L(n|z _1:t-1 )+L(n|z _t )

when a certain node is observed repeatedly, the value of y is increased continuously, and when the set threshold value is exceeded, the node is occupied and can be visualized in the octree graph. By the method, the map of the dynamic environment can be well constructed.

The present embodiment scheme is evaluated below using a sub-dataset for dynamic objects in the disclosed TUM RGB-D dataset.

All experiments are run on a PC, the CPU is AMD Ryzen 3700, the memory is 16G, the GPU is RTX 2080, and the video memory is 8G. The system environment is Ubuntu18.04, CUDA10.2 is adopted, the deep learning frame used for model training is Pytroch 1.10.0, and the deployment inference model uses Libtochh 1.10.0 library.

And performing a pose estimation experiment by using the TUM RGB-D data set, and comparing the camera pose estimated by the Segment-SLAM with the real pose in the data set during pose error estimation. The test index is the absolute track error ate (absolute track error).

Root Mean Square Error (RMSE), the mean value of the errors, was used to evaluate the system. The root mean square error is easily influenced by larger or accidental errors, so that the robustness of the system can be well reflected. The average value can reflect the stability of the system.

Segment-SLAM was first tested on 8 datasets of high and low dynamics and quantified on w _ half and s _ half. The experiments were performed 5 times and the results averaged. The calculation formula of the lifting rate is as follows:

wherein alpha represents the test result of the SLAM system to be tested, beta represents the test result of the Segment-SLAM system, and rho represents the relative lifting rate.

Fig. 5 (a) and (b) show the error map of the Segment-SLAM and ORB-SLAM3 in the w _ half data, fig. 5 (c) and (d) show the error map of the trajectory in the XYZ direction, and fig. 5 (e) and (f) show the error map of the trajectory in the RPY angle. Wherein CameraTracjector represents Segment-SLAM and ORBCameTracjector represents ORB-SLAM 3. It can be seen from fig. 5 that in a highly dynamic environment, ORB-SLAM3 performs less ideally, tracking in the x direction has large fluctuation, and tracking in angle has occasionally large error, because the lens is in a rotating state at that time, ORB-SLAM3 extracts too many feature points on a person, generates many mismatching points, affects positioning accuracy, and Segment-SLAM keeps good tracking effect because dynamic objects are removed, and substantially coincides with real tracks in the XYZ and RPY directions. This represents the very successful improvement of the ORB-SLAM system of the present invention.

Table 1 shows the absolute track error comparison between Segment-SLAM and ORB-SLAM3, and the absolute track error index directly calculates the difference between the true value and the estimated value of the camera pose, so that the algorithm precision and the global consistency of the track can be reflected very intuitively.

TABLE 1 Absolute Trajectory Error (ATE) analysis and comparison (unit m)

Where N represents the use of a convolutional neural network alone, N + G represents the use of a convolutional neural network and a multi-view geometry simultaneously, the bolded portion represents the best performing result in the data set, and ρ represents the relative lift of the Segment-SLAM compared to ORB-SLAM 3.

From the results in table 1, ORB-SLAM3 performed well in low dynamic data, but not satisfactorily in high dynamic data sets.

The result of the Segment-SLAM (N + G) has excellent performance in both low-dynamic environment and high-dynamic environment, the error is kept in centimeter level, the relative promotion rate in the dynamic environment reaches more than 80 percent and is 97.16 percent at most, and the robustness and the stability of the Segment-SLAM system are reflected.

Meanwhile, Segment-SLAM is compared with SLAM algorithms in other dynamic scenes, and the comparison result is shown in Table 2.

TABLE 2 SLAM Algorithm analysis comparison (unit m) in dynamic scenarios

The comparison content comprises a DS-SLAM based on a semantic segmentation network; detecting a Detect-SLAM of the network based on the target; by optimizing the pose graph between the key frames, computing the DVO-SLAM and the MR-SLAM based on the optical flow for computing the constraint condition by minimizing the photometric error and the depth error, the result shows that the Segment-SLAM realizes the comprehensive lead in the precision aspect.

Further, Segment-SLAM was tested on an octree-forward map of s _ static sequences and real scenes.

Fig. 6, in which (a) in fig. 6 is a real scene photographed for a data set, in which two persons are talking while sitting on a chair at all times; fig. 6 (b) is a real environment of a laboratory, in which a person is moving in a scene at all times; fig. 6 (c) is an octree map generated according to the s _ static sequence, from which it can be seen that a part of a person sitting on a chair has been substantially removed, and fig. 6 (d) is a mapping experiment in a real scene, in which a part of a person is completely eliminated.

The total number of s _ static is 707 frames of images, the disk file of the point cloud map generated according to the picture sequence is 55.6MB, and the disk file of the octree map is only 3.1MB, which is 5.64% of the point cloud file, so the octree map can effectively model large-scale scenes, and is convenient for subsequent maintenance and updating.

According to the test result, the system has good real-time performance and more accurate pose precision. Particularly in a high dynamic environment, the higher SLAM positioning precision is still kept.

The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

Claims

1. A dynamic scene vision SLAM optimization method based on a semantic segmentation network is characterized by comprising the following steps:

step 1: acquiring image data image frames including RGB images and depth images through a camera;

step 2: the RGB image removes the influence of the prior dynamic object in the image through a semantic segmentation network;

and step 3: 2, estimating the pose of the current frame by the image obtained in the step 2 through a lightweight tracking module;

and 5: generating a key frame by utilizing a tracking thread;

step 6: and constructing an octree map under a dynamic scene according to the relevant data of the key frames.

2. The dynamic scene vision SLAM optimization method based on the semantic segmentation network as claimed in claim 1, wherein: in the step 1, image frames are collected through a monocular RGB-D camera.

3. The dynamic scene vision SLAM optimization method based on the semantic segmentation network as claimed in claim 1 or 2, wherein: and 2, processing the image through an LR-ASPP semantic segmentation network, wherein the LR-ASPP semantic segmentation network uses a lightweight convolution network MobileNet V3 as a backbone network, the RGB image is processed through the LR-ASPP semantic segmentation network to obtain pixel-level semantic information, and feature points on a priori dynamic object in the image are removed by utilizing the semantic information.

4. The dynamic scene vision SLAM optimization method based on the semantic segmentation network as claimed in claim 3, wherein: and 3, after the feature points on the prior dynamic object in the image are removed, the remaining feature points obtain the pose of the current frame through a lightweight tracking module.

5. The dynamic scene vision SLAM optimization method based on the semantic segmentation network as claimed in claim 4, wherein: the light weight tracking module only estimates the pose of the current frame and does not participate in the subsequent image construction.

6. The dynamic scene vision SLAM optimization method based on the semantic segmentation network as claimed in claim 5, wherein: the specific method for filtering the non-prior dynamic object by using the improved multi-view geometry method in the step 4 comprises the following steps:

detecting dynamic points in the key frame;

7. The dynamic scene vision SLAM optimization method based on the semantic segmentation network as claimed in claim 6, wherein: the process of detecting the dynamic point of the key frame in the step 4 is as follows:

assuming that X is a key point on the selected key frame, X 'is a point where X is projected to the current frame coordinate system, and the three-dimensional point corresponding to X is X, calculating the angle alpha and the projection depth l between X, X', X _proj (ii) a When the angle alpha is greater thanWhen a certain threshold value is reached, judging that the key point is possibly a dynamic point; then, the depth l 'of the key point x' in the current frame is obtained through the calculation of the depth map, and the depth l 'and the key point x' are compared _proj Comparing if the difference is equal to l _proj -l' exceeds a threshold τ _z I.e. it is considered as a dynamic point.

8. The dynamic scene vision SLAM optimization method based on the semantic segmentation network as claimed in claim 7, wherein: the angle threshold and the projection depth threshold are respectively 30 degrees and tau _z ＝0.2m。

9. The dynamic scene vision SLAM optimization method based on the semantic segmentation network as claimed in claim 8, wherein: and 4, mutually verifying the multi-view geometric detection result and the LR-ASPP detection result to obtain a complete dynamic region, filtering ORB characteristic points on the dynamic region, and extracting characteristic points outside the mask through a characteristic point extraction module, so that the effect of eliminating dynamic objects is achieved.

10. The dynamic scene vision SLAM optimization method based on the semantic segmentation network as claimed in claim 9, wherein: the specific process of the step 6 is as follows:

let a certain node be n, observed data be z, and the log probability value of a certain node from the beginning to time t be L (n | z) _1：t-1 ) And the time t +1 is as follows:

L(n|z _1：t+1 )＝L(n|z _1：t-1 )+L(n|z _t )