CN115187614A

CN115187614A - Real-time simultaneous positioning and mapping method based on STDC semantic segmentation network

Info

Publication number: CN115187614A
Application number: CN202210690440.9A
Authority: CN
Inventors: 胡章芳; 陈健; 陈江涛
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-10-14

Abstract

The invention discloses a real-time simultaneous positioning and mapping method based on an STDC semantic segmentation network, and belongs to the field of autonomous navigation of intelligent robots. The method comprises the following steps: s1, acquiring environment information by using an RGB-D depth camera; s2, preprocessing input image information, and extracting feature points by utilizing an ORB algorithm; s3, obtaining image semantic information by using an STDC semantic segmentation network, and eliminating dynamic feature points by using the semantic information; and S4, positioning and navigating the residual feature points by utilizing an ORB-SLAM3 algorithm. By performing verification on the public data set TUM and comparing with the SLAM system which has good performance in recent years, the results show that the SLAM system provided by the method can more accurately position and navigate in a dynamic environment.

Description

Real-time simultaneous positioning and mapping method based on STDC semantic segmentation network

Technical Field

The invention belongs to the field of autonomous navigation of intelligent robots, and particularly relates to a real-time simultaneous positioning and mapping method based on an STDC semantic segmentation network.

Background

In recent years, with the rapid development of technologies such as big data and deep learning, the related technologies of the mobile robot are more mature, and great convenience is provided for the life of people. The mobile robot relates to a plurality of disciplines such as kinematics, dynamics, control theory, computer science, mechanical principle, sensor technology and the like, and is one of the most active fields in current scientific and technical research. The Simultaneous Localization And Mapping algorithm (SLAM) refers to that a mobile robot completes its own Localization And construction of a surrounding environment map simultaneously without any prior environment.

Although the visual SLAM technology has been greatly developed, there are some problems to be solved. For example, when the mobile robot works in a complex dynamic environment, a high dynamic target can cause huge inconsistency between two adjacent frames of images, and the robustness of the SLAM system is seriously influenced; in addition, many existing visual SLAM algorithms assume the external environment as a static scene, and influence of dynamic objects on the SLAM system is ignored. When dynamic objects appear in the environment, the robustness of the SLAM system can be influenced, the positioning accuracy of the SLAM system is reduced, and even the tracking failure of the SLAM system can be caused.

The deep learning and SLAM system are combined to effectively reduce the influence of dynamic targets on system positioning and mapping, but the high-precision semantic segmentation network needs to consume a large amount of time when processing images, and the real-time performance required by application is not met.

CN113516664A, a visual SLAM method based on semantic segmentation dynamic points, which adopts a Mask R-CNN segmentation network and a multi-view geometric constraint algorithm to remove dynamic feature points, thereby improving the positioning accuracy and robustness of the system in a dynamic environment. However, the Mask R-CNN network used in the above method consumes a lot of time when performing segmentation, which reduces the processing speed of the system.

CN112435262A, a dynamic environment information detection method based on a semantic segmentation network and multi-view geometry, which adopts a lightweight semantic segmentation network FcHarDnet and a multi-view geometry constraint algorithm to remove dynamic feature points, thereby improving the system robustness. However, the segmentation processing speed of the semantic segmentation network adopted in the method still does not meet the real-time requirement, and the multi-view geometric constraint algorithm consumes much time, so that the system does not meet the real-time requirement.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A real-time simultaneous positioning and mapping method based on an STDC semantic segmentation network is provided. The technical scheme of the invention is as follows:

a real-time simultaneous positioning and mapping method based on an STDC semantic segmentation network comprises the following steps:

s1, directly shooting by using an RGB-D depth camera to obtain RGB image information and image depth information;

s2, extracting feature points of the RGB image information and the image depth information by using an ORB corner detection and feature description algorithm;

s3, obtaining image semantic information by using an STDC short-term dense connection network, and removing dynamic feature points by using the semantic information;

and S4, positioning and navigating the residual feature points by utilizing an ORB-SLAM3 algorithm, wherein the ORB-SLAM3 is a real-time SLAM algorithm based on the feature points, and the algorithm comprises a tracking thread, a local mapping thread and a loop detection thread.

Further, in the step S2, feature points are extracted by using an ORB algorithm; the method specifically comprises the following steps:

firstly, obtaining FAST key points by using a FAST algorithm, wherein the method comprises the following steps: 1. traversing each pixel in the image to be extracted, and calculating the gray value I of the pixel _p (ii) a 2. Setting a threshold T (T is generally I) _p 30% of); 3. selecting 16 pixel points on a circle with the pixel as the center of a circle and 3 as the radius; 4. if the gray values of continuous 12 pixel points in the 16 pixel points are all larger than I _p + T or less than I _p -T, determining the pixel as a FAST keypoint.

Then, adding rotation description by using the gray centroid of the image block as a feature point, and defining the moment of the image block A as follows:

in the formula: i (x, y) is the gray value at the image pixel point (x, y), m _x 、m _y Respectively representing a moment in the horizontal direction and a moment in the vertical direction;

the direction of the feature point is defined as:

finally, the feature points are described by using BRIEF descriptors; randomly selecting N point pairs around the characteristic point P, and comparing gray values:

in the formula: p (x) and P (y) are the gray value sizes of the points x and y respectively;

therefore, the BRIEF descriptor is expressed as:

in the formula: n represents the nth descriptor of the feature point;

further, the STDC semantic segmentation network in step S3 is specifically to select an STDC2-Seg75 network, the STDC2-Seg75 network is encoded by using an STDC module, and features of low-level learning space details are guided by using training loss.

Further, the STDC module is specifically configured to obtain feature maps of different reception fields by using 4 convolutional layers and excitation layers, and then cascade-fuse the feature maps of the different reception fields, where the 4 convolutional layers respectively use a 2-dimensional convolution algorithm with a step length of 1 and a convolution kernel size of {1, 3}, and the excitation layer uses a ReLU function, and the formula is as follows:

R(x)＝max(0,x) (13)

in the formula: x is the input and R (x) is the output after the ReLU unit.

Further, the training loss of the STDC semantic segmentation network is specifically implemented by combining cross entropy and binary segmentation loss:

L _d ＝L _dice (p _d ,g _d )+L _bce (p _d ,g _d ) (14)

in the formula: l is a radical of an alcohol _d Represents the loss of training detail, L _dice Represents a two-class segmentation loss, L _bce Represents the cross entropy loss, p _d Showing details of the prediction, g _d And representing the corresponding detail ground truth value, and segmenting the input image by utilizing a loss function training model to obtain a semantic segmentation graph.

Further, the semantic information elimination dynamic feature points specifically include: firstly, marking a high dynamic target in a semantic segmentation graph; then the marked semantic segmentation graph is used as a mask; and finally, removing the dynamic feature points by using the mask and the feature point diagram.

Further, the step S4 of positioning and navigating the remaining feature points by using the ORB-SLAM3 algorithm specifically includes:

tracking the thread: and searching and matching local map feature points, minimizing a reprojection error by using a Beam Adjustment (BA) algorithm, and positioning the pose of each frame of camera.

Local mapping thread: and optimizing the pose and the feature point cloud of the camera by using a local BA algorithm.

Loop detection thread: and detecting a loop and eliminating accumulated drift errors through pose graph optimization. And after the pose graph is optimized, a global BA algorithm thread is started, and the optimal structure and motion result of the whole system are calculated.

The invention has the following advantages and beneficial effects:

the invention provides a real-time simultaneous positioning and mapping method based on an STDC semantic segmentation network, aiming at the problems that an SLAM system fusing the semantic segmentation network is low in positioning accuracy in a dynamic environment and high in processing time consumption of the segmentation network. On one hand, the STDC short-term dense connection network is adopted as a semantic segmentation network of the system, and as the

steps

3 and 4, the characteristics of low-level learning space details are guided by using training loss, and the guiding step is removed in a prediction stage, so that the accuracy and the processing speed of the network are improved; on the other hand, due to the high time consumption of multi-view geometric constraint, the invention adopts a direct artificial marking method, such as step 6, to artificially mark high-dynamic objects, such as people, animals, automobiles and the like, and the method can extremely improve the processing speed and has high accuracy.

Through testing on the public data set TUM, experiments show that the real-time simultaneous positioning and mapping method integrated with the STDC semantic segmentation network has better positioning precision and robustness, consumes less time in the semantic segmentation thread, and meets the real-time requirement. In order to clearly compare the positioning accuracy and the time consumption of the SLAM system based on different semantic segmentation networks, the system provided by the invention is compared with other SLAM systems based on the semantic segmentation networks, the error of the system provided by the invention on positioning is smaller, the time consumption is lower, and the excellent performance is obtained.

Drawings

FIG. 1 is a framework of a real-time simultaneous localization and mapping method for an STDC-based semantic segmentation network according to a preferred embodiment of the present invention;

fig. 2 is an STDC semantic segmentation network framework.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly in the following with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

as shown in fig. 1, a method for simultaneously positioning and mapping in real time based on an STDC semantic segmentation network, which utilizes the STDC semantic segmentation network to segment a dynamic target in an environment in real time, eliminates the influence of a dynamic object on the stability of an SLAM system in a real environment, and improves the positioning accuracy and robustness of the SLAM, specifically comprising the following steps:

s1, directly shooting by using an RGB-D depth camera to obtain RGB image information and image depth information.

And S2, preprocessing the input image information, and extracting ORB characteristic points from the RGB image information and the image depth information through an ORB algorithm. Firstly, obtaining FAST key points by using a FAST algorithm, wherein the method comprises the following steps: 1. traversing each pixel in the image to be extracted, and calculating the gray value I of the pixel _p (ii) a 2. Setting a threshold T (T is generally taken as I) _p 30% of); 3. selecting 16 pixel points on a circle with the pixel as the center of a circle and 3 as the radius; 4. if the gray values of continuous 12 pixel points in the 16 pixel points are all larger than I _p + T or less than I _p -T, determining the pixel as a FAST keypoint.

in the formula: i (x, y) is the gray value at the pixel point (x, y) of the image, m _x 、m _y Respectively representing a moment in the horizontal direction and a moment in the vertical direction;

the direction of the feature point is defined as:

finally, the feature points are described by using BRIEF descriptors; randomly selecting N point pairs around the characteristic point P, and comparing the gray values:

therefore, the BRIEF descriptor is expressed as:

in the formula: n represents the nth descriptor of the feature point;

and S3, obtaining image semantic information by using an STDC semantic segmentation network, selecting an STDC2-Seg75 network, coding the network by using an STDC module, obtaining feature maps of different receptive fields by using 4 convolutional layers and excitation layers, and then performing cascade fusion on the feature maps of the different receptive fields. The 4 convolutional layers respectively adopt a 2-dimensional convolution algorithm with the step length of 1 and the convolution kernel size of {1, 3}, the excitation layer adopts a ReLU function, and the formula is as follows:

R(x)＝max(0,x) (20)

in the formula: x is the input and R (x) is the output after the ReLU unit.

Adopting the combination of cross entropy and binary division loss:

L _d ＝L _dice (p _d ,g _d )+L _bce (p _d ,g _d ) (21)

in the formula: l is _d Represents the loss of training details, L _dice Represents a two-class segmentation loss, L _bce Represents the cross entropy loss, p _d Showing details of the prediction, g _d Representing the corresponding detailed ground truth. And (4) utilizing a loss function training model to segment the input image to obtain a semantic segmentation graph. Then, highly dynamic objects in the semantic segmentation map are labeled, such as people, animals, cars, and the like. And the marked semantic segmentation graph is used as a mask. And finally, removing the dynamic feature points by using the mask and the feature point diagram.

And S4, positioning and navigating the residual feature points by utilizing an ORB-SLAM3 algorithm to obtain an SLAM system track tracking map and an environment point cloud map.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A real-time simultaneous positioning and mapping method based on an STDC semantic segmentation network is characterized by comprising the following steps:

2. The STDC semantic segmentation network-based real-time simultaneous localization and mapping method according to claim 1, wherein the step S2 utilizes an ORB algorithm to extract feature points; the method specifically comprises the following steps:

firstly, obtaining FAST key points by using a FAST algorithm, wherein the steps are as follows: 1. traversing each of the images to be extractedA pixel, and calculating a gray value I of the pixel _p (ii) a 2. Setting a threshold T (T is generally taken as I) _p 30% of); 3. selecting 16 pixel points on a circle with the pixel as the center of a circle and 3 as the radius; 4. if the gray values of continuous 12 pixel points in the 16 pixel points are all larger than I _p + T or less than I _p -T, determining the pixel as a FAST keypoint;

the direction of the feature point is defined as:

finally, feature points are described by using BRIEF descriptors; randomly selecting N point pairs around the characteristic point P, and comparing the gray values:

therefore, the BRIEF descriptor is expressed as:

in the formula: n represents the n-th descriptor of the feature point.

3. The STDC semantic segmentation network-based real-time simultaneous localization and mapping method according to claim 1, wherein the STDC semantic segmentation network of step S3 is specifically selected from an STDC2-Seg75 network, the STDC2-Seg75 network is encoded by using an STDC module, and training loss is used to guide characteristics of low-level learning space details.

4. The method as claimed in claim 3, wherein the STDC module is specifically configured to obtain feature maps of different receptive fields by using 4 convolutional layers and excitation layers, and then perform cascade fusion on the feature maps of the different receptive fields, wherein the 4 convolutional layers employ a 2-dimensional convolution algorithm with a step size of 1 and convolution kernel sizes of {1, 3} respectively, and the excitation layer employs a ReLU function, whose formula is:

R(x)＝max(0,x) (6)

in the formula: x is the input and R (x) is the output after the ReLU unit.

5. The method according to claim 3 or 4, wherein the training loss of the STDC semantic segmentation network is specifically a combination of cross entropy and binary segmentation loss:

L _d ＝L _dice (p _d ,g _d )+L _bce (p _d ,g _d ) (7)

in the formula: l is _d Represents the loss of training details, L _dice Represents a two-class segmentation loss, L _bce Represents the cross entropy loss, p _d Showing details of the prediction, g _d And representing the corresponding detail ground truth value, and segmenting the input image by utilizing a loss function training model to obtain a semantic segmentation graph.

6. The method for real-time simultaneous localization and mapping based on the STDC semantic segmentation network according to claim 5, wherein the semantic information elimination dynamic feature points are specifically: firstly, marking a high dynamic target in a semantic segmentation graph; then the marked semantic segmentation graph is used as a mask; and finally, removing the dynamic feature points by using the mask and the feature point diagram.

7. The STDC semantic segmentation network-based real-time simultaneous localization and mapping method according to claim 6, wherein the step S4 of localization and navigation of the remaining feature points by using an ORB-SLAM3 algorithm specifically comprises:

tracking the thread: searching and matching local map feature points, minimizing a reprojection error by using a beam adjustment optimization (BA) algorithm, and positioning the pose of each frame of camera;

local mapping thread: optimizing the camera pose and the characteristic point cloud by using a local BA algorithm;