CN114187447A

CN114187447A - Semantic SLAM method based on instance segmentation

Info

Publication number: CN114187447A
Application number: CN202111497853.7A
Authority: CN
Inventors: 牛毅; 吴腾飞; 马明明; 李甫; 石光明
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-03-15

Abstract

The application relates to the field of ORB-SLAM2 systems and deep learning instance segmentation, and particularly provides a semantic SLAM method based on instance segmentation. The method comprises the following steps: s1, acquiring an image sequence; s2, extracting feature point information and semantic information; s3, fusing feature point information and semantic information; s4, detecting and removing dynamic objects; s5, object-level interframe matching; and S6, detecting the loop of the object level. The method can accurately identify various targets in a scene, uses the types, surrounding frames and mask information of the targets to help remove dynamic objects, and uses a better help system of the left static objects to perform interframe matching and loop detection. And the object-level matching is used for restraining the inter-frame matching, so that the problem of loss caused by mismatching of feature points in some scenes is effectively solved. The method of the invention uses the neural network to carry out example segmentation on the scene, can effectively identify the dynamic object under the monocular and binocular conditions, and improves the robustness of the SLAM system under the dynamic scene.

Description

Semantic SLAM method based on instance segmentation

Technical Field

The application relates to the field of ORB-SLAM2 systems and deep learning instance segmentation, in particular to a semantic SLAM method based on instance segmentation.

Background

With the continuous progress of the scientific and technological level of the modern society, the demand of people on the convenience of life is also continuously promoted. With the rise of artificial intelligence, service robots as small as homes, as large as Robotaxi, become increasingly important how to help the robots better and more accurately establish the surrounding environment, and visual SLAM is one of the best choices for simultaneous positioning and mapping tasks. For example, ORB-SLAM2, RGB-D SLAM-V2, etc. have been widely used, which have the advantages of fast sensor acquisition speed, low cost, short time delay, high accuracy, etc. Taking ORB-SLAM2 as an example, the system is easy to lose in high dynamic scenes, and because the underlying visual information adopts ORB feature descriptors, only the most basic feature points can be compared, and the object-level information is not utilized.

With the rapid development of deep learning technology, a plurality of visual problems have better and faster solutions, and objects of interest, particularly what kind and accurate boundaries in one image can be easily identified based on a neural network. These results can combine neural networks with visual SLAM techniques, combined with semantic information, to better help SLAM systems perceive the world.

In recent years, many researchers have proposed a variety of different semantic SLAM methods. The ORB-SLAM2 system proposed by Raul Mur-Artal is a monocular, binocular and RGB-D complete SLAM scheme based on feature points, and a bag of words (DBoW2) model is adopted to cluster the feature points and match the feature points during interframe matching and loop detection. Since the ORB feature points are 32-dimensional 01 vectors, matching based on vector distance does not conform to human intuition, for example, the feature points on the vehicle and the feature points on the ground have very similar vector distances, and the feature points are not matched from the human perspective. Also for loop detection, the loop detection based on the DBoW2 model only considers whether the two aligned frames are matched on the word vector or not in the loop detection, and does not make any requirement on the spatial structure of the two frames.

The MASK-SLAM proposed by Masaya Kaneko of Tokyo university is a monocular SLAM system combined with MASK-RCNN, which can effectively segment information of semantic levels such as sky, vehicles and the like, remove characteristic points belonging to dynamic objects, and convert the SLAM system in a dynamic scene into the SLAM system in a static scene. However, the method is only suitable for a monocular SLAM system, removes all semantic dynamic feature points, does not consider different states of multiple objects, causes loss of related information and loss of camera pose, and is not suitable in some scenes.

The dynaSLAM proposed by Berta Bescos combines MASK-RCNN and ORB-SLAM2, improves the adaptability of an SLAM system to dynamic scenes, endows the objects with prior dynamic information during monocular, and eliminates the characteristic points during tracking; and calculating the size of an included angle between corresponding points of the current frame and the reference frame by adopting a multi-view geometric mode during binocular shooting, and if the included angle is more than 30 degrees, determining the current frame and the reference frame as dynamic points to be excluded. The influence of the points is also removed during the map building, and a static map is obtained by using a background repairing method. However, the method does not consider the constraint of the object on the reference point from the object level, and many mismatches exist, so that misjudgment on the motion state of the object can be caused.

In summary, in the prior art, there are problems of a characteristic point matching error and poor robustness caused by not considering inter-frame constraint and object level information in a dynamic scene.

Disclosure of Invention

The present invention aims to provide a semantic SLAM method based on example segmentation to solve the problems of incorrect feature point matching and poor robustness caused by the fact that no inter-frame constraint and object level information are considered in a dynamic scene in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the application provides a semantic SLAM method based on instance segmentation, which comprises the following steps: s1, acquiring an image sequence; s2, extracting feature point information and semantic information; s3, fusing feature point information and semantic information; s4, detecting and removing dynamic objects; s5, object-level interframe matching; and S6, detecting the loop of the object level.

Further, the extraction of the feature point information and the semantic information in step S2 is performed simultaneously.

Furthermore, the extraction of semantic information is completed by the network instance segmentation module.

Still further, the network instance is partitioned into partitions at the object level.

Further, the network instance partitioning incorporates MASK-RCNN networks.

Further, the MASK-RCNN network is trained in step S2.

Further, step S3 performs fine-grained feature point classification on the feature points inside each object in the image.

Further, fine-grained feature point classification is accomplished by a KD-TREE data structure.

Further, step S5 constrains inter-frame matching using object-level matching.

Further, step S5 uses the KM algorithm to find the best match.

Compared with the prior art, the invention has the beneficial effects that:

firstly, the method carries out instance segmentation on the scene by using the neural network, can effectively identify the dynamic object under the monocular and binocular conditions, and improves the robustness of the SLAM system under the dynamic scene;

secondly, aiming at the defect that the DBoW2 model only focuses on vector level information and cannot distinguish feature point matching at a higher level, the invention provides the method for restricting interframe matching by using object level matching, and effectively solves the problem of loss caused by wrong feature point matching in some scenes;

thirdly, the invention provides a method for performing grid division on the image, associating the grid division with semantic information, establishing a three-dimensional data structure of Object _ KeyFrame _ DataBase, effectively using the position information of the Object but not only category information, and efficiently retrieving the associated key frames during the loopback detection, so that the loopback detection efficiency is higher.

Fourthly, the libtorch library is used, the training and the deployment of the network can be separated, the network model and the SLAM system are decoupled, the network model is convenient to modify, the direct use of the SLAM system is not influenced during modification, and the method is high in applicability.

Drawings

FIG. 1 is a schematic diagram of a semantic SLAM method based on example segmentation according to the present invention;

FIG. 2 is a schematic diagram of image mesh division in step S32 of a semantic SLAM method based on example segmentation according to the present invention;

FIG. 3 is a schematic diagram of a three-dimensional map point transformation method under a binocular system in a semantic SLAM method based on example segmentation according to the present invention;

FIG. 4 is a diagram illustrating a keyframe association based on a bag-of-words model in the prior art;

FIG. 5 is a spatial structure of a three-dimensional key frame DataBase KeyFrame DataBase according to the present invention, which is based on a semantic SLAM method of example segmentation provided by the present invention;

fig. 6 is a flowchart of a semantic SLAM method based on example segmentation according to the present invention.

Detailed Description

In order to make the implementation of the present invention clearer, the following detailed description is made with reference to the accompanying drawings.

Example 1:

the invention provides a semantic SLAM method based on instance segmentation, which comprises the following steps of:

s1, acquiring an image sequence;

the method comprises the following steps that images are captured by a camera to form an image sequence, wherein the image sequence can be a plurality of images under a static scene or a plurality of images under a dynamic scene, and the images under the dynamic scene can be a plurality of images with larger difference or a plurality of images with smaller difference; the embodiment of the invention uses an RGB camera to acquire an image sequence in a dynamic scene. The images captured by the cameras and the timestamp information issued using the ROS system are transferred into the SLAM system. Since ORB feature point extraction needs to be performed on a gray scale image, in the SLAM system, an image is converted into a gray scale image, and then an ORB feature point extraction thread and a network instance segmentation thread are simultaneously entered. The difference between the input images of the ORB feature point extraction thread and the network instance segmentation thread is only that the input image of the ORB feature point extraction thread needs to be converted into a gray scale image, and the input image of the network instance segmentation thread may or may not be converted into a gray scale image.

S2, extracting feature point information and semantic information;

the extraction of the feature point information is completed through an ORB feature point extraction module, and the extraction of the semantic information is completed through a network instance segmentation module, wherein the network instance is segmented into object-level segmentation, namely different objects of different same types can be distinguished. The extraction of feature point information and semantic information is performed simultaneously, that is, the following step S21 and step S22 are performed simultaneously.

S21, extracting the feature point information of the image by using an ORB feature point extraction module;

the ORB feature points consist of key points and feature descriptors. Its key point is also called "Oriented FAST", which is a modified form of the FAST corner. The feature descriptors are also called BRIFE. The FAST corner is known as FAST, and only one pixel needs to be judged whether the pixel value difference between one pixel and the surrounding pixels is large, so that a plurality of feature points can be extracted quickly, and non-maximum suppression should be performed later to avoid the problem of over-concentration of the feature points. Because the FAST corner points do not have scale invariance and rotation invariance, the ORB feature point extraction module adopted by the invention introduces a feature pyramid and a gray centroid method to solve the problems, and has scale invariance and rotation invariance, so that the changes of the positions, scales and directions of the feature points in the image brought in the playground scene can be effectively coped with, and the method is closer to the actual scene. For the feature descriptor, the BRIFE descriptor is a binary descriptor, and its description vector is composed of 01, which encodes two random pixels near the key point, if the former is large, it takes 1, otherwise it takes 0. Because of the use of binary representation, it is very fast to compute and store.

Due to the consideration of rotation and scaling and the quick extraction capability, the ORB feature points can better meet the actual requirements and are widely applied to the SLAM system.

And S22, extracting object-level semantic information of the image by using the network instance segmentation module.

The network instance segmentation of the image is realized through a MASK-RCNN network. And obtaining category information, bounding box information and MASK MASK information through a MASK-RCNN network, and screening and classifying the feature points through the information to realize the extraction of object-level semantic information. The steps for constructing the MASK-RCNN network are as follows:

s221, constructing a MASK-RCNN network;

the concrete construction steps are as follows:

the method comprises the following steps: the method mainly aims to extract feature graphs of different scales for RPN networks and subsequent tasks.

Step two: and constructing an RPN (resilient packet network), receiving a multi-scale feature map output by the FPN, and judging whether a target exists or not by distributing anchor frames with different sizes to each pixel position and distinguishing a front background. The proposed target frame is then regressed by ROI Align to obtain a more accurate target frame position.

Step three: and inputting the characteristic diagram in the target frame extracted by the RPN into the target detection branch and the mask prediction branch, and finishing tasks of classification, positioning and mask acquisition.

S222, training the built MASK-RCNN network;

aiming at different tasks, different data sets are adopted to train the MASK-RCNN, so that the accuracy of the MASK-RCNN in completing classification tasks in different tasks can be improved, and more accurate object-level semantic information can be extracted. The MASK-RCNN network was trained using PyTorch with a local data set. The local data set comprises a KITTI data set and a TUM data set; the method comprises the following steps that a KITTI data set is adopted for a local data set during MASK-RCNN network training in an outdoor scene, wherein the KITTI data set is a group of data sets which are established by the Carlsu Erirphysical institute and the Chicago division of Toyota university and are suitable for various computer vision tasks, and the data sets comprise various target information such as vehicles and guideboards; in an indoor scene, a TUM data set is adopted as a local data set during MASK-RCNN network training, wherein the TUM data set is a group of continuous indoor image sets collected by a depth camera at the university of industry of Munich, Germany, and comprises various target information such as a computer, a table, a chair and the like.

And judging whether the training of the MASK-RCNN network is finished or not through a loss function. The network of MASK-RCNN is divided into three branches with three outputs, so the corresponding loss function consists of three parts, respectively:

L＝L_cls+L_box+L_mask

wherein L is_clsThe loss function for classification is in the specific form:

L_cls(pi，pi*)＝-log(pipi*+(1-pi)(1-pi*))

pi represents the probability of being a current category, pi represents the probability of not being a current category;

wherein L is_boxIs a loss function of the bounding box, and the specific form is as follows:

L_box＝smooth_L1(ti-ti*)

ti is the predicted bounding box position, ti is the true value;

wherein L is_maskThe loss function of the mask is specifically formed by:

L_mask＝-(tlog(o)+(1-t)log(1-o))

t is the true value and o is the predicted value.

Positive samples are bounding boxes with an intersection ratio (IOU) greater than 0.6, and negative samples are bounding boxes with an IOU less than 0.6. During training, the proportion of positive and negative samples is set as 1: and 3, finishing the training when the loss function is close to convergence.

And S223, calling the MASK-RCNN network trained in the step S222 in the SLAM system. And saving the trained MASK-RCNN network as a script file by using a libtorch, and loading in a SLAM system.

S3, fusing feature point information and semantic information; the method comprises the following specific steps:

s31, traversing all the extracted feature points, obtaining target information of corresponding positions on corresponding MASK MASKs through the two-dimensional coordinate information p (x, y), and adding the Index into the data structure of corresponding objects, wherein each object corresponds to one data structure.

And S32, performing fine-grained feature point classification on the feature points in each object by using a KD-TREE data structure, namely performing hierarchical classification on the description vectors of the feature points into fine-grained feature point classification, and performing fine-grained vector information on the feature point description in the object, so that a lot of unnecessary judgment can be removed when the feature points are matched, and the matching process is accelerated. The current image is also gridded, as shown in fig. 2, into 28 x 32 grid images. And assigning semantic categories to each grid according to semantic information corresponding to the MASK MASK, establishing an object-level key frame database for loop detection, and introducing a specific association mode in S61.

S4, detecting and removing dynamic objects;

the invention relates to a monocular and binocular combined system. Wherein, when one image is input by a single eye, the initialized information is less; two images are input at a time through the two eyes, and the initialized information is more. The method can effectively identify the dynamic object under the monocular and binocular conditions, and improves the robustness of the SLAM system in a dynamic scene.

S41, monocular system;

due to the scale uncertainty of the monocular system, the distance information of the feature points cannot be obtained through single-frame information, and therefore the priori dynamic objects are removed by screening according to the priori semantic categories. All objects of a class are removed by semantic class. For example, when the object is indoors, if the object is classified as a person, the classification of the person is removed; when the object is classified as a car outdoors, the class of the car is removed. This can lead to recognition accuracy of SLAM systems in some scenarios, such as parking lots.

S42, binocular system.

And S421, obtaining the corresponding relation of the characteristic points of the left view and the right view in a line scanning mode. And (5) counting the feature points of each line of the right image, and matching in a stereo matching mode. And then searching the characteristic point pi of the ith row of the left image on the ith row of the right image to obtain the point qi which is the best matched, wherein i represents the row number of any row, and p and q represent the characteristic points on the left side and the right side of the image respectively. And (3) further searching by taking qi as a center and r as a radius, wherein the search radius is 10 pixels, and performing sub-pixel interpolation optimization on the result, so that a more accurate matching point can be obtained.

S422, triangularization is carried out on the feature points through the corresponding feature points, the feature points corresponding to the left view and the right view are obtained after the triangularization, and the distance can be obtained by converging two rays from the optical center to the feature points to one point in space, namely distance information of the feature points is obtained. Through triangulation, two-dimensional points on the left and right views are converted into points in the established three-dimensional space, and the points are called three-dimensional map points.

S423, when the SLAM system is running, a previous frame closest to the current frame is a reference frame, and the reference frame is used as reference information, and every time a new frame is input, transformation is calculated with the previous reference frame. Calculating a transformation matrix of the current frame and the reference frame through the background feature points, correspondingly transforming the three-dimensional map points corresponding to the reference frame, judging whether the error between the map points corresponding to the reference frame and the current frame is less than a threshold value, and considering that the current object is a real moving object when more than 90% of the errors of the three-dimensional map points are greater than the threshold value. Specifically, objects which belong to absolute stillness are found out through semantic information, an initial pose transformation matrix Tinit is obtained through initial pose calculation by using three-dimensional map points corresponding to the objects, and a transformation mode from a reference frame to a current frame is preliminarily estimated. Changing the three-dimensional map point corresponding to the current frame and the reference frame through Tinit, recording Plast as the characteristic point corresponding to the reference frame, and taking Purr as the characteristic point corresponding to the current frame, if:

|Plast-Pcurr|>th

as shown in fig. 3, C1 and C2 respectively represent left and right cameras, the dark color represents the original camera position, the light color represents the current position, the origin represents a map point in the space, when the camera moves, if the change relationship of the map point conforms to the transformation matrix of the camera, the camera is considered as a static point, if the error is greater than the threshold, the feature point is considered as a dynamic point, th is the threshold, and the determination is made by the prior information. If 90% or more of the feature points of an object are dynamic points, the object is considered to belong to a dynamic object, and all the feature points corresponding to the dynamic object need to be removed.

S5, object-level interframe matching;

s51, the static objects can be solved due to the fact that the static objects conform to the same transformation matrix, and the dynamic objects cannot be solved due to the fact that the dynamic objects do not conform to the same transformation matrix. After the dynamic object removal of the current frame is performed, the remaining static objects are all available for matching. And then, the KM algorithm is adopted, the IOU is used as weight information of whether the objects between two frames are the same object, and the optimal matching is searched, so that the total weight is maximum, and the matching accuracy is high. The method considers interframe constraint in a dynamic scene, has higher accuracy rate of matching the characteristic points, and improves the robustness.

Specifically, firstly, calculating weight information W of matching of two objects between two frames, namely calculating an intersection ratio value of surrounding frames of the two objects, and adding a certain offset to enable matching to have better translation invariance, so that matching accuracy is higher, mismatching is not easy to occur, and robustness is better; then, a set of optimal matches is found through the KM algorithm, specifically by making V1 be Index information of all objects in the reference frame, V2 be Index information of all objects in the current frame, and all edges < i, j > are e G, where W (i, j) represents a matching weight from the ith vertex in V1 to the jth vertex in V2. And storing graph information by using the adjacency matrix, initializing the top marks by using a greedy algorithm, finding a perfect match by using a Hungarian algorithm, modifying the top marks if the top marks cannot be found, increasing the top marks for searching again, finishing the searching when the perfect match is reached, namely finding a one-to-one corresponding relation, wherein all the top marks in the left subset in the KM algorithm have corresponding relations. Thus, pairwise matching information of the object between two frames is obtained.

S52, matching fine granularity of the feature points in the matched corresponding objects by using a KD-TREE data structure; and simultaneously traversing the word vectors of the current frame and the reference frame, and matching the feature points when the word IDs are the same. Searching the corresponding feature points of the feature points in the current word of the current frame one by one, wherein the minimum distance Dist is required to be met during searching₁<th and Dist₁<0.8*Dist₂。

And S53, after matching is successful, optimizing the pose information of the current frame through the three-dimensional map point corresponding to the corresponding feature point of the previous frame. And calculating a speed vector in the motion model as an initial value, and optimizing the pose of the current frame by fixing a three-dimensional map point and using a map optimization mode.

S54, after the pose optimization is successful, projecting the three-dimensional map points associated with the adjacent frames of the current frame, and calculating the vector distance of the feature points corresponding to the current frame, if the distance is less than th, determining that a new group of matching points are found, if the final total matching points are more than 30, determining that the tracking is successful, otherwise, determining that the tracking is lost, and repositioning.

And S6, detecting the loop of the object level.

At S61, key frame means that every frame is successfully built and tracking is successful, the key frame has enough new information. And when a key frame is obtained, entering a loop detection thread, and adding the current key frame information into a key frame DataBase KeyFrame DataBase. The existing key frame association method based on the bag-of-words model is shown in fig. 4, ABCD is in four different categories, and the right side is the current key frame, which includes four categories. The corresponding key frame DataBase KeyFrame DataBase also comprises four categories, each category is respectively associated with some key frames, and similar key frames can be found when the loop is matched through the association mode. However, this correlation method does not take the position information into consideration, and there is a certain degree of mismatching. The improved method is shown in fig. 5, wherein the dark color, the gray color and the light color respectively represent 3 categories, the white color represents the background, the left side represents the information observed by the current key frame, and different colors represent different categories through grid division. The three-dimensional data structure on the right side is a key frame data base KeyFrame DataBase, the length and the width of the three-dimensional data structure are consistent with the grid division mode of the image, the depth is the number of all categories, the background is not counted, for example, if 3 categories exist, the data structure with the depth of 3 is established, therefore, different categories at different positions can be associated to different key frames, and the semantic information of all the key frames at corresponding positions is stored through the three-dimensional data structure.

And S62, counting the maximum similarity number of all the related key frames, scaling the key frames by a certain coefficient, setting the scaled key frames as a threshold value, and enabling the key frames larger than the threshold value to enter the next screening. Specifically, the keyframes having the same category as the current frame at the corresponding position are counted, and commonggrids is the number of grids matched by two frames. The method comprises the steps of counting all key frames meeting the requirements, counting the maximum grid number (MaxCommonGrids) of the key frames after forming the associated frames, enabling 0.8 MaxCommonGrids to serve as a threshold, and not entering the next step if the threshold is smaller than the threshold, namely removing the key frames which do not meet the requirements, so that the key frames which are closest to the current key frames in position and category can be selected.

And S63, performing non-maximum suppression, and selecting a frame with the best matching effect for the key frames with the association relation. Specifically, the common-view information of all the current candidate frames is calculated, and the current candidate frames are regarded as a group for the key frames with the common-view relationship, and only the key frame with the largest common Grids is reserved in the group of key frames.

And S64, performing continuity detection, after loop detection is performed on the current frames of three continuous frames, if the obtained key frames also have a common view relationship, determining that the current loop matching is successful, and calculating a Sim3 matrix. That is, the loop matching of the image layer is successful, whether the transformation matrix between the two key frames is consistent with the associated three-dimensional map point or not is calculated, and if so, the map is corrected.

Fig. 6 shows an overall flow chart of the method of the invention.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A semantic SLAM method based on instance segmentation, the method comprising the steps of: s1, acquiring an image sequence; s2, extracting feature point information and semantic information; s3, fusing feature point information and semantic information; s4, detecting and removing dynamic objects; s5, object-level interframe matching; and S6, detecting the loop of the object level.

2. The semantic SLAM method based on instance segmentation as claimed in claim 1 wherein the extraction of the feature point information and the semantic information in step S2 is performed simultaneously.

3. The semantic SLAM method based on instance segmentation of claim 2 wherein the extraction of semantic information is done by a network instance segmentation module.

4. The semantic SLAM method based on instance partitioning of claim 3 wherein the network instances are partitioned into object level partitions.

5. The semantic SLAM approach to instance partitioning based on claim 4, wherein the network instance partitioning incorporates MASK-RCNN networks.

6. The semantic SLAM method based on instance segmentation of claim 5, wherein the MASK-RCNN network is trained in step S2.

7. The semantic SLAM method based on example segmentation as claimed in claim 6 wherein step S3 is fine grained feature point classification of feature points inside each object in the image.

8. The semantic SLAM method based on instance segmentation as claimed in claim 7 wherein the fine grained feature point classification is done by KD-TREE data structure.

9. The semantic SLAM method of instance based segmentation as claimed in claim 8 wherein said step S5 constrains inter-frame matching using object level matching.

10. The semantic SLAM method based on instance segmentation as claimed in claim 9 wherein said step S5 employs KM algorithm to find the best match.