CN116468786B

CN116468786B - Semantic SLAM method based on point-line combination and oriented to dynamic environment

Info

Publication number: CN116468786B
Application number: CN202211619407.3A
Authority: CN
Inventors: 杨健; 董军宇; 范浩; 饶源; 时正午; 杨凯; 李丛; 刘伊美
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-12-26
Anticipated expiration: 2042-12-16
Also published as: CN116468786A

Abstract

The invention provides a semantic SLAM method based on point-line combination, which is oriented to a dynamic environment, improves on the basis of ORB-SLAM3, and is oriented to the dynamic environment. The method is used for extracting the point and line characteristics, and using the point and line characteristics for accurate matching and repositioning of Lu Bang under the scene lacking texture and illumination change to estimate the pose of the camera, so that the positioning error and repositioning error are reduced, and the algorithm solves the problems of failure detection and difficult positioning of the characteristic points in the weak texture area and the illumination change scene.

Description

Semantic SLAM method based on point-line combination and oriented to dynamic environment

Technical Field

The invention relates to the field of computer vision, in particular to a semantic SLAM method based on point-line combination and oriented to a dynamic environment.

Background

The synchronous positioning and map construction technology (Simultaneous Localization and Mapping, SLAM) refers to that a robot collects surrounding environment information by using various sensors carried by the robot in an unknown environment, analyzes the position of the robot by an algorithm and establishes a map of the surrounding environment, wherein vision SLAM (Visual SLAM) mainly uses cameras to acquire data, including monocular, binocular, RGB-D cameras and the like, and the camera sensor used by the robot has the characteristics of high cost performance, small volume, low power consumption and capability of acquiring abundant environment information, so that the robot becomes a popular research field in recent years.

Various algorithms of the traditional visual SLAM can obtain good feature matching in a static scene, and mismatching can occur in a dynamic scene, so that great errors are generated in positioning and mapping of the SLAM system. Therefore, aiming at the problem that the positioning accuracy and the robustness of an SLAM system are reduced when a dynamic moving object exists in an application scene, a semantic SLAM method and a semantic SLAM system based on feature points and feature lines are provided.

The existing semantic SLAM technology mainly aims at a scene with a dynamic object, and the existing semantic SLAM technology mainly adopts a mode that pixels on all prior dynamic objects are deleted, the rest pixels are utilized for feature extraction and subsequent positioning research, or all dynamic feature points are deleted, only static feature points are adopted for feature point matching and rear end processing, the method can improve the positioning precision of a camera in a dynamic scene with rich textures, but for the scene with the dynamic object with low textures and strong illumination, only the information of the feature points and the semantics is adopted, so that enough data are difficult to obtain, tracking loss of a SLAM system is easy to be caused, and the positioning precision is reduced.

Currently, vision-based SLAM algorithm research has made great progress, such as ORB-SLAM2 (Orient FAST and Rotated BRIEF SLAM), LSD-SLAM (Large Scale Direct monocular SLAM), and the like. However, these algorithms are generally based on a strong assumption that a static working environment has many features and no obvious illumination changes, and have strict limitations on the application environment. The assumption influences the applicability of the visual SLAM system in an actual scene, when the environment is a dynamic weak texture area and has illumination change, the characteristic points are sensitive to the scene and are difficult to detect, the accuracy and the robustness of camera pose estimation can be reduced, errors are caused to the positioning based on vision, and a large deviation occurs to the three-dimensional reconstruction result.

The camera is typically in motion during the mobile robot's positioning and mapping process using the camera. This makes classical motion segmentation methods such as background removal (Background Subtraction) unusable in visual SLAM. Early SLAM systems mostly employed data optimization methods to reduce the effects of dynamic objects. A Random sampling consistency detection (Random SampleConsensus, RANSAC) algorithm is used for roughly estimating a basic matrix between two frames, semantic information and a mobile consistency detection result are combined, the establishment of a two-stage semantic knowledge base is completed, and all feature points in a dynamic contour are deleted as noise or discrete points. And eliminating the inter-frame characteristic point matching pairs on the dynamic object by using a RANSAC algorithm, and reducing the influence of the dynamic object on the SLAM system to a certain extent. These methods all implicitly assume that the objects in the image are mostly static and will fail when the data generated by the dynamic object exceeds a certain threshold.

In the prior art, researches on visual positioning, robot navigation and the like in scenes with abundant features such as cities, indoors and the like have been advanced to a certain extent, but many research contents are still insufficient, and for scenes with low texture and illumination variation with geometric features, the following problems still exist in visual positioning:

(1) The existing method is influenced by the problems of shielding, missing and the like of objects in the aspect of feature detection, and the complete geometric features are difficult to detect from the image, so that the pose of a camera is difficult to calculate;

(2) The existing method is affected by few textures and few feature points in the low-texture image, so that features of the image are difficult to extract, or feature matching errors are caused, SLAM tracking and repositioning are invalid, and camera pose recognition is poor;

(3) In the area with obvious illumination change, the detection of the characteristic points is sensitive, and the problems of difficult detection of the characteristic points, no matching and the like are easy to occur, so that the pose of a camera is inaccurate;

and combining MASK-RCNN with multi-view geometry to realize the example segmentation and rejection of the dynamic target, simultaneously identifying dynamic characteristic points, eliminating the interference of the dynamic target on characteristic matching and eliminating the influence of the dynamic target on an SLAM system.

Disclosure of Invention

The invention improves on the basis of ORB-SLAM3, and provides a semantic SLAM method based on point line characteristics, compared with the point characteristics, the line provides more geometric structure information about the environment, and the camera pose is jointly optimized through the point line, so that the camera positioning precision and robustness are improved. The method is used for extracting the point and line characteristics, and using the point and line characteristics for accurate matching and repositioning of Lu Bang under the scene lacking texture and illumination change to estimate the pose of the camera, so that the positioning error and repositioning error are reduced, and the algorithm solves the problems of failure detection and difficult positioning of the characteristic points in the weak texture area and the illumination change scene.

The invention is realized by the following technical scheme: a semantic SLAM method facing dynamic environment based on point-line combination specifically comprises the following steps:

step S1: acquiring an image stream of a scene, transmitting the image stream into a CNN network frame by frame, dividing an object with a priori dynamic property pixel by pixel, dividing the dynamic object in the scene to obtain a key frame image, and complementing a static scene blocked by a dynamic target by utilizing information of the previous frames;

step S2: for step S1: extracting feature points and feature lines from the obtained key frame image, constructing a local map related to the current frame image, including a key frame image sharing a common view point with the current frame image and adjacent frame images of the key frame image, searching feature points and line segments matched with the current frame image in the key frame image and the adjacent frame images of the key frame image, then carrying out dynamic consistency check on the prior dynamic object, removing the feature points and the feature lines on the dynamic object, reserving the feature points and the feature lines on the static object, and carrying out matching by utilizing the rest static feature points and the rest static lines;

step S3: matching the characteristic points and the characteristic lines in the step S2, filtering at the same time, removing the points and the lines which are incorrectly matched to obtain correct matching point pairs and line pairs, and obtaining the initial camera pose by using the matching point pairs;

step S4: calculating the camera pose of the current frame through the matching point pair and the line pair obtained in the step S3, and obtaining accurate camera pose estimation by minimizing the re-projection error of the point pair and the line pair;

step S5: constructing a local map about a scene by utilizing a key frame image, carrying out instance segmentation on each frame image, merging characteristic points and characteristic lines in each instance into corresponding instances, positioning a camera pose by utilizing the characteristic points and the characteristic lines, and calculating point clouds of objects and the scene to obtain a sparse point cloud map;

step S6: and (3) performing pose optimization by using loop detection, correcting drift errors, and obtaining more accurate camera pose estimation.

As a preferred scheme, step S1 is to extract feature points and feature lines of a static region on a key frame image, and extract feature points and feature lines of the static region of the key frame image, and specifically includes the following steps: and extracting the characteristics of the image static region by using ORB characteristic points, simultaneously calculating ORB descriptors to obtain characteristic points and descriptors of the image static region, extracting line characteristics of the image from which the dynamic object is removed, wherein the extraction of the line characteristics adopts a network structure of a transducer, and the line characteristics on the image static region are obtained by fusing characteristic information under different scales through a series of up-sampling and down-sampling operations.

Further, the extracted line features employ horizontal distancesAnd vertical distanceGenerating vector->To predict the positions of the two end points of a single line segment to obtain line characteristics, wherein +.>Andrepresenting coordinates of left and right end points of the line segment, < >>Is the midpoint coordinate of the line segment, ">Represent right endpoint->Coordinates and midpointA vector of relationships between coordinatesIn the present method->And->Expressed as: />，。

As a preferred solution, the matching of the feature points and the feature lines in step S3 specifically includes the following steps: the feature point matching is to find out a feature point with the closest descriptor distance as a matching point in the current frame through quick nearest neighbor search by generating ORB descriptors, then to reject the mismatching point pair, when the matching descriptor distance is larger than a threshold gamma or the ratio of the optimal matching point distance to the second optimal matching point distance is smaller than 1, the second matching point is equivalent to the first matching point, then the matching point pair is considered to be easy to be mismatched, and the matching point pair is rejected; the matching of the characteristic lines is to obtain 2D-2D matching line pairs through geometric constraint, map the 2D-2D matching line pairs to a 3D space directly through outlier rejection, and then obtain accurate 2D-3D line matching pairs by minimizing the reprojection error.

As a preferred solution, the optimization of the camera pose by minimizing the re-projection errors of the point pairs and the line pairs in step S4 is specifically implemented as follows:

the position and posture are jointly optimized by adopting the dotted line, and the minimized reprojection error is defined as:

wherein the method comprises the steps of

Wherein N represents a pair of matching lines on 2D-3D, a functionEqual to 3D line->Line projected onto 2D plane, angle error +.>By defining two planes +.>And->Defined, function->Equal to 3D point->Points on the 2D plane of the graph, +.>And->Is a given weight value, and optimizes the camera pose by minimizing the re-projection error.

In the preferred scheme, in step S5, the point cloud processing is performed through local mapping, and the pose of the camera is optimized by global repositioning, so as to obtain a sparse point cloud reconstruction map, which specifically comprises the following steps:

calculating a BOW vector of each frame of data stream, calculating the current frame image comprising the BOW vector and the common view relation information, inserting the current frame image into a map, and updating the common view; in the tracking process, each key frame is attached with information comprising feature points, feature lines and descriptors, and then map points are created by utilizing triangulation; judging whether other key frames exist in the key frame queue, if not, optimizing map points, and performing local BA optimization by using the current frame, the key frame image sharing the common view point with the current frame image and the adjacent frame images of the key frame image;

and (3) finding candidate key frames corresponding to the current frame, matching the current frame with the key frames by using a BOW dictionary for each candidate key frame, initializing by using the matching relation between the current frame and the candidate key frames, and estimating the pose by using EPnP for each candidate key frame.

Further, in step S6, optimizing the pose of the camera through loop detection specifically includes the following steps:

based on two characteristics of points and lines, performing loop detection by using key frames, when three continuous closed-loop candidate key frames have higher similarity with the current key frame, obtaining loop candidate frames, firstly matching characteristic points and characteristic lines on each candidate loop frame with the current frame, then solving a similar transformation matrix by using three-dimensional information corresponding to the characteristic points and the characteristic lines, if enough inner points and inner lines exist in the loop frame, performing Sim (3) optimization, performing loop correction by using the loop candidate frames, optimizing characteristic point constraint and line segment constraint, and obtaining the camera pose after point-line joint optimization.

(1) The invention adopts the technical proposal, and compared with the prior art, the invention has the following beneficial effects: the invention improves on the basis of ORB-SLAM3, proposes a SLAM algorithm based on feature points, feature lines and semantic information, combines MASK-RCNN with multi-view geometry, realizes the example segmentation and rejection of dynamic targets, simultaneously identifies dynamic feature points and feature lines, eliminates the interference of the dynamic targets on feature matching, eliminates the influence of the dynamic targets on a SLAM system, and completes the static scene blocked by the dynamic targets by utilizing the information of the previous frames;

(2) The invention provides a semantic SLAM system based on feature points and feature lines, which adopts a structure of a transducer to extract line features, and the line features extracted by the method are more accurate than those extracted by the traditional method;

compared with the point features, the line provides more geometric structure information about the environment, the point and line features are extracted, the point and line features can be more accurately matched with Lu Bang under the scene of weak texture and illumination change, the pose estimation of a camera is realized, the positioning error and repositioning error are reduced, and the algorithm solves the problem of difficult positioning under the low-texture scene.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a feature line detection diagram;

FIG. 2 is a flow chart of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

The semantic SLAM method based on the dotted line combination for the dynamic environment according to the embodiment of the present invention is specifically described below with reference to fig. 1 to 2.

As shown in fig. 1 and fig. 2, the invention provides a semantic SLAM method based on point-line combination and oriented to a dynamic environment, which is characterized by comprising the following steps:

step S1: acquiring an image stream of a scene, transmitting the image stream into a CNN network frame by frame, dividing objects with priori dynamic properties such as pedestrians, vehicles, fish and the like pixel by pixel, and dividing dynamic objects in the scene to obtainThe key frame image is used for complementing the static scene shielded by the dynamic target by utilizing the information of the previous frames; extracting feature points and feature lines of a static region on a key frame image, and extracting the feature points and the feature lines of the static region of the key frame image, wherein the method specifically comprises the following steps of: and extracting the characteristics of the image static region by using ORB characteristic points, simultaneously calculating ORB descriptors to obtain characteristic points and descriptors of the image static region, extracting line characteristics of the image from which the dynamic object is removed, wherein the extraction of the line characteristics adopts a network structure of a transducer, and the line characteristics on the image static region are obtained by fusing characteristic information under different scales through a series of up-sampling and down-sampling operations. Extracting line characteristics by using length of line segmentAnd the angle theta acquires two endpoints of the line segment, and for the long line segment, the small change of the angle can greatly influence the position of the endpoint of the line segment, so that larger line error is caused, and the method adopts horizontal distance +.>And vertical distance->Generating vector->To predict the positions of the two end points of a single line segment to obtain line characteristics, wherein +.>And->Representing coordinates of left and right end points of the line segment, < >>Is the midpoint coordinate of the line segment, ">Represent right endpoint->Coordinates and midpoint->A vector of the relation between coordinates, in the method +.>And->Expressed as: />，/>。

step S3: matching the characteristic points and the characteristic lines in the step S2, filtering at the same time, removing the points and the lines which are incorrectly matched to obtain correct matching point pairs and line pairs, and obtaining the initial camera pose by using the matching point pairs; the matching of the feature points and the feature lines specifically comprises the following steps: the feature point matching is to find out a feature point with the closest descriptor distance as a matching point in the current frame through quick nearest neighbor search by generating ORB descriptors, then to reject the mismatching point pair, when the matching descriptor distance is larger than a threshold gamma or the ratio of the optimal matching point distance to the second optimal matching point distance is smaller than 1, the second matching point is equivalent to the first matching point, then the matching point pair is considered to be easy to be mismatched, and the matching point pair is rejected; the matching of the characteristic lines is to obtain 2D-2D matching line pairs through geometric constraint, map the 2D-2D matching line pairs to a 3D space directly through outlier rejection, and then obtain accurate 2D-3D line matching pairs by minimizing the reprojection error. The initial camera pose calculation specifically comprises the following steps: and calculating a basic matrix and an essential matrix through the feature points and the feature lines, and obtaining a relatively accurate pose transformation matrix between cameras through SVD decomposition.

Step S4: calculating the camera pose of the current frame through the matching point pair and the line pair obtained in the step S3, and obtaining accurate camera pose estimation by minimizing the re-projection error of the point pair and the line pair; the specific implementation of optimizing the camera pose by minimizing the reprojection error of the point pair and the line pair is as follows:

wherein the method comprises the steps of

Wherein N represents a pair of matching lines on 2D-3D, a functionEqual to 3D line->Line projected onto 2D plane, angle error +.>By defining two planesFace->And->Defined, functionEqual to 3D point->Dot of figure onto 2D plane +.>And->Is a given weight value, and optimizes the camera pose by minimizing the re-projection error.

Step S5: constructing a local map about a scene by utilizing a key frame image, carrying out instance segmentation on each frame image, merging characteristic points and characteristic lines in each instance into corresponding instances, positioning a camera pose by utilizing the characteristic points and the characteristic lines, calculating point clouds of an object and the scene, carrying out point cloud processing by utilizing the local map, and optimizing the camera pose by utilizing global repositioning, thereby obtaining a sparse point cloud reconstruction map, and specifically comprising the following steps:

calculating a BOW vector of each frame of data stream, calculating the current frame image comprising the BOW vector and the common view relation information, inserting the current frame image into a map, and updating the common view; in the tracking process, each key frame is attached with information comprising feature points, feature lines and descriptors, but not all feature points become 3D map points, so that unqualified feature points and feature lines need to be removed, and then the map points are created by utilizing triangulation; judging whether other key frames exist in the key frame queue, if not, optimizing map points, and performing local BA optimization by using the current frame, the key frame image sharing the common view point with the current frame image and the adjacent frame images of the key frame image;

Step S6: and (3) performing pose optimization by using loop detection, correcting drift errors, and obtaining more accurate camera pose estimation. The method specifically comprises the following steps:

In the description of the present invention, the term "plurality" means two or more, unless explicitly defined otherwise, the orientation or positional relationship indicated by the terms "upper", "lower", etc. are based on the orientation or positional relationship shown in the drawings, merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as limiting the present invention; the terms "coupled," "mounted," "secured," and the like are to be construed broadly, and may be fixedly coupled, detachably coupled, or integrally connected, for example; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

In the description of the present specification, the terms "one embodiment," "some embodiments," "particular embodiments," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The semantic SLAM method based on the point-line combination for the dynamic environment is characterized by comprising the following steps of:

step S1: acquiring an image stream of a scene, transmitting the image stream into a CNN network frame by frame, dividing an object with a priori dynamic property pixel by pixel, dividing the dynamic object in the scene to obtain a key frame image, and complementing a static scene blocked by a dynamic target by utilizing information of the previous frames; extracting feature points and feature lines of a static region on a key frame image, and extracting the feature points and the feature lines of the static region of the key frame image, wherein the method specifically comprises the following steps of: extracting features of an image static region by using ORB feature points, simultaneously calculating ORB descriptors to obtain feature points and descriptors of the image static region, extracting line features of the image from which a dynamic object is removed, wherein the extraction of the line features adopts a network structure of a transducer, and the line features on the image static region are obtained by fusing feature information under different scales through a series of up-sampling and down-sampling operations; extracting line characteristics by using horizontal distanceAnd vertical distance->Generating vector->To predict both ends of a single line segmentThe position of the dots, a line characteristic is obtained, wherein +.>And->Representing coordinates of left and right end points of the line segment, < >>Is the midpoint coordinate of the line segment, ">Represent right endpoint->Coordinates and midpoint->A vector of the relation between coordinates, in the method +.>And->Expressed as: />，/>；

step S3: matching the characteristic points and the characteristic lines in the step S2, filtering at the same time, removing the points and the lines which are incorrectly matched to obtain correct matching point pairs and line pairs, and obtaining the initial camera pose by using the matching point pairs; the matching of the feature points and the feature lines specifically comprises the following steps: the feature point matching is to find out a feature point with the closest descriptor distance as a matching point in the current frame through quick nearest neighbor search by generating ORB descriptors, then to reject the mismatching point pair, when the matching descriptor distance is larger than a threshold gamma or the ratio of the optimal matching point distance to the second optimal matching point distance is smaller than 1, the second matching point is equivalent to the first matching point, then the matching point pair is considered to be easy to be mismatched, and the matching point pair is rejected; the matching of the characteristic lines is to obtain 2D-2D matching line pairs through geometric constraint, map the 2D-2D matching line pairs to a 3D space directly through outlier rejection, and then obtain accurate 2D-3D line matching pairs by minimizing the reprojection error;

wherein the method comprises the steps of

Wherein N represents a pair of matching lines on 2D-3D, a functionEqual to 3D line->Line projected onto 2D plane, angle error +.>By defining two planes +.>And->Defined, functionEqual to 3D point->Points on the 2D plane of the graph, +.>And->Is given weight value, and optimizes the pose of the camera by minimizing the reprojection error

2. The semantic SLAM method based on point-line combination for dynamic environment according to claim 1, wherein in step S5, the point cloud processing is performed by local mapping, and the camera pose is optimized by global repositioning, so as to obtain a sparse point cloud reconstruction map, which specifically comprises the following steps:

3. The semantic SLAM method based on point-line combination for dynamic environment according to claim 2, wherein optimizing the camera pose by loop detection in step S6 specifically comprises the steps of: