CN111797688A

CN111797688A - Visual SLAM method based on optical flow and semantic segmentation

Info

Publication number: CN111797688A
Application number: CN202010488128.2A
Authority: CN
Inventors: 姚剑; 卓胜德; 程军豪; 龚烨; 涂静敏
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2020-10-20

Abstract

The invention belongs to the technical field of visual space positioning, and discloses a visual SLAM method based on optical flow and semantic segmentation, which comprises the following steps: segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region; carrying out feature tracking on the static area and the predicted dynamic area by adopting a sparse optical flow method; judging the types of the feature points in the input image information, and removing the dynamic feature points; and (4) taking the set of the removed motion characteristic points as tracking data, inputting the tracking data into an ORB-SLAM for processing, and outputting a pose result. The method solves the problem of poor SLAM tracking and positioning effects in the dynamic environment, and can obtain track information with high pose precision in the dynamic environment.

Description

Visual SLAM method based on optical flow and semantic segmentation

Technical Field

The invention relates to the technical field of visual space positioning, in particular to a visual SLAM method based on optical flow and semantic segmentation.

Background

SLAM is a key technology in the field of smart mobile robots, wherein visual SLAM uses a camera as a main sensor, and the camera can provide more information than other types of sensors, so that the visual SLAM has been widely researched in recent years. However, achieving accurate tracking and localization in dynamic scenarios has always been a significant challenge for SLAM systems.

In an actual scene, a dynamic object may cause erroneous data in the camera motion calculation, resulting in a tracking failure or a wrong tracking situation. Several methods have been proposed to solve this problem, one of which is the traditional robustness estimation method-RANSAC. The method judges that the dynamic information and the like are removed as outliers, retains the static information to ensure the success of tracking and motion calculation, but when the moving object in the environment is taken as a main body, the method fails because the extracted available data is too little. Another approach integrates additional sensors. The method can utilize data information of a plurality of sensors to complement and realize a compensation strategy for tracking and motion calculation. However, this method is not economical in terms of equipment cost, calculation cost, and the like, and it is often realistic to increase the number of cameras.

The prior method is not excellent in SLAM application, and with the application of deep learning in semantic segmentation, target detection and the like in recent years, a new solution is provided for solving the influence of moving objects in dynamic scenes.

Visual SLAM can be divided into two types, one is a feature-based approach and one is based on a direct approach. The feature method realizes tracking and positioning by comparing descriptors of features to match point pairs and minimizing reprojection errors, and can keep better robustness to geometric noise. However, the time cost of the process of extracting the feature points is large; the direct method optimizes the pose to track by calculating the reprojection error based on the gray scale invariant theory, has better performance in a low texture environment than a method based on the characteristic points, is lower in time cost, and is lower in robustness of the whole algorithm. Neither the feature point method nor the direct method can solve the problems caused by common dynamic objects, and the dynamic objects can generate wrong data association to reduce the calculation pose precision.

Disclosure of Invention

The embodiment of the application solves the problem of poor SLAM tracking and positioning effect in a dynamic environment by providing a visual SLAM method based on optical flow and semantic segmentation.

The embodiment of the application provides a visual SLAM method based on optical flow and semantic segmentation, which comprises the following steps:

step 1, segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region;

step 2, carrying out feature tracking on the static area and the prediction dynamic area by adopting a sparse optical flow method;

step 3, judging the types of the feature points in the input image information, and removing the dynamic feature points;

and 4, inputting the set of the removed motion characteristic points as tracking data into an ORB-SLAM for processing, and outputting a pose result.

Preferably, the input image information in step 1 is one of input data corresponding to a monocular camera, input data corresponding to a binocular camera, and input data corresponding to a depth camera;

in the step 3, the types of the feature points in the input data corresponding to the monocular camera are judged through epipolar constraint; and judging the types of the characteristic points in the input data corresponding to the binocular camera or the input data corresponding to the depth camera through the reprojection error.

Preferably, the step 1 comprises the following substeps:

step 1.1, selecting a data set to train a Mask R-CNN network to obtain a trained semantic segmentation network; the data set includes multiple types of data as potential moving objects;

step 1.2, inputting image information into trainedSemantic segmentation network, finishing image segmentation to obtain static area A_sPredicted motion area A_m。

Preferably, the step 2 comprises the following substeps:

step 2.1, utilizing a sparse optical flow method to perform alignment on the static area A_sAnd the predicted motion region A_mPerforming feature extraction and matching to obtain a static matching point pair set and a predicted motion matching point pair set;

and 2.2, solving the pose based on the SLAM running model.

Preferably, the SLAM-based running model pose solving in the step 2.2 includes:

at time k, using the jth signpost y_jProjected to the current frame to obtain a projection position h (xi)_k,y_j) And obtaining a corresponding observation model:

z_k，j＝h(ξ_k，y_j)+v_k，j

where h () represents the nonlinear model of the landmark at a known pose transformation, z_k，jIndicating signpost y_jAt the pixel coordinates of the current frame, v_k，j～N(0,Q_k，j) Representing a mean of 0 and a covariance of Q_k，jGaussian noise of (2);

according to the observation model, establishing an error model according to a reprojection error formed by the projection position and the corresponding pixel coordinate:

e_k，j＝z_k，j-h(ξ_k，y_j)

wherein e is_k，jIndicating signpost y_jDifference, ξ, between the position of the current frame and the projected position_kA lie algebraic form representing pose transformation between two frames at the moment k;

converting the error model into a nonlinear least square problem, setting all camera poses xi and signposts y as x to be optimized, setting the tracking time as m and the total number of the signposts as n, and establishing a loss function:

wherein J () represents a loss function;

and obtaining an optimized pose by resolving the loss function.

Preferably, in the step 3, if the input image information is input data corresponding to a binocular camera or input data corresponding to a depth camera, the determining the type of the feature point in the input image information and removing the dynamic feature point includes the following substeps:

obtaining a first offset vector set corresponding to the static matching point pair set by adopting reprojection calculation; using a weighted average method according to the static area A_sAnd the first offset vector set is calculated to obtain a first offset vector weight T_iAnd the mean value of the weights of the first offset vector_s；

Calculating by adopting reprojection to obtain a second offset vector set corresponding to the predicted motion matching point pair set; using a weighted average method, based on said predicted motion area A_mAnd the second offset vector set is calculated to obtain a second offset vector weight T_j；

According to the second offset vector weight T_jAnd the mean value of the first offset vector weight_sFor the predicted motion region A_mJudging the type of each feature point in the image;

judging whether the predicted motion area is a dynamic area; if the characteristic points with the number exceeding the first threshold value in the predicted motion area are judged as dynamic characteristic points, the predicted motion area is marked as a dynamic area, and all the characteristic points marked as the dynamic area are removed.

Preferably, the specific implementation manner of obtaining the first offset vector set corresponding to the static matching point pair set is as follows:

the rotation and translation of the optimized pose obtained in the step 2 in the matrix form are R, t respectively, the camera internal parameter is K, and the matching point pair p of the previous frame and the current frame is set_iAnd q is_i(x_i,y_i) Corresponding to a three-dimensional space point P_iA 1 is to P_iProjecting to the current frame to obtain a projectionCoordinates of the object

Wherein the content of the first and second substances,

represents a spatial point P_iPosition of pixel point projected to current frame, x_i、y_iDenotes q_iThe coordinates of the pixels of (a) and (b),

to represent

The pixel coordinates of (a);

the static area A_sThe position offset vector corresponding to a certain matching point pair is expressed as

The first set of position offset vectors corresponding to the n pairs of matching point pairs is represented as

Preferably, the obtaining of the first offset vector weight T_iMean value phi of the first offset vector weight_sThe specific implementation mode is as follows:

wherein, T_iRepresenting a first offset vector

Weight on angle and mode length, phi_sMeans representing a weight of the first offset vector;

the pair of the predicted motion areas A_mThe specific implementation manner of judging the type of each feature point in the method is as follows:

will T_jPhi and phi_sFor comparison, if T_jGreater than phi_sJudging as a dynamic feature point; otherwise, the static feature point is determined.

Preferably, in the step 3, if the input image information is input data corresponding to a monocular camera, the determining the type of the feature point in the input image information and removing the dynamic feature point includes the following substeps:

obtaining a basis matrix F according to the rotation R and the translation t corresponding to the optimized pose obtained in the step 2:

F＝K^-Tt^RK^-1

obtaining polar line Fp according to the basic matrix F, the static matching point pair set obtained in the step 2 and the predicted motion matching point pair set_k＝[x，y，z]^T；

Obtaining a characteristic point D value according to the polar line;

presetting a second threshold eta, and judging the type of the feature point according to the D value of the feature point and the second threshold eta;

Preferably, the calculation method of the feature point D value is:

wherein p is_k，q_kUsing homogeneous forms, Fp, in the formula_kRepresenting epipolar lines in epipolar geometry, x, y, z being vector parametric representations of the epipolar lines, D representing point p_kTo polar line Fp_kThe distance of (d);

if the value of the characteristic point D is larger than a second threshold eta, a judgment point p is determined_kIs a dynamic characteristic point; otherwise, the static feature point is determined.

One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:

in the embodiment of the application, firstly, a semantic segmentation network is adopted to segment input image information to obtain a static area and a predicted dynamic area, then a sparse optical flow method is adopted to track the characteristics of the static area and the predicted dynamic area, then the types of characteristic points in the input image information are judged, the dynamic characteristic points are removed, finally, a set with the motion characteristic points removed is used as tracking data and is input into an ORB-SLAM for processing, and a pose result is output. The method is based on the dynamic object extraction, judgment and removal of dynamic influence of the combination of semantic segmentation and an optical flow method, and the static feature points without the dynamic influence are applied to a subsequent SLAM system, so that the track information with high pose precision in the dynamic environment can be obtained finally. Compared with the traditional method for processing the dynamic environment, the method can well judge and eliminate the problems of characteristic influence and low pose precision of the dynamic object.

Drawings

In order to more clearly illustrate the technical solution in the present embodiment, the drawings needed to be used in the description of the embodiment will be briefly introduced below, and it is obvious that the drawings in the following description are one embodiment of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is an overall flowchart of a visual SLAM method based on optical flow and semantic segmentation according to an embodiment of the present invention.

Detailed Description

The embodiment provides a visual SLAM method based on optical flow and semantic segmentation, which mainly comprises the following steps:

step 1, segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region.

And 2, performing feature tracking on the static area and the predicted dynamic area by adopting a sparse optical flow method.

And 3, judging the types of the feature points in the input image information, and removing the dynamic feature points.

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

The embodiment provides a visual SLAM method based on optical flow and semantic segmentation, as shown in fig. 1, including the following steps:

step 1, segmenting input image information (data) by using a Mask R-CNN network, distinguishing static and dynamic objects, and obtaining a static area and a predicted dynamic area.

Step 1.1, selecting a data set to train a Mask R-CNN network, and selecting 20 types of data in a COCO data set as potential moving objects, such as: people, bicycles, buses, boats, birds, cats, dogs, etc.

And 1.2, reading data corresponding to a monocular camera, a binocular camera or a depth camera RGB-D and inputting the data into a network. In the trained semantic segmentation network, the format of an input image is mxnx3, the format of an output result is mxnxl, wherein mxn represents the size of the image, 3 represents an image channel (RGB), and l represents the number of training categories (i.e. 20) selected in step 1.1, and the semantic segmentation is completed by combining the selected 20 types as a possible moving objectMarking the region, completing image segmentation to obtain a static region A_sAnd predicting a motion region A_m. If the segmentation does not result in A_mAt this time, it is considered that no dynamic area exists in the data, and the processing of the dynamic area is not necessary, and the processing may be performed according to the conventional ORB-SLAM, and the flow may directly proceed to step 4.

After the preprocessing of the data is completed, a lightweight algorithm is used for tracking, the essence of the algorithm is that the feature extraction and tracking functions are reserved on the basis of ORB-SLAM tracking, wherein an optical flow method is adopted to replace a feature point method, functional modules of local optimization and key frame decision are eliminated, and feature extraction, matching and pose solving are completed.

Step 2.1, for the data which is divided completely, the Lucas-Kanade optical flow method is used for the static area A_sAnd predicting a motion region A_mExtracting and matching features, obtaining all feature point sets P for the current frame, and obtaining a static matching point pair set P by matching_matchs＝{(p_i,q_i) I ═ 1,2,3, …, n } and the set of predicted motion matching point pairs P_matchm＝{(p_j,q_j),j＝1,2,3,…,m}。p_iAnd q is_iI-th statically matched pixel point pair, p, representing the previous and current frame, respectively_jAnd q is_jRespectively indicate the jth predicted motion matching pixel point pair of the previous frame and the current frame, and n and m respectively indicate the number of static region matching point pairs and the number of predicted motion region matching point pairs.

Step 2.2, starting to solve the pose based on the SLAM running model, and utilizing the jth landmark y at the moment k_jProjected to the current frame to obtain a projection position h (xi)_k,y_j) At this time, the corresponding observation model can be obtained:

z_k，j＝h(ξ_k，y_j)+v_k，j

where h () represents the nonlinear model of the landmark at a known pose transformation, z_k，jIndicating signpost y_jAt the current frame pixelCoordinates, v_k，j～N(0,Q_k，j) Representing a mean of 0 and a covariance of Q_k，jGaussian noise.

An error model can be established according to the re-projection error formed by the projection position and the corresponding pixel coordinate according to the observation model:

e_k，j＝z_k，j-h(ξ_k，y_j)

wherein e is_k，jIndicating signpost y_jDifference, ξ, between the position of the current frame and the projected position_kAnd a lie algebraic form representing pose transformation between two frames at the moment k.

And 2.3, converting the error model into a nonlinear least square problem, setting all camera poses xi and signposts y as the quantity x to be optimized, setting the tracking time as m and the total number of the signposts as n, and establishing the following loss functions:

wherein J () represents a loss function, k represents the tracked k time, J represents the jth signpost, e_k，jError in step 2.2, Q_k，jRepresenting the covariance of the gaussian noise.

And obtaining the optimized pose of the camera by resolving the loss function.

The input image information is one of input data corresponding to a monocular camera, input data corresponding to a binocular camera, and input data corresponding to a depth camera. And designing different dynamic characteristic point judgment and processing methods according to different sensor types corresponding to different input data types.

Wherein, the depth camera RGB-D and the binocular camera type are processed by the same method, and the step 3.1 is carried out; the monocular camera type then jumps directly to step 3.4.

Step 3.1, the link aims at RGB-D and binocular systems, and the initial pose lie algebraic form obtained after optimization of the camera in step 2 is xi, corresponding toIs rotated and translated into R, t, and sets the pairs of matched pixel points p of the previous and current frames, knowing the camera internal reference K_iAnd q is_i(x_i,y_i) Corresponding to a three-dimensional space point P_iA 1 is to P_iProjecting to the current frame to obtain projection coordinates

The relationship is as follows:

wherein the content of the first and second substances,

represents a spatial point P_iPosition of pixel point projected to current frame, x_i、y_iDenotes q_iThe pixel coordinates of (a);

to represent

The pixel coordinates of (a).

If there is no error effect, there should be

However, the positional shift that may exist between the static feature point and the predicted moving point is caused by the influence of noise or the like

Wherein, the position offset vector corresponding to a certain static matching point pair is expressed as

At this point, there is a set of offset vectors for n pairs of matching point pairs

I.e. static matchingAnd the point pair set corresponds to a first offset vector set.

Step 3.2, describing the static area A by using a weighted average method according to the static matching point pairs obtained in the step 2.1_sFor a static area A_sAnd a first set of offset vectors V_stateCalculating an offset error

Angle theta of_iLength of die_iAnd a weight T_i：

Wherein, T_iRepresenting offset vectors

Weight in angle and mode length.

Then, the mean value phi of the offset vector weights is calculated_s：

Step 3.3, obtaining a second offset vector set corresponding to the predicted motion matching point pair set by adopting reprojection calculation; using a weighted average method based on the predicted motion region A_mAnd a second offset vector set, and calculating to obtain a second offset vector weight T_j。

I.e. the corresponding predicted motion area a is obtained with reference to steps 3.1 and 3.2_mSecond set of offset vectors

And the angle, the modulo length and the corresponding second offset vector weight T of the offset error_jWill T_jPhi and phi_sAnd (3) comparison:

using V_otherEach vector in (1) completes the dynamic or static judgment of the corresponding predicted feature point. After this step is completed, the process jumps to step 3.6.

Step 3.4, aiming at the monocular system, the initial pose li algebraic form after the optimization of the camera is obtained in step 2 is xi, the rotation and translation corresponding to xi is R, t, and then the basic matrix F of the current motion is obtained:

F＝K^-Tt^RK^-1

meanwhile, a set P of all point pairs (including the static region matching point pair and the predicted motion region matching point pair) with two frame image feature matching is obtained in step 2.1_match＝{(p_k，q_k) And k is 1,2,3,.. n }, n represents the total number of all matched point pairs, and the polar line Fp is obtained by combining the basic matrix F_k＝[x，y，z]^T。

Wherein p is_k，q_kUsing homogeneous forms, Fp, in the formula_kRepresenting epipolar lines in epipolar geometry, x, y, z being vector parametric representations of the epipolar lines, D representing point p_kTo polar line Fp_kThe distance of (c).

Step 3.5, point pairs in the set P sequentially calculate a characteristic point D value, set a threshold eta, and judge the type of the characteristic point according to the characteristic point D value and the threshold eta:

among them, it is found that the effect of setting η to 5 is stable in the test, so it is preferable to use η of 5 for the solution.

And 3.6, judging the predicted motion area extracted in the step 1.2 according to the dynamic feature point set, when most of feature points (for example, more than 80 percent) in the predicted motion area are determined as dynamic feature points, determining that the area is a dynamic area, and then removing all feature points calibrated to be on the dynamic area.

Step 4, removing a set P of motion characteristic points from all characteristic points (including characteristic points corresponding to a static area and a predicted dynamic area)_eAs trace data, this time set P_eThe influence of the dynamic characteristics is removed, and then the pose result is input into a traditional ORB-SLAM framework for processing and output.

Step 4.1, set P_eAnd finishing local map creation and pose optimization in tracking.

And 4.2, carrying out closed-loop detection.

And 4.3, outputting a pose result.

In summary, the method for SLAM based on the combination of semantic segmentation and an optical flow method for extracting, judging and removing dynamic influence of dynamic objects provided by the invention adopts a semantic segmentation network to effectively segment potential dynamic objects, then uses a sparse optical flow method to complete stable feature tracking, then judges and removes dynamic feature points through epipolar constraint of matching point pairs and reprojection error distribution difference, and applies the static feature points without dynamic influence to a subsequent SLAM system, thereby finally obtaining track information with higher pose precision in a dynamic environment. Compared with the traditional method for processing the dynamic environment, the method can well judge and eliminate the problems of characteristic influence and low pose precision of the dynamic object.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to examples, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A visual SLAM method based on optical flow and semantic segmentation is characterized by comprising the following steps:

2. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 1 wherein the input image information in step 1 is one of input data corresponding to a monocular camera, input data corresponding to a binocular camera, input data corresponding to a depth camera;

3. The optical flow and semantic segmentation based visual SLAM method as defined in claim 1 wherein step 1 comprises the sub-steps of:

step 1.2, inputting the image information into the trained semantic segmentation network to complete image segmentation and obtain a static area A_sPredicted motion area A_m。

4. The optical flow and semantic segmentation based visual SLAM method of claim 3 wherein said step 2 comprises the sub-steps of:

and 2.2, solving the pose based on the SLAM running model.

5. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 4 wherein the SLAM based run model pose solution in step 2.2 comprises:

z_k，j＝h(ξ_k，y_j)+v_k，j

e_k，j＝z_k，j-h(ξ_k，y_j)

wherein J () represents a loss function;

and obtaining an optimized pose by resolving the loss function.

6. The visual SLAM method based on optical flow and semantic segmentation as set forth in claim 5 wherein, in step 3, if the input image information is input data corresponding to a binocular camera or input data corresponding to a depth camera, the determining the type of the feature points in the input image information and removing the dynamic feature points comprises the sub-steps of:

7. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 6 wherein the specific implementation of the first set of offset vectors corresponding to the set of statically matched point pairs is:

the rotation and translation of the optimized pose obtained in the step 2 in the matrix form are R, t respectively, the camera internal parameter is K, and the matching point pair p of the previous frame and the current frame is set_iAnd q is_i(x_i,y_i) Corresponding to a three-dimensional space point P_iA 1 is to P_iProjecting to the current frame to obtain projection coordinates

Wherein the content of the first and second substances,

to represent

The pixel coordinates of (a);

8. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 7 wherein said deriving a first offset vector weight T_iFirst, aMean of offset vector weights phi_sThe specific implementation mode is as follows:

wherein, T_iRepresenting a first offset vector

9. The visual SLAM method based on optical flow and semantic segmentation as set forth in claim 5, wherein the step 3, if the input image information is the input data corresponding to the monocular camera, the steps of determining the type of the feature points in the input image information and removing the dynamic feature points comprise the following substeps:

F＝K^-Tt^RK^-1

according to the baseObtaining a base matrix F, the static matching point pair set and the predicted motion matching point pair set obtained in the step 2, and obtaining a polar line Fp_k＝[x，y，z]^T；

Obtaining a characteristic point D value according to the polar line;

10. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 9 wherein the feature point D values are calculated by: