CN114140527A

CN114140527A - Dynamic environment binocular vision SLAM method based on semantic segmentation

Info

Publication number: CN114140527A
Application number: CN202111373890.7A
Authority: CN
Inventors: 沈晔湖; 李星; 卢金斌; 王其聪; 赵冲; 蒋全胜; 朱其新; 谢鸥; 牛福洲; 牛雪梅; 付贵忠
Original assignee: Suzhou University of Science and Technology
Current assignee: Suzhou University of Science and Technology
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-04

Abstract

The invention relates to a dynamic environment binocular vision SLAM method based on semantic segmentation, which comprises the following steps: obtaining a semantic mask of an object, wherein the semantic mask is generated through a deep learning network; acquiring a plurality of continuous binocular images by using a binocular camera; extracting characteristic points on each frame of binocular image, and matching the characteristic points on the adjacent frames of binocular images; removing the feature points on the semantic mask, and calculating the pose of the camera according to the remaining feature points; separating dynamic objects and static objects on the binocular image based on the camera pose; recalculating the camera pose based on the separated static object; and constructing a static map based on the updated camera pose and the feature points on the static object. The method uses the binocular camera, takes the image segmented by the semantic information as the guide, can identify the dynamic and static objects in the scene, and realizes the construction of the map.

Description

Dynamic environment binocular vision SLAM method based on semantic segmentation

Technical Field

The invention relates to the technical field of visual space positioning, in particular to a dynamic environment binocular vision SLAM method based on semantic segmentation.

Background

With the development of computer technology and artificial intelligence, intelligent autonomous mobile robots become an important research direction and research hotspot in the field of robots. Along with the gradual intellectualization of the mobile robot, the mobile robot has higher and higher requirements on self positioning and an environment map. Currently, the smart mobile robot has some practical applications to perform self-localization and mapping in known environments, but there are still many challenges in unknown environments. The technology for completing positioning and mapping in such an environment is called SLAM (simultaneous Localization and mapping), i.e. synchronous positioning and mapping, and the goal of SLAM is to enable a robot to complete self-positioning and incremental mapping during the movement of an unknown environment.

The traditional SLAM algorithm relies primarily on a more stable range sensor, such as a lidar. However, the range data obtained by the laser radar is very sparse, which causes the environment map constructed by the SLAM to only contain a very small number of landmark points. This map can only be used to improve the positioning accuracy of the robot and cannot be used in other fields of robot navigation such as path planning. Moreover, the high price, large volume and weight and power consumption of the laser radar limit the application of the laser radar in certain fields. Although the camera can overcome the disadvantages of the laser radar in price, volume, quality and power consumption to a certain extent, and meanwhile, the camera can acquire rich information, the camera also has some problems, such as sensitivity to light changes, high operation complexity and the like. At present, a multi-sensor fusion SLAM algorithm is provided, which can effectively relieve the problems caused by the self-deficiency of a single sensor, but further increases the cost and the complexity of the algorithm.

The existing visual SLAM algorithm is mostly based on the environmental static assumption, that is, the scene is static and there are no objects moving relatively. However, in an actual outdoor scene, dynamic objects such as pedestrians and vehicles are present in a large amount, which limits the operation of the SLAM system based on the above assumption in an actual scene. Aiming at the problem that the positioning accuracy and stability of the visual SLAM algorithm are reduced in a dynamic environment, the existing algorithm uses algorithms based on probability statistics or geometric constraints, and the influence of a dynamic object on the accuracy and stability of the visual SLAM algorithm is reduced. For example, when there are a small number of dynamic objects in the scene, probabilistic algorithms such as ransac (random Sample consensus) can be used to cull the dynamic objects. However, when a large number of dynamic objects appear in the scene, the algorithm cannot normally distinguish the dynamic objects. While other algorithms use optical flow to distinguish dynamic objects, which is indeed possible in scenes with a large number of dynamic objects, but the execution efficiency of the SLAM algorithm is reduced due to the time-consuming process of computing a dense optical flow.

Therefore, how to provide a dynamic environment binocular vision SLAM method based on semantic segmentation, which is simple in operation and low in cost and can be applied to most of practical scenes, is a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

The invention provides a dynamic environment binocular vision SLAM method based on semantic segmentation, which aims to solve the technical problem.

In order to solve the technical problem, the invention provides a dynamic environment binocular vision SLAM method based on semantic segmentation, which comprises the following steps:

obtaining a semantic mask of an object, wherein the semantic mask is generated through a deep learning network;

acquiring a plurality of continuous binocular images by using a binocular camera;

extracting feature points on each frame of binocular image, and matching the feature points on the adjacent frames of binocular images;

removing the feature points on the semantic mask, and calculating the pose of the camera according to the remaining feature points;

separating dynamic objects and static objects on the binocular image based on the camera pose;

estimating the motion parameters of the dynamic object based on the separated dynamic object;

recalculating the camera pose based on the separated static object;

and constructing a static map based on the updated camera pose and the feature points on the static object.

Preferably, the deep learning network for generating the semantic Mask is a Mask R-CNN model.

Preferably, the method for extracting the feature points on the binocular images of each frame and matching the feature points on the binocular images of the adjacent frames comprises:

extracting the characteristic points by adopting an ORB method;

obtaining the descriptors of each feature point on each frame of binocular image, calculating the Hamming distance between two descriptors of one feature point on two adjacent frames of binocular images, and forming a group of matched feature points by two feature points with the minimum Hamming distance.

Preferably, the method for determining whether the feature point is located on the semantic mask includes: the semantic mask at least comprises a frame of the object, and the coordinates of the feature points are located in the frame range, so that the feature points are located on the semantic mask.

Preferably, the method for calculating the camera pose according to the remaining feature points includes: and solving the pose of the camera by adopting a PnP algorithm.

Preferably, the separating of the dynamic object and the static object on the binocular image based on the camera pose; the method for estimating the motion parameters of the dynamic object based on the separated dynamic object comprises the following steps:

separating the dynamic object: calculating the motion probability of an object corresponding to the semantic mask based on the camera pose and the position relation between the binocular images of the adjacent frames and the semantic mask, and if the motion probability is greater than a first threshold value, judging that the object corresponding to the semantic mask is a dynamic object;

dynamic object matching: calculating hu moment, central point Euclidean distance and histogram distribution of a semantic mask corresponding to the dynamic object in the binocular images of the adjacent frames aiming at the dynamic object, calculating the matching probability of the dynamic object in the binocular images of the adjacent frames based on the hu moment, the central point Euclidean distance and the histogram distribution, and if the probability is greater than a second threshold value, enabling the two dynamic objects in the binocular images of the adjacent frames to be the same object; and

and (3) dynamic object motion estimation: and completing the association of the dynamic object between the continuous frames through the dynamic object matching, and estimating the motion parameters of the dynamic object through a PnP algorithm.

Preferably, the step of separating the dynamic object comprises:

calculating the position of the semantic mask of the previous frame corresponding to the current frame based on the camera pose;

calculating three-dimensional coordinates of all feature points on the semantic mask after projection by using a disparity map, wherein the disparity map is obtained by calculating through the binocular image;

calculating errors of the corresponding feature points of the previous frame and the current frame in the x, y and z directions, wherein the maximum value of the errors is used as an error value of the feature point;

and converting the error value into the motion probability of the object corresponding to the semantic mask where the feature point is located, and judging whether the object corresponding to the semantic mask is a dynamic object or not based on the motion probability.

Preferably, the method for recalculating the camera pose based on the separated static object comprises: and eliminating the feature points on the semantic mask corresponding to the dynamic object, and updating the camera pose by adopting a PnP algorithm according to the remaining feature points.

Preferably, the method for constructing the static map based on the updated camera pose and the feature points located on the static object comprises:

determining a plurality of keyframes based on the updated camera poses and the feature points located on the static object;

matching the feature points on the plurality of key frames, and eliminating unmatched feature points;

checking whether the matched feature points meet epipolar geometric constraint or not, and eliminating the feature points which are not met;

checking whether the forward depth of field, the parallax, the back projection error and the scale of the residual feature points are consistent, eliminating inconsistent feature points, and generating map points based on the residual feature points;

constructing the static map based on the map points.

Preferably, before the static map is constructed, a step of optimizing the generated map points by bundle adjustment is further included.

Compared with the prior art, the dynamic environment binocular vision SLAM method based on semantic segmentation provided by the invention uses a binocular camera, takes an image segmented by semantic information as a guide, can identify dynamic and static objects in a scene, and realizes the construction of a map.

Drawings

FIG. 1 is a schematic flow chart of a dynamic environment binocular vision SLAM method based on semantic segmentation according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of separating dynamic objects according to an embodiment of the present invention.

Detailed Description

In order to more thoroughly express the technical scheme of the invention, the following specific examples are listed to demonstrate the technical effect; it is emphasized that these examples are intended to illustrate the invention and are not to be construed as limiting the scope of the invention.

The invention provides a dynamic environment binocular vision SLAM method based on semantic segmentation, which comprises the following steps as shown in figure 1:

in the embodiment, the deep learning network used for generating the semantic Mask is a Mask R-CNN model, so that high-quality semantic segmentation is realized.

Adopt the binocular camera to acquire the continuous binocular image of multiframe, follow can acquire the third dimension degree of depth information of two-dimensional image pixel in the binocular image, of course, about the internal reference and the external reference of binocular camera mainly include: focal length f of camera, optical center (u, v) of camera, and radial distortion coefficient kc of camera lens₁And kc₂And the parameters can be obtained by calibrating the Zhangyingyou calibration method.

And extracting the characteristic points on each frame of binocular image, and matching the characteristic points on the adjacent frames of binocular images. The specific method comprises the following steps:

extracting the characteristic points by adopting an ORB (English full name: ordered Fast and Rotated Brief) method;

And eliminating the feature points on the semantic mask, and calculating the pose of the camera according to the remaining feature points. The method for judging whether the feature points are positioned on the semantic mask comprises the following steps: the semantic mask at least comprises a frame of the object, and if the coordinates of the feature points are located in the range of the frame, the feature points are located on the semantic mask; and if the feature point is not positioned in the frame range, the feature point is not positioned on the semantic mask. The method for calculating the camera pose according to the remaining feature points comprises the following steps: solving the camera pose by adopting a PnP (English full name Perspectral-n-Point) algorithm, constructing a reprojection error and optimizing as shown in the following formula (1):

and obtaining an optimal solution, namely the required camera pose, by minimizing the reprojection error.

Separating the dynamic object and the static object on the binocular image based on the camera pose, and the specific method comprises the following steps:

separating the dynamic object: and calculating the motion probability of the object corresponding to the semantic mask based on the camera pose and the position relation between the binocular images of the adjacent frames and the semantic mask, and if the motion probability is greater than a first threshold value, judging that the object corresponding to the semantic mask is a dynamic object. The specific steps are shown in fig. 2, and include:

calculating three-dimensional coordinates of all feature points on the semantic mask after projection by using a disparity map, wherein the disparity map is obtained by calculating through the binocular image, and the disparity map can be calculated by specifically adopting an ELAS (empirical mode for standardization) algorithm;

As known from the camera imaging principle, the conversion relationship between the three-dimensional coordinate system and the pixel (two-dimensional) coordinate system and the depth and parallax are converted into:

recording the coordinate set of the jth semantic mask of the t-1 frame on a pixel coordinate system as

Obtaining the three-dimensional coordinate set of the semantic mask at the moment through formula (2) and formula (3)

Obtaining a set of post-motion three-dimensional points by formula (4)

Obtained by the formula (3)

Set converted to pixel coordinate system

Then use

And the disparity map is obtained by calculation through formula (2) and formula (3)

Note the book

Is composed of

At the point of the ith (m) th,

is composed of

And (5) calculating the error delta i between two points as:

then the error of the object corresponding to the feature point is:

calculated motion probability S (Δ)_j) Namely:

dynamic object matching: and calculating the hu moment (namely image moment), the Euclidean distance of the central point and the histogram distribution of the semantic mask corresponding to the dynamic object in the binocular images of the adjacent frames aiming at the dynamic object, calculating the matching probability of the dynamic object in the binocular images of the adjacent frames based on the hu moment, the Euclidean distance of the central point and the histogram distribution, and if the probability is greater than a second threshold value, determining that the two dynamic objects in the binocular images of the adjacent frames are the same object. Specifically, the hu moment of an image is an image feature with translation, rotation, and scale invariance.

The common moment calculation formula of the image is as follows:

calculating the hu moment requires calculating the center distance, first calculating the centroid coordinates:

the center moment is then constructed:

the center-to-center distances are then normalized:

the hu moment is constructed by the central moment, and has 7 invariant moments, and the specific formula is as follows:

Φ₁＝η₂₀+η₀₂

Φ₃＝(η₂₀-3η₁₂)²+3(η₂₁-η₀₃)²

Φ₄＝(η₃₀+η₁₂)²+(η₂₁+η₀₃)²

Φ₅＝(η₃₀+3η₁₂)(η₃₀+η₁₂)[(η₃₀+η₁₂)²-3(η₂₁+η₀₃)²+(3η₂₁-η₀₃)(η₂₁+η₀₃)[3(η₃₀+η₁₂)²-(η₂₁+η₀₃)²

Φ₆＝(η₂₀-η₀₂)[(η₃₀+η₁₂)²-(η₂₁+η₀₃)²]+4η₁₁(η₃₀+η₁₂)(η₂₁+η₀₃)

Φ₇＝(3η₂₁-η₀₃)(η₃₀+η₁₂)[(η₃₀+η₁₂)²-3(η₂₁+η₀₃)²]+]+(3η₁₂-η₃₀)(η₂₁+η₀₃)[3(η₃₀+η₁₂)²-(η₂₁+η₀₃)²] (12)

note the book

The hu moments of j semantic masks of the t-1 th frame are taken as the distance between the hu moments of two semantic masks:

calculating the center position of each semantic mask, then calculating the Euclidean distance of the center position of each semantic mask between the front frame and the rear frame, and recording the Euclidean distance as:

the histogram distribution of the semantic mask is calculated and then normalized, and is recorded as

And calculating the Kl divergence (also called as relative entropy) of different semantic masks of the front frame and the rear frame.

And (3) estimating the matching probability by combining the hu moment, the Euclidean distance and the histogram:

the method for estimating the motion parameters of the dynamic object based on the separated dynamic object comprises the following steps: and (3) dynamic object motion estimation: and completing the association of the dynamic object between the continuous frames through the dynamic object matching, and estimating the motion parameters of the dynamic object through a PnP algorithm.

Recalculating the camera pose based on the separated static object, the specific method comprises: and removing the feature points on the semantic mask corresponding to the dynamic object, and updating the camera pose by adopting a PnP algorithm according to the remaining feature points, wherein the specific calculation method can refer to the method for calculating the camera pose for the first time.

Constructing a static map based on the updated camera pose and the feature points on the static object, wherein the specific method comprises the following steps:

matching the feature points on a plurality of key frames, triangularizing the matched feature points, matching the unmatched feature points in other key frames until all matched feature points are found and rejecting unmatched feature points;

constructing the static map based on the map points.

Preferably, before the static map is constructed, a step of optimizing the generated map points by Bundle Adjustment (BA) is further included.

According to the method, the binocular images are processed, the dynamic objects existing in the binocular images are identified, the camera pose and the pose of the dynamic objects are estimated, the environment map is constructed, and the requirements of the mobile robot on the three-dimensional map are met.

In summary, the semantic segmentation based binocular vision SLAM method for dynamic environment provided by the invention comprises the following steps: obtaining a semantic mask of an object, wherein the semantic mask is generated through a deep learning network; acquiring a plurality of continuous binocular images by using a binocular camera; extracting feature points on each frame of binocular image, and matching the feature points on the adjacent frames of binocular images; removing the feature points on the semantic mask, and calculating the pose of the camera according to the remaining feature points; separating dynamic objects and static objects on the binocular image based on the camera pose; estimating the motion parameters of the dynamic object based on the separated dynamic object; recalculating the camera pose based on the separated static object; and constructing a static map based on the updated camera pose and the feature points on the static object. The method uses the binocular camera, takes the image segmented by the semantic information as the guide, can identify the dynamic and static objects in the scene, and realizes the construction of the map.

It will be apparent to those skilled in the art that various changes and modifications may be made in the invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A dynamic environment binocular vision SLAM method based on semantic segmentation is characterized by comprising the following steps:

recalculating the camera pose based on the separated static object;

2. The semantic segmentation based dynamic environment binocular vision SLAM method of claim 1, wherein the deep learning network used to generate the semantic Mask is a Mask R-CNN model.

3. The semantic segmentation based dynamic environment binocular vision SLAM method of claim 1, wherein the method of extracting feature points on the binocular images of each frame, matching feature points on binocular images of adjacent frames comprises:

extracting the characteristic points by adopting an ORB method;

4. The semantic segmentation based dynamic environment binocular vision SLAM method of claim 1, wherein the method of determining whether the feature points are located on the semantic mask comprises: the semantic mask at least comprises a frame of the object, and the coordinates of the feature points are located in the frame range, so that the feature points are located on the semantic mask.

5. The semantic segmentation based dynamic environment binocular vision SLAM method of claim 1, wherein the method of calculating camera pose from remaining feature points comprises: and solving the pose of the camera by adopting a PnP algorithm.

6. The semantic segmentation based dynamic environment binocular vision SLAM method of claim 1, wherein the separation of dynamic objects and static objects on the binocular images based on the camera pose; the method for estimating the motion parameters of the dynamic object based on the separated dynamic object comprises the following steps:

7. The semantic segmentation based dynamic ambient binocular vision SLAM method of claim 6 wherein the step of separating dynamic objects comprises:

8. The semantic segmentation based dynamic environment binocular vision SLAM method of claim 1, wherein the method of recalculating camera pose based on separated static objects comprises: and eliminating the feature points on the semantic mask corresponding to the dynamic object, and updating the camera pose by adopting a PnP algorithm according to the remaining feature points.

9. The semantic segmentation based dynamic environment binocular vision SLAM method of claim 1, wherein the method of constructing a static map based on updated camera poses and feature points located on the static object comprises:

constructing the static map based on the map points.

10. The semantic segmentation based dynamic ambient binocular vision SLAM method of claim 9 further comprising the step of optimizing the generated map points by bundle adjustment prior to constructing the static map.