CN111311708B

CN111311708B - Visual SLAM method based on semantic optical flow and inverse depth filtering

Info

Publication number: CN111311708B
Application number: CN202010065930.0A
Authority: CN
Inventors: 崔林艳; 马朝伟; 郭政航
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2022-03-11
Anticipated expiration: 2040-01-20
Also published as: CN111311708A

Abstract

The invention relates to a visual SLAM method based on semantic optical flow and inverse depth filtering, which comprises the following steps: (1) the vision sensor collects images, and performs feature extraction and semantic segmentation on the collected images to obtain extracted feature points and semantic segmentation results. (2) And initializing the map by using an semantic optical flow method according to the feature points and the segmentation result, removing dynamic feature points and creating a reliable initialized map. (3) And evaluating whether the 3D map points in the map are dynamic points or not by adopting an inverse depth filter on the initialized map, and expanding the map according to the evaluation result of the inverse depth filter. (4) And continuously carrying out tracking, local map building and loop detection in sequence aiming at the map expanded by the depth filter, and finally realizing the visual SLAM facing the dynamic scene based on semantic optical flow and inverse depth filtering.

Description

Visual SLAM method based on semantic optical flow and inverse depth filtering

Technical Field

The invention relates to a visual SLAM method based on semantic optical flow and inverse depth filtering, which is a new visual SLAM method combining the semantic optical flow and inverse depth filtering technologies and is suitable for solving the problems that the traditional visual SLAM system fails in a high-dynamic scene, lacks in understanding the scene and the like.

Background

The simultaneous localization and mapping (SLAM) means that the pose of the robot is estimated through acquired sensor data under the condition that the robot has no environment prior information, and a globally consistent environment map is constructed at the same time. Among them, the SLAM system based on the visual sensor is called a visual SLAM, and has advantages of low hardware cost, high positioning accuracy, capability of realizing completely autonomous positioning navigation, and the like, so that the technology is widely concerned in the fields of artificial intelligence, virtual reality, and the like, and many excellent visual SLAM systems such as RTAB-MAP, DVO-SLAM, ORB-SLAM2, and the like are brought forward.

The traditional visual SLAM system usually assumes that the environment of the system is static, and is difficult to deal with common situations in daily life such as long time, large spatial scale, high dynamic scene and the like. Especially in a high dynamic scene, the visual SLAM based on the static world assumption cannot distinguish the dynamic scene where the system is located, and even cannot distinguish a dynamic object in the scene, so that the precision of the SLAM system in a dynamic environment is greatly reduced, and even the whole SLAM system fails in a severe case, which affects the wide application of the visual SLAM system in daily life. Therefore, how to improve the precision and stability of the visual SLAM system in a dynamic scene and enhance the understanding capability of the system to the surrounding environment is very important, and the problem becomes a problem to be solved urgently in the field of visual SLAM.

In recent years, with the progress of deep learning algorithms and the improvement of computing power, computers have been increasingly capable of processing images such as image classification and semantic segmentation. The traditional visual SLAM technology and the semantic segmentation technology based on deep learning are combined, so that the robustness and the practicability of the SLAM system can be greatly improved. The SLAM algorithm combined with semantic information is generally called semantic SLAM, which is an emerging research field, and how to use semantic information, at present, there is no mature and consistent scheme. The current difficulties are as follows: (1) how to ensure the precision and stability of a semantic vision SLAM system in a high dynamic scene; (2) how to enhance the capability of a semantic vision SLAM system for coping with high dynamic scenes and make the system have good performance when coping with static scenes.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the visual SLAM method based on the semantic optical flow and the inverse depth filtering is provided, the capability of the SLAM system in coping with dynamic scenes is improved, the understanding capability of the system to the scenes is improved, and the positioning accuracy of the system in the dynamic scenes is improved.

The technical scheme of the invention is a visual SLAM method based on semantic optical flow and inverse depth filtering, which comprises the following steps: the method comprises the following steps:

the method comprises the following steps that (1) a vision sensor collects images, and performs feature extraction and semantic segmentation on the collected images to obtain extracted feature points and semantic segmentation results;

step (2) map initialization is carried out through a semantic optical flow method according to the feature points and the segmentation results, dynamic feature points are removed, and a reliable initialization map is created;

step (3) evaluating whether the 3D map points in the initialized map are dynamic points by adopting an inverse depth filter on the initialized map, and expanding the map according to the evaluation result of the inverse depth filter;

and (4) continuously carrying out tracking, local map building and loop detection in sequence aiming at the extended map after the inverse depth filter is extended, further building an accurate map in a dynamic scene, and finally realizing the visual SLAM facing the dynamic scene based on semantic optical flow and inverse depth filtering.

Further, in the step (1), the image feature extraction and semantic segmentation method includes:

after image data acquired by a sensor are acquired, extracting image characteristic points, and performing semantic segmentation on the RGB image of the current frame by using a SegNet semantic segmentation network; dividing the characteristic points into three types of static, potential dynamic and dynamic through semantic information; the SegNet comprises an encoder network and a decoder network, an input image is firstly sent to the encoder network, each encoder in the encoder network generates a series of feature maps through convolution operation to obtain an input feature map, and then after batch normalization processing and ReLU activation function activation operation, a decoder in the decoder network performs up-sampling on the input feature map by using a maximum pooling index value stored in the corresponding encoder feature map to generate a sparse feature map; then, the sparse feature maps are passed through a trainable convolution module to generate dense feature maps; and the high-dimensional feature representation output by the last decoder of the decoder network is transmitted to the softmax classifier, a semantic label of each pixel is generated, and the semantic segmentation process of the image is completed.

Further, in the step (2), map initialization is performed by using an semantic optical flow method and a reliable initialization map is created, the method includes:

firstly, on the basis that the feature points on the acquired image are divided into three types of static, potential dynamic and dynamic by a semantic segmentation method, calculating sparse optical flows for semantic static feature points on the image of the current frame by using the image data of the current frame and the image data of the previous frame; subsequently, a basic matrix F is calculated, which is the key to the polar geometric constraint; finally, judging the motion characteristics of the static feature points, the potential dynamic feature points and the dynamic feature points again according to the epipolar constraint, and checking the judgment result through the basis matrix F calculated just before; and setting a pixel as a threshold value in the checking process, and if the straight-line distance from the feature point in the current frame image to the epipolar line corresponding to the feature point exceeds the threshold value, judging the feature point as a real dynamic feature point, thereby obtaining a reliable initialization map.

Further, in the step (3), an inverse depth filter is adopted to evaluate and expand the map for the 3D map points in the initialized map, and the method includes:

applying a depth filter based on gaussian-uniform mixture distribution assumption to SLAM, first, modeling the observed values of the inverse depth of map points with a mixture model of gaussian distribution and uniform distribution:

p(x|Z，π)＝πN(x|Z，τ2)+(1-π)U(x|Z_min，Z_max)

the meaning of the respective quantities in the above formula is:

x is an observed value of the inverse depth of a map point and is a random variable; z is the true inverse depth of the map point, which is the value to be calculated; pi is the probability that the map point is an interior point, which is referred to as the interior point rate for short, the interior point is a static map point in the map, and the depth of the interior point is a point obtained by triangularization through a correct matching point; p (x | Z, π) represents the distribution of map points inverse depth observations; n (x | Z, τ)²) Representing the mean value of the true inverse depth Z of the map points, τ²Is a gaussian distribution of variance; u (x | Z)_min，Z_max) Denotes a uniform distribution, Z_minAnd Z_maxThe lower and upper bounds are uniformly distributed, namely the minimum inverse depth and the maximum inverse depth;

calculating the posterior probability distribution of the current time (Z, pi) to obtain:

p(Z，π|x₁，...，x_n)∝p(Z，π|x₁，...，x_n-1)p(x_n|Z，π)

wherein x₁，...，x_nFor a series of map points of inverse depth to each otherIndependent observation values, wherein n is the serial number of the observation values; p (Z, π | x)₁，...，x_n) Is the posterior probability distribution of the current time (Z, π), p (Z, π | x)₁，...，x_n-1) Is the posterior probability distribution of the previous time (Z, π), p (x)_n| Z, pi) is the likelihood probability of the depth measurement at the current time; to estimate the parameters Z and π and simplify the operation, p (Z, π | x)₁，...，x_n) Distribution approximating a gaussian-beta form:

q(Z，π|a，b，μ，σ²)＝N(Z|μ，σ²)Beta(π|a，b)

wherein q (Z, π | a, b, μ, σ)²) Denotes that the (Z, π) obedience parameter is (a, b, μ, σ)²) Gaussian-Beta distribution of N (Z | μ, σ)²) Is a Gaussian distribution, and Beta (π | a, b) is a Beta distribution. Gaussian-Beta distribution-a total of 4 parameters (a, b, μ, σ)²) Wherein a and b are two parameters which are more than zero in the beta distribution in probability theory, mu and sigma²The expectation and the variance in the Gaussian distribution are obtained, and after a new inverse depth observation value is obtained, the 4 parameters are updated to obtain new Gaussian-beta distribution; first use

Finding the first and second moments of Z and pi, and then using p (Z, pi | x)₁，...，x_n) Determining the first and second moments of Z and pi, wherein

Indicating that the (Z, pi) obedience parameter at this time is

Gaussian-beta distribution of; then using the moment comparison method to p (Z, pi | x)₁，...，x_n) And q (Z, π | a, b, μ, σ)²) Respectively calculating the first and second moments of Z and pi, comparing them to obtain new parameters

When in use

When the inverse depth of the map point is smaller than a set threshold value, the inverse depth of the map point is considered to be converged; the first moment of the interior dot rate pi can be used as an estimate of pi:

when the inverse depth of the map point is converged, if the internal point rate pi is lower than a set threshold value, the map point is still considered as a dynamic point and is removed; only when the inverse depth of the map point converges, the interior point rate pi is higher than the set threshold value, the map point is considered to be a reliable static map point, and the reliable initial map obtained before is updated according to the reliable static map point.

Further, in the step (4), tracking and local mapping threads in a dynamic scene are performed according to the inverse depth filter extended map result, and the method includes:

carrying out initial pose estimation or repositioning of the system through an initial map obtained by semantic optical flow and inverse depth filtering, tracking a reconstructed local map, optimizing the pose, and determining a new key frame; after the key frame is determined, completing key frame insertion in a local image building thread, eliminating redundant map points and key frames, and then performing a local clustering adjustment step; in the loop detection thread, candidate frame detection is included, Sim3 is calculated, and closed loop fusion and closed loop optimization are performed; and finally, constructing an accurate map under the dynamic scene, and realizing the visual SLAM facing the dynamic scene based on semantic optical flow and inverse depth filtering.

Compared with the prior art, the invention has the advantages that:

(1) the invention adopts a semantic optical flow method, and the semantic information and the optical flow information are well integrated into a visual SLAM system in a 'tight coupling' mode, so that the problems that the traditional visual SLAM cannot understand scene information and cannot deal with dynamic scenes and the like are solved. The pose resolving accuracy in a dynamic scene is improved and is superior to that of the existing method.

(2) The invention adopts the inverse depth filtering method, considers all image frames capable of observing the map points, and continuously accumulates new observation data through the probability frame, so that the single and smaller dynamic map points can be detected and processed.

In a word, the method adopted by the invention has good performance when dealing with a high dynamic scene, and can achieve the purpose of accurately positioning the visual SLAM system in the dynamic scene.

Drawings

FIG. 1 is a flow chart of a visual SLAM method based on semantic optical flow and inverse depth filtering according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person skilled in the art based on the embodiments of the present invention belong to the protection scope of the present invention without creative efforts.

As shown in fig. 1, the specific implementation steps of the present invention are as follows:

step 1, acquiring image data acquired by a sensor, extracting image characteristic points, and performing semantic segmentation on an RGB image of a current frame by using a SegNet semantic segmentation network. The characteristic points are classified into three categories of static, latent dynamic and dynamic through semantic information. The SegNet comprises two modules of an encoder network and a decoder network. The input image is firstly sent to an encoder network, each encoder in the encoder network generates a series of feature maps through convolution operation, and then after batch normalization processing, ReLU activation function activation and other operations are carried out, a decoder in a decoder network uses the maximum pooling index value stored in the corresponding encoder feature map to carry out up-sampling on the input feature map, so that a sparse feature map is generated. The feature maps are then passed through a trainable convolution module to generate dense feature maps. And the high-dimensional feature representation output by the last decoder of the decoder network is transmitted to the softmax classifier, a semantic label of each pixel is generated, and the semantic segmentation process of the image is completed.

And 2, a semantic optical flow method is a method for detecting dynamic feature points by tightly coupling semantic information and geometric information, and the method makes up the defects of the traditional dynamic feature point detection algorithm. The semantic optical flow method firstly calculates sparse optical flow for semantic static feature points on the current frame image by using the image data of the current frame and the image data of the previous frame on the basis that the feature points on the acquired image are divided into three types of static, potential dynamic and dynamic by a semantic segmentation method. A basis matrix F is then calculated, which is critical to the polar geometry constraints. And finally, judging the motion characteristics of the static feature points, the potential dynamic feature points and the dynamic feature points again according to epipolar constraint, and checking the judgment result through the base matrix F calculated just before. And setting a pixel as a threshold value in the checking process, and if the straight-line distance from the feature point in the current frame image to the epipolar line corresponding to the feature point exceeds the threshold value, judging the feature point as a real dynamic feature point. This results in a reliable initial map.

And 3, evaluating the 3D map points in the initialized map by adopting an inverse depth filter and expanding the map. The depth filter based on the Gaussian-uniform mixed distribution hypothesis is applied to the SLAM, so that the system can not only process the influence of wrong matching on map point construction, but also process the influence of motion elements on the map point construction.

Modeling the observed value of the map point inverse depth by using a mixed model of Gaussian distribution and uniform distribution:

p(x|Z，π)＝πN(x|Z，τ²)+(1-π)U(x|Z_min，Z_max)

the meaning of the respective quantities in the above formula is:

x is an observed value of the inverse depth of a map point and is a random variable; z is the true inverse depth of the map point, which is the value to be calculated; pi is the probability that the map point is an interior point, the interior point is a static map point in the map, and the depth of the interior point is a point obtained by triangularization through a correct matching point; p (x | Z, π) represents the distribution of map points inverse depth observations; n (x | Z),τ²) Representing the mean value of the true inverse depth Z of the map points, τ²Is a gaussian distribution of variance; u (x | Z)_min,Z_max) Denotes a uniform distribution, Z_minAnd Z_maxThe lower and upper bounds of the uniform distribution, i.e., the minimum inverse depth and the maximum inverse depth.

p(Z，π|x₁，...，x_n)∝p(Z，π|x₁，...，x_n-₁)p(x_n|Z，π)

wherein x₁,...,x_nThe observation values are mutually independent in a series of map point inverse depths, and n is the serial number of the observation values; p (Z, π | x)₁,...,x_n) Is the posterior probability distribution of the current time (Z, π), p (Z, π | x)₁,...,x_n-1) Is the posterior probability distribution of the previous time (Z, π), p (x)_n| Z, π) is the likelihood probability of the depth measurement at the current time. To estimate the parameters Z and π and simplify the operation, p (Z, π | x)₁,…,x_n) Approximately gaussian-beta distribution:

q(Z,π|a,b,μ,σ²)＝N(Z|μ,σ²)Beta(π|a,b)

wherein q (Z, π | a, b, μ, σ)²) Denotes that the (Z, π) obedience parameter is (a, b, μ, σ)²) Gaussian-Beta distribution of N (Z | μ, σ)²) Is a Gaussian distribution, and Beta (π | a, b) is a Beta distribution. Gaussian-Beta distribution-a total of 4 parameters (a, b, μ, σ)²) Wherein a and b are two parameters which are more than zero in the beta distribution in probability theory, mu and sigma²The 4 parameters are expected and variance in the Gaussian distribution, so that after a new inverse depth observation value is obtained, the 4 parameters are updated to obtain a new Gaussian-Beta distribution. First use

Finding the first and second moments of Z and pi, and then using p (Z, pi | x)₁,…,x_n) Determining the first and second moments of Z and pi, wherein

Indicating that the (Z, pi) obedience parameter at this time is

Gaussian-beta distribution. Then using a moment comparison method to compare the Z and pi first and second moments respectively obtained by the two modes to obtain new parameters

When in use

When the inverse depth of the map point is smaller than a set threshold value, the inverse depth of the map point is considered to be converged. The first moment of the interior dot rate pi can be used as an estimate of pi:

when the inverse depth of the map point converges, the inner point rate pi is lower than the set threshold value, and the map point is still considered as a dynamic point and is removed. Only when the inverse depth of the map point converges, the interior point rate pi is higher than the set threshold value, the map point is considered to be a reliable static map point, and the reliable initial map obtained before is updated according to the reliable static map point.

And 4, estimating or repositioning the initial pose of the system by using the initial map obtained by the semantic optical flow and the inverse depth filtering, tracking the reconstructed local map, optimizing the pose, and determining a new key frame. After the key frame is determined, the steps of key frame insertion, redundant map points and key frames elimination, local cluster adjustment and the like are mainly completed in the local mapping thread. The loop detection thread includes candidate frame detection, Sim3 calculation, closed loop fusion, closed loop optimization and the like. Through the threads, an accurate map under a dynamic scene is finally constructed, and the visual SLAM facing the dynamic scene based on semantic optical flow and inverse depth filtering is realized.

As shown in table 1, the method of the present invention is compared with the existing visual SLAM system for dynamic scenes (4 most representative algorithms are selected here, including the algorithms proposed by DS-SLAM, DynaSLAM, Detect-SLAM, l.zhang, etc.) quantitatively on the TUM RGB-D data set, where the TUM RGB-D data set includes one low dynamic scene video sequence s _ static and four high dynamic scene video sequences w _ halfsphere, w _ rpy, w _ static and w _ xyz. The quantitative comparison result shows that the method has the highest precision in both low dynamic scenes and high dynamic scenes, can more effectively improve the capability of the visual SLAM system for coping with the dynamic scenes, and improves the positioning precision of the system in the dynamic scenes.

Table 1 shows the running result accuracy comparison on five dynamic scene video sequences of the TUM RGB-D dataset using the method of the present invention and other classical visual SLAM methods.

TABLE 1

(Note: the percentages in the table indicate the percentage improvement in the accuracy of the column visual SLAM method over the classical ORB-SLAM2, "-" indicates that the corresponding algorithm was not experimented on the video sequence)

The invention combines the traditional visual SLAM technology with the semantic optical flow technology and the inverse depth filtering technology based on deep learning, and provides a new visual SLAM method based on the semantic optical flow and the inverse depth filtering. The invention has strong practicability for innovation and improvement of the SLAM system based on the vision sensor, and has important significance for wider application of the vision SLAM system in the future.

Those skilled in the art will appreciate that the invention may be practiced without these specific details.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims

1. A visual SLAM method based on semantic optical flow and inverse depth filtering is characterized by comprising the following steps:

in the step (3), an inverse depth filter is adopted to evaluate and expand the map for the 3D map points in the initialized map, and the method comprises the following steps:

p(x|Z,π)＝πN(x|Z,τ²)+(1-π)U(x|Z_min,Z_max)

the meaning of the respective quantities in the above formula is:

x is an observed value of the inverse depth of a map point and is a random variable; z is the true inverse depth of the map point, which is the value to be calculated; pi is the probability that the map point is an interior point, which is referred to as the interior point rate for short, the interior point is a static map point in the map, and the depth of the interior point is a point obtained by triangularization through a correct matching point; p (x | Z, π) represents the distribution of map points inverse depth observations; n (x | Z, τ)²) Representing true inverse depth with map pointsZ is the mean value, τ²Is a gaussian distribution of variance; u (x | Z)_min,Z_max) Denotes a uniform distribution, Z_minAnd Z_maxThe lower and upper bounds are uniformly distributed, namely the minimum inverse depth and the maximum inverse depth;

p(Z,π|x₁,…,x_n)∝p(Z,π|x₁,…,x_n-1)p(x_n|Z,π)

wherein x₁,…,x_nThe observation values are mutually independent in a series of map point inverse depths, and n is the serial number of the observation values; p (Z, π | x)₁,…,x_n) Is the posterior probability distribution of the current time (Z, π), p (Z, π | x)₁,…,x_n-1) Is the posterior probability distribution of the previous time (Z, π), p (x)_n| Z, pi) is the likelihood probability of the depth measurement at the current time; to estimate the parameters Z and π and simplify the operation, p (Z, π | x)₁,…,x_n) Distribution approximating a gaussian-beta form:

q(Z,π|a,b,μ,σ²)＝N(Z|μ,σ²)Beta(π|a,b)

wherein q (Z, π | a, b, μ, σ)²) Denotes that the (Z, π) obedience parameter is (a, b, μ, σ)²) Gaussian-Beta distribution of N (Z | μ, σ)²) Is a Gaussian distribution, Beta (π | a, b) is a Beta distribution, which has a total of 4 parameters (a, b, μ, σ)²) Wherein a and b are two parameters which are more than zero in the beta distribution in probability theory, mu and sigma²The expectation and the variance in the Gaussian distribution are obtained, and after a new inverse depth observation value is obtained, the 4 parameters are updated to obtain new Gaussian-beta distribution; first use

Indicating that the (Z, pi) obedience parameter at this time is

Gaussian-beta distribution of; then using the moment comparison method to p (Z, pi | x)₁,…,x_n) And q (Z, π | a, b, μ, σ)²) Respectively calculating the first and second moments of Z and pi, comparing them to obtain new parameters

When in use

when the inverse depth of the map point is converged, if the internal point rate pi is lower than a set threshold value, the map point is still considered as a dynamic point and is removed; only when the inverse depth of the map point is converged, if the internal point rate pi is higher than a set threshold value, the map point is considered to be a reliable static map point, and the reliable initial map obtained before is updated according to the reliable static map point;

2. The visual SLAM method based on semantic optical flow and inverse depth filtering of claim 1, wherein: in the step (1), the image feature extraction and semantic segmentation method comprises the following steps:

3. The visual SLAM method based on semantic optical flow and inverse depth filtering of claim 1, wherein: in the step (2), map initialization is performed by using an semantic optical flow method and a reliable initialization map is created, wherein the method comprises the following steps:

4. The visual SLAM method based on semantic optical flow and inverse depth filtering of claim 1, wherein: in the step (4), tracking and local map building threads in a dynamic scene are performed according to the result of the map expansion by the inverse depth filter, and the method comprises the following steps: