CN117541652A

CN117541652A - Dynamic SLAM method based on depth LK optical flow method and D-PROSAC sampling strategy

Info

Publication number: CN117541652A
Application number: CN202311572768.1A
Authority: CN
Inventors: 艾青林; 徐沛东
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-02-09

Abstract

The invention discloses a dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy, which is suitable for robot navigation in an indoor dynamic environment. Image information is acquired through an RGB-D camera, and the current frame RGB image is transmitted into an improved YOLOv5-7.0 network model, so that instance segmentation of potential dynamic targets is realized; in order to further judge the real dynamic property of the dynamic region feature points, ORB feature point calculation is carried out on the previous frame RGB image and the current frame RGB image, matching of a depth LK optical flow method is carried out by combining the preprocessed current frame depth image, and the example segmentation mask boundary is thinned and expanded, so that effective distinction is realized on the real dynamic feature points; in the pose calculation thread, the D-PROSAC method is designed to replace the traditional RANSAC algorithm to sample feature points, and only static features are used for pose calculation and track estimation, so that the matching precision of feature points in a dynamic environment is effectively improved, and the robustness of an SLAM system is enhanced.

Description

Dynamic SLAM method based on depth LK optical flow method and D-PROSAC sampling strategy

Technical Field

The invention relates to the technical field of robot positioning navigation, in particular to a dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy.

Background

The simultaneous localization and mapping technique (simultaneous localization andmapping, SLAM) is the basis for robots to achieve autonomous localization and navigation in unknown environments. With the development of deep learning technology in recent years, the fusion of semantic information and an SLAM system has wider prospect, and the operation of the SLAM system is assisted by utilizing the semantic information to effectively enhance the perception capability of a robot to a position environment, so that the robot can understand elements in the environment like human beings, thereby realizing more reasonable path planning and decision judgment.

Most of the existing SLAM systems are based on static scene assumption, however, a large number of dynamic objects exist in a real environment, and the feature points on the dynamic objects are erroneously included in the point cloud to perform pose calculation, so that a larger error occurs in the calculation result, even if the pose calculation can be used for drawing, a large number of residual shadows exist in a dynamic region of a constructed map, and the readability of the map is greatly reduced. Although many schemes for solving the dynamic SLAM problem by using a deep learning method exist today, the precision and the efficiency still have problems, and a YOLOv5-7.0 network is not used for carrying out example segmentation and discrimination on a dynamic object, and fewer processing schemes of boundaries during segmentation exist. In the feature point sampling link, the traditional RANSAC algorithm can play a certain role in filtering dynamic feature points, but the accuracy of a model is ensured by sacrificing more iteration times, and when a large-area dynamic area exists in the environment, the method has failure risk.

Aiming at indoor dynamic environments, the existing solutions mostly have the problems of poor precision and low efficiency, so that how to filter dynamic features in the environments and effectively improve the calculation precision of SLAM systems is of great significance on the premise of guaranteeing the real-time performance of algorithms.

Disclosure of Invention

Aiming at the defects that the prior art is easy to be interfered by dynamic objects to cause pose estimation drift and precision reduction, the invention provides a dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy, which is suitable for robot navigation in an indoor dynamic environment and improves the performance of the mobile robot SLAM method in the indoor dynamic environment.

The technical scheme adopted for solving the technical problems is as follows:

a dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy comprises the following steps:

step 1: training an improved YOLOv5-7.0 network model by using a COCO data set, taking a current frame RGB image acquired by an RGB-D camera as the input of a target detection thread, and transmitting the current frame RGB image into the improved YOLOv5-7.0 network model to obtain an instance segmentation mask of a potential dynamic target, wherein the process is as follows:

step 1.1: SLAM has higher requirement on real-time performance, so that the network structure of YOLOv5-7.0 is improved, and a lighter-weight C2f-CA-fast network is provided for replacing the problems of more parameters, poor real-time performance and the like of a C3 network module. In a C2f-CA-fast network, a lightweight attention module CA is utilized to replace a convolution module of C3, feature extraction is carried out from the width and height dimensions of an image, so that attention feature codes of different layers are obtained, the feature images corresponding to different directions are respectively obtained through an average pooling layer of input features in different directions, then the weighting coefficients respectively corresponding to the convolution fusion separation calculation are obtained, and the key region features of the features output by the network are obtained through a sigmoid function, so that the salient is strengthened, and the extraction capability of the model on target distinguishing features is effectively improved. In addition, a fast module is added, and the fast module mainly comprises a PCony part convolution structure and 1x1 convolution, wherein the PCony part convolution structure is formed by cutting characteristics of a channel layer, one part of network characteristics are transmitted in an identity mode, the other part of network characteristics are transmitted in a channel-by-channel convolution mode, and feature graphs output by the two parts are fused in a channel dimension to be used as final output, so that the network reduces redundancy of information while increasing an information flow path, and the computing efficiency of the network is effectively improved. After the improvement is completed, an improved YOLOv5-7.0 network model is obtained;

Step 1.2: the indoor common categories were selected and the improved YOLOv5-7.0 network model was trained using the COCO 2017 dataset. And acquiring an image by using an RGB-D camera, transmitting the RGB image of the current frame into an improved YOLOv5-7.0 network model in a three-channel mode of 640 x 3, and performing slicing and convolution operation in a Focus structure to form an image feature map of 320 x 32. The image feature map is taken as input to be transmitted into a Backbone network of a backbond, and the backbond comprises CBS, C2f-CA-fast and SPPF 3 network structures, and the Backbone network structure is mainly used for extracting the image features and continuously shrinking the feature map. The extracted characteristic images are transmitted into a Neck structure, the Neck structure is used for fusing the characteristics of different layers, and the images with the characteristics fused are input into a detection branch and a segmentation branch for the next operation;

step 1.3: and (3) detecting a branch part, namely inputting the feature fusion image generated in the step (1.2) into a YOLACT network to obtain the confidence degrees of category information, frame information and k mask information.

Step 1.4: dividing the branch part, screening the feature fusion image generated in the step 1.2, selecting a feature image with high resolution, sufficient space information and rich semantic information to perform up-sampling operation of the FCN structure, and forming k mask prototype images through a convolution process of 1*1;

Step 1.5: and (3) linearly combining the information generated in the step (1.3) and the information generated in the step (1.4), wherein the combination formula is as follows:

wherein n is the number of object categories identified by the detection branches, k is the number of masks obtained by dividing the branches, P is k image prototype masks, C is confidence information of the masks, and the combined feature map is activated:

M＝σ(mask)

where M is the potential dynamic target mask after high confidence processing and σ is the Sigmoid () activation function. The potential dynamic target instance segmentation mask is obtained for further calculation.

Step 2: and carrying out ORB characteristic point calculation on the current frame RGB image acquired and obtained by the RGB-D camera and the previous frame RGB image to obtain a characteristic point set with direction information, wherein the process is as follows:

step 2.1: and (3) performing ORB feature point extraction on the previous frame image and the current frame image, counting the number n of the ORB feature points, and performing initialization operation only when the number n of two continuous frames is larger than a given threshold T.

Step 2.2: and calculating the main direction of ORB characteristic points by using a gray centroid method, and finding the centroid position of the image block according to the image moment:

wherein, C is the centroid of the image, m is the moment of the defined image block, and the expression is:

from the centroid C and the geometric center O, a direction vector can be obtained The direction of the feature points is thus defined as:

θ＝arctan(m ₀₁ /m ₁₀ )

after ORB characteristic points with direction information are obtained, performing next calculation;

step 3: performing depth map preprocessing on a current frame depth image acquired by an RGB-D camera, and in a feature point dynamic judging thread, taking the preprocessed depth image, the potential dynamic target instance segmentation mask obtained in the step 1 and the ORB feature point set obtained in the step 2 as input to a depth LK optical flow method matcher, and calculating to obtain a high-confidence static feature point set, wherein the specific process is as follows:

step 3.1: performing coarse filtering on the instance segmentation mask of the potential dynamic target according to the confidence coefficient;

step 3.2: because a certain number of dynamic feature points inevitably exist on the boundary of the example segmentation mask, the subsequent steps such as pose calculation and the like are influenced, and the boundary of the example segmentation mask is expanded by combining the depth image so as to ensure that the dynamic feature points are all included in the corresponding example segmentation mask. The specific method comprises the steps of preprocessing a depth map, normalizing depth information, and mapping corresponding numerical values to a designated interval to obtain a preprocessed depth image, wherein the mapping method comprises the following steps:

Wherein D is _nor For normalizing the depth value, gamma is the amplification factor, D is the current depth value, D _max 、D _min Representing the maximum depth value and the minimum depth value in the depth image, respectively. Inputting the normalized image into a bilateral filter, so that the image can reduce noise and retain edge information;

step 3.3: the expansion and refinement of the dynamic region are carried out by combining the example segmentation mask generated in the step 3.1 and the preprocessed depth image obtained in the step 3.2, and the specific process is as follows: by P _i,j Representing a specific pixel point on the boundary of the example segmentation mask, combining the pixel point serving as the origin of coordinates with 8 adjacent pixel points around to form a dynamic generation set P _net The dynamic generation set P _net Can be expressed as:

P _net ＝p{(i,j)|-1≤i≤1,-1≤j≤1,i、j∈Z}

in the dynamic generation set P _net Within the range, define the effective depth value D _p (i, j) if the pixel p (i, j) is in the example division mask region, the effective depth value of the pixel is the normalized depth value of the pixel, otherwise, the effective depth value is set to 0:

where Depth (i, j) represents the normalized Depth value of the feature point at coordinates (i, j), A _d A set of pixels within a mask region is partitioned for an instance. At the time of obtaining the effective depth value D _p (i, j) after calculation of the dynamic generation set P _net Average effective depth D of inner pixel point _mean (i,j)：

Setting a threshold delta, and determining the average effective depth D of the unified plane _mean The range is as follows:

δ _min D _mean ≤Depth(i,j)≤δ _max D _mean

in delta _min 、δ _max Representing the maximum threshold coefficient and the minimum threshold coefficient, respectively. The average effective depth range effectively characterizes the pixel point P _i,j The depth range of the plane is taken as a criterion, so that different planes in the image can be effectively distinguished, and the expansion and refinement of the example segmentation mask are realized;

step 3.4, judging P in turn _net If the depth value of the 8 adjacent pixel points in the range is within the average effective depth range, the pixel points are in the range, which indicates that the pixel points belong to a dynamic region with high probability, and the pixel points are classified into an instance segmentation mask of a dynamic target. All boundary pixel points P of the example segmentation mask _i,j The operation is recorded as one-time expansion and refinement operation, and after 3 times of expansion and refinement operations are repeated, a refined instance segmentation mask can be obtained;

step 3.5, since the refined example segmentation mask only represents the potential motion possibility of the object, in a real situation, the object still can be still static, in order to accurately judge the real motion situation of the feature points in the mask, the LK optical flow is calculated and matched for the ORB feature points in the refined example segmentation mask, and since the LK optical flow assumes that the pixel motion in the image block is the same, the following equation exists:

Wherein I is _x For the gradient of the pixel in the x-direction, I _y Is the gradient of the pixel in the y-direction, w is the image block size. And iterating for a plurality of times, so as to track the pixel points, obtain the motion vectors of the feature points, screen the motion vectors, and mark the feature points with the modulus larger than the threshold value as dynamic feature points. And eliminating all dynamic feature points in the thinned instance segmentation mask, marking the rest static feature point set as a high-confidence static feature point set, and performing the next calculation.

Step 4: in the pose calculation thread, on the basis of a high-confidence static feature point set, removing dynamic feature points in the environment, designing a D-PROSAC feature point sampling method, sampling feature points in the high-confidence static feature point set to remove mismatching and low-quality matching points, and carrying out camera pose estimation by utilizing the rest high-quality feature points, wherein the specific process is as follows:

step 4.1: when calculating the camera pose model, sampling is required in the static feature point set. The conventional RANSAC sampling method has uncertainty and requires that the accuracy of the model be guaranteed by sacrificing more iterations. The invention provides a D-PROSAC sampling method, which evaluates the reliability of data before sampling, so that compared with the traditional RANSAC algorithm, the method has faster convergence rate and calculation accuracy, and the specific flow of the D-PROSAC is as follows: design ratio evaluation function Q ₁ (p _i ) And 8 feature points with highest scores are selected by taking the feature points as the standard, and an original model F is obtained by an eight-point method ₀ The specific process is as follows: recording the static characteristic point set to be sampled as U _N N represents the number of feature points in the set, the static feature points to be sampled in the set can be represented as p _i And p is _i ∈U _N . For all static feature points p to be sampled _i Calculating the minimum Hamming distance d of the descriptors _min1 (p _i ) Distance d from the next smallest Hamming distance _min2 (p _i ) Is recorded as a ratio evaluation function Q ₁ (p _i )：

Ratio evaluation function Q ₁ (p _i ) Characterizing the static feature point p to be sampled _i Degree of reliability in matching process, Q ₁ The smaller the feature point, the higher the matching quality of the feature point. Evaluating the function Q by a ratio ₁ As a standard, the static feature point p to be sampled _i Ascending order is carried out, and the first 8 points with highest matching quality are selected as an initial sample set U ₀ For initial sample set U ₀ The original model F can be obtained by using an eight-point method ₀ ；

Step 4.2: obtaining an original model F ₀ After that, to U _N All static feature points p to be sampled in the interior _i Distance d of polar line _i And based thereon design the pole pitch evaluation function Q ₂ (p _i ) The specific process is as follows: for U _N Any static feature point p to be sampled in the inner part _i The pixel coordinates on the two images where they are located are noted as:

Then p is _i1 Corresponding polar line I ₁ Can be expressed as:

wherein F represents a basic matrix, X, Y, Z represents a polar line I ₁ From the above equation, the point p can be obtained _i2 To the polar line I ₁ Polar distance d of (2) _i ：

At a polar distance d _i Design polar distance evaluation function Q as index ₂ (p _i )：

Where θ is a scaling factor, and is specified to be greater than 0. Polar distance evaluation function Q ₂ (p _i ) Characterizing the static feature point p to be sampled _i For the satisfaction degree of polar constraint, Q ₂ The larger the feature point is, the higher the matching quality of the feature point is;

step 4.3: evaluating the ratio of the values of the function Q ₁ (p _i ) And polar distance evaluation function Q ₂ (p _i ) Fitting to generate a final bidirectional evaluation function Q ₀ (p _i )：

Wherein beta is ₁ 、β ₂ For scaling the coefficient, for evaluating the ratio of the function Q ₁ (p _i ) And polar distance evaluation function Q ₂ (p _i ) Adjusted to a uniform order of magnitude. Bidirectional evaluation function Q ₀ (p _i ) The characteristics of the ratio evaluation function and the polar distance evaluation function are combined, the reliability of characteristic point matching is considered, and the satisfaction degree of characteristic points to polar line constraint is synthesized, so that the comprehensive quality of one characteristic point can be measured, and the Q ₀ The larger the feature point, the better the overall quality of the feature point. By a bi-directional evaluation function Q ₀ (p _i ) As an index, a static characteristic point set U to be sampled _N All the feature points in the tree are ordered in descending order, namely:

u _i ,u _j ∈U _N :i<j→Q ₀ (u _i )>Q ₀ (u _j )

Inputting the ordered static characteristic points into the next link to carry out a sampling flow;

step 4.4: for the ordered static characteristic point set U to be sampled _N Sampling is carried out to generate a pose model F, and the specific flow is as follows:

step 4.4.1: and determining the maximum iteration number K, the re-projection error threshold delta and the number of inner points threshold M.

Step 4.4.2: determining the size n of the hypothesized generation set U according to the growth function rule of the PROSAC algorithm, and sequencing the static feature point set U _N In, the bidirectional evaluation function Q designed in the step 4.3 ₀ (p _i ) As a criterion, the first n feature points are selected as the hypothesis generation set U.

Step 4.4.3: in the hypothesis generation set U, 8 points are randomly selected, and an essential matrix F is obtained by calculation by an eight-point method _U 。

Step 4.4.4: for static feature point set U _N All feature points in the inner are defined by F _U Performing reprojection operation, calculating reprojection error epsilon, and if epsilon<Delta, it is marked as an inner point, and vice versa.

Step 4.4.5: counting the number of internal points M, if M > M, making m=M, otherwise repeating the steps of 4.4.2-4.4.4, and repeating the iteration times k=k+1.

Step 4.4.6: recalculating the essential matrix F from all the inliers after updating _U When k is<K, obtaining an essential matrix F _U And a new set of inliers, otherwise no model is obtained.

Obtaining an essential matrix F _U After that, for F _U Singular Value Decomposition (SVD) is performed to obtain the camera pose R, t with high accuracy.

The technical conception of the invention is as follows: a dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy designs a C2f-CA-fast network, replaces a convolution module of C3 with a lightweight attention module CA, extracts features from the width and height dimensions of an image to obtain attention feature codes of different layers, and effectively improves the extraction capability of a model on target distinguishing features. The fast module is added, so that the network reduces the redundancy of information and effectively improves the calculation efficiency of the network while increasing the information flow path. Aiming at the situation that a certain number of dynamic feature points still exist around the example segmentation mask, a depth-LK optical flow method is provided, different planes are effectively distinguished by using depth information, and expansion and refinement of the example segmentation mask are realized. In the sampling process, in order to solve the problems of low precision, poor reliability and the like of the traditional RANSAC method, a D-PRAOSAC algorithm is provided, the characteristic points are ordered by taking a bidirectional evaluation function as a standard before sampling, and only the high-quality characteristic points are sampled and calculated in pose, so that the precision and convergence speed of the model are effectively improved.

The beneficial effects of the invention are mainly shown in the following steps:

1) The lightweight network is adopted for feature processing, so that the speed and the precision of semantic segmentation links in the dynamic SLAM method are effectively improved. 2) The image depth information is fully utilized, the segmentation boundary is expanded and thinned, and the occurrence of the missing segmentation phenomenon is effectively avoided; 3) And the progressive double standard sampling is carried out on the dynamic characteristic points, the reliability of the static characteristic points selected by sampling is high, and the convergence rate of the sampling process is higher. 4) The real-time performance of the whole algorithm is stronger, the requirements on equipment are lower, and the pose calculation is more accurate.

Drawings

FIG. 1 is an overall flow chart of a specific embodiment of the present invention;

FIG. 2 is a SLAM system frame diagram of an embodiment of the present invention;

FIG. 3 is a schematic diagram of a modified YOLOv5-7.0 network model in accordance with embodiments of the present invention;

FIG. 4 is a schematic diagram of a C2f-CA-Faster network architecture in accordance with embodiments of the present invention;

FIG. 5 is a flow diagram of a depth-LK optical flow matcher in accordance with an embodiment of the present invention;

FIG. 6 is a flow chart of the D-PROSAC algorithm of an embodiment of the present invention;

FIG. 7 is a graph comparing ORB-SLAM3 with the Absolute track error (Absolute TrajectoryError, ATE) of the algorithm of the present invention, wherein (a) shows the Absolute track error graph of ORB-SLAM3 at sequence fr3_half, (b) shows the Absolute track error graph of the method at sequence fr3_half, (c) shows the Absolute track error graph of ORB-SLAM3 at sequence fr3_walk_xyz, and (d) shows the Absolute track error graph of the method at sequence fr3_walk_xyz;

FIG. 8 is a comparative graph of ORB-SLAM3 and the relative track error (Relative Pose Error, RPE) of the algorithm of the present invention, where (a) shows the relative track error graph of ORB-SLAM3 under the sequence fr3_walk_xyz and (b) shows the relative track error graph of the present method under the sequence fr3_walk_xyz.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 6, an RGB-DSLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy in an indoor dynamic environment includes the following steps:

step 1.1: SLAM has higher requirement on real-time performance, so that the network structure of YOLOv5-7.0 is improved, and a lighter-weight C2f-CA-fast network is provided for replacing the problems of more parameters, poor real-time performance and the like of a C3 network module. In a C2f-CA-fast network, a lightweight attention module CA is utilized to replace a convolution module of C3, feature extraction is carried out from the width and height dimensions of an image, so that attention feature codes of different layers are obtained, the feature images corresponding to different directions are respectively obtained through an average pooling layer of input features in different directions, then the weighting coefficients respectively corresponding to the convolution fusion separation calculation are obtained, and the key region features of the features output by the network are obtained through a sigmoid function, so that the salient is strengthened, and the extraction capability of the model on target distinguishing features is effectively improved. In addition, a fast module is added, which mainly consists of partial convolution PConey and 1x1 convolution, wherein PConey is formed by segmenting characteristics of a channel layer, a part of network characteristics are transmitted in an identity mode, another part of network characteristics are transmitted in a channel-by-channel convolution mode, and the two output characteristic graphs are fused in a channel dimension to be used as final output, so that the network reduces redundancy of information while increasing an information circulation path, and the calculation efficiency of the network is effectively improved. After the improvement is completed, an improved YOLOv5-7.0 network model is obtained;

Step 1.2: the indoor common categories were selected and the improved YOLOv5-7.0 network model was trained using the COCO 2017 dataset. And acquiring an image by using an RGB-D camera, transmitting the RGB image of the current frame into an improved YOLOv5-7.0 network model in a three-channel mode of 640 x 3, and performing slicing and convolution operation in a Focus structure to form an image feature map of 320 x 32. The image feature map is taken as input to be transmitted into a Backbone network of a backbond, and the backbond comprises CBS, C2f-CA-fast and SPPF 3 network structures, and the Backbone network structure is mainly used for extracting the image features and continuously shrinking the feature map. The extracted characteristic images are transmitted into a Neck structure, the Neck has the main effects that relatively shallow characteristics are obtained from a Backbone, then multi-scale characteristic fusion is carried out on the characteristics and deep semantic characteristics, and the images with the characteristics fused are input into a detection branch and a segmentation branch for the next operation;

M＝σ(mask)

FIG. 1 is a flow chart showing the overall operation of the SLAM method, and describing the overall operation flow and steps of the improved Yolov5 and depth-LK optical flow method according to the present invention;

FIG. 2 is a system frame diagram of the SLAM method, describing an overall system frame diagram of the improved YOLOv5 and depth-LK optical flow method, which includes four links of RGB-D camera acquisition, target detection thread, feature point dynamic judgment thread, and pose calculation thread;

from the centroid C and the geometric center O, a direction vector can be obtainedThe direction of the feature points is thus defined as:

θ＝arctan(m ₀₁ /m ₁₀ )

step 3: performing depth map preprocessing on a current frame depth image acquired by an RGB-D camera, and in a feature point dynamic judging thread, taking the preprocessed depth image, the potential dynamic target instance segmentation mask obtained in the step 1 and the ORB feature point set obtained in the step 2 as input to a depth-LK optical flow method matcher, and calculating to obtain a high-confidence static feature point set, wherein the specific process is as follows:

step 3.1: performing confidence level rough filtering on the instance segmentation mask of the potential dynamic target, discarding the instance segmentation mask with the confidence level lower than 0.20, and obtaining the instance segmentation mask with reliability;

step 3.3: combining the example segmentation mask generated in the step 3.1 with the preprocessed depth image obtained in the step 3.2 to expand and thin the dynamic areaThe method comprises the following specific processes: by P _i,j Representing a specific pixel point on the boundary of the example segmentation mask, combining the pixel point serving as the origin of coordinates with 8 adjacent pixel points around to form a dynamic generation set P _net The dynamic generation set P _net Can be expressed as:

P _net ＝p{(i,j)|-1≤i≤1,-1≤j≤1,i、j∈Z}

where Depth (i, j) represents the normalized Depth value of the feature point at coordinates (i, j), and Ad is the set of pixel points within the instance segmentation mask region. At the time of obtaining the effective depth value D _p (i, j) after calculation of the dynamic generation set P _net Average effective depth D of inner pixel point _mean (i,j)：

δ _min D _mean ≤Depth(i,j)≤δ _max D _mean

in delta _min 、δ _ma x represents the maximum threshold coefficient and the minimum threshold coefficient, respectively. The average effective depth range effectively characterizes the pixel point P _i,j The depth range of the plane is taken as a criterion, so that different planes in the image can be effectively distinguished, and the expansion and refinement of the example segmentation mask are realized;

step 3.4, judging P in turn _net Whether the depth values of 8 adjacent pixel points in the range of average effective depth or not, if soIf the point is within the range, the point is represented to belong to a dynamic area with a high probability, and the point is classified into an instance segmentation mask of a dynamic target. All boundary pixel points P of the example segmentation mask _i,j The operation is recorded as one-time expansion and refinement operation, and after 3 times of expansion and refinement operations are repeated, a refined instance segmentation mask can be obtained;

As shown in fig. 3, a schematic diagram of an improved YOLOv5 network model is shown, which describes a network structure of the improved YOLOv5, including a backbone network, a multi-scale feature fusion network, a detection branch and a segmentation branch, and a linear combination module, through which dynamic feature extraction and instance segmentation can be performed on an input RGB-D image to obtain a potential dynamic target instance segmentation mask; FIG. 4 is a schematic diagram of a C2f-CA-Faster module in the improved YOLOv5-7.0 network model of the SLAM method, showing mainly the network architecture details of CA and Faster;

step 4.1: when calculating the camera pose model, sampling is required in the static feature point set. The conventional RANSAC sampling method has uncertainty and requires that the accuracy of the model be guaranteed by sacrificing more iterations. The invention provides a D-PROSAC sampling method, which evaluates the reliability of data before sampling, so that compared with the traditional RANSAC algorithm, the method has faster convergence rate and calculation accuracy, and the specific flow of the D-PROSAC is as follows: design ratio evaluation function Q ₁ (p _i ) And 8 feature points with highest scores are selected by taking the feature points as the standard, and an original model F is obtained by an eight-point method ₀ The specific process is as follows: recording the static characteristic point set to be sampled as U _N N represents the number of feature points in the set, the static feature points to be sampled in the set can be represented as p _i And p is _i ∈U _N . For all static feature points p to be sampled _i Calculating the minimum Hamming distance d of the descriptors _min1 (p _i ) Distance d from the next smallest Hamming distance _min2 (p _i ) Is recorded as a ratio evaluation function Q ₁ ( _p i)：

Ratio evaluation function Q ₁ ( _p i) Characterizing the static feature point p to be sampled _i Degree of reliability in matching process, Q ₁ The smaller the feature point, the higher the matching quality of the feature point. Evaluating the function Q by a ratio ₁ As a standard, the static feature point p to be sampled _i Ascending order is carried out, and the first 8 points with highest matching quality are selected as an initial sample set U ₀ For initial sample set U ₀ The original model F can be obtained by using an eight-point method ₀ ；

then p is _i1 Corresponding polar line I ₁ Can be expressed as:

step 4.3: evaluating the ratio of the values of the function Q ₁ (p _i ) And polar distance evaluation function Q ₂ (p _i ) Fitting to generateFinal bi-directional evaluation function Q ₀ (p _i )：

u _i ,u _j ∈U _N :i<j→Q ₀ (u _i )>Q ₀ (u _j )

Step 4.4.4: for static feature point set U _N Inner wall of the containerHas characteristic points, consisting of F _U Performing reprojection operation, calculating reprojection error epsilon, and if epsilon<Delta, it is marked as an inner point, and vice versa.

Obtaining an essential matrix F _U After that, for F _U Singular Value Decomposition (SVD) is performed to obtain the camera pose R, t with high accuracy. As shown in fig. 5, a flow diagram of a depth-LK optical flow method matcher is shown, the matcher filters a dynamic mask according to a confidence level, expands a dynamic region by a processed depth image, calculates LK optical flow in the expanded dynamic region, and realizes finer distinction of the dynamic region.

Simulation experiment:

the simulation environment of the dynamic SLAM method experiment based on the depth LK optical flow method and the D-PROSAC sampling strategy is given as follows: GPUNVIDIARTX3060, CPU i7-12700H,Ubuntu 20.04,CUDA 11.0,Pytorch 1.8.1.

The fr3_half and fr3_walking_xyz sequences in the public dataset Tum Dynamic Objects were selected for evaluation, the entire process being dynamic. In order to verify the performance of the algorithm in a dynamic environment, a dynamic subsequence in a TUM data set is selected to respectively compare ORB-SLAM3 with the method, an absolute track error (Absolute Trajectory Error, ATE) is used as a judgment standard, and a quantitative comparison result shows that the method obviously improves the positioning accuracy of the visual SLAM system in the dynamic environment. FIG. 7 shows a comparison of the absolute track error (Absolute Trajectory Error, ATE) of ORB-SLAM3 and the algorithm of the present invention, wherein (a) represents the absolute track error of ORB-SLAM3 at sequence fr3_half, (b) represents the absolute track error of the present method at sequence fr3_half, (c) represents the absolute track error of ORB-SLAM3 at sequence fr3_walk_xyz, and (d) represents the absolute track error of the present method at sequence fr3_walk_xyz.

TABLE 1 ORB-SLAM3 and RMSE (m) of absolute track error for the method of the invention

Sequence name	ORB-SLAM3	The method of the invention	Precision improvement
				fr3_half	0.366	0.031	91.53％
fr3_walking_xyz	0.556	0.017	96.94％

FIG. 8 is a comparison of the relative track error (Relative Pose Error, RPE) of ORB-SLAM3 and the algorithm of the present invention, where (a) represents the relative track error of ORB-SLAM3 under the sequence fre3_walking_xyz and (b) represents the relative track error of the present method under the sequence fre3_walking_xyz.

TABLE 2 ORB-SLAM3 and RMSE (m) of relative track error for the method of the invention

Sequence name	ORB-SLAM3	The method of the invention	Precision improvement
				fr3_half	0.0308	0.1517	79.69％

Therefore, the dynamic SLAM method based on the depth LK optical flow method and the D-PROSAC sampling strategy effectively eliminates dynamic characteristic points in the environment, and greatly improves the positioning accuracy and robustness of the system in the dynamic environment.

As indicated above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy is characterized by comprising the following steps:

step 1: training an improved YOLOv5-7.0 network model by using a COCO data set, taking a current frame RGB image acquired by an RGB-D camera as the input of a target detection thread, and transmitting the current frame RGB image into the improved YOLOv5-7.0 network model to obtain an instance segmentation mask of a potential dynamic target;

Step 2: carrying out ORB characteristic point calculation on a current frame RGB image acquired by an RGB-D camera and a previous frame RGB image to obtain a characteristic point set with direction information;

step 3: performing depth map preprocessing on a current frame depth image acquired by an RGB-D camera, taking the preprocessed depth image, the potential dynamic target instance segmentation mask obtained in the step 1 and the ORB feature point set obtained in the step 2 as input to a depth LK optical flow method matcher in a feature point dynamic judging thread, and calculating to obtain a high-confidence static feature point set;

step 4: in the pose calculation thread, on the basis of a high-confidence static feature point set, removing dynamic feature points in the environment, designing a D-PROSAC feature point sampling method, sampling feature points in the high-confidence static feature point set to remove mismatching and low-quality matching points, and estimating the pose of the camera by using the rest high-quality feature points.

2. The dynamic SLAM method based on the deep LK optical flow method and the D-PROSAC sampling strategy of claim 1, wherein the procedure of step 1 is as follows:

step 1.1: in a C2f-CA-fast network, a lightweight attention module CA is used for replacing a convolution module of C3, and feature extraction is carried out from the width dimension and the height dimension of an image so as to obtain attention feature codes of different layers, wherein the attention feature codes are specifically as follows: the input features respectively acquire feature graphs corresponding to different directions through an average pooling layer in different directions, then the convolution fusion separation calculation respectively corresponds to the weighting coefficients, and the key region features of the nonlinear weight parameters to the features output by the network are acquired through a sigmoid function and the highlighting is enhanced;

Adding a fast module, wherein the fast module comprises a Pcony part convolution structure and 1x1 convolution, the Pcony part convolution structure is formed by segmenting characteristics of a channel layer, one part of network characteristics are transmitted in a identity mode, the other part of network characteristics are transmitted in a channel-by-channel convolution mode, and feature graphs output by the two parts are fused in a channel dimension to be used as final output; after the improvement is completed, an improved YOLOv5-7.0 network model is obtained;

step 1.2: training an improved YOLOv5-7.0 network model using the COCO 2017 dataset; an RGB-D camera is utilized to collect images, the current frame RGB images are transmitted into an improved YOLOv5-7.0 network model in a three-channel mode of 640 x 3, slicing and convolution operations are carried out in a Focus structure, and an image feature map of 320 x 32 is formed; taking the image feature map as input and transmitting the image feature map into a Backbone network of a backbond, wherein the backbond comprises CBS, C2f-CA-fast and SPPF 3 network structures and is used for extracting image features and continuously shrinking the feature map; the extracted characteristic images are transmitted into a Neck structure, the Neck structure is used for fusing the characteristics of different layers, and the images with the characteristics fused are input into a detection branch and a segmentation branch for the next operation;

Step 1.3: detecting a branch part, inputting the feature fusion image generated in the step 1.2 into a YOLACT network to obtain the confidence degrees of category information, frame information and k mask information;

M＝σ(mask)#(2)

in the formula, M is a potential dynamic target mask after high confidence processing, and sigma is a Sigmoid () activation function; the potential dynamic target instance segmentation mask is obtained for further calculation.

3. The dynamic SLAM method based on the deep LK optical flow method and the D-PROSAC sampling strategy of claim 1, wherein the procedure of step 2 is as follows:

Step 2.1: ORB feature point extraction is carried out on the previous frame image and the current frame image, the number n of the ORB feature points is counted, and initialization operation is carried out only when the number n of two continuous frames is larger than a given threshold T;

centroid C and geometric center O obtained from equation 3, obtain direction vectorThe direction of the feature points is thus defined as:

thereby obtaining ORB characteristic points with direction information for further calculation.

4. The dynamic SLAM method based on the deep LK optical flow method and the D-PROSAC sampling strategy of claim 1, wherein the procedure of step 3 is as follows:

step 3.2: performing expansion processing on the boundary of the instance segmentation mask by combining the depth image so as to ensure that the dynamic feature points are all classified into the corresponding instance segmentation mask; the specific method comprises the steps of preprocessing a depth map, normalizing depth information, and mapping corresponding numerical values to a designated interval to obtain a preprocessed depth image, wherein the mapping method comprises the following steps:

Wherein D is _nor For normalizing the depth value, gamma is the amplification factor, D is the current depth value, D _max 、D _min Respectively representing the maximum depth value and the minimum depth value in the current depth image, and inputting the normalized image into a bilateral filter so that the image can reduce noise and retain edge information;

step 3.3: the expansion and refinement of the dynamic region are carried out by combining the example segmentation mask generated in the step 3.1 and the preprocessed depth image obtained in the step 3.2, and the specific process is as follows: by P _i，j Representing a specific pixel point on the boundary of the example segmentation mask, combining the pixel point serving as the origin of coordinates with 8 adjacent pixel points around to form a dynamic generation set P _net The dynamic generation set P _net Expressed as:

P _nei ＝{p(i，j)|-1≤i≤1，-1≤j≤1，i、j∈Z}#(7)

where Depth (i, j) represents the normalized Depth value of the feature point at coordinates (i, j), A _d Dividing a pixel point set in a mask area for an example; at the time of obtaining the effective depth value D _p (i, j) after calculation of the dynamic generation set P _net Average effective depth D of inner pixel point _mean (i，j)：

δ _min D _mean ≤Depth(i，j)≤δ _max D _mean #(10)

in delta _min 、δ _max Respectively representing a maximum threshold coefficient and a minimum threshold coefficient; the average effective depth range effectively characterizes the pixel point P _i，j The depth range of the plane is taken as a criterion, so that different planes in the image are effectively distinguished, and expansion and refinement of the example segmentation mask are realized;

step 3.4: judging P in turn _net If the depth values of the 8 adjacent pixels in the range are within the average effective depth range, classifying the pixels in the range into an instance segmentation mask of a dynamic target; all boundary pixel points P of the example segmentation mask _i，j The operation is recorded as one-time expansion and refinement operation, and after 3 times of expansion and refinement operations are repeated, a refined instance segmentation mask is obtained;

step 3.5: calculating LK optical flow for ORB feature points in the thinned instance segmentation mask, and matching, wherein the LK optical flow assumes that the pixel motion in the image block is the same, and the following equation exists:

wherein I is _x For the gradient of the pixel in the x-direction, I _y Is the gradient of the pixel in the y direction, w is the image block size;

and iterating for a plurality of times, so as to track the pixel points, obtain the motion vectors of the feature points, screening the motion vectors, marking the feature points with the modulus larger than a threshold value as dynamic feature points, eliminating all dynamic feature points in the thinned instance segmentation mask, marking the rest static feature point set as a high-confidence static feature point set, and performing the next calculation.

5. The dynamic SLAM method based on the deep LK optical flow method and the D-PROSAC sampling strategy of claim 1, wherein the process of step 4 is as follows:

step 4.1: the specific flow of the D-PROSAC is as follows: design ratio evaluation function Q ₁ (p _i ) And 8 feature points with highest scores are selected by taking the feature points as the standard, and an original model F is obtained by an eight-point method ₀ The specific process is as follows: recording the static characteristic point set to be sampled as U _N N represents the number of the feature points in the set, and the static feature points to be sampled in the set are represented as p _i And p is _i ∈U _N The method comprises the steps of carrying out a first treatment on the surface of the For all static feature points p to be sampled _i Calculating the minimum Hamming distance d of the descriptors _min1 (p _i ) Distance d from the next smallest Hamming distance _min2 (p _i ) Is recorded as a ratio evaluation function Q ₁ (p _i )：

Ratio evaluation function Q ₁ (p _i ) Characterizing the static feature point p to be sampled _i Degree of reliability in matching process, Q ₁ The smaller the feature point is, the higher the matching quality of the feature point is; evaluating the function Q by a ratio ₁ As a standard, the static feature point p to be sampled _i Ascending order is carried out, and the first 8 points with highest matching quality are selected as an initial sample set U ₀ For initial sample set U ₀ Obtaining an original model F by using an eight-point method ₀ ；

Step 4.2: obtaining an original model F ₀ After that, to U _N All static feature points p to be sampled in the interior _i Distance d of polar line _i And based thereon design the pole pitch evaluation function Q ₂ (p _i ) The specific process is as follows: for U _N Any one of the samples to be sampled is stillState characteristic point p _i The pixel coordinates on the two images where they are located are noted as:

then p is _i1 Corresponding polar line I ₁ Expressed as:

wherein F represents a basic matrix, X, Y, z represents a polar line I ₁ Is obtained from the above three directional components _i2 To the polar line I ₁ Polar distance d of (2) _i ：

Wherein θ is a scaling factor, which is specified to be larger than 0; polar distance evaluation function Q ₂ (p _i ) Characterizing the static feature point p to be sampled _i For the satisfaction degree of polar constraint, Q ₂ The larger the feature point is, the higher the matching quality of the feature point is;

Wherein beta is ₁ 、β ₂ For scaling the coefficient, for evaluating the ratio of the function Q ₁ (p _i ) And polar distance evaluation function Q ₂ (p _i ) Adjusting to a uniform order of magnitude; bidirectional evaluation function Q ₀ (p _i ) The characteristics of the ratio evaluation function and the polar distance evaluation function are combined, the reliability of characteristic point matching is considered, and the satisfaction degree of characteristic points to polar line constraint is synthesized, so that the comprehensive quality of one characteristic point can be measured, and the Q ₀ The larger the feature point is, the better the comprehensive quality of the feature point is; by a bi-directional evaluation function Q ₀ (p _i ) As an index, a static characteristic point set U to be sampled _N All the feature points in the tree are ordered in descending order, namely:

u _i ，u _j ∈U _N ：i＜j→Q ₀ (u _i )＞Q ₀ (u _j )#(18)

step 4.4.1: determining the maximum iteration number K, the re-projection error threshold delta and the inner point number threshold M;

step 4.4.2: determining the size n of the hypothesized generation set U according to the growth function rule of the PROSAC algorithm, and sequencing the static feature point set U _N In, the bidirectional evaluation function Q designed in the step 4.3 ₀ (p _i ) As a criterion, selecting the first n feature points as a hypothesis generation set U;

step 4.4.3: in the hypothesis generation set U, 8 points are randomly selected, and an essential matrix F is obtained by calculation by an eight-point method _U ；

Step 4.4.4: for static feature point set U _N All feature points in the inner are defined by F _U Carrying out reprojection operation, calculating reprojection error epsilon, and marking the epsilon as an inner point if epsilon is smaller than delta, and otherwise, marking the epsilon as an outer point;

step 4.4.5: counting the number M of the inner points, if M is more than M, making m=M, otherwise repeating the steps of 4.4.2-4.4.4, and repeating the steps of k=k+1;

Step 4.4.6: recalculating the essential matrix F from all the inliers after updating _U When K is less than K, obtaining an essential matrix F _U And a new set of inliers, otherwise, no model is obtained;

obtaining an essential matrix F _U After that, for F _U SVD singular value decomposition is carried out, and the pose R, t of the camera can be obtained.