CN117541652A - Dynamic SLAM method based on depth LK optical flow method and D-PROSAC sampling strategy - Google Patents
Dynamic SLAM method based on depth LK optical flow method and D-PROSAC sampling strategy Download PDFInfo
- Publication number
- CN117541652A CN117541652A CN202311572768.1A CN202311572768A CN117541652A CN 117541652 A CN117541652 A CN 117541652A CN 202311572768 A CN202311572768 A CN 202311572768A CN 117541652 A CN117541652 A CN 117541652A
- Authority
- CN
- China
- Prior art keywords
- image
- feature
- dynamic
- depth
- points
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 119
- 238000005070 sampling Methods 0.000 title claims abstract description 44
- 230000003287 optical effect Effects 0.000 title claims abstract description 31
- 230000011218 segmentation Effects 0.000 claims abstract description 64
- 230000003068 static effect Effects 0.000 claims abstract description 60
- 238000004364 calculation method Methods 0.000 claims abstract description 39
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 16
- 238000011156 evaluation Methods 0.000 claims description 44
- 230000008569 process Effects 0.000 claims description 27
- 239000011159 matrix material Substances 0.000 claims description 15
- 230000004927 fusion Effects 0.000 claims description 12
- 238000001514 detection method Methods 0.000 claims description 11
- 238000013461 design Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 10
- 230000002457 bidirectional effect Effects 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 230000006872 improvement Effects 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000003321 amplification Effects 0.000 claims description 3
- 230000001174 ascending effect Effects 0.000 claims description 3
- 230000002146 bilateral effect Effects 0.000 claims description 3
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 3
- NRNCYVBFPDDJNE-UHFFFAOYSA-N pemoline Chemical compound O1C(N)=NC(=O)C1C1=CC=CC=C1 NRNCYVBFPDDJNE-UHFFFAOYSA-N 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000000926 separation method Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 41
- 238000010586 diagram Methods 0.000 description 9
- 230000004807 localization Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 1
- 241000282414 Homo sapiens Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/269—Analysis of motion using gradient-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
- G06T7/557—Depth or shape recovery from multiple images from light fields, e.g. from plenoptic cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10024—Color image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20112—Image segmentation details
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Biodiversity & Conservation Biology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy, which is suitable for robot navigation in an indoor dynamic environment. Image information is acquired through an RGB-D camera, and the current frame RGB image is transmitted into an improved YOLOv5-7.0 network model, so that instance segmentation of potential dynamic targets is realized; in order to further judge the real dynamic property of the dynamic region feature points, ORB feature point calculation is carried out on the previous frame RGB image and the current frame RGB image, matching of a depth LK optical flow method is carried out by combining the preprocessed current frame depth image, and the example segmentation mask boundary is thinned and expanded, so that effective distinction is realized on the real dynamic feature points; in the pose calculation thread, the D-PROSAC method is designed to replace the traditional RANSAC algorithm to sample feature points, and only static features are used for pose calculation and track estimation, so that the matching precision of feature points in a dynamic environment is effectively improved, and the robustness of an SLAM system is enhanced.
Description
Technical Field
The invention relates to the technical field of robot positioning navigation, in particular to a dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy.
Background
The simultaneous localization and mapping technique (simultaneous localization andmapping, SLAM) is the basis for robots to achieve autonomous localization and navigation in unknown environments. With the development of deep learning technology in recent years, the fusion of semantic information and an SLAM system has wider prospect, and the operation of the SLAM system is assisted by utilizing the semantic information to effectively enhance the perception capability of a robot to a position environment, so that the robot can understand elements in the environment like human beings, thereby realizing more reasonable path planning and decision judgment.
Most of the existing SLAM systems are based on static scene assumption, however, a large number of dynamic objects exist in a real environment, and the feature points on the dynamic objects are erroneously included in the point cloud to perform pose calculation, so that a larger error occurs in the calculation result, even if the pose calculation can be used for drawing, a large number of residual shadows exist in a dynamic region of a constructed map, and the readability of the map is greatly reduced. Although many schemes for solving the dynamic SLAM problem by using a deep learning method exist today, the precision and the efficiency still have problems, and a YOLOv5-7.0 network is not used for carrying out example segmentation and discrimination on a dynamic object, and fewer processing schemes of boundaries during segmentation exist. In the feature point sampling link, the traditional RANSAC algorithm can play a certain role in filtering dynamic feature points, but the accuracy of a model is ensured by sacrificing more iteration times, and when a large-area dynamic area exists in the environment, the method has failure risk.
Aiming at indoor dynamic environments, the existing solutions mostly have the problems of poor precision and low efficiency, so that how to filter dynamic features in the environments and effectively improve the calculation precision of SLAM systems is of great significance on the premise of guaranteeing the real-time performance of algorithms.
Disclosure of Invention
Aiming at the defects that the prior art is easy to be interfered by dynamic objects to cause pose estimation drift and precision reduction, the invention provides a dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy, which is suitable for robot navigation in an indoor dynamic environment and improves the performance of the mobile robot SLAM method in the indoor dynamic environment.
The technical scheme adopted for solving the technical problems is as follows:
a dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy comprises the following steps:
step 1: training an improved YOLOv5-7.0 network model by using a COCO data set, taking a current frame RGB image acquired by an RGB-D camera as the input of a target detection thread, and transmitting the current frame RGB image into the improved YOLOv5-7.0 network model to obtain an instance segmentation mask of a potential dynamic target, wherein the process is as follows:
step 1.1: SLAM has higher requirement on real-time performance, so that the network structure of YOLOv5-7.0 is improved, and a lighter-weight C2f-CA-fast network is provided for replacing the problems of more parameters, poor real-time performance and the like of a C3 network module. In a C2f-CA-fast network, a lightweight attention module CA is utilized to replace a convolution module of C3, feature extraction is carried out from the width and height dimensions of an image, so that attention feature codes of different layers are obtained, the feature images corresponding to different directions are respectively obtained through an average pooling layer of input features in different directions, then the weighting coefficients respectively corresponding to the convolution fusion separation calculation are obtained, and the key region features of the features output by the network are obtained through a sigmoid function, so that the salient is strengthened, and the extraction capability of the model on target distinguishing features is effectively improved. In addition, a fast module is added, and the fast module mainly comprises a PCony part convolution structure and 1x1 convolution, wherein the PCony part convolution structure is formed by cutting characteristics of a channel layer, one part of network characteristics are transmitted in an identity mode, the other part of network characteristics are transmitted in a channel-by-channel convolution mode, and feature graphs output by the two parts are fused in a channel dimension to be used as final output, so that the network reduces redundancy of information while increasing an information flow path, and the computing efficiency of the network is effectively improved. After the improvement is completed, an improved YOLOv5-7.0 network model is obtained;
Step 1.2: the indoor common categories were selected and the improved YOLOv5-7.0 network model was trained using the COCO 2017 dataset. And acquiring an image by using an RGB-D camera, transmitting the RGB image of the current frame into an improved YOLOv5-7.0 network model in a three-channel mode of 640 x 3, and performing slicing and convolution operation in a Focus structure to form an image feature map of 320 x 32. The image feature map is taken as input to be transmitted into a Backbone network of a backbond, and the backbond comprises CBS, C2f-CA-fast and SPPF 3 network structures, and the Backbone network structure is mainly used for extracting the image features and continuously shrinking the feature map. The extracted characteristic images are transmitted into a Neck structure, the Neck structure is used for fusing the characteristics of different layers, and the images with the characteristics fused are input into a detection branch and a segmentation branch for the next operation;
step 1.3: and (3) detecting a branch part, namely inputting the feature fusion image generated in the step (1.2) into a YOLACT network to obtain the confidence degrees of category information, frame information and k mask information.
Step 1.4: dividing the branch part, screening the feature fusion image generated in the step 1.2, selecting a feature image with high resolution, sufficient space information and rich semantic information to perform up-sampling operation of the FCN structure, and forming k mask prototype images through a convolution process of 1*1;
Step 1.5: and (3) linearly combining the information generated in the step (1.3) and the information generated in the step (1.4), wherein the combination formula is as follows:
wherein n is the number of object categories identified by the detection branches, k is the number of masks obtained by dividing the branches, P is k image prototype masks, C is confidence information of the masks, and the combined feature map is activated:
M=σ(mask)
where M is the potential dynamic target mask after high confidence processing and σ is the Sigmoid () activation function. The potential dynamic target instance segmentation mask is obtained for further calculation.
Step 2: and carrying out ORB characteristic point calculation on the current frame RGB image acquired and obtained by the RGB-D camera and the previous frame RGB image to obtain a characteristic point set with direction information, wherein the process is as follows:
step 2.1: and (3) performing ORB feature point extraction on the previous frame image and the current frame image, counting the number n of the ORB feature points, and performing initialization operation only when the number n of two continuous frames is larger than a given threshold T.
Step 2.2: and calculating the main direction of ORB characteristic points by using a gray centroid method, and finding the centroid position of the image block according to the image moment:
wherein, C is the centroid of the image, m is the moment of the defined image block, and the expression is:
from the centroid C and the geometric center O, a direction vector can be obtained The direction of the feature points is thus defined as:
θ=arctan(m 01 /m 10 )
after ORB characteristic points with direction information are obtained, performing next calculation;
step 3: performing depth map preprocessing on a current frame depth image acquired by an RGB-D camera, and in a feature point dynamic judging thread, taking the preprocessed depth image, the potential dynamic target instance segmentation mask obtained in the step 1 and the ORB feature point set obtained in the step 2 as input to a depth LK optical flow method matcher, and calculating to obtain a high-confidence static feature point set, wherein the specific process is as follows:
step 3.1: performing coarse filtering on the instance segmentation mask of the potential dynamic target according to the confidence coefficient;
step 3.2: because a certain number of dynamic feature points inevitably exist on the boundary of the example segmentation mask, the subsequent steps such as pose calculation and the like are influenced, and the boundary of the example segmentation mask is expanded by combining the depth image so as to ensure that the dynamic feature points are all included in the corresponding example segmentation mask. The specific method comprises the steps of preprocessing a depth map, normalizing depth information, and mapping corresponding numerical values to a designated interval to obtain a preprocessed depth image, wherein the mapping method comprises the following steps:
Wherein D is nor For normalizing the depth value, gamma is the amplification factor, D is the current depth value, D max 、D min Representing the maximum depth value and the minimum depth value in the depth image, respectively. Inputting the normalized image into a bilateral filter, so that the image can reduce noise and retain edge information;
step 3.3: the expansion and refinement of the dynamic region are carried out by combining the example segmentation mask generated in the step 3.1 and the preprocessed depth image obtained in the step 3.2, and the specific process is as follows: by P i,j Representing a specific pixel point on the boundary of the example segmentation mask, combining the pixel point serving as the origin of coordinates with 8 adjacent pixel points around to form a dynamic generation set P net The dynamic generation set P net Can be expressed as:
P net =p{(i,j)|-1≤i≤1,-1≤j≤1,i、j∈Z}
in the dynamic generation set P net Within the range, define the effective depth value D p (i, j) if the pixel p (i, j) is in the example division mask region, the effective depth value of the pixel is the normalized depth value of the pixel, otherwise, the effective depth value is set to 0:
where Depth (i, j) represents the normalized Depth value of the feature point at coordinates (i, j), A d A set of pixels within a mask region is partitioned for an instance. At the time of obtaining the effective depth value D p (i, j) after calculation of the dynamic generation set P net Average effective depth D of inner pixel point mean (i,j):
Setting a threshold delta, and determining the average effective depth D of the unified plane mean The range is as follows:
δ min D mean ≤Depth(i,j)≤δ max D mean
in delta min 、δ max Representing the maximum threshold coefficient and the minimum threshold coefficient, respectively. The average effective depth range effectively characterizes the pixel point P i,j The depth range of the plane is taken as a criterion, so that different planes in the image can be effectively distinguished, and the expansion and refinement of the example segmentation mask are realized;
step 3.4, judging P in turn net If the depth value of the 8 adjacent pixel points in the range is within the average effective depth range, the pixel points are in the range, which indicates that the pixel points belong to a dynamic region with high probability, and the pixel points are classified into an instance segmentation mask of a dynamic target. All boundary pixel points P of the example segmentation mask i,j The operation is recorded as one-time expansion and refinement operation, and after 3 times of expansion and refinement operations are repeated, a refined instance segmentation mask can be obtained;
step 3.5, since the refined example segmentation mask only represents the potential motion possibility of the object, in a real situation, the object still can be still static, in order to accurately judge the real motion situation of the feature points in the mask, the LK optical flow is calculated and matched for the ORB feature points in the refined example segmentation mask, and since the LK optical flow assumes that the pixel motion in the image block is the same, the following equation exists:
Wherein I is x For the gradient of the pixel in the x-direction, I y Is the gradient of the pixel in the y-direction, w is the image block size. And iterating for a plurality of times, so as to track the pixel points, obtain the motion vectors of the feature points, screen the motion vectors, and mark the feature points with the modulus larger than the threshold value as dynamic feature points. And eliminating all dynamic feature points in the thinned instance segmentation mask, marking the rest static feature point set as a high-confidence static feature point set, and performing the next calculation.
Step 4: in the pose calculation thread, on the basis of a high-confidence static feature point set, removing dynamic feature points in the environment, designing a D-PROSAC feature point sampling method, sampling feature points in the high-confidence static feature point set to remove mismatching and low-quality matching points, and carrying out camera pose estimation by utilizing the rest high-quality feature points, wherein the specific process is as follows:
step 4.1: when calculating the camera pose model, sampling is required in the static feature point set. The conventional RANSAC sampling method has uncertainty and requires that the accuracy of the model be guaranteed by sacrificing more iterations. The invention provides a D-PROSAC sampling method, which evaluates the reliability of data before sampling, so that compared with the traditional RANSAC algorithm, the method has faster convergence rate and calculation accuracy, and the specific flow of the D-PROSAC is as follows: design ratio evaluation function Q 1 (p i ) And 8 feature points with highest scores are selected by taking the feature points as the standard, and an original model F is obtained by an eight-point method 0 The specific process is as follows: recording the static characteristic point set to be sampled as U N N represents the number of feature points in the set, the static feature points to be sampled in the set can be represented as p i And p is i ∈U N . For all static feature points p to be sampled i Calculating the minimum Hamming distance d of the descriptors min1 (p i ) Distance d from the next smallest Hamming distance min2 (p i ) Is recorded as a ratio evaluation function Q 1 (p i ):
Ratio evaluation function Q 1 (p i ) Characterizing the static feature point p to be sampled i Degree of reliability in matching process, Q 1 The smaller the feature point, the higher the matching quality of the feature point. Evaluating the function Q by a ratio 1 As a standard, the static feature point p to be sampled i Ascending order is carried out, and the first 8 points with highest matching quality are selected as an initial sample set U 0 For initial sample set U 0 The original model F can be obtained by using an eight-point method 0 ;
Step 4.2: obtaining an original model F 0 After that, to U N All static feature points p to be sampled in the interior i Distance d of polar line i And based thereon design the pole pitch evaluation function Q 2 (p i ) The specific process is as follows: for U N Any static feature point p to be sampled in the inner part i The pixel coordinates on the two images where they are located are noted as:
Then p is i1 Corresponding polar line I 1 Can be expressed as:
wherein F represents a basic matrix, X, Y, Z represents a polar line I 1 From the above equation, the point p can be obtained i2 To the polar line I 1 Polar distance d of (2) i :
At a polar distance d i Design polar distance evaluation function Q as index 2 (p i ):
Where θ is a scaling factor, and is specified to be greater than 0. Polar distance evaluation function Q 2 (p i ) Characterizing the static feature point p to be sampled i For the satisfaction degree of polar constraint, Q 2 The larger the feature point is, the higher the matching quality of the feature point is;
step 4.3: evaluating the ratio of the values of the function Q 1 (p i ) And polar distance evaluation function Q 2 (p i ) Fitting to generate a final bidirectional evaluation function Q 0 (p i ):
Wherein beta is 1 、β 2 For scaling the coefficient, for evaluating the ratio of the function Q 1 (p i ) And polar distance evaluation function Q 2 (p i ) Adjusted to a uniform order of magnitude. Bidirectional evaluation function Q 0 (p i ) The characteristics of the ratio evaluation function and the polar distance evaluation function are combined, the reliability of characteristic point matching is considered, and the satisfaction degree of characteristic points to polar line constraint is synthesized, so that the comprehensive quality of one characteristic point can be measured, and the Q 0 The larger the feature point, the better the overall quality of the feature point. By a bi-directional evaluation function Q 0 (p i ) As an index, a static characteristic point set U to be sampled N All the feature points in the tree are ordered in descending order, namely:
u i ,u j ∈U N :i<j→Q 0 (u i )>Q 0 (u j )
Inputting the ordered static characteristic points into the next link to carry out a sampling flow;
step 4.4: for the ordered static characteristic point set U to be sampled N Sampling is carried out to generate a pose model F, and the specific flow is as follows:
step 4.4.1: and determining the maximum iteration number K, the re-projection error threshold delta and the number of inner points threshold M.
Step 4.4.2: determining the size n of the hypothesized generation set U according to the growth function rule of the PROSAC algorithm, and sequencing the static feature point set U N In, the bidirectional evaluation function Q designed in the step 4.3 0 (p i ) As a criterion, the first n feature points are selected as the hypothesis generation set U.
Step 4.4.3: in the hypothesis generation set U, 8 points are randomly selected, and an essential matrix F is obtained by calculation by an eight-point method U 。
Step 4.4.4: for static feature point set U N All feature points in the inner are defined by F U Performing reprojection operation, calculating reprojection error epsilon, and if epsilon<Delta, it is marked as an inner point, and vice versa.
Step 4.4.5: counting the number of internal points M, if M > M, making m=M, otherwise repeating the steps of 4.4.2-4.4.4, and repeating the iteration times k=k+1.
Step 4.4.6: recalculating the essential matrix F from all the inliers after updating U When k is<K, obtaining an essential matrix F U And a new set of inliers, otherwise no model is obtained.
Obtaining an essential matrix F U After that, for F U Singular Value Decomposition (SVD) is performed to obtain the camera pose R, t with high accuracy.
The technical conception of the invention is as follows: a dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy designs a C2f-CA-fast network, replaces a convolution module of C3 with a lightweight attention module CA, extracts features from the width and height dimensions of an image to obtain attention feature codes of different layers, and effectively improves the extraction capability of a model on target distinguishing features. The fast module is added, so that the network reduces the redundancy of information and effectively improves the calculation efficiency of the network while increasing the information flow path. Aiming at the situation that a certain number of dynamic feature points still exist around the example segmentation mask, a depth-LK optical flow method is provided, different planes are effectively distinguished by using depth information, and expansion and refinement of the example segmentation mask are realized. In the sampling process, in order to solve the problems of low precision, poor reliability and the like of the traditional RANSAC method, a D-PRAOSAC algorithm is provided, the characteristic points are ordered by taking a bidirectional evaluation function as a standard before sampling, and only the high-quality characteristic points are sampled and calculated in pose, so that the precision and convergence speed of the model are effectively improved.
The beneficial effects of the invention are mainly shown in the following steps:
1) The lightweight network is adopted for feature processing, so that the speed and the precision of semantic segmentation links in the dynamic SLAM method are effectively improved. 2) The image depth information is fully utilized, the segmentation boundary is expanded and thinned, and the occurrence of the missing segmentation phenomenon is effectively avoided; 3) And the progressive double standard sampling is carried out on the dynamic characteristic points, the reliability of the static characteristic points selected by sampling is high, and the convergence rate of the sampling process is higher. 4) The real-time performance of the whole algorithm is stronger, the requirements on equipment are lower, and the pose calculation is more accurate.
Drawings
FIG. 1 is an overall flow chart of a specific embodiment of the present invention;
FIG. 2 is a SLAM system frame diagram of an embodiment of the present invention;
FIG. 3 is a schematic diagram of a modified YOLOv5-7.0 network model in accordance with embodiments of the present invention;
FIG. 4 is a schematic diagram of a C2f-CA-Faster network architecture in accordance with embodiments of the present invention;
FIG. 5 is a flow diagram of a depth-LK optical flow matcher in accordance with an embodiment of the present invention;
FIG. 6 is a flow chart of the D-PROSAC algorithm of an embodiment of the present invention;
FIG. 7 is a graph comparing ORB-SLAM3 with the Absolute track error (Absolute TrajectoryError, ATE) of the algorithm of the present invention, wherein (a) shows the Absolute track error graph of ORB-SLAM3 at sequence fr3_half, (b) shows the Absolute track error graph of the method at sequence fr3_half, (c) shows the Absolute track error graph of ORB-SLAM3 at sequence fr3_walk_xyz, and (d) shows the Absolute track error graph of the method at sequence fr3_walk_xyz;
FIG. 8 is a comparative graph of ORB-SLAM3 and the relative track error (Relative Pose Error, RPE) of the algorithm of the present invention, where (a) shows the relative track error graph of ORB-SLAM3 under the sequence fr3_walk_xyz and (b) shows the relative track error graph of the present method under the sequence fr3_walk_xyz.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 to 6, an RGB-DSLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy in an indoor dynamic environment includes the following steps:
step 1: training an improved YOLOv5-7.0 network model by using a COCO data set, taking a current frame RGB image acquired by an RGB-D camera as the input of a target detection thread, and transmitting the current frame RGB image into the improved YOLOv5-7.0 network model to obtain an instance segmentation mask of a potential dynamic target, wherein the process is as follows:
step 1.1: SLAM has higher requirement on real-time performance, so that the network structure of YOLOv5-7.0 is improved, and a lighter-weight C2f-CA-fast network is provided for replacing the problems of more parameters, poor real-time performance and the like of a C3 network module. In a C2f-CA-fast network, a lightweight attention module CA is utilized to replace a convolution module of C3, feature extraction is carried out from the width and height dimensions of an image, so that attention feature codes of different layers are obtained, the feature images corresponding to different directions are respectively obtained through an average pooling layer of input features in different directions, then the weighting coefficients respectively corresponding to the convolution fusion separation calculation are obtained, and the key region features of the features output by the network are obtained through a sigmoid function, so that the salient is strengthened, and the extraction capability of the model on target distinguishing features is effectively improved. In addition, a fast module is added, which mainly consists of partial convolution PConey and 1x1 convolution, wherein PConey is formed by segmenting characteristics of a channel layer, a part of network characteristics are transmitted in an identity mode, another part of network characteristics are transmitted in a channel-by-channel convolution mode, and the two output characteristic graphs are fused in a channel dimension to be used as final output, so that the network reduces redundancy of information while increasing an information circulation path, and the calculation efficiency of the network is effectively improved. After the improvement is completed, an improved YOLOv5-7.0 network model is obtained;
Step 1.2: the indoor common categories were selected and the improved YOLOv5-7.0 network model was trained using the COCO 2017 dataset. And acquiring an image by using an RGB-D camera, transmitting the RGB image of the current frame into an improved YOLOv5-7.0 network model in a three-channel mode of 640 x 3, and performing slicing and convolution operation in a Focus structure to form an image feature map of 320 x 32. The image feature map is taken as input to be transmitted into a Backbone network of a backbond, and the backbond comprises CBS, C2f-CA-fast and SPPF 3 network structures, and the Backbone network structure is mainly used for extracting the image features and continuously shrinking the feature map. The extracted characteristic images are transmitted into a Neck structure, the Neck has the main effects that relatively shallow characteristics are obtained from a Backbone, then multi-scale characteristic fusion is carried out on the characteristics and deep semantic characteristics, and the images with the characteristics fused are input into a detection branch and a segmentation branch for the next operation;
step 1.3: and (3) detecting a branch part, namely inputting the feature fusion image generated in the step (1.2) into a YOLACT network to obtain the confidence degrees of category information, frame information and k mask information.
Step 1.4: dividing the branch part, screening the feature fusion image generated in the step 1.2, selecting a feature image with high resolution, sufficient space information and rich semantic information to perform up-sampling operation of the FCN structure, and forming k mask prototype images through a convolution process of 1*1;
Step 1.5: and (3) linearly combining the information generated in the step (1.3) and the information generated in the step (1.4), wherein the combination formula is as follows:
wherein n is the number of object categories identified by the detection branches, k is the number of masks obtained by dividing the branches, P is k image prototype masks, C is confidence information of the masks, and the combined feature map is activated:
M=σ(mask)
where M is the potential dynamic target mask after high confidence processing and σ is the Sigmoid () activation function. The potential dynamic target instance segmentation mask is obtained for further calculation.
FIG. 1 is a flow chart showing the overall operation of the SLAM method, and describing the overall operation flow and steps of the improved Yolov5 and depth-LK optical flow method according to the present invention;
FIG. 2 is a system frame diagram of the SLAM method, describing an overall system frame diagram of the improved YOLOv5 and depth-LK optical flow method, which includes four links of RGB-D camera acquisition, target detection thread, feature point dynamic judgment thread, and pose calculation thread;
step 2: and carrying out ORB characteristic point calculation on the current frame RGB image acquired and obtained by the RGB-D camera and the previous frame RGB image to obtain a characteristic point set with direction information, wherein the process is as follows:
Step 2.1: and (3) performing ORB feature point extraction on the previous frame image and the current frame image, counting the number n of the ORB feature points, and performing initialization operation only when the number n of two continuous frames is larger than a given threshold T.
Step 2.2: and calculating the main direction of ORB characteristic points by using a gray centroid method, and finding the centroid position of the image block according to the image moment:
wherein, C is the centroid of the image, m is the moment of the defined image block, and the expression is:
from the centroid C and the geometric center O, a direction vector can be obtainedThe direction of the feature points is thus defined as:
θ=arctan(m 01 /m 10 )
after ORB characteristic points with direction information are obtained, performing next calculation;
step 3: performing depth map preprocessing on a current frame depth image acquired by an RGB-D camera, and in a feature point dynamic judging thread, taking the preprocessed depth image, the potential dynamic target instance segmentation mask obtained in the step 1 and the ORB feature point set obtained in the step 2 as input to a depth-LK optical flow method matcher, and calculating to obtain a high-confidence static feature point set, wherein the specific process is as follows:
step 3.1: performing confidence level rough filtering on the instance segmentation mask of the potential dynamic target, discarding the instance segmentation mask with the confidence level lower than 0.20, and obtaining the instance segmentation mask with reliability;
Step 3.2: because a certain number of dynamic feature points inevitably exist on the boundary of the example segmentation mask, the subsequent steps such as pose calculation and the like are influenced, and the boundary of the example segmentation mask is expanded by combining the depth image so as to ensure that the dynamic feature points are all included in the corresponding example segmentation mask. The specific method comprises the steps of preprocessing a depth map, normalizing depth information, and mapping corresponding numerical values to a designated interval to obtain a preprocessed depth image, wherein the mapping method comprises the following steps:
wherein D is nor For normalizing the depth value, gamma is the amplification factor, D is the current depth value, D max 、D min Representing the maximum depth value and the minimum depth value in the depth image, respectively. Inputting the normalized image into a bilateral filter, so that the image can reduce noise and retain edge information;
step 3.3: combining the example segmentation mask generated in the step 3.1 with the preprocessed depth image obtained in the step 3.2 to expand and thin the dynamic areaThe method comprises the following specific processes: by P i,j Representing a specific pixel point on the boundary of the example segmentation mask, combining the pixel point serving as the origin of coordinates with 8 adjacent pixel points around to form a dynamic generation set P net The dynamic generation set P net Can be expressed as:
P net =p{(i,j)|-1≤i≤1,-1≤j≤1,i、j∈Z}
in the dynamic generation set P net Within the range, define the effective depth value D p (i, j) if the pixel p (i, j) is in the example division mask region, the effective depth value of the pixel is the normalized depth value of the pixel, otherwise, the effective depth value is set to 0:
where Depth (i, j) represents the normalized Depth value of the feature point at coordinates (i, j), and Ad is the set of pixel points within the instance segmentation mask region. At the time of obtaining the effective depth value D p (i, j) after calculation of the dynamic generation set P net Average effective depth D of inner pixel point mean (i,j):
Setting a threshold delta, and determining the average effective depth D of the unified plane mean The range is as follows:
δ min D mean ≤Depth(i,j)≤δ max D mean
in delta min 、δ ma x represents the maximum threshold coefficient and the minimum threshold coefficient, respectively. The average effective depth range effectively characterizes the pixel point P i,j The depth range of the plane is taken as a criterion, so that different planes in the image can be effectively distinguished, and the expansion and refinement of the example segmentation mask are realized;
step 3.4, judging P in turn net Whether the depth values of 8 adjacent pixel points in the range of average effective depth or not, if soIf the point is within the range, the point is represented to belong to a dynamic area with a high probability, and the point is classified into an instance segmentation mask of a dynamic target. All boundary pixel points P of the example segmentation mask i,j The operation is recorded as one-time expansion and refinement operation, and after 3 times of expansion and refinement operations are repeated, a refined instance segmentation mask can be obtained;
step 3.5, since the refined example segmentation mask only represents the potential motion possibility of the object, in a real situation, the object still can be still static, in order to accurately judge the real motion situation of the feature points in the mask, the LK optical flow is calculated and matched for the ORB feature points in the refined example segmentation mask, and since the LK optical flow assumes that the pixel motion in the image block is the same, the following equation exists:
wherein I is x For the gradient of the pixel in the x-direction, I y Is the gradient of the pixel in the y-direction, w is the image block size. And iterating for a plurality of times, so as to track the pixel points, obtain the motion vectors of the feature points, screen the motion vectors, and mark the feature points with the modulus larger than the threshold value as dynamic feature points. And eliminating all dynamic feature points in the thinned instance segmentation mask, marking the rest static feature point set as a high-confidence static feature point set, and performing the next calculation.
As shown in fig. 3, a schematic diagram of an improved YOLOv5 network model is shown, which describes a network structure of the improved YOLOv5, including a backbone network, a multi-scale feature fusion network, a detection branch and a segmentation branch, and a linear combination module, through which dynamic feature extraction and instance segmentation can be performed on an input RGB-D image to obtain a potential dynamic target instance segmentation mask; FIG. 4 is a schematic diagram of a C2f-CA-Faster module in the improved YOLOv5-7.0 network model of the SLAM method, showing mainly the network architecture details of CA and Faster;
Step 4: in the pose calculation thread, on the basis of a high-confidence static feature point set, removing dynamic feature points in the environment, designing a D-PROSAC feature point sampling method, sampling feature points in the high-confidence static feature point set to remove mismatching and low-quality matching points, and carrying out camera pose estimation by utilizing the rest high-quality feature points, wherein the specific process is as follows:
step 4.1: when calculating the camera pose model, sampling is required in the static feature point set. The conventional RANSAC sampling method has uncertainty and requires that the accuracy of the model be guaranteed by sacrificing more iterations. The invention provides a D-PROSAC sampling method, which evaluates the reliability of data before sampling, so that compared with the traditional RANSAC algorithm, the method has faster convergence rate and calculation accuracy, and the specific flow of the D-PROSAC is as follows: design ratio evaluation function Q 1 (p i ) And 8 feature points with highest scores are selected by taking the feature points as the standard, and an original model F is obtained by an eight-point method 0 The specific process is as follows: recording the static characteristic point set to be sampled as U N N represents the number of feature points in the set, the static feature points to be sampled in the set can be represented as p i And p is i ∈U N . For all static feature points p to be sampled i Calculating the minimum Hamming distance d of the descriptors min1 (p i ) Distance d from the next smallest Hamming distance min2 (p i ) Is recorded as a ratio evaluation function Q 1 ( p i):
Ratio evaluation function Q 1 ( p i) Characterizing the static feature point p to be sampled i Degree of reliability in matching process, Q 1 The smaller the feature point, the higher the matching quality of the feature point. Evaluating the function Q by a ratio 1 As a standard, the static feature point p to be sampled i Ascending order is carried out, and the first 8 points with highest matching quality are selected as an initial sample set U 0 For initial sample set U 0 The original model F can be obtained by using an eight-point method 0 ;
Step 4.2: obtaining an original model F 0 After that, to U N All static feature points p to be sampled in the interior i Distance d of polar line i And based thereon design the pole pitch evaluation function Q 2 (p i ) The specific process is as follows: for U N Any static feature point p to be sampled in the inner part i The pixel coordinates on the two images where they are located are noted as:
then p is i1 Corresponding polar line I 1 Can be expressed as:
wherein F represents a basic matrix, X, Y, Z represents a polar line I 1 From the above equation, the point p can be obtained i2 To the polar line I 1 Polar distance d of (2) i :
At a polar distance d i Design polar distance evaluation function Q as index 2 (p i ):
Where θ is a scaling factor, and is specified to be greater than 0. Polar distance evaluation function Q 2 (p i ) Characterizing the static feature point p to be sampled i For the satisfaction degree of polar constraint, Q 2 The larger the feature point is, the higher the matching quality of the feature point is;
step 4.3: evaluating the ratio of the values of the function Q 1 (p i ) And polar distance evaluation function Q 2 (p i ) Fitting to generateFinal bi-directional evaluation function Q 0 (p i ):
Wherein beta is 1 、β 2 For scaling the coefficient, for evaluating the ratio of the function Q 1 (p i ) And polar distance evaluation function Q 2 (p i ) Adjusted to a uniform order of magnitude. Bidirectional evaluation function Q 0 (p i ) The characteristics of the ratio evaluation function and the polar distance evaluation function are combined, the reliability of characteristic point matching is considered, and the satisfaction degree of characteristic points to polar line constraint is synthesized, so that the comprehensive quality of one characteristic point can be measured, and the Q 0 The larger the feature point, the better the overall quality of the feature point. By a bi-directional evaluation function Q 0 (p i ) As an index, a static characteristic point set U to be sampled N All the feature points in the tree are ordered in descending order, namely:
u i ,u j ∈U N :i<j→Q 0 (u i )>Q 0 (u j )
inputting the ordered static characteristic points into the next link to carry out a sampling flow;
step 4.4: for the ordered static characteristic point set U to be sampled N Sampling is carried out to generate a pose model F, and the specific flow is as follows:
step 4.4.1: and determining the maximum iteration number K, the re-projection error threshold delta and the number of inner points threshold M.
Step 4.4.2: determining the size n of the hypothesized generation set U according to the growth function rule of the PROSAC algorithm, and sequencing the static feature point set U N In, the bidirectional evaluation function Q designed in the step 4.3 0 (p i ) As a criterion, the first n feature points are selected as the hypothesis generation set U.
Step 4.4.3: in the hypothesis generation set U, 8 points are randomly selected, and an essential matrix F is obtained by calculation by an eight-point method U 。
Step 4.4.4: for static feature point set U N Inner wall of the containerHas characteristic points, consisting of F U Performing reprojection operation, calculating reprojection error epsilon, and if epsilon<Delta, it is marked as an inner point, and vice versa.
Step 4.4.5: counting the number of internal points M, if M > M, making m=M, otherwise repeating the steps of 4.4.2-4.4.4, and repeating the iteration times k=k+1.
Step 4.4.6: recalculating the essential matrix F from all the inliers after updating U When k is<K, obtaining an essential matrix F U And a new set of inliers, otherwise no model is obtained.
Obtaining an essential matrix F U After that, for F U Singular Value Decomposition (SVD) is performed to obtain the camera pose R, t with high accuracy. As shown in fig. 5, a flow diagram of a depth-LK optical flow method matcher is shown, the matcher filters a dynamic mask according to a confidence level, expands a dynamic region by a processed depth image, calculates LK optical flow in the expanded dynamic region, and realizes finer distinction of the dynamic region.
Simulation experiment:
the simulation environment of the dynamic SLAM method experiment based on the depth LK optical flow method and the D-PROSAC sampling strategy is given as follows: GPUNVIDIARTX3060, CPU i7-12700H,Ubuntu 20.04,CUDA 11.0,Pytorch 1.8.1.
The fr3_half and fr3_walking_xyz sequences in the public dataset Tum Dynamic Objects were selected for evaluation, the entire process being dynamic. In order to verify the performance of the algorithm in a dynamic environment, a dynamic subsequence in a TUM data set is selected to respectively compare ORB-SLAM3 with the method, an absolute track error (Absolute Trajectory Error, ATE) is used as a judgment standard, and a quantitative comparison result shows that the method obviously improves the positioning accuracy of the visual SLAM system in the dynamic environment. FIG. 7 shows a comparison of the absolute track error (Absolute Trajectory Error, ATE) of ORB-SLAM3 and the algorithm of the present invention, wherein (a) represents the absolute track error of ORB-SLAM3 at sequence fr3_half, (b) represents the absolute track error of the present method at sequence fr3_half, (c) represents the absolute track error of ORB-SLAM3 at sequence fr3_walk_xyz, and (d) represents the absolute track error of the present method at sequence fr3_walk_xyz.
TABLE 1 ORB-SLAM3 and RMSE (m) of absolute track error for the method of the invention
Sequence name | ORB-SLAM3 | The method of the invention | Precision improvement |
fr3_half | 0.366 | 0.031 | 91.53% |
fr3_walking_xyz | 0.556 | 0.017 | 96.94% |
FIG. 8 is a comparison of the relative track error (Relative Pose Error, RPE) of ORB-SLAM3 and the algorithm of the present invention, where (a) represents the relative track error of ORB-SLAM3 under the sequence fre3_walking_xyz and (b) represents the relative track error of the present method under the sequence fre3_walking_xyz.
TABLE 2 ORB-SLAM3 and RMSE (m) of relative track error for the method of the invention
Sequence name | ORB-SLAM3 | The method of the invention | Precision improvement |
fr3_half | 0.0308 | 0.1517 | 79.69% |
Therefore, the dynamic SLAM method based on the depth LK optical flow method and the D-PROSAC sampling strategy effectively eliminates dynamic characteristic points in the environment, and greatly improves the positioning accuracy and robustness of the system in the dynamic environment.
As indicated above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (5)
1. A dynamic SLAM method based on a depth LK optical flow method and a D-PROSAC sampling strategy is characterized by comprising the following steps:
step 1: training an improved YOLOv5-7.0 network model by using a COCO data set, taking a current frame RGB image acquired by an RGB-D camera as the input of a target detection thread, and transmitting the current frame RGB image into the improved YOLOv5-7.0 network model to obtain an instance segmentation mask of a potential dynamic target;
Step 2: carrying out ORB characteristic point calculation on a current frame RGB image acquired by an RGB-D camera and a previous frame RGB image to obtain a characteristic point set with direction information;
step 3: performing depth map preprocessing on a current frame depth image acquired by an RGB-D camera, taking the preprocessed depth image, the potential dynamic target instance segmentation mask obtained in the step 1 and the ORB feature point set obtained in the step 2 as input to a depth LK optical flow method matcher in a feature point dynamic judging thread, and calculating to obtain a high-confidence static feature point set;
step 4: in the pose calculation thread, on the basis of a high-confidence static feature point set, removing dynamic feature points in the environment, designing a D-PROSAC feature point sampling method, sampling feature points in the high-confidence static feature point set to remove mismatching and low-quality matching points, and estimating the pose of the camera by using the rest high-quality feature points.
2. The dynamic SLAM method based on the deep LK optical flow method and the D-PROSAC sampling strategy of claim 1, wherein the procedure of step 1 is as follows:
step 1.1: in a C2f-CA-fast network, a lightweight attention module CA is used for replacing a convolution module of C3, and feature extraction is carried out from the width dimension and the height dimension of an image so as to obtain attention feature codes of different layers, wherein the attention feature codes are specifically as follows: the input features respectively acquire feature graphs corresponding to different directions through an average pooling layer in different directions, then the convolution fusion separation calculation respectively corresponds to the weighting coefficients, and the key region features of the nonlinear weight parameters to the features output by the network are acquired through a sigmoid function and the highlighting is enhanced;
Adding a fast module, wherein the fast module comprises a Pcony part convolution structure and 1x1 convolution, the Pcony part convolution structure is formed by segmenting characteristics of a channel layer, one part of network characteristics are transmitted in a identity mode, the other part of network characteristics are transmitted in a channel-by-channel convolution mode, and feature graphs output by the two parts are fused in a channel dimension to be used as final output; after the improvement is completed, an improved YOLOv5-7.0 network model is obtained;
step 1.2: training an improved YOLOv5-7.0 network model using the COCO 2017 dataset; an RGB-D camera is utilized to collect images, the current frame RGB images are transmitted into an improved YOLOv5-7.0 network model in a three-channel mode of 640 x 3, slicing and convolution operations are carried out in a Focus structure, and an image feature map of 320 x 32 is formed; taking the image feature map as input and transmitting the image feature map into a Backbone network of a backbond, wherein the backbond comprises CBS, C2f-CA-fast and SPPF 3 network structures and is used for extracting image features and continuously shrinking the feature map; the extracted characteristic images are transmitted into a Neck structure, the Neck structure is used for fusing the characteristics of different layers, and the images with the characteristics fused are input into a detection branch and a segmentation branch for the next operation;
Step 1.3: detecting a branch part, inputting the feature fusion image generated in the step 1.2 into a YOLACT network to obtain the confidence degrees of category information, frame information and k mask information;
step 1.4: dividing the branch part, screening the feature fusion image generated in the step 1.2, selecting a feature image with high resolution, sufficient space information and rich semantic information to perform up-sampling operation of the FCN structure, and forming k mask prototype images through a convolution process of 1*1;
step 1.5: and (3) linearly combining the information generated in the step (1.3) and the information generated in the step (1.4), wherein the combination formula is as follows:
wherein n is the number of object categories identified by the detection branches, k is the number of masks obtained by dividing the branches, P is k image prototype masks, C is confidence information of the masks, and the combined feature map is activated:
M=σ(mask)#(2)
in the formula, M is a potential dynamic target mask after high confidence processing, and sigma is a Sigmoid () activation function; the potential dynamic target instance segmentation mask is obtained for further calculation.
3. The dynamic SLAM method based on the deep LK optical flow method and the D-PROSAC sampling strategy of claim 1, wherein the procedure of step 2 is as follows:
Step 2.1: ORB feature point extraction is carried out on the previous frame image and the current frame image, the number n of the ORB feature points is counted, and initialization operation is carried out only when the number n of two continuous frames is larger than a given threshold T;
step 2.2: and calculating the main direction of ORB characteristic points by using a gray centroid method, and finding the centroid position of the image block according to the image moment:
wherein, C is the centroid of the image, m is the moment of the defined image block, and the expression is:
centroid C and geometric center O obtained from equation 3, obtain direction vectorThe direction of the feature points is thus defined as:
thereby obtaining ORB characteristic points with direction information for further calculation.
4. The dynamic SLAM method based on the deep LK optical flow method and the D-PROSAC sampling strategy of claim 1, wherein the procedure of step 3 is as follows:
step 3.1: performing coarse filtering on the instance segmentation mask of the potential dynamic target according to the confidence coefficient;
step 3.2: performing expansion processing on the boundary of the instance segmentation mask by combining the depth image so as to ensure that the dynamic feature points are all classified into the corresponding instance segmentation mask; the specific method comprises the steps of preprocessing a depth map, normalizing depth information, and mapping corresponding numerical values to a designated interval to obtain a preprocessed depth image, wherein the mapping method comprises the following steps:
Wherein D is nor For normalizing the depth value, gamma is the amplification factor, D is the current depth value, D max 、D min Respectively representing the maximum depth value and the minimum depth value in the current depth image, and inputting the normalized image into a bilateral filter so that the image can reduce noise and retain edge information;
step 3.3: the expansion and refinement of the dynamic region are carried out by combining the example segmentation mask generated in the step 3.1 and the preprocessed depth image obtained in the step 3.2, and the specific process is as follows: by P i,j Representing a specific pixel point on the boundary of the example segmentation mask, combining the pixel point serving as the origin of coordinates with 8 adjacent pixel points around to form a dynamic generation set P net The dynamic generation set P net Expressed as:
P nei ={p(i,j)|-1≤i≤1,-1≤j≤1,i、j∈Z}#(7)
in the dynamic generation set P net Within the range, define the effective depth value D p (i, j) if the pixel p (i, j) is in the example division mask region, the effective depth value of the pixel is the normalized depth value of the pixel, otherwise, the effective depth value is set to 0:
where Depth (i, j) represents the normalized Depth value of the feature point at coordinates (i, j), A d Dividing a pixel point set in a mask area for an example; at the time of obtaining the effective depth value D p (i, j) after calculation of the dynamic generation set P net Average effective depth D of inner pixel point mean (i,j):
Setting a threshold delta, and determining the average effective depth D of the unified plane mean The range is as follows:
δ min D mean ≤Depth(i,j)≤δ max D mean #(10)
in delta min 、δ max Respectively representing a maximum threshold coefficient and a minimum threshold coefficient; the average effective depth range effectively characterizes the pixel point P i,j The depth range of the plane is taken as a criterion, so that different planes in the image are effectively distinguished, and expansion and refinement of the example segmentation mask are realized;
step 3.4: judging P in turn net If the depth values of the 8 adjacent pixels in the range are within the average effective depth range, classifying the pixels in the range into an instance segmentation mask of a dynamic target; all boundary pixel points P of the example segmentation mask i,j The operation is recorded as one-time expansion and refinement operation, and after 3 times of expansion and refinement operations are repeated, a refined instance segmentation mask is obtained;
step 3.5: calculating LK optical flow for ORB feature points in the thinned instance segmentation mask, and matching, wherein the LK optical flow assumes that the pixel motion in the image block is the same, and the following equation exists:
wherein I is x For the gradient of the pixel in the x-direction, I y Is the gradient of the pixel in the y direction, w is the image block size;
and iterating for a plurality of times, so as to track the pixel points, obtain the motion vectors of the feature points, screening the motion vectors, marking the feature points with the modulus larger than a threshold value as dynamic feature points, eliminating all dynamic feature points in the thinned instance segmentation mask, marking the rest static feature point set as a high-confidence static feature point set, and performing the next calculation.
5. The dynamic SLAM method based on the deep LK optical flow method and the D-PROSAC sampling strategy of claim 1, wherein the process of step 4 is as follows:
step 4.1: the specific flow of the D-PROSAC is as follows: design ratio evaluation function Q 1 (p i ) And 8 feature points with highest scores are selected by taking the feature points as the standard, and an original model F is obtained by an eight-point method 0 The specific process is as follows: recording the static characteristic point set to be sampled as U N N represents the number of the feature points in the set, and the static feature points to be sampled in the set are represented as p i And p is i ∈U N The method comprises the steps of carrying out a first treatment on the surface of the For all static feature points p to be sampled i Calculating the minimum Hamming distance d of the descriptors min1 (p i ) Distance d from the next smallest Hamming distance min2 (p i ) Is recorded as a ratio evaluation function Q 1 (p i ):
Ratio evaluation function Q 1 (p i ) Characterizing the static feature point p to be sampled i Degree of reliability in matching process, Q 1 The smaller the feature point is, the higher the matching quality of the feature point is; evaluating the function Q by a ratio 1 As a standard, the static feature point p to be sampled i Ascending order is carried out, and the first 8 points with highest matching quality are selected as an initial sample set U 0 For initial sample set U 0 Obtaining an original model F by using an eight-point method 0 ;
Step 4.2: obtaining an original model F 0 After that, to U N All static feature points p to be sampled in the interior i Distance d of polar line i And based thereon design the pole pitch evaluation function Q 2 (p i ) The specific process is as follows: for U N Any one of the samples to be sampled is stillState characteristic point p i The pixel coordinates on the two images where they are located are noted as:
then p is i1 Corresponding polar line I 1 Expressed as:
wherein F represents a basic matrix, X, Y, z represents a polar line I 1 Is obtained from the above three directional components i2 To the polar line I 1 Polar distance d of (2) i :
At a polar distance d i Design polar distance evaluation function Q as index 2 (p i ):
Wherein θ is a scaling factor, which is specified to be larger than 0; polar distance evaluation function Q 2 (p i ) Characterizing the static feature point p to be sampled i For the satisfaction degree of polar constraint, Q 2 The larger the feature point is, the higher the matching quality of the feature point is;
step 4.3: evaluating the ratio of the values of the function Q 1 (p i ) And polar distance evaluation function Q 2 (p i ) Fitting to generate a final bidirectional evaluation function Q 0 (p i ):
Wherein beta is 1 、β 2 For scaling the coefficient, for evaluating the ratio of the function Q 1 (p i ) And polar distance evaluation function Q 2 (p i ) Adjusting to a uniform order of magnitude; bidirectional evaluation function Q 0 (p i ) The characteristics of the ratio evaluation function and the polar distance evaluation function are combined, the reliability of characteristic point matching is considered, and the satisfaction degree of characteristic points to polar line constraint is synthesized, so that the comprehensive quality of one characteristic point can be measured, and the Q 0 The larger the feature point is, the better the comprehensive quality of the feature point is; by a bi-directional evaluation function Q 0 (p i ) As an index, a static characteristic point set U to be sampled N All the feature points in the tree are ordered in descending order, namely:
u i ,u j ∈U N :i<j→Q 0 (u i )>Q 0 (u j )#(18)
inputting the ordered static characteristic points into the next link to carry out a sampling flow;
step 4.4: for the ordered static characteristic point set U to be sampled N Sampling is carried out to generate a pose model F, and the specific flow is as follows:
step 4.4.1: determining the maximum iteration number K, the re-projection error threshold delta and the inner point number threshold M;
step 4.4.2: determining the size n of the hypothesized generation set U according to the growth function rule of the PROSAC algorithm, and sequencing the static feature point set U N In, the bidirectional evaluation function Q designed in the step 4.3 0 (p i ) As a criterion, selecting the first n feature points as a hypothesis generation set U;
step 4.4.3: in the hypothesis generation set U, 8 points are randomly selected, and an essential matrix F is obtained by calculation by an eight-point method U ;
Step 4.4.4: for static feature point set U N All feature points in the inner are defined by F U Carrying out reprojection operation, calculating reprojection error epsilon, and marking the epsilon as an inner point if epsilon is smaller than delta, and otherwise, marking the epsilon as an outer point;
step 4.4.5: counting the number M of the inner points, if M is more than M, making m=M, otherwise repeating the steps of 4.4.2-4.4.4, and repeating the steps of k=k+1;
Step 4.4.6: recalculating the essential matrix F from all the inliers after updating U When K is less than K, obtaining an essential matrix F U And a new set of inliers, otherwise, no model is obtained;
obtaining an essential matrix F U After that, for F U SVD singular value decomposition is carried out, and the pose R, t of the camera can be obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311572768.1A CN117541652A (en) | 2023-11-23 | 2023-11-23 | Dynamic SLAM method based on depth LK optical flow method and D-PROSAC sampling strategy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311572768.1A CN117541652A (en) | 2023-11-23 | 2023-11-23 | Dynamic SLAM method based on depth LK optical flow method and D-PROSAC sampling strategy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117541652A true CN117541652A (en) | 2024-02-09 |
Family
ID=89795426
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311572768.1A Pending CN117541652A (en) | 2023-11-23 | 2023-11-23 | Dynamic SLAM method based on depth LK optical flow method and D-PROSAC sampling strategy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117541652A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117974711A (en) * | 2024-04-02 | 2024-05-03 | 荣耀终端有限公司 | Video frame inserting method and related equipment |
CN118247591A (en) * | 2024-05-20 | 2024-06-25 | 华南农业大学 | Dynamic mask eliminating method based on multiple geometric constraints |
-
2023
- 2023-11-23 CN CN202311572768.1A patent/CN117541652A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117974711A (en) * | 2024-04-02 | 2024-05-03 | 荣耀终端有限公司 | Video frame inserting method and related equipment |
CN118247591A (en) * | 2024-05-20 | 2024-06-25 | 华南农业大学 | Dynamic mask eliminating method based on multiple geometric constraints |
CN118247591B (en) * | 2024-05-20 | 2024-08-16 | 华南农业大学 | Dynamic mask eliminating method based on multiple geometric constraints |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108388896B (en) | License plate identification method based on dynamic time sequence convolution neural network | |
CN105844669B (en) | A kind of video object method for real time tracking based on local Hash feature | |
CN111696118B (en) | Visual loopback detection method based on semantic segmentation and image restoration in dynamic scene | |
CN111611643B (en) | Household vectorization data acquisition method and device, electronic equipment and storage medium | |
CN109949340A (en) | Target scale adaptive tracking method based on OpenCV | |
CN110688965B (en) | IPT simulation training gesture recognition method based on binocular vision | |
CN111160407B (en) | Deep learning target detection method and system | |
CN117541652A (en) | Dynamic SLAM method based on depth LK optical flow method and D-PROSAC sampling strategy | |
CN111652892A (en) | Remote sensing image building vector extraction and optimization method based on deep learning | |
CN110334762B (en) | Feature matching method based on quad tree combined with ORB and SIFT | |
CN110298227B (en) | Vehicle detection method in unmanned aerial vehicle aerial image based on deep learning | |
CN105550678A (en) | Human body motion feature extraction method based on global remarkable edge area | |
CN111046856B (en) | Parallel pose tracking and map creating method based on dynamic and static feature extraction | |
CN108305260B (en) | Method, device and equipment for detecting angular points in image | |
CN113095263B (en) | Training method and device for pedestrian re-recognition model under shielding and pedestrian re-recognition method and device under shielding | |
CN109740537B (en) | Method and system for accurately marking attributes of pedestrian images in crowd video images | |
CN111814597A (en) | Urban function partitioning method coupling multi-label classification network and YOLO | |
CN106373128B (en) | Method and system for accurately positioning lips | |
CN111914832B (en) | SLAM method of RGB-D camera under dynamic scene | |
CN112861785B (en) | Instance segmentation and image restoration-based pedestrian re-identification method with shielding function | |
CN113221956B (en) | Target identification method and device based on improved multi-scale depth model | |
CN112287906B (en) | Template matching tracking method and system based on depth feature fusion | |
CN116310128A (en) | Dynamic environment monocular multi-object SLAM method based on instance segmentation and three-dimensional reconstruction | |
CN109784297A (en) | A kind of Three-dimensional target recognition based on deep learning and Optimal Grasp method | |
CN115713633A (en) | Visual SLAM method, system and storage medium based on deep learning in dynamic scene |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |