CN111797688A - Visual SLAM method based on optical flow and semantic segmentation - Google Patents

Visual SLAM method based on optical flow and semantic segmentation Download PDF

Info

Publication number
CN111797688A
CN111797688A CN202010488128.2A CN202010488128A CN111797688A CN 111797688 A CN111797688 A CN 111797688A CN 202010488128 A CN202010488128 A CN 202010488128A CN 111797688 A CN111797688 A CN 111797688A
Authority
CN
China
Prior art keywords
dynamic
area
semantic segmentation
optical flow
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010488128.2A
Other languages
Chinese (zh)
Inventor
姚剑
卓胜德
程军豪
龚烨
涂静敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202010488128.2A priority Critical patent/CN111797688A/en
Publication of CN111797688A publication Critical patent/CN111797688A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of visual space positioning, and discloses a visual SLAM method based on optical flow and semantic segmentation, which comprises the following steps: segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region; carrying out feature tracking on the static area and the predicted dynamic area by adopting a sparse optical flow method; judging the types of the feature points in the input image information, and removing the dynamic feature points; and (4) taking the set of the removed motion characteristic points as tracking data, inputting the tracking data into an ORB-SLAM for processing, and outputting a pose result. The method solves the problem of poor SLAM tracking and positioning effects in the dynamic environment, and can obtain track information with high pose precision in the dynamic environment.

Description

Visual SLAM method based on optical flow and semantic segmentation
Technical Field
The invention relates to the technical field of visual space positioning, in particular to a visual SLAM method based on optical flow and semantic segmentation.
Background
SLAM is a key technology in the field of smart mobile robots, wherein visual SLAM uses a camera as a main sensor, and the camera can provide more information than other types of sensors, so that the visual SLAM has been widely researched in recent years. However, achieving accurate tracking and localization in dynamic scenarios has always been a significant challenge for SLAM systems.
In an actual scene, a dynamic object may cause erroneous data in the camera motion calculation, resulting in a tracking failure or a wrong tracking situation. Several methods have been proposed to solve this problem, one of which is the traditional robustness estimation method-RANSAC. The method judges that the dynamic information and the like are removed as outliers, retains the static information to ensure the success of tracking and motion calculation, but when the moving object in the environment is taken as a main body, the method fails because the extracted available data is too little. Another approach integrates additional sensors. The method can utilize data information of a plurality of sensors to complement and realize a compensation strategy for tracking and motion calculation. However, this method is not economical in terms of equipment cost, calculation cost, and the like, and it is often realistic to increase the number of cameras.
The prior method is not excellent in SLAM application, and with the application of deep learning in semantic segmentation, target detection and the like in recent years, a new solution is provided for solving the influence of moving objects in dynamic scenes.
Visual SLAM can be divided into two types, one is a feature-based approach and one is based on a direct approach. The feature method realizes tracking and positioning by comparing descriptors of features to match point pairs and minimizing reprojection errors, and can keep better robustness to geometric noise. However, the time cost of the process of extracting the feature points is large; the direct method optimizes the pose to track by calculating the reprojection error based on the gray scale invariant theory, has better performance in a low texture environment than a method based on the characteristic points, is lower in time cost, and is lower in robustness of the whole algorithm. Neither the feature point method nor the direct method can solve the problems caused by common dynamic objects, and the dynamic objects can generate wrong data association to reduce the calculation pose precision.
Disclosure of Invention
The embodiment of the application solves the problem of poor SLAM tracking and positioning effect in a dynamic environment by providing a visual SLAM method based on optical flow and semantic segmentation.
The embodiment of the application provides a visual SLAM method based on optical flow and semantic segmentation, which comprises the following steps:
step 1, segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region;
step 2, carrying out feature tracking on the static area and the prediction dynamic area by adopting a sparse optical flow method;
step 3, judging the types of the feature points in the input image information, and removing the dynamic feature points;
and 4, inputting the set of the removed motion characteristic points as tracking data into an ORB-SLAM for processing, and outputting a pose result.
Preferably, the input image information in step 1 is one of input data corresponding to a monocular camera, input data corresponding to a binocular camera, and input data corresponding to a depth camera;
in the step 3, the types of the feature points in the input data corresponding to the monocular camera are judged through epipolar constraint; and judging the types of the characteristic points in the input data corresponding to the binocular camera or the input data corresponding to the depth camera through the reprojection error.
Preferably, the step 1 comprises the following substeps:
step 1.1, selecting a data set to train a Mask R-CNN network to obtain a trained semantic segmentation network; the data set includes multiple types of data as potential moving objects;
step 1.2, inputting image information into trainedSemantic segmentation network, finishing image segmentation to obtain static area AsPredicted motion area Am
Preferably, the step 2 comprises the following substeps:
step 2.1, utilizing a sparse optical flow method to perform alignment on the static area AsAnd the predicted motion region AmPerforming feature extraction and matching to obtain a static matching point pair set and a predicted motion matching point pair set;
and 2.2, solving the pose based on the SLAM running model.
Preferably, the SLAM-based running model pose solving in the step 2.2 includes:
at time k, using the jth signpost yjProjected to the current frame to obtain a projection position h (xi)k,yj) And obtaining a corresponding observation model:
zk,j=h(ξk,yj)+vk,j
where h () represents the nonlinear model of the landmark at a known pose transformation, zk,jIndicating signpost yjAt the pixel coordinates of the current frame, vk,j~N(0,Qk,j) Representing a mean of 0 and a covariance of Qk,jGaussian noise of (2);
according to the observation model, establishing an error model according to a reprojection error formed by the projection position and the corresponding pixel coordinate:
ek,j=zk,j-h(ξk,yj)
wherein e isk,jIndicating signpost yjDifference, ξ, between the position of the current frame and the projected positionkA lie algebraic form representing pose transformation between two frames at the moment k;
converting the error model into a nonlinear least square problem, setting all camera poses xi and signposts y as x to be optimized, setting the tracking time as m and the total number of the signposts as n, and establishing a loss function:
Figure BDA0002519876930000031
wherein J () represents a loss function;
and obtaining an optimized pose by resolving the loss function.
Preferably, in the step 3, if the input image information is input data corresponding to a binocular camera or input data corresponding to a depth camera, the determining the type of the feature point in the input image information and removing the dynamic feature point includes the following substeps:
obtaining a first offset vector set corresponding to the static matching point pair set by adopting reprojection calculation; using a weighted average method according to the static area AsAnd the first offset vector set is calculated to obtain a first offset vector weight TiAnd the mean value of the weights of the first offset vectors
Calculating by adopting reprojection to obtain a second offset vector set corresponding to the predicted motion matching point pair set; using a weighted average method, based on said predicted motion area AmAnd the second offset vector set is calculated to obtain a second offset vector weight Tj
According to the second offset vector weight TjAnd the mean value of the first offset vector weightsFor the predicted motion region AmJudging the type of each feature point in the image;
judging whether the predicted motion area is a dynamic area; if the characteristic points with the number exceeding the first threshold value in the predicted motion area are judged as dynamic characteristic points, the predicted motion area is marked as a dynamic area, and all the characteristic points marked as the dynamic area are removed.
Preferably, the specific implementation manner of obtaining the first offset vector set corresponding to the static matching point pair set is as follows:
the rotation and translation of the optimized pose obtained in the step 2 in the matrix form are R, t respectively, the camera internal parameter is K, and the matching point pair p of the previous frame and the current frame is setiAnd q isi(xi,yi) Corresponding to a three-dimensional space point PiA 1 is to PiProjecting to the current frame to obtain a projectionCoordinates of the object
Figure BDA0002519876930000041
Figure BDA0002519876930000042
Wherein the content of the first and second substances,
Figure BDA0002519876930000043
represents a spatial point PiPosition of pixel point projected to current frame, xi、yiDenotes qiThe coordinates of the pixels of (a) and (b),
Figure BDA0002519876930000044
to represent
Figure BDA0002519876930000045
The pixel coordinates of (a);
the static area AsThe position offset vector corresponding to a certain matching point pair is expressed as
Figure BDA0002519876930000046
The first set of position offset vectors corresponding to the n pairs of matching point pairs is represented as
Figure BDA0002519876930000047
Preferably, the obtaining of the first offset vector weight TiMean value phi of the first offset vector weightsThe specific implementation mode is as follows:
Figure BDA0002519876930000048
Figure BDA0002519876930000049
Figure BDA0002519876930000051
Figure BDA0002519876930000052
wherein, TiRepresenting a first offset vector
Figure BDA0002519876930000053
Weight on angle and mode length, phisMeans representing a weight of the first offset vector;
the pair of the predicted motion areas AmThe specific implementation manner of judging the type of each feature point in the method is as follows:
will TjPhi and phisFor comparison, if TjGreater than phisJudging as a dynamic feature point; otherwise, the static feature point is determined.
Preferably, in the step 3, if the input image information is input data corresponding to a monocular camera, the determining the type of the feature point in the input image information and removing the dynamic feature point includes the following substeps:
obtaining a basis matrix F according to the rotation R and the translation t corresponding to the optimized pose obtained in the step 2:
F=K-Tt^RK-1
obtaining polar line Fp according to the basic matrix F, the static matching point pair set obtained in the step 2 and the predicted motion matching point pair setk=[x,y,z]T
Obtaining a characteristic point D value according to the polar line;
presetting a second threshold eta, and judging the type of the feature point according to the D value of the feature point and the second threshold eta;
judging whether the predicted motion area is a dynamic area; if the characteristic points with the number exceeding the first threshold value in the predicted motion area are judged as dynamic characteristic points, the predicted motion area is marked as a dynamic area, and all the characteristic points marked as the dynamic area are removed.
Preferably, the calculation method of the feature point D value is:
Figure BDA0002519876930000054
wherein p isk,qkUsing homogeneous forms, Fp, in the formulakRepresenting epipolar lines in epipolar geometry, x, y, z being vector parametric representations of the epipolar lines, D representing point pkTo polar line FpkThe distance of (d);
if the value of the characteristic point D is larger than a second threshold eta, a judgment point p is determinedkIs a dynamic characteristic point; otherwise, the static feature point is determined.
One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:
in the embodiment of the application, firstly, a semantic segmentation network is adopted to segment input image information to obtain a static area and a predicted dynamic area, then a sparse optical flow method is adopted to track the characteristics of the static area and the predicted dynamic area, then the types of characteristic points in the input image information are judged, the dynamic characteristic points are removed, finally, a set with the motion characteristic points removed is used as tracking data and is input into an ORB-SLAM for processing, and a pose result is output. The method is based on the dynamic object extraction, judgment and removal of dynamic influence of the combination of semantic segmentation and an optical flow method, and the static feature points without the dynamic influence are applied to a subsequent SLAM system, so that the track information with high pose precision in the dynamic environment can be obtained finally. Compared with the traditional method for processing the dynamic environment, the method can well judge and eliminate the problems of characteristic influence and low pose precision of the dynamic object.
Drawings
In order to more clearly illustrate the technical solution in the present embodiment, the drawings needed to be used in the description of the embodiment will be briefly introduced below, and it is obvious that the drawings in the following description are one embodiment of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is an overall flowchart of a visual SLAM method based on optical flow and semantic segmentation according to an embodiment of the present invention.
Detailed Description
The embodiment provides a visual SLAM method based on optical flow and semantic segmentation, which mainly comprises the following steps:
step 1, segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region.
And 2, performing feature tracking on the static area and the predicted dynamic area by adopting a sparse optical flow method.
And 3, judging the types of the feature points in the input image information, and removing the dynamic feature points.
And 4, inputting the set of the removed motion characteristic points as tracking data into an ORB-SLAM for processing, and outputting a pose result.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The embodiment provides a visual SLAM method based on optical flow and semantic segmentation, as shown in fig. 1, including the following steps:
step 1, segmenting input image information (data) by using a Mask R-CNN network, distinguishing static and dynamic objects, and obtaining a static area and a predicted dynamic area.
Step 1.1, selecting a data set to train a Mask R-CNN network, and selecting 20 types of data in a COCO data set as potential moving objects, such as: people, bicycles, buses, boats, birds, cats, dogs, etc.
And 1.2, reading data corresponding to a monocular camera, a binocular camera or a depth camera RGB-D and inputting the data into a network. In the trained semantic segmentation network, the format of an input image is mxnx3, the format of an output result is mxnxl, wherein mxn represents the size of the image, 3 represents an image channel (RGB), and l represents the number of training categories (i.e. 20) selected in step 1.1, and the semantic segmentation is completed by combining the selected 20 types as a possible moving objectMarking the region, completing image segmentation to obtain a static region AsAnd predicting a motion region Am. If the segmentation does not result in AmAt this time, it is considered that no dynamic area exists in the data, and the processing of the dynamic area is not necessary, and the processing may be performed according to the conventional ORB-SLAM, and the flow may directly proceed to step 4.
And 2, performing feature tracking on the static area and the predicted dynamic area by adopting a sparse optical flow method.
After the preprocessing of the data is completed, a lightweight algorithm is used for tracking, the essence of the algorithm is that the feature extraction and tracking functions are reserved on the basis of ORB-SLAM tracking, wherein an optical flow method is adopted to replace a feature point method, functional modules of local optimization and key frame decision are eliminated, and feature extraction, matching and pose solving are completed.
Step 2.1, for the data which is divided completely, the Lucas-Kanade optical flow method is used for the static area AsAnd predicting a motion region AmExtracting and matching features, obtaining all feature point sets P for the current frame, and obtaining a static matching point pair set P by matchingmatchs={(pi,qi) I ═ 1,2,3, …, n } and the set of predicted motion matching point pairs Pmatchm={(pj,qj),j=1,2,3,…,m}。piAnd q isiI-th statically matched pixel point pair, p, representing the previous and current frame, respectivelyjAnd q isjRespectively indicate the jth predicted motion matching pixel point pair of the previous frame and the current frame, and n and m respectively indicate the number of static region matching point pairs and the number of predicted motion region matching point pairs.
Step 2.2, starting to solve the pose based on the SLAM running model, and utilizing the jth landmark y at the moment kjProjected to the current frame to obtain a projection position h (xi)k,yj) At this time, the corresponding observation model can be obtained:
zk,j=h(ξk,yj)+vk,j
where h () represents the nonlinear model of the landmark at a known pose transformation, zk,jIndicating signpost yjAt the current frame pixelCoordinates, vk,j~N(0,Qk,j) Representing a mean of 0 and a covariance of Qk,jGaussian noise.
An error model can be established according to the re-projection error formed by the projection position and the corresponding pixel coordinate according to the observation model:
ek,j=zk,j-h(ξk,yj)
wherein e isk,jIndicating signpost yjDifference, ξ, between the position of the current frame and the projected positionkAnd a lie algebraic form representing pose transformation between two frames at the moment k.
And 2.3, converting the error model into a nonlinear least square problem, setting all camera poses xi and signposts y as the quantity x to be optimized, setting the tracking time as m and the total number of the signposts as n, and establishing the following loss functions:
Figure BDA0002519876930000081
wherein J () represents a loss function, k represents the tracked k time, J represents the jth signpost, ek,jError in step 2.2, Qk,jRepresenting the covariance of the gaussian noise.
And obtaining the optimized pose of the camera by resolving the loss function.
And 3, judging the types of the feature points in the input image information, and removing the dynamic feature points.
The input image information is one of input data corresponding to a monocular camera, input data corresponding to a binocular camera, and input data corresponding to a depth camera. And designing different dynamic characteristic point judgment and processing methods according to different sensor types corresponding to different input data types.
Wherein, the depth camera RGB-D and the binocular camera type are processed by the same method, and the step 3.1 is carried out; the monocular camera type then jumps directly to step 3.4.
Step 3.1, the link aims at RGB-D and binocular systems, and the initial pose lie algebraic form obtained after optimization of the camera in step 2 is xi, corresponding toIs rotated and translated into R, t, and sets the pairs of matched pixel points p of the previous and current frames, knowing the camera internal reference KiAnd q isi(xi,yi) Corresponding to a three-dimensional space point PiA 1 is to PiProjecting to the current frame to obtain projection coordinates
Figure BDA0002519876930000091
The relationship is as follows:
Figure BDA0002519876930000092
wherein the content of the first and second substances,
Figure BDA0002519876930000093
represents a spatial point PiPosition of pixel point projected to current frame, xi、yiDenotes qiThe pixel coordinates of (a);
Figure BDA0002519876930000094
to represent
Figure BDA0002519876930000095
The pixel coordinates of (a).
If there is no error effect, there should be
Figure BDA0002519876930000096
However, the positional shift that may exist between the static feature point and the predicted moving point is caused by the influence of noise or the like
Figure BDA0002519876930000097
Wherein, the position offset vector corresponding to a certain static matching point pair is expressed as
Figure BDA0002519876930000098
At this point, there is a set of offset vectors for n pairs of matching point pairs
Figure BDA0002519876930000099
I.e. static matchingAnd the point pair set corresponds to a first offset vector set.
Step 3.2, describing the static area A by using a weighted average method according to the static matching point pairs obtained in the step 2.1sFor a static area AsAnd a first set of offset vectors VstateCalculating an offset error
Figure BDA00025198769300000910
Angle theta ofiLength of dieiAnd a weight Ti
Figure BDA0002519876930000101
Figure BDA0002519876930000102
Figure BDA0002519876930000103
Wherein, TiRepresenting offset vectors
Figure BDA0002519876930000104
Weight in angle and mode length.
Then, the mean value phi of the offset vector weights is calculateds
Figure BDA0002519876930000105
Step 3.3, obtaining a second offset vector set corresponding to the predicted motion matching point pair set by adopting reprojection calculation; using a weighted average method based on the predicted motion region AmAnd a second offset vector set, and calculating to obtain a second offset vector weight Tj
I.e. the corresponding predicted motion area a is obtained with reference to steps 3.1 and 3.2mSecond set of offset vectors
Figure BDA0002519876930000106
And the angle, the modulo length and the corresponding second offset vector weight T of the offset errorjWill TjPhi and phisAnd (3) comparison:
Figure BDA0002519876930000107
using VotherEach vector in (1) completes the dynamic or static judgment of the corresponding predicted feature point. After this step is completed, the process jumps to step 3.6.
Step 3.4, aiming at the monocular system, the initial pose li algebraic form after the optimization of the camera is obtained in step 2 is xi, the rotation and translation corresponding to xi is R, t, and then the basic matrix F of the current motion is obtained:
F=K-Tt^RK-1
meanwhile, a set P of all point pairs (including the static region matching point pair and the predicted motion region matching point pair) with two frame image feature matching is obtained in step 2.1match={(pk,qk) And k is 1,2,3,.. n }, n represents the total number of all matched point pairs, and the polar line Fp is obtained by combining the basic matrix Fk=[x,y,z]T
Figure BDA0002519876930000108
Wherein p isk,qkUsing homogeneous forms, Fp, in the formulakRepresenting epipolar lines in epipolar geometry, x, y, z being vector parametric representations of the epipolar lines, D representing point pkTo polar line FpkThe distance of (c).
Step 3.5, point pairs in the set P sequentially calculate a characteristic point D value, set a threshold eta, and judge the type of the characteristic point according to the characteristic point D value and the threshold eta:
Figure BDA0002519876930000111
among them, it is found that the effect of setting η to 5 is stable in the test, so it is preferable to use η of 5 for the solution.
And 3.6, judging the predicted motion area extracted in the step 1.2 according to the dynamic feature point set, when most of feature points (for example, more than 80 percent) in the predicted motion area are determined as dynamic feature points, determining that the area is a dynamic area, and then removing all feature points calibrated to be on the dynamic area.
Step 4, removing a set P of motion characteristic points from all characteristic points (including characteristic points corresponding to a static area and a predicted dynamic area)eAs trace data, this time set PeThe influence of the dynamic characteristics is removed, and then the pose result is input into a traditional ORB-SLAM framework for processing and output.
Step 4.1, set PeAnd finishing local map creation and pose optimization in tracking.
And 4.2, carrying out closed-loop detection.
And 4.3, outputting a pose result.
In summary, the method for SLAM based on the combination of semantic segmentation and an optical flow method for extracting, judging and removing dynamic influence of dynamic objects provided by the invention adopts a semantic segmentation network to effectively segment potential dynamic objects, then uses a sparse optical flow method to complete stable feature tracking, then judges and removes dynamic feature points through epipolar constraint of matching point pairs and reprojection error distribution difference, and applies the static feature points without dynamic influence to a subsequent SLAM system, thereby finally obtaining track information with higher pose precision in a dynamic environment. Compared with the traditional method for processing the dynamic environment, the method can well judge and eliminate the problems of characteristic influence and low pose precision of the dynamic object.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to examples, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims (10)

1. A visual SLAM method based on optical flow and semantic segmentation is characterized by comprising the following steps:
step 1, segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region;
step 2, carrying out feature tracking on the static area and the prediction dynamic area by adopting a sparse optical flow method;
step 3, judging the types of the feature points in the input image information, and removing the dynamic feature points;
and 4, inputting the set of the removed motion characteristic points as tracking data into an ORB-SLAM for processing, and outputting a pose result.
2. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 1 wherein the input image information in step 1 is one of input data corresponding to a monocular camera, input data corresponding to a binocular camera, input data corresponding to a depth camera;
in the step 3, the types of the feature points in the input data corresponding to the monocular camera are judged through epipolar constraint; and judging the types of the characteristic points in the input data corresponding to the binocular camera or the input data corresponding to the depth camera through the reprojection error.
3. The optical flow and semantic segmentation based visual SLAM method as defined in claim 1 wherein step 1 comprises the sub-steps of:
step 1.1, selecting a data set to train a Mask R-CNN network to obtain a trained semantic segmentation network; the data set includes multiple types of data as potential moving objects;
step 1.2, inputting the image information into the trained semantic segmentation network to complete image segmentation and obtain a static area AsPredicted motion area Am
4. The optical flow and semantic segmentation based visual SLAM method of claim 3 wherein said step 2 comprises the sub-steps of:
step 2.1, utilizing a sparse optical flow method to perform alignment on the static area AsAnd the predicted motion region AmPerforming feature extraction and matching to obtain a static matching point pair set and a predicted motion matching point pair set;
and 2.2, solving the pose based on the SLAM running model.
5. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 4 wherein the SLAM based run model pose solution in step 2.2 comprises:
at time k, using the jth signpost yjProjected to the current frame to obtain a projection position h (xi)k,yj) And obtaining a corresponding observation model:
zk,j=h(ξk,yj)+vk,j
where h () represents the nonlinear model of the landmark at a known pose transformation, zk,jIndicating signpost yjAt the pixel coordinates of the current frame, vk,j~N(0,Qk,j) Representing a mean of 0 and a covariance of Qk,jGaussian noise of (2);
according to the observation model, establishing an error model according to a reprojection error formed by the projection position and the corresponding pixel coordinate:
ek,j=zk,j-h(ξk,yj)
wherein e isk,jIndicating signpost yjDifference, ξ, between the position of the current frame and the projected positionkA lie algebraic form representing pose transformation between two frames at the moment k;
converting the error model into a nonlinear least square problem, setting all camera poses xi and signposts y as x to be optimized, setting the tracking time as m and the total number of the signposts as n, and establishing a loss function:
Figure FDA0002519876920000021
wherein J () represents a loss function;
and obtaining an optimized pose by resolving the loss function.
6. The visual SLAM method based on optical flow and semantic segmentation as set forth in claim 5 wherein, in step 3, if the input image information is input data corresponding to a binocular camera or input data corresponding to a depth camera, the determining the type of the feature points in the input image information and removing the dynamic feature points comprises the sub-steps of:
obtaining a first offset vector set corresponding to the static matching point pair set by adopting reprojection calculation; using a weighted average method according to the static area AsAnd the first offset vector set is calculated to obtain a first offset vector weight TiAnd the mean value of the weights of the first offset vectors
Calculating by adopting reprojection to obtain a second offset vector set corresponding to the predicted motion matching point pair set; using a weighted average method, based on said predicted motion area AmAnd the second offset vector set is calculated to obtain a second offset vector weight Tj
According to the second offset vector weight TjAnd the mean value of the first offset vector weightsFor the predicted motion region AmJudging the type of each feature point in the image;
judging whether the predicted motion area is a dynamic area; if the characteristic points with the number exceeding the first threshold value in the predicted motion area are judged as dynamic characteristic points, the predicted motion area is marked as a dynamic area, and all the characteristic points marked as the dynamic area are removed.
7. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 6 wherein the specific implementation of the first set of offset vectors corresponding to the set of statically matched point pairs is:
the rotation and translation of the optimized pose obtained in the step 2 in the matrix form are R, t respectively, the camera internal parameter is K, and the matching point pair p of the previous frame and the current frame is setiAnd q isi(xi,yi) Corresponding to a three-dimensional space point PiA 1 is to PiProjecting to the current frame to obtain projection coordinates
Figure FDA0002519876920000031
Figure FDA0002519876920000032
Wherein the content of the first and second substances,
Figure FDA0002519876920000033
represents a spatial point PiPosition of pixel point projected to current frame, xi、yiDenotes qiThe coordinates of the pixels of (a) and (b),
Figure FDA0002519876920000034
to represent
Figure FDA0002519876920000035
The pixel coordinates of (a);
the static area AsThe position offset vector corresponding to a certain matching point pair is expressed as
Figure FDA0002519876920000036
The first set of position offset vectors corresponding to the n pairs of matching point pairs is represented as
Figure FDA0002519876920000037
8. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 7 wherein said deriving a first offset vector weight TiFirst, aMean of offset vector weights phisThe specific implementation mode is as follows:
Figure FDA0002519876920000038
Figure FDA0002519876920000039
Figure FDA0002519876920000041
Figure FDA0002519876920000042
wherein, TiRepresenting a first offset vector
Figure FDA0002519876920000043
Weight on angle and mode length, phisMeans representing a weight of the first offset vector;
the pair of the predicted motion areas AmThe specific implementation manner of judging the type of each feature point in the method is as follows:
will TjPhi and phisFor comparison, if TjGreater than phisJudging as a dynamic feature point; otherwise, the static feature point is determined.
9. The visual SLAM method based on optical flow and semantic segmentation as set forth in claim 5, wherein the step 3, if the input image information is the input data corresponding to the monocular camera, the steps of determining the type of the feature points in the input image information and removing the dynamic feature points comprise the following substeps:
obtaining a basis matrix F according to the rotation R and the translation t corresponding to the optimized pose obtained in the step 2:
F=K-Tt^RK-1
according to the baseObtaining a base matrix F, the static matching point pair set and the predicted motion matching point pair set obtained in the step 2, and obtaining a polar line Fpk=[x,y,z]T
Obtaining a characteristic point D value according to the polar line;
presetting a second threshold eta, and judging the type of the feature point according to the D value of the feature point and the second threshold eta;
judging whether the predicted motion area is a dynamic area; if the characteristic points with the number exceeding the first threshold value in the predicted motion area are judged as dynamic characteristic points, the predicted motion area is marked as a dynamic area, and all the characteristic points marked as the dynamic area are removed.
10. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 9 wherein the feature point D values are calculated by:
Figure FDA0002519876920000051
wherein p isk,qkUsing homogeneous forms, Fp, in the formulakRepresenting epipolar lines in epipolar geometry, x, y, z being vector parametric representations of the epipolar lines, D representing point pkTo polar line FpkThe distance of (d);
if the value of the characteristic point D is larger than a second threshold eta, a judgment point p is determinedkIs a dynamic characteristic point; otherwise, the static feature point is determined.
CN202010488128.2A 2020-06-02 2020-06-02 Visual SLAM method based on optical flow and semantic segmentation Pending CN111797688A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010488128.2A CN111797688A (en) 2020-06-02 2020-06-02 Visual SLAM method based on optical flow and semantic segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010488128.2A CN111797688A (en) 2020-06-02 2020-06-02 Visual SLAM method based on optical flow and semantic segmentation

Publications (1)

Publication Number Publication Date
CN111797688A true CN111797688A (en) 2020-10-20

Family

ID=72806020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010488128.2A Pending CN111797688A (en) 2020-06-02 2020-06-02 Visual SLAM method based on optical flow and semantic segmentation

Country Status (1)

Country Link
CN (1) CN111797688A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112308921A (en) * 2020-11-09 2021-02-02 重庆大学 Semantic and geometric based joint optimization dynamic SLAM method
CN112381841A (en) * 2020-11-27 2021-02-19 广东电网有限责任公司肇庆供电局 Semantic SLAM method based on GMS feature matching in dynamic scene
CN112418288A (en) * 2020-11-17 2021-02-26 武汉大学 GMS and motion detection-based dynamic vision SLAM method
CN112446885A (en) * 2020-11-27 2021-03-05 广东电网有限责任公司肇庆供电局 SLAM method based on improved semantic optical flow method in dynamic environment
CN113920163A (en) * 2021-10-09 2022-01-11 成都信息工程大学 Moving target detection method based on combination of tradition and deep learning
CN115061770A (en) * 2022-08-10 2022-09-16 荣耀终端有限公司 Method and electronic device for displaying dynamic wallpaper
CN115660944A (en) * 2022-10-27 2023-01-31 深圳市大头兄弟科技有限公司 Dynamic method, device and equipment for static picture and storage medium
WO2023178951A1 (en) * 2022-03-25 2023-09-28 上海商汤智能科技有限公司 Image analysis method and apparatus, model training method and apparatus, and device, medium and program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190114777A1 (en) * 2017-10-18 2019-04-18 Tata Consultancy Services Limited Systems and methods for edge points based monocular visual slam
CN110125928A (en) * 2019-03-27 2019-08-16 浙江工业大学 A kind of binocular inertial navigation SLAM system carrying out characteristic matching based on before and after frames
CN110706279A (en) * 2019-09-27 2020-01-17 清华大学 Global position and pose estimation method based on information fusion of global map and multiple sensors

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190114777A1 (en) * 2017-10-18 2019-04-18 Tata Consultancy Services Limited Systems and methods for edge points based monocular visual slam
CN110125928A (en) * 2019-03-27 2019-08-16 浙江工业大学 A kind of binocular inertial navigation SLAM system carrying out characteristic matching based on before and after frames
CN110706279A (en) * 2019-09-27 2020-01-17 清华大学 Global position and pose estimation method based on information fusion of global map and multiple sensors

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUNHAO CH.等: "DM-SLAM:A Feature-Based SLAM System for Rigid Dynamic Scenes", 《INTERNATIONAL JOURNAL OF GEO-INFORMATION》 *
席志红 等: "基于语义分割的室内动态场景同步定位与语义建图", 《计算机应用》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112308921A (en) * 2020-11-09 2021-02-02 重庆大学 Semantic and geometric based joint optimization dynamic SLAM method
CN112308921B (en) * 2020-11-09 2024-01-12 重庆大学 Combined optimization dynamic SLAM method based on semantics and geometry
CN112418288A (en) * 2020-11-17 2021-02-26 武汉大学 GMS and motion detection-based dynamic vision SLAM method
CN112418288B (en) * 2020-11-17 2023-02-03 武汉大学 GMS and motion detection-based dynamic vision SLAM method
CN112381841A (en) * 2020-11-27 2021-02-19 广东电网有限责任公司肇庆供电局 Semantic SLAM method based on GMS feature matching in dynamic scene
CN112446885A (en) * 2020-11-27 2021-03-05 广东电网有限责任公司肇庆供电局 SLAM method based on improved semantic optical flow method in dynamic environment
CN113920163A (en) * 2021-10-09 2022-01-11 成都信息工程大学 Moving target detection method based on combination of tradition and deep learning
CN113920163B (en) * 2021-10-09 2024-06-11 成都信息工程大学 Moving target detection method based on combination of traditional and deep learning
WO2023178951A1 (en) * 2022-03-25 2023-09-28 上海商汤智能科技有限公司 Image analysis method and apparatus, model training method and apparatus, and device, medium and program
CN115061770A (en) * 2022-08-10 2022-09-16 荣耀终端有限公司 Method and electronic device for displaying dynamic wallpaper
CN115660944A (en) * 2022-10-27 2023-01-31 深圳市大头兄弟科技有限公司 Dynamic method, device and equipment for static picture and storage medium
CN115660944B (en) * 2022-10-27 2023-06-30 深圳市闪剪智能科技有限公司 Method, device, equipment and storage medium for dynamic state of static picture

Similar Documents

Publication Publication Date Title
US11763485B1 (en) Deep learning based robot target recognition and motion detection method, storage medium and apparatus
CN111797688A (en) Visual SLAM method based on optical flow and semantic segmentation
CN111563442B (en) Slam method and system for fusing point cloud and camera image data based on laser radar
Melekhov et al. Dgc-net: Dense geometric correspondence network
CN108665496B (en) End-to-end semantic instant positioning and mapping method based on deep learning
CN111862126B (en) Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm
CN108776989B (en) Low-texture planar scene reconstruction method based on sparse SLAM framework
CN113221647B (en) 6D pose estimation method fusing point cloud local features
CN110853100A (en) Structured scene vision SLAM method based on improved point-line characteristics
Zhao et al. Deep direct visual odometry
CN114937083B (en) Laser SLAM system and method applied to dynamic environment
CN112001859A (en) Method and system for repairing face image
CN112489083A (en) Image feature point tracking matching method based on ORB-SLAM algorithm
CN112308921B (en) Combined optimization dynamic SLAM method based on semantics and geometry
CN114708293A (en) Robot motion estimation method based on deep learning point-line feature and IMU tight coupling
Zhu et al. A review of 6d object pose estimation
CN112686952A (en) Image optical flow computing system, method and application
Yu et al. Accurate and robust visual localization system in large-scale appearance-changing environments
Zhu et al. Fusing panoptic segmentation and geometry information for robust visual slam in dynamic environments
CN114283265A (en) Unsupervised face correcting method based on 3D rotation modeling
Min et al. Coeb-slam: A robust vslam in dynamic environments combined object detection, epipolar geometry constraint, and blur filtering
Geiger Monocular road mosaicing for urban environments
CN114119999B (en) Iterative 6D pose estimation method and device based on deep learning
CN113012298B (en) Curved MARK three-dimensional registration augmented reality method based on region detection
Wang et al. Stream query denoising for vectorized hd map construction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201020

RJ01 Rejection of invention patent application after publication