CN117036404A - Monocular thermal imaging simultaneous positioning and mapping method and system - Google Patents

Monocular thermal imaging simultaneous positioning and mapping method and system Download PDF

Info

Publication number
CN117036404A
CN117036404A CN202310995674.9A CN202310995674A CN117036404A CN 117036404 A CN117036404 A CN 117036404A CN 202310995674 A CN202310995674 A CN 202310995674A CN 117036404 A CN117036404 A CN 117036404A
Authority
CN
China
Prior art keywords
map
frame
image
lines
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310995674.9A
Other languages
Chinese (zh)
Inventor
王岭雪
吴昱臻
张廉
白宇
蔡毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202310995674.9A priority Critical patent/CN117036404A/en
Publication of CN117036404A publication Critical patent/CN117036404A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Mathematical Optimization (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Pure & Applied Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a monocular thermal imaging simultaneous localization and mapping method and system, wherein the system comprises a collector and a processor, wherein the collector collects thermal infrared video sequence images which are not interrupted by non-uniformity correction; the processor comprises a denoising module, a feature extraction module, an initialization module, a feature tracking module, a local map building module and a loop detection module. The invention improves the signal-to-noise ratio of the thermal infrared image, ensures that thermal information is retained to the greatest extent in different environments, and widens the use boundary of the visual SLAM method in the thermal infrared image; the invention also utilizes various strategies to eliminate the interference of the dynamic target on the SLAM system, can realize high robustness in a complex dynamic environment, and shows higher positioning precision.

Description

Monocular thermal imaging simultaneous positioning and mapping method and system
Technical Field
The invention relates to the field of infrared thermal imaging, in particular to a monocular thermal imaging simultaneous positioning and mapping method and system.
Background
Although the beidou navigation satellite navigation system in China is gradually complete, when the indoor, tunnel, urban canyon and other places or satellite signals are interfered or even refused, the positioning and navigation performances of vehicles and unmanned vehicles (or unmanned aerial vehicles) using the beidou system are limited or even lost, so that the positioning and navigation by utilizing laser point clouds obtained by a laser radar and image information obtained by an imaging system become research hotspots, and the research hotspots are respectively called laser simultaneous positioning and mapping (abbreviated as laser SLAM) and vision simultaneous positioning and mapping (abbreviated as vision SLAM).
Currently, a large number of imaging detectors used by vision SLAM systems are all visible light imaging detectors. Compared with a visible light imaging system, the thermal imaging system (or a thermal infrared imager) has the advantages of working day and night, being free from interference of glare, being capable of penetrating smoke dust, haze, light rain, light snow and the like. However, compared with visible light images, thermal infrared images have insurmountable disadvantages, such as low signal-to-noise ratio, few scene details, weak texture features, and low spatial resolution (the spatial resolution of the current most advanced infrared focal plane detector is only millions of pixels, but the spatial resolution of the visible light imaging detector used in daily life, such as a mobile phone, is tens of tens or hundreds of times that of the infrared focal plane detector), which results in low positioning accuracy of the current existing thermal imaging SLAM method. Secondly, the thermal imaging system needs to perform non-uniformity correction, and a common non-uniformity correction method is to use a baffle to block light for a few seconds, so that data loss can be caused in a dynamic environment, and the consistency of positioning is seriously affected.
In addition, conventional SLAMs use a random sampling consensus method to eliminate erroneous data correlations due to dynamic objects, but this method fails when the dynamic objects occupy a large proportion of the entire image. Yet another approach is to use Convolutional Neural Networks (CNNs) to detect all possible moving objects in the image and delete them all, but this approach also erroneously deletes some static objects. When a conventional SLAM is used for a thermal infrared image with weak texture, there are fewer features available for tracking, severely affecting the effect of the SLAM.
Therefore, there are many technical difficulties faced in using a thermal imaging system to perform visual SLAM, which is why infrared SLAM has not been widely developed.
Disclosure of Invention
In order to solve the problems, the invention provides a monocular thermal imaging simultaneous localization mapping method and system based on the low signal-to-noise ratio and weak texture characteristics of fully analyzing thermal infrared images, which can perform high-precision and robust localization in a dynamic environment with vision degradation.
The technical scheme of the invention is as follows:
a monocular thermal imaging simultaneous localization and mapping system comprises a collector and a processor; the processor comprises a denoising module, a static characteristic extraction module, an initializing module, a characteristic tracking module, a local image building module and a loop detection module; wherein:
the collector collects the thermal infrared video sequence image which is not interrupted by the non-uniformity correction;
the denoising module performs real-time denoising on the acquired thermal infrared video sequence image;
the static feature extraction module is used for respectively extracting key points and descriptors of point features in the image and key line segments and descriptors of line features; object detection is carried out by using a target detection neural network, the precision is improved by adopting a moving object tracking method, a bounding box is reduced by using an example segmentation method, pixel level segmentation of a moving object is realized, and then static features for follow-up tracking are screened out by using epipolar geometric constraint;
The initialization module performs the following processing: restoring the pose of two adjacent frames according to the static point features and the static line features extracted by the static feature extraction module;
the feature tracking module performs the following processing: map points and map line tracking; key frame selection;
in map points and map line tracking, inter-frame tracking and local map tracking are sequentially used for carrying out data association, and the current frame pose is updated;
in the key frame selection, a key frame which is most observed together with a frame of thermal infrared image is used as a reference key frame;
the local map building module updates the connection relation between key frames through the co-view relation between map points and map lines, and eliminates map points and map lines with poor quality according to the observation condition;
the loop detection module uses similarity based on image appearance to determine whether to generate a loop, and uses descriptors of point features and descriptors of line features to train a bag-of-words model based on the thermal infrared image through clustering offline.
Further, in real-time scene denoising:
first performing scene-based non-uniformity correction on a thermal infrared image; meanwhile, aiming at the condition that the information of the thermal infrared image is lost in compression conversion, an adaptive histogram equalization method is used for keeping the original information of the image to the greatest extent; random noise of the thermal infrared image is then removed.
Further, for random noise, FFDNet based on convolutional neural network is adopted to denoise the thermal infrared image;
the original image is remodelled into a plurality of downsampled sub-images, and the downsampled sub-images and the adjustable noise level images are connected together and input into a convolutional neural network, convolution is performed, and finally a denoising image with the same size as the original input image is generated.
Further, when the line segment feature extraction is carried out, the extremely short invalid line feature is filtered; connecting the divided long line segments; if the length and distance of two line segments are very close and the direction difference is small enough, merging the line segments; describing the new line segment by using the descriptor, and adding geometric constraints to remove possible abnormal matching, wherein the geometric constraints comprise:
(1) The length difference and the angle difference of the matched line segment pairs are smaller than a certain value;
(2) The distance of the matched line segment pairs is smaller than a certain value;
(3) The descriptor distance of the matching line segment pairs is less than a certain value;
in the moving object tracking method, the target state detected by the target detection neural network is defined as:
where x, y represents the center coordinates of the detected object, a, p represent the area and aspect ratio of the bounding box, Then the center coordinates of the predicted object of the next frame and the area of the bounding box are represented; and carrying out motion prediction and data association of the target in the tracking process by using Kalman filtering and Hungary algorithm, and calculating a cost matrix by using the cross ratio distance of the detection frames.
Further, in the epipolar geometry constraint, judging the dynamic condition of key points in a priori target area in the thermal infrared image; an object is considered dynamic if there are more than a certain number of dynamic keypoints within the object region:
firstly, obtaining a target detection and segmentation result; secondly, obtaining a key point matching result of the continuous frames by utilizing the light streams from thick to thin; thirdly, solving a base matrix by using a random uniform sampling algorithm with the most points in a non-target area, calculating epipolar lines by using the base matrix, and finally determining that key points with the distance from the epipolar lines being larger than a threshold value are moving.
Further, the initialization module performs the following processing:
for a pair of parallel 3D lines, their 2D projection l in the first frame image captured by the acquisition unit 1 ,l 2 Intersecting at vanishing point l 1 ×l 2 Its normalized direction
Similarly, it captures a 2D projection m of the second frame image at the collector 1 ,m 2 Is normalized by vanishing point of (2) In addition, assume that the vanishing point normalization direction of the 2D projections of the other pair of parallel 3D lines in the first frame image and the second frame image is v p3 And v p4 Then, for the two pairs of parallel 3D lines, the ideal rotation matrix between the two images satisfies the following formula:
wherein g 1 And g 2 Is a constant; based on the above formula, solving the condition of meeting II g 12 +‖g 22 A rotation matrix R in the smallest case;
consider two feature two-dimensional points p detected in a first image using a detector of point features 1 、p 2 Determining that they correspond to point q in the second image by feature matching 1 、q 2 The actual translation vector is defined as:
the initialization method in ORB-SLAM2 is used in parallel at the same time, and the random sampling consistency method is used for sampling the point pairs and the line pairs in the initialization method to select the best result.
Further, in the map point and map line tracking, firstly, assuming uniform motion of a system, estimating pose through a projection matching relationship between adjacent frames; projecting map points and map lines to the current frame and searching for satisfaction within a given rangeThe required point features and line features are matched, the rotation consistency is required to be met besides the shortest required descriptor distance, and if the number of the features which are effectively matched after the searching is still smaller than a given threshold value, the searching area is enlarged until the requirements are met; the initial value of the current pose is obtained by the method, the pose of the current frame is optimized by utilizing the 3D-2D projection relation, namely, the reprojection error of the 3D point and the 3D line is minimized by a beam adjustment method; for the frame pose estimation of the constant speed model, the rotation matrix R of the current frame is used * And translation vector t * As a state variable to be optimized, a graph optimization model is built, and the following cost function iterative solution is minimized by using a Levenberg-Marquardt method:
wherein ρ is p And ρ l Introducing a Huber function to reduce abnormal terms of the cost function for the Huber robust cost function;minimum re-projection errors of points and lines, respectively;
p sum sigma l An observation covariance matrix representing points and lines;
χ c representing a set of matching pairs between successive image frames in a video sequence. Taking calculation complexity into consideration, directly solving by using a jacobian matrix; after pose optimization is executed, removing outliers and outer lines in the map; a successful match is considered successful if the number of map points and map lines successfully matched is greater than a certain value.
Adopting a matching strategy between images; converting the description sub of the feature into a bag-of-word vector to accelerate the matching of the current frame and a reference key frame, wherein the reference key frame is the key frame with the highest common visibility with the current frame; if the matched feature number still does not reach the standard, selecting to use a nearest neighbor matching algorithm, and calculating a homography matrix between images by using a random sampling consistency method so as to obtain enough correct matching features; finally, projecting map points and map lines to the current frame, and optimizing the pose of the previous frame by taking the pose of the previous frame as an initial value according to an equation;
If the tracking failure of the methods leads to positioning failure, repositioning is needed; firstly, converting a current frame into a word bag, finding a candidate key frame group similar to the current frame in a key frame database, and selecting a candidate key frame meeting the requirement in the candidate key frame group; once the matching requirement is met, estimating the camera pose of the current frame by solving the PnP problem and optimizing the pose according to an equation; if the number of the optimized inliers is too small, the unmatched map points and map lines in the key frame are projected into the current frame in a projection mode, and a new matching relation is generated; optimizing the pose again according to the projection matching result; if one candidate key frame is successfully relocated, the rest candidate key frames are not considered, otherwise, the next frame is repeated until the relocation is successful;
in the key frame selection, the key frame is selected in any of the following cases:
(1) Some frames have passed since the last global relocation;
(2) Some frames have passed since the insertion of the last key frame or the local map building thread is idle;
(3) Map points and map lines tracked by the current frame are less than a certain proportion of map points and map lines tracked by the reference key frame;
(4) The position and posture of the key frame are changed to a certain extent;
(5) The current frame tracks at least a certain number of feature points and spatial lines.
Further, the local building module utilizes the current key frame and the neighboring common-view key frames to generate new map points and map lines through triangulation so as to ensure that tracking is more stable; finally, checking and fusing repeated map points and map lines, and when the key frames in the local map are greater than a certain number, executing local beam adjustment according to the following equation to adjust the camera pose in the local mapMap dot->And map line->
Wherein,and->A set of matching pairs representing points and lines within the local map;
after optimization adjustment, judging whether more than a certain percentage of map points and map lines tracked by key frames can be tracked by other key frames, eliminating redundant key frames according to the map points and map lines, and adding the current frame into a closed-loop detection queue.
Further, in the loop detection module, a similarity score of the current key frame and the word bag vector of each common view key is calculated, and the similarity is defined as:
wherein p and l represent weight coefficients, s, respectively pab ) Representing the similarity of point features between images, s lab ) Representing line feature similarity between images; judging whether the closed loop is successful or not by finding out a closed loop candidate frame set from all key frames; and if the method is successful, the pose, map points and lines are adjusted through the solved similarity transformation, and finally, a global beam adjustment method is carried out to realize the optimization.
The invention also relates to a monocular thermal imaging simultaneous localization mapping method, which is suitable for the system and comprises the following steps: collecting a thermal infrared image, denoising the image, extracting features, initializing, tracking features, constructing a local image and detecting a loop.
According to the invention, firstly, non-refrigeration infrared focal plane detector components without baffles are used for correcting non-uniformity based on scenes, and then, thermal infrared image denoising is carried out. The thermal imager data interruption is avoided, the quality of the thermal infrared image is greatly improved, and the boundary of the application of the visual SLAM system to the thermal infrared image is widened.
And secondly, the interference of a dynamic target is reduced by combining epipolar constraint and semantic segmentation, and the robustness of the thermal SLAM in a dynamic scene is improved.
Third, the defect of poor spatial texture distribution of the thermal infrared image is overcome.
We performed experiments on thermal infrared images of a total of over 180000 real environments for small scale indoor, outdoor sequences and large scale driving sequences. Our results show that the system of the present invention can achieve camera localization and sparse structural map reconstruction in the face of visual degradation and moving objects in a dynamic environment without the aid of any other sensors. In indoor and outdoor sequences, the positioning accuracy measured by the relative position error is less than 0.1m of the ground truth trajectory.
Furthermore, the system of the present invention is the only successful system to track all sequences, with higher positioning accuracy and incomparable robustness compared to the current state-of-the-art monocular SLAM systems. Thus, the system can be used as a completely new positioning solution to replace expensive commercial navigation systems, especially in challenging urban dynamic environments with varying illumination and low visibility in air.
The contribution of the invention has the following four aspects:
the invention provides real-time comprehensive denoising based on environment, avoids data interruption when the thermal imager runs NUC, improves the signal-to-noise ratio of the thermal infrared image, ensures that thermal information is furthest reserved in different environments, and widens the use boundary of the visual SLAM method in the thermal infrared image;
according to the invention, semantic segmentation is combined with epipolar constraint, so that the interference of a dynamic target on an SLAM system is eliminated, and high robustness can be realized in a complex dynamic environment;
aiming at the defect of weak texture of a thermal infrared image, the invention designs and realizes a complete monocular thermal SLAM system using point and line characteristics, and the system comprises initialization, tracking, image construction, loop detection and global optimization;
the system of the present invention demonstrates excellent accuracy and robustness compared to existing most advanced monocular SLAM systems over multiple real world dynamic environmental data sequences. Furthermore, the system of the present invention is the only system that is able to track all data sequences in their entirety.
Drawings
FIG. 1 is a neural network architecture for thermal infrared image denoising according to an embodiment of the present invention;
FIG. 2 is an epipolar constraint for points in the dynamic environment of an embodiment of the present invention;
FIG. 3 is a graph of minimum re-projection errors of points and lines according to an embodiment of the present invention;
FIG. 4 is a system block diagram of an embodiment of the present invention;
FIG. 5 is a frame of dynamic target region segmentation in accordance with an embodiment of the present invention;
FIG. 6 is a view of a parallel 3D line projected onto an image plane and its vanishing point direction according to an embodiment of the present invention;
FIG. 7 is an illustration of local map optimization, including camera pose, map points, map lines, and visual measurements;
FIG. 8 is a schematic illustration of a thermal imager calibrated using a checkerboard calibration apparatus made of a particular material in accordance with an embodiment of the present invention;
FIG. 9 is a screenshot of a sequence of thermal infrared images collected in an embodiment of the invention; the night outdoor sequence, the night driving sequence, the low-illumination indoor sequence and the night rain driving sequence are sequentially arranged from left to right;
FIG. 10 is an original image and a processed thermal infrared image; (1) The left and right are the original thermal infrared image and the thermal infrared image after CLAHE is executed respectively; (2) - (3) left is the original thermal infrared image, middle is the high quality denoised thermal infrared image from the database, right is the thermal infrared image after ffdnaet is used;
Fig. 11 (1) is a point feature matching; (2) line feature matching; the top row represents the feature matching of the original thermal infrared image, and the bottom row represents the feature matching of the denoised thermal infrared image; the green connecting lines between the images represent feature matching;
FIG. 12 is a dynamic target detection; the sequence contains five representative frames illustrating the detection and tracking process of multiple pedestrians; top row: a single frame detection method, wherein undetected people are circled by red; bottom row: suggested enhanced detection methods;
fig. 13 is an example of filtering static features for tracking in a driving scenario. The first column shows the detection and segmentation of moving objects; the second and third columns display the initial point feature and the line feature, respectively; the fourth column shows the tracking of optical flow, where key points that do not meet epipolar constraints are purple and the rest are green; the fifth column and the sixth column are static point features and line features, respectively, for tracking after filtering;
FIG. 14 is the result of different initializations; (a) is a proposed initialization method. The left column represents the initializing graph and the right column of color line segments represents the different parallel line segments detected in the thermal infrared image; (b) is a conventional initialization method; the left column represents the map with failed initialization, and the right column represents the tracking of point characteristics;
FIGS. 15 (a) - (c) are graphs comparing the trace generated by MonoThermil-SLAM, DSO, SVO2.0 and ORB-SLAM3 on the indicator 1 3 dataset with the trace generated by LeGO-LOAM (grey dashed line); (d) Is a detailed comparison of the indiur 2 trajectory with ground truth in the x, y and z axes (grey dashed line);
FIGS. 16 (a) - (f) compare the traces generated by MonoThermol-SLAM, DSO, SVO2.0 and ORB-SLAM3 on the Outdor 16 dataset with the traces generated by LeGO-LOAM (grey dashed line);
FIG. 17 (a) is a diagrammatical effect of Monothermal-SLAM; (b) is a scene snapshot of the Outdor 1 sequence; (c) is a LeGO-LOAM mapping effect; (d) is a dynamic object of the Outdor 1 sequence;
FIG. 18 compares the trajectories of six driving sequence data sets obtained by MonoThemal-SLAM, DSO, SVO2.0 and ORB-SLAM3 with the trajectories obtained by RTK-GNSS as true values (gray dashed line).
FIG. 19 (a) is a track and global map of the MonoThermal-SLAM system in the Driving5 sequence, top row showing scene features and mapping effects, bird's eye view showing high similarity to the track, detected track loops closed on the right; FIGS. 19 (b) and (c) show a comparison of tracking and position estimation performance of MonoThermil-SLAM, DSO, SVO2.0, ORB-SLAM3 and RTK-GNSS on the Driving5 sequence.
Detailed Description
The technical solutions in this embodiment will be clearly and completely described in conjunction with the embodiment of the present invention, and it is obvious that the described embodiment is only a part of examples of the present invention, not all examples. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
The monocular thermal imaging simultaneous localization mapping system comprises a collector and a processor, wherein the processor comprises a denoising module, a feature extraction module, an initialization module, a feature tracking module, a local mapping module and a loop detection module; wherein: the collector uses a shuttered thermal imaging system to collect thermal infrared video sequence images that are not interrupted by non-uniformity correction, as shown in fig. 4.
The first step of the system of this embodiment is to remove random noise from the thermal infrared image in real time. We propose a comprehensive denoising chain from the point of view of the detector module and the processing algorithm, respectively. Aiming at the condition that information of a thermal infrared image is lost in compression conversion, the original information of the image is reserved to the greatest extent through self-adaptive histogram equalization conversion.
For random noise, we consider denoising with as little loss of original information of the image as possible. Traditional BM3D is time-consuming and occupies large memory, and considering the requirement of an SLAM system on real-time performance, we select FFDNet based on DnCNN to denoise a thermal infrared image. It defines the denoising model between the input noise observation σ and the desired output ζ as:
where M is a noise level graph associated with noise level, together with σ as input, and Θ is a model parameter.
As shown in fig. 1, it reshapes the original image into a plurality of downsampled subgraphs, connects them with an adjustable noise level graph, inputs them into a convolutional neural network, performs operations such as convolution, linear rectification function (Rectified Linear Unit), batch normalization (Batch Normalization), and the like, and finally generates a denoised image of the same size as the original input image. It can not only handle a wide range of noise levels using a single network, but also can specify a non-uniform noise level map to remove spatially varying noise, which allows it to be faster without sacrificing denoising performance.
The feature extraction module comprises the following aspects:
1. feature extraction
After full noise removal of the thermal infrared image, we first extract features. Based on ORB-SLAM2 we selected a ORB (Oriented FAST and Rotated BRIEF) feature that detects excellent in key points and possesses efficient computational and matching efficiency. At the same time we have the ORB feature points distributed as evenly as possible in the image during the feature extraction stage.
Line segment extraction we used ELSED (Enhanced line SEgment drawing). Because of the adoption of the local segment growth algorithm, the method is fast and has higher precision and repeatability, and can also be replaced by other line characteristic extraction algorithms, such as: LSD (Line Segment Detector)), to ensure line feature matching and tracking efficiency, we adjust the parameters of ELSED to fit the thermal infrared image and filter out very short invalid line features. The divided long wire segments are connected. If both line segments are very close in length and distance and the difference in direction is small enough, we merge such line segments. Finally we describe the new line segment using LBD (Line Band Descriptor) (any line segment description algorithm can be used) and add geometric constraints to remove the possible outlier matches. In practice, a pair of successfully matched line segments should satisfy the following conditions:
(1) The length difference and the angle difference of the matched line segment pairs are smaller than a certain value;
(2) The distance of the matched line segment pairs is smaller than a certain value;
(3) The LBD descriptor distance of the matching segment pair is less than a certain value.
2. Semantic meaning
Moving objects, such as walking people and driving cars, especially in urban scenes, can fool the feature association of the visual SLAM system, thus destroying the quality of the state estimation and even causing the system to fail. Thermal infrared images are advantageous in detecting moving objects because the movement of an object generates heat, resulting in a temperature above ambient. The embodiment is based on the advantages of the thermal infrared image on one hand and balancing accuracy and real-time on the other hand. We first used the object detection neural network YOLOv7, however its accuracy was still insufficient to meet SLAM requirements. Therefore, we have adopted the moving object tracking method (SORT). Furthermore, the bounding box is reduced using the example segmentation method, enabling pixel-level segmentation of moving objects.
YOLOv7 was trained on a CVC-14 dataset and our collected thermal infrared image dataset containing a total of 20,000 thermal infrared images for object detection and 10,000 thermal infrared images for semantic segmentation. The data set can detect and segment 7 types of common moving objects including people, cars, trucks, buses, dogs, bicycles and motorcycles. FIG. 5 is a framework of dynamic target region segmentation showing examples of thermal infrared images and their corresponding ground truth annotations.
In the SORT method, the target state detected by YOLOv7 is defined as:
wherein x, y representsThe center coordinates of the detected object, a, p, represent the area and aspect ratio of the bounding box,the center coordinates of the predicted object of the next frame and the area of the bounding box are represented. The motion prediction and data association of the target in the tracking process are carried out through Kalman filtering and Hungary algorithm, and the cross ratio distance is used for calculating the cost matrix.
In example segmentation, if the proportion of the image area occupied by the detection frame is less than 1% for moving objects, we consider their effect on SLAM to be limited, and therefore they are directly ignored. Furthermore, we only segment the bounding box region, not the entire image, which reduces the computational complexity and increases the speed of SLAM operation. After segmentation is completed, epipolar constraint is used to determine the dynamic situation of the target, and all dynamic points and line features are shifted out.
In the epipolar constraint, x 1 ,x 2 Matching keypoint pairs representing successive frames, their homogeneous coordinates being in the form of: x is X 1 =(μ 1 ,v 1 ,1) T ,X 2 =(μ 2 ,v 2 ,1) T Wherein mu i ,v i The coordinates of the key point in the image frame are expressed, and the epipolar constraint is known:
wherein F is the basis matrix. Polar line L 2 Can be calculated from the formula:
L 2 =[a,b,v] T =KX 1
wherein a, b, c are representations of line vectors, then match the distance e of the keypoint to the epipolar line:
if point c is a static point, point x 2 And polar line L 2 Distance e of (2)<Epsilon; conversely, if point P moves to point P ', then point x' is matched with line L 2 Distance e of (2)>Epsilon. By deleting the feature points of the dynamic target, abnormal values affecting pose estimation can be eliminated.
In the initialization module:
after feature selection, the thermal environment map needs to be initialized first. A common way of monocular camera initialization is to estimate the essential matrix or homography matrix by matching point pairs between two frames. This may result in difficult initialization success for low texture thermal infrared images due to insufficient feature points.
Therefore, we propose a SLAM map initialization solution based on dotted line combination, our approach is a complement to scenes with weak textures but rich structural features, such as urban environments in thermal infrared images.
Typically the initialization of line segments is to rely on three images to estimate the relative pose using a trifocal tensor. In contrast, as shown in fig. 6, our initialization method requires only two views for pose estimation. We assume that there are multiple pairs of parallel 3D lines in the environment, which is common in urban environments. For a pair of parallel 3D lines L 1 ,L 2 2D projection of the first frame image captured by the collector 1 ,l 2 Intersecting at vanishing point l 1 ×l 2 Its normalized directionSimilarly, it captures a 2D projection m of the second frame image at the collector 1 ,m 2 Vanishing point normalized direction +.>
Similarly, it captures a 2D projection m of the second frame image at the collector 1 ,m 2 Is normalized by vanishing point of (2)In addition, assume another pair of parallelVanishing point normalization directions of 2D projections of 3D lines in the first frame image and the second frame image are v respectively p3 And v p4 Then, for the two pairs of parallel 3D lines, the ideal rotation matrix between the two images satisfies the following formula:
wherein a is 1 ,a 2 = ±1 and g 1 ,g 2 =0. In reality, the above formula cannot be strictly satisfied due to the presence of noise, but Singular Value Decomposition (SVD) can be performed on a matrix B of 3×3:
find that II g is satisfied 12 +‖g 22 Rotation matrix in minimum case
Due to a 1 ,a 2 Unknown, we use geometric criteria to obtain the correct rotation matrix. Although the thermal infrared image is weakly textured, some feature points can still be detected and matched, and the translation vector can be estimated from the correspondence of the two points. Consider two-dimensional points p 1 ,p 2 Determining that they correspond to q in the second image by feature matching 1 ,q 2 Ideally meets the following conditions:
(Rp i ×q i ) T t=0
thus, the actual translation vector f can be defined as:
to ensure the robustness of the initialization, we use the initialization method in ORB-SLAM2 (an existing method) in parallel and simultaneously sample the point pairs and line pairs using a random sample consistency method to choose the best result.
In the feature tracking module:
1. map point and map line tracking
After the initialization of the map is completed, inter-frame tracking and local map tracking are sequentially used for carrying out data association, and the current frame pose is updated. Our approach references the ideas of ORB-SLAM2 and makes some adjustments for thermal infrared images.
Firstly, assuming uniform motion of the system, pose estimation is carried out through the projection matching relation between adjacent frames. Map points and map lines are projected to the current frame and point features and line features meeting the requirements are searched for within a given range. The matching of point features requires that rotation consistency be met in addition to the shortest descriptor distance, and the matching of line features is performed as described in feature extraction.
If the number of the features which are effectively matched after the searching is still smaller than a given threshold value, the searching area is enlarged until the requirement is met. By the method, an initial value of the current pose is obtained, and the 3D-2D projection relation is utilized to optimize the current frame pose, namely, the 3D point and 3D line reprojection error is minimized by a beam adjustment method, as shown in fig. 3.
For the frame pose estimation of the constant speed model, we use the rotation matrix R of the current frame * E SO (3) and translation vectorAs a state variable to be optimized, a graph optimization model is built, and the following cost function iterative solution is minimized by using a Levenberg-Marquardt method:
wherein ρ is p And ρ l The abnormal terms of the cost function can be reduced by introducing the Huber function into the Huber robust cost function. Sigma (sigma) p Sum sigma l Representing the observed covariance matrix of the points, lines.
Then a set of matching pairs between successive frames is represented. In view of computational complexity, we solve directly using jacobian matrices. After pose optimization is performed, outliers and outer lines in the map are removed. A successful match is considered successful if the number of map points and map lines successfully matched is greater than a certain value.
Although the tracking mode based on the constant speed model is high in efficiency, the data association between adjacent frames is too little possibly caused by severe camera movement, and at the moment, a matching strategy between images is adopted. The feature descriptors are converted into bag-of-word vectors to accelerate matching of the current frame and a reference key frame, wherein the reference key frame is the key frame with the highest common visibility with the current frame. If the matched feature number still does not reach the standard, a matching method using a nearest neighbor matching algorithm is selected, and a homography matrix between images is calculated by using a random sampling consistency method so as to obtain enough correct matching features. And finally, projecting map points and map lines to the current frame, taking the pose of the previous frame as an initial value, and optimizing the pose according to an equation.
If the tracking failure of the above methods causes positioning failure, repositioning is needed. Firstly, converting the current frame into a word bag, finding out a candidate key frame group similar to the current frame in a key frame database, and selecting a candidate key frame meeting the requirement.
Once the matching requirement is met, the camera pose of the current frame is estimated by solving the PnP problem and pose optimization is performed according to the equation. If the number of the inliers after optimization is too small, the unmatched map points and map lines in the key frame are projected into the current frame in a projection mode, and a new matching relation is generated. And optimizing the pose again according to the projection matching result. It should be noted that, as long as one candidate key frame is successfully relocated, the rest candidate key frames are not considered, otherwise, the next frame is repeated until the relocation is successful.
In the above way, only limited inter-frame information is considered, and errors are inevitably generated, so after the method, we use local map tracking to improve the system accuracy. The local map is composed of key frames having a co-view relationship with the current frame and the point and line features observed by these key frames. Similar to the inter-frame tracking, the local map tracking firstly carries out re-projection matching on the point and line characteristics in the environment map, then carries out pose optimization through a beam adjustment method, updates the observed degree of the map points and map lines of the current frame, and judges whether the tracking is successful or not according to the number of the map points and map lines tracked by the current frame.
2. Key frame selection strategy
From the point of view of robustness and real-time of the system, we choose some representative images as key frames. If the key frame selection is too loose, redundant information is introduced, so that the calculation amount of the back-end optimization is too large, otherwise, if the key frame selection is too strict, tracking failure may be caused due to difficult feature matching. Because of the weak texture nature of thermal infrared images, we will insert keyframes more frequently than some SLAM systems. For a frame of thermal infrared image, the keyframe with which it has the most common observations is taken as the reference keyframe, which is selected as the keyframe in any of the following cases:
(1) 15 frames have passed since the last global relocation;
(2) The insertion of the last key frame has been 15 frames away or the local map building thread is idle;
(3) The map points and map lines tracked by the current frame are less than 90% of the map points and map lines tracked by the reference key frame;
(4) The position and posture of the key frame are changed to a certain extent;
(5) The current frame has tracked at least 40 feature points and 15 spatial lines.
Note that: the specific values above are selected according to specific needs.
In the local mapping module:
as shown in fig. 8, after determining that the current frame is a key frame, the connection relationship between the key frames is updated through the co-view relationship between the map points and map lines, and map points and map lines with poor quality are removed according to the observation condition. We simultaneously generate new map points and map lines by triangularization using the current key frame and its neighboring co-view key frames to ensure more stable tracking. Finally, checking and fusing repeated map points and map lines, and when key frames in the local map are more than 3, executing local beam adjustment according to the equation to adjust the camera pose in the local map Map dot->And map line->
Wherein the method comprises the steps ofAnd->Representing a set of matching pairs of points, lines within the local map. After optimization adjustment, judging whether more than 90% (the specific percentage can be selected according to specific needs) of the key frame tracking map points and map lines can be tracked by other key frames, eliminating redundant key frames according to the map points and map lines, and adding the current frame into a closed loop detection queue.
In the loop detection module:
although local mapping processes can reduce errors to some extent, global consistent trajectories and maps are constructed and accumulated errors are eliminated by closed loop detection. Closed loop detection uses similarity based on image appearance to determine whether or not to generate loops. We re-used the descriptors of ORB (Oriented FAST and Rotated BRIEF) feature points with LBD (Line Band Descriptor) line feature descriptors to train the bag of words model based on thermal infrared images offline through clustering. In loop detection, we first calculate a similarity score for the current keyframe and the bag of words vector for each co-view key, where we define the similarity as:
p and l respectively represent weight coefficients, s pab ) Representing the similarity of point features between images, s lab ) Representing the similarity of line features between the images. And judging whether the closed loop is successful or not by finding out a closed loop candidate frame set from all the key frames. If the method is successful, the pose, map points and lines are adjusted through the solved similarity transformation, the method is optimized through an intrinsic diagram, and finally, the global beam adjustment method is conducted to achieve the optimization.
In this embodiment, the epipolar constraint for point features in the dynamic environment:
moving targets in the dynamic environment do not meet epipolar constraints, based on which we judge the dynamic situation of key points in a priori target region in the thermal infrared image. An object is considered dynamic if there are more than a certain number of dynamic keypoints within the object region.
First, the detection and segmentation result of the object is obtained through the first three steps of the semantic part (object detection, object tracking and semantic segmentation).
Second, the key point matching result of the continuous frame is obtained by using the light flow from thick to thin.
Thirdly, solving the basis matrix F by using a RANSAC algorithm with the most points in a non-target area, so that the interference of a moving target is avoided. The epipolar line is then calculated using the basis matrix F, and finally the keypoints that are greater than the threshold distance from the epipolar line are determined to be mobile.
Let x be 1 ,x 2 Matching keypoint pairs representing successive frames, their homogeneous coordinates being in the form of: x is X 1 =(μ 1 ,v 1 ,1) T ,X 2 =(μ 22 ,1) T Wherein mu ii The coordinates of the key point in the image frame are expressed, and the epipolar constraint is known:
wherein F is a basis matrix (Fundamental Matrix). Polar line L 2 Can be calculated from the formula:
L 2 =[a,b,c] T =FX 1
wherein a, b, c are representations of line vectors, then match the distance e of the keypoint to the epipolar line:
As shown in fig. 3, if point P is a static point, point x 2 And polar line L 2 Distance e of (2)<Epsilon; conversely, if point P moves to point P ', then point x' is matched with line L 2 Distance e of (2)>Epsilon. By deleting the feature points of the dynamic target, abnormal values affecting pose estimation can be eliminated.
In this embodiment, the geometric representation of the line in the thermal infrared image:
extraction and matching of point features requires an image rich in texture information. To overcome the weak texture and low contrast of thermal infrared images, we use line features that can provide additional structural information at the same time. The 3D straight line has 4 degrees of freedom, and the representation method comprises the following steps: a connection of two endpoints or an intersection of two planes (eight parameters), a plck coordinate (six parameters) and an orthogonal representation (four parameters). It is difficult to extract accurate line end points from an image in consideration of viewpoint changes and line segment occlusion at the SLAM operation. In this embodiment we consider the 3D straight line in space as infinitely long, using the plck coordinates for transformation and projection of the 3D straight line, using orthogonal representation for back-end optimization.
The present embodiment uses a minimum of four optimization parametersTo update the orthonormal representation of the 3D line, wherein the vector +.>For updating U: / >For->Namely:
wherein,is an SO (3) matrix expressed in +.>3D rotation of angle around x-axis, +.>And the same is true. And scalar θ is used to update W: W≡WR (θ). />
Likewise, orthonormal representation can be easily converted to the plck coordinates:
wherein u is i Representing the ith column of matrix U.
In this embodiment, the joint beam adjustment of the point and line:
the system of this embodiment performs a beam adjustment method for map optimization, which uses both point and line features and applies to thermal infrared images, resulting in different re-projection errors and different jacobian matrices, unlike SLAM methods that use only point features.
We designed joint beam adjustment using a non-linear optimization algorithm library g2o to optimize camera pose, 3D map points, and 3D map lines to minimize re-projection errors. As shown in fig. 3, we define the projection points and lines projected onto the pixel plane as p 'and l' = [ l ] respectively 1 ,l 2 ,l 3 ] T And if the matching points and the line segments are p and l respectively, the re-projection errors of the point and line characteristics are as follows:
e p =p-p′
wherein p is 1 =[u 1 ,v 1 ,1]And p 2 =[u 2 ,v 2 ,1]To match the two endpoints of line segment l. Setting f x And f y As an intrinsic parameter of the camera, a spatial point P in the camera coordinate system c =[X c ,Y c ,Z c ] T ζ is the camera pose represented by lie algebra: exp (xi) )=[R∣t]Jacobian matrix Jp of point and line reprojection errors with respect to ζ ζ And Jl ζ The method comprises the following steps:
let e 1 =p 1 T l' and e 2 =p 2 T l', then there is:
for projection of straight lineWherein->Then:
similarly, jacobian matrices for map points and map lines are:
/>
wherein:
the beam spread of the points and lines is critical to the operation of SLAM.
As a specific application:
1. experimental data
To evaluate the performance of MonoThermial-SLAM, we collected data in a real-world dynamic environment to verify the accuracy and robustness of our system.
The main sensor device of this embodiment includes a 3D LiDAR, a thermal infrared camera, and a high-precision RTK-GNSS receiver, the specifications of which are shown in table 1:
TABLE 1
Using a shuttered thermal infrared camera and scene-based non-uniformity correction ensures that the output is not frozen frequently while the thermal imager is calibrated using a calibration plate.
As shown in fig. 8, the present embodiment adopts an existing calibration plate to calibrate the thermal imager. The calibration plate is a low emissivity aviation aluminum plate, and a square chessboard pattern of a high emissivity black coating is coated on the calibration plate.
We collected a total of 16 scene sequences, including 3 indoor sequences on a small scale and 6 outdoor sequences, and 7 driving sequences on a large scale were collected by fixing the experimental equipment on the vehicle. This includes many challenging scenarios including moving dynamic objects (pedestrians, vehicles), ambient light illumination changes (low light, darkness), weather changes (light rain), ambient visibility changes (light smoke), as shown in fig. 9.
Since it is difficult to obtain very accurate true trajectory values in these challenging environments, we compare the trajectories obtained by the well-known LeGO-LOAM algorithm with reference in indoor and outdoor sequences. The LeGO-LOAM algorithm is chosen because it not only can obtain very accurate pose estimates, but is also more robust in dynamic environments. In the driving sequence, the real measurement track of the fixed state of the receiver of the real-time dynamic global satellite navigation system with high precision is used as the ground to be compared.
Due to the lack of availability of SLAM code specifically designed for monocular thermal imagers, we will propose methods that compare to SVO2.0 (monocular mode) and ORB-SLAM3 (monocular mode), DSO, which are the most advanced SLAM methods currently designed for monocular cameras. The system of this example runs on the Ubuntu 20.04 platform, and all tests used a computer equipped with an Intel i7-9700 CPU and NVIDIA GeForce RTX 2080Ti GPU.
2. Real-time denoising performance
Although random noise generally follows poisson distribution, we still use an additive gaussian white noise (AWGN) model to simulate. The reason is that the application of variance stabilizing transformation easily converts random variables with poisson distribution into random variables with approximately standard gaussian distribution, and the assumption is simple and reasonable, so that the denoising problem is easier to solve.
We have collected 5000 thermal infrared images of different scenes, the training dataset comprising input-output pairsWherein o is i Is in image I i Adding AWGN to obtain N i Is a noise level map. The trained ffdnaet has the ability to handle spatially variant noise. We compare the trained ffdnaet against a real thermal infrared denoising open source database. The database provides paired thermal infrared imaging data, wherein the low-quality thermal infrared image is the original thermal infrared data and is subjected to preliminary non-uniform correction, and the noise interference is serious; the high quality denoising thermal infrared image is processed by a complex denoising method. Fig. 10 shows the contrast of the original image and the denoised image, and the denoised thermal infrared image has not only improved signal to noise ratio but also clear texture. Furthermore, ffdnat processing of thermal infrared images with resolution 640x512 after GPU acceleration takes only about 10 milliseconds, whereas conventional BM3D methods require several seconds or even tens of seconds.
3. Point line feature detection and matching
In order to verify the improvement of the performance of the proposed method for point and line feature detection and matching, we tested the original thermal infrared image and the denoised thermal infrared image of the dataset, respectively. Both images are set to the same point and line feature parameters, and fig. 11 shows the feature matching result after outlier rejection. It can be seen that the features of a successful match are very few due to the low contrast and low signal to noise ratio of the original thermal infrared image, and it can be inferred that tracking using the original image is very difficult. The denoised thermal infrared image is much more robust, enough point and line features can be seen, and the matching performance is good.
4. Dynamic object interference resistance
To filter movable objects, we detect dynamic objects in each frame using the proposed method, an example of object detection is shown in fig. 12. It can be seen that the conventional single-frame target detection is very common due to lack of continuous frame information, and the detection rate of the dynamic target is greatly improved by the method. Meanwhile, the screening condition of the epipolar constraint on the dynamic target of the system of the embodiment is verified through experiments. FIG. 13 shows an example of the impact of the accuracy of the system of the present embodiment in a driving scenario with dynamic targets, where we have filtered out static point features and static line features for tracking, subject to epipolar constraints. The method realizes the resistance to the interference of dynamic objects, and improves the robustness of the SLAM system in a dynamic environment.
5. Positioning accuracy test
5.1 map initialization
To measure the effect of the proposed initialization method, we have experimentally compared the proposed method with classical initialization. The traditional initialization method adopts the method proposed in ORB-SLAM2, selects a basic matrix or a homography matrix for pose estimation, and reconstructs a map. Note that we use 800 point features and 100 line feature test initializations per frame, which is basically the proposed feature quantity. The experimental results of the different initialization tests are shown in fig. 14, and the results show that the initialization method proposed by the embodiment is superior to the traditional initialization method. Furthermore, the initialization method proposed by us generates a structural line map. While conventional methods rely solely on feature points, weak texture thermal infrared images lack sufficient and repeatable feature points, resulting in difficulty in initializing the map for multiple runs. FIG. 15 also shows an example of localization and mapping, which indicates that the proposed initialization generates an initial map faster than the conventional initialization. Especially for a thermal scene with abundant structural information, the method has stronger robustness.
5.2 indoor and outdoor experiments
We tested the positioning accuracy of several SLAM systems on sequences indicator 1-3 and outpor 1-6. The proposed system extracts 1000 point features and 100 line features for each thermal infrared image. The parameters of the ORB-SLAM2 system are the same as the point feature parameters of the proposed system, and the SVO2.0 and DSO use official recommended parameters. It should be noted that, according to our experiments, the original thermal infrared image without noise removal is applied to the above visual SLAM system to frequently fail tracking, so we uniformly take the thermal infrared image after real-time total noise removal as input. Since the tracking trajectory of the monocular camera results in a scale unusable, we align with the ground-truth trajectory by a similarity transformation and calculate the Root Mean Square Error (RMSE) of the relative position errors. And according to the real difference between the SLAM track and the ground, the positioning accuracy is evaluated. Furthermore, we also analyze the comparison of tracking distance to ground true distance to show the robustness of the system.
TABLE 2
Comparison of positioning accuracy and tracking path length of MonoThermol SLAM, ORB-SLAM3, SVO2.0 and DSO on indoor and outdoor sequences. Systems with tracking distances less than one third of the ground truth distance are considered tracking faults and are denoted in the table by "-".
TABLE 3 Table 3
Tables 2 and 3 show the positioning accuracy, tracking path length, and sequence characteristics of each SLAM system on both the indictor and the outpor sequences. Systems in which the tracking distance is less than one third of the true ground distance are considered tracking failures, denoted by "-" in the table. It can be seen that the proposed system has significant advantages over other methods, tracking the trajectory as shown in fig. 16. ORB-SLAM2, which uses only point features, is challenging in feature tracking due to the weak texture characteristics of thermal infrared images. Especially when there is a disturbance of a dynamic object in the sequence or there is a lack of sufficiently robust feature points in the scene, ORB-SLAM2 is prone to tracking loss during the experiment. And the path distance tracked by ORB-SLAM2 is significantly less than our method, an important reason is frequent initialization failures. The DSO based on the direct method improves the robustness of the system when facing the moving object in the weak line scene by minimizing the luminosity error. But abrupt scene luminosity changes in the Outdoor1 sequence break the constant brightness assumption, resulting in erroneous data correlations and thus tracking failures. FIG. 17 depicts the mapping effect of the trace of the Outdor 1 sequence with MonoThermal-SLAM. SVO2.0 based on the hybrid method occupies low computational resources, and shows the advantages of the characteristic point method and the direct method in individual scenes, but the tracking distance is shortest in most sequences, so that the SVO2.0 is easy to track and lose due to the interference of rapid motion and dynamic scenes.
In a word, the real-time full-noise removing link effectively improves the quality of the thermal infrared image, and widens the use boundary of the visual SLAM system in the thermal infrared image. MonoThermal-SLAM offers absolute advantages, especially in indoor and outdoor dynamic environments, due to the high robustness of the point and line features, and the polar constraint avoids erroneous data correlations. Our system succeeds in all sequences and achieves high positioning accuracy below 0.1m RPE, while ORB-SLAM2, SVO2.0 and DSO do not track all sequences completely.
5.3 Driving experiment and Loop
The positioning accuracy and the robustness of the system in the embodiment under the large-scale driving scene are further evaluated by comparing with RTK-GNSS. Unlike the RPE used in outdoor and outdoor sequence evaluations, we compared the performance of individual SLAM systems using the RMSE of the absolutetrajectory error (ATE) metrics, as shown in tables 4, 5:
TABLE 4 Table 4
Comparison of positioning accuracy and tracking path length of MonoThermol SLAM, ORB-SLAM3, SVO2.0 and DSO on the Driving sequence. Systems with tracking distances less than one third of the ground truth distance are considered tracking faults and are denoted in the table by "-".
TABLE 5
Fig. 18 shows a trace of a plurality of driving sequences. Among them, the Driving1 sequence (fig. 18 (a)) is a challenging dataset with path lengths exceeding 500m and with more dynamic targets. The ORB-SLAM2 method fails to track soon after initialization, and SVO2.0 loses feature tracking when a large turn occurs in the sequence and a dynamic target passes. DSO performs better and exhibits similar RMSE as the proposed system, yet it still has significant errors.
As can be seen from fig. 18 (b-f), the method of this embodiment has a positioning accuracy far superior to that of other methods, and the overlap ratio with the true value is highest.
In contrast, the system of the embodiment still operates smoothly in the sequence, provides more robust pose estimation, and obtains the longest tracking distance and the highest positioning accuracy. Similarly, in the rest of the sequence tests, the trajectory of the system of the present embodiment substantially coincides with the RTK-GNSS trajectory, exhibiting optimal performance.
In addition, another experiment was used to evaluate the accuracy of the loop. We have recorded the Driving5 loop sequence in a campus environment with a path length exceeding 470m. Due to the lack of loop back, large dimensional drift occurs for both DSO and SVO 2.0. The ORB-SLAM2 method behaves closest to our method, but in thermal scenarios it is difficult for the ORB-SLAM2 method to detect the correct loop and it is more difficult to initialize the map. By using the retrained thermal infrared image word bags, loops were successfully detected (fig. 19 (a)), and on the basis of this, scale correction and global BA were performed, and our trajectory was similar to the bird's eye view of google map (fig. 19 (a)). Fig. 19 (b) depicts the position estimation of each SLAM system in the Driving5 sequence and a comparison with the ground truth. Compared with other systems, the track estimated by MonoThermol-SLAM is more consistent with the true value of high-precision equipment. FIG. 19 (c) shows a comparison of tracking and position estimation performance of MonoThermol SLAM, DSO, SVO2.0, ORB-SLAM3 and RTK-GNSS on the Driving5 sequence.
In this embodiment, the influence of the line feature selection policy:
the weak texture of the thermal infrared image is detrimental to the extraction and matching of point features, but instead is advantageous for line segments, which avoids detecting a large number of cumbersome and ineffective line segments like a visual image. Our particular line-feature selection strategy for thermal infrared image sequences contributes to the success of the system of this embodiment. Firstly, the real-time total noise of the image is removed, and the detection rate of the line segments is greatly improved. Secondly, we use geometric constraints to improve the robustness of the line features in the line feature extraction and matching stage. And performing descriptor matching by adjusting strategies such as search range and the like. When the number of matches is small, we increase the threshold of the parameters in the matching process to find more matching pairs, which improves efficiency and reduces the false match rate. Finally, we dynamically adjust the specific gravity of line features in localization and mapping for different scenes, which is critical for robust localization in certain scenes.
Regarding the structural map:
considering the instability of the line features, we usually choose the line features more strictly to ensure the accuracy of tracking and positioning, so the map sometimes cannot reflect the scene features well. However, buildings in urban environments have a rich rectilinear structure, with line features being more effective as underlying features expressing their geometry. Thanks to the accurate pose estimation of the system of the embodiment, we use Line3D (an existing method) to take the pose and image of the camera as input, output a denser scene geometric model as a three-dimensional reconstruction result, so as to better represent the characteristics of the scene, which is beneficial to the establishment of a semantic map and the simplified representation of a structural scene.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (10)

1. A monocular thermal imaging simultaneous localization and mapping system is characterized in that: comprises a collector and a processor; the processor comprises a denoising module, a static characteristic extraction module, an initializing module, a characteristic tracking module, a local image building module and a loop detection module; wherein:
the collector collects the thermal infrared video sequence image which is not interrupted by the non-uniformity correction;
the denoising module performs real-time denoising on the acquired thermal infrared video sequence image;
the static feature extraction module is used for respectively extracting key points and descriptors of point features in the image and key line segments and descriptors of line features; object detection is carried out by using a target detection neural network, the precision is improved by adopting a moving object tracking method, a bounding box is reduced by using an example segmentation method, pixel level segmentation of a moving object is realized, and then static features for follow-up tracking are screened out by using epipolar geometric constraint;
the initialization module performs the following processing: restoring the pose of two adjacent frames according to the static point features and the static line features extracted by the static feature extraction module;
The feature tracking module performs the following processing: map points and map line tracking; key frame selection;
in map points and map line tracking, inter-frame tracking and local map tracking are sequentially used for carrying out data association, and the current frame pose is updated;
in the key frame selection, a key frame which is most observed together with a frame of thermal infrared image is used as a reference key frame;
the local map building module updates the connection relation between key frames through the co-view relation between map points and map lines, and eliminates map points and map lines with poor quality according to the observation condition;
the loop detection module uses similarity based on image appearance to determine whether to generate a loop, and uses descriptors of point features and descriptors of line features to train a bag-of-words model based on the thermal infrared image through clustering offline.
2. The system according to claim 1, wherein: in the real-time scene denoising:
first performing scene-based non-uniformity correction on a thermal infrared image; meanwhile, aiming at the condition that the information of the thermal infrared image is lost in compression conversion, an adaptive histogram equalization method is used for keeping the original information of the image to the greatest extent; random noise of the thermal infrared image is then removed.
3. The system according to claim 2, wherein:
for random noise, adopting FFDNet based on convolutional neural network to denoise the thermal infrared image;
the original image is remodelled into a plurality of downsampled sub-images, and the downsampled sub-images and the adjustable noise level images are connected together and input into a convolutional neural network, convolution is performed, and finally a denoising image with the same size as the original input image is generated.
4. The system according to claim 1, wherein: when the line segment feature extraction is carried out, the extremely short invalid line feature is filtered; connecting the divided long line segments; if the length and distance of two line segments are very close and the direction difference is small enough, merging the line segments; describing the new line segment by using the descriptor, and adding geometric constraints to remove possible abnormal matching, wherein the geometric constraints comprise:
(1) The length difference and the angle difference of the matched line segment pairs are smaller than a certain value;
(2) The distance of the matched line segment pairs is smaller than a certain value;
(3) The descriptor distance of the matching line segment pairs is less than a certain value;
in the moving object tracking method, the target state detected by the target detection neural network is defined as:
where x, y represents the center coordinates of the detected object, a, p represent the area and aspect ratio of the bounding box, Then the center coordinates of the predicted object of the next frame and the area of the bounding box are represented; and carrying out motion prediction and data association of the target in the tracking process by using Kalman filtering and Hungary algorithm, and calculating a cost matrix by using the cross ratio distance of the detection frames.
5. The system according to claim 1, wherein: judging the dynamic condition of key points in a priori target area in the thermal infrared image in epipolar geometric constraint; an object is considered dynamic if there are more than a certain number of dynamic keypoints within the object region:
firstly, obtaining a target detection and segmentation result; secondly, obtaining a key point matching result of the continuous frames by utilizing the light streams from thick to thin; thirdly, solving a base matrix by using a random uniform sampling algorithm with the most points in a non-target area, calculating epipolar lines by using the base matrix, and finally determining that key points with the distance from the epipolar lines being larger than a threshold value are moving.
6. The system according to claim 1, wherein: the initialization module is used for processing according to the following method:
for a pair of parallel 3D linesIts 2D projection/in the first frame image captured by the collector 1 ,l 2 Intersecting at vanishing point l 1 ×l 2 Its normalized direction
Similarly, it captures a 2D projection m of the second frame image at the collector 1 ,m 2 Is normalized by vanishing point of (2)In addition, assume that the vanishing point normalization direction of the 2D projections of the other pair of parallel 3D lines in the first frame image and the second frame image is v p3 And v p4 Then, for the two pairs of parallel 3D lines, the ideal rotation matrix between the two images satisfies the following formula:
wherein g 1 And g 2 Is a constant; based on the above formula, solving the condition of meeting II g 12 +‖g 22 A rotation matrix R in the smallest case;
consider two feature two-dimensional points p detected in a first image using a detector of point features 1 、p 2 Determining that they correspond to point q in the second image by feature matching 1 、q 2 The actual translation vector is defined as:
the initialization method in ORB-SLAM2 is used in parallel at the same time, and the random sampling consistency method is used for sampling the point pairs and the line pairs in the initialization method to select the best result.
7. Root of Chinese characterThe system according to claim 1, wherein: in map point and map line tracking, firstly, assuming uniform motion of a system, estimating pose through a projection matching relationship between adjacent frames; projecting map points and map lines to a current frame, searching point features and line features meeting requirements in a given range, wherein matching of the point features needs to meet rotation consistency besides shortest distance of a required descriptor, and if the number of the features which are effectively matched after the searching is still smaller than a given threshold value, expanding a searching area until the requirements are met; the initial value of the current pose is obtained by the method, the pose of the current frame is optimized by utilizing the 3D-2D projection relation, namely, the reprojection error of the 3D point and the 3D line is minimized by a beam adjustment method; for the frame pose estimation of the constant speed model, the rotation matrix R of the current frame is used * And translation vector t * As a state variable to be optimized, a graph optimization model is built, and the following cost function iterative solution is minimized by using a Levenberg-Marquardt method:
wherein ρ is p And ρ l Introducing a Huber function to reduce abnormal terms of the cost function for the Huber robust cost function;minimum re-projection errors of points and lines, respectively;
p sum sigma l An observation covariance matrix representing points and lines;
representing a set of matching pairs between successive image frames in a video sequence. Taking calculation complexity into consideration, directly solving by using a jacobian matrix; after pose optimization is executed, removing outliers and outer lines in the map; if the number of successfully matched map points and map lines is greater than a certain value, then the map points and map lines are considered to beSuccessful.
Adopting a matching strategy between images; converting the description sub of the feature into a bag-of-word vector to accelerate the matching of the current frame and a reference key frame, wherein the reference key frame is the key frame with the highest common visibility with the current frame; if the matched feature number still does not reach the standard, selecting to use a nearest neighbor matching algorithm, and calculating a homography matrix between images by using a random sampling consistency method so as to obtain enough correct matching features; finally, projecting map points and map lines to the current frame, and optimizing the pose of the previous frame by taking the pose of the previous frame as an initial value according to an equation;
If the tracking failure of the methods leads to positioning failure, repositioning is needed; firstly, converting a current frame into a word bag, finding a candidate key frame group similar to the current frame in a key frame database, and selecting a candidate key frame meeting the requirement in the candidate key frame group; once the matching requirement is met, estimating the camera pose of the current frame by solving the PnP problem and optimizing the pose according to an equation; if the number of the optimized inliers is too small, the unmatched map points and map lines in the key frame are projected into the current frame in a projection mode, and a new matching relation is generated; optimizing the pose again according to the projection matching result; if one candidate key frame is successfully relocated, the rest candidate key frames are not considered, otherwise, the next frame is repeated until the relocation is successful;
in the key frame selection, the key frame is selected in any of the following cases:
(1) Some frames have passed since the last global relocation;
(2) Some frames have passed since the insertion of the last key frame or the local map building thread is idle;
(3) Map points and map lines tracked by the current frame are less than a certain proportion of map points and map lines tracked by the reference key frame;
(4) The position and posture of the key frame are changed to a certain extent;
(5) The current frame tracks at least a certain number of feature points and spatial lines.
8. According to claimThe system of claim 1, wherein: the local modeling block utilizes the current key frame and the neighboring common-view key frames to generate new map points and map lines through triangulation so as to ensure that tracking is more stable; finally, checking and fusing repeated map points and map lines, and when the key frames in the local map are greater than a certain number, executing local beam adjustment according to the following equation to adjust the camera pose in the local mapMap dot->And map line->
Wherein,and->A set of matching pairs representing points and lines within the local map;
after optimization adjustment, judging whether more than a certain percentage of map points and map lines tracked by key frames can be tracked by other key frames, eliminating redundant key frames according to the map points and map lines, and adding the current frame into a closed-loop detection queue.
9. The system according to claim 1, wherein: in the loop detection module, firstly, calculating a similarity score of a current key frame and a word bag vector of each common view key, wherein the similarity is defined as:
Wherein p and l represent weight coefficients, s, respectively pab ) Representing the similarity of point features between images, s lab ) Representing line feature similarity between images; judging whether the closed loop is successful or not by finding out a closed loop candidate frame set from all key frames; and if the method is successful, the pose, map points and lines are adjusted through the solved similarity transformation, and finally, a global beam adjustment method is carried out to realize the optimization.
10. A monocular thermal imaging simultaneous localization mapping method is characterized in that: a system adapted for use in any one of claims 1-9, comprising: collecting a thermal infrared image, denoising the image, extracting features, initializing, tracking features, constructing a local image and detecting a loop.
CN202310995674.9A 2023-08-09 2023-08-09 Monocular thermal imaging simultaneous positioning and mapping method and system Pending CN117036404A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310995674.9A CN117036404A (en) 2023-08-09 2023-08-09 Monocular thermal imaging simultaneous positioning and mapping method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310995674.9A CN117036404A (en) 2023-08-09 2023-08-09 Monocular thermal imaging simultaneous positioning and mapping method and system

Publications (1)

Publication Number Publication Date
CN117036404A true CN117036404A (en) 2023-11-10

Family

ID=88629399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310995674.9A Pending CN117036404A (en) 2023-08-09 2023-08-09 Monocular thermal imaging simultaneous positioning and mapping method and system

Country Status (1)

Country Link
CN (1) CN117036404A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117906598A (en) * 2024-03-19 2024-04-19 深圳市其域创新科技有限公司 Positioning method and device of unmanned aerial vehicle equipment, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117906598A (en) * 2024-03-19 2024-04-19 深圳市其域创新科技有限公司 Positioning method and device of unmanned aerial vehicle equipment, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111462200B (en) Cross-video pedestrian positioning and tracking method, system and equipment
CN110533722B (en) Robot rapid repositioning method and system based on visual dictionary
CN107481270B (en) Table tennis target tracking and trajectory prediction method, device, storage medium and computer equipment
CN106780620B (en) Table tennis motion trail identification, positioning and tracking system and method
WO2022002150A1 (en) Method and device for constructing visual point cloud map
CN110675418B (en) Target track optimization method based on DS evidence theory
CN109544592B (en) Moving object detection algorithm for camera movement
CN108197604A (en) Fast face positioning and tracing method based on embedded device
CN111311666A (en) Monocular vision odometer method integrating edge features and deep learning
CN107657644B (en) Sparse scene flows detection method and device under a kind of mobile environment
CN101860729A (en) Target tracking method for omnidirectional vision
Ren et al. Multi-camera video surveillance for real-time analysis and reconstruction of soccer games
CN105160649A (en) Multi-target tracking method and system based on kernel function unsupervised clustering
CN111199556A (en) Indoor pedestrian detection and tracking method based on camera
CN112381890B (en) RGB-D vision SLAM method based on dotted line characteristics
CN110827321B (en) Multi-camera collaborative active target tracking method based on three-dimensional information
CN111899345B (en) Three-dimensional reconstruction method based on 2D visual image
CN117036404A (en) Monocular thermal imaging simultaneous positioning and mapping method and system
CN116109950A (en) Low-airspace anti-unmanned aerial vehicle visual detection, identification and tracking method
CN110378995B (en) Method for three-dimensional space modeling by using projection characteristics
CN113781523A (en) Football detection tracking method and device, electronic equipment and storage medium
CN116894876A (en) 6-DOF positioning method based on real-time image
Zhao et al. A robust stereo semi-direct SLAM system based on hybrid pyramid
CN116862832A (en) Three-dimensional live-action model-based operator positioning method
Wu et al. Improving autonomous detection in dynamic environments with robust monocular thermal SLAM system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination