CN116740539A - Visual SLAM method and system based on lightweight target detection network - Google Patents

Visual SLAM method and system based on lightweight target detection network Download PDF

Info

Publication number
CN116740539A
CN116740539A CN202310887776.9A CN202310887776A CN116740539A CN 116740539 A CN116740539 A CN 116740539A CN 202310887776 A CN202310887776 A CN 202310887776A CN 116740539 A CN116740539 A CN 116740539A
Authority
CN
China
Prior art keywords
detection
frame
slam
pose
characteristic points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310887776.9A
Other languages
Chinese (zh)
Inventor
徐慧英
朱信忠
戴康佳
李琛
刘巍
曹雨淇
王拔龙
刘子洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Normal University CJNU
Original Assignee
Zhejiang Normal University CJNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Normal University CJNU filed Critical Zhejiang Normal University CJNU
Priority to CN202310887776.9A priority Critical patent/CN116740539A/en
Publication of CN116740539A publication Critical patent/CN116740539A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a visual SLAM method and a system based on a lightweight target detection network, which belong to the technical fields of image detection, robot mapping and positioning, and comprise the steps of inputting a detection image into a YOLOv8s detection model for preprocessing, and obtaining a detection frame and a detection category of the detection image; removing dynamic characteristic points in the detection frame by using a RANSAC algorithm to obtain static characteristic points of the detection image; and inputting the characteristic points outside the detection frame and the static characteristic points inside the detection frame into a local map, and performing back-end optimization processing. The method aims to solve the problem that the pose estimation of the traditional vision SLAM system is inaccurate in a dynamic environment. The target detection technology is introduced into the preprocessing stage of the SLAM system, then the target detection result is transmitted to the SLAM front end in a communication mode, and the dynamic characteristic points are removed by using the RANSAC geometric method, so that the influence of the dynamic object on pose estimation is eliminated.

Description

Visual SLAM method and system based on lightweight target detection network
Technical Field
The invention relates to the technical fields of image detection, robot mapping and positioning, in particular to a visual SLAM method and a system based on a lightweight target detection network.
Background
SLAM (Simultaneous Localization and Mapping, synchronous localization and mapping) is one of the key technologies necessary for many robots. The method utilizes data acquired by sensors such as a camera, an inertial measurement unit, a laser radar and the like to estimate the position of the robot in real time and simultaneously construct an environment map. In early SLAM studies, researchers mainly used monocular cameras as the primary vision sensor and performed robot pose estimation by means of kalman filter algorithm. In 2007, PTAM first proposed a scheme of splitting localization and mapping into two threads to run in parallel, and using nonlinear optimization. In 2009, lourakes and Argyros proposed a beam-flattening (BA) algorithm based on graph optimization, and replaced a filtering algorithm by using sparsity and a symmetrical structure of a Hessian matrix, and became the mainstream method.
VSLAM is a SLAM system with a camera as a main sensor, and in general, the VSLAM system can be divided into a front end and a back end. The front end refers to the Visual odometer (Visual odometer) part of the VSLAM system, and the back end is mainly responsible for pose optimization of the VSLAM system. In recent years, the VSLAM field has emerged as a number of excellent systems, such as ORB-SLAM, LSD-SLAM, VINS-Mono and DOS, which have made an important contribution to the development of visual SLAM. For example, ORB-SLAM utilizes ORB feature descriptors to extract and match feature points, and optimizes the positioning and mapping of the robot by local map. The system is excellent in terms of speed and accuracy and is suitable for real-time applications. However, ORB-SLAM has limitations in processing large-scale maps and is sensitive to changes in illumination and object occlusion. In order to cope with challenges in complex scenes, ORB-SLAM2 is improved on the basis of ORB-SLAM, a binocular camera and an RGB-D camera are introduced, and robustness and expandability of a system are improved. The ORB-SLAM3 further integrates IMU, GPS and map fusion technologies, so that the SLAM system can process larger-scale and more complex scenes. However, conventional SLAM systems have an inherent limitation in that the environment is assumed to be static. In practical applications, it is quite difficult to achieve a truly static environment, since there are many dynamic objects, especially humans. The appearance of the dynamic object can have a larger influence on the extracted characteristic points, so that the estimation of the pose of the camera is influenced. Thus, a method is needed for SLAM systems to reduce the impact of dynamic feature points on pose estimation.
At present, researchers adopt a geometric method, a deep learning method, multi-sensor fusion and the like to remove the extracted characteristic points in the dynamic region so as to reduce the influence of the characteristic points on the SLAM system. The geometric method mainly comprises epipolar constraint, reprojection error constraint and the like. Epipolar constraint requires that the base matrix be known in advance, epipolar is found in the current frame, and whether it is a dynamic point is determined by matching the point-to-epipolar distances. And judging whether the projection point is a dynamic point or not according to the error magnitude of the projection point and the matching point in the current frame by the reprojection error constraint. However, these methods require a preliminary calculation of the base matrix and assume that most of the area in the image is a static area. Still other methods utilize graph theory and weighting strategies to distinguish between dynamic and static points.
The deep learning method plays an important role in the study of processing dynamic objects in the SLAM system. The main method is that the image is subjected to semantic segmentation through a neural network so as to identify and segment out a dynamic region, and therefore the influence of dynamic points on the SLAM system is filtered. DS-SLAM is a method that utilizes a combination of semantic segmentation network (SegNet) and epipolar constraint methods to reduce interference of dynamic targets on SLAM systems. Mask-RCNN used by many good SLAM systems segments image semantics, such as DynaSLAM and DP-SLAM. Although Mask R-CNN can identify feature points at a pixel level, when the color of a dynamic point is similar to that of a background, the method may have a problem that the dynamic point is not completely removed, and the model is complicated, requires a long calculation time, and is applied to an actual environment for a long distance.
Therefore, providing a visual SLAM method and system that can improve pose estimation accuracy in dynamic environments is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides an RGB-D vision SLAM method and system based on a lightweight target detection network, which combines a geometric method with deep learning, and can improve the accuracy of system detection and the accuracy of pose estimation.
In order to achieve the above object, the present invention provides the following technical solutions:
in one aspect, the invention provides a visual SLAM method based on a lightweight target detection network, comprising the following steps:
inputting a detection image into a Yolov8s detection model for preprocessing, and obtaining a detection frame and a detection category of the detection image;
removing dynamic feature points in the detection frame by using a RANSAC algorithm to obtain static feature points of the detection image;
and inputting the characteristic points outside the detection frame and the static characteristic points inside the detection frame into a local map, and performing back-end optimization processing.
Optionally, the inputting the detection image into the YOLOv8s detection model for preprocessing, and obtaining a detection frame and a detection category of the detection image specifically includes:
training the YOLOv8s detection model by adopting a data set to obtain training weights;
adopting a positive sample allocation strategy of a task alignment target detector to allocate labels to the pictures;
calculating the probability of each tag class using an activation function and calculating a global class penalty;
and calculating the position loss of the detection image, wherein the position loss comprises two parts, namely calculating the loss between a prediction frame and a target frame by adopting a CIoU loss function, and rapidly focusing and detecting the position distribution with the nearest distance between the target frame by adopting a DFL loss function.
Optionally, removing the dynamic feature points in the detection frame by using a RANSAC algorithm to obtain the static feature points of the detection image, which specifically includes:
step 1: initializing iteration times (iteration times) 1, the number of optimal interior points BestInliers and a dynamic feature point set S;
step 2: randomly selecting two initial feature points, and calculating an average depth value;
step 3: for the residual feature points, if the error between the depth value di and the average depth value of the residual feature points is smaller than a threshold value, adding the feature points Pi to S, and increasing the number of the current inner points Inliers by one;
step 4: judging whether the current Inliers is larger than the optimal Inliers, if the current Inliers is larger than the optimal Inliers, updating the optimal Inliers and the dynamic feature point set S;
step 5: the step 2 to the step 5 are circulated according to the iteration times of the iteration;
step 6: and after the iteration times of iteration are completed, outputting the dynamic feature point set S.
Optionally, the feature points outside the picture detection frame and the static feature points inside the detection frame are input into a local map, and the back-end optimization processing is performed, which specifically includes:
constructing a pose graph according to the input characteristic points outside the picture detection frame and the static characteristic points inside the detection frame; the relative pose relation among the key frames is represented by edges, and the pose of the key frames is adjusted by using a nonlinear least square method, so that the observation position of a map point in the key frames and the reprojection position of the map point in other key frames are minimized;
optimizing the three-dimensional position of the map point by adopting a nonlinear least square method and combining with a reprojection error, so that the observation position of the map point and the reprojection position of the map point in a plurality of key frames are minimized;
and performing loop detection, triggering closed loop optimization processing after loop detection, and correcting the position of a map point observed by the loop in the pose graph by optimizing the pose of a key frame associated with the loop in the pose graph.
In another aspect, the present invention further provides a visual SLAM system based on a lightweight target detection network, including: the system comprises a preprocessing module, a tracking module and an optimizing module; wherein, the liquid crystal display device comprises a liquid crystal display device,
the preprocessing module preprocesses the detection image by using a YOLOv8s detection model, and outputs a detection frame and a detection category of the detection image to the tracking module;
the tracking module is used for eliminating dynamic characteristic points in the detection frame by utilizing a RANSAC algorithm and acquiring static characteristic points of the detection image;
the optimizing module is used for inputting the characteristic points outside the detection frame and the static characteristic points inside the detection frame into a local map to perform back-end optimizing processing.
Optionally, the preprocessing module is specifically configured to:
training the YOLOv8s detection model by adopting a data set to obtain training weights;
performing label distribution by adopting a positive sample distribution strategy of a task alignment target detector;
calculating the probability of each category using an activation function and calculating a global category penalty;
and calculating the position loss of the detection image, wherein the position loss comprises two parts, namely calculating the loss between a prediction frame and a target frame by adopting a CIoU loss function, and rapidly focusing and detecting the position distribution with the nearest distance between the target frame by adopting a DFL loss function.
Optionally, the optimizing module specifically includes:
the pose graph construction unit is used for optimally using a nonlinear least square method to adjust the pose of the key frame so as to minimize the observation position of the map point in the key frame and the re-projection position of the map point in other key frames;
a three-dimensional position optimizing unit for adjusting the pose of the key frame by using a nonlinear least square method so as to minimize the observation position of the map point and the re-projection position of the map point in a plurality of key frames;
the closed loop detection unit is used for carrying out loop detection, triggering closed loop optimization processing after loop detection, and correcting the position of a map point observed by the loop in the pose graph by optimizing the pose of a key frame associated with the loop in the pose graph.
Compared with the prior art, the visual SLAM method and system based on the light-weight target detection network, provided by the invention, combine a geometric method with deep learning, adopt the deep learning as a preprocessing step, and utilize the latest YOLOv8s model for image preprocessing, so that the detection accuracy is higher. And then, the accuracy of pose estimation is further improved by using a geometric method.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a schematic flow chart of the pretreatment of the YOLOv8s detection model in the invention;
FIG. 3 (a) is a graph of absolute track error for the system YS-SLAM of the invention at fr3/w/xyz sequence;
FIG. 3 (b) is a graph of absolute track error for ORB-SLAM2 at fr3/w/xyz sequence;
FIG. 3 (c) is a graph of absolute track error at fr3/w/static for the system YS-SLAM of the invention;
FIG. 3 (d) is a graph of absolute track error at fr3/w/static sequence for ORB-SLAM 2;
FIG. 3 (e) is a graph of absolute track error for the system YS-SLAM of the invention at fr3/w/rpy sequence;
FIG. 3 (f) is a graph of absolute track error for ORB-SLAM2 at fr3/w/rpy sequence;
FIG. 3 (g) is a graph of absolute track error for the system YS-SLAM of the invention at fr3/w/half sequence;
FIG. 3 (h) is a graph of absolute track error for ORB-SLAM2 at fr3/w/half sequence;
FIG. 3 (i) is a graph of absolute track error at fr3/s/static for the system YS-SLAM of the invention;
FIG. 3 (j) is a graph of absolute track error for ORB-SLAM2 at fr3/s/static sequence;
FIG. 4 (a) is a graph of the relative trajectory error of the system YS-SLAM of the invention under fr3/w/xyz sequence;
FIG. 4 (b) is a graph of relative track error for ORB-SLAM2 at fr3/w/xyz sequence;
FIG. 4 (c) is a relative trajectory error comparison of the system YS-SLAM of the invention at fr3/w/static sequence;
FIG. 4 (d) is a relative track error comparison of ORB-SLAM2 at fr3/w/static sequence;
FIG. 4 (e) is a relative track error comparison of the system YS-SLAM of the invention at fr3/w/rpy sequence;
FIG. 4 (f) is a relative track error comparison of ORB-SLAM2 at fr3/w/rpy sequence;
FIG. 4 (g) is a relative trajectory error comparison of the YS-SLAM of the present invention at fr3/w/half sequence;
FIG. 4 (h) is a relative track error comparison of ORB-SLAM2 at fr3/w/half sequence;
FIG. 4 (i) is a relative trajectory error comparison of the system YS-SLAM of the invention at fr3/s/static sequence;
FIG. 4 (j) is a relative track error comparison of ORB-SLAM2 at fr3/s/static sequence;
FIG. 5 (a) is a comparison of ORB-SLAM2 and YS-SLAM of the present system in three dimensions of fr3/w/xyz sequence;
FIG. 5 (b) is a comparison of ORB-SLAM2 and YS-SLAM of the present system in three dimensions of fr3/w/static sequence;
FIG. 5 (c) is a comparison of ORB-SLAM2 and YS-SLAM of the present system in three dimensions of fr3/w/rpy sequence;
FIG. 5 (d) is a comparison of ORB-SLAM2 and YS-SLAM of the present system in three dimensions of fr3/w/half sequence;
FIG. 5 (e) is a comparison of ORB-SLAM2 and YS-SLAM of the present system in three dimensions of fr3/s/static sequence;
FIG. 6 is a schematic diagram of a system frame of the present invention;
FIG. 7 is a schematic flow chart of the system of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
The embodiment of the invention discloses a visual SLAM method based on a lightweight target detection network, which is shown in fig. 1 and comprises the following steps:
inputting the detection image into a Yolov8s detection model for preprocessing, and obtaining a detection frame and a detection category of the detection image;
removing dynamic characteristic points in the detection frame by using a RANSAC algorithm to obtain static characteristic points of the detection image;
and inputting the characteristic points outside the detection frame and the static characteristic points inside the detection frame into a local map, and performing back-end optimization processing.
The YOLOv8 preprocessing module is shown in fig. 2, and the detection frame and object class obtained by preprocessing are transmitted to the tracking module of the ORB-SLAM 2.
Wherein the preprocessing comprises the following steps:
training weights: training the model on the classical dataset COCO, a training weight was obtained.
YOLOv8 has a more sophisticated, more advanced image recognition and detection framework. The flow of the pretreatment is shown in fig. 2. There are five different models of the most advanced single-target detection algorithm YOLOv8, including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x. To achieve both detection speed and accuracy, YOLOv8s was used. YOLOv8 adopts a CSPLayer_2Conv structure instead of the C3 structure of YOLOv5 as a backbone network and a Neck part, the CSPLayer_2Conv structure has richer gradient flows, and different channel numbers are adjusted according to different scales. In the Head section YOLOv8 introduced a design of decoupling heads separating classification from prediction block prediction. Unlike the previous Anchor-Base based method, yolov8 adopts Anchor-Free method, and simultaneously takes loss branch eliminating confidence prediction at head part, and uses new loss function CIoU+ DFL (Distribution Focal Loss) to train regression branch;
label distribution: YOLOv8 directly used the tabskaligndasssigner of the TOOD. The distribution strategy selects positive samples according to classification and regression score weighting, and the formula is:
t=s γ ×u β
where s is a prediction score corresponding to the labeling category, u is IoU of the prediction box and the group-trunk box, and the alignment degree can be measured by multiplying the two.
Category loss: YOLOv8s uses the same strategy as RetinaNet, calculates the probability of each class using an activation function, and calculates the global class penalty, where the positive sample class label is IoU value and the negative samples are all 0, the penalty function uses a simple BCE (Binary CrossEntropy Loss) penalty function, and the BCE penalty function is defined as follows:
Loss=-(ylog p(x)+(1-y)log(1-p(x)))。
loss of position: YOLOv8 divides it into two parts, the first being IoU between the calculated prediction box and the target box, similar to previous YOLOv5, using CIoU (Complete-IoU) loss function. The equation for CIoU function loss is as follows:
wherein b, b gt The center points of the predicted and real frames are represented, respectively, and ρ represents the calculated euclidean distance between the two center points. c represents the diagonal distance of the minimum closure region that can contain both the predicted and true frames, α is a weight function, and v is used to measure the similarity of aspect ratios.
The second part adopts a DFL loss function, so that the network can be rapidly focused on the position distribution close to the target position. The formula of the DFL loss function is as follows:
DFL(S i ,S i+1 )=-((y i+1 -y)log S i +(y-y i )log S i+1 )。
the ORB-SLAM2 adopts an indirect method to carry out pose estimation and map construction, which means that the number of static characteristic points and the accuracy of eliminating dynamic points directly influence the pose estimation and map construction. In this embodiment, the depth values of objects within the high dynamic environment are analyzed. In research, a person is considered a dynamic object, and if a person sits on a chair and movement of the person may cause movement of the chair, a chair connected to the person is also considered a dynamic object. The tracking module extracts feature points from people and chairs, and the depth values of the feature points are generally similar and smaller. Whereas the depth values of feature points extracted from the background are typically distributed more loosely and are larger.
The embodiment provides a geometrical algorithm of Depth Value-RANSAC, which aims at eliminating dynamic feature points extracted from people and chairs and keeping static feature points extracted from the background as far as possible. In the depth camera, as the distance measurement distance increases, there may be a case where the depth value of the feature point is 0, and the feature point may be directly removed. And for the feature points with the depth values of the detection frames being larger than 0, separating the static points and the dynamic points by using a geometric algorithm. The Depth Value-RANSAC algorithm steps as follows.
Algorithm 1Depth Value-RANSAC algorithm
Input: the semantic information detects feature points Pi within the box and corresponding depth values di.
And (3) outputting: and a dynamic feature point set S.
Step 1, initializing iteration times of the first operation, and initializing the number of the optimal interior points BestInliers and a dynamic feature point set S.
Step 2, loop (Iterations):
and step 3, randomly selecting two initial feature points, and calculating an average depth value.
And 4, for all other characteristic points, if the error between the depth value di of the characteristic point and the average depth value is smaller than a threshold value (Thre), adding the characteristic point Pi to S, and adding the current inner point number Inliers++.
And step 5, if the current number of interior points is larger than the optimal number of interior points, updating the optimal number of interior points BestInliers and the dynamic feature point set S.
And 6, ending the circulation.
And 7, outputting S.
Preferably, the backend optimization process includes:
step 1: first, the mapping module builds a Pose map (Pose Graph) in which the relative Pose relationships between key frames are represented by edges. The goal of the optimization is to minimize the re-projection error so that the map points' observed locations in the key frames are as close as possible to their re-projection locations in other key frames. And the pose of the key frame is adjusted by optimizing a nonlinear least square method, so that the observation of map points is better met.
Step 2: once the pose map is optimized, the map building module optimizes map points in the map. The goal of the optimization is to adjust the three-dimensional positions of map points so that their observed positions are as consistent as possible with their re-projected positions in the plurality of keyframes. Map point optimization typically uses a nonlinear least squares method, combined with a re-projection error to optimize the location of the map points.
Step 3: after the loop detection module detects the loop, the back-end optimization triggers the loop optimization process. The closed loop optimization aims at correcting the position of a map point observed by the loop through optimizing the pose of a key frame associated with the loop in the pose graph. This can improve the consistency of the entire map. Closed loop optimization is also optimized using a nonlinear least squares method.
Example 2
The present embodiment provides a SLAM system based on YOLOv8s lightweight target detection network, and a SLAM system framework, see fig. 6 and 7, includes: the system comprises a preprocessing module, a tracking module and an optimizing module; wherein, the liquid crystal display device comprises a liquid crystal display device,
the preprocessing module preprocesses the detection image by using a YOLOv8s detection model, and outputs a detection frame and a detection category of the detection image to the tracking module;
the tracking module is used for eliminating dynamic characteristic points in the detection frame by utilizing the RANSAC algorithm and acquiring static characteristic points of the detection image;
the optimization module is used for inputting the feature points outside the detection frame and the static feature points inside the detection frame into the local map to perform back-end optimization processing.
Optionally, the preprocessing module is specifically configured to:
training a Yolov8s detection model by adopting a data set to obtain training weights;
performing label distribution by adopting a positive sample distribution strategy of a task alignment target detector;
calculating the probability of each category using an activation function and calculating a global category penalty;
and (3) carrying out position loss calculation on the detected image, wherein the position loss comprises two parts, namely calculating the loss between the prediction frame and the target frame by adopting a CIoU loss function, and rapidly focusing and detecting the position distribution with the nearest distance to the target by adopting a DFL loss function.
The tracking module extracts feature points in the detected image and judges whether the feature points are positioned in the detection frame according to the detection frame transmitted by the preprocessing module. For the feature points outside the detection frame, the feature points are directly added into the local map to participate in the local map building and optimizing process. However, for feature points within the detection frame, it is necessary to further geometrically separate static points from dynamic points. Since dynamic feature points may have an impact on the accuracy of pose estimation, feature points from the subject need to be culled. The tracking module initializes the pose of the camera and an initial map of the scene through motion estimation between two frames of images; once the initialization is complete, the tracking module will continue to track the motion of the camera. During tracking, if the camera loses tracking or fails tracking, the tracking module will attempt to reposition. The repositioning is to match the characteristic points of the current frame and the key frames in the map and estimate the pose of the camera by using PnP algorithm. If the repositioning is successful, the tracking module will resume tracking and continue estimating the camera's motion and pose. Otherwise, the system may need to be reinitialized.
Optionally, the optimizing module specifically includes:
and the map building module selects frames suitable as key frames through a strategy, uses characteristic point matching between the current frame and the previous key frames, and calculates the positions of map points through a triangulation method. In order to improve the accuracy and consistency of the map, the map building module optimizes the relative pose between the key frames by using an optimization algorithm. This can be achieved by constructing a pose graph in which the constraints between key frames are derived by feature point matching and motion estimation between them. The mapping module periodically performs loop detection to identify previously accessed scenes. And loop detection judges whether loop occurs or not by calculating a similarity score by utilizing characteristic point matching between the current frame and the historical key frame. Once loop is detected, the map building module triggers closed loop optimization, and the consistency and accuracy of the map are improved by optimizing the whole pose map.
Example 3
In the embodiment, the beneficial effects of the method are verified by experimental comparison on a TUM data set.
Experiment platform: the experiment is carried out on a linux operating system, visual Studio Code and Pycharm are selected as integrated development environments, and a model framework is realized based on Python language and C++. The main hardware configuration of the experiment is as follows: ubuntu 20.0464 bit operating system, processor (CPU) is Intel Core i5-10400K,2.9GHz, graphic card (GPU) model is GeForce RTX2080TI (22G), and memory (RAM) is 32G. The deep learning development environment is as follows: visual Studio Code, python3.8, CUDA12.1, pyTorrch1.12.0.
Data set: the TUM dataset is one widely used in the fields of computer vision, robotics and three-dimensional reconstruction, offered by the university of Munich industries. The dataset contains several sequences in a dynamic environment, of which there are two categories, walking and sitting. There are four types of sequences: halfsphere, xyz, rpy and static. Dynamic object people walk and gesture in the image sequence, while static objects include tables, chairs, etc. In such dynamic environments, it is very challenging to accurately estimate pose and construct a map.
Experimental comparison: in order to test the improved effect of YS-SLAM on ORB-SLAM2 system, the invention compares the estimated trajectory of the camera with the ground truth value to evaluate the performance of the SLAM system. The present invention uses APE (absolute attitude error) and RPE (relative attitude error) to evaluate the performance of SLAM systems. The resulting trajectories were evaluated by using the evaluation_rpe.py and evaluation_ate.py scripts. The APE is used to evaluate the error between the estimated camera pose and the true pose in the SLAM system, while the RPE is used to evaluate the accuracy of the camera pose estimation between successive frames in the SLAM system. The RPE measures the error of the SLAM system in estimating the relative motion between successive frames.
Table 1 shows the performance of the YS-SLAM system when it is added to the detection module and the geometric algorithm are added simultaneously. The behavior of the SLAM system was evaluated using the RMSE of absolute attitude error with the original ORB-SLAM2 as a reference. YS-SLAM (Y) means that only the detection module is added and all feature points within the detection frame are removed. YS-SLAM (Y+G) means that the SLAM system employs both a detection module and a geometric algorithm. The results show that the simultaneous addition of the detection module and the geometric algorithm can significantly improve SLAM performance compared with the addition of the detection module only. This is because adding only the detection module may eliminate most of the static feature points in the detection frame, while the introduction of the geometric algorithm may preserve the static feature points.
TABLE 1 YS-SLAM using different methods
Note that: "v" indicates the model used for each analysis
ORB-SLAM2 is currently widely recognized as one of the most excellent SLAM systems. In order to evaluate the improved effect of the YS-SLAM system of the present invention in a dynamic scenario, the present invention compares YS-SLAM with ORB-SLAM2 on the datasets fre3_walk_xyz, fre3_walk_static, fre3_walk_rpy, fre3_walk_half and fre3_position_static. The first four data sets belong to a high dynamic environment, while the last data set is a low dynamic environment. The comparison results are shown in fig. 3 (a) -3 (j), fig. 4 (a) -4 (j), and fig. 5 (a) -5 (e).
The quantitative comparison results of the present invention are shown in tables 2-4, and the present invention gives values for these five sequences in terms of RMSE (root mean square error), s.d. (standard deviation). The YS-SLAM improves the high dynamic sequence by more than 90 percent and improves the low dynamic sequence by more than 40 percent. The improvement rate calculation mode is as follows:
where β represents the measured value of ORB-SLAM2, α represents the measured value of YS-SLAM system, and n represents the improvement rate of YS-SLAM to ORB-SLAM 2.
TABLE 2 comparison of YS-SLAM and ORB-SLAM2 absolute track errors
TABLE 3 comparison of YS-SLAM and ORB-SLAM2 relative translational track errors
TABLE 4 error comparison of YS-SLAM and ORB-SLAM2 relative rotation trajectories
Example 4
The invention also compares YS-SLAM system with DynaSLAM, DS-SLAM, YOLO-SLAM, using RMSE and S.D. as indicators for SLAM system performance. RMSE and s.d. can better reflect the superiority of SLAM systems. From the comparison results in tables 5 to 7, it can be seen that the absolute track errors of YS-SLAM in the four sequences of fr3_walking_xyz, fr3_walking_ rpy, fr3_walking_half, fr3_walking_static are slightly reduced. This also indicates that the YS-SLAM system is excellent.
TABLE 5 comparison of YS-SLAM with other SLAMs in absolute track error
TABLE 6 comparison of YS-SLAM with other SLAMs on relative translational track errors
TABLE 7 comparison of YS-SLAM with other SLAMs on relative rotation trajectory error
To verify the performance of the YS-SLAM system in real time, YS-SLAM was compared with the average time of processing one frame by ORB-SLAM2 and DynaSLAM, DS-SLAM, YOLO-SLAM, respectively, and the results are shown in Table 8 below.
TABLE 8 time consuming costs
The specific experimental process is as follows:
experiment platform: the experiment is carried out on a linux operating system, visual Studio Code and Pycharm are selected as integrated development environments, and a model framework is realized based on Python language and C++. The main hardware configuration of the experiment is as follows: ubuntu 20.0464 bit operating system, processor (CPU) is Intel Core i5-10400K,2.9GHz, graphic card (GPU) model is GeForce RTX2080TI (22G), and memory (RAM) is 32G. The deep learning development environment is as follows: visual Studio Code, python3.8, CUDA12.1, pyTorrch1.12.0.
Experimental results: dynaSLAM and YOLO-SLAM are time consuming, mainly due to the significant amount of time the preprocessing module takes. In contrast, the system adopts an end-to-end communication mode, and the average processing time of the detection module is only 0.032 seconds. When the SLAM system processes the current frame, the detection module starts to process the next frame image, and the detection module and the current frame operate in parallel, so that the time consumption of the SLAM system in preprocessing is greatly reduced. In an environment configured as Intel (R) Core (TM) i5-10400 CPU and NVIDIA Geforce RTX2080Ti, the YS-SLAM system is capable of processing 30 frames of images on average per second. Compared to other SLAMs, real-time performance is good.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. The visual SLAM method based on the lightweight target detection network is characterized by comprising the following steps of:
inputting a detection image into a Yolov8s detection model for preprocessing, and obtaining a detection frame and a detection category of the detection image;
removing dynamic feature points in the detection frame by using a RANSAC algorithm to obtain static feature points of the detection image;
and inputting the characteristic points outside the detection frame and the static characteristic points inside the detection frame into a local map, and performing back-end optimization processing to obtain an optimized pose diagram.
2. The visual SLAM method based on the lightweight target detection network according to claim 1, wherein the inputting the detection image into the YOLOv8s detection model for preprocessing, obtaining the detection frame and the detection category of the detection image specifically comprises:
training the YOLOv8s detection model by adopting a data set to obtain training weights;
adopting a positive sample allocation strategy of a task alignment target detector to allocate labels to the pictures;
calculating the probability of each tag class using an activation function and calculating a global class penalty;
and calculating the position loss of the detection image, wherein the position loss comprises two parts, namely calculating the loss between a prediction frame and a target frame by adopting a CIoU loss function, and rapidly focusing and detecting the position distribution with the nearest distance between the target frame by adopting a DFL loss function.
3. The visual SLAM method based on the light-weight target detection network according to claim 1, wherein the method is characterized in that dynamic feature points in the detection frame are removed by using a RANSAC algorithm, and static feature points of the detection image are obtained, and specifically comprises the following steps:
step 1: initializing iteration times (iteration times) 1, the number of optimal interior points BestInliers and a dynamic feature point set S;
step 2: randomly selecting two initial feature points, and calculating an average depth value;
step 3: for the residual feature points, if the error between the depth value di and the average depth value of the residual feature points is smaller than a threshold value, adding the feature points Pi to a dynamic feature point set S, and increasing the number of the current inner points Inliers by one;
step 4: judging whether the current Inliers is larger than the optimal Inliers, if the current Inliers is larger than the optimal Inliers, updating the optimal Inliers and the dynamic feature point set S;
step 5: the step 2 to the step 5 are circulated according to the iteration times of the iteration;
step 6: and after the iteration times of iteration are completed, outputting the dynamic feature point set S.
4. The visual SLAM method based on the light-weight target detection network according to claim 1, wherein the feature points outside the picture detection frame and the static feature points inside the detection frame are input into a local map, and the rear-end optimization processing is performed, specifically including:
constructing a pose graph according to the input characteristic points outside the picture detection frame and the static characteristic points inside the detection frame; the pose of the key frame is adjusted by using a nonlinear least square method, so that the observation position of the map point in the key frame and the re-projection position of the map point in other key frames are minimized;
optimizing the three-dimensional position of the map point by adopting a nonlinear least square method and combining with a reprojection error, so that the observation position of the map point and the reprojection position of the map point in a plurality of key frames are minimized;
and performing loop detection, triggering closed loop optimization processing after loop detection, and correcting the position of a map point observed by the loop in the pose graph by optimizing the pose of a key frame associated with the loop in the pose graph.
5. A vision SLAM system based on a lightweight target detection network, comprising: the system comprises a preprocessing module, a tracking module and an optimizing module; wherein, the liquid crystal display device comprises a liquid crystal display device,
the preprocessing module preprocesses the detection image by using a YOLOv8s detection model, and outputs a detection frame and a detection category of the detection image to the tracking module;
the tracking module is used for eliminating dynamic characteristic points in the detection frame by utilizing a RANSAC algorithm and acquiring static characteristic points of the detection image;
the optimizing module is used for inputting the characteristic points outside the detection frame and the static characteristic points inside the detection frame into a local map to perform back-end optimizing processing.
6. The visual SLAM system of claim 5, wherein the preprocessing module is specifically configured to:
training the YOLOv8s detection model by adopting a data set to obtain training weights;
performing label distribution by adopting a positive sample distribution strategy of a task alignment target detector;
calculating the probability of each category using an activation function and calculating a global category penalty;
and calculating the position loss of the detection image, wherein the position loss comprises two parts, namely calculating the loss between a prediction frame and a target frame by adopting a CIoU loss function, and rapidly focusing and detecting the position distribution with the nearest distance between the target frame by adopting a DFL loss function.
7. The visual SLAM system of claim 5, wherein the optimization module specifically comprises:
the pose graph construction unit is used for optimally using a nonlinear least square method to adjust the pose of the key frame so as to minimize the observation position of the map point in the key frame and the re-projection position of the map point in other key frames;
a three-dimensional position optimizing unit for adjusting the pose of the key frame by using a nonlinear least square method so as to minimize the observation position of the map point and the re-projection position of the map point in a plurality of key frames;
the closed loop detection unit is used for carrying out loop detection, triggering closed loop optimization processing after loop detection, and correcting the position of a map point observed by the loop in the pose graph by optimizing the pose of a key frame associated with the loop in the pose graph.
CN202310887776.9A 2023-07-19 2023-07-19 Visual SLAM method and system based on lightweight target detection network Pending CN116740539A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310887776.9A CN116740539A (en) 2023-07-19 2023-07-19 Visual SLAM method and system based on lightweight target detection network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310887776.9A CN116740539A (en) 2023-07-19 2023-07-19 Visual SLAM method and system based on lightweight target detection network

Publications (1)

Publication Number Publication Date
CN116740539A true CN116740539A (en) 2023-09-12

Family

ID=87916979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310887776.9A Pending CN116740539A (en) 2023-07-19 2023-07-19 Visual SLAM method and system based on lightweight target detection network

Country Status (1)

Country Link
CN (1) CN116740539A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197728A (en) * 2023-11-07 2023-12-08 成都千嘉科技股份有限公司 Method for identifying real-time gas diffusing operation through wearable camera equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197728A (en) * 2023-11-07 2023-12-08 成都千嘉科技股份有限公司 Method for identifying real-time gas diffusing operation through wearable camera equipment
CN117197728B (en) * 2023-11-07 2024-01-23 成都千嘉科技股份有限公司 Method for identifying real-time gas diffusing operation through wearable camera equipment

Similar Documents

Publication Publication Date Title
He et al. Bounding box regression with uncertainty for accurate object detection
Liu et al. Robust visual tracking using local sparse appearance model and k-selection
Choi et al. A general framework for tracking multiple people from a moving camera
US8320618B2 (en) Object tracker and object tracking method
Chen et al. A deep learning approach to drone monitoring
CN105760846B (en) Target detection and localization method and system based on depth data
CN110490158B (en) Robust face alignment method based on multistage model
US20130251246A1 (en) Method and a device for training a pose classifier and an object classifier, a method and a device for object detection
CN108229416B (en) Robot SLAM method based on semantic segmentation technology
KR20080073933A (en) Object tracking method and apparatus, and object pose information calculating method and apparatus
CN109919977A (en) A kind of video motion personage tracking and personal identification method based on temporal characteristics
CN110472542A (en) A kind of infrared image pedestrian detection method and detection system based on deep learning
CN111476827A (en) Target tracking method, system, electronic device and storage medium
CN110610210B (en) Multi-target detection method
CN107563323A (en) A kind of video human face characteristic point positioning method
CN109255289A (en) A kind of across aging face identification method generating model based on unified formula
Zhao et al. Adversarial deep tracking
Wu et al. Multivehicle object tracking in satellite video enhanced by slow features and motion features
CN116740539A (en) Visual SLAM method and system based on lightweight target detection network
CN110827320A (en) Target tracking method and device based on time sequence prediction
Hou et al. Robust human tracking based on DPM constrained multiple-kernel from a moving camera
Cordea et al. Real-time 2 (1/2)-D head pose recovery for model-based video-coding
CN115330833A (en) Fruit yield estimation method with improved multi-target tracking
CN114283355A (en) Multi-target endangered animal tracking method based on small sample learning
Elassal et al. Unsupervised crowd counting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination