CN114283198A

CN114283198A - SLAM method for removing dynamic target based on RGBD sensor

Info

Publication number: CN114283198A
Application number: CN202111637308.3A
Authority: CN
Inventors: 陆龙飞; 林志赟; 王博; 韩志敏
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-05

Abstract

The invention discloses an SLAM method for removing a dynamic target based on an RGBD sensor, which comprises the following steps: detecting dynamic objects in the image using an object detection neural network YOLOv 5; combining the image depth information, and extracting all dynamic targets in the frame image through the processing of a pixel level segmentation method to form a mask image; after processing original image information by using a mask image, respectively obtaining a dynamic target image and a static background image, and performing positioning and map building by using the static background image through an SALM system; the experimental result shows that the deep learning and the visual SLAM are integrated, so that the interference of a dynamic target to the SLAM system can be effectively removed, and the positioning and mapping accuracy of the visual SLAM system in a dynamic environment is improved; meanwhile, the real-time performance of the system can be ensured without the support of a GPU with good performance, and the method can be effectively applied to actual scenes.

Description

SLAM method for removing dynamic target based on RGBD sensor

Technical Field

The invention relates to the technical field of computer vision and mobile robot positioning, in particular to an SLAM method for removing a dynamic target based on an RGBD sensor.

Background

In recent years, computer vision and robots have become a hot research direction, the most basic research is the positioning problem of the robots themselves, and meanwhile, a positioning and mapping (SALM) technology is widely applied to positioning, navigation and obstacle avoidance of the robots. The robot has various sensors, and the vision sensor has the advantages of low price and large information amount, and is widely applied to the SLAM technology, wherein the RGBD sensor can directly provide depth information and can effectively reduce the calculation amount. The traditional synchronous positioning and mapping technology is concentrated on researching an ideal static scene without moving targets, and the dynamic targets in the actual scene can cause the position and pose calculation of a camera to be deviated, so that the positioning of the whole vision SLAM system is misaligned. The pure static assumption is not applicable in a real environment, and SLAM suitable for a dynamic environment is a research hotspot of the technology.

With the development of deep learning technology, the deep learning method has wide application in the aspects of target detection and semantic segmentation. However, most deep learning networks need a good Graphics Processing Unit (GPU) for acceleration support to achieve a real-time detection effect. At present, a lot of work is carried out based on a semantic segmentation network, and semantic segmentation needs larger calculation amount relative to target detection and is difficult to meet the real-time requirement. The YOLOv5 is a new target detection network with high precision and good real-time performance. After the target is detected, pixel-level segmentation is carried out, so that the effect of removing the dynamic target is achieved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an SLAM method for removing a dynamic target based on an RGBD sensor. The invention aims to realize a SLAM system which can run in real time at a mobile terminal and eliminate dynamic target interference, eliminate the interference of a preset dynamic target object in a scene on the positioning and mapping of a mobile robot and ensure the stability of the SLAM system.

The purpose of the invention is realized by the following technical scheme: an SLAM method for removing dynamic targets based on RGBD sensors, the method comprising the steps of:

step 1, obtaining a picture needing to be preset to remove a dynamic target category, and constructing a training sample set and a test data set;

step 2, training and testing the target detection neural network YOLOv5 by using the training sample set and the test data set constructed in the step 1 to obtain a trained YOLOv5 model;

step 3, processing the RGB image acquired by the RGBD sensor by using the target detection neural network YOLOv5 model trained in the step 2, and judging whether a dynamic target exists in the image or not and the pixel coordinate position of a dynamic target frame;

step 4, combining the pixel coordinate position of the dynamic target frame obtained in the step 3 and the depth image information of the RGBD sensor, judging whether a dynamic target exists in the current frame of the depth image of the RGBD sensor, if so, processing the same pixel region marked out in the dynamic target frame in the depth image corresponding to the current frame by using a pixel level segmentation method to obtain a mask image, and if not, directly setting the mask image to be empty;

the pixel level segmentation method specifically comprises the following steps:

step 4.1, uniformly distributing depth values of N pixel points in a dynamic target detection frame region corresponding to a pixel position in a depth image of each frame of RGB image for all dynamic target detection frames of each frame of RGB image, and storing the depth values into a container Xi, wherein the value of i ranges from 1 to N;

step 4.2, dividing the obtained depth values of the N pixel points into two sets consisting of the depth values by using an absolute median deviation outlier algorithm, and respectively obtaining average pixel depth values of the two sets by using weighted summation on the two sets, wherein the average pixel depth value which is relatively small is set as a dynamic target class depth value, and the average pixel depth value which is relatively large is set as a static background class depth value; in a dynamic target frame region output by the target detection neural network YOLOv5 model, the position of a dynamic target is closer to the optical center of the camera than the static background;

4.3, aiming at the depth values of the dynamic target class and the static background class, clustering all pixels in the dynamic target detection frame by using a distance-based clustering algorithm, dividing all pixels in the dynamic target frame into the dynamic target class or the static background class, and obtaining a set of pixel points of the dynamic target class or the static background class;

step 4.4, extracting dynamic target pixel positions of all target detection frames in the image, projecting the obtained dynamic target pixel positions on the image with the same size as the original input image and outputting the dynamic target pixel positions in a binarization mode, and carrying out OR operation on pixel positions corresponding to the binarization image generated by all the target detection frames in the image to enable a plurality of images to be superposed and fused into a mask image, wherein the mask image is also output in the binarization mode;

and 5, processing the gray level image in the SLAM by using the mask image, specifically, performing AND operation on the msak image output in a binarization mode and the gray level image in the SLAM, wherein the pixel value of the position of the dynamic target class is 0, other parts of the gray level image cannot be changed, the part of the gray level image with the pixel value of 0 is dynamic target information in the image, the unchanged part of the gray level image is static background information, and the static background information is used for positioning and map construction of the mobile robot.

Further, in step 1, the test data set is labeled and then adjusted to the VOC data set format for use in subsequent steps.

Further, in the step 2, the simplest YOLOv5s network of the target detection neural network YOLOv5 is used for training, and the target detection neural network has smaller parameter and faster operation speed compared with other networks of the same type.

Further, in step 3, the pixel coordinate position includes a target frame center point coordinate and a target frame width and height, and normalization processing is performed according to the size of the input image.

Further, in step 4.2, the absolute median deviation outlier algorithm may detect data with a larger difference between one or more values in the data and other values, where the difference is determined according to a parameter n in the absolute median deviation outlier algorithm, and extract average depth information of the dynamic target by using the algorithm, where the formula is:

the basic steps of the algorithm can be summarized as the following sub-steps:

step 4.2.1, calculate X_iMedian value X of all elements in_median；

4.2.2, calculating absolute deviations of all elements and median values, and then calculating median values MAD of all elements relative to the absolute deviations;

step 4.2.3, determining a parameter n, and adjusting the parameter according to an algorithm formula; and separating outlier data, storing the outlier data into a container outlier, storing other data into a container normal, calculating the average depth value of all pixels in the container outlier and the average depth value of all pixels in the container normal, comparing the average depth values of the pixels and the normal, setting the average pixel depth value as a dynamic target class depth value when the average pixel depth value is relatively small, and setting the average pixel depth value as a static background class depth value when the average pixel depth value is relatively large.

Further, in the step 4.3, the distance-based clustering algorithm includes the following steps:

step 4.3.1, firstly setting the central points of the clusters, wherein the depth values of the two central points are set in the step, namely the depth value of the dynamic target class and the depth value of the static background class obtained in the step 4.2;

4.3.2, traversing all pixel points in the dynamic target detection frame, respectively calculating the Euclidean distance from the depth value of each pixel point to the depth values of two set clustering center points, comparing the two distance values, and dividing all the pixel points in the dynamic target detection frame to the clustering center with a short distance;

step 4.3.3, reselecting the depth value of the clustering center, calculating the average pixel of the two types of pixel point sets divided in the step 4.3.2, using the average pixel as the depth value of the clustering center set in the step 4.3.1, and performing the iterative clustering of the next round again;

step 4.3.4, looping step 4.3.1 to step 4.3.3, ending the loop until at least one of the following conditions is met, and outputting the pixel position information belonging to the dynamic target class or the static background class in the dynamic target detection frame;

the condition 1 is that the difference between the average pixel of the two types and the clustering center is less than a set value a;

condition 2, the difference between the numbers of pixels belonging to the two classes in the target frame is greater than a set value b;

the parameters a and b are adjusted according to the condition of adapting to the actual application scene.

Further, in step 5, the algorithm flow of the SLAM system specifically includes the following steps:

step 5.1, ORB feature extraction is carried out on the static image provided in the step 4, and when the number of extracted feature points exceeds a set value, an SLAM system is initialized;

step 5.2, the extracted ORB features are combined with the static background information of the previous frame to estimate the motion posture of the camera, the RGBD camera provides depth information, and a PnP method is used for solving the pose of the camera;

step 5.3, minimizing the reprojection error by using a beam adjustment method (BA) and optimizing a local map;

and 5.4, optimizing the pose by using loop detection and correcting drift errors.

Further, said step 5.4 is divided into two parts, closed loop detection and closed loop correction, respectively; the closed loop detection firstly uses a Bow bag-of-words model for detection, and then calculates similarity transformation through an Sim3 algorithm; the closed-loop correction is mainly closed-loop fusion Essential Graph optimization to achieve the effect of adjusting and correcting errors.

The invention has the beneficial effects that: aiming at the condition that the traditional SLAM can not overcome dynamic target interference, the SLAM method for removing the dynamic target based on the RGBD sensor effectively overcomes the defects of instability and low precision of the traditional SLAM method under the dynamic target interference, and effectively improves the precision under the dynamic target interference. And meanwhile, a target detection framework of deep learning is introduced, so that the method has universality, and a real-time effect can be achieved only by the support of a common image processor. In the pixel level segmentation method, the method provided by the invention has small calculation amount and good segmentation effect. The method is different from the mainstream method for processing the dynamic target by using deep learning, has the advantages of small calculated amount, good real-time performance and good effect of removing the dynamic target, and is favorable for deployment and application at the mobile robot end. The method can effectively improve the accuracy and robustness of tracking by using the RGBD sensor, and improve the positioning and mapping precision of the visual SLAM in a dynamic scene.

Drawings

FIG. 1 is a schematic structural diagram of an SLAM method for removing a dynamic target based on an RGBD sensor according to the present invention;

FIG. 2 is a flow chart of a pixel level segmentation method according to the present invention;

FIG. 3 is a diagram illustrating the processing result of YOLOv5 under a certain scenario in an embodiment of the present invention;

FIG. 4 is a diagram illustrating an effect of uniform point fetching in a certain scenario according to an embodiment of the present invention;

FIG. 5 is a graph of a pixel level segmentation result in a certain scenario in an embodiment of the present invention;

FIG. 6 is a schematic diagram of a mask image in a scene in an embodiment of the present invention;

FIG. 7 is a graph of error comparison of a motion trajectory in the xyz direction in an embodiment of the present invention;

FIG. 8 is a graph showing the comparison of the errors of the motion trajectory at rpy degrees in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

Due to the static assumption of the traditional synchronous positioning and mapping technology, the visual SLAM technology is difficult to be widely applied in a real scene, and can be more effectively deployed in a real environment after the interference of a dynamic target is removed.

The purpose of the invention is realized by the following technical scheme: an RGBD sensor based SLAM method for removing dynamic targets is characterized in that RGB image information and depth image information are provided by an RGBD sensor of a mobile robot, and dynamic target detection is performed on RGB images by using a target detection network to obtain the positions of dynamic targets in pixel coordinates. And performing pixel-level segmentation on each dynamic target area by using the depth image and the coordinate position of the dynamic target to completely separate the dynamic target from the static background. The positioning and mapping of the mobile robot is performed using a static background. The method removes dynamic target interference with a small calculation amount, can be applied to the mobile robot in real time, and is beneficial to stable operation of the SLAM system. As shown in fig. 1, the method comprises the steps of:

step 1, obtaining a picture of a dynamic target which is preset to be removed according to an actual application scene, and constructing a training sample set and a test data set. In this embodiment, the preset dynamic target is set as a person, and the training sample set and the test data set of the person are marked with the pixel positions of the dynamic target of the set class in the corresponding image. The data set was created by collecting a total of 17125 relevant pictures containing the set target. Labeling of the data sets was done using a Labelimg tool, and the labeled test data sets were formatted into a VOC data set for use in subsequent steps.

And 2, training and testing the target detection neural network YOLOv5 model by using the constructed training sample set and the constructed test data set to obtain a trained YOLOv5 model. The invention uses the simplest YOLOv5s model in YOLOv5 for training, and the network has the advantages of small network and high speed. The ratio of the training sample set to the testing sample set is 10:1, the network of YOLOv5 is not improved in the invention, and an open-source network framework is used. The training result is tested, the detection accuracy can reach 85.1%, the indexes of the method have improved space in the network frame, and the method can be pertinently adapted to the specified application environment by adjusting the data set and the parameters of the network frame. The method has important significance for the subsequent steps of the method for accurately detecting the dynamic target.

And 3, processing the RGB image acquired by the RGBD sensor by using the trained YOLOv5 model, and judging whether a dynamic target exists in the image or not and the pixel coordinate position of the dynamic target frame. The dynamic target box information output by the module can be represented by the following formula:

wherein o is_jRefers to a preset j-type dynamic object in the frame image,

refers to the x coordinate of the center point of the ith j-type target frame in the frame image,

refers to the y coordinate of the center point of the ith j-type target frame in the frame image,

refers to the width and height of the coordinate frame with the front x and y as the center point of the ith j type in the frame image. In the formula

The 4 parameters, i.e. the coordinates of the center point and the width and height, are normalized according to the size of the input imageThe processing of quantization can be applied to images of different sizes after scaling, and the effect of the YOLO module is shown in fig. 3.

And 4, combining the pixel coordinate position of the dynamic target frame obtained in the step 3 with the depth image of the RGBD sensor, judging whether a dynamic target exists in the current frame, if so, processing the same pixel region marked out in the dynamic target frame in the depth image corresponding to the current frame by using a pixel level segmentation method to obtain a mask image, and if not, directly setting the mask image to be empty.

Further, as shown in fig. 2, in step 4, the pixel level segmentation method specifically includes the following sub-steps:

step 4.1, aiming at all dynamic target detection frames of each frame of RGB image, in the dynamic target detection frame area corresponding to the pixel position in the depth image of the frame of image, the depth values of N pixel points are uniformly distributed and taken, and the depth values are stored in a container X_iWherein i has a value from 1 to N. When N is 9, the effect is shown in fig. 4.

And 4.2, dividing the obtained depth values of the N pixel points into two sets consisting of the depth values by using an absolute median deviation outlier algorithm, respectively obtaining average pixel depth values of the two sets by using weighted summation of the two sets, and comparing the two average pixel depth values. Wherein, the dynamic target class depth value is set as the average pixel depth value which is relatively small, and the static background class depth value is set as the average pixel depth value which is relatively large. When the camera acquires an image in an indoor environment, the position of the dynamic target and the distance between the background and the optical center of the camera are clearly defined in the depth image, wherein the depth value of the background is far greater than that of the target class, and in the dynamic target frame region output by the target detection neural network in the step 3, the position of the dynamic target is closer to the optical center of the camera than the position of the static background.

The absolute median deviation outlier algorithm can detect one or more data with larger difference compared with other data in the data and eliminate the data, wherein the definition of the larger difference is obtained by experiments in specific application scenes according to the invention, a parameter n in the absolute median deviation outlier algorithm is used for defining the difference, and the parameter n in the experiments of the invention is set to be 1.1. As shown in fig. 3, since most of the pixels in the target detection frame belong to dynamic target objects, the depth information of the pixels is very close to that of the dynamic target objects, and the difference between the background depth information and the detection target is large, the average depth information of the dynamic target can be extracted by using the algorithm. The formula is as follows:

the basic steps of the algorithm can be summarized as the following sub-steps:

step 4.2.1, calculate X_iMedian value X of all elements in_median。

And 4.2.2, calculating absolute deviations of all elements and the median, and then calculating the median MAD of all elements relative to the absolute deviations.

And 4.2.3, determining the parameter n, and adjusting the data according to the MAD formula.

In the invention, the parameter n is adjusted to be 1.1 according to the depth value condition of the adaptive target scene, only the outlier data is separated and stored in the container outlier, other data is stored in the container normal, and the outlier data is not adjusted. And then calculating the average depth value of all pixels in the container outlier and the average depth value of all pixels in the container normal, and comparing the average depth values of the pixels and the container normal, wherein the average pixel depth value which is relatively small is set as a dynamic target class depth value, and the average pixel depth value which is relatively large is set as a static background class depth value.

And 4.3, aiming at the depth values of the dynamic target class and the static background class, clustering all pixels in the dynamic target detection frame by using a distance-based clustering algorithm, and dividing all pixels in the dynamic target frame into the dynamic target class or the static background class, wherein the dynamic target class or the static background class is a set of pixel points.

The distance-based clustering algorithm specifically comprises the following substeps:

and 4.3.1, firstly setting the central points of the clusters, wherein the depth values of the two central points are set to be the dynamic target class depth value and the static background class depth value obtained in the step 4.2.

And 4.3.2, traversing all the pixel points in the dynamic target detection frame, respectively calculating the Euclidean distance from the depth value of each pixel point to the depth values of the two set clustering center points, comparing the two distance values, and dividing all the pixel points in the dynamic target detection frame to the clustering center with a short distance.

And 4.3.3, reselecting the depth value of the clustering center, calculating the average pixel of the two types of pixel point sets divided in the step 4.3.2, using the average pixel as the depth value of the clustering center set in the step 4.3.1, and performing the next iteration clustering again.

And 4.3.4, looping the steps 4.3.1 to 4.3.3 until at least one of the following conditions is met, ending the loop, and outputting the pixel position information belonging to the dynamic target class or the static background class in the dynamic target detection frame.

And in the condition 1, the difference between the average pixel of the two types and the clustering center is smaller than a set value a, and a is set to be 1 in the invention.

Condition 2, the difference between the numbers of pixels belonging to the two classes in the target frame is larger than a set value b, and b is set to 1.7 in the present invention.

Step 4.4, extracting the target pixel positions of all target detection frames in the frame image, projecting the obtained dynamic target pixel positions on an image with the same size as the original input image and outputting the dynamic target pixel positions in a binarization mode, and performing OR operation on the pixel positions corresponding to the binarization images generated by all the target detection frames in the frame image to enable a plurality of images to be superposed and fused into a mask image output in the binarization mode, wherein the mask image is shown in FIG. 6;

fig. 5 shows the effect of the dynamic object detection box of fig. 4 after being processed by the pixel level segmentation method.

And 5, after the mask image is used for processing the gray level image in the SLAM system, specifically, performing AND operation on the msak image output in a binarization mode and the gray level image in the SLAM, wherein the pixel value of the position of the dynamic target class is 0, other parts of the gray level image cannot be changed, the part of the gray level image with the pixel value of 0 is dynamic target information in the image, the unchanged part of the gray level image is static background information, and the static background information is used for positioning and map construction of the mobile robot. In the invention, in combination with the ORB-SLAM2 system, the algorithm flow of the SLAM system can be summarized as the following sub-steps:

and 5.1, performing ORB feature extraction on the static image provided in the step 4, and initializing the SLAM system when the number of the extracted feature points exceeds a set value.

And 5.2, estimating the motion attitude of the camera by using the extracted ORB characteristics and combining the static background information of the upper frame. RGBD cameras provide depth information so the PnP method can be used directly to solve for the camera pose.

PNP is a method of estimating camera motion by projecting matching points from three-dimensional space to the image plane and calculating errors from the observation data, which is also called reprojection error. The PNP method based on analysis can estimate relative motion by only adopting a small number of matching pairs, and then optimizes the camera motion by a nonlinear optimization method, thereby effectively reducing the calculation amount.

And 5.3, minimizing the reprojection error by a nonlinear optimization method, namely by using a beam adjustment method (BA), and optimizing the local map.

And 5.4, optimizing the pose by using loop detection and correcting drift errors. The process is divided into two parts, closed-loop detection and closed-loop correction respectively. Closed loop detection is first probed using the Bow bag-of-words model, and then the similarity transformation is calculated by the Sim3 algorithm. And closed-loop correction, which is mainly closed-loop optimization of a Graph fused with the Essential Graph to achieve the effect of adjusting and correcting errors.

The method has the key points that a new YOLOv5 algorithm is used for detecting a preset dynamic object in a scene image, and then a pixel-level segmentation method is used for extracting a dynamic target mask image; feature point matching is performed on the static background and then applied to the ORB-SLAM2 system.

The present embodiment evaluates the performance of the algorithm on the published TUM data set. There are 5 image sequences in the TUM dataset including fr1_ xyz, fr3_ sitting _ static, fr3_ walking _ halfsphere, fr3_ walking _ rpy, and fr3_ walking _ xyz. Where fr1 or fr3 represent different profiles required to run a sequence of images, the xyz suffix represents the camera movement in the direction of the three x-y-z axes, the rpy suffix represents the camera rotation in the three r-p-y orientation angles, the halfsphere suffix represents the camera adding to the xyz and rpy an arc motion in space, and the static suffix represents the camera is relatively stationary. The sitting suffix represents a sequence of pictures in low motion and the walking suffix represents a sequence of pictures in high motion.

The comparison of the algorithm of the present application with other algorithms is shown in table 1.

TABLE 1 positioning accuracy comparison table of the present invention

Compared with the data in the table 1, for a high-dynamic image sequence, the positioning accuracy of the invention is greatly improved compared with the original ORB-SLAM2 system. Meanwhile, in low-dynamic and pure static image sequences, the positioning accuracy of the invention and the positioning accuracy of the ORB-SLAM2 can achieve the same effect.

Fig. 7 and 8 are schematic diagrams showing the effect achieved by the present invention. In the figure, group is the real track, CameraTracory _ ORB is the ORB-SLAM2 system track, CameraTracory _ my is the track of the present invention, FIG. 7 is the error of the three tracks in the directions of the three axes x-y-z, and FIG. 8 is the error of the three tracks in the three directions of the angle r-p-y. It can be seen from the figure that the original system has inaccurate positioning accuracy and huge drift error in a dynamic environment, and tracking failure occurs in the period from 63 seconds to 68 seconds.

In terms of running time cost, the method uses the simplest YOLOv5s model in YOLOv5 for training, the network has the advantages that the network is small and fast, the real-time effect can be achieved only by performing accelerated operation on YOLOv5-s by a general GPU, and if the GPU is good in performance, the detection speed of the YOLOv5-s model only needs 2.5ms per frame, namely the detection speed is 400 frames per second.

As shown in Table 2, the average running time of the ORB-SLAM2 system on CPU i7-7700 is about 30 milliseconds per frame, i.e., the running speed is about 33 frames per second. Although the invention adds the pixel level segmentation method to process the image, the method has extremely small time cost, and the time of the method is basically consistent with that of the original system. However, the original ORB-SLAM2 is not robust in a dynamic environment, but the invention can remove the influence of a dynamic target on a SLAM system in the dynamic environment, not only has higher precision in the dynamic environment, but also can run on a Central Processing Unit (CPU) in real time.

TABLE 2 comparison of the time spent in the present invention

The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. An SLAM method for removing a dynamic target based on an RGBD sensor, the method comprising the steps of:

the pixel level segmentation method specifically comprises the following steps:

step 4.1, aiming at all dynamic target detection frames of each frame of RGB image, in the dynamic target detection frame area corresponding to the pixel position in the depth image of the frame of image, the depth values of N pixel points are uniformly distributed and taken, and the depth values are stored in a container X_iWherein i takes on a value from 1 to N;

2. The SLAM method for removing dynamic targets based on RGBD sensors of claim 1, wherein in step 1, the test data set is labeled and adjusted to VOC data set format for use in subsequent steps.

3. The SLAM method for removing dynamic targets based on RGBD sensors as claimed in claim 1, wherein in step 2, the simplest YOLOv5s network of the target detection neural network YOLOv5 is used for training, and the target detection neural network has smaller parameters and faster operation speed than other similar type networks.

4. The SLAM method for removing the dynamic target based on the RGBD sensor, as claimed in claim 1, wherein in the step 3, the pixel coordinate position comprises a target frame center point coordinate and a target frame width height, and both are normalized according to the size of the input image.

5. The SLAM method for removing the dynamic target based on the RGBD sensor as claimed in claim 1, wherein in step 4.2, the absolute median deviation outlier algorithm can detect the data with one or more values having a larger difference compared with other values in the data and reject the data, wherein the difference is determined according to the parameter n in the absolute median deviation outlier algorithm, and the average depth information of the dynamic target is extracted by using the algorithm, which has the formula:

the basic steps of the algorithm can be summarized as the following sub-steps:

step 4.2.1, calculate X_iMedian value X of all elements in_median；

6. The SLAM method for removing dynamic targets based on RGBD sensor as claimed in claim 1, wherein in the step 4.3, the distance based clustering algorithm comprises the following steps:

7. The SLAM method for removing the dynamic target based on the RGBD sensor, as claimed in claim 1, wherein in step 5, the algorithm flow of the SLAM system specifically comprises the following steps:

8. The SLAM method for removing dynamic targets based on RGBD sensor of claim 7, wherein the step 5.4 is divided into two parts, closed loop detection and closed loop correction; the closed loop detection firstly uses a Bow bag-of-words model for detection, and then calculates similarity transformation through an Sim3 algorithm; the closed-loop correction is mainly closed-loop fusion Essential Graph optimization to achieve the effect of adjusting and correcting errors.