CN110827395B

CN110827395B - Instant positioning and map construction method suitable for dynamic environment

Info

Publication number: CN110827395B
Application number: CN201910848375.6A
Authority: CN
Inventors: 陈文峰; 蔡述庭; 李丰; 徐伟峰
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2023-01-20
Anticipated expiration: 2039-09-09
Also published as: CN110827395A

Abstract

The invention discloses an instant positioning and map building method suitable for a dynamic environment, which comprises the following steps: firstly, externally calibrating a depth camera and an encoder to obtain a transformation matrix from an encoder coordinate to a depth camera coordinate; then, each frame of image collected by the depth camera is layered by using an image pyramid, and then ORB feature points are extracted from each layer of the image pyramid; taking data obtained by an encoder as a constant-speed model of a tracking module, projecting the characteristic points of the key frame of the previous frame onto the current frame, matching the characteristic points with the characteristic points of the current frame, and obtaining the pose between the two frames by triangulation of the successfully matched key points; the tracking part of the invention uses the encoder data as the initial pose when the camera re-projects, realizes the accurate positioning tracking in the dynamic environment, and saves a large amount of time compared with the traditional semantic slam tracking module which needs each frame to eliminate the dynamic pixels.

Description

Instant positioning and map construction method suitable for dynamic environment

Technical Field

The invention relates to the technical field of positioning navigation and map building of mobile robots, in particular to an instant positioning and map building method suitable for a dynamic environment.

Background

Slam (instant positioning and mapping) is widely applied to the fields of mobile robots, unmanned planes, augmented reality and the like. However, in a dynamic environment, the conventional slam system is not robust. In a real scene full of dynamic objects, a tracking module in a classic SLAM system like an ORB _ SLAM fails, and what is worse, a static map is established by the classic SLAM system, and the dynamic objects cannot be removed from the map, so that the map cannot be used at all. In order to solve the problem of how to adapt the SLAM system to the dynamic scene, many paper documents provide different solutions, and DS _ SLAM is an effective solution, and the solution performs semantic segmentation on each input frame image, calculates an essential matrix by using the segmented image, then judges whether the images are dynamic objects by using the essential matrix, and if so, eliminates the dynamic objects. However, because the semantic segmentation is carried out on each frame of image, the system is difficult to achieve real-time performance on a mobile platform, and a dense point cloud map is established, so that the memory occupation is high.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an instant positioning and map construction method suitable for a dynamic environment.

The purpose of the invention is realized by the following technical scheme:

an instant positioning and mapping method suitable for dynamic environment comprises the following steps:

firstly, carrying out external reference calibration on a depth camera and an encoder to obtain a transformation matrix from an encoder coordinate to a depth camera coordinate; then, each frame of image collected by the depth camera is layered by using an image pyramid, and then ORB feature points are extracted from each layer of the image pyramid; taking data obtained by an encoder as a constant-speed model of a tracking module, projecting the feature points of the key frame of the previous frame onto the current frame, matching the feature points with the feature points of the current frame, and obtaining the pose between the two frames by triangulation of the successfully matched key points;

step two, the tracking module judges whether the current frame is selected as a key frame according to the principle that whether more than 20 frames are past from the current frame or not through the latest global optimization and whether 50 map points can be detected from the current frame or not, then the key frame is transmitted to the dynamic object removing module, and the dynamic object removing module carries out semantic segmentation on the key frame by using a network;

thirdly, projecting the key frame selected by the tracking module in a local map optimization module to obtain map points, and then optimizing by using lacol BA; then, using a DBow2 bag-of-words model to perform loop detection in a loop detection module, and performing global optimization on all map points and key frames in the past by using BA (building block), so that the pose of the robot is optimized and the map points are also optimized;

and step four, receiving the globally optimized key frame and the result of semantic segmentation, projecting the key frame to the 3d point cloud, coloring the point cloud according to different semantic segmentation labels, and then establishing a semantic octree map in a semantic octree map module.

Preferably, the tracking module works as follows:

the depth camera collects color images and depth images at the frequency of 30 frames per second, the color images and the depth images are issued in a topic form through an iai _ kinect2 software package in ros, and then the slam system monitors the two topics to receive the collected color images and depth images; the mobile robot issues encoder data at a frequency of 90hz, the slam system acquires the encoder data through monitoring topics, and then the system respectively preprocesses the encoder data and images.

Preferably, the preprocessing of the encoder data is specifically:

firstly, at the moment t, the pose of the robot is xi _t ＝[x _t y _t z _t ] ^T And according to the odometer motion model, the pose of the robot at the t +1 moment can be obtained as follows:

wherein, delta s is the moving distance of the center of the robot, and Delta theta is the moving angle value of the robot; the readings of the left and right code discs of the encoder can be obtained by monitoring the topic of the encoder, and the following relations are satisfied between the readings of the left and right code discs of the encoder and the movement of the robot:

where b is the wheelbase between two wheels,. DELTA.s _r And Δ s _l The difference between the readings of the left and right coded disks at the moment from t to t + 1; therefore, the transformation matrix of the robot coordinate from t to t +1 can be obtained as follows:

preferably, the preprocessing the image is divided into preprocessing a color image and a depth image, specifically:

the color image obtained from the topic is a three-channel color image, and the color image must be converted into a single-channel gray image by using an OpenCV library function; meanwhile, the depth map is also a three-channel image, and the depth map needs to be converted into a single-channel image of CV _32F by a mapping factor;

after data are preprocessed, entering an initialization part of a tracking module, and selecting a first frame image as a key frame; then, when a new frame arrives, the main thread of the tracking module is entered:

(1) Using an image pyramid to extract orb characteristic points including orb key points and descriptors under 12 transformation scales;

(2) Re-projecting the feature points using the encoder model; in the data preprocessing, a transformation matrix of the robot coordinate from t to t +1 is obtained, and the acquisition frequency of an encoder is the acquisition frequency of a camera3 times of the previous frame, and the encoder data and the camera data have the same time stamp to correspond to each other, so that a transformation matrix from the current frame to the previous frame can be obtained

Let p be _c The method comprises the following steps that (1) the camera coordinate of a certain pixel of a current frame is defined, and in a camera pinhole model, the 2d pixel coordinate is converted into a function operation of a 3d point camera coordinate; the current frame pixel p _c The projection result projected to the previous frame can be expressed as

Wherein

Is an external parameter matrix of the robot and is determined by the relative poses of the robot and the camera; then can be at p _c Finding matched characteristic points in a pixel region of 3 sigma in the method, wherein the error of an encoder is subject to normal distribution, and sigma is the standard deviation of the normal distribution;

(3) Then selecting key points with highest descriptor matching scores in the region as matching points; traversing all the matching points, and carrying out re-projection on the depth values of the matching points to eliminate mismatching caused by a dynamic environment; firstly, projecting the matched characteristic point pair to world coordinates to obtain a 3d point p, then projecting the 3d point to a nearest key frame of the frame, and reconstructing the map point on the key frame

If the map point corresponding to this matching point is static, the depth value between p' and p is unchanged, so if the difference between the depth values is larger than an empirical threshold d _th If the matching point is a dynamic point, the depth value of the matching point is set to zero; d can be obtained from the following formula _th ＝d _b +k _d * z', wherein k _d Related to scale, d _b Is a base threshold, let d generally _b =0.2m and k _d ＝0.025；

(4) Calling a PNP function provided in OpenCV, and calculating the pose between the two frames of pictures by using the ePNP and the matched key points;

(5) Finally, judging the key frame, judging whether the current frame meets the condition called as the key frame, and if the current frame meets the following condition, considering the current frame as the key frame; the map points are obtained by converting the feature points on the key frame into world coordinates;

(1) more than 20 frames from the last global relocation;

(2) the local mapping process is idle, or more than 20 frames are inserted from the last key frame;

(3) the current frame can track at least 50 map points;

(4) the current frame can track map points less than 90% of the map points in the reference key frame.

Preferably, the local map optimization module mainly manages map points and key frames, and optimizes a transformation matrix between the map points and the key frames through local BA; the working steps of the local map optimization module are as follows:

(1) And (3) removing map points, namely removing the map points which do not meet the following conditions from the system:

(1) the point is predicted to be in a key frame which can observe the point, and more than 25% of key frames can track the point;

(2) it must be observed by more than three key frames;

(2) Creating a new map point; after a new key frame is generated, performing feature matching on the secondary key frame and the key frame connected with the secondary key frame, and triangulating the generated feature points to obtain 3d points which are new map points;

(3) BA optimization of the local map, namely establishing a cost function between map points and feature points by utilizing the re-projection relationship between the map points and the feature points of the key frame, and then enabling an error value of the cost function to be minimum by using external parameters of the camera;

(4) And key frame elimination, wherein if 90% of key points can be observed by more than 3 key frames, the excessive key frames are considered as redundant key frames, and are eliminated.

Preferably, the dynamic object elimination module comprises the following working steps:

when the tracking module generates a new key frame, the color image and the depth image corresponding to the key frame are issued in a ros system topic form, and then the dynamic object removing module thread receives the topic, so that the color image and the depth image of which the dynamic pixels need to be removed are obtained; because the semantic segmentation network used in the key frame dynamic object elimination module is realized on a pytorch platform and must be realized by a Python language, and other modules are realized by c + +, a ros communication mechanism is needed among cross-platforms;

before semantic segmentation is carried out, a network for semantic segmentation needs to be trained, a vocpascal2012 data set is adopted to train an ICNet network, the selected reason is that the data set marks the background into one class, and marks other different objects such as people, tables and chairs into other classes, so that the background and the dynamic objects are conveniently separated when the image is subjected to semantic segmentation;

after receiving the color image and the depth image, the color image can be segmented by using an ICNet, the segmented result is a single-channel gray-scale image, the single-channel gray-scale image is processed into a three-channel color image by using an OpenCV (open cell vehicle), then a label marked on the color image is colored by using a pre-made color bar, and the final semantic segmentation result is obtained after the coloring is finished;

the semantic segmentation divides the picture by using pre-labeled labels, wherein the 0 th type label is a background, the 15 th type label is a person, and for an indoor dynamic scene, a dynamic object to be removed is mainly a person; therefore, firstly carrying out binarization on the image, setting the gray value of the label belonging to the 15 th class as 255, and then using an OpenCV region expansion function to expand the pixel region of the label of the 15 th class; then, in the depth map, setting the depth value of a pixel which is 255 in the corresponding binary image to be zero, and eliminating the dynamic pixel;

and finally, sending the depth map with the dynamic pixels removed and the color map subjected to semantic segmentation back to the slam system in a topic form.

Preferably, the working steps of the loop detection module are as follows:

after the local map optimization module is finished, the key frames are transmitted to the loop detection module, and then the loop detection module calls the bag-of-word vectors of the current key frames and the bag-of-word vectors of other key frames for similarity test, as shown in the following formula:

wherein v is ₁ And v ₂ The word bag vector reflects the characteristic point information in one image and can be obtained by calling the library function of the word bag model; if the similarity between the current key frame and a key frame of a certain previous frame is 1.1 times higher than a similarity threshold, the current key frame and the previous key frame are considered to have the same pose, and loop detection is successful and used as a condition for global key frame optimization; the similarity threshold is obtained by similarity calculation between the key frame and four adjacent key frames, and the value with the highest similarity is set as the similarity threshold.

Preferably, the semantic octree map module works as follows:

after the loop detection is finished, a semantic octree map module receives a color image and a depth image obtained after semantic segmentation; p is a radical of formula _a ＝[x _a y _a 1] ^T For a pixel in a color image, p is divided _a Conversion to world coordinates

Traversing each pixel in the color map to convert the pixels to be under the world coordinate to obtain a 3d point under the world coordinate system; then converting the 3d point into a color point cloud data type pcl, wherein the point cloud data type is PointXYZRGB, the point cloud data type is a six-dimensional variable, the front three-dimensional is the world coordinate of the point cloud, and the rear three-dimensional is the color of the point cloud; after segmentation, different labels have different colors on the color image, and point cloud data is colored according to the result; class 0 tags are background, so class 0 tags correspondAfter the pixels are converted into point cloud data, the point clouds are set to be silvery white; other labels are set to be different colors in sequence;

after point cloud data containing semantic information exist, the point cloud data can be directly converted into an octree map; calling a library of the octree map in c + +, creating an octree object of ColorOcTree, then inserting point cloud data into the octree map, finally setting the resolution of the octree map to be 0.05, setting the hit probability to be 0.7 and setting the miss probability to be 0.55; and then releasing the established octree in a ros topic/octomap form.

Compared with the prior art, the invention has the following beneficial effects:

(1) The tracking part of the invention uses encoder data as an initial pose when the camera re-projects, thereby realizing accurate positioning tracking in a dynamic environment, and saving a large amount of time compared with a traditional semantic slam tracking module which needs each frame to remove dynamic pixels; the semantic segmentation part is only arranged in the key frame, so that dynamic pixels in the map can be removed, and a reliable static map is established; meanwhile, the semantic segmentation part is innovatively placed on a TensorFlow platform, which is different from a coffee platform of the traditional semantic slam, so that not only is the time for semantic segmentation improved, but also the precision of semantic segmentation is improved;

(2) The method is different from the slam scheme of the fusion encoder, the encoder is only used as a constant speed model when the color image characteristic points are re-projected, the encoder is used as a main sensor in the traditional dynamic environment slam of the fusion encoder data, and the encoder data can be optimized in the map local optimization and loop detection; however, since the encoder is easy to slip and the data may fail, there may be a considerable error in some scenarios when the encoder is used as the main sensor, and this problem can be effectively avoided when the encoder is used as the constant-speed model.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a graph showing the performance comparison between an ICNet network and a conventional semantic segmentation network according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the embodiments of the present invention are not limited thereto.

The invention provides an instant positioning and map construction method suitable for a dynamic environment, which is characterized in that on the basis of a DS _ SLAM system, encoder data is integrated, so that a tracking part in the dynamic environment is more robust, only a key frame part is subjected to image segmentation, an image segmentation result is only used for an image construction part, and the time is greatly shortened; in addition, a semantic segmentation network on a TensorFlow platform is innovatively combined, and compared with the semantic segmentation network under the traditional caffe framework, the new network has better performance; a new network introduces a ros system multi-node communication mechanism, and a node thread is independently opened up for semantic segmentation; and finally, establishing a semantic octree map according to the result of semantic segmentation.

Specifically, as shown in fig. 1 to 2, a method for instant positioning and mapping in a dynamic environment includes the following steps:

firstly, externally calibrating a depth camera and an encoder to obtain a transformation matrix from an encoder coordinate to a depth camera coordinate; then, layering each frame of image collected by the depth camera by using an image pyramid, and extracting ORB feature points from each layer of the image pyramid; taking data obtained by an encoder as a constant-speed model of a tracking module, projecting the feature points of the key frame of the previous frame onto the current frame, matching the feature points with the feature points of the current frame, and obtaining the pose between the two frames by triangulation of the successfully matched key points;

the tracking module comprises the following working steps:

The encoder data preprocessing specifically comprises:

where b is the wheelbase between two wheels,. DELTA.s _r And Δ s _l The difference between the readings of the left and right code discs at the moment from t to t + 1; therefore, the transformation matrix of the robot coordinate from t to t +1 can be obtained as follows:

the image preprocessing comprises the following steps of preprocessing a color image and a depth image:

after preprocessing the data, entering an initialization part of a tracking module, and selecting a first frame image as a key frame; then, when a new frame arrives, the main thread of the tracking module is entered:

(2) Re-projecting the feature points using the encoder model; in the data preprocessing, a transformation matrix of a robot coordinate from t to t +1 is obtained, the acquisition frequency of an encoder is 3 times that of a camera, and the encoder data and the camera data have the same time stamp and can be corresponded, so that the transformation matrix from a current frame to a previous frame can be obtained

Suppose p _c The method comprises the following steps that (1) the camera coordinate of a certain pixel of a current frame is defined, and in a camera pinhole model, the 2d pixel coordinate is converted into a function operation of a 3d point camera coordinate; the current frame pixel p _c The projection result projected to the previous frame can be expressed as

Wherein

(3) Then selecting key points with highest descriptor matching scores in the region as matching points; traversing all the matching points, re-projecting the depth values of the matching points, and eliminating mismatching caused by a dynamic environment; firstly, projecting the matched characteristic point pair to world coordinates to obtain a 3d point p, then projecting the 3d point to a nearest key frame of the frame, and reconstructing the map point on the key frame

If the map point corresponding to this matching point is static, the depth value between p' and p is unchanged, so if the difference between the depth values is larger than an empirical threshold d _th Then the matching point is considered as a dynamic point and its depth value is setZero; d can be obtained from the following formula _th ＝d _b +k _d * z' wherein k _d Related to scale, d _b Is a base threshold, let d generally _b =0.2m and k _d ＝0.025；

(5) Finally, judging the key frame, judging whether the current frame meets the condition called as the key frame, and if the current frame meets the following condition, considering the current frame as the key frame; wherein, the map points are obtained by converting the characteristic points on the key frame into world coordinates;

(1) more than 20 frames from the last global relocation;

(2) the local mapping process is idle or more than 20 frames away from the last key frame insertion;

(3) the current frame can track at least 50 map points;

the dynamic object removing module comprises the following working steps:

and finally, sending the depth map without the dynamic pixels and the color map subjected to semantic segmentation processing back to the slam system in a topic form.

the local map optimization module is mainly used for managing map points and key frames and optimizing a conversion matrix between the map points and the key frames through local BA; the working steps of the local map optimization module are as follows:

(2) it must be observed by more than three key frames;

The working steps of the loop detection module are as follows:

after the local map optimization module is finished, the key frames are transmitted to the loop detection module, and then the loop detection module calls the bag-of-word vector of the current key frame and performs similarity detection with the bag-of-word vectors of other key frames, as shown in the following formula:

wherein v is ₁ And v ₂ The bag-of-word vector reflects feature point information in an image and can be obtained by calling a library function of a bag-of-word model; if the similarity between the current key frame and a key frame of a certain previous frame is 1.1 times higher than a similarity threshold, the current key frame and the previous key frame are considered to have the same pose, and loop detection is successful and used as a condition for global key frame optimization; the similarity threshold is obtained by calculating the similarity between the key frame and the adjacent four key frames, and the value with the highest similarity is set as the similarity threshold.

The semantic octree map module comprises the following working steps:

after loop detection is finished, a semantic octree map module receives a color image and a depth image obtained after semantic segmentation; p is a radical of _a ＝[x _a y _a 1] ^T For a pixel in a color image, p is divided _a Conversion to world coordinates

Traversing each pixel in the color map to convert the pixels to be under the world coordinate to obtain a 3d point under the world coordinate system; then converting the 3d points into a color point cloud data type pcl, wherein the point cloud data type is a six-dimensional variable, the front three-dimensional is the world coordinate of the point cloud, and the rear three-dimensional is the color of the point cloud; after segmentation, different labels have different colors on the color image, and point cloud data is colored according to the result; the 0 th type label is a background, so after pixels corresponding to the 0 th type label are converted into point cloud data, the point clouds are set to be silvery white; other labels are set to be different colors in sequence;

after point cloud data containing semantic information exist, the point cloud data can be directly converted into an octree map; calling a library of the octree map in c + +, creating an octree object of ColorOcTree, then inserting point cloud data into the octree map, finally setting the resolution of the octree map to 0.05, setting the hit probability to 0.7 and setting the miss probability to 0.55; and then issuing the established octree in a ros topic/octomap form.

The traditional semantic segmentation slam is generally based on a caffe framework, and few selectable semantic segmentation networks exist under the caffe framework, so that the segmentation effect is not good even if the speed is low; as shown in fig. 2, the abscissa represents the number of frames processed per second, the ordinate represents the precision segmentation of the image, the unit is mloU%, and the ICNet (outputs) is far ahead of the Segnet network used in the traditional semantic segmentation framework in terms of the forward running time and precision of the network, and the performance can achieve real-time performance.

Compared with the SLAM method based on laser radar in a dynamic environment, the SLAM object state detection method of the robot in a dynamic sparse environment and the indoor personnel autonomous positioning method based on the integration of SLAM and gait IMU in the prior art, the method has the main innovation points that:

1. an innovative method for fusing encoder data is used, only the encoder data is used as a transformation matrix of the reprojection between the current frame and the key frame of the camera, and the shallow fusion of the encoder data and the camera data is realized; the huge influence on the system caused by the error of the encoder data due to the slip is avoided.

2. C + + and Python multi-node communication is achieved by using a ros system, so that a semantic segmentation network under a TensorFlow frame is innovatively added into the slam system, and the performance of semantic segmentation is effectively improved.

3. The semantic octree map is innovatively removed, image information obtained by semantic segmentation is fully utilized, and a more complete and more practical semantic map is established.

The tracking part of the invention uses encoder data as the initial pose of the camera during re-projection, thus realizing accurate positioning tracking in a dynamic environment, and saving a large amount of time compared with a traditional semantic slam tracking module which needs each frame to eliminate dynamic pixels; the semantic segmentation part is only arranged in the key frame, so that dynamic pixels in the map can be removed, and a reliable static map is established; meanwhile, the semantic segmentation part is innovatively placed on a TensorFlow platform, which is different from a caffe platform of the traditional semantic slam, so that not only is the time for semantic segmentation improved, but also the precision of semantic segmentation is improved.

The method is different from the slam scheme of the fusion encoder, the encoder is only used as a constant speed model when the color image characteristic points are re-projected, the encoder is used as a main sensor in the traditional dynamic environment slam of the fusion encoder data, and the encoder data can be optimized in the map local optimization and loop detection; however, since the encoder is easy to slip and the data may fail, there may be a considerable error in some scenarios when the encoder is used as the main sensor, and this problem can be effectively avoided when the encoder is used as the constant-speed model.

The present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents and equivalents thereof, which are intended to be included in the scope of the present invention.

Claims

1. An instant positioning and mapping method suitable for dynamic environment, characterized by comprising the following steps:

step two, judging whether the current frame is selected as a key frame or not by the tracking module according to the principle that whether more than 20 frames have passed from the current frame or not through the last global optimization and whether 50 map points can be detected from the current frame or not by the tracking module, and then transmitting the key frame to the dynamic object removing module, wherein the dynamic object removing module performs semantic segmentation on the key frame by using a network;

when the tracking module generates a new key frame, the color image and the depth image corresponding to the key frame are issued in a ros system topic form, and then the dynamic object elimination module thread receives the topic, so as to obtain the color image and the depth image of which the dynamic pixels need to be eliminated; because the semantic segmentation network used in the key frame dynamic object elimination module is realized on a pytorech platform and must be realized by Python language, and other modules are realized by c + +, a ros communication mechanism is required among cross-platforms;

before semantic segmentation is carried out, a network for semantic segmentation is required to be trained, a vocacal 2012 data set is adopted to train an ICNet network, the selected reason is that the data set marks the background into one class, and marks other different objects including people, tables and chairs into other classes, so that the background and the dynamic objects are conveniently separated when the image is subjected to semantic segmentation;

after receiving the color image and the depth image, carrying out image segmentation on the color image by using an ICNet, wherein the segmented result is a single-channel gray-scale image, the color image is processed into a three-channel color image by using an OpenCV (open cell vehicle), then a label marked on the color image is colored by using a pre-made color bar, and the result of final semantic segmentation is obtained after coloring is finished;

the semantic segmentation divides the picture by using pre-labeled labels, wherein the 0 th label is a background, the 15 th label is a person, and for an indoor dynamic scene, the dynamic object to be removed is a person; therefore, firstly, carrying out binarization on the image, setting the gray value of the label belonging to the 15 th class as 255, and then expanding the pixel area of the label of the 15 th class by using an OpenCV area expansion function; then, in the depth map, setting the depth value of a pixel which is 255 in the corresponding binary image as zero, and realizing the elimination of the dynamic pixel;

finally, the depth map with the dynamic pixels removed and the color map subjected to semantic segmentation are sent back to the slam system in a topic form;

thirdly, projecting the key frame selected by the tracking module in a local map optimization module to obtain map points, and then optimizing by using a local map BA; then, using a DBow2 bag-of-words model to perform loop detection in a loop detection module, and performing global optimization on all map points and key frames in the past by using BA (building block), so that the pose of the robot is optimized and the map points are also optimized;

step four, receiving the globally optimized key frame and the result after semantic segmentation, projecting the key frame to a 3d point cloud, coloring the point cloud according to different semantic segmentation labels, and then establishing a semantic octree map in a semantic octree map module;

the working steps of the tracking module are as follows:

the depth camera collects color images and depth images at the frequency of 30 frames per second, the color images and the depth images are issued in a topic form through an iai _ kinect2 software package in ros, and then the slam system monitors the two topics to receive the collected color images and depth images; the mobile robot issues encoder data at a frequency of 90hz, the slam system acquires the encoder data through monitoring topics, and then the system respectively preprocesses the encoder data and images;

after preprocessing the data, entering an initialization part of a tracking module, and selecting a first frame image as a key frame; then, every time a new frame arrives, the main thread entering the tracking module comprises the following steps (1) to (5):

(2) Re-projecting the feature points using an encoder model; the transformation matrix of the robot coordinate from t to t +1 moment is obtained in the data preprocessing, the acquisition frequency of the encoder is 3 times of that of the camera, and the encoder data and the camera data have the same timestamp and are corresponding to each other, so that the transformation matrix from the current frame to the previous frame is obtained

T _rr A transformation matrix representing the robot coordinate from t to t + 1; suppose p _c Is the camera coordinate of a certain pixel of the current frame, and in the definition of the camera pinhole model, the 2d imageFunction operation of converting pixel coordinates into 3d point camera coordinates; the current frame pixel p _c The projection result projected to the previous frame is represented as

Wherein

The external parameter matrix is an external parameter matrix of the robot and is determined by the relative poses of the robot and the camera; then just at p _c ' in 3 sigma pixel area, finding matched characteristic points, the error of the encoder obeys normal distribution, wherein sigma is the standard deviation of the normal distribution;

(3) Then selecting the key point with the highest descriptor matching score in the region as a matching point; traversing all the matching points, and carrying out re-projection on the depth values of the matching points to eliminate mismatching caused by a dynamic environment; firstly, projecting the matched characteristic point pair to world coordinates to obtain a 3d point p, then projecting the 3d point to a nearest key frame of the frame, and reconstructing the map point on the key frame

If the map point corresponding to this matching point is static, the depth value between p' and p is unchanged, so if the difference between the depth values is larger than an empirical threshold d _th If the matching point is a dynamic point, setting the depth value of the matching point to zero; d is obtained from the following formula _th ＝0.2+0.025*z′；

(4) Calling an eNP function provided in OpenCV, and calculating the pose between the two frames of pictures by using the eNP function and the matched key points;

(1) more than 20 frames from the last global relocation;

(3) the current frame can track at least 50 map points;

(4) the map points that the current frame can track are less than 90% of the map points in the reference key frame;

the semantic octree map module comprises the following working steps:

after loop detection is finished, a semantic octree map module receives a color image and a depth image obtained after semantic segmentation; p is a radical of _a ＝[x _a y _a 1] ^T For a pixel in a color image, p is added _a Conversion to world coordinates

Traversing each pixel in the color map and converting the pixels to be below world coordinates to obtain a 3d point under a world coordinate system; then converting the 3d point into a color point cloud data type pcl, wherein the point cloud data type is PointXYZRGB, the point cloud data type is a six-dimensional variable, the front three-dimensional is the world coordinate of the point cloud, and the rear three-dimensional is the color of the point cloud; after segmentation, different labels have different colors on the color image, and point cloud data is colored according to the result; the 0 th type label is a background, so after pixels corresponding to the 0 th type label are converted into point cloud data, the point clouds are set to be silvery white; other labels are sequentially set to be different colors;

directly converting the point cloud data containing semantic information into an octree map; calling a library of the octree map in c + +, creating an octree object of ColorOcTree, then inserting point cloud data into the octree map, finally setting the resolution of the octree map to 0.05, setting the hit probability to 0.7 and setting the miss probability to 0.55; and then releasing the established octree in a ros topic/octomap form.

2. The method of claim 1, wherein the pre-processing of the encoder data is specifically:

firstly, at the time t, the pose of the robotIs xi _t ＝[x _t y _t θ _t ] ^T And according to the odometer motion model, obtaining the pose of the robot at the t +1 moment as follows:

wherein, delta s is the moving distance of the center of the robot, and Delta theta is the moving angle value of the robot; the method comprises the following steps that the readings of the left and right code discs of the encoder are obtained by monitoring the topic of the encoder, and the following relations are satisfied between the readings of the left and right code discs of the encoder and the movement of the robot:

wherein b is the wheelbase between two wheels,. DELTA.s _r Is the difference of readings, Δ s, of the left code wheel from t to t +1 _l The reading difference of the right coded disc from t to t + 1; therefore, a transformation matrix of the robot coordinate from t to t +1 is obtained as follows:

3. the method of claim 1, wherein the local map optimization module is configured to manage map points and keyframes, and optimize a transformation matrix between the map points and the keyframes via a local map BA; the working steps of the local map optimization module are as follows:

A. and (3) removing map points, namely removing the map points which do not meet the following conditions from the system:

(1) the point is predicted to be in a key frame that can be observed, and more than 25% of key frames can track the point;

(2) it must be observed by more than three key frames;

B. creating a new map point; after a new key frame is generated, performing feature matching on the secondary key frame and the key frame connected with the secondary key frame, and triangulating the generated feature points to obtain 3d points which are new map points;

C. BA optimization of the local map, namely establishing a cost function between map points and feature points by utilizing the re-projection relationship between the map points and the key frame feature points, and then enabling the error value of the cost function to be minimum by using the external parameters of the camera;

D. and (4) key frame elimination, if 90% of key points can be observed by more than 3 key frames, the excessive key frames are considered as redundant key frames, and the redundant key frames are eliminated.

4. The method of claim 1, wherein the loop detection module comprises the following steps:

wherein s (v) ₁ ,v ₂ ) Shows the result of the similarity test, v ₁ And v ₂ The word bag vector reflects the characteristic point information in one image and is obtained by calling the library function of the word bag model; if the similarity between the current key frame and a key frame of a certain previous frame is 1.1 times higher than a similarity threshold, the current key frame and the previous key frame are considered to have the same pose, and loop detection is successful and used as a condition for global key frame optimization; the similarity threshold is obtained by similarity calculation between the key frame and four adjacent key frames, and the value with the highest similarity is set as the similarity threshold.