CN110827395A

CN110827395A - Instant positioning and map construction method suitable for dynamic environment

Info

Publication number: CN110827395A
Application number: CN201910848375.6A
Authority: CN
Inventors: 陈文峰; 蔡述庭; 李丰; 徐伟峰
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2020-02-21
Anticipated expiration: 2039-09-09
Also published as: CN110827395B

Abstract

The invention discloses an instant positioning and map building method suitable for a dynamic environment, which comprises the following steps: firstly, carrying out external reference calibration on a depth camera and an encoder to obtain a transformation matrix from an encoder coordinate to a depth camera coordinate; then, layering each frame of image collected by the depth camera by using an image pyramid, and extracting ORB feature points from each layer of the image pyramid; taking data obtained by an encoder as a constant-speed model of a tracking module, projecting the feature points of the key frame of the previous frame onto the current frame, matching the feature points with the feature points of the current frame, and obtaining the pose between the two frames by triangulation of the successfully matched key points; the tracking part of the invention uses the encoder data as the initial pose when the camera re-projects, realizes the accurate positioning tracking in the dynamic environment, and saves a large amount of time compared with the traditional semantic slam tracking module which needs each frame to eliminate the dynamic pixels.

Description

Instant positioning and map construction method suitable for dynamic environment

Technical Field

The invention relates to the technical field of positioning navigation and map building of mobile robots, in particular to an instant positioning and map building method suitable for a dynamic environment.

Background

Slam (instant positioning and mapping) is widely applied to the fields of mobile robots, unmanned planes, augmented reality and the like. However, in a dynamic environment, a conventional slam system is not robust. In a real scene full of dynamic objects, a tracking module in a classic SLAM system like an ORB _ SLAM fails, and what is worse, a static map is established by the classic SLAM system, and the dynamic objects cannot be removed from the map, so that the map cannot be used at all. In order to solve the problem of how to adapt the SLAM system to the dynamic scene, various paper documents provide different solutions, and DS _ SLAM is an effective solution, and the solution performs semantic segmentation on each frame of input image, calculates an essential matrix by using the segmented image, then judges whether the images are dynamic objects by using the essential matrix, and rejects the dynamic objects if the images are the dynamic objects. However, in the scheme, because each frame of image is subjected to semantic segmentation, the system is difficult to achieve real-time performance on a mobile platform, and a dense point cloud map is established, so that the memory occupation is high.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an instant positioning and map construction method suitable for a dynamic environment.

The purpose of the invention is realized by the following technical scheme:

an instant positioning and mapping method suitable for dynamic environment comprises the following steps:

firstly, carrying out external reference calibration on a depth camera and an encoder to obtain a transformation matrix from an encoder coordinate to a depth camera coordinate; then, layering each frame of image collected by the depth camera by using an image pyramid, and extracting ORB feature points from each layer of the image pyramid; taking data obtained by an encoder as a constant-speed model of a tracking module, projecting the feature points of the key frame of the previous frame onto the current frame, matching the feature points with the feature points of the current frame, and obtaining the pose between the two frames by triangulation of the successfully matched key points;

step two, the tracking module judges whether the current frame is selected as a key frame according to the principle that whether more than 20 frames are past from the current frame or not through the latest global optimization and whether 50 map points can be detected from the current frame or not, then the key frame is transmitted to the dynamic object removing module, and the dynamic object removing module carries out semantic segmentation on the key frame by using a network;

thirdly, projecting the key frame selected by the tracking module in a local map optimization module to obtain map points, and then optimizing by using lacol BA; then, a DBow2 bag-of-words model is used in a loop detection module for loop detection, all map points and key frames are globally optimized by BA, and the pose of the robot and the map points are optimized;

and step four, receiving the globally optimized key frame and the result of semantic segmentation, projecting the key frame to the 3d point cloud, coloring the point cloud according to different semantic segmentation labels, and then establishing a semantic octree map in a semantic octree map module.

Preferably, the tracking module works as follows:

the depth camera collects color images and depth images at the frequency of 30 frames per second, the color images and the depth images are published in a topic form through an iai _ kinect2 software package in ros, and then the slam system monitors the two topics to receive the collected color images and the collected depth images; the mobile robot issues encoder data at a frequency of 90hz, the slam system acquires the encoder data through monitoring topics, and then the system respectively preprocesses the encoder data and images.

Preferably, the preprocessing of the encoder data is specifically:

firstly, at the time t, the pose of the robot is ξ_t＝[x_ty_tz_t]^TAnd according to the odometer motion model, the pose of the robot at the t +1 moment can be obtained as follows:

wherein, Delta s is the moving distance of the center of the robot, and Delta theta is the moving angle value of the robot; the readings of the left and right code discs of the encoder can be obtained by monitoring the topic of the encoder, and the following relations are satisfied between the readings of the left and right code discs of the encoder and the movement of the robot:

where b is the wheelbase between two wheels,. DELTA.s_rAnd Δ s_lThe difference between the readings of the left and right code discs at the moment from t to t + 1; therefore, the transformation matrix of the robot coordinate from t to t +1 can be obtained as follows:

preferably, the preprocessing the image is divided into preprocessing a color image and a depth image, specifically:

the color image obtained from the topic is a three-channel color image, and the color image must be converted into a single-channel gray image by using an OpenCV library function; meanwhile, the depth map is also a three-channel image, and the depth map needs to be converted into a single-channel image of CV _32F by a mapping factor;

after preprocessing the data, entering an initialization part of a tracking module, and selecting a first frame image as a key frame; then, when a new frame arrives, the main thread of the tracking module is entered:

(1) using an image pyramid, extracting orb feature points including orb keypoints and descriptors at 12 transformation scales;

(2) re-projecting the feature points using the encoder model; in the data preprocessing, a transformation matrix of a robot coordinate from t to t +1 is obtained, the acquisition frequency of an encoder is 3 times that of a camera, and the encoder data and the camera data have the same time stamp and can be corresponded, so that the transformation matrix from a current frame to a previous frame can be obtained

Let p be_cThe method comprises the following steps that (1) the camera coordinate of a certain pixel of a current frame is defined, and in a camera pinhole model, the 2d pixel coordinate is converted into a function operation of a 3d point camera coordinate; the current frame pixel p_cThe projection result projected to the previous frame can be expressed as

Wherein

Is an external parameter matrix of the robot and is determined by the relative poses of the robot and the camera; then can be at p_c' in 3 sigma pixel area, finding matched characteristic points, the error of the encoder obeys normal distribution, wherein sigma is the standard deviation of the normal distribution;

(3) then selecting the key point with the highest descriptor matching score in the region as a matching point; traversing all the matching points, and carrying out re-projection on the depth values of the matching points to eliminate mismatching caused by a dynamic environment; firstly, projecting the matched characteristic point pair to world coordinates to obtain a 3d point p, then projecting the 3d point to a nearest key frame of the frame, and reconstructing the map point on the key frame

If the map point corresponding to this matching point is static, the depth value between p' and p is unchanged, so if the difference between the depth values is larger than an empirical threshold d_thIf the matching point is a dynamic point, setting the depth value of the matching point to zero; d can be obtained from the following formula_th＝d_b+k_dZ', wherein k_dRelated to scale, d_bIs a base threshold, let d generally_b0.2m and k_d＝0.025；

(4) Calling a PNP function provided in OpenCV, and calculating the pose between the two frames of pictures by using the ePNP and the matched key points;

(5) finally, judging the key frame, judging whether the current frame meets the condition called as the key frame, and if the current frame meets the following condition, considering the current frame as the key frame; the map points are obtained by converting the feature points on the key frame into world coordinates;

① more than 20 frames from the last global relocation;

② the partial mapping process is idle or more than 20 frames away from the last key frame insertion;

③ the current frame can track at least 50 map points;

④ the current frame can track less than 90% of the map points in the reference key frame.

Preferably, the local map optimization module mainly manages map points and key frames, and optimizes a transformation matrix between the map points and the key frames through local BA; the working steps of the local map optimization module are as follows:

(1) and (3) removing map points, namely removing the map points which do not meet the following conditions from the system:

① the point is predicted to be among the key frames that can be observed, more than 25% of the key frames can track the point;

② it must be observed by more than three key frames;

(2) creating a new map point; after a new key frame is generated, performing feature matching on the secondary key frame and the key frame connected with the secondary key frame, and triangulating the generated feature points to obtain 3d points which are new map points;

(3) BA optimization of the local map, namely establishing a cost function between map points and feature points by utilizing the re-projection relationship between the map points and the key frame feature points, and then enabling the error value of the cost function to be minimum by using the external parameters of the camera;

(4) and key frame elimination, wherein if 90% of key points can be observed by more than 3 key frames, the excessive key frames are considered as redundant key frames, and are eliminated.

Preferably, the dynamic object elimination module comprises the following working steps:

when the tracking module generates a new key frame, the color image and the depth image corresponding to the key frame are issued in a ros system topic form, and then the dynamic object removing module thread receives the topic, so that the color image and the depth image of which the dynamic pixels need to be removed are obtained; because the semantic segmentation network used in the key frame dynamic object elimination module is realized on a pytorech platform and must be realized by Python language, and other modules are realized by c + +, a ros communication mechanism is required among cross-platforms;

before semantic segmentation is carried out, a network for semantic segmentation is required to be trained, a vocacal 2012 data set is adopted to train an ICNet network, the selected reason is that the data set marks the background into one class, and marks other different objects such as people, tables and chairs into other classes, so that the background and the dynamic objects are conveniently separated when the image is subjected to semantic segmentation;

after receiving the color image and the depth image, the color image can be segmented by using an ICNet, the segmented result is a single-channel gray-scale image, the single-channel gray-scale image is processed into a three-channel color image by using an OpenCV (open cell vehicle), then a label marked on the color image is colored by using a pre-made color bar, and the final semantic segmentation result is obtained after the coloring is finished;

the semantic segmentation divides the picture by using pre-labeled labels, wherein the 0 th type label is a background, the 15 th type label is a person, and for an indoor dynamic scene, a dynamic object to be removed is mainly a person; therefore, firstly carrying out binarization on the image, setting the gray value of the label belonging to the 15 th class as 255, and then using an OpenCV region expansion function to expand the pixel region of the label of the 15 th class; then, in the depth map, setting the depth value of a pixel which is 255 in the corresponding binary image to be zero, and eliminating the dynamic pixel;

and finally, sending the depth map with the dynamic pixels removed and the color map subjected to semantic segmentation back to the slam system in a topic form.

Preferably, the working steps of the loop detection module are as follows:

after the local map optimization module is finished, the key frames are transmitted to the loop detection module, and then the loop detection module calls the bag-of-word vectors of the current key frames and the bag-of-word vectors of other key frames for similarity test, as shown in the following formula:

wherein v is₁And v₂The word bag vector reflects the characteristic point information in one image and can be obtained by calling the library function of the word bag model; if the similarity between the current key frame and a key frame of a certain previous frame is 1.1 times higher than a similarity threshold, the current key frame and the previous key frame are considered to have the same pose, and loop detection is successful and used as a condition for global key frame optimization; the similarity threshold is obtained by calculating the similarity between the key frame and the adjacent four key frames, and the value with the highest similarity is set as the similarity threshold.

Preferably, the semantic octree map module works as follows:

after loop detection is finished, a semantic octree map module receives a color image and a depth image obtained after semantic segmentation; p is a radical of_a＝[x_ay_a1]^TFor a pixel in a color image, p is added_aConversion to world coordinates

Traversing each pixel in the color map to convert the pixels to be under the world coordinate to obtain a 3d point under the world coordinate system; then converting the 3d point into a color point cloud data type pcl, wherein the point cloud data type is PointXYZRGB, the point cloud data type is a six-dimensional variable, the front three-dimensional is the world coordinate of the point cloud, and the rear three-dimensional is the color of the point cloud; after segmentation, different labels have different colors on the color image, and point cloud data is colored according to the result; the 0 th type label is a background, so after the pixels corresponding to the 0 th type label are converted into point cloud data, the point clouds are set to be silvery white; other labels are sequentially set to be different colors;

after point cloud data containing semantic information exist, the point cloud data can be directly converted into an octree map; calling a library of the octree map in c + +, creating an octree object of ColorOcTree, then inserting point cloud data into the octree map, finally setting the resolution of the octree map to 0.05, setting the hit probability to 0.7 and setting the miss probability to 0.55; and then releasing the established octree in a ros topic/octomap form.

Compared with the prior art, the invention has the following beneficial effects:

(1) the tracking part of the invention uses encoder data as the initial pose of the camera during re-projection, thus realizing accurate positioning tracking in a dynamic environment, and saving a large amount of time compared with a traditional semantic slam tracking module which needs each frame to eliminate dynamic pixels; the semantic segmentation part is only arranged in the key frame, so that dynamic pixels in the map can be removed, and a reliable static map is established; meanwhile, the semantic segmentation part is innovatively placed on a TensorFlow platform, which is different from a caffe platform of the traditional semantic slam, so that not only is the time for semantic segmentation improved, but also the precision of semantic segmentation is improved;

(2) the method is different from the slam scheme of the fusion encoder, the encoder is only used as a constant speed model when the color image characteristic points are re-projected, the encoder is used as a main sensor in the traditional dynamic environment slam of the fusion encoder data, and the encoder data can be optimized in the map local optimization and loop detection; however, since the encoder is easy to slip and the data may fail, there may be a considerable error in some scenarios when the encoder is used as the main sensor, and this problem can be effectively avoided when the encoder is used as the constant-speed model.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a graph showing the performance comparison between an ICNet network and a conventional semantic segmentation network according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

The invention provides an instant positioning and map construction method suitable for a dynamic environment, which is characterized in that encoder data are integrated on the basis of a DS _ SLAM system, so that a tracking part in the dynamic environment is more robust, only a key frame part is subjected to image segmentation, an image segmentation result is only used for an image construction part, and the time is greatly shortened; in addition, a semantic segmentation network on a TensorFlow platform is innovatively combined, and compared with the semantic segmentation network under the traditional caffe framework, the new network has better performance; a new network introduces a ros system multi-node communication mechanism, and a node thread is independently opened up for semantic segmentation; and finally, establishing a semantic octree map according to the result of semantic segmentation.

Specifically, as shown in fig. 1 to 2, a method for instant positioning and map construction suitable for a dynamic environment includes the following steps:

the tracking module comprises the following working steps:

The encoder data preprocessing specifically comprises:

the image preprocessing comprises the following steps of preprocessing a color image and a depth image:

(2) re-projecting the feature points using the encoder model; in the data preprocessing, a transformation matrix of the robot coordinate at the time from t to t +1 is obtained, and the acquisition frequency of an encoder isThe camera collects 3 times of frequency, and the encoder data and the camera data have the same time stamp to correspond, so that a transformation matrix from the current frame to the previous frame can be obtained

Wherein

① more than 20 frames from the last global relocation;

③ the current frame can track at least 50 map points;

the dynamic object removing module comprises the following working steps:

the local map optimization module is mainly used for managing map points and key frames and optimizing a conversion matrix between the map points and the key frames through local BA; the working steps of the local map optimization module are as follows:

② it must be observed by more than three key frames;

The working steps of the loop detection module are as follows:

The semantic octree map module comprises the following working steps:

The traditional semantic segmentation slam is generally based on a caffe framework, and few selectable semantic segmentation networks are available under the caffe framework, so that the segmentation effect is not good even if the speed is low; in the invention, as shown in fig. 2, the abscissa is the number of frames processed per second, and the ordinate is the precision segmentation of the image, the unit is mloU%, and icnet (outlets) is far superior to the Segnet network used in the traditional semantic segmentation framework in terms of the forward running time and precision of the network, and the performance can achieve real-time performance.

Compared with the SLAM method based on laser radar in a dynamic environment, the robot SLAM object state detection method in a dynamic sparse environment and the indoor personnel autonomous positioning method based on the integration of the SLAM and the gait IMU in the prior art, the invention has the main innovation points that:

1. an innovative method for fusing encoder data is used, only the encoder data is used as a transformation matrix of the reprojection between the current frame and the key frame of the camera, and the shallow fusion of the encoder data and the camera data is realized; the huge influence on the system caused by the error of the encoder data due to the slip is avoided.

2. C + + and Python multi-node communication is achieved by using a ros system, so that a semantic segmentation network under a TensorFlow frame is innovatively added into a slam system, and the semantic segmentation performance is effectively improved.

3. The semantic octree map is innovatively eliminated, image information obtained by semantic segmentation is fully utilized, and a more complete and practical semantic map is established.

The tracking part of the invention uses encoder data as the initial pose of the camera during re-projection, thus realizing accurate positioning tracking in a dynamic environment, and saving a large amount of time compared with a traditional semantic slam tracking module which needs each frame to eliminate dynamic pixels; the semantic segmentation part is only arranged in the key frame, so that dynamic pixels in the map can be removed, and a reliable static map is established; meanwhile, the semantic segmentation part is innovatively placed on a TensorFlow platform, which is different from a caffe platform of the traditional semantic slam, so that not only is the time for semantic segmentation improved, but also the precision of semantic segmentation is improved.

The method is different from the slam scheme of the fusion encoder, the encoder is only used as a constant speed model when the color image characteristic points are re-projected, the encoder is used as a main sensor in the traditional dynamic environment slam of the fusion encoder data, and the encoder data can be optimized in the map local optimization and loop detection; however, since the encoder is easy to slip and the data may fail, there may be a considerable error in some scenarios when the encoder is used as the main sensor, and this problem can be effectively avoided when the encoder is used as the constant-speed model.

The present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents and are included in the scope of the present invention.

Claims

1. An instant positioning and mapping method suitable for dynamic environment is characterized by comprising the following steps:

thirdly, projecting the key frame selected by the tracking module in a local map optimization module to obtain map points, and then optimizing by using lacolBA; then, a DBow2 bag-of-words model is used in a loop detection module for loop detection, all map points and key frames are globally optimized by BA, and the pose of the robot and the map points are optimized;

2. The method of claim 1, wherein the tracking module comprises the following steps:

3. The method of claim 2, wherein the pre-processing of the encoder data is specifically:

4. the method of claim 2, wherein the preprocessing the image is divided into preprocessing a color image and a depth image, and specifically comprises:

Wherein

Is an external parameter matrix of the robot,determined by the relative poses of the robot and the camera; then can be at p_c' in 3 sigma pixel area, finding matched characteristic points, the error of the encoder obeys normal distribution, wherein sigma is the standard deviation of the normal distribution;

① more than 20 frames from the last global relocation;

③ the current frame can track at least 50 map points;

5. The method of claim 1, wherein the local map optimization module mainly manages map points and key frames, and optimizes a transformation matrix between map points and key frames through local BA; the working steps of the local map optimization module are as follows:

② it must be observed by more than three key frames;

6. The method for instant location and mapping in a dynamic environment as claimed in claim 1, wherein the dynamic object elimination module comprises the following steps:

7. The method of claim 1, wherein the loop detection module comprises the following steps:

8. The method for instant location and mapping in a dynamic environment of claim 1, wherein the semantic octree map module works as follows: