CN112833892B

CN112833892B - Semantic mapping method based on track alignment

Info

Publication number: CN112833892B
Application number: CN202011644778.8A
Authority: CN
Inventors: 张亮; 朱光明; 汪火良; 蒋得志
Original assignee: Hangzhou Adaptive Technology Co ltd
Current assignee: Hangzhou Adaptive Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-12-16
Anticipated expiration: 2040-12-31
Also published as: CN112833892A

Abstract

The invention discloses a semantic mapping method based on track alignment, which comprises the following steps: s1, synchronously acquiring RGB data, laser data, IMU data and odometer data along the same running track; s2, through semantic ORB-SLAM, when ORB tracking is carried out by using RGB data, target detection is carried out on each frame of key frame image, a triangularization relation is constructed, and a 3D semantic landmark is extracted; s3, filtering the laser data by utilizing a down-sampling operation, and calculating IMU data and mileage count data; s4, processing the laser data, the IMU data and the odometer data to construct a 2D grid map; and S5, solving a similarity transformation relation between the ORB-SLAM and the track sequence, and fusing the 3D semantic landmark and the 2D grid map to generate a semantic map. The method and the device can construct an accurate grid map with semantic landmarks.

Description

Semantic mapping method based on track alignment

Technical Field

The invention belongs to the field of semantic map building, and relates to a semantic map building method based on track alignment, which can be used for building a semantic map of a surrounding environment.

Background

The SLAM (synchronous positioning and mapping) technology can help a robot to gradually construct a map of the surrounding environment by acquiring information through a sensor in a strange environment, and can also position the position of the robot in the map according to the change of environmental information, and the technology can help the mobile robot to realize an autonomous navigation task. The traditional SLAM technology is divided into two categories, but the traditional SLAM technology is only limited to analysis of environmental geometric characteristics, and the robot can only realize some very basic functions of obstacle avoidance, path planning and the like by means of a map generated by the traditional SLAM. In recent years, the research of semantic SLAM has been advanced, and by means of the more and more mature image processing technology based on deep learning, the semantic information of the environmental object is extracted from the image and is merged into the processing process and result of the semantic SLAM, so that the accuracy of the algorithm is further improved, and the map result is enriched. Although the current laser SLAM can generate an environment map which has higher precision and is suitable for robot navigation by virtue of a laser radar sensor, the laser radar sensor directly obtains point cloud information by virtue of a laser flight time method, so that semantic information in the environment is not easy to extract. The visual SLAM utilizes a camera sensor to acquire an image sequence of an environment to position and map, and the environment information acquired by the sensor data is richer and can be used for extracting semantic information, but is easily influenced by environmental conditions, so the mapping accuracy of the visual SLAM is often worse than that of the laser SLAM.

The traditional SLAM can only construct a geometric feature map of an environment, tasks which can be completed by the robot by using the map are very limited, the semantic SLAM can further improve the understanding degree of the robot to the environment, and the robot can complete more complex tasks by using the semantic map, so that the semantic mapping method for researching multi-sensor fusion has great significance in the semantic SLAM field.

Disclosure of Invention

In order to solve the above problems, the present invention provides a semantic mapping method based on track alignment, which includes adding a target detection process into ORB (organized Fast and tracked Brief) -SLAM, and extracting 3D semantic landmark information of an environmental object; meanwhile, a high-precision 2D grid map is constructed by utilizing laser radar data, IMU data and mileage count data; and finally, solving a similarity transformation relation between the track sequences of the two and aligning the 3D landmark and the 2D map to generate a semantic map, thereby solving the problems existing in the traditional SLAM.

The technical scheme adopted by the invention is as follows:

a semantic mapping method based on track alignment comprises the following steps:

s1, synchronously acquiring RGB data, laser data, IMU (inertial measurement unit) data and odometer data along the same running track;

s2, through semantic ORB-SLAM, when ORB tracking is carried out by using RGB data, target detection is carried out on each frame of key frame image, a triangularization relation is constructed according to target detection results between adjacent key frames, and a 3D semantic landmark is extracted;

s3, filtering the laser data by utilizing a down-sampling operation, reducing the calculated amount, and calculating and processing IMU data and mileage count data;

s4, processing the laser data, the IMU data and the odometer data through a Cartogrph algorithm to construct a 2D grid map;

and S5, solving a similarity transformation relation between the ORB-SLAM and the Cartogrer algorithm by utilizing a track sequence generated by the ORB-SLAM and the Cartogrer algorithm, and fusing the 3D semantic landmarks and the 2D grid map to generate a semantic map.

Preferably, in step S1, the RGB data, the laser data, the IMU data, and the odometer data start to be acquired synchronously, the final trajectories of the visual SLAM and the laser SLAM are each constituted by pose points, in order to solve the similarity transformation between the trajectories, the pose points that can be obtained by the timestamps in one-to-one correspondence are required to be obtained, and the pose points are all solved by the relative motion between the adjacent frames, so that the laser data and the RGB image sequence need to be acquired synchronously, and each sensor moves in an integral state during acquisition, so that the finally generated trajectories highly coincide.

Preferably, the step S2 is specifically performed according to the following steps:

s21, completing map initialization of the ORB-SLAM algorithm, enabling the ORB-SLAM algorithm to track the ORB of each frame of RGB image and generate a key frame;

s22, after the key frame is obtained, iteratively updating the 3D pose of the object landmark: firstly, projecting an object landmark (3D cube) serving as a boundary frame into each key frame image, matching the projection results and target detection results pairwise by using a KM algorithm, and matching the cost c (p, D) of a projection boundary frame p and a detected boundary frame D as shown in a formula (1):

wherein p represents a projection bounding box, d represents a detected bounding box, and l, t, r, b represent the left, top, right and bottom of the bounding box in the pixel coordinate system, respectively; the denominator aims to normalize the cost by using the width of the detected bounding box, and prevent a larger bounding box from leading the whole matching cost;

s23, triangulating the position of the 3D point of the landmark by a ray intersection extending from the center of the camera to the detected bounding box; the projection point selects the center point of the 2D detection frame, which means that the triangulated 3D point is also at the exact center of the object cubic landmark; after a group of object landmark hypotheses are sampled to be 3D points, a 3D boundary cube is instantiated for each point, and the center of the top surface of the boundary cube is anchored on the 3D point;

after generating a cubic landmark hypothesis for a new object, scoring the new hypothesis, wherein the score S (h) of the object hypothesis h is shown in equation (2):

where L is a set of landmarks H _o Hypothesis that is o ∈ L, K _o Keyframes, δ, containing all associated objects for which o is detected _k Is the detection of a correlation in k, and f _k Projecting h into keyframe k, α is the scale factor used to adjust the first sum, and Γ measures the relevance of object cubes h and h' according to their typical context (e.g., mouse and keyboard tend to be on the same surface); in this formula, the lower the computed score, the better the assumption of object landmarks; after scoring each hypothesis for all landmarks, if the best score of the new hypothesis is lower than the score of the existing estimated landmark, replacing the original landmark with the hypothesis with the lowest score;

s24, in each iteration, if the bounding box detected during the data association does not match any existing landmark, a new object landmark is instantiated for the detection. The accuracy of the cube at the very beginning is not important because the camera does not move much in subsequent key frames where the projection of the cube is sufficient to establish the data association. Once multiple views of landmarks are obtained through different keyframes, the pose and size scale of object landmarks can be refined gradually through multiple iterations.

Preferably, in step S3, the filtering the laser data specifically includes:

preprocessing the lidar data through a voxel filter in a PCL library: creating a 3D voxel grid on the input point cloud data, then in each voxel, all existing points are approximated by their centroids;

the specific calculation processing of the IMU data is as follows: performing a quaternion to euler angle conversion on the IMU data, as shown in equation (3):

wherein q = (q) ₀ ,q ₁ ,q ₂ ,q ₃ ) Is a quaternion, r = (α, β, γ) is euler angle;

the odometer data is obtained by equipping speedometers on left and right wheels of the mobile robot to obtain the advancing speed and the angular speed of the robot, and the odometer data is a directly acquired process.

Preferably, in step S4, the 2D occupancy grid map is generated by processing the lidar data, the IMU data, and the odometer data using the Cartographer algorithm partial SLAM and the global SLAM.

Preferably, the step S5 is specifically performed according to the following steps:

s51, obtaining a track sequence S generated by a semantic ORB-SLAM algorithm _o And a track sequence S generated by the Cartogrier algorithm _c Aligning the two sequences by means of time stamps, gradually aligning S _o Filtering pose nodes without matching points, and finally generating two new sequences S which have the same number of pose points and are in one-to-one correspondence in sequence _c′ And S _o′ ；

S52, obtaining the matched sequence S _c′ And S _o′ Then, the two sequences are subjected to similarity transformation solution, and the transformed error variance value epsilon is required to be obtained ² ε ² Minimum, error value ε ² As shown in formula (4):

wherein s is a scaling coefficient, R is a rotation coefficient, and t is a translation coefficient; the final solution can be obtained here by solving this equation using the umeyama method:

R＝USV ^T (5)

s53, after obtaining the similarity transformation parameters S, R and t, solving the corresponding position of any point in the environment point cloud generated by the ORB-SLAM in the 2D occupied grid coordinate system, wherein the specific conversion formula is as follows:

p′＝s(Rp+t) (8)

wherein p' is a certain point in the grid map coordinate system, and p is a certain point in the semantic landmark coordinate system;

and S54, obtaining the corresponding relation between the 3D semantic earth surface and the 2D grid map through the conversion relation, and combining the 3D semantic earth surface and the 2D grid map into one semantic map.

The invention has the beneficial effects that:

(1) The method combines the advantages of the laser SLAM and the visual SLAM in the traditional SLAM, collects the laser radar data and the RGB data in the same running track, then processes the laser data to generate a 2D grid map by means of the Cartogrer algorithm, constructs the cubic semantic landmarks of an object by the ORB-SLAM algorithm, finally aligns two different pose point sequences of the same track generated by the two parts, and maps the semantic landmarks to the grid map by utilizing similarity transformation to form a final semantic map, thereby not only exerting the advantage that the laser SLAM can construct an accurate 2D map, but also exerting the advantage that the visual SLAM can extract semantic information, and solving the problem that the algorithm in the traditional SLAM cannot construct an accurate grid map with the semantic landmarks.

(2) The invention can use a hardware platform consisting of the sweeping robot and the camera sensor to collect a data set in an actual scene, has good flexibility and expandability, effectively solves the problem of inconsistent generated track sequences by fixing the sweeping robot and the camera sensor in the same device, and realizes the whole system by using the message publishing and subscribing model of the ROS, so that the final system can be conveniently applied to indoor task application of the mobile robot.

Drawings

FIG. 1 is a process diagram of a semantic mapping method based on track alignment according to the present invention;

fig. 2 is an overall architecture diagram of a semantic SLAM system based on a four-layer structure, which is implemented based on a sweeping robot and a camera sensor.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The embodiment of the invention provides a semantic map building method based on track alignment, aiming at the problem of content coupling in the traditional SLAM development, the invention provides an SLAM development framework based on four levels, and a semantic SLAM algorithm based on multi-sensor fusion is realized on the basis, as shown in figures 1 and 2, the semantic SLAM method comprises the following steps:

a1, synchronously scanning and acquiring RGB data, laser data, IMU data and odometer data by using a monocular camera and a sweeping robot, wherein the monocular camera is fixed on the sweeping robot and connected with a notebook computer, and continuously acquires RGB images as input of semantic ORB-SLAM in the moving process. And the laser data, IMU data and odometer data collected by the sweeping robot are used as the input of the Cartogrer.

Camera sensor calibration and sensor time alignment are required before the data set is acquired. The purpose of camera calibration is to determine the geometric position relationship between a certain point on the surface of an object in a three-dimensional space and the corresponding point in a two-dimensional image, and a geometric model of camera imaging must be established, namely, internal and external parameters and distortion parameters of a camera are solved. The purpose of sensor time alignment is to enable the timestamps recorded by the nodes of the two tracks to be matched on the same timeline when the track nodes are finally matched.

The camera calibration adopts a checkerboard camera calibration method of Zhang Zhengyou, a checkerboard calibration plate with known number and size of squares is used, an image coordinate system is established by taking the upper left corner as an original point, and a world coordinate system is established by taking a checkerboard plane as z = 0. Since the points where the black and white grids intersect in the checkerboard are all corner points, and the coordinates of each corner point are known, the transformation matrix between the two coordinate systems can be obtained. According to the orthogonality of the internal and external parameters of the camera, various parameters of the camera can be obtained by utilizing transformation matrix decomposition.

The sensor time comprises the time of a built-in system of the sweeping robot and the system time of a notebook computer, a wireless network card is embedded in the sweeping robot, and after the sweeping robot is started, the notebook computer is connected with the wireless network of the robot, so that the sweeping robot and the notebook computer are positioned in the same local area network. Before the robot is controlled to start running, the sweeping robot continuously sends 10 UDP packets with current time stamps to the notebook, the notebook forms a time stamp pair with the time stamp of the robot and the current time stamp of the notebook after receiving the packets each time and records the time stamp pair, and finally, the average time stamp difference between the time stamp pair and the time stamp pair is calculated according to the 10 groups of time stamp pairs.

A2, extracting the 3D semantic landmark from the RGB image sequence by utilizing the semantic ORB-SLAM needs to carry out several steps of extracting a key frame, obtaining target detection and optimizing landmark hypothesis.

And A21, completing map initialization of an ORB-SLAM algorithm, enabling the map initialization to perform ORB tracking on each frame of RGB image, and generating a key frame.

And A22, after the key frame is obtained, iteratively updating the 3D pose of the object landmark. Firstly, an object landmark (3D cube) is used as a boundary box to be projected into each key frame image, and then the projection and the target detection result are matched pairwise by utilizing a KM algorithm. The cost c (p, d) of matching a projected bounding box p with a detected bounding box d is given by equation (1):

where p denotes a projected bounding box, d denotes a detected bounding box, and l, t, r, b denote the left, top, right and bottom of the bounding box in the pixel coordinate system, respectively. The purpose of the denominator is to normalize the cost using the width of the detected bounding box, preventing a larger bounding box from dominating the overall matching cost.

A23, triangulating the 3D point locations of the landmarks by ray intersections extending from the camera center to the detected bounding box. The projection point is the center point of the 2D detection frame, which means that the triangulated 3D point is also at the exact center of the object cube landmark. After a group of object landmark hypotheses are sampled to be 3D points, a 3D boundary cube is instantiated for each point, and the center of the top surface of the boundary cube is anchored on the 3D point;

after generating a cubic landmark hypothesis for a new object, the new hypothesis is scored, and the score S (h) of the object hypothesis h is shown in equation (2):

where L is a set of landmarks H _o Hypothesis that is o ∈ L, K _o Keyframes, δ, containing all associated objects for which o is detected _k Is the detection of a correlation in k, and f _k H is projected into the keyframe k, α is the scale factor used to adjust the first sum, and Γ measures the relevance of the object cubes h and h' according to their typical context (e.g., the mouse and keyboard tend to be on the same surface). In this formula, the lower the computed score, the better the assumption of object landmarks. Finally, after scoring each hypothesis of all landmarks, if the best score of the new hypothesis is lower than the score of the existing estimated landmarkThe assumption with the lowest score is used to replace the original landmark.

At each iteration, if the bounding box detected during the data association does not match any existing landmarks, a new object landmark is instantiated for the detection. The accuracy of the cube at the very beginning is not important because the camera does not move much in subsequent key frames where the projection of the cube is sufficient to establish the data association. Once multiple views of landmarks are obtained through different keyframes, the pose and size scale of object landmarks can be refined gradually through multiple iterations.

And A3, preprocessing the laser data, the IMU data and the odometer data when the raster map is constructed by utilizing the laser data, the IMU data and the odometer data.

The lidar data is pre-processed by voxel filters in the PCL library, creating individual 3D voxel grids on the input point cloud data, and then in each voxel, all existing points are approximated by their centroids.

The IMU data needs to perform a conversion of quaternion to euler angle, as shown in equation (3):

and the odometer data is obtained by equipping the left and right wheels of the mobile robot with speedometers to obtain the advancing speed and the angular speed of the robot.

And A4, processing the laser radar data, the IMU data and the odometer data by using a local SLAM and a global SLAM of a Cartogrer algorithm library embedded in the sweeping robot to generate a 2D occupancy grid map.

And A5, after the semantic landmark generation and the grid map construction are completed, combining the semantic landmark generation and the grid map construction by using track alignment.

A51, obtaining a track sequence S generated by a semantic ORB-SLAM algorithm _o And a track sequence S generated by a Cartogrph algorithm _c Aligning the two sequences by means of time stamps, gradually aligning S _o Filtering pose nodes without matching points, and finally generating two new sequences S which have the same number of pose points and are in one-to-one correspondence in sequence _c′ And S _o′ 。

A52, obtaining the matched sequence S _c′ And S _o′ Then, the two sequences are subjected to similarity transformation solution, and the transformed error variance value epsilon is required to be obtained ² ε ² Minimum, error value ε ² As shown in formula (4):

wherein s is a scaling coefficient, R is a rotation coefficient, and t is a translation coefficient; solving this equation here using the umeyama method, the final solution can be obtained:

R＝USV ^T (5)

a53, after obtaining the similarity transformation parameters s, R and t, knowing the conversion relation between the two coordinate systems, the corresponding position of any point in the environment point cloud generated by the ORB-SLAM in the 2D occupied grid coordinate system can be obtained. The specific conversion formula is as follows:

p′＝s(Rp+t) (8)

where p' is a certain point in the grid map coordinate system and p is a certain point in the semantic landmark coordinate system.

And finally, obtaining the corresponding relation between the 3D semantic ground surface and the 2D grid map through the conversion relation, and combining the 3D semantic ground surface and the 2D grid map into one semantic map.

The patent respectively uses the public data set and the generation module of the cube semantic landmark and the grid mapping module in the actual scene to carry out experiments and analysis. Experimental results show that the generation module of the semantic landmark can accurately generate a cubic frame representing the position and the size of an object in a three-dimensional space by means of an image sequence acquired under multiple visual angles, the grid map construction module can complete construction and optimization of a grid map in a small scene, and the map optimization effect is optimal when multiple sensor data are used simultaneously. And finally, acquiring a data set in an indoor scene containing a plurality of objects, completing the track alignment of the ORB-SLAM and the Cartogrer by utilizing the semantic SLAM system realized by the text, and finally generating a semantic map fused with a plurality of sensor data in the scene.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A semantic mapping method based on track alignment is characterized by comprising the following steps:

s1, synchronously acquiring RGB data, laser data, IMU data and odometer data along the same running track;

s4, processing the laser data, the IMU data and the odometer data through a Cartogrer algorithm to construct a 2D grid map;

s5, solving a similarity transformation relation between the ORB-SLAM and a track sequence generated by a Cartogrer algorithm, and fusing a 3D semantic landmark and a 2D raster map to generate a semantic map;

the step S2 is specifically performed according to the following steps:

s22, after the key frame is obtained, iteratively updating the 3D pose of the object landmark: firstly, projecting an object landmark as a boundary frame into each key frame image, then matching the projection results and the target detection results pairwise by using a KM algorithm, wherein the cost c (p, d) of matching a projection boundary frame p and a detected boundary frame d is shown as a formula (1):

wherein p represents a projection bounding box, d represents a detected bounding box, and l, t, r, b represent the left, top, right and bottom of the bounding box in the pixel coordinate system, respectively;

s23, triangulating the position of the 3D point of the landmark by the intersection point of the light rays extending from the center of the camera to the detected bounding box; after a group of object landmark hypotheses are sampled to be 3D points, a 3D boundary cube is instantiated for each point, and the center of the top surface of the boundary cube is anchored on the 3D point;

after generating a cubic landmark hypothesis of a new object, scoring the new hypothesis, wherein the score S (h) of the object hypothesis h is shown in formula (2):

where L is a set of landmarks H _o Hypothesis that is o ∈ L, K _o Keyframes, δ, containing all associated objects for which o is detected _k Is the detection of a correlation in k, and f _k Projecting h into key frame k, α is the scale factor used to adjust the first sum, and Γ measures the relevance of object cubes h and h' according to their typical context; in thatAfter scoring each hypothesis of all landmarks, if the best score of the new hypothesis is lower than the score of the existing estimated landmark, replacing the original landmark by the hypothesis with the lowest score;

s24, in each iteration, if the bounding box detected during data association does not match any existing landmark, instantiating a new object landmark for the detection;

the step S5 is specifically performed according to the following steps:

S52, obtaining a matched sequence S _c′ And S _o′ Then, the two sequences are subjected to similarity transformation solution, and the transformed error variance value epsilon is required to be obtained ² Minimum, error value ε ² As shown in formula (4):

R＝USV ^T (5)

p′＝s(Rp+t) (8)

and S54, obtaining the corresponding relation between the 3D semantic ground surface and the 2D grid map through the relation transformation, and combining the 3D semantic ground surface and the 2D grid map into one semantic map.

2. The semantic mapping method based on trajectory alignment according to claim 1, wherein in step S1, RGB data, laser data, IMU data, and odometer data are synchronously acquired, and each sensor moves in an integral state during acquisition.

3. The semantic mapping method based on track alignment according to claim 1, wherein in the step S3, the filtering the laser data specifically comprises:

preprocessing the laser radar data through a voxel filter in a PCL library: creating a 3D voxel grid on the input point cloud data, then in each voxel, all existing points are approximated by their centroids;

and the odometer data is obtained by equipping speedometers at the left and right wheels of the mobile robot to obtain the advancing speed and the angular speed of the robot.

4. The semantic mapping method based on trajectory alignment as claimed in claim 3, wherein in step S4, the laser radar data, IMU data and odometry data are processed by means of a Cartogrer algorithm, a partial SLAM and a global SLAM, and a 2D occupancy grid map is generated.