CN112016612A

CN112016612A - Monocular depth estimation-based multi-sensor fusion SLAM method

Info

Publication number: CN112016612A
Application number: CN202010871686.7A
Authority: CN
Inventors: 彭倍; 刘帅; 邵继业; 曾双耀
Original assignee: Sichuan Artigent Robotics Equipment Co ltd
Current assignee: Sichuan Artigent Robotics Equipment Co ltd
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-01

Abstract

The application provides a monocular depth estimation-based multi-sensor fusion SLAM method, which comprises the following steps: firstly, acquiring an RGB image, laser point cloud information and mileage information of a current environment, and performing monocular depth estimation on the RGB image to acquire corresponding depth information; secondly, synchronizing topics corresponding to the depth information, the laser point cloud information and the mileage information in an ROS system; thirdly, establishing a new node according to the information transmitted in each topic after synchronization to obtain a local grid map; then, carrying out loop detection on the local grid map; and finally, fusing the local grid maps after loop detection to create a global map. The method and the device for creating the global map by using the laser radar and the sensor combination of the pseudo RGBD camera obtained through monocular depth estimation reduce the economic cost and the size of directly using the RGBD camera or the combination of the binocular camera and the laser radar sensor.

Description

Monocular depth estimation-based multi-sensor fusion SLAM method

Technical Field

The application relates to the technical field of mobile robots, in particular to a multi-sensor fusion SLAM method based on monocular depth estimation.

Background

The multi-sensor fusion is one of the future development directions in robot positioning and navigation, and in the multi-sensor fusion, the fusion of laser radar and visual information is an important part. In the existing modes, visual images are obtained through an RGBD camera, a binocular camera and the like. However, the RGBD camera and the binocular camera which can directly obtain the depth information have high cost and large volume, and for a low-cost robot or a small-sized robot, there are not little problems in terms of economic cost and structural design. Although monocular cameras are small in size and low in cost, depth information cannot be directly acquired. It is therefore desirable to provide a solution to this problem.

Disclosure of Invention

The application aims to provide a monocular depth estimation-based multi-sensor fusion SLAM method, which is used for achieving the technical effects of obtaining depth information of surrounding environment through a monocular sensor and reducing cost and equipment size.

The embodiment of the application provides a multi-sensor fusion SLAM method based on monocular depth estimation, which comprises the following steps:

step 1, acquiring an RGB image, laser point cloud information and mileage information of a current environment, and performing monocular depth estimation on the RGB image to acquire corresponding depth information;

step 2, synchronizing topics corresponding to the depth information, the laser point cloud information and the mileage information in an ROS system;

step 3, establishing new nodes according to the information transmitted in each synchronized topic to obtain a local grid map;

step 4, carrying out loop detection on the local grid map;

and 5, fusing the local grid maps subjected to loop detection to create a global map.

The beneficial effect that this application can realize is: the method and the device for creating the global map by using the laser radar and the sensor combination of the pseudo RGBD camera obtained through monocular depth estimation reduce the economic cost and the size of directly using the RGBD camera or the combination of the binocular camera and the laser radar sensor.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic flowchart of a multi-sensor fusion SLAM method based on monocular depth estimation according to an embodiment of the present disclosure;

fig. 2 is a schematic view of a monocular depth estimation process according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a new node creation process provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of the results of an analysis comparison experiment provided in the examples of the present application;

fig. 5 is a schematic diagram of a loopback detection result provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a multi-sensor fusion SLAM method based on monocular depth estimation according to an embodiment of the present disclosure.

step 4, carrying out loop detection on the local grid map;

In one embodiment, the RGB image of the current environment can be acquired through the monocular sensor, the depth information of the RGB image is analyzed, and then the RGB image is transmitted to the ROS system; acquiring laser point cloud information of the current environment through a laser radar and transmitting the laser point cloud information to an ROS system; and acquiring the mileage information of the robot through an encoder and transmitting the mileage information to the ROS system. The RGB image may be directly obtained by the monocular sensor, and the depth information (including depth information of each pixel point in the depth image) matched with the RGB image obtained by the monocular sensor may be obtained by learning through the CNN neural network.

Considering that different sensors transmit in the ROS system through respective topics, and the frequency of sending topics to the outside by each sensor is different, it is necessary to use a topic synchronization scheme with a certain policy to perform corresponding synchronization on sensor topics. When topic synchronization is performed through the step 2, the sets output firstly should not have intersection. Assuming that there are two sets M and N, and i is a certain moment in the topic, then for all i, these two sets should either satisfy Mi ≦ Ni, or satisfy Ni ≦ Mi. Secondly, there is at least one topic with no other messages discarded between the two groups of messages, i.e. the set should be continuous; the set of outgoing topics should depend on the time stamp of the ROS, not the actual arrival time of the message.

Please refer to fig. 2, fig. 2 is a schematic view illustrating a monocular depth estimation process according to an embodiment of the present application.

Monocular depth estimation is to estimate the depth information of each pixel point in a monocular image input to a function through the function. Such a function can be briefly described as:

wherein the content of the first and second substances,

i is the depth information of the pixel point, I is the input image. The training step of the function comprises:

firstly, acquiring a real left image and a real right image of a surrounding environment;

secondly, estimating a right disparity map and a left disparity map according to the real left map through a CNN neural network;

and finally, overlapping the right disparity map and the real left map to obtain a right map estimation map, and overlapping the left disparity map and the real right map to obtain a left map estimation map.

In the above process, the loss functions of the right and left image estimation maps in the estimation process can be analyzed, and the loss functions are reduced to expected values by adjusting the parameters of the CNN neural network.

In particular, the loss function comprises an image reconstruction loss functionNumber C_apParallax smoothing loss function C_dsLeft and right consistency loss function C_lr. Specifically, taking the real left image as an example, the loss function C of image reconstruction thereof_apThe calculation method is as follows:

wherein the content of the first and second substances,

representing a first image reconstruction loss function; n represents the number of pixel points;

representing a true left graph;

representing a left graph estimation graph; α represents a weight. In one embodiment, a single scale SSIM (structural similarity index) and simplified 3 x 3 filtering may be employed, with α taken to be 0.82. The same way can obtain a second image reconstruction loss function C^r _ap。

Parallax smoothing loss function C_dsThe calculation method is as follows:

wherein, C^l _dsRepresenting a first disparity smoothing loss function;

representing the parallax gradient of the left parallax map in the x direction;

representing the image gradient of the true left image in the x direction;

representing the parallax gradient of the left parallax map in the y direction;

representing the image gradient in the y-direction of the true left image. The same can obtain the second parallax smooth loss function C^r _ds。

In practical situations, the left-to-right parallax of the binocular camera and the right-to-left parallax are completely equivalent, the CNN neural network only receives a left image as input and simultaneously predicts a left parallax image and a right parallax image, in order to ensure coordination, a penalty for left-right parallax inconsistency is added into a loss function, the left parallax image is forced to be equal to a projected right parallax image, the prediction accuracy of the network on the parallax images can be effectively improved by adding strong prior supervision, and a left-right consistency loss function C is added_lrThe calculation method is as follows:

wherein, C^l _lrRepresenting a first left-right consistency loss function;

representing a left disparity map;

a reconstructed disparity map representing the left disparity map. The same can obtain a second left-right consistency loss function C^r _lr。

The three loss functions are weighted to obtain the total loss function:

in the formula, C_sRepresenting the total loss function, α_apA weight representing an image reconstruction loss function; alpha is alpha_dsWeights representing a disparity smoothing loss function; alpha is alpha_lrWeights representing left and right consistency loss functionsAnd (4) heavy. In one embodiment, α_apMay be 0.9; alpha is alpha_dsCan be 1; alpha is alpha_lrMay be 0.01.

As shown in fig. 3, fig. 3 is a schematic diagram of a new node creation process provided in the embodiment of the present application.

In one embodiment, the new node may be obtained by a visual odometer, and specifically, the establishment of the new node is performed in sequence according to the following procedures:

first, feature detection is performed. Specifically, a frame of image is acquired, and the GFTT feature is used to perform feature detection on the frame of image, thereby acquiring a feature map.

Secondly, feature matching is performed. Specifically, through NNDR test, extracting the features in the feature map by using binary robust independent basic feature descriptors; the feature map contains 3D features with descriptors from the last key frame.

Again, motion estimation is performed. Specifically, a PnP algorithm is used to calculate the feature transformation of the current frame corresponding to the feature map, and the implementation manner is as follows:

λ_iv_i＝RP_i+t (1)

wherein λ is_iFor 2D reference points v in the camera coordinate system_iDepth of (v) of_iSatisfy | | v _i1, R is a rotation matrix, and t is a translation vector; p_iIs a spatial 3D reference point in the world coordinate system.

From time to time, local BA optimization was performed. Specifically, the displacement result is refined by using the features of all key frames in the feature map.

And then, updating the pose. Specifically, the output odometer and the coordinate transformation are updated using the feature transformation of the motion estimation; namely, updating the conversion of the output odometer and the ROS system tf tree from an "from" coordinate system to a "base _ link" coordinate system; adopting a Median Absolute Deviation (MAD) method of 3D characteristics when the covariance is calculated;

finally, motion prediction is performed. Specifically, according to the result of pose updating, motion prediction is carried out on the position of the feature map in the current frame, and then feature matching is carried out again; specifically, after the pose is updated, the position of the feature of the motion model feature map in the current frame can be used for motion prediction, and then feature matching is performed again.

In one embodiment, the present application uses an improved PnP algorithm for motion estimation, which is as follows: set space point P under world coordinate system_iProjection in the image coordinate system is P_i'＝[x'，y']^TSince there is some error and uncertainty in measuring the location of the projected point, to express this uncertainty, a two-dimensional covariance matrix can be used (where P' corresponds to P)_i')：

In order to introduce such uncertainty into the whole calculation process, a two-dimensional covariance matrix can also be used to express the normalization coordinate process and the transformation process of the space vector of the coordinate point in the camera coordinate system. The uncertainty of the reference point is finally expressed as a matrix C:

in the matrix

Is a matrix sigma_p'p'And (5) normalizing the matrix.

Writing equation (1) in the form of a homogeneous system of linear equations:

Bu＝0

where B is the coefficient of the homogeneous system of equations, and u is the column vector:

the uncertainty of the reference point described above is added to the resulting homogeneous system of linear equations:

B^TCBu＝Nu＝0

wherein u satisfies the constraint condition | | | u | | | ═ 1. And carrying out singular value decomposition on the coefficient matrix in the formula.

N＝UDV^T

Wherein, U, D, V^TRepresenting the singular values after singular value decomposition.

Obtaining a characteristic vector through singular value decomposition of the formula, and obtaining a rotation matrix and a translation vector according to the characteristic vector:

wherein the content of the first and second substances,

only the translation direction is reflected, and distance information is not available, so that the distance information is added:

in the formula (I), the compound is shown in the specification,

to pair

Singular value decomposition is carried out to obtain a rotation matrix:

R＝U_RV_R ^T

and finally, obtaining initial values of a rotation matrix and a translation vector of the camera, constructing an error function, and iteratively searching for an optimal solution by adopting a Gauss-Newton method.

In one embodiment, a new node is established according to the information transmitted in each topic after synchronization, and the step of obtaining the local grid map further includes: and analyzing whether the inner layer number calculated in the motion estimation process is lower than a set fixed threshold, and if the inner layer number is lower than the set fixed threshold, updating the key frames in the feature matching and local BA optimization.

Specifically, if the number of intra layers calculated in the motion estimation process is below a fixed threshold, the keyframes or feature maps are updated. And for the feature map, updating the feature map by adding the matching features of the new frame and updating the positions of the matching features through a local BA optimization module. When the size of the feature map exceeds a fixed threshold, the oldest feature that does not match the current frame is deleted. If a key frame no longer has a feature in the feature map, it will be discarded.

In one embodiment, the process of fusing the local grid maps after the loop detection to create the global map uses a bayesian filtering algorithm to fuse the local grid maps:

wherein z represents an observed value of the local map; state_iIs the state of the ith grid; p (state)_i|z_t+1) And p (state)_i|z_t) The state distribution of the ith grid at the time t +1 and the time t respectively; p is a radical of_obs(state_i|z_t+1) Is a constant set. Illustratively, p is the occupied state when the state of the ith grid at time t +1 is the occupied state_obs(state_i|z_t+1) 0.8; when the status of the ith grid at time t +1 is null_obs(state_i|z_t+1)＝0.3。

Fig. 4 is a schematic diagram showing the results of an analysis and comparison experiment provided in the embodiment of the present application, as shown in fig. 4.

In one embodiment, the application performs algorithmic verification with the freiburg2_ Pioneer _360 dataset of the institute of information and technology computer vision group (TUM) of munich university, which is recorded by a Kinect camera mounted on top of a Pioneer robot, which performs circular motion under the control of a remote control handle, and the dataset also includes lidar data and odometry data of the robot, and the experiment uses only RGB images of the Kinect camera.

Firstly, data playback is carried out in an ROSbag system by using a ROSbag format file of a data set, an RGB image of a Kinect camera, laser point cloud information of a laser radar and mileage information of a mileometer are input into the ROS system, a monocular depth estimation module estimates a depth image sequence from the obtained RGB image sequence and inputs the obtained depth image sequence into the ROS system, and the motion of the camera is obtained through fusion estimation of the obtained image information and the laser radar point cloud information. As can be seen from fig. 4, the cumulative error of the path obtained by using the method of the present application is smaller and closer to the real path.

As shown in fig. 5, fig. 5 is a schematic diagram of a loop detection result provided in the embodiment of the present application.

The path of the robot is still deviated from the actual path along with the movement of the robot, so that the influence of accumulated errors on the path of the robot needs to be reduced by using a loop detection module. After the loop detection is carried out, the robot path is adjusted again, and a more accurate final result is obtained.

In one embodiment, a word-bag method may be used when performing loop detection. When a new node is created, the program extracts visual characteristics from the RGB image (the same GFTT features as in building the node can be used) and quantizes it to a delta visual vocabulary. The GFTT characteristics can extract a large amount of characteristic information in a short time, the characteristic diversity is rich, and in order to improve the loop detection efficiency, only the subset of the odometer characteristics with the highest response can quantize visual vocabularies. When a loop back transition has to be computed, other characteristics remain in the node, and the created new node is then compared to the previous node to detect a loop back.

In summary, the present application provides a monocular depth estimation-based multi-sensor fusion SLAM method, including acquiring RGB images, laser point cloud information, and mileage information of a current environment, and performing monocular depth estimation on the RGB images to acquire corresponding depth information; synchronizing topics corresponding to the depth information, the laser point cloud information and the mileage information in the ROS system; establishing a new node according to the information transmitted in each topic after synchronization to obtain a local grid map; carrying out loop detection on the local grid map: and fusing the local grid maps after the loopback detection to create a global map. The method and the device for creating the global map by using the laser radar and the sensor combination of the pseudo RGBD camera obtained through monocular depth estimation reduce the economic cost and the size of directly using the RGBD camera or the combination of the binocular camera and the laser radar sensor.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A multi-sensor fusion SLAM method based on monocular depth estimation is characterized by comprising the following steps:

step 4, carrying out loop detection on the local grid map;

2. The method of claim 1, wherein step 1 comprises:

step 11, acquiring a real left image and a real right image of the surrounding environment;

step 12, estimating a right disparity map and a left disparity map according to the real left map through a CNN neural network;

and step 13, overlapping the right disparity map and the real left map to obtain a right map estimation map, and overlapping the left disparity map and the real right map to obtain a left map estimation map.

3. The method of claim 2, wherein step 1 further comprises:

step 14, analyzing the loss functions of the right graph estimation graph and the left graph estimation graph, and adjusting the parameters of the CNN neural network to reduce the total loss function to an expected value; the method comprises the following steps:

analyzing a first image reconstruction loss function of an estimation process from the true left image and the left image estimate; analyzing a second image reconstruction loss function of the estimation process from the true right image and the right image estimation map; wherein, the first image reconstruction loss function expression is:

in the formula, C^l _apRepresenting the first image reconstruction loss function; n represents the number of pixel points;

representing a true left graph;

representing a left graph estimation graph; alpha is the weight of the basic reconstruction error and the similarity error, and the second image reconstruction loss function C can be obtained by the same method^r _ap；

Analyzing a first parallax smooth loss function according to the parallax gradient of the left parallax image and the image gradient of the real left image; analyzing a second parallax smoothing loss function according to the parallax gradient of the right parallax image and the image gradient of the real right image; wherein, the first parallax smooth loss function expression is:

wherein, C^l _dsRepresenting the first disparity smoothing loss function;

representing the parallax gradient of the left parallax map in the x direction;

representing the image gradient of the true left image in the x direction;

representing the parallax gradient of the left parallax map in the y direction;

representing the image gradient of the true left image in the y direction; the second parallax smooth loss function C can be obtained by the same method^r _ds；

Reconstructing a disparity map according to the left disparity map and the right disparity map, and analyzing a first left-right consistency loss function of the left disparity map and a second left-right consistency loss function of the right disparity map; wherein the first left-right consistency loss function expression is:

wherein, C^l _lrRepresenting the first left-right consistency loss function;

representing a left disparity map;

a reconstructed disparity map representing a left disparity map; the second bilateral consistency loss function C can be obtained in the same way^r _lr；

Weighting the first image reconstruction loss function, the second image reconstruction loss function, the first parallax smooth loss function, the second parallax smooth loss function, the first left-right consistency loss function and the second left-right consistency loss function to obtain a total loss function; wherein, the total loss function expression is:

in the formula, C_sRepresenting the total loss function, α_apA weight representing an image reconstruction loss function; alpha is alpha_dsWeights representing a disparity smoothing loss function; alpha is alpha_lrRepresenting the weights of the left and right consistency loss functions.

4. The method of claim 1, wherein step 3 comprises:

step 31, feature detection: acquiring a frame of image, and performing feature detection on the frame of image by using GFTT features to acquire a feature map;

step 32, feature matching: extracting features in the feature map by using a binary robust independent basic feature descriptor through NNDR test; the feature map contains 3D features with descriptors from the last key frame;

step 33, motion estimation: the PnP algorithm is used for calculating the feature transformation of the current frame corresponding to the feature map, and the implementation mode is that

λ_iv_i＝RP_i+t

Wherein λ is_iFor 2D reference points v in the camera coordinate system_iDepth of (v) of_iSatisfy | | v_i1, R is a rotation matrix, and t is a translation vector; p_iA spatial 3D reference point under a world coordinate system;

step 34, local BA optimization: refining the displacement result by using the characteristics of all key frames in the characteristic diagram;

step 35, pose updating: updating the output odometer and the coordinate transformation by using the characteristic transformation of the motion estimation;

step 36, motion prediction: and according to the pose updating result, performing motion prediction on the position in the current frame by using the features of the feature map, and then performing the step 32 again.

5. The method of claim 4, wherein step 3 further comprises:

and step 37, analyzing whether the inner layer number calculated in the motion estimation process is lower than a set fixed threshold, and if the inner layer number is lower than the set fixed threshold, updating the key frames in the step 32 and the step 34.

6. The method according to claim 1, wherein said step 5 fuses said local grid map using a bayesian filtering algorithm:

wherein z represents an observed value of the local map; state_iIs the state of the ith grid; p (state)_i|z_t+1) And p (state)_i|z_t) The state distribution of the ith grid at the time t +1 and the time t respectively; p is a radical of_obs(state_i|z_t+1) Is a constant set.