CN114359388A

CN114359388A - Binocular vision SLAM dense image construction method based on DNN stereo matching module

Info

Publication number: CN114359388A
Application number: CN202210014232.7A
Authority: CN
Inventors: 巢建树; 刘洋; 胡诗佳; 顾明珠; 郭杰龙; 魏宪; 俞辉; 刘�文
Original assignee: Mindu Innovation Laboratory
Current assignee: Mindu Innovation Laboratory
Priority date: 2022-01-06
Filing date: 2022-01-06
Publication date: 2022-04-15

Abstract

The invention relates to a binocular vision SLAM dense mapping method based on a DNN stereo matching module, which comprises the following steps: step 1, training an end-to-end stereo matching network model on a GPU server by utilizing a public data set; step 2, carrying out lightweight processing on the trained stereo matching network model; step 3, taking the stereoscopic matching network model after the lightweight processing as a module, adding the module into a visual SLAM algorithm in a thread mode, and accelerating to complete real-time binocular depth calculation through a GPU; and 4, finally completing the real-time construction of the dense map through the added point cloud construction thread.

Description

Binocular vision SLAM dense image construction method based on DNN stereo matching module

Technical Field

The invention relates to the technical field of image data processing and visual SLAM (simultaneous localization and mapping) algorithms, in particular to a binocular stereo vision dense point cloud mapping method based on Deep Neural Networks (DNN).

Background

The existing method for building the image densely mainly uses an RGB-D depth camera or a radar, but the depth camera is easily and strongly interfered by sunlight and cannot be applied to outdoor scenes, and the cost of the radar is always high; in addition, the traditional binocular stereo matching algorithm based on the epipolar constraint relationship cannot calculate the depth of an object which is too close or too far, and the camera baseline limits the measurement range; the existing binocular stereo matching deep neural network has a complex structure and high computational complexity, and cannot meet real-time application.

Disclosure of Invention

In order to solve the technical problem, the invention provides a binocular vision SLAM dense mapping method based on a DNN stereo matching module, which comprises the following steps:

step 1, training an end-to-end stereo matching network model on a GPU server by utilizing a public data set;

step 2, carrying out lightweight processing on the trained stereo matching network model;

step 3, taking the stereoscopic matching network model after the lightweight processing as a module, adding the module into a visual SLAM algorithm in a thread mode, and accelerating to complete real-time binocular depth calculation through a GPU;

and 4, finally completing the real-time construction of the dense map through the added point cloud construction thread.

Has the advantages that:

the method is different from the traditional stereo matching algorithm, and adopts the stereo matching deep neural network, so that the robustness under the weak texture scene is improved, and the matching precision is improved; according to the method, two threads are added on the basis of three threads of an ORB-SLAM3 algorithm, one thread is a binocular stereo vision depth estimation thread to achieve more accurate depth calculation of binocular images, the other thread is a dense point cloud construction thread, and a dense point cloud map is generated by combining key frames transmitted by a TRACKING thread and image depth information transmitted by the depth estimation thread.

Drawings

FIG. 1 is a schematic diagram of a PSmNet network architecture;

FIG. 2 is a channel pruning algorithm;

FIG. 3 is a schematic flow diagram of a teacher-student network based on target distillation;

FIG. 4 is the modified ORB-SLAM3 framework, with the main modifications in the box;

FIG. 5(a) a real scene;

FIG. 5(b) ORB-SLAM3 sparse map;

FIG. 5(c) a dense map estimated by the method;

FIG. 5(d) ground route generated by RGBD camera.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person skilled in the art based on the embodiments of the present invention belong to the protection scope of the present invention without creative efforts.

At present, deep learning is greatly different in solving the depth estimation problem of stereoscopic vision, and stereo matching deep neural networks such as AANet, GCNet, PSmNet and LEASTEROO show a much better effect than the traditional method.

Taking PSMNet as an example (other network structures may also be applicable to this method), according to an embodiment of the present invention, training of a network model is performed on a GPU server by using public data sets KITTI 2012 and KITTI 2015. And then carrying out lightweight processing on the trained depth estimation algorithm network, adding the network subjected to lightweight processing into a visual SLAM algorithm (taking ORB-SLAM3 as an example) as a module (thread), accelerating by a GPU to complete real-time binocular depth calculation, and finally completing real-time construction of a dense map by the added point cloud construction thread. Hardware required for the whole scheme: binocular cameras, an ubuntu16.04 computer with a GPU configured with the ORB-SLAM3 algorithm, and a GPU server. The binocular camera is handheld or can be placed on equipment such as a small handcart, a robot, an unmanned aerial vehicle and the like;

the invention provides a binocular vision SLAM dense map building method based on a DNN stereo matching module, which comprises the following steps:

step 1, training an end-to-end stereo matching network model on a GPU server by utilizing a public data set; the process of training the binocular stereo matching network model specifically comprises the following steps:

according to one embodiment of the invention, a PSmNet network model is selected as an example as a stereo matching network model; the stereo matching network model of the invention can also adopt the stereo matching deep neural networks such as AANet, GCNet, PSmNet, LEASTEROO and the like.

Firstly, a PSMNet network model is trained on a GPU server, a network structure of the PSMNet is shown in fig. 1, and the PSMNet is composed of an SPP (spatial pyramid pooling) module integrating global context and a stacked hourglass module for cost regularization, each of the three hourglass networks generates a disparity map (disparity refers to horizontal displacement between corresponding pixels on left and right images), in a training stage, a total loss is calculated as a weighting of three losses, and a loss function is defined as:

(N is the number of marking pixels in the image, L₁Represents L₁Norm, di is the true value of the disparity,

is the predicted disparity value, d is the disparity,

is the average of the predicted disparities),

wherein

In the testing phase, the final disparity map is the last of the three outputs, according to the disparity depth formula:

and calculating the image depth, wherein z is the depth, f is the focal length, b is the base line of the binocular camera, and d is the parallax.

Pre-training is performed using the published KITTI dataset (a computer vision algorithm assessment dataset for use in autonomous driving scenarios), mainly using its binocular pictures and depth maps.

according to the embodiment of the invention, the trained network model is subjected to lightweight processing, so that the whole network structure can be compressed and can be remarkably accelerated, and meanwhile, the precision is kept as much as possible.

According to one embodiment of the invention, the lightweight process comprises channel pruning and knowledge distillation;

as shown in fig. 2, in the channel pruning algorithm, a is the original image, B is the feature map, C is the feature map after convolution, W is the convolution kernel filter, and the optimization method is performed in the dashed line box, which illustrates that when two channels are pruned for the feature map B, the corresponding channel of the filter W (i.e., the convolution kernel marked by the dashed line) can be removed, C and n respectively represent the number of channels of the feature map B, C, k is the number of channels of the feature map B, C_w×k_hIs the kernel size. The idea of channel pruning is as follows: the convolution calculation is accelerated and the model size is reduced by reducing the number of channels of the feature map B while minimizing the reconstruction error of C. Detailed description of the inventionThe method comprises the following steps:

an improved channel pruning algorithm (based on two-step iteration) is applied to an SPP module and a 3D CNN in the PSmNet stereo matching network, and the method specifically comprises the following steps:

first, a representative channel is found based on LASSO regression and redundant channels are eliminated, and in another step, the output of the remaining channels is reconstructed by using a linear least square method and executed alternately.

The first step finds a representative channel. After the loss function, L1 regularization is added (i.e.

Is the number of samples, λ is the coefficient of the regular term, w_jIs the weight of the jth channel) under the condition that the input X is constant, selecting to cut off a plurality of channels, and simultaneously, re-learning the weight according to the following formula after the channels are cut off to ensure that the output characteristic graph has the minimum L2 norm before and after pruning. The channels with high weights are representative channels, and the channels with lower weights can be regarded as redundant channels.

N is the number of samples; n is the number of output channels; y is nxc multiplied by k_h×k_wApplying the convolution filter of (1) to Nxc x k_h×k_wInputting an Nxn output matrix generated by X; i | · | purple wind_FIs a frobenius norm, i.e., a 2 norm; x_iThe ith channel slice for input X; beta is a coefficient vector of length c for channel selection, beta_iIs the mask of the ith channel, i.e., whether to discard the entire channel; w_iIs the weight of the W ith channel; c' is the expected number of channels, and is between 0 and c; if beta is_i＝0，X_iWill no longer be useful, it can be safely removed from the channel with its corresponding weight W_iMay also be deleted.

Because the optimal solution of W and beta is an NP Hard problem, in the second step, firstly fixing W to solve beta to select channels, then fixing beta to solve W to reconstruct errors, then re-learning weights based on a least square method, and reconstructing residual channels (namely selecting the optimal channel combination) to ensure that the precision of the model before and after pruning is not changed greatly.

And the trained depth estimation algorithm network model is used as a teacher model, and the student model is supervised and trained, so that the student model has the performance equivalent to that of a large model, but the parameter number is greatly reduced, and the compression and acceleration of the model are realized. Taking the PSMNet network of the present invention as an example, the basic structure thereof has 12 3 × 3 convolutional layers with different dimensions, and half of them is taken as the student model. The PSmNet network calculates the probability of parallax of each pixel through a softmax output layer, and changes the function of the probability of parallax of each pixel into:

wherein T is a hyperparameter in a temperature function and a softmax function, and is generally set to be 1; z represents a logic, namely a probability predicted value of possible parallax output by the teacher model, i is a pixel point serial number, and z is_iIs the probability predicted value of the corresponding pixel point i, j is also the pixel point serial number, z_jAnd the denominator is the sum of the probability predicted values of all the pixels, wherein the probability predicted value is the probability predicted value of the corresponding pixel j. And then continuously increasing the value of the parameter T, dividing the output result of the teacher model by the temperature parameter, and then performing softmax calculation to obtain a soft target value (namely the prediction result output by the teacher model and corresponding to the original label of the hard target sample).

Then, training a student model, inputting a sample predicted by a teacher model to obtain output, and then calculating by two steps: dividing the parameters by the same temperature parameters as the teacher model, then performing softmax calculation, and comparing the output with soft target; secondly, performing softmax calculation to obtain a predicted value, and comparing the predicted value with hard target.

The two loss functions are added to obtain a total loss function, the loss function is calculated, and parameters in the student network are updated by adopting a gradient descent optimization algorithm, as shown in fig. 3.

Step 3, adding the binocular stereo matching module into a visual SLAM algorithm to be used as a thread;

after the lightweight processing (i.e., pruning and distillation) of the stereo matching deep neural network is completed, the scale of the whole network is greatly reduced, the compressed network is used as a whole module to be added into a TRACKING thread of a visual SLAM algorithm (taking ORB-SLAM3 as an example), then the depth estimation is mainly carried out on key frames (including a left image and a right image and taking the left image as a main view) transmitted by the TRACKING thread, and the GPU is used for accelerating the calculation process to meet the real-time requirement, as shown in FIG. 4.

Step 4, carrying out dense point cloud mapping by using the depth information calculated by the stereo matching module;

transmitting the depth map estimated by the binocular stereo matching thread into a dense point cloud construction thread, and recovering a scene structure through camera motion, wherein the method mainly comprises the following steps:

1. detecting and matching the feature points;

2. epipolar geometry construction;

3. estimating the pose and the scene structure of the camera;

and 4, optimizing the pose and the scene of the camera by the BA.

5. Coloring and splicing the spatial point cloud;

since the ORB-SLAM is a visual SLAM based on a feature point method, the traditional steps 1-4 can only generate a sparse point cloud map. And through the added point cloud construction thread, obtaining the spatial coordinates of the point cloud in the space by using the transmitted key frame and the key frame depth map transmitted by the binocular stereo matching thread, coloring the point cloud according to the image information, and continuously performing point cloud splicing and global optimization along with the addition of the key frame, thereby obtaining a dense map.

FIGS. 5(a) -5 (d) are graphs showing the experimental effect of ORB-SLAM3 before and after improvement.

Different from the traditional stereo matching algorithm, the stereo matching deep neural network is adopted, so that the robustness in a weak texture scene is improved, and the matching precision is improved;

and adding two threads on the basis of the three threads of the ORB-SLAM3 algorithm, wherein one thread is a binocular stereo matching thread to realize more accurate depth calculation of binocular images, and the other thread is a dense point cloud construction thread, and generating a dense point cloud map by using key frames transmitted by a TRACKING thread and image depth information transmitted by the binocular stereo matching thread.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims

1. A binocular vision SLAM dense mapping method based on a DNN stereo matching module is characterized by comprising the following steps:

2. The binocular vision SLAM dense mapping method based on the DNN stereo matching module as claimed in claim 1, wherein the step 1 of training the end-to-end stereo matching network model on the GPU server by using the public data set specifically comprises the following steps:

firstly, training a stereo matching network model for binocular depth estimation on a GPU server, wherein the stereo matching network model can generate one or more disparity maps, and each disparity map corresponds to a loss function; in the training phase, the total loss is calculated as a weighting of all loss functions, defined as:

n is the number of marking pixels in the image, L₁Represents L₁Norm, di is the true value of the disparity,

is the predicted disparity value, d is the disparity,

in order to predict the average value of the disparity,

wherein

In the testing stage, the final disparity map is the last of all outputs, according to the disparity depth formula

Calculating the image depth, wherein z is the depth, f is the focal length, b is the base line of the binocular camera, and d is the parallax;

pre-training is performed using binocular pictures and depth maps in the public dataset.

3. The binocular vision SLAM dense mapping method based on the DNN stereo matching module as claimed in claim 1, wherein the step 2 is to perform lightweight processing on the trained stereo matching network model, specifically as follows:

firstly, finding out a representative channel based on LASSO regression and removing redundant channels, and reconstructing the output of the residual channels by using a linear least square method in the other step, and alternately executing; the method comprises the steps of firstly, finding out representative channels, adding L1 regularization after a loss function, selecting and cutting off a plurality of channels under the condition that input X is fixed, and simultaneously, re-learning of weight after the channels are cut to ensure that an output feature graph has the minimum L2 norm before and after pruning;

in the second step, firstly fixing W to solve beta for channel selection, then fixing beta to solve W for error reconstruction, then re-learning the weight based on the least square method, and reconstructing the residual channel; β is a coefficient vector of length c for channel selection, W is a convolution kernel filter;

and the trained depth estimation algorithm network model is used as a teacher model, and the student model is supervised and trained, so that the student model has the performance equivalent to that of the teacher model, but the quantity of parameters is reduced, and the compression and acceleration of the model are realized.

4. The binocular vision SLAM dense mapping method based on the DNN stereo matching module as claimed in claim 1, wherein the step 3 is to add the stereo matching network model after the lightweight processing as a module into the vision SLAM algorithm in a thread manner, and complete the real-time binocular depth calculation through GPU acceleration, specifically as follows:

after the lightweight processing of the stereo matching deep neural network is completed, the compressed network is used as a whole module and added into a TRACKING thread of a visual SLAM algorithm, depth estimation is mainly carried out on key frames transmitted from the TRACKING thread, and a GPU is used for accelerating a calculation process so as to meet real-time requirements.

5. The binocular vision SLAM dense map building method based on the DNN stereo matching module as claimed in claim 1, wherein the step 4 is finally completed by adding a point cloud building thread, and the real-time building of the dense map is completed, specifically as follows:

4.1. detecting and matching the feature points;

4.2. epipolar geometry construction;

4.3. estimating the pose and the scene structure of the camera;

4.4.BA optimizing the pose and scene of the camera;

4.5. coloring and splicing the spatial point cloud;

and through the added point cloud construction thread, obtaining the spatial coordinates of the point cloud in the space by using the transmitted key frame and the key frame depth map transmitted by the binocular stereo matching thread, coloring the point cloud according to the image information, and continuously performing point cloud splicing and global optimization along with the addition of the key frame, thereby obtaining a dense map.