CN114359388A - Binocular vision SLAM dense image construction method based on DNN stereo matching module - Google Patents

Binocular vision SLAM dense image construction method based on DNN stereo matching module Download PDF

Info

Publication number
CN114359388A
CN114359388A CN202210014232.7A CN202210014232A CN114359388A CN 114359388 A CN114359388 A CN 114359388A CN 202210014232 A CN202210014232 A CN 202210014232A CN 114359388 A CN114359388 A CN 114359388A
Authority
CN
China
Prior art keywords
stereo matching
thread
binocular
network model
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210014232.7A
Other languages
Chinese (zh)
Inventor
巢建树
刘洋
胡诗佳
顾明珠
郭杰龙
魏宪
俞辉
刘�文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mindu Innovation Laboratory
Original Assignee
Mindu Innovation Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mindu Innovation Laboratory filed Critical Mindu Innovation Laboratory
Priority to CN202210014232.7A priority Critical patent/CN114359388A/en
Publication of CN114359388A publication Critical patent/CN114359388A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to a binocular vision SLAM dense mapping method based on a DNN stereo matching module, which comprises the following steps: step 1, training an end-to-end stereo matching network model on a GPU server by utilizing a public data set; step 2, carrying out lightweight processing on the trained stereo matching network model; step 3, taking the stereoscopic matching network model after the lightweight processing as a module, adding the module into a visual SLAM algorithm in a thread mode, and accelerating to complete real-time binocular depth calculation through a GPU; and 4, finally completing the real-time construction of the dense map through the added point cloud construction thread.

Description

Binocular vision SLAM dense image construction method based on DNN stereo matching module
Technical Field
The invention relates to the technical field of image data processing and visual SLAM (simultaneous localization and mapping) algorithms, in particular to a binocular stereo vision dense point cloud mapping method based on Deep Neural Networks (DNN).
Background
The existing method for building the image densely mainly uses an RGB-D depth camera or a radar, but the depth camera is easily and strongly interfered by sunlight and cannot be applied to outdoor scenes, and the cost of the radar is always high; in addition, the traditional binocular stereo matching algorithm based on the epipolar constraint relationship cannot calculate the depth of an object which is too close or too far, and the camera baseline limits the measurement range; the existing binocular stereo matching deep neural network has a complex structure and high computational complexity, and cannot meet real-time application.
Disclosure of Invention
In order to solve the technical problem, the invention provides a binocular vision SLAM dense mapping method based on a DNN stereo matching module, which comprises the following steps:
step 1, training an end-to-end stereo matching network model on a GPU server by utilizing a public data set;
step 2, carrying out lightweight processing on the trained stereo matching network model;
step 3, taking the stereoscopic matching network model after the lightweight processing as a module, adding the module into a visual SLAM algorithm in a thread mode, and accelerating to complete real-time binocular depth calculation through a GPU;
and 4, finally completing the real-time construction of the dense map through the added point cloud construction thread.
Has the advantages that:
the method is different from the traditional stereo matching algorithm, and adopts the stereo matching deep neural network, so that the robustness under the weak texture scene is improved, and the matching precision is improved; according to the method, two threads are added on the basis of three threads of an ORB-SLAM3 algorithm, one thread is a binocular stereo vision depth estimation thread to achieve more accurate depth calculation of binocular images, the other thread is a dense point cloud construction thread, and a dense point cloud map is generated by combining key frames transmitted by a TRACKING thread and image depth information transmitted by the depth estimation thread.
Drawings
FIG. 1 is a schematic diagram of a PSmNet network architecture;
FIG. 2 is a channel pruning algorithm;
FIG. 3 is a schematic flow diagram of a teacher-student network based on target distillation;
FIG. 4 is the modified ORB-SLAM3 framework, with the main modifications in the box;
FIG. 5(a) a real scene;
FIG. 5(b) ORB-SLAM3 sparse map;
FIG. 5(c) a dense map estimated by the method;
FIG. 5(d) ground route generated by RGBD camera.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person skilled in the art based on the embodiments of the present invention belong to the protection scope of the present invention without creative efforts.
At present, deep learning is greatly different in solving the depth estimation problem of stereoscopic vision, and stereo matching deep neural networks such as AANet, GCNet, PSmNet and LEASTEROO show a much better effect than the traditional method.
Taking PSMNet as an example (other network structures may also be applicable to this method), according to an embodiment of the present invention, training of a network model is performed on a GPU server by using public data sets KITTI 2012 and KITTI 2015. And then carrying out lightweight processing on the trained depth estimation algorithm network, adding the network subjected to lightweight processing into a visual SLAM algorithm (taking ORB-SLAM3 as an example) as a module (thread), accelerating by a GPU to complete real-time binocular depth calculation, and finally completing real-time construction of a dense map by the added point cloud construction thread. Hardware required for the whole scheme: binocular cameras, an ubuntu16.04 computer with a GPU configured with the ORB-SLAM3 algorithm, and a GPU server. The binocular camera is handheld or can be placed on equipment such as a small handcart, a robot, an unmanned aerial vehicle and the like;
the invention provides a binocular vision SLAM dense map building method based on a DNN stereo matching module, which comprises the following steps:
step 1, training an end-to-end stereo matching network model on a GPU server by utilizing a public data set; the process of training the binocular stereo matching network model specifically comprises the following steps:
according to one embodiment of the invention, a PSmNet network model is selected as an example as a stereo matching network model; the stereo matching network model of the invention can also adopt the stereo matching deep neural networks such as AANet, GCNet, PSmNet, LEASTEROO and the like.
Firstly, a PSMNet network model is trained on a GPU server, a network structure of the PSMNet is shown in fig. 1, and the PSMNet is composed of an SPP (spatial pyramid pooling) module integrating global context and a stacked hourglass module for cost regularization, each of the three hourglass networks generates a disparity map (disparity refers to horizontal displacement between corresponding pixels on left and right images), in a training stage, a total loss is calculated as a weighting of three losses, and a loss function is defined as:
Figure BDA0003459264490000031
(N is the number of marking pixels in the image, L1Represents L1Norm, di is the true value of the disparity,
Figure BDA0003459264490000032
is the predicted disparity value, d is the disparity,
Figure BDA0003459264490000033
is the average of the predicted disparities),
wherein
Figure BDA0003459264490000034
In the testing phase, the final disparity map is the last of the three outputs, according to the disparity depth formula:
Figure BDA0003459264490000035
and calculating the image depth, wherein z is the depth, f is the focal length, b is the base line of the binocular camera, and d is the parallax.
Pre-training is performed using the published KITTI dataset (a computer vision algorithm assessment dataset for use in autonomous driving scenarios), mainly using its binocular pictures and depth maps.
Step 2, carrying out lightweight processing on the trained stereo matching network model;
according to the embodiment of the invention, the trained network model is subjected to lightweight processing, so that the whole network structure can be compressed and can be remarkably accelerated, and meanwhile, the precision is kept as much as possible.
According to one embodiment of the invention, the lightweight process comprises channel pruning and knowledge distillation;
as shown in fig. 2, in the channel pruning algorithm, a is the original image, B is the feature map, C is the feature map after convolution, W is the convolution kernel filter, and the optimization method is performed in the dashed line box, which illustrates that when two channels are pruned for the feature map B, the corresponding channel of the filter W (i.e., the convolution kernel marked by the dashed line) can be removed, C and n respectively represent the number of channels of the feature map B, C, k is the number of channels of the feature map B, Cw×khIs the kernel size. The idea of channel pruning is as follows: the convolution calculation is accelerated and the model size is reduced by reducing the number of channels of the feature map B while minimizing the reconstruction error of C. Detailed description of the inventionThe method comprises the following steps:
an improved channel pruning algorithm (based on two-step iteration) is applied to an SPP module and a 3D CNN in the PSmNet stereo matching network, and the method specifically comprises the following steps:
first, a representative channel is found based on LASSO regression and redundant channels are eliminated, and in another step, the output of the remaining channels is reconstructed by using a linear least square method and executed alternately.
The first step finds a representative channel. After the loss function, L1 regularization is added (i.e.
Figure BDA0003459264490000036
Is the number of samples, λ is the coefficient of the regular term, wjIs the weight of the jth channel) under the condition that the input X is constant, selecting to cut off a plurality of channels, and simultaneously, re-learning the weight according to the following formula after the channels are cut off to ensure that the output characteristic graph has the minimum L2 norm before and after pruning. The channels with high weights are representative channels, and the channels with lower weights can be regarded as redundant channels.
Figure BDA0003459264490000041
N is the number of samples; n is the number of output channels; y is nxc multiplied by kh×kwApplying the convolution filter of (1) to Nxc x kh×kwInputting an Nxn output matrix generated by X; i | · | purple windFIs a frobenius norm, i.e., a 2 norm; xiThe ith channel slice for input X; beta is a coefficient vector of length c for channel selection, betaiIs the mask of the ith channel, i.e., whether to discard the entire channel; wiIs the weight of the W ith channel; c' is the expected number of channels, and is between 0 and c; if beta isi=0,XiWill no longer be useful, it can be safely removed from the channel with its corresponding weight WiMay also be deleted.
Because the optimal solution of W and beta is an NP Hard problem, in the second step, firstly fixing W to solve beta to select channels, then fixing beta to solve W to reconstruct errors, then re-learning weights based on a least square method, and reconstructing residual channels (namely selecting the optimal channel combination) to ensure that the precision of the model before and after pruning is not changed greatly.
And the trained depth estimation algorithm network model is used as a teacher model, and the student model is supervised and trained, so that the student model has the performance equivalent to that of a large model, but the parameter number is greatly reduced, and the compression and acceleration of the model are realized. Taking the PSMNet network of the present invention as an example, the basic structure thereof has 12 3 × 3 convolutional layers with different dimensions, and half of them is taken as the student model. The PSmNet network calculates the probability of parallax of each pixel through a softmax output layer, and changes the function of the probability of parallax of each pixel into:
Figure BDA0003459264490000042
wherein T is a hyperparameter in a temperature function and a softmax function, and is generally set to be 1; z represents a logic, namely a probability predicted value of possible parallax output by the teacher model, i is a pixel point serial number, and z isiIs the probability predicted value of the corresponding pixel point i, j is also the pixel point serial number, zjAnd the denominator is the sum of the probability predicted values of all the pixels, wherein the probability predicted value is the probability predicted value of the corresponding pixel j. And then continuously increasing the value of the parameter T, dividing the output result of the teacher model by the temperature parameter, and then performing softmax calculation to obtain a soft target value (namely the prediction result output by the teacher model and corresponding to the original label of the hard target sample).
Then, training a student model, inputting a sample predicted by a teacher model to obtain output, and then calculating by two steps: dividing the parameters by the same temperature parameters as the teacher model, then performing softmax calculation, and comparing the output with soft target; secondly, performing softmax calculation to obtain a predicted value, and comparing the predicted value with hard target.
The two loss functions are added to obtain a total loss function, the loss function is calculated, and parameters in the student network are updated by adopting a gradient descent optimization algorithm, as shown in fig. 3.
Step 3, adding the binocular stereo matching module into a visual SLAM algorithm to be used as a thread;
after the lightweight processing (i.e., pruning and distillation) of the stereo matching deep neural network is completed, the scale of the whole network is greatly reduced, the compressed network is used as a whole module to be added into a TRACKING thread of a visual SLAM algorithm (taking ORB-SLAM3 as an example), then the depth estimation is mainly carried out on key frames (including a left image and a right image and taking the left image as a main view) transmitted by the TRACKING thread, and the GPU is used for accelerating the calculation process to meet the real-time requirement, as shown in FIG. 4.
Step 4, carrying out dense point cloud mapping by using the depth information calculated by the stereo matching module;
transmitting the depth map estimated by the binocular stereo matching thread into a dense point cloud construction thread, and recovering a scene structure through camera motion, wherein the method mainly comprises the following steps:
1. detecting and matching the feature points;
2. epipolar geometry construction;
3. estimating the pose and the scene structure of the camera;
and 4, optimizing the pose and the scene of the camera by the BA.
5. Coloring and splicing the spatial point cloud;
since the ORB-SLAM is a visual SLAM based on a feature point method, the traditional steps 1-4 can only generate a sparse point cloud map. And through the added point cloud construction thread, obtaining the spatial coordinates of the point cloud in the space by using the transmitted key frame and the key frame depth map transmitted by the binocular stereo matching thread, coloring the point cloud according to the image information, and continuously performing point cloud splicing and global optimization along with the addition of the key frame, thereby obtaining a dense map.
FIGS. 5(a) -5 (d) are graphs showing the experimental effect of ORB-SLAM3 before and after improvement.
Different from the traditional stereo matching algorithm, the stereo matching deep neural network is adopted, so that the robustness in a weak texture scene is improved, and the matching precision is improved;
and adding two threads on the basis of the three threads of the ORB-SLAM3 algorithm, wherein one thread is a binocular stereo matching thread to realize more accurate depth calculation of binocular images, and the other thread is a dense point cloud construction thread, and generating a dense point cloud map by using key frames transmitted by a TRACKING thread and image depth information transmitted by the binocular stereo matching thread.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims (5)

1. A binocular vision SLAM dense mapping method based on a DNN stereo matching module is characterized by comprising the following steps:
step 1, training an end-to-end stereo matching network model on a GPU server by utilizing a public data set;
step 2, carrying out lightweight processing on the trained stereo matching network model;
step 3, taking the stereoscopic matching network model after the lightweight processing as a module, adding the module into a visual SLAM algorithm in a thread mode, and accelerating to complete real-time binocular depth calculation through a GPU;
and 4, finally completing the real-time construction of the dense map through the added point cloud construction thread.
2. The binocular vision SLAM dense mapping method based on the DNN stereo matching module as claimed in claim 1, wherein the step 1 of training the end-to-end stereo matching network model on the GPU server by using the public data set specifically comprises the following steps:
firstly, training a stereo matching network model for binocular depth estimation on a GPU server, wherein the stereo matching network model can generate one or more disparity maps, and each disparity map corresponds to a loss function; in the training phase, the total loss is calculated as a weighting of all loss functions, defined as:
Figure FDA0003459264480000011
n is the number of marking pixels in the image, L1Represents L1Norm, di is the true value of the disparity,
Figure FDA0003459264480000012
is the predicted disparity value, d is the disparity,
Figure FDA0003459264480000013
in order to predict the average value of the disparity,
wherein
Figure FDA0003459264480000014
In the testing stage, the final disparity map is the last of all outputs, according to the disparity depth formula
Figure FDA0003459264480000015
Calculating the image depth, wherein z is the depth, f is the focal length, b is the base line of the binocular camera, and d is the parallax;
pre-training is performed using binocular pictures and depth maps in the public dataset.
3. The binocular vision SLAM dense mapping method based on the DNN stereo matching module as claimed in claim 1, wherein the step 2 is to perform lightweight processing on the trained stereo matching network model, specifically as follows:
firstly, finding out a representative channel based on LASSO regression and removing redundant channels, and reconstructing the output of the residual channels by using a linear least square method in the other step, and alternately executing; the method comprises the steps of firstly, finding out representative channels, adding L1 regularization after a loss function, selecting and cutting off a plurality of channels under the condition that input X is fixed, and simultaneously, re-learning of weight after the channels are cut to ensure that an output feature graph has the minimum L2 norm before and after pruning;
in the second step, firstly fixing W to solve beta for channel selection, then fixing beta to solve W for error reconstruction, then re-learning the weight based on the least square method, and reconstructing the residual channel; β is a coefficient vector of length c for channel selection, W is a convolution kernel filter;
and the trained depth estimation algorithm network model is used as a teacher model, and the student model is supervised and trained, so that the student model has the performance equivalent to that of the teacher model, but the quantity of parameters is reduced, and the compression and acceleration of the model are realized.
4. The binocular vision SLAM dense mapping method based on the DNN stereo matching module as claimed in claim 1, wherein the step 3 is to add the stereo matching network model after the lightweight processing as a module into the vision SLAM algorithm in a thread manner, and complete the real-time binocular depth calculation through GPU acceleration, specifically as follows:
after the lightweight processing of the stereo matching deep neural network is completed, the compressed network is used as a whole module and added into a TRACKING thread of a visual SLAM algorithm, depth estimation is mainly carried out on key frames transmitted from the TRACKING thread, and a GPU is used for accelerating a calculation process so as to meet real-time requirements.
5. The binocular vision SLAM dense map building method based on the DNN stereo matching module as claimed in claim 1, wherein the step 4 is finally completed by adding a point cloud building thread, and the real-time building of the dense map is completed, specifically as follows:
transmitting the depth map estimated by the binocular stereo matching thread into a dense point cloud construction thread, and recovering a scene structure through camera motion, wherein the method mainly comprises the following steps:
4.1. detecting and matching the feature points;
4.2. epipolar geometry construction;
4.3. estimating the pose and the scene structure of the camera;
4.4.BA optimizing the pose and scene of the camera;
4.5. coloring and splicing the spatial point cloud;
and through the added point cloud construction thread, obtaining the spatial coordinates of the point cloud in the space by using the transmitted key frame and the key frame depth map transmitted by the binocular stereo matching thread, coloring the point cloud according to the image information, and continuously performing point cloud splicing and global optimization along with the addition of the key frame, thereby obtaining a dense map.
CN202210014232.7A 2022-01-06 2022-01-06 Binocular vision SLAM dense image construction method based on DNN stereo matching module Pending CN114359388A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210014232.7A CN114359388A (en) 2022-01-06 2022-01-06 Binocular vision SLAM dense image construction method based on DNN stereo matching module

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210014232.7A CN114359388A (en) 2022-01-06 2022-01-06 Binocular vision SLAM dense image construction method based on DNN stereo matching module

Publications (1)

Publication Number Publication Date
CN114359388A true CN114359388A (en) 2022-04-15

Family

ID=81107912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210014232.7A Pending CN114359388A (en) 2022-01-06 2022-01-06 Binocular vision SLAM dense image construction method based on DNN stereo matching module

Country Status (1)

Country Link
CN (1) CN114359388A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110533712A (en) * 2019-08-26 2019-12-03 北京工业大学 A kind of binocular solid matching process based on convolutional neural networks
WO2020134254A1 (en) * 2018-12-27 2020-07-02 南京芊玥机器人科技有限公司 Method employing reinforcement learning to optimize trajectory of spray painting robot
CN111583136A (en) * 2020-04-25 2020-08-25 华南理工大学 Method for simultaneously positioning and establishing image of autonomous mobile platform in rescue scene
CN111998862A (en) * 2020-07-02 2020-11-27 中山大学 Dense binocular SLAM method based on BNN
CN112785702A (en) * 2020-12-31 2021-05-11 华南理工大学 SLAM method based on tight coupling of 2D laser radar and binocular camera

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020134254A1 (en) * 2018-12-27 2020-07-02 南京芊玥机器人科技有限公司 Method employing reinforcement learning to optimize trajectory of spray painting robot
CN110533712A (en) * 2019-08-26 2019-12-03 北京工业大学 A kind of binocular solid matching process based on convolutional neural networks
CN111583136A (en) * 2020-04-25 2020-08-25 华南理工大学 Method for simultaneously positioning and establishing image of autonomous mobile platform in rescue scene
CN111998862A (en) * 2020-07-02 2020-11-27 中山大学 Dense binocular SLAM method based on BNN
CN112785702A (en) * 2020-12-31 2021-05-11 华南理工大学 SLAM method based on tight coupling of 2D laser radar and binocular camera

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
JIA-REN CHANG等: "Pyramid Stereo Matching Network", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》, 16 December 2018 (2018-12-16), pages 1 - 9 *
JIA-REN CHANG等: "Pyramid Stereo Matching Network", COMPUTER VISION AND PATTERN RECOGNITION, 23 March 2018 (2018-03-23), pages 1 - 8 *
YIHUI HE等: "Channel pruning for Accelerating Very Deep Neural Networks", 《2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION(ICCV)》, 25 December 2017 (2017-12-25), pages 1 - 10 *
YIHUI HE等: "Channel pruning for Accelerating Very Deep Neural Networks", COMPUTER VISION AND PATTERN RECOGNITION, 21 August 2017 (2017-08-21), pages 1 - 9 *
刘德康: "动态环境下双目视觉SLAM的定位与建图研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 2021, 15 August 2021 (2021-08-15), pages 18 - 22 *
孙其功等: "深度神经网络FPGA设计与实现", 31 July 2020, 西安电子科技大学出版社, pages: 231 - 233 *

Similar Documents

Publication Publication Date Title
Iyer et al. CalibNet: Geometrically supervised extrinsic calibration using 3D spatial transformer networks
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN111127538B (en) Multi-view image three-dimensional reconstruction method based on convolution cyclic coding-decoding structure
CN113160375B (en) Three-dimensional reconstruction and camera pose estimation method based on multi-task learning algorithm
Saputra et al. Learning monocular visual odometry through geometry-aware curriculum learning
JP2021518622A (en) Self-location estimation, mapping, and network training
CN110197505B (en) Remote sensing image binocular stereo matching method based on depth network and semantic information
CN110969653B (en) Image depth estimation method based on deep learning and Fourier domain analysis
CN108171249B (en) RGBD data-based local descriptor learning method
CN116258817B (en) Automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction
CN113256699B (en) Image processing method, image processing device, computer equipment and storage medium
JP2022547288A (en) Scene display using image processing
CN110610486A (en) Monocular image depth estimation method and device
CN115900710A (en) Dynamic environment navigation method based on visual information
CN112016612A (en) Monocular depth estimation-based multi-sensor fusion SLAM method
CN114372523A (en) Binocular matching uncertainty estimation method based on evidence deep learning
CN116402876A (en) Binocular depth estimation method, binocular depth estimation device, embedded equipment and readable storage medium
CN112767467A (en) Double-image depth estimation method based on self-supervision deep learning
Ke et al. Deep multi-view depth estimation with predicted uncertainty
CN112509021A (en) Parallax optimization method based on attention mechanism
CN117456136A (en) Digital twin scene intelligent generation method based on multi-mode visual recognition
CN116468769A (en) Depth information estimation method based on image
Iyer et al. CalibNet: Self-supervised extrinsic calibration using 3D spatial transformer networks
CN113313740B (en) Disparity map and surface normal vector joint learning method based on plane continuity
CN116188550A (en) Self-supervision depth vision odometer based on geometric constraint

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination