CN111028285A

CN111028285A - Depth estimation method based on binocular vision and laser radar fusion

Info

Publication number: CN111028285A
Application number: CN201911221616.0A
Authority: CN
Inventors: 陈昆; 沈会良
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-04-17

Abstract

The invention discloses a depth estimation method based on binocular vision and laser radar fusion, which is characterized in that data collected by a laser radar and a binocular camera are registered through joint calibration; obtaining a laser radar disparity map according to the result of the combined calibration; obtaining a binocular disparity map through a binocular stereo matching algorithm, performing confidence analysis on the binocular disparity map, removing points with low confidence in the binocular disparity map, and obtaining the binocular disparity map after confidence processing; extracting and fusing the characteristics of the obtained laser radar disparity map and the binocular disparity map after the confidence coefficient processing; further feature extraction is carried out through a cascading hourglass structure, and parallax regression is carried out; the output of the front and rear cascaded hourglass structures is utilized by adopting relay supervision; and outputting an accurate and dense disparity map after fusion. The method designs a more effective network structure, and the characteristics of the laser radar disparity map and the binocular disparity map are better extracted and fused to obtain a more accurate disparity map.

Description

Depth estimation method based on binocular vision and laser radar fusion

Technical Field

The invention belongs to the field of robot and computer vision application, and particularly relates to a depth estimation method based on binocular vision and laser radar fusion.

Background

In many robotic and computer vision applications, perceiving the three-dimensional geometry of a scene or object through depth estimation is undoubtedly key to many tasks, such as autopilot, mobile robotics, positioning, obstacle avoidance, path planning, three-dimensional reconstruction, etc.

To estimate reliable depth information of a scene, two techniques may be used: depth estimation is performed using a lidar scanner or using stereo matching algorithms on binocular images. For complex outdoor scenes, lidar scanners are the most practical 3D perception solution, and the three-dimensional perception of lidar scanners can provide very accurate depth information, with errors in centimeters. However, since the lidar point cloud is sparse and occupies less than 6% of the image points, the reconstruction of a three-dimensional image by using the lidar is limited in practical application, and cannot cover all objects in a scene, and although some efforts to interpolate depth information of the sparse three-dimensional depth points exist, the performance of the lidar is also limited. On the other hand, depth estimation by binocular vision can yield dense depth information. However, due to the small baseline of the binocular camera, the limited sensing range and the inherent limitation of the stereo matching algorithm (caused by occlusion, illumination and other factors), the accuracy of the depth information obtained by binocular stereo vision is often not high.

Therefore, in order to obtain better depth estimation information, the depth information acquired by the laser radar and the depth information acquired by the binocular vision need to be fused, so that accurate and dense depth information is obtained. Attempts have also been made in the prior art based on this, however the results are not satisfactory.

The invention patent with the patent application number of CN 201810448954.7 provides a fusion method based on a low-beam laser radar and a binocular camera. And generating an error coefficient according to the data corresponding to the same object in the image data and the radar data, and generating the calibrated image data according to the error coefficient. The method is greatly influenced by environmental factors, image data acquired by a binocular camera in a complex environment can be greatly interfered, and a generated disparity map has a plurality of regions which are estimated incorrectly, so that an accurate fusion result is not obtained.

The patent with the patent application number of CN 201810575904.5 provides a network based on the existing binocular stereo matching algorithm, and the network is used for training, tuning and finally outputting an optimized disparity map by adding laser radar disparity data as a supervision, namely a system error compensation module. The fusion method belongs to front-end fusion, namely, a laser radar disparity map is added in the process of calculating the binocular disparity map to be used as a supervision item to restrict the solving process of the binocular disparity map. The method has high requirements on the binocular stereo matching algorithm, and because the supervision information is added into the binocular stereo matching algorithm, the structure of the whole network needs to be adjusted every time different binocular stereo matching algorithms are changed, which is tedious.

The patent application number of CN 201710851841.7 provides a fusion correction method of stereoscopic vision and low beam laser radar in unmanned driving. According to the method, the disparity map after semantic segmentation is obtained through semantic segmentation, and then the binocular disparity map is compensated through laser radar data to obtain the compensated full-pixel disparity map. And finally, inputting the two disparity maps as a neural network to obtain a final disparity map. However, the result of semantic segmentation is not accurate in a complex environment, and the original depth information cannot be recovered more accurately.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a depth estimation method based on binocular vision and laser radar fusion. According to the method, a more effective network structure is designed, the characteristics of the laser radar disparity map and the binocular disparity map are better extracted and fused, and a more accurate disparity map is obtained.

A depth estimation method based on binocular vision and laser radar fusion comprises the following steps:

(1) performing binocular calibration on the camera, and performing combined calibration on the camera and the laser radar;

(2) acquiring a binocular disparity map through a binocular stereo matching algorithm according to a binocular calibration result;

(3) acquiring a laser point cloud disparity map, and performing confidence processing on the binocular disparity map, wherein the confidence processing adopts a convolutional neural network method, and the method is used for acquiring the confidence of an input disparity map end to end;

(4) respectively extracting the characteristics of the input laser point cloud disparity map and the binocular disparity map after the confidence coefficient processing, and splicing and fusing;

(5) performing further feature extraction through a cascaded hourglass structure, performing parallax regression and adding relay supervision, and fully utilizing the output result of each hourglass structure;

(6) and outputting the fused disparity map.

In the above technical solution, further, the step (1) specifically includes: the method comprises the steps of synchronously acquiring images and laser point cloud data through a binocular camera and a laser radar, carrying out camera monocular calibration and binocular calibration according to the acquired images to determine camera internal parameters and external parameters, and acquiring the external parameters between the camera and the laser radar by combining the laser point cloud data.

Further, the step (2) is specifically: and correcting the images according to the images acquired by the binocular cameras and the calibration parameters of the binocular cameras so as to align the lines of the images, and then obtaining the corresponding binocular disparity maps by using a binocular stereo matching algorithm.

Further, the step (3) is specifically: projecting the laser point cloud onto the image through a combined calibration result to obtain a corresponding laser point cloud disparity map; and removing points with the confidence coefficient less than 95% in the original binocular disparity map through a convolutional neural network, and obtaining the binocular disparity map after confidence coefficient processing.

Further, the step (4) is specifically as follows: taking the binocular disparity map and the laser point cloud disparity map after the confidence degree processing as input, performing characteristic preliminary extraction through a multilayer convolutional neural network to obtain two groups of sixteen-channel characteristic maps, and then splicing and fusing the obtained characteristic maps to obtain a thirty-two channel characteristic map; the convolution kernel adopts hole convolution, so that fewer parameters are used under the condition of the same receptive field, and the time consumption of the network is reduced; and a BN layer is added to prevent the problems of overfitting, gradient disappearance and the like.

Further, the step (5) is specifically: according to the obtained thirty-two channel characteristic diagram, further characteristic extraction is carried out through a two-layer cascade hourglass structure, finally the overall loss of the network is calculated according to the ground truth, and due to the fact that the two-layer cascade hourglass structure is formed in all, the loss calculation is carried out on the output of the two-layer hourglass structure and the ground truth respectively, and then the final loss is obtained through weighted summation, so that the shallow network can be trained fully, and the performance of the whole network is improved;

wherein D is_F1，D_F2Disparity map, D, representing the output of the first and second hourglass structures_GRepresents a disparity map corresponding to a ground channel, and p represents D_GAll pixels with a median value other than null, λ₁And λ₂Respectively representing the corresponding weight coefficient when two layers of outputs find the loss.

Further, the step (6) is specifically: the results output using the second layer hourglass structure were used as the disparity map after final fusion at the time of testing.

The main idea of the method of the invention is as follows:

registering data acquired by a laser radar and a binocular camera through joint calibration; obtaining a laser radar disparity map according to the result of the combined calibration; obtaining a binocular disparity map through a binocular stereo matching algorithm, performing confidence analysis on the binocular disparity map, removing points with low confidence in the binocular disparity map, and obtaining the binocular disparity map after confidence processing; extracting and fusing the characteristics of the obtained laser radar disparity map and the binocular disparity map after the confidence coefficient processing; extracting deeper features through a cascaded hourglass structure and performing parallax regression; the output of the front and rear cascaded hourglass structures is utilized by adopting relay supervision; and outputting an accurate and dense disparity map after fusion.

Compared with the prior art, the invention has the following advantages:

1. the invention provides a depth estimation framework with high precision and high flexibility;

2. the invention provides a fusion method for complementing advantages of binocular vision and a laser radar, and the method removes points with low confidence in a binocular disparity map through confidence analysis, so that the influence of mismatching points on final fusion is eliminated.

3. The invention adopts a rear-end fusion method innovatively, and is different from the original front-end fusion method in that laser point cloud is used as a supervision item and added into the solving process of the binocular stereo matching algorithm.

4. According to the invention, the hourglass structure is adopted for extracting the characteristics of the parallax regression module for the first time, the strategy of firstly down-sampling and then up-sampling is adopted for the hourglass structure, and the skip level connection is added to assist up-sampling, so that the deep characteristics can be extracted, the calculated amount of network parameters is considered, the relay supervision is added into the cascaded hourglass structure, the loss output by each hourglass structure is calculated respectively, the output information of different layers of networks is fully utilized, the integration of a plurality of models is equivalent, and the robustness of the network is improved.

5. The invention trains the network under a large number of different scenes, so that the adaptability to the environment is stronger, and a better fusion result can be obtained under a complex environment.

Drawings

FIG. 1 is an overall flow diagram of the process of the present invention;

fig. 2 is an original image and a disparity image obtained in embodiment 1 of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

Fig. 1 shows a process of performing binocular vision and lidar fusion according to the present invention, which includes the following steps:

1. the method comprises the steps of synchronously acquiring calibration plate images and laser point cloud data through a binocular camera and a laser radar, carrying out camera monocular calibration and binocular calibration according to the acquired images to determine camera internal parameters and external parameters of a left camera and a right camera, and acquiring the external parameters before the cameras and the laser radar by combining the laser radar point cloud data.

1.1 in the step, the optical axes of the lenses of the binocular camera are kept parallel, the focal lengths are consistent, the positions of the camera and the laser radar are fixed, and the relative position between the camera and the laser radar is guaranteed to be unchanged; simultaneously collecting two paths of image signals and laser radar point cloud data under the control of a signal synchronization unit;

1.2 according to the monocular and binocular calibration principle of the camera, acquiring the internal reference and external reference information of the camera and the plane equation a of the calibration plate in each frame of camera coordinate system_m,iMeanwhile, a plane equation a of the calibration plate in each frame of laser radar coordinate system is obtained according to the laser point cloud_l,i. Theta denotes the normal vector of the plane, X denotes the coordinates of a point on the plane, and d denotes the distance from the origin of the coordinate system to the plane.

a_m,i:θ_m,iX+d_m,i＝0

a_l,i:θ_l,iX+d_l,i＝0

1.3 after plane equations of the calibration plate under different coordinate systems are obtained, restraining an RT matrix by minimizing the following equations, and solving to obtain final external reference information. i represents the serial number of each group of images and laser point cloud data, and l represents the number of points on the mid-plane of each group.

2. And processing the image and the laser point cloud data by using the binocular calibration and the combined calibration to obtain a binocular disparity map and a laser point cloud disparity map.

2.1 in this step, first, the result of binocular calibration is used to obtain the left and right aligned images, and the corresponding disparity map is obtained according to the PSMNet end-to-end network. The PSmNet extracts feature information of left and right views through a pyramid pooling structure, then constructs a matching cost cube according to the extracted feature information, and finally performs classification regression on parallax through three cascaded hourglass structures.

2.2 after obtaining the preliminary disparity map, carrying out confidence analysis on the disparity map through the confidence analysis, and removing points with confidence lower than Thresh. The following formula is shown in detail:

wherein M is_iRepresenting the gray value, d, of the binocular disparity map at pixel point i_iAnd (4) representing the gray value of the confidence map output after the confidence analysis at the pixel point i, and Thresh representing the confidence threshold set by us.

And 2.3 projecting the laser point cloud into the image according to the camera laser radar external parameter obtained by calibration, converting the depth information of the laser point cloud into parallax information according to the relation between parallax and depth, and obtaining a laser point cloud parallax map.

Where B and f represent the length of the binocular camera baseline and the focal length of the camera, respectively, Z represents depth, and D represents parallax.

3. Taking the binocular disparity map and the laser point cloud disparity map obtained in the step 2 as input, performing primary feature extraction through a multilayer convolutional neural network to obtain two groups of sixteen-channel feature maps, and then splicing and fusing the obtained multi-channel feature maps to obtain a thirty-two-channel feature map; the convolution kernel adopts the cavity convolution, so that fewer parameters are used under the condition of the same receptive field, and the time consumption of the network is reduced; a BN layer is added in order to prevent problems of overfitting, disappearance of gradients etc.

4. And extracting deeper features through a two-layer cascaded hourglass structure according to the extracted features, and finally calculating the overall loss of the network according to the ground route. Because the two cascaded hourglass structures are arranged in total, the output of the two cascaded hourglass structures is respectively subjected to loss calculation with the ground route, and then the final loss is obtained through weighted summation, so that the shallow network can be trained sufficiently, and the performance of the whole network is improved.

5. The results output using the second layer hourglass structure were used as the disparity map after final fusion at the time of testing.

The method adds confidence coefficient analysis to reduce errors caused by mismatching in the binocular disparity map, and the specific working process of the confidence coefficient analysis comprises the following steps:

step 1) after obtaining a preliminary disparity map, carrying out confidence coefficient analysis on the disparity map through confidence coefficient analysis;

step 2) outputting a confidence coefficient analysis result graph;

and 3) considering that the fusion process is a complementary process, the accuracy of the binocular disparity map and the laser point cloud disparity map is ensured as much as possible. The points with the confidence coefficient lower than a certain higher threshold Thresh are removed, and the situation that pixels with overlarge errors cannot occur in the fusion process is guaranteed.

By means of a rear-end fusion method (namely, directly fusing the binocular disparity map and the laser radar disparity map), the whole system can flexibly select different binocular stereo matching algorithms without modifying the structure of a network, and the flexibility is higher.

The hourglass structure is adopted for feature extraction, deep features of the disparity map can be better extracted by the hourglass structure, so that information is richer, and the acquisition of a subsequent disparity map is facilitated; the hourglass structure adopts a strategy of first down-sampling and then up-sampling, and simultaneously skip-level connection is added to assist up-sampling, so that the deep feature of extraction is ensured, and the calculation amount of network parameters is considered (the network calculation amount can be greatly reduced by first down-sampling and then up-sampling); relay supervision is added into the cascaded hourglass structure, loss output by each hourglass structure is calculated respectively, output information of networks in different layers is fully utilized, equivalently, a plurality of models are integrated, and the robustness of the network is improved.

Example 1

The embodiment mainly measures the quality of depth estimation in a road scene, and mainly comprises the restoration of a detailed part and an overall contour. In fig. 2, (a), (b), and (c) respectively show the original pattern, the disparity map obtained by the PSMNet method, and the disparity map obtained by the fusion method of the present invention. It can be seen from the results that the contour of the person and the vehicle is restored more accurately by the method of the present invention compared with the original binocular disparity map. Meanwhile, the depth estimation of the original binocular disparity map on the far rod has obvious errors, a gap exists between the depth estimation and the far rod, and the depth information of the whole rod is well estimated by the method.

Example 2

This example evaluates the method of the invention with a Kitti2015 dataset. The Kitti2015 data set consists of 200 sets of data, including left and right views and corresponding ground truth, and corresponding lidar data can be obtained from Kitti's raw data. The data set evaluates the quality of the depth estimation by comparing the disparity map with the ground channel and calculating the error rate, wherein the lower the error rate, the better the quality of the depth estimation is. The error rate is defined as the proportion of the number of the pixels with the disparity value and the ground resolution difference of more than 3% or more than 5% in all the pixels. Specific results are shown in table 1:

TABLE 1 expression of the respective methods on Kitti2015

As can be seen from the table, the error rate of the PSMNet method we used on Kitti2015 dataset is 3.98%. And then, the result graphs obtained by PSmNet are fused, and the error rate of the parallax graph obtained by fusing by the method is 1.46%. It can be seen that after the binocular disparity map and the laser radar disparity map are fused by the method, the error rate of the original binocular disparity map can be reduced by about 60%.

Claims

1. A depth estimation method based on binocular vision and laser radar fusion is characterized by comprising the following steps:

(3) acquiring a laser point cloud disparity map, and performing confidence processing on the binocular disparity map, wherein the confidence processing adopts a convolutional neural network method;

(5) performing further feature extraction through a cascaded hourglass structure, performing parallax regression and adding relay supervision;

(6) and outputting the fused disparity map.

2. The binocular vision and lidar fusion based depth estimation method of claim 1, wherein the step (1) is specifically: the method comprises the steps of synchronously acquiring images and laser point cloud data through a binocular camera and a laser radar, carrying out camera monocular calibration and binocular calibration according to the acquired images to determine camera internal parameters and external parameters, and acquiring the external parameters between the camera and the laser radar by combining the laser point cloud data.

3. The binocular vision and lidar fusion based depth estimation method of claim 1, wherein the step (2) is specifically: and correcting the images according to the images acquired by the binocular cameras and the calibration parameters of the binocular cameras so as to align the lines of the images, and then obtaining the corresponding binocular disparity maps by using a binocular stereo matching algorithm.

4. The binocular vision and lidar fusion based depth estimation method of claim 1, wherein the step (3) is specifically: projecting the laser point cloud onto the image through a combined calibration result to obtain a corresponding laser point cloud disparity map; and removing points with the confidence coefficient less than 95% in the original binocular disparity map through a convolutional neural network, and obtaining the binocular disparity map after confidence coefficient processing.

5. The binocular vision and lidar fusion based depth estimation method of claim 1, wherein the step (4) is specifically: taking the binocular disparity map and the laser point cloud disparity map after the confidence degree processing as input, performing characteristic preliminary extraction through a multilayer convolutional neural network to obtain two groups of sixteen-channel characteristic maps, and then splicing and fusing the obtained characteristic maps to obtain a thirty-two channel characteristic map; the convolution kernel adopts hole convolution and is added with a BN layer.

6. The binocular vision and lidar fusion based depth estimation method of claim 1, wherein the step (5) is specifically: according to the obtained thirty-two channel characteristic diagram, further characteristic extraction is carried out through a two-layer cascade hourglass structure, finally the overall loss of the network is calculated according to the ground route, the loss calculation is carried out on the output of the two-layer hourglass structure and the ground route respectively, and then the final loss is obtained through weighted summation;

7. The binocular vision and lidar fusion based depth estimation method of claim 1, wherein the step (6) is specifically: the results output using the second layer hourglass structure were used as the disparity map after final fusion at the time of testing.