CN114495064A

CN114495064A - Monocular depth estimation-based vehicle surrounding obstacle early warning method

Info

Publication number: CN114495064A
Application number: CN202210104631.2A
Authority: CN
Inventors: 辛海同; 蔡登�
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-13

Abstract

The invention discloses a vehicle surrounding obstacle early warning method based on monocular depth estimation, which comprises the following steps: (1) acquiring image data, and generating a pixel label, a depth map label and a 3D target label required by training in the image data to form a training data set; (2) establishing a 3D target detection model based on monocular depth estimation; (3) training the 3D target detection model by using a training data set; (4) in the process of early warning the obstacles, detecting the obstacles of continuous frames by using a 3D target detection model obtained by training optimization; (5) constructing a tracking model, and tracking the corresponding obstacles in the continuous frames by using a Hungarian maximum matching algorithm; (6) and establishing a Kalman filtering model related to the space position and the speed of the obstacle, and finally obtaining space position information of the tracked obstacle and judging whether collision danger exists or not through a filtering algorithm. By using the method and the device, the precision of vehicle obstacle early warning can be improved on the premise of saving cost.

Description

Monocular depth estimation-based vehicle surrounding obstacle early warning method

Technical Field

The invention belongs to the field of 3D target detection in computer vision, and particularly relates to a vehicle surrounding obstacle early warning method based on monocular depth estimation.

Background

Vehicle intellectualization makes the driver assistance function become the indispensable function of middle and high-end motorcycle type. In order to ensure driving safety, the detection of obstacles around the vehicle and the early warning of the driver become one of the core functions of the auxiliary driving function. The obstacle early warning system judges whether a safety accident occurs or not by calculating the relative distance and speed relation between the obstacle and the vehicle and timely reminds a driver to avoid danger.

In order to obtain the state information of the vehicle front obstacle, Kunsoo Hu et al, An article on American control reference in 1999, "A Experimental information of a CW/CA system for automatic using a discrete-in-the-loop theory", obtains a discrete millimeter wave radar measurement system equation by a sampling discretization method, and optimally calculates the system state including the target vehicle distance and relative speed by using a second-order Kalman filter, but the detection precision of the method is not high. In order to improve the Detection accuracy of millimeter wave Radar, Anselm Haselhoff et al, in the article of "radio-Vision Fusion with an Application to Car-focusing using an Improved adaptive Detection Algorithm" on the IEEE Intelligent transfer Systems Conference Conference of 2007, proposed the use of millimeter wave Radar and Vision Fusion to detect obstacles. The method comprises the steps of firstly using a millimeter wave radar to detect a 3D space candidate object of an obstacle in advance, then obtaining an interested area on an image through the information, and finally verifying a detection result of the millimeter wave radar through an AdaBoost classifier. However, the method depends heavily on the detection result of the millimeter wave radar, and if a threat target is missed in the millimeter wave radar candidate region, the target cannot be detected again in subsequent operations. In order to solve the problem, a Monocular image and Millimeter Wave Radar three-level fusion strategy is proposed in an article of Integrating Millimeter Wave Radar with a single annular Vision Sensor for On-Road observer Detection Applications On 2011 Sensors, such as W Tao, firstly, coordinate systems of a Millimeter Wave Radar and a camera are calibrated, then, a Millimeter Wave Radar Detection area is locked for Obstacle Detection, and then, a corresponding image area is used for verification of a Detection target. In addition, Xin Liu et al, 2011 in the IEEE International Conference On temporal Electronics and Safety article "On-road detection function front and vision" proposed a cross-validation method to detect obstacles. The method uses a special shadow segmentation method to detect an image, then carries out verification matching with a millimeter wave radar detection result obtained in the same frame, and carries out verification on an unmatched millimeter wave radar object by using visual data again.

In addition to monocular camera Fusion, the colloid Sensing by Stereo Vision and Radar Fusion published by Shunguang Wu et al in IEEE Transactions on Intelligent transfer Systems 2009 proposed the use of depth camera and millimeter wave Radar Fusion to detect obstacles. The method comprises the steps of firstly fitting the closest point of the outline of the threatened obstacle in depth vision, then fusing the closest point with a millimeter wave radar detection result, and finally tracking the closest point of the outline of the fused obstacle by using rigid body constraint so as to obtain the spatial position and the motion state of the threatened obstacle.

Compared with millimeter wave radar, laser radar has far detection distance and precision. In the process of acquiring the state information of surrounding obstacles, for the purpose of achieving real-time performance, Alex H.Lang and the like propose a form of encoding Point cloud features into a bird's-eye view pseudo image to perform target Detection in a 3D space in a convolution mode in an article of pointpilars, Fast Encoders for Object Detection From Point Clouds, published in the international top meeting IEEE Conference on Computer Vision and Pattern Recognition in 2019. In order to improve the precision under the condition of keeping the real-time performance unchanged, Zedong Yang and the like remove an FP module in the universal Point cloud feature learning method by using a mode of fusing and downsampling an F-FPS and a D-FPS in a 3DSSD (three-dimensional space-Based) 3D Single Stage Object Detector. By reducing the calculation amount of feature extraction and maintaining the accuracy, a quite good effect is achieved in the 3D target detection task. However, the use of the laser radar for the obstacle warning function in the auxiliary driving is expensive, and is not suitable in practical situations.

In summary, in consideration of cost, the solution of integrating the laser radar and the sensor is not very suitable for mass production of vehicles with driving assistance function, and due to the equipment limitation of the sensor, the millimeter wave radar alone cannot achieve good obstacle detection effect.

Disclosure of Invention

The invention provides a vehicle surrounding obstacle early warning method based on monocular depth estimation, which not only can save the actual production cost, but also can achieve the precision required by the actual application of the vehicle obstacle early warning function.

A vehicle surrounding obstacle early warning method based on monocular depth estimation comprises the following steps:

(1) acquiring image data, wherein the image data comprises camera calibration parameters and point cloud data in the same frame as the image; generating a pixel label, a depth map label and a 3D target label required by training in image data to form a training data set;

(2) establishing a 3D target detection model based on monocular depth estimation;

(3) training and testing the 3D target detection model by using a training data set to finally obtain a 3D target detection model after training optimization;

(4) in the process of early warning the obstacles, detecting the obstacles of continuous frames by using a 3D target detection model obtained by training optimization;

(5) constructing a tracking model, and tracking the corresponding obstacles in the continuous frames by using a Hungarian maximum matching algorithm;

(6) and establishing a Kalman filtering model related to the space position and the speed of the obstacle, finally obtaining space position information of the tracked obstacle through a filtering algorithm, and judging whether the collision danger exists or not by taking the space position information as a distance reference.

The invention discloses a method for early warning obstacles around a vehicle, and relates to a 3D target detection method based on monocular depth estimation. By using the monocular camera as the sensor, the cost is saved, and with the development of the monocular depth estimation method, the error of the depth obtained by using the monocular estimation method is extremely small in a short distance range, so that the space position information of the detected obstacle has extremely high confidence.

In the step (1), calculating a depth z value of a pixel point corresponding to point cloud data in a camera coordinate system by using camera calibration parameters and point cloud data in the same frame as an image, and taking the z value as a true value of pixel depth; and setting the default value of the depth value of the pixel point which is not matched with the point cloud as 0 so as to obtain the depth map label of the monocular image.

In the step (2), in the 3D target detection model, DenseNet121 is used as a backhaul for image feature extraction, and a BTS depth estimation model is used for predicting the depth value of each pixel point on the basis of the extracted image features; meanwhile, generating an interested pixel set by an interested pixel proposal module on the basis of the extracted image characteristics; and finally, outputting the 3D space position, size and category of the barrier obtained by regression by using the simplified single-stage 3D detection head and taking the pseudo laser point generated by the interested pixel as input.

In the step (3), the process of training the 3D target detection model by using the training data set is as follows:

(3-1) randomly scrambling the training data set, and then simultaneously performing data enhancement on the image, the pixel label, the 3D label and the depth map label according to the random horizontal inversion of 50%;

(3-2) inputting the training data set into a 3D target detection network according to the number of preset pictures with the size of BatchSize, predicting the depth value of each pixel point through a network depth regression head corresponding to a BTS depth estimation model, and generating interested pixel points which are most likely to be obstacles through an interested area module corresponding to an interested pixel proposal module;

(3-3) taking the interested pixel points and the depth values thereof as input, and converting the interested pixel points and the depth values into corresponding spatial coordinate points through camera calibration parameters; inputting the generated spatial coordinate point into the spatial position and size of a regression barrier in a 3D regression head corresponding to the 3D target detection head and predicting the category of the regression barrier, and reducing the Euclidean distance between a pixel point with a true depth value and the predicted depth and the Euclidean distance between the interested category of the pixel and the predicted value thereof as much as possible in the process of training the 3D target detection head, and simultaneously reducing the Euclidean distance between the spatial position, size and category of the barrier and the predicted value as much as possible;

and (3-4) repeating the step (3-1) to the step (3-3), and finishing training after the preset training times are reached.

In the step (3-2), the target function of the network depth regression head training is a scale invariant loss function in a log space, and the formula is as follows:

where T denotes the number of pixel points with depth truth values, λ is a hyperparameter whose value is set to 0.5, g_iAnd expressing the Euclidean distance between the depth predicted value and the truth value in the log space, wherein the specific calculation formula is as follows:

wherein the content of the first and second substances,

and d_iThe estimated depth value and the depth true value are respectively represented, and because the scene has more depth true value pixel points, the final loss function of the network depth regression head is defined as follows:

where α is set to 10 in the training as the loss weight control amount.

The region of interest module sets a pixel class cross entropy loss function to constrain the network during training, wherein the loss function is defined as:

wherein y represents a pixel type, the values of which are 0 and 1, and represents a background pixel point and an obstacle pixel point respectively.

And expressing the pixel point category predicted value.

In step (3-3), the training objective function of the 3D regression head includes a classification loss function and a regression loss function, and the formula is:

wherein L is_cTo classify the loss, P_iIn order to predict the probability of the ith class, K represents the number of predicted classes, and the method mainly comprises two classes: cars and others, so that K is set to 2, y_iRepresenting a category of the class; l is_rThe regression Loss of the target space position is determined by using SmoothL1Loss function, beta is a hyperparameter and is set to be 0.1, mu_iAnd

true and predicted values.

In the step (6), in the established Kalman filtering model, the observed quantities are x, y, z, h, w, l and theta, wherein x, y and z correspond to the spatial position of the obstacle, h, w and l correspond to the size of the obstacle, and theta corresponds to the orientation of the obstacle; the predicted quantity is x_p、y_p、z_p、h_p、w_p、l_p、θ_p、υ_x、υ_yAnd upsilon_zNamely the spatial position, size, orientation and speed in three directions of the obstacle after filtering;

when a Kalman filtering model is established, assuming that prediction noise and observation noise are both subjected to normal distribution, the specific setting is as follows:

q represents a prediction noise covariance matrix in a Kalman filtering model, K represents an observation noise covariance matrix, F represents a state transition matrix, and H represents an observation matrix.

Compared with the prior art, the invention has the following beneficial effects:

1. the method for early warning the obstacles around the vehicle based on monocular depth estimation can accurately detect the space positions of the obstacles in a relatively close range.

2. The invention uses the monocular camera to detect the space position of the surrounding obstacles, thereby reducing the cost of mass production of vehicles.

3. According to the invention, through the maximum matching algorithm and the Kalman filtering algorithm, more accurate obstacle position information can be obtained.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a network framework diagram of a 3D object detection model in the method of the present invention;

FIG. 3 results of the present invention detecting vehicle obstacles under different road monocular cameras in a KITTI dataset.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

As shown in fig. 1, the obstacle early warning function of the present invention is mainly divided into two modules, an obstacle detection module (3D target detection model) and an obstacle tracking module, and on the basis of the detection result, the obstacle tracking module tracks the obstacle and finally obtains the ID and distance information of the obstacle.

The network framework of the 3D object detection model based on monocular depth estimation is shown in fig. 2, and the network is an end-to-end trained network model. When the model is trained, input data are images, camera calibration parameters corresponding to the images and point cloud data of the same frame as the images. The basic training steps are as follows:

1. labels required for network model training are generated from the raw data and include pixel labels, image depth map labels, and 3D object labels. Wherein the pixel label indicates whether the pixel point is the pixel of interest. The pixel label is generated through the 2D frame label, wherein the label of the pixel point in the 2D frame is set to be 1 to represent the interested pixel point, and the pixel point outside the 2D frame is set to be 0. The image depth map label is obtained through laser point cloud data, the laser point cloud data is firstly converted into coordinates under a camera coordinate system through camera calibration external parameters, then the coordinates are converted into pixel coordinates according to camera internal parameters, the depth value of a pixel point corresponding to the laser point is obtained, and the depth map numerical value of the pixel point which is not matched with the laser point is set to be 0.

2. Initializing network model parameters, sending image data into image feature extraction backbone network extraction scales to respectively sample feature maps of 2 times, 4 times, 8 times and 16 times, and then sending the four feature maps into a depth estimation module and an interested pixel proposal module respectively.

3. In the Depth Estimation module, Depth values are estimated using the BTS Depth Estimation model mentioned in the paper From Big to Small Multi-Scale Local Planar guide for cellular Depth Estimation. After obtaining the estimated depth value of the pixel, on one hand, the label is used for optimizing the model, and on the other hand, the depth information is used as the input of the 3D target detection head for the next calculation.

4. In an interested pixel proposal module, each scale feature map is firstly subjected to sampling and splicing to the size of an original image to obtain a feature map with multi-scale information, the feature map is used as input, a prediction head is used for predicting the pixel category, and the top 4096 pixel points with higher scores are used as interested pixel points. Finally, the interested pixel points and the characteristics thereof are sent to a 3D target detection head for next calculation

And 5.3D target detection heads receive the interested pixel points, the depth maps and the characteristics of the interested pixel points, convert the interested pixel points into points in a world coordinate system through camera calibration internal parameters and external parameters, and obtain 256 candidate points and characteristics thereof through the modes of characteristic distance farthest point downsampling, grouping and characteristic learning. Finally, a detection head is used to predict the 3D frame and the category thereof by taking the 256 candidate points and the characteristics thereof as input.

6. And circularly traversing the training data set for a plurality of times to finally obtain a convergent monocular obstacle detection network.

After the detection result is obtained, the obstacles need to be tracked, and the maximum obstacle matching between the continuous frames is carried out through the Hungarian algorithm. Meanwhile, in order to obtain a stable tracking result, a prediction error is reduced through Kalman filtering.

Fig. 3 shows the effect of detecting vehicle obstacles in three road scenes by using the monocular 3D object, and it is obvious that the method of the present invention can comprehensively detect the vehicles around the own vehicle and can more accurately predict the distance between the surrounding vehicles and the own vehicle.

In the embodiment, training and testing are performed on a large public data set KITTI data set, wherein a monocular Object Detection network trains on a KITTI 3D Object Detection Evaluation 2017, and the data set is divided into a training set and a verification set, which respectively comprise 3712 pictures and 3769 pictures. The invention carries out experimental verification of detection and Tracking on three scenes of a KITTI Object Tracking Evaluation data set.

Criteria used in the present invention are Precision (Precision) and Recall (Recall). The target detection algorithm is used for detecting and tracking three road scenes of KITTI, and the results are shown in table 1.

TABLE 1

As can be seen from table 1, what is proposed by the present invention fully demonstrates the effectiveness of the present invention in obstacle detection and tracking.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A vehicle surrounding obstacle early warning method based on monocular depth estimation is characterized by comprising the following steps:

2. The monocular depth estimation-based vehicle surrounding obstacle early warning method according to claim 1, wherein in the step (1), a depth z value of a pixel point corresponding to point cloud data in a camera coordinate system is calculated by using camera calibration parameters and point cloud data in the same frame as an image, and the z value is used as a pixel depth true value; and setting the default value of the depth value of the pixel point which is not matched with the point cloud as 0 so as to obtain the depth map label of the monocular image.

3. The monocular depth estimation-based vehicle surrounding obstacle warning method as recited in claim 1, wherein in step (2), in the 3D object detection model, a DenseNet121 is used as a backhaul for image feature extraction, and a BTS depth estimation model is used to predict a depth value of each pixel point on the basis of the extracted image feature; meanwhile, generating an interested pixel set by an interested pixel proposal module on the basis of the extracted image characteristics; and finally, outputting the 3D space position, size and category of the barrier obtained by regression by using the simplified single-stage 3D detection head and taking the pseudo laser point generated by the interested pixel as input.

4. The monocular depth estimation-based vehicle surrounding obstacle warning method according to claim 1, wherein in the step (3), the training of the 3D target detection model using the training data set is performed as follows:

(3-2) inputting the training data set into a 3D target detection network according to the preset picture number of the BatchSize, predicting the depth value of each pixel point through a network depth regression head corresponding to a BTS depth estimation model, and generating interested pixel points which are most likely to be obstacles through an interested area module corresponding to an interested pixel proposal module;

(3-3) taking the interested pixel points and the depth values thereof as input, and converting the interested pixel points and the depth values into corresponding spatial coordinate points through camera calibration parameters; inputting the generated spatial coordinate point to the spatial position and size of a regression obstacle in a 3D regression head corresponding to the 3D target detection head and predicting the category of the regression obstacle, and reducing the Euclidean distance between a pixel point with a depth truth value and the predicted depth and the Euclidean distance between the pixel interested category and the predicted value thereof as much as possible in the process of training the 3D target detection head, and simultaneously reducing the Euclidean distance between the spatial position, the size and the category of the obstacle and the predicted value as much as possible;

5. The monocular depth estimation-based vehicle surrounding obstacle warning method according to claim 4, wherein in the step (3-2), the objective function trained by the network depth regression head is a scale invariant loss function in a log space, and the formula is as follows:

where T denotes the number of pixel points with true depth values, λ is a hyperparameter whose value is set to 0.5, g_iThe Euclidean distance in log space from the depth predicted value and the truth value is represented, and the specific calculation formula is as follows:

wherein the content of the first and second substances,

and d_iRespectively representing estimated depth value and depth true value, and the network depth regression head is the most important due to more depth true value pixel points in the sceneThe final loss function is defined as:

where α is set to 10 in the training as the loss weight control amount.

6. The monocular depth estimation-based vehicle surrounding obstacle warning method according to claim 4, wherein in the step (3-2), the region of interest module sets a pixel class cross entropy loss function to constrain the network during training, and the loss function is defined as:

wherein y represents a pixel category with values of 0 and 1, respectively representing background points and obstacle pixel points,

and expressing the pixel point category predicted value.

7. The monocular depth estimation-based vehicle surrounding obstacle warning method of claim 4, wherein in step (3-3), the training objective function of the 3D regression head comprises a classification loss function and a regression loss function, and the formula is as follows:

wherein L is_cTo classify the loss, P_iTo predict the probability of being of class i, K representsThe number of predicted types mainly comprises two types in the method: cars, and others, so K is set to 2, y_iRepresenting a category of the class; l is_rThe regression Loss of the target space position is determined by using SmoothL1Loss function, beta is a hyperparameter and is set to be 0.1, mu_iAnd

true and predicted values.

8. The monocular depth estimation-based vehicle surrounding obstacle early warning method according to claim 1, wherein in the step (6), in the established kalman filter model, the observed quantities are x, y, z, h, w, l and θ, wherein x, y and z correspond to the spatial position of the obstacle, h, w and l correspond to the size of the obstacle, and θ corresponds to the orientation of the obstacle; the predicted quantity is x_p、y_p、z_p、h_p、w_p、l_p、θ_p、υ_x、υ_yAnd upsilon_zI.e. spatial position, size, orientation and velocity in three directions of the obstacles after filtering;