CN113269118A

CN113269118A - Monocular vision forward vehicle distance detection method based on depth estimation

Info

Publication number: CN113269118A
Application number: CN202110633046.7A
Authority: CN
Inventors: 赵敏; 孙棣华; 周璇
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-08-17
Anticipated expiration: 2041-06-07
Also published as: CN113269118B

Abstract

The invention discloses a monocular vision forward vehicle distance detection method based on depth estimation, which is characterized by comprising the following steps: the method comprises the following steps: step1, building a forward vehicle distance detection model based on depth estimation; step2, introducing a DORN algorithm, and building a forward vehicle distance detection model based on the DORN; step3, optimizing a target key point fitting method; step4, designing a loss function in network training; and Step5, utilizing a model compression acceleration tool to realize acceleration of the forward vehicle distance detection model. The method can predict the forward vehicle distance efficiently and accurately.

Description

Monocular vision forward vehicle distance detection method based on depth estimation

Technical Field

The invention relates to a monocular vision forward vehicle distance detection method based on depth estimation.

Background

The distance detection of the vehicle is used as an important part of environment perception of the intelligent driving system, and the collision which is possibly generated in the driving process of the vehicle is predicted in real time by detecting the motion trend of the vehicle and the pedestrian in the front direction, so that the intelligent driving system plays a very important role. However, the forward vehicle distance detection is greatly challenged by phenomena of vehicle-to-vehicle shielding, vehicle-to-environment shielding, complexity and changeability of an unstructured road, pitch angle change in a driving process and the like in a real traffic scene. Therefore, it is a difficult point of research for intelligent driving system to rapidly and accurately detect the distance between the forward vehicle. According to different detection modes, methods for realizing distance detection can be mainly divided into the following categories: electromagnetic ranging, ultrasonic ranging, visual ranging, and the like. At present, distance measurement methods based on active sensors such as millimeter wave radar and laser radar are expensive, have limited scanning range and speed, and are easily interfered by external signals [3 ]. The distance detection technology based on vision has the advantages of low cost, convenience in installation and debugging, rich acquired information and the like, and has a great application prospect. In the vision-based distance detection method, monocular detection, binocular detection and multi-view detection can be divided according to the number of cameras. The monocular vision-based distance measurement method has the advantages of convenience in equipment installation and debugging, low computing resource consumption, good dynamic real-time performance and the like, and has a good application prospect.

The existing mainstream monocular vision-based distance detection method is generally based on a similar geometric principle and combines camera parameter matching to estimate the distance of a forward vehicle. However, the method needs to acquire geometric information about the obstacle, so that some methods cannot judge the non-standard obstacle, the phenomenon of difficult camera parameter matching also exists, the pitching and rolling phenomena in the driving process and the unstructured road scene are not fully considered, and the defects of short effective distance measurement distance, complex work, large calculation error and the like also exist. With the great improvement of computer technology, the artificial intelligence algorithm based on deep learning is widely applied to the industrial field and achieves good effect. Currently, some experts and scholars apply a deep learning technique to a monocular visual distance detection technique to improve distance detection performance. However, most of the existing research is depth estimation performed on all pixel points in an RGB map, and distance detection is not performed for a specific target in a traffic scene. Moreover, the phenomena of obvious shielding, environment transformation and the like exist in real traffic, and a large error exists in the distance detection for a specific target. In few algorithms for realizing the detection of the distance of the specific target based on the deep learning, the distance of the vehicle is detected by directly utilizing an end-to-end distance regression algorithm, but the method easily loses more spatial information, thereby damaging the spatial structure and having certain influence on the depth prediction precision.

Disclosure of Invention

According to the analysis, aiming at the defects in the prior art, the forward vehicle distance detection model is built by utilizing a related deep learning algorithm in the field of computer vision, a corresponding optimization strategy is proposed, and the forward vehicle distance detection based on monocular vision is realized.

Specifically, the forward vehicle distance detection based on monocular vision is realized from three aspects of model building, target key point fitting and loss function design. On the other hand, in consideration of the practical application requirement of the forward vehicle distance detection model, the TensorRT tool is adopted to optimize the model, so that the detection speed of the model is improved.

The technical method provided by the invention comprises the following five steps:

the method comprises the following steps: a forward vehicle distance detection model based on depth estimation is built, and the forward vehicle distance detection model mainly comprises the following three parts:

1) determining the input and the output of a forward vehicle distance detection model;

2) selecting a convolution neural network for extracting image features;

3) and designing a vehicle target key point fitting method.

Step two: introducing a DORN algorithm, and building a forward vehicle distance detection model based on the DORN, wherein the model mainly comprises the following four parts:

1) replacing a common feature extraction network with a dense feature extractor;

2) a scene understanding module is added to realize the comprehensive understanding of the network to the input image;

3) dividing the discrete depth values into a plurality of classes by using a ordinal number regression module to convert the regression problem into a classification problem;

4) and selecting a fitting method of the key points of the vehicle target.

Step three: the method for fitting the optimized target key points mainly comprises the following two parts:

1) introducing a k-means clustering algorithm to realize the fitting of the key points of the vehicle targets;

2) and the fitting precision of the vehicle target key points is improved through effective parameter configuration.

Step four: a loss function in the network training is designed. The device mainly comprises the following two parts:

1) designing a regression loss function of the target key point by using an L1 norm loss function;

2) and combining the ordinal regression function to realize the network training regression.

Step five: the model compression acceleration tool is utilized to realize acceleration of a forward vehicle distance detection model, and the model compression acceleration tool mainly comprises the following two parts:

1) processing data which cannot be directly converted in a network;

2) converting the forward vehicle distance detection model into a TensorRT model;

has the advantages that:

the forward vehicle distance detection based on monocular vision is realized from three aspects of model building, target key point fitting and loss function design. On the other hand, in consideration of the practical application requirement of the forward vehicle distance detection model, the invention adopts a TensorRT tool to optimize the model, so as to improve the detection speed of the model from 0.0284s on average to 0.0003s on average per graph, wherein the error is about 1.5259 e-05.

Drawings

FIG. 1 is a schematic flow diagram of a DORN-based forward vehicle distance detection model;

FIG. 2 is a schematic structural diagram of a DORN-based forward vehicle distance detection model.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the application include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Example 1: as shown in figures 1 and 2 of the drawings,

the embodiment provides a monocular vision forward vehicle distance detection method based on depth estimation, which specifically comprises the following five steps:

1) firstly, a forward vehicle distance detection model is built, and the whole model structure is divided into three parts: input, intermediate processing and output. The input part comprises RGB original graphs, vehicle target frame coordinates and a depth map. The RGB original image is an input value used for analyzing and detecting the whole network, the depth map is a real value used for comparing with a predicted value and training and learning by the network, and the coordinates of the vehicle target frame are coordinates required for finally displaying an output map of the vehicle target. The intermediate processing part comprises feature extraction, pooling and regression, and is a process needed to be experienced by the network for learning and predicting. In addition, a key point fitting part is added, and the information of the target key points is obtained from the predicted depth map. The output part is an RGB map with a vehicle target frame and distance values.

2) The feature extraction network is a convolutional neural network, and usually VGG16 or Resnet50 is adopted as a basic feature extraction network in the field of computer vision.

3) In the key point fitting part, a vehicle target key point fitting method is designed.

1) and introducing a dense feature extractor module in the DORN network as a feature extraction network, and expanding the visual field of a filter on the basis of not reducing the spatial resolution or increasing the parameter number by removing the last downsampling operators in the feature extractor DCNNs and inserting spaces into a subsequent convolution layer for filtering to form expanded convolution.

2) A scene understanding module is introduced, which consists of three parallel components, a hole space convolutional pooling pyramid (ASPP) module, a cross-channel leaner, and a fullimage full image encoder. The ASPP extracts features from a plurality of larger receptive fields by expanding convolution operations, with expansion ratios of 6, 12 and 18, respectively, wherein a convolution branch with a kernel size of 1 × 1 can learn complex cross-channel leaner interactions, while the fullimage encoder can capture global context information, thereby reducing local aliasing problems.

3) And introducing an ordinal regression module to divide the discrete depth values into multiple categories, converting the regression problem into a classification problem, and finally adopting a Softmax function for regression loss values to realize network training. Therefore, the depth values are used as an ordered set to generate strong ordered correlation, and an ordinal loss is adopted to learn the network parameters. Is provided with

A feature map of size W × H × C obtained from the input image I, wherein

Parameters involved in both the dense feature extractor and the scene understanding module are shown. Is provided with

And (2) expressing the ordinal output of each space position with the size of W multiplied by H multiplied by 2K, wherein theta is (theta 0, theta 1, …, theta 2K-1) contains a weight vector, 2K refers to the number of channels of convolution and expresses that K score values are used for two classifications, when the quantized value of the real depth value of the group is more than K and expressed by 1, and when the quantized value of the real depth value of the group is not more than K and expressed by 0. Let l (w, h) e {0,1, …, K-1} be the discrete label generated by SID at spatial position (w, h). Defining the ordinal loss value L (χ, Θ) as the ordinal loss Ψ (X) of the pixel levelw, h, χ, Θ) over the entire image field, as follows:

wherein N is W.times.H, wherein

Representing the estimated discrete value decoded from y (w, h), P is the predicted probability value, where Ψ is the value calculated at the pixel point, between 0-K in depth, to add the probabilities of the points logarithmically, divided into a front part from 0 to the real value, and a rear part from the real value to the farthest distance K, the rear part requiring (1-P) because the y value for calculating the probability is output with the real value greater than K. Furthermore, regression probabilities were calculated from y (w, h,2k) and y (w, h,2k +1) using the Softmax function

The formula is as follows:

in the formula (I), the compound is shown in the specification,

x_(w,h)e x, wherein K is 0 to K-1, y is the output of the prediction result only after the prediction result is greater than K, namely, the numerical value of the quantized real depth value of the groudtruth is greater than K. Minimizing the ordinal loss value L (χ, Θ), ensures that predictors further from the true value label get closer tests than they do to the true value labelLower value scores.

4) Preliminarily selecting a coordinate of half of the length and the width of the target frame as a threshold, averaging pixel values in the threshold to serve as a key point distance value drawn by a model, wherein the formula is as follows:

in the formula, W and H represent the abscissa and the ordinate of the pixel, respectively, W and H represent the threshold after the target frame is reduced, and N represents the number of pixels included in the threshold range, thereby fitting the target distance value.

1) and introducing a k-means clustering algorithm to realize the fitting of the key points of the vehicle targets. The method mainly comprises the steps of clustering pixels in a prediction target frame, solving a first category and a second category which are ranked in a front quantity, and analyzing the two categories. And when the number of the first categories is more than 1.5 times of that of the second categories, if the distance value of the center point of the first categories is less than a threshold value 80m, selecting the first categories as final categories, otherwise, if the center value of the first categories is more than the threshold value 80m, selecting the second categories as final categories. And when the number of the first categories and the number of the second categories are different by no more than 1.5 times, the final category with a small central point distance value is taken to be used for segmenting the target to be measured and the environmental interference around the target. And then, taking the central point of the final category as a key point of the target, and selecting the distance value of the target key point as a predicted value of the forward vehicle target distance in the experiment.

2) Effective parameter configuration is carried out on the algorithm, the cutting multiple of the target frame and the set K value are determined through a plurality of groups of comparison tests, and the final set typical parameters are as follows: the K value is 4, and the cutting multiple is 1/2, so that the fitting precision of the vehicle target key points is improved.

1) by usingAnd designing a regression loss function of the target key points by using an L1 norm loss function. Defining the regression loss value of the central point as L_dIf the distance value output for the vehicle target in the original network is d (i), the distance value of the real central point is d, and the number of the target frames in the graph is N2, the central point regression loss value formula is as follows:

where the L1 norm loss function is also referred to as the minimum absolute value error, minimizing the sum of absolute differences of the target and estimated values can be achieved. Wherein Smooth_L1The L1 norm loss function after smoothing is expressed, the instability problem caused by the break point in the L1 loss curve is solved, and the formula is as follows:

when the absolute value of the input x is larger than 1 or smaller than-1, the derivative function has a value of 1 or-1, so that more outliers can be avoided at the initial stage of network training, and gradient explosion is avoided, and when the absolute value of the input x is between-1 and 1, the derivative function has a value of a linear increasing function between 1 and-1, so that stable transition is realized, and convergence is promoted.

2) Combining the ordinal regression function to realize the training regression of the forward vehicle distance detection network model, wherein the total loss function of the network model is as follows:

in the formula, λ₁And λ₂For self-defined parameters when training the network, the model ensures a predicted value closer to a real value label by minimizing the total loss value L of the model to obtain a higher score than a test value farther away from the real value label, wherein the method for minimizing the loss value is to realize iterative optimization by adopting an SGD random gradient descent algorithm.

1) the model output in the network cannot be in dictionary form, so the output containing dictionary types is converted into operational tenor form.

2) When a forward vehicle distance detection model is built, the used deep learning frame is a Pytrch, and under a Python environment, the Pytrch model is converted into TensorRT by two methods, wherein one method is to convert a pt model into onnx and then into TensorRT, and the other method is to directly convert the pt model into TensorRT. Because onnx model conversion only supports fixed pitch size, and the torch2trt library can be directly imported and used, the invention adopts a method of directly converting the TensorRT model by means of the torch2trt library.

Finally, the above examples are intended only to illustrate the technical solution of the present invention and not to limit it, and although the present invention has been described in detail with reference to preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention defined by the appended claims.

Claims

1. A monocular vision forward vehicle distance detection method based on depth estimation is characterized in that: the method comprises the following steps:

step1, building a forward vehicle distance detection model based on depth estimation;

step2, introducing a DORN algorithm, and building a forward vehicle distance detection model based on the DORN;

step3, optimizing a target key point fitting method;

step4, designing a loss function in network training;

and Step5, utilizing a model compression acceleration tool to realize acceleration of the forward vehicle distance detection model.

2. The depth estimation-based monocular vision forward vehicle distance detecting method of claim 1, characterized in that: the forward vehicle distance detection model in step1 comprises three parts of input, intermediate processing and output; the input part comprises RGB original images, a depth map and a vehicle target block diagram, the intermediate processing part is a network learning and predicting process comprising feature extraction, pooling and regression, and the output part is an RGB map with a vehicle target block and a distance value.

3. The depth estimation-based monocular vision forward vehicle distance detecting method of claim 2, characterized in that: in the intermediate processing part, the feature extraction network is a convolutional neural network, and VGG16 or Resnet50 is used as the feature extraction network.

4. A monocular vision forward vehicle distance detecting method based on depth estimation as recited in claim 3, wherein: the step2 comprises the following steps: 1) introducing an intensive feature extractor module in the DRON as a feature extraction network; 2) introducing a scene understanding module, 3) introducing an ordinal regression module to divide the discrete depth values into a plurality of classes, and 4) selecting a vehicle target key point fitting method.

5. The depth estimation-based monocular vision forward vehicle distance detecting method of claim 4, wherein: the scene understanding module comprises a hole space convolution pooling pyramid (ASPP) module, a cross-channel leaner and a fullimage full image encoder in parallel.

6. The depth estimation-based monocular vision forward vehicle distance detecting method of claim 4, wherein: the step3 comprises the following steps: 1) and 2) introducing a k-means clustering algorithm to realize fitting of the key points of the vehicle target, and 2) improving the fitting precision of the key points of the vehicle target through parameter configuration.

7. The depth estimation-based monocular vision forward vehicle distance detecting method of claim 5, wherein: the step4 comprises the following steps: 1) designing a regression loss function of the target key points by using an L1 norm loss function, and 2) combining the ordinal regression function to realize network training regression.

8. The depth estimation-based monocular vision forward vehicle distance detecting method of claim 6, wherein: the step5 includes: 1) converting the output of data which cannot be directly converted in the network into an operable tenor form; 2) and converting the forward vehicle distance detection model into a TensorRT model.