CN114612883A

CN114612883A - Forward vehicle distance detection method based on cascade SSD and monocular depth estimation

Info

Publication number: CN114612883A
Application number: CN202210263947.6A
Authority: CN
Inventors: 赵敏; 孙棣华; 庞思袁
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-06-10

Abstract

The invention discloses a forward vehicle distance detection method based on cascade SSD and monocular depth estimation, which comprises the following steps: s1, acquiring a forward vehicle training sample and a depth estimation data set; s2, building a backbone network; s3, building a target detection network; s4, building a depth estimation sub-network; s5, training a forward vehicle distance detection network based on the cascade SSD network and the monocular depth estimation; and S6, selecting and detecting the distance based on the K-means clustering feature points. The invention adopts an extended receptive field module based on cavity convolution to expand the receptive field, and combines an online characteristic selection strategy to improve the accuracy of small-scale vehicle target detection; aiming at the problem of inaccurate positioning of vehicles, a cascaded SSD network is provided, and the problem of characteristic misalignment caused by an anchor frame mechanism is relieved by introducing deformable convolution; aiming at the problem of inaccurate selection of monocular depth estimation feature points, a K-means clustering feature point-based selection and ranging method is provided, and the problem of large errors of horizontal and longitudinal ranging is effectively solved.

Description

Forward vehicle distance detection method based on cascade SSD and monocular depth estimation

Technical Field

The invention relates to the technical field of vehicle distance detection, in particular to a forward vehicle distance detection method based on cascade SSD and monocular depth estimation.

Background

With the maturity and continuous progress of computer vision technology, rapid development of emerging fields such as automatic driving, intelligent medical treatment and intelligent robots is also driven, and especially under the trend that the automatic driving industry is mature and stable day by day, the hope of unmanned driving is seen. One of the key technologies is to detect the distance of a front target on the premise of accurately identifying the front target of a vehicle, which also becomes one of the main research hotspots in the field of computer vision. According to incomplete statistics, more than 125 million people are lost due to road traffic accidents every year around the world, the number of the injured people is as high as 5000 million people, the economic loss reaches $ 1.85 trillion, and the main reasons of the accidents are that drivers are not concentrated or fatigue driving is caused. Since highway traffic is often caused by rear-end collision, if the distance between the vehicle and the forward vehicle can be accurately detected in real time, a good prevention effect can be achieved on similar traffic accidents such as rear-end collision. Therefore, the distance detection between vehicles plays an important role in ensuring the traveling safety of people to the maximum extent.

The forward vehicle distance detection methods are classified into the following categories according to different detection devices and detection modes: millimeter wave radar ranging, laser ranging, ultrasonic ranging, visual ranging, and the like. In these methods, although the ultrasonic wave, millimeter wave radar, laser radar, etc. are far superior to the visual ranging in the ranging accuracy, they all have certain defects, and in contrast, the visual ranging has the advantages of low cost, easy popularization, etc. Binocular vision range finding obtains depth information through characteristic point matching, under the background that consumes a large amount of time, is difficult to reach the real-time nature requirement of detection, and monocular vision price is more substantial, and the model structure is simpler, is a research focus that has more advantages.

Existing ranging algorithms based on monocular vision are roughly classified into four categories: the first type is a method based on a small hole forming model, which realizes the conversion from a pixel coordinate system to a world coordinate system through coordinate system transformation and needs to be completed by means of the actual height or width of a measured object; the second type of data regression modeling method adopts a method of firstly measuring distance and then modeling in a reverse direction to measure the distance of the vehicle in the front vehicle, such as the method of measuring distance of the vehicle based on monocular vision (publication number: CN104899554A) applied by northeast university. The third type is that an imaging geometric model is constructed to measure distance by combining calibration parameters of a camera, for example, a front vehicle distance detection method (publication number: CN106153000A) applied by the university of fertilizer industry, the invention completes the distance measurement of the front vehicle by the geometric model, has high calculation speed and simple implementation, but is greatly influenced by the camera parameters and the pitch angle. The fourth type is a forward distance detection method based on monocular depth estimation, which obtains distance information from pictures through monocular depth estimation, such as the 'monocular vision forward vehicle distance detection method based on depth estimation' applied by Chongqing university (application number: 202110633046.7).

Real-time ranging of forward vehicle targets is an important component of an autonomous driving system, where the first step is detection of the forward vehicle. Aiming at the problems of different angles of forward vehicles, obvious difference of the sizes along with the distance, low pixels and the like, the invention selects the SSD neural network with higher detection speed as the basis to be improved under the condition of meeting the detection real-time property, and then the SSD neural network is used as a sub-network of the distance detection network of the front vehicle. The invention provides a cascaded SSD network, aiming at the problem of poor detection effect caused by insufficient feature expression capability of the SSD network for extracting small targets, the invention enriches the features of a low-level network by adopting an ASPP network cascading method in an initial detection network, and the cascaded SSD network adopts deformable convolution as connection, thereby relieving the problem of feature misalignment caused by an anchor frame mechanism, reducing the influence of negative samples on network training and improving the precision of target detection. In the imaging process of the camera, the three-dimensional image is projected to the plane of the two-dimensional image, the depth information is lost, the depth information needs to be recovered, the distance information from each pixel point to the plane of the camera is obtained, then a distance measurement characteristic point is selected, and the distance measurement of the forward vehicle is completed. The process can establish a monocular depth estimation sub-network for extracting global pixel information, and the monocular depth estimation sub-network is connected with a target detection sub-network in parallel to realize the distance detection of the forward vehicle target.

Disclosure of Invention

The invention designs a forward vehicle distance detection network based on a cascade SSD and monocular depth estimation, starting from actual environments of expressways and urban roads, and aiming at the problem that the SSD network has poor forward small-scale vehicle feature extraction effect and the defects of the traditional distance measurement algorithm in forward vehicle distance measurement. Aiming at the problem that the detection effect of a small target is poor due to insufficient extraction of superficial features by conv4_3 in an original SSD network, the sense field is expanded by adopting an expansion sense field module based on cavity convolution, and the accuracy of small-scale vehicle target detection is improved by combining an online feature selection strategy; aiming at the problem of inaccurate positioning of vehicles, a cascaded SSD network is provided, and the problem of characteristic misalignment caused by an anchor frame mechanism is relieved by introducing deformable convolution; aiming at the problem of inaccurate selection of monocular depth estimation feature points, the method for selecting and ranging the feature points based on K-means clustering is provided, and the problem of large errors of horizontal and longitudinal ranging can be effectively solved.

The technical method provided by the invention mainly comprises three parts, namely sample training, detection and distance measurement, and specifically comprises the following steps:

the method comprises the following steps: obtaining forward vehicle training samples and depth estimation datasets

Downloading a KITTI data set, wherein the data set comprises vehicle training pictures, a labeling file, camera and radar calibration parameters corresponding to each picture and a point cloud file;

carrying out data cleaning on the KITTI data set, screening out the pictures marked with errors and removing the pictures;

selecting proper number of anchor frames and height-width ratio by adopting K-means clustering;

counting the depth information of the data set to prepare for a subsequent optimization depth estimation model;

randomly taking 80% of the data set as a training set and 20% of the data set as a testing set;

step two: backbone network construction

VGG-16 is selected as the basis of the backbone network for feature extraction, and modification is carried out on the basis;

down-sampling the training picture to Conv4-3 resolution, and obtaining a low-level feature map delta of picture target details through 3 multiplied by 3 convolutional layers₀(ii) a To obtain high-level features, fc 7-level features are upsampled to Conv4-3 resolution size to obtain a feature map delta₁And then the Conv5-3 is subjected to deconvolution up-sampling to obtain a characteristic diagram delta₂Finally, will delta₀、δ₁、δ₂Carrying out channel dimension feature fusion on the three feature maps and Conv4-3 to realize feature enhancement and obtain a fused feature map psi₀；

Will phi₀As input to the tandem ASPP, the voidage of the ASPP is set to 2 and 3, respectively, and the feature layer ψ after passing through each ASPP is set₁、ψ₂Extracting to obtain psi₀、ψ₁、ψ₂Superimposing the channels and making them consistent with the original Conv4-3 channel number, results in the signature psi.

Step three: construction of object detection sub-networks

The target detection sub-network is improved on the basis of an SSD network, and the number and the size of anchor frames in the detection network are changed into the size obtained in the step one;

on the basis of the feature diagram psi obtained in the second step, additionally adding 3 groups of convolution layers, wherein each group of convolution layers consists of 1 × 1 convolution layer with the step length of 2 and 1 × 3 convolution layer, and feature diagrams with 4 scales are shared to construct a feature pyramid which is used as an initial detection network of the target detection network and is used for classifying the background and the foreground;

on the basis of an initial detection network, the feature graph of each scale is subjected to deformable convolution to relieve the problem of feature misalignment caused by an anchor frame mechanism and improve the regression precision of a detection frame.

Step four: building of depth estimation sub-networks

On the basis of the feature map psi obtained in the second step, obtaining feature layer sigma through additional 3 convolutional layers, wherein each convolutional layer is composed of 1 × 1 convolutional layer with the step size of 2 and 13 × 3 convolutional layer₀；

Then in the feature layer σ₀On the basis, the size of the deconvolution image is changed into 1/4 of the original image by performing upsampling through identity mapping and a deconvolution layer;

step five: training based on cascaded SSD network and monocular depth estimation forward vehicle distance detection network

Designing an overall loss function, including a target detection loss function, a monocular estimation network loss function and a perception loss function;

the network input picture size is set to 300 × 300, the initial learning rate is set to 0.003, the learning rate is adjusted to polynomial attenuation, and the batch-size is set to 16;

the model training is iterated for 50 ten thousand times, then the deep estimation network branches are restrained, and after 15 ten thousand times of iteration is carried out on the target detection branches, the whole network is iterated for 20 ten thousand times;

step six: selection and distance detection based on K-means clustering feature points

The method comprises the steps of taking an original image as input, and obtaining a vehicle detection confidence coefficient and a detection frame in the image and a monocular depth image of the image through a trained model;

coordinates (x1, y1, x2, y2) of the detection frame are obtained, and coordinates of the center point are expressed as:

dividing the vehicle into an upper part and a lower part by the transverse coordinate of the center point, wherein the lower part contains most information of the tail or the head of the vehicle and is also the position of the closest point between the vehicles, and the position is used as a depth estimation clustering set;

clustering the set obtained in the step 2) by adopting a K-means clustering algorithm, wherein the main clustering is foreground and background 2 types, and finally, adopting a foreground cluster clustering center point as a distance measuring characteristic point and outputting distance information.

Has the advantages that:

aiming at the problem that the detection effect of a small target is poor due to insufficient extraction of superficial features by conv4_3 in an original SSD network, the method adopts an expansion receptive field module based on cavity convolution to expand a receptive field, and combines an online feature selection strategy to improve the accuracy of small-scale vehicle target detection; aiming at the problem of inaccurate positioning of vehicles, a cascaded SSD network is provided, and the problem of characteristic misalignment caused by an anchor frame mechanism is relieved by introducing deformable convolution; aiming at the problem of inaccurate selection of monocular depth estimation feature points, the method for selecting and ranging the feature points based on K-means clustering is provided, and the problem of large errors of horizontal and longitudinal ranging can be effectively solved.

Drawings

Fig. 1 is a flowchart of a forward vehicle distance detection method based on a cascaded SSD network and monocular depth estimation according to the present invention.

Detailed Description

The technical solution in the embodiment of the present invention is clearly and completely described below with reference to the schematic diagram of the detection network. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the present embodiment provides a forward vehicle distance detection method based on cascaded SSD and monocular depth estimation, comprising the steps of:

the method comprises the following steps: the method comprises the following steps of acquiring a forward vehicle training sample and a depth estimation data set, wherein the forward vehicle training sample and the depth estimation data set mainly comprise the following five parts:

1) downloading a KITTI data set from a KITTI data set official website, wherein the data set comprises vehicle training pictures, marking files, cameras corresponding to each picture, radar calibration parameters and point cloud files, sparse three-dimensional point cloud data of the radar vegetable chicken are required to be projected onto a pixel coordinate system from a laser radar coordinate system, and then a depth map is obtained by adopting a linear interpolation algorithm;

2) performing data cleaning on the KITTI data set, screening and removing pictures with wrong labels, for example, the shielding between vehicles is serious, but the labels are still marked, so that the label frames are seriously overlapped, and the labels of the vehicles are wrongly marked;

3) by referring to the method proposed in YOLO v2 for k-means clustering of target frames, the Anchor shape is set here by the following steps:

1. all the target frames are taken as a set, and the number of clustering categories is set;

2. randomly selecting k target boxes as original 'center boxes';

3. calculating the IoU distance between each target frame of non-central frames and k central frames, and dividing the target frames into the central frame category set with the minimum distance from IoU

In (1), i is the number of the "center box". The "IoU distance" is defined as:

where area (C) and area (G) represent the area of the two boxes, and the "IoU distance" also represents the degree of overlap between the two boxes;

4. for any purpose

The median of the widths and heights of all the target boxes it contains is calculated as the new "center box".

5. Judging whether the IoU distance between the new central frame and the original central frame in each category is smaller than a set threshold or reaches the maximum iteration number, if so, continuing the next step, otherwise, repeatedly executing the steps 3-5;

6. calculating the maximum IoU for each target box and "center box" to get an average IoU;

7. repeating 2-6 according to the traversal of k ═ 2,3, …,11 and 12, so as to obtain 11 groups of k and an average IoU. Taking k as an abscissa and average IoU as an ordinate, finding an inflection point with the best balance between speed and precision, and determining the category number k;

4) depth information of the data set is counted, the problem of long tail data distribution similar to target detection in the KITTI data set can be obtained, and a network can be guided to be optimized in the training process;

5) about 7500 KITTI data sets, randomly taking 80% of the data sets as training sets and 20% as testing sets;

step two: the construction of the backbone network mainly comprises the following three parts:

1) VGG-16 is selected as the basis of the backbone network for feature extraction, and modification is carried out on the basis;

2) the training original picture is down-sampled to Conv4-3 resolution size (the resolution size obtained when 300 × 300 is input is 38 × 38), and then a convolution layer group comprises 3 convolution layers comprising 3 × 3 with the step size of 1, so as to obtain the low-level feature map delta of the picture target details₀(ii) a To obtain high-level features, fc7 layer features are upsampled to Conv4-3 resolution size, obtaining a feature map delta₁(ii) a Considering the redundancy of fc7 layer on information, model pruning is realized by adding 128 1024 × 1 × 1 convolution kernels for online feature screening, and Conv5-3 is deconvoluted and then upsampled to Conv4-3 resolution to obtain a feature map delta₂(ii) a Supplementing semantic information of receptive field, and finally converting delta₀、δ₁、δ₂The three feature layers are fused with Conv4-3 to realize feature enhancement, and a fused feature diagram psi is obtained₀；

3) Will phi₀As input of serial ASPP, the void ratio of the ASPP is respectively set to be 2 and 3 to enhance the receptive field of the characteristic layer, improve the detection robustness of small-scale or incomplete appearance targets, and then the characteristic layer psi passing through each ASPP₁、ψ₂Extracting psi₀、ψ₁、ψ₂Superimposing the channels and making them consistent with the original Conv4-3 channel number, results in the signature psi.

Step three: the construction of the target detection sub-network mainly comprises the following three parts:

1) the target detection sub-network is improved on the basis of an SSD network, and the number and the size of anchor frames in the detection network are changed into the size obtained in the step one;

2) on the basis of the feature diagram psi obtained in the second step, 3 groups of convolution layers are additionally added, each group of convolution layers is composed of 1 × 1 convolution layer with the step length of 2 and 1 × 3 convolution layer, feature diagrams with 4 scales are shared to construct a feature pyramid, the feature pyramid is used as an initial detection network of the target detection network to classify the background and the foreground, and the resolution of the final target detection network feature layer is as shown in table 1:

TABLE 1 feature layer resolution size

Resolution ratio

38×38

19×19

10×10

5×5

3) On the basis of an initial detection network, deformable convolution is adopted for the feature graph of each scale, the offset of a convolution sampling point is obtained by a learning-based method, deformable convolution calculation is completed, the problem of feature misalignment caused by an anchor frame mechanism can be relieved, and the regression precision of a detection frame is improved. The primary inspection network outputs four variables: (dx, dy, dh, dw). Where (dx, dy) corresponds to the offset of the spatial location and (dh, dw) represents the offset on the scale. A feature alignment module FAM can be designed, the offset required by FAM is obtained by convolution of dx and dy, and the whole FAM process is:

Δpⁱ＝f(dxⁱ,dyⁱ)

FMⁱ⁺¹＝Deformable(FMⁱ,Δpⁱ)

wherein, FM represents a characteristic diagram, f represents convolution operation, and Deformable represents the Deformable convolution.

Step four: the construction of the depth estimation sub-network mainly comprises the following two parts:

1) on the basis of the feature map psi obtained in the second step, through additional 3 convolution layers, each convolution layer consists of 1 × 1 convolution with the step size of 2 and 1 × 3 convolution, the feature layers with the resolution sizes of 38 × 38, 19 × 19, 10 × 10 and 5 × 5 are sequentially obtained to extract depth estimation features, and finally the feature layer sigma is obtained₀The resolution size of (2) is 5 x 5;

2) then in the feature layer σ₀On the basis, connecting the feature layers with corresponding sizes in the step 1) through identity mapping, adopting an deconvolution layer to perform upsampling, sequentially obtaining feature layer resolution sizes of 10 × 10, 19 × 19, 38 × 38 and 75 × 75, obtaining 1/4 by taking the final feature layer resolution size as the original image size, and then taking the feature layer 75 × 75 as a detection network for depth estimation;

step five: the training of the forward vehicle distance detection network based on the cascade SSD network and the monocular depth estimation mainly comprises the following three parts:

1) designing the global loss function L_totalIncluding the eyeMark detection loss function L_detectMonocular estimation of network loss function L_depthAnd a perceptual loss function L_DAβ is a hyperparameter, whose corresponding expression is:

L_total＝L_detect+β(L_depth+L_DA)

the target detection loss function adopts an SSD network loss function:

L_confindicates a loss of confidence, L_locRepresenting position loss, setting a balance coefficient lambda to be 1, and setting N to be the sum of the number of positive samples in a batch so as to scale the loss;

monocular estimation of network loss function L_depthExpressed as:

d denotes a predicted depth image, D^*Representing an original depth image, i expressing pixel indexes, and calculating horizontal and vertical gradients in order to keep similar local characteristics;

perceptual loss function L_DAThe method is used for optimizing the long tail distribution problem existing in the KITTI data set and is specifically represented as follows:

where N represents the number of pixels, i represents the pixel index, l represents the distance metric, d_iRepresenting the predicted depth value(s) of the depth image,

representing true depth values, α_DRepresenting depth-aware attention terms to guide the network to focus on long-tailed depth distributions in general

λ_DIs a regular term to avoid gradient disappearance, and the calculation formula is:

2) the network input picture size is set to 300 × 300, the initial learning rate is set to 0.003, the learning rate is adjusted to polynomial attenuation, and the batch-size is set to 16;

3) the model training is iterated for 50 ten thousand times, then the deep estimation network branches are restrained, and after 15 ten thousand times of iteration is carried out on the target detection branches, the whole network is iterated for 20 ten thousand times;

1) The method comprises the steps of taking an original image as input, and obtaining a vehicle detection confidence coefficient and a detection frame in the image and a monocular depth image of the image through a trained model;

2) coordinates (x1, y1, x2, y2) of the detection frame are obtained, and coordinates of the center point of the detection frame are expressed as:

dividing the vehicle into an upper part and a lower part by the horizontal coordinate of the middle point, wherein the lower part contains most information of the tail or the head of the vehicle and is also the position of the closest point between the vehicles, and the position is used as a depth estimation clustering set;

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A forward vehicle distance detection method based on cascaded SSD and monocular depth estimation is characterized by comprising the following steps:

s1, acquiring a forward vehicle training sample and a depth estimation data set;

s2, building a backbone network;

s3, building a target detection network;

s4, building a depth estimation sub-network;

s5, training a forward vehicle distance detection network based on the cascade SSD network and the monocular depth estimation;

and S6, selecting and detecting the distance based on the K-means clustering feature points.

2. The method according to claim 1, wherein the S1 comprises the steps of:

s11, downloading a KITTI data set, wherein the KITTI data set comprises vehicle training pictures, marking files, camera and radar calibration parameters corresponding to each picture and point cloud files;

s12, carrying out data cleaning on the KITTI data set, screening out the pictures marked with errors and removing the pictures;

s13, selecting a proper number of anchor frames and a proper height-width ratio by adopting K-means clustering;

s14, counting the depth information of the KITTI data set, and preparing for a subsequent optimization depth estimation model;

and S15, randomly taking 80% of KITTI data sets as training sets and 20% as test sets.

3. The tandem SSD and monocular depth estimation based forward vehicle distance detection method of claim 2, wherein the S2 comprises the steps of:

s21, selecting VGG-16 as the basis of the backbone network for feature extraction, and modifying the basis;

s22, down-sampling the training picture to Conv4-3 resolution, and obtaining the low-level feature map delta of the picture target detail through 3 multiplied by 3 convolutional layers₀(ii) a Then, fc7 layer features are upsampled to the size of Conv4-3 resolution, and a feature map delta is obtained₁Then, Conv5-3 is sampled by deconvolution to obtain a feature map delta₂Finally, will delta₀、δ₁、δ₂Carrying out feature fusion of channel dimensions on the three feature graphs and Conv4-3 to realize feature enhancement and obtain a fused feature graph psi₀；

S23, will ψ₀As input to the tandem ASPP, the voidage of the ASPP is set to 2 and 3, respectively, and the feature layer ψ after passing through each ASPP is set₁、ψ₂Extracting to obtain psi₀、ψ₁、ψ₂Superimposing the channels and making them consistent with the original Conv4-3 channel number, results in the signature psi.

4. The tandem SSD and monocular depth estimation based forward vehicle distance detection method of claim 3, wherein the S3 comprises the steps of:

s31, improving the target detection sub-network on the basis of an SSD network, and firstly changing the number and the size of anchor frames in the detection network into the size obtained in S13;

s32, additionally adding 3 groups of convolution layers on the basis of the feature diagram psi obtained in S23, wherein each group of convolution layers is composed of 1 × 1 convolution layer with the step length of 2 and 1 × 3 convolution layer, and feature diagrams with 4 scales are shared to construct a feature pyramid which is used as an initial detection network of the target detection network and is used for classifying the background and the foreground;

and S33, on the basis of the initial inspection network, performing deformable convolution on the feature map of each scale to relieve the feature misalignment problem caused by an anchor frame mechanism.

5. The method according to claim 4, wherein the S4 comprises the steps of:

s41, obtaining a feature layer sigma through additional 3 convolution layers on the basis of the feature diagram psi obtained in S23, wherein each convolution layer is composed of 1 x1 convolution with the step size of 2 and 13 x 3 convolution₀；

S42, in feature layer sigma₀Then, the deconvolution image size is changed to 1/4 size of the original image by upsampling by the identity mapping and deconvolution layer.

6. The method according to claim 5, wherein the S5 comprises the steps of:

s51, designing a total loss function, including a target detection loss function, a monocular estimation network loss function and a perception loss function;

s52, setting the size of the network input picture to be 300 x 300, setting the initial learning rate to be 0.003, adjusting the learning rate to be polynomial attenuation, and setting the batch-size to be 16;

and S53, iterating the model training for 50 ten thousand times, then inhibiting the deep estimation network branch, iterating the target detection branch for 15 ten thousand times, and then iterating the whole network for 20 ten thousand times.

7. The method according to claim 6, wherein the S6 comprises the steps of:

s61, obtaining a vehicle detection confidence coefficient and a detection frame in the image and a monocular depth image of the image by taking the original image as input through a trained model;

s62, obtaining the coordinates (x1, y1, x2, y2) of the detection frame, wherein the coordinates of the center point are expressed as:

and S63, clustering the set obtained in S62 by adopting a K-means clustering algorithm, wherein the main clustering is foreground and background, and finally, adopting the foreground cluster clustering center point as a distance measuring characteristic point and outputting distance information.