CN114612883A - Forward vehicle distance detection method based on cascade SSD and monocular depth estimation - Google Patents

Forward vehicle distance detection method based on cascade SSD and monocular depth estimation Download PDF

Info

Publication number
CN114612883A
CN114612883A CN202210263947.6A CN202210263947A CN114612883A CN 114612883 A CN114612883 A CN 114612883A CN 202210263947 A CN202210263947 A CN 202210263947A CN 114612883 A CN114612883 A CN 114612883A
Authority
CN
China
Prior art keywords
network
feature
depth estimation
detection
ssd
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210263947.6A
Other languages
Chinese (zh)
Inventor
赵敏
孙棣华
庞思袁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202210263947.6A priority Critical patent/CN114612883A/en
Publication of CN114612883A publication Critical patent/CN114612883A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C3/00Measuring distances in line of sight; Optical rangefinders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Electromagnetism (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a forward vehicle distance detection method based on cascade SSD and monocular depth estimation, which comprises the following steps: s1, acquiring a forward vehicle training sample and a depth estimation data set; s2, building a backbone network; s3, building a target detection network; s4, building a depth estimation sub-network; s5, training a forward vehicle distance detection network based on the cascade SSD network and the monocular depth estimation; and S6, selecting and detecting the distance based on the K-means clustering feature points. The invention adopts an extended receptive field module based on cavity convolution to expand the receptive field, and combines an online characteristic selection strategy to improve the accuracy of small-scale vehicle target detection; aiming at the problem of inaccurate positioning of vehicles, a cascaded SSD network is provided, and the problem of characteristic misalignment caused by an anchor frame mechanism is relieved by introducing deformable convolution; aiming at the problem of inaccurate selection of monocular depth estimation feature points, a K-means clustering feature point-based selection and ranging method is provided, and the problem of large errors of horizontal and longitudinal ranging is effectively solved.

Description

Forward vehicle distance detection method based on cascade SSD and monocular depth estimation
Technical Field
The invention relates to the technical field of vehicle distance detection, in particular to a forward vehicle distance detection method based on cascade SSD and monocular depth estimation.
Background
With the maturity and continuous progress of computer vision technology, rapid development of emerging fields such as automatic driving, intelligent medical treatment and intelligent robots is also driven, and especially under the trend that the automatic driving industry is mature and stable day by day, the hope of unmanned driving is seen. One of the key technologies is to detect the distance of a front target on the premise of accurately identifying the front target of a vehicle, which also becomes one of the main research hotspots in the field of computer vision. According to incomplete statistics, more than 125 million people are lost due to road traffic accidents every year around the world, the number of the injured people is as high as 5000 million people, the economic loss reaches $ 1.85 trillion, and the main reasons of the accidents are that drivers are not concentrated or fatigue driving is caused. Since highway traffic is often caused by rear-end collision, if the distance between the vehicle and the forward vehicle can be accurately detected in real time, a good prevention effect can be achieved on similar traffic accidents such as rear-end collision. Therefore, the distance detection between vehicles plays an important role in ensuring the traveling safety of people to the maximum extent.
The forward vehicle distance detection methods are classified into the following categories according to different detection devices and detection modes: millimeter wave radar ranging, laser ranging, ultrasonic ranging, visual ranging, and the like. In these methods, although the ultrasonic wave, millimeter wave radar, laser radar, etc. are far superior to the visual ranging in the ranging accuracy, they all have certain defects, and in contrast, the visual ranging has the advantages of low cost, easy popularization, etc. Binocular vision range finding obtains depth information through characteristic point matching, under the background that consumes a large amount of time, is difficult to reach the real-time nature requirement of detection, and monocular vision price is more substantial, and the model structure is simpler, is a research focus that has more advantages.
Existing ranging algorithms based on monocular vision are roughly classified into four categories: the first type is a method based on a small hole forming model, which realizes the conversion from a pixel coordinate system to a world coordinate system through coordinate system transformation and needs to be completed by means of the actual height or width of a measured object; the second type of data regression modeling method adopts a method of firstly measuring distance and then modeling in a reverse direction to measure the distance of the vehicle in the front vehicle, such as the method of measuring distance of the vehicle based on monocular vision (publication number: CN104899554A) applied by northeast university. The third type is that an imaging geometric model is constructed to measure distance by combining calibration parameters of a camera, for example, a front vehicle distance detection method (publication number: CN106153000A) applied by the university of fertilizer industry, the invention completes the distance measurement of the front vehicle by the geometric model, has high calculation speed and simple implementation, but is greatly influenced by the camera parameters and the pitch angle. The fourth type is a forward distance detection method based on monocular depth estimation, which obtains distance information from pictures through monocular depth estimation, such as the 'monocular vision forward vehicle distance detection method based on depth estimation' applied by Chongqing university (application number: 202110633046.7).
Real-time ranging of forward vehicle targets is an important component of an autonomous driving system, where the first step is detection of the forward vehicle. Aiming at the problems of different angles of forward vehicles, obvious difference of the sizes along with the distance, low pixels and the like, the invention selects the SSD neural network with higher detection speed as the basis to be improved under the condition of meeting the detection real-time property, and then the SSD neural network is used as a sub-network of the distance detection network of the front vehicle. The invention provides a cascaded SSD network, aiming at the problem of poor detection effect caused by insufficient feature expression capability of the SSD network for extracting small targets, the invention enriches the features of a low-level network by adopting an ASPP network cascading method in an initial detection network, and the cascaded SSD network adopts deformable convolution as connection, thereby relieving the problem of feature misalignment caused by an anchor frame mechanism, reducing the influence of negative samples on network training and improving the precision of target detection. In the imaging process of the camera, the three-dimensional image is projected to the plane of the two-dimensional image, the depth information is lost, the depth information needs to be recovered, the distance information from each pixel point to the plane of the camera is obtained, then a distance measurement characteristic point is selected, and the distance measurement of the forward vehicle is completed. The process can establish a monocular depth estimation sub-network for extracting global pixel information, and the monocular depth estimation sub-network is connected with a target detection sub-network in parallel to realize the distance detection of the forward vehicle target.
Disclosure of Invention
The invention designs a forward vehicle distance detection network based on a cascade SSD and monocular depth estimation, starting from actual environments of expressways and urban roads, and aiming at the problem that the SSD network has poor forward small-scale vehicle feature extraction effect and the defects of the traditional distance measurement algorithm in forward vehicle distance measurement. Aiming at the problem that the detection effect of a small target is poor due to insufficient extraction of superficial features by conv4_3 in an original SSD network, the sense field is expanded by adopting an expansion sense field module based on cavity convolution, and the accuracy of small-scale vehicle target detection is improved by combining an online feature selection strategy; aiming at the problem of inaccurate positioning of vehicles, a cascaded SSD network is provided, and the problem of characteristic misalignment caused by an anchor frame mechanism is relieved by introducing deformable convolution; aiming at the problem of inaccurate selection of monocular depth estimation feature points, the method for selecting and ranging the feature points based on K-means clustering is provided, and the problem of large errors of horizontal and longitudinal ranging can be effectively solved.
The technical method provided by the invention mainly comprises three parts, namely sample training, detection and distance measurement, and specifically comprises the following steps:
the method comprises the following steps: obtaining forward vehicle training samples and depth estimation datasets
Downloading a KITTI data set, wherein the data set comprises vehicle training pictures, a labeling file, camera and radar calibration parameters corresponding to each picture and a point cloud file;
carrying out data cleaning on the KITTI data set, screening out the pictures marked with errors and removing the pictures;
selecting proper number of anchor frames and height-width ratio by adopting K-means clustering;
counting the depth information of the data set to prepare for a subsequent optimization depth estimation model;
randomly taking 80% of the data set as a training set and 20% of the data set as a testing set;
step two: backbone network construction
VGG-16 is selected as the basis of the backbone network for feature extraction, and modification is carried out on the basis;
down-sampling the training picture to Conv4-3 resolution, and obtaining a low-level feature map delta of picture target details through 3 multiplied by 3 convolutional layers0(ii) a To obtain high-level features, fc 7-level features are upsampled to Conv4-3 resolution size to obtain a feature map delta1And then the Conv5-3 is subjected to deconvolution up-sampling to obtain a characteristic diagram delta2Finally, will delta0、δ1、δ2Carrying out channel dimension feature fusion on the three feature maps and Conv4-3 to realize feature enhancement and obtain a fused feature map psi0
Will phi0As input to the tandem ASPP, the voidage of the ASPP is set to 2 and 3, respectively, and the feature layer ψ after passing through each ASPP is set1、ψ2Extracting to obtain psi0、ψ1、ψ2Superimposing the channels and making them consistent with the original Conv4-3 channel number, results in the signature psi.
Step three: construction of object detection sub-networks
The target detection sub-network is improved on the basis of an SSD network, and the number and the size of anchor frames in the detection network are changed into the size obtained in the step one;
on the basis of the feature diagram psi obtained in the second step, additionally adding 3 groups of convolution layers, wherein each group of convolution layers consists of 1 × 1 convolution layer with the step length of 2 and 1 × 3 convolution layer, and feature diagrams with 4 scales are shared to construct a feature pyramid which is used as an initial detection network of the target detection network and is used for classifying the background and the foreground;
on the basis of an initial detection network, the feature graph of each scale is subjected to deformable convolution to relieve the problem of feature misalignment caused by an anchor frame mechanism and improve the regression precision of a detection frame.
Step four: building of depth estimation sub-networks
On the basis of the feature map psi obtained in the second step, obtaining feature layer sigma through additional 3 convolutional layers, wherein each convolutional layer is composed of 1 × 1 convolutional layer with the step size of 2 and 13 × 3 convolutional layer0
Then in the feature layer σ0On the basis, the size of the deconvolution image is changed into 1/4 of the original image by performing upsampling through identity mapping and a deconvolution layer;
step five: training based on cascaded SSD network and monocular depth estimation forward vehicle distance detection network
Designing an overall loss function, including a target detection loss function, a monocular estimation network loss function and a perception loss function;
the network input picture size is set to 300 × 300, the initial learning rate is set to 0.003, the learning rate is adjusted to polynomial attenuation, and the batch-size is set to 16;
the model training is iterated for 50 ten thousand times, then the deep estimation network branches are restrained, and after 15 ten thousand times of iteration is carried out on the target detection branches, the whole network is iterated for 20 ten thousand times;
step six: selection and distance detection based on K-means clustering feature points
The method comprises the steps of taking an original image as input, and obtaining a vehicle detection confidence coefficient and a detection frame in the image and a monocular depth image of the image through a trained model;
coordinates (x1, y1, x2, y2) of the detection frame are obtained, and coordinates of the center point are expressed as:
Figure BDA0003551854480000041
dividing the vehicle into an upper part and a lower part by the transverse coordinate of the center point, wherein the lower part contains most information of the tail or the head of the vehicle and is also the position of the closest point between the vehicles, and the position is used as a depth estimation clustering set;
clustering the set obtained in the step 2) by adopting a K-means clustering algorithm, wherein the main clustering is foreground and background 2 types, and finally, adopting a foreground cluster clustering center point as a distance measuring characteristic point and outputting distance information.
Has the advantages that:
aiming at the problem that the detection effect of a small target is poor due to insufficient extraction of superficial features by conv4_3 in an original SSD network, the method adopts an expansion receptive field module based on cavity convolution to expand a receptive field, and combines an online feature selection strategy to improve the accuracy of small-scale vehicle target detection; aiming at the problem of inaccurate positioning of vehicles, a cascaded SSD network is provided, and the problem of characteristic misalignment caused by an anchor frame mechanism is relieved by introducing deformable convolution; aiming at the problem of inaccurate selection of monocular depth estimation feature points, the method for selecting and ranging the feature points based on K-means clustering is provided, and the problem of large errors of horizontal and longitudinal ranging can be effectively solved.
Drawings
Fig. 1 is a flowchart of a forward vehicle distance detection method based on a cascaded SSD network and monocular depth estimation according to the present invention.
Detailed Description
The technical solution in the embodiment of the present invention is clearly and completely described below with reference to the schematic diagram of the detection network. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in fig. 1, the present embodiment provides a forward vehicle distance detection method based on cascaded SSD and monocular depth estimation, comprising the steps of:
the method comprises the following steps: the method comprises the following steps of acquiring a forward vehicle training sample and a depth estimation data set, wherein the forward vehicle training sample and the depth estimation data set mainly comprise the following five parts:
1) downloading a KITTI data set from a KITTI data set official website, wherein the data set comprises vehicle training pictures, marking files, cameras corresponding to each picture, radar calibration parameters and point cloud files, sparse three-dimensional point cloud data of the radar vegetable chicken are required to be projected onto a pixel coordinate system from a laser radar coordinate system, and then a depth map is obtained by adopting a linear interpolation algorithm;
2) performing data cleaning on the KITTI data set, screening and removing pictures with wrong labels, for example, the shielding between vehicles is serious, but the labels are still marked, so that the label frames are seriously overlapped, and the labels of the vehicles are wrongly marked;
3) by referring to the method proposed in YOLO v2 for k-means clustering of target frames, the Anchor shape is set here by the following steps:
1. all the target frames are taken as a set, and the number of clustering categories is set;
2. randomly selecting k target boxes as original 'center boxes';
3. calculating the IoU distance between each target frame of non-central frames and k central frames, and dividing the target frames into the central frame category set with the minimum distance from IoU
Figure BDA0003551854480000051
In (1), i is the number of the "center box". The "IoU distance" is defined as:
Figure BDA0003551854480000061
where area (C) and area (G) represent the area of the two boxes, and the "IoU distance" also represents the degree of overlap between the two boxes;
4. for any purpose
Figure BDA0003551854480000062
The median of the widths and heights of all the target boxes it contains is calculated as the new "center box".
5. Judging whether the IoU distance between the new central frame and the original central frame in each category is smaller than a set threshold or reaches the maximum iteration number, if so, continuing the next step, otherwise, repeatedly executing the steps 3-5;
6. calculating the maximum IoU for each target box and "center box" to get an average IoU;
7. repeating 2-6 according to the traversal of k ═ 2,3, …,11 and 12, so as to obtain 11 groups of k and an average IoU. Taking k as an abscissa and average IoU as an ordinate, finding an inflection point with the best balance between speed and precision, and determining the category number k;
4) depth information of the data set is counted, the problem of long tail data distribution similar to target detection in the KITTI data set can be obtained, and a network can be guided to be optimized in the training process;
5) about 7500 KITTI data sets, randomly taking 80% of the data sets as training sets and 20% as testing sets;
step two: the construction of the backbone network mainly comprises the following three parts:
1) VGG-16 is selected as the basis of the backbone network for feature extraction, and modification is carried out on the basis;
2) the training original picture is down-sampled to Conv4-3 resolution size (the resolution size obtained when 300 × 300 is input is 38 × 38), and then a convolution layer group comprises 3 convolution layers comprising 3 × 3 with the step size of 1, so as to obtain the low-level feature map delta of the picture target details0(ii) a To obtain high-level features, fc7 layer features are upsampled to Conv4-3 resolution size, obtaining a feature map delta1(ii) a Considering the redundancy of fc7 layer on information, model pruning is realized by adding 128 1024 × 1 × 1 convolution kernels for online feature screening, and Conv5-3 is deconvoluted and then upsampled to Conv4-3 resolution to obtain a feature map delta2(ii) a Supplementing semantic information of receptive field, and finally converting delta0、δ1、δ2The three feature layers are fused with Conv4-3 to realize feature enhancement, and a fused feature diagram psi is obtained0
3) Will phi0As input of serial ASPP, the void ratio of the ASPP is respectively set to be 2 and 3 to enhance the receptive field of the characteristic layer, improve the detection robustness of small-scale or incomplete appearance targets, and then the characteristic layer psi passing through each ASPP1、ψ2Extracting psi0、ψ1、ψ2Superimposing the channels and making them consistent with the original Conv4-3 channel number, results in the signature psi.
Step three: the construction of the target detection sub-network mainly comprises the following three parts:
1) the target detection sub-network is improved on the basis of an SSD network, and the number and the size of anchor frames in the detection network are changed into the size obtained in the step one;
2) on the basis of the feature diagram psi obtained in the second step, 3 groups of convolution layers are additionally added, each group of convolution layers is composed of 1 × 1 convolution layer with the step length of 2 and 1 × 3 convolution layer, feature diagrams with 4 scales are shared to construct a feature pyramid, the feature pyramid is used as an initial detection network of the target detection network to classify the background and the foreground, and the resolution of the final target detection network feature layer is as shown in table 1:
TABLE 1 feature layer resolution size
Resolution ratio 38×38 19×19 10×10 5×5
3) On the basis of an initial detection network, deformable convolution is adopted for the feature graph of each scale, the offset of a convolution sampling point is obtained by a learning-based method, deformable convolution calculation is completed, the problem of feature misalignment caused by an anchor frame mechanism can be relieved, and the regression precision of a detection frame is improved. The primary inspection network outputs four variables: (dx, dy, dh, dw). Where (dx, dy) corresponds to the offset of the spatial location and (dh, dw) represents the offset on the scale. A feature alignment module FAM can be designed, the offset required by FAM is obtained by convolution of dx and dy, and the whole FAM process is:
Δpi=f(dxi,dyi)
FMi+1=Deformable(FMi,Δpi)
wherein, FM represents a characteristic diagram, f represents convolution operation, and Deformable represents the Deformable convolution.
Step four: the construction of the depth estimation sub-network mainly comprises the following two parts:
1) on the basis of the feature map psi obtained in the second step, through additional 3 convolution layers, each convolution layer consists of 1 × 1 convolution with the step size of 2 and 1 × 3 convolution, the feature layers with the resolution sizes of 38 × 38, 19 × 19, 10 × 10 and 5 × 5 are sequentially obtained to extract depth estimation features, and finally the feature layer sigma is obtained0The resolution size of (2) is 5 x 5;
2) then in the feature layer σ0On the basis, connecting the feature layers with corresponding sizes in the step 1) through identity mapping, adopting an deconvolution layer to perform upsampling, sequentially obtaining feature layer resolution sizes of 10 × 10, 19 × 19, 38 × 38 and 75 × 75, obtaining 1/4 by taking the final feature layer resolution size as the original image size, and then taking the feature layer 75 × 75 as a detection network for depth estimation;
step five: the training of the forward vehicle distance detection network based on the cascade SSD network and the monocular depth estimation mainly comprises the following three parts:
1) designing the global loss function LtotalIncluding the eyeMark detection loss function LdetectMonocular estimation of network loss function LdepthAnd a perceptual loss function LDAβ is a hyperparameter, whose corresponding expression is:
Ltotal=Ldetect+β(Ldepth+LDA)
the target detection loss function adopts an SSD network loss function:
Figure BDA0003551854480000081
Lconfindicates a loss of confidence, LlocRepresenting position loss, setting a balance coefficient lambda to be 1, and setting N to be the sum of the number of positive samples in a batch so as to scale the loss;
monocular estimation of network loss function LdepthExpressed as:
Figure BDA0003551854480000082
d denotes a predicted depth image, D*Representing an original depth image, i expressing pixel indexes, and calculating horizontal and vertical gradients in order to keep similar local characteristics;
perceptual loss function LDAThe method is used for optimizing the long tail distribution problem existing in the KITTI data set and is specifically represented as follows:
Figure BDA0003551854480000091
where N represents the number of pixels, i represents the pixel index, l represents the distance metric, diRepresenting the predicted depth value(s) of the depth image,
Figure BDA0003551854480000092
representing true depth values, αDRepresenting depth-aware attention terms to guide the network to focus on long-tailed depth distributions in general
Figure BDA0003551854480000093
λDIs a regular term to avoid gradient disappearance, and the calculation formula is:
Figure BDA0003551854480000094
2) the network input picture size is set to 300 × 300, the initial learning rate is set to 0.003, the learning rate is adjusted to polynomial attenuation, and the batch-size is set to 16;
3) the model training is iterated for 50 ten thousand times, then the deep estimation network branches are restrained, and after 15 ten thousand times of iteration is carried out on the target detection branches, the whole network is iterated for 20 ten thousand times;
step six: selection and distance detection based on K-means clustering feature points
1) The method comprises the steps of taking an original image as input, and obtaining a vehicle detection confidence coefficient and a detection frame in the image and a monocular depth image of the image through a trained model;
2) coordinates (x1, y1, x2, y2) of the detection frame are obtained, and coordinates of the center point of the detection frame are expressed as:
Figure BDA0003551854480000095
dividing the vehicle into an upper part and a lower part by the horizontal coordinate of the middle point, wherein the lower part contains most information of the tail or the head of the vehicle and is also the position of the closest point between the vehicles, and the position is used as a depth estimation clustering set;
clustering the set obtained in the step 2) by adopting a K-means clustering algorithm, wherein the main clustering is foreground and background 2 types, and finally, adopting a foreground cluster clustering center point as a distance measuring characteristic point and outputting distance information.
Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (7)

1. A forward vehicle distance detection method based on cascaded SSD and monocular depth estimation is characterized by comprising the following steps:
s1, acquiring a forward vehicle training sample and a depth estimation data set;
s2, building a backbone network;
s3, building a target detection network;
s4, building a depth estimation sub-network;
s5, training a forward vehicle distance detection network based on the cascade SSD network and the monocular depth estimation;
and S6, selecting and detecting the distance based on the K-means clustering feature points.
2. The method according to claim 1, wherein the S1 comprises the steps of:
s11, downloading a KITTI data set, wherein the KITTI data set comprises vehicle training pictures, marking files, camera and radar calibration parameters corresponding to each picture and point cloud files;
s12, carrying out data cleaning on the KITTI data set, screening out the pictures marked with errors and removing the pictures;
s13, selecting a proper number of anchor frames and a proper height-width ratio by adopting K-means clustering;
s14, counting the depth information of the KITTI data set, and preparing for a subsequent optimization depth estimation model;
and S15, randomly taking 80% of KITTI data sets as training sets and 20% as test sets.
3. The tandem SSD and monocular depth estimation based forward vehicle distance detection method of claim 2, wherein the S2 comprises the steps of:
s21, selecting VGG-16 as the basis of the backbone network for feature extraction, and modifying the basis;
s22, down-sampling the training picture to Conv4-3 resolution, and obtaining the low-level feature map delta of the picture target detail through 3 multiplied by 3 convolutional layers0(ii) a Then, fc7 layer features are upsampled to the size of Conv4-3 resolution, and a feature map delta is obtained1Then, Conv5-3 is sampled by deconvolution to obtain a feature map delta2Finally, will delta0、δ1、δ2Carrying out feature fusion of channel dimensions on the three feature graphs and Conv4-3 to realize feature enhancement and obtain a fused feature graph psi0
S23, will ψ0As input to the tandem ASPP, the voidage of the ASPP is set to 2 and 3, respectively, and the feature layer ψ after passing through each ASPP is set1、ψ2Extracting to obtain psi0、ψ1、ψ2Superimposing the channels and making them consistent with the original Conv4-3 channel number, results in the signature psi.
4. The tandem SSD and monocular depth estimation based forward vehicle distance detection method of claim 3, wherein the S3 comprises the steps of:
s31, improving the target detection sub-network on the basis of an SSD network, and firstly changing the number and the size of anchor frames in the detection network into the size obtained in S13;
s32, additionally adding 3 groups of convolution layers on the basis of the feature diagram psi obtained in S23, wherein each group of convolution layers is composed of 1 × 1 convolution layer with the step length of 2 and 1 × 3 convolution layer, and feature diagrams with 4 scales are shared to construct a feature pyramid which is used as an initial detection network of the target detection network and is used for classifying the background and the foreground;
and S33, on the basis of the initial inspection network, performing deformable convolution on the feature map of each scale to relieve the feature misalignment problem caused by an anchor frame mechanism.
5. The method according to claim 4, wherein the S4 comprises the steps of:
s41, obtaining a feature layer sigma through additional 3 convolution layers on the basis of the feature diagram psi obtained in S23, wherein each convolution layer is composed of 1 x1 convolution with the step size of 2 and 13 x 3 convolution0
S42, in feature layer sigma0Then, the deconvolution image size is changed to 1/4 size of the original image by upsampling by the identity mapping and deconvolution layer.
6. The method according to claim 5, wherein the S5 comprises the steps of:
s51, designing a total loss function, including a target detection loss function, a monocular estimation network loss function and a perception loss function;
s52, setting the size of the network input picture to be 300 x 300, setting the initial learning rate to be 0.003, adjusting the learning rate to be polynomial attenuation, and setting the batch-size to be 16;
and S53, iterating the model training for 50 ten thousand times, then inhibiting the deep estimation network branch, iterating the target detection branch for 15 ten thousand times, and then iterating the whole network for 20 ten thousand times.
7. The method according to claim 6, wherein the S6 comprises the steps of:
s61, obtaining a vehicle detection confidence coefficient and a detection frame in the image and a monocular depth image of the image by taking the original image as input through a trained model;
s62, obtaining the coordinates (x1, y1, x2, y2) of the detection frame, wherein the coordinates of the center point are expressed as:
Figure FDA0003551854470000031
dividing the vehicle into an upper part and a lower part by the transverse coordinate of the center point, wherein the lower part contains most information of the tail or the head of the vehicle and is also the position of the closest point between the vehicles, and the position is used as a depth estimation clustering set;
and S63, clustering the set obtained in S62 by adopting a K-means clustering algorithm, wherein the main clustering is foreground and background, and finally, adopting the foreground cluster clustering center point as a distance measuring characteristic point and outputting distance information.
CN202210263947.6A 2022-03-17 2022-03-17 Forward vehicle distance detection method based on cascade SSD and monocular depth estimation Pending CN114612883A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210263947.6A CN114612883A (en) 2022-03-17 2022-03-17 Forward vehicle distance detection method based on cascade SSD and monocular depth estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210263947.6A CN114612883A (en) 2022-03-17 2022-03-17 Forward vehicle distance detection method based on cascade SSD and monocular depth estimation

Publications (1)

Publication Number Publication Date
CN114612883A true CN114612883A (en) 2022-06-10

Family

ID=81864966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210263947.6A Pending CN114612883A (en) 2022-03-17 2022-03-17 Forward vehicle distance detection method based on cascade SSD and monocular depth estimation

Country Status (1)

Country Link
CN (1) CN114612883A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115200784A (en) * 2022-09-16 2022-10-18 福建(泉州)哈工大工程技术研究院 Powder leakage detection method and device based on improved SSD network model and readable medium
CN116977810A (en) * 2023-09-25 2023-10-31 之江实验室 Multi-mode post-fusion long tail category detection method and system
CN117854045A (en) * 2024-03-04 2024-04-09 东北大学 Automatic driving-oriented vehicle target detection method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115200784A (en) * 2022-09-16 2022-10-18 福建(泉州)哈工大工程技术研究院 Powder leakage detection method and device based on improved SSD network model and readable medium
CN115200784B (en) * 2022-09-16 2022-12-02 福建(泉州)哈工大工程技术研究院 Powder leakage detection method and device based on improved SSD network model and readable medium
CN116977810A (en) * 2023-09-25 2023-10-31 之江实验室 Multi-mode post-fusion long tail category detection method and system
CN116977810B (en) * 2023-09-25 2024-01-09 之江实验室 Multi-mode post-fusion long tail category detection method and system
CN117854045A (en) * 2024-03-04 2024-04-09 东北大学 Automatic driving-oriented vehicle target detection method

Similar Documents

Publication Publication Date Title
CN111274976B (en) Lane detection method and system based on multi-level fusion of vision and laser radar
CN110136170B (en) Remote sensing image building change detection method based on convolutional neural network
CN114612883A (en) Forward vehicle distance detection method based on cascade SSD and monocular depth estimation
CN111242041B (en) Laser radar three-dimensional target rapid detection method based on pseudo-image technology
CN109784283B (en) Remote sensing image target extraction method based on scene recognition task
CN110599537A (en) Mask R-CNN-based unmanned aerial vehicle image building area calculation method and system
CN110689043A (en) Vehicle fine granularity identification method and device based on multiple attention mechanism
CN115082674A (en) Multi-mode data fusion three-dimensional target detection method based on attention mechanism
CN110807485B (en) Method for fusing two-classification semantic segmentation maps into multi-classification semantic map based on high-resolution remote sensing image
CN111582339A (en) Vehicle detection and identification method based on deep learning
CN113850129A (en) Target detection method for rotary equal-variation space local attention remote sensing image
CN111126459A (en) Method and device for identifying fine granularity of vehicle
CN115049640B (en) Road crack detection method based on deep learning
CN113095152A (en) Lane line detection method and system based on regression
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
EP4174792A1 (en) Method for scene understanding and semantic analysis of objects
CN116168246A (en) Method, device, equipment and medium for identifying waste slag field for railway engineering
CN115424237A (en) Forward vehicle identification and distance detection method based on deep learning
Yildiz et al. Hybrid image improving and CNN (HIICNN) stacking ensemble method for traffic sign recognition
CN113033363A (en) Vehicle dense target detection method based on deep learning
CN116630702A (en) Pavement adhesion coefficient prediction method based on semantic segmentation network
CN116486352A (en) Lane line robust detection and extraction method based on road constraint
CN113255704B (en) Pixel difference convolution edge detection method based on local binary pattern
CN114882205A (en) Target detection method based on attention mechanism
CN114821508A (en) Road three-dimensional target detection method based on implicit context learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination