CN111462237A

CN111462237A - Target distance detection method for constructing four-channel virtual image by using multi-source information

Info

Publication number: CN111462237A
Application number: CN202010258411.6A
Authority: CN
Inventors: 杨殿阁; 周韬华; 江昆; 于春磊; 杨蒙蒙
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2020-07-28
Anticipated expiration: 2040-04-03
Also published as: CN111462237B

Abstract

The invention relates to a target distance detection method for constructing a four-channel virtual image by utilizing multi-source information, which comprises the following steps: acquiring original point cloud data by using a millimeter wave radar to perform information processing, determining radar original point information belonging to the same target, and obtaining the target size and the target reflection center position; according to the reflection center position of a target under a radar plane and the target center pixel position in an image acquired by a monocular camera, searching the spatial conversion relation of two sensors by a combined calibration method, and simultaneously combining time synchronization to realize the association of asynchronous heterogeneous multi-source information; constructing a virtual four-channel picture containing distance information according to the incidence relation between the millimeter wave radar and the image data; and building a convolutional neural network according to the virtual four-channel picture to realize target detection. The invention can improve the distance prediction capability of target detection, realize the lightweight network structure, save the computing resources and improve the spatial information prediction precision and speed of the existing visual 3D target detection algorithm.

Description

Target distance detection method for constructing four-channel virtual image by using multi-source information

Technical Field

The invention relates to the field of environment perception of intelligent automobiles, in particular to a target distance detection method for constructing a four-channel virtual image by utilizing multi-source information.

Background

In an intelligent automobile system, it is very critical to realize accurate, reliable and robust environmental perception. The image information contains rich semantic features, and with the development of artificial intelligence and deep learning, the target detection algorithm based on the vision realized by the convolutional neural network is increasingly mature, and becomes the current popular research. However, the monocular vision cannot directly acquire the target distance information and the convolutional neural network is more suitable for classification tasks, so that the target distance detection based on the vision still needs to be improved: the existing 3D visual target detection algorithm can not meet the requirement of distance detection precision in a driving task generally, the distance precision is not accurate, and the task of target tracking or driving decision can not be performed; the network performance can be improved by increasing the network depth, such as ResNet, or increasing the network width, such as the inclusion method, but at the same time, the network structure is complex, the occupied computing resource is large, and the network is difficult to get on the vehicle for use.

The millimeter wave radar can accurately measure the distance and speed information of the target, has the capacity of all-weather work, can provide a new data source for a target detection task, makes up the visual deficiency, and realizes the distance information detection of the target. Therefore, research for improving the environmental perception performance by using multi-source fusion is also gradually focused. The existing multi-source information fusion detection method mainly comprises two methods: the pre-fusion algorithm provides a region of interest (ROI) position for vision by using target position information detected by a millimeter wave radar, and then visual target identification and classification are carried out; and the post-fusion is to process the radar and the image information respectively and correlate and fuse the target layer information obtained respectively. The former is interfered by clutter detected by the radar, and if the radar has missing detection, the target detection rate is influenced; the latter algorithm is time-consuming, and meanwhile, perception information provided by the millimeter wave radar cannot be fully utilized to achieve more effective fusion.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a target distance detection method for constructing a four-channel virtual image by using multi-source information, which can improve the distance prediction capability of target detection, further realize lightweight network structure, save computing resources, and improve the spatial information prediction accuracy and speed of the existing visual 3D target detection algorithm.

In order to achieve the purpose, the invention adopts the following technical scheme: a target distance detection method for constructing a four-channel virtual image by using multi-source information comprises the following steps: 1) acquiring original point cloud data by using a millimeter wave radar to perform information processing, determining radar original point information belonging to the same target, and obtaining the target size and the target reflection center position; 2) according to the reflection center position of a target under a radar plane and the target center pixel position in an image acquired by a monocular camera, searching the spatial conversion relation of two sensors by a combined calibration method, and simultaneously combining time synchronization to realize the association of asynchronous heterogeneous multi-source information; 3) constructing a virtual four-channel picture containing distance information according to the incidence relation between the millimeter wave radar and the image data; 4) and (3) building a convolutional neural network according to the virtual four-channel picture to realize target detection: and (3) constructing an end-to-end target detection convolutional neural network to realize distance fusion, so that the training of a four-channel virtual picture generated by using radar original point cloud information and an RGB image can be realized, and the category, the image boundary frame and the distance information of the target can be predicted.

Further, in the step 1), in a traffic scene, the targets are all considered to be rigid bodies, and the original point information belonging to the same target is determined through a clustering algorithm according to the similarity between the position provided by the original point and the Doppler velocity; meanwhile, outliers are eliminated by using RANSAC algorithm, and then the size (w, h) of the target and the reflection center position of the target are obtained

Further, the step 2) specifically comprises the following steps: 2.1) determining the Radar coordinatesTarget reflection center position of tethered calibration object

And the target center pixel position (u) in the image₀,v₀) The conversion relationship between them; 2.2) time synchronizing asynchronous information: taking the acquisition time of the camera as a reference, recording the time stamp t by an extrapolation method every time the sensing data of the camera is updated_camFinding the radar data closest to the moment with time stamp t_radarRecording time difference Δ t ═ t_cam-t_radarThe position information Δ x (Δ t) is updated to obtain the multi-sensor sensing data synchronized in time, considering that the speed of the object is unchanged in the short time.

Further, in the step 2.1), the measurement data (x) of the radar under the same target is directly searched by using the joint calibration_r,y_r,z_r) And the measurement data (u, v) of the image, the spatial transformation relationship of the two being:

in the formula, omega represents a proportional constant, P represents a projection matrix of 3 × 4, A represents an internal reference matrix of the camera, R represents a rotation relation in the external reference calibration, and t represents a translation relation in the external reference calibration.

Further, the joint calibration process comprises: detecting a target by the millimeter wave radar at the same time, recording data, and recording the position of the target in the image by shooting; then, obtaining the position of the target reflection center through the clustering algorithm in the step 1), and correspondingly finding the position in the image; this process is repeated to obtain multiple sets of data.

Further, in the step 3), a single-channel virtual picture with the same resolution as the RGB image is generated according to target feature information that can be reflected by the radar through a corresponding rule, and then is associated with the RGB color image of the camera to form a 4-channel virtual picture, which is used as input data for target distance detection network training.

Further, it is characterized byThe corresponding rule is as follows: (1) determining the center of the region of interest: the target reflection center position detected by the radar is positioned according to the calibration parameters found in the step 2)

Projecting onto an image; the position (u) of the central pixel point of the interested area of the radar detection target on the image is determined by the space conversion relation₀,v₀) (ii) a (2) Determining the size and pixel value of a target area in a radar single-channel virtual picture: determining a target area filling pixel value and an area size of a single-channel virtual picture by adopting a two-dimensional Gaussian model, wherein the mean value of Gaussian distribution is the pixel position (u) corresponding to the reflection center determined in the step (1)₀,v₀) (ii) a According to the 3 σ principle, pixels are considered to be zero outside 3 σ, so the variance (σ) of the Gaussian distribution₁ ²,σ₂ ²) Reflecting the size of the target area in two dimensions of length and width on the image, and expressing the function (sigma) of the variance and the relative distance r between the target and the sensor and the target size (w, h) estimated by the radar in the step 1) by a function g₁ ²,σ₂ ²) G (w, h, r); meanwhile, in a traffic scene, the attention degree of the detection accuracy of a close-range moving target is reflected by filling pixel values; furthermore, because the radar can provide confidence σ that the reflection point is a target, the above factor is reflected by a scaling factor k, k ═ f (r, v)_relσ), k will affect the pixel values of the target fill area on the virtual image.

Further, in the step 4), in order to implement deep fusion, the neural network structure modification includes the following aspects: (1) modifying a training data reading function of the selected algorithm, and receiving data reading of the 4-channel image; (2) modifying the convolution layer convolution kernel and extracting the characteristics with higher dimensionality; (3) and modifying the reading mode of the labeling information: adding a true labeling value of a relative distance beyond a true labeling value provided during image target detection training; (4) adding a distance prediction function: a loss function for distance prediction is added.

And further, the method also comprises a step of evaluating and optimizing the convolutional neural network, when the convolutional neural network shows loss function convergence on a training set, and after training is mature, a verification set which is distributed with the training set data is constructed for verification and the effect of the convolutional neural network is evaluated.

Further, the logic for quantitatively evaluating the network effect is as follows: judging that the prediction frame is a positive sample under the condition that the overlapping rate of the prediction result and the truth value is greater than the threshold iou; considering the model to predict the target under the condition that the prediction box score is greater than the threshold score; evaluating the distance intervals according to different evaluation indexes, wherein the evaluation indexes comprise accuracy Precision TP/(TP + FP), Recall TP/(TP + FN), absolute errors of the predicted distance and errors of relative truth values; judging whether the training parameters need to be adjusted to improve the model according to the result; TP: actually, the number of positive examples is predicted by the model; FP: actually negative examples, the model predicts the number of positive examples; FN: in fact, positive cases, the model predicts the number of negative cases.

Due to the adoption of the technical scheme, the invention has the following advantages: 1. the method fully utilizes the information characteristics of the original point cloud provided by the radar to construct the virtual picture expressing the target information, and does not cause the loss of the original information. 2. The invention realizes the spatial synchronization of the multi-sensor information by using a low-cost millimeter wave radar and image combined calibration method, and has simple operation and high precision. 3. The method and the device directly acquire the spatial information of the target by fusing the radar and the image information through the virtual four-channel picture structure. 4. The invention adopts the end-to-end neural network to output the target detection information, can be directly used for the driving decision of the vehicle, reduces the intermediate processing links, and improves the accuracy, comprehensiveness and robustness of target identification.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a schematic diagram of spatial joint calibration of millimeter wave radar and camera data for use in the present invention;

FIG. 3 is a diagram of a process for constructing a four-channel virtual picture according to the present invention;

fig. 4 is an example of the prediction result of the fusion model proposed by the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and examples.

As shown in fig. 1, the invention provides a target distance detection method for constructing a four-channel virtual image by using multi-source information, which performs information fusion by using a vehicle-mounted sensor millimeter wave radar and a monocular camera, and realizes traffic target detection containing distance information by using an end-to-end convolutional neural network. The invention specifically comprises the following steps:

1) acquiring original point cloud data by using a millimeter wave radar to perform information processing, determining radar original point information belonging to the same target, and obtaining the target size and the target reflection center position;

the radar original point information comprises a radial distance r, an angle theta and a Doppler relative velocity v_relAnd a reflection intensity γ;

under the traffic scene, the targets are all considered to be rigid bodies, so that the similarity between the position provided by the original point and the Doppler velocity is carried out through a clustering algorithm (such as K)_meansClustering) to determine original point information belonging to the same target. Meanwhile, in order to reduce the influence of close vehicles or other static targets on clustering results as much as possible, an RANSAC algorithm is used for eliminating outliers, and then the target size (w, h) and the reflection center position of the target are obtained

2) According to the position of the reflection center of the target under the radar plane

The spatial conversion relation between the target central pixel position (u, v) and the target central pixel position in the image acquired by the monocular camera is searched by a joint calibration method, and the association of asynchronous heterogeneous multi-source information is realized by combining time synchronization;

the premise of realizing multi-source information fusion is to realize space-time synchronization of perception information under different acquisition frequencies and observation coordinate systems.

2.1) determining the target reflection center position of a stator under a radar coordinate system

And the target center pixel position (u, v) in the image.

Joint calibration involves the interconversion of the following coordinate systems: millimeter wave radar coordinate system (x)_r,y_r,z_r) Camera coordinate system (x)_c,y_c,z_c) An imaging plane coordinate system (x, y) and an image pixel coordinate system (u, v); wherein z is_rIs a fixed value set in advance. The sequential calibration needs to calibrate the internal and external parameters of the camera first and then calibrate the external parameters of the millimeter wave radar, so as to determine the parameters of each conversion matrix. The calibration process is complicated, the requirement on the calibration precision is high, and the calibration precision is easily influenced by accumulated errors.

In this embodiment, a joint calibration method is used to directly search the observation data (x) of the same target under radar_r,y_r,z_r) And the measurement data (u, v) of the image, the spatial transformation relationship of the two being:

in the formula, omega represents a proportional constant, P represents a projection matrix of 3 × 4, which comprises an internal reference and an external reference calibration, A represents an internal reference matrix of a camera, R represents a rotation relation in the external reference calibration, and t represents a translation relation in the external reference calibration;

the purpose of the combined calibration is to solve the values of omega and 3 × 4 projection matrix P by collecting data and utilizing an SVD decomposition method, the steps of the experimental process of the combined calibration are shown in figure 2, a millimeter wave radar detects a target and records data at the same time, the position of the target in a recorded image is shot, and the position of the reflection center of the target is obtained by the clustering algorithm in the step 1

Then correspondingly finding the target center position (u) in the image₀,v₀). Repeating this process can result in multiple sets of data.

Meanwhile, since the detection range of the millimeter wave radar is a sector plane with a fixed height and the ground is taken as the zero point of the z axis, the z axis is considered to be_r0, the height of the reflection center of the calibration object (angular reflection/rod-shaped reflection object) is a fixed value

Can obtain the same

Multiple sets of data under value

And the calibration precision is further improved by using geometric constraint. The method has the advantages of low cost and simple operation, and can obtain calibration parameters with higher precision.

2.2) because the data acquisition frequencies of the millimeter wave radar and the camera are not equal, time synchronization is needed to be carried out on asynchronous information: because the acquisition frequency of the millimeter wave radar is a fixed value, the acquisition frequency of the camera is not fixed due to data frame dropping, and the acquisition time of the camera is used as a reference. By extrapolation, the timestamp t is recorded each time the camera's perception data is updated_camFinding the radar data closest to the moment with time stamp t_radarRecording time difference Δ t ═ t_cam-t_radar. The recording time difference is usually less than 5ms, so that the position information Δ x (Δ t) is updated by considering that the speed of the object is unchanged in the short time, thereby obtaining the multi-sensor perception data synchronized in time.

The correlation of asynchronous heterogeneous multi-source information is realized based on the space-time synchronization method.

3) Constructing a virtual four-channel picture containing distance information according to the incidence relation between the millimeter wave radar and the image data;

generating a single-channel virtual picture with the same resolution as the RGB image according to target characteristic information which can be reflected by a radar through a corresponding rule, and then associating the single-channel virtual picture with the RGB color image of the camera to form a 4-channel virtual picture (namely the 4-channel virtual picture is formed by the way that the first 3 channels are the RGB color image and the 4 th channel is the 4 th channelA single-channel virtual picture made of the radar data in time-space synchronization with the picture in the step 2) is used as input data for training the target distance detection network. The novel structure is beneficial to extracting features of the convolutional neural network for feature learning, and makes full use of various types of data provided by the radar on the basis of the ROI (region of interest) extraction algorithm extracted by the radar, including the size (w, h) of the target and the position of the central point of the target, which are obtained by inference in the step 1)

Target relative velocity v_relReflection intensity γ, confidence σ.

The corresponding rules for making a virtual picture are:

(1) determining the center of area of interest

The target reflection center position detected by the radar is positioned according to the calibration parameters found in the step 2)

Projecting onto an image; the position (u) of a central pixel point of a Region of interest (ROI) of a radar detection target on an image is determined by a spatial conversion relation₀,v₀)。

(2) Determining size and pixel value of target area in radar single-channel virtual picture

And determining a target area filling pixel value and an area size of the single-channel virtual picture by adopting a two-dimensional Gaussian model. The size of the target reflected on the image is related to the relative distance between the target and the own vehicle and the size of the target itself. The mean value of Gaussian distribution is the pixel position (u) corresponding to the reflection center determined in the step (1)₀,v₀) (ii) a The variance of the gaussian distribution is a very important parameter in the gaussian distribution, reflects the shape characteristics of the gaussian distribution, and according to the 3 σ principle, the pixels outside 3 σ are considered to be zero. Thus the variance (σ) of the Gaussian distribution₁ ²,σ₂ ²) Reflecting the size of a target area in two dimensions of length and width on an image, and expressing the variance and the relative distance r between a target and a sensor by a function g and expressing the function (sigma) of the target size (w, h) estimated by radar in the clustering algorithm of step 1)₁ ²,σ₂ ²)＝g(w,h,r)。

Meanwhile, in a traffic scene, due to the importance of driving safety, the detection accuracy of a close-distance moving target is more concerned, so that the filling pixel value can reflect the concerned degree. Furthermore, since radar can provide confidence σ whether a reflection point is a target, it will also be one of the considerations of pixel fill value, which is reflected by a scaling factor k, k ═ f (r, v)_relσ), k will affect the pixel values of the target fill area on the virtual image.

Combining the above principles, the default ρ is 0, and the pixel value G (u, v) filled in at an arbitrary pixel position (u, v) of the virtual picture conforms to the following gaussian distribution:

[μ₁,μ₂]＝[u₀,v₀],[σ₁ ²,σ₂ ²]＝g(w,h,r),k＝f(r,v_rel,σ)

wherein (μ)₁,μ₂) The mean value in a two-dimensional Gaussian distribution model, the physical meaning is the position of the center of reflection of a radar target

Position (u) of target center pixel point obtained by projecting to image₀,v₀)；

For variance in a two-dimensional gaussian distribution model, the physical meaning reflects the dimensional relationship of the target on the image, which is expressed by a function g, in relation to the actual size (w, h) of the target in both the length and width dimensions and the target-to-sensor relative distance r, according to the previous analysis: (sigma)₁ ²,σ₂ ²) G (w, h, r); k is a scale factor of the model, the physical meaning influences the size of the filling pixel value, and the confidence coefficient sigma of the target, the target and the propagation provided by the radar are calculated according to the corresponding ruleRelative distance r, relative velocity v of sensor_relCorrelation, this relationship is represented by the function f: k ═ f (r, v)_rel,σ)。

Because the input data characteristic of the convolutional neural network extracted characteristic is image information, a single-channel picture capable of reflecting target position information and size information is manufactured by utilizing millimeter wave radar information, and is stacked with an RGB three-channel picture to form an RGB-D type 4-channel picture, and the 4-channel picture is sent to network learning to predict a target and distance information thereof. The final implementation effect is shown in fig. 3, and then four-channel picture data is used as input data for model training.

Meanwhile, since depth fusion is expected to provide prediction of the position, the target category and the spatial distance of the target on the image, the category information of the target, the relative position information in the image (stability of model training is facilitated by using the relative position information) and the distance information detected by the radar can be provided when the truth annotation text is made.

4) Building a convolutional neural network according to the virtual four-channel picture to realize target detection;

and (3) constructing an end-to-end target detection convolutional neural network to realize deep fusion, so that the training of a four-channel virtual picture generated by using radar original point cloud information and an RGB image can be realized, and the category, the image boundary frame and the distance information of the target can be predicted.

To achieve deep fusion, the neural network architecture modification includes the following aspects:

(1) modifying a training data reading function of the selected algorithm, and receiving data reading of the 4-channel image;

(2) modifying the convolutional layer convolution kernel, and extracting the characteristics of higher dimensionality: since the number of channels of a convolution kernel must be the same as the input of the convolution kernel (or called convolution filter) for the size of the convolution kernel, changing the input to a four-channel training picture increases the number of channels of the corresponding convolution kernel to 4.

(3) And modifying the reading mode of the labeling information: adding a true labeling value of a relative distance beyond a true labeling value provided during image target detection training;

(4) adding a distance prediction function: adding a loss function for distance prediction, preferably a squared loss function

Is a scale parameter, d is a model predicted distance value,

is the true value of the distance.

In a preferred embodiment, a YO L Ov2 target detection algorithm is used to achieve the above 4 aspects, a Darknet53 network corresponding to the YO L Ov2 algorithm is a typical end-to-end convolutional neural network for achieving a visual target detection task, which can directly input an image output target detection result, the algorithm speed is high, and the accuracy is relatively high.

5) Convolutional neural network evaluation and optimization

When the convolutional neural network shows loss function convergence on a training set, after training is mature, establishing a verification set which is distributed with the training set data in the same way for verification and evaluating the effect of the convolutional neural network; adjustment of the model parameters is facilitated.

The logic for quantitative evaluation of network effects is as follows: judging that the prediction frame is a positive sample under the condition that the overlapping rate of the prediction result and the truth value is greater than the threshold iou; considering the model to predict the target under the condition that the prediction box score is greater than the threshold score; evaluating the distance intervals according to different evaluation indexes, wherein the evaluation indexes comprise accuracy Precision TP/(TP + FP), Recall TP/(TP + FN), absolute errors of the predicted distance and errors of relative truth values; and judging whether the training parameters need to be adjusted to improve the model according to the result. Wherein:

TP (true Positive): actually, the number of positive examples is predicted by the model;

FP (false positives): actually negative examples, the model predicts the number of positive examples;

fn (false negative): actually positive examples, the model predicts the number of negative examples;

the method for improving the model according to the prediction result comprises the following steps:

when the accuracy precision of the model on the training set and the verification set is lower than a preset threshold value, the model is considered to be under-fitted and needs to be trained continuously;

when the accuracy precision of the model on the training set is higher than a preset threshold value and the accuracy on the verification set is lower than the preset threshold value, the model is considered to be over-fitted, the number of training rounds needs to be reduced, and the data volume of the training set is increased;

meanwhile, according to the evaluation results of different distance intervals, corresponding data amount is increased for the distance intervals with poor evaluation effect.

Adjusting training parameters: the learning rate and the training batch are continuously adjusted to obtain the best evaluation result.

In the training of the fusion network, the network effect is generated every 10000 times of training verification until the network training is mature; optimizing the network, increasing noise: considering the phenomena of false detection and missing detection of the millimeter wave radar, and adding random noise to input data; and modifying the projection relation: the model effect is related to the projection relation of the target frame, and the model effect is improved continuously, so that the prediction effect of the model is improved.

In addition, target information and speed information provided by the radar can be subjected to re-matching through data re-association, and a final target detection result of fusion perception is obtained. The finally realized model prediction effect is as shown in fig. 4, and can not only detect the position and the type of the target on the image, but also provide the prediction of the relative distance between the target and the sensor.

In conclusion, through training and verification, a mature model capable of detecting the target distance of the virtual four-channel picture is obtained. The method can better predict the category and distance information of the front target, and realizes the utilization of multi-source information.

In the example, a multi-sensor data acquisition system is carried on the experiment vehicle to synchronize multi-source information and generate a virtual four-channel picture. By means of deep fusion, the obtained target detection result can be applied to a simple algorithm for early warning of front vehicle collision and pedestrian avoidance, and driving decision assistance is achieved.

The method is different from the traditional fusion algorithm, the original point cloud information of the millimeter wave radar is fully utilized, target extraction is carried out through clustering and RANSAC algorithm, and the reflection center position and size of the target are obtained; the time-space synchronization of multi-source information is realized by utilizing a low-cost millimeter wave radar and image combined calibration method; the novel data structure for the fusion of the original point information and the visual information of the millimeter wave radar is provided, a single-channel virtual picture reflecting target position and distance information is generated by utilizing Gaussian distribution and combining radar point cloud information, an RGB-D four-channel virtual picture is constructed by being associated with the visual information, the single-channel virtual picture is synchronously input into a convolutional neural network for deep learning, the convolutional neural network is adjusted, and target detection with spatial information is realized. Since the network input information contains richer spatial information about the target, the distance prediction capability of target detection can be further improved. Meanwhile, deep fusion is realized by utilizing an end-to-end neural network, the lightweight of a network structure can be further realized, the computing resources are saved, and the spatial information prediction precision and speed of the existing visual 3D target detection algorithm are improved.

The above embodiments are only for illustrating the present invention, and the steps may be changed, and on the basis of the technical solution of the present invention, the modification and equivalent changes of the individual steps according to the principle of the present invention should not be excluded from the protection scope of the present invention.

Claims

1. A target distance detection method for constructing a four-channel virtual image by using multi-source information is characterized by comprising the following steps:

2) according to the reflection center position of a target under a radar plane and the target center pixel position in an image acquired by a monocular camera, searching the spatial conversion relation of two sensors by a combined calibration method, and simultaneously combining time synchronization to realize the association of asynchronous heterogeneous multi-source information;

4) and (3) building a convolutional neural network according to the virtual four-channel picture to realize target detection: and (3) constructing an end-to-end target detection convolutional neural network to realize distance fusion, so that the training of a four-channel virtual picture generated by using radar original point cloud information and an RGB image can be realized, and the category, the image boundary frame and the distance information of the target can be predicted.

2. The object distance detection method according to claim 1, characterized in that: in the step 1), in a traffic scene, the targets are all considered to be rigid bodies, and the original point information belonging to the same target is determined through a clustering algorithm according to the similarity between the position provided by the original point and the Doppler velocity; meanwhile, outliers are eliminated by using RANSAC algorithm, and then the size (w, h) of the target and the reflection center position of the target are obtained

3. The object distance detection method according to claim 1, characterized in that: the step 2) specifically comprises the following steps:

And the target center pixel position (u) in the image₀,v₀) The conversion relationship between them;

2.2) time synchronizing asynchronous information: taking the acquisition time of the camera as a reference, recording the time stamp t by an extrapolation method every time the sensing data of the camera is updated_camFinding the radar data closest to the moment with time stamp t_radarRecording time difference Δ t ═ t_cam-t_radarThe position information Δ x (Δ t) is updated to obtain the multi-sensor sensing data synchronized in time, considering that the speed of the object is unchanged in the short time.

4. The object distance detection method according to claim 3, characterized in that: in the step 2.1), the measurement data (x) of the radar under the same target is directly searched by using the combined calibration_r,y_r,z_r) And the measurement data (u, v) of the image, the spatial transformation relationship of the two being:

P＝A[R|t]

5. The object distance detecting method according to claim 4, characterized in that: the process of the combined calibration comprises the following steps: detecting a target by the millimeter wave radar at the same time, recording data, and recording the position of the target in the image by shooting; then, obtaining the position of the target reflection center through the clustering algorithm in the step 1), and correspondingly finding the position in the image; this process is repeated to obtain multiple sets of data.

6. The object distance detection method according to claim 1, characterized in that: in the step 3), a single-channel virtual picture with the same resolution as the RGB image is generated according to target characteristic information which can be reflected by the radar through a corresponding rule, and then the single-channel virtual picture is associated with the RGB color image of the camera to form a 4-channel virtual picture which is used as input data of target distance detection network training.

7. The object distance detecting method according to claim 6, characterized in that: the corresponding rule is:

(1) determining the center of the region of interest: the target reflection center position detected by the radar is positioned according to the calibration parameters found in the step 2)

Projecting onto an image; the position (u) of the central pixel point of the interested area of the radar detection target on the image is determined by the space conversion relation₀,v₀)；

(2) Determining the size and pixel value of a target area in a radar single-channel virtual picture: determining a target area filling pixel value and an area size of a single-channel virtual picture by adopting a two-dimensional Gaussian model, wherein the mean value of Gaussian distribution is the pixel position (u) corresponding to the reflection center determined in the step (1)₀,v₀) (ii) a According to the 3 σ principle, pixels are considered to be zero outside 3 σ, so the variance (σ) of the Gaussian distribution₁ ²,σ₂ ²) Reflecting the size of the target area in two dimensions of length and width on the image, and expressing the function (sigma) of the variance and the relative distance r between the target and the sensor and the target size (w, h) estimated by the radar in the step 1) by a function g₁ ²,σ₂ ²)＝g(w,h,r)；

Meanwhile, in a traffic scene, the attention degree of the detection accuracy of a close-range moving target is reflected by filling pixel values; furthermore, because the radar can provide confidence σ that the reflection point is a target, the above factor is reflected by a scaling factor k, k ═ f (r, v)_relσ), k will affect the pixel values of the target fill area on the virtual image.

8. The object distance detection method according to claim 1, characterized in that: in the step 4), in order to implement deep fusion, the modification of the neural network structure includes the following aspects:

(2) modifying the convolution layer convolution kernel and extracting the characteristics with higher dimensionality;

(4) adding a distance prediction function: a loss function for distance prediction is added.

9. The object distance detection method according to claim 1, characterized in that: and when the convolutional neural network shows loss function convergence on a training set, constructing a verification set which is distributed with the training set data in the same way for verification and evaluating the effect of the convolutional neural network after the training is mature.

10. The object distance detecting method according to claim 9, characterized in that: the logic for quantitative evaluation of network effects is as follows: judging that the prediction frame is a positive sample under the condition that the overlapping rate of the prediction result and the truth value is greater than the threshold iou; considering the model to predict the target under the condition that the prediction box score is greater than the threshold score; evaluating the distance intervals according to different evaluation indexes, wherein the evaluation indexes comprise accuracy Precision TP/(TP + FP), Recall TP/(TP + FN), absolute errors of the predicted distance and errors of relative truth values; judging whether the training parameters need to be adjusted to improve the model according to the result; TP: actually, the number of positive examples is predicted by the model; FP: actually negative examples, the model predicts the number of positive examples; FN: in fact, positive cases, the model predicts the number of negative cases.