CN113095154A

CN113095154A - Three-dimensional target detection system and method based on millimeter wave radar and monocular camera

Info

Publication number: CN113095154A
Application number: CN202110299442.0A
Authority: CN
Inventors: 薛建儒; 袁佳玮; 王盼; 叶蓁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-07-09

Abstract

The system comprises a space-time alignment module of image data and millimeter wave radar data, a feature extraction module of the image data and the millimeter wave radar data and a fusion detection module of the image data and the millimeter wave radar data; the invention is a fusion target detection method based on two sensors, namely a millimeter wave radar sensor and a monocular camera, and compared with a target detection algorithm based on three-dimensional laser, the cost is lower. The three-dimensional target detection method has the advantages that a good three-dimensional target detection effect can be achieved, the millimeter wave radar can provide accurate information such as object depth and speed, the camera can provide accurate object type information, and the advantages and the disadvantages of the two sensors are complementary.

Description

Three-dimensional target detection system and method based on millimeter wave radar and monocular camera

Technical Field

The invention belongs to the technical field of computer vision combined with deep learning, and particularly relates to a three-dimensional target detection system and method based on a millimeter wave radar and a monocular camera.

Background

The environmental perception is a key part of the unmanned system, is input at the front end of the planning decision module, and provides an important basis for the planning decision module. Environmental awareness includes a variety of tasks including traffic light detection, obstacle detection, lane line detection, etc., where three-dimensional object detection is particularly important. The unmanned vehicle needs to judge whether obstacles exist in the surrounding environment, so that planning decision is made, and the future driving track of the unmanned vehicle is determined. Most autonomous vehicles utilize sensors such as cameras, radars and lidar, the use of different types of sensors has advantages in tasks such as object detection and may result in more accurate and reliable detection, but also makes designing real-time sensing systems more challenging.

Although millimeter wave radar can provide accurate distance and speed information of a target, it is not suitable for tasks such as target classification. On the other hand, a camera is a very effective object classification sensor, but it cannot provide accurate information of object speed and depth. This makes the fusion of millimeter wave radar and image data a very interesting topic in autonomous driving applications.

With the rise of large-scale parallel computing and GPU, the method based on the deep neural network is applied in the field of target detection of the unmanned platform on a large scale. However, due to the lack of an open-source millimeter wave radar data set, methods for fusion of millimeter wave radar and image data based on deep learning are still few.

Disclosure of Invention

The invention aims to provide a three-dimensional target detection system and a three-dimensional target detection method based on a millimeter wave radar and a monocular camera, so as to solve the problems.

In order to achieve the purpose, the invention adopts the following technical scheme: (finalization and replenishment)

Compared with the prior art, the invention has the following technical effects:

the invention is a fusion target detection method based on two sensors, namely a millimeter wave radar sensor and a monocular camera, and compared with a target detection algorithm based on three-dimensional laser, the cost is lower. The method has the advantages that a good three-dimensional target detection effect can be achieved, the millimeter wave radar can provide accurate information such as object depth and speed, the camera can provide rich appearance characteristics including accurate object category information but not a good information source for depth estimation, and advantages and disadvantages of the two sensors are complementary.

The invention relates to a time-space alignment, feature extraction and fusion detection module based on a millimeter wave radar and a monocular camera, which is characterized in that the detection result of the radar is associated with the primary detection result obtained from an image, then the feature extraction is carried out on the radar data, and the three-dimensional detection frame of a target is accurately estimated by combining the image features, so that the accurate detection of an obstacle in an unmanned scene is realized. We feature-extract the image and the millimeter-wave radar data using centret and PointNet, respectively, and accurately estimate three-dimensional properties of the object, such as depth, rotation, and speed information, using the fused features. The existing millimeter wave radar and camera fused deep learning detection network is improved, and the performance is remarkably improved.

Drawings

FIG. 1 is a schematic diagram of a calibration board for a millimeter wave radar and camera according to the present invention;

FIG. 2 is a schematic structural diagram of a fusion detection network of a millimeter wave radar and a camera according to the present invention;

FIG. 3 is a schematic diagram of a data association module of the millimeter wave radar and the camera according to the present invention;

FIG. 4 is a schematic diagram of a feature extraction module of a millimeter wave radar point according to the present invention;

Detailed Description

The invention is further described below with reference to the accompanying drawings:

referring to fig. 1 to 4, first, the image and the millimeter wave radar data are aligned in time and space.

The two sensors acquire data and also acquire time stamps, and in order to ensure the reliability of the data, the data of the millimeter wave radar is aligned to the time stamps of the images. Matching data of one frame of millimeter wave radar to each frame of image data by using nearest neighbor algorithm

Wherein

Is t_radarThe target position detected by the radar at the moment, rcs is the radar scattering area, (v)_x；v_y) Is the radial velocity of the detected object in the x and y directions. Because the millimeter wave radar can provide accurate speed information, the target position is compensated by using the target speed obtained by the detection of the millimeter wave radar, and the calculation process is shown as the following formula:

in the above formula t_imageIs the acquisition time of the image.

After the time alignment of the two sensors is realized, the fused data of the radar and the image data can be finally obtained. The data of the two sensors that have been aligned in time are then spatially aligned.

It is assumed that the millimeter wave radar detects an obstacle on a plane parallel to the ground, which is a radar plane. Thus, data of the millimeter wave radar can be expressed as Z_radarThe coordinate system is a two-dimensional top-view coordinate system, where (x, y) is the position of the target relative to the millimeter wave sensor. And image result level data z_imgAnd (u, v) is a two-dimensional image coordinate system, wherein (u, v) is the coordinates of the object in the image coordinate system.

The spatial alignment is to establish projection transformation between a two-dimensional overlook coordinate system where millimeter-wave radar point cloud data is located and an image coordinate system. The mapping relationship between the millimeter wave radar data and the image is shown as follows:

wherein

Can be obtained from the above formula

Referring to fig. 1, a schematic diagram of a millimeter wave radar and an image calibration board according to the present invention is shown. The calculation of the transformation matrix of the two sensor coordinate systems is realized by manufacturing a calibration plate. The calibration plate is a plate pasted with four circular patterns, a metal three-surface corner reflector is placed in the middle of the four circular patterns, the height of the metal three-surface corner reflector from the lower edge of the calibration plate is the same as the height of the millimeter wave radar, the metal three-surface corner reflector can be detected by the millimeter wave radar, and the detection result is in the same plane. The positions of four circles in the acquired image are detected by Hough detection, pixel coordinate values of the four circles are obtained, and finally the pixel coordinate values of the middle of the four circles in the image (namely the metal trihedral corner reflector) are obtained through calculation.

The calibration plate is vertically placed for multiple times at different spatial arrangements and distances, and then multiple groups of different millimeter wave radar detection results (x, y) and corresponding image detection results (u, v) can be obtained. Then, four pairs (or more than four pairs) of (x, y) and (u, v) are manually selected, and a transformation matrix of the millimeter wave radar plane coordinate system and the image coordinate system can be obtained. The solving process is as follows:

from a pair of pairs of characteristic points, the following system of linear equations can be obtained:

since the homography matrix includes the constraint of 1, at least four pairs of corresponding points are needed to calculate the homography matrix. In a real application scene, the homography matrix is calculated by using more than four points, so that the calculation result is more accurate.

And finally projecting the data of the millimeter wave radar after time alignment to an image coordinate system through a transformation matrix.

After millimeter wave radar and camera fused data are obtained through space-time alignment, a feature level fusion method is used for detecting a three-dimensional target.

Fig. 2 is a schematic structural diagram of the fusion detection network of the millimeter wave radar and the camera according to the present invention. Using the DLA network as a backbone network, image features are extracted for predicting the center point of an object on the image, as well as the size of a two-dimensional rectangular frame, the center offset, the size of a three-dimensional rectangular frame, depth information, and rotation information of the object on the image. These values are predicted from the basic regression networks shown in FIG. 2, each consisting of a 3X 3 convolutional layer with 256 dimensions and a 1X 1 convolutional layer. This may provide an accurate two-dimensional rectangular frame and a rough three-dimensional rectangular frame for the objects in the image.

Firstly, adopting a CenterNet algorithm to predict the center point of an image target.

The CenterNet model uses I ∈ R^W×H×3As an input to the network, and finally generates one

As an output. W and the slice are the width and the height of the image, R is the down-sampling rate, and C is the number of object types. One is

The output of (C) indicates that an object of type C located at (x, y) in one image is detected. Thermodynamic diagram

Generated from a two-dimensional truth box by a gaussian kernel. Finally, for the position q epsilon R in the image²The value Y of class c of (a) is defined as:

wherein sigma_iThe size of the thermodynamic diagram is controlled based on the size of each object for scale-adaptive standard deviation. The neural network may generate a thermodynamic map for each target type in the image, with the peak representing the likely center point of the object. In the network training process, in order to solve the problem of uneven distribution of positive and negative samples, the Loss function is based on Focal local. The training strategy is to reduce the loss of other pixel points around the central point, and the calculation mode is shown as the following formula:

wherein N is the number of objects,

for a true thermodynamic diagram of the object to be labeled, α and β are the over-parameters of focal loss, and finally α is 2 and β is 4.

For each centroid, the network also predicts a local offset

To compensate for the discretized error caused by the fixed output step in the backbone network, the offset penalty is calculated by the L1 penalty function, as shown in the following equation:

the value of the partial loss function is only at the position of the key point

The above calculation is performed and other positions are ignored.

Class c_kCan be enclosed in a two-dimensional rectangular frame of the object kIs represented by (x)₁ ^(k)，y₁ ^(k)，x₂ ^(k)，y₂ ^(k)) The center point of which is at the position

By using

All the center points are predicted. And for each object k returns the size s of its object_k＝(x₂ ^(k)-x₁ ^(k)，y₂ ^(k)-y₁ ^(k)) Use of

All classes of objects are predicted. Calculated using the L1 loss function as follows:

the overall loss function for network training is:

L_det＝L_k+λ_sizeL_size+λ_offL_off.

in the experiment, λ was set_sizeIs 0.1, lambda_offIs 1.

To generate a three-dimensional rectangular box, the network also regresses three additional attributes, namely depth, angle and size information of the three-dimensional box. Thus, three separate regression network branches were added, the loss values of which were still calculated by the L1 loss function.

It is next necessary to accurately correlate the targets detected by the millimeter wave radar. The centret generates a thermodynamic diagram for each target class in the image. The peaks in the thermodynamic diagram represent the likely center points of the object, and the image features at these locations are used to estimate the properties of other objects. To utilize radar information in such an arrangement, radar-based features need to be mapped to the center of their corresponding object on the image, which requires precise correlation between the targets detected by the radar and the targets in the image.

Fig. 3 is a schematic diagram of a data association module of the millimeter wave radar and the camera according to the present invention. A method based on cone map correlation is used which can significantly reduce the effort for checking the radar data for which a correlation is required, since any radar points beyond this cone are negligible.

In the training phase of the neural network, a region-of-interest viewing cone is created using the three-dimensional truth rectangle of the target and radar detections are associated therewith. In the testing phase, the viewing cones are created by a two-dimensional rectangular box of image predictions and their estimated depth d and size. As shown in FIG. 3, the use of the cone method avoids this problem of associating overlapping objects in two dimensions because the objects do not overlap in three dimensions, and each object will have a separate cone.

Referring to fig. 4, a network structure for extracting millimeter wave radar data according to the present invention is shown. And extracting the characteristics of the millimeter wave radar point cloud data which is well associated with the image through PointNet. Millimeter-wave radars typically represent detected objects as two-dimensional points in a bird's eye view, and may provide their azimuthal and radial distances. For each detection result, the millimeter wave radar can obtain the radial instantaneous speed of the object.

Each radar detection result is represented as a three-dimensional point in the coordinate system of the self-vehicle and is parameterized as P ═ x; y; z; rcs; v ═ y_x；v_y) One point cloud is represented as a set of three-dimensional points { P }_i1., n } for each point P_iIs a compound containing (x, y, z, rcs, v)_x，v_y) The vector of (2). PointNet takes these n points as input, aggregating the characteristics of the points by maximal pooling.

Can be described as a function of a transformation that approximates the elements in a point set, as shown by:

f({x₁，...，x_n})≈g(h(x₁)，...，h(x_n))，

and (3) approximating the function h through a multilayer perceptron, approximating the function g through a maximum pooling layer, and finally obtaining the 1024-dimensional global feature. After the global point cloud characteristic corresponding to each viewing cone is calculated, the global characteristic obtained after PointNet processing and the data characteristic of each original radar point are connected to form the characteristic of each radar point, and therefore the obtained characteristic learns the local information and the global information.

The generated millimeter wave radar features are connected together as additional channels and image features. These features are used as input to a second regression to calculate more accurate object information. The velocity regression network estimates the x, y components of the actual velocity of the object in the vehicle coordinate system. The attribute regression network estimates different attributes for different objects, such as Car-like motion or parking, etc. And finally, decoding the regression result into a three-dimensional obstacle frame, and simultaneously obtaining other attributes of the object.

Claims

1. The three-dimensional target detection system based on the millimeter wave radar and the monocular camera is characterized by comprising a space-time alignment module of image data and millimeter wave radar data, a feature extraction module of the image data and the millimeter wave radar data and a fusion detection module of the image data and the millimeter wave radar data;

the space-time alignment module of the image data and the millimeter wave radar data comprises a time alignment module and a space alignment module; the time alignment module is used for aligning the data of the millimeter wave radar to the time stamp of the image data and compensating the position of the image data by using the target speed of the millimeter wave radar; the space alignment module performs spatial association on data of the millimeter wave radar points and corresponding areas of the images through projection transformation;

the feature extraction module of the image data and the millimeter wave radar data is used for extracting image features and extracting features of the millimeter wave radar data; predicting the central point of the target by using the image characteristics to obtain the preliminary estimation of the three-dimensional coordinate and the depth information of the target; performing data association on the millimeter wave radar point and the detected central point of the image by using a method based on a viewing cone, and further performing global feature extraction on the associated millimeter wave radar point cloud data;

and the fusion detection module of the image data and the millimeter wave radar data is used for performing accurate regression of the speed, the depth, the rotation and the attribute information of the three-dimensional object at the second stage after splicing the features extracted by the two sensors.

2. The three-dimensional target detection method based on the millimeter wave radar and the monocular camera is characterized in that the three-dimensional target detection system based on the millimeter wave radar and the monocular camera in claim 1 comprises the following steps:

step 1, performing time-space alignment on an image and millimeter wave radar data, and firstly obtaining a frame of data which comprises a millimeter wave radar and an image and is aligned in time and space;

step 2, after a frame of data containing the millimeter wave radar and the images with aligned time spaces is obtained, feature extraction is respectively carried out on the images and the millimeter wave radar data, and regression of speed, depth, rotation and attribute information of the three-dimensional object in the first stage is carried out by using the features extracted by the images;

and 3, after the extracted millimeter wave radar and the image data features are spliced, performing regression of the speed, the depth, the rotation and the attribute information of the three-dimensional object at the second stage to calculate more accurate object information.

3. The three-dimensional target detection method based on the millimeter wave radar and the monocular camera as claimed in claim 2, wherein in step 1, with the sampling rate of the camera as a reference, the camera selects a frame of data cached by the millimeter wave radar by using a nearest neighbor algorithm every time the camera acquires a frame of image, and then performs position compensation on the frame of data by using accurate speed information provided by the millimeter wave radar so as to synchronize the millimeter wave radar data and the camera data in time; and calculating a coordinate transformation matrix between a plane coordinate system and an image coordinate system of the millimeter wave radar in an off-line manner through a calibration plate, projecting the data of the millimeter wave radar after time alignment to a corresponding image plane through the transformation matrix obtained through calculation, and unifying the detection results of the millimeter wave radar and the monocular camera to the same coordinate system.

4. The millimeter wave radar and monocular camera-based three-dimensional target detection method according to claim 3, wherein the calibration plate is a plate to which four circular patterns are attached, a metal three-dimensional corner reflector is placed in the middle of the four circular patterns, the height of the metal three-dimensional corner reflector from the lower edge of the calibration plate is the same as the installation height of the millimeter wave radar, so as to ensure that the metal three-dimensional corner reflector is detected by the millimeter wave radar and the detection results are all in the same plane, the positions of four circles in the acquired image are detected by using Hough detection, pixel coordinate values of the four circles are obtained, and finally, the pixel coordinate value in the middle of the four circles in the image is obtained by calculation; and vertically placing the calibration plate for multiple times at different spatial arrangements and distances to obtain multiple groups of different millimeter wave radar detection results and corresponding image detection results, and then manually selecting four or more pairs of millimeter wave radar detection results and corresponding image detection results to obtain a transformation matrix of a millimeter wave radar plane coordinate system and an image coordinate system.

5. The method for detecting the three-dimensional target based on the millimeter wave radar and the monocular camera as recited in claim 2, wherein the step 2, the extracting the feature based on the data containing the millimeter wave radar and the image with the time and the space aligned comprises: predicting a central point of an object on an image plane by using a CenterNet model, and then regressing to obtain a primary three-dimensional position, direction and size and object attribute information, wherein the object attribute information comprises the running or parking of a vehicle and the standing or sitting of a pedestrian; and a three-dimensional target region of interest is created by using the two-dimensional rectangular frame predicted by the image and the estimated depth and size of the two-dimensional rectangular frame, radar detection results are associated with the three-dimensional target region of interest, and the point cloud data of the associated local millimeter waves are subjected to global feature extraction by using PointNet.

6. The method for detecting the three-dimensional target based on the millimeter wave radar and the monocular camera of claim 5, wherein the center point of the target in the image is predicted by adopting a CenterNet algorithm:

As an output; w and H are the width and height of the image, R is the down-sampling rate, and C is the number of object types; one is

The output of (C) indicates that an object of type C located in (x, y) in one image is detected; truth value of thermodynamic diagram

Generating from a two-dimensional truth box by a Gaussian kernel; finally, for the position q epsilon R in the image²The value Y of class c of (a) is defined as:

wherein sigma_iControlling the size of the thermodynamic diagram based on the size of each object for scale-adaptive standard deviation; the neural network generates a thermodynamic diagram for each target type in the image, and the peak value of the thermodynamic diagram represents a possible central point of the object; the training strategy is to reduce the loss of other pixel points around the central point, and the calculation mode is shown as the following formula:

wherein N is the number of objects,

a truth thermodynamic diagram of the object being labeled, α and β being the over-parameters of focal loss;

for each centroid, the network predicts a local offset

the value of the partial loss function is only at the position of the key point

The calculation is carried out, and other positions are ignored;

class c_kCan be represented as (x)₁ ^(k)，y₁ ^(k)，x₂ ^(k)，y₂ ^(k)) The center point of which is at the position

By using

To predict all center points; and for each object k returns the size s of its object_k＝(x₂ ^(k)-x₁ ^(k)，y₂ ^(k)-y₁ ^(k)) Use of

Predicting objects of all classes; calculated using the L1 loss function as follows:

the overall loss function for network training is:

L_det＝L_k+λ_sizeL_size+λ_offL_off。

7. the millimeter wave radar and monocular camera-based three-dimensional target detection method according to claim 5, wherein the feature of the millimeter wave radar point cloud data that has been correlated with the image is extracted by PointNet, the millimeter wave radar generally represents the detected object as a two-dimensional point in a bird's eye view and provides its azimuth angle and radial distance, and for each detection result, the millimeter wave radar can obtain the radial instantaneous velocity of the object; each radar detection result is represented as a three-dimensional point in the coordinate system of the self-vehicle and is parameterized as P ═ x; y; z; rcs; v ═ y_x；v_y) One point cloud is represented as a set of three-dimensional points { P }_i1., n } for each point P_iIs a compound containing (x, y, z, rcs, v)_x，v_y) The vector of (a); the PointNet takes the n points as input and aggregates the characteristics of the points through maximum pooling;

described as a function of the transformation that approximates the elements in a point set, as shown by:

f({x₁，...，x_n})≈g(h(x₁)，...，h(x_n))，

and (4) approximating the function h through a multilayer perceptron, approximating the function g through a maximum pooling layer, and finally obtaining the global feature.

8. The method as claimed in claim 2, wherein step 3 specifically includes generating a feature map of the millimeter wave radar, connecting the feature map as an additional channel to the image feature, and using the feature map with the radar and image data features as an input of a second regression network of the network, so as to recalculate depth, rotation, speed and attribute information of the object and obtain an accurate result based on the image detection result.