CN116740519A

CN116740519A - Three-dimensional target detection method, system and storage medium for close-range and long-range multi-dimensional fusion

Info

Publication number: CN116740519A
Application number: CN202310711683.0A
Authority: CN
Inventors: 方介泼; 薛俊; 刘仪婷; 肖昊; 李兴通; 钱星铭; 陶重犇
Original assignee: Suzhou University of Science and Technology
Current assignee: Suzhou University of Science and Technology
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-09-12

Abstract

The invention provides a three-dimensional target detection method, a system and a storage medium for close-range and long-range multi-dimensional fusion, wherein the three-dimensional target detection method comprises the following steps: step one: detecting a central point of an image by using a central set, regressing basic attributes, and extracting characteristic values through a full-convolution coding-decoding backbone network; step two: and dividing the three-dimensional ROI region of the target by using the estimated depth, and dividing the detection task into near view detection and distant view detection. The beneficial effects of the invention are as follows: the three-dimensional target detection method disclosed by the invention combines the advantages of three sensors of the laser radar, the millimeter wave radar and the camera, realizes the 3D target detection technology in the automatic driving field, can accurately identify and position targets such as vehicles, pedestrians and riding people, and can be applied to actual scenes.

Description

Three-dimensional target detection method, system and storage medium for close-range and long-range multi-dimensional fusion

Technical Field

The invention relates to the technical field of image processing, in particular to a three-dimensional target detection method, a system and a storage medium for close-range and long-range multi-dimensional fusion.

Background

With the development of automatic driving technology, many related researches on target detection have been recently developed. Conventional three-dimensional object detection camera-based methods estimate object classification and location from semantic information of images. However, since it is difficult to obtain depth information from an image, additional computational effort is required to estimate depth information of a target. Laser radar-based detection methods mostly use projections of voxels or point clouds for detection. Although the point cloud retains the geometric information of the target more completely, the sparsity and disorder of the point cloud reduces the ability to accurately detect distant objects. The mainstream adopts the algorithm of multiple sensors to simultaneously use laser radar and camera, utilizes complementary advantage to realize high-accuracy three-dimensional target detection. However, this method suffers from a decrease in detection accuracy in the face of both distant scenes and adverse weather. In addition, cameras and lidars are also unable to directly acquire critical speed data to prevent collisions in many cases. In addition to lidar and cameras, millimeter wave radar is also widely used to assist driving. Compared with a laser radar and a camera, the millimeter wave radar has strong penetrating capacity and better robustness in a severe environment. In addition, the millimeter wave radar can also accurately detect the relative speed of the target. However, the point cloud of the millimeter wave radar is quite sparse and can only be used as information sources of depth and speed, so that the algorithm for detecting by adopting the millimeter wave radar is less. Therefore, in order to weaken the dependence of the algorithm on a single sensor, the robustness of the algorithm is enhanced, and meanwhile, the detection precision of the close range and the distant range is improved.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a detection method for close-range and long-range multi-dimensional data fusion.

The invention provides a three-dimensional target detection method for close-range and long-range multi-dimensional fusion, which comprises the following steps:

step one: detecting a central point of an image by using a central set, regressing basic attributes, and extracting characteristic values through a full-convolution coding-decoding backbone network;

step two: the estimated depth is used to divide the target into three-dimensional region of interest (ROI), and then the detection task is divided into close-range detection and distant-range detection.

The invention also discloses a three-dimensional target detection system for close-range and long-range multi-dimensional fusion, which comprises the following components: a memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the three-dimensional object detection method of the invention when called by the processor.

The invention also discloses a computer readable storage medium storing a computer program configured to implement the steps of the three-dimensional object detection method of the invention when called by a processor.

The beneficial effects of the invention are as follows: the three-dimensional target detection method disclosed by the invention combines the advantages of three sensors of the laser radar, the millimeter wave radar and the camera, realizes the 3D target detection technology in the automatic driving field, can accurately identify and position targets such as vehicles, pedestrians and riding people, and can be applied to actual scenes.

Drawings

FIG. 1 is a block diagram of a three-dimensional object detection method of the present invention;

FIG. 2 is a schematic diagram of the coordinate transformation of the three-dimensional object detection method of the present invention;

FIG. 3 is a schematic diagram of a point cloud detection network of the three-dimensional object detection method of the present invention;

FIG. 4 is a schematic diagram of the relationship between radar and keypoint speed for the three-dimensional object detection method of the present invention;

fig. 5 is a schematic diagram of a rotation estimation network of the three-dimensional object detection method of the present invention.

Detailed Description

The invention discloses a detection method for multi-dimensional data fusion of a near view and a far view, which adopts different detection schemes respectively in the far view and the near view, shares the data detected at a detection center point and the later period, and improves the utilization rate of the data. And the speed information of the millimeter wave radar is used as prior and characteristic to be integrated into a detection network, so that the object detection precision is improved.

As shown in fig. 1-5, the invention discloses a detection method for multi-dimensional data fusion of a near view and a far view, which comprises the following steps:

step one: detecting a central point of an image by using a central network, regressing basic attributes, and extracting characteristic values through a full-convolution encoding-decoding backbone network;

the first step is specifically as follows:

with I epsilon R ^W×H×3 As an input image, with width W and height H, the network generates a heatmap about the center pointR is the size scaling of the output, C is the type of center point, Y _x，y，c The =1 representative point (x, Y) is the key point under type C, and Y _x，y，c The expression "0" represents the background point of type c. When training the key point detection network, for true value p E R ² Downsampling is performed to get->By constructing a gaussian kernel function as shown in the formula:

will true valueProjected onto the heat map and the size of the target 2D detection frame is adapted by sigma, which is just one parameter for adjusting the size of the 2D detection frame. Training objective function:

where N is the number of targets and,for a calibrated truth heat map, α and β are hyper-parameters of the loss function. For each center point detected, the network predicts a local offset to compensate for the discretization error in the backbone network due to downsampling. The network regresses the 2D size, three dimensional size, target depth and rotation angle of the target. These values are regressed by the main heads shown in fig. 1, each consisting of a 3 x 3 convolution layer and a1 x 1 convolution layer, the former as input and the latter as output. The detection network provides accurate detection for the target center point and the 2D detection frame for subsequent detected characters, and meanwhile, preliminary detection for the target 3D information is realized.

The three-dimensional object detection method of the present invention exploits the different advantages of different data to create complementary features.

When performing detection tasks for close-up, RGB-D data within the view cone is used to construct features about the target and semantic information from 2D detection is used as a priori information. The method specifically comprises the following steps:

step S1: first, to enhance rotational invariance of the target, the learning pressure of the network is relieved, the coordinate axis is rotated along the y-axis, so that the rotated z-axis passes through the projection of the peak in the center point heatmap on the y-axisThen by constructing a rotation matrix R _y (θ _Δy )∈R ^3×3 The global coordinates are converted into local coordinates.

Step S2: in order to filter out point cloud data that is not related to the target, the invention segments the point cloud data in the view cone. Firstly, point cloud data converted into local coordinates are input into a plurality of shared Mlp perceptrons for dimension lifting processing, each point in the point cloud data is lifted into 1024-dimension feature vectors, and global features of the point cloud data are obtained through a Maxpooling layer on the premise of keeping the point cloud disorder. And then, connecting the global features with each point and adding a K-dimensional one-hot vector (single thermal coding) to ensure that the segmentation network can fully utilize prior information brought by 2D detection, generating n multiplied by 1 vectors through the same shared Mlp (perceptron) to segment the point cloud data, and finally, primarily regressing the center of the target by utilizing the point closest to the center of mass of the segmented point cloud on the local coordinate y-axis.

The initial target center obtained during the point cloud data segmentation has a larger difference from the real center of the target, and in order to accurately regress the target center, the invention uses a special space transformation network (T-Net) and utilizes the relation between the two-dimensional target center regressed by the center Net and the depth value to perform dimension reduction processing on the T-Net, as shown in the formula:

where d is the difference between the preliminary target depth of center and the actual target depth of center value due to regression of the T-space transformation network (T-Ne), the invention builds a residual-based loss function as shown in the formula:

L _box ＝C _box -C _mask -ΔC _T-Net (4)

wherein C is _box Representing prediction frame information, C _mask Representing mask prediction information, C _T-Net Representing T-net network prediction information;

after the center point of the object is obtained, the invention projects the point cloud separated in the view cone to the X-Z axis to form a point cloud image of the BEV (bird's eye view) bird's eye view.

In the BEV diagram, the invention performs rasterization processing on the projected point cloud data, and enhances the adaptability of the network to differences caused by different point cloud densities under different sensors. When the features of the BEV graph are extracted, the invention performs uniform slicing processing on the BEV graph at different heights so as to maintain the height information of the point cloud data as much as possible.

Because of the poor global descriptability of BEV features to objects, the present invention fuses Point-Wise features with BEV features, and previous work has typically used early fusion or late fusion in order to combine information from different features. The invention fuses multi-view features in layers.

Assume thatAnd->It can be derived that:from (a, b), it can be known that the Point cloud segmentation network has stronger robustness to noise in the Point cloud, and most of information about target key points is carried by the Point-Wise (Point level) features, and because of the missing nature of the Point-Wise (Point level) features to the local information description, the Point-Wise (Point level) features are more represented as a weight, so the invention utilizes an Element-Wise multiplication fusion method to perform feature fusion, and the fusion mode is shown as a formula:

wherein f is a feature, and H represents a perceptron function;

and in order to prevent perception machineIs (1)By adopting the method, the two perception circuits are fused, and simultaneously, the independent auxiliary loss training of the two perception machines is added, and the weight is shared between the fusion training and the auxiliary loss training. Finally, during training, the regression task has a certain inseparability, so the invention carries out joint optimization on the multi-task, and uses the joint loss function as shown:

where i, j, k represent variables and P represents the corresponding point.

When a distant view detection task is carried out, point clouds generated by a laser radar are gradually sparse along with the increase of the distance, RGB data provided by a camera are also quite blurred, and the millimeter wave radar can be used for keeping high-precision detection in a long distance, but the point cloud data of the millimeter wave radar also face the problem that the point cloud data are too sparse, so that the method of combining vision with the millimeter wave radar is used, and non-visual features of the millimeter wave radar are fully utilized on the basis of visual detection to create complementary features for images. For each detection associated with a target within the viewing cone, a single channel heat map is generated. The size of the heat map is proportional to the 2D detection frame of the target, and the size of the heat map is controlled by a parameter alpha, and the value of the heat map is the normalized value of the target depth (D):wherein the normalization factor of the M target depth, +.>For the center point coordinates of object j, w ^j And h ^j The method comprises the steps of respectively connecting the generated characteristic diagram with the image characteristic of the target in parallel, determining the center point of the target by utilizing the corresponding relation between the characteristic diagram and the center point of the target, and constructing a view cone to divide the ROI. And the feature map is input into an auxiliary detection head to help the main detection task to carry out regression of target depth and rotation information. From the following componentsIn millimeter wave radar, only the relative radial velocity between the target and the main body is returned, and in order to return the absolute velocity of the target, the absolute velocity of the main body needs to be returned first. The radar point cloud is divided into a target key point and a background point by dividing the ROI, and the relation between the background point and the speed of the main body is as follows:

wherein V is _d ⁿ Radial velocity, θ, carried by data point n _n Deflection angle carried by data point n, V _n For the target subject speed obtained by regression, u is the least square error value between the regression value and the true value. After the subject velocity is obtained, the present invention regresses the absolute velocity of the target.

For a larger vehicle target, multiple radar keypoints may be contained within its detection frame. The invention uses the angular difference of radial velocity between different keypoints to return to the absolute velocity of the target:

wherein V is _d ^p Speed information theta carried by target key point p _pd Angle information theta carried by target key point p _pt To get the speed direction of the vehicle target by regression, V _T ^p The size of the vehicle target is obtained for regression.

For a Person target with a smaller target scale, because a plurality of radar key points are difficult to find in a view cone, the Person target is reversely tracked by using a Person-Reid (pedestrian re-identification) algorithm based on Part-based, and the ROI is divided by the radial speed provided by the millimeter wave radar, as shown in the formula:

wherein the method comprises the steps ofx _t 、y _t For the target position of the current frame, x _t-1 、y _t-1 C for the position of the target to be tracked in the previous frame _n Predictive information for target n; v (V) _t Radial velocity provided for millimeter wave radar. The invention returns the speed of the target through the position change of the target between frames, uses the returning speed direction of the target as the initial value of the target deflection angle, and focuses on the difference between the training target speed and the target deflection angle during training. And accessing an auxiliary regression network by using the size and the direction of the target speed as global characteristics. The invention utilizes residual errors to construct a loss function, as shown in the formula:

wherein n is the target number, θ _T Accurate deflection angle, theta _i The deviation angle is predicted, and the delta theta is accurately different from the deviation angle.

Finally, the invention corrects the size of the target by using the obtained target deflection angle as priori information, and constructs the priori size by using semantic information, wherein the constructed loss function is shown in the formula:

wherein D is ^* Representing the actual dimensions of the device,representing the predicted size, delta is a residual value.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A three-dimensional target detection method for close-range and long-range multi-dimensional fusion is characterized by comprising the following steps:

step two: and carrying out three-dimensional region-of-interest division on the target by using the estimated depth, and dividing detection tasks into near view detection and distant view detection.

2. The method for detecting a three-dimensional object according to claim 1, wherein the first step is specifically: generating a heat map about a center point using a center set with I as an input imageFor each center point detected, a centrnet predicts a local offset to compensate for discretized errors due to downsampling in the fully convolved encoded-decoded backbone, a centrnet versus 2D size of the target, a three dimensional rulerThe dimensions, target depth and rotation angle were regressed.

3. The method of three-dimensional object detection according to claim 2, wherein, in the heat mapIn (1), Y _x，y，c The =1 representative point (x, Y) is the key point under type C, Y _x，y，c The =0 representative point (x, y) is the background point under type c, and the true value p is downsampled +.>Truth value is applied by constructing a Gaussian kernel function>And projecting the image on a heat map, and self-adapting the size of the target 2D detection frame through sigma, wherein the formula of the Gaussian kernel function is as follows:

the training objective function formula is as follows:

the 2D size, three dimensional size, target depth and rotation angle are regressed by the main heads, each consisting of a 3 x 3 convolution layer and a1 x 1 convolution layer, with the 3 x 3 convolution layer as input and the 1 x 1 convolution layer as output.

4. The three-dimensional object detection method according to claim 1, wherein in the second step, for a near-view detection task, RGB-D data in a view cone is used to construct features related to an object, semantic information brought by 2D detection is used as prior information, for a far-view detection task, a method of combining vision with millimeter wave radar is used, and non-visual features of the millimeter wave radar are fully utilized to create complementary features for an image on the basis of vision detection.

5. The three-dimensional object detection method according to claim 4, wherein the close-up detection task specifically comprises:

step S1: the coordinate axis is first rotated along the y-axis so that the rotated z-axis passes through the projection of the peak in the center point heatmap onto the y-axis, and then by constructing a rotation matrix R _y (θ _Δy ) Converting the global coordinates into local coordinates;

step S2: partitioning the point cloud data in a view cone; firstly, point cloud data converted into local coordinates are input into a plurality of shared perceptrons to be subjected to dimension lifting processing, each point in the point cloud data is lifted to 1024-dimensional feature vectors, global features of the point cloud data are obtained on the premise of keeping the point cloud disorder through a large pooling layer, then the global features are connected with each point and are added into a K-dimensional one-hot vecter to ensure that a segmentation network fully utilizes prior information brought by 2D detection, n multiplied by 1 vectors are generated through the same shared perceptrons to segment the point cloud data, and finally, the center of a regression target of the point closest to the center of mass of the segmented point cloud on a local coordinate y axis is utilized.

6. The method according to claim 5, wherein in the step S2, in order to accurately perform regression on the target center, the dimensional reduction process is performed on the spatial transformation network by using the relationship between the two-dimensional target center and the depth value, which are regressed by the center, according to the specific formula:

wherein d is the difference between the preliminary target depth and the actual target depth value due to regression of the spatial transformation network;

constructing a residual-based loss function during training:

L _box ＝C _box -C _mask -ΔC _T-Net (4)

after the central point of the target is obtained, projecting the point cloud separated in the view cone to an X-Z axis to form a point cloud image of a BEV aerial view, and carrying out rasterization processing on the projected point cloud data in the BEV image;

in order to combine information from different features, the Point-Wise feature and the BEV feature are fused by adopting an Element-Wise multiplication fusion method, and the fusion mode has the following formula:

wherein f is a feature, and H represents a perceptron function;

to prevent perceptronAdding independent auxiliary loss training to the two perceptrons while fusing the two perception lines, and sharing weight between the fusion training and the auxiliary loss training;

during training, the multi-task is subjected to joint optimization, and the used joint loss function formula is as follows:

where i, j, k represent variables and P represents the corresponding point.

7. The three-dimensional object detection method according to claim 4, wherein the perspective detection task specifically comprises:

and (3) connecting the generated feature map with the image features of the target in parallel, determining the center point of the target by utilizing the corresponding relation between the feature map and the target center point, constructing a view cone to divide the ROI, and inputting the feature map into an auxiliary detection head to help the main detection task to carry out regression of the target depth and rotation information.

8. The three-dimensional object detection method according to claim 7, wherein in the perspective detection task, further comprising:

step A1, returning the absolute speed of the main body: the radar point cloud is divided into a target key point and a background point by dividing the ROI, and the relation between the background point and the main body speed is as follows:

wherein V is _d ⁿ Radial velocity, θ, carried by data point n _n Deflection angle carried by data point n, V _n For the speed of the target main body obtained by regression, u is the least square error value between the regression value and the true value;

step A2, regression of the absolute speed of the target: for a vehicle target with larger volume, the absolute speed of the target is returned by using the angle difference of the radial speed between different key points, and the formula is as follows:

wherein V is _d ^p Speed information theta carried by target key point p _pd Angle information theta carried by target key point p _pt To get the speed direction of the vehicle target by regression, V _T ^p Obtaining the size of the vehicle target for regression; reverse-tracking Person targets with smaller target scale by using Person-Reid algorithmThe ROI is divided by the radial velocity provided by the millimeter wave radar, and the formula is as follows:

wherein x is _t 、y _t For the target position of the current frame, x _t-1 、y _t-1 C for the position of the target to be tracked in the previous frame _n Predictive information for target n; v (V) _t Radial velocity provided for millimeter wave radar;

in the distant view detection task, a residual error is utilized to construct a loss function, and the formula is as follows:

wherein n is the target number, θ ^T Accurate deflection angle, theta _i Predicting a deflection angle, wherein delta theta is an accurate difference value from the prediction deflection angle;

correcting the size of the target by using the obtained target deflection angle as priori information, and constructing the priori size by using semantic information, wherein the constructed loss function has the following formula:

9. A near-view distant-view multi-dimensional fused three-dimensional target detection system, comprising: a memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the three-dimensional object detection method of any one of claims 1-8 when invoked by the processor.

10. A computer-readable storage medium, characterized by: the computer readable storage medium stores a computer program configured to implement the steps of the three-dimensional object detection method of any one of claims 1-8 when invoked by a processor.