CN116630937A

CN116630937A - Multimode fusion 3D target detection method

Info

Publication number: CN116630937A
Application number: CN202310605073.2A
Authority: CN
Inventors: 冯欣; 曾俊贤; 单玉梅; 何桢苇
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2023-08-22

Abstract

The invention relates to a multi-mode fusion 3D target detection method, which comprises the steps of firstly constructing an integral network, wherein the integral network comprises a laser radar feature extraction branch, a 4D millimeter wave feature extraction branch, a feature fusion module, a 2D backbone network and an RPN detection head, then acquiring real data of automatic driving to train the integral network, finally inputting currently acquired laser radar point cloud data and 4D millimeter wave point cloud data into the trained integral network, and outputting a detection result. Experiments prove the effectiveness of the method, and by comparing experiments with other methods on an Astyx HiRes2019 data set, the invention has the best performance.

Description

Multimode fusion 3D target detection method

Technical Field

The invention relates to the technical field of automatic driving, in particular to a multi-mode fusion 3D target detection method.

Background

The automatic driving technology is one of the products of the high-speed development of modern technology, and is also the embodiment of the continuous pursuit of innovation in the automobile industry. As an important participant in the modern traffic field, the automotive industry has been actively exploring and applying various high-tech technologies. With the continuous progress of the fields of artificial intelligence technology, computer vision and the like, the automatic driving technology has become a trend of future development of the automobile industry, and attracts more and more manufacturers to compete. In addition, intellectualization, new energy and networking are also key targets for the development of the automotive industry, and application of these new technologies will greatly change our travel and life modes. With the emergence of various new technologies, the automobile industry will continuously promote the acceleration of industrial revolution, bringing more convenience and safety for human travel.

The automatic driving automobile collects data on an automobile platform by carrying a plurality of advanced sensors, and combines computer vision and other artificial intelligence technologies, so that the automobile can be driven autonomously, and a driver is assisted to complete driving tasks safely and efficiently.

In an autopilot perception system, the 3D object detection task is one of the most important links. The method can rapidly and accurately identify and position the target in the three-dimensional space, and provides key support for intelligent driving. The conventional 3D object detection method generally only considers one of the point cloud data or the RGB image data, and cannot fully utilize various sensor information, so that the detection result is not accurate and robust enough. However, multi-modal fusion can improve accuracy of target detection from multiple aspects using information provided by different sensors, such as color, texture, depth, illumination, etc., and improve robustness of detection algorithms by integrating data from different sensors. This means that even if one sensor fails, the target detection can still maintain high accuracy, and the multi-modal fusion can provide more comprehensive information, making the target detection algorithm suitable for a wider range of scenarios. For example, in low light conditions, an infrared sensor may be used to detect a target, thereby improving the detection effect. Meanwhile, the multi-mode fusion 3D target detection needs to be combined with different sensors, algorithms and data processing technologies, so that technical progress and innovation in various fields are promoted. Therefore, fusing multimodal information (e.g., point clouds, images, radars, etc.) to improve the performance of 3D object detection is a challenging and significant study.

Currently, there are two main types of input data for 3D object detection methods. One is image data and the other is lidar point cloud data. Compared with image data, the laser radar point cloud has depth truth value information, so that the detection accuracy is higher, and the laser radar point cloud becomes a currently mainstream single-mode 3D target detection method.

However, there are still some unavoidable drawbacks to a single sensor. For example, lidar data is sparse and is susceptible to severe weather (e.g., rain, snow, fog); RGB image data lacks depth information, is susceptible to illumination, and the like. In order to improve the accuracy and robustness of the algorithm, more and more work is beginning to pay attention to the research of the multi-mode fusion 3D target detection method. The advantages of each sensor are better utilized through multi-mode fusion, and the high-resolution RGB image and the high-line number laser radar are mainly used, so that the detection accuracy is improved to a certain extent. However, the method has high cost, and can not well solve the problems of long-distance small targets, blocked targets, target error detection under severe weather and the like.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention aims to solve the technical problems that: how to solve the problems of error leakage detection of a long-distance small target, a blocked target and a target in bad weather.

In order to solve the technical problems, the invention adopts the following technical scheme: a multi-mode fusion 3D target detection method comprises the following steps:

s1: constructing an integral network, wherein the integral network comprises a laser radar feature extraction branch, a 4D millimeter wave feature extraction branch, a feature fusion module, a 2D backbone network and an RPN detection head;

the laser radar feature extraction branch and the 4D millimeter wave feature extraction branch adopt the same feature extraction module;

s2: acquiring real data of automatic driving to train the whole network:

the laser radar feature extraction branch performs feature extraction on point cloud data of a laser radar to obtain radar features, the 4D millimeter wave feature extraction branch performs feature extraction on the point cloud data of the 4D millimeter waves to obtain millimeter wave features, the radar features and the millimeter wave features are input into a feature fusion module to obtain a pseudo image, the pseudo image is input into a 2D backbone network to extract the pseudo image features, and finally the pseudo image features are input into an RPN detection head to return to a final detection result;

training is carried out by adopting a deep learning framework, and when the value of the loss function is not changed, a trained whole network is obtained.

S3: and inputting the currently acquired laser radar point cloud data and 4D millimeter wave point cloud data into a trained whole network, and outputting a checking result.

As an improvement, the process of extracting the characteristics of the point cloud data of the laser radar by the laser radar characteristic extraction branch in S2 is as follows:

the lidar point cloud inputs N x 4-dimensional vector data by first voxelizing the original point cloud into the form of column pilar, and then expanding the points in each column from 4 dimensions (x, y, z, r) to 10 dimensions (x, y, z, r, x) _c ,y _c ,z _c ,x _p ,y _p ,z _p ) Wherein the subscript c denotes the deviation of the arithmetic mean of all points in the column, the subscript p denotes the deviation of the center of the column, and r denotes the reflectivity.

The point cloud data of each frame of laser radar only selects P non-empty columns, N points in each column are selected as samples, and if columns smaller than the N points are filled with zero, a column characteristic with the size of (D, P, N) is obtained as a radar characteristic, wherein D is the dimension of one point.

As an improvement, the process of obtaining the pseudo image by the feature fusion module in S2 is as follows:

the feature Fusion module adopts a multi-mode Fusion module LR-Fusion based on a self-attention mechanism, the input features of the feature Fusion module are extracted (D, P, N) before, and firstly, the dimension of the input features is reduced to reduce the calculated amount:

wherein O is _L And O _R Respectively representing input features, namely radar features and millimeter wave features, F _L ，F _R Output characteristics respectively representing radar characteristics and millimeter wave characteristics;

the simplification is shown as a formula (2).

Wherein U is _L And U _R The LN (, MP (,), conv (, etc.) respectively represent the full connection layer, the maximum pooling layer and the convolution layer.

Equation (3) then calculates two sets of attention weights to fuse the two modality features, and uses Softmax to calculate the attention weights, where W _R←L Is the characteristic of laser radar to millimeter wave transmission, W _L←R Is a feature of opposite direction.

Where Softmax (.cndot.) is the activation function.

(4) associating attention weights W _R←L Right multiplied by feature U _R Subtracting the characteristic U _L Then go through the linear layer, the normalization layer and the activation functionNumber, generating a weighting matrix F of lidar features _Lm Millimeter wave feature weighting matrix F _Rm The acquisition method is the same.

Where BN (. Cndot.) is batch normalized and ReLU (. Cndot.) is the activation function. The original features are added by using the weighting matrix, and the features of the two modes are updated by the residual, as shown in formula (5).

F _outputL ，F _outputR The pilar characteristics of the lidar after aggregation and the pilar characteristics of the 4D millimeter wave based on the attention mechanism are shown, respectively.

And restoring the generated pilar characteristics of the laser radar and the 4D millimeter wave pilar characteristics into the original space according to the coordinate index to obtain a pseudo image.

As an improvement, the loss function in S2 is:

the loss function of the network training is divided into three major parts, namely a classification loss function, a positioning loss function and a direction loss function. The coordinate difference between the truth box and the anchor box is calculated by the positioning loss function; the class loss function calculates a class prediction difference between the anchor frame and the truth frame; the directional loss calculates the difference between the rotation angle between the anchor frame and the truth frame. The truth box and the anchor box are represented by 7-dimensional vectors (x, y, z, w, l, h, θ), wherein (x, y, z) is the center coordinates of the truth box or the anchor box, and (w, l, h) is the size of the truth box or the anchor box, and θ is the direction angle of the truth box or the anchor box).

Wherein x is ^gt ，y ^gt And z ^gt Respectively represent the center coordinates, w, of the truth box ^gt ，l ^gt And h ^gt True value representing boxLength, width, height, theta ^gt Indicating the direction angle of the truth box;

x ^a ，y ^a and z ^a Respectively represent the central coordinates, w, of the anchor frame ^a ，l ^a And h ^a Represents the length, width and height of the anchor frame, theta ^a Indicating the direction angle of the truth box;

Δx, Δy, Δz and Δθ represent differences in three coordinates and direction angles of the truth box and the center of the anchor box, respectively.

d ^a The intermediate variable has no actual meaning and is calculated by formula (7).

Loss of positioning L _loc Training was performed using SmoothL1, and the expression is shown in (8):

learning the direction of an object using a Softmax loss function, the direction loss being noted as L _dir Then, the classification loss L of the target _cls The use of focal loss can be represented by equation (9).

L _cls ＝-α _a (1-p ^a ) ^γ log p ^a (9)

Wherein p is ^a Is the class probability of the anchor frame, alpha _a And gamma is a superparameter;

L _dir the direction classification of learning objects is lost for Softmax;

the total loss function can be represented by equation (10):

wherein N is _pos Number beta of positive anchor points _loc ，β _cls And beta _dir Three lost weights, respectively.

Compared with the prior art, the invention has at least the following advantages:

the invention provides a 3D target detection method based on multi-mode fusion of a laser radar and 4D millimeter waves, and a detection network with better performance under a complex scene is obtained through multi-mode fusion of the laser radar and the 4D millimeter waves. According to the method, firstly, an extraction module similar to the PointPicloras point cloud characteristic is provided, the calculated amount of a network is effectively reduced, then, a multi-mode fusion module of the laser Lei Dage D millimeter wave characteristic is designed, the characteristics of two modes are aggregated in a self-attention mode, finally, the fusion characteristic with stronger expression capability is converted into a pseudo image form, the characteristic is further extracted by a 2D backbone network, and a final detection result is returned by an RPN detection head. The invention shows that the method is superior to other methods in an Astyx HiRes2019 data set through a large number of comparison experiments, and has the best performance.

Drawings

Fig. 1 is a block diagram of the overall network of the present invention.

Fig. 2 is a point cloud feature extraction structure diagram.

Fig. 3 is a feature fusion module.

Fig. 4 is a training process analysis.

Fig. 5 is a diagram of a sample data set, graphs (a) and (b) are image data, graphs (c) and (D) are lidar data, and graphs (e) and (f) are data of 4D millimeter waves, wherein graphs (c) - (D) visualize the prediction frames of the pointpilars model.

Detailed Description

The present invention will be described in further detail below.

The invention provides a 3D target detection method based on laser radar and 4D millimeter wave multi-mode fusion. According to the method, the characteristics of strong penetrability and adaptability to severe weather of the 4D millimeter waves are utilized, and under the inspired of a self-attention mechanism, an interactive multi-mode fusion module is designed and used for fusing data of the 16-line laser radar and the 4D millimeter waves. The module is capable of aggregating features of both modalities and identifying cross-modal relationships between them. The experiments demonstrate the effectiveness of this method and demonstrate its best performance by comparison experiments with other methods on the ascyx HiRes2019 dataset.

Through multi-mode fusion of the laser radar and the 4D millimeter wave, the method improves the detection accuracy and the robustness and solves the problems faced by the traditional method. The sensor can effectively utilize the advantages of different sensors, and is suitable for the target detection requirements under complex scenes and severe weather conditions.

A multi-mode fusion 3D target detection method comprises the following steps:

s1: constructing an integral network, wherein the integral network comprises a laser radar feature extraction branch, a 4D millimeter wave feature extraction branch, a feature fusion module, a 2D backbone network and an RPN detection head; the network input comes from two sensor modes, one is laser radar point cloud and the other is 4D millimeter wave point cloud. In the respective feature extraction branches of the laser radar and the 4D millimeter wave, firstly, preprocessing point cloud data, extracting columnar (pilar) features in the point cloud by utilizing voxelization, then sending the point cloud data into a multi-mode feature fusion module, fusing 16-line laser radar and 4D millimeter wave features by utilizing a self-attention mechanism, describing the module in detail in a section 2.3, converting the fused features into a form of a pseudo image, further extracting the pseudo image features by using a 2D backbone network which is the same as that of the Pointpilars, and finally, returning a final detection result by an RPN detection head.

s2: acquiring real data of automatic driving to train the whole network:

Specifically, the process of feature extraction of the point cloud data of the laser radar by the laser radar feature extraction branch in S2 is as follows:

Fig. 2 is a graph of laser radar point cloud feature extraction, where the point cloud data of each frame of laser radar only selects P non-empty pillars, N points are selected as samples in each pillar, and if pillars smaller than N points are filled with zeros, a pillar feature of (D, P, N) size is obtained as a radar feature, where D is the dimension of one point.

Specifically, the process of obtaining the pseudo image by the feature fusion module in S2 is as follows:

the self-attention mechanism is an attention mechanism in a neural network that weights different locations in the input data so that the network can better understand the relationships of the different locations, and can derive a weight vector by computing the correlation between each input location and the other locations, and then multiplying and summing this weight vector with the input vector to derive a weighted vector representation. In this way, each input location can be given a vector representing its correlation with other locations through the self-attention mechanism.

In the self-attention mechanism, the correlation is calculated by dot product attention. Dot product attention treats each element in the input vector as a query vector that is used to calculate its relevance to other locations. Specifically, for each query vector, the dot product attention will calculate its dot product with the vector for all positions in the input sequence and convert the resulting relevance value into a probability distribution by a softmax function. Then, the vector of each position is multiplied by the probability value corresponding to the vector and summed to obtain the weighted vector representation. This may improve the efficiency and accuracy of the feature extractor.

The feature Fusion module adopts a multi-mode Fusion module LR-Fusion based on a self-attention mechanism, as shown in FIG. 3. The feature fusion module aggregates features from two modes, and identifies a cross-mode relationship between 4D millimeter wave and laser radar features, so that the network is focused on important feature information, and irrelevant features are ignored. Wherein the lidar can accurately detect objects at close range, but its effectiveness in detecting objects at far range is unstable; the 4D millimeter wave has better penetrability, can accurately detect distant objects, but has irregular shape of point cloud. Experiments prove that the module furthest aggregates respective advantages and overcomes the defects.

As shown in fig. 3, the input features of the feature fusion module are extracted (D, P, N) before, and the dimension of the input features is first reduced to reduce the calculation amount:

the simplification is shown as a formula (2).

Equation (3) then calculates two sets of attention weights to fuse the two modality features,attention weight is calculated using Softmax, where W _R←L Is the characteristic of laser radar to millimeter wave transmission, W _L←R Is a feature of opposite direction.

Where Softmax (.cndot.) is the activation function.

(4) associating attention weights W _R←L Right multiplied by feature U _R Subtracting the characteristic U _L Then generating a weighting matrix F of the laser radar features through a linear layer, a normalization layer and an activation function _Lm Millimeter wave feature weighting matrix F _Rm The acquisition method is the same.

Two sets of characteristics are obtained based on interaction of two modes, and characteristic expression of the laser radar and the millimeter wave point cloud is enhanced.

Specifically, the loss function in S2 is:

the loss function of the network training is divided into three major parts, namely a classification loss function, a positioning loss function and a direction loss function. The coordinate difference between the truth box and the anchor box is calculated by the positioning loss function; the class loss function calculates a class prediction difference between the anchor frame and the truth frame; the directional loss calculates the difference between the rotation angle between the anchor frame and the truth frame. The truth box and the anchor box are expressed by 7-dimensional vectors (x, y, z, w, l, h, theta), wherein (x, y, z) is the center coordinates of the truth box or the anchor box, and (w, l, h) is the size of the truth box or the anchor box, and theta is the direction angle of the truth box or the anchor box, and the parameter to be learned in the regression task of the detection box is the offset of the 7 variables.

Wherein, superscript gt and superscript a respectively represent a truth box and an anchor box, x ^gt ，y ^gt And z ^gt Respectively represent the center coordinates, w, of the truth box ^gt ，l ^gt And h ^gt Representing the length, width and height of the truth box, theta ^gt Indicating the direction angle of the truth box;

Loss of positioning L _loc Training was performed using SmoothL1, with the expression shown in 3.3:

to avoid misprediction of direction, learning the direction of the object using a Softmax loss function, the direction loss being noted as L _dir Then, the classification loss of the targetLoss of L _cls The use of focal loss can be represented by equation (9).

L _cls ＝-α _a (1-p ^a ) ^γ log p ^a (9)

Wherein p is ^a Is the class probability of the anchor frame, alpha _a And gamma is a superparameter. We use the settings α=0.25 and γ=2.

L _dir Direction sorting of learning objects for Softmax loss (Softmax direction loss)

The total loss function can be represented by equation (10):

wherein N is _pos Number beta of positive anchor points _loc ，β _cls And beta _dir Weights of three losses, beta _loc ＝2，β _cls ＝1，β _dir ＝0.2。

Experiment and analysis:

1. verification data set

In order to verify the robustness of a 3D target detection model fused by laser radar and 4D millimeter wave multimodality, the method provided by the invention is verified on an automatic driving data set Astyx HiRes 2019. The ascyx HiRes2019 dataset is a popular autopilot dataset, which is the first disclosed dataset containing 4D millimeter wave point clouds, lidar and camera data, consisting of 546 frame 16 line lidar, 4D millimeter wave and image data. To ensure that the distributions of the training set and the test set are consistent and mutually exclusive with the actual distribution of the samples, the data set is partitioned at a ratio of 3:1, which means 410 frames as the training data set and 136 frames as the test data set, wherein the 4D millimeter wave and laser radar point clouds respectively contain about 1000-10000 points and 10000-25000 points. In addition, each image had a resolution of 2048×618 pixels, and the training data contained 2204 cars, 36 pedestrians, and 8 cyclists in total.

2. Data preprocessing

The 4D millimeter wave is to provide environmental information by determining the object position through distance, elevation angle, azimuth, and speed information. In the detection range, due to the existence of the aperture, a plurality of objects are detected at a specific distance, the azimuth dimension of the objects is difficult to determine, so that noisy data are generated, and the pitch angle of point clouds in the data set of the Astyx HiRes2019 is calculated and counted, so that the pitch angle is found to be out of the normal range to a certain extent. Thus, noise is eliminated by rotating the point cloud with z-coordinates less than 0 clockwise. The specific treatment method is shown in the formula (3.6).

3. Training details and parameter settings

The method is implemented in an open-source OpenPCDet framework, and an operating system is ubuntu18.04 by using PyTorch (V1.7.1). The hardware configuration is as follows: CPU is Intel Core i7-7700@3.6GHz X8, single card GPU server NVIDIA RTX 3090, memory 32GB.

The input point cloud sets the ranges of x, y and z to (0, 69.12), (-39.68, 39.68) and (-3, 1), respectively, the size of the column pilar is (0.16,0.16,4), then expands the points in each column from 4 dimensions (x, y, z, r) to 10 dimensions (x, y, z, r, x) _c ,y _c ,z _c ,x _p ,y _p ,z _p ) Wherein the subscript c denotes the deviation of the arithmetic mean of all points in the column, the subscript p denotes the deviation of the center of the column, and r denotes the reflectivity. To reduce sampling to reduce inference time, N points are selected in each column, less than N points being zero-filled. In practice, only P non-empty columns are selected per frame, n=32 points per column, thus yielding a tensor of size (D, P, N), where D is the dimension of a point.

Training was performed using a single 3090 graphics card, with a batch size of 2, iteration number of 160, learning rate of 0.003, and non-maximum suppression (NMS) with a 3D IoU threshold of 0.5 was used in the inference phase. For optimization we used Adam optimizer, whose learning rate was gradually decayed. Results performance measurements were made using KITTI to evaluate 2D, 3D and BEV for the detection indicators.

Fig. 4 is a result of analyzing the training process of the proposed network, further verifying the correctness of the training details and parameter settings.

4. Experimental results and analysis

The method of the invention verifies on the Astyx HiRes2019 data set and compares with the mainstream 3D target detection method in recent years. Because of the lack of multi-mode methods for laser radar and 4D millimeter wave fusion, some mainstream methods are first selected, for example: pointRCNN, SECOND, PVRCNN and PointPicloras et al, a total of twelve experiments were performed for lidar and 4D millimeter waves, respectively, and then compared with the multi-modal fusion method proposed by the present invention, the results are shown in Table 1.

Table 1 results of comparative experiments

The experimental result shows that in the single-mode method taking the laser radar as input, the accuracy of the two-stage PV-RCNN is 44.63% in the 3D mAP with medium difficulty, the accuracy is highest, the accuracy is 0.73% higher than that of the SECOND name of SECOND, and the accuracy is 46.36% in the BEV mAP; in the single-mode method using 4D millimeter wave as input, the two-stage PV-RCNN has the highest precision. The last line of the table is the method proposed by the present invention, and the results are significantly better than all other single-mode methods, with a 6.73% and 11.80% improvement in 3D and BEV maps, respectively, at a mid-level compared to the baseline pointpilars using lidar as input.

In order to further verify the effectiveness of each module, the invention respectively carries out an ablation experiment on each module, and the results are shown in table 2 and comprise an input data mode, a data preprocessing and a multi-mode fusion module. In the 4D millimeter wave point cloud, a data pre-processing (DP) method was used to reduce data noise, as shown in the first and second, fourth and fifth rows in table 2, which improves performance in each method, with the first and second rows achieving a 3.66% increase in mAP in 3D and a 4.77% increase in BEV mAP at a mid-level when only the data pre-processing module is added in the baseline pointpilars. This shows that the data preprocessing corrects the offset error of the divergence angle by rotating the point cloud, verifying the validity of the method.

The effect of the multimodal fusion module is verified using a single modality while maintaining consistency of other methods. As shown in the first, third, and fourth rows of table 2, detection performance improves significantly when features from both modalities are fused using the multi-modality fusion module. At a mid-level, the network 3D maps using the multi-modal fusion module increased by 2.48% and BEV maps increased by 8.01% compared to the single-mode approach using lidar. The experiments prove that the multi-mode fusion module enables the network to learn more abundant information from multiple modes, and feature expression is enhanced. Finally, experiments were performed on the multimodal fusion module along with data preprocessing. As can be seen from the last row of table 2, the 3D object detection performance is further significantly improved.

Table 2 ablation experiments

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims

1. A multi-mode fusion 3D target detection method is characterized in that: the method comprises the following steps:

s2: acquiring real data of automatic driving to train the whole network:

training by adopting a deep learning framework, and obtaining a trained whole network when the value of the loss function is not changed any more;

2. The multi-modal fused 3D target detection method of claim 1, wherein: the step S2 of extracting features of the point cloud data of the laser radar by the laser radar feature extraction branch comprises the following steps:

the lidar point cloud inputs N x 4-dimensional vector data by first voxelizing the original point cloud into the form of column pilar, and then expanding the points in each column from 4 dimensions (x, y, z, r) to 10 dimensions (x, y, z, r, x) _c ,y _c ,z _c ,x _p ,y _p ,z _p ) Wherein the subscript c represents the deviation from the arithmetic mean of all points in the column, the subscript p represents the deviation from the center of the column, and r represents the reflectivity;

3. The multi-modal fused 3D target detection method of claim 2, wherein: the process of obtaining the pseudo image by the feature fusion module in the S2 is as follows:

the simplified product is shown as a formula (2);

wherein U is _L And U _R The method comprises the steps of respectively representing the output characteristics of simplified radar characteristics and millimeter wave characteristics, wherein LN (, MP (,) and Conv (, respectively represent a full connection layer, a maximum pooling layer and a convolution layer;

equation (3) then calculates two sets of attention weights to fuse the two modality features, and uses Softmax to calculate the attention weights, where W _R←L Is the characteristic of laser radar to millimeter wave transmission, W _L←R Features in opposite directions;

wherein Softmax (·) is the activation function;

(4) associating attention weights W _R←L Right multiplied by feature U _R Subtracting the characteristic U _L Then generating a weighting matrix F of the laser radar features through a linear layer, a normalization layer and an activation function _Lm Millimeter wave feature weighting matrix F _Rm The acquisition method is the same;

wherein BN (·) is batch normalization, reLU (·) is an activation function, original features are added by using a weighting matrix, and features of two modes are updated by residual errors, as shown in a formula (5);

F _outputL ，F _outputR the pilar characteristics of the laser radar after aggregation and the pilar characteristics of the 4D millimeter wave are respectively enhanced based on an attention mechanism;

4. A multi-modal fused 3D target detection method as claimed in claim 3 wherein: the loss function in S2 is:

the loss function of the network training is divided into three parts, namely a classification loss function, a positioning loss function and a direction loss function, wherein the positioning loss function calculates a coordinate difference value between a truth box and an anchor box; the class loss function calculates a class prediction difference between the anchor frame and the truth frame; the direction loss is calculated as the difference between the rotation angles of the anchor frame and the truth frame, wherein the truth frame and the anchor frame are expressed by 7-dimensional vectors (x, y, z, w, l, h, theta), and (x, y, z) is the center coordinates of the truth frame or the anchor frame, and (w, l, h) is the size of the truth frame or the anchor frame, and theta is the direction angle of the truth frame or the anchor frame.

Wherein x is ^gt ，y ^gt And z ^gt Respectively represent the center coordinates, w, of the truth box ^gt ，l ^gt And h ^gt Representing the length, width and height of the truth box, theta ^gt Indicating the direction angle of the truth box;

Δx, Δy, Δz and Δθ represent differences in three coordinates and direction angles of the truth box and the center of the anchor box, respectively;

d ^a the intermediate variable has no practical meaning and is calculated by the formula (7);

learning the direction of an object using a Softmax loss function, the direction loss being noted as L _dir Then, the classification loss L of the target _cls Using focal loss, it can be represented by formula (9);

L _cls ＝-a _a (1-p ^a ) ^γ log p ^a (9)

L _dir the direction classification of learning objects is lost for Softmax;

the total loss function can be represented by equation (10):