CN116630937A - Multimode fusion 3D target detection method - Google Patents

Multimode fusion 3D target detection method Download PDF

Info

Publication number
CN116630937A
CN116630937A CN202310605073.2A CN202310605073A CN116630937A CN 116630937 A CN116630937 A CN 116630937A CN 202310605073 A CN202310605073 A CN 202310605073A CN 116630937 A CN116630937 A CN 116630937A
Authority
CN
China
Prior art keywords
features
millimeter wave
laser radar
point cloud
feature extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310605073.2A
Other languages
Chinese (zh)
Inventor
冯欣
曾俊贤
单玉梅
何桢苇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN202310605073.2A priority Critical patent/CN116630937A/en
Publication of CN116630937A publication Critical patent/CN116630937A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/08Detecting or categorising vehicles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Optical Radar Systems And Details Thereof (AREA)

Abstract

The invention relates to a multi-mode fusion 3D target detection method, which comprises the steps of firstly constructing an integral network, wherein the integral network comprises a laser radar feature extraction branch, a 4D millimeter wave feature extraction branch, a feature fusion module, a 2D backbone network and an RPN detection head, then acquiring real data of automatic driving to train the integral network, finally inputting currently acquired laser radar point cloud data and 4D millimeter wave point cloud data into the trained integral network, and outputting a detection result. Experiments prove the effectiveness of the method, and by comparing experiments with other methods on an Astyx HiRes2019 data set, the invention has the best performance.

Description

Multimode fusion 3D target detection method
Technical Field
The invention relates to the technical field of automatic driving, in particular to a multi-mode fusion 3D target detection method.
Background
The automatic driving technology is one of the products of the high-speed development of modern technology, and is also the embodiment of the continuous pursuit of innovation in the automobile industry. As an important participant in the modern traffic field, the automotive industry has been actively exploring and applying various high-tech technologies. With the continuous progress of the fields of artificial intelligence technology, computer vision and the like, the automatic driving technology has become a trend of future development of the automobile industry, and attracts more and more manufacturers to compete. In addition, intellectualization, new energy and networking are also key targets for the development of the automotive industry, and application of these new technologies will greatly change our travel and life modes. With the emergence of various new technologies, the automobile industry will continuously promote the acceleration of industrial revolution, bringing more convenience and safety for human travel.
The automatic driving automobile collects data on an automobile platform by carrying a plurality of advanced sensors, and combines computer vision and other artificial intelligence technologies, so that the automobile can be driven autonomously, and a driver is assisted to complete driving tasks safely and efficiently.
In an autopilot perception system, the 3D object detection task is one of the most important links. The method can rapidly and accurately identify and position the target in the three-dimensional space, and provides key support for intelligent driving. The conventional 3D object detection method generally only considers one of the point cloud data or the RGB image data, and cannot fully utilize various sensor information, so that the detection result is not accurate and robust enough. However, multi-modal fusion can improve accuracy of target detection from multiple aspects using information provided by different sensors, such as color, texture, depth, illumination, etc., and improve robustness of detection algorithms by integrating data from different sensors. This means that even if one sensor fails, the target detection can still maintain high accuracy, and the multi-modal fusion can provide more comprehensive information, making the target detection algorithm suitable for a wider range of scenarios. For example, in low light conditions, an infrared sensor may be used to detect a target, thereby improving the detection effect. Meanwhile, the multi-mode fusion 3D target detection needs to be combined with different sensors, algorithms and data processing technologies, so that technical progress and innovation in various fields are promoted. Therefore, fusing multimodal information (e.g., point clouds, images, radars, etc.) to improve the performance of 3D object detection is a challenging and significant study.
Currently, there are two main types of input data for 3D object detection methods. One is image data and the other is lidar point cloud data. Compared with image data, the laser radar point cloud has depth truth value information, so that the detection accuracy is higher, and the laser radar point cloud becomes a currently mainstream single-mode 3D target detection method.
However, there are still some unavoidable drawbacks to a single sensor. For example, lidar data is sparse and is susceptible to severe weather (e.g., rain, snow, fog); RGB image data lacks depth information, is susceptible to illumination, and the like. In order to improve the accuracy and robustness of the algorithm, more and more work is beginning to pay attention to the research of the multi-mode fusion 3D target detection method. The advantages of each sensor are better utilized through multi-mode fusion, and the high-resolution RGB image and the high-line number laser radar are mainly used, so that the detection accuracy is improved to a certain extent. However, the method has high cost, and can not well solve the problems of long-distance small targets, blocked targets, target error detection under severe weather and the like.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention aims to solve the technical problems that: how to solve the problems of error leakage detection of a long-distance small target, a blocked target and a target in bad weather.
In order to solve the technical problems, the invention adopts the following technical scheme: a multi-mode fusion 3D target detection method comprises the following steps:
s1: constructing an integral network, wherein the integral network comprises a laser radar feature extraction branch, a 4D millimeter wave feature extraction branch, a feature fusion module, a 2D backbone network and an RPN detection head;
the laser radar feature extraction branch and the 4D millimeter wave feature extraction branch adopt the same feature extraction module;
s2: acquiring real data of automatic driving to train the whole network:
the laser radar feature extraction branch performs feature extraction on point cloud data of a laser radar to obtain radar features, the 4D millimeter wave feature extraction branch performs feature extraction on the point cloud data of the 4D millimeter waves to obtain millimeter wave features, the radar features and the millimeter wave features are input into a feature fusion module to obtain a pseudo image, the pseudo image is input into a 2D backbone network to extract the pseudo image features, and finally the pseudo image features are input into an RPN detection head to return to a final detection result;
training is carried out by adopting a deep learning framework, and when the value of the loss function is not changed, a trained whole network is obtained.
S3: and inputting the currently acquired laser radar point cloud data and 4D millimeter wave point cloud data into a trained whole network, and outputting a checking result.
As an improvement, the process of extracting the characteristics of the point cloud data of the laser radar by the laser radar characteristic extraction branch in S2 is as follows:
the lidar point cloud inputs N x 4-dimensional vector data by first voxelizing the original point cloud into the form of column pilar, and then expanding the points in each column from 4 dimensions (x, y, z, r) to 10 dimensions (x, y, z, r, x) c ,y c ,z c ,x p ,y p ,z p ) Wherein the subscript c denotes the deviation of the arithmetic mean of all points in the column, the subscript p denotes the deviation of the center of the column, and r denotes the reflectivity.
The point cloud data of each frame of laser radar only selects P non-empty columns, N points in each column are selected as samples, and if columns smaller than the N points are filled with zero, a column characteristic with the size of (D, P, N) is obtained as a radar characteristic, wherein D is the dimension of one point.
As an improvement, the process of obtaining the pseudo image by the feature fusion module in S2 is as follows:
the feature Fusion module adopts a multi-mode Fusion module LR-Fusion based on a self-attention mechanism, the input features of the feature Fusion module are extracted (D, P, N) before, and firstly, the dimension of the input features is reduced to reduce the calculated amount:
wherein O is L And O R Respectively representing input features, namely radar features and millimeter wave features, F L ,F R Output characteristics respectively representing radar characteristics and millimeter wave characteristics;
the simplification is shown as a formula (2).
Wherein U is L And U R The LN (, MP (,), conv (, etc.) respectively represent the full connection layer, the maximum pooling layer and the convolution layer.
Equation (3) then calculates two sets of attention weights to fuse the two modality features, and uses Softmax to calculate the attention weights, where W R←L Is the characteristic of laser radar to millimeter wave transmission, W L←R Is a feature of opposite direction.
Where Softmax (.cndot.) is the activation function.
(4) associating attention weights W R←L Right multiplied by feature U R Subtracting the characteristic U L Then go through the linear layer, the normalization layer and the activation functionNumber, generating a weighting matrix F of lidar features Lm Millimeter wave feature weighting matrix F Rm The acquisition method is the same.
Where BN (. Cndot.) is batch normalized and ReLU (. Cndot.) is the activation function. The original features are added by using the weighting matrix, and the features of the two modes are updated by the residual, as shown in formula (5).
F outputL ,F outputR The pilar characteristics of the lidar after aggregation and the pilar characteristics of the 4D millimeter wave based on the attention mechanism are shown, respectively.
And restoring the generated pilar characteristics of the laser radar and the 4D millimeter wave pilar characteristics into the original space according to the coordinate index to obtain a pseudo image.
As an improvement, the loss function in S2 is:
the loss function of the network training is divided into three major parts, namely a classification loss function, a positioning loss function and a direction loss function. The coordinate difference between the truth box and the anchor box is calculated by the positioning loss function; the class loss function calculates a class prediction difference between the anchor frame and the truth frame; the directional loss calculates the difference between the rotation angle between the anchor frame and the truth frame. The truth box and the anchor box are represented by 7-dimensional vectors (x, y, z, w, l, h, θ), wherein (x, y, z) is the center coordinates of the truth box or the anchor box, and (w, l, h) is the size of the truth box or the anchor box, and θ is the direction angle of the truth box or the anchor box).
Wherein x is gt ,y gt And z gt Respectively represent the center coordinates, w, of the truth box gt ,l gt And h gt True value representing boxLength, width, height, theta gt Indicating the direction angle of the truth box;
x a ,y a and z a Respectively represent the central coordinates, w, of the anchor frame a ,l a And h a Represents the length, width and height of the anchor frame, theta a Indicating the direction angle of the truth box;
Δx, Δy, Δz and Δθ represent differences in three coordinates and direction angles of the truth box and the center of the anchor box, respectively.
d a The intermediate variable has no actual meaning and is calculated by formula (7).
Loss of positioning L loc Training was performed using SmoothL1, and the expression is shown in (8):
learning the direction of an object using a Softmax loss function, the direction loss being noted as L dir Then, the classification loss L of the target cls The use of focal loss can be represented by equation (9).
L cls =-α a (1-p a ) γ log p a (9)
Wherein p is a Is the class probability of the anchor frame, alpha a And gamma is a superparameter;
L dir the direction classification of learning objects is lost for Softmax;
the total loss function can be represented by equation (10):
wherein N is pos Number beta of positive anchor points loc ,β cls And beta dir Three lost weights, respectively.
Compared with the prior art, the invention has at least the following advantages:
the invention provides a 3D target detection method based on multi-mode fusion of a laser radar and 4D millimeter waves, and a detection network with better performance under a complex scene is obtained through multi-mode fusion of the laser radar and the 4D millimeter waves. According to the method, firstly, an extraction module similar to the PointPicloras point cloud characteristic is provided, the calculated amount of a network is effectively reduced, then, a multi-mode fusion module of the laser Lei Dage D millimeter wave characteristic is designed, the characteristics of two modes are aggregated in a self-attention mode, finally, the fusion characteristic with stronger expression capability is converted into a pseudo image form, the characteristic is further extracted by a 2D backbone network, and a final detection result is returned by an RPN detection head. The invention shows that the method is superior to other methods in an Astyx HiRes2019 data set through a large number of comparison experiments, and has the best performance.
Drawings
Fig. 1 is a block diagram of the overall network of the present invention.
Fig. 2 is a point cloud feature extraction structure diagram.
Fig. 3 is a feature fusion module.
Fig. 4 is a training process analysis.
Fig. 5 is a diagram of a sample data set, graphs (a) and (b) are image data, graphs (c) and (D) are lidar data, and graphs (e) and (f) are data of 4D millimeter waves, wherein graphs (c) - (D) visualize the prediction frames of the pointpilars model.
Detailed Description
The present invention will be described in further detail below.
The invention provides a 3D target detection method based on laser radar and 4D millimeter wave multi-mode fusion. According to the method, the characteristics of strong penetrability and adaptability to severe weather of the 4D millimeter waves are utilized, and under the inspired of a self-attention mechanism, an interactive multi-mode fusion module is designed and used for fusing data of the 16-line laser radar and the 4D millimeter waves. The module is capable of aggregating features of both modalities and identifying cross-modal relationships between them. The experiments demonstrate the effectiveness of this method and demonstrate its best performance by comparison experiments with other methods on the ascyx HiRes2019 dataset.
Through multi-mode fusion of the laser radar and the 4D millimeter wave, the method improves the detection accuracy and the robustness and solves the problems faced by the traditional method. The sensor can effectively utilize the advantages of different sensors, and is suitable for the target detection requirements under complex scenes and severe weather conditions.
A multi-mode fusion 3D target detection method comprises the following steps:
s1: constructing an integral network, wherein the integral network comprises a laser radar feature extraction branch, a 4D millimeter wave feature extraction branch, a feature fusion module, a 2D backbone network and an RPN detection head; the network input comes from two sensor modes, one is laser radar point cloud and the other is 4D millimeter wave point cloud. In the respective feature extraction branches of the laser radar and the 4D millimeter wave, firstly, preprocessing point cloud data, extracting columnar (pilar) features in the point cloud by utilizing voxelization, then sending the point cloud data into a multi-mode feature fusion module, fusing 16-line laser radar and 4D millimeter wave features by utilizing a self-attention mechanism, describing the module in detail in a section 2.3, converting the fused features into a form of a pseudo image, further extracting the pseudo image features by using a 2D backbone network which is the same as that of the Pointpilars, and finally, returning a final detection result by an RPN detection head.
The laser radar feature extraction branch and the 4D millimeter wave feature extraction branch adopt the same feature extraction module;
s2: acquiring real data of automatic driving to train the whole network:
the laser radar feature extraction branch performs feature extraction on point cloud data of a laser radar to obtain radar features, the 4D millimeter wave feature extraction branch performs feature extraction on the point cloud data of the 4D millimeter waves to obtain millimeter wave features, the radar features and the millimeter wave features are input into a feature fusion module to obtain a pseudo image, the pseudo image is input into a 2D backbone network to extract the pseudo image features, and finally the pseudo image features are input into an RPN detection head to return to a final detection result;
training is carried out by adopting a deep learning framework, and when the value of the loss function is not changed, a trained whole network is obtained.
S3: and inputting the currently acquired laser radar point cloud data and 4D millimeter wave point cloud data into a trained whole network, and outputting a checking result.
Specifically, the process of feature extraction of the point cloud data of the laser radar by the laser radar feature extraction branch in S2 is as follows:
the lidar point cloud inputs N x 4-dimensional vector data by first voxelizing the original point cloud into the form of column pilar, and then expanding the points in each column from 4 dimensions (x, y, z, r) to 10 dimensions (x, y, z, r, x) c ,y c ,z c ,x p ,y p ,z p ) Wherein the subscript c denotes the deviation of the arithmetic mean of all points in the column, the subscript p denotes the deviation of the center of the column, and r denotes the reflectivity.
Fig. 2 is a graph of laser radar point cloud feature extraction, where the point cloud data of each frame of laser radar only selects P non-empty pillars, N points are selected as samples in each pillar, and if pillars smaller than N points are filled with zeros, a pillar feature of (D, P, N) size is obtained as a radar feature, where D is the dimension of one point.
Specifically, the process of obtaining the pseudo image by the feature fusion module in S2 is as follows:
the self-attention mechanism is an attention mechanism in a neural network that weights different locations in the input data so that the network can better understand the relationships of the different locations, and can derive a weight vector by computing the correlation between each input location and the other locations, and then multiplying and summing this weight vector with the input vector to derive a weighted vector representation. In this way, each input location can be given a vector representing its correlation with other locations through the self-attention mechanism.
In the self-attention mechanism, the correlation is calculated by dot product attention. Dot product attention treats each element in the input vector as a query vector that is used to calculate its relevance to other locations. Specifically, for each query vector, the dot product attention will calculate its dot product with the vector for all positions in the input sequence and convert the resulting relevance value into a probability distribution by a softmax function. Then, the vector of each position is multiplied by the probability value corresponding to the vector and summed to obtain the weighted vector representation. This may improve the efficiency and accuracy of the feature extractor.
The feature Fusion module adopts a multi-mode Fusion module LR-Fusion based on a self-attention mechanism, as shown in FIG. 3. The feature fusion module aggregates features from two modes, and identifies a cross-mode relationship between 4D millimeter wave and laser radar features, so that the network is focused on important feature information, and irrelevant features are ignored. Wherein the lidar can accurately detect objects at close range, but its effectiveness in detecting objects at far range is unstable; the 4D millimeter wave has better penetrability, can accurately detect distant objects, but has irregular shape of point cloud. Experiments prove that the module furthest aggregates respective advantages and overcomes the defects.
As shown in fig. 3, the input features of the feature fusion module are extracted (D, P, N) before, and the dimension of the input features is first reduced to reduce the calculation amount:
wherein O is L And O R Respectively representing input features, namely radar features and millimeter wave features, F L ,F R Output characteristics respectively representing radar characteristics and millimeter wave characteristics;
the simplification is shown as a formula (2).
Wherein U is L And U R The LN (, MP (,), conv (, etc.) respectively represent the full connection layer, the maximum pooling layer and the convolution layer.
Equation (3) then calculates two sets of attention weights to fuse the two modality features,attention weight is calculated using Softmax, where W R←L Is the characteristic of laser radar to millimeter wave transmission, W L←R Is a feature of opposite direction.
Where Softmax (.cndot.) is the activation function.
(4) associating attention weights W R←L Right multiplied by feature U R Subtracting the characteristic U L Then generating a weighting matrix F of the laser radar features through a linear layer, a normalization layer and an activation function Lm Millimeter wave feature weighting matrix F Rm The acquisition method is the same.
Where BN (. Cndot.) is batch normalized and ReLU (. Cndot.) is the activation function. The original features are added by using the weighting matrix, and the features of the two modes are updated by the residual, as shown in formula (5).
F outputL ,F outputR The pilar characteristics of the lidar after aggregation and the pilar characteristics of the 4D millimeter wave based on the attention mechanism are shown, respectively.
Two sets of characteristics are obtained based on interaction of two modes, and characteristic expression of the laser radar and the millimeter wave point cloud is enhanced.
And restoring the generated pilar characteristics of the laser radar and the 4D millimeter wave pilar characteristics into the original space according to the coordinate index to obtain a pseudo image.
Specifically, the loss function in S2 is:
the loss function of the network training is divided into three major parts, namely a classification loss function, a positioning loss function and a direction loss function. The coordinate difference between the truth box and the anchor box is calculated by the positioning loss function; the class loss function calculates a class prediction difference between the anchor frame and the truth frame; the directional loss calculates the difference between the rotation angle between the anchor frame and the truth frame. The truth box and the anchor box are expressed by 7-dimensional vectors (x, y, z, w, l, h, theta), wherein (x, y, z) is the center coordinates of the truth box or the anchor box, and (w, l, h) is the size of the truth box or the anchor box, and theta is the direction angle of the truth box or the anchor box, and the parameter to be learned in the regression task of the detection box is the offset of the 7 variables.
Wherein, superscript gt and superscript a respectively represent a truth box and an anchor box, x gt ,y gt And z gt Respectively represent the center coordinates, w, of the truth box gt ,l gt And h gt Representing the length, width and height of the truth box, theta gt Indicating the direction angle of the truth box;
x a ,y a and z a Respectively represent the central coordinates, w, of the anchor frame a ,l a And h a Represents the length, width and height of the anchor frame, theta a Indicating the direction angle of the truth box;
Δx, Δy, Δz and Δθ represent differences in three coordinates and direction angles of the truth box and the center of the anchor box, respectively.
d a The intermediate variable has no actual meaning and is calculated by formula (7).
Loss of positioning L loc Training was performed using SmoothL1, with the expression shown in 3.3:
to avoid misprediction of direction, learning the direction of the object using a Softmax loss function, the direction loss being noted as L dir Then, the classification loss of the targetLoss of L cls The use of focal loss can be represented by equation (9).
L cls =-α a (1-p a ) γ log p a (9)
Wherein p is a Is the class probability of the anchor frame, alpha a And gamma is a superparameter. We use the settings α=0.25 and γ=2.
L dir Direction sorting of learning objects for Softmax loss (Softmax direction loss)
The total loss function can be represented by equation (10):
wherein N is pos Number beta of positive anchor points loc ,β cls And beta dir Weights of three losses, beta loc =2,β cls =1,β dir =0.2。
Experiment and analysis:
1. verification data set
In order to verify the robustness of a 3D target detection model fused by laser radar and 4D millimeter wave multimodality, the method provided by the invention is verified on an automatic driving data set Astyx HiRes 2019. The ascyx HiRes2019 dataset is a popular autopilot dataset, which is the first disclosed dataset containing 4D millimeter wave point clouds, lidar and camera data, consisting of 546 frame 16 line lidar, 4D millimeter wave and image data. To ensure that the distributions of the training set and the test set are consistent and mutually exclusive with the actual distribution of the samples, the data set is partitioned at a ratio of 3:1, which means 410 frames as the training data set and 136 frames as the test data set, wherein the 4D millimeter wave and laser radar point clouds respectively contain about 1000-10000 points and 10000-25000 points. In addition, each image had a resolution of 2048×618 pixels, and the training data contained 2204 cars, 36 pedestrians, and 8 cyclists in total.
2. Data preprocessing
The 4D millimeter wave is to provide environmental information by determining the object position through distance, elevation angle, azimuth, and speed information. In the detection range, due to the existence of the aperture, a plurality of objects are detected at a specific distance, the azimuth dimension of the objects is difficult to determine, so that noisy data are generated, and the pitch angle of point clouds in the data set of the Astyx HiRes2019 is calculated and counted, so that the pitch angle is found to be out of the normal range to a certain extent. Thus, noise is eliminated by rotating the point cloud with z-coordinates less than 0 clockwise. The specific treatment method is shown in the formula (3.6).
3. Training details and parameter settings
The method is implemented in an open-source OpenPCDet framework, and an operating system is ubuntu18.04 by using PyTorch (V1.7.1). The hardware configuration is as follows: CPU is Intel Core i7-7700@3.6GHz X8, single card GPU server NVIDIA RTX 3090, memory 32GB.
The input point cloud sets the ranges of x, y and z to (0, 69.12), (-39.68, 39.68) and (-3, 1), respectively, the size of the column pilar is (0.16,0.16,4), then expands the points in each column from 4 dimensions (x, y, z, r) to 10 dimensions (x, y, z, r, x) c ,y c ,z c ,x p ,y p ,z p ) Wherein the subscript c denotes the deviation of the arithmetic mean of all points in the column, the subscript p denotes the deviation of the center of the column, and r denotes the reflectivity. To reduce sampling to reduce inference time, N points are selected in each column, less than N points being zero-filled. In practice, only P non-empty columns are selected per frame, n=32 points per column, thus yielding a tensor of size (D, P, N), where D is the dimension of a point.
Training was performed using a single 3090 graphics card, with a batch size of 2, iteration number of 160, learning rate of 0.003, and non-maximum suppression (NMS) with a 3D IoU threshold of 0.5 was used in the inference phase. For optimization we used Adam optimizer, whose learning rate was gradually decayed. Results performance measurements were made using KITTI to evaluate 2D, 3D and BEV for the detection indicators.
Fig. 4 is a result of analyzing the training process of the proposed network, further verifying the correctness of the training details and parameter settings.
4. Experimental results and analysis
The method of the invention verifies on the Astyx HiRes2019 data set and compares with the mainstream 3D target detection method in recent years. Because of the lack of multi-mode methods for laser radar and 4D millimeter wave fusion, some mainstream methods are first selected, for example: pointRCNN, SECOND, PVRCNN and PointPicloras et al, a total of twelve experiments were performed for lidar and 4D millimeter waves, respectively, and then compared with the multi-modal fusion method proposed by the present invention, the results are shown in Table 1.
Table 1 results of comparative experiments
The experimental result shows that in the single-mode method taking the laser radar as input, the accuracy of the two-stage PV-RCNN is 44.63% in the 3D mAP with medium difficulty, the accuracy is highest, the accuracy is 0.73% higher than that of the SECOND name of SECOND, and the accuracy is 46.36% in the BEV mAP; in the single-mode method using 4D millimeter wave as input, the two-stage PV-RCNN has the highest precision. The last line of the table is the method proposed by the present invention, and the results are significantly better than all other single-mode methods, with a 6.73% and 11.80% improvement in 3D and BEV maps, respectively, at a mid-level compared to the baseline pointpilars using lidar as input.
In order to further verify the effectiveness of each module, the invention respectively carries out an ablation experiment on each module, and the results are shown in table 2 and comprise an input data mode, a data preprocessing and a multi-mode fusion module. In the 4D millimeter wave point cloud, a data pre-processing (DP) method was used to reduce data noise, as shown in the first and second, fourth and fifth rows in table 2, which improves performance in each method, with the first and second rows achieving a 3.66% increase in mAP in 3D and a 4.77% increase in BEV mAP at a mid-level when only the data pre-processing module is added in the baseline pointpilars. This shows that the data preprocessing corrects the offset error of the divergence angle by rotating the point cloud, verifying the validity of the method.
The effect of the multimodal fusion module is verified using a single modality while maintaining consistency of other methods. As shown in the first, third, and fourth rows of table 2, detection performance improves significantly when features from both modalities are fused using the multi-modality fusion module. At a mid-level, the network 3D maps using the multi-modal fusion module increased by 2.48% and BEV maps increased by 8.01% compared to the single-mode approach using lidar. The experiments prove that the multi-mode fusion module enables the network to learn more abundant information from multiple modes, and feature expression is enhanced. Finally, experiments were performed on the multimodal fusion module along with data preprocessing. As can be seen from the last row of table 2, the 3D object detection performance is further significantly improved.
Table 2 ablation experiments
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims (4)

1. A multi-mode fusion 3D target detection method is characterized in that: the method comprises the following steps:
s1: constructing an integral network, wherein the integral network comprises a laser radar feature extraction branch, a 4D millimeter wave feature extraction branch, a feature fusion module, a 2D backbone network and an RPN detection head;
the laser radar feature extraction branch and the 4D millimeter wave feature extraction branch adopt the same feature extraction module;
s2: acquiring real data of automatic driving to train the whole network:
the laser radar feature extraction branch performs feature extraction on point cloud data of a laser radar to obtain radar features, the 4D millimeter wave feature extraction branch performs feature extraction on the point cloud data of the 4D millimeter waves to obtain millimeter wave features, the radar features and the millimeter wave features are input into a feature fusion module to obtain a pseudo image, the pseudo image is input into a 2D backbone network to extract the pseudo image features, and finally the pseudo image features are input into an RPN detection head to return to a final detection result;
training by adopting a deep learning framework, and obtaining a trained whole network when the value of the loss function is not changed any more;
s3: and inputting the currently acquired laser radar point cloud data and 4D millimeter wave point cloud data into a trained whole network, and outputting a checking result.
2. The multi-modal fused 3D target detection method of claim 1, wherein: the step S2 of extracting features of the point cloud data of the laser radar by the laser radar feature extraction branch comprises the following steps:
the lidar point cloud inputs N x 4-dimensional vector data by first voxelizing the original point cloud into the form of column pilar, and then expanding the points in each column from 4 dimensions (x, y, z, r) to 10 dimensions (x, y, z, r, x) c ,y c ,z c ,x p ,y p ,z p ) Wherein the subscript c represents the deviation from the arithmetic mean of all points in the column, the subscript p represents the deviation from the center of the column, and r represents the reflectivity;
the point cloud data of each frame of laser radar only selects P non-empty columns, N points in each column are selected as samples, and if columns smaller than the N points are filled with zero, a column characteristic with the size of (D, P, N) is obtained as a radar characteristic, wherein D is the dimension of one point.
3. The multi-modal fused 3D target detection method of claim 2, wherein: the process of obtaining the pseudo image by the feature fusion module in the S2 is as follows:
the feature Fusion module adopts a multi-mode Fusion module LR-Fusion based on a self-attention mechanism, the input features of the feature Fusion module are extracted (D, P, N) before, and firstly, the dimension of the input features is reduced to reduce the calculated amount:
wherein O is L And O R Respectively representing input features, namely radar features and millimeter wave features, F L ,F R Output characteristics respectively representing radar characteristics and millimeter wave characteristics;
the simplified product is shown as a formula (2);
wherein U is L And U R The method comprises the steps of respectively representing the output characteristics of simplified radar characteristics and millimeter wave characteristics, wherein LN (, MP (,) and Conv (, respectively represent a full connection layer, a maximum pooling layer and a convolution layer;
equation (3) then calculates two sets of attention weights to fuse the two modality features, and uses Softmax to calculate the attention weights, where W R←L Is the characteristic of laser radar to millimeter wave transmission, W L←R Features in opposite directions;
wherein Softmax (·) is the activation function;
(4) associating attention weights W R←L Right multiplied by feature U R Subtracting the characteristic U L Then generating a weighting matrix F of the laser radar features through a linear layer, a normalization layer and an activation function Lm Millimeter wave feature weighting matrix F Rm The acquisition method is the same;
wherein BN (·) is batch normalization, reLU (·) is an activation function, original features are added by using a weighting matrix, and features of two modes are updated by residual errors, as shown in a formula (5);
F outputL ,F outputR the pilar characteristics of the laser radar after aggregation and the pilar characteristics of the 4D millimeter wave are respectively enhanced based on an attention mechanism;
and restoring the generated pilar characteristics of the laser radar and the 4D millimeter wave pilar characteristics into the original space according to the coordinate index to obtain a pseudo image.
4. A multi-modal fused 3D target detection method as claimed in claim 3 wherein: the loss function in S2 is:
the loss function of the network training is divided into three parts, namely a classification loss function, a positioning loss function and a direction loss function, wherein the positioning loss function calculates a coordinate difference value between a truth box and an anchor box; the class loss function calculates a class prediction difference between the anchor frame and the truth frame; the direction loss is calculated as the difference between the rotation angles of the anchor frame and the truth frame, wherein the truth frame and the anchor frame are expressed by 7-dimensional vectors (x, y, z, w, l, h, theta), and (x, y, z) is the center coordinates of the truth frame or the anchor frame, and (w, l, h) is the size of the truth frame or the anchor frame, and theta is the direction angle of the truth frame or the anchor frame.
Wherein x is gt ,y gt And z gt Respectively represent the center coordinates, w, of the truth box gt ,l gt And h gt Representing the length, width and height of the truth box, theta gt Indicating the direction angle of the truth box;
x a ,y a and z a Respectively represent the central coordinates, w, of the anchor frame a ,l a And h a Represents the length, width and height of the anchor frame, theta a Indicating the direction angle of the truth box;
Δx, Δy, Δz and Δθ represent differences in three coordinates and direction angles of the truth box and the center of the anchor box, respectively;
d a the intermediate variable has no practical meaning and is calculated by the formula (7);
loss of positioning L loc Training was performed using SmoothL1, and the expression is shown in (8):
learning the direction of an object using a Softmax loss function, the direction loss being noted as L dir Then, the classification loss L of the target cls Using focal loss, it can be represented by formula (9);
L cls =-a a (1-p a ) γ log p a (9)
wherein p is a Is the class probability of the anchor frame, alpha a And gamma is a superparameter;
L dir the direction classification of learning objects is lost for Softmax;
the total loss function can be represented by equation (10):
wherein N is pos Number beta of positive anchor points loc ,β cls And beta dir Three lost weights, respectively.
CN202310605073.2A 2023-05-26 2023-05-26 Multimode fusion 3D target detection method Pending CN116630937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310605073.2A CN116630937A (en) 2023-05-26 2023-05-26 Multimode fusion 3D target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310605073.2A CN116630937A (en) 2023-05-26 2023-05-26 Multimode fusion 3D target detection method

Publications (1)

Publication Number Publication Date
CN116630937A true CN116630937A (en) 2023-08-22

Family

ID=87596887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310605073.2A Pending CN116630937A (en) 2023-05-26 2023-05-26 Multimode fusion 3D target detection method

Country Status (1)

Country Link
CN (1) CN116630937A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197631A (en) * 2023-11-06 2023-12-08 安徽蔚来智驾科技有限公司 Multi-mode sensor fusion sensing method, computer equipment, medium and vehicle

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197631A (en) * 2023-11-06 2023-12-08 安徽蔚来智驾科技有限公司 Multi-mode sensor fusion sensing method, computer equipment, medium and vehicle
CN117197631B (en) * 2023-11-06 2024-04-19 安徽蔚来智驾科技有限公司 Multi-mode sensor fusion sensing method, computer equipment, medium and vehicle

Similar Documents

Publication Publication Date Title
Zhao et al. Fusion of 3D LIDAR and camera data for object detection in autonomous vehicle applications
CN111429514B (en) Laser radar 3D real-time target detection method integrating multi-frame time sequence point cloud
Chen et al. Progressive lidar adaptation for road detection
Lee et al. Simultaneous traffic sign detection and boundary estimation using convolutional neural network
Ding et al. DiResNet: Direction-aware residual network for road extraction in VHR remote sensing images
US20230099113A1 (en) Training method and apparatus for a target detection model, target detection method and apparatus, and medium
US20230213643A1 (en) Camera-radar sensor fusion using local attention mechanism
CN112347987A (en) Multimode data fusion three-dimensional target detection method
Vaquero et al. Dual-branch CNNs for vehicle detection and tracking on LiDAR data
Chen et al. SAANet: Spatial adaptive alignment network for object detection in automatic driving
CN112949380B (en) Intelligent underwater target identification system based on laser radar point cloud data
CN115049821A (en) Three-dimensional environment target detection method based on multi-sensor fusion
CN115272416A (en) Vehicle and pedestrian detection tracking method and system based on multi-source sensor fusion
CN112683228A (en) Monocular camera ranging method and device
CN116630937A (en) Multimode fusion 3D target detection method
Zhang et al. PSNet: Perspective-sensitive convolutional network for object detection
Bieder et al. Exploiting multi-layer grid maps for surround-view semantic segmentation of sparse lidar data
Wang et al. Real-time 3D object detection from point cloud through foreground segmentation
Wu et al. Realtime single-shot refinement neural network with adaptive receptive field for 3D object detection from LiDAR point cloud
Chidanand et al. Multi-scale voxel class balanced ASPP for LIDAR pointcloud semantic segmentation
Danapal et al. Sensor fusion of camera and LiDAR raw data for vehicle detection
Yuan et al. Multi-level object detection by multi-sensor perception of traffic scenes
CN117315612A (en) 3D point cloud target detection method based on dynamic self-adaptive data enhancement
CN116468950A (en) Three-dimensional target detection method for neighborhood search radius of class guide center point
Yuan et al. SHREC 2020 track: 6D object pose estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination