CN116486368A - Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene - Google Patents
Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene Download PDFInfo
- Publication number
- CN116486368A CN116486368A CN202310357033.0A CN202310357033A CN116486368A CN 116486368 A CN116486368 A CN 116486368A CN 202310357033 A CN202310357033 A CN 202310357033A CN 116486368 A CN116486368 A CN 116486368A
- Authority
- CN
- China
- Prior art keywords
- point cloud
- features
- bev
- image
- detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 60
- 230000004927 fusion Effects 0.000 title claims abstract description 31
- 238000000605 extraction Methods 0.000 claims abstract description 8
- 235000004522 Pentaglottis sempervirens Nutrition 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 34
- 238000000034 method Methods 0.000 claims description 16
- 238000005070 sampling Methods 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000000295 complement effect Effects 0.000 claims 1
- 230000011218 segmentation Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000003384 imaging method Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/58—Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a multi-mode fusion three-dimensional target robust detection method based on an automatic driving scene, which comprises the following steps: step 1: acquiring current frame point cloud data and image data; step 2: and inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature. Step 3: inputting the image data into an image feature extraction network to obtain multi-scale image features. Step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result. Step 5: and (3) sending the point cloud and the image characteristics output in the step (2) and the step (3) and the preliminary detection result output in the step (4) to a staggered fusion module so as to adaptively fuse the image characteristics and the point cloud characteristics. And (5) fine-tuning the preliminary detection result in the step (4) by using the fused result. Compared with the existing result, the invention realizes the complementarity of modes through the staggered fusion architecture, can show better robustness under the condition of noise of various laser radars, and has stronger recall capability, thereby improving the detection precision.
Description
Technical Field
The invention relates to a computer vision and mode recognition technology, in particular to a multi-mode fusion three-dimensional target robust detection method based on an automatic driving scene.
Background
With the rapid development of sensor technology and the rapid development of deep learning, a three-dimensional target detection technology based on multi-mode fusion is remarkably improved. The target detection method based on the single sensor has certain defects under complex daily application scenes due to the imaging characteristics of the sensor. The optical camera can provide dense and rich example-level target information under good illumination, and can reflect the texture and color of an object. However, in the case of poor light conditions, such as at night or in rainy and foggy weather, the imaging effect of the optical camera is drastically reduced. Under this condition, therefore, a pure vision-based three-dimensional object detection model generally does not achieve satisfactory detection accuracy. Lidar is a laser sensor technology that can capture objects and three-dimensional surfaces in space, and has the advantage over optical camera systems that high accuracy distance information can be obtained and is not limited by lighting conditions. However, compared with a standard RGB image structure, the point cloud data of the laser radar has the characteristics of sparsity and disorder. Therefore, the convolutional neural network suitable for the traditional two-dimensional target detection cannot effectively extract the point cloud characteristics.
In order to obtain accurate three-dimensional object detection results, literature (Vora S, lang A H, helou B, et al Pointpaint: sequential fusion for 3d object detection[C ]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognment.2020:4604-4612.) proposes using features of an RGB image as a priori to find a correlation between point cloud data and the RGB image. Firstly, semantic segmentation results in an image space are obtained through a two-dimensional image semantic segmentation network, then the segmentation results are projected into a three-dimensional space by utilizing a camera projection matrix for supplementing point cloud features, and finally, the three-dimensional object detector based on the point cloud is sent to obtain detection results. Document (Xu S, zhou D, fang J, et al fusion imaging: multimodal fusion with adaptive attention for 3d object detection[C ]//2021IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE, 2021:3047-3054.) uses the semantic segmentation network of the image and the point cloud to obtain the segmentation result of both as a priori, and the adaptive fusion module fuses the dense image information into the point cloud information to obtain the detection result, thereby improving the problem of boundary blurring caused by the segmentation network. The method for detecting the target based on the semantic segmentation result and fusing the image and the point cloud improves the detection precision, but the method is too dependent on the point cloud characteristics and does not fully utilize the dense image characteristics, and if the point cloud information has noise, the whole result is greatly affected. In order to enhance the specific gravity of image features in a multi-modal fusion model, the invention proposes to use a multi-modal fusion architecture of interleaved fusion.
Disclosure of Invention
The invention provides a multimode fusion robust three-dimensional target detection method based on an automatic driving scene, which aims to overcome the defect that the prior art is too dependent on point cloud characteristics.
The invention uses a rotating lidar and 6 optical cameras surrounding the vehicle as sensors. Three-dimensional point cloud information and RGB image information are respectively acquired. Therefore, the geometric information of objects in the surrounding environment of the vehicle is effectively perceived, and the geometric information can be used for adaptively identifying targets in different road scenes, such as: common objectives are pedestrians, roadblocks, automobiles, engineering vehicles, buses, and the like.
The multi-mode fusion robust three-dimensional target detection method based on the automatic driving scene comprises the following steps:
step 1: acquiring current frame point cloud data and image data;
step 2: and inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature.
Step 3: inputting the image data into an image feature extraction network to obtain multi-scale image features.
Step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result.
Step 5: and (3) sending the images and the point cloud features output in the step (2) and the step (3) and the preliminary detection results output in the step (4) to an interleaving fusion module, so as to adaptively fuse the image features and the point cloud features. And (5) fine-tuning the preliminary detection result in the step (4) by using the fused result.
The specific flow of the step 2 is as follows:
step 2-1: voxel size and detection range are defined. The human definition of the detection range interval is set to be X and Y axis intervals of [ -54m,54m ], Z axis intervals of [ -5m,3m ], and the definition of the voxel size is (0.075 m,0.2 m).
Step 2-2: and (5) voxelization of the point cloud. And rasterizing the point cloud in the space according to the defined voxel size to obtain a plurality of voxels. Secondly, sampling point clouds in each non-empty voxel, randomly selecting N point clouds, and if the point clouds are insufficient, complementing the point clouds with 0. Performing dimension-increasing operation on the sampled point cloud by using the fully connected neural network to obtain the point cloud characteristics in the voxel i, wherein the point cloud characteristics are as followsAnd obtaining the voxel characteristic V by using a maximum pooling method i ∈R C . Wherein C is the number of characteristic channels. In addition, 0-padding is also used for empty voxels where no point cloud exists.
Step 2-3: voxel features are extracted. Downsampling the voxel features obtained by the steps by using sparse three-dimensional convolution to obtain BEV features F B ∈R C×W×H And then obtaining multi-scale BEV features by using traditional two-dimensional convolution and obtaining final BEV features by fusing multi-scale information through FPN.
The specific flow of the step 3 is as follows: first, 6 image features at this time are extracted using Resnet-50. And then, utilizing the FPN to fuse the multi-scale information and outputting feature graphs corresponding to different scales.
The specific flow of the step 4 is as follows:
step 4-1: and (5) reducing the dimension. The BEV features are dimensionality reduced using a 3 x 3 convolution, thereby saving computation.
Step 4-2: and predicting a result. The detection task is regarded as a set matching problem, a set of learnable object query vectors is given, the BEV features are decoded using a cross-attention mechanism, and the object query vectors are used as containers after the BEV features are decoded. Finally, the object query vector is input into the regression branch and the classification branch to obtain the detection result in the BEV space.
The specific flow of step 5 is as follows, the decoder in the interleave fusion module is the decoder in step 4-2:
step 5-1: initializing. The initialization process is divided into two steps: first, the object query vector in step 4-2 is used as the object query vector of the interleave fusion module. The point cloud in the preliminary detection result is sampled and used for generating an enhancement vector to be added to the object query vector. Secondly, the center point of the detection result in the step 4-2 is taken as a 'reference point' of the module.
Step 5-2: and fusing the image features. The object query vector is first fed into the fully connected layer, generating six learnable biases based on "reference points". And secondly, obtaining sampling positions through the offset calculation, so as to carry out bilinear interpolation sampling in the input multi-scale image feature map. And simultaneously, generating weights corresponding to the six points by using the full connection layer. Finally, the characteristics of the six points are weighted and summed for updating the object query vector.
Step 5-3: fusing point cloud features. First, the three-dimensional coordinates of the reference point are converted into coordinates at the BEV viewing angle. And secondly, calculating the BEV characteristic corresponding to the reference point through a bilinear sampling algorithm. Finally, the object query vector is updated.
The invention has the advantages that:
1. the invention provides a novel laser radar-camera fusion model for three-dimensional target detection. The complementarity of the modes is realized through the staggered fusion architecture. Can show better robustness under the noise condition of various laser radars.
2. The invention provides pluggable feature enhancement operation, which encourages a network to learn difficult samples by adding enhancement vectors generated from point clouds in initial detection results, thereby promoting the enhancement of perception results.
3. Compared with the traditional fusion method, the method has the advantages that dense image features are better mined and utilized, so that the network has stronger recall capability, the missing detection of the three-dimensional target is reduced, and the driving safety is better ensured.
Drawings
Fig. 1 is a schematic overall flow diagram of the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings, and a flow chart of the invention is shown in fig. 1.
The multi-mode fusion robust three-dimensional target detection method based on the automatic driving scene is beneficial to better utilizing dense semantic information in the image so as to achieve better robustness and higher detection precision. The method comprises the following specific steps:
step 1: acquiring current frame point cloud data and image data;
step 2: and inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature.
Step 3: inputting the image data into an image feature extraction network to obtain multi-scale image features.
Step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result.
Step 5: and (3) sending the image and the point cloud characteristics output in the step (2) and the preliminary detection result output in the step (3) to an interleaving fusion module, so as to adaptively fuse the image characteristics and the point cloud characteristics. And (3) fine-tuning the preliminary detection result in the step (3) by using the fused result.
The specific steps in the step 1 are as follows:
step 1-1: the trigger frequency of the laser radar and the six vehicle-mounted cameras is set to 20HZ by people. And finding corresponding point cloud data and image data through the time stamp to be used as model input.
Step 1-2: and obtaining an internal reference matrix of the camera and an external reference matrix from a radar coordinate system to a camera coordinate system by adopting a mature camera calibration method.
The specific steps in the step 2 are as follows:
step 2-1: voxel size and detection range are defined. The human definition of the detection range interval is set to be X and Y axis intervals of [ -54m,54m ], Z axis intervals of [ -5m,3m ], and the definition of the voxel size is (0.075 m,0.2 m).
Step 2-2: and (5) voxelization of the point cloud. And rasterizing the point cloud in the space according to the defined voxel size to obtain a plurality of voxels. Secondly, sampling point clouds in each non-empty voxel, randomly selecting N point clouds, and if the point clouds are insufficient, complementing the point clouds with 0. Performing dimension-increasing operation on the sampled point cloud by using the fully connected neural network to obtain the point cloud characteristics in the voxel i, wherein the point cloud characteristics are as followsAnd obtaining the voxel characteristic V by using a maximum pooling method i ∈R C . Wherein C is the number of characteristic channels. In addition, 0-padding is also used for empty voxels where no point cloud exists.
Step 2-3: voxel features are extracted. Downsampling the voxel features obtained by the steps by using sparse three-dimensional convolution to obtain BEV features F B ∈R C×W×H And then obtaining multi-scale BEV features by using traditional two-dimensional convolution and obtaining final BEV features by fusing multi-scale information through FPN.
The specific flow of the step 3 is as follows: first, 6 image features at this time are extracted using a Resnet-50 feature extractor that is pre-trained by imagenet. Then, the FPN is utilized to fuse multi-scale information, and feature graphs corresponding to different scales are outputWherein C is s ,W s ,H s The number of characteristic channels of the image, the width and the height of the image at s scale are respectively represented.
The specific flow of the step 4 is as follows:
step 4-1: and (5) reducing the dimension. Because the calculation amount of the point cloud branch is huge and time-consuming, in order to reduce the calculation consumption of back propagation, the method uses 3×3 convolution to perform the dimension reduction operation on the BEV feature, thereby saving the calculation amount of the subsequent process and accelerating the reasoning speed of the model.
Step 4-2: and predicting a result. The detection task is regarded as a set matching problem, a set of learnable object query vectors is given, the BEV features are decoded using a cross-attention mechanism, and the object query vectors are used as containers after the BEV features are decoded. Finally, the object query vector is input into the regression branch and the classification branch to obtain the detection result in the BEV space.
The specific flow of step 5 is as follows, the decoder in the interleave fusion module is the decoder in step 4-2:
step 5-1: initializing. The initialization process is divided into two steps: first, the object query vector in step 4-2 is used as the object query vector of the interleave fusion module. In addition, as shown in formula (1), I three-dimensional frames are obtained through the preliminary detection result, and Z point clouds are randomly sampled from the point clouds in the framesAnd uses the mapping f bp Obtaining d pc Gao Weidian cloud feature P of dimension box . Then, as shown in equation (2), an enhancement vector P is generated using a max-pooling operation e And added to the object query vector. Finally, the center point of the detection result in step 4-2 is taken as the "reference point" of the module.
P box =f bp (P rand ) (1)
P e =MaxPool(P box ) (2)
Step 5-2: and fusing the image features. First, as shown in equation (3), C is calculated using a calibration matrix composed of camera internal parameters and external parameters in Each center x of (a) i ,y i ,z i Projecting the two-dimensional center points u to a camera coordinate system to obtain corresponding two-dimensional center points u in the RGB image i ,v i Where d is a scale factor (expressed in the current condition as a depth value in the world coordinate system). Then, as shown in formula (4), the object query vector is sent to the full connection layer to generate six learnable biases Deltax based on the reference point lkp . Next, the sampling position r is calculated by the offset c +△x lkp Thereby bilinear interpolation sampling in the input multi-scale image feature mapSimultaneously, the weight A corresponding to the six points is generated by utilizing the full connection layer lkp . Finally, the characteristics of the six points are weighted and summed for updating the object query vector +.>Where M represents the number of interleaved encoder blocks. k is the index of the sample point, Δx lkp And A lkp Respectively representing the sampling offset and the attention weight of the kth sampling point in the ith feature layer of the jth camera.
Step 5-3: fusing point cloud features. First, the three-dimensional coordinates of the reference point are converted into coordinates at the BEV viewing angle. Second, by bilinear sampling algorithm F L And calculating to obtain BEV characteristics corresponding to the reference points. Finally, as shown in equation (5), the feature is used to update the object query vectorWhere k is the index of the sample point, deltax k And A k The sample offset and the attention weight of the kth sample point are represented, respectively.
Step 6: outputting the detection result. As shown in formulas (6) and (7), the query vectors are input into classifiers f composed of linear layers, respectively class And regression f reg Obtaining a classification prediction result P of the three-dimensional target class And locating the prediction result P res 。
Step 7: when the laser radar is affected by strong light or reflective materials, distortion of laser radar data and loss of point clouds can be caused, and the quality of BEV characteristics is finally affected. In this case, the interlacing fusion module in step 5 will adaptively adjust the attention weights in step 5-2 and step 5-3, reduce the update weight of the BEV feature on the query vector in step 5-3, and increase the image feature update weight in step 5-2, thereby increasing the weight of the image modality in the model prediction process. The step utilizes abundant semantic information in the image space to compensate negative effects caused by laser radar modal data distortion, so that the prediction accuracy of the model working in a laser radar signal noise environment is improved.
The embodiments described in the present specification are merely examples of implementation forms of the inventive concept, and the protection scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the protection scope of the present invention and equivalent technical means that can be conceived by those skilled in the art based on the inventive concept.
Claims (2)
1. The multi-mode fusion robust three-dimensional target detection method based on the automatic driving scene comprises the following specific steps:
step 1: acquiring current frame point cloud data and image data;
step 2: inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature; wherein the method can be divided into three steps
Step 2-1: defining a voxel size and a detection range; the artificial definition of the detection range interval is set as X and Y axis intervals of [ -54m,54m ], Z axis intervals of [ -5m,3m ], and the definition of the voxel size is (0.075 m,0.2 m);
step 2-2: voxelization of point cloud; performing rasterization operation on the point cloud in the space according to the defined voxel size to obtain a plurality of voxels; secondly, sampling point clouds in each non-empty voxel, randomly selecting N point clouds, and if the point clouds are insufficient, complementing the point clouds with 0; performing dimension-increasing operation on the sampled point cloud by using the fully connected neural network to obtain the point cloud characteristics in the voxel i, wherein the point cloud characteristics are as followsAnd obtaining the voxel characteristic V by using a maximum pooling method i ∈R C The method comprises the steps of carrying out a first treatment on the surface of the Wherein C is the number of characteristic channels; in addition, 0 complement is also used for empty voxels where no point cloud exists;
step 2-3: extracting voxel characteristics; downsampling the voxel features obtained by the steps by using sparse three-dimensional convolution to obtain BEV features F B ∈R C×W×H Then, a traditional two-dimensional convolution is used for obtaining multi-scale BEV features, and final BEV features are obtained by fusing multi-scale information through FPN;
step 3: firstly, inputting image data into an image feature extraction network Resnet-50, and extracting 6 image features at the moment; then, fusion of multi-scale information by using FPN (field programmable gate array) is utilized to obtain feature graphs corresponding to different scales;
step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result; the steps can be specifically divided into:
step 4-1: dimension reduction; performing dimension reduction operation on BEV features by using 3X 3 convolution, so that the calculated amount is saved;
step 4-2: predicting a result; regarding the detection task as a set matching problem, giving a group of learnable object query vectors, and decoding BEV features by using a cross-attention mechanism, wherein the object query vectors are used as containers after BEV features are decoded; finally, inputting the object query vector into a regression branch and a classification branch to obtain a detection result in the BEV space;
step 5: sending the point cloud and the image characteristics output in the step 2 and the step 3 and the preliminary detection results output in the step 4 to a staggered fusion module so as to adaptively fuse the image characteristics and the point cloud characteristics; and (5) fine-tuning the preliminary detection result in the step (4) by using the fused result.
2. The multi-modal fusion robust three-dimensional object detection method based on an autopilot scenario of claim 1, wherein: the specific flow of step 5 is as follows:
step 5-1: initializing; the initialization process is divided into two steps: firstly, using the object query vector in the step 4-2 as the object query vector of the interlacing fusion module; sampling point clouds in the preliminary detection result to generate an enhancement vector and adding the enhancement vector into an object query vector; secondly, taking the center point of the detection result in the step 4-2 as a reference point of the module;
step 5-2: fusing image features; firstly, sending an object query vector into a full connection layer to generate six learnable biases based on reference points; secondly, obtaining sampling positions through the offset calculation, and performing bilinear interpolation sampling in the input multi-scale image feature map; simultaneously, generating weights corresponding to the six points by using the full connection layer; finally, the characteristics of the six points are weighted and summed to update the object query vector;
step 5-3: fusing point cloud features; firstly, converting three-dimensional coordinates of a reference point into coordinates under a BEV view angle; secondly, calculating to obtain BEV characteristics corresponding to the reference point through a bilinear sampling algorithm; finally, the object query vector is updated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310357033.0A CN116486368A (en) | 2023-04-03 | 2023-04-03 | Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310357033.0A CN116486368A (en) | 2023-04-03 | 2023-04-03 | Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116486368A true CN116486368A (en) | 2023-07-25 |
Family
ID=87214767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310357033.0A Pending CN116486368A (en) | 2023-04-03 | 2023-04-03 | Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116486368A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116740668A (en) * | 2023-08-16 | 2023-09-12 | 之江实验室 | Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium |
CN117058646A (en) * | 2023-10-11 | 2023-11-14 | 南京工业大学 | Complex road target detection method based on multi-mode fusion aerial view |
CN117392393A (en) * | 2023-12-13 | 2024-01-12 | 安徽蔚来智驾科技有限公司 | Point cloud semantic segmentation method, computer equipment, storage medium and intelligent equipment |
-
2023
- 2023-04-03 CN CN202310357033.0A patent/CN116486368A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116740668A (en) * | 2023-08-16 | 2023-09-12 | 之江实验室 | Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium |
CN116740668B (en) * | 2023-08-16 | 2023-11-14 | 之江实验室 | Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium |
CN117058646A (en) * | 2023-10-11 | 2023-11-14 | 南京工业大学 | Complex road target detection method based on multi-mode fusion aerial view |
CN117058646B (en) * | 2023-10-11 | 2024-02-27 | 南京工业大学 | Complex road target detection method based on multi-mode fusion aerial view |
CN117392393A (en) * | 2023-12-13 | 2024-01-12 | 安徽蔚来智驾科技有限公司 | Point cloud semantic segmentation method, computer equipment, storage medium and intelligent equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113128348B (en) | Laser radar target detection method and system integrating semantic information | |
CN111339977B (en) | Small target intelligent recognition system based on remote video monitoring and recognition method thereof | |
CN110570429B (en) | Lightweight real-time semantic segmentation method based on three-dimensional point cloud | |
CN116486368A (en) | Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene | |
CN111046781B (en) | Robust three-dimensional target detection method based on ternary attention mechanism | |
CN112731436B (en) | Multi-mode data fusion travelable region detection method based on point cloud up-sampling | |
CN111292366B (en) | Visual driving ranging algorithm based on deep learning and edge calculation | |
CN116129233A (en) | Automatic driving scene panoramic segmentation method based on multi-mode fusion perception | |
CN115019043B (en) | Cross-attention mechanism-based three-dimensional object detection method based on image point cloud fusion | |
CN116612468A (en) | Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism | |
CN117274749B (en) | Fused 3D target detection method based on 4D millimeter wave radar and image | |
CN117058646B (en) | Complex road target detection method based on multi-mode fusion aerial view | |
CN114495064A (en) | Monocular depth estimation-based vehicle surrounding obstacle early warning method | |
CN114639115B (en) | Human body key point and laser radar fused 3D pedestrian detection method | |
CN117975436A (en) | Three-dimensional target detection method based on multi-mode fusion and deformable attention | |
CN117173399A (en) | Traffic target detection method and system of cross-modal cross-attention mechanism | |
CN117237919A (en) | Intelligent driving sensing method for truck through multi-sensor fusion detection under cross-mode supervised learning | |
CN117111055A (en) | Vehicle state sensing method based on thunder fusion | |
CN116310673A (en) | Three-dimensional target detection method based on fusion of point cloud and image features | |
CN116486396A (en) | 3D target detection method based on 4D millimeter wave radar point cloud | |
CN115100741B (en) | Point cloud pedestrian distance risk detection method, system, equipment and medium | |
CN115115917A (en) | 3D point cloud target detection method based on attention mechanism and image feature fusion | |
CN115082902B (en) | Vehicle target detection method based on laser radar point cloud | |
CN117372697A (en) | Point cloud segmentation method and system for single-mode sparse orbit scene | |
CN116778449A (en) | Detection method for improving detection efficiency of three-dimensional target of automatic driving |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |