CN116486368A - Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene - Google Patents

Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene Download PDF

Info

Publication number
CN116486368A
CN116486368A CN202310357033.0A CN202310357033A CN116486368A CN 116486368 A CN116486368 A CN 116486368A CN 202310357033 A CN202310357033 A CN 202310357033A CN 116486368 A CN116486368 A CN 116486368A
Authority
CN
China
Prior art keywords
point cloud
features
bev
image
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310357033.0A
Other languages
Chinese (zh)
Inventor
禹鑫燚
杨阳
陈昊
沈春华
欧林林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202310357033.0A priority Critical patent/CN116486368A/en
Publication of CN116486368A publication Critical patent/CN116486368A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a multi-mode fusion three-dimensional target robust detection method based on an automatic driving scene, which comprises the following steps: step 1: acquiring current frame point cloud data and image data; step 2: and inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature. Step 3: inputting the image data into an image feature extraction network to obtain multi-scale image features. Step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result. Step 5: and (3) sending the point cloud and the image characteristics output in the step (2) and the step (3) and the preliminary detection result output in the step (4) to a staggered fusion module so as to adaptively fuse the image characteristics and the point cloud characteristics. And (5) fine-tuning the preliminary detection result in the step (4) by using the fused result. Compared with the existing result, the invention realizes the complementarity of modes through the staggered fusion architecture, can show better robustness under the condition of noise of various laser radars, and has stronger recall capability, thereby improving the detection precision.

Description

Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene
Technical Field
The invention relates to a computer vision and mode recognition technology, in particular to a multi-mode fusion three-dimensional target robust detection method based on an automatic driving scene.
Background
With the rapid development of sensor technology and the rapid development of deep learning, a three-dimensional target detection technology based on multi-mode fusion is remarkably improved. The target detection method based on the single sensor has certain defects under complex daily application scenes due to the imaging characteristics of the sensor. The optical camera can provide dense and rich example-level target information under good illumination, and can reflect the texture and color of an object. However, in the case of poor light conditions, such as at night or in rainy and foggy weather, the imaging effect of the optical camera is drastically reduced. Under this condition, therefore, a pure vision-based three-dimensional object detection model generally does not achieve satisfactory detection accuracy. Lidar is a laser sensor technology that can capture objects and three-dimensional surfaces in space, and has the advantage over optical camera systems that high accuracy distance information can be obtained and is not limited by lighting conditions. However, compared with a standard RGB image structure, the point cloud data of the laser radar has the characteristics of sparsity and disorder. Therefore, the convolutional neural network suitable for the traditional two-dimensional target detection cannot effectively extract the point cloud characteristics.
In order to obtain accurate three-dimensional object detection results, literature (Vora S, lang A H, helou B, et al Pointpaint: sequential fusion for 3d object detection[C ]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognment.2020:4604-4612.) proposes using features of an RGB image as a priori to find a correlation between point cloud data and the RGB image. Firstly, semantic segmentation results in an image space are obtained through a two-dimensional image semantic segmentation network, then the segmentation results are projected into a three-dimensional space by utilizing a camera projection matrix for supplementing point cloud features, and finally, the three-dimensional object detector based on the point cloud is sent to obtain detection results. Document (Xu S, zhou D, fang J, et al fusion imaging: multimodal fusion with adaptive attention for 3d object detection[C ]//2021IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE, 2021:3047-3054.) uses the semantic segmentation network of the image and the point cloud to obtain the segmentation result of both as a priori, and the adaptive fusion module fuses the dense image information into the point cloud information to obtain the detection result, thereby improving the problem of boundary blurring caused by the segmentation network. The method for detecting the target based on the semantic segmentation result and fusing the image and the point cloud improves the detection precision, but the method is too dependent on the point cloud characteristics and does not fully utilize the dense image characteristics, and if the point cloud information has noise, the whole result is greatly affected. In order to enhance the specific gravity of image features in a multi-modal fusion model, the invention proposes to use a multi-modal fusion architecture of interleaved fusion.
Disclosure of Invention
The invention provides a multimode fusion robust three-dimensional target detection method based on an automatic driving scene, which aims to overcome the defect that the prior art is too dependent on point cloud characteristics.
The invention uses a rotating lidar and 6 optical cameras surrounding the vehicle as sensors. Three-dimensional point cloud information and RGB image information are respectively acquired. Therefore, the geometric information of objects in the surrounding environment of the vehicle is effectively perceived, and the geometric information can be used for adaptively identifying targets in different road scenes, such as: common objectives are pedestrians, roadblocks, automobiles, engineering vehicles, buses, and the like.
The multi-mode fusion robust three-dimensional target detection method based on the automatic driving scene comprises the following steps:
step 1: acquiring current frame point cloud data and image data;
step 2: and inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature.
Step 3: inputting the image data into an image feature extraction network to obtain multi-scale image features.
Step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result.
Step 5: and (3) sending the images and the point cloud features output in the step (2) and the step (3) and the preliminary detection results output in the step (4) to an interleaving fusion module, so as to adaptively fuse the image features and the point cloud features. And (5) fine-tuning the preliminary detection result in the step (4) by using the fused result.
The specific flow of the step 2 is as follows:
step 2-1: voxel size and detection range are defined. The human definition of the detection range interval is set to be X and Y axis intervals of [ -54m,54m ], Z axis intervals of [ -5m,3m ], and the definition of the voxel size is (0.075 m,0.2 m).
Step 2-2: and (5) voxelization of the point cloud. And rasterizing the point cloud in the space according to the defined voxel size to obtain a plurality of voxels. Secondly, sampling point clouds in each non-empty voxel, randomly selecting N point clouds, and if the point clouds are insufficient, complementing the point clouds with 0. Performing dimension-increasing operation on the sampled point cloud by using the fully connected neural network to obtain the point cloud characteristics in the voxel i, wherein the point cloud characteristics are as followsAnd obtaining the voxel characteristic V by using a maximum pooling method i ∈R C . Wherein C is the number of characteristic channels. In addition, 0-padding is also used for empty voxels where no point cloud exists.
Step 2-3: voxel features are extracted. Downsampling the voxel features obtained by the steps by using sparse three-dimensional convolution to obtain BEV features F B ∈R C×W×H And then obtaining multi-scale BEV features by using traditional two-dimensional convolution and obtaining final BEV features by fusing multi-scale information through FPN.
The specific flow of the step 3 is as follows: first, 6 image features at this time are extracted using Resnet-50. And then, utilizing the FPN to fuse the multi-scale information and outputting feature graphs corresponding to different scales.
The specific flow of the step 4 is as follows:
step 4-1: and (5) reducing the dimension. The BEV features are dimensionality reduced using a 3 x 3 convolution, thereby saving computation.
Step 4-2: and predicting a result. The detection task is regarded as a set matching problem, a set of learnable object query vectors is given, the BEV features are decoded using a cross-attention mechanism, and the object query vectors are used as containers after the BEV features are decoded. Finally, the object query vector is input into the regression branch and the classification branch to obtain the detection result in the BEV space.
The specific flow of step 5 is as follows, the decoder in the interleave fusion module is the decoder in step 4-2:
step 5-1: initializing. The initialization process is divided into two steps: first, the object query vector in step 4-2 is used as the object query vector of the interleave fusion module. The point cloud in the preliminary detection result is sampled and used for generating an enhancement vector to be added to the object query vector. Secondly, the center point of the detection result in the step 4-2 is taken as a 'reference point' of the module.
Step 5-2: and fusing the image features. The object query vector is first fed into the fully connected layer, generating six learnable biases based on "reference points". And secondly, obtaining sampling positions through the offset calculation, so as to carry out bilinear interpolation sampling in the input multi-scale image feature map. And simultaneously, generating weights corresponding to the six points by using the full connection layer. Finally, the characteristics of the six points are weighted and summed for updating the object query vector.
Step 5-3: fusing point cloud features. First, the three-dimensional coordinates of the reference point are converted into coordinates at the BEV viewing angle. And secondly, calculating the BEV characteristic corresponding to the reference point through a bilinear sampling algorithm. Finally, the object query vector is updated.
The invention has the advantages that:
1. the invention provides a novel laser radar-camera fusion model for three-dimensional target detection. The complementarity of the modes is realized through the staggered fusion architecture. Can show better robustness under the noise condition of various laser radars.
2. The invention provides pluggable feature enhancement operation, which encourages a network to learn difficult samples by adding enhancement vectors generated from point clouds in initial detection results, thereby promoting the enhancement of perception results.
3. Compared with the traditional fusion method, the method has the advantages that dense image features are better mined and utilized, so that the network has stronger recall capability, the missing detection of the three-dimensional target is reduced, and the driving safety is better ensured.
Drawings
Fig. 1 is a schematic overall flow diagram of the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings, and a flow chart of the invention is shown in fig. 1.
The multi-mode fusion robust three-dimensional target detection method based on the automatic driving scene is beneficial to better utilizing dense semantic information in the image so as to achieve better robustness and higher detection precision. The method comprises the following specific steps:
step 1: acquiring current frame point cloud data and image data;
step 2: and inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature.
Step 3: inputting the image data into an image feature extraction network to obtain multi-scale image features.
Step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result.
Step 5: and (3) sending the image and the point cloud characteristics output in the step (2) and the preliminary detection result output in the step (3) to an interleaving fusion module, so as to adaptively fuse the image characteristics and the point cloud characteristics. And (3) fine-tuning the preliminary detection result in the step (3) by using the fused result.
The specific steps in the step 1 are as follows:
step 1-1: the trigger frequency of the laser radar and the six vehicle-mounted cameras is set to 20HZ by people. And finding corresponding point cloud data and image data through the time stamp to be used as model input.
Step 1-2: and obtaining an internal reference matrix of the camera and an external reference matrix from a radar coordinate system to a camera coordinate system by adopting a mature camera calibration method.
The specific steps in the step 2 are as follows:
step 2-1: voxel size and detection range are defined. The human definition of the detection range interval is set to be X and Y axis intervals of [ -54m,54m ], Z axis intervals of [ -5m,3m ], and the definition of the voxel size is (0.075 m,0.2 m).
Step 2-2: and (5) voxelization of the point cloud. And rasterizing the point cloud in the space according to the defined voxel size to obtain a plurality of voxels. Secondly, sampling point clouds in each non-empty voxel, randomly selecting N point clouds, and if the point clouds are insufficient, complementing the point clouds with 0. Performing dimension-increasing operation on the sampled point cloud by using the fully connected neural network to obtain the point cloud characteristics in the voxel i, wherein the point cloud characteristics are as followsAnd obtaining the voxel characteristic V by using a maximum pooling method i ∈R C . Wherein C is the number of characteristic channels. In addition, 0-padding is also used for empty voxels where no point cloud exists.
Step 2-3: voxel features are extracted. Downsampling the voxel features obtained by the steps by using sparse three-dimensional convolution to obtain BEV features F B ∈R C×W×H And then obtaining multi-scale BEV features by using traditional two-dimensional convolution and obtaining final BEV features by fusing multi-scale information through FPN.
The specific flow of the step 3 is as follows: first, 6 image features at this time are extracted using a Resnet-50 feature extractor that is pre-trained by imagenet. Then, the FPN is utilized to fuse multi-scale information, and feature graphs corresponding to different scales are outputWherein C is s ,W s ,H s The number of characteristic channels of the image, the width and the height of the image at s scale are respectively represented.
The specific flow of the step 4 is as follows:
step 4-1: and (5) reducing the dimension. Because the calculation amount of the point cloud branch is huge and time-consuming, in order to reduce the calculation consumption of back propagation, the method uses 3×3 convolution to perform the dimension reduction operation on the BEV feature, thereby saving the calculation amount of the subsequent process and accelerating the reasoning speed of the model.
Step 4-2: and predicting a result. The detection task is regarded as a set matching problem, a set of learnable object query vectors is given, the BEV features are decoded using a cross-attention mechanism, and the object query vectors are used as containers after the BEV features are decoded. Finally, the object query vector is input into the regression branch and the classification branch to obtain the detection result in the BEV space.
The specific flow of step 5 is as follows, the decoder in the interleave fusion module is the decoder in step 4-2:
step 5-1: initializing. The initialization process is divided into two steps: first, the object query vector in step 4-2 is used as the object query vector of the interleave fusion module. In addition, as shown in formula (1), I three-dimensional frames are obtained through the preliminary detection result, and Z point clouds are randomly sampled from the point clouds in the framesAnd uses the mapping f bp Obtaining d pc Gao Weidian cloud feature P of dimension box . Then, as shown in equation (2), an enhancement vector P is generated using a max-pooling operation e And added to the object query vector. Finally, the center point of the detection result in step 4-2 is taken as the "reference point" of the module.
P box =f bp (P rand ) (1)
P e =MaxPool(P box ) (2)
Step 5-2: and fusing the image features. First, as shown in equation (3), C is calculated using a calibration matrix composed of camera internal parameters and external parameters in Each center x of (a) i ,y i ,z i Projecting the two-dimensional center points u to a camera coordinate system to obtain corresponding two-dimensional center points u in the RGB image i ,v i Where d is a scale factor (expressed in the current condition as a depth value in the world coordinate system). Then, as shown in formula (4), the object query vector is sent to the full connection layer to generate six learnable biases Deltax based on the reference point lkp . Next, the sampling position r is calculated by the offset c +△x lkp Thereby bilinear interpolation sampling in the input multi-scale image feature mapSimultaneously, the weight A corresponding to the six points is generated by utilizing the full connection layer lkp . Finally, the characteristics of the six points are weighted and summed for updating the object query vector +.>Where M represents the number of interleaved encoder blocks. k is the index of the sample point, Δx lkp And A lkp Respectively representing the sampling offset and the attention weight of the kth sampling point in the ith feature layer of the jth camera.
Step 5-3: fusing point cloud features. First, the three-dimensional coordinates of the reference point are converted into coordinates at the BEV viewing angle. Second, by bilinear sampling algorithm F L And calculating to obtain BEV characteristics corresponding to the reference points. Finally, as shown in equation (5), the feature is used to update the object query vectorWhere k is the index of the sample point, deltax k And A k The sample offset and the attention weight of the kth sample point are represented, respectively.
Step 6: outputting the detection result. As shown in formulas (6) and (7), the query vectors are input into classifiers f composed of linear layers, respectively class And regression f reg Obtaining a classification prediction result P of the three-dimensional target class And locating the prediction result P res
Step 7: when the laser radar is affected by strong light or reflective materials, distortion of laser radar data and loss of point clouds can be caused, and the quality of BEV characteristics is finally affected. In this case, the interlacing fusion module in step 5 will adaptively adjust the attention weights in step 5-2 and step 5-3, reduce the update weight of the BEV feature on the query vector in step 5-3, and increase the image feature update weight in step 5-2, thereby increasing the weight of the image modality in the model prediction process. The step utilizes abundant semantic information in the image space to compensate negative effects caused by laser radar modal data distortion, so that the prediction accuracy of the model working in a laser radar signal noise environment is improved.
The embodiments described in the present specification are merely examples of implementation forms of the inventive concept, and the protection scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the protection scope of the present invention and equivalent technical means that can be conceived by those skilled in the art based on the inventive concept.

Claims (2)

1. The multi-mode fusion robust three-dimensional target detection method based on the automatic driving scene comprises the following specific steps:
step 1: acquiring current frame point cloud data and image data;
step 2: inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature; wherein the method can be divided into three steps
Step 2-1: defining a voxel size and a detection range; the artificial definition of the detection range interval is set as X and Y axis intervals of [ -54m,54m ], Z axis intervals of [ -5m,3m ], and the definition of the voxel size is (0.075 m,0.2 m);
step 2-2: voxelization of point cloud; performing rasterization operation on the point cloud in the space according to the defined voxel size to obtain a plurality of voxels; secondly, sampling point clouds in each non-empty voxel, randomly selecting N point clouds, and if the point clouds are insufficient, complementing the point clouds with 0; performing dimension-increasing operation on the sampled point cloud by using the fully connected neural network to obtain the point cloud characteristics in the voxel i, wherein the point cloud characteristics are as followsAnd obtaining the voxel characteristic V by using a maximum pooling method i ∈R C The method comprises the steps of carrying out a first treatment on the surface of the Wherein C is the number of characteristic channels; in addition, 0 complement is also used for empty voxels where no point cloud exists;
step 2-3: extracting voxel characteristics; downsampling the voxel features obtained by the steps by using sparse three-dimensional convolution to obtain BEV features F B ∈R C×W×H Then, a traditional two-dimensional convolution is used for obtaining multi-scale BEV features, and final BEV features are obtained by fusing multi-scale information through FPN;
step 3: firstly, inputting image data into an image feature extraction network Resnet-50, and extracting 6 image features at the moment; then, fusion of multi-scale information by using FPN (field programmable gate array) is utilized to obtain feature graphs corresponding to different scales;
step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result; the steps can be specifically divided into:
step 4-1: dimension reduction; performing dimension reduction operation on BEV features by using 3X 3 convolution, so that the calculated amount is saved;
step 4-2: predicting a result; regarding the detection task as a set matching problem, giving a group of learnable object query vectors, and decoding BEV features by using a cross-attention mechanism, wherein the object query vectors are used as containers after BEV features are decoded; finally, inputting the object query vector into a regression branch and a classification branch to obtain a detection result in the BEV space;
step 5: sending the point cloud and the image characteristics output in the step 2 and the step 3 and the preliminary detection results output in the step 4 to a staggered fusion module so as to adaptively fuse the image characteristics and the point cloud characteristics; and (5) fine-tuning the preliminary detection result in the step (4) by using the fused result.
2. The multi-modal fusion robust three-dimensional object detection method based on an autopilot scenario of claim 1, wherein: the specific flow of step 5 is as follows:
step 5-1: initializing; the initialization process is divided into two steps: firstly, using the object query vector in the step 4-2 as the object query vector of the interlacing fusion module; sampling point clouds in the preliminary detection result to generate an enhancement vector and adding the enhancement vector into an object query vector; secondly, taking the center point of the detection result in the step 4-2 as a reference point of the module;
step 5-2: fusing image features; firstly, sending an object query vector into a full connection layer to generate six learnable biases based on reference points; secondly, obtaining sampling positions through the offset calculation, and performing bilinear interpolation sampling in the input multi-scale image feature map; simultaneously, generating weights corresponding to the six points by using the full connection layer; finally, the characteristics of the six points are weighted and summed to update the object query vector;
step 5-3: fusing point cloud features; firstly, converting three-dimensional coordinates of a reference point into coordinates under a BEV view angle; secondly, calculating to obtain BEV characteristics corresponding to the reference point through a bilinear sampling algorithm; finally, the object query vector is updated.
CN202310357033.0A 2023-04-03 2023-04-03 Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene Pending CN116486368A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310357033.0A CN116486368A (en) 2023-04-03 2023-04-03 Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310357033.0A CN116486368A (en) 2023-04-03 2023-04-03 Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene

Publications (1)

Publication Number Publication Date
CN116486368A true CN116486368A (en) 2023-07-25

Family

ID=87214767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310357033.0A Pending CN116486368A (en) 2023-04-03 2023-04-03 Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene

Country Status (1)

Country Link
CN (1) CN116486368A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740668A (en) * 2023-08-16 2023-09-12 之江实验室 Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium
CN117058646A (en) * 2023-10-11 2023-11-14 南京工业大学 Complex road target detection method based on multi-mode fusion aerial view
CN117392393A (en) * 2023-12-13 2024-01-12 安徽蔚来智驾科技有限公司 Point cloud semantic segmentation method, computer equipment, storage medium and intelligent equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740668A (en) * 2023-08-16 2023-09-12 之江实验室 Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium
CN116740668B (en) * 2023-08-16 2023-11-14 之江实验室 Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium
CN117058646A (en) * 2023-10-11 2023-11-14 南京工业大学 Complex road target detection method based on multi-mode fusion aerial view
CN117058646B (en) * 2023-10-11 2024-02-27 南京工业大学 Complex road target detection method based on multi-mode fusion aerial view
CN117392393A (en) * 2023-12-13 2024-01-12 安徽蔚来智驾科技有限公司 Point cloud semantic segmentation method, computer equipment, storage medium and intelligent equipment

Similar Documents

Publication Publication Date Title
CN113128348B (en) Laser radar target detection method and system integrating semantic information
CN111339977B (en) Small target intelligent recognition system based on remote video monitoring and recognition method thereof
CN110570429B (en) Lightweight real-time semantic segmentation method based on three-dimensional point cloud
CN116486368A (en) Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene
CN111046781B (en) Robust three-dimensional target detection method based on ternary attention mechanism
CN112731436B (en) Multi-mode data fusion travelable region detection method based on point cloud up-sampling
CN111292366B (en) Visual driving ranging algorithm based on deep learning and edge calculation
CN115019043B (en) Cross-attention mechanism-based three-dimensional object detection method based on image point cloud fusion
CN116129233A (en) Automatic driving scene panoramic segmentation method based on multi-mode fusion perception
CN117274749B (en) Fused 3D target detection method based on 4D millimeter wave radar and image
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN117058646B (en) Complex road target detection method based on multi-mode fusion aerial view
CN114495064A (en) Monocular depth estimation-based vehicle surrounding obstacle early warning method
CN114639115B (en) Human body key point and laser radar fused 3D pedestrian detection method
CN117975436A (en) Three-dimensional target detection method based on multi-mode fusion and deformable attention
CN117237919A (en) Intelligent driving sensing method for truck through multi-sensor fusion detection under cross-mode supervised learning
CN117111055A (en) Vehicle state sensing method based on thunder fusion
CN115100741B (en) Point cloud pedestrian distance risk detection method, system, equipment and medium
CN116310673A (en) Three-dimensional target detection method based on fusion of point cloud and image features
CN117173399A (en) Traffic target detection method and system of cross-modal cross-attention mechanism
CN116486396A (en) 3D target detection method based on 4D millimeter wave radar point cloud
CN115115917A (en) 3D point cloud target detection method based on attention mechanism and image feature fusion
CN117372697A (en) Point cloud segmentation method and system for single-mode sparse orbit scene
CN115082902B (en) Vehicle target detection method based on laser radar point cloud
CN118119968A (en) Point cloud data labeling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination