CN117333749A - Multimode hybrid automatic driving unified 3D detection and tracking method - Google Patents
Multimode hybrid automatic driving unified 3D detection and tracking method Download PDFInfo
- Publication number
- CN117333749A CN117333749A CN202311382428.2A CN202311382428A CN117333749A CN 117333749 A CN117333749 A CN 117333749A CN 202311382428 A CN202311382428 A CN 202311382428A CN 117333749 A CN117333749 A CN 117333749A
- Authority
- CN
- China
- Prior art keywords
- bev
- detection
- features
- current frame
- unified
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 62
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000004927 fusion Effects 0.000 claims description 29
- 230000003044 adaptive effect Effects 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 4
- 230000002045 lasting effect Effects 0.000 claims description 2
- 238000012549 training Methods 0.000 abstract description 3
- 241000709691 Enterovirus E Species 0.000 description 42
- 230000006870 function Effects 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 6
- 238000012512 characterization method Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 241000764238 Isis Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10032—Satellite or aerial image; Remote sensing
- G06T2207/10044—Radar image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Optical Radar Systems And Details Thereof (AREA)
Abstract
The invention discloses a multimode hybrid automatic driving unified 3D detection and tracking method, and belongs to the technical field of automatic driving. The invention mainly comprises the following steps: 1. generating BEV features under different modes; 2. generating an adaptively fused BEV feature; 3. generating a single-frame 3D target detection result; 4. generating a single-frame 3D target tracking result; 5. iteration of the target tracking results from frame to frame. Based on the unified 3D detection and tracking method provided by the invention, different sensor data can be fused into unified BEV characteristics, and 3D target detection and 3D target tracking are unified into a whole. Compared with the independent target detection and target tracking model, the unified model can improve instantaneity, accuracy and robustness, and the performance and safety of the automatic driving system are improved. Meanwhile, training cost and deployment difficulty of the model can be reduced.
Description
Technical Field
The invention belongs to the technical field of automatic driving, and particularly relates to a multimode hybrid automatic driving unified 3D detection and tracking method.
Background
With the rapid development of automatic driving technology, the realization of efficient and accurate target detection and tracking is critical to the safety and performance of an automatic driving system. Traditional target detection and tracking methods are based primarily on single modality data, such as images or point clouds. However, single modality data may have some limitations in some scenarios, for example, in complex environments, the image may be affected by illumination, occlusion, etc., while the point cloud may be limited by factors such as sensor resolution and noise.
To overcome the limitations of single modality data, researchers have begun exploring the mix of multi-modality data. Multimodal data is typically collected by different sensors, such as cameras and lidar. The image data provides rich visual information, while the point cloud data provides accurate geometric information. By mixing the image and the point cloud data, more comprehensive and accurate target detection and tracking results can be obtained.
Conventional target detection and tracking methods are usually performed separately, and target detection is performed first, and then target tracking is performed. However, this manner of separation may lead to loss and inconsistency of information. In order to achieve more accurate and consistent target detection and tracking, a unified fusion method needs to be provided. The method carries out joint modeling on the target detection and tracking process, and realizes the tight combination of target detection and tracking by sharing the characteristics and the context information, thereby improving the accuracy and the stability of the detection and the tracking.
Transformer is a powerful deep learning model, and has achieved remarkable results in the fields of natural language processing, computer vision and the like. By utilizing the self-attention mechanism and the global context modeling capability of the transducer, the relation and the dependency relationship between the multi-mode data can be better captured, so that the target detection and tracking performance is improved.
Disclosure of Invention
The invention aims to overcome the defects caused by a single mode and fully utilize the advantages of unification of target detection and target tracking, and provides an automatic driving unification 3D detection and tracking method based on the hybrid of the point cloud of a transducer and the multiple modes of an image, so that the training cost and the deployment difficulty can be reduced, and the effect of mutually improving the performance can be obtained by fully utilizing the relevance between the target detection and the target tracking task.
The technical scheme adopted by the invention is as follows:
a multimode hybrid automatic driving unified 3D detection and tracking method comprises the following steps:
step (1), inputting multi-mode data acquired by a laser radar and a camera from an automatic driving system, and respectively extracting BEV characteristics under different modes;
step (2), obtaining BEV characteristics under different modes through the BEV characteristics under the different modes obtained in the step (1), and generating self-adaptive fused BEV characteristics by fusing BEV characteristics under the different modes with self-adaptive fusion weights;
step (3), the BEV characteristics obtained in the step (2) are subjected to self-adaptive fusion, and a transducer encoder is adopted for encoding, so that the encoding characteristics of the current frame are obtained; meanwhile, the self-adaptive fused BEV features pass through a candidate region generation network to complete a 3D target detection task of the current frame, and a series of 3D candidate frames of the current frame are generated;
step (4), splicing a series of 3D candidate frames of the current frame and the processed target tracking result of the previous frame, and inputting the splicing result and the coding characteristics of the current frame into a transducer decoder together to obtain an initial target tracking result of the current frame;
and (5) generating a processed target tracking result of the current frame by using the initial target tracking result of the current frame acquired in the step (4), and finally outputting the target tracking result of the whole multi-frame input through continuous iteration between frames.
Further, the step (5) includes:
step (5.1) dividing the initial target tracking result of the current frame obtained in the step (4) into a new object set corresponding to the 3D target detection output result of the current frameOld object set corresponding to the target tracking result of the last frame processed +.>
Step (5.2) collecting the new objects obtained in step (5.1)And old object set->Respectively judging, and removing objects which do not meet the requirements from the collection;
step (5.3) combining the processed new object setsAnd old object set->And generating a target tracking result of the processed current frame.
Further, in the step (5.2), for a new object setIf the detection confidence of a certain object is larger than the set threshold value, the object is reserved, otherwise, the object is added from the new object set>Removing; for old object collection->If the detection confidence of a certain object lasting for 3 frames is smaller than the set threshold value, the object is added from the old object set>And eliminating, otherwise, keeping the object.
The invention has the beneficial effects that:
the invention designs a complete multimode hybrid automatic driving unified 3D detection and tracking method, which comprises a plurality of stages of multimode BEV feature generation and target detection and target tracking. In the generation stage of the multi-mode BEV features, the method can process the sensor data of the laser radar and the camera, and is fused into a unified BEV feature space, so that the method can flexibly adapt to the change of the number of sensors and can be used as the input features of subsequent target detection and target tracking tasks. In the realization stage of target detection and target tracking, the method designs an encoder and a decoder based on a transducer structure, and effectively combines a target detection task and a target tracking task together, so that the performance of the two tasks is improved by fully utilizing the relevance between the detection task and the tracking task, and meanwhile, the difficulty of improving the training difficulty of a plurality of independent models and the difficulty of model deployment are effectively reduced.
Compared with the independent target detection and target tracking model, the unified model can improve instantaneity, accuracy and robustness, and the performance and safety of the automatic driving system are improved.
Drawings
Fig. 1 is a flowchart of a multi-mode hybrid automatic driving unified 3D detection and tracking method according to the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The multi-mode hybrid automatic driving unified 3D detection and tracking method is used for realizing more accurate and stable target detection and tracking by mixing images and point cloud data and combining the capability of a transducer model by utilizing a unified fusion strategy, and provides better support for the safety and performance of an automatic driving system.
Variable subscripts "LiDAR" and "cam" are set to distinguish between two different sensors, liDAR and Camera, and variable subscripts "det" and "track" are set to distinguish between Object Detection (Object Detection) and Object Tracking (Object Tracking) tasks. As shown in fig. 1, the specific implementation steps of the present invention are:
step (1) laser radar input to automatic driving systemAnd camera->Is used to generate a network ψ using different BEVs, respectively lidar 、Ψ cam The method comprises the steps of converting two groups of different input modes into a unified BEV visual angle, wherein the generated BEV features under different modes have the same spatial resolution W multiplied by H and the same feature dimension C, and are used for realizing fusion of the features of a laser radar and a camera, and the specific calculation formula is as follows:
wherein,representing BEV characteristics in lidar mode, < >>Representing BEV features in the camera modality. In this embodiment, the BEV generates a network ψ lidar 、Ψ cam May be implemented using existing networks such as BEVFormer, bevFusion.
And (2) obtaining BEV characteristics under different modes obtained in the step (1) through self-adaptive BEV characteristic fusion network learning, and adaptively fusing the BEV characteristics under different modes together by adopting a proper mode to generate self-adaptive fused BEV characteristics. The method specifically comprises the following steps:
step (2.1) BEV characterization by lidar point cloud modalityGenerating corresponding lidar BEV feature adaptive fusion weights +.>
Wherein Γ is lidar And (3) obtaining the self-adaptive fusion weight with the same receptive field as the original BEV characteristics through downsampling and upsampling of a multi-layer convolutional neural network.
Step (2.2) BEV characterization by camera modalityGenerating corresponding camera BEV characteristic self-adaptive fusion weight +.>
Wherein Γ is cam And (3) obtaining the self-adaptive fusion weight with the same receptive field as the original BEV characteristics through downsampling and upsampling of a multi-layer convolutional neural network.
Step (2.3) BEV characteristics of fusion of a laser radar point cloud mode and a camera modeGenerating BEV characteristic self-adaptive fusion weight of fusion of corresponding laser radar point cloud and camera>
Wherein Γ is lidar-cam And (3) obtaining the self-adaptive fusion weight with the same receptive field as the original BEV characteristics through downsampling and upsampling of a multi-layer convolutional neural network.
Step (2.4) the fusion weights generated in the steps (2.1) to (2.3) are obtained by using a normalization function Is->And (3) carrying out numerical normalization:
wherein sigma is a normalization function, and can be realized by adopting a Softmax function;
step (2.5) BEV characterization of the three modalitiesIs->Corresponding adaptive fusion weight after normalization +.>Is->Processed to obtain an adaptively fused BEV characteristic->
Here, the BEV characteristics of the three modes,is->First, the self-adaptive fusion weights after normalization corresponding to the self-adaptive fusion weights are respectively added>Is->Multiplication results in three sets of BEV features of the same size as the original BEV features. />Representing a stitching operation, where three sets of C-dimensional BEV features are stitched into a 3C-dimensional feature; the MLP network is used for converting feature dimension into C dimension, and finally obtaining self-adaptive fused BEV feature with W×H×C>The sensor can flexibly adapt to the change of the number of the sensors. The adaptive BEV feature fusion network includes Γ lidar Network Γ cam Network Γ lidar-cam Networks and MLP networks.
Step (3) adaptively fused BEV features obtained by step (2)A transducer-based encoder is designed for encoding. Meanwhile, a candidate region generation network (RPN) is designed, and BEV characteristics which are adaptively fused are +.>Completing a 3D target detection task of a current frame through an RPN network, and generating a series of 3D candidate frames of the current frame, wherein the 3D candidate frames are specifically as follows:
step (3.1) BEV characterization by adaptive fusionGenerating a series of 3D candidate frames of the current frame through an RPN network as an output result of 3D target detection of the current frame:
wherein,and representing the 3D target detection result at the time t.
Step (3.2) BEV characterization by adaptive fusionCoding feature of the current frame is obtained via a transform structure based encoder ENC>
The encoder ENC is composed of a plurality of serial encoder modules, adjacent modules being connected in a residual manner. Each module is constructed with an attention mechanism, including the following main calculation processes:
wherein, ATT comprises the following main calculations:
wherein,is the input of the encoder module and will +.>Query, key and value as attention mechanisms, respectively; the sigma function is a Softmax function and is used for normalizing the correlation matrix; FFN is a feed-forward neural network with two layers; c is the feature dimension of the input adaptively fused BEV feature, which may be generally set to 128. The output of the encoder and the 3D candidate box output by the object detection task will be the input of the decoder of the transducer.
Designing a transducer decoder to obtain the encoding characteristics of the current frame output by the transducer encoder obtained in the step (3)Simultaneously processing the 3D candidate frame of the current frame acquired in the step (3)Splicing the target tracking results of the previous frame, and adding the coding characteristic of the current frame>The target tracking result of the current frame is obtained by inputting the target tracking result and the splicing result to a decoder of a transducer together, wherein the target tracking result is specifically as follows:
step (4.1) splice the processed target tracking result of the previous frame and the 3D candidate frames of the series of current frames generated in step (3.1):
wherein,representing a splicing operation->3D candidate box representing current frame +.>Representing the target tracking result of the last frame processed.
Step (4.2) the coding feature of the current frame obtained in the step (3.2) after being coded by the coderAnd +.A obtained in step (4.1)>Input to the decoder of the transform structure to obtain the initial target tracking result of the last current frame +.>The specific calculation formula is as follows:
ATT includes the following main calculations:
wherein,as a query to the decoder,>as a key of the decoder,>as a decoder value, the sigma function is a Softmax function for normalizing the correlation matrix; FFN is a feed-forward neural network with two layers; c is the characteristic dimension of the input, and can be generally set to 128;
step (5) designing a life cycle control network of the target, and utilizing the initial target tracking result of the current frame obtained in the step (4)Generating a processed target tracking result +.>The method is used for target tracking of the next frame, so that iteration is continuously carried out between frames, and finally a target tracking result of the whole multi-frame input is output, and the method is concretely as follows:
step (5.1) according to step (4.1)The initial target tracking result of the current frame acquired in the step (4.2) is +.>Divided into two groups->And->Wherein->Correspond to->Decoding result of->Correspond to->Decoding result of->The number of detection targets contained is respectively equal to->The corresponding target numbers are consistent. The expression is as follows:
wherein d isIs a function of the number of detected objects.
Step (5.2) the two groups of results obtained in the step (5.1) are treated respectively, and thenFor processing the originally tracked object and controlling the object to leave the current scene; will->For processing newly generated objects and controlling the objects to enter the current scene. For the departure of the object, when the detection confidence of the object is smaller than the set threshold value tau left And for 3 frames, the object is removed from the tracked object list. For the entry of the object, when the detection confidence of the object is greater than the set threshold value tau en At that point, the object is added to the tracked queue. The specific calculation formula is as follows:
wherein q k,old Representation ofThe kth object in the collection, q k,new Representation->Kth object in the collection, +.>Represents q k,old Detection confidence in the last three frames, +.>Represents q k,new Detection confidence in the current frame, +.>Representing the set of objects after handling the departure of the object, < > and>representing a set of objects after having processed the object into the scene, < >>Representing the resulting processed target tracking result to be used in the tracking network for the next frame.
The foregoing description is only illustrative of the specific embodiments of the present application and the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or their equivalents without departing from the spirit of the application. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.
Claims (9)
1. The multi-mode hybrid automatic driving unified 3D detection and tracking method is characterized by comprising the following steps of:
step (1), inputting multi-mode data acquired by a laser radar and a camera from an automatic driving system, and respectively extracting BEV characteristics under different modes;
step (2), obtaining BEV characteristics under different modes through the BEV characteristics under the different modes obtained in the step (1), and generating self-adaptive fused BEV characteristics by fusing BEV characteristics under the different modes with self-adaptive fusion weights;
step (3), the BEV characteristics obtained in the step (2) are subjected to self-adaptive fusion, and a transducer encoder is adopted for encoding, so that the encoding characteristics of the current frame are obtained; meanwhile, the self-adaptive fused BEV features pass through a candidate region generation network to complete a 3D target detection task of the current frame, and a series of 3D candidate frames of the current frame are generated;
step (4), splicing a series of 3D candidate frames of the current frame and the processed target tracking result of the previous frame, and inputting the splicing result and the coding characteristics of the current frame into a transducer decoder together to obtain an initial target tracking result of the current frame;
and (5) generating a processed target tracking result of the current frame by using the initial target tracking result of the current frame acquired in the step (4), and finally outputting the target tracking result of the whole multi-frame input through continuous iteration between frames.
2. The method for unified 3D detection and tracking of multi-modal hybrid automatic driving according to claim 1, wherein in the step (1), multi-modal data collected by a laser radar and a camera from an automatic driving system are input, and converted to a consistent BEV viewing angle by respective BEV generating networks, respectively, and BEV features under different generated modalities have the same spatial resolution and the same feature dimension.
3. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 1, wherein step (2) comprises:
generating adaptive fusion weights of corresponding laser radar BEV features from the BEV features of the laser radar modality;
generating adaptive fusion weights of corresponding camera BEV features from the BEV features of the camera modality;
generating self-adaptive fusion weights of the corresponding laser radar and camera fusion BEV features by the BEV features fused by the laser radar mode and the camera mode;
step (2.4) carrying out numerical normalization on the three self-adaptive fusion weights generated in the steps (2.1) to (2.3);
and (2.5) obtaining the BEV characteristics of the self-adaptive fusion through the BEV characteristics of the three modes and the corresponding normalized self-adaptive fusion weights.
4. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 3, wherein step (2.5) specifically comprises: multiplying the BEV features of the three modes with the corresponding normalized self-adaptive fusion weights respectively to obtain three groups of BEV features with the same size as the original BEV features; and splicing the three groups of BEV features to obtain new BEV features with 3 times of the original BEV feature dimension, and converting the dimension of the new BEV features into the self-adaptive fusion BEV features with the same dimension as the original BEV features by using the MLP network.
5. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 1, wherein step (3) comprises:
step (3.1) generating a series of 3D candidate frames of the current frame as an output result of 3D target detection of the current frame by adaptively fused BEV features through an RPN network;
and (3.2) obtaining the coding characteristic of the current frame from the adaptively fused BEV characteristic through an encoder based on a transducer structure.
6. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 5, wherein the transducer structure based encoder is comprised of a plurality of serial encoder modules, adjacent modules being connected by means of residuals; each module is built with an attention mechanism whose queries, keys and values are the adaptively fused BEV features obtained in step (2).
7. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 5, wherein step (4) comprises:
step (4.1) the processed target tracking result of the previous frame is spliced with the 3D target detection output result of the current frame generated in the step (3.1);
and (4.2) inputting the coding characteristics of the current frame obtained in the step (3.2) and the splicing result obtained in the step (4.1) into a decoder of a transducer structure to obtain an initial target tracking result of the current frame.
8. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 1, wherein step (5) comprises:
step (5.1) dividing the initial target tracking result of the current frame obtained in the step (4) into a new object set corresponding to the 3D target detection output result of the current frameOld object set corresponding to the target tracking result of the last frame processed +.>
Step (5.2) collecting the new objects obtained in step (5.1)And old object set->Respectively judging, and removing objects which do not meet the requirements from the collection;
step (5.3) combining the processed new object setsAnd old object set->Generating the passing pointAnd (5) tracking the target of the current frame.
9. The unified 3D detection and tracking method for multi-modal hybrid automatic driving as recited in claim 8, wherein in step (5.2), for new object setsIf the detection confidence of a certain object is larger than the set threshold value, the object is reserved, otherwise, the object is added from the new object set>Removing; for old object collection->If the detection confidence of a certain object lasting for 3 frames is smaller than the set threshold value, the object is added from the old object set>And eliminating, otherwise, keeping the object.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311382428.2A CN117333749A (en) | 2023-10-24 | 2023-10-24 | Multimode hybrid automatic driving unified 3D detection and tracking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311382428.2A CN117333749A (en) | 2023-10-24 | 2023-10-24 | Multimode hybrid automatic driving unified 3D detection and tracking method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117333749A true CN117333749A (en) | 2024-01-02 |
Family
ID=89279078
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311382428.2A Pending CN117333749A (en) | 2023-10-24 | 2023-10-24 | Multimode hybrid automatic driving unified 3D detection and tracking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117333749A (en) |
-
2023
- 2023-10-24 CN CN202311382428.2A patent/CN117333749A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112347859B (en) | Method for detecting significance target of optical remote sensing image | |
Yang et al. | Spatio-temporal domain awareness for multi-agent collaborative perception | |
CN115223082A (en) | Aerial video classification method based on space-time multi-scale transform | |
CN116385761A (en) | 3D target detection method integrating RGB and infrared information | |
CN113326735A (en) | Multi-mode small target detection method based on YOLOv5 | |
CN116612468A (en) | Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism | |
CN115131281A (en) | Method, device and equipment for training change detection model and detecting image change | |
CN111696136A (en) | Target tracking method based on coding and decoding structure | |
CN115588237A (en) | Three-dimensional hand posture estimation method based on monocular RGB image | |
Xie et al. | YOLO-MS: Multispectral object detection via feature interaction and self-attention guided fusion | |
CN117788544A (en) | Image depth estimation method based on lightweight attention mechanism | |
CN110942463B (en) | Video target segmentation method based on generation countermeasure network | |
CN117058392A (en) | Multi-scale Transformer image semantic segmentation method based on convolution local enhancement | |
CN116740480A (en) | Multi-mode image fusion target tracking method | |
CN114693953B (en) | RGB-D significance target detection method based on cross-mode bidirectional complementary network | |
CN115909408A (en) | Pedestrian re-identification method and device based on Transformer network | |
CN117333749A (en) | Multimode hybrid automatic driving unified 3D detection and tracking method | |
CN113920317A (en) | Semantic segmentation method based on visible light image and low-resolution depth image | |
Yan et al. | EMTNet: efficient mobile transformer network for real-time monocular depth estimation | |
CN111126310A (en) | Pedestrian gender identification method based on scene migration | |
CN118229781B (en) | Display screen foreign matter detection method, model training method, device, equipment and medium | |
Zhou et al. | Underwater occluded object recognition with two-stage image reconstruction strategy | |
Huang et al. | SOAda-YOLOR: Small Object Adaptive YOLOR Algorithm for Road Object Detection | |
CN116680656B (en) | Automatic driving movement planning method and system based on generating pre-training converter | |
Zheng et al. | A Dual Encoder-Decoder Network for Self-supervised Monocular Depth Estimation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |