CN117333749A

CN117333749A - Multimode hybrid automatic driving unified 3D detection and tracking method

Info

Publication number: CN117333749A
Application number: CN202311382428.2A
Authority: CN
Inventors: 丁勇; 孙瑀; 程华元; 刘琳琳; 牛乐乐
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-01-02

Abstract

The invention discloses a multimode hybrid automatic driving unified 3D detection and tracking method, and belongs to the technical field of automatic driving. The invention mainly comprises the following steps: 1. generating BEV features under different modes; 2. generating an adaptively fused BEV feature; 3. generating a single-frame 3D target detection result; 4. generating a single-frame 3D target tracking result; 5. iteration of the target tracking results from frame to frame. Based on the unified 3D detection and tracking method provided by the invention, different sensor data can be fused into unified BEV characteristics, and 3D target detection and 3D target tracking are unified into a whole. Compared with the independent target detection and target tracking model, the unified model can improve instantaneity, accuracy and robustness, and the performance and safety of the automatic driving system are improved. Meanwhile, training cost and deployment difficulty of the model can be reduced.

Description

Multimode hybrid automatic driving unified 3D detection and tracking method

Technical Field

The invention belongs to the technical field of automatic driving, and particularly relates to a multimode hybrid automatic driving unified 3D detection and tracking method.

Background

With the rapid development of automatic driving technology, the realization of efficient and accurate target detection and tracking is critical to the safety and performance of an automatic driving system. Traditional target detection and tracking methods are based primarily on single modality data, such as images or point clouds. However, single modality data may have some limitations in some scenarios, for example, in complex environments, the image may be affected by illumination, occlusion, etc., while the point cloud may be limited by factors such as sensor resolution and noise.

To overcome the limitations of single modality data, researchers have begun exploring the mix of multi-modality data. Multimodal data is typically collected by different sensors, such as cameras and lidar. The image data provides rich visual information, while the point cloud data provides accurate geometric information. By mixing the image and the point cloud data, more comprehensive and accurate target detection and tracking results can be obtained.

Conventional target detection and tracking methods are usually performed separately, and target detection is performed first, and then target tracking is performed. However, this manner of separation may lead to loss and inconsistency of information. In order to achieve more accurate and consistent target detection and tracking, a unified fusion method needs to be provided. The method carries out joint modeling on the target detection and tracking process, and realizes the tight combination of target detection and tracking by sharing the characteristics and the context information, thereby improving the accuracy and the stability of the detection and the tracking.

Transformer is a powerful deep learning model, and has achieved remarkable results in the fields of natural language processing, computer vision and the like. By utilizing the self-attention mechanism and the global context modeling capability of the transducer, the relation and the dependency relationship between the multi-mode data can be better captured, so that the target detection and tracking performance is improved.

Disclosure of Invention

The invention aims to overcome the defects caused by a single mode and fully utilize the advantages of unification of target detection and target tracking, and provides an automatic driving unification 3D detection and tracking method based on the hybrid of the point cloud of a transducer and the multiple modes of an image, so that the training cost and the deployment difficulty can be reduced, and the effect of mutually improving the performance can be obtained by fully utilizing the relevance between the target detection and the target tracking task.

The technical scheme adopted by the invention is as follows:

a multimode hybrid automatic driving unified 3D detection and tracking method comprises the following steps:

step (1), inputting multi-mode data acquired by a laser radar and a camera from an automatic driving system, and respectively extracting BEV characteristics under different modes;

step (2), obtaining BEV characteristics under different modes through the BEV characteristics under the different modes obtained in the step (1), and generating self-adaptive fused BEV characteristics by fusing BEV characteristics under the different modes with self-adaptive fusion weights;

step (3), the BEV characteristics obtained in the step (2) are subjected to self-adaptive fusion, and a transducer encoder is adopted for encoding, so that the encoding characteristics of the current frame are obtained; meanwhile, the self-adaptive fused BEV features pass through a candidate region generation network to complete a 3D target detection task of the current frame, and a series of 3D candidate frames of the current frame are generated;

step (4), splicing a series of 3D candidate frames of the current frame and the processed target tracking result of the previous frame, and inputting the splicing result and the coding characteristics of the current frame into a transducer decoder together to obtain an initial target tracking result of the current frame;

and (5) generating a processed target tracking result of the current frame by using the initial target tracking result of the current frame acquired in the step (4), and finally outputting the target tracking result of the whole multi-frame input through continuous iteration between frames.

Further, the step (5) includes:

step (5.1) dividing the initial target tracking result of the current frame obtained in the step (4) into a new object set corresponding to the 3D target detection output result of the current frameOld object set corresponding to the target tracking result of the last frame processed +.>

Step (5.2) collecting the new objects obtained in step (5.1)And old object set->Respectively judging, and removing objects which do not meet the requirements from the collection;

step (5.3) combining the processed new object setsAnd old object set->And generating a target tracking result of the processed current frame.

Further, in the step (5.2), for a new object setIf the detection confidence of a certain object is larger than the set threshold value, the object is reserved, otherwise, the object is added from the new object set>Removing; for old object collection->If the detection confidence of a certain object lasting for 3 frames is smaller than the set threshold value, the object is added from the old object set>And eliminating, otherwise, keeping the object.

The invention has the beneficial effects that:

the invention designs a complete multimode hybrid automatic driving unified 3D detection and tracking method, which comprises a plurality of stages of multimode BEV feature generation and target detection and target tracking. In the generation stage of the multi-mode BEV features, the method can process the sensor data of the laser radar and the camera, and is fused into a unified BEV feature space, so that the method can flexibly adapt to the change of the number of sensors and can be used as the input features of subsequent target detection and target tracking tasks. In the realization stage of target detection and target tracking, the method designs an encoder and a decoder based on a transducer structure, and effectively combines a target detection task and a target tracking task together, so that the performance of the two tasks is improved by fully utilizing the relevance between the detection task and the tracking task, and meanwhile, the difficulty of improving the training difficulty of a plurality of independent models and the difficulty of model deployment are effectively reduced.

Compared with the independent target detection and target tracking model, the unified model can improve instantaneity, accuracy and robustness, and the performance and safety of the automatic driving system are improved.

Drawings

Fig. 1 is a flowchart of a multi-mode hybrid automatic driving unified 3D detection and tracking method according to the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The multi-mode hybrid automatic driving unified 3D detection and tracking method is used for realizing more accurate and stable target detection and tracking by mixing images and point cloud data and combining the capability of a transducer model by utilizing a unified fusion strategy, and provides better support for the safety and performance of an automatic driving system.

Variable subscripts "LiDAR" and "cam" are set to distinguish between two different sensors, liDAR and Camera, and variable subscripts "det" and "track" are set to distinguish between Object Detection (Object Detection) and Object Tracking (Object Tracking) tasks. As shown in fig. 1, the specific implementation steps of the present invention are:

step (1) laser radar input to automatic driving systemAnd camera->Is used to generate a network ψ using different BEVs, respectively _lidar 、Ψ _cam The method comprises the steps of converting two groups of different input modes into a unified BEV visual angle, wherein the generated BEV features under different modes have the same spatial resolution W multiplied by H and the same feature dimension C, and are used for realizing fusion of the features of a laser radar and a camera, and the specific calculation formula is as follows:

wherein,representing BEV characteristics in lidar mode, < >>Representing BEV features in the camera modality. In this embodiment, the BEV generates a network ψ _lidar 、Ψ _cam May be implemented using existing networks such as BEVFormer, bevFusion.

And (2) obtaining BEV characteristics under different modes obtained in the step (1) through self-adaptive BEV characteristic fusion network learning, and adaptively fusing the BEV characteristics under different modes together by adopting a proper mode to generate self-adaptive fused BEV characteristics. The method specifically comprises the following steps:

step (2.1) BEV characterization by lidar point cloud modalityGenerating corresponding lidar BEV feature adaptive fusion weights +.>

Wherein Γ is _lidar And (3) obtaining the self-adaptive fusion weight with the same receptive field as the original BEV characteristics through downsampling and upsampling of a multi-layer convolutional neural network.

Step (2.2) BEV characterization by camera modalityGenerating corresponding camera BEV characteristic self-adaptive fusion weight +.>

Wherein Γ is _cam And (3) obtaining the self-adaptive fusion weight with the same receptive field as the original BEV characteristics through downsampling and upsampling of a multi-layer convolutional neural network.

Step (2.3) BEV characteristics of fusion of a laser radar point cloud mode and a camera modeGenerating BEV characteristic self-adaptive fusion weight of fusion of corresponding laser radar point cloud and camera>

Wherein Γ is _lidar-cam And (3) obtaining the self-adaptive fusion weight with the same receptive field as the original BEV characteristics through downsampling and upsampling of a multi-layer convolutional neural network.

Step (2.4) the fusion weights generated in the steps (2.1) to (2.3) are obtained by using a normalization function Is->And (3) carrying out numerical normalization:

wherein sigma is a normalization function, and can be realized by adopting a Softmax function;

step (2.5) BEV characterization of the three modalitiesIs->Corresponding adaptive fusion weight after normalization +.>Is->Processed to obtain an adaptively fused BEV characteristic->

Here, the BEV characteristics of the three modes,is->First, the self-adaptive fusion weights after normalization corresponding to the self-adaptive fusion weights are respectively added>Is->Multiplication results in three sets of BEV features of the same size as the original BEV features. />Representing a stitching operation, where three sets of C-dimensional BEV features are stitched into a 3C-dimensional feature; the MLP network is used for converting feature dimension into C dimension, and finally obtaining self-adaptive fused BEV feature with W×H×C>The sensor can flexibly adapt to the change of the number of the sensors. The adaptive BEV feature fusion network includes Γ _lidar Network Γ _cam Network Γ _lidar-cam Networks and MLP networks.

Step (3) adaptively fused BEV features obtained by step (2)A transducer-based encoder is designed for encoding. Meanwhile, a candidate region generation network (RPN) is designed, and BEV characteristics which are adaptively fused are +.>Completing a 3D target detection task of a current frame through an RPN network, and generating a series of 3D candidate frames of the current frame, wherein the 3D candidate frames are specifically as follows:

step (3.1) BEV characterization by adaptive fusionGenerating a series of 3D candidate frames of the current frame through an RPN network as an output result of 3D target detection of the current frame:

wherein,and representing the 3D target detection result at the time t.

Step (3.2) BEV characterization by adaptive fusionCoding feature of the current frame is obtained via a transform structure based encoder ENC>

The encoder ENC is composed of a plurality of serial encoder modules, adjacent modules being connected in a residual manner. Each module is constructed with an attention mechanism, including the following main calculation processes:

wherein, ATT comprises the following main calculations:

wherein,is the input of the encoder module and will +.>Query, key and value as attention mechanisms, respectively; the sigma function is a Softmax function and is used for normalizing the correlation matrix; FFN is a feed-forward neural network with two layers; c is the feature dimension of the input adaptively fused BEV feature, which may be generally set to 128. The output of the encoder and the 3D candidate box output by the object detection task will be the input of the decoder of the transducer.

Designing a transducer decoder to obtain the encoding characteristics of the current frame output by the transducer encoder obtained in the step (3)Simultaneously processing the 3D candidate frame of the current frame acquired in the step (3)Splicing the target tracking results of the previous frame, and adding the coding characteristic of the current frame>The target tracking result of the current frame is obtained by inputting the target tracking result and the splicing result to a decoder of a transducer together, wherein the target tracking result is specifically as follows:

step (4.1) splice the processed target tracking result of the previous frame and the 3D candidate frames of the series of current frames generated in step (3.1):

wherein,representing a splicing operation->3D candidate box representing current frame +.>Representing the target tracking result of the last frame processed.

Step (4.2) the coding feature of the current frame obtained in the step (3.2) after being coded by the coderAnd +.A obtained in step (4.1)>Input to the decoder of the transform structure to obtain the initial target tracking result of the last current frame +.>The specific calculation formula is as follows:

ATT includes the following main calculations:

wherein,as a query to the decoder,>as a key of the decoder,>as a decoder value, the sigma function is a Softmax function for normalizing the correlation matrix; FFN is a feed-forward neural network with two layers; c is the characteristic dimension of the input, and can be generally set to 128;

step (5) designing a life cycle control network of the target, and utilizing the initial target tracking result of the current frame obtained in the step (4)Generating a processed target tracking result +.>The method is used for target tracking of the next frame, so that iteration is continuously carried out between frames, and finally a target tracking result of the whole multi-frame input is output, and the method is concretely as follows:

step (5.1) according to step (4.1)The initial target tracking result of the current frame acquired in the step (4.2) is +.>Divided into two groups->And->Wherein->Correspond to->Decoding result of->Correspond to->Decoding result of->The number of detection targets contained is respectively equal to->The corresponding target numbers are consistent. The expression is as follows:

wherein d isIs a function of the number of detected objects.

Step (5.2) the two groups of results obtained in the step (5.1) are treated respectively, and thenFor processing the originally tracked object and controlling the object to leave the current scene; will->For processing newly generated objects and controlling the objects to enter the current scene. For the departure of the object, when the detection confidence of the object is smaller than the set threshold value tau _left And for 3 frames, the object is removed from the tracked object list. For the entry of the object, when the detection confidence of the object is greater than the set threshold value tau _en At that point, the object is added to the tracked queue. The specific calculation formula is as follows:

wherein q _k,old Representation ofThe kth object in the collection, q _k,new Representation->Kth object in the collection, +.>Represents q _k,old Detection confidence in the last three frames, +.>Represents q _k,new Detection confidence in the current frame, +.>Representing the set of objects after handling the departure of the object, < > and>representing a set of objects after having processed the object into the scene, < >>Representing the resulting processed target tracking result to be used in the tracking network for the next frame.

The foregoing description is only illustrative of the specific embodiments of the present application and the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or their equivalents without departing from the spirit of the application. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. The multi-mode hybrid automatic driving unified 3D detection and tracking method is characterized by comprising the following steps of:

2. The method for unified 3D detection and tracking of multi-modal hybrid automatic driving according to claim 1, wherein in the step (1), multi-modal data collected by a laser radar and a camera from an automatic driving system are input, and converted to a consistent BEV viewing angle by respective BEV generating networks, respectively, and BEV features under different generated modalities have the same spatial resolution and the same feature dimension.

3. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 1, wherein step (2) comprises:

generating adaptive fusion weights of corresponding laser radar BEV features from the BEV features of the laser radar modality;

generating adaptive fusion weights of corresponding camera BEV features from the BEV features of the camera modality;

generating self-adaptive fusion weights of the corresponding laser radar and camera fusion BEV features by the BEV features fused by the laser radar mode and the camera mode;

step (2.4) carrying out numerical normalization on the three self-adaptive fusion weights generated in the steps (2.1) to (2.3);

and (2.5) obtaining the BEV characteristics of the self-adaptive fusion through the BEV characteristics of the three modes and the corresponding normalized self-adaptive fusion weights.

4. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 3, wherein step (2.5) specifically comprises: multiplying the BEV features of the three modes with the corresponding normalized self-adaptive fusion weights respectively to obtain three groups of BEV features with the same size as the original BEV features; and splicing the three groups of BEV features to obtain new BEV features with 3 times of the original BEV feature dimension, and converting the dimension of the new BEV features into the self-adaptive fusion BEV features with the same dimension as the original BEV features by using the MLP network.

5. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 1, wherein step (3) comprises:

step (3.1) generating a series of 3D candidate frames of the current frame as an output result of 3D target detection of the current frame by adaptively fused BEV features through an RPN network;

and (3.2) obtaining the coding characteristic of the current frame from the adaptively fused BEV characteristic through an encoder based on a transducer structure.

6. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 5, wherein the transducer structure based encoder is comprised of a plurality of serial encoder modules, adjacent modules being connected by means of residuals; each module is built with an attention mechanism whose queries, keys and values are the adaptively fused BEV features obtained in step (2).

7. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 5, wherein step (4) comprises:

step (4.1) the processed target tracking result of the previous frame is spliced with the 3D target detection output result of the current frame generated in the step (3.1);

and (4.2) inputting the coding characteristics of the current frame obtained in the step (3.2) and the splicing result obtained in the step (4.1) into a decoder of a transducer structure to obtain an initial target tracking result of the current frame.

8. The multi-modal hybrid automatic driving unified 3D detection and tracking method of claim 1, wherein step (5) comprises:

step (5.3) combining the processed new object setsAnd old object set->Generating the passing pointAnd (5) tracking the target of the current frame.

9. The unified 3D detection and tracking method for multi-modal hybrid automatic driving as recited in claim 8, wherein in step (5.2), for new object setsIf the detection confidence of a certain object is larger than the set threshold value, the object is reserved, otherwise, the object is added from the new object set>Removing; for old object collection->If the detection confidence of a certain object lasting for 3 frames is smaller than the set threshold value, the object is added from the old object set>And eliminating, otherwise, keeping the object.