CN117516581A

CN117516581A - End-to-end automatic driving track planning system, method and training method integrating BEVFomer and neighborhood attention transducer

Info

Publication number: CN117516581A
Application number: CN202311691441.6A
Authority: CN
Inventors: 刘擎超; 王林强; 陈思齐; 蔡英凤; 王海; 赵晶娅; 赵霞; 熊晓夏; 许淼
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-02-06

Abstract

The invention provides a system, a method and a training method for end-to-end automatic driving track planning by fusing BEVFomer and neighborhood attention Transformer, wherein the BEVFomer learns unified BEV characteristic representation through a space-time structure Transformer in a multi-view image so as to capture spatial relationship and time information in input BEV data; extracting dynamic characteristics and associated characteristics of vehicle running from vehicle history track data and vehicle state data by using an RNN (RNN feature extractor), and fusing the dynamic characteristics and associated characteristics of vehicle running into BEV (vehicle-mounted vehicle) characteristics; finally, the vehicle control signals and trajectories are output through a neighborhood attention-based transducer planning model and a full connection layer. According to the invention, space and time information is utilized across cameras and time steps, vehicle motion characteristics and aerial view characteristics are fused to better understand and analyze complex relations such as vehicle behaviors and environmental changes, and the relevance among different positions in the characteristics is captured by applying a self-attention mechanism in a local neighborhood, so that the expression capacity of the model is improved.

Description

End-to-end automatic driving track planning system, method and training method integrating BEVFomer and neighborhood attention transducer

Technical Field

The invention belongs to the technical field of end-to-end automatic driving, and particularly relates to a BEVFomer and neighborhood attention transducer end-to-end automatic driving track planning system, method and training method.

Background

The statements in this section merely provide background information related to the present application and may not constitute prior art.

Autopilot technology is becoming a key development direction in future traffic areas as an important technology. In modern society, problems such as frequent traffic accidents, traffic jams and the like have become global challenges, and the introduction of automatic driving technology has provided new possibilities for solving the problems. The automatic driving technology can enable the vehicle to autonomously sense the environment, make safe driving decisions and realize efficient traffic flow by using an advanced sensing and decision-making system. The development of autopilot technology has gone through several stages. Initially, rule-based methods were widely used in automated driving systems. These methods control vehicle behavior, such as lane keeping, traffic signal compliance, etc., through predefined rules and logic. However, this approach often requires a large number of rules to be manually designed and cannot cope with complex driving scenarios and changing traffic conditions.

To address the limitations in the traditional approach, end-to-end autopilot technology is becoming of increasing interest. The end-to-end automatic driving technology directly learns the driving model from the original sensor data to output a control signal or a future track by integrating the processes of sensing, decision, planning, control and the like into a unified system. The method eliminates the complex rule design and decision reasoning process in the traditional method, and has better adaptability and flexibility. Despite significant advances in end-to-end autopilot technology, there are still some challenges and limitations. The method comprises the aspects of large data demand, poor model interpretation, generalization capability for new scenes and the like.

Disclosure of Invention

In order to solve the technical problems, the invention provides a BEVFomer and neighborhood attention transducer end-to-end automatic driving track planning system, a method and a training method, which are used for greatly improving the understanding and decision making capability of an automatic driving system to the environment and improving the accuracy and generalization of the driving system, so that the control signal or the future track of the automatic driving vehicle can be output more accurately, and support is provided for realizing safer and intelligent automatic driving.

The technical problems to be solved by the present invention are not limited to the above-described problems, and any other technical problems not mentioned in the present application will be clearly understood from the following description by those skilled in the art to which the present application pertains.

The end-to-end automatic driving track planning system integrating BEVFomer and neighborhood attention transducer is characterized by comprising the following components:

the data acquisition module is used for acquiring multi-view images, historical track data of the vehicle and vehicle state data and preprocessing the acquired data;

the RNN feature extractor is used for extracting dynamic features and associated features of vehicle running from historical track data and vehicle state data of the vehicle; for historical track data, capturing a motion change trend in the track through time sequence processing, and capturing a time dependence in the running process of the vehicle by using a memory mechanism of the RNN feature extractor; by learning the vehicle state data, an association between the vehicle state and the behavior is obtained,

the BEVFomer feature extraction module comprises a back bone neural network and at least one encoder layer, wherein each encoder layer comprises a BEV query mechanism, a spatial cross attention module, a time self attention module and a feedforward network;

Backbone neural network is used for obtaining view characteristic F of multi-view image at t moment _t ；

The saidBEV query mechanisms predefine a set of learnable parameters of a mesh shapeAs a BEVFormer query, for querying the corresponding grid cell region at p= (x, y) in the BEV plane;

the BEV querying mechanism queries the BEV spatial feature B through the temporal self-attention module _t-1 Extracting, by the spatial cross-attention module, a query BEV spatial feature B _t-1 Spatial information of (a);

the spatial cross-attention module is further based on view features F of the multi-view image _t To obtain multi-view spatial information of the vehicle;

the feed-forward network is based on view characteristics F of multi-view images _t Spatial information of (B) BEV spatial characteristics B _t-1 Is used for obtaining refined BEV characteristic B through time information and space information _t ；

The feature fusion module is used for fusing the dynamic feature and the associated feature of the vehicle running extracted by the RNN feature extractor into the BEV feature;

and (3) a neighborhood attention transducer planning model, planning a future track of the vehicle based on the fused BEV characteristics, and outputting a planning result through the full connection layer and the visualization module.

Further, the data acquisition module comprises cameras with different visual angles, vehicle-mounted sensors and an inertial measurement unit.

Further, a spatial cross-attention (SCA) process formula of the spatial cross-attention module:

-i represents an index of camera views;

-j represents the index of the reference point;

-N _ref is the number of total reference points per BEV query;

-F _t ⁱ is the firstFeatures of i camera views;

-Q _p is each BEV query;

p (P, i, j) is a projection function, from the j-th 3D point (x 'y' z _j ') to a 2D point on the ith view;

-defromatt () is a variable attention;

the formula of the projection function is as follows:

P(p,i,j)＝(x _ij ,y _ij )

in addition, z _ij .[x _ij y _ij 1] ^T ＝T _i .[x′ y′ z′ _j 1] ^T

Wherein:is a projection matrix known to the i-th camera.

Further, given the BEV query Q at the current time t and the historical BEV characteristics B maintained at time t-1 _t-1 First B is carried out according to the movement of the bicycle _t-1 Alignment with Q, having features on the same grid corresponding to the same real world location, historical BEV feature B to be aligned _t-1 As B' _t-1 The method comprises the steps of carrying out a first treatment on the surface of the The time correlation between BEV features modeled by the time self-attention module is specifically expressed as follows:

wherein,

-Q _p representing a BEV query located at p= (x, y);

the offset in the temporal self-attention module is through Q and B' _t-1 For the first sample of each sequence, the temporal self-attention module is degenerated to self-attention without temporal information, wherein the BEV features { Q, B' _t-1 The BEV query { Q, Q } is replaced with the repeated BEV query { Q, Q }.

Further, the feature fusion module splices the features extracted by the RNN feature extractor and the BEV features together according to feature dimensions.

Further, the neighborhood attention Transformer embeds the BEV features after fusion by using 2 consecutive 3 x 3 overlapping convolutions; the method comprises the steps that a 4-level neighborhood attention transducer planning model is arranged in a stacked mode, an overlapping marker is arranged at the upstream of a first-level neighborhood attention transducer planning model, a downsampler is connected between two adjacent levels, and the step length of the downsampler is 3 multiplied by 3 convolution of 2; the neighborhood attention mechanism is as follows:

given an inputIt is a matrix, its rows aredA dimension marking vector; linear projections Q, K, V of Y, relative positional deviations G (u, V);

defining an attention weight for a u-th input having a neighborhood size kThe method comprises the following steps: the u-th input query projection Q _u Projection of its k nearest neighbors +.>The specific formula is as follows:

wherein ρ is _v (u) represents the v nearest neighbor of u;

then, the adjacent values are comparedDefined as a matrix whose rows are the k nearest neighbor projections of the u-th input, as follows:

the neighborhood attention of the u-th marker with a neighborhood size k is:

Wherein,is the scaling parameter, which is repeated for each pixel in the feature of the BEV after fusion.

The training method for fusing the BEVFomer and the neighborhood attention transducer end-to-end automatic driving track planning and predicting system is characterized by comprising the following steps of:

s1, collecting input data of a model: collecting images of multiple views, historical track data of a vehicle and vehicle state data, and preprocessing the collected data;

s2, taking historical track data and vehicle state data of the vehicle as input of an RNN model, and extracting dynamic characteristics and associated characteristics of vehicle running from the historical track data and the vehicle state data of the vehicle by utilizing the RNN module:

s3, taking the images with multiple visual angles as a BEVFomer model to be input, and extracting BEV features with time and space by using a BEVFomer feature extraction module;

s4, fusing the dynamic characteristics and the associated characteristics of the vehicle running extracted by the RNN into BEV characteristics through a characteristic fusion module;

s5, training, verifying and optimizing a neighborhood attention transducer model: inputting the fused BEV characteristics into a neighborhood attention transducer for training; after the neighborhood attention transducer outputs, a full connection layer is applied to map the input features to an output space, and a future track planning result of the vehicle is output; and repeating the steps S2-S5 to verify and optimize the model.

Further, the multi-view image in step S1 includes image data from cameras of different view angles; the historical track data of the vehicle is to record the motion track of the vehicle in the past period of time, and comprises position coordinates, speed, acceleration and course angle information of the vehicle; the vehicle state data is information of the current state of the vehicle, and comprises the speed, the acceleration, the steering angle, the vehicle power system parameters and the brake system parameters of the vehicle.

Further, the data preprocessing in step S1 includes:

(1) Data cleaning: cleaning the data, including processing missing values, outliers and repeated values;

(2) Feature selection: selecting characteristics related to the problems, and eliminating irrelevant characteristics;

(3) Data set partitioning: dividing the data set into a training set, a verification set and a test set;

(4) And (3) data coding: encoding the collected data and converting the encoded data into numerical data;

(5) Data normalization: the data is normalized to have zero mean and unit variance to eliminate dimensional differences between features.

Further, the specific steps of training, verifying and optimizing the S5 model are as follows:

(1) Initializing model parameters: initializing parameters of the model by using a random initialization method;

(2) Defining a loss function: determining a loss function to measure a difference between the model's predictions and the true value on the training data; the loss function is a mean square error or cross entropy function;

(3) Defining an optimization algorithm: selecting an optimization algorithm suitable for model training, wherein the optimization algorithm is a gradient descent method, a random gradient descent method or Adam, and the task selects the random gradient descent method; the goal of the optimization algorithm is to minimize the loss function by adjusting the parameters of the model.

(4) Iterative training: performing iterative training of the model, each training iteration cycle comprising the steps of:

a forward propagation: the input data is transmitted forward through the model to obtain a predicted value;

b, calculating loss: comparing the predicted value with the true value, and calculating the value of the loss function;

c back propagation: calculating the gradient of each parameter to the loss through a back propagation algorithm according to the value of the loss function;

d, updating parameters: updating parameters of the model according to the gradient by using an optimization algorithm;

repeating the above steps until reaching the set stopping condition (such as reaching the maximum iteration number or the convergence of the loss function);

(5) Verification and adjustment of the model: during training, periodically using the validation set to evaluate the performance of the model; and performing model adjustment, such as learning rate adjustment, regularization increase and the like, according to the performance of the verification set so as to optimize the performance of the model.

(6) Model preservation: the trained model parameters are saved for subsequent use and deployment.

An automatic driving track planning and predicting method based on a fused BEVFomer and neighborhood attention transducer end-to-end automatic driving track planning and predicting system is characterized in that,

the method comprises the following steps:

step 1, collecting model input data: collecting images of multiple views, historical track data of a vehicle and vehicle state data, and preprocessing the collected data;

step 2, taking the historical track data and the vehicle state data of the vehicle as the input of an RNN model, and extracting dynamic characteristics and associated characteristics of the running of the vehicle from the historical track data and the vehicle state data of the vehicle by utilizing the RNN module:

step 3, taking the images with multiple visual angles as the input of a BEVFomer model, and extracting BEV features with time and space by utilizing a BEVFomer feature extraction module;

step 4, fusing the dynamic characteristics of the vehicle running and the associated characteristics extracted by the RNN characteristic extractor into BEV characteristics through a characteristic fusion module;

step 5, predicting a neighborhood attention transducer model: inputting the fused BEV characteristics into a neighborhood attention transducer for training; after neighborhood attention transducer output, a full connection layer is applied to map input features to output space, and future track planning results of the vehicle are output.

The invention provides an end-to-end automatic driving track planning system integrating BEVFomer and neighborhood attention transducer based on sensing the complex environment of an automatic driving vehicle. According to the automatic driving track planning and predicting method based on the fusion BEVFomer and the neighborhood attention transducer end-to-end automatic driving track planning and predicting system, firstly, multi-view images are used as the input of the BEVFomer, and historical track data and vehicle state data are used as the input of an RNN feature extractor model. Bevfomer learns unified BEV feature representations through convectors of the spatio-temporal structure to effectively capture spatial relationship and temporal information in the input BEV data; and extracting dynamic characteristics and associated characteristics of the vehicle running from the historical track data and the vehicle state data of the vehicle by using the RNN characteristic extractor. Then, the dynamic feature and the associated feature of the vehicle running extracted by the RNN feature extractor are fused into BEV features by a feature fusion module. Finally, the vehicle control signals and trajectories are output through a neighborhood attention-based transducer model and full connection layer. Among them, neighborhood attention convertors are a hierarchical convertors that are set efficient, accurate and scalable. In addition, neighborhood attention is an efficient and scalable sliding window attention mechanism that expands the receptive field of each query to its nearest neighbor and approaches the self-attention mechanism as the receptive field increases. Therefore, the end-to-end automatic driving track planning system improves the understanding and decision making capability of the automatic driving system to the environment, and improves the accuracy and generalization of the automatic driving system, so that the control signal or the future track of the automatic driving vehicle can be more accurately output, and support is provided for realizing safer and intelligent automatic driving.

The beneficial effects of the invention are as follows:

1. the invention provides a feature fusion module which fuses vehicle motion features and BEV features acquired by an RNN feature extractor. First, the RNN feature extractor can capture timing information and patterns in the vehicle history trajectory data and the vehicle state data. By fusing the vehicle motion features and BEV features, these timing information can be effectively incorporated into the overall feature representation, thereby more fully describing vehicle behavior and state. Thus, fusing different types of features may provide a richer, diversified, multi-dimensional representation of the state and surrounding environment of the feature-descriptive vehicle, thereby improving the expressive power of the model. In addition, by fusing the motion characteristics and the aerial view characteristics of the vehicle, the model can better understand and analyze complex relations such as vehicle behaviors and environmental changes, and further improve the performance of the model in related tasks.

2. The invention provides a novel method for fusing BEVFomer and neighborhood attention transducer. First, spatiotemporal features and historical features from multi-view cameras are effectively aggregated by BEVFormer. The method is a space-time converter and can support various automatic driving perception tasks. BevFormer interacts with the spatio-temporal space through predefined latticed BEV queries, and can capture information across cameras and time steps using spatial and temporal information, improving the performance and efficiency of the framework. Second, neighborhood attention transformers can capture the association between different locations in a feature by applying a self-attention mechanism within the local neighborhood. Such locality modeling makes neighborhood attention convertors excellent in handling fine-grained features and structures in features. In addition, the self-attention mechanism in the neighborhood attention transducer has the advantage of parallel computation, and the characteristics of different positions can be computed at the same time. This allows the neighborhood attention transducer to be computationally efficient in processing large-scale data.

Drawings

FIG. 1 is a block diagram of an end-to-end autopilot trajectory planning system incorporating BEVFomers and neighborhood attention transgenes in accordance with the present invention.

FIG. 2 is a flow chart of an end-to-end automated driving trajectory planning system incorporating BEVFomer and neighborhood attention Transformer in accordance with the present invention.

FIG. 3 is a flow chart of training and testing the fused BEVFormer and neighborhood attention Transformer model according to the present invention.

FIG. 4 is a block diagram of the BEVFormer model architecture of the present invention.

FIG. 5 is a diagram of a neighborhood attention transducer architecture of the present invention.

FIG. 6 is a block diagram of the neighborhood of attention of the present invention.

Fig. 7 is a diagram showing the results of the end-to-end autopilot trajectory planning system of the present invention.

Fig. 8 is an exemplary diagram of an application scenario of the end-to-end automatic driving trajectory planning system of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings, but the protection of the invention is not limited thereto.

The invention discloses a BEVFomer and neighborhood attention Transformer end-to-end automatic driving track planning and predicting system, which is shown in figure 1 and comprises a data acquisition module, an RNN feature extractor, a BEVFomer feature extraction module, a feature fusion module and a neighborhood attention Transformer planning model.

The data acquisition module comprises cameras with different visual angles, vehicle-mounted sensors and an inertial measurement unit; the method is used for acquiring multi-view images, historical track data of the vehicle and vehicle state data and preprocessing the acquired data.

the bevfomer feature extraction module, as shown in fig. 4, includes a back bone neural network, at least one encoder layer, each encoder layer including a BEV query mechanism, a spatial cross-attention module, a temporal self-attention module, and a feed forward network. The Backbone neural network is used for obtaining view characteristics of the multi-view image at the moment t; the BEV query mechanism predefines a set of learnable parameters of the grid shape as a BEVFormer query for the corresponding grid cell region at the query in the BEV plane. The BEV query mechanism passes through The time self-attention module inquires time information of the BEV spatial characteristics, and the spatial information of the inquired BEV spatial characteristics is extracted through the spatial cross-attention module. The spatial cross-attention module also obtains multi-view spatial information of the vehicle based on view features of the multi-view image. The feedforward network obtains refined BEV characteristic B based on spatial information of view characteristics of the multi-view image, temporal information of BEV spatial characteristics and spatial information _t 。

The feature fusion module is used for fusing the dynamic feature and the associated feature of the vehicle running extracted by the RNN feature extractor into the BEV feature.

And the neighborhood attention transducer planning model performs planning of a future track of the vehicle based on the fused BEV characteristics, and then outputs a planning result through the full connection layer and the visualization module.

As shown in FIG. 2, the training method for fusing the BEVFomer and neighborhood attention Transformer end-to-end automatic driving track planning model comprises the following steps:

data collection, data preprocessing, feature extraction modules, construction of a neighborhood attention transducer model, model training, verification and optimization.

S1, inputting a constructed model:

data collection uses cameras, sensors, GPS devices, inertial Measurement Units (IMUs) or experimental devices to collect images of multiple perspectives, historical trajectory data of the vehicle, and vehicle state data. Wherein the multi-view image is used as input of the BEVFormer feature extraction module, and the historical track data of the vehicle and the vehicle state data are used as input of the RNN feature extractor. The multi-view image is image data obtained from cameras from different views, each providing visual information of the surroundings of the vehicle. The historical track data of the vehicle records the motion track of the vehicle in a period of time in the past, and the motion track comprises information such as position coordinates, speed, acceleration, course angle and the like of the vehicle; such data may be obtained by means of onboard sensors such as GPS or Inertial Measurement Units (IMU). The vehicle state data provides information about the current state of the vehicle, including the speed, acceleration, steering angle, vehicle powertrain parameters (e.g., engine speed, accelerator opening), brake system parameters, etc. These data may be acquired by onboard sensors and an Electronic Control Unit (ECU) of the vehicle.

The multi-view images, the historical track data of the vehicle and the vehicle state data need to be subjected to data preprocessing before being input into the bevfomer feature extraction module and the RNN feature extractor, and the raw data are cleaned, converted and sorted to improve the data quality, reduce noise, process missing values and outliers, and convert the data into a format suitable for model training and analysis. The following is the step of data preprocessing:

(1) Data cleaning: the data is cleaned, including processing missing values, outliers, duplicate values, etc. Various methods may be used, such as filling in missing values, deleting outliers, etc.

(2) Feature selection: and selecting the characteristics related to the problems, and eliminating the irrelevant characteristics so as to improve the effect and efficiency of the model. Feature selection may be performed by statistical analysis, feature correlation, feature importance, etc. Feature selection is to select features with the highest relevance and importance from the original data so as to improve the performance and efficiency of the model.

(3) Data set partitioning: the data set is divided into a training set, a validation set and a test set. As shown in fig. 3, the processed multi-view image, the history trajectory data of the vehicle, and the vehicle state data are processed as follows: 2: the scale of 2 is divided into a training set, a validation set and a test set. The training set is used for training the model and adjusting parameters, the verification set is used for selecting, optimizing and adjusting super parameters of the model, and the test set is used for finally evaluating the performance and generalization capability of the model.

(4) And (3) data coding: the collected and processed data are encoded and converted into numerical data so as to facilitate the processing of the model.

(5) Data normalization: and carrying out standardization processing on the data to ensure that the data has zero mean and unit variance so as to eliminate dimension differences among features and improve the stability and convergence rate of the model.

S2, feature extraction

(1) Establishing a BEV Former feature extraction module, and extracting BEV features with space-time characteristics by using the BEV Former feature extraction module:

a new transform-based BEV generation framework is presented that can efficiently aggregate spatio-temporal features from multi-view cameras and record BEV features through a attentive mechanism. As shown in fig. 4, the BEVFormer feature extraction module architecture has a back bone neural network and 3 encoder layers, each following the conventional structure of the transducer except for the BEV query mechanism, the spatial cross-attention module, and the temporal self-attention module. In particular, in each encoder layer, a BEV query mechanism is used to query features in BEV space from a multi-view through an attention mechanism. The spatial cross-attention module and the temporal self-attention module are attention layers working with BEV query mechanisms for finding and aggregating spatial features from multiple camera images and temporal features, spatial features from historical BEVs according to the BEV query mechanisms. Namely: at time t, the bevfomer feature extraction module first obtains the view feature F of the multi-view image via a back bone neural network, e.g., VGGNet _t . Due to the retention in the system of the historical BEV characteristic B at the previous time t-1 _t-1 . In each encoder layer, the prior BEV feature B is first queried by the temporal self-attention module using the BEV query mechanism _t-1 Time information of (2); then, using the BEV query to extract spatial information of the spatial features of the query BEV through the spatial cross-attention module, while from the multi-camera feature F at time t through the spatial cross-attention module _t Querying space information; finally, after the feed-forward network, the encoder outputs the refined BEV features as input to the next encoder layer, and after 3 stacked encoders, a unified BEV feature B at the current time t is generated _t For subsequent tasks.

First, BEV query mechanism:

BEV query mechanisms predefine a set of learnable parameters of a mesh shapeAs a BEVFormer query, where H, W is the spatial shape of the BEV plane. Specifically, the query at p= (x, y) of Q +.>Is responsible for querying the corresponding grid cell area in the BEV plane. Each grid cell in the BEV plane corresponds to a size of s meters in length in the real world. The center of the BEV feature corresponds by default to the location of the own vehicle. Before the BEV query Q is input to the bevform, a leachable location embedding is added to it.

Second, spatial cross-attention module:

the computation cost of ordinary multi-head attention is high due to the large scale of multi-camera 3D input. Thus, spatial cross-attention was developed based on variable attention, a resource efficient layer of attention, each BEV query Q _p Only the region of interest is interacted with in the camera view. Wherein the formula of the variable attention mechanism is:

wherein,

q, p, x represent query features, reference point features and input features, respectively;

-i is the index of the attention header and j is the index of the sampling key;

-N _head represents the total number of attention heads, N _key Is the number of sampling keys per attention header;

-and->Is a learnable parameter, where C is a feature dimension;

-A _ij ∈[0,1]is the predicted attention weight, byNormalization was performed.

-Is the predicted offset relative to the reference point p;

-x(p+Δp _ij ) Represents p+Δp _ij And extracting the characteristics at the position by adopting a bilinear interpolation method.

However, the variable attention is designed based on 2D perception, so some adjustments need to be made to accommodate the 3D scene. Specifically, each query on the BEV plane is first promoted to a columnar query, and N is sampled from the column _ref The 3D reference points are then projected into the 2D view. For a BEV query, the point projected onto 2D can only fall on some views, while other views are not hit. We call the hit view V here _hit . These 2D points are then considered to be query Q _p And from hit view V _hit Surrounding sampling features. Finally, the sampled features are weighted and then output as spatial cross-attention. The following is a process formula for spatial cross-attention (SCA):

-i represents an index of camera views;

-j represents the index of the reference point;

-N _ref is the number of total reference points per BEV query;

-F _t ⁱ is characteristic of the ith camera view;

-Q _p is each BEV query;

-P (P, i, j) is a projection function, a j-th reference point on the i-th view image being obtained;

deformatt () is a variable attention.

Next, it is described how to obtain a reference point on the view image from the projection function P.First calculate query Q _p The corresponding real world position (x ', y') at position p= (x, y) is as follows:

-H, W is the height and width of the BEV query;

-s is the resolution size of the BEV mesh;

- (x ', y') coordinates with the vehicle position as the origin;

in 3D space, an object located at (x ', y ') will appear at a height z ' in the z-axis. Thus, a set of anchor heights is predefined firstTo ensure that cues present at different heights can be captured. In this way, for each query Q _p Will get a columnar 3D reference point +.>Finally, the three-dimensional reference points are projected into different views through a projection matrix of the camera, and the projection function is expressed as follows:

P(p,i,j)＝(x _ij ,y _ij )

in addition, z _ij .[x _ij y _ij 1] ^T ＝T _i .[x′ y′ z′ _j 1] ^T

Wherein:

p (P, i, j) is the point (x 'y' z 'from the j-th 3D point' _j ) A 2D point projected onto the ith view;

-is a projection matrix known to the i-th camera.

Third, time self-attention module:

in addition to spatial information, temporal information is also critical to the system's understanding of the surrounding environment. For example, it is challenging to infer the velocity of moving objects or detect highly occluded objects from static images without time cues. To address this problem, a temporal self-attention module was designed to represent the current environment by incorporating historical BEV features. The method comprises the following specific steps:

given the BEV query Q at the current time t and the historical BEV feature B saved at time t-1 _t-1 First B is carried out according to the movement of the bicycle _t-1 Aligned with Q such that features on the same grid correspond to the same real world location. Here, the aligned historical BEV features B _t-1 As B' _t-1 . However, from t-1 to t, the movable object moves in the real world with a different offset. It is a challenge to construct accurate correlations of the same targets between BEV features at different times. Thus, the temporal correlation between BEV features is modeled by a temporal self-attention module (TSA). The specific formula is as follows:

Wherein,

-Q _p representing a BEV query located at p= (x, y);

furthermore, unlike normal variable attention, the offset in the temporal self-attention module is through Q and B' _t-1 Predicted by a series of (a) and (b) are predicted by a series of (b). In particular, for the first sample of each sequence, the temporal self-attention module is degenerated to self-attention without temporal information, wherein BEV features { Q, B' _t-1 The BEV query { Q, Q } is replaced with the repeated BEV query { Q, Q }.

Thus, the temporal self-attention module may more effectively simulate long-term dependencies than simply superimposing BEVs. Bevfomer extracts time information from previous BEV features rather than from multiple superimposed BEV features, thus reducing computational cost and interference information.

(2) The RNN feature extractor is used for extracting vehicle dynamic features and associated features from historical track data and vehicle state data of the vehicle, and the processing steps are as follows:

first, vehicle motion data including historical track data of a vehicle and vehicle state data are input to an RNN feature extractor (e.g., LSTM). For the historical trajectory data, the RNN feature extractor considers information of vehicle position, speed, acceleration, and the like at each point in time, and captures a movement pattern and a change trend in the trajectory through time-series processing. The memory mechanism of the RNN feature extractor may help capture time-dependent relationships during vehicle travel, such as whether the acceleration of the vehicle remains consistent over a period of time, or whether the position of the vehicle exhibits periodic changes. By learning these patterns and trends in the historical trajectory data, the RNN feature extractor can further extract dynamic features of the vehicle's travel, providing a reference for subsequent driving strategies. For vehicle status data, the RNN feature extractor may consider various sensor data of the vehicle, such as steering wheel angle, accelerator pedal position, brake pedal status, etc. of the vehicle, as well as other status information of the vehicle, such as vehicle speed, turn signal status, whether the vehicle is turning on cruise control, etc. By taking these vehicle state data as inputs to the RNN feature extractor, the RNN feature extractor can learn an association between the vehicle state and behavior, e.g., the vehicle is more likely to take some behavior in some state. By extracting these associated features, the RNN feature extractor may provide more accurate inputs for the vehicle driving strategy.

S3, feature fusion:

the vehicle motion features extracted by the RNN feature extractor comprise historical track features and state features of the vehicle and are fused into BEV features extracted by the BEVFomer. Here, we stitch the feature extracted by the RNN feature extractor and the BEV feature together according to the feature dimension to form a longer feature vector. In addition, the consistency and alignment of the data are maintained in the fusion process, and the time and space correspondence between the features extracted by the RNN feature extractor and the BEV features is ensured. The characteristics from different characteristic extraction methods are fused, so that the expression capacity of the characteristics can be enhanced, the modeling capacity of a model on data is improved, the robustness and generalization capacity of the model are improved, and the performance of the model is improved.

S4, training, verifying and optimizing neighborhood attention transducer model

(1) Constructing a neighborhood attention transducer planning model:

as shown in fig. 5, the neighborhood attention fransformer planning model is a set of efficient, accurate and scalable hierarchical fransformers that embed BEV features after fusion by using 2 consecutive 3 x 3 convolutions (step size 2), which results in a spatial dimension of 1/4 of the input dimension. This is similar to using a 4 x 4patch and embedded layer, but it uses overlapping convolutions rather than non-overlapping convolutions. This may introduce a useful inductive bias. On the other hand, 2 convolutions introduce more parameters. However, this problem is handled by reconfiguring the model, which will yield a better trade-off. The 4-level neighborhood attention transducer planning model is arranged in a stacked mode, an overlapped marker is arranged at the upstream of the first-level neighborhood attention transducer planning model, and a downsampler is arranged between two adjacent stages. The downsampler reduces the space size to half of the original, and the number of channels is doubled. The downsampler here uses a 3 x 3 convolution (step size 2). Since the overlay marker downsamples the spatial size by a multiple of 4, the model generates a size of Allowing neighborhood attention transformers to migrate pre-trained models into downstream tasks more easily. />

As shown in FIG. 6, the neighborhood attention mechanism is an effective and scalable sliding window attention mechanism that locates the attention range of each pixel to its nearest neighborhood, approximates self-attention as its range grows, and maintains the variability of translation, etc. There are linear advantages in terms of temporal and spatial complexity compared to self-attention, and also local induced bias is introduced, similar to convolution. The method comprises the following specific steps:

given an inputIt is a matrix whose rows are d-dimensional marker vectors. Linear projections Q, K, V of Y, and relative positional deviations G (u, V), define the attention weight of the u-th input with neighborhood size K +.>Projection Q for the u-th input query _u Projection of its k nearest neighbors +.>Is a dot product of (a). The specific formula is as follows:

wherein ρ is _v (u) represents the v nearest neighbor of u;

the neighborhood attention defining the u-th marker with a neighborhood size k is:

As can be seen from this definition, as k increases,near self-attention weight, and +.>Near V _u Itself. Each pixel in the neighborhood attention focuses on the window around it and fills in around the input to handle the edge case. It is due to this difference that as the window size grows, the neighborhood attention approaches self-attention.

(2) Model training, verification and optimization

Model training plays a vital role in an autopilot system, training models for processes that learn and extract useful patterns and laws from data. The model is trained through the steps of initializing model parameters, defining a loss function, defining an optimization algorithm, performing iterative training, verifying and adjusting the model and the like, and in the training process, the generalization capability and the overfitting condition of the model are monitored to obtain the optimal model performance.

The neighborhood attention transducer planning model achieves expected performance through model training, verification and optimization. Finally, the model application predicts the driving control and future trajectory of the autonomous vehicle in a complex scenario. In addition, when the model is deployed, factors such as model security, privacy protection and the like need to be considered. And predicting or deducing new input data by using the trained end-to-end model. In the model application phase, corresponding driving control or future trajectories are generated.

The data set is divided into a training set, a verification set and a test set according to a certain proportion. The training and validation data set is then used for model training, validation. And inputting the BEV characteristics after the characteristic fusion into a neighborhood attention transducer planning model for training, verification and optimization.

The method comprises the following specific steps:

(1) Initializing model parameters: parameters of the neighborhood attention transducer planning model are initialized, and a random initialization method is used. The purpose of the initialization is to give the model a starting point that enables it to adjust parameters step by step during the training process to adapt the data.

(2) Defining a loss function: an appropriate loss function is chosen to measure the difference between the model's predictions and the true values on the training data. Common loss functions include Mean Square Error (MSE), cross entropy, etc., where cross entropy functions are chosen as the loss functions.

(3) Defining an optimization algorithm: optimization algorithms suitable for model training are selected, and common algorithms comprise a gradient descent method, a random gradient descent method, adam and the like. The task selects a random gradient descent method, and the objective of the optimization algorithm is to minimize the loss function by adjusting the parameters of the model.

(4) Iterative training: iterative training of the model begins. Each training iteration cycle comprises the steps of:

a forward propagation: and forward transmitting the input data through the model to obtain a predicted value.

b, calculating loss: the predicted value is compared with the actual value and the value of the loss function is calculated.

c back propagation: the gradient of each parameter to loss is calculated by a back propagation algorithm based on the value of the loss function.

d, updating parameters: parameters of the model are updated according to the gradient using an optimization algorithm.

Repeating the above steps until reaching the set stopping condition, such as reaching the maximum iteration number or the convergence of the loss function.

(5) Verification and adjustment of the model: during training, the performance of the model is periodically evaluated using the validation set. And performing model adjustment, such as learning rate adjustment, regularization increase and the like, according to the performance of the verification set so as to optimize the performance of the model.

Model generation and optimization: for each test set sample, the test set sample is input into a trained model for prediction, the model propagates forward according to the input characteristics, and the output result is calculated layer by layer. In the trained model, the model generates control signals or future trajectories of the vehicle based on current inputs. The generated vehicle control signal or future trajectory is then transferred to a specific execution unit of the vehicle to control the vehicle. At the same time, data in the actual execution process continues to be collected for further optimization and iteration of the model.

As shown in fig. 7, a display of end-to-end autopilot trajectory planning system results. Referring to fig. 7, the interface is composed of four parts, and a frame (1) shows the current running state of the vehicle, including running, stopping, braking, accelerating, starting, etc.; the picture frame (2) displays the current time and the signal state of the vehicle; the frame (3) displays four display interface keys of the end-to-end prediction system: the vehicle history track information, the vehicle state information and the end-to-end prediction information can be used for respectively checking specific display information; the frame (4) shows vehicle-related information of the corresponding key of the frame (3).

In detail, the "vehicle history track information" interface displays the track of the vehicle at the current moment, and the vehicle history track data mainly comprises the position information, the time stamp, the speed, the direction and the like of the vehicle; the "vehicle state information" interface displays the state of the vehicle at the current time, and the vehicle state data mainly includes a vehicle ID, a vehicle speed, a current traveling direction of the vehicle, an acceleration of the vehicle, a steering angle, a vehicle type, a vehicle state, and the like. The "end-to-end prediction information" shows a driving control signal or a future track of the predicted autonomous vehicle at the present time based on the information input described above.

As shown in fig. 8, an exemplary view of an application scenario for an end-to-end autopilot trajectory planning system.

It should be understood at first that fig. 8 is presented by way of example only and is not intended to limit the scope of the present application.

Referring to fig. 8, taking the traffic scenario of the figure as an example, the driving control or future track of the own vehicle during driving is studied, the own vehicle collects various information including the historical track information of the vehicle, the vehicle state information, the multi-view image information of the vehicle, and the like, and then the own vehicle processes the information and transmits the information to the trained prediction model to output the driving control or future track of the vehicle. And finally, displaying all predicted display results on a central control screen of the automobile, displaying the real-time driving strategy of the automobile for a driver and providing related suggestions.

The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent manners or modifications that do not depart from the technical scope of the present invention should be included in the scope of the present invention.

Claims

1. The end-to-end automatic driving track planning system integrating BEVFomer and neighborhood attention transducer is characterized by comprising the following components:

The BEV query mechanism predefines a set of learnable parameters of the mesh shapeAs a BEVFormer query, for querying the corresponding grid cell region at p= (x, y) in the BEV plane;

the BEV querying mechanism queries the BEV spatial feature B through the temporal self-attention module _t-1 Time signal of (2)Extracting the BEV spatial feature B through the spatial cross attention module _t-1 Spatial information of (a);

2. The fused bevform and neighborhood attention transducer end-to-end automatic driving trajectory planning system of claim 1, wherein the data acquisition module comprises cameras of different perspectives, on-board sensors, inertial measurement units.

3. The fused bevform and neighborhood attention fransformer end-to-end automatic driving trajectory planning system of claim 1, wherein the spatial cross-attention (SCA) process formula of the spatial cross-attention module:

-i represents an index of camera views;

-j represents the index of the reference point;

-N _ref is the number of total reference points per BEV query;

-F _t ⁱ is characteristic of the ith camera view;

-Q _p is each BEV query；

P (P, i, j) is the projection function, from the j-th 3D point (x ' y ' z ' _j ) A 2D point projected onto the ith view;

-defromatt () is a variable attention;

the formula of the projection function is as follows:

P(p,i,j)＝(x _ij ,y _ij )

in addition, z _ij .[x _ij y _ij 1] ^T ＝T _i .[x′ y′ z′ _j 1] ^T

Wherein:is a projection matrix known to the i-th camera.

4. The fused bevform and neighborhood attention fransformer end-to-end automatic driving trajectory planning system of claim 1, wherein BEV query Q at current time t and historical BEV feature B saved at time t-1 are given _t-1 First B is carried out according to the movement of the bicycle _t-1 Alignment with Q, having features on the same grid corresponding to the same real world location, historical BEV feature B to be aligned _t-1 As B' _t-1 The method comprises the steps of carrying out a first treatment on the surface of the The time correlation between BEV features modeled by the time self-attention module is specifically expressed as follows:

wherein,

-Q _p representing a BEV query located at p= (x, y);

5. The fused bevfomer and neighborhood attention fransformer end-to-end automatic driving trajectory planning system of claim 1 wherein the feature fusion module concatenates the RNN feature extractor extracted features and BEV features together in feature dimensions.

6. The fused bevform and neighborhood attention fransformer end-to-end automatic driving trajectory planning system of claim 1, wherein the neighborhood attention fransformer embeds BEV features after fusion by using 2 consecutive 3 x 3 overlapping convolutions; the method comprises the steps that a multistage neighborhood attention transducer planning model is arranged in a stacked mode, an overlapping marker is arranged at the upstream of a first stage neighborhood attention transducer planning model, a downsampler is connected between two adjacent stages, and the step length of the downsampler is 3 multiplied by 3 convolution of 2; the neighborhood attention mechanism is as follows:

given an inputIt is a matrix, its rows are d-dimensional marker vectors; linear projections Q, K, V of Y, relative positional deviations G (u, V);

wherein ρ is _v (u) represents the v nearest neighbor of u;

the neighborhood attention of the u-th marker with a neighborhood size k is:

7. The training method of the fused bevform and neighborhood attention fransformer end-to-end automatic driving trajectory planning system of any one of claims 1-6, comprising the steps of:

s2, feature extraction: the method comprises the steps that historical track data and vehicle state data of a vehicle are used as inputs of an RNN feature extractor, and dynamic features and associated features of vehicle running are extracted from the historical track data and the vehicle state data of the vehicle by the RNN feature extractor: taking the images with multiple visual angles as the input of a BEVFomer model, and extracting BEV features with time and space by using a BEVFomer feature extraction module;

S3, fusing the dynamic characteristics and the associated characteristics of the vehicle running extracted by the RNN characteristic extractor into BEV characteristics through a characteristic fusion module;

s4, training, verifying and optimizing a neighborhood attention transducer model: inputting the fused BEV characteristics into a neighborhood attention transducer for training; after the neighborhood attention transducer outputs, a full connection layer is applied to map the input features to an output space, and a future track planning result of the vehicle is output; and repeating the steps S2-S5 to verify and optimize the model.

8. The training method of an automatic driving trajectory planning system according to claim 7, characterized in that the multi-view images in step S1 include image data from different view cameras; the historical track data of the vehicle is to record the motion track of the vehicle in the past period of time, and comprises position coordinates, speed, acceleration and course angle information of the vehicle; the vehicle state data is information of the current state of the vehicle, and comprises the speed, the acceleration, the steering angle, the vehicle power system parameters and the brake system parameters of the vehicle.

9. The method of training an automatic driving trajectory planning prediction model according to claim 7, characterized in that the data preprocessing in step S1 comprises:

10. The training method of the automatic driving trajectory planning system according to claim 7, wherein the specific steps of training, verifying and optimizing the S4 model are:

(3) Defining an optimization algorithm: selecting an optimization algorithm suitable for model training, wherein the optimization algorithm is a gradient descent method, a random gradient descent method or Adam, and the task selects the random gradient descent method;

repeating the steps until reaching the set stop condition;

11. An automatic driving track planning prediction method based on a fused BEVFomer and neighborhood attention transducer end-to-end automatic driving track planning system is characterized in that,

the method comprises the following steps:

Step 2, taking the historical track data and the vehicle state data of the vehicle as the input of an RNN feature extractor, and extracting dynamic features and associated features of the running of the vehicle from the historical track data and the vehicle state data of the vehicle by using the RNN feature extractor;