CN113139446A

CN113139446A - End-to-end automatic driving behavior decision method, system and terminal equipment

Info

Publication number: CN113139446A
Application number: CN202110391084.6A
Authority: CN
Inventors: 刘占文; 赵祥模; 樊星; 齐明远; 范颂华; 李超; 张嘉颖; 高涛; 王润民; 林杉; 员惠莹
Original assignee: Changan University
Current assignee: Changan University
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2021-07-20
Anticipated expiration: 2041-04-12
Also published as: CN113139446B

Abstract

The invention discloses an end-to-end automatic driving behavior decision method, a system and terminal equipment, and belongs to the field of automatic driving. Extracting scene space position characteristics through a convolutional neural network embedded with an attention mechanism, constructing a space characteristic extraction network, and accurately analyzing scene target space characteristics and semantic information; capturing scene time context characteristics through a long-short term memory network coding-decoding structure embedded with a time attention mechanism, constructing a time characteristic extraction network, and understanding memory scene time sequence information; according to the method, scene space information and time sequence information are integrated, and a higher weight is given to the key visual region and the motion sequence by combining an attention mechanism, so that the prediction process is more in line with the driving habits of human drivers, and the prediction result is more accurate.

Description

End-to-end automatic driving behavior decision method, system and terminal equipment

Technical Field

The invention belongs to the field of automatic driving, and relates to an automatic driving behavior decision method, in particular to an end-to-end automatic driving behavior decision method, an end-to-end automatic driving behavior decision system and terminal equipment.

Background

The automatic driving decision technology is an important research direction in the fields of artificial intelligence and automatic driving, the decision effectiveness of the automatic driving decision technology influences the performance of the whole automatic driving system to a great extent, however, the current traditional rule-based automatic driving decision method does not conform to the habit of human driving behaviors, and the automatic driving behavior decision is also a classic problem in the field of automatic driving. The automatic driving behavior decision is not only related to the driving scene of the current vehicle, but also related to the historical movement speed of the vehicle, so that the common influence of the current driving scene and the historical movement state on the vehicle is considered. The human visual system can selectively notice the primary contents of the observed scene and ignore other secondary contents, and the driver should pay attention to the things which have a large influence on the driving decision, such as vehicles, pedestrians, traffic lights and the like, and ignore the features which are not important for driving, such as sky, trees and the like, while the vehicle is driving. Therefore, an end-to-end automatic driving decision model based on attention mechanism and space-time characteristics becomes a new research hotspot.

The automatic driving decision-making method is mainly divided into a rule-based automatic driving decision-making method and an end-to-end automatic driving decision-making method. The rule-based decision method divides the automatic driving decision process into different task modules, understands and divides the vehicle state according to the traffic situation, and generates real-time reasonable driving actions by combining a manually constructed rule base and priori knowledge, thereby realizing the control of the automatic driving vehicle. The automatic driving model based on end-to-end learning can unify various driving subtasks such as scene environment perception, target recognition, target tracking and planning decision-making into a deep neural network, directly maps perception information into control quantities such as an accelerator, a steering wheel and braking, completes unification from cognition to decision-making, does not need module splitting, can simplify task steps with complicated characteristic engineering, and enables the automatic driving system to be simpler and more efficient in structure. The existing end-to-end automatic driving decision method does not consider the influence of the historical motion state of the vehicle on the decision of the vehicle, and has the problems of low decision accuracy, low efficiency and the like.

Disclosure of Invention

In order to overcome the defects that the influence of the historical motion state of the vehicle on the vehicle decision-making is not considered in the end-to-end automatic driving decision-making method in the prior art, the decision-making accuracy is low, and the efficiency is too low, the invention aims to provide an end-to-end automatic driving behavior decision-making method, an end-to-end automatic driving behavior decision-making system and terminal equipment.

In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:

an end-to-end automatic driving behavior decision method comprises the following steps:

acquiring the spatial characteristics of a scene through a convolutional neural network embedded with an attention mechanism based on image information, depth information and semantic segmentation information;

acquiring time characteristics through a memory network coding-decoding structure embedded with an attention mechanism based on the historical motion state sequence information of the vehicle;

and communicating the spatial characteristics with the temporal characteristics, establishing an end-to-end automatic driving behavior decision model, and acquiring an end-to-end automatic driving behavior prediction result.

Preferably, the extracting process of the spatial features comprises:

inputting the packed image and depth image into a backbone network to obtain image information and depth information;

image information is pooled through a backbone network and a pyramid to obtain semantic segmentation information;

inputting image information, depth information and semantic segmentation information into a connecting layer, generating a spatial feature vector with a fixed length, and acquiring spatial features of a scene;

wherein, an attention mechanism is embedded in the backbone network.

Further preferably, the image information and the depth information are obtained by:

capturing an input characteristic diagram by using three spatial attention branch networks, establishing interaction between spatial dimensions and channel dimensions, and acquiring a spatial attention diagram;

training a backbone network to obtain different sparse masks; pruning the main network by using different sparse masks to generate two different sub-networks, and acquiring image information and depth information based on the different sub-networks.

Further preferably, the process of acquiring the spatial attention map specifically includes:

establishing interaction of space dimensionality and channel dimensionality through the feature diagram input by rotation in the three space attention branch networks respectively; then, performing maximum pooling and average pooling on the rotated feature maps respectively; cascading the average pooled feature map and the maximum pooled feature map, and inputting the cascaded feature maps into two full-connection layers for coding; generating attention weights through a sigmoid activation function, and combining the attention weights with an original input feature map to obtain three attention maps; the three attention diagrams are averaged to obtain a spatial attention diagram.

Preferably, the pruning is specifically performed by:

initializing a base network and a mask matrix randomly, and setting a pruning threshold value at the same time;

training the base network and the mask matrix after being subjected to an AND operation, and iteratively updating the mask matrix to obtain two sparse mask matrices with different sharing parameters;

and obtaining different sparse masks of the two sub-networks through training, and pruning the trunk network by using the different sparse masks.

Preferably, the extracting process of the time feature comprises:

understanding, summarizing and memorizing the vehicle historical motion state sequence information by using an encoder to obtain a vehicle historical motion state feature vector;

constructing a time attention mechanism by utilizing a time feature extraction network;

and performing time sequence generation and feature extraction on the vehicle historical motion state feature vector subjected to the time attention mechanism by using a decoder, updating the hidden state at the current moment, and acquiring time features.

Preferably, the time attention mechanism is constructed based on a time attention module, and the specific operations include:

the multi-layer perceptron in the time attention module obtains an energy item according to the hidden state of the encoder and the hidden state of the decoder;

a Softmax function in the time attention module obtains real-time encoder abstract characteristics and a decoder attention coefficient according to the energy items;

and the time attention module takes the attention coefficient as a weight and carries out weighted summation on the hidden states at all the moments to obtain the context vector of the decoder at each moment.

An end-to-end automated driving behavior decision system, comprising:

the spatial attention module is used for acquiring spatial features of a scene based on image information, depth information and semantic segmentation information;

the time attention module is used for acquiring time characteristics based on the vehicle historical motion state sequence information and constructing a time attention mechanism based on the time characteristics;

the model establishing module is respectively interacted with the space attention module and the time attention module, establishes an end-to-end automatic driving behavior decision model based on the space characteristic and the time attention mechanism, and predicts an end-to-end automatic driving behavior result through the end-to-end automatic driving behavior decision model.

Preferably, the spatial attention module comprises a ResNet feature extractor, a pyramid pooling unit and three spatial attention branch networks;

the ResNet feature extractor is used for acquiring feature information in the image information;

the pyramid pooling unit is used for performing maximum pooling and average pooling on the feature information acquired by the ResNet feature extractor;

the space attention branching network is used for establishing interaction of space dimension and channel dimension based on the input feature diagram;

the time attention module comprises an encoder, a decoder and a multi-layer perceptron;

the encoder is used for understanding, summarizing and memorizing the vehicle historical motion state sequence;

the decoder is used for generating a time sequence and extracting features and updating the hidden state at the current moment;

the multi-layer perceptron is used to derive an energy term based on the hidden states of the encoder and decoder.

A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the end-to-end automated driving behavior decision method when executing the computer program.

Compared with the prior art, the invention has the following beneficial effects:

the invention discloses an end-to-end automatic driving decision method, which comprises the steps of extracting scene space position characteristics through a convolutional neural network embedded with an attention mechanism, constructing a space characteristic extraction network, and accurately analyzing scene target space characteristics and semantic information; capturing scene time context characteristics through a long-short term memory network coding-decoding structure embedded with a time attention mechanism, constructing a time characteristic extraction network, and understanding memory scene time sequence information; according to the method, scene space information and time sequence information are integrated, and a higher weight is given to the key visual region and the motion sequence by combining an attention mechanism, so that the prediction process is more in line with the driving habits of human drivers, and the prediction result is more accurate.

Furthermore, in the end-to-end automatic driving decision process, only the visual information in the RGB image is not enough to fully perceive objects such as vehicles, pedestrians, obstacles and the like in the scene, the depth information contains more position features and contour features of the scene objects, and the semantic segmentation information covers high-level semantic understanding of the driving scene.

Furthermore, spatial position features and temporal context features of the scene are respectively extracted and fused through a convolutional network and an LSTM, meanwhile, in the process of extracting the spatial position features, semantic segmentation information is utilized to improve model prediction precision, and decision quantities of a steering wheel and a steering angle are output. Although the prediction effect of the model is improved after the multi-modal input is considered, the method of the invention does not pay attention to key objects in the scene, such as pedestrians, lane lines, traffic signs and the like.

Furthermore, in order to improve the extraction capability of the space-time significance characteristics of the driving scene, an attention module is introduced, so that the end-to-end automatic driving behavior decision method focuses on the detail information of the current task target area, the performance and the efficiency of the system can be improved under the condition of limited resources, and unnecessary resource waste is reduced.

Further, the network model is large due to the fact that multi-modal input and the attention module are embedded, and the complexity of the model is reduced by pruning the network through sparse masks.

The invention also discloses an end-to-end automatic driving behavior decision-making system,

drawings

FIG. 1 is an overall architecture of the end-to-end automatic driving decision system of the present invention;

FIG. 2 is a training process of sparse mask matrix pruning in the method of the present invention;

FIG. 3 is a spatial attention module in the present invention;

FIG. 4 is an LSTM encoding-decoding structure in the present invention;

FIG. 5 is a graph of loss for the inventive system and MM-STConv model;

FIG. 6 is a graph of the accuracy of the inventive system and MM-STConv model;

FIG. 7 is a velocity prediction curve for 100s (1000 frames) of consecutive images in a data set for the system of the present invention;

FIG. 8 is a plot of steering angle prediction for 100s (1000 frames) of consecutive images in a data set by the system of the present invention;

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings:

example 1

The end-to-end automatic driving behavior decision model based on the attention mechanism and the space-time characteristics comprises the following steps:

step 1: the spatial feature extraction network describes scene spatial position features by using RGB image information, depth information and semantic segmentation information.

And 11, inputting the packed RGB image and depth image into a main network embedded with an attention mechanism to extract RGB image features and depth features.

Step 12, the RGB image obtains semantic segmentation information through a ResNet network and a pyramid pooling module which are embedded with an attention mechanism, and context information is sensed;

and step 13, inputting the three kinds of fused feature information into a full connection layer to generate a spatial feature vector with a fixed length.

Step 2: the time feature extraction network extracts time context features by using the vehicle historical motion state sequence information.

And step 3: and connecting the features obtained by the spatial feature extraction network with the features obtained by the temporal feature extraction network, and obtaining a final prediction result through two full-connection layers.

In step 11, the RGB image and the depth image are input into a backbone network embedded with an attention mechanism to extract RGB image features and depth features, and the method specifically includes:

step 111: by adopting three spatial attention branch networks to capture interaction between the spatial dimension and the channel dimension of the input feature map, the important position features of the traffic scene are highlighted, and irrelevant scene features are inhibited.

Step 112: pruning the trunk network through different sparse masks obtained by training the trunk network, and generating two different sub-networks for extracting RGB (red, green, blue) features, depth features and semantic features.

The step 111 specifically includes steps a-D:

A. establishing interaction of a space dimension H and a channel dimension C in a first branch of an attention module through a feature diagram input by rotation, then respectively performing maximum pooling and average pooling on the feature diagram after the rotation, inputting the feature diagram to two full connection layers (FC) for coding, and finally generating attention weight through a sigmoid activation function to multiply with an original input feature diagram to obtain an attention diagram;

B. establishing interaction of a space dimension W and a channel dimension C in a second branch of the attention module through a feature diagram input by rotation, then respectively performing maximum pooling and average pooling on the feature diagram after rotation, inputting the feature diagram to two full connection layers (FC) for coding, and finally generating attention weight through a sigmoid activation function to multiply with the original input feature diagram to obtain an attention diagram;

C. performing maximum pooling and average pooling on the input feature map in a third branch of the attention module, cascading the averaged pooled feature map and the maximum pooled feature map, generating an attention weight through a sigmoid activation function, and multiplying the attention weight by the input feature map according to elements to obtain a weighted attention map;

D. and averaging the attention diagrams obtained by the three branches of the attention module according to element addition to obtain the final space attention diagram.

The step 112 specifically includes steps a-C:

A. initializing a base network and a mask matrix randomly, and setting a pruning threshold value at the same time;

B. performing AND training on the base network and the mask matrix, and iteratively updating the mask matrix to obtain two different sparse mask matrices sharing parameters;

C. and obtaining different sparse masks of two sub-networks in the spatial feature extraction network through training, and pruning the base network.

In the step 12, the RGB image is processed through a ResNet network embedded with an attention mechanism and a pyramid pooling module to obtain semantic segmentation information, which is as follows:

the RGB image obtains semantic segmentation features fusing global information and multi-scale context information through a ResNet feature extractor embedded in a spatial attention module and a pyramid pooling module, and a spatial attention map is obtained, so that high-level semantic information is obtained.

The step 2 specifically comprises the following steps:

step 21, the LSTM encoder understands, summarizes and memorizes the vehicle historical motion state sequence;

and step 22, constructing a time attention mechanism in the time feature extraction network, modeling the relation between the historical speed state sequences, and giving more weight to important time context features.

And step 23, the decoder generates a time sequence and extracts features, and updates the hidden state at the current moment.

In the step 21, the LSTM encoder understands, summarizes and memorizes the vehicle historical continuous motion state sequence as follows:

the LSTM encoder performs T recursive updates on the vehicle historical continuous motion state sequence with the length of T to obtain a time context encoding vector c with the fixed length_t。

The step 22 constructs a time attention mechanism in the time feature extraction network, which is specifically as follows:

step 221, the multi-layer sensor in the time attention module obtains an energy item e according to the hidden state of the encoder and the hidden state of the decoder_ji；

At step 222, the Softmax function in the temporal attention module is based on e_jiObtaining the abstract feature of the encoder at the ith step and the attention coefficient a of the decoder at the jth step_ji；

In step 223, the temporal attention module compares the attention coefficient a_jiAs weights, for all timesWeighted summation is carried out on hidden states at the moment to obtain a context vector m of the decoder at the j step_j。

The decoder performs time sequence generation and feature extraction in step 23, and updates the hidden state at the current time, which is specifically as follows:

the decoder outputs a vector y according to the last moment_j-1Hidden state s_j-1And a context vector m_jHidden state s to step j_jUpdate according to s_j、y_j-1And m_jAnd updating the historical motion output vector decoded in the j step.

Example 2

As shown in fig. 1, the end-to-end automatic driving behavior decision method based on attention mechanism and spatiotemporal features specifically includes the following steps:

step 1: the spatial feature extraction network describes scene spatial position features by using RGB image information, depth information and semantic segmentation information, and generates two sub-networks sharing parameters for extracting the image spatial position features and the semantic features by using a sparse mask pruning trunk network;

step 2: the time feature extraction network extracts time context features by using the vehicle historical movement speed sequence information.

And step 3: and connecting the characteristics obtained by the space network with the characteristics obtained by the time series network to obtain a final prediction result.

The step 1 comprises a step 11, a step 12 and a step 13:

and 11, inputting the packed RGB image and depth image into a ResNet network embedded with an attention mechanism to extract RGB image features and depth features.

and step 13, generating a feature map with the same size as the spatial feature extraction feature map by passing the semantic segmentation feature map through a convolution layer and two pooling layers, and connecting and inputting the two feature maps into a full connection layer to generate a spatial feature vector with a fixed length.

In the step 11, the RGB image and the depth image are input into a ResNet network embedded with an attention mechanism to extract RGB image features and depth features, and the method specifically includes steps 111 to 113:

step 111, extracting characteristic information useful for decision tasks by embedding a spatial attention module in each bottompiece block of ResNet, wherein the attention module emphasizes important position characteristics of traffic scenes by establishing dependency among spatial channels to inhibit irrelevant scene characteristics as shown in FIG. 3; the method comprises the following specific steps:

A. inputting a feature map in the first branch of the attention Module

Rotating the H-dimension by 90 degrees anticlockwise, establishing interaction between the H-dimension and the C-dimension, and rotating the obtained characteristic diagram F_r1(x) Is W × H × C, then maximum pooling and average pooling are performed on the rotated feature map, respectively, and the vectors resulting from the maximum pooling and average pooling operations are input to two fully-connected (FC) layers to encode the relationship between the channels. And finally, adding the two coded feature vectors in pixel level, generating an attention weight through sigmoid, multiplying the attention weight by the original input feature map, and finally, clockwise rotating the output by 90 degrees along the dimension H to keep the shape consistent with the input. To obtain the following formula:

B. inputting the feature map in the second branch of the attention Module

Rotating 90 degrees along the W dimension anticlockwise, establishing the interaction between the W dimension and the C dimension, and rotating to obtain a characteristic diagram F_r1(x) Is H multiplied by W multiplied by C, and then the feature maps after rotation are respectively subjected to maximum pooling and flatteningAnd (4) equalizing pooling, namely inputting vectors obtained by maximum pooling and average pooling operations into two fully-connected (FC) layers to encode the relationship between the channels. And finally, adding the two coded feature vectors in pixel level, generating attention weight through sigmoid to multiply with the original input feature map, and finally, rotating the output clockwise by 90 degrees along the w axis to keep the shape consistent with the input.

To obtain the following formula:

C. feature maps for inputs in the third branch of the attention Module

Maximum pooling and average pooling were performed, and a feature map having an average pooled shape of 1 × H × W and a feature map having a maximum pooled shape of 1 × H × W were concatenated into a feature vector of 2 × H × W. The feature vector firstly passes through a standard convolution layer and a batch normalization layer with convolution kernel size of K multiplied by K, then generates attention weight through sigmoid activation function, and multiplies the attention weight by the input feature map according to elements to obtain weighted attention map. To obtain the following formula:

And step 112, pruning the trunk network by using different sparse masks obtained by training the trunk network, and generating two different sub-networks for extracting RGB (red, green, blue) features, depth features and semantic features. As shown in fig. 2, the following are specific:

A. and (4) initializing the base network randomly, setting all mask matrixes to be 1, and setting a pruning threshold.

B. And training the base network and the mask matrix after the phase comparison, comparing the trained result with a threshold value, and updating the mask matrix in an iterative manner to obtain two different sparse mask matrices sharing the public parameters.

C. And obtaining two different sparse masks of the sub-networks in the spatial feature extraction network through training, and pruning elements which contribute less to the task in each bottleeck of ResNet.

The step 2 specifically comprises the following steps:

step 21, the LSTM encoder understands, summarizes and memorizes the vehicle historical continuous motion state sequence;

Step 23, the decoder generates a time sequence and extracts features, and updates the hidden state at the current moment;

as shown in fig. 4, the LSTM encoder understands, summarizes and memorizes the vehicle historical continuous motion state sequence in step 21, which is as follows:

LSTM encoder pair length ist series s of historical continuous motion states of the vehicle₁,...,s_tPerforming T recursive updates to obtain a time context coding vector c with a fixed length_tThe vector comprises the understanding, the summarization and the memory of the historical continuous motion state sequence of the vehicle by the encoder;

step 221, the multi-layer perceptron in the time attention branch network hides the state h according to the encoder of the ith step_iHidden state s of decoder in step j-1_j-1To obtain an energy term e_ji＝w^Ttanh(W[s_j-1,h_i]+ b), where W and b are the weight and bias vectors from the input layer to the hidden layer, and W is the weight vector from the hidden layer to the output layer.

Step 222, the Softmax function in the temporal attention Branch network according to e_jiObtaining the abstract feature of the encoder at the ith step and the attention coefficient a of the decoder at the jth step_jiI.e. by

Where t is the length of the input sequence.

Step 223, the temporal attention branch network compares the attention coefficient a_jiAs weight, carrying out weighted summation on hidden states at all moments to obtain a context vector m of the decoder at the j step_jI.e. by

Effect verification:

in order to verify the effectiveness of the method, a data set generated and labeled by an automatic driving simulation test platform is adopted, 8112 images are selected for training, and the remaining 3476 images are used for testing the images for algorithm verification.

The method of the invention is compared with the MM-STConv behavior decision model training for errors, and the result is shown in figure 5, and the training loss curves of the two models are gradually reduced along with the increase of the training period, wherein the training loss curve of the end-to-end automatic driving decision method based on attention mechanism and space-time characteristics is wholly lower than that of the MM-STConv behavior decision method, and the reduction speed is higher than that of the MM-STConv behavior decision model. Meanwhile, compared with the MM-STConv behavior decision model, the training loss curve of the fusion attention mechanism model has smaller jitter. Compared with an MM-STConv behavior decision model, the end-to-end automatic driving behavior decision method based on the attention mechanism and the space-time characteristics is more stable and efficient in training process and faster in convergence speed.

The results of comparing the prediction accuracy of the method of the present invention with that of the MM-STConv behavioral decision model are shown in FIG. 6. It can be seen from the graph that the prediction accuracy curves of both methods gradually rise with the increase of the training period, wherein the prediction accuracy curves based on the attention mechanism and the spatio-temporal features are integrally higher than the MM-STConv behavior decision model, and the rising speed is faster than the MM-STConv behavior decision model. Compared with an MM-STConv behavior decision model, the attention mechanism and space-time characteristic based end-to-end automatic driving behavior decision method has the advantages that due to the introduction of the attention module, the model performance is better, and the prediction result is more stable.

The results of the speed prediction and the steering angle prediction by using the system of the invention are shown in fig. 7 and 8, and the results show that the speed and steering angle prediction curves of the method of the invention are closer to the real reference curve, the curve can be well fitted with the reference curve, the curve jitter is smaller, and the prediction is more stable.

Example 3

The estimation method based on the deep neural network of the present invention can be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice. The computer storage medium may be any available medium or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical memory (e.g., CD, DVD, BD, HVD, etc.), and semiconductor memory (e.g., ROM, EPROM, EEPROM, nonvolatile memory (NANDFLASH), Solid State Disk (SSD)), etc.

Example 4

In an exemplary embodiment, a computer device is also provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the deep neural network based channel estimation method when executing the computer program. The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. An end-to-end automatic driving behavior decision method is characterized by comprising the following steps:

2. The end-to-end automatic driving behavior decision method according to claim 1, characterized in that the extraction process of the spatial features comprises:

wherein, an attention mechanism is embedded in the backbone network.

3. The end-to-end automatic driving behavior decision method according to claim 2, characterized in that the acquisition process of the image information and the depth information is as follows:

4. The end-to-end automatic driving behavior decision method according to claim 3, characterized in that the acquisition process of the spatial attention map is specifically as follows:

5. The end-to-end automatic driving behavior decision method according to claim 3, characterized in that the specific operations of pruning are:

6. The end-to-end automatic driving behavior decision method according to claim 1, characterized in that the extraction process of the temporal features comprises:

7. The end-to-end automated driving behavior decision method of claim 6,

the time attention mechanism is constructed based on a time attention module, and the specific operations comprise:

8. An end-to-end automated driving behavior decision system, comprising:

9. The end-to-end automated driving behavior decision system of claim 8, characterized in that the spatial attention module comprises a ResNet feature extractor, a pyramid pooling unit and three spatial attention branch networks;

10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program implements the steps of the end-to-end autopilot behavior decision method according to any one of claims 1 to 7.