CN113139446A - End-to-end automatic driving behavior decision method, system and terminal equipment - Google Patents

End-to-end automatic driving behavior decision method, system and terminal equipment Download PDF

Info

Publication number
CN113139446A
CN113139446A CN202110391084.6A CN202110391084A CN113139446A CN 113139446 A CN113139446 A CN 113139446A CN 202110391084 A CN202110391084 A CN 202110391084A CN 113139446 A CN113139446 A CN 113139446A
Authority
CN
China
Prior art keywords
attention
time
information
automatic driving
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110391084.6A
Other languages
Chinese (zh)
Other versions
CN113139446B (en
Inventor
刘占文
赵祥模
樊星
齐明远
范颂华
李超
张嘉颖
高涛
王润民
林杉
员惠莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changan University
Original Assignee
Changan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changan University filed Critical Changan University
Priority to CN202110391084.6A priority Critical patent/CN113139446B/en
Publication of CN113139446A publication Critical patent/CN113139446A/en
Application granted granted Critical
Publication of CN113139446B publication Critical patent/CN113139446B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an end-to-end automatic driving behavior decision method, a system and terminal equipment, and belongs to the field of automatic driving. Extracting scene space position characteristics through a convolutional neural network embedded with an attention mechanism, constructing a space characteristic extraction network, and accurately analyzing scene target space characteristics and semantic information; capturing scene time context characteristics through a long-short term memory network coding-decoding structure embedded with a time attention mechanism, constructing a time characteristic extraction network, and understanding memory scene time sequence information; according to the method, scene space information and time sequence information are integrated, and a higher weight is given to the key visual region and the motion sequence by combining an attention mechanism, so that the prediction process is more in line with the driving habits of human drivers, and the prediction result is more accurate.

Description

End-to-end automatic driving behavior decision method, system and terminal equipment
Technical Field
The invention belongs to the field of automatic driving, and relates to an automatic driving behavior decision method, in particular to an end-to-end automatic driving behavior decision method, an end-to-end automatic driving behavior decision system and terminal equipment.
Background
The automatic driving decision technology is an important research direction in the fields of artificial intelligence and automatic driving, the decision effectiveness of the automatic driving decision technology influences the performance of the whole automatic driving system to a great extent, however, the current traditional rule-based automatic driving decision method does not conform to the habit of human driving behaviors, and the automatic driving behavior decision is also a classic problem in the field of automatic driving. The automatic driving behavior decision is not only related to the driving scene of the current vehicle, but also related to the historical movement speed of the vehicle, so that the common influence of the current driving scene and the historical movement state on the vehicle is considered. The human visual system can selectively notice the primary contents of the observed scene and ignore other secondary contents, and the driver should pay attention to the things which have a large influence on the driving decision, such as vehicles, pedestrians, traffic lights and the like, and ignore the features which are not important for driving, such as sky, trees and the like, while the vehicle is driving. Therefore, an end-to-end automatic driving decision model based on attention mechanism and space-time characteristics becomes a new research hotspot.
The automatic driving decision-making method is mainly divided into a rule-based automatic driving decision-making method and an end-to-end automatic driving decision-making method. The rule-based decision method divides the automatic driving decision process into different task modules, understands and divides the vehicle state according to the traffic situation, and generates real-time reasonable driving actions by combining a manually constructed rule base and priori knowledge, thereby realizing the control of the automatic driving vehicle. The automatic driving model based on end-to-end learning can unify various driving subtasks such as scene environment perception, target recognition, target tracking and planning decision-making into a deep neural network, directly maps perception information into control quantities such as an accelerator, a steering wheel and braking, completes unification from cognition to decision-making, does not need module splitting, can simplify task steps with complicated characteristic engineering, and enables the automatic driving system to be simpler and more efficient in structure. The existing end-to-end automatic driving decision method does not consider the influence of the historical motion state of the vehicle on the decision of the vehicle, and has the problems of low decision accuracy, low efficiency and the like.
Disclosure of Invention
In order to overcome the defects that the influence of the historical motion state of the vehicle on the vehicle decision-making is not considered in the end-to-end automatic driving decision-making method in the prior art, the decision-making accuracy is low, and the efficiency is too low, the invention aims to provide an end-to-end automatic driving behavior decision-making method, an end-to-end automatic driving behavior decision-making system and terminal equipment.
In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:
an end-to-end automatic driving behavior decision method comprises the following steps:
acquiring the spatial characteristics of a scene through a convolutional neural network embedded with an attention mechanism based on image information, depth information and semantic segmentation information;
acquiring time characteristics through a memory network coding-decoding structure embedded with an attention mechanism based on the historical motion state sequence information of the vehicle;
and communicating the spatial characteristics with the temporal characteristics, establishing an end-to-end automatic driving behavior decision model, and acquiring an end-to-end automatic driving behavior prediction result.
Preferably, the extracting process of the spatial features comprises:
inputting the packed image and depth image into a backbone network to obtain image information and depth information;
image information is pooled through a backbone network and a pyramid to obtain semantic segmentation information;
inputting image information, depth information and semantic segmentation information into a connecting layer, generating a spatial feature vector with a fixed length, and acquiring spatial features of a scene;
wherein, an attention mechanism is embedded in the backbone network.
Further preferably, the image information and the depth information are obtained by:
capturing an input characteristic diagram by using three spatial attention branch networks, establishing interaction between spatial dimensions and channel dimensions, and acquiring a spatial attention diagram;
training a backbone network to obtain different sparse masks; pruning the main network by using different sparse masks to generate two different sub-networks, and acquiring image information and depth information based on the different sub-networks.
Further preferably, the process of acquiring the spatial attention map specifically includes:
establishing interaction of space dimensionality and channel dimensionality through the feature diagram input by rotation in the three space attention branch networks respectively; then, performing maximum pooling and average pooling on the rotated feature maps respectively; cascading the average pooled feature map and the maximum pooled feature map, and inputting the cascaded feature maps into two full-connection layers for coding; generating attention weights through a sigmoid activation function, and combining the attention weights with an original input feature map to obtain three attention maps; the three attention diagrams are averaged to obtain a spatial attention diagram.
Preferably, the pruning is specifically performed by:
initializing a base network and a mask matrix randomly, and setting a pruning threshold value at the same time;
training the base network and the mask matrix after being subjected to an AND operation, and iteratively updating the mask matrix to obtain two sparse mask matrices with different sharing parameters;
and obtaining different sparse masks of the two sub-networks through training, and pruning the trunk network by using the different sparse masks.
Preferably, the extracting process of the time feature comprises:
understanding, summarizing and memorizing the vehicle historical motion state sequence information by using an encoder to obtain a vehicle historical motion state feature vector;
constructing a time attention mechanism by utilizing a time feature extraction network;
and performing time sequence generation and feature extraction on the vehicle historical motion state feature vector subjected to the time attention mechanism by using a decoder, updating the hidden state at the current moment, and acquiring time features.
Preferably, the time attention mechanism is constructed based on a time attention module, and the specific operations include:
the multi-layer perceptron in the time attention module obtains an energy item according to the hidden state of the encoder and the hidden state of the decoder;
a Softmax function in the time attention module obtains real-time encoder abstract characteristics and a decoder attention coefficient according to the energy items;
and the time attention module takes the attention coefficient as a weight and carries out weighted summation on the hidden states at all the moments to obtain the context vector of the decoder at each moment.
An end-to-end automated driving behavior decision system, comprising:
the spatial attention module is used for acquiring spatial features of a scene based on image information, depth information and semantic segmentation information;
the time attention module is used for acquiring time characteristics based on the vehicle historical motion state sequence information and constructing a time attention mechanism based on the time characteristics;
the model establishing module is respectively interacted with the space attention module and the time attention module, establishes an end-to-end automatic driving behavior decision model based on the space characteristic and the time attention mechanism, and predicts an end-to-end automatic driving behavior result through the end-to-end automatic driving behavior decision model.
Preferably, the spatial attention module comprises a ResNet feature extractor, a pyramid pooling unit and three spatial attention branch networks;
the ResNet feature extractor is used for acquiring feature information in the image information;
the pyramid pooling unit is used for performing maximum pooling and average pooling on the feature information acquired by the ResNet feature extractor;
the space attention branching network is used for establishing interaction of space dimension and channel dimension based on the input feature diagram;
the time attention module comprises an encoder, a decoder and a multi-layer perceptron;
the encoder is used for understanding, summarizing and memorizing the vehicle historical motion state sequence;
the decoder is used for generating a time sequence and extracting features and updating the hidden state at the current moment;
the multi-layer perceptron is used to derive an energy term based on the hidden states of the encoder and decoder.
A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the end-to-end automated driving behavior decision method when executing the computer program.
Compared with the prior art, the invention has the following beneficial effects:
the invention discloses an end-to-end automatic driving decision method, which comprises the steps of extracting scene space position characteristics through a convolutional neural network embedded with an attention mechanism, constructing a space characteristic extraction network, and accurately analyzing scene target space characteristics and semantic information; capturing scene time context characteristics through a long-short term memory network coding-decoding structure embedded with a time attention mechanism, constructing a time characteristic extraction network, and understanding memory scene time sequence information; according to the method, scene space information and time sequence information are integrated, and a higher weight is given to the key visual region and the motion sequence by combining an attention mechanism, so that the prediction process is more in line with the driving habits of human drivers, and the prediction result is more accurate.
Furthermore, in the end-to-end automatic driving decision process, only the visual information in the RGB image is not enough to fully perceive objects such as vehicles, pedestrians, obstacles and the like in the scene, the depth information contains more position features and contour features of the scene objects, and the semantic segmentation information covers high-level semantic understanding of the driving scene.
Furthermore, spatial position features and temporal context features of the scene are respectively extracted and fused through a convolutional network and an LSTM, meanwhile, in the process of extracting the spatial position features, semantic segmentation information is utilized to improve model prediction precision, and decision quantities of a steering wheel and a steering angle are output. Although the prediction effect of the model is improved after the multi-modal input is considered, the method of the invention does not pay attention to key objects in the scene, such as pedestrians, lane lines, traffic signs and the like.
Furthermore, in order to improve the extraction capability of the space-time significance characteristics of the driving scene, an attention module is introduced, so that the end-to-end automatic driving behavior decision method focuses on the detail information of the current task target area, the performance and the efficiency of the system can be improved under the condition of limited resources, and unnecessary resource waste is reduced.
Further, the network model is large due to the fact that multi-modal input and the attention module are embedded, and the complexity of the model is reduced by pruning the network through sparse masks.
The invention also discloses an end-to-end automatic driving behavior decision-making system,
drawings
FIG. 1 is an overall architecture of the end-to-end automatic driving decision system of the present invention;
FIG. 2 is a training process of sparse mask matrix pruning in the method of the present invention;
FIG. 3 is a spatial attention module in the present invention;
FIG. 4 is an LSTM encoding-decoding structure in the present invention;
FIG. 5 is a graph of loss for the inventive system and MM-STConv model;
FIG. 6 is a graph of the accuracy of the inventive system and MM-STConv model;
FIG. 7 is a velocity prediction curve for 100s (1000 frames) of consecutive images in a data set for the system of the present invention;
FIG. 8 is a plot of steering angle prediction for 100s (1000 frames) of consecutive images in a data set by the system of the present invention;
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings:
example 1
The end-to-end automatic driving behavior decision model based on the attention mechanism and the space-time characteristics comprises the following steps:
step 1: the spatial feature extraction network describes scene spatial position features by using RGB image information, depth information and semantic segmentation information.
And 11, inputting the packed RGB image and depth image into a main network embedded with an attention mechanism to extract RGB image features and depth features.
Step 12, the RGB image obtains semantic segmentation information through a ResNet network and a pyramid pooling module which are embedded with an attention mechanism, and context information is sensed;
and step 13, inputting the three kinds of fused feature information into a full connection layer to generate a spatial feature vector with a fixed length.
Step 2: the time feature extraction network extracts time context features by using the vehicle historical motion state sequence information.
And step 3: and connecting the features obtained by the spatial feature extraction network with the features obtained by the temporal feature extraction network, and obtaining a final prediction result through two full-connection layers.
In step 11, the RGB image and the depth image are input into a backbone network embedded with an attention mechanism to extract RGB image features and depth features, and the method specifically includes:
step 111: by adopting three spatial attention branch networks to capture interaction between the spatial dimension and the channel dimension of the input feature map, the important position features of the traffic scene are highlighted, and irrelevant scene features are inhibited.
Step 112: pruning the trunk network through different sparse masks obtained by training the trunk network, and generating two different sub-networks for extracting RGB (red, green, blue) features, depth features and semantic features.
The step 111 specifically includes steps a-D:
A. establishing interaction of a space dimension H and a channel dimension C in a first branch of an attention module through a feature diagram input by rotation, then respectively performing maximum pooling and average pooling on the feature diagram after the rotation, inputting the feature diagram to two full connection layers (FC) for coding, and finally generating attention weight through a sigmoid activation function to multiply with an original input feature diagram to obtain an attention diagram;
B. establishing interaction of a space dimension W and a channel dimension C in a second branch of the attention module through a feature diagram input by rotation, then respectively performing maximum pooling and average pooling on the feature diagram after rotation, inputting the feature diagram to two full connection layers (FC) for coding, and finally generating attention weight through a sigmoid activation function to multiply with the original input feature diagram to obtain an attention diagram;
C. performing maximum pooling and average pooling on the input feature map in a third branch of the attention module, cascading the averaged pooled feature map and the maximum pooled feature map, generating an attention weight through a sigmoid activation function, and multiplying the attention weight by the input feature map according to elements to obtain a weighted attention map;
D. and averaging the attention diagrams obtained by the three branches of the attention module according to element addition to obtain the final space attention diagram.
The step 112 specifically includes steps a-C:
A. initializing a base network and a mask matrix randomly, and setting a pruning threshold value at the same time;
B. performing AND training on the base network and the mask matrix, and iteratively updating the mask matrix to obtain two different sparse mask matrices sharing parameters;
C. and obtaining different sparse masks of two sub-networks in the spatial feature extraction network through training, and pruning the base network.
In the step 12, the RGB image is processed through a ResNet network embedded with an attention mechanism and a pyramid pooling module to obtain semantic segmentation information, which is as follows:
the RGB image obtains semantic segmentation features fusing global information and multi-scale context information through a ResNet feature extractor embedded in a spatial attention module and a pyramid pooling module, and a spatial attention map is obtained, so that high-level semantic information is obtained.
The step 2 specifically comprises the following steps:
step 21, the LSTM encoder understands, summarizes and memorizes the vehicle historical motion state sequence;
and step 22, constructing a time attention mechanism in the time feature extraction network, modeling the relation between the historical speed state sequences, and giving more weight to important time context features.
And step 23, the decoder generates a time sequence and extracts features, and updates the hidden state at the current moment.
In the step 21, the LSTM encoder understands, summarizes and memorizes the vehicle historical continuous motion state sequence as follows:
the LSTM encoder performs T recursive updates on the vehicle historical continuous motion state sequence with the length of T to obtain a time context encoding vector c with the fixed lengtht
The step 22 constructs a time attention mechanism in the time feature extraction network, which is specifically as follows:
step 221, the multi-layer sensor in the time attention module obtains an energy item e according to the hidden state of the encoder and the hidden state of the decoderji
At step 222, the Softmax function in the temporal attention module is based on ejiObtaining the abstract feature of the encoder at the ith step and the attention coefficient a of the decoder at the jth stepji
In step 223, the temporal attention module compares the attention coefficient ajiAs weights, for all timesWeighted summation is carried out on hidden states at the moment to obtain a context vector m of the decoder at the j stepj
The decoder performs time sequence generation and feature extraction in step 23, and updates the hidden state at the current time, which is specifically as follows:
the decoder outputs a vector y according to the last momentj-1Hidden state sj-1And a context vector mjHidden state s to step jjUpdate according to sj、yj-1And mjAnd updating the historical motion output vector decoded in the j step.
Example 2
As shown in fig. 1, the end-to-end automatic driving behavior decision method based on attention mechanism and spatiotemporal features specifically includes the following steps:
step 1: the spatial feature extraction network describes scene spatial position features by using RGB image information, depth information and semantic segmentation information, and generates two sub-networks sharing parameters for extracting the image spatial position features and the semantic features by using a sparse mask pruning trunk network;
step 2: the time feature extraction network extracts time context features by using the vehicle historical movement speed sequence information.
And step 3: and connecting the characteristics obtained by the space network with the characteristics obtained by the time series network to obtain a final prediction result.
The step 1 comprises a step 11, a step 12 and a step 13:
and 11, inputting the packed RGB image and depth image into a ResNet network embedded with an attention mechanism to extract RGB image features and depth features.
Step 12, the RGB image obtains semantic segmentation information through a ResNet network and a pyramid pooling module which are embedded with an attention mechanism, and context information is sensed;
and step 13, generating a feature map with the same size as the spatial feature extraction feature map by passing the semantic segmentation feature map through a convolution layer and two pooling layers, and connecting and inputting the two feature maps into a full connection layer to generate a spatial feature vector with a fixed length.
In the step 11, the RGB image and the depth image are input into a ResNet network embedded with an attention mechanism to extract RGB image features and depth features, and the method specifically includes steps 111 to 113:
step 111, extracting characteristic information useful for decision tasks by embedding a spatial attention module in each bottompiece block of ResNet, wherein the attention module emphasizes important position characteristics of traffic scenes by establishing dependency among spatial channels to inhibit irrelevant scene characteristics as shown in FIG. 3; the method comprises the following specific steps:
A. inputting a feature map in the first branch of the attention Module
Figure BDA0003016728030000101
Rotating the H-dimension by 90 degrees anticlockwise, establishing interaction between the H-dimension and the C-dimension, and rotating the obtained characteristic diagram Fr1(x) Is W × H × C, then maximum pooling and average pooling are performed on the rotated feature map, respectively, and the vectors resulting from the maximum pooling and average pooling operations are input to two fully-connected (FC) layers to encode the relationship between the channels. And finally, adding the two coded feature vectors in pixel level, generating an attention weight through sigmoid, multiplying the attention weight by the original input feature map, and finally, clockwise rotating the output by 90 degrees along the dimension H to keep the shape consistent with the input. To obtain the following formula:
Figure BDA0003016728030000111
Figure BDA0003016728030000112
B. inputting the feature map in the second branch of the attention Module
Figure BDA0003016728030000113
Rotating 90 degrees along the W dimension anticlockwise, establishing the interaction between the W dimension and the C dimension, and rotating to obtain a characteristic diagram Fr1(x) Is H multiplied by W multiplied by C, and then the feature maps after rotation are respectively subjected to maximum pooling and flatteningAnd (4) equalizing pooling, namely inputting vectors obtained by maximum pooling and average pooling operations into two fully-connected (FC) layers to encode the relationship between the channels. And finally, adding the two coded feature vectors in pixel level, generating attention weight through sigmoid to multiply with the original input feature map, and finally, rotating the output clockwise by 90 degrees along the w axis to keep the shape consistent with the input.
To obtain the following formula:
Figure BDA0003016728030000114
Figure BDA0003016728030000115
C. feature maps for inputs in the third branch of the attention Module
Figure BDA0003016728030000116
Maximum pooling and average pooling were performed, and a feature map having an average pooled shape of 1 × H × W and a feature map having a maximum pooled shape of 1 × H × W were concatenated into a feature vector of 2 × H × W. The feature vector firstly passes through a standard convolution layer and a batch normalization layer with convolution kernel size of K multiplied by K, then generates attention weight through sigmoid activation function, and multiplies the attention weight by the input feature map according to elements to obtain weighted attention map. To obtain the following formula:
Figure BDA0003016728030000117
Figure BDA0003016728030000121
D. and averaging the attention diagrams obtained by the three branches of the attention module according to element addition to obtain the final space attention diagram.
Figure BDA0003016728030000122
And step 112, pruning the trunk network by using different sparse masks obtained by training the trunk network, and generating two different sub-networks for extracting RGB (red, green, blue) features, depth features and semantic features. As shown in fig. 2, the following are specific:
A. and (4) initializing the base network randomly, setting all mask matrixes to be 1, and setting a pruning threshold.
B. And training the base network and the mask matrix after the phase comparison, comparing the trained result with a threshold value, and updating the mask matrix in an iterative manner to obtain two different sparse mask matrices sharing the public parameters.
C. And obtaining two different sparse masks of the sub-networks in the spatial feature extraction network through training, and pruning elements which contribute less to the task in each bottleeck of ResNet.
In the step 12, the RGB image is processed through a ResNet network embedded with an attention mechanism and a pyramid pooling module to obtain semantic segmentation information, which is as follows:
the RGB image obtains semantic segmentation features fusing global information and multi-scale context information through a ResNet feature extractor embedded in a spatial attention module and a pyramid pooling module, and a spatial attention map is obtained, so that high-level semantic information is obtained.
The step 2 specifically comprises the following steps:
step 21, the LSTM encoder understands, summarizes and memorizes the vehicle historical continuous motion state sequence;
and step 22, constructing a time attention mechanism in the time feature extraction network, modeling the relation between the historical speed state sequences, and giving more weight to important time context features.
Step 23, the decoder generates a time sequence and extracts features, and updates the hidden state at the current moment;
as shown in fig. 4, the LSTM encoder understands, summarizes and memorizes the vehicle historical continuous motion state sequence in step 21, which is as follows:
LSTM encoder pair length ist series s of historical continuous motion states of the vehicle1,...,stPerforming T recursive updates to obtain a time context coding vector c with a fixed lengthtThe vector comprises the understanding, the summarization and the memory of the historical continuous motion state sequence of the vehicle by the encoder;
the step 22 constructs a time attention mechanism in the time feature extraction network, which is specifically as follows:
step 221, the multi-layer perceptron in the time attention branch network hides the state h according to the encoder of the ith stepiHidden state s of decoder in step j-1j-1To obtain an energy term eji=wTtanh(W[sj-1,hi]+ b), where W and b are the weight and bias vectors from the input layer to the hidden layer, and W is the weight vector from the hidden layer to the output layer.
Step 222, the Softmax function in the temporal attention Branch network according to ejiObtaining the abstract feature of the encoder at the ith step and the attention coefficient a of the decoder at the jth stepjiI.e. by
Figure BDA0003016728030000131
Where t is the length of the input sequence.
Step 223, the temporal attention branch network compares the attention coefficient ajiAs weight, carrying out weighted summation on hidden states at all moments to obtain a context vector m of the decoder at the j stepjI.e. by
Figure BDA0003016728030000132
The decoder performs time sequence generation and feature extraction in step 23, and updates the hidden state at the current time, which is specifically as follows:
the decoder outputs a vector y according to the last momentj-1Hidden state sj-1And a context vector mjHidden state s to step jjUpdate according to sj、yj-1And mjAnd updating the historical motion output vector decoded in the j step.
Effect verification:
in order to verify the effectiveness of the method, a data set generated and labeled by an automatic driving simulation test platform is adopted, 8112 images are selected for training, and the remaining 3476 images are used for testing the images for algorithm verification.
The method of the invention is compared with the MM-STConv behavior decision model training for errors, and the result is shown in figure 5, and the training loss curves of the two models are gradually reduced along with the increase of the training period, wherein the training loss curve of the end-to-end automatic driving decision method based on attention mechanism and space-time characteristics is wholly lower than that of the MM-STConv behavior decision method, and the reduction speed is higher than that of the MM-STConv behavior decision model. Meanwhile, compared with the MM-STConv behavior decision model, the training loss curve of the fusion attention mechanism model has smaller jitter. Compared with an MM-STConv behavior decision model, the end-to-end automatic driving behavior decision method based on the attention mechanism and the space-time characteristics is more stable and efficient in training process and faster in convergence speed.
The results of comparing the prediction accuracy of the method of the present invention with that of the MM-STConv behavioral decision model are shown in FIG. 6. It can be seen from the graph that the prediction accuracy curves of both methods gradually rise with the increase of the training period, wherein the prediction accuracy curves based on the attention mechanism and the spatio-temporal features are integrally higher than the MM-STConv behavior decision model, and the rising speed is faster than the MM-STConv behavior decision model. Compared with an MM-STConv behavior decision model, the attention mechanism and space-time characteristic based end-to-end automatic driving behavior decision method has the advantages that due to the introduction of the attention module, the model performance is better, and the prediction result is more stable.
The results of the speed prediction and the steering angle prediction by using the system of the invention are shown in fig. 7 and 8, and the results show that the speed and steering angle prediction curves of the method of the invention are closer to the real reference curve, the curve can be well fitted with the reference curve, the curve jitter is smaller, and the prediction is more stable.
Example 3
The estimation method based on the deep neural network of the present invention can be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice. The computer storage medium may be any available medium or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical memory (e.g., CD, DVD, BD, HVD, etc.), and semiconductor memory (e.g., ROM, EPROM, EEPROM, nonvolatile memory (NANDFLASH), Solid State Disk (SSD)), etc.
Example 4
In an exemplary embodiment, a computer device is also provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the deep neural network based channel estimation method when executing the computer program. The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (10)

1. An end-to-end automatic driving behavior decision method is characterized by comprising the following steps:
acquiring the spatial characteristics of a scene through a convolutional neural network embedded with an attention mechanism based on image information, depth information and semantic segmentation information;
acquiring time characteristics through a memory network coding-decoding structure embedded with an attention mechanism based on the historical motion state sequence information of the vehicle;
and communicating the spatial characteristics with the temporal characteristics, establishing an end-to-end automatic driving behavior decision model, and acquiring an end-to-end automatic driving behavior prediction result.
2. The end-to-end automatic driving behavior decision method according to claim 1, characterized in that the extraction process of the spatial features comprises:
inputting the packed image and depth image into a backbone network to obtain image information and depth information;
image information is pooled through a backbone network and a pyramid to obtain semantic segmentation information;
inputting image information, depth information and semantic segmentation information into a connecting layer, generating a spatial feature vector with a fixed length, and acquiring spatial features of a scene;
wherein, an attention mechanism is embedded in the backbone network.
3. The end-to-end automatic driving behavior decision method according to claim 2, characterized in that the acquisition process of the image information and the depth information is as follows:
capturing an input characteristic diagram by using three spatial attention branch networks, establishing interaction between spatial dimensions and channel dimensions, and acquiring a spatial attention diagram;
training a backbone network to obtain different sparse masks; pruning the main network by using different sparse masks to generate two different sub-networks, and acquiring image information and depth information based on the different sub-networks.
4. The end-to-end automatic driving behavior decision method according to claim 3, characterized in that the acquisition process of the spatial attention map is specifically as follows:
establishing interaction of space dimensionality and channel dimensionality through the feature diagram input by rotation in the three space attention branch networks respectively; then, performing maximum pooling and average pooling on the rotated feature maps respectively; cascading the average pooled feature map and the maximum pooled feature map, and inputting the cascaded feature maps into two full-connection layers for coding; generating attention weights through a sigmoid activation function, and combining the attention weights with an original input feature map to obtain three attention maps; the three attention diagrams are averaged to obtain a spatial attention diagram.
5. The end-to-end automatic driving behavior decision method according to claim 3, characterized in that the specific operations of pruning are:
initializing a base network and a mask matrix randomly, and setting a pruning threshold value at the same time;
training the base network and the mask matrix after being subjected to an AND operation, and iteratively updating the mask matrix to obtain two sparse mask matrices with different sharing parameters;
and obtaining different sparse masks of the two sub-networks through training, and pruning the trunk network by using the different sparse masks.
6. The end-to-end automatic driving behavior decision method according to claim 1, characterized in that the extraction process of the temporal features comprises:
understanding, summarizing and memorizing the vehicle historical motion state sequence information by using an encoder to obtain a vehicle historical motion state feature vector;
constructing a time attention mechanism by utilizing a time feature extraction network;
and performing time sequence generation and feature extraction on the vehicle historical motion state feature vector subjected to the time attention mechanism by using a decoder, updating the hidden state at the current moment, and acquiring time features.
7. The end-to-end automated driving behavior decision method of claim 6,
the time attention mechanism is constructed based on a time attention module, and the specific operations comprise:
the multi-layer perceptron in the time attention module obtains an energy item according to the hidden state of the encoder and the hidden state of the decoder;
a Softmax function in the time attention module obtains real-time encoder abstract characteristics and a decoder attention coefficient according to the energy items;
and the time attention module takes the attention coefficient as a weight and carries out weighted summation on the hidden states at all the moments to obtain the context vector of the decoder at each moment.
8. An end-to-end automated driving behavior decision system, comprising:
the spatial attention module is used for acquiring spatial features of a scene based on image information, depth information and semantic segmentation information;
the time attention module is used for acquiring time characteristics based on the vehicle historical motion state sequence information and constructing a time attention mechanism based on the time characteristics;
the model establishing module is respectively interacted with the space attention module and the time attention module, establishes an end-to-end automatic driving behavior decision model based on the space characteristic and the time attention mechanism, and predicts an end-to-end automatic driving behavior result through the end-to-end automatic driving behavior decision model.
9. The end-to-end automated driving behavior decision system of claim 8, characterized in that the spatial attention module comprises a ResNet feature extractor, a pyramid pooling unit and three spatial attention branch networks;
the ResNet feature extractor is used for acquiring feature information in the image information;
the pyramid pooling unit is used for performing maximum pooling and average pooling on the feature information acquired by the ResNet feature extractor;
the space attention branching network is used for establishing interaction of space dimension and channel dimension based on the input feature diagram;
the time attention module comprises an encoder, a decoder and a multi-layer perceptron;
the encoder is used for understanding, summarizing and memorizing the vehicle historical motion state sequence;
the decoder is used for generating a time sequence and extracting features and updating the hidden state at the current moment;
the multi-layer perceptron is used to derive an energy term based on the hidden states of the encoder and decoder.
10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program implements the steps of the end-to-end autopilot behavior decision method according to any one of claims 1 to 7.
CN202110391084.6A 2021-04-12 2021-04-12 End-to-end automatic driving behavior decision method, system and terminal equipment Active CN113139446B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110391084.6A CN113139446B (en) 2021-04-12 2021-04-12 End-to-end automatic driving behavior decision method, system and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110391084.6A CN113139446B (en) 2021-04-12 2021-04-12 End-to-end automatic driving behavior decision method, system and terminal equipment

Publications (2)

Publication Number Publication Date
CN113139446A true CN113139446A (en) 2021-07-20
CN113139446B CN113139446B (en) 2024-02-06

Family

ID=76811192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110391084.6A Active CN113139446B (en) 2021-04-12 2021-04-12 End-to-end automatic driving behavior decision method, system and terminal equipment

Country Status (1)

Country Link
CN (1) CN113139446B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673412A (en) * 2021-08-17 2021-11-19 驭势(上海)汽车科技有限公司 Key target object identification method and device, computer equipment and storage medium
CN114423061A (en) * 2022-01-20 2022-04-29 重庆邮电大学 Wireless route optimization method based on attention mechanism and deep reinforcement learning
CN114463670A (en) * 2021-12-29 2022-05-10 电子科技大学 Airport scene monitoring video change detection system and method
CN114777797A (en) * 2022-06-13 2022-07-22 长沙金维信息技术有限公司 High-precision map visual positioning method for automatic driving and automatic driving method
CN115049130A (en) * 2022-06-20 2022-09-13 重庆邮电大学 Automatic driving track prediction method based on space-time pyramid

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism
US20200139973A1 (en) * 2018-11-01 2020-05-07 GM Global Technology Operations LLC Spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle
WO2020253965A1 (en) * 2019-06-20 2020-12-24 Toyota Motor Europe Control device, system and method for determining perceptual load of a visual and dynamic driving scene in real time
CN112418409A (en) * 2020-12-14 2021-02-26 南京信息工程大学 Method for predicting time-space sequence of convolution long-short term memory network improved by using attention mechanism

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism
US20200139973A1 (en) * 2018-11-01 2020-05-07 GM Global Technology Operations LLC Spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle
WO2020253965A1 (en) * 2019-06-20 2020-12-24 Toyota Motor Europe Control device, system and method for determining perceptual load of a visual and dynamic driving scene in real time
CN112418409A (en) * 2020-12-14 2021-02-26 南京信息工程大学 Method for predicting time-space sequence of convolution long-short term memory network improved by using attention mechanism

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
杜圣东;李天瑞;杨燕;王浩;谢鹏;洪西进;: "一种基于序列到序列时空注意力学习的交通流预测模型", 计算机研究与发展, no. 08 *
王军;鹿姝;李云伟;: "融合注意力机制和连接时序分类的多模态手语识别", 信号处理, no. 09 *
胡学敏;童秀迟;郭琳;张若晗;孔力;: "基于深度视觉注意神经网络的端到端自动驾驶模型", 计算机应用, no. 07 *
蔡英凤;朱南楠;邰康盛;刘擎超;王海;: "基于注意力机制的车辆行为预测", 江苏大学学报(自然科学版), no. 02 *
赵祥模;连心雨;刘占文;沈超;董鸣;: "基于MM-STConv的端到端自动驾驶行为决策模型", 中国公路学报, no. 03 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673412A (en) * 2021-08-17 2021-11-19 驭势(上海)汽车科技有限公司 Key target object identification method and device, computer equipment and storage medium
CN113673412B (en) * 2021-08-17 2023-09-26 驭势(上海)汽车科技有限公司 Method and device for identifying key target object, computer equipment and storage medium
CN114463670A (en) * 2021-12-29 2022-05-10 电子科技大学 Airport scene monitoring video change detection system and method
CN114423061A (en) * 2022-01-20 2022-04-29 重庆邮电大学 Wireless route optimization method based on attention mechanism and deep reinforcement learning
CN114423061B (en) * 2022-01-20 2024-05-07 重庆邮电大学 Wireless route optimization method based on attention mechanism and deep reinforcement learning
CN114777797A (en) * 2022-06-13 2022-07-22 长沙金维信息技术有限公司 High-precision map visual positioning method for automatic driving and automatic driving method
CN115049130A (en) * 2022-06-20 2022-09-13 重庆邮电大学 Automatic driving track prediction method based on space-time pyramid

Also Published As

Publication number Publication date
CN113139446B (en) 2024-02-06

Similar Documents

Publication Publication Date Title
CN113139446B (en) End-to-end automatic driving behavior decision method, system and terminal equipment
Alonso et al. 3d-mininet: Learning a 2d representation from point clouds for fast and efficient 3d lidar semantic segmentation
Ding et al. Context contrasted feature and gated multi-scale aggregation for scene segmentation
JP2023503527A (en) Trajectory prediction method, trajectory prediction device, electronic device, recording medium, and computer program
US20180336469A1 (en) Sigma-delta position derivative networks
US10325371B1 (en) Method and device for segmenting image to be used for surveillance using weighted convolution filters for respective grid cells by converting modes according to classes of areas to satisfy level 4 of autonomous vehicle, and testing method and testing device using the same
Akan et al. Stretchbev: Stretching future instance prediction spatially and temporally
CN111696110B (en) Scene segmentation method and system
CN110570035B (en) People flow prediction system for simultaneously modeling space-time dependency and daily flow dependency
CN117157678A (en) Method and system for graph-based panorama segmentation
JP2023514172A (en) A method for continuously learning a classifier for classifying client images using a continuous learning server and a continuous learning server using the same
WO2020198173A1 (en) Subject-object interaction recognition model
US11943460B2 (en) Variable bit rate compression using neural network models
CN113362491A (en) Vehicle track prediction and driving behavior analysis method
Wang et al. MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection
CN114037640A (en) Image generation method and device
Manssor et al. Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network
CN115018039A (en) Neural network distillation method, target detection method and device
CN116485867A (en) Structured scene depth estimation method for automatic driving
Zhao et al. End‐to‐end autonomous driving decision model joined by attention mechanism and spatiotemporal features
CN113435356B (en) Track prediction method for overcoming observation noise and perception uncertainty
CN116861262B (en) Perception model training method and device, electronic equipment and storage medium
CN111160282B (en) Traffic light detection method based on binary Yolov3 network
CN113033430B (en) Artificial intelligence method, system and medium for multi-mode information processing based on bilinear
CN116152263A (en) CM-MLP network-based medical image segmentation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant