CN116476863A

CN116476863A - Automatic driving transverse and longitudinal integrated decision-making method based on deep reinforcement learning

Info

Publication number: CN116476863A
Application number: CN202310332965.XA
Authority: CN
Inventors: 任明仑; 汪娟; 周俊杰; 吴淑慧; 朱倩倩
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-07-25

Abstract

The invention provides an automatic driving transverse and longitudinal integrated decision-making method and system based on deep reinforcement learning, a storage medium and electronic equipment, and relates to the technical field of automatic driving. The method comprises the steps of collecting the states of an automatic driving vehicle and surrounding environments in real time; taking the current state of the automatic driving vehicle and the surrounding environment as the input of the deep learning model to acquire an action space; selecting and executing the action with highest evaluation according to the reinforcement learning decision model and with the aim of maximizing expected rewards; and repeatedly executing the steps until the target point is reached. By collecting and processing a large amount of driving data, the intelligent decision network is automatically learned and optimized, and various driving scenes and tasks can be quickly adapted without retraining a model, so that the accuracy and stability of decision making are improved, and a solid foundation is laid for realizing commercialized application of the automatic driving technology.

Description

Automatic driving transverse and longitudinal integrated decision-making method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of automatic driving, in particular to an automatic driving transverse and longitudinal integrated decision-making method, system, storage medium and electronic equipment based on deep reinforcement learning.

Background

Conventional reinforcement learning algorithms are often designed to address specific driving scenarios or tasks, such as following and obstacle avoidance of an autonomous vehicle on a highway. The decision model obtained by training the algorithms is only suitable for specific driving scenes or tasks, but cannot be well adapted to other different driving tasks or scenes. This results in a great gap between the traditional reinforcement learning algorithm and the human driver, as the human driver can freely operate the vehicle in a variety of different driving scenarios and tasks.

However, in the automatic driving process, traffic conditions on roads are often changeable, and how to accurately predict the complex traffic conditions to make an optimal driving scheme and decision, so that the vehicle can safely and efficiently travel to a destination is a problem to be solved by intelligent decision of automatic driving.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides an automatic driving transverse and longitudinal integrated decision-making method, a system, a storage medium and electronic equipment based on deep reinforcement learning, which solve the technical problem that accurate prediction cannot be carried out in the face of complex traffic conditions.

(II) technical scheme

In order to achieve the above purpose, the invention is realized by the following technical scheme:

an automatic driving transverse and longitudinal integrated decision method based on deep reinforcement learning is characterized in that an intelligent decision network is trained in advance, and the intelligent decision network comprises a deep learning model and a reinforcement learning decision model; the method comprises the following steps:

s1, acquiring states of an automatic driving vehicle and surrounding environments in real time;

s2, taking the current state of the automatic driving vehicle and the surrounding environment as the input of the deep learning model, and acquiring an action space;

s3, selecting and executing the action with highest evaluation according to the reinforcement learning decision model and with the aim of maximizing expected rewards;

s4, repeatedly executing S1 to S3 until the target point is reached.

Preferably, the automatic driving vehicle and the surrounding environment state in S1 include:

wherein s is _t The state of the automatic driving vehicle and the surrounding environment at the time t is shown;representing the own vehicle state at time t, including the own vehicle speed at time t +.>Self-parking position-> Representing the target state at time t, using the target position p at time t ^t _goal A representation; />Other vehicle states at time t are represented; />Representing other vehicle conditions.

Preferably, the other vehicle states are acquired by a sensor in real time and are represented by adopting a grid occupation map;

wherein, the grid occupation map specifically refers to:

characterizing the occupied condition of a region with the periphery of a bicycle being 12m multiplied by 120m by adopting a 7 multiplied by 40 binary matrix;

the grid occupation map is defined under a Frenet coordinate system, the front and rear positions of the vehicle along the road direction are represented by a variable s, and the left and right positions of the vehicle on the road are represented by a variable d;

along the s-axis, taking the front 100m and rear 20m areas of the bicycle as sampling areas; the left and right 6m regions of the vehicle are taken as sampling regions along the d axis.

Preferably, the motion space a in S2 _t Five discrete behaviors including acceleration, deceleration, left lane change, right lane change and state maintenance;

wherein acceleration is defined as increasing by 2m/s based on the current speed, deceleration is defined as decreasing by 2m/s based on the current speed, left lane change is defined as left lane change, right lane change is defined as right lane change, and the hold state is defined as driving along the lane at the current speed for a fixed period of time.

Preferably, the deep learning model adopts a three-layer stacked LSTM network.

Preferably, training the reinforcement learning decision model by adopting an Actor-Critic method;

wherein the policy network pi (a|s; θ) is used to input information s based on the state at time t _t The action space a is calculated and output through the deep learning model _t The reinforcement learning decision model selects the optimal action according to the current decision strategy piValue network q (s, a; ω) is used to evaluate the state s _t Action is made in the case of (2)>The degree of quality of (3);

executing an actionThe rear automatic driving vehicle forms a new state s _t+1 And get feedback rewards r of environment _t The method comprises the steps of carrying out a first treatment on the surface of the The reinforcement learning decision model is based on a reward function r _t Continuously maximizing rewards and learning optimal decision strategy pi ^* 。

Preferably, the reward function r _t The method specifically comprises the following steps:

wherein disr _r Vehicle position indicating time tWith the target position p ^t _goal A distance therebetween; />Representing the x-coordinate of the own vehicle at time t, < >>X-coordinate of the target point representing time t, +.>Representing the y-coordinate of the own vehicle at time t, < >>A y coordinate representing the target point at time t; />Speed of the vehicle at time t>

An automatic driving transverse and longitudinal integrated decision system based on deep reinforcement learning trains an intelligent decision network in advance, wherein the intelligent decision network comprises a deep learning model and a reinforcement learning decision model; the system comprises:

the acquisition module is used for acquiring the states of the automatic driving vehicle and the surrounding environment in real time;

the acquisition module is used for taking the current state of the automatic driving vehicle and the surrounding environment as the input of the deep learning model to acquire an action space;

the selection module is used for selecting and executing the action with highest evaluation according to the reinforcement learning decision model and aiming at maximizing expected rewards;

and the repeating module is used for repeating the acquisition module, the acquisition module and the selection module until reaching the target point.

A storage medium storing a computer program for deep reinforcement learning-based automated driving horizontal-vertical integrated decision, wherein the computer program causes a computer to execute the automated driving intelligent decision method as described above.

An electronic device, comprising:

one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the automated driving intelligent decision method as described above.

(III) beneficial effects

The invention provides an automatic driving transverse and longitudinal integrated decision-making method and system based on deep reinforcement learning, a storage medium and electronic equipment. Compared with the prior art, the method has the following beneficial effects:

the method comprises the steps of collecting the states of an automatic driving vehicle and surrounding environments in real time; taking the current state of the automatic driving vehicle and the surrounding environment as the input of the deep learning model to acquire an action space; selecting and executing the action with highest evaluation according to the reinforcement learning decision model and with the aim of maximizing expected rewards; and repeatedly executing the steps until the target point is reached. By collecting and processing a large amount of driving data, the intelligent decision network is automatically learned and optimized, and various driving scenes and tasks can be quickly adapted without retraining a model, so that the accuracy and stability of decision making are improved, and a solid foundation is laid for realizing commercialized application of the automatic driving technology.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a technical path diagram of an automatic driving horizontal and vertical integrated decision method based on deep reinforcement learning provided by an embodiment of the invention;

FIG. 2 is a block diagram of an automated driving horizontal and vertical integrated decision method based on deep reinforcement learning according to an embodiment of the present invention;

FIG. 3 is an exemplary diagram of a grid occupancy map provided by an embodiment of the present invention;

fig. 4 is an exemplary diagram of another grid occupation map according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

According to the embodiment of the application, the technical problem that accurate prediction cannot be performed in the face of complex traffic conditions is solved by providing the automatic driving transverse and longitudinal integrated decision method, the system, the storage medium and the electronic equipment based on deep reinforcement learning.

The technical scheme in the embodiment of the application aims to solve the technical problems, and the overall thought is as follows:

as shown in FIG. 1, the invention provides an intelligent decision method for automatic driving based on deep reinforcement learning. The method obtains the states s of the own vehicle and other surrounding vehicles in real time through the sensor _t Input into a deep learning network model to generate decision candidates (namely action space) a _t Selecting optimal actions by reinforcement learning modelOptimal action of the vehicle>Environment awarding reward function r _t And reach the next state s _t+1 The process is repeated to reach the target point to finish the current task. By collecting and processing a large amount of driving data, the intelligent decision network is automatically learned and optimized, and various driving scenes and tasks can be quickly adapted without retraining a model, so that the accuracy and stability of decision making are improved, and a solid foundation is laid for realizing commercialized application of the automatic driving technology.

In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.

Examples:

as shown in fig. 2, the embodiment of the invention provides an automatic driving horizontal and vertical integrated decision method based on deep reinforcement learning, which is characterized in that an intelligent decision network is trained in advance, wherein the intelligent decision network comprises a deep learning model and a reinforcement learning decision model; the method comprises the following steps:

s4, repeatedly executing S1 to S3 until the target point is reached.

The embodiment of the invention can quickly adapt to various driving scenes and tasks by collecting and processing a large amount of driving data and automatically learning and optimizing the intelligent decision network without retraining a model, thereby improving the accuracy and stability of decision and laying a solid foundation for realizing the commercialized application of the automatic driving technology.

The following will describe each step of the above technical solution in detail:

firstly, it should be noted that, in the implementation of the present invention, the intelligent decision network trained in advance based on the deep reinforcement learning algorithm is specifically composed of two parts, namely a deep learning model and a reinforcement learning decision model.

Deep learning models use a multi-layer neural network to learn complex relationships between input data. Deep learning models are typically used to learn from state s _t And extracting features and generating decision candidates.

The reinforcement learning decision model is intended to teach an agent how to make decisions in an unknown environment. In reinforcement learning, an agent interacts with an environment and receives rewards or penalties from the environment. The reinforcement learning decision model maximizes future rewards by optimizing the agent's strategy.

In step S1, the state of the autonomous vehicle and the surrounding environment is acquired in real time.

The automatic driving vehicle and the surrounding environment state in this step include:

In particular, other vehicle states are acquired by a sensor in real time and are characterized by adopting a grid occupation map;

as shown in fig. 3 to 4, the grid occupation map specifically refers to:

for a region of 12m×120m surrounding the vehicle, a 7×40 binary matrix is used to characterize its occupancy, e.g

The grid occupation map is defined under a Frenet coordinate system, the front and rear positions (namely longitudinal displacement) of the vehicle along the road direction are represented by a variable s, and the left and right positions (namely transverse offset) of the vehicle on the road are represented by a variable d;

Compared with other original data representation modes such as laser or images, the dimension of the grid map is relatively low, the complexity of a state space can be effectively reduced, and the calculated amount and training time of an algorithm are reduced. Meanwhile, the method can be widely applied to different scenes and environments, such as cities, highways and the like. Compared with other characterization modes, the grid map has better generalization performance and can be better suitable for various different environments.

The grid is adopted to occupy the input state of the map, which has little difference with the actual environment, so that the state domain is ensured to have little difference, and the training model of the simulation environment can be conveniently migrated to the actual vehicle environment.

In step S2, the current state of the autonomous vehicle and the surrounding environment is used as an input of the deep learning model, and an operation space is acquired.

In automated driving, the vehicle needs to analyze historical data in order to predict future dynamic states and make appropriate driving decisions. Decision making based on single-frame observation usually loses important state information, so that the embodiment of the invention introduces multi-frame data into a deep learning model and adopts an LSTM network to process time sequence input. LSTM networks may model historical states, predicting current states to guide decision making. By adding the LSTM, the intelligent vehicle can more accurately predict dynamic changes on the road, so that more accurate driving decisions are made, and the decision accuracy is further improved.

Specifically, the deep learning model in this step adopts a three-layer stacked LSTM network. Wherein the input layer, the hidden layer and the output layer constitute the input, processing and output parts of the neural network, respectively. In the hidden layer, each LSTM cell contains three gates and one cell state, including a forget gate, an input gate, and an output gate, and the cell state that maintains the current state. In order to improve the expression capacity of the network, the LSTM network has 128 cells and adopts a structure of three circulating layers. Between the different layers, a 0.2 drop-off rate was used to avoid the over-fitting problem. In addition, the embodiment of the invention adopts an Adam optimizer, and sets the learning rate as a=0.0005 and the attenuation rate as 0.9 so as to accelerate the convergence rate of the network training.

The selection of LSTM network input length is critical to the performance of the model. Too short an input length can cause the model to fail to capture the inherent mathematical features of the input data, thereby causing a reduction in feature extraction accuracy; too long an input length would lead to an increase in model computation, thereby increasing the time cost of training and reasoning. In order to balance the feature extraction accuracy and the calculation efficiency, the invention selects a relatively long input quantity length in the model design, namely, data(s) of 20 frames of history _t ,s _t-1 …s _t-19 ) As an input layer, the last layer outputs action a _t 。

The memory function is realized by adding LSTM in the network structure, and the previous operation and the previous rewards are used as input to provide more complete information, so that the convergence speed of the model is improved.

As shown in table 1, the motion space a in this step _t Including acceleration, deceleration, left lane change, right lane change, and state maintenance.

TABLE 1

In step S3, the action with highest evaluation is selected and executed with the goal of maximizing the desired rewards according to the reinforcement learning decision model.

The embodiment of the invention adopts an Actor-Critic method to train the reinforcement learning decision model;

The bonus function r _t And in the driving process, in order to ensure that the driving speed is not too low, the behavior with higher relative speed is rewarded, so that the efficiency of completing the driving task is improved. The method specifically comprises the following steps:

wherein dist _t Vehicle position indicating time tWith the target position p ^t _goal A distance therebetween; />Representing the x-coordinate of the own vehicle at time t, < >>X-coordinate of the target point representing time t, +.>Representing the y-coordinate of the own vehicle at time t, < >>A y coordinate representing the target point at time t;

speed of the vehicle at time t>

Overtaking states are defined as vehicles taking an action to change left or right lane. Overtaking can improve the efficiency of traveling, especially under the condition of traffic jam, can shorten travel time and reduce the jam. But there is also a certain risk of overtaking, so that a smaller positive prize 5 is given to the overtaking maneuver.

A collision state is defined as the vehicle coinciding with other vehicles in the grid occupancy map, i.e. two or more vehicles occupy the same grid. This situation represents the occurrence of a traffic accident and therefore gives a large negative prize.

At the same time, the obtained experience (s _t ,a _t ,s _t+1 ,r _t ) And storing the parameters in an experience pool, continuously updating parameters of the strategy network through preferential experience playback and collecting new experiences to obtain an optimal strategy, so that the current value network evaluates the current action highest.

When the experience quantity reaches the preset capacity of the experience pool, experience vectors with the set quantity and the size of batch are extracted from the experience pool according to the priority experience playback principle, and parameters of a strategy network a (a|s; theta) and a value network q (s, a; omega) are learned.

The priority experience playback principle is to preferentially select experience vectors containing a large number of TD errors in an experience pool. TD error refers to the difference between the target value and the predicted value. The larger the TD error, the greater the updating effect of the experience vector on the network, so the experience vector with high TD error is selected to be preferentially used for training, thereby improving the efficiency and performance of the model.

The goal of the policy network pi (a|s; θ) is to maximize the desired rewards, improving the performance of the policy by updating its parameters. The update direction of the policy network may be calculated by a back propagation algorithm.

Illustratively, the loss function of the policy network is defined as:

where θ is a parameter of the policy network and q (s, a; ω) is the value network based on the current state s _t The score of the score is marked out,is a counter-propagating gradient of the policy network.

The goal of the value network q (s, a; ω) is to accurately estimate the cost function of the state action, improving the accuracy of the estimation by updating the parameters of the value network. The update direction of the value network may be adjusted by calculating the error between the predicted value and the target value and using a back propagation algorithm.

Illustratively, the loss function of the value network is defined as:

where ω is a parameter of the value network, q (s _t ,a _t The method comprises the steps of carrying out a first treatment on the surface of the ω) is the network for the time tState s _t And action a _t The scored value, q (s _t+1 ,a _t+1 The method comprises the steps of carrying out a first treatment on the surface of the ω) is the state s for the network at time t+1 _t+1 And action a _t+1 The scored score, gamma, is the discount coefficient.

And selecting and executing the action with highest evaluation in the action space acquired in the step S2 according to the reinforced learning decision model with the aim of maximizing the expected rewards.

In step S4, S1 to S3 are repeatedly executed until the target point is reached.

The embodiment of the invention provides an automatic driving horizontal and vertical integrated decision system based on deep reinforcement learning, which is used for pre-training an intelligent decision network, wherein the intelligent decision network comprises a deep learning model and a reinforcement learning decision model; the system comprises:

Embodiments of the present invention provide a storage medium storing a computer program for deep reinforcement learning-based automated driving horizontal-vertical integrated decision, wherein the computer program causes a computer to execute the automated driving intelligent decision method as described above.

An electronic device, comprising:

one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the automated driving intelligent decision method as described above,

it may be understood that the automatic driving horizontal and vertical integrated decision system based on deep reinforcement learning, the storage medium and the electronic device provided by the embodiment of the present invention correspond to the automatic driving horizontal and vertical integrated decision method based on deep reinforcement learning provided by the embodiment of the present invention, and the explanation, the examples, the beneficial effects and other parts of the relevant content may refer to the corresponding parts in the automatic driving intelligent decision method, which are not repeated herein.

In summary, compared with the prior art, the method has the following beneficial effects:

1. the embodiment of the invention can quickly adapt to various driving scenes and tasks by collecting and processing a large amount of driving data and automatically learning and optimizing the intelligent decision network without retraining a model, thereby improving the accuracy and stability of decision and laying a solid foundation for realizing the commercialized application of the automatic driving technology.

2. Different from the method that camera images or radar original data are input as a network, the grid is adopted to occupy the input state of the map, which is simulated and has little difference with the actual environment, so that the state domain is ensured to have little difference, and the training model of the simulation environment can be conveniently migrated to the actual vehicle environment.

3. The memory function is realized by adding LSTM in the network structure, and the previous operation and the previous rewards are used as input to provide more complete information, so that the convergence speed of the model is improved.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An automatic driving horizontal and vertical integrated decision method based on deep reinforcement learning is characterized in that an intelligent decision network is trained in advance, and the intelligent decision network comprises a deep learning model and a reinforcement learning decision model; the method comprises the following steps:

s4, repeatedly executing S1 to S3 until the target point is reached.

2. The automated driving intelligent decision method of claim 1, wherein the automated driving vehicle and surrounding environment state in S1 comprises:

3. The automated driving intelligent decision method of claim 2, wherein the other vehicle state is obtained by a sensor in real time, characterized by using a grid occupancy map;

wherein, the grid occupation map specifically refers to:

4. The automated driving intelligent decision method of claim 1, wherein the action space a in S2 _t Five discrete behaviors including acceleration, deceleration, left lane change, right lane change and state maintenance;

5. The automated driving intelligent decision method of claim 1, wherein the deep learning model employs a three-layer stacked LSTM network.

6. The automated driving intelligent decision method of claim 1, wherein the reinforcement learning decision model is trained using an Actor-Critic method;

7. The automated driving intelligent decision method of claim 6, wherein the reward function r _t The method specifically comprises the following steps:

wherein dist _t Vehicle position indicating time tWith the target position p ^t _goal A distance therebetween; />Representing the x-coordinate of the own vehicle at time t, < >>X-coordinate of the target point representing time t, +.>Representing the y-coordinate of the own vehicle at time t,a y coordinate representing the target point at time t; />Speed of the vehicle at time t>

8. An automatic driving transverse and longitudinal integrated decision system based on deep reinforcement learning is characterized in that an intelligent decision network is trained in advance, and the intelligent decision network comprises a deep learning model and a reinforcement learning decision model; the system comprises:

9. A storage medium storing a computer program for automated driving horizontal-vertical integrated decision based on deep reinforcement learning, wherein the computer program causes a computer to execute the automated driving intelligent decision method according to any one of claims 1 to 7.

10. An electronic device, comprising:

one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the autopilot intelligent decision method of any one of claims 1-7.