CN112861269B

CN112861269B - Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction

Info

Publication number: CN112861269B
Application number: CN202110267799.0A
Authority: CN
Inventors: 黄鹤; 吴润晨; 张峰; 王博文; 于海涛; 汤德江; 张炳力
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2022-08-30
Anticipated expiration: 2041-03-11
Also published as: CN112861269A

Abstract

The invention discloses an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction, which comprises the following steps: 1, defining a state parameter set s and a control parameter set a for driving the automobile; 2, initializing deep reinforcement learning parameters and constructing a deep neural network; 3 defining a depth reinforcement learning reward function and a priority extraction rule; 4 training a deep neural network and obtaining an optimal network model; 5 obtaining the state parameter s of the automobile at the moment t _t And inputting the optimal network model to obtain an output a _t And is executed by the automobile. The invention completes the longitudinal multi-state driving of the automobile by combining the priority extraction algorithm and the control method of deep reinforcement learning, thereby ensuring higher safety of the automobile in the driving process and reducing the occurrence of traffic accidents.

Description

Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction

Technical Field

The invention relates to the technical field of intelligent automobile longitudinal multi-state control, in particular to an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction.

Background

With the rapid development of urban economy and the continuous improvement of the living standard of people, the quantity of motor vehicles kept in cities is greatly increased, automobiles become indispensable tools for transportation when people go out, and a series of safety problems are brought while rapidness and convenience are brought. Due to the fact that technical capacity of a driver is limited or other uncontrollable external factors and the like, traffic problems such as two-vehicle or multi-vehicle collision and the like often occur on a road, life and property safety loss is brought, and meanwhile great difficulty is caused to road smoothness. With the continuous development of automobile related technologies, an adaptive cruise system, an emergency braking system and the like are introduced by a plurality of automobile enterprises. The self-adaptive cruise system obtains front road data by using sensors such as a radar and the like, keeps a certain distance from a front vehicle and maintains a certain speed according to a corresponding algorithm, but is usually started at a higher speed, such as more than 25km/h, and needs a driver to manually control when the speed is lower than the speed; the emergency braking system is a technology which can actively brake to avoid accidents under the conditions that an automobile runs in a non-adaptive cruise state and meets the front emergency, such as sudden stop of the automobile or sudden pedestrian, but has related reasons of misjudgment of a sensor, environmental errors and the like, and cannot be applied to various running environments, so that dangerous accidents are caused.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the automobile longitudinal multi-state control method based on the deep reinforcement learning priority extraction, so that the automobile longitudinal multi-state driving is completed by combining the priority extraction algorithm and the deep reinforcement learning control method, the safety of the automobile in the driving process is higher, and the occurrence of traffic accidents is reduced.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction, which is characterized by comprising the following steps of:

step 1: establishing a vehicle dynamic model and a vehicle running environment model;

and 2, step: acquiring automobile running data in a real driving scene as initialization data, wherein the automobile running data is initial state information of a vehicle and initial control parameter information of the vehicle;

and step 3: defining a set of state information s ═ s of the vehicle ₀ ,s ₁ ,···s _t ,···,s _n }，s ₀ Information indicating the initial state of the vehicle, s _t Indicating that the vehicle is in state s _t-1 I.e. the control action a is executed at time t-1 _t-1 The state reached thereafter, and has s _t ＝{Ax _t ,e _t ,Ve _t In which Ax is _t Representing the longitudinal acceleration of the vehicle at time t, e _t Representing the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time t _t The difference value between the self vehicle speed and the front vehicle speed at the moment t is represented;

control parameter set a ═ a of defined vehicle ₀ ,a ₁ ,···,a _t ,···,a _n }，a ₀ Initial control parameter information indicative of a vehicle, a _t Indicating that the vehicle is in state s _t I.e. the action performed by the vehicle at time t, and has a _t ＝{T _t ,B _t In which T is _t Representing the throttle opening at time t of the vehicle, B _t The master cylinder pressure of the vehicle at the time t is represented, and t is 1,2, c and c represents the total training time;

and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;

and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;

the deep neural network comprises an input layer, a hidden layer and an output layer; wherein, theThe input layer comprises m neurons for inputting the state s of the vehicle at the time t _t The hidden layer comprises n neurons, calculates state information from the input layer by using an activation function Relu and transmits the state information to the output layer, the output layer comprises k neurons, is used for outputting an action value function, and comprises:

Q _e ＝Relu(Relu(s _t ×w ₁ +b ₁ )×w ₂ +b ₂ ) (1)

in the formula (1), w ₁ 、b ₁ Weight and bias value for the hidden layer, w ₂ 、b ₂ Is the weight and offset value, Q, of the output layer _e The output value of the output layer is the current Q value of all actions obtained by the deep neural network;

step 6: define reward functions for deep reinforcement learning:

in the formulae (2) and (3), r _h The bonus value r is the bonus value in the high-speed state of the vehicle _l The method comprises the following steps that (1) the reward value is in a low-speed state of a vehicle, dis is the relative distance between the vehicle and a front vehicle, Vf is the speed of the front vehicle, x represents the lower limit of the relative distance, y represents the upper limit of the relative distance, mid represents the switching threshold value of a reward function relative to the relative distance, lim represents the switching threshold value of the reward function relative to the difference value between the speed of the vehicle and the speed of the front vehicle, z represents the switching threshold value of the reward function relative to the speed of the front vehicle, and u represents the lower limit of the speed of the front vehicle;

and 7: defining an experience pool priority extraction rule;

for the current Q value Q stored in the experience pool _e And a target Q value Q _t Making difference, and using the difference value to perform priority ordering on various parameter forms stored in the experience pool according to SumTree algorithm, obtaining ordered parameter forms and extracting the ordered parameter formsTaking a parameter form of a previous bs strip;

the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (4):

in the formula (4), p _k The priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;

and 8: defining a greedy strategy;

generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting Q _e The action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;

and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment;

state s at time t _t Obtaining all action value functions through the deep neural network, and selecting action a by utilizing a greedy strategy _t Then executed by the vehicle;

state s of the vehicle at time t _t Lower execution action a _t Obtaining the state parameter s at the moment of t +1 _t+1 And a prize value r at time t _t Each parameter is expressed in a parameter form s _t ,a _t ,r _t ,s _t+1 Storing the data into an experience pool D;

step 10: constructing a target neural network with the same structure as the deep neural network;

obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 moment _t+1 Inputting the target neural network, and having:

Q _ne ＝Relu(Relu(s _t+1 ×w ₁ ′+b ₁ ′)×w ₂ ′+b ₂ ′) (5)

in the formula (5), Q _ne The output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a ₁ ′、w ₂ ' weights for the hidden and output layers of the target neural network, respectively, b ₁ ′、b ₂ ' bias of the hidden layer and the output layer of the target neural network, respectively;

step 11: establishing a target Q value Q _t ；

The probability distribution pi (a | s) of the action a performed in the state s is defined by equation (6):

π(a|s)＝P(a _t ＝a|s _t ＝s) (6)

in the formula (6), p represents a conditional probability;

obtaining a State cost function v using equation (7) _π (s)：

v _π (s)＝E _π (r _t +γr _t+1 +γ ² r _t+2 +···|s _t ＝s) (7)

In the formula (7), gamma is a reward attenuation factor, E _π Indicating a desire;

obtaining the execution of action a at time t by equation (8) _t Probability of going to the next state s

Obtaining an action cost function q by using the formula (9) _π (s,a)：

In the formula (9), the reaction mixture is,

representing the reward value, v, of the vehicle after performing action a in state s _π (s') represents a vehicleA state cost function for the vehicle at state s';

obtaining a target Q value Q by the formula (10) _t ：

Q _t ＝r _t +γmax(Q _ne ) (10)

Step 12: the loss function loss is constructed using equation (11):

loss＝ISW×(Q _t -Q _e ) ² (11)

carrying out a gradient descent method on the loss function loss so as to update the deep neural network parameter w ₁ 、w ₂ 、b ₁ 、b ₂ ；

Updating the parameter w of the target neural network with an update frequency rt ₁ ′、w ₂ ′、b ₁ ′、b ₂ ', and update values are taken from the deep neural network;

step 13: assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, making t equal to c +1, increasing the network iteration times, and returning to the step 9 to execute;

step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, so that corresponding actions are executed on the vehicle to complete longitudinal multi-state control.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with the traditional automobile longitudinal control method, the control method has better control smoothness under different working conditions and better control stability under the limit working condition, and is suitable for the multi-state control of high speed, medium speed and low speed of the automobile;

2. the deep reinforcement learning of the invention utilizes the trained deep neural network, and the corresponding action can be executed only by inputting the state information of the automobile, so that the invention has more simplicity and rapidity compared with the complex traditional automobile control, and the control effect is relatively good;

3. compared with common reinforcement learning, the deep reinforcement learning of the invention processes the input state parameter information by using the neural network without a large amount of table storage data, thereby greatly saving the memory space, and the neural network training has higher efficiency and better convergence compared with the common iteration method;

4. the invention adopts the data priority extraction method, can perform priority arrangement on the data in the experience pool compared with the harsh performance of the traditional automobile multi-state control method switching, integrates the parameter information of the automobile in various states, greatly shortens the training time, enables the multi-state control of the automobile to be uniform, does not need to perform complicated control method switching, and has better control effect.

Detailed Description

In this embodiment, an automobile longitudinal multi-state control method based on deep reinforcement learning and preferential extraction can decide the throttle opening and the master cylinder pressure of an automobile at a corresponding moment according to real-time state parameters of the automobile, so as to complete multi-state control of automobile following running, adaptive cruise, emergency braking in a medium speed state and start-stop in a low speed state of the automobile in a high speed state, specifically according to the following steps:

step 1: establishing a vehicle dynamic model and a vehicle running environment model by utilizing carsim software;

step 2: acquiring automobile driving data in a real driving scene and taking the automobile driving data as initialization data, wherein the automobile driving data is initial state information of a vehicle and initial control parameter information of the vehicle;

and 3, step 3: defining a set of state information s ═ s for a vehicle ₀ ,s ₁ ,···s _t ,···,s _n }，s ₀ Information indicating the initial state of the vehicle, s _t Indicating that the vehicle is in state s _t-1 I.e. control action a is performed at time t-1 _t-1 The state reached thereafter, and has s _t ＝{Ax _t ,e _t ,Ve _t In which Ax is _t Represents the longitudinal acceleration of the vehicle at time t, in m/s ² ，e _t Representing the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time t _t The difference value between the self vehicle speed and the front vehicle speed at the moment t is represented;

defining a control parameter set a ═ { a) of a vehicle ₀ ,a ₁ ,···,a _t ,···,a _n }，a ₀ Initial control parameter information indicative of a vehicle, a _t Indicating that the vehicle is in state s _t I.e. the action performed by the vehicle at time t, and has a _t ＝{T _t ,B _t In which T is _t Representing the throttle opening at time t of the vehicle, B _t The unit of master cylinder pressure of the vehicle at the time t is Mpa, t is 1,2, c and c represents the total training time;

the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time t _t The hidden layer comprises n neurons, state information from the input layer is calculated by using an activation function Relu and transmitted to the output layer, and the output layer comprises k neurons and is used for outputting an action value function;

for the hidden layer, there are:

l＝Relu((s _t ×w ₁ )+b ₁ ) (1)

in the formula (1), w ₁ 、b ₁ Weights and bias values for the hidden layer;

for the output layer, there are:

out＝Relu((l×w ₂ )+b ₂ ) (2)

in the formula (2), w ₂ 、b ₂ Is the weight and offset value of the output layer;

the formula (1) and the formula (2) are combined to obtain:

Q _e ＝Relu(Relu(s _t ×w ₁ +b ₁ )×w ₂ +b ₂ ) (3)

in the formula (3), Q _e Obtaining current Q values of all actions through a deep neural network for the output value of an output layer;

step 6: defining a deep reinforcement learning reward function, wherein the design of the reward function is an important component of a deep reinforcement learning algorithm, the updating and convergence of the neural network weight and bias depend on the quality of the design of the reward function, and the reward function is defined as follows:

in the formulae (4) and (5), r _h The bonus value r is the bonus value in the high-speed state of the vehicle _l The reward value is the reward value under the low-speed state of the vehicle, the condition of the reward value and the condition of the reward value is that whether the vehicle speed reaches 25km/h or not, if the vehicle speed reaches or exceeds 25km/h, the high-speed control of the vehicle is carried out, the corresponding follow-up running and adaptive cruise are completed, if the vehicle speed is lower than 25km/h, the medium-low speed control of the vehicle is carried out, the corresponding emergency braking and start-stop operation are completed, dis is the relative distance between the vehicle and the front vehicle, the unit is m, Vf is the vehicle speed of the front vehicle, the unit is km/h, x is the lower limit of the relative distance, the unit is m, y is the upper limit of the relative distance, the unit is m, mid is the switching threshold value of the reward function relative distance, the unit is m, lim is the switching threshold value of the reward function relative distance between the vehicle speed and the front vehicle, the unit is km/h, z is the switching threshold value of the reward function relative vehicle speed, the unit is km/h, u represents the lower limit of the speed of the front vehicle, and the unit is km/h;

and 7: defining an experience pool priority extraction rule;

under the normal condition, the vehicle rarely meets the state meeting the large reward value in the environment, the reward values of other states are very small, the vehicle is not worth learning and has small action on the parameters of the iterative neural network, the learning time is greatly increased in the environment with a small number of large reward values, and the effect is not good;

by using the experience pool priority extraction method, the small amount of state samples which are worth learning can be valued;

the specific method is that when the current and target state parameters are stored in the experience pool, the current Q value Q stored in the experience pool is _e And a target Q value Q _t Making a difference, and carrying out priority sequencing on all parameter forms stored in the experience pool by using the difference value according to a SumTree algorithm to obtain a sequenced parameter form and extracting a front bs parameter form from the sequenced parameter form;

the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (6):

in the formula (6), p _k The priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;

the ineffective training can be effectively avoided by using the experience pool priority extraction method, the training time is greatly shortened, and the training effect is better;

and step 8: defining a greedy strategy;

and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment, and processing data correlation and non-static distribution problems by deep reinforcement learning through the aid of experience pool playback;

state s at time t _t Obtaining all action value functions through a deep neural network, and selecting an action a by using a greedy strategy _t Then executed by the vehicle;

Q _ne ＝Relu(Relu(s _t+1 ×w ₁ ′+b ₁ ′)×w ₂ ′+b ₂ ′) (7)

in the formula (7), Q _ne The output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a ₁ ′、w ₂ ' weights for the hidden and output layers of the target neural network, respectively, b ₁ ′、b ₂ ' bias of the hidden layer and the output layer of the target neural network, respectively;

step 11: establishing a target Q value Q _t ；

The action of the vehicle in a certain state is uncertain, and a relevant conditional probability is needed to select the determined action, wherein the conditional probability is defined as follows:

π(a|s)＝P(a _t ＝a|s _t ＝s) (8)

in equation (8), pi (a | s) represents a probability distribution of an action a performed by the vehicle in a state s, and p represents a conditional probability;

obtaining a state cost function v using equation (9) _π (s)：

v _π (s)＝E _π (r _t +γr _t+1 +γ ² r _t+2 +···|s _t ＝s) (9)

In formula (9), E _π Representing expectation, gamma represents a reward attenuation factor, taking a value between 0 and 1; when gamma is 0, v _π (s)＝E _π (r _t |s _t S), at which point the state merit function is determined only by the prize value for the current stateDefinitely, independent of the subsequent state; when γ takes 1, v _π (s)＝E _π (r _t +r _t+1 +r _t+2 +···|s _t S), at which point the state cost function is determined by the prize values for all current and subsequent states; when the value of gamma tends to 0, the current reward is more emphasized, and when the value of gamma tends to 1, the subsequent reward is more considered;

the execution of action a at time t is obtained by equation (10) _t Probability of going to the next state s

Obtaining an action cost function q by using equation (11) _π (s,a)：

In the formula (11), the reaction mixture is,

representing the reward value, v, of the vehicle after performing action a in state s _π (s ') represents a state cost function for the vehicle at state s';

obtaining a target Q value Q by the formula (12) _t ：

Q _t ＝r _t +γmax(Q _ne ) (12)

Step 12: the loss function loss is constructed using equation (13):

loss＝ISW×(Q _t -Q _e ) ² (13)

a gradient descent method is carried out on the loss function loss so as to update the parameter w of the deep neural network ₁ 、w ₂ 、b ₁ 、b ₂ ；

Updating the parameter w of the target neural network with an update frequency rt ₁ ′、w ₂ ′、b ₁ ′、b ₂ ', and the update value is taken from the deep neural network;

step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, and executing corresponding actions on the vehicle to finish longitudinal high, medium and low speed multi-state control.

Claims

1. A longitudinal multi-state control method of an automobile based on deep reinforcement learning preferential extraction is characterized by comprising the following steps:

step 2: acquiring automobile running data in a real driving scene as initialization data, wherein the automobile running data is initial state information of a vehicle and initial control parameter information of the vehicle;

and step 3: defining a set of state information s ═ s for a vehicle ₀ ,s ₁ ,···s _t ,···,s _n }，s ₀ Indicating initial state information of the vehicle, s _t Indicating that the vehicle is in state s _t-1 I.e. the control action a is executed at time t-1 _t-1 The state reached thereafter, and has s _t ＝{Ax _t ,e _t ,Ve _t In which Ax is _t Representing the longitudinal acceleration of the vehicle at time t, e _t Representing the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time t _t The difference value between the self vehicle speed and the front vehicle speed at the moment t is represented;

defining a control parameter set a ═ { a) of a vehicle ₀ ,a ₁ ,···,a _t ,···,a _n }，a ₀ Initial control parameter information representing a vehicle, a _t Indicating that the vehicle is in state s _t I.e. performed by the vehicle at time tIs actuated and has a _t ＝{T _t ,B _t In which T is _t Representing the throttle opening at time t of the vehicle, B _t The master cylinder pressure of the vehicle at the time t is represented, and t is 1,2, c and c represents the total training time;

the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time t _t The hidden layer comprises n neurons, calculates state information from the input layer by using an activation function Relu and transmits the state information to the output layer, the output layer comprises k neurons, is used for outputting an action value function, and comprises:

Q _e ＝Relu(Relu(s _t ×w ₁ +b ₁ )×w ₂ +b ₂ ) (1)

step 6: defining a reward function for deep reinforcement learning:

in the formulae (2) and (3), r _h The bonus value r is the bonus value in the high-speed state of the vehicle _l Is the award value in the low speed state of the vehicle,dis is the relative distance between the self vehicle and the front vehicle, Vf is the speed of the front vehicle, x is the lower limit of the relative distance, y is the upper limit of the relative distance, mid is the switching threshold of the reward function relative to the relative distance, lim is the switching threshold of the reward function relative to the difference between the self vehicle speed and the speed of the front vehicle, z is the switching threshold of the reward function relative to the speed of the front vehicle, and u is the lower limit of the speed of the front vehicle;

and 7: defining an experience pool priority extraction rule;

for the current Q value Q stored in the experience pool _e And a target Q value Q _t Making a difference, and carrying out priority sequencing on all parameter forms stored in the experience pool by using the difference value according to a SumTree algorithm to obtain a sequenced parameter form and extracting a front bs parameter form from the sequenced parameter form;

and step 8: defining a greedy strategy;

state s of the vehicle at time t _t Lower execution action a _t Obtaining the state parameter s at the moment of t +1 _t+1 And the reward value r at time t _t Each parameter is expressed in a parameter form s _t ,a _t ,r _t ,s _t+1 Storing the data into an experience pool D;

Q _ne ＝Relu(Relu(s _t+1 ×w ₁ ′+b ₁ ′)×w ₂ ′+b ₂ ′) (5)

step 11: establishing a target Q value Q _t ；

π(a|s)＝P(a _t ＝a|s _t ＝s) (6)

in formula (6), p represents a conditional probability;

obtaining a State cost function v using equation (7) _π (s)：

v _π (s)＝E _π (r _t +γr _t+1 +γ ² r _t+2 +···|s _t ＝s) (7)

Obtaining an action cost function q by using the formula (9) _π (s,a)：

In the formula (9), the reaction mixture is,

obtaining a target Q value Q by the formula (10) _t ：

Q _t ＝r _t +γmax(Q _ne ) (10)

Step 12: the loss function loss is constructed using equation (11):

loss＝ISW×(Q _t -Q _e ) ² (11)

step 13: after assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, setting t to c +1, increasing the number of network iterations, and returning to the step 9 to execute;