CN112861269B - Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction - Google Patents

Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction Download PDF

Info

Publication number
CN112861269B
CN112861269B CN202110267799.0A CN202110267799A CN112861269B CN 112861269 B CN112861269 B CN 112861269B CN 202110267799 A CN202110267799 A CN 202110267799A CN 112861269 B CN112861269 B CN 112861269B
Authority
CN
China
Prior art keywords
vehicle
state
value
neural network
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110267799.0A
Other languages
Chinese (zh)
Other versions
CN112861269A (en
Inventor
黄鹤
吴润晨
张峰
王博文
于海涛
汤德江
张炳力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202110267799.0A priority Critical patent/CN112861269B/en
Publication of CN112861269A publication Critical patent/CN112861269A/en
Application granted granted Critical
Publication of CN112861269B publication Critical patent/CN112861269B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/15Vehicle, aircraft or watercraft design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/17Mechanical parametric or variational design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/14Force analysis or force optimisation, e.g. static or dynamic forces
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction, which comprises the following steps: 1, defining a state parameter set s and a control parameter set a for driving the automobile; 2, initializing deep reinforcement learning parameters and constructing a deep neural network; 3 defining a depth reinforcement learning reward function and a priority extraction rule; 4 training a deep neural network and obtaining an optimal network model; 5 obtaining the state parameter s of the automobile at the moment t t And inputting the optimal network model to obtain an output a t And is executed by the automobile. The invention completes the longitudinal multi-state driving of the automobile by combining the priority extraction algorithm and the control method of deep reinforcement learning, thereby ensuring higher safety of the automobile in the driving process and reducing the occurrence of traffic accidents.

Description

Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction
Technical Field
The invention relates to the technical field of intelligent automobile longitudinal multi-state control, in particular to an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction.
Background
With the rapid development of urban economy and the continuous improvement of the living standard of people, the quantity of motor vehicles kept in cities is greatly increased, automobiles become indispensable tools for transportation when people go out, and a series of safety problems are brought while rapidness and convenience are brought. Due to the fact that technical capacity of a driver is limited or other uncontrollable external factors and the like, traffic problems such as two-vehicle or multi-vehicle collision and the like often occur on a road, life and property safety loss is brought, and meanwhile great difficulty is caused to road smoothness. With the continuous development of automobile related technologies, an adaptive cruise system, an emergency braking system and the like are introduced by a plurality of automobile enterprises. The self-adaptive cruise system obtains front road data by using sensors such as a radar and the like, keeps a certain distance from a front vehicle and maintains a certain speed according to a corresponding algorithm, but is usually started at a higher speed, such as more than 25km/h, and needs a driver to manually control when the speed is lower than the speed; the emergency braking system is a technology which can actively brake to avoid accidents under the conditions that an automobile runs in a non-adaptive cruise state and meets the front emergency, such as sudden stop of the automobile or sudden pedestrian, but has related reasons of misjudgment of a sensor, environmental errors and the like, and cannot be applied to various running environments, so that dangerous accidents are caused.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the automobile longitudinal multi-state control method based on the deep reinforcement learning priority extraction, so that the automobile longitudinal multi-state driving is completed by combining the priority extraction algorithm and the deep reinforcement learning control method, the safety of the automobile in the driving process is higher, and the occurrence of traffic accidents is reduced.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction, which is characterized by comprising the following steps of:
step 1: establishing a vehicle dynamic model and a vehicle running environment model;
and 2, step: acquiring automobile running data in a real driving scene as initialization data, wherein the automobile running data is initial state information of a vehicle and initial control parameter information of the vehicle;
and step 3: defining a set of state information s ═ s of the vehicle 0 ,s 1 ,···s t ,···,s n },s 0 Information indicating the initial state of the vehicle, s t Indicating that the vehicle is in state s t-1 I.e. the control action a is executed at time t-1 t-1 The state reached thereafter, and has s t ={Ax t ,e t ,Ve t In which Ax is t Representing the longitudinal acceleration of the vehicle at time t, e t Representing the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time t t The difference value between the self vehicle speed and the front vehicle speed at the moment t is represented;
control parameter set a ═ a of defined vehicle 0 ,a 1 ,···,a t ,···,a n },a 0 Initial control parameter information indicative of a vehicle, a t Indicating that the vehicle is in state s t I.e. the action performed by the vehicle at time t, and has a t ={T t ,B t In which T is t Representing the throttle opening at time t of the vehicle, B t The master cylinder pressure of the vehicle at the time t is represented, and t is 1,2, c and c represents the total training time;
and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;
and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;
the deep neural network comprises an input layer, a hidden layer and an output layer; wherein, theThe input layer comprises m neurons for inputting the state s of the vehicle at the time t t The hidden layer comprises n neurons, calculates state information from the input layer by using an activation function Relu and transmits the state information to the output layer, the output layer comprises k neurons, is used for outputting an action value function, and comprises:
Q e =Relu(Relu(s t ×w 1 +b 1 )×w 2 +b 2 ) (1)
in the formula (1), w 1 、b 1 Weight and bias value for the hidden layer, w 2 、b 2 Is the weight and offset value, Q, of the output layer e The output value of the output layer is the current Q value of all actions obtained by the deep neural network;
step 6: define reward functions for deep reinforcement learning:
Figure GDA0003750500380000021
Figure GDA0003750500380000022
in the formulae (2) and (3), r h The bonus value r is the bonus value in the high-speed state of the vehicle l The method comprises the following steps that (1) the reward value is in a low-speed state of a vehicle, dis is the relative distance between the vehicle and a front vehicle, Vf is the speed of the front vehicle, x represents the lower limit of the relative distance, y represents the upper limit of the relative distance, mid represents the switching threshold value of a reward function relative to the relative distance, lim represents the switching threshold value of the reward function relative to the difference value between the speed of the vehicle and the speed of the front vehicle, z represents the switching threshold value of the reward function relative to the speed of the front vehicle, and u represents the lower limit of the speed of the front vehicle;
and 7: defining an experience pool priority extraction rule;
for the current Q value Q stored in the experience pool e And a target Q value Q t Making difference, and using the difference value to perform priority ordering on various parameter forms stored in the experience pool according to SumTree algorithm, obtaining ordered parameter forms and extracting the ordered parameter formsTaking a parameter form of a previous bs strip;
the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (4):
Figure GDA0003750500380000031
in the formula (4), p k The priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;
and 8: defining a greedy strategy;
generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting Q e The action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;
and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment;
state s at time t t Obtaining all action value functions through the deep neural network, and selecting action a by utilizing a greedy strategy t Then executed by the vehicle;
state s of the vehicle at time t t Lower execution action a t Obtaining the state parameter s at the moment of t +1 t+1 And a prize value r at time t t Each parameter is expressed in a parameter form s t ,a t ,r t ,s t+1 Storing the data into an experience pool D;
step 10: constructing a target neural network with the same structure as the deep neural network;
obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 moment t+1 Inputting the target neural network, and having:
Q ne =Relu(Relu(s t+1 ×w 1 ′+b 1 ′)×w 2 ′+b 2 ′) (5)
in the formula (5), Q ne The output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a 1 ′、w 2 ' weights for the hidden and output layers of the target neural network, respectively, b 1 ′、b 2 ' bias of the hidden layer and the output layer of the target neural network, respectively;
step 11: establishing a target Q value Q t
The probability distribution pi (a | s) of the action a performed in the state s is defined by equation (6):
π(a|s)=P(a t =a|s t =s) (6)
in the formula (6), p represents a conditional probability;
obtaining a State cost function v using equation (7) π (s):
v π (s)=E π (r t +γr t+12 r t+2 +···|s t =s) (7)
In the formula (7), gamma is a reward attenuation factor, E π Indicating a desire;
obtaining the execution of action a at time t by equation (8) t Probability of going to the next state s
Figure GDA0003750500380000041
Figure GDA0003750500380000042
Obtaining an action cost function q by using the formula (9) π (s,a):
Figure GDA0003750500380000043
In the formula (9), the reaction mixture is,
Figure GDA0003750500380000044
representing the reward value, v, of the vehicle after performing action a in state s π (s') represents a vehicleA state cost function for the vehicle at state s';
obtaining a target Q value Q by the formula (10) t
Q t =r t +γmax(Q ne ) (10)
Step 12: the loss function loss is constructed using equation (11):
loss=ISW×(Q t -Q e ) 2 (11)
carrying out a gradient descent method on the loss function loss so as to update the deep neural network parameter w 1 、w 2 、b 1 、b 2
Updating the parameter w of the target neural network with an update frequency rt 1 ′、w 2 ′、b 1 ′、b 2 ', and update values are taken from the deep neural network;
step 13: assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, making t equal to c +1, increasing the network iteration times, and returning to the step 9 to execute;
step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, so that corresponding actions are executed on the vehicle to complete longitudinal multi-state control.
Compared with the prior art, the invention has the beneficial effects that:
1. compared with the traditional automobile longitudinal control method, the control method has better control smoothness under different working conditions and better control stability under the limit working condition, and is suitable for the multi-state control of high speed, medium speed and low speed of the automobile;
2. the deep reinforcement learning of the invention utilizes the trained deep neural network, and the corresponding action can be executed only by inputting the state information of the automobile, so that the invention has more simplicity and rapidity compared with the complex traditional automobile control, and the control effect is relatively good;
3. compared with common reinforcement learning, the deep reinforcement learning of the invention processes the input state parameter information by using the neural network without a large amount of table storage data, thereby greatly saving the memory space, and the neural network training has higher efficiency and better convergence compared with the common iteration method;
4. the invention adopts the data priority extraction method, can perform priority arrangement on the data in the experience pool compared with the harsh performance of the traditional automobile multi-state control method switching, integrates the parameter information of the automobile in various states, greatly shortens the training time, enables the multi-state control of the automobile to be uniform, does not need to perform complicated control method switching, and has better control effect.
Detailed Description
In this embodiment, an automobile longitudinal multi-state control method based on deep reinforcement learning and preferential extraction can decide the throttle opening and the master cylinder pressure of an automobile at a corresponding moment according to real-time state parameters of the automobile, so as to complete multi-state control of automobile following running, adaptive cruise, emergency braking in a medium speed state and start-stop in a low speed state of the automobile in a high speed state, specifically according to the following steps:
step 1: establishing a vehicle dynamic model and a vehicle running environment model by utilizing carsim software;
step 2: acquiring automobile driving data in a real driving scene and taking the automobile driving data as initialization data, wherein the automobile driving data is initial state information of a vehicle and initial control parameter information of the vehicle;
and 3, step 3: defining a set of state information s ═ s for a vehicle 0 ,s 1 ,···s t ,···,s n },s 0 Information indicating the initial state of the vehicle, s t Indicating that the vehicle is in state s t-1 I.e. control action a is performed at time t-1 t-1 The state reached thereafter, and has s t ={Ax t ,e t ,Ve t In which Ax is t Represents the longitudinal acceleration of the vehicle at time t, in m/s 2 ,e t Representing the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time t t The difference value between the self vehicle speed and the front vehicle speed at the moment t is represented;
defining a control parameter set a ═ { a) of a vehicle 0 ,a 1 ,···,a t ,···,a n },a 0 Initial control parameter information indicative of a vehicle, a t Indicating that the vehicle is in state s t I.e. the action performed by the vehicle at time t, and has a t ={T t ,B t In which T is t Representing the throttle opening at time t of the vehicle, B t The unit of master cylinder pressure of the vehicle at the time t is Mpa, t is 1,2, c and c represents the total training time;
and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;
and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;
the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time t t The hidden layer comprises n neurons, state information from the input layer is calculated by using an activation function Relu and transmitted to the output layer, and the output layer comprises k neurons and is used for outputting an action value function;
for the hidden layer, there are:
l=Relu((s t ×w 1 )+b 1 ) (1)
in the formula (1), w 1 、b 1 Weights and bias values for the hidden layer;
for the output layer, there are:
out=Relu((l×w 2 )+b 2 ) (2)
in the formula (2), w 2 、b 2 Is the weight and offset value of the output layer;
the formula (1) and the formula (2) are combined to obtain:
Q e =Relu(Relu(s t ×w 1 +b 1 )×w 2 +b 2 ) (3)
in the formula (3), Q e Obtaining current Q values of all actions through a deep neural network for the output value of an output layer;
step 6: defining a deep reinforcement learning reward function, wherein the design of the reward function is an important component of a deep reinforcement learning algorithm, the updating and convergence of the neural network weight and bias depend on the quality of the design of the reward function, and the reward function is defined as follows:
Figure GDA0003750500380000061
Figure GDA0003750500380000062
in the formulae (4) and (5), r h The bonus value r is the bonus value in the high-speed state of the vehicle l The reward value is the reward value under the low-speed state of the vehicle, the condition of the reward value and the condition of the reward value is that whether the vehicle speed reaches 25km/h or not, if the vehicle speed reaches or exceeds 25km/h, the high-speed control of the vehicle is carried out, the corresponding follow-up running and adaptive cruise are completed, if the vehicle speed is lower than 25km/h, the medium-low speed control of the vehicle is carried out, the corresponding emergency braking and start-stop operation are completed, dis is the relative distance between the vehicle and the front vehicle, the unit is m, Vf is the vehicle speed of the front vehicle, the unit is km/h, x is the lower limit of the relative distance, the unit is m, y is the upper limit of the relative distance, the unit is m, mid is the switching threshold value of the reward function relative distance, the unit is m, lim is the switching threshold value of the reward function relative distance between the vehicle speed and the front vehicle, the unit is km/h, z is the switching threshold value of the reward function relative vehicle speed, the unit is km/h, u represents the lower limit of the speed of the front vehicle, and the unit is km/h;
and 7: defining an experience pool priority extraction rule;
under the normal condition, the vehicle rarely meets the state meeting the large reward value in the environment, the reward values of other states are very small, the vehicle is not worth learning and has small action on the parameters of the iterative neural network, the learning time is greatly increased in the environment with a small number of large reward values, and the effect is not good;
by using the experience pool priority extraction method, the small amount of state samples which are worth learning can be valued;
the specific method is that when the current and target state parameters are stored in the experience pool, the current Q value Q stored in the experience pool is e And a target Q value Q t Making a difference, and carrying out priority sequencing on all parameter forms stored in the experience pool by using the difference value according to a SumTree algorithm to obtain a sequenced parameter form and extracting a front bs parameter form from the sequenced parameter form;
the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (6):
Figure GDA0003750500380000071
in the formula (6), p k The priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;
the ineffective training can be effectively avoided by using the experience pool priority extraction method, the training time is greatly shortened, and the training effect is better;
and step 8: defining a greedy strategy;
generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting Q e The action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;
and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment, and processing data correlation and non-static distribution problems by deep reinforcement learning through the aid of experience pool playback;
state s at time t t Obtaining all action value functions through a deep neural network, and selecting an action a by using a greedy strategy t Then executed by the vehicle;
state s of the vehicle at time t t Lower execution action a t Obtaining the state parameter s at the moment of t +1 t+1 And a prize value r at time t t Each parameter is expressed in a parameter form s t ,a t ,r t ,s t+1 Storing the data into an experience pool D;
step 10: constructing a target neural network with the same structure as the deep neural network;
obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 moment t+1 Inputting the target neural network, and having:
Q ne =Relu(Relu(s t+1 ×w 1 ′+b 1 ′)×w 2 ′+b 2 ′) (7)
in the formula (7), Q ne The output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a 1 ′、w 2 ' weights for the hidden and output layers of the target neural network, respectively, b 1 ′、b 2 ' bias of the hidden layer and the output layer of the target neural network, respectively;
step 11: establishing a target Q value Q t
The action of the vehicle in a certain state is uncertain, and a relevant conditional probability is needed to select the determined action, wherein the conditional probability is defined as follows:
π(a|s)=P(a t =a|s t =s) (8)
in equation (8), pi (a | s) represents a probability distribution of an action a performed by the vehicle in a state s, and p represents a conditional probability;
obtaining a state cost function v using equation (9) π (s):
v π (s)=E π (r t +γr t+12 r t+2 +···|s t =s) (9)
In formula (9), E π Representing expectation, gamma represents a reward attenuation factor, taking a value between 0 and 1; when gamma is 0, v π (s)=E π (r t |s t S), at which point the state merit function is determined only by the prize value for the current stateDefinitely, independent of the subsequent state; when γ takes 1, v π (s)=E π (r t +r t+1 +r t+2 +···|s t S), at which point the state cost function is determined by the prize values for all current and subsequent states; when the value of gamma tends to 0, the current reward is more emphasized, and when the value of gamma tends to 1, the subsequent reward is more considered;
the execution of action a at time t is obtained by equation (10) t Probability of going to the next state s
Figure GDA0003750500380000081
Figure GDA0003750500380000082
Obtaining an action cost function q by using equation (11) π (s,a):
Figure GDA0003750500380000083
In the formula (11), the reaction mixture is,
Figure GDA0003750500380000084
representing the reward value, v, of the vehicle after performing action a in state s π (s ') represents a state cost function for the vehicle at state s';
obtaining a target Q value Q by the formula (12) t
Q t =r t +γmax(Q ne ) (12)
Step 12: the loss function loss is constructed using equation (13):
loss=ISW×(Q t -Q e ) 2 (13)
a gradient descent method is carried out on the loss function loss so as to update the parameter w of the deep neural network 1 、w 2 、b 1 、b 2
Updating the parameter w of the target neural network with an update frequency rt 1 ′、w 2 ′、b 1 ′、b 2 ', and the update value is taken from the deep neural network;
step 13: assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, making t equal to c +1, increasing the network iteration times, and returning to the step 9 to execute;
step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, and executing corresponding actions on the vehicle to finish longitudinal high, medium and low speed multi-state control.

Claims (1)

1. A longitudinal multi-state control method of an automobile based on deep reinforcement learning preferential extraction is characterized by comprising the following steps:
step 1: establishing a vehicle dynamic model and a vehicle running environment model;
step 2: acquiring automobile running data in a real driving scene as initialization data, wherein the automobile running data is initial state information of a vehicle and initial control parameter information of the vehicle;
and step 3: defining a set of state information s ═ s for a vehicle 0 ,s 1 ,···s t ,···,s n },s 0 Indicating initial state information of the vehicle, s t Indicating that the vehicle is in state s t-1 I.e. the control action a is executed at time t-1 t-1 The state reached thereafter, and has s t ={Ax t ,e t ,Ve t In which Ax is t Representing the longitudinal acceleration of the vehicle at time t, e t Representing the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time t t The difference value between the self vehicle speed and the front vehicle speed at the moment t is represented;
defining a control parameter set a ═ { a) of a vehicle 0 ,a 1 ,···,a t ,···,a n },a 0 Initial control parameter information representing a vehicle, a t Indicating that the vehicle is in state s t I.e. performed by the vehicle at time tIs actuated and has a t ={T t ,B t In which T is t Representing the throttle opening at time t of the vehicle, B t The master cylinder pressure of the vehicle at the time t is represented, and t is 1,2, c and c represents the total training time;
and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;
and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;
the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time t t The hidden layer comprises n neurons, calculates state information from the input layer by using an activation function Relu and transmits the state information to the output layer, the output layer comprises k neurons, is used for outputting an action value function, and comprises:
Q e =Relu(Relu(s t ×w 1 +b 1 )×w 2 +b 2 ) (1)
in the formula (1), w 1 、b 1 Weight and bias value for the hidden layer, w 2 、b 2 Is the weight and offset value, Q, of the output layer e The output value of the output layer is the current Q value of all actions obtained by the deep neural network;
step 6: defining a reward function for deep reinforcement learning:
Figure FDA0003750500370000011
Figure FDA0003750500370000012
in the formulae (2) and (3), r h The bonus value r is the bonus value in the high-speed state of the vehicle l Is the award value in the low speed state of the vehicle,dis is the relative distance between the self vehicle and the front vehicle, Vf is the speed of the front vehicle, x is the lower limit of the relative distance, y is the upper limit of the relative distance, mid is the switching threshold of the reward function relative to the relative distance, lim is the switching threshold of the reward function relative to the difference between the self vehicle speed and the speed of the front vehicle, z is the switching threshold of the reward function relative to the speed of the front vehicle, and u is the lower limit of the speed of the front vehicle;
and 7: defining an experience pool priority extraction rule;
for the current Q value Q stored in the experience pool e And a target Q value Q t Making a difference, and carrying out priority sequencing on all parameter forms stored in the experience pool by using the difference value according to a SumTree algorithm to obtain a sequenced parameter form and extracting a front bs parameter form from the sequenced parameter form;
the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (4):
Figure FDA0003750500370000021
in the formula (4), p k The priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;
and step 8: defining a greedy strategy;
generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting Q e The action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;
and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment;
state s at time t t Obtaining all action value functions through the deep neural network, and selecting action a by utilizing a greedy strategy t Then executed by the vehicle;
state s of the vehicle at time t t Lower execution action a t Obtaining the state parameter s at the moment of t +1 t+1 And the reward value r at time t t Each parameter is expressed in a parameter form s t ,a t ,r t ,s t+1 Storing the data into an experience pool D;
step 10: constructing a target neural network with the same structure as the deep neural network;
obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 moment t+1 Inputting the target neural network, and having:
Q ne =Relu(Relu(s t+1 ×w 1 ′+b 1 ′)×w 2 ′+b 2 ′) (5)
in the formula (5), Q ne The output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a 1 ′、w 2 ' weights for the hidden and output layers of the target neural network, respectively, b 1 ′、b 2 ' bias of the hidden layer and the output layer of the target neural network, respectively;
step 11: establishing a target Q value Q t
The probability distribution pi (a | s) of the action a performed in the state s is defined by equation (6):
π(a|s)=P(a t =a|s t =s) (6)
in formula (6), p represents a conditional probability;
obtaining a State cost function v using equation (7) π (s):
v π (s)=E π (r t +γr t+12 r t+2 +···|s t =s) (7)
In the formula (7), gamma is a reward attenuation factor, E π Indicating a desire;
obtaining the execution of action a at time t by equation (8) t Probability of going to the next state s
Figure FDA0003750500370000031
Figure FDA0003750500370000032
Obtaining an action cost function q by using the formula (9) π (s,a):
Figure FDA0003750500370000033
In the formula (9), the reaction mixture is,
Figure FDA0003750500370000034
representing the reward value, v, of the vehicle after performing action a in state s π (s ') represents a state cost function for the vehicle at state s';
obtaining a target Q value Q by the formula (10) t
Q t =r t +γmax(Q ne ) (10)
Step 12: the loss function loss is constructed using equation (11):
loss=ISW×(Q t -Q e ) 2 (11)
carrying out a gradient descent method on the loss function loss so as to update the deep neural network parameter w 1 、w 2 、b 1 、b 2
Updating the parameter w of the target neural network with an update frequency rt 1 ′、w 2 ′、b 1 ′、b 2 ', and update values are taken from the deep neural network;
step 13: after assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, setting t to c +1, increasing the number of network iterations, and returning to the step 9 to execute;
step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, so that corresponding actions are executed on the vehicle to complete longitudinal multi-state control.
CN202110267799.0A 2021-03-11 2021-03-11 Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction Active CN112861269B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110267799.0A CN112861269B (en) 2021-03-11 2021-03-11 Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110267799.0A CN112861269B (en) 2021-03-11 2021-03-11 Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction

Publications (2)

Publication Number Publication Date
CN112861269A CN112861269A (en) 2021-05-28
CN112861269B true CN112861269B (en) 2022-08-30

Family

ID=75994127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110267799.0A Active CN112861269B (en) 2021-03-11 2021-03-11 Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction

Country Status (1)

Country Link
CN (1) CN112861269B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113734170B (en) * 2021-08-19 2023-10-24 崔建勋 Automatic driving lane change decision method based on deep Q learning
CN113715842B (en) * 2021-08-24 2023-02-03 华中科技大学 High-speed moving vehicle control method based on imitation learning and reinforcement learning
CN114527642B (en) * 2022-03-03 2024-04-02 东北大学 Method for automatically adjusting PID parameters by AGV based on deep reinforcement learning
CN115303290B (en) * 2022-10-09 2022-12-06 北京理工大学 System key level switching method and system of vehicle hybrid key level system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110450771A (en) * 2019-08-29 2019-11-15 合肥工业大学 A kind of intelligent automobile stability control method based on deeply study
CN110716550A (en) * 2019-11-06 2020-01-21 南京理工大学 Gear shifting strategy dynamic optimization method based on deep reinforcement learning
CN110716562A (en) * 2019-09-25 2020-01-21 南京航空航天大学 Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning
CN110850720A (en) * 2019-11-26 2020-02-28 国网山东省电力公司电力科学研究院 DQN algorithm-based area automatic power generation dynamic control method
CN110969848A (en) * 2019-11-26 2020-04-07 武汉理工大学 Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
CN111985614A (en) * 2020-07-23 2020-11-24 中国科学院计算技术研究所 Method, system and medium for constructing automatic driving decision system
CN112162555A (en) * 2020-09-23 2021-01-01 燕山大学 Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet
CN112406867A (en) * 2020-11-19 2021-02-26 清华大学 Emergency vehicle hybrid lane change decision method based on reinforcement learning and avoidance strategy

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111316295B (en) * 2017-10-27 2023-09-22 渊慧科技有限公司 Reinforcement learning using distributed prioritized playback
US11688160B2 (en) * 2018-01-17 2023-06-27 Huawei Technologies Co., Ltd. Method of generating training data for training a neural network, method of training a neural network and using neural network for autonomous operations
US10845815B2 (en) * 2018-07-27 2020-11-24 GM Global Technology Operations LLC Systems, methods and controllers for an autonomous vehicle that implement autonomous driver agents and driving policy learners for generating and improving policies based on collective driving experiences of the autonomous driver agents

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110450771A (en) * 2019-08-29 2019-11-15 合肥工业大学 A kind of intelligent automobile stability control method based on deeply study
CN110716562A (en) * 2019-09-25 2020-01-21 南京航空航天大学 Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning
CN110716550A (en) * 2019-11-06 2020-01-21 南京理工大学 Gear shifting strategy dynamic optimization method based on deep reinforcement learning
CN110850720A (en) * 2019-11-26 2020-02-28 国网山东省电力公司电力科学研究院 DQN algorithm-based area automatic power generation dynamic control method
CN110969848A (en) * 2019-11-26 2020-04-07 武汉理工大学 Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
CN111985614A (en) * 2020-07-23 2020-11-24 中国科学院计算技术研究所 Method, system and medium for constructing automatic driving decision system
CN112162555A (en) * 2020-09-23 2021-01-01 燕山大学 Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet
CN112406867A (en) * 2020-11-19 2021-02-26 清华大学 Emergency vehicle hybrid lane change decision method based on reinforcement learning and avoidance strategy

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A review of communication, driver characteristics, and controls aspects of cooperative adaptive cruise control;Dey, K.C.;Li Yan;《IEEE Transactions on Intelligent Transportation Systems》;20160228;第17卷(第2期);第491-509页 *
基于深度强化学习的协同式自适应巡航控制;王文飒,梁军,陈龙,陈小波,朱宁,华国栋;《交通信息与安全》;20190331;第37卷(第3期);第93-100页 *
基于深度强化学习的自动泊车控制策略;黄鹤,郭伟锋,梅炜炜,张润,程进,张炳力;《2020中国汽车工程学会年会论文集》;20201231;第181-189页 *

Also Published As

Publication number Publication date
CN112861269A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
CN112861269B (en) Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction
CN111898211B (en) Intelligent vehicle speed decision method based on deep reinforcement learning and simulation method thereof
CN107229973B (en) Method and device for generating strategy network model for automatic vehicle driving
CN111845701B (en) HEV energy management method based on deep reinforcement learning in car following environment
CN111845741B (en) Automatic driving decision control method and system based on hierarchical reinforcement learning
CN106740846A (en) A kind of electric automobile self-adapting cruise control method of double mode switching
CN111332362B (en) Intelligent steer-by-wire control method integrating individual character of driver
EP3725627A1 (en) Method and apparatus for generating vehicle control command, and vehicle controller and storage medium
CN110949398A (en) Method for detecting abnormal driving behavior of first-vehicle drivers in vehicle formation driving
CN113276884B (en) Intelligent vehicle interactive decision passing method and system with variable game mode
CN113954837B (en) Deep learning-based lane change decision-making method for large-scale commercial vehicle
CN112668779A (en) Preceding vehicle motion state prediction method based on self-adaptive Gaussian process
CN113096402B (en) Dynamic speed limit control method, system, terminal and readable storage medium based on intelligent networked vehicle
CN109436085A (en) A kind of wire-controlled steering system gearratio control method based on driving style
JP7415471B2 (en) Driving evaluation device, driving evaluation system, in-vehicle device, external evaluation device, and driving evaluation program
CN113722835A (en) Modeling method for anthropomorphic random lane change driving behavior
CN110879595A (en) Unmanned mine card tracking control system and method based on deep reinforcement learning
CN114852105A (en) Method and system for planning track change of automatic driving vehicle
CN112158045A (en) Active suspension control method based on depth certainty strategy gradient
CN112542061B (en) Lane borrowing and overtaking control method, device and system based on Internet of vehicles and storage medium
CN114030485A (en) Automatic driving automobile man lane change decision planning method considering attachment coefficient
CN114074680A (en) Vehicle lane change behavior decision method and system based on deep reinforcement learning
CN114148349B (en) Vehicle personalized following control method based on generation of countermeasure imitation study
CN113033902B (en) Automatic driving lane change track planning method based on improved deep learning
WO2023004698A1 (en) Method for intelligent driving decision-making, vehicle movement control method, apparatus, and vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant