CN112861269B - Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction - Google Patents
Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction Download PDFInfo
- Publication number
- CN112861269B CN112861269B CN202110267799.0A CN202110267799A CN112861269B CN 112861269 B CN112861269 B CN 112861269B CN 202110267799 A CN202110267799 A CN 202110267799A CN 112861269 B CN112861269 B CN 112861269B
- Authority
- CN
- China
- Prior art keywords
- vehicle
- state
- value
- neural network
- action
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/10—Geometric CAD
- G06F30/15—Vehicle, aircraft or watercraft design
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/10—Geometric CAD
- G06F30/17—Mechanical parametric or variational design
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2119/00—Details relating to the type or aim of the analysis or the optimisation
- G06F2119/14—Force analysis or force optimisation, e.g. static or dynamic forces
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention discloses an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction, which comprises the following steps: 1, defining a state parameter set s and a control parameter set a for driving the automobile; 2, initializing deep reinforcement learning parameters and constructing a deep neural network; 3 defining a depth reinforcement learning reward function and a priority extraction rule; 4 training a deep neural network and obtaining an optimal network model; 5 obtaining the state parameter s of the automobile at the moment t t And inputting the optimal network model to obtain an output a t And is executed by the automobile. The invention completes the longitudinal multi-state driving of the automobile by combining the priority extraction algorithm and the control method of deep reinforcement learning, thereby ensuring higher safety of the automobile in the driving process and reducing the occurrence of traffic accidents.
Description
Technical Field
The invention relates to the technical field of intelligent automobile longitudinal multi-state control, in particular to an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction.
Background
With the rapid development of urban economy and the continuous improvement of the living standard of people, the quantity of motor vehicles kept in cities is greatly increased, automobiles become indispensable tools for transportation when people go out, and a series of safety problems are brought while rapidness and convenience are brought. Due to the fact that technical capacity of a driver is limited or other uncontrollable external factors and the like, traffic problems such as two-vehicle or multi-vehicle collision and the like often occur on a road, life and property safety loss is brought, and meanwhile great difficulty is caused to road smoothness. With the continuous development of automobile related technologies, an adaptive cruise system, an emergency braking system and the like are introduced by a plurality of automobile enterprises. The self-adaptive cruise system obtains front road data by using sensors such as a radar and the like, keeps a certain distance from a front vehicle and maintains a certain speed according to a corresponding algorithm, but is usually started at a higher speed, such as more than 25km/h, and needs a driver to manually control when the speed is lower than the speed; the emergency braking system is a technology which can actively brake to avoid accidents under the conditions that an automobile runs in a non-adaptive cruise state and meets the front emergency, such as sudden stop of the automobile or sudden pedestrian, but has related reasons of misjudgment of a sensor, environmental errors and the like, and cannot be applied to various running environments, so that dangerous accidents are caused.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the automobile longitudinal multi-state control method based on the deep reinforcement learning priority extraction, so that the automobile longitudinal multi-state driving is completed by combining the priority extraction algorithm and the deep reinforcement learning control method, the safety of the automobile in the driving process is higher, and the occurrence of traffic accidents is reduced.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction, which is characterized by comprising the following steps of:
step 1: establishing a vehicle dynamic model and a vehicle running environment model;
and 2, step: acquiring automobile running data in a real driving scene as initialization data, wherein the automobile running data is initial state information of a vehicle and initial control parameter information of the vehicle;
and step 3: defining a set of state information s ═ s of the vehicle 0 ,s 1 ,···s t ,···,s n },s 0 Information indicating the initial state of the vehicle, s t Indicating that the vehicle is in state s t-1 I.e. the control action a is executed at time t-1 t-1 The state reached thereafter, and has s t ={Ax t ,e t ,Ve t In which Ax is t Representing the longitudinal acceleration of the vehicle at time t, e t Representing the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time t t The difference value between the self vehicle speed and the front vehicle speed at the moment t is represented;
control parameter set a ═ a of defined vehicle 0 ,a 1 ,···,a t ,···,a n },a 0 Initial control parameter information indicative of a vehicle, a t Indicating that the vehicle is in state s t I.e. the action performed by the vehicle at time t, and has a t ={T t ,B t In which T is t Representing the throttle opening at time t of the vehicle, B t The master cylinder pressure of the vehicle at the time t is represented, and t is 1,2, c and c represents the total training time;
and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;
and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;
the deep neural network comprises an input layer, a hidden layer and an output layer; wherein, theThe input layer comprises m neurons for inputting the state s of the vehicle at the time t t The hidden layer comprises n neurons, calculates state information from the input layer by using an activation function Relu and transmits the state information to the output layer, the output layer comprises k neurons, is used for outputting an action value function, and comprises:
Q e =Relu(Relu(s t ×w 1 +b 1 )×w 2 +b 2 ) (1)
in the formula (1), w 1 、b 1 Weight and bias value for the hidden layer, w 2 、b 2 Is the weight and offset value, Q, of the output layer e The output value of the output layer is the current Q value of all actions obtained by the deep neural network;
step 6: define reward functions for deep reinforcement learning:
in the formulae (2) and (3), r h The bonus value r is the bonus value in the high-speed state of the vehicle l The method comprises the following steps that (1) the reward value is in a low-speed state of a vehicle, dis is the relative distance between the vehicle and a front vehicle, Vf is the speed of the front vehicle, x represents the lower limit of the relative distance, y represents the upper limit of the relative distance, mid represents the switching threshold value of a reward function relative to the relative distance, lim represents the switching threshold value of the reward function relative to the difference value between the speed of the vehicle and the speed of the front vehicle, z represents the switching threshold value of the reward function relative to the speed of the front vehicle, and u represents the lower limit of the speed of the front vehicle;
and 7: defining an experience pool priority extraction rule;
for the current Q value Q stored in the experience pool e And a target Q value Q t Making difference, and using the difference value to perform priority ordering on various parameter forms stored in the experience pool according to SumTree algorithm, obtaining ordered parameter forms and extracting the ordered parameter formsTaking a parameter form of a previous bs strip;
the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (4):
in the formula (4), p k The priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;
and 8: defining a greedy strategy;
generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting Q e The action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;
and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment;
state s at time t t Obtaining all action value functions through the deep neural network, and selecting action a by utilizing a greedy strategy t Then executed by the vehicle;
state s of the vehicle at time t t Lower execution action a t Obtaining the state parameter s at the moment of t +1 t+1 And a prize value r at time t t Each parameter is expressed in a parameter form s t ,a t ,r t ,s t+1 Storing the data into an experience pool D;
step 10: constructing a target neural network with the same structure as the deep neural network;
obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 moment t+1 Inputting the target neural network, and having:
Q ne =Relu(Relu(s t+1 ×w 1 ′+b 1 ′)×w 2 ′+b 2 ′) (5)
in the formula (5), Q ne The output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a 1 ′、w 2 ' weights for the hidden and output layers of the target neural network, respectively, b 1 ′、b 2 ' bias of the hidden layer and the output layer of the target neural network, respectively;
step 11: establishing a target Q value Q t ;
The probability distribution pi (a | s) of the action a performed in the state s is defined by equation (6):
π(a|s)=P(a t =a|s t =s) (6)
in the formula (6), p represents a conditional probability;
obtaining a State cost function v using equation (7) π (s):
v π (s)=E π (r t +γr t+1 +γ 2 r t+2 +···|s t =s) (7)
In the formula (7), gamma is a reward attenuation factor, E π Indicating a desire;
obtaining the execution of action a at time t by equation (8) t Probability of going to the next state s
Obtaining an action cost function q by using the formula (9) π (s,a):
In the formula (9), the reaction mixture is,representing the reward value, v, of the vehicle after performing action a in state s π (s') represents a vehicleA state cost function for the vehicle at state s';
obtaining a target Q value Q by the formula (10) t :
Q t =r t +γmax(Q ne ) (10)
Step 12: the loss function loss is constructed using equation (11):
loss=ISW×(Q t -Q e ) 2 (11)
carrying out a gradient descent method on the loss function loss so as to update the deep neural network parameter w 1 、w 2 、b 1 、b 2 ;
Updating the parameter w of the target neural network with an update frequency rt 1 ′、w 2 ′、b 1 ′、b 2 ', and update values are taken from the deep neural network;
step 13: assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, making t equal to c +1, increasing the network iteration times, and returning to the step 9 to execute;
step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, so that corresponding actions are executed on the vehicle to complete longitudinal multi-state control.
Compared with the prior art, the invention has the beneficial effects that:
1. compared with the traditional automobile longitudinal control method, the control method has better control smoothness under different working conditions and better control stability under the limit working condition, and is suitable for the multi-state control of high speed, medium speed and low speed of the automobile;
2. the deep reinforcement learning of the invention utilizes the trained deep neural network, and the corresponding action can be executed only by inputting the state information of the automobile, so that the invention has more simplicity and rapidity compared with the complex traditional automobile control, and the control effect is relatively good;
3. compared with common reinforcement learning, the deep reinforcement learning of the invention processes the input state parameter information by using the neural network without a large amount of table storage data, thereby greatly saving the memory space, and the neural network training has higher efficiency and better convergence compared with the common iteration method;
4. the invention adopts the data priority extraction method, can perform priority arrangement on the data in the experience pool compared with the harsh performance of the traditional automobile multi-state control method switching, integrates the parameter information of the automobile in various states, greatly shortens the training time, enables the multi-state control of the automobile to be uniform, does not need to perform complicated control method switching, and has better control effect.
Detailed Description
In this embodiment, an automobile longitudinal multi-state control method based on deep reinforcement learning and preferential extraction can decide the throttle opening and the master cylinder pressure of an automobile at a corresponding moment according to real-time state parameters of the automobile, so as to complete multi-state control of automobile following running, adaptive cruise, emergency braking in a medium speed state and start-stop in a low speed state of the automobile in a high speed state, specifically according to the following steps:
step 1: establishing a vehicle dynamic model and a vehicle running environment model by utilizing carsim software;
step 2: acquiring automobile driving data in a real driving scene and taking the automobile driving data as initialization data, wherein the automobile driving data is initial state information of a vehicle and initial control parameter information of the vehicle;
and 3, step 3: defining a set of state information s ═ s for a vehicle 0 ,s 1 ,···s t ,···,s n },s 0 Information indicating the initial state of the vehicle, s t Indicating that the vehicle is in state s t-1 I.e. control action a is performed at time t-1 t-1 The state reached thereafter, and has s t ={Ax t ,e t ,Ve t In which Ax is t Represents the longitudinal acceleration of the vehicle at time t, in m/s 2 ,e t Representing the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time t t The difference value between the self vehicle speed and the front vehicle speed at the moment t is represented;
defining a control parameter set a ═ { a) of a vehicle 0 ,a 1 ,···,a t ,···,a n },a 0 Initial control parameter information indicative of a vehicle, a t Indicating that the vehicle is in state s t I.e. the action performed by the vehicle at time t, and has a t ={T t ,B t In which T is t Representing the throttle opening at time t of the vehicle, B t The unit of master cylinder pressure of the vehicle at the time t is Mpa, t is 1,2, c and c represents the total training time;
and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;
and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;
the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time t t The hidden layer comprises n neurons, state information from the input layer is calculated by using an activation function Relu and transmitted to the output layer, and the output layer comprises k neurons and is used for outputting an action value function;
for the hidden layer, there are:
l=Relu((s t ×w 1 )+b 1 ) (1)
in the formula (1), w 1 、b 1 Weights and bias values for the hidden layer;
for the output layer, there are:
out=Relu((l×w 2 )+b 2 ) (2)
in the formula (2), w 2 、b 2 Is the weight and offset value of the output layer;
the formula (1) and the formula (2) are combined to obtain:
Q e =Relu(Relu(s t ×w 1 +b 1 )×w 2 +b 2 ) (3)
in the formula (3), Q e Obtaining current Q values of all actions through a deep neural network for the output value of an output layer;
step 6: defining a deep reinforcement learning reward function, wherein the design of the reward function is an important component of a deep reinforcement learning algorithm, the updating and convergence of the neural network weight and bias depend on the quality of the design of the reward function, and the reward function is defined as follows:
in the formulae (4) and (5), r h The bonus value r is the bonus value in the high-speed state of the vehicle l The reward value is the reward value under the low-speed state of the vehicle, the condition of the reward value and the condition of the reward value is that whether the vehicle speed reaches 25km/h or not, if the vehicle speed reaches or exceeds 25km/h, the high-speed control of the vehicle is carried out, the corresponding follow-up running and adaptive cruise are completed, if the vehicle speed is lower than 25km/h, the medium-low speed control of the vehicle is carried out, the corresponding emergency braking and start-stop operation are completed, dis is the relative distance between the vehicle and the front vehicle, the unit is m, Vf is the vehicle speed of the front vehicle, the unit is km/h, x is the lower limit of the relative distance, the unit is m, y is the upper limit of the relative distance, the unit is m, mid is the switching threshold value of the reward function relative distance, the unit is m, lim is the switching threshold value of the reward function relative distance between the vehicle speed and the front vehicle, the unit is km/h, z is the switching threshold value of the reward function relative vehicle speed, the unit is km/h, u represents the lower limit of the speed of the front vehicle, and the unit is km/h;
and 7: defining an experience pool priority extraction rule;
under the normal condition, the vehicle rarely meets the state meeting the large reward value in the environment, the reward values of other states are very small, the vehicle is not worth learning and has small action on the parameters of the iterative neural network, the learning time is greatly increased in the environment with a small number of large reward values, and the effect is not good;
by using the experience pool priority extraction method, the small amount of state samples which are worth learning can be valued;
the specific method is that when the current and target state parameters are stored in the experience pool, the current Q value Q stored in the experience pool is e And a target Q value Q t Making a difference, and carrying out priority sequencing on all parameter forms stored in the experience pool by using the difference value according to a SumTree algorithm to obtain a sequenced parameter form and extracting a front bs parameter form from the sequenced parameter form;
the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (6):
in the formula (6), p k The priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;
the ineffective training can be effectively avoided by using the experience pool priority extraction method, the training time is greatly shortened, and the training effect is better;
and step 8: defining a greedy strategy;
generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting Q e The action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;
and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment, and processing data correlation and non-static distribution problems by deep reinforcement learning through the aid of experience pool playback;
state s at time t t Obtaining all action value functions through a deep neural network, and selecting an action a by using a greedy strategy t Then executed by the vehicle;
state s of the vehicle at time t t Lower execution action a t Obtaining the state parameter s at the moment of t +1 t+1 And a prize value r at time t t Each parameter is expressed in a parameter form s t ,a t ,r t ,s t+1 Storing the data into an experience pool D;
step 10: constructing a target neural network with the same structure as the deep neural network;
obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 moment t+1 Inputting the target neural network, and having:
Q ne =Relu(Relu(s t+1 ×w 1 ′+b 1 ′)×w 2 ′+b 2 ′) (7)
in the formula (7), Q ne The output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a 1 ′、w 2 ' weights for the hidden and output layers of the target neural network, respectively, b 1 ′、b 2 ' bias of the hidden layer and the output layer of the target neural network, respectively;
step 11: establishing a target Q value Q t ;
The action of the vehicle in a certain state is uncertain, and a relevant conditional probability is needed to select the determined action, wherein the conditional probability is defined as follows:
π(a|s)=P(a t =a|s t =s) (8)
in equation (8), pi (a | s) represents a probability distribution of an action a performed by the vehicle in a state s, and p represents a conditional probability;
obtaining a state cost function v using equation (9) π (s):
v π (s)=E π (r t +γr t+1 +γ 2 r t+2 +···|s t =s) (9)
In formula (9), E π Representing expectation, gamma represents a reward attenuation factor, taking a value between 0 and 1; when gamma is 0, v π (s)=E π (r t |s t S), at which point the state merit function is determined only by the prize value for the current stateDefinitely, independent of the subsequent state; when γ takes 1, v π (s)=E π (r t +r t+1 +r t+2 +···|s t S), at which point the state cost function is determined by the prize values for all current and subsequent states; when the value of gamma tends to 0, the current reward is more emphasized, and when the value of gamma tends to 1, the subsequent reward is more considered;
the execution of action a at time t is obtained by equation (10) t Probability of going to the next state s
Obtaining an action cost function q by using equation (11) π (s,a):
In the formula (11), the reaction mixture is,representing the reward value, v, of the vehicle after performing action a in state s π (s ') represents a state cost function for the vehicle at state s';
obtaining a target Q value Q by the formula (12) t :
Q t =r t +γmax(Q ne ) (12)
Step 12: the loss function loss is constructed using equation (13):
loss=ISW×(Q t -Q e ) 2 (13)
a gradient descent method is carried out on the loss function loss so as to update the parameter w of the deep neural network 1 、w 2 、b 1 、b 2 ;
Updating the parameter w of the target neural network with an update frequency rt 1 ′、w 2 ′、b 1 ′、b 2 ', and the update value is taken from the deep neural network;
step 13: assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, making t equal to c +1, increasing the network iteration times, and returning to the step 9 to execute;
step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, and executing corresponding actions on the vehicle to finish longitudinal high, medium and low speed multi-state control.
Claims (1)
1. A longitudinal multi-state control method of an automobile based on deep reinforcement learning preferential extraction is characterized by comprising the following steps:
step 1: establishing a vehicle dynamic model and a vehicle running environment model;
step 2: acquiring automobile running data in a real driving scene as initialization data, wherein the automobile running data is initial state information of a vehicle and initial control parameter information of the vehicle;
and step 3: defining a set of state information s ═ s for a vehicle 0 ,s 1 ,···s t ,···,s n },s 0 Indicating initial state information of the vehicle, s t Indicating that the vehicle is in state s t-1 I.e. the control action a is executed at time t-1 t-1 The state reached thereafter, and has s t ={Ax t ,e t ,Ve t In which Ax is t Representing the longitudinal acceleration of the vehicle at time t, e t Representing the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time t t The difference value between the self vehicle speed and the front vehicle speed at the moment t is represented;
defining a control parameter set a ═ { a) of a vehicle 0 ,a 1 ,···,a t ,···,a n },a 0 Initial control parameter information representing a vehicle, a t Indicating that the vehicle is in state s t I.e. performed by the vehicle at time tIs actuated and has a t ={T t ,B t In which T is t Representing the throttle opening at time t of the vehicle, B t The master cylinder pressure of the vehicle at the time t is represented, and t is 1,2, c and c represents the total training time;
and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;
and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;
the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time t t The hidden layer comprises n neurons, calculates state information from the input layer by using an activation function Relu and transmits the state information to the output layer, the output layer comprises k neurons, is used for outputting an action value function, and comprises:
Q e =Relu(Relu(s t ×w 1 +b 1 )×w 2 +b 2 ) (1)
in the formula (1), w 1 、b 1 Weight and bias value for the hidden layer, w 2 、b 2 Is the weight and offset value, Q, of the output layer e The output value of the output layer is the current Q value of all actions obtained by the deep neural network;
step 6: defining a reward function for deep reinforcement learning:
in the formulae (2) and (3), r h The bonus value r is the bonus value in the high-speed state of the vehicle l Is the award value in the low speed state of the vehicle,dis is the relative distance between the self vehicle and the front vehicle, Vf is the speed of the front vehicle, x is the lower limit of the relative distance, y is the upper limit of the relative distance, mid is the switching threshold of the reward function relative to the relative distance, lim is the switching threshold of the reward function relative to the difference between the self vehicle speed and the speed of the front vehicle, z is the switching threshold of the reward function relative to the speed of the front vehicle, and u is the lower limit of the speed of the front vehicle;
and 7: defining an experience pool priority extraction rule;
for the current Q value Q stored in the experience pool e And a target Q value Q t Making a difference, and carrying out priority sequencing on all parameter forms stored in the experience pool by using the difference value according to a SumTree algorithm to obtain a sequenced parameter form and extracting a front bs parameter form from the sequenced parameter form;
the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (4):
in the formula (4), p k The priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;
and step 8: defining a greedy strategy;
generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting Q e The action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;
and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment;
state s at time t t Obtaining all action value functions through the deep neural network, and selecting action a by utilizing a greedy strategy t Then executed by the vehicle;
state s of the vehicle at time t t Lower execution action a t Obtaining the state parameter s at the moment of t +1 t+1 And the reward value r at time t t Each parameter is expressed in a parameter form s t ,a t ,r t ,s t+1 Storing the data into an experience pool D;
step 10: constructing a target neural network with the same structure as the deep neural network;
obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 moment t+1 Inputting the target neural network, and having:
Q ne =Relu(Relu(s t+1 ×w 1 ′+b 1 ′)×w 2 ′+b 2 ′) (5)
in the formula (5), Q ne The output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a 1 ′、w 2 ' weights for the hidden and output layers of the target neural network, respectively, b 1 ′、b 2 ' bias of the hidden layer and the output layer of the target neural network, respectively;
step 11: establishing a target Q value Q t ;
The probability distribution pi (a | s) of the action a performed in the state s is defined by equation (6):
π(a|s)=P(a t =a|s t =s) (6)
in formula (6), p represents a conditional probability;
obtaining a State cost function v using equation (7) π (s):
v π (s)=E π (r t +γr t+1 +γ 2 r t+2 +···|s t =s) (7)
In the formula (7), gamma is a reward attenuation factor, E π Indicating a desire;
obtaining the execution of action a at time t by equation (8) t Probability of going to the next state s
Obtaining an action cost function q by using the formula (9) π (s,a):
In the formula (9), the reaction mixture is,representing the reward value, v, of the vehicle after performing action a in state s π (s ') represents a state cost function for the vehicle at state s';
obtaining a target Q value Q by the formula (10) t :
Q t =r t +γmax(Q ne ) (10)
Step 12: the loss function loss is constructed using equation (11):
loss=ISW×(Q t -Q e ) 2 (11)
carrying out a gradient descent method on the loss function loss so as to update the deep neural network parameter w 1 、w 2 、b 1 、b 2 ;
Updating the parameter w of the target neural network with an update frequency rt 1 ′、w 2 ′、b 1 ′、b 2 ', and update values are taken from the deep neural network;
step 13: after assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, setting t to c +1, increasing the number of network iterations, and returning to the step 9 to execute;
step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, so that corresponding actions are executed on the vehicle to complete longitudinal multi-state control.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110267799.0A CN112861269B (en) | 2021-03-11 | 2021-03-11 | Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110267799.0A CN112861269B (en) | 2021-03-11 | 2021-03-11 | Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112861269A CN112861269A (en) | 2021-05-28 |
CN112861269B true CN112861269B (en) | 2022-08-30 |
Family
ID=75994127
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110267799.0A Active CN112861269B (en) | 2021-03-11 | 2021-03-11 | Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112861269B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113734170B (en) * | 2021-08-19 | 2023-10-24 | 崔建勋 | Automatic driving lane change decision method based on deep Q learning |
CN113715842B (en) * | 2021-08-24 | 2023-02-03 | 华中科技大学 | High-speed moving vehicle control method based on imitation learning and reinforcement learning |
CN114527642B (en) * | 2022-03-03 | 2024-04-02 | 东北大学 | Method for automatically adjusting PID parameters by AGV based on deep reinforcement learning |
CN115303290B (en) * | 2022-10-09 | 2022-12-06 | 北京理工大学 | System key level switching method and system of vehicle hybrid key level system |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110450771A (en) * | 2019-08-29 | 2019-11-15 | 合肥工业大学 | A kind of intelligent automobile stability control method based on deeply study |
CN110716550A (en) * | 2019-11-06 | 2020-01-21 | 南京理工大学 | Gear shifting strategy dynamic optimization method based on deep reinforcement learning |
CN110716562A (en) * | 2019-09-25 | 2020-01-21 | 南京航空航天大学 | Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning |
CN110850720A (en) * | 2019-11-26 | 2020-02-28 | 国网山东省电力公司电力科学研究院 | DQN algorithm-based area automatic power generation dynamic control method |
CN110969848A (en) * | 2019-11-26 | 2020-04-07 | 武汉理工大学 | Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes |
CN111605565A (en) * | 2020-05-08 | 2020-09-01 | 昆山小眼探索信息科技有限公司 | Automatic driving behavior decision method based on deep reinforcement learning |
CN111985614A (en) * | 2020-07-23 | 2020-11-24 | 中国科学院计算技术研究所 | Method, system and medium for constructing automatic driving decision system |
CN112162555A (en) * | 2020-09-23 | 2021-01-01 | 燕山大学 | Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet |
CN112406867A (en) * | 2020-11-19 | 2021-02-26 | 清华大学 | Emergency vehicle hybrid lane change decision method based on reinforcement learning and avoidance strategy |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111316295B (en) * | 2017-10-27 | 2023-09-22 | 渊慧科技有限公司 | Reinforcement learning using distributed prioritized playback |
US11688160B2 (en) * | 2018-01-17 | 2023-06-27 | Huawei Technologies Co., Ltd. | Method of generating training data for training a neural network, method of training a neural network and using neural network for autonomous operations |
US10845815B2 (en) * | 2018-07-27 | 2020-11-24 | GM Global Technology Operations LLC | Systems, methods and controllers for an autonomous vehicle that implement autonomous driver agents and driving policy learners for generating and improving policies based on collective driving experiences of the autonomous driver agents |
-
2021
- 2021-03-11 CN CN202110267799.0A patent/CN112861269B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110450771A (en) * | 2019-08-29 | 2019-11-15 | 合肥工业大学 | A kind of intelligent automobile stability control method based on deeply study |
CN110716562A (en) * | 2019-09-25 | 2020-01-21 | 南京航空航天大学 | Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning |
CN110716550A (en) * | 2019-11-06 | 2020-01-21 | 南京理工大学 | Gear shifting strategy dynamic optimization method based on deep reinforcement learning |
CN110850720A (en) * | 2019-11-26 | 2020-02-28 | 国网山东省电力公司电力科学研究院 | DQN algorithm-based area automatic power generation dynamic control method |
CN110969848A (en) * | 2019-11-26 | 2020-04-07 | 武汉理工大学 | Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes |
CN111605565A (en) * | 2020-05-08 | 2020-09-01 | 昆山小眼探索信息科技有限公司 | Automatic driving behavior decision method based on deep reinforcement learning |
CN111985614A (en) * | 2020-07-23 | 2020-11-24 | 中国科学院计算技术研究所 | Method, system and medium for constructing automatic driving decision system |
CN112162555A (en) * | 2020-09-23 | 2021-01-01 | 燕山大学 | Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet |
CN112406867A (en) * | 2020-11-19 | 2021-02-26 | 清华大学 | Emergency vehicle hybrid lane change decision method based on reinforcement learning and avoidance strategy |
Non-Patent Citations (3)
Title |
---|
A review of communication, driver characteristics, and controls aspects of cooperative adaptive cruise control;Dey, K.C.;Li Yan;《IEEE Transactions on Intelligent Transportation Systems》;20160228;第17卷(第2期);第491-509页 * |
基于深度强化学习的协同式自适应巡航控制;王文飒,梁军,陈龙,陈小波,朱宁,华国栋;《交通信息与安全》;20190331;第37卷(第3期);第93-100页 * |
基于深度强化学习的自动泊车控制策略;黄鹤,郭伟锋,梅炜炜,张润,程进,张炳力;《2020中国汽车工程学会年会论文集》;20201231;第181-189页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112861269A (en) | 2021-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112861269B (en) | Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction | |
CN111898211B (en) | Intelligent vehicle speed decision method based on deep reinforcement learning and simulation method thereof | |
CN107229973B (en) | Method and device for generating strategy network model for automatic vehicle driving | |
CN111845701B (en) | HEV energy management method based on deep reinforcement learning in car following environment | |
CN111845741B (en) | Automatic driving decision control method and system based on hierarchical reinforcement learning | |
CN106740846A (en) | A kind of electric automobile self-adapting cruise control method of double mode switching | |
CN111332362B (en) | Intelligent steer-by-wire control method integrating individual character of driver | |
EP3725627A1 (en) | Method and apparatus for generating vehicle control command, and vehicle controller and storage medium | |
CN110949398A (en) | Method for detecting abnormal driving behavior of first-vehicle drivers in vehicle formation driving | |
CN113276884B (en) | Intelligent vehicle interactive decision passing method and system with variable game mode | |
CN113954837B (en) | Deep learning-based lane change decision-making method for large-scale commercial vehicle | |
CN112668779A (en) | Preceding vehicle motion state prediction method based on self-adaptive Gaussian process | |
CN113096402B (en) | Dynamic speed limit control method, system, terminal and readable storage medium based on intelligent networked vehicle | |
CN109436085A (en) | A kind of wire-controlled steering system gearratio control method based on driving style | |
JP7415471B2 (en) | Driving evaluation device, driving evaluation system, in-vehicle device, external evaluation device, and driving evaluation program | |
CN113722835A (en) | Modeling method for anthropomorphic random lane change driving behavior | |
CN110879595A (en) | Unmanned mine card tracking control system and method based on deep reinforcement learning | |
CN114852105A (en) | Method and system for planning track change of automatic driving vehicle | |
CN112158045A (en) | Active suspension control method based on depth certainty strategy gradient | |
CN112542061B (en) | Lane borrowing and overtaking control method, device and system based on Internet of vehicles and storage medium | |
CN114030485A (en) | Automatic driving automobile man lane change decision planning method considering attachment coefficient | |
CN114074680A (en) | Vehicle lane change behavior decision method and system based on deep reinforcement learning | |
CN114148349B (en) | Vehicle personalized following control method based on generation of countermeasure imitation study | |
CN113033902B (en) | Automatic driving lane change track planning method based on improved deep learning | |
WO2023004698A1 (en) | Method for intelligent driving decision-making, vehicle movement control method, apparatus, and vehicle |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |