CN111845773A - Automatic driving vehicle micro-decision-making method based on reinforcement learning - Google Patents
Automatic driving vehicle micro-decision-making method based on reinforcement learning Download PDFInfo
- Publication number
- CN111845773A CN111845773A CN202010642778.8A CN202010642778A CN111845773A CN 111845773 A CN111845773 A CN 111845773A CN 202010642778 A CN202010642778 A CN 202010642778A CN 111845773 A CN111845773 A CN 111845773A
- Authority
- CN
- China
- Prior art keywords
- driving
- network
- vehicle
- decision
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W50/00—Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W60/00—Drive control systems specially adapted for autonomous road vehicles
- B60W60/001—Planning or execution of driving tasks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W50/00—Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
- B60W2050/0001—Details of the control system
- B60W2050/0019—Control system elements or transfer functions
- B60W2050/0028—Mathematical models, e.g. for simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Automation & Control Theory (AREA)
- Human Computer Interaction (AREA)
- Transportation (AREA)
- Mechanical Engineering (AREA)
- Probability & Statistics with Applications (AREA)
- Traffic Control Systems (AREA)
Abstract
The invention discloses an automatic driving vehicle micro-decision method based on reinforcement learning. The method adopts an A3C algorithm for reinforcement learning, the driving behavior is output by an Actor network, the flexibility is strong, and the complexity of the judgment logic is not influenced by the size of a state space and a behavior space. The method employs a two-stage training solution process. In the first stage, an automatic driving micro decision-making model suitable for all road sections is obtained through training so as to ensure the driving safety. And in the second stage, the integral model in the first stage is deployed to each road section, and each road section trains a single road section model on the basis, so that the transportability is realized. Meanwhile, the continuous training of the second stage enables the method to adapt to the influence of various real-time factors. Finally, a distributed communication architecture based on a real vehicle networking system structure is set forth, and distributed calculation in the solving process can be completed, so that the method can adapt to different road characteristics and dynamic driving environments, and has wide applicability and robustness.
Description
Technical Field
The invention relates to the technical field of automatic driving, in particular to an automatic driving vehicle micro-decision method based on reinforcement learning.
Background
The automatic driving technology is one of core technologies in intelligent traffic, automatic driving decisions are generally divided into two types, one type is a macroscopic path planning problem, namely after a starting place and a destination of a vehicle are determined, factors such as driving distance and congestion conditions are comprehensively considered, and how to select an optimal driving path is selected.
In the prior art, the automatic driving vehicle micro-decision model is divided into the following categories:
finite state machine model: the method comprises the following steps that a vehicle selects proper driving behaviors from predefined behavior modes of parking, lane changing, overtaking, avoiding, slow running and the like according to the environment;
a decision tree model: the driving behavior mode is expressed by the tree structure of the model, and the judgment logic is solidified at the branch nodes of the tree to carry out a top-down searching mechanism.
For example, the invention patent with chinese patent publication No. CN110969848A discloses an automatic driving overtaking decision method based on reinforcement learning under opposite double lanes, which comprises the following steps: collecting the traffic state of the automatic driving vehicle through a sensor; inputting the acquired traffic state into a trained decision model; the decision-making model selects and outputs a corresponding driving action instruction from the action space according to the input information, and automatically drives the vehicle to form a new traffic state after the driving action; calculating the reward value of the driving action through a reward function, and storing the original traffic state, the driving action, the reward value and the new traffic state as transfer samples into an experience playback pool; calculating a loss function value of the decision model, and optimizing parameters of the decision model according to the transfer sample and the loss function value; and repeating the steps until the automatic driving is finished. The safety and the comfort of the decision process of the overtaking of the automatic driving vehicle are ensured, and the human-simulated and the robustness of the decision are improved by a reinforcement learning decision method.
For another example, the invention patent with chinese patent publication No. CN109624986A discloses an automatic driving overtaking decision method based on reinforcement learning under opposite two lanes, and specifically relates to a learning cruise control system and method based on mode switching for adaptive cruise control by mode switching of specific driver style and adaptive learning of following behavior. The system defines the driving style in a switching strategy among several modes of constant-speed cruising, accelerating approaching, steady-state following and rapid braking of a driver under different following conditions, learns the driving style, and further learns the driving characteristics under each driving mode by using a continuous state-based learning method.
The prior art has at least the following problems:
the finite state machine model and the decision tree model ignore environmental uncertainty, cannot adapt to dynamic changes of the environment well, and when more behavior modes are defined, the state space and the behavior space are larger, the judgment logic is complex, the feasibility degree is not high, and better decision performance is difficult to show in urban roads with rich structural features.
Aiming at the problems that in the prior art, both a finite-state machine model and a decision tree model ignore environment uncertainty, cannot adapt to dynamic changes of the environment well, and when more behavior patterns are defined, a state space and a behavior space are large, judgment logic is complex, feasibility is not high, and good decision performance is difficult to show in urban roads with abundant structural features, an effective solution is not provided at present.
Disclosure of Invention
The invention aims to provide an automatic driving vehicle micro-decision method based on reinforcement learning aiming at the defects of the prior art, and the safety requirement and the driving efficiency requirement in automatic driving are met.
The automatic driving vehicle micro-decision method comprises the following steps:
step 1, reinforcement learning modeling is carried out, and an automatic driving decision scheme is subjected to modeling representation;
step 2, designing a solving network, obtaining an optimal vehicle micro-decision scheme related to driving micro-decision in step 1, and then solving by using an A3C algorithm, wherein in the A3C algorithm, a global network and a proxy network both comprise an Actor network and a Critic network, the Actor network and the Critic network in all the global networks and the proxy networks are respectively the same, for the network structures of the Actor network and the Critic network, both take the state as input, and in combination with step 1, a neural network composed of a convolutional layer and a full connection layer is adopted, wherein the Actor network represents a strategy function, and an output layer is a mu of a probability density function in the strategy function θ(s) and σθ(s) Critic network represents a function of state values, with output layer being a state value
And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy.
Further, in step 1, the method further comprises the following steps:
step 1.1, defining a driving process of a vehicle as a Markov decision process, regarding an automatic driving vehicle as a proxy, regarding a driving environment of the vehicle as a reinforcement learning environment, making a driving decision and a driving behavior by the proxy vehicle according to detected environment information, adjusting the driving decision according to a driving result, dividing driving time into a plurality of time slots, making the driving decision at the beginning of the time slot of each proxy vehicle, and determining the driving behavior of each proxy vehicle in the time slot;
step 1.2, modeling is carried out by using basic elements in reinforcement learning:
step 1.2.1, state set, defining the lane direction as y direction, defining the direction vertical to the y direction as x direction, if the lane is a curve, the y direction represents the tangential direction of the laneDefining the position and speed of the agent vehicle and the nearest I vehicles around as the state, the state set is expressed as: s ═ c 0,c1,c2,...,cI]Where s is one sample in the state set, c0=[x0,y0,v0x,v0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, ci=[Δxi,Δyi,vix,viy],i≤I,Δxi,Δyi,vix,viyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;
step 1.2.2, an action set, which defines the moving distance of the agent vehicle in two directions per time slot as an action, is expressed as: a ═ x, y],Xm<x<XM,Ym<y<YMWhere a is a sample in the motion set, X, y respectively represent the moving distance in two directions, and Xm,XM,Ym,YMRespectively represents the minimum distance and the maximum distance of movement in two directions, and Ym=0;
Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function piθ(a | s), the random function value represents the probability of taking the action a in the state s, that is, the policy function is a probability density function, and the action is obtained by sampling according to the probability density function, as shown in the following formula (1):
in the formula (1), am=[Xm,Ym],aM=[XM,YM]Representing the maximum and minimum values of the motion,
wherein, muθ(s) represents the mean value of the distribution, σθ(s) represents the variance of the distribution,
step 1.2.4, a reward function, which specifies an award value obtained after a certain action is performed in a certain state to reflect the quality of the action selection, is defined as the following formula (2):
In the above formula (2), kcIs positive;
step 1.2.5, objective function: in order to obtain the optimal driving strategy, the driving strategy is taken as a variable according to the consideration of safety and driving efficiency, the following optimization target is defined, for each agent, the action is selected according to the strategy function in the initial state to reach the next state, the process of selecting the action and reaching the next state is continuously repeated, and a track (pi) is finally generated after a plurality of iterationsθ) For this track, the cumulative discount return is expressed as the following equation (3):
in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, rtRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):
in the formula (4), the first and second groups,representing a desire to accumulate a discount return,
step 1.2.6, optimizing a decision scheme, wherein the driving decision scheme is to find an optimal strategy pi*The objective function is maximized, the optimization process of the strategy substantially optimizes the parameter θ of the strategy function, and the optimization decision scheme is finally expressed as the following formula (5):
obtaining the optimal parameter theta*Then, the optimal strategy is expressed asI.e. an optimal vehicle micro-decision scheme.
Further, in step 3, the method further comprises the following steps:
step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs, in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:
step 3.1.1, deploying the decision neural network on the central server and all RSUs;
step 3.1.2, starting iteration of the training network, namely repeatedly executing the following steps until the network converges:
step 3.1.2.1, for each RSU, collecting a driving track covering vehicles on a road to simulate a driving environment, randomly generating an agent, executing driving behaviors in the simulated driving environment, obtaining state information according to the driving environment, inputting an Actor and a Critic network, making a driving decision according to the output of the Actor network, making a driving action, obtaining an interaction result after the interaction process is finished, reaching the driving environment of the next state, and continuing the interaction until generating sampling data of the driving track;
Step 3.1.2.2, the RSU trains a local decision network by using a local driving track data set, and uploads a training result to a central server;
step 3.1.2.3, after the central server collects the training result transmitted from one RSU, it updates the global network once and returns the updated global network parameter to the RSU;
step 3.1.2.4, after receiving the global network returned by the central server, the RSU synchronizes the global network to the local network, and starts a new round of sample collection and training on the basis;
step 3.1.2.5, after the network convergence, the network parameters are not changed, and a basic model suitable for all road sections is obtained;
step 3.2, training a single road model, sinking the global network layer in the step 3.1 to the RSU of each road, sinking the agent layer to the RSU to cover all the automatic driving vehicles on the road, and specifically deploying each road in the following way:
step 3.2.1, the RSU synchronizes the basic model obtained in the first stage from the central server to be used as a global network;
step 3.2.2, when each automatic driving vehicle starts to enter the road covered by the RSU, synchronizing the global network model from the RSU to become an agent in the road agent layer, and executing the training of a decision network:
Step 3.2.1.1, for each vehicle, taking the vehicle as an agent and taking the driving behavior track of the vehicle as a training sample, and performing the same process as the step 3.1.2.1 to obtain track sampling data;
3.2.1.2, training a local decision network by the vehicle by using a local driving track data set, and uploading a training result to the RSU;
step 3.2.1.3, after the RSU collects a training result transmitted by a vehicle, the RSU updates the global network once and returns the updated global network parameters to the vehicle;
and 3.2.1.4, after the vehicle receives the global network parameters returned by the central server, synchronizing the global network parameters to the local network, and starting a new round of sample collection and training on the basis until the vehicle leaves the current road.
Compared with the prior art, the automatic driving vehicle micro-decision method has the following remarkable advantages:
the design of the invention is combined with the network architecture of the Internet of vehicles, is easy to deploy and has strong feasibility.
2, the invention does not use the predefined driving mode, the driving behavior is more flexible, the adaptability is strong, the complexity of decision making can not be increased due to the increase of the state space and the behavior space, and the calculation mode is simpler.
3, the first stage of the invention can obtain a universal driving model, and can ensure the driving safety on different road sections, therefore, when a road section is newly added, the model is only required to be synchronized from the central server, the RSU and the automatic driving vehicle can immediately start the training process, and the invention has strong universality and transportability.
4, the driving model of each exclusive road section is obtained in the second stage of the invention, compared with the situation that the same model is used for all road sections, the model of the invention can better adapt to the characteristics of different road sections, under the driving environment of the road section, the model driving efficiency of a single road section is superior to the model shared by all road sections, and in addition, compared with the situation that one model is independently trained for each road section, the training and calculation cost of the invention is lower.
Compared with a fixed driving strategy model, the model of the second stage of the method can adapt to constantly changing real-time factors such as road conditions, weather, traffic flow density and the like, and has better robustness.
Drawings
FIG. 1 is a schematic diagram of the calculation structure of the A3C algorithm of the reinforcement learning-based automatic vehicle micro-decision making method according to the present invention;
FIG. 2 is a schematic diagram of a three-layer system architecture of the reinforcement learning-based automated vehicle micro-decision making method of the present invention;
FIG. 3 is a schematic diagram of a network architecture of an actor for an enhanced learning-based autonomous vehicle micro-decision making method according to the present invention;
fig. 4 is a schematic diagram of a network structure of a review family of the reinforcement learning-based automatic driving vehicle micro-decision method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
As shown in fig. 1 to 4, the automatic driving vehicle micro-decision method includes the following steps:
step 1, reinforcement learning modeling, and modeling and representing an automatic driving decision scheme:
step 1.1, defining a driving process of a vehicle as a Markov decision process, regarding an automatic driving vehicle as a proxy, regarding a driving environment of the vehicle as a reinforcement learning environment, making a driving decision and a driving behavior by the proxy vehicle according to detected environment information, adjusting the driving decision according to a driving result, dividing driving time into a plurality of time slots, making the driving decision at the beginning of the time slot of each proxy vehicle, and determining the driving behavior of each proxy vehicle in the time slot;
step 1.2, modeling is carried out by using basic elements in reinforcement learning:
step 1.2.1, a state set, wherein a lane direction is defined as a y direction, a direction perpendicular to the y direction is defined as an x direction, if the lane is a curve, the y direction represents a tangential direction of the lane, and positions and speeds of an agent vehicle and a nearest vehicle I around the agent vehicle are defined as states, the state set is represented as: s ═ c 0,c1,c2,...,cI]Where s is one sample in the state set, c0=[x0,y0,v0x,v0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, ci=[Δxi,Δyi,vix,viy],i≤I,Δxi,Δyi,vix,viyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;
step 1.2.2, an action set, which defines the moving distance of the agent vehicle in two directions per time slot as an action, is expressed as: a ═ x, y],Xm<x<XM,Ym<y<YMWhere a is a sample in the motion set, X, y respectively represent the moving distance in two directions, and Xm,XM,Ym,YMRespectively represents the minimum distance and the maximum distance of movement in two directions, and Ym=0;
Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function piθ(a | s), the random function value represents the probability of taking the action a in the state s, that is, the policy function is a probability density function, and the action is obtained by sampling according to the probability density function, as shown in the following formula (1):
in the formula (1), am=[Xm,Ym],aM=[XM,YM]Representing the maximum and minimum values of the motion,
wherein, muθ(s) represents the mean value of the distribution, σθ(s) represents the variance of the distribution,
step 1.2.4, a reward function, which specifies an award value obtained after a certain action is performed in a certain state to reflect the quality of the action selection, is defined as the following formula (2):
In the above formula (2), kcIs positive;
step 1.2.5, objective function: to obtain an optimal driving strategy, rootAccording to the consideration of safety and driving efficiency, the driving strategy is used as a variable, the following optimization target is defined, for each agent, the action is selected according to the strategy function in the initial state to reach the next state, the process of selecting the action and reaching the next state is continuously repeated, and after a plurality of iterations, a track (pi) is finally generatedθ) For this track, the cumulative discount return is expressed as the following equation (3):
in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, rtRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):
in the formula (4), the first and second groups,representing a desire to accumulate a discount return,
step 1.2.6, optimizing a decision scheme, wherein the driving decision scheme is to find an optimal strategy pi*The objective function is maximized, the optimization process of the strategy substantially optimizes the parameter θ of the strategy function, and the optimization decision scheme is finally expressed as the following formula (5):
obtaining the optimal parameter theta*Then, the optimal strategy is expressed asI.e. an optimal vehicle micro-decision scheme.
Step 2, designing a solving network, and obtaining the optimal vehicle micro-decision related to the driving micro-decision in the step 1The scheme is that a solution is then solved by using an A3C algorithm, in the A3C algorithm, the global network and the proxy network both include an Actor network and a criticic network, the Actor network and the criticic network in all the global network and the proxy network are respectively the same, for the network structures of the Actor network and the criticic network, both take the state as input, and in combination with the two-dimensional structural characteristics of the state defined in step 1.2.1, the neural network composed of the convolutional layer and the full connection layer shown in fig. 3 and 4 is adopted, wherein the Actor network represents the policy function, and the output layer is the mu of the probability density function in the policy functionθ(s) and σθ(s) Critic network represents a function of state values, with output layer being a state value
And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy:
step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all Road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs (Road Side units), in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:
Step 3.1.1, deploying the decision neural network on the central server and all RSUs;
step 3.1.2, starting iteration of the training network, namely repeatedly executing the following steps until the network converges:
step 3.1.2.1, for each RSU, collecting a driving track covering vehicles on a road to simulate a driving environment, randomly generating an agent, executing driving behaviors in the simulated driving environment, obtaining state information according to the driving environment, inputting an Actor and a Critic network, making a driving decision according to the output of the Actor network, making a driving action, obtaining an interaction result after the interaction process is finished, reaching the driving environment of the next state, and continuing the interaction until generating sampling data of the driving track;
step 3.1.2.2, the RSU trains a local decision network by using a local driving track data set, and uploads a training result to a central server;
step 3.1.2.3, after the central server collects the training result transmitted from one RSU, it updates the global network once and returns the updated global network parameter to the RSU;
step 3.1.2.4, after receiving the global network returned by the central server, the RSU synchronizes the global network to the local network, and starts a new round of sample collection and training on the basis;
Step 3.1.2.5, after the network convergence, the network parameters are not changed, and a basic model suitable for all road sections is obtained;
step 3.2, training a single road model, sinking the global network layer in the step 3.1 to the RSU of each road, sinking the agent layer to the RSU to cover all the automatic driving vehicles on the road, and specifically deploying each road in the following way:
step 3.2.1, the RSU synchronizes the basic model obtained in the first stage from the central server to be used as a global network;
step 3.2.2, when each automatic driving vehicle starts to enter the road covered by the RSU, synchronizing the global network model from the RSU to become an agent in the road agent layer, and executing the training of a decision network:
step 3.2.1.1, for each vehicle, taking the vehicle as an agent and taking the driving behavior track of the vehicle as a training sample, and performing the same process as the step 3.1.2.1 to obtain track sampling data;
3.2.1.2, training a local decision network by the vehicle by using a local driving track data set, and uploading a training result to the RSU;
step 3.2.1.3, after the RSU collects a training result transmitted by a vehicle, the RSU updates the global network once and returns the updated global network parameters to the vehicle;
And 3.2.1.4, after the vehicle receives the global network parameters returned by the central server, synchronizing the global network parameters to the local network, and starting a new round of sample collection and training on the basis until the vehicle leaves the current road.
It should be noted that the global network and the Actor network structures of all agents are the same; the Critic network structure of the global network and all the agents is the same, that is, the network structure of the global network and all the agent networks is the same: the network structure comprises an Actor network and a criticic network, all the Actor networks have the same structure, and all the criticic networks have the same structure.
The above description is only for the preferred embodiment of the present invention and should not be construed as limiting the present invention, and various modifications and changes can be made by those skilled in the art without departing from the spirit and principle of the present invention, and any modifications, equivalents, improvements, etc. should be included in the scope of the claims of the present invention.
Claims (3)
1. An automatic driving vehicle micro-decision method based on reinforcement learning is characterized by comprising the following steps:
step 1, reinforcement learning modeling is carried out, and an automatic driving decision scheme is subjected to modeling representation;
Step 2, designing a solving network, obtaining an optimal vehicle micro-decision scheme related to driving micro-decision in step 1, and then solving by using an A3C algorithm, wherein in the A3C algorithm, a global network and a proxy network both comprise an Actor network and a Critic network, the Actor network and the Critic network in all the global networks and the proxy networks are respectively the same, for the network structures of the Actor network and the Critic network, both take the state as input, and in combination with step 1, a neural network composed of a convolutional layer and a full connection layer is adopted, wherein the Actor network represents a strategy function, and an output layer is a mu of a probability density function in the strategy functionθ(s) and σθ(s) Critic network represents a function of state values, with output layer being a state value
And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy.
2. The reinforcement learning-based automated driving vehicle micro-decision method according to claim 1, further comprising, in step 1, the steps of:
step 1.1, defining a driving process of a vehicle as a Markov decision process, regarding an automatic driving vehicle as a proxy, regarding a driving environment of the vehicle as a reinforcement learning environment, making a driving decision and a driving behavior by the proxy vehicle according to detected environment information, adjusting the driving decision according to a driving result, dividing driving time into a plurality of time slots, making the driving decision at the beginning of the time slot of each proxy vehicle, and determining the driving behavior of each proxy vehicle in the time slot;
Step 1.2, modeling is carried out by using basic elements in reinforcement learning:
step 1.2.1, a state set, wherein a lane direction is defined as a y direction, a direction perpendicular to the y direction is defined as an x direction, if the lane is a curve, the y direction represents a tangential direction of the lane, and positions and speeds of an agent vehicle and a nearest vehicle I around the agent vehicle are defined as states, the state set is represented as: s ═ c0,c1,c2,...,cI]Where s is one sample in the state set, c0=[x0,y0,v0x,v0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, ci=[Δxi,Δyi,vix,viy],i≤I,Δxi,Δyi,vix,viyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;
step 1.2.2, an action set, which defines the moving distance of the agent vehicle in two directions per time slot as an action, is expressed as: a ═ x, y],Xm<x<XM,Ym<y<YMWhere a is a sample in the motion set, X, y respectively represent the moving distance in two directions, and Xm,XM,Ym,YMRespectively represents the minimum distance and the maximum distance of movement in two directions, and Ym=0;
Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function pi θ(a | s), the random function value represents the probability of taking the action a in the state s, that is, the policy function is a probability density function, and the action is obtained by sampling according to the probability density function, as shown in the following formula (1):
in the formula (1), am=[Xm,Ym],aM=[XM,YM]Representing the maximum and minimum values of the motion,
wherein, muθ(s) represents the mean value of the distribution, σθ(s) represents the variance of the distribution,
step 1.2.4, a reward function, which specifies an award value obtained after a certain action is performed in a certain state to reflect the quality of the action selection, is defined as the following formula (2):
in the above formula (2), kcIs positive;
step 1.2.5, objective function:in order to obtain the optimal driving strategy, the driving strategy is taken as a variable according to the consideration of safety and driving efficiency, the following optimization target is defined, for each agent, the action is selected according to the strategy function in the initial state to reach the next state, the process of selecting the action and reaching the next state is continuously repeated, and a track (pi) is finally generated after a plurality of iterationsθ) For this track, the cumulative discount return is expressed as the following equation (3):
in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, r tRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):
in the formula (4), the first and second groups,representing a desire to accumulate a discount return,
step 1.2.6, optimizing a decision scheme, wherein the driving decision scheme is to find an optimal strategy pi*The objective function is maximized, the optimization process of the strategy substantially optimizes the parameter θ of the strategy function, and the optimization decision scheme is finally expressed as the following formula (5):
3. The reinforcement learning-based automated vehicular micro decision making method according to claim 1, further comprising, in step 3, the steps of:
step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs, in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:
step 3.1.1, deploying the decision neural network on the central server and all RSUs;
Step 3.1.2, starting iteration of the training network, namely repeatedly executing the following steps until the network converges:
step 3.1.2.1, for each RSU, collecting a driving track covering vehicles on a road to simulate a driving environment, randomly generating an agent, executing driving behaviors in the simulated driving environment, obtaining state information according to the driving environment, inputting an Actor and a Critic network, making a driving decision according to the output of the Actor network, making a driving action, obtaining an interaction result after the interaction process is finished, reaching the driving environment of the next state, and continuing the interaction until generating sampling data of the driving track;
step 3.1.2.2, the RSU trains a local decision network by using a local driving track data set, and uploads a training result to a central server;
step 3.1.2.3, after the central server collects the training result transmitted from one RSU, it updates the global network once and returns the updated global network parameter to the RSU;
step 3.1.2.4, after receiving the global network returned by the central server, the RSU synchronizes the global network to the local network, and starts a new round of sample collection and training on the basis;
step 3.1.2.5, after the network convergence, the network parameters are not changed, and a basic model suitable for all road sections is obtained;
Step 3.2, training a single road model, sinking the global network layer in the step 3.1 to the RSU of each road, sinking the agent layer to the RSU to cover all the automatic driving vehicles on the road, and specifically deploying each road in the following way:
step 3.2.1, the RSU synchronizes the basic model obtained in the first stage from the central server to be used as a global network;
step 3.2.2, when each automatic driving vehicle starts to enter the road covered by the RSU, synchronizing the global network model from the RSU to become an agent in the road agent layer, and executing the training of a decision network:
step 3.2.1.1, for each vehicle, taking the vehicle as an agent and taking the driving behavior track of the vehicle as a training sample, and performing the same process as the step 3.1.2.1 to obtain track sampling data;
3.2.1.2, training a local decision network by the vehicle by using a local driving track data set, and uploading a training result to the RSU;
step 3.2.1.3, after the RSU collects a training result transmitted by a vehicle, the RSU updates the global network once and returns the updated global network parameters to the vehicle;
and 3.2.1.4, after the vehicle receives the global network parameters returned by the central server, synchronizing the global network parameters to the local network, and starting a new round of sample collection and training on the basis until the vehicle leaves the current road.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010642778.8A CN111845773B (en) | 2020-07-06 | 2020-07-06 | Automatic driving vehicle micro-decision-making method based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010642778.8A CN111845773B (en) | 2020-07-06 | 2020-07-06 | Automatic driving vehicle micro-decision-making method based on reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111845773A true CN111845773A (en) | 2020-10-30 |
CN111845773B CN111845773B (en) | 2021-10-26 |
Family
ID=73153538
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010642778.8A Active CN111845773B (en) | 2020-07-06 | 2020-07-06 | Automatic driving vehicle micro-decision-making method based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111845773B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348201A (en) * | 2020-11-11 | 2021-02-09 | 扬州大学 | Intelligent decision implementation method for automatic driving group vehicle based on federal deep reinforcement learning |
CN112644516A (en) * | 2020-12-16 | 2021-04-13 | 吉林大学青岛汽车研究院 | Unmanned control system and control method suitable for roundabout scene |
CN112700642A (en) * | 2020-12-19 | 2021-04-23 | 北京工业大学 | Method for improving traffic passing efficiency by using intelligent internet vehicle |
CN112896187A (en) * | 2021-02-08 | 2021-06-04 | 浙江大学 | System and method for considering social compatibility and making automatic driving decision |
CN113044064A (en) * | 2021-04-01 | 2021-06-29 | 南京大学 | Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning |
CN113071524A (en) * | 2021-04-29 | 2021-07-06 | 深圳大学 | Decision control method, decision control device, autonomous driving vehicle and storage medium |
CN113099418A (en) * | 2021-03-26 | 2021-07-09 | 深圳供电局有限公司 | Optimization method of block chain task for data transmission of Internet of vehicles |
CN113501008A (en) * | 2021-08-12 | 2021-10-15 | 东风悦享科技有限公司 | Automatic driving behavior decision method based on reinforcement learning algorithm |
CN113511222A (en) * | 2021-08-27 | 2021-10-19 | 清华大学 | Scene self-adaptive vehicle interactive behavior decision and prediction method and device |
CN113619604A (en) * | 2021-08-26 | 2021-11-09 | 清华大学 | Integrated decision and control method and device for automatic driving automobile and storage medium |
CN117828489A (en) * | 2024-03-05 | 2024-04-05 | 河钢国际科技(北京)有限公司 | Intelligent ship remote dynamic control system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015094645A1 (en) * | 2013-12-22 | 2015-06-25 | Lytx, Inc. | Autonomous driving comparison and evaluation |
US20180011488A1 (en) * | 2016-07-08 | 2018-01-11 | Toyota Motor Engineering & Manufacturing North America, Inc. | Control policy learning and vehicle control method based on reinforcement learning without active exploration |
CN109213148A (en) * | 2018-08-03 | 2019-01-15 | 东南大学 | It is a kind of based on deeply study vehicle low speed with decision-making technique of speeding |
CN109682392A (en) * | 2018-12-28 | 2019-04-26 | 山东大学 | Vision navigation method and system based on deeply study |
CN110320883A (en) * | 2018-03-28 | 2019-10-11 | 上海汽车集团股份有限公司 | A kind of Vehicular automatic driving control method and device based on nitrification enhancement |
CN110406530A (en) * | 2019-07-02 | 2019-11-05 | 宁波吉利汽车研究开发有限公司 | A kind of automatic Pilot method, apparatus, equipment and vehicle |
CN111026127A (en) * | 2019-12-27 | 2020-04-17 | 南京大学 | Automatic driving decision method and system based on partially observable transfer reinforcement learning |
-
2020
- 2020-07-06 CN CN202010642778.8A patent/CN111845773B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015094645A1 (en) * | 2013-12-22 | 2015-06-25 | Lytx, Inc. | Autonomous driving comparison and evaluation |
US20180011488A1 (en) * | 2016-07-08 | 2018-01-11 | Toyota Motor Engineering & Manufacturing North America, Inc. | Control policy learning and vehicle control method based on reinforcement learning without active exploration |
CN110320883A (en) * | 2018-03-28 | 2019-10-11 | 上海汽车集团股份有限公司 | A kind of Vehicular automatic driving control method and device based on nitrification enhancement |
CN109213148A (en) * | 2018-08-03 | 2019-01-15 | 东南大学 | It is a kind of based on deeply study vehicle low speed with decision-making technique of speeding |
CN109682392A (en) * | 2018-12-28 | 2019-04-26 | 山东大学 | Vision navigation method and system based on deeply study |
CN110406530A (en) * | 2019-07-02 | 2019-11-05 | 宁波吉利汽车研究开发有限公司 | A kind of automatic Pilot method, apparatus, equipment and vehicle |
CN111026127A (en) * | 2019-12-27 | 2020-04-17 | 南京大学 | Automatic driving decision method and system based on partially observable transfer reinforcement learning |
Non-Patent Citations (1)
Title |
---|
立升波: "深度神经网络的关键技术及其在自动驾驶领域的应用", 《汽车安全与节能学报》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348201A (en) * | 2020-11-11 | 2021-02-09 | 扬州大学 | Intelligent decision implementation method for automatic driving group vehicle based on federal deep reinforcement learning |
CN112348201B (en) * | 2020-11-11 | 2024-03-12 | 扬州大学 | Intelligent decision-making implementation method of automatic driving group vehicle based on federal deep reinforcement learning |
CN112644516A (en) * | 2020-12-16 | 2021-04-13 | 吉林大学青岛汽车研究院 | Unmanned control system and control method suitable for roundabout scene |
CN112644516B (en) * | 2020-12-16 | 2022-03-29 | 吉林大学青岛汽车研究院 | Unmanned control system and control method suitable for roundabout scene |
CN112700642A (en) * | 2020-12-19 | 2021-04-23 | 北京工业大学 | Method for improving traffic passing efficiency by using intelligent internet vehicle |
CN112896187B (en) * | 2021-02-08 | 2022-07-26 | 浙江大学 | System and method for considering social compatibility and making automatic driving decision |
CN112896187A (en) * | 2021-02-08 | 2021-06-04 | 浙江大学 | System and method for considering social compatibility and making automatic driving decision |
CN113099418A (en) * | 2021-03-26 | 2021-07-09 | 深圳供电局有限公司 | Optimization method of block chain task for data transmission of Internet of vehicles |
CN113099418B (en) * | 2021-03-26 | 2022-08-16 | 深圳供电局有限公司 | Optimization method of block chain task for data transmission of Internet of vehicles |
CN113044064A (en) * | 2021-04-01 | 2021-06-29 | 南京大学 | Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning |
CN113044064B (en) * | 2021-04-01 | 2022-07-29 | 南京大学 | Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning |
CN113071524A (en) * | 2021-04-29 | 2021-07-06 | 深圳大学 | Decision control method, decision control device, autonomous driving vehicle and storage medium |
CN113501008B (en) * | 2021-08-12 | 2023-05-19 | 东风悦享科技有限公司 | Automatic driving behavior decision method based on reinforcement learning algorithm |
CN113501008A (en) * | 2021-08-12 | 2021-10-15 | 东风悦享科技有限公司 | Automatic driving behavior decision method based on reinforcement learning algorithm |
CN113619604A (en) * | 2021-08-26 | 2021-11-09 | 清华大学 | Integrated decision and control method and device for automatic driving automobile and storage medium |
CN113619604B (en) * | 2021-08-26 | 2023-08-15 | 清华大学 | Integrated control method, device and storage medium for automatic driving automobile |
CN113511222A (en) * | 2021-08-27 | 2021-10-19 | 清华大学 | Scene self-adaptive vehicle interactive behavior decision and prediction method and device |
CN113511222B (en) * | 2021-08-27 | 2023-09-26 | 清华大学 | Scene self-adaptive vehicle interaction behavior decision and prediction method and device |
CN117828489A (en) * | 2024-03-05 | 2024-04-05 | 河钢国际科技(北京)有限公司 | Intelligent ship remote dynamic control system |
CN117828489B (en) * | 2024-03-05 | 2024-05-14 | 河钢国际科技(北京)有限公司 | Intelligent ship remote dynamic control system |
Also Published As
Publication number | Publication date |
---|---|
CN111845773B (en) | 2021-10-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111845773B (en) | Automatic driving vehicle micro-decision-making method based on reinforcement learning | |
CN114495527B (en) | Internet-connected intersection vehicle road collaborative optimization method and system in mixed traffic environment | |
CN111931905B (en) | Graph convolution neural network model and vehicle track prediction method using same | |
CN110032782B (en) | City-level intelligent traffic signal control system and method | |
WO2022052406A1 (en) | Automatic driving training method, apparatus and device, and medium | |
US11243532B1 (en) | Evaluating varying-sized action spaces using reinforcement learning | |
KR102306939B1 (en) | Method and device for short-term path planning of autonomous driving through information fusion by using v2x communication and image processing | |
CN111696370A (en) | Traffic light control method based on heuristic deep Q network | |
CN111267830B (en) | Hybrid power bus energy management method, device and storage medium | |
CN104952248A (en) | Automobile convergence predicting method based on Euclidean space | |
CN113255998B (en) | Expressway unmanned vehicle formation method based on multi-agent reinforcement learning | |
CN112183288B (en) | Multi-agent reinforcement learning method based on model | |
CN104966129A (en) | Method for separating vehicle running track | |
Baskar et al. | Hierarchical traffic control and management with intelligent vehicles | |
CN118097989B (en) | Multi-agent traffic area signal control method based on digital twin | |
CN112550314A (en) | Embedded optimization type control method suitable for unmanned driving, driving control module and automatic driving control system thereof | |
CN113903173B (en) | Vehicle track feature extraction method based on directed graph structure and LSTM | |
CN111310919A (en) | Driving control strategy training method based on scene segmentation and local path planning | |
CN113299079B (en) | Regional intersection signal control method based on PPO and graph convolution neural network | |
Gutiérrez-Moreno et al. | Hybrid decision making for autonomous driving in complex urban scenarios | |
CN114267191A (en) | Control system, method, medium, equipment and application for relieving traffic jam of driver | |
CN117075473A (en) | Multi-vehicle collaborative decision-making method in man-machine mixed driving environment | |
CN115273502B (en) | Traffic signal cooperative control method | |
CN116620327A (en) | Lane changing decision method for realizing automatic driving high-speed scene based on PPO and Lattice | |
CN114117944B (en) | Model updating method, device, equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |