CN111845773B - Automatic driving vehicle micro-decision-making method based on reinforcement learning - Google Patents

Automatic driving vehicle micro-decision-making method based on reinforcement learning Download PDF

Info

Publication number
CN111845773B
CN111845773B CN202010642778.8A CN202010642778A CN111845773B CN 111845773 B CN111845773 B CN 111845773B CN 202010642778 A CN202010642778 A CN 202010642778A CN 111845773 B CN111845773 B CN 111845773B
Authority
CN
China
Prior art keywords
network
driving
vehicle
decision
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010642778.8A
Other languages
Chinese (zh)
Other versions
CN111845773A (en
Inventor
郑侃
刘杰
赵龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202010642778.8A priority Critical patent/CN111845773B/en
Publication of CN111845773A publication Critical patent/CN111845773A/en
Application granted granted Critical
Publication of CN111845773B publication Critical patent/CN111845773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W2050/0001Details of the control system
    • B60W2050/0019Control system elements or transfer functions
    • B60W2050/0028Mathematical models, e.g. for simulation

Abstract

The invention discloses an automatic driving vehicle micro-decision method based on reinforcement learning. The method adopts an A3C algorithm for reinforcement learning, the driving behavior is output by an Actor network, the flexibility is strong, and the complexity of the judgment logic is not influenced by the size of a state space and a behavior space. The method employs a two-stage training solution process. In the first stage, an automatic driving micro decision-making model suitable for all road sections is obtained through training so as to ensure the driving safety. And in the second stage, the integral model in the first stage is deployed to each road section, and each road section trains a single road section model on the basis, so that the transportability is realized. Meanwhile, the continuous training of the second stage enables the method to adapt to the influence of various real-time factors. Finally, a distributed communication architecture based on a real vehicle networking system structure is set forth, and distributed calculation in the solving process can be completed, so that the method can adapt to different road characteristics and dynamic driving environments, and has wide applicability and robustness.

Description

Automatic driving vehicle micro-decision-making method based on reinforcement learning
Technical Field
The invention relates to the technical field of automatic driving, in particular to an automatic driving vehicle micro-decision method based on reinforcement learning.
Background
The automatic driving technology is one of core technologies in intelligent traffic, automatic driving decisions are generally divided into two types, one type is a macroscopic path planning problem, namely after a starting place and a destination of a vehicle are determined, factors such as driving distance and congestion conditions are comprehensively considered, and how to select an optimal driving path is selected.
In the prior art, the automatic driving vehicle micro-decision model is divided into the following categories:
finite state machine model: the method comprises the following steps that a vehicle selects proper driving behaviors from predefined behavior modes of parking, lane changing, overtaking, avoiding, slow running and the like according to the environment;
a decision tree model: the driving behavior mode is expressed by the tree structure of the model, and the judgment logic is solidified at the branch nodes of the tree to carry out a top-down searching mechanism.
For example, the invention patent with chinese patent publication No. CN110969848A discloses an automatic driving overtaking decision method based on reinforcement learning under opposite double lanes, which comprises the following steps: collecting the traffic state of the automatic driving vehicle through a sensor; inputting the acquired traffic state into a trained decision model; the decision-making model selects and outputs a corresponding driving action instruction from the action space according to the input information, and automatically drives the vehicle to form a new traffic state after the driving action; calculating the reward value of the driving action through a reward function, and storing the original traffic state, the driving action, the reward value and the new traffic state as transfer samples into an experience playback pool; calculating a loss function value of the decision model, and optimizing parameters of the decision model according to the transfer sample and the loss function value; and repeating the steps until the automatic driving is finished. The safety and the comfort of the decision process of the overtaking of the automatic driving vehicle are ensured, and the human-simulated and the robustness of the decision are improved by a reinforcement learning decision method.
For another example, the invention patent with chinese patent publication No. CN109624986A discloses an automatic driving overtaking decision method based on reinforcement learning under opposite two lanes, and specifically relates to a learning cruise control system and method based on mode switching for adaptive cruise control by mode switching of specific driver style and adaptive learning of following behavior. The system defines the driving style in a switching strategy among several modes of constant-speed cruising, accelerating approaching, steady-state following and rapid braking of a driver under different following conditions, learns the driving style, and further learns the driving characteristics under each driving mode by using a continuous state-based learning method.
The prior art has at least the following problems:
the finite state machine model and the decision tree model ignore environmental uncertainty, cannot adapt to dynamic changes of the environment well, and when more behavior modes are defined, the state space and the behavior space are larger, the judgment logic is complex, the feasibility degree is not high, and better decision performance is difficult to show in urban roads with rich structural features.
Aiming at the problems that in the prior art, both a finite-state machine model and a decision tree model ignore environment uncertainty, cannot adapt to dynamic changes of the environment well, and when more behavior patterns are defined, a state space and a behavior space are large, judgment logic is complex, feasibility is not high, and good decision performance is difficult to show in urban roads with abundant structural features, an effective solution is not provided at present.
Disclosure of Invention
The invention aims to provide an automatic driving vehicle micro-decision method based on reinforcement learning aiming at the defects of the prior art, and the safety requirement and the driving efficiency requirement in automatic driving are met.
The automatic driving vehicle micro-decision method comprises the following steps:
step 1, reinforcement learning modeling is carried out, and an automatic driving decision scheme is subjected to modeling representation;
step 2, designing a solving network, obtaining an optimal vehicle micro-decision scheme related to driving micro-decision in step 1, and then solving by using an A3C algorithm, wherein in the A3C algorithm, a global network and a proxy network both comprise an Actor network and a Critic network, the Actor network and the Critic network in all the global networks and the proxy networks are respectively the same, for the network structures of the Actor network and the Critic network, both take the state as input, and in combination with step 1, a neural network composed of a convolutional layer and a full connection layer is adopted, wherein the Actor network represents a strategy function, and an output layer is a mu of a probability density function in the strategy functionθ(s) and σθ(s)The Critic network represents a function of state values, and the output layer is a state value
Figure BDA0002571864750000021
And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy.
Further, in step 1, the method further comprises the following steps:
step 1.1, defining a driving process of a vehicle as a Markov decision process, regarding an automatic driving vehicle as a proxy, regarding a driving environment of the vehicle as a reinforcement learning environment, making a driving decision and a driving behavior by the proxy vehicle according to detected environment information, adjusting the driving decision according to a driving result, dividing driving time into a plurality of time slots, making the driving decision at the beginning of the time slot of each proxy vehicle, and determining the driving behavior of each proxy vehicle in the time slot;
step 1.2, modeling is carried out by using basic elements in reinforcement learning:
step 1.2.1, a state set, wherein a lane direction is defined as a y direction, a direction perpendicular to the y direction is defined as an x direction, if the lane is a curve, the y direction represents a tangential direction of the lane, and positions and speeds of an agent vehicle and a nearest vehicle I around the agent vehicle are defined as states, the state set is represented as: s ═ c0,c1,c2,...,cI]Where s is one sample in the state set, c0=[x0,y0,v0x,v0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, ci=[Δxi,Δyi,vix,viy],i≤I,Δxi,Δyi,vix,viyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;
step 1.2.2, an action set, which defines the moving distance of the agent vehicle in two directions per time slot as an action, is expressed as: a ═{a|a=[x,y],Xm<x<XM,Ym<y<YMWhere a is a sample in the motion set, X, y respectively represent the moving distance in two directions, and Xm,XM,Ym,YMRespectively represents the minimum distance and the maximum distance of movement in two directions, and Ym=0;
Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function piθ(a | s), the random function value represents the probability of taking the action a in the state s, that is, the policy function is a probability density function, and the action is obtained by sampling according to the probability density function, as shown in the following formula (1):
Figure BDA0002571864750000031
in the formula (1), am=[Xm,Ym],aM=[XM,YM]Representing the maximum and minimum values of the motion,
Figure BDA0002571864750000032
wherein, muθ(s) represents the mean value of the distribution, σθ(s) represents the variance of the distribution,
Figure BDA0002571864750000033
step 1.2.4, a reward function, which specifies an award value obtained after a certain action is performed in a certain state to reflect the quality of the action selection, is defined as the following formula (2):
Figure BDA0002571864750000034
in the above formula (2), kcIs positive;
step 1.2.5, objective function: in order to obtain an optimal driving strategy, the driving strategy is taken as a variable according to consideration of safety and driving efficiency, the following optimization target is defined, for each agent, action is selected according to a strategy function in an initial state to reach a next state, the process of selecting action and reaching the next state is continuously repeated, and a track gamma (pi) is finally generated after a plurality of iterationsθ) For this track, the cumulative discount return is expressed as the following equation (3):
Figure BDA0002571864750000041
in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, rtRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):
Figure BDA0002571864750000042
in the formula (4), the first and second groups,
Figure BDA0002571864750000043
representing a desire to accumulate a discount return,
step 1.2.6, optimizing a decision scheme, wherein the driving decision scheme is to find an optimal strategy pi*The objective function is maximized, the optimization process of the strategy substantially optimizes the parameter θ of the strategy function, and the optimization decision scheme is finally expressed as the following formula (5):
Figure BDA0002571864750000044
obtaining the optimal parameter theta*Then, the optimal strategy is expressed as
Figure BDA0002571864750000045
I.e. an optimal vehicle micro-decision scheme.
Further, in step 3, the method further comprises the following steps:
step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs, in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:
step 3.1.1, deploying the decision neural network on the central server and all RSUs;
step 3.1.2, starting iteration of the training network, namely repeatedly executing the following steps until the network converges:
step 3.1.2.1, for each RSU, collecting a driving track covering vehicles on a road to simulate a driving environment, randomly generating an agent, executing driving behaviors in the simulated driving environment, obtaining state information according to the driving environment, inputting an Actor and a Critic network, making a driving decision according to the output of the Actor network, making a driving action, obtaining an interaction result after the interaction process is finished, reaching the driving environment of the next state, and continuing the interaction until generating sampling data of the driving track;
step 3.1.2.2, the RSU trains a local decision network by using a local driving track data set, and uploads a training result to a central server;
step 3.1.2.3, after the central server collects the training result transmitted from one RSU, it updates the global network once and returns the updated global network parameter to the RSU;
step 3.1.2.4, after receiving the global network returned by the central server, the RSU synchronizes the global network to the local network, and starts a new round of sample collection and training on the basis;
step 3.1.2.5, after the network convergence, the network parameters are not changed, and a basic model suitable for all road sections is obtained;
step 3.2, training a single road model, sinking the global network layer in the step 3.1 to the RSU of each road, sinking the agent layer to the RSU to cover all the automatic driving vehicles on the road, and specifically deploying each road in the following way:
step 3.2.1, the RSU synchronizes the basic model obtained in the first stage from the central server to be used as a global network;
step 3.2.2, when each automatic driving vehicle starts to enter the road covered by the RSU, synchronizing the global network model from the RSU to become an agent in the road agent layer, and executing the training of a decision network:
step 3.2.1.1, for each vehicle, taking the vehicle as an agent and taking the driving behavior track of the vehicle as a training sample, and performing the same process as the step 3.1.2.1 to obtain track sampling data;
3.2.1.2, training a local decision network by the vehicle by using a local driving track data set, and uploading a training result to the RSU;
step 3.2.1.3, after the RSU collects a training result transmitted by a vehicle, the RSU updates the global network once and returns the updated global network parameters to the vehicle;
and 3.2.1.4, after the vehicle receives the global network parameters returned by the central server, synchronizing the global network parameters to the local network, and starting a new round of sample collection and training on the basis until the vehicle leaves the current road.
Compared with the prior art, the automatic driving vehicle micro-decision method has the following remarkable advantages:
the design of the invention is combined with the network architecture of the Internet of vehicles, is easy to deploy and has strong feasibility.
2, the invention does not use the predefined driving mode, the driving behavior is more flexible, the adaptability is strong, the complexity of decision making can not be increased due to the increase of the state space and the behavior space, and the calculation mode is simpler.
3, the first stage of the invention can obtain a universal driving model, and can ensure the driving safety on different road sections, therefore, when a road section is newly added, the model is only required to be synchronized from the central server, the RSU and the automatic driving vehicle can immediately start the training process, and the invention has strong universality and transportability.
4, the driving model of each exclusive road section is obtained in the second stage of the invention, compared with the situation that the same model is used for all road sections, the model of the invention can better adapt to the characteristics of different road sections, under the driving environment of the road section, the model driving efficiency of a single road section is superior to the model shared by all road sections, and in addition, compared with the situation that one model is independently trained for each road section, the training and calculation cost of the invention is lower.
Compared with a fixed driving strategy model, the model of the second stage of the method can adapt to constantly changing real-time factors such as road conditions, weather, traffic flow density and the like, and has better robustness.
Drawings
FIG. 1 is a schematic diagram of the calculation structure of the A3C algorithm of the reinforcement learning-based automatic vehicle micro-decision making method according to the present invention;
FIG. 2 is a schematic diagram of a three-layer system architecture of the reinforcement learning-based automated vehicle micro-decision making method of the present invention;
FIG. 3 is a schematic diagram of a network architecture of an actor for an enhanced learning-based autonomous vehicle micro-decision making method according to the present invention;
fig. 4 is a schematic diagram of a network structure of a review family of the reinforcement learning-based automatic driving vehicle micro-decision method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
As shown in fig. 1 to 4, the automatic driving vehicle micro-decision method includes the following steps:
step 1, reinforcement learning modeling, and modeling and representing an automatic driving decision scheme:
step 1.1, defining a driving process of a vehicle as a Markov decision process, regarding an automatic driving vehicle as a proxy, regarding a driving environment of the vehicle as a reinforcement learning environment, making a driving decision and a driving behavior by the proxy vehicle according to detected environment information, adjusting the driving decision according to a driving result, dividing driving time into a plurality of time slots, making the driving decision at the beginning of the time slot of each proxy vehicle, and determining the driving behavior of each proxy vehicle in the time slot;
step 1.2, modeling is carried out by using basic elements in reinforcement learning:
step 1.2.1, a state set, wherein a lane direction is defined as a y direction, a direction perpendicular to the y direction is defined as an x direction, if the lane is a curve, the y direction represents a tangential direction of the lane, and positions and speeds of an agent vehicle and a nearest vehicle I around the agent vehicle are defined as states, the state set is represented as: s ═ c0,c1,c2,...,cI]Where s is one sample in the state set, c0=[x0,y0,v0x,v0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, ci=[Δxi,Δyi,vix,viy],i≤I,Δxi,Δyi,vix,viyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;
step 1.2.2, an action set, which defines the moving distance of the agent vehicle in two directions per time slot as an action, is expressed as: a ═ x, y],Xm<x<XM,Ym<y<YMWhere a is a sample in the motion set, X, y respectively represent the moving distance in two directions, and Xm,XM,Ym,YMRespectively represents the minimum distance and the maximum distance of movement in two directions, and Ym=0;
Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function piθ(as), the random function value represents the probability of taking action a under the state s, namely the strategy function is a probability density function, and the action is according to the probability density functionA digital sample is obtained as shown in the following equation (1):
Figure BDA0002571864750000071
in the formula (1), am=[Xm,Ym],aM=[XM,YM]Representing the maximum and minimum values of the motion,
Figure BDA0002571864750000072
wherein, muθ(s) represents the mean value of the distribution, σθ(s) represents the variance of the distribution,
Figure BDA0002571864750000073
step 1.2.4, a reward function, which specifies an award value obtained after a certain action is performed in a certain state to reflect the quality of the action selection, is defined as the following formula (2):
Figure BDA0002571864750000074
in the above formula (2), kcIs positive;
step 1.2.5, objective function: in order to obtain an optimal driving strategy, the driving strategy is taken as a variable according to consideration of safety and driving efficiency, the following optimization target is defined, for each agent, action is selected according to a strategy function in an initial state to reach a next state, the process of selecting action and reaching the next state is continuously repeated, and a track gamma (pi) is finally generated after a plurality of iterationsθ) For this track, the cumulative discount return is expressed as the following equation (3):
Figure BDA0002571864750000075
in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, rtRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):
Figure BDA0002571864750000081
in the formula (4), the first and second groups,
Figure BDA0002571864750000082
representing a desire to accumulate a discount return,
step 1.2.6, optimizing a decision scheme, wherein the driving decision scheme is to find an optimal strategy pi*The objective function is maximized, the optimization process of the strategy substantially optimizes the parameter θ of the strategy function, and the optimization decision scheme is finally expressed as the following formula (5):
Figure BDA0002571864750000083
obtaining the optimal parameter theta*Then, the optimal strategy is expressed as
Figure BDA0002571864750000084
I.e. an optimal vehicle micro-decision scheme.
Step 2, designing a solving network, obtaining an optimal vehicle micro-decision scheme related to driving micro-decisions in step 1, and then solving by using an A3C algorithm, wherein in the A3C algorithm, the global network and the proxy network both comprise an Actor network and a Critic network, the Actor network and the Critic network in all the global network and the proxy network are respectively the same, and for the network structures of the Actor network and the Critic network, both take the state as input, and by combining the two-dimensional structural characteristics of the state defined in step 1.2.1, the neural network composed of a convolutional layer and a full-link layer shown in fig. 3 and fig. 4 is adopted, wherein the Actor network and the full-link layer are formed, and the Actor network and the proxy network both adopt the neural network composed of the convolutional layer and the full-link layer shown in fig. 3 and fig. 4The r network represents a strategy function, and the output layer is mu of a probability density function in the strategy functionθ(s) and σθ(s) Critic network represents a function of state values, with output layer being a state value
Figure BDA0002571864750000085
And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy:
step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all Road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs (Road Side units), in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:
step 3.1.1, deploying the decision neural network on the central server and all RSUs;
step 3.1.2, starting iteration of the training network, namely repeatedly executing the following steps until the network converges:
step 3.1.2.1, for each RSU, collecting a driving track covering vehicles on a road to simulate a driving environment, randomly generating an agent, executing driving behaviors in the simulated driving environment, obtaining state information according to the driving environment, inputting an Actor and a Critic network, making a driving decision according to the output of the Actor network, making a driving action, obtaining an interaction result after the interaction process is finished, reaching the driving environment of the next state, and continuing the interaction until generating sampling data of the driving track;
step 3.1.2.2, the RSU trains a local decision network by using a local driving track data set, and uploads a training result to a central server;
step 3.1.2.3, after the central server collects the training result transmitted from one RSU, it updates the global network once and returns the updated global network parameter to the RSU;
step 3.1.2.4, after receiving the global network returned by the central server, the RSU synchronizes the global network to the local network, and starts a new round of sample collection and training on the basis;
step 3.1.2.5, after the network convergence, the network parameters are not changed, and a basic model suitable for all road sections is obtained;
step 3.2, training a single road model, sinking the global network layer in the step 3.1 to the RSU of each road, sinking the agent layer to the RSU to cover all the automatic driving vehicles on the road, and specifically deploying each road in the following way:
step 3.2.1, the RSU synchronizes the basic model obtained in the first stage from the central server to be used as a global network;
step 3.2.2, when each automatic driving vehicle starts to enter the road covered by the RSU, synchronizing the global network model from the RSU to become an agent in the road agent layer, and executing the training of a decision network:
step 3.2.1.1, for each vehicle, taking the vehicle as an agent and taking the driving behavior track of the vehicle as a training sample, and performing the same process as the step 3.1.2.1 to obtain track sampling data;
3.2.1.2, training a local decision network by the vehicle by using a local driving track data set, and uploading a training result to the RSU;
step 3.2.1.3, after the RSU collects a training result transmitted by a vehicle, the RSU updates the global network once and returns the updated global network parameters to the vehicle;
and 3.2.1.4, after the vehicle receives the global network parameters returned by the central server, synchronizing the global network parameters to the local network, and starting a new round of sample collection and training on the basis until the vehicle leaves the current road.
It should be noted that the global network and the Actor network structures of all agents are the same; the Critic network structure of the global network and all the agents is the same, that is, the network structure of the global network and all the agent networks is the same: the network structure comprises an Actor network and a criticic network, all the Actor networks have the same structure, and all the criticic networks have the same structure.
The above description is only for the preferred embodiment of the present invention and should not be construed as limiting the present invention, and various modifications and changes can be made by those skilled in the art without departing from the spirit and principle of the present invention, and any modifications, equivalents, improvements, etc. should be included in the scope of the claims of the present invention.

Claims (2)

1. An automatic driving vehicle micro-decision method based on reinforcement learning is characterized by comprising the following steps:
step 1, reinforcement learning modeling is carried out, and an automatic driving decision scheme is subjected to modeling representation;
step 1.1, defining a driving process of a vehicle as a Markov decision process, regarding an automatic driving vehicle as a proxy, regarding a driving environment of the vehicle as a reinforcement learning environment, making a driving decision and a driving behavior by the proxy vehicle according to detected environment information, adjusting the driving decision according to a driving result, dividing driving time into a plurality of time slots, making the driving decision at the beginning of the time slot of each proxy vehicle, and determining the driving behavior of each proxy vehicle in the time slot;
step 1.2, modeling is carried out by using basic elements in reinforcement learning:
step 1.2.1, a state set, wherein a lane direction is defined as a y direction, a direction perpendicular to the y direction is defined as an x direction, if the lane is a curve, the y direction represents a tangential direction of the lane, and positions and speeds of an agent vehicle and a nearest vehicle I around the agent vehicle are defined as states, the state set is represented as: s ═ c0,c1,c2,...,cI]Where s is one sample in the state set,
Figure FDA0003215305690000011
is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, ci=[Δxi,Δyi,vix,viy],i≤I,Δxi,Δyi,vix,viyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;
step 1.2.2, an action set, which defines the moving distance of the agent vehicle in two directions per time slot as an action, is expressed as: a ═ x, y],Xm<x<XM,Ym<y<YMWhere a is a sample in the motion set, X, y respectively represent the moving distance in two directions, and Xm,XM,Ym,YMRespectively represents the minimum distance and the maximum distance of movement in two directions, and Ym=0;
Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function piθ(a | s), the random function value represents the probability of taking the action a in the state s, that is, the policy function is a probability density function, and the action is obtained by sampling according to the probability density function, as shown in the following formula (1):
Figure FDA0003215305690000012
in the formula (1), am=[Xm,Ym],aM=[XM,YM]Representing the maximum and minimum values of the motion,
Figure FDA0003215305690000013
wherein, muθ(s) represents the mean value of the distribution, σθ(s) represents the variance of the distribution,
Figure FDA0003215305690000021
step 1.2.4, a reward function, which specifies an award value obtained after a certain action is performed in a certain state to reflect the quality of the action selection, is defined as the following formula (2):
Figure FDA0003215305690000022
in the above formula (2), kcIs positive;
step 1.2.5, objective function: in order to obtain an optimal driving strategy, the driving strategy is taken as a variable according to consideration of safety and driving efficiency, the following optimization target is defined, for each agent, action is selected according to a strategy function in an initial state to reach a next state, the process of selecting action and reaching the next state is continuously repeated, and a track gamma (pi) is finally generated after a plurality of iterationsθ) For this track, the cumulative discount return is expressed as the following equation (3):
Figure FDA0003215305690000023
in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, rtRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):
Figure FDA0003215305690000024
in the formula (4), the first and second groups,
Figure FDA0003215305690000025
representing a desire to accumulate a discount return,
step 1.2.6, optimizing a decision scheme, wherein the driving decision scheme is to find an optimal strategy pi*The optimization process of the strategy substantially optimizes the parameter theta of the strategy function and optimizes the final table of the decision schemeShown as the following equation (5):
Figure FDA0003215305690000026
obtaining the optimal parameter theta*Then, the optimal strategy is expressed as
Figure FDA0003215305690000027
Namely an optimal vehicle micro-decision scheme;
step 2, designing a solving network, obtaining an optimal vehicle micro-decision scheme related to driving micro-decision in step 1, and then solving by using an A3C algorithm, wherein in the A3C algorithm, a global network and a proxy network both comprise an Actor network and a Critic network, the Actor network and the Critic network in all the global networks and the proxy networks are respectively the same, for the network structures of the Actor network and the Critic network, both take the state as input, and in combination with step 1, a neural network composed of a convolutional layer and a full connection layer is adopted, wherein the Actor network represents a strategy function, and an output layer is a mu of a probability density function in the strategy functionθ(s) and σθ(s) Critic network represents a function of state values, with output layer being a state value
Figure FDA0003215305690000031
And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy.
2. The reinforcement learning-based automated vehicular micro decision making method according to claim 1, further comprising, in step 3, the steps of:
step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs, in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:
step 3.1.1, deploying the decision neural network on the central server and all RSUs;
step 3.1.2, starting iteration of the training network, namely repeatedly executing the following steps until the network converges:
step 3.1.2.1, for each RSU, collecting a driving track covering vehicles on a road to simulate a driving environment, randomly generating an agent, executing driving behaviors in the simulated driving environment, obtaining state information according to the driving environment, inputting an Actor and a Critic network, making a driving decision according to the output of the Actor network, making a driving action, obtaining an interaction result after the interaction process is finished, reaching the driving environment of the next state, and continuing the interaction until generating sampling data of the driving track;
step 3.1.2.2, the RSU trains a local decision network by using a local driving track data set, and uploads a training result to a central server;
step 3.1.2.3, after the central server collects the training result transmitted from one RSU, it updates the global network once and returns the updated global network parameter to the RSU;
step 3.1.2.4, after receiving the global network returned by the central server, the RSU synchronizes the global network to the local network, and starts a new round of sample collection and training on the basis;
step 3.1.2.5, after the network convergence, the network parameters are not changed, and a basic model suitable for all road sections is obtained;
step 3.2, training a single road model, sinking the global network layer in the step 3.1 to the RSU of each road, sinking the agent layer to the RSU to cover all the automatic driving vehicles on the road, and specifically deploying each road in the following way:
step 3.2.1, the RSU synchronizes the basic model obtained in the first stage from the central server to be used as a global network;
step 3.2.2, when each automatic driving vehicle starts to enter the road covered by the RSU, synchronizing the global network model from the RSU to become an agent in the road agent layer, and executing the training of a decision network:
step 3.2.1.1, for each vehicle, taking the vehicle as an agent and taking the driving behavior track of the vehicle as a training sample, and performing the same process as the step 3.1.2.1 to obtain track sampling data;
3.2.1.2, training a local decision network by the vehicle by using a local driving track data set, and uploading a training result to the RSU;
step 3.2.1.3, after the RSU collects a training result transmitted by a vehicle, the RSU updates the global network once and returns the updated global network parameters to the vehicle;
and 3.2.1.4, after the vehicle receives the global network parameters returned by the central server, synchronizing the global network parameters to the local network, and starting a new round of sample collection and training on the basis until the vehicle leaves the current road.
CN202010642778.8A 2020-07-06 2020-07-06 Automatic driving vehicle micro-decision-making method based on reinforcement learning Active CN111845773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010642778.8A CN111845773B (en) 2020-07-06 2020-07-06 Automatic driving vehicle micro-decision-making method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010642778.8A CN111845773B (en) 2020-07-06 2020-07-06 Automatic driving vehicle micro-decision-making method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN111845773A CN111845773A (en) 2020-10-30
CN111845773B true CN111845773B (en) 2021-10-26

Family

ID=73153538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010642778.8A Active CN111845773B (en) 2020-07-06 2020-07-06 Automatic driving vehicle micro-decision-making method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN111845773B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348201B (en) * 2020-11-11 2024-03-12 扬州大学 Intelligent decision-making implementation method of automatic driving group vehicle based on federal deep reinforcement learning
CN112644516B (en) * 2020-12-16 2022-03-29 吉林大学青岛汽车研究院 Unmanned control system and control method suitable for roundabout scene
CN112700642B (en) * 2020-12-19 2022-09-23 北京工业大学 Method for improving traffic passing efficiency by using intelligent internet vehicle
CN112896187B (en) * 2021-02-08 2022-07-26 浙江大学 System and method for considering social compatibility and making automatic driving decision
CN113099418B (en) * 2021-03-26 2022-08-16 深圳供电局有限公司 Optimization method of block chain task for data transmission of Internet of vehicles
CN113044064B (en) * 2021-04-01 2022-07-29 南京大学 Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning
CN113071524B (en) * 2021-04-29 2022-04-12 深圳大学 Decision control method, decision control device, autonomous driving vehicle and storage medium
CN113501008B (en) * 2021-08-12 2023-05-19 东风悦享科技有限公司 Automatic driving behavior decision method based on reinforcement learning algorithm
CN113619604B (en) * 2021-08-26 2023-08-15 清华大学 Integrated control method, device and storage medium for automatic driving automobile
CN113511222B (en) * 2021-08-27 2023-09-26 清华大学 Scene self-adaptive vehicle interaction behavior decision and prediction method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015094645A1 (en) * 2013-12-22 2015-06-25 Lytx, Inc. Autonomous driving comparison and evaluation
CN109213148A (en) * 2018-08-03 2019-01-15 东南大学 It is a kind of based on deeply study vehicle low speed with decision-making technique of speeding
CN109682392A (en) * 2018-12-28 2019-04-26 山东大学 Vision navigation method and system based on deeply study
CN110320883A (en) * 2018-03-28 2019-10-11 上海汽车集团股份有限公司 A kind of Vehicular automatic driving control method and device based on nitrification enhancement
CN110406530A (en) * 2019-07-02 2019-11-05 宁波吉利汽车研究开发有限公司 A kind of automatic Pilot method, apparatus, equipment and vehicle
CN111026127A (en) * 2019-12-27 2020-04-17 南京大学 Automatic driving decision method and system based on partially observable transfer reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10061316B2 (en) * 2016-07-08 2018-08-28 Toyota Motor Engineering & Manufacturing North America, Inc. Control policy learning and vehicle control method based on reinforcement learning without active exploration

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015094645A1 (en) * 2013-12-22 2015-06-25 Lytx, Inc. Autonomous driving comparison and evaluation
CN110320883A (en) * 2018-03-28 2019-10-11 上海汽车集团股份有限公司 A kind of Vehicular automatic driving control method and device based on nitrification enhancement
CN109213148A (en) * 2018-08-03 2019-01-15 东南大学 It is a kind of based on deeply study vehicle low speed with decision-making technique of speeding
CN109682392A (en) * 2018-12-28 2019-04-26 山东大学 Vision navigation method and system based on deeply study
CN110406530A (en) * 2019-07-02 2019-11-05 宁波吉利汽车研究开发有限公司 A kind of automatic Pilot method, apparatus, equipment and vehicle
CN111026127A (en) * 2019-12-27 2020-04-17 南京大学 Automatic driving decision method and system based on partially observable transfer reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
深度神经网络的关键技术及其在自动驾驶领域的应用;立升波;《汽车安全与节能学报》;20190228;全文 *

Also Published As

Publication number Publication date
CN111845773A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN111845773B (en) Automatic driving vehicle micro-decision-making method based on reinforcement learning
CN109733415B (en) Anthropomorphic automatic driving and following model based on deep reinforcement learning
CN111898211B (en) Intelligent vehicle speed decision method based on deep reinforcement learning and simulation method thereof
US11243532B1 (en) Evaluating varying-sized action spaces using reinforcement learning
KR102306939B1 (en) Method and device for short-term path planning of autonomous driving through information fusion by using v2x communication and image processing
CN114495527B (en) Internet-connected intersection vehicle road collaborative optimization method and system in mixed traffic environment
US20230124864A1 (en) Graph Representation Querying of Machine Learning Models for Traffic or Safety Rules
CN111267830B (en) Hybrid power bus energy management method, device and storage medium
CN113643553B (en) Multi-intersection intelligent traffic signal lamp control method and system based on federal reinforcement learning
CN104952248A (en) Automobile convergence predicting method based on Euclidean space
CN104966129A (en) Method for separating vehicle running track
CN114013443A (en) Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning
CN113255998B (en) Expressway unmanned vehicle formation method based on multi-agent reinforcement learning
CN112183288B (en) Multi-agent reinforcement learning method based on model
CN112550314A (en) Embedded optimization type control method suitable for unmanned driving, driving control module and automatic driving control system thereof
CN113593228A (en) Automatic driving cooperative control method for bottleneck area of expressway
CN114038218A (en) Chained feedback multi-intersection signal lamp decision system and method based on road condition information
CN112201070A (en) Deep learning-based automatic driving expressway bottleneck section behavior decision method
CN113299079B (en) Regional intersection signal control method based on PPO and graph convolution neural network
CN117075473A (en) Multi-vehicle collaborative decision-making method in man-machine mixed driving environment
CN117007066A (en) Unmanned trajectory planning method integrated by multiple planning algorithms and related device
Wang et al. Joint traffic signal and connected vehicle control in IoV via deep reinforcement learning
CN111310919A (en) Driving control strategy training method based on scene segmentation and local path planning
CN114926823B (en) WGCN-based vehicle driving behavior prediction method
CN114516336B (en) Vehicle track prediction method considering road constraint conditions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant