CN111845773A - Automatic driving vehicle micro-decision-making method based on reinforcement learning - Google Patents

Automatic driving vehicle micro-decision-making method based on reinforcement learning Download PDF

Info

Publication number
CN111845773A
CN111845773A CN202010642778.8A CN202010642778A CN111845773A CN 111845773 A CN111845773 A CN 111845773A CN 202010642778 A CN202010642778 A CN 202010642778A CN 111845773 A CN111845773 A CN 111845773A
Authority
CN
China
Prior art keywords
driving
network
vehicle
decision
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010642778.8A
Other languages
Chinese (zh)
Other versions
CN111845773B (en
Inventor
郑侃
刘杰
赵龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202010642778.8A priority Critical patent/CN111845773B/en
Publication of CN111845773A publication Critical patent/CN111845773A/en
Application granted granted Critical
Publication of CN111845773B publication Critical patent/CN111845773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W2050/0001Details of the control system
    • B60W2050/0019Control system elements or transfer functions
    • B60W2050/0028Mathematical models, e.g. for simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Automation & Control Theory (AREA)
  • Human Computer Interaction (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Probability & Statistics with Applications (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses an automatic driving vehicle micro-decision method based on reinforcement learning. The method adopts an A3C algorithm for reinforcement learning, the driving behavior is output by an Actor network, the flexibility is strong, and the complexity of the judgment logic is not influenced by the size of a state space and a behavior space. The method employs a two-stage training solution process. In the first stage, an automatic driving micro decision-making model suitable for all road sections is obtained through training so as to ensure the driving safety. And in the second stage, the integral model in the first stage is deployed to each road section, and each road section trains a single road section model on the basis, so that the transportability is realized. Meanwhile, the continuous training of the second stage enables the method to adapt to the influence of various real-time factors. Finally, a distributed communication architecture based on a real vehicle networking system structure is set forth, and distributed calculation in the solving process can be completed, so that the method can adapt to different road characteristics and dynamic driving environments, and has wide applicability and robustness.

Description

Automatic driving vehicle micro-decision-making method based on reinforcement learning
Technical Field
The invention relates to the technical field of automatic driving, in particular to an automatic driving vehicle micro-decision method based on reinforcement learning.
Background
The automatic driving technology is one of core technologies in intelligent traffic, automatic driving decisions are generally divided into two types, one type is a macroscopic path planning problem, namely after a starting place and a destination of a vehicle are determined, factors such as driving distance and congestion conditions are comprehensively considered, and how to select an optimal driving path is selected.
In the prior art, the automatic driving vehicle micro-decision model is divided into the following categories:
finite state machine model: the method comprises the following steps that a vehicle selects proper driving behaviors from predefined behavior modes of parking, lane changing, overtaking, avoiding, slow running and the like according to the environment;
a decision tree model: the driving behavior mode is expressed by the tree structure of the model, and the judgment logic is solidified at the branch nodes of the tree to carry out a top-down searching mechanism.
For example, the invention patent with chinese patent publication No. CN110969848A discloses an automatic driving overtaking decision method based on reinforcement learning under opposite double lanes, which comprises the following steps: collecting the traffic state of the automatic driving vehicle through a sensor; inputting the acquired traffic state into a trained decision model; the decision-making model selects and outputs a corresponding driving action instruction from the action space according to the input information, and automatically drives the vehicle to form a new traffic state after the driving action; calculating the reward value of the driving action through a reward function, and storing the original traffic state, the driving action, the reward value and the new traffic state as transfer samples into an experience playback pool; calculating a loss function value of the decision model, and optimizing parameters of the decision model according to the transfer sample and the loss function value; and repeating the steps until the automatic driving is finished. The safety and the comfort of the decision process of the overtaking of the automatic driving vehicle are ensured, and the human-simulated and the robustness of the decision are improved by a reinforcement learning decision method.
For another example, the invention patent with chinese patent publication No. CN109624986A discloses an automatic driving overtaking decision method based on reinforcement learning under opposite two lanes, and specifically relates to a learning cruise control system and method based on mode switching for adaptive cruise control by mode switching of specific driver style and adaptive learning of following behavior. The system defines the driving style in a switching strategy among several modes of constant-speed cruising, accelerating approaching, steady-state following and rapid braking of a driver under different following conditions, learns the driving style, and further learns the driving characteristics under each driving mode by using a continuous state-based learning method.
The prior art has at least the following problems:
the finite state machine model and the decision tree model ignore environmental uncertainty, cannot adapt to dynamic changes of the environment well, and when more behavior modes are defined, the state space and the behavior space are larger, the judgment logic is complex, the feasibility degree is not high, and better decision performance is difficult to show in urban roads with rich structural features.
Aiming at the problems that in the prior art, both a finite-state machine model and a decision tree model ignore environment uncertainty, cannot adapt to dynamic changes of the environment well, and when more behavior patterns are defined, a state space and a behavior space are large, judgment logic is complex, feasibility is not high, and good decision performance is difficult to show in urban roads with abundant structural features, an effective solution is not provided at present.
Disclosure of Invention
The invention aims to provide an automatic driving vehicle micro-decision method based on reinforcement learning aiming at the defects of the prior art, and the safety requirement and the driving efficiency requirement in automatic driving are met.
The automatic driving vehicle micro-decision method comprises the following steps:
step 1, reinforcement learning modeling is carried out, and an automatic driving decision scheme is subjected to modeling representation;
step 2, designing a solving network, obtaining an optimal vehicle micro-decision scheme related to driving micro-decision in step 1, and then solving by using an A3C algorithm, wherein in the A3C algorithm, a global network and a proxy network both comprise an Actor network and a Critic network, the Actor network and the Critic network in all the global networks and the proxy networks are respectively the same, for the network structures of the Actor network and the Critic network, both take the state as input, and in combination with step 1, a neural network composed of a convolutional layer and a full connection layer is adopted, wherein the Actor network represents a strategy function, and an output layer is a mu of a probability density function in the strategy function θ(s) and σθ(s) Critic network represents a function of state values, with output layer being a state value
Figure BDA0002571864750000021
And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy.
Further, in step 1, the method further comprises the following steps:
step 1.1, defining a driving process of a vehicle as a Markov decision process, regarding an automatic driving vehicle as a proxy, regarding a driving environment of the vehicle as a reinforcement learning environment, making a driving decision and a driving behavior by the proxy vehicle according to detected environment information, adjusting the driving decision according to a driving result, dividing driving time into a plurality of time slots, making the driving decision at the beginning of the time slot of each proxy vehicle, and determining the driving behavior of each proxy vehicle in the time slot;
step 1.2, modeling is carried out by using basic elements in reinforcement learning:
step 1.2.1, state set, defining the lane direction as y direction, defining the direction vertical to the y direction as x direction, if the lane is a curve, the y direction represents the tangential direction of the laneDefining the position and speed of the agent vehicle and the nearest I vehicles around as the state, the state set is expressed as: s ═ c 0,c1,c2,...,cI]Where s is one sample in the state set, c0=[x0,y0,v0x,v0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, ci=[Δxi,Δyi,vix,viy],i≤I,Δxi,Δyi,vix,viyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;
step 1.2.2, an action set, which defines the moving distance of the agent vehicle in two directions per time slot as an action, is expressed as: a ═ x, y],Xm<x<XM,Ym<y<YMWhere a is a sample in the motion set, X, y respectively represent the moving distance in two directions, and Xm,XM,Ym,YMRespectively represents the minimum distance and the maximum distance of movement in two directions, and Ym=0;
Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function piθ(a | s), the random function value represents the probability of taking the action a in the state s, that is, the policy function is a probability density function, and the action is obtained by sampling according to the probability density function, as shown in the following formula (1):
Figure BDA0002571864750000031
in the formula (1), am=[Xm,Ym],aM=[XM,YM]Representing the maximum and minimum values of the motion,
Figure BDA0002571864750000032
wherein, muθ(s) represents the mean value of the distribution, σθ(s) represents the variance of the distribution,
Figure BDA0002571864750000033
step 1.2.4, a reward function, which specifies an award value obtained after a certain action is performed in a certain state to reflect the quality of the action selection, is defined as the following formula (2):
Figure BDA0002571864750000034
In the above formula (2), kcIs positive;
step 1.2.5, objective function: in order to obtain the optimal driving strategy, the driving strategy is taken as a variable according to the consideration of safety and driving efficiency, the following optimization target is defined, for each agent, the action is selected according to the strategy function in the initial state to reach the next state, the process of selecting the action and reaching the next state is continuously repeated, and a track (pi) is finally generated after a plurality of iterationsθ) For this track, the cumulative discount return is expressed as the following equation (3):
Figure BDA0002571864750000041
in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, rtRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):
Figure BDA0002571864750000042
in the formula (4), the first and second groups,
Figure BDA0002571864750000043
representing a desire to accumulate a discount return,
step 1.2.6, optimizing a decision scheme, wherein the driving decision scheme is to find an optimal strategy pi*The objective function is maximized, the optimization process of the strategy substantially optimizes the parameter θ of the strategy function, and the optimization decision scheme is finally expressed as the following formula (5):
Figure BDA0002571864750000044
obtaining the optimal parameter theta*Then, the optimal strategy is expressed as
Figure BDA0002571864750000045
I.e. an optimal vehicle micro-decision scheme.
Further, in step 3, the method further comprises the following steps:
step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs, in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:
step 3.1.1, deploying the decision neural network on the central server and all RSUs;
step 3.1.2, starting iteration of the training network, namely repeatedly executing the following steps until the network converges:
step 3.1.2.1, for each RSU, collecting a driving track covering vehicles on a road to simulate a driving environment, randomly generating an agent, executing driving behaviors in the simulated driving environment, obtaining state information according to the driving environment, inputting an Actor and a Critic network, making a driving decision according to the output of the Actor network, making a driving action, obtaining an interaction result after the interaction process is finished, reaching the driving environment of the next state, and continuing the interaction until generating sampling data of the driving track;
Step 3.1.2.2, the RSU trains a local decision network by using a local driving track data set, and uploads a training result to a central server;
step 3.1.2.3, after the central server collects the training result transmitted from one RSU, it updates the global network once and returns the updated global network parameter to the RSU;
step 3.1.2.4, after receiving the global network returned by the central server, the RSU synchronizes the global network to the local network, and starts a new round of sample collection and training on the basis;
step 3.1.2.5, after the network convergence, the network parameters are not changed, and a basic model suitable for all road sections is obtained;
step 3.2, training a single road model, sinking the global network layer in the step 3.1 to the RSU of each road, sinking the agent layer to the RSU to cover all the automatic driving vehicles on the road, and specifically deploying each road in the following way:
step 3.2.1, the RSU synchronizes the basic model obtained in the first stage from the central server to be used as a global network;
step 3.2.2, when each automatic driving vehicle starts to enter the road covered by the RSU, synchronizing the global network model from the RSU to become an agent in the road agent layer, and executing the training of a decision network:
Step 3.2.1.1, for each vehicle, taking the vehicle as an agent and taking the driving behavior track of the vehicle as a training sample, and performing the same process as the step 3.1.2.1 to obtain track sampling data;
3.2.1.2, training a local decision network by the vehicle by using a local driving track data set, and uploading a training result to the RSU;
step 3.2.1.3, after the RSU collects a training result transmitted by a vehicle, the RSU updates the global network once and returns the updated global network parameters to the vehicle;
and 3.2.1.4, after the vehicle receives the global network parameters returned by the central server, synchronizing the global network parameters to the local network, and starting a new round of sample collection and training on the basis until the vehicle leaves the current road.
Compared with the prior art, the automatic driving vehicle micro-decision method has the following remarkable advantages:
the design of the invention is combined with the network architecture of the Internet of vehicles, is easy to deploy and has strong feasibility.
2, the invention does not use the predefined driving mode, the driving behavior is more flexible, the adaptability is strong, the complexity of decision making can not be increased due to the increase of the state space and the behavior space, and the calculation mode is simpler.
3, the first stage of the invention can obtain a universal driving model, and can ensure the driving safety on different road sections, therefore, when a road section is newly added, the model is only required to be synchronized from the central server, the RSU and the automatic driving vehicle can immediately start the training process, and the invention has strong universality and transportability.
4, the driving model of each exclusive road section is obtained in the second stage of the invention, compared with the situation that the same model is used for all road sections, the model of the invention can better adapt to the characteristics of different road sections, under the driving environment of the road section, the model driving efficiency of a single road section is superior to the model shared by all road sections, and in addition, compared with the situation that one model is independently trained for each road section, the training and calculation cost of the invention is lower.
Compared with a fixed driving strategy model, the model of the second stage of the method can adapt to constantly changing real-time factors such as road conditions, weather, traffic flow density and the like, and has better robustness.
Drawings
FIG. 1 is a schematic diagram of the calculation structure of the A3C algorithm of the reinforcement learning-based automatic vehicle micro-decision making method according to the present invention;
FIG. 2 is a schematic diagram of a three-layer system architecture of the reinforcement learning-based automated vehicle micro-decision making method of the present invention;
FIG. 3 is a schematic diagram of a network architecture of an actor for an enhanced learning-based autonomous vehicle micro-decision making method according to the present invention;
fig. 4 is a schematic diagram of a network structure of a review family of the reinforcement learning-based automatic driving vehicle micro-decision method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
As shown in fig. 1 to 4, the automatic driving vehicle micro-decision method includes the following steps:
step 1, reinforcement learning modeling, and modeling and representing an automatic driving decision scheme:
step 1.1, defining a driving process of a vehicle as a Markov decision process, regarding an automatic driving vehicle as a proxy, regarding a driving environment of the vehicle as a reinforcement learning environment, making a driving decision and a driving behavior by the proxy vehicle according to detected environment information, adjusting the driving decision according to a driving result, dividing driving time into a plurality of time slots, making the driving decision at the beginning of the time slot of each proxy vehicle, and determining the driving behavior of each proxy vehicle in the time slot;
step 1.2, modeling is carried out by using basic elements in reinforcement learning:
step 1.2.1, a state set, wherein a lane direction is defined as a y direction, a direction perpendicular to the y direction is defined as an x direction, if the lane is a curve, the y direction represents a tangential direction of the lane, and positions and speeds of an agent vehicle and a nearest vehicle I around the agent vehicle are defined as states, the state set is represented as: s ═ c 0,c1,c2,...,cI]Where s is one sample in the state set, c0=[x0,y0,v0x,v0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, ci=[Δxi,Δyi,vix,viy],i≤I,Δxi,Δyi,vix,viyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;
step 1.2.2, an action set, which defines the moving distance of the agent vehicle in two directions per time slot as an action, is expressed as: a ═ x, y],Xm<x<XM,Ym<y<YMWhere a is a sample in the motion set, X, y respectively represent the moving distance in two directions, and Xm,XM,Ym,YMRespectively represents the minimum distance and the maximum distance of movement in two directions, and Ym=0;
Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function piθ(a | s), the random function value represents the probability of taking the action a in the state s, that is, the policy function is a probability density function, and the action is obtained by sampling according to the probability density function, as shown in the following formula (1):
Figure BDA0002571864750000071
in the formula (1), am=[Xm,Ym],aM=[XM,YM]Representing the maximum and minimum values of the motion,
Figure BDA0002571864750000072
wherein, muθ(s) represents the mean value of the distribution, σθ(s) represents the variance of the distribution,
Figure BDA0002571864750000073
step 1.2.4, a reward function, which specifies an award value obtained after a certain action is performed in a certain state to reflect the quality of the action selection, is defined as the following formula (2):
Figure BDA0002571864750000074
In the above formula (2), kcIs positive;
step 1.2.5, objective function: to obtain an optimal driving strategy, rootAccording to the consideration of safety and driving efficiency, the driving strategy is used as a variable, the following optimization target is defined, for each agent, the action is selected according to the strategy function in the initial state to reach the next state, the process of selecting the action and reaching the next state is continuously repeated, and after a plurality of iterations, a track (pi) is finally generatedθ) For this track, the cumulative discount return is expressed as the following equation (3):
Figure BDA0002571864750000075
in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, rtRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):
Figure BDA0002571864750000081
in the formula (4), the first and second groups,
Figure BDA0002571864750000082
representing a desire to accumulate a discount return,
step 1.2.6, optimizing a decision scheme, wherein the driving decision scheme is to find an optimal strategy pi*The objective function is maximized, the optimization process of the strategy substantially optimizes the parameter θ of the strategy function, and the optimization decision scheme is finally expressed as the following formula (5):
Figure BDA0002571864750000083
obtaining the optimal parameter theta*Then, the optimal strategy is expressed as
Figure BDA0002571864750000084
I.e. an optimal vehicle micro-decision scheme.
Step 2, designing a solving network, and obtaining the optimal vehicle micro-decision related to the driving micro-decision in the step 1The scheme is that a solution is then solved by using an A3C algorithm, in the A3C algorithm, the global network and the proxy network both include an Actor network and a criticic network, the Actor network and the criticic network in all the global network and the proxy network are respectively the same, for the network structures of the Actor network and the criticic network, both take the state as input, and in combination with the two-dimensional structural characteristics of the state defined in step 1.2.1, the neural network composed of the convolutional layer and the full connection layer shown in fig. 3 and 4 is adopted, wherein the Actor network represents the policy function, and the output layer is the mu of the probability density function in the policy functionθ(s) and σθ(s) Critic network represents a function of state values, with output layer being a state value
Figure BDA0002571864750000085
And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy:
step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all Road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs (Road Side units), in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:
Step 3.1.1, deploying the decision neural network on the central server and all RSUs;
step 3.1.2, starting iteration of the training network, namely repeatedly executing the following steps until the network converges:
step 3.1.2.1, for each RSU, collecting a driving track covering vehicles on a road to simulate a driving environment, randomly generating an agent, executing driving behaviors in the simulated driving environment, obtaining state information according to the driving environment, inputting an Actor and a Critic network, making a driving decision according to the output of the Actor network, making a driving action, obtaining an interaction result after the interaction process is finished, reaching the driving environment of the next state, and continuing the interaction until generating sampling data of the driving track;
step 3.1.2.2, the RSU trains a local decision network by using a local driving track data set, and uploads a training result to a central server;
step 3.1.2.3, after the central server collects the training result transmitted from one RSU, it updates the global network once and returns the updated global network parameter to the RSU;
step 3.1.2.4, after receiving the global network returned by the central server, the RSU synchronizes the global network to the local network, and starts a new round of sample collection and training on the basis;
Step 3.1.2.5, after the network convergence, the network parameters are not changed, and a basic model suitable for all road sections is obtained;
step 3.2, training a single road model, sinking the global network layer in the step 3.1 to the RSU of each road, sinking the agent layer to the RSU to cover all the automatic driving vehicles on the road, and specifically deploying each road in the following way:
step 3.2.1, the RSU synchronizes the basic model obtained in the first stage from the central server to be used as a global network;
step 3.2.2, when each automatic driving vehicle starts to enter the road covered by the RSU, synchronizing the global network model from the RSU to become an agent in the road agent layer, and executing the training of a decision network:
step 3.2.1.1, for each vehicle, taking the vehicle as an agent and taking the driving behavior track of the vehicle as a training sample, and performing the same process as the step 3.1.2.1 to obtain track sampling data;
3.2.1.2, training a local decision network by the vehicle by using a local driving track data set, and uploading a training result to the RSU;
step 3.2.1.3, after the RSU collects a training result transmitted by a vehicle, the RSU updates the global network once and returns the updated global network parameters to the vehicle;
And 3.2.1.4, after the vehicle receives the global network parameters returned by the central server, synchronizing the global network parameters to the local network, and starting a new round of sample collection and training on the basis until the vehicle leaves the current road.
It should be noted that the global network and the Actor network structures of all agents are the same; the Critic network structure of the global network and all the agents is the same, that is, the network structure of the global network and all the agent networks is the same: the network structure comprises an Actor network and a criticic network, all the Actor networks have the same structure, and all the criticic networks have the same structure.
The above description is only for the preferred embodiment of the present invention and should not be construed as limiting the present invention, and various modifications and changes can be made by those skilled in the art without departing from the spirit and principle of the present invention, and any modifications, equivalents, improvements, etc. should be included in the scope of the claims of the present invention.

Claims (3)

1. An automatic driving vehicle micro-decision method based on reinforcement learning is characterized by comprising the following steps:
step 1, reinforcement learning modeling is carried out, and an automatic driving decision scheme is subjected to modeling representation;
Step 2, designing a solving network, obtaining an optimal vehicle micro-decision scheme related to driving micro-decision in step 1, and then solving by using an A3C algorithm, wherein in the A3C algorithm, a global network and a proxy network both comprise an Actor network and a Critic network, the Actor network and the Critic network in all the global networks and the proxy networks are respectively the same, for the network structures of the Actor network and the Critic network, both take the state as input, and in combination with step 1, a neural network composed of a convolutional layer and a full connection layer is adopted, wherein the Actor network represents a strategy function, and an output layer is a mu of a probability density function in the strategy functionθ(s) and σθ(s) Critic network represents a function of state values, with output layer being a state value
Figure FDA0002571864740000011
And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy.
2. The reinforcement learning-based automated driving vehicle micro-decision method according to claim 1, further comprising, in step 1, the steps of:
step 1.1, defining a driving process of a vehicle as a Markov decision process, regarding an automatic driving vehicle as a proxy, regarding a driving environment of the vehicle as a reinforcement learning environment, making a driving decision and a driving behavior by the proxy vehicle according to detected environment information, adjusting the driving decision according to a driving result, dividing driving time into a plurality of time slots, making the driving decision at the beginning of the time slot of each proxy vehicle, and determining the driving behavior of each proxy vehicle in the time slot;
Step 1.2, modeling is carried out by using basic elements in reinforcement learning:
step 1.2.1, a state set, wherein a lane direction is defined as a y direction, a direction perpendicular to the y direction is defined as an x direction, if the lane is a curve, the y direction represents a tangential direction of the lane, and positions and speeds of an agent vehicle and a nearest vehicle I around the agent vehicle are defined as states, the state set is represented as: s ═ c0,c1,c2,...,cI]Where s is one sample in the state set, c0=[x0,y0,v0x,v0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, ci=[Δxi,Δyi,vix,viy],i≤I,Δxi,Δyi,vix,viyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;
step 1.2.2, an action set, which defines the moving distance of the agent vehicle in two directions per time slot as an action, is expressed as: a ═ x, y],Xm<x<XM,Ym<y<YMWhere a is a sample in the motion set, X, y respectively represent the moving distance in two directions, and Xm,XM,Ym,YMRespectively represents the minimum distance and the maximum distance of movement in two directions, and Ym=0;
Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function pi θ(a | s), the random function value represents the probability of taking the action a in the state s, that is, the policy function is a probability density function, and the action is obtained by sampling according to the probability density function, as shown in the following formula (1):
Figure FDA0002571864740000021
in the formula (1), am=[Xm,Ym],aM=[XM,YM]Representing the maximum and minimum values of the motion,
Figure FDA0002571864740000022
wherein, muθ(s) represents the mean value of the distribution, σθ(s) represents the variance of the distribution,
Figure FDA0002571864740000023
step 1.2.4, a reward function, which specifies an award value obtained after a certain action is performed in a certain state to reflect the quality of the action selection, is defined as the following formula (2):
Figure FDA0002571864740000024
in the above formula (2), kcIs positive;
step 1.2.5, objective function:in order to obtain the optimal driving strategy, the driving strategy is taken as a variable according to the consideration of safety and driving efficiency, the following optimization target is defined, for each agent, the action is selected according to the strategy function in the initial state to reach the next state, the process of selecting the action and reaching the next state is continuously repeated, and a track (pi) is finally generated after a plurality of iterationsθ) For this track, the cumulative discount return is expressed as the following equation (3):
Figure FDA0002571864740000025
in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, r tRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):
Figure FDA0002571864740000026
in the formula (4), the first and second groups,
Figure FDA0002571864740000031
representing a desire to accumulate a discount return,
step 1.2.6, optimizing a decision scheme, wherein the driving decision scheme is to find an optimal strategy pi*The objective function is maximized, the optimization process of the strategy substantially optimizes the parameter θ of the strategy function, and the optimization decision scheme is finally expressed as the following formula (5):
Figure FDA0002571864740000032
obtaining the optimal parameter theta*Then, the optimal strategy is expressed as
Figure FDA0002571864740000033
I.e. an optimal vehicle micro-decision scheme.
3. The reinforcement learning-based automated vehicular micro decision making method according to claim 1, further comprising, in step 3, the steps of:
step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs, in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:
step 3.1.1, deploying the decision neural network on the central server and all RSUs;
Step 3.1.2, starting iteration of the training network, namely repeatedly executing the following steps until the network converges:
step 3.1.2.1, for each RSU, collecting a driving track covering vehicles on a road to simulate a driving environment, randomly generating an agent, executing driving behaviors in the simulated driving environment, obtaining state information according to the driving environment, inputting an Actor and a Critic network, making a driving decision according to the output of the Actor network, making a driving action, obtaining an interaction result after the interaction process is finished, reaching the driving environment of the next state, and continuing the interaction until generating sampling data of the driving track;
step 3.1.2.2, the RSU trains a local decision network by using a local driving track data set, and uploads a training result to a central server;
step 3.1.2.3, after the central server collects the training result transmitted from one RSU, it updates the global network once and returns the updated global network parameter to the RSU;
step 3.1.2.4, after receiving the global network returned by the central server, the RSU synchronizes the global network to the local network, and starts a new round of sample collection and training on the basis;
step 3.1.2.5, after the network convergence, the network parameters are not changed, and a basic model suitable for all road sections is obtained;
Step 3.2, training a single road model, sinking the global network layer in the step 3.1 to the RSU of each road, sinking the agent layer to the RSU to cover all the automatic driving vehicles on the road, and specifically deploying each road in the following way:
step 3.2.1, the RSU synchronizes the basic model obtained in the first stage from the central server to be used as a global network;
step 3.2.2, when each automatic driving vehicle starts to enter the road covered by the RSU, synchronizing the global network model from the RSU to become an agent in the road agent layer, and executing the training of a decision network:
step 3.2.1.1, for each vehicle, taking the vehicle as an agent and taking the driving behavior track of the vehicle as a training sample, and performing the same process as the step 3.1.2.1 to obtain track sampling data;
3.2.1.2, training a local decision network by the vehicle by using a local driving track data set, and uploading a training result to the RSU;
step 3.2.1.3, after the RSU collects a training result transmitted by a vehicle, the RSU updates the global network once and returns the updated global network parameters to the vehicle;
and 3.2.1.4, after the vehicle receives the global network parameters returned by the central server, synchronizing the global network parameters to the local network, and starting a new round of sample collection and training on the basis until the vehicle leaves the current road.
CN202010642778.8A 2020-07-06 2020-07-06 Automatic driving vehicle micro-decision-making method based on reinforcement learning Active CN111845773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010642778.8A CN111845773B (en) 2020-07-06 2020-07-06 Automatic driving vehicle micro-decision-making method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010642778.8A CN111845773B (en) 2020-07-06 2020-07-06 Automatic driving vehicle micro-decision-making method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN111845773A true CN111845773A (en) 2020-10-30
CN111845773B CN111845773B (en) 2021-10-26

Family

ID=73153538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010642778.8A Active CN111845773B (en) 2020-07-06 2020-07-06 Automatic driving vehicle micro-decision-making method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN111845773B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348201A (en) * 2020-11-11 2021-02-09 扬州大学 Intelligent decision implementation method for automatic driving group vehicle based on federal deep reinforcement learning
CN112644516A (en) * 2020-12-16 2021-04-13 吉林大学青岛汽车研究院 Unmanned control system and control method suitable for roundabout scene
CN112700642A (en) * 2020-12-19 2021-04-23 北京工业大学 Method for improving traffic passing efficiency by using intelligent internet vehicle
CN112896187A (en) * 2021-02-08 2021-06-04 浙江大学 System and method for considering social compatibility and making automatic driving decision
CN113044064A (en) * 2021-04-01 2021-06-29 南京大学 Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning
CN113071524A (en) * 2021-04-29 2021-07-06 深圳大学 Decision control method, decision control device, autonomous driving vehicle and storage medium
CN113099418A (en) * 2021-03-26 2021-07-09 深圳供电局有限公司 Optimization method of block chain task for data transmission of Internet of vehicles
CN113501008A (en) * 2021-08-12 2021-10-15 东风悦享科技有限公司 Automatic driving behavior decision method based on reinforcement learning algorithm
CN113511222A (en) * 2021-08-27 2021-10-19 清华大学 Scene self-adaptive vehicle interactive behavior decision and prediction method and device
CN113619604A (en) * 2021-08-26 2021-11-09 清华大学 Integrated decision and control method and device for automatic driving automobile and storage medium
CN117828489A (en) * 2024-03-05 2024-04-05 河钢国际科技(北京)有限公司 Intelligent ship remote dynamic control system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015094645A1 (en) * 2013-12-22 2015-06-25 Lytx, Inc. Autonomous driving comparison and evaluation
US20180011488A1 (en) * 2016-07-08 2018-01-11 Toyota Motor Engineering & Manufacturing North America, Inc. Control policy learning and vehicle control method based on reinforcement learning without active exploration
CN109213148A (en) * 2018-08-03 2019-01-15 东南大学 It is a kind of based on deeply study vehicle low speed with decision-making technique of speeding
CN109682392A (en) * 2018-12-28 2019-04-26 山东大学 Vision navigation method and system based on deeply study
CN110320883A (en) * 2018-03-28 2019-10-11 上海汽车集团股份有限公司 A kind of Vehicular automatic driving control method and device based on nitrification enhancement
CN110406530A (en) * 2019-07-02 2019-11-05 宁波吉利汽车研究开发有限公司 A kind of automatic Pilot method, apparatus, equipment and vehicle
CN111026127A (en) * 2019-12-27 2020-04-17 南京大学 Automatic driving decision method and system based on partially observable transfer reinforcement learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015094645A1 (en) * 2013-12-22 2015-06-25 Lytx, Inc. Autonomous driving comparison and evaluation
US20180011488A1 (en) * 2016-07-08 2018-01-11 Toyota Motor Engineering & Manufacturing North America, Inc. Control policy learning and vehicle control method based on reinforcement learning without active exploration
CN110320883A (en) * 2018-03-28 2019-10-11 上海汽车集团股份有限公司 A kind of Vehicular automatic driving control method and device based on nitrification enhancement
CN109213148A (en) * 2018-08-03 2019-01-15 东南大学 It is a kind of based on deeply study vehicle low speed with decision-making technique of speeding
CN109682392A (en) * 2018-12-28 2019-04-26 山东大学 Vision navigation method and system based on deeply study
CN110406530A (en) * 2019-07-02 2019-11-05 宁波吉利汽车研究开发有限公司 A kind of automatic Pilot method, apparatus, equipment and vehicle
CN111026127A (en) * 2019-12-27 2020-04-17 南京大学 Automatic driving decision method and system based on partially observable transfer reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
立升波: "深度神经网络的关键技术及其在自动驾驶领域的应用", 《汽车安全与节能学报》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348201A (en) * 2020-11-11 2021-02-09 扬州大学 Intelligent decision implementation method for automatic driving group vehicle based on federal deep reinforcement learning
CN112348201B (en) * 2020-11-11 2024-03-12 扬州大学 Intelligent decision-making implementation method of automatic driving group vehicle based on federal deep reinforcement learning
CN112644516A (en) * 2020-12-16 2021-04-13 吉林大学青岛汽车研究院 Unmanned control system and control method suitable for roundabout scene
CN112644516B (en) * 2020-12-16 2022-03-29 吉林大学青岛汽车研究院 Unmanned control system and control method suitable for roundabout scene
CN112700642A (en) * 2020-12-19 2021-04-23 北京工业大学 Method for improving traffic passing efficiency by using intelligent internet vehicle
CN112896187B (en) * 2021-02-08 2022-07-26 浙江大学 System and method for considering social compatibility and making automatic driving decision
CN112896187A (en) * 2021-02-08 2021-06-04 浙江大学 System and method for considering social compatibility and making automatic driving decision
CN113099418A (en) * 2021-03-26 2021-07-09 深圳供电局有限公司 Optimization method of block chain task for data transmission of Internet of vehicles
CN113099418B (en) * 2021-03-26 2022-08-16 深圳供电局有限公司 Optimization method of block chain task for data transmission of Internet of vehicles
CN113044064A (en) * 2021-04-01 2021-06-29 南京大学 Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning
CN113044064B (en) * 2021-04-01 2022-07-29 南京大学 Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning
CN113071524A (en) * 2021-04-29 2021-07-06 深圳大学 Decision control method, decision control device, autonomous driving vehicle and storage medium
CN113501008B (en) * 2021-08-12 2023-05-19 东风悦享科技有限公司 Automatic driving behavior decision method based on reinforcement learning algorithm
CN113501008A (en) * 2021-08-12 2021-10-15 东风悦享科技有限公司 Automatic driving behavior decision method based on reinforcement learning algorithm
CN113619604A (en) * 2021-08-26 2021-11-09 清华大学 Integrated decision and control method and device for automatic driving automobile and storage medium
CN113619604B (en) * 2021-08-26 2023-08-15 清华大学 Integrated control method, device and storage medium for automatic driving automobile
CN113511222A (en) * 2021-08-27 2021-10-19 清华大学 Scene self-adaptive vehicle interactive behavior decision and prediction method and device
CN113511222B (en) * 2021-08-27 2023-09-26 清华大学 Scene self-adaptive vehicle interaction behavior decision and prediction method and device
CN117828489A (en) * 2024-03-05 2024-04-05 河钢国际科技(北京)有限公司 Intelligent ship remote dynamic control system
CN117828489B (en) * 2024-03-05 2024-05-14 河钢国际科技(北京)有限公司 Intelligent ship remote dynamic control system

Also Published As

Publication number Publication date
CN111845773B (en) 2021-10-26

Similar Documents

Publication Publication Date Title
CN111845773B (en) Automatic driving vehicle micro-decision-making method based on reinforcement learning
CN114495527B (en) Internet-connected intersection vehicle road collaborative optimization method and system in mixed traffic environment
CN111931905B (en) Graph convolution neural network model and vehicle track prediction method using same
CN110032782B (en) City-level intelligent traffic signal control system and method
WO2022052406A1 (en) Automatic driving training method, apparatus and device, and medium
US11243532B1 (en) Evaluating varying-sized action spaces using reinforcement learning
KR102306939B1 (en) Method and device for short-term path planning of autonomous driving through information fusion by using v2x communication and image processing
CN111696370A (en) Traffic light control method based on heuristic deep Q network
CN111267830B (en) Hybrid power bus energy management method, device and storage medium
CN104952248A (en) Automobile convergence predicting method based on Euclidean space
CN113255998B (en) Expressway unmanned vehicle formation method based on multi-agent reinforcement learning
CN112183288B (en) Multi-agent reinforcement learning method based on model
CN104966129A (en) Method for separating vehicle running track
Baskar et al. Hierarchical traffic control and management with intelligent vehicles
CN118097989B (en) Multi-agent traffic area signal control method based on digital twin
CN112550314A (en) Embedded optimization type control method suitable for unmanned driving, driving control module and automatic driving control system thereof
CN113903173B (en) Vehicle track feature extraction method based on directed graph structure and LSTM
CN111310919A (en) Driving control strategy training method based on scene segmentation and local path planning
CN113299079B (en) Regional intersection signal control method based on PPO and graph convolution neural network
Gutiérrez-Moreno et al. Hybrid decision making for autonomous driving in complex urban scenarios
CN114267191A (en) Control system, method, medium, equipment and application for relieving traffic jam of driver
CN117075473A (en) Multi-vehicle collaborative decision-making method in man-machine mixed driving environment
CN115273502B (en) Traffic signal cooperative control method
CN116620327A (en) Lane changing decision method for realizing automatic driving high-speed scene based on PPO and Lattice
CN114117944B (en) Model updating method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant