CN111845773A

CN111845773A - Automatic driving vehicle micro-decision-making method based on reinforcement learning

Info

Publication number: CN111845773A
Application number: CN202010642778.8A
Authority: CN
Inventors: 郑侃; 刘杰; 赵龙
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-07-06
Filing date: 2020-07-06
Publication date: 2020-10-30
Anticipated expiration: 2040-07-06
Also published as: CN111845773B

Abstract

The invention discloses an automatic driving vehicle micro-decision method based on reinforcement learning. The method adopts an A3C algorithm for reinforcement learning, the driving behavior is output by an Actor network, the flexibility is strong, and the complexity of the judgment logic is not influenced by the size of a state space and a behavior space. The method employs a two-stage training solution process. In the first stage, an automatic driving micro decision-making model suitable for all road sections is obtained through training so as to ensure the driving safety. And in the second stage, the integral model in the first stage is deployed to each road section, and each road section trains a single road section model on the basis, so that the transportability is realized. Meanwhile, the continuous training of the second stage enables the method to adapt to the influence of various real-time factors. Finally, a distributed communication architecture based on a real vehicle networking system structure is set forth, and distributed calculation in the solving process can be completed, so that the method can adapt to different road characteristics and dynamic driving environments, and has wide applicability and robustness.

Description

Automatic driving vehicle micro-decision-making method based on reinforcement learning

Technical Field

The invention relates to the technical field of automatic driving, in particular to an automatic driving vehicle micro-decision method based on reinforcement learning.

Background

The automatic driving technology is one of core technologies in intelligent traffic, automatic driving decisions are generally divided into two types, one type is a macroscopic path planning problem, namely after a starting place and a destination of a vehicle are determined, factors such as driving distance and congestion conditions are comprehensively considered, and how to select an optimal driving path is selected.

In the prior art, the automatic driving vehicle micro-decision model is divided into the following categories:

finite state machine model: the method comprises the following steps that a vehicle selects proper driving behaviors from predefined behavior modes of parking, lane changing, overtaking, avoiding, slow running and the like according to the environment;

a decision tree model: the driving behavior mode is expressed by the tree structure of the model, and the judgment logic is solidified at the branch nodes of the tree to carry out a top-down searching mechanism.

For example, the invention patent with chinese patent publication No. CN110969848A discloses an automatic driving overtaking decision method based on reinforcement learning under opposite double lanes, which comprises the following steps: collecting the traffic state of the automatic driving vehicle through a sensor; inputting the acquired traffic state into a trained decision model; the decision-making model selects and outputs a corresponding driving action instruction from the action space according to the input information, and automatically drives the vehicle to form a new traffic state after the driving action; calculating the reward value of the driving action through a reward function, and storing the original traffic state, the driving action, the reward value and the new traffic state as transfer samples into an experience playback pool; calculating a loss function value of the decision model, and optimizing parameters of the decision model according to the transfer sample and the loss function value; and repeating the steps until the automatic driving is finished. The safety and the comfort of the decision process of the overtaking of the automatic driving vehicle are ensured, and the human-simulated and the robustness of the decision are improved by a reinforcement learning decision method.

For another example, the invention patent with chinese patent publication No. CN109624986A discloses an automatic driving overtaking decision method based on reinforcement learning under opposite two lanes, and specifically relates to a learning cruise control system and method based on mode switching for adaptive cruise control by mode switching of specific driver style and adaptive learning of following behavior. The system defines the driving style in a switching strategy among several modes of constant-speed cruising, accelerating approaching, steady-state following and rapid braking of a driver under different following conditions, learns the driving style, and further learns the driving characteristics under each driving mode by using a continuous state-based learning method.

The prior art has at least the following problems:

the finite state machine model and the decision tree model ignore environmental uncertainty, cannot adapt to dynamic changes of the environment well, and when more behavior modes are defined, the state space and the behavior space are larger, the judgment logic is complex, the feasibility degree is not high, and better decision performance is difficult to show in urban roads with rich structural features.

Aiming at the problems that in the prior art, both a finite-state machine model and a decision tree model ignore environment uncertainty, cannot adapt to dynamic changes of the environment well, and when more behavior patterns are defined, a state space and a behavior space are large, judgment logic is complex, feasibility is not high, and good decision performance is difficult to show in urban roads with abundant structural features, an effective solution is not provided at present.

Disclosure of Invention

The invention aims to provide an automatic driving vehicle micro-decision method based on reinforcement learning aiming at the defects of the prior art, and the safety requirement and the driving efficiency requirement in automatic driving are met.

The automatic driving vehicle micro-decision method comprises the following steps:

step 1, reinforcement learning modeling is carried out, and an automatic driving decision scheme is subjected to modeling representation;

step 2, designing a solving network, obtaining an optimal vehicle micro-decision scheme related to driving micro-decision in step 1, and then solving by using an A3C algorithm, wherein in the A3C algorithm, a global network and a proxy network both comprise an Actor network and a Critic network, the Actor network and the Critic network in all the global networks and the proxy networks are respectively the same, for the network structures of the Actor network and the Critic network, both take the state as input, and in combination with step 1, a neural network composed of a convolutional layer and a full connection layer is adopted, wherein the Actor network represents a strategy function, and an output layer is a mu of a probability density function in the strategy function _θ(s) and σ_θ(s) Critic network represents a function of state values, with output layer being a state value

And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy.

Further, in step 1, the method further comprises the following steps:

step 1.1, defining a driving process of a vehicle as a Markov decision process, regarding an automatic driving vehicle as a proxy, regarding a driving environment of the vehicle as a reinforcement learning environment, making a driving decision and a driving behavior by the proxy vehicle according to detected environment information, adjusting the driving decision according to a driving result, dividing driving time into a plurality of time slots, making the driving decision at the beginning of the time slot of each proxy vehicle, and determining the driving behavior of each proxy vehicle in the time slot;

step 1.2, modeling is carried out by using basic elements in reinforcement learning:

step 1.2.1, state set, defining the lane direction as y direction, defining the direction vertical to the y direction as x direction, if the lane is a curve, the y direction represents the tangential direction of the laneDefining the position and speed of the agent vehicle and the nearest I vehicles around as the state, the state set is expressed as: s ═ c ₀,c₁,c₂,...,c_I]Where s is one sample in the state set, c₀＝[x₀,y₀,v_0x,v_0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, c_i＝[Δx_i,Δy_i,v_ix,v_iy],i≤I，Δx_i,Δy_i,v_ix,v_iyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;

step 1.2.2, an action set, which defines the moving distance of the agent vehicle in two directions per time slot as an action, is expressed as: a ═ x, y],X_m＜x＜X_M,Y_m＜y＜Y_MWhere a is a sample in the motion set, X, y respectively represent the moving distance in two directions, and X_m,X_M,Y_m,Y_MRespectively represents the minimum distance and the maximum distance of movement in two directions, and Y_m＝0；

Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function pi_θ(a | s), the random function value represents the probability of taking the action a in the state s, that is, the policy function is a probability density function, and the action is obtained by sampling according to the probability density function, as shown in the following formula (1):

in the formula (1), a_m＝[X_m,Y_m]，a_M＝[X_M,Y_M]Representing the maximum and minimum values of the motion,

wherein, mu_θ(s) represents the mean value of the distribution, σ_θ(s) represents the variance of the distribution,

step 1.2.4, a reward function, which specifies an award value obtained after a certain action is performed in a certain state to reflect the quality of the action selection, is defined as the following formula (2):

In the above formula (2), k_cIs positive;

step 1.2.5, objective function: in order to obtain the optimal driving strategy, the driving strategy is taken as a variable according to the consideration of safety and driving efficiency, the following optimization target is defined, for each agent, the action is selected according to the strategy function in the initial state to reach the next state, the process of selecting the action and reaching the next state is continuously repeated, and a track (pi) is finally generated after a plurality of iterations_θ) For this track, the cumulative discount return is expressed as the following equation (3):

in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, r_tRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):

in the formula (4), the first and second groups,

representing a desire to accumulate a discount return,

step 1.2.6, optimizing a decision scheme, wherein the driving decision scheme is to find an optimal strategy pi^*The objective function is maximized, the optimization process of the strategy substantially optimizes the parameter θ of the strategy function, and the optimization decision scheme is finally expressed as the following formula (5):

obtaining the optimal parameter theta^*Then, the optimal strategy is expressed as

I.e. an optimal vehicle micro-decision scheme.

Further, in step 3, the method further comprises the following steps:

step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs, in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:

step 3.1.1, deploying the decision neural network on the central server and all RSUs;

step 3.1.2, starting iteration of the training network, namely repeatedly executing the following steps until the network converges:

step 3.1.2.1, for each RSU, collecting a driving track covering vehicles on a road to simulate a driving environment, randomly generating an agent, executing driving behaviors in the simulated driving environment, obtaining state information according to the driving environment, inputting an Actor and a Critic network, making a driving decision according to the output of the Actor network, making a driving action, obtaining an interaction result after the interaction process is finished, reaching the driving environment of the next state, and continuing the interaction until generating sampling data of the driving track;

Step 3.1.2.2, the RSU trains a local decision network by using a local driving track data set, and uploads a training result to a central server;

step 3.1.2.3, after the central server collects the training result transmitted from one RSU, it updates the global network once and returns the updated global network parameter to the RSU;

step 3.1.2.4, after receiving the global network returned by the central server, the RSU synchronizes the global network to the local network, and starts a new round of sample collection and training on the basis;

step 3.1.2.5, after the network convergence, the network parameters are not changed, and a basic model suitable for all road sections is obtained;

step 3.2, training a single road model, sinking the global network layer in the step 3.1 to the RSU of each road, sinking the agent layer to the RSU to cover all the automatic driving vehicles on the road, and specifically deploying each road in the following way:

step 3.2.1, the RSU synchronizes the basic model obtained in the first stage from the central server to be used as a global network;

step 3.2.2, when each automatic driving vehicle starts to enter the road covered by the RSU, synchronizing the global network model from the RSU to become an agent in the road agent layer, and executing the training of a decision network:

Step 3.2.1.1, for each vehicle, taking the vehicle as an agent and taking the driving behavior track of the vehicle as a training sample, and performing the same process as the step 3.1.2.1 to obtain track sampling data;

3.2.1.2, training a local decision network by the vehicle by using a local driving track data set, and uploading a training result to the RSU;

step 3.2.1.3, after the RSU collects a training result transmitted by a vehicle, the RSU updates the global network once and returns the updated global network parameters to the vehicle;

and 3.2.1.4, after the vehicle receives the global network parameters returned by the central server, synchronizing the global network parameters to the local network, and starting a new round of sample collection and training on the basis until the vehicle leaves the current road.

Compared with the prior art, the automatic driving vehicle micro-decision method has the following remarkable advantages:

the design of the invention is combined with the network architecture of the Internet of vehicles, is easy to deploy and has strong feasibility.

2, the invention does not use the predefined driving mode, the driving behavior is more flexible, the adaptability is strong, the complexity of decision making can not be increased due to the increase of the state space and the behavior space, and the calculation mode is simpler.

3, the first stage of the invention can obtain a universal driving model, and can ensure the driving safety on different road sections, therefore, when a road section is newly added, the model is only required to be synchronized from the central server, the RSU and the automatic driving vehicle can immediately start the training process, and the invention has strong universality and transportability.

4, the driving model of each exclusive road section is obtained in the second stage of the invention, compared with the situation that the same model is used for all road sections, the model of the invention can better adapt to the characteristics of different road sections, under the driving environment of the road section, the model driving efficiency of a single road section is superior to the model shared by all road sections, and in addition, compared with the situation that one model is independently trained for each road section, the training and calculation cost of the invention is lower.

Compared with a fixed driving strategy model, the model of the second stage of the method can adapt to constantly changing real-time factors such as road conditions, weather, traffic flow density and the like, and has better robustness.

Drawings

FIG. 1 is a schematic diagram of the calculation structure of the A3C algorithm of the reinforcement learning-based automatic vehicle micro-decision making method according to the present invention;

FIG. 2 is a schematic diagram of a three-layer system architecture of the reinforcement learning-based automated vehicle micro-decision making method of the present invention;

FIG. 3 is a schematic diagram of a network architecture of an actor for an enhanced learning-based autonomous vehicle micro-decision making method according to the present invention;

fig. 4 is a schematic diagram of a network structure of a review family of the reinforcement learning-based automatic driving vehicle micro-decision method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

As shown in fig. 1 to 4, the automatic driving vehicle micro-decision method includes the following steps:

step 1, reinforcement learning modeling, and modeling and representing an automatic driving decision scheme:

step 1.2.1, a state set, wherein a lane direction is defined as a y direction, a direction perpendicular to the y direction is defined as an x direction, if the lane is a curve, the y direction represents a tangential direction of the lane, and positions and speeds of an agent vehicle and a nearest vehicle I around the agent vehicle are defined as states, the state set is represented as: s ═ c ₀,c₁,c₂,...,c_I]Where s is one sample in the state set, c₀＝[x₀,y₀,v_0x,v_0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, c_i＝[Δx_i,Δy_i,v_ix,v_iy],i≤I，Δx_i,Δy_i,v_ix,v_iyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;

In the above formula (2), k_cIs positive;

step 1.2.5, objective function: to obtain an optimal driving strategy, rootAccording to the consideration of safety and driving efficiency, the driving strategy is used as a variable, the following optimization target is defined, for each agent, the action is selected according to the strategy function in the initial state to reach the next state, the process of selecting the action and reaching the next state is continuously repeated, and after a plurality of iterations, a track (pi) is finally generated_θ) For this track, the cumulative discount return is expressed as the following equation (3):

in the formula (4), the first and second groups,

representing a desire to accumulate a discount return,

I.e. an optimal vehicle micro-decision scheme.

Step 2, designing a solving network, and obtaining the optimal vehicle micro-decision related to the driving micro-decision in the step 1The scheme is that a solution is then solved by using an A3C algorithm, in the A3C algorithm, the global network and the proxy network both include an Actor network and a criticic network, the Actor network and the criticic network in all the global network and the proxy network are respectively the same, for the network structures of the Actor network and the criticic network, both take the state as input, and in combination with the two-dimensional structural characteristics of the state defined in step 1.2.1, the neural network composed of the convolutional layer and the full connection layer shown in fig. 3 and 4 is adopted, wherein the Actor network represents the policy function, and the output layer is the mu of the probability density function in the policy function_θ(s) and σ_θ(s) Critic network represents a function of state values, with output layer being a state value

And 3, solving the decision scheme, and training an Actor network and a Critic network based on the model, the decision scheme and the solution network defined in the steps 1 and 2 to obtain an optimal strategy:

step 3.1, training a global strategy, wherein the training process in the stage aims to obtain a basic driving strategy model suitable for all Road sections, the training process in the stage is deployed on a two-layer structure of a central server and RSUs (Road Side units), in the stage, the central server is used as a global network layer, all the RSUs form an agent layer, each RSU is an agent, and the specific deployment process is as follows:

It should be noted that the global network and the Actor network structures of all agents are the same; the Critic network structure of the global network and all the agents is the same, that is, the network structure of the global network and all the agent networks is the same: the network structure comprises an Actor network and a criticic network, all the Actor networks have the same structure, and all the criticic networks have the same structure.

The above description is only for the preferred embodiment of the present invention and should not be construed as limiting the present invention, and various modifications and changes can be made by those skilled in the art without departing from the spirit and principle of the present invention, and any modifications, equivalents, improvements, etc. should be included in the scope of the claims of the present invention.

Claims

1. An automatic driving vehicle micro-decision method based on reinforcement learning is characterized by comprising the following steps:

Step 2, designing a solving network, obtaining an optimal vehicle micro-decision scheme related to driving micro-decision in step 1, and then solving by using an A3C algorithm, wherein in the A3C algorithm, a global network and a proxy network both comprise an Actor network and a Critic network, the Actor network and the Critic network in all the global networks and the proxy networks are respectively the same, for the network structures of the Actor network and the Critic network, both take the state as input, and in combination with step 1, a neural network composed of a convolutional layer and a full connection layer is adopted, wherein the Actor network represents a strategy function, and an output layer is a mu of a probability density function in the strategy function_θ(s) and σ_θ(s) Critic network represents a function of state values, with output layer being a state value

2. The reinforcement learning-based automated driving vehicle micro-decision method according to claim 1, further comprising, in step 1, the steps of:

step 1.2.1, a state set, wherein a lane direction is defined as a y direction, a direction perpendicular to the y direction is defined as an x direction, if the lane is a curve, the y direction represents a tangential direction of the lane, and positions and speeds of an agent vehicle and a nearest vehicle I around the agent vehicle are defined as states, the state set is represented as: s ═ c₀,c₁,c₂,...,c_I]Where s is one sample in the state set, c₀＝[x₀,y₀,v_0x,v_0y]Is a vector consisting of the x-and y-directional positions and velocities of the proxy vehicle, c_i＝[Δx_i,Δy_i,v_ix,v_iy],i≤I，Δx_i,Δy_i,v_ix,v_iyRespectively representing the distance and the speed of the ith vehicle closest to the agent vehicle in the x direction and the y direction;

Step 1.2.3, the strategy function, strategy function pi:S → A is the mapping from state to action, representing the special mode of selecting action by agent according to current state, defining the strategy function as random function pi _θ(a | s), the random function value represents the probability of taking the action a in the state s, that is, the policy function is a probability density function, and the action is obtained by sampling according to the probability density function, as shown in the following formula (1):

in the above formula (2), k_cIs positive;

step 1.2.5, objective function:in order to obtain the optimal driving strategy, the driving strategy is taken as a variable according to the consideration of safety and driving efficiency, the following optimization target is defined, for each agent, the action is selected according to the strategy function in the initial state to reach the next state, the process of selecting the action and reaching the next state is continuously repeated, and a track (pi) is finally generated after a plurality of iterations_θ) For this track, the cumulative discount return is expressed as the following equation (3):

in equation (3), γ is a discount factor representing the importance of the return at a future time to the decision at that time, r _tRepresenting the reward obtained by the agent at time t, the expectation of the cumulative discount reward is taken as an objective function, as shown in the following formula (4):

in the formula (4), the first and second groups,

representing a desire to accumulate a discount return,

I.e. an optimal vehicle micro-decision scheme.

3. The reinforcement learning-based automated vehicular micro decision making method according to claim 1, further comprising, in step 3, the steps of: