CN114120672B

CN114120672B - Heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning

Info

Publication number: CN114120672B
Application number: CN202111402414.3A
Authority: CN
Inventors: 张程伟; 栾利广; 赵心田
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-10-25
Anticipated expiration: 2041-11-19
Also published as: CN114120672A

Abstract

The invention discloses a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning, which relates to the technical field of traffic signal control. The traffic signal control method not only can ensure the optimal overall performance, but also can obtain better performance at special intersections.

Description

Heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning

Technical Field

The invention relates to the technical field of traffic signal control, in particular to a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning.

Background

The goal of Adaptive Traffic Signal Control (ATSC) is to reduce traffic congestion by adaptively adjusting the signal phase. In recent years, research has significantly progressed on the application of Multi-Agent Learning left Approach (MARL) to the ATSC problem. In a real urban traffic environment, there may be thousands of intersections coordinated together to optimize urban traffic. It is more natural to define traffic signal control as a cooperative multi-agent game, i.e. each crossing is controlled by a single agent with local observations.

To date, most of the existing ATSC multi-agent perspective work has focused on independent optimization-based approaches that use local observations and messages from other coordinating agents, treating the ATSC problem as a global-level or neighbor-level multi-agent cooperative game.

However, these approaches all assume that all agents in the cooperative game are homogeneous, ignoring the fact that different agents may play heterogeneous roles in the ATSC scenario. In fact, vehicle tolerance varies at different intersections in the same area, for example, traffic congestion near a hospital or school may affect the timely treatment of patients or the safety of children, and certainly need to be more concerned than congestion at ordinary intersections.

Disclosure of Invention

In view of the above, the invention provides a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning, and considers both an integral intersection and a special intersection (the special intersection has higher importance in a real environment, and traffic congestion near an intersection needing special attention, such as a hospital or a school, can affect timely treatment of patients or safety of children, and certainly needs more attention than congestion of a common intersection).

Therefore, the invention provides the following technical scheme:

the invention provides a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning, which comprises the following steps:

s1, establishing an intelligent agent corresponding to each intersection according to a traffic network; the agents comprise leader agents corresponding to special intersections and follower agents corresponding to common intersections;

s2, sequencing each intersection in a traffic network, and determining the action selection sequence of each intersection to obtain a precursor agent and a successor agent of each agent;

s3, acquiring real-time traffic characteristics of each intersection in the traffic network, acquiring observation information of the intelligent agent and a neighbor intelligent agent thereof according to the real-time traffic characteristics of each intersection aiming at the intelligent agent corresponding to each intersection, and acquiring joint action of a precursor intelligent agent of the intelligent agent;

s4, inputting the observation information of the agent and the neighbor agents thereof and the joint action of the precursor agent of the agent into the HDQN network of the agent, determining the action of the agent, and transmitting the action of the agent to the inheritor of the agent;

s5, the intelligent agent executes the determined action and obtains observation information and reward after the action is executed and trajectory information observed by the intelligent agent in each round;

s6, generating an interaction experience according to the track information;

and S7, updating the HDQN network of the agent according to the interaction experience.

Further, sequencing each intersection in the traffic network comprises:

and sequencing each intersection in the traffic network by adopting a breadth first sequencing method.

Further, the leader agent's precursor set is empty.

Further, the interaction experience includes: the observation values of the agent and its neighbors, the actions executed by the agent and its predecessors, the rewards received by the agent, and the observation values after the agent executes the actions.

Further, the HDQN network comprises a DQN estimation network and a DQN target network; accordingly, updating the HDQN network of the agent according to the interaction experience, comprising:

updating the DQN estimation network according to the interaction experience;

every N _θ Step of updating the DQN target network based on the DQN estimation network, wherein N _θ Representing the number of steps the agent interacts with the environment.

Further, the reward after the agent performs the action is as follows:

wherein the content of the first and second substances,

the current congestion situation near the intersection i at the time t is the negative number of the waiting vehicles,

N _L and N _F ＝N-N _L Respectively a set of leader agents and follower agents; n is a radical of _i Is a set of neighbor agents; ω ∈ (0, 1) is a discount factor used to measure the importance between a follower agent and its neighbors.

The invention has the advantages and positive effects that:

the invention models the ATSC problem as Leader-Follower Markov Game (LF-MG), which is a master-slave Markov Game model considering the performance of the whole and special intersections, and on the basis of the LF-MG, the invention provides a distributed Learning multi-agent cooperation method expanded by HDQN (high Learning Deep Q-Learning Network), named as Breadth First Sort Hyperteric DQN (BFS-HDQN), which is used for Learning the cooperation control strategy of the whole return optimization of a plurality of intersections and the intersections needing special attention. The method is experimentally evaluated in two synthetic scenes and a real traffic scene, and the experimental result shows that in almost all evaluation indexes commonly used by ATSC, compared with the existing method, the algorithm can not only ensure the optimal overall performance, but also obtain better performance at a special intersection.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning according to an embodiment of the present invention;

FIG. 2 illustrates three traffic network configurations according to an embodiment of the present invention;

FIG. 3 is a graph illustrating average total earnings of all intersections in three traffic scenarios according to an embodiment of the present invention;

fig. 4 shows the average profit at the leader intersection under three traffic scenarios in the embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Given the apparent topological relationship between intersections in a traffic network, based on human intelligence in cooperative behavior (e.g., the core players in a football team have a high decision-making right as the captain, while other players cooperate with the captain to achieve the goal), the present invention proposes a new model, called leader-follower markov game (LF-MG), for defining the goals of different intersections in ATSC. Specifically, a model based on a leader-follower paradigm (LF paradigm) divides intersections in a traffic scene into two categories, namely a leader agent and a follower agent, and adopts different optimization targets for the two categories of agents (agents). The leader agent represents the intersection which needs special attention, and only considers optimizing the traffic condition nearby, while the follower agent needs to consider the congestion condition of the leader agent and the neighbors thereof. On the basis, the invention designs an independent MARL framework to learn the multi-objective optimization strategy, which is called Breadth First Sort Hysteretic DQN (BFS-HDQN). The BFS-HDQN models the ATSC problem as a neighbor aware Markov game, where each agent controls intersections based on its local information (containing information of its own observations and its neighbors). The MARL framework consists of two parts, an independent MARL algorithm (the present invention uses HDQN as a basic algorithm, which is an independent MARL algorithm designed for learning the optimal combination strategy in cooperative multi-agent gaming), to train different types of agents, and a communication mechanism based on graph Breadth First Search (BFS) to generate interaction information for each agent, which uses a leader-follower behavior selection paradigm to enable subsequent agents to select individual behaviors after obtaining a predecessor strategy, which determines the order and content of information transfer between agents. The method can not only ensure the optimal overall performance, but also obtain better performance at special intersections.

As shown in fig. 1, it shows a flowchart of a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning in an embodiment of the present invention, where the method includes:

s1, establishing intelligent bodies corresponding to intersections according to a traffic network; the agents comprise leader agents corresponding to special intersections and follower agents corresponding to common intersections;

markov Games (MG) are standard MARL settings, commonly defined as<N,S,O,A,P,R,γ>Where N is the agent set, N = | N | is the size of the set, S is the state space, O =<O ₁ ,...,O _N >For viewing space, wherein O _i Is a local observation that agent i can observe, A =<A ₁ ,...,A _N > and A _i Respectively, joint motion space for all agents and local motion space for agent i, P: S × A × S → [0,1]Represents the state transfer function, R =<r ₁ ,...,r _N >Is the joint reward function and gamma is the discount factor. MG is a fully cooperative game, if each participant uses the same award, then

r _i And (r). The Network Markov Game (NMG) considers the relation between agents, and replaces N with a graph G (N, epsilon) in MG, wherein j epsilon is a communication link between agent i and j; in addition, agent i Observation in NMG O _i Is the self-observation that includes agent i and the information received from the neighbors M _ij } _ij∈ε . agent i strategy pi _i :O _i →Δ(A _i ) The observation space is mapped to a probability distribution of the behavior space. Same as aboveIn the present embodiment, pi =isdefined<π _i ,...,π _N >As a common strategy for all agents. In the s-state, the expected return of agent i's union policy π (or the expected sum of future awards) is defined by a value function or a state-value function (also called Q-value function):

wherein the content of the first and second substances,

the reward obtained for agent i at time t; k is a coefficient of a discount factor, and is put on the discount factor to represent the importance degree of the current reward and the future reward, wherein the discount factor is a value from 0 to 1, and k is time, and represents that the farther from the current time t, the lower the importance degree of the corresponding reward is; u denotes an action.

In the embodiment of the invention, an ATSC problem in an actual traffic scene is defined as a Markov Game model (LF-MG) based on a Leader-Follower paradigm, which distinguishes intersections in the ATSC scene into special intersections and common intersections. Traffic congestion near special intersections, such as hospitals or schools, can affect patient timely treatment or child safety and certainly require more attention than congestion at ordinary intersections. The LF-MG is established based on a special network Markov game, and in the game, the intersection can observe the vehicle condition on each driving lane of the intersection and the adjacent intersections. LF-MG defines two agents: the device comprises a leader agent and a follower agent, wherein the leader agent only considers the benefits of the leader agent, and the follower agent comprehensively considers the benefits of the leader agent and the neighbors. The invention herein defines the reward for an agent as:

here, the

this can be directly derived from the observation of agent i, N _L And N _F ＝N-N _L In the respective sets of leader agent and follower agent, ω e (0, 1) is a discount factor for measuring the importance between follower agent and its neighboring agent, and the larger ω is, the stronger social awareness of the actors is, and the targets of all actors are always the same, whereas the smaller ω is, the more selfish is. When omega =0, agent is equivalent to leader agent, if the neighbor is leader, the invention fixes omega to 1, otherwise to 0.5. The following definition mainly considers the group interest of the agents and the communication limitation among the agents in the network scene.

The present invention does not distinguish between observations and action definitions for different agents. Formally, the observed value o of agent i _i ∈O _i Defined by the information of intersection i and its neighbors:

here, the first and second liquid crystal display panels are,

is the phase of agent at time t, L _i And

respectively, an entrance lane set of the intersection i and its adjacent points, wave [ l ]] ^(t) Is the number of vehicles waiting in lane i alone at time t. action a of agenti _i ∈A _i Is the signal phase. The goal of each agent in LF-MG is to maximize its cumulative reward (return) by finding a join policy, pi,

s2, sequencing each intersection in the traffic network, determining the action selection sequence of each intersection, and obtaining predecessors and successors of each intelligent agent;

in the embodiment of the invention, the BFS-HDQN is adopted to learn the multi-objective optimization strategy, and the BFS-HDQN framework enables two agents to learn a combined strategy, so that the benefits of the whole network and some special intersections are maximized.

Given the nature of the graph in the ATSC problem, the impact of actions between intersections progressively passes along the edge between each pair of connected neighbors. In the embodiment of the invention, the intersections of the leader are used as starting nodes, the intersections in the ATSC scene are sequenced by adopting a classic graph sequencing algorithm, namely breadth First sort (width First sort), and the action selection sequence of each intersection is determined. Considering that there may be multiple leaders in an ATSC scenario, the original BFS algorithm is improved in the present embodiment, as shown in table 1, by queuing multiple nodes at the beginning of the algorithm, and according to the ordering order of intersections, the present invention can determine which agents announce their intended actions before any agent i. The invention is defined herein

As a predecessor to the neighbor of i. Then, for agents at common intersections, when they choose to act, they have obtained observations of themselves and neighbors, as well as their predecessor neighbors' actions. For leader agent, the present invention defines its predecessor set as empty. The agent may then select an action or train its strategy based on the condition estimates when interacting with the environment.

S3, acquiring real-time traffic characteristics of each intersection in the traffic network, and acquiring observation information of the intelligent agent and the adjacent intelligent agents thereof and joint action of the precursor intelligent agents thereof aiming at the intelligent agent corresponding to each intersection according to the real-time traffic characteristics of each intersection;

s4, inputting the observation information of the agent and the neighbor agents and the joint action of the precursor agents into the HDQN network of the agent, determining the action of the agent, and transmitting the action of the agent to the inheritor of the agent;

s6, generating experience according to the track information;

and S7, updating the HDQN network of the intelligent agent according to the experience.

Deep Reinforcement Learning (DRL), such as DQN, combines a deep neural network with a traditional RL by representing a value function with the deep neural network. To train neural networks, DRLs use a replay memory buffer to store experiences<s，a，r，s′>R and s' are the prize awarded after taking action a in state s and the next state, respectively. θ is a parameter of the neural network for the DRL agent. For convenience, the present invention also refers directly to neural networks by θ. In DQN, by minimizing the square δ of the TD error _i To update θ:

wherein b is the batch size, θ ^- Is a target network that is periodically replicated from theta and remains unchanged from one replication to another.

HDQN is an independent learning algorithm designed for a collaborative multi-agent environment. There are two learning rates, α and α' = α h,0 < h < 1. HDQN uses a smaller learning rate ah whenever an update would lower the q-value, and is large otherwise. This results in an optimistic update, focusing more on the positive experience, which has proven useful in many fully collaborative multi-agent tasks, with the formula:

in the embodiment of the invention, HDQN is used as a basic RL framework to learn the joint optimal strategy of a plurality of intersections.

For the sake of easy understanding, the BSF-HDQN of the present invention will be specifically described below.

Real-world tasks sometimes require agents to consider the information of other agents for proper collaboration during the action selection phase. For example, in a human collaboration task, there is typically a team leader of higher priority than other team members. In the cooperation process, the leader firstly tells other players what he wants to do, and other team members perform corresponding cooperation according to the decision of the leader. Intuitively, this would give priority to protecting the interests of the leader. At the same time, other members, acting as collaborators, may consider how to maximize the benefits of the team after knowing the actions the leader is to perform. Thus, the final strategy takes into account the common interests of the leader and team.

BFS-HDQN is driven by the above organizational behavior, in order to estimate the joint action of self-agent and its predecessors when modeling the DQN network for each agent, the DQN network of BFS-HDQN agent takes the actions of the predecessor neighbors as part of its state input, outputting the Q-values of all its actions:

here, θ _i As is the network parameter of the agenti,

is the joint action of agent predecessor neighbors.

Considering that the greedy value of the next state needs to be calculated to train the DQN network, the above definition cannot get experience directly after each step like the conventional DQN. Here, the invention uses temporary memory τ _i The trajectory information observed by agent at each round is stored temporarily. The final experience is generated by this trajectory.

In the embodiment of the invention, for each agent, BFS-HDQN has two randomly initialized neural networks, namely DQN estimation network theta _i And DQN target network

agent selects its own actions through neighbor-aware observations and predecessor actions, and uses an "epsilon greedy policy" where epsilon decreases from 1 to 0 as interaction time increases. They then propagate this action to their successors. After agents perform their operations, each agent will get a new observation and a reward. At the end of each round, each agent follows the trace τ it observes during that round _i And generating experience. And finally, training all networks into a common DRL algorithm.

The communication mechanism (the above communication mechanism based on the graph Breadth First Search (BFS)) uses a leader-follower behavior selection paradigm, so that subsequent agents can select a single behavior after obtaining a predecessor policy, which determines the sequence and content of information transmission between agents. The BFS-HDQN algorithm is shown in Table 2.

TABLE 1

Inputting: graph ζ (N, ε), leader set N _L ；
	And (3) outputting: the sequence sq is ordered.
1: initializing an empty order queue sq
	2: initializing an empty queue q and belonging the queue i to N _L Input q
3: when q is not null:
	4: taking the front end element u out of q
5: queuing all non-visited neighbors of u
	6: adding u to sq

TABLE 2

In order to prove the effectiveness of the traffic signal control method based on multi-agent reinforcement learning in the embodiment of the invention, the method provided by the invention is evaluated by using a real scene and two synthetic scenes. In particular, the present invention uses a real traffic network introduced from denna and two composite traffic grids to validate the method of the present invention.

The invention firstly tests the BFS-HDQN and the most advanced MARL method MA2C, and the invention also makes an ablation experiment, and the HDQN and the BFS-HDQN without weight setting are defined as the BFS-HDQN (non), which shows that both the leader-follower setting and the BFS communication mechanism in the BFS-HDQN are helpful to improve the ATSC performance.

(1) Scene setting

The present invention evaluates the method of the present invention using three available public transportation networks, namely one real transportation network and two 4 x 4 composite transportation grids, on a Cityflow (a common traffic simulator). As shown in fig. 2, for completeness, the present invention sets a different number of special intersections in the three ATSC scenarios. Intersections in all three scenarios are homogenous and have the same action space.

Three traffic network settings. The five-pointed star and the circle in each network respectively represent a special intersection (1 eader) and a common intersection (follower). There is only one specific intersection in the road networks (a) and (b) and three in the road networks (c).

The invention adopts three indexes to proveThe performance of the model of the invention is shown as follows: (1) Travel time m _travel-time : the average running time of all vehicles is the most common index for evaluating different methods; (2) Length m of queue _queue-length : average value of vehicle queue length of each road junction lane; (3) Throughput m _throughout : total number of vehicles arriving at the destination. Specifically, the method comprises the following steps:

m _throughout ＝|V|

wherein the content of the first and second substances,

and

respectively representing the time of entry and exit of a vehicle V, V being the vehicle arriving at the destination within one hour, V _in Is the vehicle entering the traffic network, | N | is the number of intersections, the set length is T =360. The goal is to learn a federation strategy that maximizes revenue for the entire network and for some specific intersections. Therefore, the present invention tests the method of the present invention in the average index of all intersections and special intersections (leading points), respectively.

(2) Performance comparison

Fig. 3 and 4 are learning curves of the MARL method for the average total profit and the leader profit, respectively, under three traffic scenarios ((a) (b) (c) numbers corresponding to the three traffic scenarios in fig. 2, respectively). The lines and shadows around them are the average return and error range of these methods in the learning process.

As can be seen from FIG. 3, BFS-HDQN has a stable and almost highest return in all environments, followed by BFS-HDQN (non) and HDQN, MA2C performing the worst. The results show that the algorithm is effective in improving the overall efficiency, and all innovations of the algorithm are meaningful.

In FIG. 4, the results of the comparison of BFS-HDQN with MA2C and HDQN are evident, while the performance of BFS-HDQN (non) is almost the same as that of BFS-HDQN (even slightly better than BFS-HDQN in both synthetic scenarios). The reason is intuitive because the weights in the BFS-HDQN are added to the followers' rewards, not the leader. The results of fig. 4 show that the algorithm of the present invention can also effectively improve the benefit of special intersections.

In order to compare the execution performance, the invention also shows the final statistical results of the trained MARL strategy in table 3 (overall performance training strategy execution comparison) and table 4 (leader performance training strategy execution comparison based on metric queue length), which shows that the algorithm of the invention achieves the best results in all environments.

TABLE 3

TABLE 4

In summary, the present invention evaluates the method of the present invention in one real world and two synthetic traffic scenarios and compares it with the SOFT method. The experimental result shows that compared with the existing method, the algorithm can not only ensure the optimal overall performance, but also obtain better performance at special intersections under the three most common indexes of travel time, throughput and queue length.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning is characterized by comprising the following steps:

s1, establishing intelligent bodies corresponding to intersections according to a traffic network; the agents comprise leader agents corresponding to special intersections and follower agents corresponding to common intersections; the method comprises the steps that an ATSC problem of an actual traffic scene is defined as a Markov game model LF-MG based on a leader-follower paradigm, and intersections in the ATSC scene are divided into special intersections and common intersections; LF-MG defines two agent agents: the leader agent and follower agent are respectively a leader agent and a follower agent, wherein the leader agent only considers the benefits of the leader agent, and the follower agent comprehensively considers the benefits of the leader agent and the neighbors; define the reward for agent as:

wherein the content of the first and second substances,

obtained directly from the observation of agent i, N _L And N _F ＝N-N _L Are respectively a set of leader agent and follower agent, N is an agent set _i Is a set of neighbor agents; omega E (0, 1) is a discount factor used for measuring the importance between the following follower agent and the adjacent agent, the larger omega is, the stronger social awareness of the actors is, and the targets of all actors are always the same, otherwise, the smaller omega is, the more selfish the agent is; when ω =0, agent is equivalent to leader agent, if neighbor is leader, ω is fixed to be leader1, otherwise, fixing to 0.5; the folower rewarded definition mainly considers the group benefit of the agents as the target and the communication limitation among the agents under the network scene;

observed value o of agent i _i ∈O _i Defined by the information of intersection i and its neighbors:

wherein the content of the first and second substances,

is the phase of agent at time t, L _i And

respectively, an entrance lane set of the intersection i and its adjacent points, wave [ l ]] ^(t) The number of vehicles waiting in lane i alone at time t; action a of agent i _i ∈A _i Is the signal phase; the goal of each agent in LF-MG is to maximize its cumulative reward by finding a federation policy, pi,

O _i is a local observation that agent i can observe, A _i Is the local action space of agenti;

s4, inputting the observation information of the agent and the neighbor agents thereof and the joint action of the precursor agent of the agent into an HDQN network of the agent, determining the action of the agent, and transmitting the action of the agent to the inheritor of the agent;

s6, generating an interaction experience according to the track information; the interactive experience comprises: the method comprises the following steps that observed values of an intelligent agent and neighbors of the intelligent agent, actions executed by the intelligent agent and predecessors of the intelligent agent, rewards received by the intelligent agent and observed values after the intelligent agent executes the actions;

2. The heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning of claim 1, wherein the step of ordering the intersections in the traffic network comprises:

3. The multi-agent reinforcement learning-based heterogeneous intersection scene traffic signal control method of claim 2, wherein the leader agent's precursor set is null.

4. The multi-agent reinforcement learning-based heterogeneous intersection scene traffic signal control method according to claim 3, wherein the HDQN network comprises a DQN estimation network and a DQN target network; accordingly, updating the HDQN network of the agent according to the interaction experience, comprising:

updating the DQN estimation network according to the interaction experience;