CN114120672B - Heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning - Google Patents

Heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning Download PDF

Info

Publication number
CN114120672B
CN114120672B CN202111402414.3A CN202111402414A CN114120672B CN 114120672 B CN114120672 B CN 114120672B CN 202111402414 A CN202111402414 A CN 202111402414A CN 114120672 B CN114120672 B CN 114120672B
Authority
CN
China
Prior art keywords
agent
intersection
network
leader
intelligent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111402414.3A
Other languages
Chinese (zh)
Other versions
CN114120672A (en
Inventor
张程伟
栾利广
赵心田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Maritime University filed Critical Dalian Maritime University
Priority to CN202111402414.3A priority Critical patent/CN114120672B/en
Publication of CN114120672A publication Critical patent/CN114120672A/en
Application granted granted Critical
Publication of CN114120672B publication Critical patent/CN114120672B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/081Plural intersections under common control
    • G08G1/083Controlling the allocation of time between phases of a cycle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning, which relates to the technical field of traffic signal control. The traffic signal control method not only can ensure the optimal overall performance, but also can obtain better performance at special intersections.

Description

Heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning
Technical Field
The invention relates to the technical field of traffic signal control, in particular to a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning.
Background
The goal of Adaptive Traffic Signal Control (ATSC) is to reduce traffic congestion by adaptively adjusting the signal phase. In recent years, research has significantly progressed on the application of Multi-Agent Learning left Approach (MARL) to the ATSC problem. In a real urban traffic environment, there may be thousands of intersections coordinated together to optimize urban traffic. It is more natural to define traffic signal control as a cooperative multi-agent game, i.e. each crossing is controlled by a single agent with local observations.
To date, most of the existing ATSC multi-agent perspective work has focused on independent optimization-based approaches that use local observations and messages from other coordinating agents, treating the ATSC problem as a global-level or neighbor-level multi-agent cooperative game.
However, these approaches all assume that all agents in the cooperative game are homogeneous, ignoring the fact that different agents may play heterogeneous roles in the ATSC scenario. In fact, vehicle tolerance varies at different intersections in the same area, for example, traffic congestion near a hospital or school may affect the timely treatment of patients or the safety of children, and certainly need to be more concerned than congestion at ordinary intersections.
Disclosure of Invention
In view of the above, the invention provides a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning, and considers both an integral intersection and a special intersection (the special intersection has higher importance in a real environment, and traffic congestion near an intersection needing special attention, such as a hospital or a school, can affect timely treatment of patients or safety of children, and certainly needs more attention than congestion of a common intersection).
Therefore, the invention provides the following technical scheme:
the invention provides a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning, which comprises the following steps:
s1, establishing an intelligent agent corresponding to each intersection according to a traffic network; the agents comprise leader agents corresponding to special intersections and follower agents corresponding to common intersections;
s2, sequencing each intersection in a traffic network, and determining the action selection sequence of each intersection to obtain a precursor agent and a successor agent of each agent;
s3, acquiring real-time traffic characteristics of each intersection in the traffic network, acquiring observation information of the intelligent agent and a neighbor intelligent agent thereof according to the real-time traffic characteristics of each intersection aiming at the intelligent agent corresponding to each intersection, and acquiring joint action of a precursor intelligent agent of the intelligent agent;
s4, inputting the observation information of the agent and the neighbor agents thereof and the joint action of the precursor agent of the agent into the HDQN network of the agent, determining the action of the agent, and transmitting the action of the agent to the inheritor of the agent;
s5, the intelligent agent executes the determined action and obtains observation information and reward after the action is executed and trajectory information observed by the intelligent agent in each round;
s6, generating an interaction experience according to the track information;
and S7, updating the HDQN network of the agent according to the interaction experience.
Further, sequencing each intersection in the traffic network comprises:
and sequencing each intersection in the traffic network by adopting a breadth first sequencing method.
Further, the leader agent's precursor set is empty.
Further, the interaction experience includes: the observation values of the agent and its neighbors, the actions executed by the agent and its predecessors, the rewards received by the agent, and the observation values after the agent executes the actions.
Further, the HDQN network comprises a DQN estimation network and a DQN target network; accordingly, updating the HDQN network of the agent according to the interaction experience, comprising:
updating the DQN estimation network according to the interaction experience;
every N θ Step of updating the DQN target network based on the DQN estimation network, wherein N θ Representing the number of steps the agent interacts with the environment.
Further, the reward after the agent performs the action is as follows:
Figure BDA0003365351390000031
wherein the content of the first and second substances,
Figure BDA0003365351390000032
the current congestion situation near the intersection i at the time t is the negative number of the waiting vehicles,
Figure BDA0003365351390000033
N L and N F =N-N L Respectively a set of leader agents and follower agents; n is a radical of i Is a set of neighbor agents; ω ∈ (0, 1) is a discount factor used to measure the importance between a follower agent and its neighbors.
The invention has the advantages and positive effects that:
the invention models the ATSC problem as Leader-Follower Markov Game (LF-MG), which is a master-slave Markov Game model considering the performance of the whole and special intersections, and on the basis of the LF-MG, the invention provides a distributed Learning multi-agent cooperation method expanded by HDQN (high Learning Deep Q-Learning Network), named as Breadth First Sort Hyperteric DQN (BFS-HDQN), which is used for Learning the cooperation control strategy of the whole return optimization of a plurality of intersections and the intersections needing special attention. The method is experimentally evaluated in two synthetic scenes and a real traffic scene, and the experimental result shows that in almost all evaluation indexes commonly used by ATSC, compared with the existing method, the algorithm can not only ensure the optimal overall performance, but also obtain better performance at a special intersection.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning according to an embodiment of the present invention;
FIG. 2 illustrates three traffic network configurations according to an embodiment of the present invention;
FIG. 3 is a graph illustrating average total earnings of all intersections in three traffic scenarios according to an embodiment of the present invention;
fig. 4 shows the average profit at the leader intersection under three traffic scenarios in the embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Given the apparent topological relationship between intersections in a traffic network, based on human intelligence in cooperative behavior (e.g., the core players in a football team have a high decision-making right as the captain, while other players cooperate with the captain to achieve the goal), the present invention proposes a new model, called leader-follower markov game (LF-MG), for defining the goals of different intersections in ATSC. Specifically, a model based on a leader-follower paradigm (LF paradigm) divides intersections in a traffic scene into two categories, namely a leader agent and a follower agent, and adopts different optimization targets for the two categories of agents (agents). The leader agent represents the intersection which needs special attention, and only considers optimizing the traffic condition nearby, while the follower agent needs to consider the congestion condition of the leader agent and the neighbors thereof. On the basis, the invention designs an independent MARL framework to learn the multi-objective optimization strategy, which is called Breadth First Sort Hysteretic DQN (BFS-HDQN). The BFS-HDQN models the ATSC problem as a neighbor aware Markov game, where each agent controls intersections based on its local information (containing information of its own observations and its neighbors). The MARL framework consists of two parts, an independent MARL algorithm (the present invention uses HDQN as a basic algorithm, which is an independent MARL algorithm designed for learning the optimal combination strategy in cooperative multi-agent gaming), to train different types of agents, and a communication mechanism based on graph Breadth First Search (BFS) to generate interaction information for each agent, which uses a leader-follower behavior selection paradigm to enable subsequent agents to select individual behaviors after obtaining a predecessor strategy, which determines the order and content of information transfer between agents. The method can not only ensure the optimal overall performance, but also obtain better performance at special intersections.
As shown in fig. 1, it shows a flowchart of a heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning in an embodiment of the present invention, where the method includes:
s1, establishing intelligent bodies corresponding to intersections according to a traffic network; the agents comprise leader agents corresponding to special intersections and follower agents corresponding to common intersections;
markov Games (MG) are standard MARL settings, commonly defined as<N,S,O,A,P,R,γ>Where N is the agent set, N = | N | is the size of the set, S is the state space, O =<O 1 ,...,O N >For viewing space, wherein O i Is a local observation that agent i can observe, A =<A 1 ,...,A N > and A i Respectively, joint motion space for all agents and local motion space for agent i, P: S × A × S → [0,1]Represents the state transfer function, R =<r 1 ,...,r N >Is the joint reward function and gamma is the discount factor. MG is a fully cooperative game, if each participant uses the same award, then
Figure BDA0003365351390000052
r i And (r). The Network Markov Game (NMG) considers the relation between agents, and replaces N with a graph G (N, epsilon) in MG, wherein j epsilon is a communication link between agent i and j; in addition, agent i Observation in NMG O i Is the self-observation that includes agent i and the information received from the neighbors M ij } ij∈ε . agent i strategy pi i :O i →Δ(A i ) The observation space is mapped to a probability distribution of the behavior space. Same as aboveIn the present embodiment, pi =isdefined<π i ,...,π N >As a common strategy for all agents. In the s-state, the expected return of agent i's union policy π (or the expected sum of future awards) is defined by a value function or a state-value function (also called Q-value function):
Figure BDA0003365351390000051
wherein the content of the first and second substances,
Figure BDA0003365351390000061
the reward obtained for agent i at time t; k is a coefficient of a discount factor, and is put on the discount factor to represent the importance degree of the current reward and the future reward, wherein the discount factor is a value from 0 to 1, and k is time, and represents that the farther from the current time t, the lower the importance degree of the corresponding reward is; u denotes an action.
In the embodiment of the invention, an ATSC problem in an actual traffic scene is defined as a Markov Game model (LF-MG) based on a Leader-Follower paradigm, which distinguishes intersections in the ATSC scene into special intersections and common intersections. Traffic congestion near special intersections, such as hospitals or schools, can affect patient timely treatment or child safety and certainly require more attention than congestion at ordinary intersections. The LF-MG is established based on a special network Markov game, and in the game, the intersection can observe the vehicle condition on each driving lane of the intersection and the adjacent intersections. LF-MG defines two agents: the device comprises a leader agent and a follower agent, wherein the leader agent only considers the benefits of the leader agent, and the follower agent comprehensively considers the benefits of the leader agent and the neighbors. The invention herein defines the reward for an agent as:
Figure BDA0003365351390000062
here, the
Figure BDA0003365351390000063
The current congestion situation near the intersection i at the time t is the negative number of the waiting vehicles,
Figure BDA0003365351390000064
this can be directly derived from the observation of agent i, N L And N F =N-N L In the respective sets of leader agent and follower agent, ω e (0, 1) is a discount factor for measuring the importance between follower agent and its neighboring agent, and the larger ω is, the stronger social awareness of the actors is, and the targets of all actors are always the same, whereas the smaller ω is, the more selfish is. When omega =0, agent is equivalent to leader agent, if the neighbor is leader, the invention fixes omega to 1, otherwise to 0.5. The following definition mainly considers the group interest of the agents and the communication limitation among the agents in the network scene.
The present invention does not distinguish between observations and action definitions for different agents. Formally, the observed value o of agent i i ∈O i Defined by the information of intersection i and its neighbors:
Figure BDA0003365351390000071
here, the first and second liquid crystal display panels are,
Figure BDA0003365351390000072
is the phase of agent at time t, L i And
Figure BDA0003365351390000073
respectively, an entrance lane set of the intersection i and its adjacent points, wave [ l ]] (t) Is the number of vehicles waiting in lane i alone at time t. action a of agenti i ∈A i Is the signal phase. The goal of each agent in LF-MG is to maximize its cumulative reward (return) by finding a join policy, pi,
Figure BDA0003365351390000074
s2, sequencing each intersection in the traffic network, determining the action selection sequence of each intersection, and obtaining predecessors and successors of each intelligent agent;
in the embodiment of the invention, the BFS-HDQN is adopted to learn the multi-objective optimization strategy, and the BFS-HDQN framework enables two agents to learn a combined strategy, so that the benefits of the whole network and some special intersections are maximized.
Given the nature of the graph in the ATSC problem, the impact of actions between intersections progressively passes along the edge between each pair of connected neighbors. In the embodiment of the invention, the intersections of the leader are used as starting nodes, the intersections in the ATSC scene are sequenced by adopting a classic graph sequencing algorithm, namely breadth First sort (width First sort), and the action selection sequence of each intersection is determined. Considering that there may be multiple leaders in an ATSC scenario, the original BFS algorithm is improved in the present embodiment, as shown in table 1, by queuing multiple nodes at the beginning of the algorithm, and according to the ordering order of intersections, the present invention can determine which agents announce their intended actions before any agent i. The invention is defined herein
Figure BDA0003365351390000075
As a predecessor to the neighbor of i. Then, for agents at common intersections, when they choose to act, they have obtained observations of themselves and neighbors, as well as their predecessor neighbors' actions. For leader agent, the present invention defines its predecessor set as empty. The agent may then select an action or train its strategy based on the condition estimates when interacting with the environment.
S3, acquiring real-time traffic characteristics of each intersection in the traffic network, and acquiring observation information of the intelligent agent and the adjacent intelligent agents thereof and joint action of the precursor intelligent agents thereof aiming at the intelligent agent corresponding to each intersection according to the real-time traffic characteristics of each intersection;
s4, inputting the observation information of the agent and the neighbor agents and the joint action of the precursor agents into the HDQN network of the agent, determining the action of the agent, and transmitting the action of the agent to the inheritor of the agent;
s5, the intelligent agent executes the determined action and obtains observation information and reward after the action is executed and trajectory information observed by the intelligent agent in each round;
s6, generating experience according to the track information;
and S7, updating the HDQN network of the intelligent agent according to the experience.
Deep Reinforcement Learning (DRL), such as DQN, combines a deep neural network with a traditional RL by representing a value function with the deep neural network. To train neural networks, DRLs use a replay memory buffer to store experiences<s,a,r,s′>R and s' are the prize awarded after taking action a in state s and the next state, respectively. θ is a parameter of the neural network for the DRL agent. For convenience, the present invention also refers directly to neural networks by θ. In DQN, by minimizing the square δ of the TD error i To update θ:
Figure BDA0003365351390000081
wherein b is the batch size, θ - Is a target network that is periodically replicated from theta and remains unchanged from one replication to another.
HDQN is an independent learning algorithm designed for a collaborative multi-agent environment. There are two learning rates, α and α' = α h,0 < h < 1. HDQN uses a smaller learning rate ah whenever an update would lower the q-value, and is large otherwise. This results in an optimistic update, focusing more on the positive experience, which has proven useful in many fully collaborative multi-agent tasks, with the formula:
Figure BDA0003365351390000082
in the embodiment of the invention, HDQN is used as a basic RL framework to learn the joint optimal strategy of a plurality of intersections.
For the sake of easy understanding, the BSF-HDQN of the present invention will be specifically described below.
Real-world tasks sometimes require agents to consider the information of other agents for proper collaboration during the action selection phase. For example, in a human collaboration task, there is typically a team leader of higher priority than other team members. In the cooperation process, the leader firstly tells other players what he wants to do, and other team members perform corresponding cooperation according to the decision of the leader. Intuitively, this would give priority to protecting the interests of the leader. At the same time, other members, acting as collaborators, may consider how to maximize the benefits of the team after knowing the actions the leader is to perform. Thus, the final strategy takes into account the common interests of the leader and team.
BFS-HDQN is driven by the above organizational behavior, in order to estimate the joint action of self-agent and its predecessors when modeling the DQN network for each agent, the DQN network of BFS-HDQN agent takes the actions of the predecessor neighbors as part of its state input, outputting the Q-values of all its actions:
Figure BDA0003365351390000091
here, θ i As is the network parameter of the agenti,
Figure BDA0003365351390000092
is the joint action of agent predecessor neighbors.
Considering that the greedy value of the next state needs to be calculated to train the DQN network, the above definition cannot get experience directly after each step like the conventional DQN. Here, the invention uses temporary memory τ i The trajectory information observed by agent at each round is stored temporarily. The final experience is generated by this trajectory.
In the embodiment of the invention, for each agent, BFS-HDQN has two randomly initialized neural networks, namely DQN estimation network theta i And DQN target network
Figure BDA0003365351390000093
agent selects its own actions through neighbor-aware observations and predecessor actions, and uses an "epsilon greedy policy" where epsilon decreases from 1 to 0 as interaction time increases. They then propagate this action to their successors. After agents perform their operations, each agent will get a new observation and a reward. At the end of each round, each agent follows the trace τ it observes during that round i And generating experience. And finally, training all networks into a common DRL algorithm.
The communication mechanism (the above communication mechanism based on the graph Breadth First Search (BFS)) uses a leader-follower behavior selection paradigm, so that subsequent agents can select a single behavior after obtaining a predecessor policy, which determines the sequence and content of information transmission between agents. The BFS-HDQN algorithm is shown in Table 2.
TABLE 1
Inputting: graph ζ (N, ε), leader set N L
And (3) outputting: the sequence sq is ordered.
1: initializing an empty order queue sq
2: initializing an empty queue q and belonging the queue i to N L Input q
3: when q is not null:
4: taking the front end element u out of q
5: queuing all non-visited neighbors of u
6: adding u to sq
TABLE 2
Figure BDA0003365351390000101
Figure BDA0003365351390000111
In order to prove the effectiveness of the traffic signal control method based on multi-agent reinforcement learning in the embodiment of the invention, the method provided by the invention is evaluated by using a real scene and two synthetic scenes. In particular, the present invention uses a real traffic network introduced from denna and two composite traffic grids to validate the method of the present invention.
The invention firstly tests the BFS-HDQN and the most advanced MARL method MA2C, and the invention also makes an ablation experiment, and the HDQN and the BFS-HDQN without weight setting are defined as the BFS-HDQN (non), which shows that both the leader-follower setting and the BFS communication mechanism in the BFS-HDQN are helpful to improve the ATSC performance.
(1) Scene setting
The present invention evaluates the method of the present invention using three available public transportation networks, namely one real transportation network and two 4 x 4 composite transportation grids, on a Cityflow (a common traffic simulator). As shown in fig. 2, for completeness, the present invention sets a different number of special intersections in the three ATSC scenarios. Intersections in all three scenarios are homogenous and have the same action space.
Three traffic network settings. The five-pointed star and the circle in each network respectively represent a special intersection (1 eader) and a common intersection (follower). There is only one specific intersection in the road networks (a) and (b) and three in the road networks (c).
The invention adopts three indexes to proveThe performance of the model of the invention is shown as follows: (1) Travel time m travel-time : the average running time of all vehicles is the most common index for evaluating different methods; (2) Length m of queue queue-length : average value of vehicle queue length of each road junction lane; (3) Throughput m throughout : total number of vehicles arriving at the destination. Specifically, the method comprises the following steps:
Figure BDA0003365351390000121
Figure BDA0003365351390000122
m throughout =|V|
wherein the content of the first and second substances,
Figure BDA0003365351390000123
and
Figure BDA0003365351390000124
respectively representing the time of entry and exit of a vehicle V, V being the vehicle arriving at the destination within one hour, V in Is the vehicle entering the traffic network, | N | is the number of intersections, the set length is T =360. The goal is to learn a federation strategy that maximizes revenue for the entire network and for some specific intersections. Therefore, the present invention tests the method of the present invention in the average index of all intersections and special intersections (leading points), respectively.
(2) Performance comparison
Fig. 3 and 4 are learning curves of the MARL method for the average total profit and the leader profit, respectively, under three traffic scenarios ((a) (b) (c) numbers corresponding to the three traffic scenarios in fig. 2, respectively). The lines and shadows around them are the average return and error range of these methods in the learning process.
As can be seen from FIG. 3, BFS-HDQN has a stable and almost highest return in all environments, followed by BFS-HDQN (non) and HDQN, MA2C performing the worst. The results show that the algorithm is effective in improving the overall efficiency, and all innovations of the algorithm are meaningful.
In FIG. 4, the results of the comparison of BFS-HDQN with MA2C and HDQN are evident, while the performance of BFS-HDQN (non) is almost the same as that of BFS-HDQN (even slightly better than BFS-HDQN in both synthetic scenarios). The reason is intuitive because the weights in the BFS-HDQN are added to the followers' rewards, not the leader. The results of fig. 4 show that the algorithm of the present invention can also effectively improve the benefit of special intersections.
In order to compare the execution performance, the invention also shows the final statistical results of the trained MARL strategy in table 3 (overall performance training strategy execution comparison) and table 4 (leader performance training strategy execution comparison based on metric queue length), which shows that the algorithm of the invention achieves the best results in all environments.
TABLE 3
Figure BDA0003365351390000131
TABLE 4
Figure BDA0003365351390000132
In summary, the present invention evaluates the method of the present invention in one real world and two synthetic traffic scenarios and compares it with the SOFT method. The experimental result shows that compared with the existing method, the algorithm can not only ensure the optimal overall performance, but also obtain better performance at special intersections under the three most common indexes of travel time, throughput and queue length.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (4)

1. A heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning is characterized by comprising the following steps:
s1, establishing intelligent bodies corresponding to intersections according to a traffic network; the agents comprise leader agents corresponding to special intersections and follower agents corresponding to common intersections; the method comprises the steps that an ATSC problem of an actual traffic scene is defined as a Markov game model LF-MG based on a leader-follower paradigm, and intersections in the ATSC scene are divided into special intersections and common intersections; LF-MG defines two agent agents: the leader agent and follower agent are respectively a leader agent and a follower agent, wherein the leader agent only considers the benefits of the leader agent, and the follower agent comprehensively considers the benefits of the leader agent and the neighbors; define the reward for agent as:
Figure FDA0003821117690000011
wherein the content of the first and second substances,
Figure FDA0003821117690000012
the current congestion situation near the intersection i at the time t is the negative number of the waiting vehicles,
Figure FDA0003821117690000013
obtained directly from the observation of agent i, N L And N F =N-N L Are respectively a set of leader agent and follower agent, N is an agent set i Is a set of neighbor agents; omega E (0, 1) is a discount factor used for measuring the importance between the following follower agent and the adjacent agent, the larger omega is, the stronger social awareness of the actors is, and the targets of all actors are always the same, otherwise, the smaller omega is, the more selfish the agent is; when ω =0, agent is equivalent to leader agent, if neighbor is leader, ω is fixed to be leader1, otherwise, fixing to 0.5; the folower rewarded definition mainly considers the group benefit of the agents as the target and the communication limitation among the agents under the network scene;
observed value o of agent i i ∈O i Defined by the information of intersection i and its neighbors:
Figure FDA0003821117690000014
wherein the content of the first and second substances,
Figure FDA0003821117690000015
is the phase of agent at time t, L i And
Figure FDA0003821117690000016
respectively, an entrance lane set of the intersection i and its adjacent points, wave [ l ]] (t) The number of vehicles waiting in lane i alone at time t; action a of agent i i ∈A i Is the signal phase; the goal of each agent in LF-MG is to maximize its cumulative reward by finding a federation policy, pi,
Figure FDA0003821117690000021
O i is a local observation that agent i can observe, A i Is the local action space of agenti;
s2, sequencing each intersection in a traffic network, and determining the action selection sequence of each intersection to obtain a precursor agent and a successor agent of each agent;
s3, acquiring real-time traffic characteristics of each intersection in the traffic network, acquiring observation information of the intelligent agent and a neighbor intelligent agent thereof according to the real-time traffic characteristics of each intersection aiming at the intelligent agent corresponding to each intersection, and acquiring joint action of a precursor intelligent agent of the intelligent agent;
s4, inputting the observation information of the agent and the neighbor agents thereof and the joint action of the precursor agent of the agent into an HDQN network of the agent, determining the action of the agent, and transmitting the action of the agent to the inheritor of the agent;
s5, the intelligent agent executes the determined action and obtains observation information and reward after the action is executed and trajectory information observed by the intelligent agent in each round;
s6, generating an interaction experience according to the track information; the interactive experience comprises: the method comprises the following steps that observed values of an intelligent agent and neighbors of the intelligent agent, actions executed by the intelligent agent and predecessors of the intelligent agent, rewards received by the intelligent agent and observed values after the intelligent agent executes the actions;
and S7, updating the HDQN network of the agent according to the interaction experience.
2. The heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning of claim 1, wherein the step of ordering the intersections in the traffic network comprises:
and sequencing each intersection in the traffic network by adopting a breadth first sequencing method.
3. The multi-agent reinforcement learning-based heterogeneous intersection scene traffic signal control method of claim 2, wherein the leader agent's precursor set is null.
4. The multi-agent reinforcement learning-based heterogeneous intersection scene traffic signal control method according to claim 3, wherein the HDQN network comprises a DQN estimation network and a DQN target network; accordingly, updating the HDQN network of the agent according to the interaction experience, comprising:
updating the DQN estimation network according to the interaction experience;
every N θ Step of updating the DQN target network based on the DQN estimation network, wherein N θ Representing the number of steps the agent interacts with the environment.
CN202111402414.3A 2021-11-19 2021-11-19 Heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning Active CN114120672B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111402414.3A CN114120672B (en) 2021-11-19 2021-11-19 Heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111402414.3A CN114120672B (en) 2021-11-19 2021-11-19 Heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning

Publications (2)

Publication Number Publication Date
CN114120672A CN114120672A (en) 2022-03-01
CN114120672B true CN114120672B (en) 2022-10-25

Family

ID=80371736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111402414.3A Active CN114120672B (en) 2021-11-19 2021-11-19 Heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning

Country Status (1)

Country Link
CN (1) CN114120672B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2187369A3 (en) * 2008-06-04 2012-03-28 Roads and Traffic Authority of New South Wales Traffic signals control system
CN102542793B (en) * 2012-01-11 2014-02-26 东南大学 Active control method of oversaturated traffic situation at intersection group
BR102017019865A2 (en) * 2017-09-15 2019-04-16 Velsis Sistemas E Tecnologia Viaria S/A PREDICTIVE, INTEGRATED AND INTELLIGENT SYSTEM FOR TRAFFIC TRAFFIC TIME CONTROL
CN108335497B (en) * 2018-02-08 2021-09-14 南京邮电大学 Traffic signal self-adaptive control system and method
CN108877256B (en) * 2018-06-27 2020-11-13 南京邮电大学 Wireless communication-based method for controlling scattered cooperative self-adaptive cruise near intersection
CN113393667B (en) * 2021-06-10 2022-05-13 大连海事大学 Traffic control method based on Categorical-DQN optimistic exploration
CN113435112B (en) * 2021-06-10 2024-02-13 大连海事大学 Traffic signal control method based on neighbor awareness multi-agent reinforcement learning
CN113643553B (en) * 2021-07-09 2022-10-25 华东师范大学 Multi-intersection intelligent traffic signal lamp control method and system based on federal reinforcement learning

Also Published As

Publication number Publication date
CN114120672A (en) 2022-03-01

Similar Documents

Publication Publication Date Title
Du et al. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications
Rădulescu et al. Multi-objective multi-agent decision making: a utility-based analysis and survey
Muller et al. A generalized training approach for multiagent learning
Hernandez-Leal et al. Is multiagent deep reinforcement learning the answer or the question? A brief survey
Noothigattu et al. Teaching AI agents ethical values using reinforcement learning and policy orchestration
Vodopivec et al. On monte carlo tree search and reinforcement learning
Kok et al. Non-communicative multi-robot coordination in dynamic environments
CN111582469A (en) Multi-agent cooperation information processing method and system, storage medium and intelligent terminal
CN109669452A (en) A kind of cloud robot task dispatching method and system based on parallel intensified learning
Walȩdzik et al. Applying hybrid Monte Carlo tree search methods to risk-aware project scheduling problem
Ronecker et al. Deep Q-network based decision making for autonomous driving
CN114415735B (en) Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method
Diallo et al. Coordinated behavior of cooperative agents using deep reinforcement learning
CN112990485A (en) Knowledge strategy selection method and device based on reinforcement learning
CN114120672B (en) Heterogeneous intersection scene traffic signal control method based on multi-agent reinforcement learning
Li et al. Two-level Q-learning: learning from conflict demonstrations
Gurzoni et al. Market-based dynamic task allocation using heuristically accelerated reinforcement learning
Liu et al. Lazy agents: a new perspective on solving sparse reward problem in multi-agent reinforcement learning
Li et al. Learning roles with emergent social value orientations
Devlin Potential-based reward shaping for knowledge-based, multi-agent reinforcement learning
Oderanti et al. Automatic fuzzy decision making system with learning for competing and connected businesses
Huang et al. Multi-agent Decision-making at Unsignalized Intersections with Reinforcement Learning from Demonstrations
Shapira et al. Reinforcement learning agents for interacting with humans
Hu Convention Emergence in Multi-Agent Systems
Vidal et al. Multiagent systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant