CN113628442A - Traffic organization scheme optimization method based on multi-signal-lamp reinforcement learning - Google Patents
Traffic organization scheme optimization method based on multi-signal-lamp reinforcement learning Download PDFInfo
- Publication number
- CN113628442A CN113628442A CN202110911165.4A CN202110911165A CN113628442A CN 113628442 A CN113628442 A CN 113628442A CN 202110911165 A CN202110911165 A CN 202110911165A CN 113628442 A CN113628442 A CN 113628442A
- Authority
- CN
- China
- Prior art keywords
- network
- road
- behavior
- traffic
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/01—Detecting movement of traffic to be counted or controlled
- G08G1/0104—Measuring and analyzing of parameters relative to traffic conditions
- G08G1/0125—Traffic data processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/07—Controlling traffic signals
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Traffic Control Systems (AREA)
Abstract
The invention discloses a traffic organization scheme optimization method based on multi-signal lamp reinforcement learning, and belongs to the field of traffic signal lamp control. Firstly, an Actor network comprising a state space set and a behavior space set is constructed, then an observed value is transmitted, high-latitude information is compressed into low-latitude information through the processing of a Subnet network, the behavior deflection probability is calculated, then initial state information, updated state information and behavior deflection probability are transmitted into a Critic network for centralized learning, and the trajectory is reconstructed recently. Under the multi-intersection traffic environment, the multi-agent improves the road network smooth traffic rate by means of an Actor-Critic algorithm framework. Meanwhile, a centralized learning and distributed execution method among the agents is used, and the advantages of centralized learning and distributed execution are combined, so that the convergence rate of the algorithm is greatly improved.
Description
Technical Field
The invention relates to the field of traffic signal lamp control, in particular to a traffic organization scheme optimization method based on multi-signal lamp reinforcement learning.
Background
In the age of scientific and technological informatization, human lives are more and more abundant, most families have vehicles, namely automobiles, for self-walking, and various traffic problems in cities are caused, such as overlong waiting time, overhigh lane occupancy and the like. With the development of artificial intelligence, many traffic intelligence technologies have emerged, and traffic behaviors are beginning to be effectively controlled. The smart reinforcement learning is one of the technologies of artificial intelligence development, and currently, the reinforcement learning is the mainstream of the traffic intelligence technology, and includes algorithms such as Q-learning, Sarsa, and TD lambda.
How to enable an agent to efficiently learn in a traffic environment has been a challenge in reinforcement learning in recent years. In the traditional method for training the intelligent agent in reinforcement learning, repeated training is carried out by continuously iterating strategies, but the training in the past is only suitable for a single intelligent agent in the past and is not suitable for multiple intelligent agents.
Considering the problem of intelligent management of traffic in cities, when an agent starts to act according to policies, how to select and execute an excellent policy from a plurality of policies is a difficult point of research in recent years.
Extensive research has been conducted on the use of traffic light control at individual intersections, which have set the vehicle to reach a destination in a particular manner, and most attempts have been made to optimize both the travel time of the vehicle and the queue length at the intersection. Many reinforcement learning-based methods attempt to solve this problem by learning from data, such as early experiments using Q-learning by building Q-tables, but Q-learning is suitable for dealing with discrete situations and is deployed in the current traffic environment using Q-learning, which faces thousands of situations at intersections in a single intersection environment, the capacity of Q-tables is limited, cannot count tens of thousands of situations, and is not suitable for the traffic environment.
For multi-port traffic signal optimization, there is a method of performing joint modeling and performing execution behaviors collectively through learning of a centralized training agent, but this method has two related common problems:
a) as the number of agents grows, the computational effort of centralized training is too large;
b) during testing, each agent acts independently, and changes to agents in a dynamic environment require up-down coordination in conjunction with other agents in the vicinity.
The other method is to use distributed reinforcement learning agents to control multiple intersections for interaction, and the method is to make decisions of each agent based on the information of adjacent intersections around the agent. Decentralized communication is more practical and does not require a good scalability in centralized decision-making, but is often very unstable in model convergence and speed.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a traffic organization scheme optimization method based on multi-signal lamp reinforcement learning, which uses a multi-agent and an Actor-critical network framework to optimize the traffic state in a multi-interface environment.
The technical scheme of the invention is as follows:
a traffic organization scheme optimization method based on multi-signal lamp reinforcement learning comprises the following steps:
s1: constructing an Actor network
The traffic network comprises a plurality of intersections, a signal lamp of each intersection corresponds to an agent, a plurality of Actor networks corresponding to the agents are constructed, and the Actor networks comprise state space sets and behavior space sets;
s2: transmitted into the observed value
The method comprises the steps that a multi-agent observes traffic states of multiple intersections to obtain observed values, and then the observed values are transmitted into a state space set in an Actor network, wherein the observed values comprise vehicle waiting time and lane occupancy of corresponding intersections;
s3: afferent behavior scheme
Setting a behavior scheme of a plurality of intelligent agents, and transmitting the behavior scheme into a behavior space set in an Actor network;
s4: calculating behavioral deflection probability
In the Actor network, calculating behavior deflection probability based on the observation value and the behavior scheme;
s5: selecting behavior and updating state
Each agent selects a behavior based on the behavior deflection probability and updates the state space set according to the selected behavior;
s6: critic web learning
Transmitting the behavior deflection probability, the initial state space set and the updated state space set in the Actor network into the criticic network for centralized learning training, reversely transmitting the learned information to the Actor network, and outputting the selected behavior scheme;
s7: trajectory reconstruction
And after the Actor network selects the behaviors, deleting the blocked road sections from the track of the vehicle, replanning the path, and outputting the replanned path.
The Actor network is responsible for completing the interaction between the action and the surrounding environment according to a strategy function, the approximate formula of the function is that s represents the current state of the agent, a represents the action selected by the agent, and the strategy is approximately represented. The strategy pi can now be described as a function containing the parameter theta: II typeθP (a | s, θ) ≈ Π (a | s). The size of the state space depends on the number of agents and the size of the behavior space depends on the number of actions of the agents. Approximation of a cost function used by the Critic network, wherein a state cost function introduces an action cost function q ^ which is described by a parameter w, receives a state S and an action a as inputs, obtains an approximate action value after calculation, and V is an intelligent state value in a state S, namely the intelligent state value in the state SThe action state function isThe formula used for the parameters of the update strategy in the AC algorithm is θ ═ θ + α +θlogΠθ(s, a) v, where α is the training step size and v is the state valueAnd theta is a policy function parameter.
Further, a Subnet network is constructed behind the Acotr network, the Subnet network compresses and processes high latitude state information transmitted by the Actor network into low latitude state information, and then reversely transmits the low latitude state information to the Actor network to calculate the behavior deflection probability. The number of the matrixes transmitted into the Subnet network is the number of the intelligent agents, the state information acquired by the intelligent agents is used as input, feature extraction is carried out through convolution, the output features are converted into a one-dimensional vector through full-connection layer convolution, and the vector is the state information generated by interaction of the target intelligent agents.
The subnet network is a convolutional network, is divided into a certain layers, and has different filters adopted by each layer, and the subnet network and the Actor network share parameters.
Further, a Subnet network is arranged between the Actor network and the Critic network, and the Subnet network compresses the initial state space set and the updated state space set in each Actor network and transmits the compressed state space sets and the behavior deflection probability to the Critic network for centralized learning.
Further, discretizing the road of the road into a certain number of road sections, wherein each road section contains corresponding vehicles, and comparing the length of the vehicles in each road section with the length of the road section to obtain the lane occupancy; the vehicle waiting time is the waiting time of all vehicles on the current road.
When the exit road of the road is discretized and then divided into n road sections, each road section contains corresponding vehicles, and the ratio of the length of the vehicles to the length of the road section is used as the lane occupancy and added into the observed value. Different intersections have different lane numbers, and after the road is discretized, the state information of the intersection contains information in n +1 as input (including 1 type of intersection vehicle waiting time).
In different road sections, the number of lanes contained in each road section is also different, and when the observation value is transmitted, the lane occupancy of each lane section needs to be summarized and then transmitted into the state matrix. In the travel road of the vehicle, S ═ { S1, S2, … …, sn, sn +1} is set in the state of the agentWherein s1, s2, … …, sn, sn +1 represents the lane occupancy rate, sn +1 represents the total time of vehicle queuing, and the calculation formula of rate isWherein VehLengthiRepresenting vehicle length, RoadLengthiRepresenting the length of the road segment.
Further, the behavior scheme of step S3 is to set the left turn signal lamp to red (disable left), and/or set the right turn signal lamp to red (disable right), and/or set the straight signal lamp to red (disable straight), and/or disable straight-turning around. When the traffic flow needs to flow into another road section with larger traffic flow, the adjustment can be carried out by the actions of forbidding left and right, forbidding straight and forbidding turning around.
Further, each Actor network is responsible only for the respective agent, rather than for all or a plurality of agents, which may reduce the delay in the agent's behavior during the learning process. Each agent has the same target and is homogeneous, and the training speed can be accelerated in a parameter sharing mode. The Actor network parameters of the agents are the same and do not represent that the agents can take the same action, and each agent takes different actions according to different observation environments around the agent. In addition, when the strategy allows additional information to be used for simplifying training in the execution process, a simple Actor-criticic algorithm is provided in the intelligent agent cooperation process, so that the criticic network adds information of other intelligent agents in the learning process for learning, and the TD obtained by each intelligent agent follows the following gradient according to the result of parameter sharing: g ═ vθΠlogΠ(ai|oi)(r+VΠ(St+1)-V(St) Where Π is the policy of the agent, a and s are the behavior and state of the agent, respectively, o represents an observed value, V represents a state cost function, and r is the reward value of the agent.
Further, in step S6, the Critic network calculates the current value v and the value v _ of the next state according to the inputted behavior deflection probability, the initial state space set and the updated state space set; then calculate the selected rowThe latter TD-error value, TD ═ r + VΠ(S′)-VΠ(S), r is feedback, S is initially obtained state, S' is new state based on selected behavior; and finally, calculating the error of the TD-error, wherein the TD-error is r + GAMMA v-v _, r is feedback, and GAMMA is decay value.
Further, in step S7, the route of the prohibited road segment is changed without changing the starting point and the ending point, so that the vehicle can run normally, and the deadlock of the road vehicle is avoided.
Furthermore, reward acquired by the agent is formulated after the track is reconstructed, and the total waiting time of the vehicles in the whole road network is calculated by reconstructing the track for the vehicles in each turn, wherein reward is- (wt-self wt), self wt is the total waiting time of the vehicles in the road in the original environment, and wt is the total waiting time of the vehicles after being subjected to track reconstruction after being learned in each turn through reinforcement learning; and wt is initialized to be 0, and the queuing time of the vehicle running on the reconstructed track is compared with the original queuing time to be used as the reward of the round.
Further, the traffic condition is evaluated by using two indexes, namely the road network open traffic rate and the travel time index. The road network smooth traffic rate is the ratio of the road mileage of the road network in a good traffic state to the mileage of all the road sections in the road network within a certain time period T, and the total smooth traffic degree of the road network is described. The calculation formula isRNCR (T) is road network open rate in T time period (T can be 5min or 3min), n is road network number, lijIs the length of the ith road segment, kiK is a binary function when the traffic state level of the section i belongs to the acceptable traffic state i1, otherwise k i0, RNCR (t) is in the range of [0,1 ]]The larger the value, the better the road network state, and conversely, the worse the road network state.
The travel time index is the ratio of the actual travel time to the expected travel time, the larger the value is, the worse the traffic state is, and the calculation formula isTTI is the travel time index, T is the time interval taken, and meanTimeLoss is the average time lost over a period of time.
The traffic organization scheme optimization method based on multi-signal reinforcement learning can also be called traffic redirected Light, TR-Light for short, and the traffic organization scheme optimization is carried out through the awarding and the state of two major elements of reinforcement learning and the ideas of centralized learning distributed execution and track reconstruction among multiple intelligent agents.
The invention has the advantages that:
the reinforcement learning of the multi-agent is used, the multi-agent carries out interactive learning with other agents according to the current state of the multi-agent and the self current learning, the multi-agent is coordinated and organized by self learning, and the multi-agent effectively cooperates with other agents to finish the self learning and change the self state in the process to finish the final high-efficiency target.
The control of traffic lights is used as the selection of actions, the waiting time of vehicles at intersections and the occupancy rate of the vehicles and roads are used as the observed values of the environment, the intelligent bodies at multiple intersections are trained and learned through a centralized learning distributed execution method, the smooth traffic rate of a road network is effectively improved, meanwhile, the vehicles running on the road section are subjected to track reconstruction after being managed and controlled, and the reconstructed track and reinforcement learning are combined to achieve the best effect.
In the process of the AC algorithm, the multiple intelligent agents are subjected to centralized interactive learning, the intelligent agents uniformly feed back behavior modes executed by the intelligent agents to a Critic network, the same Critic network carries out reverse transmission on Actor networks in other intelligent agents, and the learning mode enables the intelligent agents to converge more stably and quickly.
The Subnet network is composed of a neural network, after high-dimensional data are transmitted, the data are compressed and reduced in dimension through the neural network to generate new information, the new information is transmitted into the Critic network through the Subnet network to be learned, and the learning efficiency of Critic can be improved.
Drawings
FIG. 1 is a schematic diagram of the difference between the multi-agent and single-agent environments in example 1;
fig. 2 is a schematic diagram of traffic light behavior scheme setting in embodiment 1;
FIG. 3 is a schematic view of the road dispersion in example 1;
FIG. 4 is a schematic diagram of the multi-port agent operation in example 1;
fig. 5 is a schematic diagram of a Subnet network in embodiment 1;
FIG. 6 is a SUMO simulation diagram in example 2;
FIG. 7 is a comparison graph of experiments of different algorithms in example 2;
FIG. 8 is a SUMO simulation diagram in example 3;
FIG. 9 is a comparison graph of experiments of different algorithms in example 3;
FIG. 10 is a SUMO simulation map of the Mianyang gardening mountain area in example 4;
FIG. 11 is a diagram showing the results of an iterative final scheme after the reinforcement learning method is used in example 4;
FIG. 12 is a comparison of the optimization results of different algorithms in example 4.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
The examples are given for the purpose of better illustration of the invention, but the invention is not limited to the examples. Therefore, those skilled in the art should make insubstantial modifications and adaptations to the embodiments of the present invention in light of the above teachings and remain within the scope of the invention.
Example 1
The embodiment is a multi-interface traffic organization scheme optimization method based on multi-signal lamp reinforcement learning, and the road network open traffic rate is improved by using a plurality of intelligent agents, an Actor-Critic network, a Subnet network and track reconstruction. A multi-agent environment is an environment with multiple intelligent entities in each step, as shown in fig. 1, which is the difference between a multi-agent environment and a single-agent environment.
The method comprises the steps of firstly, constructing an Actor network, wherein a traffic network comprises a plurality of intersections, a signal lamp of each intersection corresponds to one intelligent agent, the plurality of intelligent agents need to construct a plurality of corresponding Actor networks, and each Actor network comprises a state space set and a behavior space set.
The state of the road is changed through the Program in the traffic signal lamp, and traffic control is carried out by temporarily closing the road in a certain sense. In this embodiment, the behavior is set as four schemes of setting the left-turn signal lamp to be red (prohibiting left), setting the right-turn signal lamp to be red (prohibiting right), setting the straight-going signal lamp to be red (prohibiting straight), and prohibiting turning around. When the traffic flow needs to flow into another road section with larger traffic flow, the adjustment can be carried out by forbidding left and right, forbidding straight and forbidding turning around. In the action setting, an action space is set first, and the set action schemes are sequentially transmitted into the action space. As shown in fig. 2, since the setting of the actions is designed by using a beacon, the action set is a { program1, program2, … program16}, where 16 represents the total number of actions that can be selected, the cases in fig. 2 are substituted into the action set, where the program1 represents the actions that prohibit left turn, the action set a is { forbidding left, forbidding right, forbidding straight, forbidding … program16}, and since there are 4 programs that can be set in fig. 2, there are 16 possible action schemes in total, and the setting is performed.
The intelligent agent obtains real-time traffic states by observing the environment of the intersection, and transmits the real-time traffic states to a set state space set in the Actor network according to the states to perform subsequent execution. The current road state is represented by using the vehicle waiting time and lane occupancy at the intersection, as shown in fig. 3, the departure road of the road is discretized into 10 sections, each section has corresponding vehicles contained in each section, the ratio of the vehicles to the sections is added to the observed value as the lane occupancy, different intersections have different lane numbers, and after discretizing the road, the state contains 11 kinds of information as input (including 1 kind of intersection vehicle waiting time). When the observation state is transmitted, the occupation rates of all lanes per section of lane need to be aggregated and transmitted into the state matrix. For the traveling road of the vehicle, S ═ S1, S2, … …, S10, S11 is set in the state of the agent, where S1, … …, S11 represent lane occupancy rate, and S11 represents the vehicleThe total time in the queue is,
as shown in fig. 4, the Actor-Critic network is combined with the Subnet network to realize the cooperation of the intelligent agents at the multiple intersections. In the Actor network, behavior deflection probability is calculated based on the observation value and the behavior scheme, each agent selects behaviors based on the behavior deflection probability, and a state space set is updated according to the selected behaviors. Each Actor network is only responsible for respective agents, each agent has the same target and is homogeneous, the training speed can be accelerated in a parameter sharing mode, and each agent takes different actions according to different observation environments around the agent.
And constructing a Subnet network behind the Acotr network, compressing and processing high latitude state information transmitted by the Actor network into low latitude state information by the Subnet network, and then reversely transmitting the low latitude state information to the Actor network to calculate the behavior deflection probability. As shown in fig. 5, the subnet network is a convolutional network, and has a certain hierarchy, and the filter used in each layer is different, and the subnet network and the Actor network share parameters. The number of the matrixes transmitted into the Subnet network is the number of the intelligent agents, the state information acquired by the intelligent agents is used as input, feature extraction is carried out through convolution, the output features are converted into a one-dimensional vector through full-connection layer convolution, and the vector is the state information generated by interaction of the target intelligent agents.
And transmitting the initial state space set and the updated state space set in the Actor network into a Subnet network for compression, and transmitting the initial state space set and the updated state space set together with the behavior deflection probability into a Critic network for centralized learning. The Critic network calculates the current value v and the value v _ofthe next state according to the input behavior deflection probability, the initial state space set and the updated state space set; the TD-error value after the selected action is then calculated, TD ═ r + VΠ(S′)-VΠ(S), r is feedback, S is initially obtained state, S' is new state based on selected behavior; finally, the error of TD-error is calculated, wherein TD-error is r + GAMMA v-v _, r is feedback, and GAMMA is decay valueAnd reversely transmitting the learned information to the Actor network.
After the Actor network selects the behavior, the blocked road section is deleted from the track of the vehicle and the path is re-planned, and the blocked road section is changed into the path under the condition of not changing the starting point and the ending point, so that the vehicle can normally run, and the phenomenon of deadlock of the road vehicle is avoided.
Establishing reward rewarded acquired by the agent after the track is reconstructed, and calculating the total waiting time of vehicles in the whole road network aiming at the track reconstruction of the vehicles in each turn, wherein the rewarded is- (wt-self fw), self fwt is the total waiting time of the vehicles in the road in the original environment, and wt is the total waiting time of the vehicles after the track reconstruction is carried out on the vehicles after each turn of learning through reinforcement learning; and wt is initialized to be 0, and the queuing time of the vehicle running on the reconstructed track is compared with the original queuing time to be used as the reward of the round.
Example 2
The embodiment is a traffic organization scheme optimization method based on multi-signal lamp reinforcement learning for a single intersection. The simulation platform that this embodiment adopted is SUMO, and SUMO is an open source road simulator, can satisfy the collection of required relevant data in the simulation experiment and also traffic behavior's simulation and the road network construction that needs, and the most crucial is can also collect traffic signal lamp's timing data. The IDE tool for developing codes uses Pycharm, the Tensorflow-gpu-1.4.0 version and Numpy are used for completing related reinforcement learning and the construction of a neural network, the extension needs to be perfected, secondly, the most important thing is to implement a traffic control interface of SUMO Traci, Traci can help to extend traffic signals controlled in a dynamic state, and can call a SUMO simulation tool, acquire single vehicle information and acquire detailed data and real-time road conditions of each road.
In the first experiment, a double-traffic light coordination mode is adopted under the environment of double intersections, 3000 vehicles are deployed in the simulation system in the experiment process, the mode is initially set to be 30 vehicles, and the random seed parameter is set to be 4. In the environment, 13 pairs of OD pairs are shared in traffic, most tracks in the OD pairs run from the north direction to the south direction, and under the condition without any reinforcement learning method, the total waiting time of the automobile in the model is 11567 seconds, the traffic flow of the two intersections from north to south to the middle road section is congested in figure 6, after the relevant reinforcement learning method is added, the model is continuously trained, and in the given optimization scheme method, the most reasonable result is that the road sections from north to south in the two intersections are forbidden in real time as shown in figure 6, meanwhile, the vehicles in the forbidden road sections are reconstructed according to the track, the vehicles in the direction are changed to reach the destination through other road sections, the comparison experiment of the model is the result shown in fig. 7, and the implementation scheme of each intersection when the algorithm goes to the optimal result is the result style shown in fig. 7 (TR-Light is the multi-signal reinforcement learning in the present invention).
Example 3
In this example 9, each rectangle represents a signal intersection, and every two adjacent intersections are connected by two lanes.
In the setting of this embodiment, it is necessary to complete the following parameter setting in SUMO simulation software, 7000 vehicles in the 9-grid environment step into the simulation system, and the model sets the initial vehicles to be 50 vehicles, where the shortest vehicle travel path is 2, the longest vehicle travel path is 7, and the random seed parameter is 10.
After the experimental model is built, the action mode of each intelligent agent is built according to the respective action mode, the total waiting time of the automobile in the environment under the original condition is 24732 seconds, the OD pairs in the experimental traffic environment are 21 pairs, the traffic volume of the right lower area of the 9 grids in the original environment is larger, the track end points of most vehicles are inquired at the right lower corner intersection according to the OD, so that the road network environment of the whole map is not smooth, FIG. 8 is a final plan given by each intersection after the final result of the model experiment, which is a final plan of the intersection, and the right lower intersection and the adjacent intersections are controlled in real time, meanwhile, other intersections maintain the scheme reservation of the original intersection, and the real-time track of the vehicle leading to the forbidden direction is reconstructed, and fig. 9 shows the comparison test results carried out by different algorithms (TR-Light is the multi-signal lamp reinforcement learning in the invention).
Fig. 9 shows that under a given experimental environment, due to the action of the neural network, the multi-signal lamp reinforcement learning effect is very obvious, and the coordination among intersections can be quickly handled.
Example 4
In this embodiment, a map of a museum garden mountain area is simulated by using SUMO software, as shown in fig. 10, a simulation area is constructed in SUMO, an original timing time of a traffic light is set, and in the setting of the traffic light, several intersections with a large traffic flow are selected and the traffic light is set. In this embodiment, a comparative experiment is performed with the conventional Q-learning and DQN algorithms, a comparative experiment is performed on the total waiting time of the vehicles in the road network, real vehicle data is added in this environment, 51320 vehicles are shared in the area in the time from 17:00 to 19:00 of the late peak, the total waiting time of the vehicles in the environment is 338798 seconds in the original time distribution time of the traffic signal lamp without using any reinforcement learning method, the traffic flow at A, B, C, D four intersections is maximum in the late peak time period in the historical track data, the result of the iteration final scheme after using the reinforcement learning method model is shown in fig. 11, and the total optimization result of different algorithms on the environment is shown in fig. 12.
According to the reaction result of fig. 12, it is not particularly obvious that the effect translation difference between the DQN algorithm and the TR-Light algorithm in the present invention in the first 50 rounds of iteration is not particularly obvious, after the number of iterations becomes large, the Critic network in the TR-Light model starts to gradually act on TD-error to start self-learning update, gradually moves to a better behavior pattern, and starts to change its corresponding strategy to maintain the latest pattern of the strategy, because the enhanced learning method of Q-learning does not include a neural network, it is impossible to predict according to the state, and only gradually selects the optimal mode each time, so the optimization effect is not obvious, the final result cannot be converged, and the convergence speed of the TR-Light model starts to accelerate with the increase of iteration, which is the result that the convergence is reached most first.
For each added target intersection, the following data collected for the traffic indexes of single intersection and multiple intersections are compared with the following 3 optimization methods as shown in table 1 below:
crossing | Optimization method | Number of iterations | Road network open traffic rate | Time index |
Single road junction | Q-learning | 136 | 0.47674 | 1.98378 |
Single road junction | DQN | 94 | 0.59421 | 1.43793 |
Single road junction | Sarsa | 144 | 0.49327 | 1.82331 |
Single road junction | PG | 105 | 0.54732 | 1.31251 |
Single road junction | TR-Light | 62 | 0.68315 | 1.12741 |
Multi-port | Q-learning | 4300 | 0.34975 | 3.46581 |
Multi-port | DQN | 1600 | 0.43653 | 2.67138 |
Multi-port | Sarsa | 4700 | 0.31462 | 3.60431 |
Multi-port | PG | 3100 | 0.40587 | 2.96971 |
Multi-port | TR-Light | 1167 | 0.51312 | 1.94672 |
TABLE 1
Under the environment of multiple intersections, a TR-Light model is designed by controlling traffic lights, an algorithm framework of Actor-Critic is used, meanwhile, a centralized learning and distributed execution method among intelligent agents is used, and the advantages of centralized learning and distributed learning are combined, so that the convergence speed of the algorithm is greatly improved. Through comparison of multi-port experimental data, in the process of processing traffic environment, the states of the intelligent agents are changeable and diversified in the Q-Learning algorithm in the traditional algorithm, and the Q-Learning algorithm is difficult to converge because the states cannot be predicted without a neural network. Although the DQN algorithm is assisted by the neural network, the method is not implemented on the interaction method of the multi-agent, the traffic state is improved by the design of the TR-Light model, and a foundation is laid for the application of the traffic signal control of later multi-agent reinforcement learning.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.
Claims (10)
1. A traffic organization scheme optimization method based on multi-signal lamp reinforcement learning is characterized by comprising the following steps:
s1: constructing an Actor network
The signal lamp of each intersection corresponds to an agent, a plurality of Actor networks corresponding to the agents are constructed, and the Actor networks comprise state space sets and behavior space sets;
s2: transmitted into the observed value
The method comprises the steps that a multi-agent observes traffic states of multiple intersections to obtain observed values, and the observed values are transmitted into a state space set in an Actor network, wherein the observed values comprise vehicle waiting time and lane occupancy of corresponding intersections;
s3: afferent behavior scheme
Setting a behavior scheme of a multi-agent, and transmitting the behavior scheme into a behavior space set in the Actor network;
s4: calculating behavioral deflection probability
In the Actor network, calculating a behavior deflection probability based on the observation values and a behavior scheme;
s5: selecting behavior and updating state
Each agent selects a behavior based on the behavior deflection probability and updates a state space set according to the selected behavior;
s6: critic web learning
Transmitting the behavior deflection probability, the initial state space set and the updated state space set in the Actor network into a criticic network for centralized learning training, reversely transmitting the learned information to the Actor network, and outputting the selected behavior scheme;
s7: trajectory reconstruction
And after the Actor network selects the behaviors, deleting the blocked road sections from the track of the vehicle, replanning the path, and outputting the replanned path.
2. The traffic organization scheme optimization method based on multi-signal lamp reinforcement learning of claim 1, characterized in that: constructing a Subnet network behind the Acotr network, wherein the Subnet network compresses and processes high latitude state information transmitted by the Actor network into low latitude state information, and then reversely transmits the low latitude state information to the Actor network to calculate behavior deflection probability; the subnet network is a convolutional network and is divided into a certain layers, filters adopted by each layer are different, and the subnet network and the Actor network share parameters; the number of the matrixes transmitted into the Subnet network is the number of the intelligent agents.
3. The traffic organization scheme optimization method based on multi-signal lamp reinforcement learning of claim 2, characterized in that: the Subnet network is arranged between the Actor network and the Critic network, compresses the initial state space set and the updated state space set in each Actor network, and transmits the compressed state space sets and the behavior deflection probability together into the Critic network for centralized learning.
4. The traffic organization scheme optimization method based on multi-signal lamp reinforcement learning of claim 1, characterized in that: the method comprises the following steps of performing road discretization on a road outgoing from a road to be divided into a certain number of road sections, wherein each road section contains corresponding vehicles, and obtaining the lane occupancy by respectively comparing the length of the vehicles in each road section with the length of the road section; the vehicle waiting time is the waiting time of all vehicles in the current road.
5. The traffic organization scheme optimization method based on multi-signal lamp reinforcement learning of claim 1, characterized in that: the behavior scheme of step S3 is to set the left turn signal lamp to red (left forbidden), and/or set the right turn signal lamp to red (right forbidden), and/or set the straight signal lamp to red (straight forbidden), and/or turn around straight forbidden.
6. The traffic organization scheme optimization method based on multi-signal lamp reinforcement learning of claim 1, characterized in that: each Actor network is only responsible for respective agents, each agent has the same target and is homogeneous, and the training speed can be accelerated in a parameter sharing mode.
7. The traffic organization scheme optimization method based on multi-signal lamp reinforcement learning of claim 1, characterized in that: in step S6, the Critic network calculates the current value v and the value v _ of the next state according to the inputted behavior deflection probability, the initial state space set and the updated state space set; then calculating the TD-e after the selection behaviorrror value, TD ═ r + VΠ(S′)-VΠ(S), r is feedback, S is initially obtained state, S' is new state based on selected behavior; and finally, calculating the error of the TD-error, wherein the TD-error is r + GAMMA v-v _, r is feedback, and GAMMA is decay value.
8. The traffic organization scheme optimization method based on multi-signal lamp reinforcement learning of claim 1, characterized in that: in step S7, the route of the prohibited road segment is changed without changing the starting point and the ending point, so that the vehicle can run normally, and the deadlock of the road vehicle is avoided.
9. The traffic organization scheme optimization method based on multi-signal lamp reinforcement learning of claim 1, characterized in that: establishing reward rewarded acquired by the agent after the track is reconstructed, and calculating the total waiting time of vehicles in the whole road network aiming at the track reconstruction of the vehicles in each turn, wherein the rewarded is- (wt-self fw), self fwt is the total waiting time of the vehicles in the road in the original environment, and wt is the total waiting time of the vehicles after the track reconstruction is carried out on the vehicles after each turn of learning through reinforcement learning; and wt is initialized to be 0, and the queuing time of the vehicle running on the reconstructed track is compared with the original queuing time to be used as the reward of the round.
10. The traffic organization scheme optimization method based on multi-signal lamp reinforcement learning of claim 1, characterized in that: the traffic condition is evaluated by using two indexes of road network open traffic rate and travel time index, wherein the road network open traffic rate is the ratio of the mileage of a road section with a better traffic state in a certain time period T of the road network to the mileage of all road sections in the road network, and the calculation formula isRNCR (T) is road network open rate in T time period (T can be 5min or 3min), n is road network number, lijIs the length of the ith road segment, kiAs a binary function, when the traffic state class of the section i belongs toWhen the traffic state is acceptable, ki1, otherwise ki0, RNCR (t) is in the range of [0,1 ]]The larger the value is, the better the road network state is, otherwise, the worse the road network state is;
the travel time index is the ratio of the actual travel time to the expected travel time, the larger the value is, the worse the traffic state is, and the calculation formula isTTI is the travel time index, T is the time interval taken, and meanTimeLoss is the average time lost over a period of time.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110911165.4A CN113628442B (en) | 2021-08-06 | 2021-08-06 | Traffic organization scheme optimization method based on multi-signal-lamp reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110911165.4A CN113628442B (en) | 2021-08-06 | 2021-08-06 | Traffic organization scheme optimization method based on multi-signal-lamp reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113628442A true CN113628442A (en) | 2021-11-09 |
CN113628442B CN113628442B (en) | 2022-10-14 |
Family
ID=78383803
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110911165.4A Active CN113628442B (en) | 2021-08-06 | 2021-08-06 | Traffic organization scheme optimization method based on multi-signal-lamp reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113628442B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114639255A (en) * | 2022-03-28 | 2022-06-17 | 浙江大华技术股份有限公司 | Traffic signal control method, device, equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110114806A (en) * | 2018-02-28 | 2019-08-09 | 华为技术有限公司 | Signalized control method, relevant device and system |
CN111582469A (en) * | 2020-03-23 | 2020-08-25 | 成都信息工程大学 | Multi-agent cooperation information processing method and system, storage medium and intelligent terminal |
AU2021101685A4 (en) * | 2021-04-01 | 2021-05-20 | Arun Singh Chouhan | Design and development of real time automated routing algorithm for computer networks |
WO2021105055A1 (en) * | 2019-11-25 | 2021-06-03 | Thales | Decision assistance device and method for managing aerial conflicts |
CN112949933A (en) * | 2021-03-23 | 2021-06-11 | 成都信息工程大学 | Traffic organization scheme optimization method based on multi-agent reinforcement learning |
-
2021
- 2021-08-06 CN CN202110911165.4A patent/CN113628442B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110114806A (en) * | 2018-02-28 | 2019-08-09 | 华为技术有限公司 | Signalized control method, relevant device and system |
WO2021105055A1 (en) * | 2019-11-25 | 2021-06-03 | Thales | Decision assistance device and method for managing aerial conflicts |
CN111582469A (en) * | 2020-03-23 | 2020-08-25 | 成都信息工程大学 | Multi-agent cooperation information processing method and system, storage medium and intelligent terminal |
CN112949933A (en) * | 2021-03-23 | 2021-06-11 | 成都信息工程大学 | Traffic organization scheme optimization method based on multi-agent reinforcement learning |
AU2021101685A4 (en) * | 2021-04-01 | 2021-05-20 | Arun Singh Chouhan | Design and development of real time automated routing algorithm for computer networks |
Non-Patent Citations (2)
Title |
---|
杨文臣等: "多智能体强化学习在城市交通网络信号控制方法中的应用综述", 《计算机应用研究》 * |
赵宇航等: "一种高效率的多智能体协作学习通信机制", 《信息安全研究》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114639255A (en) * | 2022-03-28 | 2022-06-17 | 浙江大华技术股份有限公司 | Traffic signal control method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN113628442B (en) | 2022-10-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110032782B (en) | City-level intelligent traffic signal control system and method | |
Wang et al. | Adaptive Traffic Signal Control for large-scale scenario with Cooperative Group-based Multi-agent reinforcement learning | |
CN111696370B (en) | Traffic light control method based on heuristic deep Q network | |
CN109215355A (en) | A kind of single-point intersection signal timing optimization method based on deeply study | |
Durand et al. | Ant colony optimization for air traffic conflict resolution | |
CN113223305B (en) | Multi-intersection traffic light control method and system based on reinforcement learning and storage medium | |
CN110345960B (en) | Route planning intelligent optimization method for avoiding traffic obstacles | |
Koh et al. | Reinforcement learning for vehicle route optimization in SUMO | |
Aragon-Gómez et al. | Traffic-signal control reinforcement learning approach for continuous-time Markov games | |
Tahifa et al. | Swarm reinforcement learning for traffic signal control based on cooperative multi-agent framework | |
CN113628442B (en) | Traffic organization scheme optimization method based on multi-signal-lamp reinforcement learning | |
CN112446538B (en) | Optimal path obtaining method based on personalized risk avoidance | |
CN117933673B (en) | Line patrol planning method and device and line patrol planning system | |
CN114995119A (en) | Urban traffic signal cooperative control method based on multi-agent deep reinforcement learning | |
CN113053122A (en) | WMGIRL algorithm-based regional flow distribution prediction method in variable traffic control scheme | |
Wang et al. | A large-scale traffic signal control algorithm based on multi-layer graph deep reinforcement learning | |
CN113299079B (en) | Regional intersection signal control method based on PPO and graph convolution neural network | |
CN114815801A (en) | Adaptive environment path planning method based on strategy-value network and MCTS | |
CN113156979A (en) | Forest guard patrol path planning method and device based on improved MADDPG algorithm | |
CN111507499B (en) | Method, device and system for constructing model for prediction and testing method | |
Ordouei | Comparative Analysis of Evolutionary Algorithms for Optimizing Vehicle Routing in Smart Cities | |
Zhang et al. | Coordinated control of distributed traffic signal based on multiagent cooperative game | |
CN115762128A (en) | Deep reinforcement learning traffic signal control method based on self-attention mechanism | |
Zhang et al. | Research on urban traffic active control in cooperative vehicle infrastructure | |
Alhassan et al. | Adjusting Street Plans Using Deep Reinforcement Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |