CN113628442A

CN113628442A - A traffic organization scheme optimization method based on multi-signal reinforcement learning

Info

Publication number: CN113628442A
Application number: CN202110911165.4A
Authority: CN
Inventors: 郑皎凌; 吴昊昇; 王茂帆
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2021-11-09
Anticipated expiration: 2041-08-06
Also published as: CN113628442B

Abstract

The invention discloses a traffic organization scheme optimization method based on multi-signal lamp reinforcement learning, and belongs to the field of traffic signal lamp control. Firstly, an Actor network comprising a state space set and a behavior space set is constructed, then an observed value is transmitted, high-latitude information is compressed into low-latitude information through the processing of a Subnet network, the behavior deflection probability is calculated, then initial state information, updated state information and behavior deflection probability are transmitted into a Critic network for centralized learning, and the trajectory is reconstructed recently. Under the multi-intersection traffic environment, the multi-agent improves the road network smooth traffic rate by means of an Actor-Critic algorithm framework. Meanwhile, a centralized learning and distributed execution method among the agents is used, and the advantages of centralized learning and distributed execution are combined, so that the convergence rate of the algorithm is greatly improved.

Description

Traffic organization scheme optimization method based on multi-signal-lamp reinforcement learning

Technical Field

The invention relates to the field of traffic signal lamp control, in particular to a traffic organization scheme optimization method based on multi-signal lamp reinforcement learning.

Background

In the age of scientific and technological informatization, human lives are more and more abundant, most families have vehicles, namely automobiles, for self-walking, and various traffic problems in cities are caused, such as overlong waiting time, overhigh lane occupancy and the like. With the development of artificial intelligence, many traffic intelligence technologies have emerged, and traffic behaviors are beginning to be effectively controlled. The smart reinforcement learning is one of the technologies of artificial intelligence development, and currently, the reinforcement learning is the mainstream of the traffic intelligence technology, and includes algorithms such as Q-learning, Sarsa, and TD lambda.

How to enable an agent to efficiently learn in a traffic environment has been a challenge in reinforcement learning in recent years. In the traditional method for training the intelligent agent in reinforcement learning, repeated training is carried out by continuously iterating strategies, but the training in the past is only suitable for a single intelligent agent in the past and is not suitable for multiple intelligent agents.

Considering the problem of intelligent management of traffic in cities, when an agent starts to act according to policies, how to select and execute an excellent policy from a plurality of policies is a difficult point of research in recent years.

Extensive research has been conducted on the use of traffic light control at individual intersections, which have set the vehicle to reach a destination in a particular manner, and most attempts have been made to optimize both the travel time of the vehicle and the queue length at the intersection. Many reinforcement learning-based methods attempt to solve this problem by learning from data, such as early experiments using Q-learning by building Q-tables, but Q-learning is suitable for dealing with discrete situations and is deployed in the current traffic environment using Q-learning, which faces thousands of situations at intersections in a single intersection environment, the capacity of Q-tables is limited, cannot count tens of thousands of situations, and is not suitable for the traffic environment.

For multi-port traffic signal optimization, there is a method of performing joint modeling and performing execution behaviors collectively through learning of a centralized training agent, but this method has two related common problems:

a) as the number of agents grows, the computational effort of centralized training is too large;

b) during testing, each agent acts independently, and changes to agents in a dynamic environment require up-down coordination in conjunction with other agents in the vicinity.

The other method is to use distributed reinforcement learning agents to control multiple intersections for interaction, and the method is to make decisions of each agent based on the information of adjacent intersections around the agent. Decentralized communication is more practical and does not require a good scalability in centralized decision-making, but is often very unstable in model convergence and speed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a traffic organization scheme optimization method based on multi-signal lamp reinforcement learning, which uses a multi-agent and an Actor-critical network framework to optimize the traffic state in a multi-interface environment.

The technical scheme of the invention is as follows:

a traffic organization scheme optimization method based on multi-signal lamp reinforcement learning comprises the following steps:

s1: constructing an Actor network

The traffic network comprises a plurality of intersections, a signal lamp of each intersection corresponds to an agent, a plurality of Actor networks corresponding to the agents are constructed, and the Actor networks comprise state space sets and behavior space sets;

s2: transmitted into the observed value

The method comprises the steps that a multi-agent observes traffic states of multiple intersections to obtain observed values, and then the observed values are transmitted into a state space set in an Actor network, wherein the observed values comprise vehicle waiting time and lane occupancy of corresponding intersections;

s3: afferent behavior scheme

Setting a behavior scheme of a plurality of intelligent agents, and transmitting the behavior scheme into a behavior space set in an Actor network;

s4: calculating behavioral deflection probability

In the Actor network, calculating behavior deflection probability based on the observation value and the behavior scheme;

s5: selecting behavior and updating state

Each agent selects a behavior based on the behavior deflection probability and updates the state space set according to the selected behavior;

s6: critic web learning

Transmitting the behavior deflection probability, the initial state space set and the updated state space set in the Actor network into the criticic network for centralized learning training, reversely transmitting the learned information to the Actor network, and outputting the selected behavior scheme;

s7: trajectory reconstruction

And after the Actor network selects the behaviors, deleting the blocked road sections from the track of the vehicle, replanning the path, and outputting the replanned path.

The Actor network is responsible for completing the interaction between the action and the surrounding environment according to a strategy function, the approximate formula of the function is that s represents the current state of the agent, a represents the action selected by the agent, and the strategy is approximately represented. The strategy pi can now be described as a function containing the parameter theta: II type_θP (a | s, θ) ≈ Π (a | s). The size of the state space depends on the number of agents and the size of the behavior space depends on the number of actions of the agents. Approximation of a cost function used by the Critic network, wherein a state cost function introduces an action cost function q ^ which is described by a parameter w, receives a state S and an action a as inputs, obtains an approximate action value after calculation, and V is an intelligent state value in a state S, namely the intelligent state value in the state S

The action state function is

The formula used for the parameters of the update strategy in the AC algorithm is θ ═ θ + α +_θlogΠ_θ(s, a) v, where α is the training step size and v is the state valueAnd theta is a policy function parameter.

Further, a Subnet network is constructed behind the Acotr network, the Subnet network compresses and processes high latitude state information transmitted by the Actor network into low latitude state information, and then reversely transmits the low latitude state information to the Actor network to calculate the behavior deflection probability. The number of the matrixes transmitted into the Subnet network is the number of the intelligent agents, the state information acquired by the intelligent agents is used as input, feature extraction is carried out through convolution, the output features are converted into a one-dimensional vector through full-connection layer convolution, and the vector is the state information generated by interaction of the target intelligent agents.

The subnet network is a convolutional network, is divided into a certain layers, and has different filters adopted by each layer, and the subnet network and the Actor network share parameters.

Further, a Subnet network is arranged between the Actor network and the Critic network, and the Subnet network compresses the initial state space set and the updated state space set in each Actor network and transmits the compressed state space sets and the behavior deflection probability to the Critic network for centralized learning.

Further, discretizing the road of the road into a certain number of road sections, wherein each road section contains corresponding vehicles, and comparing the length of the vehicles in each road section with the length of the road section to obtain the lane occupancy; the vehicle waiting time is the waiting time of all vehicles on the current road.

When the exit road of the road is discretized and then divided into n road sections, each road section contains corresponding vehicles, and the ratio of the length of the vehicles to the length of the road section is used as the lane occupancy and added into the observed value. Different intersections have different lane numbers, and after the road is discretized, the state information of the intersection contains information in n +1 as input (including 1 type of intersection vehicle waiting time).

In different road sections, the number of lanes contained in each road section is also different, and when the observation value is transmitted, the lane occupancy of each lane section needs to be summarized and then transmitted into the state matrix. In the travel road of the vehicle, S ═ { S1, S2, … …, sn, sn +1} is set in the state of the agentWherein s1, s2, … …, sn, sn +1 represents the lane occupancy rate, sn +1 represents the total time of vehicle queuing, and the calculation formula of rate is

Wherein VehLength_iRepresenting vehicle length, RoadLength_iRepresenting the length of the road segment.

Further, the behavior scheme of step S3 is to set the left turn signal lamp to red (disable left), and/or set the right turn signal lamp to red (disable right), and/or set the straight signal lamp to red (disable straight), and/or disable straight-turning around. When the traffic flow needs to flow into another road section with larger traffic flow, the adjustment can be carried out by the actions of forbidding left and right, forbidding straight and forbidding turning around.

Further, each Actor network is responsible only for the respective agent, rather than for all or a plurality of agents, which may reduce the delay in the agent's behavior during the learning process. Each agent has the same target and is homogeneous, and the training speed can be accelerated in a parameter sharing mode. The Actor network parameters of the agents are the same and do not represent that the agents can take the same action, and each agent takes different actions according to different observation environments around the agent. In addition, when the strategy allows additional information to be used for simplifying training in the execution process, a simple Actor-criticic algorithm is provided in the intelligent agent cooperation process, so that the criticic network adds information of other intelligent agents in the learning process for learning, and the TD obtained by each intelligent agent follows the following gradient according to the result of parameter sharing: g ═ v_θΠlogΠ(a_i|o_i)(r+V^Π(S_t+1)-V(S_t) Where Π is the policy of the agent, a and s are the behavior and state of the agent, respectively, o represents an observed value, V represents a state cost function, and r is the reward value of the agent.

Further, in step S6, the Critic network calculates the current value v and the value v _ of the next state according to the inputted behavior deflection probability, the initial state space set and the updated state space set; then calculate the selected rowThe latter TD-error value, TD ═ r + V^Π(S′)-V^Π(S), r is feedback, S is initially obtained state, S' is new state based on selected behavior; and finally, calculating the error of the TD-error, wherein the TD-error is r + GAMMA v-v _, r is feedback, and GAMMA is decay value.

Further, in step S7, the route of the prohibited road segment is changed without changing the starting point and the ending point, so that the vehicle can run normally, and the deadlock of the road vehicle is avoided.

Furthermore, reward acquired by the agent is formulated after the track is reconstructed, and the total waiting time of the vehicles in the whole road network is calculated by reconstructing the track for the vehicles in each turn, wherein reward is- (wt-self wt), self wt is the total waiting time of the vehicles in the road in the original environment, and wt is the total waiting time of the vehicles after being subjected to track reconstruction after being learned in each turn through reinforcement learning; and wt is initialized to be 0, and the queuing time of the vehicle running on the reconstructed track is compared with the original queuing time to be used as the reward of the round.

Further, the traffic condition is evaluated by using two indexes, namely the road network open traffic rate and the travel time index. The road network smooth traffic rate is the ratio of the road mileage of the road network in a good traffic state to the mileage of all the road sections in the road network within a certain time period T, and the total smooth traffic degree of the road network is described. The calculation formula is

RNCR (T) is road network open rate in T time period (T can be 5min or 3min), n is road network number, l_ijIs the length of the ith road segment, k_iK is a binary function when the traffic state level of the section i belongs to the acceptable traffic state _i1, otherwise k _i0, RNCR (t) is in the range of [0,1 ]]The larger the value, the better the road network state, and conversely, the worse the road network state.

The travel time index is the ratio of the actual travel time to the expected travel time, the larger the value is, the worse the traffic state is, and the calculation formula is

TTI is the travel time index, T is the time interval taken, and meanTimeLoss is the average time lost over a period of time.

The traffic organization scheme optimization method based on multi-signal reinforcement learning can also be called traffic redirected Light, TR-Light for short, and the traffic organization scheme optimization is carried out through the awarding and the state of two major elements of reinforcement learning and the ideas of centralized learning distributed execution and track reconstruction among multiple intelligent agents.

The invention has the advantages that:

the reinforcement learning of the multi-agent is used, the multi-agent carries out interactive learning with other agents according to the current state of the multi-agent and the self current learning, the multi-agent is coordinated and organized by self learning, and the multi-agent effectively cooperates with other agents to finish the self learning and change the self state in the process to finish the final high-efficiency target.

The control of traffic lights is used as the selection of actions, the waiting time of vehicles at intersections and the occupancy rate of the vehicles and roads are used as the observed values of the environment, the intelligent bodies at multiple intersections are trained and learned through a centralized learning distributed execution method, the smooth traffic rate of a road network is effectively improved, meanwhile, the vehicles running on the road section are subjected to track reconstruction after being managed and controlled, and the reconstructed track and reinforcement learning are combined to achieve the best effect.

In the process of the AC algorithm, the multiple intelligent agents are subjected to centralized interactive learning, the intelligent agents uniformly feed back behavior modes executed by the intelligent agents to a Critic network, the same Critic network carries out reverse transmission on Actor networks in other intelligent agents, and the learning mode enables the intelligent agents to converge more stably and quickly.

The Subnet network is composed of a neural network, after high-dimensional data are transmitted, the data are compressed and reduced in dimension through the neural network to generate new information, the new information is transmitted into the Critic network through the Subnet network to be learned, and the learning efficiency of Critic can be improved.

Drawings

FIG. 1 is a schematic diagram of the difference between the multi-agent and single-agent environments in example 1;

fig. 2 is a schematic diagram of traffic light behavior scheme setting in embodiment 1;

FIG. 3 is a schematic view of the road dispersion in example 1;

FIG. 4 is a schematic diagram of the multi-port agent operation in example 1;

fig. 5 is a schematic diagram of a Subnet network in embodiment 1;

FIG. 6 is a SUMO simulation diagram in example 2;

FIG. 7 is a comparison graph of experiments of different algorithms in example 2;

FIG. 8 is a SUMO simulation diagram in example 3;

FIG. 9 is a comparison graph of experiments of different algorithms in example 3;

FIG. 10 is a SUMO simulation map of the Mianyang gardening mountain area in example 4;

FIG. 11 is a diagram showing the results of an iterative final scheme after the reinforcement learning method is used in example 4;

FIG. 12 is a comparison of the optimization results of different algorithms in example 4.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

The examples are given for the purpose of better illustration of the invention, but the invention is not limited to the examples. Therefore, those skilled in the art should make insubstantial modifications and adaptations to the embodiments of the present invention in light of the above teachings and remain within the scope of the invention.

Example 1

The embodiment is a multi-interface traffic organization scheme optimization method based on multi-signal lamp reinforcement learning, and the road network open traffic rate is improved by using a plurality of intelligent agents, an Actor-Critic network, a Subnet network and track reconstruction. A multi-agent environment is an environment with multiple intelligent entities in each step, as shown in fig. 1, which is the difference between a multi-agent environment and a single-agent environment.

The method comprises the steps of firstly, constructing an Actor network, wherein a traffic network comprises a plurality of intersections, a signal lamp of each intersection corresponds to one intelligent agent, the plurality of intelligent agents need to construct a plurality of corresponding Actor networks, and each Actor network comprises a state space set and a behavior space set.

The state of the road is changed through the Program in the traffic signal lamp, and traffic control is carried out by temporarily closing the road in a certain sense. In this embodiment, the behavior is set as four schemes of setting the left-turn signal lamp to be red (prohibiting left), setting the right-turn signal lamp to be red (prohibiting right), setting the straight-going signal lamp to be red (prohibiting straight), and prohibiting turning around. When the traffic flow needs to flow into another road section with larger traffic flow, the adjustment can be carried out by forbidding left and right, forbidding straight and forbidding turning around. In the action setting, an action space is set first, and the set action schemes are sequentially transmitted into the action space. As shown in fig. 2, since the setting of the actions is designed by using a beacon, the action set is a { program1, program2, … program16}, where 16 represents the total number of actions that can be selected, the cases in fig. 2 are substituted into the action set, where the program1 represents the actions that prohibit left turn, the action set a is { forbidding left, forbidding right, forbidding straight, forbidding … program16}, and since there are 4 programs that can be set in fig. 2, there are 16 possible action schemes in total, and the setting is performed.

The intelligent agent obtains real-time traffic states by observing the environment of the intersection, and transmits the real-time traffic states to a set state space set in the Actor network according to the states to perform subsequent execution. The current road state is represented by using the vehicle waiting time and lane occupancy at the intersection, as shown in fig. 3, the departure road of the road is discretized into 10 sections, each section has corresponding vehicles contained in each section, the ratio of the vehicles to the sections is added to the observed value as the lane occupancy, different intersections have different lane numbers, and after discretizing the road, the state contains 11 kinds of information as input (including 1 kind of intersection vehicle waiting time). When the observation state is transmitted, the occupation rates of all lanes per section of lane need to be aggregated and transmitted into the state matrix. For the traveling road of the vehicle, S ═ S1, S2, … …, S10, S11 is set in the state of the agent, where S1, … …, S11 represent lane occupancy rate, and S11 represents the vehicleThe total time in the queue is,

as shown in fig. 4, the Actor-Critic network is combined with the Subnet network to realize the cooperation of the intelligent agents at the multiple intersections. In the Actor network, behavior deflection probability is calculated based on the observation value and the behavior scheme, each agent selects behaviors based on the behavior deflection probability, and a state space set is updated according to the selected behaviors. Each Actor network is only responsible for respective agents, each agent has the same target and is homogeneous, the training speed can be accelerated in a parameter sharing mode, and each agent takes different actions according to different observation environments around the agent.

And constructing a Subnet network behind the Acotr network, compressing and processing high latitude state information transmitted by the Actor network into low latitude state information by the Subnet network, and then reversely transmitting the low latitude state information to the Actor network to calculate the behavior deflection probability. As shown in fig. 5, the subnet network is a convolutional network, and has a certain hierarchy, and the filter used in each layer is different, and the subnet network and the Actor network share parameters. The number of the matrixes transmitted into the Subnet network is the number of the intelligent agents, the state information acquired by the intelligent agents is used as input, feature extraction is carried out through convolution, the output features are converted into a one-dimensional vector through full-connection layer convolution, and the vector is the state information generated by interaction of the target intelligent agents.

And transmitting the initial state space set and the updated state space set in the Actor network into a Subnet network for compression, and transmitting the initial state space set and the updated state space set together with the behavior deflection probability into a Critic network for centralized learning. The Critic network calculates the current value v and the value v _ofthe next state according to the input behavior deflection probability, the initial state space set and the updated state space set; the TD-error value after the selected action is then calculated, TD ═ r + V^Π(S′)-V^Π(S), r is feedback, S is initially obtained state, S' is new state based on selected behavior; finally, the error of TD-error is calculated, wherein TD-error is r + GAMMA v-v _, r is feedback, and GAMMA is decay valueAnd reversely transmitting the learned information to the Actor network.

After the Actor network selects the behavior, the blocked road section is deleted from the track of the vehicle and the path is re-planned, and the blocked road section is changed into the path under the condition of not changing the starting point and the ending point, so that the vehicle can normally run, and the phenomenon of deadlock of the road vehicle is avoided.

Establishing reward rewarded acquired by the agent after the track is reconstructed, and calculating the total waiting time of vehicles in the whole road network aiming at the track reconstruction of the vehicles in each turn, wherein the rewarded is- (wt-self fw), self fwt is the total waiting time of the vehicles in the road in the original environment, and wt is the total waiting time of the vehicles after the track reconstruction is carried out on the vehicles after each turn of learning through reinforcement learning; and wt is initialized to be 0, and the queuing time of the vehicle running on the reconstructed track is compared with the original queuing time to be used as the reward of the round.

Example 2

The embodiment is a traffic organization scheme optimization method based on multi-signal lamp reinforcement learning for a single intersection. The simulation platform that this embodiment adopted is SUMO, and SUMO is an open source road simulator, can satisfy the collection of required relevant data in the simulation experiment and also traffic behavior's simulation and the road network construction that needs, and the most crucial is can also collect traffic signal lamp's timing data. The IDE tool for developing codes uses Pycharm, the Tensorflow-gpu-1.4.0 version and Numpy are used for completing related reinforcement learning and the construction of a neural network, the extension needs to be perfected, secondly, the most important thing is to implement a traffic control interface of SUMO Traci, Traci can help to extend traffic signals controlled in a dynamic state, and can call a SUMO simulation tool, acquire single vehicle information and acquire detailed data and real-time road conditions of each road.

In the first experiment, a double-traffic light coordination mode is adopted under the environment of double intersections, 3000 vehicles are deployed in the simulation system in the experiment process, the mode is initially set to be 30 vehicles, and the random seed parameter is set to be 4. In the environment, 13 pairs of OD pairs are shared in traffic, most tracks in the OD pairs run from the north direction to the south direction, and under the condition without any reinforcement learning method, the total waiting time of the automobile in the model is 11567 seconds, the traffic flow of the two intersections from north to south to the middle road section is congested in figure 6, after the relevant reinforcement learning method is added, the model is continuously trained, and in the given optimization scheme method, the most reasonable result is that the road sections from north to south in the two intersections are forbidden in real time as shown in figure 6, meanwhile, the vehicles in the forbidden road sections are reconstructed according to the track, the vehicles in the direction are changed to reach the destination through other road sections, the comparison experiment of the model is the result shown in fig. 7, and the implementation scheme of each intersection when the algorithm goes to the optimal result is the result style shown in fig. 7 (TR-Light is the multi-signal reinforcement learning in the present invention).

Example 3

In this example 9, each rectangle represents a signal intersection, and every two adjacent intersections are connected by two lanes.

In the setting of this embodiment, it is necessary to complete the following parameter setting in SUMO simulation software, 7000 vehicles in the 9-grid environment step into the simulation system, and the model sets the initial vehicles to be 50 vehicles, where the shortest vehicle travel path is 2, the longest vehicle travel path is 7, and the random seed parameter is 10.

After the experimental model is built, the action mode of each intelligent agent is built according to the respective action mode, the total waiting time of the automobile in the environment under the original condition is 24732 seconds, the OD pairs in the experimental traffic environment are 21 pairs, the traffic volume of the right lower area of the 9 grids in the original environment is larger, the track end points of most vehicles are inquired at the right lower corner intersection according to the OD, so that the road network environment of the whole map is not smooth, FIG. 8 is a final plan given by each intersection after the final result of the model experiment, which is a final plan of the intersection, and the right lower intersection and the adjacent intersections are controlled in real time, meanwhile, other intersections maintain the scheme reservation of the original intersection, and the real-time track of the vehicle leading to the forbidden direction is reconstructed, and fig. 9 shows the comparison test results carried out by different algorithms (TR-Light is the multi-signal lamp reinforcement learning in the invention).

Fig. 9 shows that under a given experimental environment, due to the action of the neural network, the multi-signal lamp reinforcement learning effect is very obvious, and the coordination among intersections can be quickly handled.

Example 4

In this embodiment, a map of a museum garden mountain area is simulated by using SUMO software, as shown in fig. 10, a simulation area is constructed in SUMO, an original timing time of a traffic light is set, and in the setting of the traffic light, several intersections with a large traffic flow are selected and the traffic light is set. In this embodiment, a comparative experiment is performed with the conventional Q-learning and DQN algorithms, a comparative experiment is performed on the total waiting time of the vehicles in the road network, real vehicle data is added in this environment, 51320 vehicles are shared in the area in the time from 17:00 to 19:00 of the late peak, the total waiting time of the vehicles in the environment is 338798 seconds in the original time distribution time of the traffic signal lamp without using any reinforcement learning method, the traffic flow at A, B, C, D four intersections is maximum in the late peak time period in the historical track data, the result of the iteration final scheme after using the reinforcement learning method model is shown in fig. 11, and the total optimization result of different algorithms on the environment is shown in fig. 12.

According to the reaction result of fig. 12, it is not particularly obvious that the effect translation difference between the DQN algorithm and the TR-Light algorithm in the present invention in the first 50 rounds of iteration is not particularly obvious, after the number of iterations becomes large, the Critic network in the TR-Light model starts to gradually act on TD-error to start self-learning update, gradually moves to a better behavior pattern, and starts to change its corresponding strategy to maintain the latest pattern of the strategy, because the enhanced learning method of Q-learning does not include a neural network, it is impossible to predict according to the state, and only gradually selects the optimal mode each time, so the optimization effect is not obvious, the final result cannot be converged, and the convergence speed of the TR-Light model starts to accelerate with the increase of iteration, which is the result that the convergence is reached most first.

For each added target intersection, the following data collected for the traffic indexes of single intersection and multiple intersections are compared with the following 3 optimization methods as shown in table 1 below:

crossing	Optimization method	Number of iterations	Road network open traffic rate	Time index
					Single road junction	Q-learning	136	0.47674	1.98378
Single road junction	DQN	94	0.59421	1.43793
					Single road junction	Sarsa	144	0.49327	1.82331
Single road junction	PG	105	0.54732	1.31251
					Single road junction	TR-Light	62	0.68315	1.12741
Multi-port	Q-learning	4300	0.34975	3.46581
					Multi-port	DQN	1600	0.43653	2.67138
Multi-port	Sarsa	4700	0.31462	3.60431
					Multi-port	PG	3100	0.40587	2.96971
Multi-port	TR-Light	1167	0.51312	1.94672

TABLE 1

Under the environment of multiple intersections, a TR-Light model is designed by controlling traffic lights, an algorithm framework of Actor-Critic is used, meanwhile, a centralized learning and distributed execution method among intelligent agents is used, and the advantages of centralized learning and distributed learning are combined, so that the convergence speed of the algorithm is greatly improved. Through comparison of multi-port experimental data, in the process of processing traffic environment, the states of the intelligent agents are changeable and diversified in the Q-Learning algorithm in the traditional algorithm, and the Q-Learning algorithm is difficult to converge because the states cannot be predicted without a neural network. Although the DQN algorithm is assisted by the neural network, the method is not implemented on the interaction method of the multi-agent, the traffic state is improved by the design of the TR-Light model, and a foundation is laid for the application of the traffic signal control of later multi-agent reinforcement learning.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. a traffic organization scheme optimization method based on multi-signal reinforcement learning, is characterized in that the method comprises the following steps:

S1: Construct Actor Network

The signal light at each intersection corresponds to one agent, and constructs multiple Actor networks corresponding to multiple agents, and the Actor network includes a state space set and a behavior space set;

S2: incoming observations

The multi-agent observes the traffic states of multiple intersections to obtain observation values, and transmits the observation values to the state space concentration in the Actor network, and the observation values include vehicle waiting time and lane occupancy at the corresponding intersection;

S3: incoming behavior scheme

Set the behavior plan of the multi-agent, and transfer the behavior plan into the behavior space set in the Actor network;

S4: Calculate the behavior deflection probability

In the Actor network, the behavior deflection probability is calculated based on the observation value and the behavior scheme;

S5: Select behavior and update state

Each agent selects a behavior based on the behavior deflection probability, and updates the state space set according to the selected behavior;

S6: Critical eLearning

The behavior deflection probability, the initial state space set and the updated state space set in the Actor network are transferred to the Critic network for centralized learning and training, and the learned information is reversely transmitted to the Actor network, and the selected Behavioral program output;

S7: Trajectory reconstruction

After the Actor network performs behavior selection, it deletes the banned road section from the vehicle's trajectory, re-plans the route, and outputs the re-planned route.

2. The traffic organization scheme optimization method based on multi-signal reinforcement learning according to claim 1, characterized in that: a Subnet network is constructed after the Acotr network, and the Subnet network compresses the high-latitude state information imported by the Actor network Processed into low-latitude state information, and then reversely passed the low-latitude state information into the Actor network to calculate the behavior deflection probability; the subnet network is a convolutional network, which is divided into certain levels and each layer adopts The filters of the subnet network are different, and the subnet network shares parameters with the Actor network; the number of matrices passed into the subnet network is the number of agents.

3. The traffic organization scheme optimization method based on multi-signal reinforcement learning according to claim 2, characterized in that: the Subnet network is between the Actor network and the Critic network, and the Subnet network combines the initial state space in each Actor network. The set and the updated state space set are compressed, and passed into the Critic network together with the behavior deflection probability for centralized learning.

4. The traffic organization scheme optimization method based on multi-signal reinforcement learning according to claim 1, characterized in that: the outgoing lane of the road is discretized into a certain number of road sections, and each road section contains a corresponding vehicle, The lane occupancy rate is obtained by comparing the length of the vehicle in each road section with the length of the road section; the vehicle waiting time is the waiting time of all vehicles on the current road.

5. The traffic organization scheme optimization method based on multi-signal reinforcement learning according to claim 1, characterized in that: the behavior scheme of step S3 is to set the left-turn signal light to a red light (forbidding left), and/or to set the right-turn signal light Set the red light (no right), and/or set the straight traffic light to red (no straight), and/or prohibit straight turning.

6. The traffic organization scheme optimization method based on multi-signal reinforcement learning according to claim 1, wherein each said Actor network is only responsible for its own agent, and each agent has the same goal and is the same Qualitatively, the training speed can be accelerated by parameter sharing.

7. The traffic organization scheme optimization method based on multi-signal reinforcement learning according to claim 1, characterized in that: in step S6, the Critic network is based on the input behavior deflection probability, initial state space set and updated state The space set calculates the current value v and the value v_ of the next state; then calculates the TD-error value after the selection behavior, TD=r+V ^Π (S')-V ^Π (S), r is the feedback, S is the initial acquisition state, S' is the new state obtained based on the selected behavior; finally calculate the error of TD-error, TD-error=r+GAMMA*v-v_, r is the feedback, and GAMMA is the decay value.

8. The traffic organization scheme optimization method based on multi-signal reinforcement learning according to claim 1, characterized in that: in step S7, without changing the starting point and the ending point, the route of the forbidden road section is changed, so that the vehicle can proceed normally Drive to avoid road vehicle deadlock.

9. The traffic organization scheme optimization method based on multi-signal reinforcement learning according to claim 1, characterized in that: after the trajectory reconstruction, the reward obtained by the agent is formulated, and the trajectory of the vehicle is reconstructed for each round to calculate The total waiting time of vehicles in the entire road network, where reward=-(wt-selfwt), selfwt is the total waiting time of vehicles on the road in the original environment, and wt is the total waiting time of the vehicle after the trajectory reconstruction of the vehicle after each round of learning through reinforcement learning Time; wt is initialized to 0, and the queuing time of the reconstructed trajectory driving vehicle is compared with the original queuing time as the reward for this round.

10. The traffic organization scheme optimization method based on multi-signal reinforcement learning according to claim 1, characterized in that: using two indicators of road network smoothness rate and travel time index to evaluate traffic conditions, wherein the road network smoothness rate is The ratio of the mileage of the road section with good traffic conditions to the mileage of all road sections in the road network in a certain time period T, the calculation formula is:

RNCR(t) is the smoothness rate of the road network in the time period T (T can be 5min or 3min), n is the number of road segments included in the road network, l _ij is the length of the _i -th road segment, ki is a binary function, when When the traffic state level of the road segment i belongs to the acceptable traffic state, _ki = 1, otherwise _ki = 0, the value range of RNCR(t) is [0, 1], the larger the value, the better the road network state. Conversely, the worse the road network is;

The travel time index is the ratio of the actual travel time to the expected travel time. The larger the value, the worse the traffic condition. The calculation formula is:

TTI is the travel time index, T is the time interval taken, and meanTimeLoss is the average time loss over a period of time.