CN109559530A

CN109559530A - A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning

Info

Publication number: CN109559530A
Application number: CN201910011893.2A
Authority: CN
Inventors: 葛宏伟; 宋玉美
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2019-04-02
Anticipated expiration: 2039-01-07
Also published as: CN109559530B

Abstract

The present invention provides a kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning, belongs to the crossing domain of machine learning and intelligent transportation.The multi-intersection transportation network in one region is modeled as Agent system first by this method, each Agent considers the influence of the adjacent Agent movement at nearest moment simultaneously during learning strategy, and multiple Agent is enabled synergistically to carry out the Signalized control of multi-intersection.Each Agent controls an intersection by a depth Q network self-adapting, and network inputs are the discrete traffic behavior coding of the respectively preprocessed original state information at corresponding crossing.In the loss function that the optimal movement Q value of adjacent Agent of nearest moment is moved to network in its learning process.This method is able to ascend the magnitude of traffic flow of Regional Road Network, improves the utilization rate of road, reduces the queue length of vehicle, alleviates traffic congestion.This method infinitely makes each intersection mouth structure.

Description

A kind of multi-intersection signal lamp Collaborative Control based on Q value Transfer Depth intensified learning Method

Technical field

The invention belongs to the crossing domains of machine learning and intelligent transportation, are related to a kind of based on Q value Transfer Depth extensive chemical The multi-intersection signal lamp cooperative control method of habit.

Background technique

Traffic jam issue has become the urgent challenge that urban transportation faces, however existing basic road equipment is due to sky Between, the limitation of environment and economic aspect is difficult to expand.Therefore, the optimal control of traffic lights is solve the problems, such as this effective One of approach.By the self adaptive control of signal lamp, the traffic of area road network can be optimized, reduce congestion and carbon dioxide Discharge.

Currently, different machine learning methods has been used for the research of city traffic signal lamp control, main includes fuzzy Logic, evolution algorithm and Dynamic Programming.Control based on fuzzy logic usually establishes one group of rule according to expertise, further according to Traffic behavior selects approximate signal lamp phase.However, since the formulation of rule is too dependent on expertise, it is big for possessing The multi-intersection of phase is measured, it is more difficult to obtain a set of effective rule.The evolution algorithms such as genetic algorithm and ant group algorithm, due to Its lower search efficiency, when being applied to large-scale traffic collaboration optimal control, it is difficult to meet traffic lights decision Requirement of real-time.Dynamic Programming is difficult to set up effective traffic environment model, it is difficult to solve to calculate cost and calculate environment transfer The problem of probability.

Traffic signal control is actually a Sequence Decision problem, and the frames of many research and utilization intensified learnings is sought Seek optimal control policy.Intensified learning learns to be made of Agent by the uncertain award of perception ambient condition and therefrom acquisition Dynamical system optimum behavior strategy.Study is considered as the process of a trial and error by this method, if some behavior plan of Agent The award (enhanced signal) for slightly causing environment positive, then generating the trend of this behavioral strategy after Agent will reinforce. The target of Agent is to find optimal policy in each discrete state so that desired accumulation award is maximum.

Intensified learning method has extensive application in terms of single intersection and region Multiple Intersections Signalized control.For Multiple Intersections Signalized control, mainly have centerized fusion and distributed AC servo system two ways.Centralized control utilizes intensified learning training one A individual Agent controls entire road network, in each time step Agent to each Intersections phase of road network Carry out decision.However, centralized control due to state space and motion space can with the linearly increasing and exponential increase of intersection, Lead to the dimension disaster of state space and motion space；Multiple Intersections Signalized control problem is modeled as more by distributed AC servo system Agent system, wherein each Agent is responsible for controlling the signal lamp of a single intersection.The local environment that Agent passes through single crossing The mode for carrying out decision easily scales to multi-intersection.

Traditional intensified learning indicates state space by the crossing feature manually extracted.To avoid state space mistake Greatly, usually all simplifying state indicates, often has ignored some important informations.Passed through based on the Agent of intensified learning to ring around The state observation in border carries out decision, if losing important information, Agent is difficult to make the decision optimal to true environment.Example Such as, indicate that state space has ignored the position of the vehicle and vehicle that are moving, speed merely with vehicle queue length on road The information such as degree；And historical traffic data is only reflected merely with average traffic delay, have ignored real-time traffic demand.These are solved The excessive method of state space does not make full use of the effective status information of intersection, and the decision for causing Agent to be done is to be based on Partial information.

Mnih in the laboratory Deep Mind is proposed intensified learning and united depth Q network (the Deep Q of deep learning Network, DQN) (MnihV, KavukcuogluK, SilverD, et al.Human-level control after learning algorithm Through deep reinforcement learning [J] .Nature, 2015,518 (7540): 529-533.), Hen Duoxue Deeply learning art is applied to the Signalized control of single intersection and Multiple Intersections by person.Pass through convolutional neural networks (Convolutional Neural Network, CNN), stacking self-encoding encoder (Stacked Auto-Encoder, SAE) etc. are deep Degree learning model carries out the feature that automatically extracts of crossing status information, and Agent can be carried out fully using crossing status information Optimizing decision.Li et al. people uses each road vehicle queue length as crossing state, and using depth stack self-encoding encoder come Estimate optimal Q value (Li L, Yisheng L, Wang F Y.Traffic signal timing via deep reinforcement learning[J].ACTAAUTOMATICASINICA,2016,3(3):247-254.).Genders etc. People proposes the deeply study control single intersection signal lamp based on CNN, and state space is defined as to location matrix, the speed of vehicle The signal lamp phase for spending matrix and nearest moment, using the letter of the Q-learning algorithm training Single Intersection with experience replay Signal lamp controller.This method is due to the potentially relevant property between action value and target value, so that the stability of algorithm is poor (Genders W,Razavi S.Using a Deep Reinforcement Learning Agent for Traffic Signal Control[J].//arXiv preprint arXiv:1611.01142,2016.).In order to solve unstable ask Topic, Gao et al. improve method (Gao J, ShenY, Liu J, the et al.Adaptive of Genders using target network Traffic Signal Control:DeepReinforcement Learning Algorithm with Experience Replay and TargetNetwork.//arXiv preprint arXiv:1705.02755,2017.).Jeon et al. refers to The parameter in previous most of intensified learning researchs cannot completely represent the complexity of actual traffic state out, they directly use The video image of intersection indicates traffic behavior (Jeon H J, Lee J and SohnK.Artificial intelligence for traffic signal controlbased solely on video images.Journal of Intelligent TransportationSystems,2018,22(5):433-445).Recently, Van der Pol et al. The study of multi-Agent deeply is applied to Multiple Intersections signal lamp self adaptive control (the Vander Pol E and of rule for the first time Oliehoek F A,Coordinated deep reinforcement learnersfor traffic light control.//In NIPS’16Workshop on Learning,Inferenceand Control of Multi-Agent Systems,2016).Multi-Agent problem is divided into multiple lesser subproblems first, and (Agent of two Adjacent Intersections is One subproblem, also known as " source problem "), it is trained on source problem using DQN algorithm and obtains approximately combining Q function, into And the approximate joint Q function for obtaining training source problem is migrated to other subproblems, is finally found most using max-plus algorithm Excellent teamwork.However, max-plus algorithm to be applied to the cooperation Agent system indicated with cooperative figure, it cannot be guaranteed that receiving Optimal solution is held back, and migrates each source problem state space of Q function requirements and motion space size phase between different source problems Together, thus this method is applied with stronger limitation to the network structure of each intersection.

For the feature extraction of multi-intersection traffic behavior, difficult, Signalized control lacks effective collaborative strategy and collaboration Tactful excessively to rely on the problem of intersecting mouth structure, the invention proposes a kind of more intersections based on Q value Transfer Depth intensified learning Message signal lamp cooperative control method (Cooperative Deep Q-Learning with Q-value Transfer, QT- CDQN).Area road network modelling is Agent system by QT-CDQN, and each Agent passes through one DQN network-control one Intersection, the input of network are the discrete traffic behavior coding of the preprocessed original state information of vehicle.The corresponding Agent in each crossing is being instructed During white silk, influence of the optimal movement of adjacent intersection to this crossing is considered, by the Q of the optimal movement at adjacent Agent nearest moment Value moves in the loss function of network.This can balance the magnitude of traffic flow at each crossing to a certain extent, improve regional traffic The utilization rate of middle road reduces the queue length of vehicle, alleviates traffic congestion.This method has preferably transportation network and can expand Malleability, and each intersection mouth structure is infinitely made.

Summary of the invention

For traditional Signalized control method, there are lack between traffic behavior feature extraction difficulty, Multiple Intersections signal lamp Effective collaborative strategy and algorithm excessively rely on the problems such as intersecting mouth structure, and the present invention proposes a kind of association with the migration of Q value Multi-intersection signal lamp Collaborative Control is used for depth Q network (QT-CDQN).This method carries out the raw information of traffic behavior Automatic Feature Extraction, and fully consider the influence of Adjacent Intersections, Collaborative Control is carried out to multi-intersection signal lamp, is improved more The traffic efficiency of intersection alleviates the congestion of each intersection.

Technical solution of the present invention:

A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning, includes the following steps:

Step 1: the transportation network in a region being modeled as Agent system, each intersection is controlled by an Agent System, each Agent include an experience pond M, an estimation network and a target network composition, and network is estimated in initialization respectively With the parameter θ of target network_iAnd θ_i', initialize each experience pond.

Step 2: discrete state coding is carried out to all road vehicles for entering intersection, for some intersection i, The road k that length is l since stop line is divided into the discrete unit of length c, by the vehicle location of the road k of intersection i It is vehicle location matrix with speed recordWith car speed matrixWhen vehicle head is on some discrete unit, then vehicle Location matrixCorresponding positional value is 1, and otherwise value is 0；It will be after the maximum speed normalization of car speed and road limitation Value as rate matricesThe value of corresponding unit lattice., there is a position in the lane for entering intersection i for every accordingly Set matrixWith a rate matricesFor i-th of intersection, all lanesWithForm the position of intersection i Matrix P_iWith rate matrices V_i.In t moment, Agent observes that the state of i-th of intersection isWherein S_i Indicate the state space at i-th of crossing.

Define the motion space A of i-th of intersection_i, i.e., all switching signal lamp phases of i-th intersection.

Define the average queuing that reward functions r is t moment and the t+1 moment enters vehicle on all lanes in i-th of intersection The variation of length.Calculation formula are as follows:

Wherein,WithThe respectively average row of t moment and t+1 moment into vehicle on all lanes in i-th of intersection Team leader's degree.

Step 3: in each time step t, by i-th of intersection current stateInput the estimation net of i-th of Agent Network, estimation network automatically extract the feature of intersection and estimate the corresponding Q value of each movement, and Agent is exported according to estimation network The corresponding Q value of each movement the corresponding movement of maximum Q value is selected according to ε-Greedy strategy with probability 1- ε, i.e.,Otherwise a movement is randomly choosed in motion spaceThen Agent executes selection MovementThe movement residence time is τ_g(minimum unit time), intersection enter next stateAgent is according to formula (1) award is calculatedWherein, the initial value of ε is 1, is linearly successively decreased.

Step 4: by the experience of each AgentIt is stored in the corresponding experience pond M of Agent.Wherein,Table Show the Q value of the everything of the estimation network output of i-th of Agent of t moment；

Step 5: m experience of stochastical sampling from the M of experience pond, using RMSProp gradient descent algorithm more new estimation network Parameter θ_i, loss function is

Wherein, γ is learning rate.A ' is some optional movement in motion space.N is the neighborhood of i-th of Agent, J is some neighbours Agent, A therein_jFor the motion space of j-th of Agent,For j-th of Agent the t-1 moment shape State,For the optimal Q value at neighbours j nearest moment.

Step 6: enabling

Step 7: repeating T step 3 to step 6.

Step 8: updating the parameter θ of target network_i'=θ_i, ε value is successively decreased until value is 0.1.

Step 9: repeat step 3 to step 8, timing (traffic in about 50 hours) calculates a vehicle and is averaged queue length L, When continuous 3 non-decreasings of L and when adjacent L difference is less than 0.02, then multi-intersection contract network training is completed.

Step 10: after the completion of the training of multi-intersection contract network, in each time step t, by the current shape of i-th of intersection StateInput the estimation network of i-th of Agent, the corresponding Q value of each movement of estimation network output of each Agent, Agent The corresponding movement of maximum Q value is selected with probability 1- ε, i.e.,Otherwise random in motion space Select a movementAgent executes movement

The estimation network and target network are convolutional neural networks, include 4 hidden layers, first convolutional layer is by 16 A 4 × 4 filter composition, step-length 2；Second convolutional layer is made of 32 2 × 2 filters, step-length 1；Third layer It is two full articulamentums with the 4th layer, is made of respectively 128 and 64 neurons.Four hidden layers all use Relu nonlinear activation Then the output valve of network is re-used as the input of last output layer by function, output layer uses softmax activation primitive, The neuron number of middle output layer and the motion space of corresponding intersection are equal in magnitude.

Beneficial effects of the present invention: the signal lamp cooperative control method based on Q value Transfer Depth intensified learning makes full use of The status information of intersection, and the signal lamp of multi-intersection can be synergistically controlled, this method can expand to more intersections Mouthful, and each intersection mouth structure is infinitely made.

Detailed description of the invention

Four intersection schematic diagrames of Fig. 1 unsymmetric structure；

The discrete state of Fig. 2 traffic information encodes；

The motion space of tetra- intersection Fig. 3；

The structure of Fig. 4 estimation network and target network；

Fig. 5 has the multi-intersection cooperative control structure of Q value migration；

Signal lamp Collaborative Control flow chart of the Fig. 6 based on Q value Transfer Depth intensified learning；

(wherein, QT-CDQN is with Q value Transfer Depth for average queue length of Fig. 7 QT-CDQN method on four crossings The cooperative control method of intensified learning, MADQN are the DQN method without collaboration, and FTA is optimal fixed to be set in advance according to vehicle flowrate Period control method)；

Average speed of Fig. 8 QT-CDQN method on four crossings；

Average latency of Fig. 9 QT-CDQN method on four crossings；

Average queue length of Figure 10 QT-CDQN method in each intersection；

Average speed of Figure 11 QT-CDQN method in each intersection；

Average latency of Figure 12 QT-CDQN method in each intersection.

Specific embodiment

The present invention provides a kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning.Institute The specific embodiment of discussion is merely to illustrate implementation of the invention, and does not limit the scope of the invention.With reference to the accompanying drawing Detailed description of embodiments of the present invention, specifically includes the following steps:

1. the schematic diagram of four intersections.Application of the invention does not limit the structure of intersection, hands over in Fig. 1 irregular four Illustrate for prong, wherein crossing 3 is simple intersection, and others are three-way intersection, and there is a letter in each intersection The passage of signal lamp control vehicle.Three-way intersection and simple intersection have respectively three and four enter crossing roads, every Road has two lanes.According to the structure at crossing, left-hand lane allows vehicle to keep straight on or turn left, and right-hand lane allows vehicle straight Row is turned right.

2. the discrete state of traffic information encodes.By since stop line length be l road k (k=0,1 ... 12) draw It is divided into the discrete unit of length c, wherein the value of c wants moderate, and c value is excessive, is easy to ignore individual vehicle state, too small to make It is too big at calculation amount.As shown in Fig. 2, by the vehicle location of the road k of crossing i and speed record in two matrixes: vehicle location MatrixWith car speed matrixIf vehicle head is on some cell, matrixCorresponding positional value is 1, no Then value is 0；Using the value after the maximum speed normalization of car speed and road limitation as rate matricesCorresponding unit lattice Value.For i-th of intersection (here by taking simple intersection as an example), the vehicle location matrix P of all roads_iWith car speed square Battle array V_iIt is expressed asWithIn t moment, Agent observes the shape at i-th of crossing State isWherein S_iIndicate the state space at i-th of crossing.

3. the motion space of four intersections.In t moment, Agent obtains the state of i-th of intersectionAfterwards, one is selected MovementWherein A_iIndicate the motion space of i-th of intersection, the corresponding motion space A in different intersections_iDifference, such as Shown in Fig. 3, there are three the movements different with four respectively for three-way intersection and simple intersection.The movement selected every time, phase The position time is the time interval τ of one section of regular length_g(6s), after current phase time, current time, t terminated therewith, and And starting next moment t+1, Agent starts to observe next state of i-th of intersectionStateIt can be by nearest one The secondary performed influence acted, for new stateSelect next movementAnd executed (possible selection at this time and upper a period of time Carve identical movement).

4. the setting of reward functions.Reward functions are the prize signals obtained during with environmental interaction, award letter Number has reacted the property for the task that Agent is faced, while the basis as Agent modification strategy.I-th is observed in Agent The state of a intersectionAfterwards, a movement is selectedAnd execute, Agent will obtain a scalar reward value from environmentWith The performed quality acted of evaluation.The target that Agent is pursued is exactly to find a kind of state-action policy, is made finally obtained tired Product reward value reaches maximum.The present invention select crossing vehicle be averaged queue length variation as reward functions,WithRespectively Enter the average queue length of vehicle on all lanes in i-th of intersection, award for t moment and t+1 momentSuch as formula (1) institute Show, reward value is positive, and indicates that the movement taken of t moment has an active influence to environment, subtracts the vehicle queue length that is averaged Few, the expression that is negative movement causes in environment the vehicle queue length that be averaged to increase.

5. estimating the structure of network and target network.By taking the road network of four intersections as an example, each intersection is by one Agent control, each Agent are made of an estimation network and a target network, and each network is a convolutional Neural Network.Estimation network can carry out Automatic Feature Extraction according to the original traffic state at respective crossing and approach state action value letter Number (Q function).CNN estimation network structure as shown in Figure 4 (answer when realizing in figure by the dimension of matrix and output layer neuron number It is arranged according to the actual situation).Each crossing using the normalization matrix that the speed of the location matrix of vehicle and vehicle constructs as pair The input of CNN network is answered, the output of network is that (Q value is passed through to the value assessment of everything in the state of observed Probability value after Softmax).CNN network includes 4 hidden layers, and first convolutional layer is made of 16 4 × 4 filters, step-length It is 2；Second convolutional layer is made of 32 2 × 2 filters, step-length 1；Third layer and the 4th layer are two full articulamentums, It is made of respectively 128 and 64 neurons.Four hidden layers all use Relu nonlinear activation function, then by the output valve of network Be re-used as the input of last output layer, output layer uses softmax activation primitive, wherein the neuron number of output layer with it is right Answer the motion space at crossing equal in magnitude.Change the tactful concussion problem that may cause to alleviate Q value small in decision process, The newly-increased target network different from estimation network structure identical parameters of each Agent, estimates under network-evaluated current state The Q value of each movementTarget network estimates target value y_t, whereinBy at one section Freeze the parameter of target network in time, so that estimation network is more stable.

6. the training process of network.As shown in figure 5, each Agent only considers the optimal movement of adjacent intersection to the shadow at this crossing It rings, in the loss function by the way that the Q value at the nearest moment of adjacent Agent to be moved to respective Agent system, so that multiple Agent can synergistically carry out the Signalized control of multi-intersection.By taking synergistic mechanism, the action selection plan of an intersection Its own Q value is slightly depended not only upon, the Q value of its Adjacent Intersections is additionally depended on, this method improves the traffic flow of Regional Road Network Amount, alleviates traffic congestion.

The optimal Q value at adjacent intersection nearest moment is moved in the loss function at each crossing, loss function is

Wherein, m is batch size, and θ is the parameter for estimating network,For i-th Agent estimation network it is defeated Out, θ ' is the parameter of target network,For the output of corresponding target network, N is neighbours' collection of i-th of Agent It closes,For the optimal Q value at neighbours j nearest moment.

The flow chart of QT-CDQN method is as shown in fig. 6, in each time step t, and i-th of Agent is by the state to intersection ObservationNetwork is inputted, according to network outputValue is acted using Greedy strategy selectionAnd execute, Agent is by formula at this time (2) award from environment is calculatedAnd enter next stateIn each time step t by the warp to i-th of crossing It testsDeposit experience pond M_iIn (each agent corresponding an experience pond).Each experience pond Most multipotency stores max_size (2 × 10⁵) experience, earliest rejection of data is continued after being filled with to be stored in newest experience.For More effective training estimation network C NN_iParameter θ_i, at interval of certain step number from experience pond M_iMiddle stochastical sampling m (32) item Experience is updated network.Since the optimal Q value at its neighbour nearest moment can be moved when updating the network of i-th of Agent It moves on in the loss function of current Agent, therefore, after stochastical sampling in the Mi of experience pond, needs from the experience pond of its neighbour Sample the experience at corresponding nearest moment.

In training, when acting selection using the ε-Greedy strategy to successively decrease, i.e., selected at random with probability ε (initial value 1) A movement is selected, with the maximum movement of probability 1- ε selection action value, ε successively decreases with the incremental of training bout, this choosing Selection method tends to be increasingly turned to utilize by exploration, until ε value remains unchanged after being reduced to 0.1.Each estimation network is all using The RMSProp gradient descent algorithm that habit rate is 0.0002, the parameter of target network every T (200) step are updated to latest value, that is, estimate The most recent parameters of network.When estimate network can fully approximate movement value function Q after, it is defeated by selection current state lower network The corresponding movement of maximum value out controls to be optimal.

After the completion of the training of multi-intersection contract network, in each time step t, by i-th of intersection current state input i-th The estimation network of a Agent, the corresponding Q value of each movement of estimation network output of each Agent, Agent are selected most with probability 1- ε The big corresponding movement of Q value, otherwise randomly chooses a movement in motion space, and Agent executes selected movement.

Claims

1. a kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning, which is characterized in that including Following steps:

Step 1: the transportation network in a region being modeled as Agent system, each intersection is controlled by an Agent, often A Agent includes an experience pond M, an estimation network and a target network composition, and network and mesh are estimated in initialization respectively Mark the parameter θ of network_iAnd θ_i', initialize each experience pond；

Step 2: discrete state coding is carried out to all road vehicles for entering intersection, it, will be from for some intersection i Stop line starts the discrete unit that the road k that length is l is divided into length c, by the vehicle location and speed of the road k of intersection i Degree is recorded as vehicle location matrixWith car speed matrixWhen vehicle head is on some discrete unit, then vehicle position Set matrixCorresponding positional value is 1, and otherwise value is 0；By the value after the maximum speed normalization of car speed and road limitation As rate matricesThe value of corresponding unit lattice；, there is a position square in the lane for entering intersection i for every accordingly Battle arrayWith a rate matricesFor i-th of intersection, all lanesWithForm the location matrix of intersection i P_iWith rate matrices V_i；In t moment, Agent observes that the state of i-th of intersection isWherein S_iIt indicates The state space at i-th of crossing；

Define the motion space A of i-th of intersection_i, i.e., all switching signal lamp phases of i-th intersection；

Define the average queue length that reward functions r is t moment and the t+1 moment enters vehicle on all lanes in i-th of intersection Variation；Calculation formula are as follows:

Wherein,WithThe average queuing of respectively t moment and t+1 moment into vehicle on all lanes in i-th of intersection is long Degree；

Step 3: in each time step t, by i-th of intersection current stateThe estimation network for inputting i-th of Agent, estimates Meter network automatically extracts the feature of intersection and estimates the corresponding Q value of each movement, and Agent exports each according to estimation network Corresponding Q value is acted, according to ε-Greedy strategy, the corresponding movement of maximum Q value is selected with probability 1- ε, i.e.,Otherwise a movement is randomly choosed in motion spaceThen Agent executes selection MovementThe movement residence time is τ_g, intersection enters next stateAgent is calculated according to formula (1) and is awarded Wherein, the initial value of ε is 1, is linearly successively decreased；

Step 4: by the experience of each AgentIt is stored in the corresponding experience pond M of Agent；Wherein,Indicate t The Q value of the everything of the estimation network output of i-th of Agent of moment；

Wherein, γ is learning rate；A ' is some optional movement in motion space；N is the neighborhood of i-th of Agent, and j is Some neighbours Agent, A therein_jFor the motion space of j-th of Agent,State for j-th of Agent at the t-1 moment,For the optimal Q value at neighbours j nearest moment；

Step 6: enabling

Step 7: repeating T step 3 to step 6；

Step 8: updating the parameter θ of target network_i'=θ_i, ε value is successively decreased until value is 0.1；

Step 9: repeating step 3 to step 8, timing calculates a vehicle and is averaged queue length L, when continuous 3 non-decreasings of L and phase When adjacent L difference is less than 0.02, then multi-intersection contract network training is completed；

Step 10: after the completion of the training of multi-intersection contract network, in each time step t, by the current state of i-th of intersectionInput the estimation network of i-th of Agent, the corresponding Q value of each movement of estimation network output of each Agent, Agent with Probability 1- ε selects the corresponding movement of maximum Q value, i.e.,Otherwise it is randomly choosed in motion space One movementAgent executes movement

2. a kind of multi-intersection signal lamp Collaborative Control based on Q value Transfer Depth intensified learning according to claim 1 Method, which is characterized in that the estimation network and target network are convolutional neural networks, include 4 hidden layers, first volume Lamination is made of 16 4 × 4 filters, step-length 2；Second convolutional layer is made of 32 2 × 2 filters, and step-length is 1；Third layer and the 4th layer are two full articulamentums, are made of respectively 128 and 64 neurons；Four hidden layers all use Relu non- The output valve of network, is then re-used as the input of last output layer by linear activation primitive, and output layer is activated using softmax Function, wherein the neuron number of output layer and the motion space of corresponding intersection are equal in magnitude.

3. a kind of multi-intersection signal lamp based on Q value Transfer Depth intensified learning according to claim 1 or 2 cooperates with control Method processed, which is characterized in that timing in the step 9 calculates a vehicle and be averaged queue length L, is set as 50 hours meters A vehicle is calculated to be averaged queue length L.