CN109559530A - A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning - Google Patents

A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning Download PDF

Info

Publication number
CN109559530A
CN109559530A CN201910011893.2A CN201910011893A CN109559530A CN 109559530 A CN109559530 A CN 109559530A CN 201910011893 A CN201910011893 A CN 201910011893A CN 109559530 A CN109559530 A CN 109559530A
Authority
CN
China
Prior art keywords
intersection
agent
network
value
movement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910011893.2A
Other languages
Chinese (zh)
Other versions
CN109559530B (en
Inventor
葛宏伟
宋玉美
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN201910011893.2A priority Critical patent/CN109559530B/en
Publication of CN109559530A publication Critical patent/CN109559530A/en
Application granted granted Critical
Publication of CN109559530B publication Critical patent/CN109559530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/081Plural intersections under common control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The present invention provides a kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning, belongs to the crossing domain of machine learning and intelligent transportation.The multi-intersection transportation network in one region is modeled as Agent system first by this method, each Agent considers the influence of the adjacent Agent movement at nearest moment simultaneously during learning strategy, and multiple Agent is enabled synergistically to carry out the Signalized control of multi-intersection.Each Agent controls an intersection by a depth Q network self-adapting, and network inputs are the discrete traffic behavior coding of the respectively preprocessed original state information at corresponding crossing.In the loss function that the optimal movement Q value of adjacent Agent of nearest moment is moved to network in its learning process.This method is able to ascend the magnitude of traffic flow of Regional Road Network, improves the utilization rate of road, reduces the queue length of vehicle, alleviates traffic congestion.This method infinitely makes each intersection mouth structure.

Description

A kind of multi-intersection signal lamp Collaborative Control based on Q value Transfer Depth intensified learning Method
Technical field
The invention belongs to the crossing domains of machine learning and intelligent transportation, are related to a kind of based on Q value Transfer Depth extensive chemical The multi-intersection signal lamp cooperative control method of habit.
Background technique
Traffic jam issue has become the urgent challenge that urban transportation faces, however existing basic road equipment is due to sky Between, the limitation of environment and economic aspect is difficult to expand.Therefore, the optimal control of traffic lights is solve the problems, such as this effective One of approach.By the self adaptive control of signal lamp, the traffic of area road network can be optimized, reduce congestion and carbon dioxide Discharge.
Currently, different machine learning methods has been used for the research of city traffic signal lamp control, main includes fuzzy Logic, evolution algorithm and Dynamic Programming.Control based on fuzzy logic usually establishes one group of rule according to expertise, further according to Traffic behavior selects approximate signal lamp phase.However, since the formulation of rule is too dependent on expertise, it is big for possessing The multi-intersection of phase is measured, it is more difficult to obtain a set of effective rule.The evolution algorithms such as genetic algorithm and ant group algorithm, due to Its lower search efficiency, when being applied to large-scale traffic collaboration optimal control, it is difficult to meet traffic lights decision Requirement of real-time.Dynamic Programming is difficult to set up effective traffic environment model, it is difficult to solve to calculate cost and calculate environment transfer The problem of probability.
Traffic signal control is actually a Sequence Decision problem, and the frames of many research and utilization intensified learnings is sought Seek optimal control policy.Intensified learning learns to be made of Agent by the uncertain award of perception ambient condition and therefrom acquisition Dynamical system optimum behavior strategy.Study is considered as the process of a trial and error by this method, if some behavior plan of Agent The award (enhanced signal) for slightly causing environment positive, then generating the trend of this behavioral strategy after Agent will reinforce. The target of Agent is to find optimal policy in each discrete state so that desired accumulation award is maximum.
Intensified learning method has extensive application in terms of single intersection and region Multiple Intersections Signalized control.For Multiple Intersections Signalized control, mainly have centerized fusion and distributed AC servo system two ways.Centralized control utilizes intensified learning training one A individual Agent controls entire road network, in each time step Agent to each Intersections phase of road network Carry out decision.However, centralized control due to state space and motion space can with the linearly increasing and exponential increase of intersection, Lead to the dimension disaster of state space and motion space;Multiple Intersections Signalized control problem is modeled as more by distributed AC servo system Agent system, wherein each Agent is responsible for controlling the signal lamp of a single intersection.The local environment that Agent passes through single crossing The mode for carrying out decision easily scales to multi-intersection.
Traditional intensified learning indicates state space by the crossing feature manually extracted.To avoid state space mistake Greatly, usually all simplifying state indicates, often has ignored some important informations.Passed through based on the Agent of intensified learning to ring around The state observation in border carries out decision, if losing important information, Agent is difficult to make the decision optimal to true environment.Example Such as, indicate that state space has ignored the position of the vehicle and vehicle that are moving, speed merely with vehicle queue length on road The information such as degree;And historical traffic data is only reflected merely with average traffic delay, have ignored real-time traffic demand.These are solved The excessive method of state space does not make full use of the effective status information of intersection, and the decision for causing Agent to be done is to be based on Partial information.
Mnih in the laboratory Deep Mind is proposed intensified learning and united depth Q network (the Deep Q of deep learning Network, DQN) (MnihV, KavukcuogluK, SilverD, et al.Human-level control after learning algorithm Through deep reinforcement learning [J] .Nature, 2015,518 (7540): 529-533.), Hen Duoxue Deeply learning art is applied to the Signalized control of single intersection and Multiple Intersections by person.Pass through convolutional neural networks (Convolutional Neural Network, CNN), stacking self-encoding encoder (Stacked Auto-Encoder, SAE) etc. are deep Degree learning model carries out the feature that automatically extracts of crossing status information, and Agent can be carried out fully using crossing status information Optimizing decision.Li et al. people uses each road vehicle queue length as crossing state, and using depth stack self-encoding encoder come Estimate optimal Q value (Li L, Yisheng L, Wang F Y.Traffic signal timing via deep reinforcement learning[J].ACTAAUTOMATICASINICA,2016,3(3):247-254.).Genders etc. People proposes the deeply study control single intersection signal lamp based on CNN, and state space is defined as to location matrix, the speed of vehicle The signal lamp phase for spending matrix and nearest moment, using the letter of the Q-learning algorithm training Single Intersection with experience replay Signal lamp controller.This method is due to the potentially relevant property between action value and target value, so that the stability of algorithm is poor (Genders W,Razavi S.Using a Deep Reinforcement Learning Agent for Traffic Signal Control[J].//arXiv preprint arXiv:1611.01142,2016.).In order to solve unstable ask Topic, Gao et al. improve method (Gao J, ShenY, Liu J, the et al.Adaptive of Genders using target network Traffic Signal Control:DeepReinforcement Learning Algorithm with Experience Replay and TargetNetwork.//arXiv preprint arXiv:1705.02755,2017.).Jeon et al. refers to The parameter in previous most of intensified learning researchs cannot completely represent the complexity of actual traffic state out, they directly use The video image of intersection indicates traffic behavior (Jeon H J, Lee J and SohnK.Artificial intelligence for traffic signal controlbased solely on video images.Journal of Intelligent TransportationSystems,2018,22(5):433-445).Recently, Van der Pol et al. The study of multi-Agent deeply is applied to Multiple Intersections signal lamp self adaptive control (the Vander Pol E and of rule for the first time Oliehoek F A,Coordinated deep reinforcement learnersfor traffic light control.//In NIPS’16Workshop on Learning,Inferenceand Control of Multi-Agent Systems,2016).Multi-Agent problem is divided into multiple lesser subproblems first, and (Agent of two Adjacent Intersections is One subproblem, also known as " source problem "), it is trained on source problem using DQN algorithm and obtains approximately combining Q function, into And the approximate joint Q function for obtaining training source problem is migrated to other subproblems, is finally found most using max-plus algorithm Excellent teamwork.However, max-plus algorithm to be applied to the cooperation Agent system indicated with cooperative figure, it cannot be guaranteed that receiving Optimal solution is held back, and migrates each source problem state space of Q function requirements and motion space size phase between different source problems Together, thus this method is applied with stronger limitation to the network structure of each intersection.
For the feature extraction of multi-intersection traffic behavior, difficult, Signalized control lacks effective collaborative strategy and collaboration Tactful excessively to rely on the problem of intersecting mouth structure, the invention proposes a kind of more intersections based on Q value Transfer Depth intensified learning Message signal lamp cooperative control method (Cooperative Deep Q-Learning with Q-value Transfer, QT- CDQN).Area road network modelling is Agent system by QT-CDQN, and each Agent passes through one DQN network-control one Intersection, the input of network are the discrete traffic behavior coding of the preprocessed original state information of vehicle.The corresponding Agent in each crossing is being instructed During white silk, influence of the optimal movement of adjacent intersection to this crossing is considered, by the Q of the optimal movement at adjacent Agent nearest moment Value moves in the loss function of network.This can balance the magnitude of traffic flow at each crossing to a certain extent, improve regional traffic The utilization rate of middle road reduces the queue length of vehicle, alleviates traffic congestion.This method has preferably transportation network and can expand Malleability, and each intersection mouth structure is infinitely made.
Summary of the invention
For traditional Signalized control method, there are lack between traffic behavior feature extraction difficulty, Multiple Intersections signal lamp Effective collaborative strategy and algorithm excessively rely on the problems such as intersecting mouth structure, and the present invention proposes a kind of association with the migration of Q value Multi-intersection signal lamp Collaborative Control is used for depth Q network (QT-CDQN).This method carries out the raw information of traffic behavior Automatic Feature Extraction, and fully consider the influence of Adjacent Intersections, Collaborative Control is carried out to multi-intersection signal lamp, is improved more The traffic efficiency of intersection alleviates the congestion of each intersection.
Technical solution of the present invention:
A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning, includes the following steps:
Step 1: the transportation network in a region being modeled as Agent system, each intersection is controlled by an Agent System, each Agent include an experience pond M, an estimation network and a target network composition, and network is estimated in initialization respectively With the parameter θ of target networkiAnd θi', initialize each experience pond.
Step 2: discrete state coding is carried out to all road vehicles for entering intersection, for some intersection i, The road k that length is l since stop line is divided into the discrete unit of length c, by the vehicle location of the road k of intersection i It is vehicle location matrix with speed recordWith car speed matrixWhen vehicle head is on some discrete unit, then vehicle Location matrixCorresponding positional value is 1, and otherwise value is 0;It will be after the maximum speed normalization of car speed and road limitation Value as rate matricesThe value of corresponding unit lattice., there is a position in the lane for entering intersection i for every accordingly Set matrixWith a rate matricesFor i-th of intersection, all lanesWithForm the position of intersection i Matrix PiWith rate matrices Vi.In t moment, Agent observes that the state of i-th of intersection isWherein Si Indicate the state space at i-th of crossing.
Define the motion space A of i-th of intersectioni, i.e., all switching signal lamp phases of i-th intersection.
Define the average queuing that reward functions r is t moment and the t+1 moment enters vehicle on all lanes in i-th of intersection The variation of length.Calculation formula are as follows:
Wherein,WithThe respectively average row of t moment and t+1 moment into vehicle on all lanes in i-th of intersection Team leader's degree.
Step 3: in each time step t, by i-th of intersection current stateInput the estimation net of i-th of Agent Network, estimation network automatically extract the feature of intersection and estimate the corresponding Q value of each movement, and Agent is exported according to estimation network The corresponding Q value of each movement the corresponding movement of maximum Q value is selected according to ε-Greedy strategy with probability 1- ε, i.e.,Otherwise a movement is randomly choosed in motion spaceThen Agent executes selection MovementThe movement residence time is τg(minimum unit time), intersection enter next stateAgent is according to formula (1) award is calculatedWherein, the initial value of ε is 1, is linearly successively decreased.
Step 4: by the experience of each AgentIt is stored in the corresponding experience pond M of Agent.Wherein,Table Show the Q value of the everything of the estimation network output of i-th of Agent of t moment;
Step 5: m experience of stochastical sampling from the M of experience pond, using RMSProp gradient descent algorithm more new estimation network Parameter θi, loss function is
Wherein, γ is learning rate.A ' is some optional movement in motion space.N is the neighborhood of i-th of Agent, J is some neighbours Agent, A thereinjFor the motion space of j-th of Agent,For j-th of Agent the t-1 moment shape State,For the optimal Q value at neighbours j nearest moment.
Step 6: enabling
Step 7: repeating T step 3 to step 6.
Step 8: updating the parameter θ of target networki'=θi, ε value is successively decreased until value is 0.1.
Step 9: repeat step 3 to step 8, timing (traffic in about 50 hours) calculates a vehicle and is averaged queue length L, When continuous 3 non-decreasings of L and when adjacent L difference is less than 0.02, then multi-intersection contract network training is completed.
Step 10: after the completion of the training of multi-intersection contract network, in each time step t, by the current shape of i-th of intersection StateInput the estimation network of i-th of Agent, the corresponding Q value of each movement of estimation network output of each Agent, Agent The corresponding movement of maximum Q value is selected with probability 1- ε, i.e.,Otherwise random in motion space Select a movementAgent executes movement
The estimation network and target network are convolutional neural networks, include 4 hidden layers, first convolutional layer is by 16 A 4 × 4 filter composition, step-length 2;Second convolutional layer is made of 32 2 × 2 filters, step-length 1;Third layer It is two full articulamentums with the 4th layer, is made of respectively 128 and 64 neurons.Four hidden layers all use Relu nonlinear activation Then the output valve of network is re-used as the input of last output layer by function, output layer uses softmax activation primitive, The neuron number of middle output layer and the motion space of corresponding intersection are equal in magnitude.
Beneficial effects of the present invention: the signal lamp cooperative control method based on Q value Transfer Depth intensified learning makes full use of The status information of intersection, and the signal lamp of multi-intersection can be synergistically controlled, this method can expand to more intersections Mouthful, and each intersection mouth structure is infinitely made.
Detailed description of the invention
Four intersection schematic diagrames of Fig. 1 unsymmetric structure;
The discrete state of Fig. 2 traffic information encodes;
The motion space of tetra- intersection Fig. 3;
The structure of Fig. 4 estimation network and target network;
Fig. 5 has the multi-intersection cooperative control structure of Q value migration;
Signal lamp Collaborative Control flow chart of the Fig. 6 based on Q value Transfer Depth intensified learning;
(wherein, QT-CDQN is with Q value Transfer Depth for average queue length of Fig. 7 QT-CDQN method on four crossings The cooperative control method of intensified learning, MADQN are the DQN method without collaboration, and FTA is optimal fixed to be set in advance according to vehicle flowrate Period control method);
Average speed of Fig. 8 QT-CDQN method on four crossings;
Average latency of Fig. 9 QT-CDQN method on four crossings;
Average queue length of Figure 10 QT-CDQN method in each intersection;
Average speed of Figure 11 QT-CDQN method in each intersection;
Average latency of Figure 12 QT-CDQN method in each intersection.
Specific embodiment
The present invention provides a kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning.Institute The specific embodiment of discussion is merely to illustrate implementation of the invention, and does not limit the scope of the invention.With reference to the accompanying drawing Detailed description of embodiments of the present invention, specifically includes the following steps:
1. the schematic diagram of four intersections.Application of the invention does not limit the structure of intersection, hands over in Fig. 1 irregular four Illustrate for prong, wherein crossing 3 is simple intersection, and others are three-way intersection, and there is a letter in each intersection The passage of signal lamp control vehicle.Three-way intersection and simple intersection have respectively three and four enter crossing roads, every Road has two lanes.According to the structure at crossing, left-hand lane allows vehicle to keep straight on or turn left, and right-hand lane allows vehicle straight Row is turned right.
2. the discrete state of traffic information encodes.By since stop line length be l road k (k=0,1 ... 12) draw It is divided into the discrete unit of length c, wherein the value of c wants moderate, and c value is excessive, is easy to ignore individual vehicle state, too small to make It is too big at calculation amount.As shown in Fig. 2, by the vehicle location of the road k of crossing i and speed record in two matrixes: vehicle location MatrixWith car speed matrixIf vehicle head is on some cell, matrixCorresponding positional value is 1, no Then value is 0;Using the value after the maximum speed normalization of car speed and road limitation as rate matricesCorresponding unit lattice Value.For i-th of intersection (here by taking simple intersection as an example), the vehicle location matrix P of all roadsiWith car speed square Battle array ViIt is expressed asWithIn t moment, Agent observes the shape at i-th of crossing State isWherein SiIndicate the state space at i-th of crossing.
3. the motion space of four intersections.In t moment, Agent obtains the state of i-th of intersectionAfterwards, one is selected MovementWherein AiIndicate the motion space of i-th of intersection, the corresponding motion space A in different intersectionsiDifference, such as Shown in Fig. 3, there are three the movements different with four respectively for three-way intersection and simple intersection.The movement selected every time, phase The position time is the time interval τ of one section of regular lengthg(6s), after current phase time, current time, t terminated therewith, and And starting next moment t+1, Agent starts to observe next state of i-th of intersectionStateIt can be by nearest one The secondary performed influence acted, for new stateSelect next movementAnd executed (possible selection at this time and upper a period of time Carve identical movement).
4. the setting of reward functions.Reward functions are the prize signals obtained during with environmental interaction, award letter Number has reacted the property for the task that Agent is faced, while the basis as Agent modification strategy.I-th is observed in Agent The state of a intersectionAfterwards, a movement is selectedAnd execute, Agent will obtain a scalar reward value from environmentWith The performed quality acted of evaluation.The target that Agent is pursued is exactly to find a kind of state-action policy, is made finally obtained tired Product reward value reaches maximum.The present invention select crossing vehicle be averaged queue length variation as reward functions,WithRespectively Enter the average queue length of vehicle on all lanes in i-th of intersection, award for t moment and t+1 momentSuch as formula (1) institute Show, reward value is positive, and indicates that the movement taken of t moment has an active influence to environment, subtracts the vehicle queue length that is averaged Few, the expression that is negative movement causes in environment the vehicle queue length that be averaged to increase.
5. estimating the structure of network and target network.By taking the road network of four intersections as an example, each intersection is by one Agent control, each Agent are made of an estimation network and a target network, and each network is a convolutional Neural Network.Estimation network can carry out Automatic Feature Extraction according to the original traffic state at respective crossing and approach state action value letter Number (Q function).CNN estimation network structure as shown in Figure 4 (answer when realizing in figure by the dimension of matrix and output layer neuron number It is arranged according to the actual situation).Each crossing using the normalization matrix that the speed of the location matrix of vehicle and vehicle constructs as pair The input of CNN network is answered, the output of network is that (Q value is passed through to the value assessment of everything in the state of observed Probability value after Softmax).CNN network includes 4 hidden layers, and first convolutional layer is made of 16 4 × 4 filters, step-length It is 2;Second convolutional layer is made of 32 2 × 2 filters, step-length 1;Third layer and the 4th layer are two full articulamentums, It is made of respectively 128 and 64 neurons.Four hidden layers all use Relu nonlinear activation function, then by the output valve of network Be re-used as the input of last output layer, output layer uses softmax activation primitive, wherein the neuron number of output layer with it is right Answer the motion space at crossing equal in magnitude.Change the tactful concussion problem that may cause to alleviate Q value small in decision process, The newly-increased target network different from estimation network structure identical parameters of each Agent, estimates under network-evaluated current state The Q value of each movementTarget network estimates target value yt, whereinBy at one section Freeze the parameter of target network in time, so that estimation network is more stable.
6. the training process of network.As shown in figure 5, each Agent only considers the optimal movement of adjacent intersection to the shadow at this crossing It rings, in the loss function by the way that the Q value at the nearest moment of adjacent Agent to be moved to respective Agent system, so that multiple Agent can synergistically carry out the Signalized control of multi-intersection.By taking synergistic mechanism, the action selection plan of an intersection Its own Q value is slightly depended not only upon, the Q value of its Adjacent Intersections is additionally depended on, this method improves the traffic flow of Regional Road Network Amount, alleviates traffic congestion.
The optimal Q value at adjacent intersection nearest moment is moved in the loss function at each crossing, loss function is
Wherein, m is batch size, and θ is the parameter for estimating network,For i-th Agent estimation network it is defeated Out, θ ' is the parameter of target network,For the output of corresponding target network, N is neighbours' collection of i-th of Agent It closes,For the optimal Q value at neighbours j nearest moment.
The flow chart of QT-CDQN method is as shown in fig. 6, in each time step t, and i-th of Agent is by the state to intersection ObservationNetwork is inputted, according to network outputValue is acted using Greedy strategy selectionAnd execute, Agent is by formula at this time (2) award from environment is calculatedAnd enter next stateIn each time step t by the warp to i-th of crossing It testsDeposit experience pond MiIn (each agent corresponding an experience pond).Each experience pond Most multipotency stores max_size (2 × 105) experience, earliest rejection of data is continued after being filled with to be stored in newest experience.For More effective training estimation network C NNiParameter θi, at interval of certain step number from experience pond MiMiddle stochastical sampling m (32) item Experience is updated network.Since the optimal Q value at its neighbour nearest moment can be moved when updating the network of i-th of Agent It moves on in the loss function of current Agent, therefore, after stochastical sampling in the Mi of experience pond, needs from the experience pond of its neighbour Sample the experience at corresponding nearest moment.
In training, when acting selection using the ε-Greedy strategy to successively decrease, i.e., selected at random with probability ε (initial value 1) A movement is selected, with the maximum movement of probability 1- ε selection action value, ε successively decreases with the incremental of training bout, this choosing Selection method tends to be increasingly turned to utilize by exploration, until ε value remains unchanged after being reduced to 0.1.Each estimation network is all using The RMSProp gradient descent algorithm that habit rate is 0.0002, the parameter of target network every T (200) step are updated to latest value, that is, estimate The most recent parameters of network.When estimate network can fully approximate movement value function Q after, it is defeated by selection current state lower network The corresponding movement of maximum value out controls to be optimal.
After the completion of the training of multi-intersection contract network, in each time step t, by i-th of intersection current state input i-th The estimation network of a Agent, the corresponding Q value of each movement of estimation network output of each Agent, Agent are selected most with probability 1- ε The big corresponding movement of Q value, otherwise randomly chooses a movement in motion space, and Agent executes selected movement.

Claims (3)

1. a kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning, which is characterized in that including Following steps:
Step 1: the transportation network in a region being modeled as Agent system, each intersection is controlled by an Agent, often A Agent includes an experience pond M, an estimation network and a target network composition, and network and mesh are estimated in initialization respectively Mark the parameter θ of networkiAnd θi', initialize each experience pond;
Step 2: discrete state coding is carried out to all road vehicles for entering intersection, it, will be from for some intersection i Stop line starts the discrete unit that the road k that length is l is divided into length c, by the vehicle location and speed of the road k of intersection i Degree is recorded as vehicle location matrixWith car speed matrixWhen vehicle head is on some discrete unit, then vehicle position Set matrixCorresponding positional value is 1, and otherwise value is 0;By the value after the maximum speed normalization of car speed and road limitation As rate matricesThe value of corresponding unit lattice;, there is a position square in the lane for entering intersection i for every accordingly Battle arrayWith a rate matricesFor i-th of intersection, all lanesWithForm the location matrix of intersection i PiWith rate matrices Vi;In t moment, Agent observes that the state of i-th of intersection isWherein SiIt indicates The state space at i-th of crossing;
Define the motion space A of i-th of intersectioni, i.e., all switching signal lamp phases of i-th intersection;
Define the average queue length that reward functions r is t moment and the t+1 moment enters vehicle on all lanes in i-th of intersection Variation;Calculation formula are as follows:
Wherein,WithThe average queuing of respectively t moment and t+1 moment into vehicle on all lanes in i-th of intersection is long Degree;
Step 3: in each time step t, by i-th of intersection current stateThe estimation network for inputting i-th of Agent, estimates Meter network automatically extracts the feature of intersection and estimates the corresponding Q value of each movement, and Agent exports each according to estimation network Corresponding Q value is acted, according to ε-Greedy strategy, the corresponding movement of maximum Q value is selected with probability 1- ε, i.e.,Otherwise a movement is randomly choosed in motion spaceThen Agent executes selection MovementThe movement residence time is τg, intersection enters next stateAgent is calculated according to formula (1) and is awarded Wherein, the initial value of ε is 1, is linearly successively decreased;
Step 4: by the experience of each AgentIt is stored in the corresponding experience pond M of Agent;Wherein,Indicate t The Q value of the everything of the estimation network output of i-th of Agent of moment;
Step 5: m experience of stochastical sampling from the M of experience pond, using RMSProp gradient descent algorithm more new estimation network parameter θi, loss function is
Wherein, γ is learning rate;A ' is some optional movement in motion space;N is the neighborhood of i-th of Agent, and j is Some neighbours Agent, A thereinjFor the motion space of j-th of Agent,State for j-th of Agent at the t-1 moment,For the optimal Q value at neighbours j nearest moment;
Step 6: enabling
Step 7: repeating T step 3 to step 6;
Step 8: updating the parameter θ of target networki'=θi, ε value is successively decreased until value is 0.1;
Step 9: repeating step 3 to step 8, timing calculates a vehicle and is averaged queue length L, when continuous 3 non-decreasings of L and phase When adjacent L difference is less than 0.02, then multi-intersection contract network training is completed;
Step 10: after the completion of the training of multi-intersection contract network, in each time step t, by the current state of i-th of intersectionInput the estimation network of i-th of Agent, the corresponding Q value of each movement of estimation network output of each Agent, Agent with Probability 1- ε selects the corresponding movement of maximum Q value, i.e.,Otherwise it is randomly choosed in motion space One movementAgent executes movement
2. a kind of multi-intersection signal lamp Collaborative Control based on Q value Transfer Depth intensified learning according to claim 1 Method, which is characterized in that the estimation network and target network are convolutional neural networks, include 4 hidden layers, first volume Lamination is made of 16 4 × 4 filters, step-length 2;Second convolutional layer is made of 32 2 × 2 filters, and step-length is 1;Third layer and the 4th layer are two full articulamentums, are made of respectively 128 and 64 neurons;Four hidden layers all use Relu non- The output valve of network, is then re-used as the input of last output layer by linear activation primitive, and output layer is activated using softmax Function, wherein the neuron number of output layer and the motion space of corresponding intersection are equal in magnitude.
3. a kind of multi-intersection signal lamp based on Q value Transfer Depth intensified learning according to claim 1 or 2 cooperates with control Method processed, which is characterized in that timing in the step 9 calculates a vehicle and be averaged queue length L, is set as 50 hours meters A vehicle is calculated to be averaged queue length L.
CN201910011893.2A 2019-01-07 2019-01-07 Multi-intersection signal lamp cooperative control method based on Q value migration depth reinforcement learning Active CN109559530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910011893.2A CN109559530B (en) 2019-01-07 2019-01-07 Multi-intersection signal lamp cooperative control method based on Q value migration depth reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910011893.2A CN109559530B (en) 2019-01-07 2019-01-07 Multi-intersection signal lamp cooperative control method based on Q value migration depth reinforcement learning

Publications (2)

Publication Number Publication Date
CN109559530A true CN109559530A (en) 2019-04-02
CN109559530B CN109559530B (en) 2020-07-14

Family

ID=65872499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910011893.2A Active CN109559530B (en) 2019-01-07 2019-01-07 Multi-intersection signal lamp cooperative control method based on Q value migration depth reinforcement learning

Country Status (1)

Country Link
CN (1) CN109559530B (en)

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060475A (en) * 2019-04-17 2019-07-26 清华大学 A kind of multi-intersection signal lamp cooperative control method based on deeply study
CN110164151A (en) * 2019-06-21 2019-08-23 西安电子科技大学 Traffic lamp control method based on distributed deep-cycle Q network
CN110264750A (en) * 2019-06-14 2019-09-20 大连理工大学 A kind of multi-intersection signal lamp cooperative control method of the Q value migration based on multitask depth Q network
CN110363295A (en) * 2019-06-28 2019-10-22 电子科技大学 A kind of intelligent vehicle multilane lane-change method based on DQN
CN110718077A (en) * 2019-11-04 2020-01-21 武汉理工大学 Signal lamp optimization timing method under action-evaluation mechanism
CN110753384A (en) * 2019-10-12 2020-02-04 西安邮电大学 Distributed reinforcement learning stable topology generation method based on self-adaptive boundary
CN110930734A (en) * 2019-11-30 2020-03-27 天津大学 Intelligent idle traffic indicator lamp control method based on reinforcement learning
CN111081035A (en) * 2019-12-17 2020-04-28 扬州市鑫通智能信息技术有限公司 Traffic signal control method based on Q learning
CN111091711A (en) * 2019-12-18 2020-05-01 上海天壤智能科技有限公司 Traffic control method and system based on reinforcement learning and traffic lane competition theory
CN111091710A (en) * 2019-12-18 2020-05-01 上海天壤智能科技有限公司 Traffic signal control method, system and medium
CN111260937A (en) * 2020-02-24 2020-06-09 武汉大学深圳研究院 Cross traffic signal lamp control method based on reinforcement learning
CN111653106A (en) * 2020-04-15 2020-09-11 南京理工大学 Traffic signal control method based on deep Q learning
CN111696370A (en) * 2020-06-16 2020-09-22 西安电子科技大学 Traffic light control method based on heuristic deep Q network
CN111797857A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Data processing method, data processing device, storage medium and electronic equipment
CN111813893A (en) * 2020-06-24 2020-10-23 重庆邮电大学 Real estate market analysis method, device and equipment based on deep migration learning
CN112150808A (en) * 2020-09-25 2020-12-29 天津大学 Urban traffic system scheduling strategy generation method based on deep learning
CN112216124A (en) * 2020-09-17 2021-01-12 浙江工业大学 Traffic signal control method based on deep reinforcement learning
CN112216129A (en) * 2020-10-13 2021-01-12 大连海事大学 Self-adaptive traffic signal control method based on multi-agent reinforcement learning
CN112258859A (en) * 2020-09-28 2021-01-22 航天科工广信智能技术有限公司 Intersection traffic control optimization method based on time difference learning
CN112309138A (en) * 2020-10-19 2021-02-02 智邮开源通信研究院(北京)有限公司 Traffic signal control method and device, electronic equipment and readable storage medium
CN112365724A (en) * 2020-04-13 2021-02-12 北方工业大学 Continuous intersection signal cooperative control method based on deep reinforcement learning
CN112614343A (en) * 2020-12-11 2021-04-06 多伦科技股份有限公司 Traffic signal control method and system based on random strategy gradient and electronic equipment
CN112669629A (en) * 2020-12-17 2021-04-16 北京建筑大学 Real-time traffic signal control method and device based on deep reinforcement learning
CN112700663A (en) * 2020-12-23 2021-04-23 大连理工大学 Multi-agent intelligent signal lamp road network control method based on deep reinforcement learning strategy
CN112750298A (en) * 2020-12-17 2021-05-04 梁宏斌 Truck formation dynamic resource allocation method based on SMDP and DRL
CN112927505A (en) * 2021-01-28 2021-06-08 哈尔滨工程大学 Signal lamp self-adaptive control method based on multi-agent deep reinforcement learning in Internet of vehicles environment
CN112927522A (en) * 2021-01-19 2021-06-08 华东师范大学 Internet of things equipment-based reinforcement learning variable-duration signal lamp control method
CN113160585A (en) * 2021-03-24 2021-07-23 中南大学 Traffic light timing optimization method, system and storage medium
CN113223305A (en) * 2021-03-26 2021-08-06 中南大学 Multi-intersection traffic light control method and system based on reinforcement learning and storage medium
CN113299084A (en) * 2021-05-31 2021-08-24 大连理工大学 Regional signal lamp cooperative control method based on multi-view coding migration reinforcement learning
CN113299079A (en) * 2021-03-29 2021-08-24 东南大学 Regional intersection signal control method based on PPO and graph convolution neural network
CN113487891A (en) * 2021-06-04 2021-10-08 东南大学 Intersection joint signal control method based on Nash Q learning algorithm
CN113724507A (en) * 2021-08-19 2021-11-30 复旦大学 Traffic control and vehicle induction cooperation method and system based on deep reinforcement learning
CN113963555A (en) * 2021-10-12 2022-01-21 南京航空航天大学 Deep reinforcement learning traffic signal control method combined with state prediction
CN114613169A (en) * 2022-04-20 2022-06-10 南京信息工程大学 Traffic signal lamp control method based on double experience pools DQN
CN115457781A (en) * 2022-09-13 2022-12-09 内蒙古工业大学 Intelligent traffic signal lamp control method based on multi-agent deep reinforcement learning
CN116612636A (en) * 2023-05-22 2023-08-18 暨南大学 Signal lamp cooperative control method based on multi-agent reinforcement learning and multi-mode signal sensing
CN117275259A (en) * 2023-11-20 2023-12-22 北京航空航天大学 Multi-intersection cooperative signal control method based on field information backtracking

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2916305A1 (en) * 2014-03-05 2015-09-09 Siemens Industry, Inc. Cloud-enhanced traffic controller
CN105654744A (en) * 2016-03-10 2016-06-08 同济大学 Improved traffic signal control method based on Q learning
CN106340192A (en) * 2016-10-08 2017-01-18 京东方科技集团股份有限公司 Intelligent traffic system and intelligent traffic control method
JP2017081382A (en) * 2015-10-27 2017-05-18 トヨタ自動車株式会社 Automatic drive apparatus
US20180013211A1 (en) * 2016-07-07 2018-01-11 NextEv USA, Inc. Duplicated wireless transceivers associated with a vehicle to receive and send sensitive information
CN108510764A (en) * 2018-04-24 2018-09-07 南京邮电大学 A kind of adaptive phase difference coordinated control system of Multiple Intersections and method based on Q study

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2916305A1 (en) * 2014-03-05 2015-09-09 Siemens Industry, Inc. Cloud-enhanced traffic controller
JP2017081382A (en) * 2015-10-27 2017-05-18 トヨタ自動車株式会社 Automatic drive apparatus
CN105654744A (en) * 2016-03-10 2016-06-08 同济大学 Improved traffic signal control method based on Q learning
US20180013211A1 (en) * 2016-07-07 2018-01-11 NextEv USA, Inc. Duplicated wireless transceivers associated with a vehicle to receive and send sensitive information
CN106340192A (en) * 2016-10-08 2017-01-18 京东方科技集团股份有限公司 Intelligent traffic system and intelligent traffic control method
CN108510764A (en) * 2018-04-24 2018-09-07 南京邮电大学 A kind of adaptive phase difference coordinated control system of Multiple Intersections and method based on Q study

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797857A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Data processing method, data processing device, storage medium and electronic equipment
CN110060475A (en) * 2019-04-17 2019-07-26 清华大学 A kind of multi-intersection signal lamp cooperative control method based on deeply study
CN110264750A (en) * 2019-06-14 2019-09-20 大连理工大学 A kind of multi-intersection signal lamp cooperative control method of the Q value migration based on multitask depth Q network
CN110264750B (en) * 2019-06-14 2020-11-13 大连理工大学 Multi-intersection signal lamp cooperative control method based on Q value migration of multi-task deep Q network
CN110164151A (en) * 2019-06-21 2019-08-23 西安电子科技大学 Traffic lamp control method based on distributed deep-cycle Q network
CN110363295A (en) * 2019-06-28 2019-10-22 电子科技大学 A kind of intelligent vehicle multilane lane-change method based on DQN
CN110753384A (en) * 2019-10-12 2020-02-04 西安邮电大学 Distributed reinforcement learning stable topology generation method based on self-adaptive boundary
CN110753384B (en) * 2019-10-12 2023-02-03 西安邮电大学 Distributed reinforcement learning stable topology generation method based on self-adaptive boundary
CN110718077B (en) * 2019-11-04 2020-08-07 武汉理工大学 Signal lamp optimization timing method under action-evaluation mechanism
CN110718077A (en) * 2019-11-04 2020-01-21 武汉理工大学 Signal lamp optimization timing method under action-evaluation mechanism
CN110930734A (en) * 2019-11-30 2020-03-27 天津大学 Intelligent idle traffic indicator lamp control method based on reinforcement learning
CN111081035A (en) * 2019-12-17 2020-04-28 扬州市鑫通智能信息技术有限公司 Traffic signal control method based on Q learning
CN111091710A (en) * 2019-12-18 2020-05-01 上海天壤智能科技有限公司 Traffic signal control method, system and medium
CN111091711A (en) * 2019-12-18 2020-05-01 上海天壤智能科技有限公司 Traffic control method and system based on reinforcement learning and traffic lane competition theory
CN111260937A (en) * 2020-02-24 2020-06-09 武汉大学深圳研究院 Cross traffic signal lamp control method based on reinforcement learning
CN111260937B (en) * 2020-02-24 2021-09-14 武汉大学深圳研究院 Cross traffic signal lamp control method based on reinforcement learning
CN112365724A (en) * 2020-04-13 2021-02-12 北方工业大学 Continuous intersection signal cooperative control method based on deep reinforcement learning
CN111653106A (en) * 2020-04-15 2020-09-11 南京理工大学 Traffic signal control method based on deep Q learning
CN111696370A (en) * 2020-06-16 2020-09-22 西安电子科技大学 Traffic light control method based on heuristic deep Q network
CN111813893A (en) * 2020-06-24 2020-10-23 重庆邮电大学 Real estate market analysis method, device and equipment based on deep migration learning
CN111813893B (en) * 2020-06-24 2022-11-18 重庆邮电大学 Real estate market analysis method, device and equipment based on deep migration learning
CN112216124A (en) * 2020-09-17 2021-01-12 浙江工业大学 Traffic signal control method based on deep reinforcement learning
CN112216124B (en) * 2020-09-17 2021-07-27 浙江工业大学 Traffic signal control method based on deep reinforcement learning
CN112150808B (en) * 2020-09-25 2022-06-17 天津大学 Urban traffic system scheduling strategy generation method based on deep learning
CN112150808A (en) * 2020-09-25 2020-12-29 天津大学 Urban traffic system scheduling strategy generation method based on deep learning
CN112258859A (en) * 2020-09-28 2021-01-22 航天科工广信智能技术有限公司 Intersection traffic control optimization method based on time difference learning
CN112216129A (en) * 2020-10-13 2021-01-12 大连海事大学 Self-adaptive traffic signal control method based on multi-agent reinforcement learning
CN112309138A (en) * 2020-10-19 2021-02-02 智邮开源通信研究院(北京)有限公司 Traffic signal control method and device, electronic equipment and readable storage medium
CN112614343A (en) * 2020-12-11 2021-04-06 多伦科技股份有限公司 Traffic signal control method and system based on random strategy gradient and electronic equipment
CN112750298A (en) * 2020-12-17 2021-05-04 梁宏斌 Truck formation dynamic resource allocation method based on SMDP and DRL
CN112669629A (en) * 2020-12-17 2021-04-16 北京建筑大学 Real-time traffic signal control method and device based on deep reinforcement learning
CN112700663A (en) * 2020-12-23 2021-04-23 大连理工大学 Multi-agent intelligent signal lamp road network control method based on deep reinforcement learning strategy
CN112927522A (en) * 2021-01-19 2021-06-08 华东师范大学 Internet of things equipment-based reinforcement learning variable-duration signal lamp control method
CN112927505A (en) * 2021-01-28 2021-06-08 哈尔滨工程大学 Signal lamp self-adaptive control method based on multi-agent deep reinforcement learning in Internet of vehicles environment
CN112927505B (en) * 2021-01-28 2022-08-02 哈尔滨工程大学 Signal lamp self-adaptive control method based on multi-agent deep reinforcement learning in Internet of vehicles environment
CN113160585A (en) * 2021-03-24 2021-07-23 中南大学 Traffic light timing optimization method, system and storage medium
CN113160585B (en) * 2021-03-24 2022-09-06 中南大学 Traffic light timing optimization method, system and storage medium
CN113223305A (en) * 2021-03-26 2021-08-06 中南大学 Multi-intersection traffic light control method and system based on reinforcement learning and storage medium
CN113299079A (en) * 2021-03-29 2021-08-24 东南大学 Regional intersection signal control method based on PPO and graph convolution neural network
CN113299079B (en) * 2021-03-29 2022-06-10 东南大学 Regional intersection signal control method based on PPO and graph convolution neural network
CN113299084A (en) * 2021-05-31 2021-08-24 大连理工大学 Regional signal lamp cooperative control method based on multi-view coding migration reinforcement learning
CN113487891A (en) * 2021-06-04 2021-10-08 东南大学 Intersection joint signal control method based on Nash Q learning algorithm
CN113724507A (en) * 2021-08-19 2021-11-30 复旦大学 Traffic control and vehicle induction cooperation method and system based on deep reinforcement learning
CN113724507B (en) * 2021-08-19 2024-01-23 复旦大学 Traffic control and vehicle guidance cooperative method and system based on deep reinforcement learning
CN113963555A (en) * 2021-10-12 2022-01-21 南京航空航天大学 Deep reinforcement learning traffic signal control method combined with state prediction
CN114613169B (en) * 2022-04-20 2023-02-28 南京信息工程大学 Traffic signal lamp control method based on double experience pools DQN
CN114613169A (en) * 2022-04-20 2022-06-10 南京信息工程大学 Traffic signal lamp control method based on double experience pools DQN
CN115457781A (en) * 2022-09-13 2022-12-09 内蒙古工业大学 Intelligent traffic signal lamp control method based on multi-agent deep reinforcement learning
CN115457781B (en) * 2022-09-13 2023-07-11 内蒙古工业大学 Intelligent traffic signal lamp control method based on multi-agent deep reinforcement learning
CN116612636A (en) * 2023-05-22 2023-08-18 暨南大学 Signal lamp cooperative control method based on multi-agent reinforcement learning and multi-mode signal sensing
CN116612636B (en) * 2023-05-22 2024-01-23 暨南大学 Signal lamp cooperative control method based on multi-agent reinforcement learning
CN117275259A (en) * 2023-11-20 2023-12-22 北京航空航天大学 Multi-intersection cooperative signal control method based on field information backtracking
CN117275259B (en) * 2023-11-20 2024-02-06 北京航空航天大学 Multi-intersection cooperative signal control method based on field information backtracking

Also Published As

Publication number Publication date
CN109559530B (en) 2020-07-14

Similar Documents

Publication Publication Date Title
CN109559530A (en) A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning
CN110264750A (en) A kind of multi-intersection signal lamp cooperative control method of the Q value migration based on multitask depth Q network
CN106910351B (en) A kind of traffic signals self-adaptation control method based on deeply study
Wu et al. Multi-agent deep reinforcement learning for urban traffic light control in vehicular networks
CN110060475A (en) A kind of multi-intersection signal lamp cooperative control method based on deeply study
CN110047278B (en) Adaptive traffic signal control system and method based on deep reinforcement learning
CN111696370B (en) Traffic light control method based on heuristic deep Q network
CN113643553B (en) Multi-intersection intelligent traffic signal lamp control method and system based on federal reinforcement learning
Liang et al. Deep reinforcement learning for traffic light control in vehicular networks
CN110032782B (en) City-level intelligent traffic signal control system and method
CN109215355A (en) A kind of single-point intersection signal timing optimization method based on deeply study
CN112365724B (en) Continuous intersection signal cooperative control method based on deep reinforcement learning
CN110470301A (en) Unmanned plane paths planning method under more dynamic task target points
CN113223305B (en) Multi-intersection traffic light control method and system based on reinforcement learning and storage medium
Lee et al. Reinforcement learning for joint control of traffic signals in a transportation network
CN113299084A (en) Regional signal lamp cooperative control method based on multi-view coding migration reinforcement learning
CN109872531A (en) Road traffic signal controls system-wide net optimized control objective function construction method
Xie et al. Iedqn: Information exchange dqn with a centralized coordinator for traffic signal control
CN114995119A (en) Urban traffic signal cooperative control method based on multi-agent deep reinforcement learning
Ha-li et al. An intersection signal control method based on deep reinforcement learning
CN114613169A (en) Traffic signal lamp control method based on double experience pools DQN
CN116524745B (en) Cloud edge cooperative area traffic signal dynamic timing system and method
Yu et al. Minimize pressure difference traffic signal control based on deep reinforcement learning
Benedetti et al. Application of deep reinforcement learning for traffic control of road intersection with emergency vehicles
CN106373410B (en) A kind of Optimal Method of Urban Traffic Signal Control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant