CN112365724A - Continuous intersection signal cooperative control method based on deep reinforcement learning - Google Patents
Continuous intersection signal cooperative control method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN112365724A CN112365724A CN202010287076.2A CN202010287076A CN112365724A CN 112365724 A CN112365724 A CN 112365724A CN 202010287076 A CN202010287076 A CN 202010287076A CN 112365724 A CN112365724 A CN 112365724A
- Authority
- CN
- China
- Prior art keywords
- value
- intersection
- state
- network
- action
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 230000002787 reinforcement Effects 0.000 title claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000005457 optimization Methods 0.000 claims abstract description 16
- 230000008569 process Effects 0.000 claims abstract description 8
- 230000009471 action Effects 0.000 claims description 66
- 238000013528 artificial neural network Methods 0.000 claims description 43
- 230000006870 function Effects 0.000 claims description 36
- 239000011159 matrix material Substances 0.000 claims description 12
- 239000003795 chemical substances by application Substances 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 6
- 238000000819 phase cycle Methods 0.000 claims description 6
- 230000001419 dependent effect Effects 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 3
- 239000004576 sand Substances 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 abstract description 4
- 230000010355 oscillation Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 11
- 238000013527 convolutional neural network Methods 0.000 description 7
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 230000001934 delay Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 206010039203 Road traffic accident Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/07—Controlling traffic signals
- G08G1/08—Controlling traffic signals according to detected number or speed of vehicles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/01—Detecting movement of traffic to be counted or controlled
- G08G1/0104—Measuring and analyzing of parameters relative to traffic conditions
- G08G1/0125—Traffic data processing
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/01—Detecting movement of traffic to be counted or controlled
- G08G1/0104—Measuring and analyzing of parameters relative to traffic conditions
- G08G1/0137—Measuring and analyzing of parameters relative to traffic conditions for specific applications
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/07—Controlling traffic signals
- G08G1/081—Plural intersections under common control
Abstract
The invention provides a continuous intersection signal cooperative control method based on deep reinforcement learning, which adopts DQN strategies of upper and lower Agent networks to process continuous intersection signal timing so as to reduce the complexity of state acquisition and feedback evaluation and solve the problem of continuous intersection signal optimization. In order to ensure the stability of a training target and avoid oscillation divergence in feedback circulation of a target value and a predicted value of the training target, a Dueling Double optimization method is adopted to carry out optimal training on the DQN.
Description
Technical Field
The invention belongs to the technical field of vehicle-road coordination/main-line coordination control, and particularly relates to a traffic signal control model based on vehicle-road coordination and deep reinforcement learning, which is suitable for controlling any adjacent upstream and downstream intersections on a road section in a main-line coordination manner.
Background
For an urban road network, intersections are nodes for realizing traffic flow conversion of each road section, and are also main bottlenecks for restricting the traffic capacity of the road network. The traffic signal control system distributes corresponding right of way for the traffic flows which conflict with each other, and separation of the conflict traffic flows is achieved, so that the traffic efficiency and the safety of the intersection are improved. Scholars at home and abroad develop a great deal of research and application in the aspect of traffic signal control, and successively provide a plurality of signal control methods based on deep reinforcement learning, which play an important role in improving urban traffic conditions and relieving traffic congestion, but at present, the control on intersection traffic flow generally images a single intersection, predicts the intersection traffic flow by using a convolutional neural network, and then performs corresponding prediction timing control on a prediction result; in addition, for the problem of continuous intersection signal control processing, because the intersection is too many to establish the state space model, even if the state space model is established, when the multi-intersection space model in different states is processed, the complexity of traditional reinforcement learning increases exponentially, and the complexity of state acquisition and feedback evaluation is increased.
(1) Prior art related art
1. Traffic signal control and optimization
In recent years, with the rapid development of social economy, the urban traffic demand is driven to increase at a high speed, and the load of urban roads is increased and the traffic jam phenomenon is increasingly increased while the number of motor vehicles is increased rapidly. Traffic jam increases traffic delay, reduces driving speed, induces traffic accidents, and directly affects working efficiency and physical health of people, and unreasonable traffic signal configuration and insufficient coordination and optimization degree of urban road intersection are main technical reasons, so that smoothness and safety of urban main road traffic can be effectively improved by improving traffic management measures, particularly optimizing intersection signal timing.
2. Deep reinforcement learning
The method combines the perception capability of deep learning and the decision capability of the reinforced learning in a universal mode, and can realize direct control from original input to output in an end-to-end learning mode.
3. And (5) coordinating urban main lines.
The city main line coordination control is an important link of intelligent traffic research. Because urban traffic has the characteristics of randomness and time-varying property, an accurate scientific model is difficult to establish, the average queuing length of each intersection and each phase direction is an evaluation parameter for controlling the quality, and a plurality of adjacent intersections are used as urban trunk line control models.
(2) The prior art is not enough
1. The existing common deep reinforcement learning model only considers the control of a single intersection, does not relate to the cooperative control of multiple intersections, and still needs further research on the neural network modeling under the multiple intersections; when multi-intersection space models in different states are processed at the same time, the complexity of traditional reinforcement learning increases exponentially, and the complexity of state acquisition and feedback evaluation is increased, so that modeling is needed for the conditions of multiple intersections. In addition, the control method of the signal lamp is carried out by taking a period as a unit, and a model-free self-adaptive control method is an important solution on the premise of interaction of a vehicle network environment and big data.
2. The existing green wave optimization algorithm is usually required to be large in calculation amount and inflexible, most of researches from the past are directed at a single-phase coordination path, a traditional model is not enough to support the condition of continuous intersections, and the intersection signal coordination benefit is low.
3. The main difficulties of the coordinated design of continuous intersections are that the density of transverse intersections is large, the intervals of the intersections are different, the conventional green wave band control method is difficult to realize, the conditions for controlling two-way traffic required on a road section to almost simultaneously reach the same intersection are too harsh, various fine optimization algorithms can theoretically partially meet the traffic capacity of a road network, but the practicability and operability of the algorithms are possibly difficult to guarantee, if the neural network technology can be combined with an optimization scheme under a vehicle-road coordinated environment to control traffic lights, and the signal phases of adjacent intersections are subjected to coordinated control, the signal control problem of the continuous intersections can be solved.
Disclosure of Invention
Aiming at the defects of the three related technologies, the invention fully utilizes the advantages of the vehicle-road cooperative theory, takes the upper and lower neural networks as the basic premise of intersection group traffic signal control, analyzes and calculates the disturbance influence of the intersection control scheme under the environment on the coordination relationship of adjacent intersection groups, and establishes the signal lamp control method based on the upper and lower neural networks. The intersection phases are switched in real time according to different road environments and traffic states, the cooperation capacity among the intersections is increased, the smooth driving of the intersections is guaranteed, the traffic capacity of the intersections is improved, and a new solution and a theoretical basis are provided for relieving traffic jam, improving the traveling efficiency and reducing safety accidents. The invention specifically adopts the following technical scheme:
Firstly, generating data of a training batch, and storing the current state and action and a received feedback value as a quadruple (s, a, r, s') in a memory for updating parameters of a main neural network; then the model randomly selects an action while storing a sufficient number of samples; then updating the learning rate in the neural network through back propagation; finally, carrying out secondary adjustment on the green light duration of all the intersections according to the average delay of the global vehicles and the average delay of the vehicles at each intersection;
Taking the vehicle waiting time, the vehicle delay and the signal lamp phase change of each direction of the intersection as state input, and carrying out discretization modeling on the intersection area:
dividing the intersection into rectangular grids with the same size, wherein each lane is divided into grids and is regarded as a cell, detecting vehicle state information by a detector, and expressing the speed and position information detected in time t by single-channel convolution for each small square area; if the detector does not detect the vehicle, zero filling is carried out on the block, and finally the obtained speed and position matrix is used as the state information of the whole road network;
step 3, action selection of lower-layer neural network
Selecting proper actions to guide vehicles at the intersection according to the current traffic state, taking the switching between stages as an action space, discretizing the unit time of the cycle into 5 seconds, updating the current phase-sequence state into the selected phase-sequence state after switching, and selecting the next action by the traffic signal lamp in the same way as the previous process;
step 4, defining feedback value of lower layer neural network
Let kiRepresenting the arrival number of vehicles on the road network from the ith time period to the i +1 time period, and recording the waiting time of the jth vehicle in the time period i asThe feedback value at the i-th time period is defined as:
step 5, modeling of the lower layer neural network
The collected vehicle speed and position information is first matrixed and the data is then processed through three convolutional layers, which are constructed as follows: the first convolutional layer contains 32 filters, each filter has a size of 4 × 4, and each time the data input is shifted by 4 × 4/step; the second convolutional layer has 64 filters, each filter has a size of 2 × 2, it moves 2 × 2/step, the output size after two convolutional layers is 30 × 30 × 64; the third convolutional layer has 128 filters, the size is 2 × 2, the step size is 1 × 1, the output of the third convolutional layer is 30 × 30 × 128 tensor, and one fully connected layer converts the tensor into a 128 × 1 matrix; after the layer is fully connected, the data is split into two parts of the same size, 64 × 1; the first part represents a state value function V(s) and represents a value function of the static state of the current road network; the second part represents a state-dependent action dominance function A (a) which represents a road network delay change value additionally brought by selecting a certain action, and combines V(s) and A (a) to obtain a Q value of each action, wherein the Q function represents that an average expected value is predicted by the current road network traffic state by using a as the maximum accumulated feedback value of the first action from the state s, and an optimal signal switching strategy under the current neural network is executed by a controller;
step 6, optimization of lower-layer reinforcement learning network
1) Deep reinforcement learning network
Updating the neural network with mean square error:
wherein Qtarget(s, a) represents the target Q value of action a taken in state s, θ represents the network parameter as the mean square error loss, and p(s) represents the probability of state s occurring in a training batch;
parameters in the main neural network are updated by back propagation, where θ-The update is based on θ in the following equation:
θ-=αθ-+(1-α)θ
alpha is an updating rate and represents the influence degree of the new parameter on the target network;
wherein Q (s, a; theta)i) The output Q value of the current network is represented, and is used for evaluating the Q value corresponding to the current state action,the output of the network of target values is represented,an optimization objective, i.e., an objective Q value, that approximates the representation of the value function;
2) determining neural network parameters
Calculating the priority probability of experience samples based on a sorting method, wherein the error delta of a sample i is defined as:
δi=|Q(s,a;θ)i-Qtarget(s,a)i|
the errors δ are sorted, and the priorities p of these experiences are setiIs the reciprocal of its order, PiProbability for sampling experience i:
where τ represents how many priorities are used;
let J (θ) denote the loss function, calculate the parametric gradient g:
the first and second order bias moments s and r are updated with exponential moving averages, respectively.
s=ρss+(1-ρs)g
r=ρrr+(1-ρr)g
Where ρ issAnd ρrFirst and second order exponential decay rates, respectively, using the time step t to correct the first and second order bias moments so that the corrected results are closer to each otherThe true gradient.
Calculating a gradient update:
the final parameter updates are as follows:
wherein oa isrIs the initial learning rate of the initial learning rate,is a constant that stabilizes the value.
The final loss function J is as follows:
step 7, defining upper layer state space
Each main body in the system is a traffic signal controller of an intersection, an upper controller of network hierarchical control can control an area formed by a plurality of intersection signal controllers at a lower layer together, and each intersection is 1, 2 and … zeta;
step 8, defining upper layer action space
Let j be the green light adjustment time, average delay of all crossing vehiclesIf the current crossThe average delay of the mouth ζ is rζThe phase green time at the intersection is adjusted to
Step 9, upper layer neural network feedback value definition
The feedback value r of the upper layer Agent is comparedkDefined as the average delay of all vehicles at the intersection, as the feedback value of the traffic signal control system, and calculated as the input of the next cycle
Wherein m is the number of intersections, and n represents that the current cycle number is the nth time.
Drawings
Fig. 1 is a schematic diagram of a continuous intersection signalized upper and lower level traffic signalized intersection.
Fig. 2 is a diagram of a global model framework for an upper and lower network.
FIG. 3 is a cross-port regionalized discretized modeling diagram.
Fig. 4 is an MDP loop flow diagram.
FIG. 5 is a schematic diagram of a convolutional neural network processing vehicle information.
Fig. 6 is a model framework diagram of DQN.
Fig. 7 is a training flow diagram of DQN.
Fig. 8 is an upper state space definition diagram.
Fig. 9 is a schematic diagram of the upper state space.
Detailed Description
And establishing scenes of upper and lower traffic signal intersections of the continuous intersection, wherein the continuous intersection comprises a detector for detecting vehicle information, and the vehicle is also provided with a sensor for acquiring the information. The method comprises the steps of acquiring information such as traffic signal timing data, vehicle running states and road actual conditions in real time through a vehicle-mounted network technology and various sensors, then predicting signal timing according with the current traffic state through a neural network by using deep reinforcement learning.
The invention divides the control of the signal lamps of the continuous intersections into upper and lower layer control, the lower layer Agent is the traffic signal controller of each intersection, and each controller has a unique learning strategy; the upper layer Agent is mainly used for adjusting the temporary strategy of the lower layer Agent. The upper layer and the lower layer of controllers jointly control the signal lights of the whole area, and the multi-body system is modeled as shown in figure 1.
Fig. 2 is a global model framework diagram of the upper and lower neural networks, and the main convolutional neural network selects the current intersection state and the tentative phase switching action as the feedback value to select the most valuable action. First, a training batch of data is generated, and the current state and action and the received feedback values are stored in a memory as a quadruple (s, a, r, s') to update the parameters of the master neural network. Target network theta-Is a separate neural network that increases learning stability. The use of Double DQN in the primary and target networks can reduce overestimation and improve performance by training the Q-value for each action of the model and by selecting the action with the largest Q-value to obtain the optimal strategy. The model then randomly selects an action while storing a sufficient number of samples. Before training, each sample is of the same priority, and is randomly divided into small batches for training. After each training, the samples are updated for priority, they are selected by different probabilities, and then the learning rate in the neural network is updated by Adam back-propagation. The model derives an initial control scheme from the oa and the Action selection operation with the largest Q value. And finally, carrying out secondary adjustment on the green light time of all the intersections according to the average delay of the global vehicles and the average delay of the vehicles at each intersection, and enabling the model to make corresponding reactions on different traffic scenes through learning so as to reduce the vehicle delay.
In order to accurately describe traffic information at an intersection, a vehicle waiting time W, a vehicle delay D, and a signal lamp phase change C for each direction at the intersection are input as states. In addition, in order to accurately represent the specific distribution of the position and speed information of the vehicles at the intersection, discretized modeling is performed on the intersection area.
As shown in fig. 3, the whole intersection is divided into rectangular grids with the same size, and in order to reduce the calculation amount and save the calculation resources, the speed and position information of the vehicle is stored in the matrix. Dividing each lane into grids and regarding the grids as a cell, detecting vehicle state information by a detector, and expressing the speed and position information detected in time t by using single-channel convolution Q for each small square area; if the detector does not detect a vehicle, zero padding is performed on the block. And finally, taking the obtained speed and position matrix as the state information of the whole road network.
Step 3, action selection of lower-layer neural network
The traffic signal control system selects an appropriate action to guide the vehicles at the intersection according to the current traffic state. The invention takes the switching between the phases as the action space and models the process of switching between the phases as the Markov Decision Process (MDP). MDP is a Sequential Decision (Sequential Decision) mathematical model used to simulate the randomness strategy and feedback value that can be achieved by an agent in a traffic scene with a markov property in the system state, and then learn the switching strategy with the lowest feedback value by combining the MDP control strategy through trial and error in deep reinforcement learning.
In fig. 4, each loop represents the phase transition of the intersection signal lamp in a time slot cycle, the unit time of the cycle is discretized into 5 seconds, after the switching, the current phase sequence state is updated to the selected phase sequence state, and the traffic signal lamp selects the next action in the same way as the previous process. In addition, for the model to learn the switching phase, setting the maximum and minimum lamp color durations, respectively, the present invention sets the maximum and minimum times for the phases to 60 seconds and 5 seconds. And if the green time of a certain phase reaches 60 seconds, forcibly switching to the next phase, and continuously and iteratively updating based on the original control scheme.
Step 4, defining feedback value of lower layer neural network
To provide feedback to the reinforcement learning model regarding previous performance, feedback values are defined to assist traffic signals in taking optimal action strategies. In order to reduce the average delay of the vehicle, the invention defines the Reward as the average delay reduction value of the vehicle within a time period, so that the Reward is ensured to be positive during training.
Let kiRepresenting the arrival number of vehicles on the road network from the ith time period to the i +1 time period, and recording the waiting time of the jth vehicle in the time period i asThe feedback value at the i-th time period is defined as:
from the formula, if riWhen the vehicle is larger, the average waiting time is longer than before, and r is ensured to achieve the purpose of continuously reducing the vehicle delayiThe maximum is taken as much as possible.
Step 5, modeling of the lower layer neural network
The invention uses two main networks and target networks with consistent parameters, wherein the main network theta is used for updating the weight in real time, and the target network theta is used for updating the weight-Updating after the main network is updated for a plurality of times, jointly updating the Q value by using a state value function V(s) and an action advantage function A (a), selecting Adam by an optimizer, and then adopting an e-greedy strategy and an experience playback strategy in the learning process.
The structure of the bottom layer CNN is shown in fig. 5. It consists of three convolutional layers and three fully-connected layers. The vehicle's speed and position information matrix is first passed through three convolutional layers, each of which includes three parts, convolution, pooling, and a nonlinear activation function. The convolutional layer comprises a plurality of filters, each filter containing a set of weights, each move by a step size defined by a step size to yield the next fully connected layer, different filters have different weights and generate different features in the next layer, the invention uses the Leaky ReLU function as the activation function:
where x represents the output in units and β is a constant. Compared with the conventional ReLU function, the introduction of β can avoid dead neurons due to zero gradient on the negative side. The Leaky ReLU function may converge faster than other activation functions (e.g., tanh and sigmod), thereby increasing the convergence rate of vehicle delays during training.
The neural network modeling is shown in fig. 5 below.
FIG. 5 is a process of vehicle speed and position information matrix processing in a graph convolutional neural network, first matrixing the collected information and second processing the data through three convolutional layers. The three convolutional layers and the fully connected layer were constructed as follows: the first convolutional layer contains 32 filters, each filter has a size of 4 × 4, and each time the data input is shifted by 4 × 4/step; the second convolutional layer has 64 filters, each filter has a size of 2 × 2, it moves 2 × 2/step, the output size after two convolutional layers is 30 × 30 × 64; the third convolutional layer has 128 filters with size of 2 × 2 and step size of 1 × 1, the output of the third convolutional layer is 30 × 30 × 128 tensor, and one fully connected layer converts the tensor into a 128 × 1 matrix. After the layers are fully connected, the data is split into two parts of the same size, 64 × 1. The first part represents a state value function V(s) and represents a value function of the static state of the current road network; the second part represents a state-dependent Action advantage function A (a) which represents a road network delay change value additionally brought by selecting a certain Action, the possible Action number is the number k of legal phases, so the size of A (a) is k multiplied by 1, the two parts are combined again to obtain a Q value of each Action, wherein the parameter in the CNN is represented as theta, Q (s, a) is converted into Q (s, a; theta), theta represents that the network parameter is the mean square error loss, the Q function represents that from the state s, a is used as the maximum accumulated feedback value of the first Action, the average expected value r is predicted through the current road network traffic state, and the controller executes the optimal signal switching strategy under the current neural network.
Step 6, optimization of lower-layer reinforcement learning network
The core of the DQN model is a convolutional neural network, and is trained using Q-learning, the input of which is an original road network data matrix, and the output is an estimated Q value of the optimal strategy, and fig. 6 and 7 below are a frame diagram and a flow chart of DQN, respectively.
The matrix containing vehicle position and speed information passes through the convolutional layer and the full link layer, and then a vector containing each action Q value is output through the input state and action.
1) Deep reinforcement learning network
During DQN training, a deep convolutional network is used to approximate the current valuation function, while another network is used to produce the target Q value. Specifically, let Qtarget(s, a) represents the target Q value for taking action a in state s, θ represents the network parameter as the mean square error loss, and the neural network is updated with the Mean Square Error (MSE), as follows:
where p(s) represents the probability of the occurrence of state s in a training batch. In order to provide a stable update (steady reduction of delays in road networks during training) in each iteration, a separate target network θ is used that is identical in structure to the main neural network but with different parameters-To generate a Q value.
Parameters in the main neural network are updated by back propagation, where θ-The update is based on θ in the following equation:
θ-=αθ-+(1-α)θ (4)
and alpha is an updating rate and represents the influence degree of the new parameter on the target network.
Wherein Q (s, a; theta)i) The output Q value of the current network is represented, and is used for evaluating the Q value corresponding to the current state action,the output of the network of target values is represented,the optimization objective of the value function, i.e. the target Q value, is approximately represented.
When the parameter theta of the current value network is updated, the parameter of the current value network is copied to the target value network theta through N iterations-By minimizing the current Q value and the target network QtargetThe network parameters are updated by the mean square error between the values, so that the error items of the network are reduced to a limited interval, and the Q value and the gradient value are in a reasonable range, thereby being beneficial to delay and stable reduction of the road network.
2) DuelingDQN optimization method
In some states stFor example, when too few or too many vehicles are on the road network, no matter what action a is donetDo not affect the next state st+1The delay of (1), that is, the correlation between the current state action function and the current action selection is weak, which easily causes the road network delay not to be converged in the current state. In order to solve the problem, the invention adopts the Dueling DQN to improve the learning effect and the convergence speed of the DQN.
And on the basis of the original network, fitting the Q value in reinforcement learning by using a deep network, and dividing a Q value function into a state V value and an action V value, wherein the Q value is updated by adding the state V value and the action V' value.
In a neural network, the overall expected feedback value of taking a probabilistic action in a future step is represented by a state V (s; theta), for each action A (s, a; theta), the Q value is the sum of the state V and a state-dependent A (a) function, the definition of the A (a) function is the cumulative discount return brought by the current actual action compared with the optimal action, and the Q value is calculated as follows:
wherein A (s, a', theta) represents the effect of the action taken on the Q-value function, and if the A value of the action is a positive number, the action can reduce the delay better than other actions; otherwise, if the value a of a certain action is negative, it indicates that the potential feedback value of the action is smaller than the average value. Compared with the method which directly uses the original Q value, the method improves the stability of the model and reduces the average delay of the vehicle.
3) Double DQN optimization method
The traditional DQN has the defect of over-estimation, due to estimation nonuniformity, an over-estimation problem can be generated during parameter updating and iteration, so that the current phase switching scheme is not the optimal scheme, and in order to prevent the Q value from being over-estimated, Q is usedtargetThe values are updated by the Double DQN algorithm.
Qtarget(s,a)=r+γQ'(s′,argmax(Q(s′,a′;θ)),θ-) (6)
And two Q networks in the above formula, wherein Q determines which state Reward value is the largest, and the Q' function is responsible for selecting actions so as to reduce the problem of overestimation, thereby effectively reducing the average delay of vehicles on the road network.
4) Neural network parameters
The invention adopts a priority experience playback structure based on sequencing to increase learning efficiency, thereby increasing the playback probability of samples with lower average delay, and calculates the priority probability of experience samples by using a method based on sequencing, wherein the error delta of a sample i is defined as:
δi=|Q(s,a;θ)i-Qtarget(s,a)i| (7)
the errors δ are sorted, and the priorities p of these experiences are setiIs the reciprocal of its order, PiProbability for sampling experience i:
where τ indicates how much priority is used, it is a random sample when τ is 0.
An optimizer of the neural network model selects an Adam (adaptive matrix estimation) method, and an Adam algorithm designs independent adaptive learning rates for different parameters by calculating first moment estimation and second moment estimation of gradients, so that the convergence speed and the model effect are accelerated. Let J (θ) denote the loss function, calculate the parametric gradient g:
the first and second order bias moments s and r are updated with exponential moving averages, respectively.
s=ρss+(1-ρs)g (10)
r=ρrr+(1-ρr)g (11)
Where ρ issAnd ρrFirst-order and second-order exponential decay rates are respectively adopted, and the time step t is used for correcting the first-order and second-order bias moments, so that the corrected result is closer to the real gradient.
Calculate gradient update (element by element):
the final parameter updates are as follows:
wherein oa isrIs the initial learning rate of the initial learning rate,is a constant that stabilizes the value.
The final loss function J is as follows:
step 7, defining upper layer state space
When the upper-layer Agent controls the continuous intersections, firstly, the action of each intersection on the lower layer is adjusted based on the original scheme, and finally, the optimization scheme is updated according to the average queuing length of each intersection.
Modeling of a multi-body system is shown in FIG. 8.
Each main body in the system is a traffic signal controller of an intersection, an upper controller controlled by a network layer can control an area formed by a plurality of intersection signal controllers of a lower layer together, each intersection is set to be 1, 2 and … zeta, an Agent of each intersection of the lower layer has a learning strategy, and the upper layer Agent provides guidance. The secondary adjustment of the signals is performed by first sorting delays at each intersection as shown in fig. 9.
Through the steps, the delays of the intersections are sequenced, and the state space of the upper layer is the intersection number data with the highest delay.
Step 8, defining upper layer action space
In order to reduce the average delay of vehicles, intersections with larger delay need to be distributed with more green light time, intersections with less delay need to be distributed with less green light, j is set as green light adjusting time, the specific value is determined by the average delay of the vehicles at each intersection, and the average delay of the vehicles at all the intersections is referredIf the average delay of zeta at the current intersection is rζThe phase green time at the intersection is adjusted to
Step 9, upper layer neural network feedback value definition
The upper layer is arrangedFeedback value r of AgentkDefined as the average delay of all crossing vehicles as the feedback value for the traffic signal control system and calculated as the input for the next cycle.
Wherein m is the number of intersections, and n represents that the current cycle number is the nth time.
Claims (1)
1. A continuous intersection signal cooperative control method based on deep reinforcement learning is characterized by comprising the following steps:
step 1, building continuous intersection upper and lower layer signal control model framework
Firstly, generating data of a training batch, and storing the current state and action and a received feedback value as a quadruple (s, a, r, s') in a memory for updating parameters of a main neural network; then the model randomly selects an action while storing a sufficient number of samples; then updating the learning rate in the neural network through back propagation; finally, carrying out secondary adjustment on the green light duration of all the intersections according to the average delay of the global vehicles and the average delay of the vehicles at each intersection;
step 2, defining the state space of the lower layer neural network
Taking the vehicle waiting time, the vehicle delay and the signal lamp phase change of each direction of the intersection as state input, and carrying out discretization modeling on the intersection area:
dividing the intersection into rectangular grids with the same size, wherein each lane is divided into grids and is regarded as a cell, detecting vehicle state information by a detector, and expressing the speed and position information detected in time t by single-channel convolution for each small square area; if the detector does not detect the vehicle, zero filling is carried out on the block, and finally the obtained speed and position matrix is used as the state information of the whole road network;
step 3, action selection of lower-layer neural network
Selecting proper actions to guide vehicles at the intersection according to the current traffic state, taking the switching between stages as an action space, discretizing the unit time of the cycle into 5 seconds, updating the current phase-sequence state into the selected phase-sequence state after switching, and selecting the next action by the traffic signal lamp in the same way as the previous process;
step 4, defining feedback value of lower layer neural network
Let kiRepresenting the arrival number of vehicles on the road network from the ith time period to the i +1 time period, and recording the waiting time of the jth vehicle in the time period i asThe feedback value at the i-th time period is defined as:
step 5, modeling of the lower layer neural network
The collected vehicle speed and position information is first matrixed and the data is then processed through three convolutional layers, which are constructed as follows: the first convolutional layer contains 32 filters, each filter has a size of 4 × 4, and each time the data input is shifted by 4 × 4/step; the second convolutional layer has 64 filters, each filter has a size of 2 × 2, it moves 2 × 2/step, the output size after two convolutional layers is 30 × 30 × 64; the third convolutional layer has 128 filters, the size is 2 × 2, the step size is 1 × 1, the output of the third convolutional layer is 30 × 30 × 128 tensor, and one fully connected layer converts the tensor into a 128 × 1 matrix; after the layer is fully connected, the data is split into two parts of the same size, 64 × 1; the first part represents a state value function V(s) and represents a value function of the static state of the current road network; the second part represents a state-dependent action dominance function A (a) which represents a road network delay change value additionally brought by selecting a certain action, and combines V(s) and A (a) to obtain a Q value of each action, wherein the Q function represents that an average expected value is predicted by the current road network traffic state by using a as the maximum accumulated feedback value of the first action from the state s, and an optimal signal switching strategy under the current neural network is executed by a controller;
step 6, optimization of lower-layer reinforcement learning network
1) Deep reinforcement learning network
Updating the neural network with mean square error:
wherein Qtarget(s, a) represents the target Q value of action a taken in state s, θ represents the network parameter as the mean square error loss, and p(s) represents the probability of state s occurring in a training batch;
parameters in the main neural network are updated by back propagation, where θ-The update is based on θ in the following equation:
θ-=αθ-+(1-α)θ
alpha is an updating rate and represents the influence degree of the new parameter on the target network;
wherein Q (s, a; theta)i) The output Q value of the current network is represented, and is used for evaluating the Q value corresponding to the current state action,the output of the network of target values is represented,an optimization objective, i.e., an objective Q value, that approximates the representation of the value function;
2) determining neural network parameters
Calculating the priority probability of experience samples based on a sorting method, wherein the error delta of a sample i is defined as:
δi=|Q(s,a;θ)i-Qtarget(s,a)i|
the errors δ are sorted, and the priorities p of these experiences are setiIs the reciprocal of its order, PiProbability for sampling experience i:
where τ represents how many priorities are used;
let J (θ) denote the loss function, calculate the parametric gradient g:
the first and second order bias moments s and r are updated with exponential moving averages, respectively.
s=ρss+(1-ρs)g
r=ρrr+(1-ρr)g
Where ρ issAnd ρrFirst-order and second-order exponential decay rates are respectively adopted, and the time step t is used for correcting the first-order and second-order bias moments, so that the corrected result is closer to the real gradient.
Calculating a gradient update:
the final parameter updates are as follows:
whereinIs the initial learning rate of the initial learning rate,is a constant that stabilizes the value.
The final loss function J is as follows:
step 7, defining upper layer state space
Each main body in the system is a traffic signal controller of an intersection, an upper controller of network hierarchical control can control an area formed by a plurality of intersection signal controllers at a lower layer together, and each intersection is 1, 2 and … zeta;
step 8, defining upper layer action space
Let j be the green light adjustment time, average delay of all crossing vehiclesIf the average delay of zeta at the current intersection is rζThe phase green time at the intersection is adjusted to
Step 9, upper layer neural network feedback value definition
The feedback value r of the upper layer Agent is comparedkDefined as the average delay of all vehicles at the intersection, as the feedback value of the traffic signal control system, and calculated as the input of the next cycle
Wherein m is the number of intersections, and n represents that the current cycle number is the nth time.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010287076.2A CN112365724B (en) | 2020-04-13 | 2020-04-13 | Continuous intersection signal cooperative control method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010287076.2A CN112365724B (en) | 2020-04-13 | 2020-04-13 | Continuous intersection signal cooperative control method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112365724A true CN112365724A (en) | 2021-02-12 |
CN112365724B CN112365724B (en) | 2022-03-29 |
Family
ID=74516407
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010287076.2A Active CN112365724B (en) | 2020-04-13 | 2020-04-13 | Continuous intersection signal cooperative control method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112365724B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113257008A (en) * | 2021-05-12 | 2021-08-13 | 兰州交通大学 | Pedestrian flow dynamic control system and method based on deep learning |
CN113299069A (en) * | 2021-05-28 | 2021-08-24 | 广东工业大学华立学院 | Self-adaptive traffic signal control method based on historical error back propagation |
CN113299078A (en) * | 2021-03-29 | 2021-08-24 | 东南大学 | Multi-mode traffic trunk line signal coordination control method and device based on multi-agent cooperation |
CN113487902A (en) * | 2021-05-17 | 2021-10-08 | 东南大学 | Reinforced learning area signal control method based on vehicle planned path |
CN113643543A (en) * | 2021-10-13 | 2021-11-12 | 北京大学深圳研究生院 | Traffic flow control method and traffic signal control system with privacy protection function |
CN113724507A (en) * | 2021-08-19 | 2021-11-30 | 复旦大学 | Traffic control and vehicle induction cooperation method and system based on deep reinforcement learning |
CN114627657A (en) * | 2022-03-09 | 2022-06-14 | 哈尔滨理工大学 | Adaptive traffic signal control method based on deep graph reinforcement learning |
CN114898576A (en) * | 2022-05-10 | 2022-08-12 | 阿波罗智联(北京)科技有限公司 | Traffic control signal generation method and target network model training method |
CN115171408B (en) * | 2022-07-08 | 2023-05-30 | 华侨大学 | Traffic signal optimization control method |
CN117114079A (en) * | 2023-10-25 | 2023-11-24 | 中泰信合智能科技有限公司 | Method for migrating single intersection signal control model to target environment |
CN117173914A (en) * | 2023-11-03 | 2023-12-05 | 中泰信合智能科技有限公司 | Road network signal control unit decoupling method, device and medium for simplifying complex model |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1389820A (en) * | 2001-06-05 | 2003-01-08 | 郑肖惺 | Intelligent city traffic controlling network system |
CN104464310A (en) * | 2014-12-02 | 2015-03-25 | 上海交通大学 | Signal collaborative optimization control method and system of multiple intersections of urban region |
CN105118308A (en) * | 2015-10-12 | 2015-12-02 | 青岛大学 | Method based on clustering reinforcement learning and used for optimizing traffic signals of urban road intersections |
CN107705557A (en) * | 2017-09-04 | 2018-02-16 | 清华大学 | Road network signal control method and device based on depth enhancing network |
CN109472984A (en) * | 2018-12-27 | 2019-03-15 | 苏州科技大学 | Signalized control method, system and storage medium based on deeply study |
CN109559530A (en) * | 2019-01-07 | 2019-04-02 | 大连理工大学 | A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning |
CN110060475A (en) * | 2019-04-17 | 2019-07-26 | 清华大学 | A kind of multi-intersection signal lamp cooperative control method based on deeply study |
CN110264750A (en) * | 2019-06-14 | 2019-09-20 | 大连理工大学 | A kind of multi-intersection signal lamp cooperative control method of the Q value migration based on multitask depth Q network |
CN110428615A (en) * | 2019-07-12 | 2019-11-08 | 中国科学院自动化研究所 | Learn isolated intersection traffic signal control method, system, device based on deeply |
US20190347933A1 (en) * | 2018-05-11 | 2019-11-14 | Virtual Traffic Lights, LLC | Method of implementing an intelligent traffic control apparatus having a reinforcement learning based partial traffic detection control system, and an intelligent traffic control apparatus implemented thereby |
-
2020
- 2020-04-13 CN CN202010287076.2A patent/CN112365724B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1389820A (en) * | 2001-06-05 | 2003-01-08 | 郑肖惺 | Intelligent city traffic controlling network system |
CN104464310A (en) * | 2014-12-02 | 2015-03-25 | 上海交通大学 | Signal collaborative optimization control method and system of multiple intersections of urban region |
CN105118308A (en) * | 2015-10-12 | 2015-12-02 | 青岛大学 | Method based on clustering reinforcement learning and used for optimizing traffic signals of urban road intersections |
CN107705557A (en) * | 2017-09-04 | 2018-02-16 | 清华大学 | Road network signal control method and device based on depth enhancing network |
US20190347933A1 (en) * | 2018-05-11 | 2019-11-14 | Virtual Traffic Lights, LLC | Method of implementing an intelligent traffic control apparatus having a reinforcement learning based partial traffic detection control system, and an intelligent traffic control apparatus implemented thereby |
CN109472984A (en) * | 2018-12-27 | 2019-03-15 | 苏州科技大学 | Signalized control method, system and storage medium based on deeply study |
CN109559530A (en) * | 2019-01-07 | 2019-04-02 | 大连理工大学 | A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning |
CN110060475A (en) * | 2019-04-17 | 2019-07-26 | 清华大学 | A kind of multi-intersection signal lamp cooperative control method based on deeply study |
CN110264750A (en) * | 2019-06-14 | 2019-09-20 | 大连理工大学 | A kind of multi-intersection signal lamp cooperative control method of the Q value migration based on multitask depth Q network |
CN110428615A (en) * | 2019-07-12 | 2019-11-08 | 中国科学院自动化研究所 | Learn isolated intersection traffic signal control method, system, device based on deeply |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113299078A (en) * | 2021-03-29 | 2021-08-24 | 东南大学 | Multi-mode traffic trunk line signal coordination control method and device based on multi-agent cooperation |
CN113299078B (en) * | 2021-03-29 | 2022-04-08 | 东南大学 | Multi-mode traffic trunk line signal coordination control method and device based on multi-agent cooperation |
CN113257008A (en) * | 2021-05-12 | 2021-08-13 | 兰州交通大学 | Pedestrian flow dynamic control system and method based on deep learning |
CN113487902B (en) * | 2021-05-17 | 2022-08-12 | 东南大学 | Reinforced learning area signal control method based on vehicle planned path |
CN113487902A (en) * | 2021-05-17 | 2021-10-08 | 东南大学 | Reinforced learning area signal control method based on vehicle planned path |
CN113299069A (en) * | 2021-05-28 | 2021-08-24 | 广东工业大学华立学院 | Self-adaptive traffic signal control method based on historical error back propagation |
CN113299069B (en) * | 2021-05-28 | 2022-05-13 | 广东工业大学华立学院 | Self-adaptive traffic signal control method based on historical error back propagation |
CN113724507A (en) * | 2021-08-19 | 2021-11-30 | 复旦大学 | Traffic control and vehicle induction cooperation method and system based on deep reinforcement learning |
CN113724507B (en) * | 2021-08-19 | 2024-01-23 | 复旦大学 | Traffic control and vehicle guidance cooperative method and system based on deep reinforcement learning |
CN113643543A (en) * | 2021-10-13 | 2021-11-12 | 北京大学深圳研究生院 | Traffic flow control method and traffic signal control system with privacy protection function |
CN113643543B (en) * | 2021-10-13 | 2022-01-11 | 北京大学深圳研究生院 | Traffic flow control method and traffic signal control system with privacy protection function |
CN114627657A (en) * | 2022-03-09 | 2022-06-14 | 哈尔滨理工大学 | Adaptive traffic signal control method based on deep graph reinforcement learning |
CN114898576A (en) * | 2022-05-10 | 2022-08-12 | 阿波罗智联(北京)科技有限公司 | Traffic control signal generation method and target network model training method |
CN114898576B (en) * | 2022-05-10 | 2023-12-19 | 阿波罗智联(北京)科技有限公司 | Traffic control signal generation method and target network model training method |
CN115171408B (en) * | 2022-07-08 | 2023-05-30 | 华侨大学 | Traffic signal optimization control method |
CN117114079A (en) * | 2023-10-25 | 2023-11-24 | 中泰信合智能科技有限公司 | Method for migrating single intersection signal control model to target environment |
CN117114079B (en) * | 2023-10-25 | 2024-01-26 | 中泰信合智能科技有限公司 | Method for migrating single intersection signal control model to target environment |
CN117173914A (en) * | 2023-11-03 | 2023-12-05 | 中泰信合智能科技有限公司 | Road network signal control unit decoupling method, device and medium for simplifying complex model |
CN117173914B (en) * | 2023-11-03 | 2024-01-26 | 中泰信合智能科技有限公司 | Road network signal control unit decoupling method, device and medium for simplifying complex model |
Also Published As
Publication number | Publication date |
---|---|
CN112365724B (en) | 2022-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112365724B (en) | Continuous intersection signal cooperative control method based on deep reinforcement learning | |
CN108847037B (en) | Non-global information oriented urban road network path planning method | |
CN112216124B (en) | Traffic signal control method based on deep reinforcement learning | |
Liang et al. | Deep reinforcement learning for traffic light control in vehicular networks | |
Liang et al. | A deep reinforcement learning network for traffic light cycle control | |
CN111696370B (en) | Traffic light control method based on heuristic deep Q network | |
CN108510764B (en) | Multi-intersection self-adaptive phase difference coordination control system and method based on Q learning | |
CN112632858A (en) | Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm | |
CN114038212B (en) | Signal lamp control method based on two-stage attention mechanism and deep reinforcement learning | |
CN109215355A (en) | A kind of single-point intersection signal timing optimization method based on deeply study | |
CN113223305B (en) | Multi-intersection traffic light control method and system based on reinforcement learning and storage medium | |
Liang et al. | A deep q learning network for traffic lights’ cycle control in vehicular networks | |
Mao et al. | A comparison of deep reinforcement learning models for isolated traffic signal control | |
CN111985619A (en) | City single intersection control method based on short-term traffic flow prediction | |
CN114463997A (en) | Lantern-free intersection vehicle cooperative control method and system | |
CN114995119A (en) | Urban traffic signal cooperative control method based on multi-agent deep reinforcement learning | |
Wu et al. | ES-CTC: A deep neuroevolution model for cooperative intelligent freeway traffic control | |
Shamsi et al. | Reinforcement learning for traffic light control with emphasis on emergency vehicles | |
CN113724507B (en) | Traffic control and vehicle guidance cooperative method and system based on deep reinforcement learning | |
CN113487857A (en) | Regional multi-intersection variable lane cooperative control decision method | |
CN115019523A (en) | Deep reinforcement learning traffic signal coordination optimization control method based on minimized pressure difference | |
CN115273502B (en) | Traffic signal cooperative control method | |
CN115116240A (en) | Lantern-free intersection vehicle cooperative control method and system | |
Qiao et al. | Traffic signal control using a cooperative EWMA-based multi-agent reinforcement learning | |
Wu et al. | Deep Reinforcement Learning Based Traffic Signal Control: A Comparative Analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |