CN112365724A - Continuous intersection signal cooperative control method based on deep reinforcement learning - Google Patents

Continuous intersection signal cooperative control method based on deep reinforcement learning Download PDF

Info

Publication number
CN112365724A
CN112365724A CN202010287076.2A CN202010287076A CN112365724A CN 112365724 A CN112365724 A CN 112365724A CN 202010287076 A CN202010287076 A CN 202010287076A CN 112365724 A CN112365724 A CN 112365724A
Authority
CN
China
Prior art keywords
value
intersection
state
network
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010287076.2A
Other languages
Chinese (zh)
Other versions
CN112365724B (en
Inventor
王庞伟
冯月
汪云峰
张名芳
王力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China University of Technology
Original Assignee
North China University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Technology filed Critical North China University of Technology
Priority to CN202010287076.2A priority Critical patent/CN112365724B/en
Publication of CN112365724A publication Critical patent/CN112365724A/en
Application granted granted Critical
Publication of CN112365724B publication Critical patent/CN112365724B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/08Controlling traffic signals according to detected number or speed of vehicles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0137Measuring and analyzing of parameters relative to traffic conditions for specific applications
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/081Plural intersections under common control

Abstract

The invention provides a continuous intersection signal cooperative control method based on deep reinforcement learning, which adopts DQN strategies of upper and lower Agent networks to process continuous intersection signal timing so as to reduce the complexity of state acquisition and feedback evaluation and solve the problem of continuous intersection signal optimization. In order to ensure the stability of a training target and avoid oscillation divergence in feedback circulation of a target value and a predicted value of the training target, a Dueling Double optimization method is adopted to carry out optimal training on the DQN.

Description

Continuous intersection signal cooperative control method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of vehicle-road coordination/main-line coordination control, and particularly relates to a traffic signal control model based on vehicle-road coordination and deep reinforcement learning, which is suitable for controlling any adjacent upstream and downstream intersections on a road section in a main-line coordination manner.
Background
For an urban road network, intersections are nodes for realizing traffic flow conversion of each road section, and are also main bottlenecks for restricting the traffic capacity of the road network. The traffic signal control system distributes corresponding right of way for the traffic flows which conflict with each other, and separation of the conflict traffic flows is achieved, so that the traffic efficiency and the safety of the intersection are improved. Scholars at home and abroad develop a great deal of research and application in the aspect of traffic signal control, and successively provide a plurality of signal control methods based on deep reinforcement learning, which play an important role in improving urban traffic conditions and relieving traffic congestion, but at present, the control on intersection traffic flow generally images a single intersection, predicts the intersection traffic flow by using a convolutional neural network, and then performs corresponding prediction timing control on a prediction result; in addition, for the problem of continuous intersection signal control processing, because the intersection is too many to establish the state space model, even if the state space model is established, when the multi-intersection space model in different states is processed, the complexity of traditional reinforcement learning increases exponentially, and the complexity of state acquisition and feedback evaluation is increased.
(1) Prior art related art
1. Traffic signal control and optimization
In recent years, with the rapid development of social economy, the urban traffic demand is driven to increase at a high speed, and the load of urban roads is increased and the traffic jam phenomenon is increasingly increased while the number of motor vehicles is increased rapidly. Traffic jam increases traffic delay, reduces driving speed, induces traffic accidents, and directly affects working efficiency and physical health of people, and unreasonable traffic signal configuration and insufficient coordination and optimization degree of urban road intersection are main technical reasons, so that smoothness and safety of urban main road traffic can be effectively improved by improving traffic management measures, particularly optimizing intersection signal timing.
2. Deep reinforcement learning
The method combines the perception capability of deep learning and the decision capability of the reinforced learning in a universal mode, and can realize direct control from original input to output in an end-to-end learning mode.
3. And (5) coordinating urban main lines.
The city main line coordination control is an important link of intelligent traffic research. Because urban traffic has the characteristics of randomness and time-varying property, an accurate scientific model is difficult to establish, the average queuing length of each intersection and each phase direction is an evaluation parameter for controlling the quality, and a plurality of adjacent intersections are used as urban trunk line control models.
(2) The prior art is not enough
1. The existing common deep reinforcement learning model only considers the control of a single intersection, does not relate to the cooperative control of multiple intersections, and still needs further research on the neural network modeling under the multiple intersections; when multi-intersection space models in different states are processed at the same time, the complexity of traditional reinforcement learning increases exponentially, and the complexity of state acquisition and feedback evaluation is increased, so that modeling is needed for the conditions of multiple intersections. In addition, the control method of the signal lamp is carried out by taking a period as a unit, and a model-free self-adaptive control method is an important solution on the premise of interaction of a vehicle network environment and big data.
2. The existing green wave optimization algorithm is usually required to be large in calculation amount and inflexible, most of researches from the past are directed at a single-phase coordination path, a traditional model is not enough to support the condition of continuous intersections, and the intersection signal coordination benefit is low.
3. The main difficulties of the coordinated design of continuous intersections are that the density of transverse intersections is large, the intervals of the intersections are different, the conventional green wave band control method is difficult to realize, the conditions for controlling two-way traffic required on a road section to almost simultaneously reach the same intersection are too harsh, various fine optimization algorithms can theoretically partially meet the traffic capacity of a road network, but the practicability and operability of the algorithms are possibly difficult to guarantee, if the neural network technology can be combined with an optimization scheme under a vehicle-road coordinated environment to control traffic lights, and the signal phases of adjacent intersections are subjected to coordinated control, the signal control problem of the continuous intersections can be solved.
Disclosure of Invention
Aiming at the defects of the three related technologies, the invention fully utilizes the advantages of the vehicle-road cooperative theory, takes the upper and lower neural networks as the basic premise of intersection group traffic signal control, analyzes and calculates the disturbance influence of the intersection control scheme under the environment on the coordination relationship of adjacent intersection groups, and establishes the signal lamp control method based on the upper and lower neural networks. The intersection phases are switched in real time according to different road environments and traffic states, the cooperation capacity among the intersections is increased, the smooth driving of the intersections is guaranteed, the traffic capacity of the intersections is improved, and a new solution and a theoretical basis are provided for relieving traffic jam, improving the traveling efficiency and reducing safety accidents. The invention specifically adopts the following technical scheme:
step 1, building continuous intersection upper and lower layer signal control model framework
Firstly, generating data of a training batch, and storing the current state and action and a received feedback value as a quadruple (s, a, r, s') in a memory for updating parameters of a main neural network; then the model randomly selects an action while storing a sufficient number of samples; then updating the learning rate in the neural network through back propagation; finally, carrying out secondary adjustment on the green light duration of all the intersections according to the average delay of the global vehicles and the average delay of the vehicles at each intersection;
step 2, defining the state space of the lower layer neural network
Taking the vehicle waiting time, the vehicle delay and the signal lamp phase change of each direction of the intersection as state input, and carrying out discretization modeling on the intersection area:
dividing the intersection into rectangular grids with the same size, wherein each lane is divided into grids and is regarded as a cell, detecting vehicle state information by a detector, and expressing the speed and position information detected in time t by single-channel convolution for each small square area; if the detector does not detect the vehicle, zero filling is carried out on the block, and finally the obtained speed and position matrix is used as the state information of the whole road network;
step 3, action selection of lower-layer neural network
Selecting proper actions to guide vehicles at the intersection according to the current traffic state, taking the switching between stages as an action space, discretizing the unit time of the cycle into 5 seconds, updating the current phase-sequence state into the selected phase-sequence state after switching, and selecting the next action by the traffic signal lamp in the same way as the previous process;
step 4, defining feedback value of lower layer neural network
Let kiRepresenting the arrival number of vehicles on the road network from the ith time period to the i +1 time period, and recording the waiting time of the jth vehicle in the time period i as
Figure RE-GDA0002891513160000041
The feedback value at the i-th time period is defined as:
Figure RE-GDA0002891513160000042
step 5, modeling of the lower layer neural network
The collected vehicle speed and position information is first matrixed and the data is then processed through three convolutional layers, which are constructed as follows: the first convolutional layer contains 32 filters, each filter has a size of 4 × 4, and each time the data input is shifted by 4 × 4/step; the second convolutional layer has 64 filters, each filter has a size of 2 × 2, it moves 2 × 2/step, the output size after two convolutional layers is 30 × 30 × 64; the third convolutional layer has 128 filters, the size is 2 × 2, the step size is 1 × 1, the output of the third convolutional layer is 30 × 30 × 128 tensor, and one fully connected layer converts the tensor into a 128 × 1 matrix; after the layer is fully connected, the data is split into two parts of the same size, 64 × 1; the first part represents a state value function V(s) and represents a value function of the static state of the current road network; the second part represents a state-dependent action dominance function A (a) which represents a road network delay change value additionally brought by selecting a certain action, and combines V(s) and A (a) to obtain a Q value of each action, wherein the Q function represents that an average expected value is predicted by the current road network traffic state by using a as the maximum accumulated feedback value of the first action from the state s, and an optimal signal switching strategy under the current neural network is executed by a controller;
step 6, optimization of lower-layer reinforcement learning network
1) Deep reinforcement learning network
Updating the neural network with mean square error:
Figure RE-GDA0002891513160000043
wherein Qtarget(s, a) represents the target Q value of action a taken in state s, θ represents the network parameter as the mean square error loss, and p(s) represents the probability of state s occurring in a training batch;
parameters in the main neural network are updated by back propagation, where θ-The update is based on θ in the following equation:
θ-=αθ-+(1-α)θ
alpha is an updating rate and represents the influence degree of the new parameter on the target network;
wherein Q (s, a; theta)i) The output Q value of the current network is represented, and is used for evaluating the Q value corresponding to the current state action,
Figure RE-GDA0002891513160000051
the output of the network of target values is represented,
Figure RE-GDA0002891513160000052
an optimization objective, i.e., an objective Q value, that approximates the representation of the value function;
2) determining neural network parameters
Calculating the priority probability of experience samples based on a sorting method, wherein the error delta of a sample i is defined as:
δi=|Q(s,a;θ)i-Qtarget(s,a)i|
the errors δ are sorted, and the priorities p of these experiences are setiIs the reciprocal of its order, PiProbability for sampling experience i:
Figure RE-GDA0002891513160000053
where τ represents how many priorities are used;
let J (θ) denote the loss function, calculate the parametric gradient g:
Figure RE-GDA0002891513160000054
the first and second order bias moments s and r are updated with exponential moving averages, respectively.
s=ρss+(1-ρs)g
r=ρrr+(1-ρr)g
Where ρ issAnd ρrFirst and second order exponential decay rates, respectively, using the time step t to correct the first and second order bias moments so that the corrected results are closer to each otherThe true gradient.
Figure RE-GDA0002891513160000055
Figure RE-GDA0002891513160000056
Calculating a gradient update:
Figure RE-GDA0002891513160000057
the final parameter updates are as follows:
Figure RE-GDA0002891513160000061
wherein oa isrIs the initial learning rate of the initial learning rate,
Figure RE-GDA0002891513160000062
is a constant that stabilizes the value.
The final loss function J is as follows:
Figure RE-GDA0002891513160000063
step 7, defining upper layer state space
Each main body in the system is a traffic signal controller of an intersection, an upper controller of network hierarchical control can control an area formed by a plurality of intersection signal controllers at a lower layer together, and each intersection is 1, 2 and … zeta;
step 8, defining upper layer action space
Let j be the green light adjustment time, average delay of all crossing vehicles
Figure RE-GDA0002891513160000064
If the current crossThe average delay of the mouth ζ is rζThe phase green time at the intersection is adjusted to
Figure RE-GDA0002891513160000065
Step 9, upper layer neural network feedback value definition
The feedback value r of the upper layer Agent is comparedkDefined as the average delay of all vehicles at the intersection, as the feedback value of the traffic signal control system, and calculated as the input of the next cycle
Figure RE-GDA0002891513160000066
Wherein m is the number of intersections, and n represents that the current cycle number is the nth time.
Drawings
Fig. 1 is a schematic diagram of a continuous intersection signalized upper and lower level traffic signalized intersection.
Fig. 2 is a diagram of a global model framework for an upper and lower network.
FIG. 3 is a cross-port regionalized discretized modeling diagram.
Fig. 4 is an MDP loop flow diagram.
FIG. 5 is a schematic diagram of a convolutional neural network processing vehicle information.
Fig. 6 is a model framework diagram of DQN.
Fig. 7 is a training flow diagram of DQN.
Fig. 8 is an upper state space definition diagram.
Fig. 9 is a schematic diagram of the upper state space.
Detailed Description
Step 1, controlling model frame for upper and lower layer signals of continuous intersection
And establishing scenes of upper and lower traffic signal intersections of the continuous intersection, wherein the continuous intersection comprises a detector for detecting vehicle information, and the vehicle is also provided with a sensor for acquiring the information. The method comprises the steps of acquiring information such as traffic signal timing data, vehicle running states and road actual conditions in real time through a vehicle-mounted network technology and various sensors, then predicting signal timing according with the current traffic state through a neural network by using deep reinforcement learning.
The invention divides the control of the signal lamps of the continuous intersections into upper and lower layer control, the lower layer Agent is the traffic signal controller of each intersection, and each controller has a unique learning strategy; the upper layer Agent is mainly used for adjusting the temporary strategy of the lower layer Agent. The upper layer and the lower layer of controllers jointly control the signal lights of the whole area, and the multi-body system is modeled as shown in figure 1.
Fig. 2 is a global model framework diagram of the upper and lower neural networks, and the main convolutional neural network selects the current intersection state and the tentative phase switching action as the feedback value to select the most valuable action. First, a training batch of data is generated, and the current state and action and the received feedback values are stored in a memory as a quadruple (s, a, r, s') to update the parameters of the master neural network. Target network theta-Is a separate neural network that increases learning stability. The use of Double DQN in the primary and target networks can reduce overestimation and improve performance by training the Q-value for each action of the model and by selecting the action with the largest Q-value to obtain the optimal strategy. The model then randomly selects an action while storing a sufficient number of samples. Before training, each sample is of the same priority, and is randomly divided into small batches for training. After each training, the samples are updated for priority, they are selected by different probabilities, and then the learning rate in the neural network is updated by Adam back-propagation. The model derives an initial control scheme from the oa and the Action selection operation with the largest Q value. And finally, carrying out secondary adjustment on the green light time of all the intersections according to the average delay of the global vehicles and the average delay of the vehicles at each intersection, and enabling the model to make corresponding reactions on different traffic scenes through learning so as to reduce the vehicle delay.
Step 2, defining the state space of the lower layer neural network
In order to accurately describe traffic information at an intersection, a vehicle waiting time W, a vehicle delay D, and a signal lamp phase change C for each direction at the intersection are input as states. In addition, in order to accurately represent the specific distribution of the position and speed information of the vehicles at the intersection, discretized modeling is performed on the intersection area.
As shown in fig. 3, the whole intersection is divided into rectangular grids with the same size, and in order to reduce the calculation amount and save the calculation resources, the speed and position information of the vehicle is stored in the matrix. Dividing each lane into grids and regarding the grids as a cell, detecting vehicle state information by a detector, and expressing the speed and position information detected in time t by using single-channel convolution Q for each small square area; if the detector does not detect a vehicle, zero padding is performed on the block. And finally, taking the obtained speed and position matrix as the state information of the whole road network.
Step 3, action selection of lower-layer neural network
The traffic signal control system selects an appropriate action to guide the vehicles at the intersection according to the current traffic state. The invention takes the switching between the phases as the action space and models the process of switching between the phases as the Markov Decision Process (MDP). MDP is a Sequential Decision (Sequential Decision) mathematical model used to simulate the randomness strategy and feedback value that can be achieved by an agent in a traffic scene with a markov property in the system state, and then learn the switching strategy with the lowest feedback value by combining the MDP control strategy through trial and error in deep reinforcement learning.
In fig. 4, each loop represents the phase transition of the intersection signal lamp in a time slot cycle, the unit time of the cycle is discretized into 5 seconds, after the switching, the current phase sequence state is updated to the selected phase sequence state, and the traffic signal lamp selects the next action in the same way as the previous process. In addition, for the model to learn the switching phase, setting the maximum and minimum lamp color durations, respectively, the present invention sets the maximum and minimum times for the phases to 60 seconds and 5 seconds. And if the green time of a certain phase reaches 60 seconds, forcibly switching to the next phase, and continuously and iteratively updating based on the original control scheme.
Step 4, defining feedback value of lower layer neural network
To provide feedback to the reinforcement learning model regarding previous performance, feedback values are defined to assist traffic signals in taking optimal action strategies. In order to reduce the average delay of the vehicle, the invention defines the Reward as the average delay reduction value of the vehicle within a time period, so that the Reward is ensured to be positive during training.
Let kiRepresenting the arrival number of vehicles on the road network from the ith time period to the i +1 time period, and recording the waiting time of the jth vehicle in the time period i as
Figure RE-GDA0002891513160000081
The feedback value at the i-th time period is defined as:
Figure RE-GDA0002891513160000091
from the formula, if riWhen the vehicle is larger, the average waiting time is longer than before, and r is ensured to achieve the purpose of continuously reducing the vehicle delayiThe maximum is taken as much as possible.
Step 5, modeling of the lower layer neural network
The invention uses two main networks and target networks with consistent parameters, wherein the main network theta is used for updating the weight in real time, and the target network theta is used for updating the weight-Updating after the main network is updated for a plurality of times, jointly updating the Q value by using a state value function V(s) and an action advantage function A (a), selecting Adam by an optimizer, and then adopting an e-greedy strategy and an experience playback strategy in the learning process.
The structure of the bottom layer CNN is shown in fig. 5. It consists of three convolutional layers and three fully-connected layers. The vehicle's speed and position information matrix is first passed through three convolutional layers, each of which includes three parts, convolution, pooling, and a nonlinear activation function. The convolutional layer comprises a plurality of filters, each filter containing a set of weights, each move by a step size defined by a step size to yield the next fully connected layer, different filters have different weights and generate different features in the next layer, the invention uses the Leaky ReLU function as the activation function:
Figure RE-GDA0002891513160000092
where x represents the output in units and β is a constant. Compared with the conventional ReLU function, the introduction of β can avoid dead neurons due to zero gradient on the negative side. The Leaky ReLU function may converge faster than other activation functions (e.g., tanh and sigmod), thereby increasing the convergence rate of vehicle delays during training.
The neural network modeling is shown in fig. 5 below.
FIG. 5 is a process of vehicle speed and position information matrix processing in a graph convolutional neural network, first matrixing the collected information and second processing the data through three convolutional layers. The three convolutional layers and the fully connected layer were constructed as follows: the first convolutional layer contains 32 filters, each filter has a size of 4 × 4, and each time the data input is shifted by 4 × 4/step; the second convolutional layer has 64 filters, each filter has a size of 2 × 2, it moves 2 × 2/step, the output size after two convolutional layers is 30 × 30 × 64; the third convolutional layer has 128 filters with size of 2 × 2 and step size of 1 × 1, the output of the third convolutional layer is 30 × 30 × 128 tensor, and one fully connected layer converts the tensor into a 128 × 1 matrix. After the layers are fully connected, the data is split into two parts of the same size, 64 × 1. The first part represents a state value function V(s) and represents a value function of the static state of the current road network; the second part represents a state-dependent Action advantage function A (a) which represents a road network delay change value additionally brought by selecting a certain Action, the possible Action number is the number k of legal phases, so the size of A (a) is k multiplied by 1, the two parts are combined again to obtain a Q value of each Action, wherein the parameter in the CNN is represented as theta, Q (s, a) is converted into Q (s, a; theta), theta represents that the network parameter is the mean square error loss, the Q function represents that from the state s, a is used as the maximum accumulated feedback value of the first Action, the average expected value r is predicted through the current road network traffic state, and the controller executes the optimal signal switching strategy under the current neural network.
Step 6, optimization of lower-layer reinforcement learning network
The core of the DQN model is a convolutional neural network, and is trained using Q-learning, the input of which is an original road network data matrix, and the output is an estimated Q value of the optimal strategy, and fig. 6 and 7 below are a frame diagram and a flow chart of DQN, respectively.
The matrix containing vehicle position and speed information passes through the convolutional layer and the full link layer, and then a vector containing each action Q value is output through the input state and action.
1) Deep reinforcement learning network
During DQN training, a deep convolutional network is used to approximate the current valuation function, while another network is used to produce the target Q value. Specifically, let Qtarget(s, a) represents the target Q value for taking action a in state s, θ represents the network parameter as the mean square error loss, and the neural network is updated with the Mean Square Error (MSE), as follows:
Figure RE-GDA0002891513160000101
where p(s) represents the probability of the occurrence of state s in a training batch. In order to provide a stable update (steady reduction of delays in road networks during training) in each iteration, a separate target network θ is used that is identical in structure to the main neural network but with different parameters-To generate a Q value.
Parameters in the main neural network are updated by back propagation, where θ-The update is based on θ in the following equation:
θ-=αθ-+(1-α)θ (4)
and alpha is an updating rate and represents the influence degree of the new parameter on the target network.
Wherein Q (s, a; theta)i) The output Q value of the current network is represented, and is used for evaluating the Q value corresponding to the current state action,
Figure RE-GDA0002891513160000111
the output of the network of target values is represented,
Figure RE-GDA0002891513160000112
the optimization objective of the value function, i.e. the target Q value, is approximately represented.
When the parameter theta of the current value network is updated, the parameter of the current value network is copied to the target value network theta through N iterations-By minimizing the current Q value and the target network QtargetThe network parameters are updated by the mean square error between the values, so that the error items of the network are reduced to a limited interval, and the Q value and the gradient value are in a reasonable range, thereby being beneficial to delay and stable reduction of the road network.
2) DuelingDQN optimization method
In some states stFor example, when too few or too many vehicles are on the road network, no matter what action a is donetDo not affect the next state st+1The delay of (1), that is, the correlation between the current state action function and the current action selection is weak, which easily causes the road network delay not to be converged in the current state. In order to solve the problem, the invention adopts the Dueling DQN to improve the learning effect and the convergence speed of the DQN.
And on the basis of the original network, fitting the Q value in reinforcement learning by using a deep network, and dividing a Q value function into a state V value and an action V value, wherein the Q value is updated by adding the state V value and the action V' value.
In a neural network, the overall expected feedback value of taking a probabilistic action in a future step is represented by a state V (s; theta), for each action A (s, a; theta), the Q value is the sum of the state V and a state-dependent A (a) function, the definition of the A (a) function is the cumulative discount return brought by the current actual action compared with the optimal action, and the Q value is calculated as follows:
Figure RE-GDA0002891513160000113
wherein A (s, a', theta) represents the effect of the action taken on the Q-value function, and if the A value of the action is a positive number, the action can reduce the delay better than other actions; otherwise, if the value a of a certain action is negative, it indicates that the potential feedback value of the action is smaller than the average value. Compared with the method which directly uses the original Q value, the method improves the stability of the model and reduces the average delay of the vehicle.
3) Double DQN optimization method
The traditional DQN has the defect of over-estimation, due to estimation nonuniformity, an over-estimation problem can be generated during parameter updating and iteration, so that the current phase switching scheme is not the optimal scheme, and in order to prevent the Q value from being over-estimated, Q is usedtargetThe values are updated by the Double DQN algorithm.
Qtarget(s,a)=r+γQ'(s′,argmax(Q(s′,a′;θ)),θ-) (6)
And two Q networks in the above formula, wherein Q determines which state Reward value is the largest, and the Q' function is responsible for selecting actions so as to reduce the problem of overestimation, thereby effectively reducing the average delay of vehicles on the road network.
4) Neural network parameters
The invention adopts a priority experience playback structure based on sequencing to increase learning efficiency, thereby increasing the playback probability of samples with lower average delay, and calculates the priority probability of experience samples by using a method based on sequencing, wherein the error delta of a sample i is defined as:
δi=|Q(s,a;θ)i-Qtarget(s,a)i| (7)
the errors δ are sorted, and the priorities p of these experiences are setiIs the reciprocal of its order, PiProbability for sampling experience i:
Figure RE-GDA0002891513160000121
where τ indicates how much priority is used, it is a random sample when τ is 0.
An optimizer of the neural network model selects an Adam (adaptive matrix estimation) method, and an Adam algorithm designs independent adaptive learning rates for different parameters by calculating first moment estimation and second moment estimation of gradients, so that the convergence speed and the model effect are accelerated. Let J (θ) denote the loss function, calculate the parametric gradient g:
Figure RE-GDA0002891513160000122
the first and second order bias moments s and r are updated with exponential moving averages, respectively.
s=ρss+(1-ρs)g (10)
r=ρrr+(1-ρr)g (11)
Where ρ issAnd ρrFirst-order and second-order exponential decay rates are respectively adopted, and the time step t is used for correcting the first-order and second-order bias moments, so that the corrected result is closer to the real gradient.
Figure RE-GDA0002891513160000131
Figure RE-GDA0002891513160000132
Calculate gradient update (element by element):
Figure RE-GDA0002891513160000133
the final parameter updates are as follows:
Figure RE-GDA0002891513160000134
wherein oa isrIs the initial learning rate of the initial learning rate,
Figure RE-GDA0002891513160000135
is a constant that stabilizes the value.
The final loss function J is as follows:
Figure RE-GDA0002891513160000136
step 7, defining upper layer state space
When the upper-layer Agent controls the continuous intersections, firstly, the action of each intersection on the lower layer is adjusted based on the original scheme, and finally, the optimization scheme is updated according to the average queuing length of each intersection.
Modeling of a multi-body system is shown in FIG. 8.
Each main body in the system is a traffic signal controller of an intersection, an upper controller controlled by a network layer can control an area formed by a plurality of intersection signal controllers of a lower layer together, each intersection is set to be 1, 2 and … zeta, an Agent of each intersection of the lower layer has a learning strategy, and the upper layer Agent provides guidance. The secondary adjustment of the signals is performed by first sorting delays at each intersection as shown in fig. 9.
Through the steps, the delays of the intersections are sequenced, and the state space of the upper layer is the intersection number data with the highest delay.
Step 8, defining upper layer action space
In order to reduce the average delay of vehicles, intersections with larger delay need to be distributed with more green light time, intersections with less delay need to be distributed with less green light, j is set as green light adjusting time, the specific value is determined by the average delay of the vehicles at each intersection, and the average delay of the vehicles at all the intersections is referred
Figure RE-GDA0002891513160000137
If the average delay of zeta at the current intersection is rζThe phase green time at the intersection is adjusted to
Figure RE-GDA0002891513160000141
Step 9, upper layer neural network feedback value definition
The upper layer is arrangedFeedback value r of AgentkDefined as the average delay of all crossing vehicles as the feedback value for the traffic signal control system and calculated as the input for the next cycle.
Figure RE-GDA0002891513160000142
Wherein m is the number of intersections, and n represents that the current cycle number is the nth time.

Claims (1)

1. A continuous intersection signal cooperative control method based on deep reinforcement learning is characterized by comprising the following steps:
step 1, building continuous intersection upper and lower layer signal control model framework
Firstly, generating data of a training batch, and storing the current state and action and a received feedback value as a quadruple (s, a, r, s') in a memory for updating parameters of a main neural network; then the model randomly selects an action while storing a sufficient number of samples; then updating the learning rate in the neural network through back propagation; finally, carrying out secondary adjustment on the green light duration of all the intersections according to the average delay of the global vehicles and the average delay of the vehicles at each intersection;
step 2, defining the state space of the lower layer neural network
Taking the vehicle waiting time, the vehicle delay and the signal lamp phase change of each direction of the intersection as state input, and carrying out discretization modeling on the intersection area:
dividing the intersection into rectangular grids with the same size, wherein each lane is divided into grids and is regarded as a cell, detecting vehicle state information by a detector, and expressing the speed and position information detected in time t by single-channel convolution for each small square area; if the detector does not detect the vehicle, zero filling is carried out on the block, and finally the obtained speed and position matrix is used as the state information of the whole road network;
step 3, action selection of lower-layer neural network
Selecting proper actions to guide vehicles at the intersection according to the current traffic state, taking the switching between stages as an action space, discretizing the unit time of the cycle into 5 seconds, updating the current phase-sequence state into the selected phase-sequence state after switching, and selecting the next action by the traffic signal lamp in the same way as the previous process;
step 4, defining feedback value of lower layer neural network
Let kiRepresenting the arrival number of vehicles on the road network from the ith time period to the i +1 time period, and recording the waiting time of the jth vehicle in the time period i as
Figure FDA0002448924720000011
The feedback value at the i-th time period is defined as:
Figure FDA0002448924720000012
step 5, modeling of the lower layer neural network
The collected vehicle speed and position information is first matrixed and the data is then processed through three convolutional layers, which are constructed as follows: the first convolutional layer contains 32 filters, each filter has a size of 4 × 4, and each time the data input is shifted by 4 × 4/step; the second convolutional layer has 64 filters, each filter has a size of 2 × 2, it moves 2 × 2/step, the output size after two convolutional layers is 30 × 30 × 64; the third convolutional layer has 128 filters, the size is 2 × 2, the step size is 1 × 1, the output of the third convolutional layer is 30 × 30 × 128 tensor, and one fully connected layer converts the tensor into a 128 × 1 matrix; after the layer is fully connected, the data is split into two parts of the same size, 64 × 1; the first part represents a state value function V(s) and represents a value function of the static state of the current road network; the second part represents a state-dependent action dominance function A (a) which represents a road network delay change value additionally brought by selecting a certain action, and combines V(s) and A (a) to obtain a Q value of each action, wherein the Q function represents that an average expected value is predicted by the current road network traffic state by using a as the maximum accumulated feedback value of the first action from the state s, and an optimal signal switching strategy under the current neural network is executed by a controller;
step 6, optimization of lower-layer reinforcement learning network
1) Deep reinforcement learning network
Updating the neural network with mean square error:
Figure FDA0002448924720000021
wherein Qtarget(s, a) represents the target Q value of action a taken in state s, θ represents the network parameter as the mean square error loss, and p(s) represents the probability of state s occurring in a training batch;
parameters in the main neural network are updated by back propagation, where θ-The update is based on θ in the following equation:
θ-=αθ-+(1-α)θ
alpha is an updating rate and represents the influence degree of the new parameter on the target network;
wherein Q (s, a; theta)i) The output Q value of the current network is represented, and is used for evaluating the Q value corresponding to the current state action,
Figure FDA0002448924720000022
the output of the network of target values is represented,
Figure FDA0002448924720000023
an optimization objective, i.e., an objective Q value, that approximates the representation of the value function;
2) determining neural network parameters
Calculating the priority probability of experience samples based on a sorting method, wherein the error delta of a sample i is defined as:
δi=|Q(s,a;θ)i-Qtarget(s,a)i|
the errors δ are sorted, and the priorities p of these experiences are setiIs the reciprocal of its order, PiProbability for sampling experience i:
Figure FDA0002448924720000031
where τ represents how many priorities are used;
let J (θ) denote the loss function, calculate the parametric gradient g:
Figure FDA0002448924720000038
the first and second order bias moments s and r are updated with exponential moving averages, respectively.
s=ρss+(1-ρs)g
r=ρrr+(1-ρr)g
Where ρ issAnd ρrFirst-order and second-order exponential decay rates are respectively adopted, and the time step t is used for correcting the first-order and second-order bias moments, so that the corrected result is closer to the real gradient.
Figure FDA0002448924720000032
Figure FDA0002448924720000033
Calculating a gradient update:
Figure FDA0002448924720000034
the final parameter updates are as follows:
Figure FDA0002448924720000035
wherein
Figure FDA0002448924720000036
Is the initial learning rate of the initial learning rate,
Figure FDA0002448924720000037
is a constant that stabilizes the value.
The final loss function J is as follows:
Figure FDA0002448924720000041
step 7, defining upper layer state space
Each main body in the system is a traffic signal controller of an intersection, an upper controller of network hierarchical control can control an area formed by a plurality of intersection signal controllers at a lower layer together, and each intersection is 1, 2 and … zeta;
step 8, defining upper layer action space
Let j be the green light adjustment time, average delay of all crossing vehicles
Figure FDA0002448924720000042
If the average delay of zeta at the current intersection is rζThe phase green time at the intersection is adjusted to
Figure FDA0002448924720000043
Step 9, upper layer neural network feedback value definition
The feedback value r of the upper layer Agent is comparedkDefined as the average delay of all vehicles at the intersection, as the feedback value of the traffic signal control system, and calculated as the input of the next cycle
Figure FDA0002448924720000044
Wherein m is the number of intersections, and n represents that the current cycle number is the nth time.
CN202010287076.2A 2020-04-13 2020-04-13 Continuous intersection signal cooperative control method based on deep reinforcement learning Active CN112365724B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010287076.2A CN112365724B (en) 2020-04-13 2020-04-13 Continuous intersection signal cooperative control method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010287076.2A CN112365724B (en) 2020-04-13 2020-04-13 Continuous intersection signal cooperative control method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN112365724A true CN112365724A (en) 2021-02-12
CN112365724B CN112365724B (en) 2022-03-29

Family

ID=74516407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010287076.2A Active CN112365724B (en) 2020-04-13 2020-04-13 Continuous intersection signal cooperative control method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN112365724B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113257008A (en) * 2021-05-12 2021-08-13 兰州交通大学 Pedestrian flow dynamic control system and method based on deep learning
CN113299069A (en) * 2021-05-28 2021-08-24 广东工业大学华立学院 Self-adaptive traffic signal control method based on historical error back propagation
CN113299078A (en) * 2021-03-29 2021-08-24 东南大学 Multi-mode traffic trunk line signal coordination control method and device based on multi-agent cooperation
CN113487902A (en) * 2021-05-17 2021-10-08 东南大学 Reinforced learning area signal control method based on vehicle planned path
CN113643543A (en) * 2021-10-13 2021-11-12 北京大学深圳研究生院 Traffic flow control method and traffic signal control system with privacy protection function
CN113724507A (en) * 2021-08-19 2021-11-30 复旦大学 Traffic control and vehicle induction cooperation method and system based on deep reinforcement learning
CN114627657A (en) * 2022-03-09 2022-06-14 哈尔滨理工大学 Adaptive traffic signal control method based on deep graph reinforcement learning
CN114898576A (en) * 2022-05-10 2022-08-12 阿波罗智联(北京)科技有限公司 Traffic control signal generation method and target network model training method
CN115171408B (en) * 2022-07-08 2023-05-30 华侨大学 Traffic signal optimization control method
CN117114079A (en) * 2023-10-25 2023-11-24 中泰信合智能科技有限公司 Method for migrating single intersection signal control model to target environment
CN117173914A (en) * 2023-11-03 2023-12-05 中泰信合智能科技有限公司 Road network signal control unit decoupling method, device and medium for simplifying complex model

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1389820A (en) * 2001-06-05 2003-01-08 郑肖惺 Intelligent city traffic controlling network system
CN104464310A (en) * 2014-12-02 2015-03-25 上海交通大学 Signal collaborative optimization control method and system of multiple intersections of urban region
CN105118308A (en) * 2015-10-12 2015-12-02 青岛大学 Method based on clustering reinforcement learning and used for optimizing traffic signals of urban road intersections
CN107705557A (en) * 2017-09-04 2018-02-16 清华大学 Road network signal control method and device based on depth enhancing network
CN109472984A (en) * 2018-12-27 2019-03-15 苏州科技大学 Signalized control method, system and storage medium based on deeply study
CN109559530A (en) * 2019-01-07 2019-04-02 大连理工大学 A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning
CN110060475A (en) * 2019-04-17 2019-07-26 清华大学 A kind of multi-intersection signal lamp cooperative control method based on deeply study
CN110264750A (en) * 2019-06-14 2019-09-20 大连理工大学 A kind of multi-intersection signal lamp cooperative control method of the Q value migration based on multitask depth Q network
CN110428615A (en) * 2019-07-12 2019-11-08 中国科学院自动化研究所 Learn isolated intersection traffic signal control method, system, device based on deeply
US20190347933A1 (en) * 2018-05-11 2019-11-14 Virtual Traffic Lights, LLC Method of implementing an intelligent traffic control apparatus having a reinforcement learning based partial traffic detection control system, and an intelligent traffic control apparatus implemented thereby

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1389820A (en) * 2001-06-05 2003-01-08 郑肖惺 Intelligent city traffic controlling network system
CN104464310A (en) * 2014-12-02 2015-03-25 上海交通大学 Signal collaborative optimization control method and system of multiple intersections of urban region
CN105118308A (en) * 2015-10-12 2015-12-02 青岛大学 Method based on clustering reinforcement learning and used for optimizing traffic signals of urban road intersections
CN107705557A (en) * 2017-09-04 2018-02-16 清华大学 Road network signal control method and device based on depth enhancing network
US20190347933A1 (en) * 2018-05-11 2019-11-14 Virtual Traffic Lights, LLC Method of implementing an intelligent traffic control apparatus having a reinforcement learning based partial traffic detection control system, and an intelligent traffic control apparatus implemented thereby
CN109472984A (en) * 2018-12-27 2019-03-15 苏州科技大学 Signalized control method, system and storage medium based on deeply study
CN109559530A (en) * 2019-01-07 2019-04-02 大连理工大学 A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning
CN110060475A (en) * 2019-04-17 2019-07-26 清华大学 A kind of multi-intersection signal lamp cooperative control method based on deeply study
CN110264750A (en) * 2019-06-14 2019-09-20 大连理工大学 A kind of multi-intersection signal lamp cooperative control method of the Q value migration based on multitask depth Q network
CN110428615A (en) * 2019-07-12 2019-11-08 中国科学院自动化研究所 Learn isolated intersection traffic signal control method, system, device based on deeply

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299078A (en) * 2021-03-29 2021-08-24 东南大学 Multi-mode traffic trunk line signal coordination control method and device based on multi-agent cooperation
CN113299078B (en) * 2021-03-29 2022-04-08 东南大学 Multi-mode traffic trunk line signal coordination control method and device based on multi-agent cooperation
CN113257008A (en) * 2021-05-12 2021-08-13 兰州交通大学 Pedestrian flow dynamic control system and method based on deep learning
CN113487902B (en) * 2021-05-17 2022-08-12 东南大学 Reinforced learning area signal control method based on vehicle planned path
CN113487902A (en) * 2021-05-17 2021-10-08 东南大学 Reinforced learning area signal control method based on vehicle planned path
CN113299069A (en) * 2021-05-28 2021-08-24 广东工业大学华立学院 Self-adaptive traffic signal control method based on historical error back propagation
CN113299069B (en) * 2021-05-28 2022-05-13 广东工业大学华立学院 Self-adaptive traffic signal control method based on historical error back propagation
CN113724507A (en) * 2021-08-19 2021-11-30 复旦大学 Traffic control and vehicle induction cooperation method and system based on deep reinforcement learning
CN113724507B (en) * 2021-08-19 2024-01-23 复旦大学 Traffic control and vehicle guidance cooperative method and system based on deep reinforcement learning
CN113643543A (en) * 2021-10-13 2021-11-12 北京大学深圳研究生院 Traffic flow control method and traffic signal control system with privacy protection function
CN113643543B (en) * 2021-10-13 2022-01-11 北京大学深圳研究生院 Traffic flow control method and traffic signal control system with privacy protection function
CN114627657A (en) * 2022-03-09 2022-06-14 哈尔滨理工大学 Adaptive traffic signal control method based on deep graph reinforcement learning
CN114898576A (en) * 2022-05-10 2022-08-12 阿波罗智联(北京)科技有限公司 Traffic control signal generation method and target network model training method
CN114898576B (en) * 2022-05-10 2023-12-19 阿波罗智联(北京)科技有限公司 Traffic control signal generation method and target network model training method
CN115171408B (en) * 2022-07-08 2023-05-30 华侨大学 Traffic signal optimization control method
CN117114079A (en) * 2023-10-25 2023-11-24 中泰信合智能科技有限公司 Method for migrating single intersection signal control model to target environment
CN117114079B (en) * 2023-10-25 2024-01-26 中泰信合智能科技有限公司 Method for migrating single intersection signal control model to target environment
CN117173914A (en) * 2023-11-03 2023-12-05 中泰信合智能科技有限公司 Road network signal control unit decoupling method, device and medium for simplifying complex model
CN117173914B (en) * 2023-11-03 2024-01-26 中泰信合智能科技有限公司 Road network signal control unit decoupling method, device and medium for simplifying complex model

Also Published As

Publication number Publication date
CN112365724B (en) 2022-03-29

Similar Documents

Publication Publication Date Title
CN112365724B (en) Continuous intersection signal cooperative control method based on deep reinforcement learning
CN108847037B (en) Non-global information oriented urban road network path planning method
CN112216124B (en) Traffic signal control method based on deep reinforcement learning
Liang et al. Deep reinforcement learning for traffic light control in vehicular networks
Liang et al. A deep reinforcement learning network for traffic light cycle control
CN111696370B (en) Traffic light control method based on heuristic deep Q network
CN108510764B (en) Multi-intersection self-adaptive phase difference coordination control system and method based on Q learning
CN112632858A (en) Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm
CN114038212B (en) Signal lamp control method based on two-stage attention mechanism and deep reinforcement learning
CN109215355A (en) A kind of single-point intersection signal timing optimization method based on deeply study
CN113223305B (en) Multi-intersection traffic light control method and system based on reinforcement learning and storage medium
Liang et al. A deep q learning network for traffic lights’ cycle control in vehicular networks
Mao et al. A comparison of deep reinforcement learning models for isolated traffic signal control
CN111985619A (en) City single intersection control method based on short-term traffic flow prediction
CN114463997A (en) Lantern-free intersection vehicle cooperative control method and system
CN114995119A (en) Urban traffic signal cooperative control method based on multi-agent deep reinforcement learning
Wu et al. ES-CTC: A deep neuroevolution model for cooperative intelligent freeway traffic control
Shamsi et al. Reinforcement learning for traffic light control with emphasis on emergency vehicles
CN113724507B (en) Traffic control and vehicle guidance cooperative method and system based on deep reinforcement learning
CN113487857A (en) Regional multi-intersection variable lane cooperative control decision method
CN115019523A (en) Deep reinforcement learning traffic signal coordination optimization control method based on minimized pressure difference
CN115273502B (en) Traffic signal cooperative control method
CN115116240A (en) Lantern-free intersection vehicle cooperative control method and system
Qiao et al. Traffic signal control using a cooperative EWMA-based multi-agent reinforcement learning
Wu et al. Deep Reinforcement Learning Based Traffic Signal Control: A Comparative Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant