CN112365724A

CN112365724A - Continuous intersection signal cooperative control method based on deep reinforcement learning

Info

Publication number: CN112365724A
Application number: CN202010287076.2A
Authority: CN
Inventors: 王庞伟; 冯月; 汪云峰; 张名芳; 王力
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2021-02-12
Anticipated expiration: 2040-04-13
Also published as: CN112365724B

Abstract

The invention provides a continuous intersection signal cooperative control method based on deep reinforcement learning, which adopts DQN strategies of upper and lower Agent networks to process continuous intersection signal timing so as to reduce the complexity of state acquisition and feedback evaluation and solve the problem of continuous intersection signal optimization. In order to ensure the stability of a training target and avoid oscillation divergence in feedback circulation of a target value and a predicted value of the training target, a Dueling Double optimization method is adopted to carry out optimal training on the DQN.

Description

Continuous intersection signal cooperative control method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of vehicle-road coordination/main-line coordination control, and particularly relates to a traffic signal control model based on vehicle-road coordination and deep reinforcement learning, which is suitable for controlling any adjacent upstream and downstream intersections on a road section in a main-line coordination manner.

Background

For an urban road network, intersections are nodes for realizing traffic flow conversion of each road section, and are also main bottlenecks for restricting the traffic capacity of the road network. The traffic signal control system distributes corresponding right of way for the traffic flows which conflict with each other, and separation of the conflict traffic flows is achieved, so that the traffic efficiency and the safety of the intersection are improved. Scholars at home and abroad develop a great deal of research and application in the aspect of traffic signal control, and successively provide a plurality of signal control methods based on deep reinforcement learning, which play an important role in improving urban traffic conditions and relieving traffic congestion, but at present, the control on intersection traffic flow generally images a single intersection, predicts the intersection traffic flow by using a convolutional neural network, and then performs corresponding prediction timing control on a prediction result; in addition, for the problem of continuous intersection signal control processing, because the intersection is too many to establish the state space model, even if the state space model is established, when the multi-intersection space model in different states is processed, the complexity of traditional reinforcement learning increases exponentially, and the complexity of state acquisition and feedback evaluation is increased.

(1) Prior art related art

1. Traffic signal control and optimization

In recent years, with the rapid development of social economy, the urban traffic demand is driven to increase at a high speed, and the load of urban roads is increased and the traffic jam phenomenon is increasingly increased while the number of motor vehicles is increased rapidly. Traffic jam increases traffic delay, reduces driving speed, induces traffic accidents, and directly affects working efficiency and physical health of people, and unreasonable traffic signal configuration and insufficient coordination and optimization degree of urban road intersection are main technical reasons, so that smoothness and safety of urban main road traffic can be effectively improved by improving traffic management measures, particularly optimizing intersection signal timing.

2. Deep reinforcement learning

The method combines the perception capability of deep learning and the decision capability of the reinforced learning in a universal mode, and can realize direct control from original input to output in an end-to-end learning mode.

3. And (5) coordinating urban main lines.

The city main line coordination control is an important link of intelligent traffic research. Because urban traffic has the characteristics of randomness and time-varying property, an accurate scientific model is difficult to establish, the average queuing length of each intersection and each phase direction is an evaluation parameter for controlling the quality, and a plurality of adjacent intersections are used as urban trunk line control models.

(2) The prior art is not enough

1. The existing common deep reinforcement learning model only considers the control of a single intersection, does not relate to the cooperative control of multiple intersections, and still needs further research on the neural network modeling under the multiple intersections; when multi-intersection space models in different states are processed at the same time, the complexity of traditional reinforcement learning increases exponentially, and the complexity of state acquisition and feedback evaluation is increased, so that modeling is needed for the conditions of multiple intersections. In addition, the control method of the signal lamp is carried out by taking a period as a unit, and a model-free self-adaptive control method is an important solution on the premise of interaction of a vehicle network environment and big data.

2. The existing green wave optimization algorithm is usually required to be large in calculation amount and inflexible, most of researches from the past are directed at a single-phase coordination path, a traditional model is not enough to support the condition of continuous intersections, and the intersection signal coordination benefit is low.

3. The main difficulties of the coordinated design of continuous intersections are that the density of transverse intersections is large, the intervals of the intersections are different, the conventional green wave band control method is difficult to realize, the conditions for controlling two-way traffic required on a road section to almost simultaneously reach the same intersection are too harsh, various fine optimization algorithms can theoretically partially meet the traffic capacity of a road network, but the practicability and operability of the algorithms are possibly difficult to guarantee, if the neural network technology can be combined with an optimization scheme under a vehicle-road coordinated environment to control traffic lights, and the signal phases of adjacent intersections are subjected to coordinated control, the signal control problem of the continuous intersections can be solved.

Disclosure of Invention

Aiming at the defects of the three related technologies, the invention fully utilizes the advantages of the vehicle-road cooperative theory, takes the upper and lower neural networks as the basic premise of intersection group traffic signal control, analyzes and calculates the disturbance influence of the intersection control scheme under the environment on the coordination relationship of adjacent intersection groups, and establishes the signal lamp control method based on the upper and lower neural networks. The intersection phases are switched in real time according to different road environments and traffic states, the cooperation capacity among the intersections is increased, the smooth driving of the intersections is guaranteed, the traffic capacity of the intersections is improved, and a new solution and a theoretical basis are provided for relieving traffic jam, improving the traveling efficiency and reducing safety accidents. The invention specifically adopts the following technical scheme:

step 1, building continuous intersection upper and lower layer signal control model framework

Firstly, generating data of a training batch, and storing the current state and action and a received feedback value as a quadruple (s, a, r, s') in a memory for updating parameters of a main neural network; then the model randomly selects an action while storing a sufficient number of samples; then updating the learning rate in the neural network through back propagation; finally, carrying out secondary adjustment on the green light duration of all the intersections according to the average delay of the global vehicles and the average delay of the vehicles at each intersection;

step 2, defining the state space of the lower layer neural network

Taking the vehicle waiting time, the vehicle delay and the signal lamp phase change of each direction of the intersection as state input, and carrying out discretization modeling on the intersection area:

dividing the intersection into rectangular grids with the same size, wherein each lane is divided into grids and is regarded as a cell, detecting vehicle state information by a detector, and expressing the speed and position information detected in time t by single-channel convolution for each small square area; if the detector does not detect the vehicle, zero filling is carried out on the block, and finally the obtained speed and position matrix is used as the state information of the whole road network;

step 3, action selection of lower-layer neural network

Selecting proper actions to guide vehicles at the intersection according to the current traffic state, taking the switching between stages as an action space, discretizing the unit time of the cycle into 5 seconds, updating the current phase-sequence state into the selected phase-sequence state after switching, and selecting the next action by the traffic signal lamp in the same way as the previous process;

step 4, defining feedback value of lower layer neural network

Let k_iRepresenting the arrival number of vehicles on the road network from the ith time period to the i +1 time period, and recording the waiting time of the jth vehicle in the time period i as

The feedback value at the i-th time period is defined as:

step 5, modeling of the lower layer neural network

The collected vehicle speed and position information is first matrixed and the data is then processed through three convolutional layers, which are constructed as follows: the first convolutional layer contains 32 filters, each filter has a size of 4 × 4, and each time the data input is shifted by 4 × 4/step; the second convolutional layer has 64 filters, each filter has a size of 2 × 2, it moves 2 × 2/step, the output size after two convolutional layers is 30 × 30 × 64; the third convolutional layer has 128 filters, the size is 2 × 2, the step size is 1 × 1, the output of the third convolutional layer is 30 × 30 × 128 tensor, and one fully connected layer converts the tensor into a 128 × 1 matrix; after the layer is fully connected, the data is split into two parts of the same size, 64 × 1; the first part represents a state value function V(s) and represents a value function of the static state of the current road network; the second part represents a state-dependent action dominance function A (a) which represents a road network delay change value additionally brought by selecting a certain action, and combines V(s) and A (a) to obtain a Q value of each action, wherein the Q function represents that an average expected value is predicted by the current road network traffic state by using a as the maximum accumulated feedback value of the first action from the state s, and an optimal signal switching strategy under the current neural network is executed by a controller;

step 6, optimization of lower-layer reinforcement learning network

1) Deep reinforcement learning network

Updating the neural network with mean square error:

wherein Q_target(s, a) represents the target Q value of action a taken in state s, θ represents the network parameter as the mean square error loss, and p(s) represents the probability of state s occurring in a training batch;

parameters in the main neural network are updated by back propagation, where θ^-The update is based on θ in the following equation:

θ^-＝αθ^-+(1-α)θ

alpha is an updating rate and represents the influence degree of the new parameter on the target network;

wherein Q (s, a; theta)_i) The output Q value of the current network is represented, and is used for evaluating the Q value corresponding to the current state action,

the output of the network of target values is represented,

an optimization objective, i.e., an objective Q value, that approximates the representation of the value function;

2) determining neural network parameters

Calculating the priority probability of experience samples based on a sorting method, wherein the error delta of a sample i is defined as:

δ_i＝|Q(s,a；θ)_i-Q_target(s,a)_i|

the errors δ are sorted, and the priorities p of these experiences are set_iIs the reciprocal of its order, P_iProbability for sampling experience i:

where τ represents how many priorities are used;

let J (θ) denote the loss function, calculate the parametric gradient g:

the first and second order bias moments s and r are updated with exponential moving averages, respectively.

s＝ρ_ss+(1-ρ_s)g

r＝ρ_rr+(1-ρ_r)g

Where ρ is_sAnd ρ_rFirst and second order exponential decay rates, respectively, using the time step t to correct the first and second order bias moments so that the corrected results are closer to each otherThe true gradient.

Calculating a gradient update:

the final parameter updates are as follows:

wherein oa is_rIs the initial learning rate of the initial learning rate,

is a constant that stabilizes the value.

The final loss function J is as follows:

step 7, defining upper layer state space

Each main body in the system is a traffic signal controller of an intersection, an upper controller of network hierarchical control can control an area formed by a plurality of intersection signal controllers at a lower layer together, and each intersection is 1, 2 and … zeta;

step 8, defining upper layer action space

Let j be the green light adjustment time, average delay of all crossing vehicles

If the current crossThe average delay of the mouth ζ is r_ζThe phase green time at the intersection is adjusted to

Step 9, upper layer neural network feedback value definition

The feedback value r of the upper layer Agent is compared_kDefined as the average delay of all vehicles at the intersection, as the feedback value of the traffic signal control system, and calculated as the input of the next cycle

Wherein m is the number of intersections, and n represents that the current cycle number is the nth time.

Drawings

Fig. 1 is a schematic diagram of a continuous intersection signalized upper and lower level traffic signalized intersection.

Fig. 2 is a diagram of a global model framework for an upper and lower network.

FIG. 3 is a cross-port regionalized discretized modeling diagram.

Fig. 4 is an MDP loop flow diagram.

FIG. 5 is a schematic diagram of a convolutional neural network processing vehicle information.

Fig. 6 is a model framework diagram of DQN.

Fig. 7 is a training flow diagram of DQN.

Fig. 8 is an upper state space definition diagram.

Fig. 9 is a schematic diagram of the upper state space.

Detailed Description

Step 1, controlling model frame for upper and lower layer signals of continuous intersection

And establishing scenes of upper and lower traffic signal intersections of the continuous intersection, wherein the continuous intersection comprises a detector for detecting vehicle information, and the vehicle is also provided with a sensor for acquiring the information. The method comprises the steps of acquiring information such as traffic signal timing data, vehicle running states and road actual conditions in real time through a vehicle-mounted network technology and various sensors, then predicting signal timing according with the current traffic state through a neural network by using deep reinforcement learning.

The invention divides the control of the signal lamps of the continuous intersections into upper and lower layer control, the lower layer Agent is the traffic signal controller of each intersection, and each controller has a unique learning strategy; the upper layer Agent is mainly used for adjusting the temporary strategy of the lower layer Agent. The upper layer and the lower layer of controllers jointly control the signal lights of the whole area, and the multi-body system is modeled as shown in figure 1.

Fig. 2 is a global model framework diagram of the upper and lower neural networks, and the main convolutional neural network selects the current intersection state and the tentative phase switching action as the feedback value to select the most valuable action. First, a training batch of data is generated, and the current state and action and the received feedback values are stored in a memory as a quadruple (s, a, r, s') to update the parameters of the master neural network. Target network theta^-Is a separate neural network that increases learning stability. The use of Double DQN in the primary and target networks can reduce overestimation and improve performance by training the Q-value for each action of the model and by selecting the action with the largest Q-value to obtain the optimal strategy. The model then randomly selects an action while storing a sufficient number of samples. Before training, each sample is of the same priority, and is randomly divided into small batches for training. After each training, the samples are updated for priority, they are selected by different probabilities, and then the learning rate in the neural network is updated by Adam back-propagation. The model derives an initial control scheme from the oa and the Action selection operation with the largest Q value. And finally, carrying out secondary adjustment on the green light time of all the intersections according to the average delay of the global vehicles and the average delay of the vehicles at each intersection, and enabling the model to make corresponding reactions on different traffic scenes through learning so as to reduce the vehicle delay.

Step 2, defining the state space of the lower layer neural network

In order to accurately describe traffic information at an intersection, a vehicle waiting time W, a vehicle delay D, and a signal lamp phase change C for each direction at the intersection are input as states. In addition, in order to accurately represent the specific distribution of the position and speed information of the vehicles at the intersection, discretized modeling is performed on the intersection area.

As shown in fig. 3, the whole intersection is divided into rectangular grids with the same size, and in order to reduce the calculation amount and save the calculation resources, the speed and position information of the vehicle is stored in the matrix. Dividing each lane into grids and regarding the grids as a cell, detecting vehicle state information by a detector, and expressing the speed and position information detected in time t by using single-channel convolution Q for each small square area; if the detector does not detect a vehicle, zero padding is performed on the block. And finally, taking the obtained speed and position matrix as the state information of the whole road network.

Step 3, action selection of lower-layer neural network

The traffic signal control system selects an appropriate action to guide the vehicles at the intersection according to the current traffic state. The invention takes the switching between the phases as the action space and models the process of switching between the phases as the Markov Decision Process (MDP). MDP is a Sequential Decision (Sequential Decision) mathematical model used to simulate the randomness strategy and feedback value that can be achieved by an agent in a traffic scene with a markov property in the system state, and then learn the switching strategy with the lowest feedback value by combining the MDP control strategy through trial and error in deep reinforcement learning.

In fig. 4, each loop represents the phase transition of the intersection signal lamp in a time slot cycle, the unit time of the cycle is discretized into 5 seconds, after the switching, the current phase sequence state is updated to the selected phase sequence state, and the traffic signal lamp selects the next action in the same way as the previous process. In addition, for the model to learn the switching phase, setting the maximum and minimum lamp color durations, respectively, the present invention sets the maximum and minimum times for the phases to 60 seconds and 5 seconds. And if the green time of a certain phase reaches 60 seconds, forcibly switching to the next phase, and continuously and iteratively updating based on the original control scheme.

Step 4, defining feedback value of lower layer neural network

To provide feedback to the reinforcement learning model regarding previous performance, feedback values are defined to assist traffic signals in taking optimal action strategies. In order to reduce the average delay of the vehicle, the invention defines the Reward as the average delay reduction value of the vehicle within a time period, so that the Reward is ensured to be positive during training.

The feedback value at the i-th time period is defined as:

from the formula, if r_iWhen the vehicle is larger, the average waiting time is longer than before, and r is ensured to achieve the purpose of continuously reducing the vehicle delay_iThe maximum is taken as much as possible.

Step 5, modeling of the lower layer neural network

The invention uses two main networks and target networks with consistent parameters, wherein the main network theta is used for updating the weight in real time, and the target network theta is used for updating the weight^-Updating after the main network is updated for a plurality of times, jointly updating the Q value by using a state value function V(s) and an action advantage function A (a), selecting Adam by an optimizer, and then adopting an e-greedy strategy and an experience playback strategy in the learning process.

The structure of the bottom layer CNN is shown in fig. 5. It consists of three convolutional layers and three fully-connected layers. The vehicle's speed and position information matrix is first passed through three convolutional layers, each of which includes three parts, convolution, pooling, and a nonlinear activation function. The convolutional layer comprises a plurality of filters, each filter containing a set of weights, each move by a step size defined by a step size to yield the next fully connected layer, different filters have different weights and generate different features in the next layer, the invention uses the Leaky ReLU function as the activation function:

where x represents the output in units and β is a constant. Compared with the conventional ReLU function, the introduction of β can avoid dead neurons due to zero gradient on the negative side. The Leaky ReLU function may converge faster than other activation functions (e.g., tanh and sigmod), thereby increasing the convergence rate of vehicle delays during training.

The neural network modeling is shown in fig. 5 below.

FIG. 5 is a process of vehicle speed and position information matrix processing in a graph convolutional neural network, first matrixing the collected information and second processing the data through three convolutional layers. The three convolutional layers and the fully connected layer were constructed as follows: the first convolutional layer contains 32 filters, each filter has a size of 4 × 4, and each time the data input is shifted by 4 × 4/step; the second convolutional layer has 64 filters, each filter has a size of 2 × 2, it moves 2 × 2/step, the output size after two convolutional layers is 30 × 30 × 64; the third convolutional layer has 128 filters with size of 2 × 2 and step size of 1 × 1, the output of the third convolutional layer is 30 × 30 × 128 tensor, and one fully connected layer converts the tensor into a 128 × 1 matrix. After the layers are fully connected, the data is split into two parts of the same size, 64 × 1. The first part represents a state value function V(s) and represents a value function of the static state of the current road network; the second part represents a state-dependent Action advantage function A (a) which represents a road network delay change value additionally brought by selecting a certain Action, the possible Action number is the number k of legal phases, so the size of A (a) is k multiplied by 1, the two parts are combined again to obtain a Q value of each Action, wherein the parameter in the CNN is represented as theta, Q (s, a) is converted into Q (s, a; theta), theta represents that the network parameter is the mean square error loss, the Q function represents that from the state s, a is used as the maximum accumulated feedback value of the first Action, the average expected value r is predicted through the current road network traffic state, and the controller executes the optimal signal switching strategy under the current neural network.

Step 6, optimization of lower-layer reinforcement learning network

The core of the DQN model is a convolutional neural network, and is trained using Q-learning, the input of which is an original road network data matrix, and the output is an estimated Q value of the optimal strategy, and fig. 6 and 7 below are a frame diagram and a flow chart of DQN, respectively.

The matrix containing vehicle position and speed information passes through the convolutional layer and the full link layer, and then a vector containing each action Q value is output through the input state and action.

1) Deep reinforcement learning network

During DQN training, a deep convolutional network is used to approximate the current valuation function, while another network is used to produce the target Q value. Specifically, let Q_target(s, a) represents the target Q value for taking action a in state s, θ represents the network parameter as the mean square error loss, and the neural network is updated with the Mean Square Error (MSE), as follows:

where p(s) represents the probability of the occurrence of state s in a training batch. In order to provide a stable update (steady reduction of delays in road networks during training) in each iteration, a separate target network θ is used that is identical in structure to the main neural network but with different parameters^-To generate a Q value.

θ^-＝αθ^-+(1-α)θ (4)

and alpha is an updating rate and represents the influence degree of the new parameter on the target network.

the output of the network of target values is represented,

the optimization objective of the value function, i.e. the target Q value, is approximately represented.

When the parameter theta of the current value network is updated, the parameter of the current value network is copied to the target value network theta through N iterations^-By minimizing the current Q value and the target network Q_targetThe network parameters are updated by the mean square error between the values, so that the error items of the network are reduced to a limited interval, and the Q value and the gradient value are in a reasonable range, thereby being beneficial to delay and stable reduction of the road network.

2) DuelingDQN optimization method

In some states s_tFor example, when too few or too many vehicles are on the road network, no matter what action a is done_tDo not affect the next state s_t+1The delay of (1), that is, the correlation between the current state action function and the current action selection is weak, which easily causes the road network delay not to be converged in the current state. In order to solve the problem, the invention adopts the Dueling DQN to improve the learning effect and the convergence speed of the DQN.

And on the basis of the original network, fitting the Q value in reinforcement learning by using a deep network, and dividing a Q value function into a state V value and an action V value, wherein the Q value is updated by adding the state V value and the action V' value.

In a neural network, the overall expected feedback value of taking a probabilistic action in a future step is represented by a state V (s; theta), for each action A (s, a; theta), the Q value is the sum of the state V and a state-dependent A (a) function, the definition of the A (a) function is the cumulative discount return brought by the current actual action compared with the optimal action, and the Q value is calculated as follows:

wherein A (s, a', theta) represents the effect of the action taken on the Q-value function, and if the A value of the action is a positive number, the action can reduce the delay better than other actions; otherwise, if the value a of a certain action is negative, it indicates that the potential feedback value of the action is smaller than the average value. Compared with the method which directly uses the original Q value, the method improves the stability of the model and reduces the average delay of the vehicle.

3) Double DQN optimization method

The traditional DQN has the defect of over-estimation, due to estimation nonuniformity, an over-estimation problem can be generated during parameter updating and iteration, so that the current phase switching scheme is not the optimal scheme, and in order to prevent the Q value from being over-estimated, Q is used_targetThe values are updated by the Double DQN algorithm.

Q_target(s,a)＝r+γQ'(s′,argmax(Q(s′,a′；θ)),θ^-) (6)

And two Q networks in the above formula, wherein Q determines which state Reward value is the largest, and the Q' function is responsible for selecting actions so as to reduce the problem of overestimation, thereby effectively reducing the average delay of vehicles on the road network.

4) Neural network parameters

The invention adopts a priority experience playback structure based on sequencing to increase learning efficiency, thereby increasing the playback probability of samples with lower average delay, and calculates the priority probability of experience samples by using a method based on sequencing, wherein the error delta of a sample i is defined as:

δ_i＝|Q(s,a；θ)_i-Q_target(s,a)_i| (7)

where τ indicates how much priority is used, it is a random sample when τ is 0.

An optimizer of the neural network model selects an Adam (adaptive matrix estimation) method, and an Adam algorithm designs independent adaptive learning rates for different parameters by calculating first moment estimation and second moment estimation of gradients, so that the convergence speed and the model effect are accelerated. Let J (θ) denote the loss function, calculate the parametric gradient g:

s＝ρ_ss+(1-ρ_s)g (10)

r＝ρ_rr+(1-ρ_r)g (11)

Where ρ is_sAnd ρ_rFirst-order and second-order exponential decay rates are respectively adopted, and the time step t is used for correcting the first-order and second-order bias moments, so that the corrected result is closer to the real gradient.

Calculate gradient update (element by element):

the final parameter updates are as follows:

wherein oa is_rIs the initial learning rate of the initial learning rate,

is a constant that stabilizes the value.

The final loss function J is as follows:

step 7, defining upper layer state space

When the upper-layer Agent controls the continuous intersections, firstly, the action of each intersection on the lower layer is adjusted based on the original scheme, and finally, the optimization scheme is updated according to the average queuing length of each intersection.

Modeling of a multi-body system is shown in FIG. 8.

Each main body in the system is a traffic signal controller of an intersection, an upper controller controlled by a network layer can control an area formed by a plurality of intersection signal controllers of a lower layer together, each intersection is set to be 1, 2 and … zeta, an Agent of each intersection of the lower layer has a learning strategy, and the upper layer Agent provides guidance. The secondary adjustment of the signals is performed by first sorting delays at each intersection as shown in fig. 9.

Through the steps, the delays of the intersections are sequenced, and the state space of the upper layer is the intersection number data with the highest delay.

Step 8, defining upper layer action space

In order to reduce the average delay of vehicles, intersections with larger delay need to be distributed with more green light time, intersections with less delay need to be distributed with less green light, j is set as green light adjusting time, the specific value is determined by the average delay of the vehicles at each intersection, and the average delay of the vehicles at all the intersections is referred

If the average delay of zeta at the current intersection is r_ζThe phase green time at the intersection is adjusted to

Step 9, upper layer neural network feedback value definition

The upper layer is arrangedFeedback value r of Agent_kDefined as the average delay of all crossing vehicles as the feedback value for the traffic signal control system and calculated as the input for the next cycle.

Claims

1. A continuous intersection signal cooperative control method based on deep reinforcement learning is characterized by comprising the following steps:

step 2, defining the state space of the lower layer neural network

step 3, action selection of lower-layer neural network

step 4, defining feedback value of lower layer neural network