CN114613169A

CN114613169A - Traffic signal lamp control method based on double experience pools DQN

Info

Publication number: CN114613169A
Application number: CN202210415387.1A
Authority: CN
Inventors: 孔燕; 杨智超
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-06-10
Anticipated expiration: 2042-04-20
Also published as: CN114613169B

Abstract

The invention discloses a traffic signal light control method based on double experience pool DQN, comprising: 1. establishing a traffic signal light control main network and a target value network based on a DQN algorithm; Establish the state value s _t ; 3. Input s _t into the main network, select the action at _t with the maximum value of Q; 4. Execute at and calculate the reward r _t and the state s _t ₊₁ ; (s _t , at _t , r _t , s _t+1 ) are stored in the first experience pool; 5. If the reward r _t is greater than the average reward of historical experience

Store (s _t , at , r _t , s _t ₊₁ ) in the second experience pool; 6. Generate a random number P, select the first experience pool with probability 1-P, and select the second experience pool with probability P, Randomly sample from the selected experience pool, and train the parameters of the main network by minimizing the loss function; S7, update the parameters of the target value network regularly; update s _t according to the current road condition information, and jump to step 3 to continue execution. This method can make the algorithm converge quickly, and the obtained traffic light control strategy can be quickly optimized.

Description

A traffic signal control method based on double empirical pool DQN

技术领域technical field

本发明属于交通信号灯控制技术领域，具体涉及一种基于双经验池深度Q学习的交通信号灯控制方法。The invention belongs to the technical field of traffic signal light control, and in particular relates to a traffic signal light control method based on double-experience pool deep Q-learning.

背景技术Background technique

采用基于深度Q学习算法(DQN)进行交通信号灯的调控已有大量的研究。该方法无需带标签的测试数据，而是通过建立经验池来构建训练数据，在算法开始阶段得到的策略较差，随着经验池的不断更新和训练的继续进行，得到的策略逐渐得到优化，越来越好。因此，如何使算法快速收敛，即策略快速优化，是影响方法整体执行效果的重要因素。There has been a lot of research on the regulation of traffic lights based on deep Q-learning algorithm (DQN). This method does not need labeled test data, but builds training data by establishing an experience pool. The strategy obtained at the beginning of the algorithm is poor. With the continuous update of the experience pool and the continuation of training, the obtained strategy is gradually optimized. Getting better and better. Therefore, how to make the algorithm converge quickly, that is, the rapid optimization of the strategy, is an important factor affecting the overall execution effect of the method.

发明内容SUMMARY OF THE INVENTION

发明目的：本发明提供一种基于双经验池DQN的交通信号灯控制方法，该方法能够使算法快速收敛，获得的交通信号灯控制策略快速优化。Purpose of the invention: The present invention provides a traffic signal light control method based on the double experience pool DQN, which can make the algorithm converge quickly, and the obtained traffic signal light control strategy can be quickly optimized.

技术方案：本发明采用如下技术方案：Technical scheme: the present invention adopts the following technical scheme:

一种基于双经验池DQN的交通信号灯控制方法，包括步骤：A traffic signal control method based on double experience pool DQN, comprising the steps of:

S1、建立基于DQN算法的交通信号灯控制主网络和目标值网络；所述交通信号灯控制主网络和目标值网络的结构相同，输入为状态值，输出为在输入状态值下执行各种动作的Q值最大值，以及该Q值最大值所对应的动作；所述主网络和目标值网络的状态空间为交通路口各车道上车辆的数量构成的向量，动作空间为对交通路口当前所有交通信号灯相位的调控操作构成的向量，奖励函数为交通路口所有进车道上车辆数量与出车道上车辆数量之差；S1. Establish a traffic light control main network and a target value network based on the DQN algorithm; the traffic light control main network and the target value network have the same structure, the input is a state value, and the output is a Q for performing various actions under the input state value value maximum value, and the action corresponding to the Q value maximum value; the state space of the main network and the target value network is the vector formed by the number of vehicles in each lane of the traffic intersection, and the action space is the current phase of all traffic lights at the traffic intersection. The vector formed by the control operation of , and the reward function is the difference between the number of vehicles in all incoming lanes and the number of vehicles in outgoing lanes at the traffic intersection;

S2、对主网络的参数θ进行随机初始化，将目标值网络的参数θ′初始化为θ，初始化时间步t＝0，采集交通路口的路况信息，建立初始状态值s_t，初始化

S2. Randomly initialize the parameter θ of the main network, initialize the parameter θ′ of the target value network to θ, initialize the time step t=0, collect the road condition information of the traffic intersection, establish the initial state value s _t , and initialize

S3、将s_t输入主网络中，选择使Q(s_t,a；θ)取最大值的动作a_t作为当前时间对交通信号灯的调控操作，即：a_t＝argmax_aQ(s_t,a；θ)，其中Q(s_t,a；θ)表示主网络在参数θ下根据状态s_t动作a输出的Q值；S3. Input s _t into the main network, and select the action a _t that makes Q(s _t , a; θ) take the maximum value as the control operation for the traffic lights at the current time, namely: a _t =argmax _a Q(s _t , a; θ), where Q(s _t , a; θ) represents the Q value output by the main network according to the state s _t action a under the parameter θ;

S4、执行动作a_t并计算奖励r_t和状态s_t+1；将(s_t,a_t,r_t,s_t+1)存储到第一经验池中；S4. Execute action at and calculate reward _rt and state s _t ₊₁ ; store (s _t , at , r _t , s _t ₊₁ ) in the first experience pool;

S5、当t＞0时计算当前历史经验平均奖励

如果

将(s_t,a_t,r_t,s_t+1)存储到第二经验池中；S5. Calculate the average reward of current historical experience when t>0

if

store (s _t , at , r _t , s _t ₊₁ ) into the second experience pool;

S6、在(p₁,p₂)区间内生成随机数P，以1-P作为概率选择第一经验池，以P作为概率选择第二经验池，在选中的经验池中随机抽样B个记录，通过最小化损失函数训练主网络的参数θ；p₁,p₂为预设的区间下限和上限，0＜p₁＜p₂＜1；S6. Generate a random number P in the (p ₁ , p ₂ ) interval, select the first experience pool with 1-P as the probability, select the second experience pool with P as the probability, and randomly sample B records from the selected experience pool , the parameter θ of the main network is trained by minimizing the loss function; p ₁ , p ₂ are the preset lower and upper limits of the interval, 0<p ₁ <p ₂ <1;

所述损失函数为：

The loss function is:

其中(s_i,a_i,r_i,s_i+1)为在选中的经验池中随机抽样的记录，γ为折扣因子，max_a′Q′(s_i+1,a′,θ′)表示目标值网络在输入状态s_i+1时输出的最大的Q值，max_aQ(s_i,a,θ)表示主网络在输入状态s_i时输出的最大的Q值；where (s _i , a _i , r _i , s _i+1 ) are the records randomly sampled in the selected experience pool, γ is the discount factor, max _a′ Q′(s _i+1 , a′, θ′) Represents the maximum Q value output by the target value network in the input state s _i+1 , max _a Q(s _i , a, θ) represents the maximum Q value output by the main network in the input state s _i ;

S7、令t加一，如果mod(t,C)为0，将目标值网络的参数θ′更新为主网络的参数θ；mod为取余运算，C为预设的参数更新时间步；根据当前路况信息更新s_t，跳转至步骤S3继续执行。S7. Let t increase by one, if mod(t, C) is 0, update the parameter θ' of the target value network to the parameter θ of the main network; mod is the remainder operation, and C is the preset parameter update time step; The current road condition information is updated s _t , and jumps to step S3 to continue the execution.

进一步地，所述步骤S6采用梯度下降法最小化损失函数得到主网络的参数。Further, the step S6 adopts the gradient descent method to minimize the loss function to obtain the parameters of the main network.

进一步地，当交通路口为十字路口，所述主网络和目标值网络的状态空间中的状态值为[n₁,m₁,n₂,m₂,n₃,m₃,n₄,m₄]，其中n_j为十字路口中第j个进车道上的车辆数量，m_j为第j个出车道上的车辆数量；j＝1,2,3,4。Further, when the traffic intersection is an intersection, the state values in the state space of the main network and the target value network are [n ₁ ,m ₁ ,n ₂ ,m ₂ ,n ₃ ,m ₃ ,n ₄ ,m ₄ ], where n _j is the number of vehicles in the j-th incoming lane in the intersection, m _j is the number of vehicles in the j-th outgoing lane; j=1,2,3,4.

进一步地，所述主网络和目标值网络的动作空间中的动作值有三种取值，分别为：ac₁：当前相位时长加T秒；ac₂：当前相位时长减T秒；ac₃：当前相位时长不变。本发明中，T为5秒。Further, the action values in the action space of the main network and the target value network have three values, respectively: ac ₁ : the current phase duration plus T seconds; ac ₂ : the current phase duration minus T seconds; ac ₃ : the current phase duration The phase duration does not change. In the present invention, T is 5 seconds.

进一步地，本发明中生成随机数P的区间下限p₁＝0.7，区间上限p₂＝0.9。Further, in the present invention, the interval lower limit p ₁ =0.7 of the random number P is generated, and the interval upper limit p ₂ =0.9.

进一步地，奖励函数值为：

Further, the reward function value is:

进一步地，所述第一经验池和第二经验池均采用容量固定的队列存储记录。Further, the first experience pool and the second experience pool both use queues with fixed capacity to store records.

进一步地，所述步骤S5计算当前历史经验平均奖励

Further, the step S5 calculates the average reward of current historical experience

有益效果：本发明公开的交通信号灯控制方法采用双经验池和DQN相结合，其中双经验池机制能够使网络参数训练快速收敛，获得的交通信号灯控制策略快速优化，从而更好地实现交通灯的智能调控。Beneficial effects: The traffic signal control method disclosed in the present invention adopts a combination of dual experience pools and DQN, wherein the dual experience pool mechanism can make the network parameter training converge quickly, and the obtained traffic signal control strategy can be quickly optimized, so as to better realize the control of traffic lights. Intelligent regulation.

附图说明Description of drawings

图1为本发明公开的交通信号灯控制方法的流程图；FIG. 1 is a flowchart of a traffic signal light control method disclosed in the present invention;

图2为实施例中路口示意图；2 is a schematic diagram of an intersection in an embodiment;

图3为本发明网络架构示意图。FIG. 3 is a schematic diagram of the network architecture of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施方式，进一步阐明本发明。The present invention will be further explained below in conjunction with the accompanying drawings and specific embodiments.

本发明公开了一种基于双经验池DQN的交通信号灯控制方法，如图1所示，包括步骤：The invention discloses a traffic signal light control method based on double experience pool DQN, as shown in FIG. 1 , including steps:

S1、建立基于DQN算法的交通信号灯控制主网络和目标值网络；所述交通信号灯控制主网络和目标值网络的结构相同，输入为状态值，输出为在输入状态值下执行各种动作的Q值；所述主网络和目标值网络的状态空间为交通路口各车道上车辆的数量构成的向量，动作空间为对交通路口当前所有交通信号灯相位的调控操作构成的向量，奖励函数为交通路口所有进车道上车辆数量与出车道上车辆数量之差；S1. Establish a traffic light control main network and a target value network based on the DQN algorithm; the traffic light control main network and the target value network have the same structure, the input is a state value, and the output is a Q for performing various actions under the input state value value; the state space of the main network and the target value network is the vector formed by the number of vehicles in each lane of the traffic intersection, the action space is the vector formed by the control operations on the phases of all current traffic lights at the traffic intersection, and the reward function is all traffic intersections. The difference between the number of vehicles in the incoming lane and the number of vehicles in the outgoing lane;

当交通路口为十字路口时，如图2中的路口A所示，其四个路口均有进入路口和驶离路口的车道，如图中N1-N4为进入路口的车道，M1-M4为驶离路口的车道，则主网络和目标值网络的状态空间中的状态值为[n₁,m₁,n₂,m₂,n₃,m₃,n₄,m₄]，其中n_j为十字路口中第j个进车道上的车辆数量，m_j为第j个出车道上的车辆数量；j＝1,2,3,4。可以通过各个方向道路设置的传感器或摄像头捕捉上述数据。奖励函数值为：

即进车道上车辆数量与出车道上车辆数量之差。网络和目标值网络的动作空间中的动作值有三种取值，分别为：ac₁：当前相位时长加T秒；ac₂：当前相位时长减T秒；ac₃：当前相位时长不变，即按照预设的交通信号灯相位变化来改变当前相位的状态。When the traffic intersection is an intersection, as shown in intersection A in Figure 2, its four intersections have lanes entering and leaving the intersection. Lanes away from the intersection, the state values in the state space of the main network and the target value network are [n ₁ ,m ₁ ,n ₂ ,m ₂ ,n ₃ ,m ₃ ,n ₄ ,m ₄ ], where n _j is The number of vehicles in the j-th incoming lane in the intersection, m _j is the number of vehicles in the j-th outgoing lane; j=1,2,3,4. This data can be captured by sensors or cameras placed on the road in all directions. The reward function is:

That is, the difference between the number of vehicles in the incoming lane and the number of vehicles in the outgoing lane. The action value in the action space of the network and the target value network has three values, namely: ac ₁ : the current phase duration plus T seconds; ac ₂ : the current phase duration minus T seconds; ac ₃ : the current phase duration is unchanged, that is Change the state of the current phase according to the preset traffic light phase change.

S5、当t＞0时计算当前历史经验平均奖励

如果

if

store (s _t , at , r _t , s _t ₊₁ ) into the second experience pool;

当前历史经验平均奖励

即根据上一时间步的平均奖励

和当前时间步数t和奖励r_t来计算。Current historical experience average reward

i.e. according to the average reward of the previous time step

and the current time step t and reward r _t to calculate.

本发明中第一经验池和第二经验池均采用容量固定的队列存储记录，当满队时，将队头的记录删除，新的记录存储至队尾，以此来更新经验池。In the present invention, both the first experience pool and the second experience pool use a queue with a fixed capacity to store records. When the queue is full, the record at the head of the queue is deleted, and the new record is stored at the end of the queue to update the experience pool.

S6、在(p₁,p₂)区间内生成随机数P，以1-P作为概率选择第一经验池，以P作为概率选择第二经验池；在选中的经验池中随机抽样B个记录，通过最小化损失函数训练主网络的参数θ；p₁,p₂为预设的区间下限和上限，0＜p₁＜p₂＜1。本实施例中，p₁＝0.7，p₂＝0.9，即选择第二经验池的概率大于第一经验池。由于第二经验池中的记录奖励较大，其表现优于第一经验池，采用第二经验池内的记录训练能够加快收敛速度。同时保留了以较低概率(1-P)选择第一经验池，是为了降低网络进入过拟合的概率。S6. Generate a random number P in the (p ₁ , p ₂ ) interval, select the first experience pool with 1-P as the probability, and select the second experience pool with P as the probability; randomly sample B records from the selected experience pool , the parameter θ of the main network is trained by minimizing the loss function; p ₁ , p ₂ are the preset lower and upper limits of the interval, 0<p ₁ <p ₂ <1. In this embodiment, p ₁ =0.7, p ₂ =0.9, that is, the probability of selecting the second experience pool is greater than that of the first experience pool. Since the record reward in the second experience pool is larger, its performance is better than that of the first experience pool, and the training with the records in the second experience pool can speed up the convergence speed. At the same time, the selection of the first experience pool with a lower probability (1-P) is reserved in order to reduce the probability of the network entering overfitting.

所述损失函数为：

The loss function is:

本实施例中，采用梯度下降法最小化损失函数得到主网络的参数。如图3所示，为本发明网络架构示意图。In this embodiment, the gradient descent method is used to minimize the loss function to obtain the parameters of the main network. As shown in FIG. 3 , it is a schematic diagram of the network architecture of the present invention.

S7、令t加一，如果mod(t,C)为0，将目标值网络的参数θ′更新为主网络的参数θ；mod为取余运算，C为预设的参数更新时间步；根据t-1时刻和t时刻之间的时长、C的值，可以控制目标值网络参数更新的频率；根据当前路况信息更新s_t，跳转至步骤S3继续执行。S7. Let t increase by one, if mod(t, C) is 0, update the parameter θ' of the target value network to the parameter θ of the main network; mod is the remainder operation, and C is the preset parameter update time step; The duration between time t-1 and time t and the value of C can control the frequency of updating the target value network parameters; update s _t according to the current road condition information, and jump to step S3 to continue execution.

本发明采用的双经验池的形式，加快了DQN训练时网络的收敛速度，从而更好地缓解了交通拥堵的情况，推进了智能交通和深度强化学习领域的发展。The form of double experience pools adopted in the present invention accelerates the convergence speed of the network during DQN training, thereby better relieving traffic congestion and promoting the development of intelligent transportation and deep reinforcement learning.

Claims

1. a traffic signal control method based on double experience pool DQN, is characterized in that, comprises the steps:

S1. Establish a traffic light control main network and a target value network based on the DQN algorithm; the traffic light control main network and the target value network have the same structure, the input is a state value, and the output is a Q for performing various actions under the input state value value maximum value, and the action corresponding to the Q value maximum value; the state space of the main network and the target value network is the vector formed by the number of vehicles in each lane of the traffic intersection, and the action space is the current phase of all traffic lights at the traffic intersection. The vector formed by the control operation of , and the reward function is the difference between the number of vehicles in all incoming lanes and the number of vehicles in outgoing lanes at the traffic intersection;

S3. Input s _t into the main network, and select the action a _t that makes Q(s _t , a; θ) take the maximum value as the control operation for the traffic lights at the current time, namely: a _t =argmax _a Q(s _t , a; θ), where Q(s _t , a; θ) represents the Q value output by the main network according to the state s _t action a under the parameter θ;

S4. Execute action at and calculate reward _rt and state s _t ₊₁ ; store (s _t , at , r _t , s _t ₊₁ ) in the first experience pool;

S5. Calculate the average reward of current historical experience when t>0

if

store (s _t , at , r _t , s _t ₊₁ ) into the second experience pool;

S6. Generate a random number P in the (p ₁ , p ₂ ) interval, select the first experience pool with 1-P as the probability, select the second experience pool with P as the probability, and randomly sample B records from the selected experience pool , the parameter θ of the main network is trained by minimizing the loss function; p ₁ , p ₂ are the preset lower and upper limits of the interval, 0<p ₁ <p ₂ <1;

The loss function is:

where (s _i , a _i , r _i , s _i+1 ) are the records randomly sampled in the selected experience pool, γ is the discount factor, max _a′ Q′(s _i+1 , a′, θ′) Represents the maximum Q value output by the target value network in the input state s _i+1 , max _a Q(s _i , a, θ) represents the maximum Q value output by the main network in the input state s _i ;

S7. Let t increase by one, if mod(t, C) is 0, update the parameter θ' of the target value network to the parameter θ of the main network; mod is the remainder operation, and C is the preset parameter update time step; The current road condition information is updated s _t , and jumps to step S3 to continue the execution.

2 . The traffic signal light control method based on the double empirical pool DQN according to claim 1 , wherein, in the step S6 , a gradient descent method is used to minimize the loss function to obtain the parameters of the main network. 3 .

3. The traffic signal light control method based on double experience pool DQN according to claim 1, is characterized in that, when the traffic intersection is an intersection, the state value in the state space of the main network and the target value network is [n ₁ ,m ₁ ,n ₂ ,m ₂ ,n ₃ ,m ₃ ,n ₄ ,m ₄ ], where n _j is the number of vehicles in the j-th incoming lane at the intersection, and m _j is the vehicle in the j-th outgoing lane Quantity; j=1,2,3,4.

4. the traffic light control method based on double experience pool DQN according to claim 1 is characterized in that, the action value in the action space of described main network and target value network has three kinds of values, respectively: ac ₁ : Current phase duration plus T seconds; ac ₂ : Current phase duration minus T seconds; ac ₃ : Current phase duration unchanged.

5 . The traffic signal control method based on the double empirical pool DQN according to claim 1 , wherein p ₁ =0.7, p ₂ =0.9. 6 .

6. the traffic light control method based on double experience pool DQN according to claim 3, is characterized in that, reward function value is:

7 . The traffic light control method based on dual experience pools DQN according to claim 1 , wherein the first experience pool and the second experience pool both use queues with fixed capacity to store records. 8 .

8. The traffic light control method based on double experience pool DQN according to claim 1, wherein the step S5 calculates the average reward of current historical experience

9 . The traffic light control method based on double empirical pool DQN according to claim 4 , wherein T is 5 seconds. 10 .