CN114613169A - Traffic signal lamp control method based on double experience pools DQN - Google Patents

Traffic signal lamp control method based on double experience pools DQN Download PDF

Info

Publication number
CN114613169A
CN114613169A CN202210415387.1A CN202210415387A CN114613169A CN 114613169 A CN114613169 A CN 114613169A CN 202210415387 A CN202210415387 A CN 202210415387A CN 114613169 A CN114613169 A CN 114613169A
Authority
CN
China
Prior art keywords
network
value
experience pool
traffic
experience
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210415387.1A
Other languages
Chinese (zh)
Other versions
CN114613169B (en
Inventor
孔燕
杨智超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202210415387.1A priority Critical patent/CN114613169B/en
Publication of CN114613169A publication Critical patent/CN114613169A/en
Application granted granted Critical
Publication of CN114613169B publication Critical patent/CN114613169B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/08Controlling traffic signals according to detected number or speed of vehicles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Traffic Control Systems (AREA)

Abstract

本发明公开了一种基于双经验池DQN的交通信号灯控制方法,包括:1、建立基于DQN算法的交通信号灯控制主网络和目标值网络;2、初始化算法相关参数,采集交通路口的路况信息,建立状态值st;3、将st输入主网络中,选择Q值最大值的动作at;4、执行at并计算奖励rt和状态st+1;将(st,at,rt,st+1)存储到第一经验池;5、如果奖励rt大于历史经验平均奖励

Figure DDA0003605668070000011
将(st,at,rt,st+1)存储到第二经验池;6、生成随机数P,以概率1‑P选择第一经验池,以概率P选择第二经验池,在选中的经验池中随机抽样,通过最小化损失函数训练主网络的参数;S7、定时更新目标值网络的参数;根据当前路况信息更新st,跳转至步骤3继续执行。该方法能够使算法快速收敛,获得的交通信号灯控制策略快速优化。

Figure 202210415387

The invention discloses a traffic signal light control method based on double experience pool DQN, comprising: 1. establishing a traffic signal light control main network and a target value network based on a DQN algorithm; Establish the state value s t ; 3. Input s t into the main network, select the action at t with the maximum value of Q; 4. Execute at and calculate the reward r t and the state s t +1 ; (s t , at t , r t , s t+1 ) are stored in the first experience pool; 5. If the reward r t is greater than the average reward of historical experience

Figure DDA0003605668070000011
Store (s t , at , r t , s t +1 ) in the second experience pool; 6. Generate a random number P, select the first experience pool with probability 1-P, and select the second experience pool with probability P, Randomly sample from the selected experience pool, and train the parameters of the main network by minimizing the loss function; S7, update the parameters of the target value network regularly; update s t according to the current road condition information, and jump to step 3 to continue execution. This method can make the algorithm converge quickly, and the obtained traffic light control strategy can be quickly optimized.

Figure 202210415387

Description

一种基于双经验池DQN的交通信号灯控制方法A traffic signal control method based on double empirical pool DQN

技术领域technical field

本发明属于交通信号灯控制技术领域,具体涉及一种基于双经验池深度Q学习的交通信号灯控制方法。The invention belongs to the technical field of traffic signal light control, and in particular relates to a traffic signal light control method based on double-experience pool deep Q-learning.

背景技术Background technique

采用基于深度Q学习算法(DQN)进行交通信号灯的调控已有大量的研究。该方法无需带标签的测试数据,而是通过建立经验池来构建训练数据,在算法开始阶段得到的策略较差,随着经验池的不断更新和训练的继续进行,得到的策略逐渐得到优化,越来越好。因此,如何使算法快速收敛,即策略快速优化,是影响方法整体执行效果的重要因素。There has been a lot of research on the regulation of traffic lights based on deep Q-learning algorithm (DQN). This method does not need labeled test data, but builds training data by establishing an experience pool. The strategy obtained at the beginning of the algorithm is poor. With the continuous update of the experience pool and the continuation of training, the obtained strategy is gradually optimized. Getting better and better. Therefore, how to make the algorithm converge quickly, that is, the rapid optimization of the strategy, is an important factor affecting the overall execution effect of the method.

发明内容SUMMARY OF THE INVENTION

发明目的:本发明提供一种基于双经验池DQN的交通信号灯控制方法,该方法能够使算法快速收敛,获得的交通信号灯控制策略快速优化。Purpose of the invention: The present invention provides a traffic signal light control method based on the double experience pool DQN, which can make the algorithm converge quickly, and the obtained traffic signal light control strategy can be quickly optimized.

技术方案:本发明采用如下技术方案:Technical scheme: the present invention adopts the following technical scheme:

一种基于双经验池DQN的交通信号灯控制方法,包括步骤:A traffic signal control method based on double experience pool DQN, comprising the steps of:

S1、建立基于DQN算法的交通信号灯控制主网络和目标值网络;所述交通信号灯控制主网络和目标值网络的结构相同,输入为状态值,输出为在输入状态值下执行各种动作的Q值最大值,以及该Q值最大值所对应的动作;所述主网络和目标值网络的状态空间为交通路口各车道上车辆的数量构成的向量,动作空间为对交通路口当前所有交通信号灯相位的调控操作构成的向量,奖励函数为交通路口所有进车道上车辆数量与出车道上车辆数量之差;S1. Establish a traffic light control main network and a target value network based on the DQN algorithm; the traffic light control main network and the target value network have the same structure, the input is a state value, and the output is a Q for performing various actions under the input state value value maximum value, and the action corresponding to the Q value maximum value; the state space of the main network and the target value network is the vector formed by the number of vehicles in each lane of the traffic intersection, and the action space is the current phase of all traffic lights at the traffic intersection. The vector formed by the control operation of , and the reward function is the difference between the number of vehicles in all incoming lanes and the number of vehicles in outgoing lanes at the traffic intersection;

S2、对主网络的参数θ进行随机初始化,将目标值网络的参数θ′初始化为θ,初始化时间步t=0,采集交通路口的路况信息,建立初始状态值st,初始化

Figure BDA0003605668050000011
S2. Randomly initialize the parameter θ of the main network, initialize the parameter θ′ of the target value network to θ, initialize the time step t=0, collect the road condition information of the traffic intersection, establish the initial state value s t , and initialize
Figure BDA0003605668050000011

S3、将st输入主网络中,选择使Q(st,a;θ)取最大值的动作at作为当前时间对交通信号灯的调控操作,即:at=argmaxaQ(st,a;θ),其中Q(st,a;θ)表示主网络在参数θ下根据状态st动作a输出的Q值;S3. Input s t into the main network, and select the action a t that makes Q(s t , a; θ) take the maximum value as the control operation for the traffic lights at the current time, namely: a t =argmax a Q(s t , a; θ), where Q(s t , a; θ) represents the Q value output by the main network according to the state s t action a under the parameter θ;

S4、执行动作at并计算奖励rt和状态st+1;将(st,at,rt,st+1)存储到第一经验池中;S4. Execute action at and calculate reward rt and state s t +1 ; store (s t , at , r t , s t +1 ) in the first experience pool;

S5、当t>0时计算当前历史经验平均奖励

Figure BDA0003605668050000021
如果
Figure BDA0003605668050000022
将(st,at,rt,st+1)存储到第二经验池中;S5. Calculate the average reward of current historical experience when t>0
Figure BDA0003605668050000021
if
Figure BDA0003605668050000022
store (s t , at , r t , s t +1 ) into the second experience pool;

S6、在(p1,p2)区间内生成随机数P,以1-P作为概率选择第一经验池,以P作为概率选择第二经验池,在选中的经验池中随机抽样B个记录,通过最小化损失函数训练主网络的参数θ;p1,p2为预设的区间下限和上限,0<p1<p2<1;S6. Generate a random number P in the (p 1 , p 2 ) interval, select the first experience pool with 1-P as the probability, select the second experience pool with P as the probability, and randomly sample B records from the selected experience pool , the parameter θ of the main network is trained by minimizing the loss function; p 1 , p 2 are the preset lower and upper limits of the interval, 0<p 1 <p 2 <1;

所述损失函数为:

Figure BDA0003605668050000023
The loss function is:
Figure BDA0003605668050000023

其中(si,ai,ri,si+1)为在选中的经验池中随机抽样的记录,γ为折扣因子,maxa′Q′(si+1,a′,θ′)表示目标值网络在输入状态si+1时输出的最大的Q值,maxaQ(si,a,θ)表示主网络在输入状态si时输出的最大的Q值;where (s i , a i , r i , s i+1 ) are the records randomly sampled in the selected experience pool, γ is the discount factor, max a′ Q′(s i+1 , a′, θ′) Represents the maximum Q value output by the target value network in the input state s i+1 , max a Q(s i , a, θ) represents the maximum Q value output by the main network in the input state s i ;

S7、令t加一,如果mod(t,C)为0,将目标值网络的参数θ′更新为主网络的参数θ;mod为取余运算,C为预设的参数更新时间步;根据当前路况信息更新st,跳转至步骤S3继续执行。S7. Let t increase by one, if mod(t, C) is 0, update the parameter θ' of the target value network to the parameter θ of the main network; mod is the remainder operation, and C is the preset parameter update time step; The current road condition information is updated s t , and jumps to step S3 to continue the execution.

进一步地,所述步骤S6采用梯度下降法最小化损失函数得到主网络的参数。Further, the step S6 adopts the gradient descent method to minimize the loss function to obtain the parameters of the main network.

进一步地,当交通路口为十字路口,所述主网络和目标值网络的状态空间中的状态值为[n1,m1,n2,m2,n3,m3,n4,m4],其中nj为十字路口中第j个进车道上的车辆数量,mj为第j个出车道上的车辆数量;j=1,2,3,4。Further, when the traffic intersection is an intersection, the state values in the state space of the main network and the target value network are [n 1 ,m 1 ,n 2 ,m 2 ,n 3 ,m 3 ,n 4 ,m 4 ], where n j is the number of vehicles in the j-th incoming lane in the intersection, m j is the number of vehicles in the j-th outgoing lane; j=1,2,3,4.

进一步地,所述主网络和目标值网络的动作空间中的动作值有三种取值,分别为:ac1:当前相位时长加T秒;ac2:当前相位时长减T秒;ac3:当前相位时长不变。本发明中,T为5秒。Further, the action values in the action space of the main network and the target value network have three values, respectively: ac 1 : the current phase duration plus T seconds; ac 2 : the current phase duration minus T seconds; ac 3 : the current phase duration The phase duration does not change. In the present invention, T is 5 seconds.

进一步地,本发明中生成随机数P的区间下限p1=0.7,区间上限p2=0.9。Further, in the present invention, the interval lower limit p 1 =0.7 of the random number P is generated, and the interval upper limit p 2 =0.9.

进一步地,奖励函数值为:

Figure BDA0003605668050000024
Further, the reward function value is:
Figure BDA0003605668050000024

进一步地,所述第一经验池和第二经验池均采用容量固定的队列存储记录。Further, the first experience pool and the second experience pool both use queues with fixed capacity to store records.

进一步地,所述步骤S5计算当前历史经验平均奖励

Figure BDA0003605668050000031
Further, the step S5 calculates the average reward of current historical experience
Figure BDA0003605668050000031

有益效果:本发明公开的交通信号灯控制方法采用双经验池和DQN相结合,其中双经验池机制能够使网络参数训练快速收敛,获得的交通信号灯控制策略快速优化,从而更好地实现交通灯的智能调控。Beneficial effects: The traffic signal control method disclosed in the present invention adopts a combination of dual experience pools and DQN, wherein the dual experience pool mechanism can make the network parameter training converge quickly, and the obtained traffic signal control strategy can be quickly optimized, so as to better realize the control of traffic lights. Intelligent regulation.

附图说明Description of drawings

图1为本发明公开的交通信号灯控制方法的流程图;FIG. 1 is a flowchart of a traffic signal light control method disclosed in the present invention;

图2为实施例中路口示意图;2 is a schematic diagram of an intersection in an embodiment;

图3为本发明网络架构示意图。FIG. 3 is a schematic diagram of the network architecture of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施方式,进一步阐明本发明。The present invention will be further explained below in conjunction with the accompanying drawings and specific embodiments.

本发明公开了一种基于双经验池DQN的交通信号灯控制方法,如图1所示,包括步骤:The invention discloses a traffic signal light control method based on double experience pool DQN, as shown in FIG. 1 , including steps:

S1、建立基于DQN算法的交通信号灯控制主网络和目标值网络;所述交通信号灯控制主网络和目标值网络的结构相同,输入为状态值,输出为在输入状态值下执行各种动作的Q值;所述主网络和目标值网络的状态空间为交通路口各车道上车辆的数量构成的向量,动作空间为对交通路口当前所有交通信号灯相位的调控操作构成的向量,奖励函数为交通路口所有进车道上车辆数量与出车道上车辆数量之差;S1. Establish a traffic light control main network and a target value network based on the DQN algorithm; the traffic light control main network and the target value network have the same structure, the input is a state value, and the output is a Q for performing various actions under the input state value value; the state space of the main network and the target value network is the vector formed by the number of vehicles in each lane of the traffic intersection, the action space is the vector formed by the control operations on the phases of all current traffic lights at the traffic intersection, and the reward function is all traffic intersections. The difference between the number of vehicles in the incoming lane and the number of vehicles in the outgoing lane;

当交通路口为十字路口时,如图2中的路口A所示,其四个路口均有进入路口和驶离路口的车道,如图中N1-N4为进入路口的车道,M1-M4为驶离路口的车道,则主网络和目标值网络的状态空间中的状态值为[n1,m1,n2,m2,n3,m3,n4,m4],其中nj为十字路口中第j个进车道上的车辆数量,mj为第j个出车道上的车辆数量;j=1,2,3,4。可以通过各个方向道路设置的传感器或摄像头捕捉上述数据。奖励函数值为:

Figure BDA0003605668050000032
即进车道上车辆数量与出车道上车辆数量之差。网络和目标值网络的动作空间中的动作值有三种取值,分别为:ac1:当前相位时长加T秒;ac2:当前相位时长减T秒;ac3:当前相位时长不变,即按照预设的交通信号灯相位变化来改变当前相位的状态。When the traffic intersection is an intersection, as shown in intersection A in Figure 2, its four intersections have lanes entering and leaving the intersection. Lanes away from the intersection, the state values in the state space of the main network and the target value network are [n 1 ,m 1 ,n 2 ,m 2 ,n 3 ,m 3 ,n 4 ,m 4 ], where n j is The number of vehicles in the j-th incoming lane in the intersection, m j is the number of vehicles in the j-th outgoing lane; j=1,2,3,4. This data can be captured by sensors or cameras placed on the road in all directions. The reward function is:
Figure BDA0003605668050000032
That is, the difference between the number of vehicles in the incoming lane and the number of vehicles in the outgoing lane. The action value in the action space of the network and the target value network has three values, namely: ac 1 : the current phase duration plus T seconds; ac 2 : the current phase duration minus T seconds; ac 3 : the current phase duration is unchanged, that is Change the state of the current phase according to the preset traffic light phase change.

S2、对主网络的参数θ进行随机初始化,将目标值网络的参数θ′初始化为θ,初始化时间步t=0,采集交通路口的路况信息,建立初始状态值st,初始化

Figure BDA0003605668050000041
S2. Randomly initialize the parameter θ of the main network, initialize the parameter θ′ of the target value network to θ, initialize the time step t=0, collect the road condition information of the traffic intersection, establish the initial state value s t , and initialize
Figure BDA0003605668050000041

S3、将st输入主网络中,选择使Q(st,a;θ)取最大值的动作at作为当前时间对交通信号灯的调控操作,即:at=argmaxaQ(st,a;θ),其中Q(st,a;θ)表示主网络在参数θ下根据状态st动作a输出的Q值;S3. Input s t into the main network, and select the action a t that makes Q(s t , a; θ) take the maximum value as the control operation for the traffic lights at the current time, namely: a t =argmax a Q(s t , a; θ), where Q(s t , a; θ) represents the Q value output by the main network according to the state s t action a under the parameter θ;

S4、执行动作at并计算奖励rt和状态st+1;将(st,at,rt,st+1)存储到第一经验池中;S4. Execute action at and calculate reward rt and state s t +1 ; store (s t , at , r t , s t +1 ) in the first experience pool;

S5、当t>0时计算当前历史经验平均奖励

Figure BDA0003605668050000042
如果
Figure BDA0003605668050000043
将(st,at,rt,st+1)存储到第二经验池中;S5. Calculate the average reward of current historical experience when t>0
Figure BDA0003605668050000042
if
Figure BDA0003605668050000043
store (s t , at , r t , s t +1 ) into the second experience pool;

当前历史经验平均奖励

Figure BDA0003605668050000044
即根据上一时间步的平均奖励
Figure BDA0003605668050000045
和当前时间步数t和奖励rt来计算。Current historical experience average reward
Figure BDA0003605668050000044
i.e. according to the average reward of the previous time step
Figure BDA0003605668050000045
and the current time step t and reward r t to calculate.

本发明中第一经验池和第二经验池均采用容量固定的队列存储记录,当满队时,将队头的记录删除,新的记录存储至队尾,以此来更新经验池。In the present invention, both the first experience pool and the second experience pool use a queue with a fixed capacity to store records. When the queue is full, the record at the head of the queue is deleted, and the new record is stored at the end of the queue to update the experience pool.

S6、在(p1,p2)区间内生成随机数P,以1-P作为概率选择第一经验池,以P作为概率选择第二经验池;在选中的经验池中随机抽样B个记录,通过最小化损失函数训练主网络的参数θ;p1,p2为预设的区间下限和上限,0<p1<p2<1。本实施例中,p1=0.7,p2=0.9,即选择第二经验池的概率大于第一经验池。由于第二经验池中的记录奖励较大,其表现优于第一经验池,采用第二经验池内的记录训练能够加快收敛速度。同时保留了以较低概率(1-P)选择第一经验池,是为了降低网络进入过拟合的概率。S6. Generate a random number P in the (p 1 , p 2 ) interval, select the first experience pool with 1-P as the probability, and select the second experience pool with P as the probability; randomly sample B records from the selected experience pool , the parameter θ of the main network is trained by minimizing the loss function; p 1 , p 2 are the preset lower and upper limits of the interval, 0<p 1 <p 2 <1. In this embodiment, p 1 =0.7, p 2 =0.9, that is, the probability of selecting the second experience pool is greater than that of the first experience pool. Since the record reward in the second experience pool is larger, its performance is better than that of the first experience pool, and the training with the records in the second experience pool can speed up the convergence speed. At the same time, the selection of the first experience pool with a lower probability (1-P) is reserved in order to reduce the probability of the network entering overfitting.

所述损失函数为:

Figure BDA0003605668050000046
The loss function is:
Figure BDA0003605668050000046

其中(si,ai,ri,si+1)为在选中的经验池中随机抽样的记录,γ为折扣因子,maxa′Q′(si+1,a′,θ′)表示目标值网络在输入状态si+1时输出的最大的Q值,maxaQ(si,a,θ)表示主网络在输入状态si时输出的最大的Q值;where (s i , a i , r i , s i+1 ) are the records randomly sampled in the selected experience pool, γ is the discount factor, max a′ Q′(s i+1 , a′, θ′) Represents the maximum Q value output by the target value network in the input state s i+1 , max a Q(s i , a, θ) represents the maximum Q value output by the main network in the input state s i ;

本实施例中,采用梯度下降法最小化损失函数得到主网络的参数。如图3所示,为本发明网络架构示意图。In this embodiment, the gradient descent method is used to minimize the loss function to obtain the parameters of the main network. As shown in FIG. 3 , it is a schematic diagram of the network architecture of the present invention.

S7、令t加一,如果mod(t,C)为0,将目标值网络的参数θ′更新为主网络的参数θ;mod为取余运算,C为预设的参数更新时间步;根据t-1时刻和t时刻之间的时长、C的值,可以控制目标值网络参数更新的频率;根据当前路况信息更新st,跳转至步骤S3继续执行。S7. Let t increase by one, if mod(t, C) is 0, update the parameter θ' of the target value network to the parameter θ of the main network; mod is the remainder operation, and C is the preset parameter update time step; The duration between time t-1 and time t and the value of C can control the frequency of updating the target value network parameters; update s t according to the current road condition information, and jump to step S3 to continue execution.

本发明采用的双经验池的形式,加快了DQN训练时网络的收敛速度,从而更好地缓解了交通拥堵的情况,推进了智能交通和深度强化学习领域的发展。The form of double experience pools adopted in the present invention accelerates the convergence speed of the network during DQN training, thereby better relieving traffic congestion and promoting the development of intelligent transportation and deep reinforcement learning.

Claims (9)

1.一种基于双经验池DQN的交通信号灯控制方法,其特征在于,包括步骤:1. a traffic signal control method based on double experience pool DQN, is characterized in that, comprises the steps: S1、建立基于DQN算法的交通信号灯控制主网络和目标值网络;所述交通信号灯控制主网络和目标值网络的结构相同,输入为状态值,输出为在输入状态值下执行各种动作的Q值最大值,以及该Q值最大值所对应的动作;所述主网络和目标值网络的状态空间为交通路口各车道上车辆的数量构成的向量,动作空间为对交通路口当前所有交通信号灯相位的调控操作构成的向量,奖励函数为交通路口所有进车道上车辆数量与出车道上车辆数量之差;S1. Establish a traffic light control main network and a target value network based on the DQN algorithm; the traffic light control main network and the target value network have the same structure, the input is a state value, and the output is a Q for performing various actions under the input state value value maximum value, and the action corresponding to the Q value maximum value; the state space of the main network and the target value network is the vector formed by the number of vehicles in each lane of the traffic intersection, and the action space is the current phase of all traffic lights at the traffic intersection. The vector formed by the control operation of , and the reward function is the difference between the number of vehicles in all incoming lanes and the number of vehicles in outgoing lanes at the traffic intersection; S2、对主网络的参数θ进行随机初始化,将目标值网络的参数θ′初始化为θ,初始化时间步t=0,采集交通路口的路况信息,建立初始状态值st,初始化
Figure FDA0003605668040000011
S2. Randomly initialize the parameter θ of the main network, initialize the parameter θ′ of the target value network to θ, initialize the time step t=0, collect the road condition information of the traffic intersection, establish the initial state value s t , and initialize
Figure FDA0003605668040000011
S3、将st输入主网络中,选择使Q(st,a;θ)取最大值的动作at作为当前时间对交通信号灯的调控操作,即:at=argmaxaQ(st,a;θ),其中Q(st,a;θ)表示主网络在参数θ下根据状态st动作a输出的Q值;S3. Input s t into the main network, and select the action a t that makes Q(s t , a; θ) take the maximum value as the control operation for the traffic lights at the current time, namely: a t =argmax a Q(s t , a; θ), where Q(s t , a; θ) represents the Q value output by the main network according to the state s t action a under the parameter θ; S4、执行动作at并计算奖励rt和状态st+1;将(st,at,rt,st+1)存储到第一经验池中;S4. Execute action at and calculate reward rt and state s t +1 ; store (s t , at , r t , s t +1 ) in the first experience pool; S5、当t>0时计算当前历史经验平均奖励
Figure FDA0003605668040000012
如果
Figure FDA0003605668040000013
将(st,at,rt,st+1)存储到第二经验池中;
S5. Calculate the average reward of current historical experience when t>0
Figure FDA0003605668040000012
if
Figure FDA0003605668040000013
store (s t , at , r t , s t +1 ) into the second experience pool;
S6、在(p1,p2)区间内生成随机数P,以1-P作为概率选择第一经验池,以P作为概率选择第二经验池,在选中的经验池中随机抽样B个记录,通过最小化损失函数训练主网络的参数θ;p1,p2为预设的区间下限和上限,0<p1<p2<1;S6. Generate a random number P in the (p 1 , p 2 ) interval, select the first experience pool with 1-P as the probability, select the second experience pool with P as the probability, and randomly sample B records from the selected experience pool , the parameter θ of the main network is trained by minimizing the loss function; p 1 , p 2 are the preset lower and upper limits of the interval, 0<p 1 <p 2 <1; 所述损失函数为:
Figure FDA0003605668040000014
The loss function is:
Figure FDA0003605668040000014
其中(si,ai,ri,si+1)为在选中的经验池中随机抽样的记录,γ为折扣因子,maxa′Q′(si+1,a′,θ′)表示目标值网络在输入状态si+1时输出的最大的Q值,maxaQ(si,a,θ)表示主网络在输入状态si时输出的最大的Q值;where (s i , a i , r i , s i+1 ) are the records randomly sampled in the selected experience pool, γ is the discount factor, max a′ Q′(s i+1 , a′, θ′) Represents the maximum Q value output by the target value network in the input state s i+1 , max a Q(s i , a, θ) represents the maximum Q value output by the main network in the input state s i ; S7、令t加一,如果mod(t,C)为0,将目标值网络的参数θ′更新为主网络的参数θ;mod为取余运算,C为预设的参数更新时间步;根据当前路况信息更新st,跳转至步骤S3继续执行。S7. Let t increase by one, if mod(t, C) is 0, update the parameter θ' of the target value network to the parameter θ of the main network; mod is the remainder operation, and C is the preset parameter update time step; The current road condition information is updated s t , and jumps to step S3 to continue the execution.
2.根据权利要求1所述的基于双经验池DQN的交通信号灯控制方法,其特征在于,所述步骤S6中采用梯度下降法最小化损失函数得到主网络的参数。2 . The traffic signal light control method based on the double empirical pool DQN according to claim 1 , wherein, in the step S6 , a gradient descent method is used to minimize the loss function to obtain the parameters of the main network. 3 . 3.根据权利要求1所述的基于双经验池DQN的交通信号灯控制方法,其特征在于,当交通路口为十字路口,所述主网络和目标值网络的状态空间中的状态值为[n1,m1,n2,m2,n3,m3,n4,m4],其中nj为十字路口中第j个进车道上的车辆数量,mj为第j个出车道上的车辆数量;j=1,2,3,4。3. The traffic signal light control method based on double experience pool DQN according to claim 1, is characterized in that, when the traffic intersection is an intersection, the state value in the state space of the main network and the target value network is [n 1 ,m 1 ,n 2 ,m 2 ,n 3 ,m 3 ,n 4 ,m 4 ], where n j is the number of vehicles in the j-th incoming lane at the intersection, and m j is the vehicle in the j-th outgoing lane Quantity; j=1,2,3,4. 4.根据权利要求1所述的基于双经验池DQN的交通信号灯控制方法,其特征在于,所述主网络和目标值网络的动作空间中的动作值有三种取值,分别为:ac1:当前相位时长加T秒;ac2:当前相位时长减T秒;ac3:当前相位时长不变。4. the traffic light control method based on double experience pool DQN according to claim 1 is characterized in that, the action value in the action space of described main network and target value network has three kinds of values, respectively: ac 1 : Current phase duration plus T seconds; ac 2 : Current phase duration minus T seconds; ac 3 : Current phase duration unchanged. 5.根据权利要求1所述的基于双经验池DQN的交通信号灯控制方法,其特征在于,p1=0.7,p2=0.9。5 . The traffic signal control method based on the double empirical pool DQN according to claim 1 , wherein p 1 =0.7, p 2 =0.9. 6 . 6.根据权利要求3所述的基于双经验池DQN的交通信号灯控制方法,其特征在于,奖励函数值为:
Figure FDA0003605668040000021
6. the traffic light control method based on double experience pool DQN according to claim 3, is characterized in that, reward function value is:
Figure FDA0003605668040000021
7.根据权利要求1所述的基于双经验池DQN的交通信号灯控制方法,其特征在于,所述第一经验池和第二经验池均采用容量固定的队列存储记录。7 . The traffic light control method based on dual experience pools DQN according to claim 1 , wherein the first experience pool and the second experience pool both use queues with fixed capacity to store records. 8 . 8.根据权利要求1所述的基于双经验池DQN的交通信号灯控制方法,其特征在于,所述步骤S5计算当前历史经验平均奖励
Figure FDA0003605668040000022
8. The traffic light control method based on double experience pool DQN according to claim 1, wherein the step S5 calculates the average reward of current historical experience
Figure FDA0003605668040000022
9.根据权利要求4所述的基于双经验池DQN的交通信号灯控制方法,其特征在于,T为5秒。9 . The traffic light control method based on double empirical pool DQN according to claim 4 , wherein T is 5 seconds. 10 .
CN202210415387.1A 2022-04-20 2022-04-20 Traffic signal lamp control method based on double experience pools DQN Active CN114613169B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210415387.1A CN114613169B (en) 2022-04-20 2022-04-20 Traffic signal lamp control method based on double experience pools DQN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210415387.1A CN114613169B (en) 2022-04-20 2022-04-20 Traffic signal lamp control method based on double experience pools DQN

Publications (2)

Publication Number Publication Date
CN114613169A true CN114613169A (en) 2022-06-10
CN114613169B CN114613169B (en) 2023-02-28

Family

ID=81870213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210415387.1A Active CN114613169B (en) 2022-04-20 2022-04-20 Traffic signal lamp control method based on double experience pools DQN

Country Status (1)

Country Link
CN (1) CN114613169B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115643623A (en) * 2022-10-17 2023-01-24 北京航空航天大学 A Wireless Ad Hoc Network Device Routing Method Based on Deep Q-Learning
CN115758705A (en) * 2022-11-10 2023-03-07 北京航天驭星科技有限公司 Modeling method, model and acquisition method of satellite north-south conservation strategy model
CN117010482A (en) * 2023-07-06 2023-11-07 三峡大学 Strategy method based on double experience pool priority sampling and DuelingDQN implementation

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109559530A (en) * 2019-01-07 2019-04-02 大连理工大学 A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning
CN110060475A (en) * 2019-04-17 2019-07-26 清华大学 A kind of multi-intersection signal lamp cooperative control method based on deeply study
CN110930734A (en) * 2019-11-30 2020-03-27 天津大学 Intelligent idle traffic indicator lamp control method based on reinforcement learning
CN111696370A (en) * 2020-06-16 2020-09-22 西安电子科技大学 Traffic light control method based on heuristic deep Q network
CN111898211A (en) * 2020-08-07 2020-11-06 吉林大学 Intelligent vehicle speed decision method and simulation method based on deep reinforcement learning
CN112632858A (en) * 2020-12-23 2021-04-09 浙江工业大学 Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm
CN113411099A (en) * 2021-05-28 2021-09-17 杭州电子科技大学 Double-change frequency hopping pattern intelligent decision method based on PPER-DQN
CN113947928A (en) * 2021-10-15 2022-01-18 河南工业大学 Traffic signal lamp timing method based on combination of deep reinforcement learning and extended Kalman filtering
CN113963553A (en) * 2021-10-20 2022-01-21 西安工业大学 Road intersection signal lamp green signal ratio control method, device and equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109559530A (en) * 2019-01-07 2019-04-02 大连理工大学 A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning
CN110060475A (en) * 2019-04-17 2019-07-26 清华大学 A kind of multi-intersection signal lamp cooperative control method based on deeply study
CN110930734A (en) * 2019-11-30 2020-03-27 天津大学 Intelligent idle traffic indicator lamp control method based on reinforcement learning
CN111696370A (en) * 2020-06-16 2020-09-22 西安电子科技大学 Traffic light control method based on heuristic deep Q network
CN111898211A (en) * 2020-08-07 2020-11-06 吉林大学 Intelligent vehicle speed decision method and simulation method based on deep reinforcement learning
CN112632858A (en) * 2020-12-23 2021-04-09 浙江工业大学 Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm
CN113411099A (en) * 2021-05-28 2021-09-17 杭州电子科技大学 Double-change frequency hopping pattern intelligent decision method based on PPER-DQN
CN113947928A (en) * 2021-10-15 2022-01-18 河南工业大学 Traffic signal lamp timing method based on combination of deep reinforcement learning and extended Kalman filtering
CN113963553A (en) * 2021-10-20 2022-01-21 西安工业大学 Road intersection signal lamp green signal ratio control method, device and equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
WANC H ET.AL: "Value-based deep reinforcement learning for adaptive isolated intersection signal control", 《IET INTELLIGENT TRANSPORT SYSTEMS》 *
丁文杰: ""基于深度强化学习的交通信号自适应控制研究"", 《全国优秀硕士学位论文全文库 工程科技Ⅱ辑》 *
徐东伟 等: ""基于深度强化学习的城市交通信号控制综述"", 《交通运输工程与信息学报》 *
甘正胜 等: ""基于元学习的小样本遥感图像分类"", 《计算机工程与设计》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115643623A (en) * 2022-10-17 2023-01-24 北京航空航天大学 A Wireless Ad Hoc Network Device Routing Method Based on Deep Q-Learning
CN115758705A (en) * 2022-11-10 2023-03-07 北京航天驭星科技有限公司 Modeling method, model and acquisition method of satellite north-south conservation strategy model
CN117010482A (en) * 2023-07-06 2023-11-07 三峡大学 Strategy method based on double experience pool priority sampling and DuelingDQN implementation

Also Published As

Publication number Publication date
CN114613169B (en) 2023-02-28

Similar Documents

Publication Publication Date Title
CN114613169A (en) Traffic signal lamp control method based on double experience pools DQN
CN109559530B (en) A collaborative control method for multi-intersection signal lights based on Q-value transfer deep reinforcement learning
CN110264750B (en) Multi-intersection signal lamp cooperative control method based on Q value migration of multi-task deep Q network
CN110047278A (en) A kind of self-adapting traffic signal control system and method based on deeply study
CN114627657A (en) Adaptive traffic signal control method based on deep graph reinforcement learning
WO2021051870A1 (en) Reinforcement learning model-based information control method and apparatus, and computer device
CN110136456A (en) Traffic light anti-jamming control method and system based on deep reinforcement learning
CN110570672B (en) A method of regional traffic light control based on graph neural network
CN111260937A (en) Cross traffic signal lamp control method based on reinforcement learning
CN108335497A (en) A kind of traffic signals adaptive control system and method
CN113554875B (en) Variable speed-limiting control method for heterogeneous traffic flow of expressway based on edge calculation
CN106097733B (en) A kind of traffic signal optimization control method based on Policy iteration and cluster
CN111951574A (en) Adaptive Iterative Learning Control Method for Traffic Signals Based on Attenuated Memory Removal
CN115019523A (en) Deep reinforcement learning traffic signal coordination optimization control method based on minimized pressure difference
CN113299079B (en) Regional intersection signal control method based on PPO and graph convolution neural network
CN115691167A (en) Single-point traffic signal control method based on intersection holographic data
CN115359672A (en) A Traffic Area Boundary Control Method Combining Data-Driven and Reinforcement Learning
CN115472023B (en) Intelligent traffic light control method and device based on deep reinforcement learning
CN116524745A (en) Cloud edge cooperative area traffic signal dynamic timing system and method
CN115578870A (en) A Traffic Signal Control Method Based on Proximal Policy Optimization
CN118172951A (en) Urban intersection signal control method based on deep reinforcement learning
CN114419884B (en) Self-adaptive signal control method and system based on reinforcement learning and phase competition
CN116597670A (en) Traffic signal lamp timing method, device and equipment based on deep reinforcement learning
CN113724507B (en) Traffic control and vehicle guidance cooperative method and system based on deep reinforcement learning
WO2023206248A1 (en) Control method and apparatus for traffic light, and road network system, electronic device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant