KR102346507B1

KR102346507B1 - Method and apparatus for controlling distributed traffic signal based on reinforcement learning in multi-intersections environment

Info

Publication number: KR102346507B1
Application number: KR1020200013141A
Authority: KR
Inventors: 임유진; 주현진
Original assignee: 숙명여자대학교산학협력단
Priority date: 2020-02-04
Filing date: 2020-02-04
Publication date: 2022-01-03
Also published as: KR20210099369A

Abstract

본 발명은 교차로에서 복수의 통행 방향에 따라 구분되는 상태, 특정 상태에 대해 복수의 신호 조합 중 하나로 통행 신호를 설정하는 행동 및 상기 교차로의 차량 통행량 및 상기 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간에 따라 결정되는 보상을 설정하는 단계; 상기 상태, 상기 행동 및 상기 보상에 기초하여, 소정의 교통 데이터를 이용한 강화학습을 수행하는 단계; 및 상기 강화학습의 결과 및 상기 교차로에 대응되는 대상교차로의 교통정보에 기초하여, 상기 대상교차로의 복수의 통행 방향에 대한 통행 신호를 설정하는 단계를 포함하는 것을 특징으로 하는 다중 교차로 환경에서의 강화학습 기반의 분산 교통 신호 제어 방법을 제공한다.The present invention relates to a state differentiated according to a plurality of traffic directions at an intersection, an action of setting a traffic signal as one of a plurality of signal combinations for a specific state, a vehicle traffic volume at the intersection, and average waiting of vehicles at a plurality of intersections adjacent to the intersection setting a reward determined according to time; performing reinforcement learning using predetermined traffic data based on the state, the action, and the reward; and based on the result of the reinforcement learning and the traffic information of the target intersection corresponding to the intersection, setting traffic signals for a plurality of travel directions of the target intersection. A learning-based distributed traffic signal control method is provided.

Description

Distributed traffic signal control method and apparatus based on reinforcement learning in a multi-intersection environment

본 발명은 다중 교차로 환경에서 강화학습에 기반하는 교통 신호의 제어 방법 및 장치를 제공한다.The present invention provides a method and apparatus for controlling a traffic signal based on reinforcement learning in a multi-intersection environment.

교통 혼잡(traffic congestion)은 매일 도시지역에서 발생한다. 교통 혼잡은 연료 소비량, 운영비 그리고 이동시간 등을 증가 시키기 때문에 비용이 많이 발생한다. 또한, 이에 따라 환경오염을 유발한다. 따라서, 교통 혼잡은 교통 시스템에서 해결해야 할 주요 과제이다. 교통 혼잡을 해결하기 위해, 교통 관리 시스템에 대해 다양한 연구가 수행되었다. 최근에는 효율적이고 안전하며 친환경적인 지능형 교통 시스템(intelligent transportation systems)에 대한 연구가 실시되었다.Traffic congestion occurs in urban areas every day. Traffic congestion is costly because it increases fuel consumption, operating costs, and travel time. Also, this causes environmental pollution. Therefore, traffic congestion is a major problem to be solved in the transportation system. In order to solve the traffic congestion, various studies have been conducted on the traffic management system. Recently, research on efficient, safe and eco-friendly intelligent transportation systems has been conducted.

스마트시티는 최근 도시 발전의 미래 트렌드로, 우리가 생활에서 사용하는 거의 모든 것을 효율적이고 편리하게 삶의 질을 향상시키고자 한다. 스마트 교통 관리 시스템(smart traffic management system)은 교통 체증을 줄이기 위한, 스마트 시티의 중요한 측면 중 하나이다. 관리 시스템 중 교통 신호 제어(traffic signal control)은 지금도 사용하고 있는 가장 효과적이고 중요한 수단이다. 현재 이용되고 있는 일반 교통 신호 제어는 고정 시간(fixed signal) 메커니즘을 사용한다. Smart city is the future trend of recent urban development, and it aims to improve the quality of life efficiently and conveniently with almost everything we use in our lives. A smart traffic management system is one of the important aspects of a smart city to reduce traffic congestion. Among the management systems, traffic signal control is the most effective and important means still in use. The general traffic signal control currently used uses a fixed signal mechanism.

하지만 스마트시티의 발전에 발맞춰 스마트 교통 신호 제어 시스템은 대량의 데이터를 처리해야 하며 또한, 지속적으로 변화하는 교통 환경을 적용해야 한다. 따라서 많은 연구자들이 지능형 기법(artificial intelligent)을 사용하여 교통 신호 제어 문제를 해결하고자 하였다.However, in line with the development of smart cities, the smart traffic signal control system must process a large amount of data and adapt to the continuously changing traffic environment. Therefore, many researchers have tried to solve the traffic signal control problem by using an artificial intelligent technique.

하지만, 교차로들은 연속되어있기 때문에 서로 영향을 받으므로, 상황에 따라 신호 순서가 유연하게 바뀐다면 더 효율적으로 교통을 제어할 수 있다.However, since the intersections are continuous, they affect each other, so traffic can be controlled more efficiently if the signal sequence is flexibly changed depending on the situation.

따라서 본 발명은 다중 교차로 환경에서 강화학습을 사용하여 최적화된 분산 교통 신호 제어 방법 및 장치를 제공하고자 한다.Accordingly, the present invention is to provide a method and apparatus for controlling a distributed traffic signal optimized using reinforcement learning in a multi-intersection environment.

본 발명은 다중 교차로 환경에서 강화학습을 이용하여 분산 교통 신호 제어를 최적화하는 방법 및 장치를 제공한다. The present invention provides a method and apparatus for optimizing distributed traffic signal control using reinforcement learning in a multi-intersection environment.

또한, 본 발명은 적절한 신호를 분배하여 최대한 많은 차량이 교차로를 통과하도록 하며, 인접한 모든 교차로의 평균 차량 대기시간을 고려하여 교통 신호 순서를 조정함으로써 효율적으로 교통 신호를 분배하는 알고리즘을 제공한다.In addition, the present invention provides an algorithm for efficiently distributing traffic signals by distributing appropriate signals so that as many vehicles as possible pass through the intersection, and adjusting the order of traffic signals in consideration of the average vehicle waiting time of all adjacent intersections.

본 발명은 교차로에서 복수의 통행 방향에 따라 구분되는 상태, 특정 상태에 대해 복수의 신호 조합 중 하나로 통행 신호를 설정하는 행동 및 상기 교차로의 차량 통행량 및 상기 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간에 따라 결정되는 보상을 설정하는 단계; 상기 상태, 상기 행동 및 상기 보상에 기초하여, 소정의 교통 데이터를 이용한 강화학습을 수행하는 단계; 및 상기 강화학습의 결과 및 상기 교차로에 대응되는 대상교차로의 교통정보에 기초하여, 상기 대상교차로의 복수의 통행 방향에 대한 통행 신호를 설정하는 단계를 포함하는 다중 교차로 환경에서의 강화학습 기반의 분산 교통 신호 제어 방법을 제공한다.The present invention relates to a state differentiated according to a plurality of traffic directions at an intersection, an action of setting a traffic signal as one of a plurality of signal combinations for a specific state, a vehicle traffic volume at the intersection, and average waiting of vehicles at a plurality of intersections adjacent to the intersection setting a reward determined according to time; performing reinforcement learning using predetermined traffic data based on the state, the action, and the reward; And based on the result of the reinforcement learning and the traffic information of the target intersection corresponding to the intersection, setting traffic signals for a plurality of travel directions of the target intersection. A traffic signal control method is provided.

일 실시예에 따르면, 상기 특정 상태가 상기 교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 나타내는 현재 상태일 때, 상기 강화학습을 수행하는 단계는 상기 현재 상태에 상기 행동이 적용될 때마다, 상기 교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 새로운 현재 상태로 재설정할 수 있다.According to an embodiment, when the specific state is a current state indicating a passage direction in which the number of waiting vehicles is the maximum among the plurality of passage directions of the intersection, the performing of the reinforcement learning may include whenever the action is applied to the current state. , a travel direction in which the number of waiting vehicles is the largest among a plurality of travel directions of the intersection may be reset to a new current state.

일 실시예에 따르면, 상기 강화학습은 Q-러닝(Q-learning)일 수 있다.According to an embodiment, the reinforcement learning may be Q-learning.

일 실시예에 따르면, 상기 통행 신호를 설정하는 단계는 상기 대상교차로의 교통정보를 이용하여, 상기 대상교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 결정하는 단계; 상기 대기 차량 수가 최대인 통행 방향과 상기 강화학습의 결과인 Q함수를 이용하여, 상기 복수의 신호 조합 중 상기 Q함수의 값을 최대로 하는 통행 신호인 최대통행신호를 산출하는 단계; 및 상기 대기 차량 수가 최대인 통행 방향에 대하여 상기 최대통행신호로 통행 신호를 설정하는 단계를 포함할 수 있다.According to an embodiment, the setting of the traffic signal may include: determining a traffic direction in which the number of waiting vehicles is the largest from among a plurality of passing directions of the target intersection by using the traffic information of the target intersection; calculating a maximum travel signal that is a travel signal that maximizes the value of the Q function among the plurality of signal combinations by using the travel direction in which the number of waiting vehicles is the maximum and the Q function as a result of the reinforcement learning; and setting a traffic signal as the maximum traffic signal for a passage direction in which the number of waiting vehicles is the maximum.

일 실시예에 따르면, 상기 보상은 단위 시간당 상기 교차로의 차량 통행량에 비례하고, 상기 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간에 반비례할 수 있다.According to an embodiment, the compensation may be proportional to the amount of vehicle traffic at the intersection per unit time, and may be inversely proportional to the average waiting time of vehicles at a plurality of intersections adjacent to the intersection.

일 실시예에 따르면, 상기 보상은 수학식 1에 의해 정의될 수 있다.According to an embodiment, the compensation may be defined by Equation (1).

[수학식 1] [Equation 1]

여기서, r_t는 보상이고, i는 교차로 인덱스이고, α는 가중치계수이고, ω는 1을 초과하는 값이고, tp는 단위시간당 차량 통행량이고, L_agent는 인접한 복수의 교차로의 개수이고, ξ는 0에서 1사이의 값이고, wt는 인접한 복수의 교차로에서의 차량 평균 대기시간이다.where r _t is compensation, i is an intersection index, α is a weighting factor, ω is a value greater than 1, tp is vehicle traffic per unit time, L _agent is the number of adjacent intersections, and ξ is It is a value between 0 and 1, and wt is the average waiting time of vehicles at a plurality of adjacent intersections.

또한, 본 발명은 교차로에서 복수의 통행 방향에 따라 구분되는 상태, 특정 상태에 대해 복수의 신호 조합 중 하나로 통행 신호를 설정하는 행동 및 상기 교차로의 차량 통행량 및 상기 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간에 따라 결정되는 보상을 설정하는 준비부; 상기 상태, 상기 행동 및 상기 보상에 기초하여, 소정의 교통 데이터를 이용한 강화학습을 수행하는 학습부; 및 상기 강화학습의 결과 및 상기 교차로에 대응되는 대상교차로의 교통정보에 기초하여, 상기 대상교차로의 복수의 통행 방향에 대한 통행 신호를 설정하는 설정부를 포함하는 다중 교차로 환경에서의 강화학습 기반의 분산 교통 신호 제어 장치를 제공한다.In addition, the present invention relates to a state divided according to a plurality of traffic directions at an intersection, an action of setting a traffic signal as one of a plurality of signal combinations for a specific state, a vehicle traffic volume at the intersection, and a vehicle at a plurality of intersections adjacent to the intersection a preparation unit for setting a reward determined according to the average waiting time; a learning unit that performs reinforcement learning using predetermined traffic data based on the state, the action, and the reward; and a setting unit configured to set traffic signals for a plurality of travel directions of the target intersection based on the result of the reinforcement learning and traffic information of the target intersection corresponding to the intersection. Provides traffic signal control devices.

일 실시예에 따르면, 상기 특정 상태가 상기 교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 나타내는 현재 상태일 때, 상기 학습부는 상기 현재 상태에 상기 행동이 적용될 때마다, 상기 교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 새로운 현재 상태로 재설정할 수 있다.According to an embodiment, when the specific state is a current state indicating a passage direction in which the number of waiting vehicles is the largest among the plurality of passage directions of the intersection, the learning unit receives the plurality of intersections whenever the action is applied to the current state. Among the travel directions of , the travel direction with the largest number of waiting vehicles may be reset to a new current state.

일 실시예에 따르면, 상기 설정부는 상기 대상교차로의 교통정보를 이용하여, 상기 대상교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 결정하고, 상기 대기 차량 수가 최대인 통행 방향과 상기 강화학습의 결과인 Q함수를 이용하여, 상기 복수의 신호 조합 중 상기 Q함수의 값을 최대로 하는 통행 신호인 최대통행신호를 산출하고, 상기 대기 차량 수가 최대인 통행 방향에 대하여 상기 최대통행신호로 통행 신호를 설정할 수 있다.According to an embodiment, the setting unit uses the traffic information of the target intersection to determine a traveling direction in which the number of waiting vehicles is the largest among a plurality of traveling directions of the target intersection, and the travel direction in which the number of waiting vehicles is the maximum and the reinforcement Using the Q function as a result of learning, a maximum traffic signal that is a traffic signal that maximizes the value of the Q function among the plurality of signal combinations is calculated, and the maximum traffic signal is used as the maximum traffic signal for the traffic direction in which the number of waiting vehicles is the maximum. You can set traffic signals.

일 실시예에 따르면, 상기 보상은 단위 시간당 상기 교차로의 차량 통행량에 비례하고, 상기 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간에 반비례에 반비례할 수 있다.According to an embodiment, the compensation may be proportional to the amount of vehicle traffic at the intersection per unit time, and may be inversely proportional to the average waiting time of vehicles at a plurality of intersections adjacent to the intersection.

본 발명의 일 실시예에 따른 강화학습 기반의 교통 신호 제어 방법 및 장치는 적절히 신호를 분배하여 최대한 많은 차량이 교차로를 통과하도록 하며, 인접한 모든 교차로의 평균 차량 대기시간을 고려하여 교차로의 교통 신호 순서를 조정함으로써 다중 교차로 환경에서 효율적으로 교통 신호를 분배하는 효과가 있다.Reinforcement learning-based traffic signal control method and apparatus according to an embodiment of the present invention distributes signals appropriately so that as many vehicles as possible pass through the intersection, and the order of traffic signals at the intersection in consideration of the average vehicle waiting time of all adjacent intersections It has the effect of efficiently distributing traffic signals in a multi-intersection environment by adjusting the

도 1은 본 발명의 일 실시예에 따른 다중 교차로 환경에서의 강화학습 기반의 분산 교통 신호 제어 방법에 대한 흐름도이다.
도 2는 본 발명의 일 실시예에 따른 강화학습 결과를 이용한 통행 신호 설정 방법에 대한 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 다중 교차로 환경에서의 강화학습 기반의 분산 교통 신호 제어 장치에 대한 블록도이다.
도 4는 본 발명의 일 실시예에 따른 교차로를 나타내는 도면이다.
도 5는 본 발명의 일 실시예에 따른 신호 조합을 나타내는 도면이다.
도 6a 내지 6c는 본 발명의 일 실시예를 종래의 기술과 비교한 결과를 나타내는 도면이다.
도 7a 내지 7b는 다중 교차로 환경을 나타내는 도면이다.1 is a flowchart of a method for controlling a distributed traffic signal based on reinforcement learning in a multi-intersection environment according to an embodiment of the present invention.
2 is a flowchart of a method of setting a traffic signal using a reinforcement learning result according to an embodiment of the present invention.
3 is a block diagram of an apparatus for controlling a distributed traffic signal based on reinforcement learning in a multi-intersection environment according to an embodiment of the present invention.
4 is a view showing an intersection according to an embodiment of the present invention.
5 is a diagram illustrating a signal combination according to an embodiment of the present invention.
6A to 6C are diagrams showing results of comparing an embodiment of the present invention with that of the prior art.
7A to 7B are diagrams illustrating a multi-intersection environment.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing each figure, like reference numerals have been used for like elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When an element is referred to as being “connected” or “connected” to another element, it is understood that it may be directly connected or connected to the other element, but other elements may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. 이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세하게 설명한다.Unless defined otherwise, all terms used herein, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 강화학습 기반의 교통 신호 제어 방법에 대한 흐름도이다.1 is a flowchart of a method for controlling a traffic signal based on reinforcement learning according to an embodiment of the present invention.

강화학습은 과거의 학습과정을 통해 스스로 학습하는 알고리즘이다. 강화학습의 일종인 Q-러닝은 trial-and-error 접근법을 사용하여 환경을 탐구하고 경험에 기반하여 현재 상태(state)에서 최선의 행동(action)을 선택한다. 즉, Q-러닝은 상태(state), 행동(action)과 보상(reward)라는 개념을 가지고 있으며, 특정한 상태에서 액션은 보상을 극대화하는 방향으로 결정될 수 있다. 상태(s_t)에서 행동(a_t)을 취하면 다음 상태(s_t+1)로 이동한다. 이와 관련된 공식은 다음 수학식 1과 같다.Reinforcement learning is an algorithm that learns by itself through the past learning process. Q-learning, a type of reinforcement learning, uses a trial-and-error approach to explore the environment and selects the best action in the current state based on experience. That is, Q-learning has the concepts of state, action, and reward, and in a specific state, the action can be determined in the direction of maximizing the reward. Taking the action (a _t) on the state (s _t) is moved to the next (s _{t + 1)} state. A related formula is the following Equation 1.

[수학식 1][Equation 1]

또한, Q-러닝에서 경험을 저장하는 Q-table은 현재 상태(s_t), 행동(a_t), 보상(r_t) 및 다음 상태의 최대값(max_aQ(s_t+1,a_t+1))을 학습율(

)을 적용하여 새로운 상태로 업데이트 된다. 이와 관련 공식은 다음 수학식 2와 같다.In addition, in Q-learning, the Q-table that stores the experience is the current state (s _t ), the action (a _t ), the reward (r _t ), and the maximum value of the next state (max _a Q(s _t+1 ,a _{t) +1} )) to the learning rate (

) is applied to update to the new state. The related formula is the following Equation 2.

[수학식 2][Equation 2]

여기서,

은 기존의 상태값이고,

는 학습률(learning rate)이고,

는 할인팩터(discount factor)이다.here,

is the existing state value,

is the learning rate,

is the discount factor.

단계 S110에서는, 교통 신호 제어 장치가 교차로에서 복수의 통행 방향에 따라 구분되는 상태, 특정 상태에 대해 복수의 신호 조합 중 하나로 통행 신호를 설정하는 행동 및 그 교차로의 차량 통행량 및 그 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간에 따라 결정되는 보상을 설정한다.In step S110, the traffic signal control device sets a traffic signal as one of a plurality of signal combinations for a state divided according to a plurality of traffic directions at the intersection, a specific state, and the amount of vehicle traffic at the intersection and a plurality of vehicles adjacent to the intersection Set up a reward that is determined by the average waiting time of vehicles at the intersection.

이때, 교통 신호 제어 장치는 그 교차로에 인접한 복수의 교차로 각각에 대응되는 복수의 로컬 에이전트(예, 인접한 교차로의 교통 신호 제어 장치)로부터 각 교차로의 차량 평균 대기시간에 대한 정보를 유선 또는 무선 통신을 통해 수신할 수 있다. At this time, the traffic signal control device transmits information about the average waiting time of vehicles at each intersection from a plurality of local agents corresponding to each of the plurality of intersections adjacent to the intersection (eg, a traffic signal control device of an adjacent intersection) through wired or wireless communication. can be received through

예컨대, 도 7a를 참조하면, 교통 신호 제어 장치가 교차로(701)에 대응되는 것으로 가정할 때, 교통 신호 제어 장치는 교차로(702, 704)와 인접하였으므로, 교차로(702, 704)의 로컬 에이전트로부터 각 교차로의 차량 평균 대기시간에 대한 정보를 수신할 수 있다.For example, referring to FIG. 7A , assuming that the traffic signal control device corresponds to the intersection 701, since the traffic signal control device is adjacent to the intersections 702 and 704, from the local agent of the intersections 702 and 704 Information on the average waiting time of vehicles at each intersection can be received.

여기서, 상태(state)는 교차로를 구성하는 복수의 통행 방향에 따라 구분될 수 있다. 예컨대, 도 4를 참조하면, 교차로가 사거리이고, 각 통행 방향에서 직진(401, 403, 405, 407)과 좌회전(402, 404, 406, 408)이 가능하다고 가정하면, 총 8개의 통행 방향이 존재할 수 있다. 따라서, 최종적으로 상태는 총 8개로 정의될 수 있다. 마찬가지로 오거리 교차로인 경우에는 10개의 상태가 정의되고, 육거리 교차로인 경우에는 12개의 상태가 정의될 수 있다. 하지만 본 발명에서 차량의 통행 방향은 직진과 좌회전으로 한정되는 것은 아니며, 유턴, 우회전 등과 같은 다양한 방향이 적용될 수 있음은 물론이다.Here, the state may be classified according to a plurality of passage directions constituting the intersection. For example, referring to FIG. 4 , assuming that the intersection is an intersection, and that straight ahead (401, 403, 405, 407) and left turn (402, 404, 406, 408) are possible in each passing direction, a total of eight traffic directions are may exist. Therefore, finally, a total of eight states can be defined. Similarly, in the case of a five-way intersection, 10 states may be defined, and in the case of a six-way intersection, 12 states may be defined. However, in the present invention, the traveling direction of the vehicle is not limited to going straight and turning left, and various directions such as a U-turn and a right turn may be applied.

또한, 행동(action)은 특정 시점(t)에서 앞서 언급한 상태 중에서 하나인 특정 상태에 적용되는 것으로, 그 특정 상태에 대하여 복수의 신호 조합 중 하나로 통행 신호를 설정하는 것이다. 예컨대, 도 4를 참조하면 직진(407)이 특정 상태일 때, 본 발명에서 행동은 그 특정 상태에 대하여 도 5에 나타난 3가지 신호 조합(501, 502, 503) 중 하나로 통행 신호를 설정할 수 있다. 하지만 본 발명에서 신호 조합은 도 5에 나타난 신호 조합으로 한정되는 것은 아니며, 교차로의 구성에 따라 다양한 신호 조합이 적용될 수 있음은 물론이다.In addition, an action is applied to a specific state that is one of the aforementioned states at a specific time point t, and a traffic signal is set as one of a plurality of signal combinations for the specific state. For example, referring to FIG. 4 , when going straight 407 is in a specific state, the action in the present invention may set a traffic signal to one of three signal combinations 501 , 502 , 503 shown in FIG. 5 for that specific state. . However, in the present invention, the signal combination is not limited to the signal combination shown in FIG. 5, and various signal combinations may be applied according to the configuration of the intersection.

또한, 보상(reward)은 교차로에서의 지연을 최소화하기 위하여 단위시간당 차량 통행량(tp)과 그 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간(wt)를 매개변수로 하여 결정될 수 있다. 이때, 차량 통행량(tp)은 단위 시간 동안 교차로를 지나간 차량의 수를 의미하고, 차량 평균 대기시간(wt)는 그 교차로와 인접하여 직접 연결되는 복수의 교차로에서의 차량 평균 대기시간을 의미한다.In addition, the reward may be determined using, as parameters, the amount of vehicle traffic per unit time (tp) and the average waiting time (wt) of vehicles at a plurality of intersections adjacent to the intersection in order to minimize delay at the intersection. In this case, the vehicle traffic volume tp refers to the number of vehicles that have passed the intersection for a unit time, and the vehicle average waiting time wt refers to the average vehicle waiting time at a plurality of intersections directly connected to the intersection.

다른 실시예에서는, 보상은 단위 시간당 교차로의 차량 통행량에 비례하고, 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간에 반비례할 수 있다.In another embodiment, the compensation may be proportional to the amount of vehicle traffic at the intersection per unit time, and may be inversely proportional to the average waiting time of vehicles at a plurality of intersections adjacent to the intersection.

본 발명이 차량의 대기시간을 최소화하고 최대한 많은 차량이 교차로를 통과하도록 하는 것을 목적으로 하는 점을 고려할 때, 보상이 교차로의 차량 통행량에 비례하여 증가 또는 감소하고, 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간하여 증가 또는 감소하는 것은 자명할 것이다.Considering that the present invention aims to minimize the waiting time of vehicles and allow as many vehicles to pass through the intersection, the compensation increases or decreases in proportion to the amount of vehicle traffic at the intersection, and at a plurality of intersections adjacent to the intersection. It will be apparent that the average vehicle waiting time increases or decreases.

또 다른 실시예에서는, 보상은 수학식 3에 의해 정의될 수 있다.In another embodiment, the compensation may be defined by Equation (3).

[수학식 3][Equation 3]

즉, 보상(r_t)은 교차로의 차량 통행량에 비례하고, 인접한 복수의 교차로에서의 차량 평균 대기시간에 반비례한다.That is, the compensation r _t is proportional to the amount of vehicle traffic at the intersection and is inversely proportional to the average waiting time of vehicles at a plurality of adjacent intersections.

단계 S120에서는, 교통 신호 제어 장치가 그 상태, 행동 및 보상에 기초하여, 소정의 교통 데이터를 이용한 강화학습을 수행한다.In step S120, the traffic signal control device performs reinforcement learning using predetermined traffic data based on the state, behavior, and reward.

즉, 교통 신호 제어 장치는 앞서 결정된 그 상태, 행동 및 보상을 이용하고, 소정의 교통 데이터를 적용함으로써, 강화학습을 수행할 수 있다. 이때, 교통 데이터는 해당 교차로에서 수집된 실제 데이터이거나, 유사한 구조의 다른 교차로에서 수집된 데이터일 수 있다.That is, the traffic signal control apparatus may perform reinforcement learning by using the previously determined state, behavior, and reward, and applying predetermined traffic data. In this case, the traffic data may be actual data collected at the corresponding intersection or data collected at another intersection having a similar structure.

예컨대, 교통 신호 제어 장치는 교통 데이터를 이용하여 강화학습의 일종인 Q-러닝을 수행하면서, 계속하여 Q-table을 업데이트시킬 수 있다. 이때, 업데이트되는 Q함수값은 경험(교통 데이터)으로부터 발생한 보상이다. 그에 더하여, 새로운 경로를 개척하기 위한 탐색(ε)을 추가할 수 있다. 무작위성을 갖는 탐색(exploration)은 0에서 1사이의 값을 가지고 있으며, 더 나은 길을 찾을 수 있도록 한다.For example, the traffic signal control device may continuously update the Q-table while performing Q-learning, a type of reinforcement learning, using traffic data. In this case, the updated Q function value is a reward generated from experience (traffic data). In addition, it is possible to add a search (ε) to carve out a new path. Exploration with randomness has a value between 0 and 1, allowing us to find a better way.

한편, 수학식 2를 참조하면, 학습율(

)은 0에서 1사이의 값으로, 학습할 새로운 정보의 양을 결정한다. 이 값이 1에 가까울수록, 새롭게 취득된 정보가 더 중요해진다. 또한, 할인팩터(

)는 미래 상태의 중요성을 결정한다. 할인팩터가 1에 가까워짐에 따라, 현재의 경험보다 미래의 보상에 대하여 초점이 맞춰지게 된다.On the other hand, referring to Equation 2, the learning rate (

) is a value between 0 and 1, which determines the amount of new information to learn. The closer this value is to 1, the more important the newly acquired information. In addition, the discount factor (

) determines the importance of the future state. As the discount factor approaches 1, the focus is on future rewards rather than on present experiences.

다른 실시예에서는, 교통 신호 제어 장치가 특정 상태가 교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 나타내는 현재 상태일 때, 그 현재 상태에 행동이 적용될 때마다, 교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 새로운 현재 상태로 재설정하면서 강화학습을 수행할 수 있다.In another embodiment, when the specific state is a current state indicating a passage direction in which the number of waiting vehicles is the largest among a plurality of travel directions of the intersection, the traffic signal control device determines that whenever an action is applied to the current state, the plurality of travel directions of the intersection Reinforcement learning can be performed while resetting the travel direction with the maximum number of waiting vehicles to the new current state.

즉, 교통 신호 제어 장치는 특정 상태를 교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 나타내는 현재 상태(current state)로 하여, 현재 상태에 행동(action)을 적용시킬 수 있다. 또한, 교통 신호 제어 장치는 행동이 적용될 때마다, 아래의 수학식 4와 같이 대기 차량 수가 최대인 통행 방향을 새로운 현재 상태로 갱신할 수 있다. That is, the traffic signal control apparatus may apply an action to the current state by setting the specific state as a current state indicating a passage direction in which the number of waiting vehicles is the maximum among a plurality of passage directions of an intersection. In addition, whenever an action is applied, the traffic signal control apparatus may update the travel direction in which the number of waiting vehicles is the maximum to a new current state as shown in Equation 4 below.

[수학식 4][Equation 4]

여기서, S_t+1은 새로운 현재 상태이고, qt_i는 i번째 통행 방향의 대기 차량 수이다.Here, S _t+1 is the new current state, and qt _i is the number of waiting vehicles in the i-th direction.

다시 말하면, 교통 신호 제어 장치는 특정 시점(t)에서 현재 상태(s_t)에 행동(a_t)을 적용한 후에, 그 다음 시점(t+1)에서 대기 차량 수가 최대인 통행 방향으로 새로운 현재 상태(s_t+1)를 재설정할 수 있다.In other words, the traffic signal control device applies the action a _t _{to the current state s t} at a specific time point t, and then at the next time point t+1, a new current state in the direction of travel with the largest number of waiting vehicles. (s _t+1 ) can be reset.

마지막으로 단계 S130에서는, 교통 신호 제어 장치가 그 강화학습의 결과 및 그 교차로에 대응되는 대상교차로의 교통정보에 기초하여, 대상교차로의 복수의 통행 방향에 대한 통행 신호를 설정한다.Finally, in step S130, the traffic signal control device sets traffic signals for a plurality of travel directions of the target intersection based on the result of the reinforcement learning and the traffic information of the target intersection corresponding to the intersection.

즉, 교통 신호 제어 장치는 소정의 교통 데이터를 이용하여 강화학습이 완료된 결과와 대상교차로의 교통정보를 이용하여, 그 대상교차로의 복수의 통행 방향에 대한 통행 신호를 설정할 수 있다.That is, the traffic signal control apparatus may set traffic signals for a plurality of travel directions of the target intersection by using the result of the completion of reinforcement learning using predetermined traffic data and traffic information of the target intersection.

예컨대, 교통 신호 제어 장치는 Q-러닝을 이용하는 경우에, Q함수의 값이 최대가 되도록 하는 행동을 결정하고, 그 결정된 행동에 대응되는 통행 신호를 설정할 수 있다.For example, when Q-learning is used, the traffic signal control apparatus may determine an action to maximize the value of the Q function, and set a traffic signal corresponding to the determined action.

이때, 대상교차로는 그 교차로와 동일한 교차로이거나, 그 교차로와 동일 또는 유사한 구조(예, 차선 수, 통행 방향)를 가지고 있는 다른 교차로일 수 있다.In this case, the target intersection may be the same intersection as the intersection or another intersection having the same or similar structure (eg, number of lanes, direction of passage) as the intersection.

다른 실시예에서는, 강화학습은 Q-러닝(Q-learning)일 수 있다.In another embodiment, reinforcement learning may be Q-learning.

한편, 본 발명은 9개의 교차로 환경에서 시뮬레이션을 진행했다. 제안하는 모델인 통행량과 인접한 교차로에서의 대기시간을 파라미터로 적용한 D-TCS(distributed), 고정형 방식인 F-TCS(fixed-time) 그리고 적응형 방식은 A-TCS(adaptive)으로 명시하여 실험을 진행하였다. 시뮬레이션은 연속된 9개의 4-way 교차로에서 진행하였다.On the other hand, in the present invention, simulation was performed in the environment of 9 intersections. Experiments were conducted by specifying the proposed model, D-TCS (distributed), in which traffic volume and waiting time at adjacent intersections were applied as parameters, and F-TCS (fixed-time), which is a fixed method, and A-TCS (adaptive) for the adaptive method. proceeded. The simulation was conducted at 9 consecutive 4-way intersections.

도 6a 내지 6c는 교차로에서 교통량에 따라 성능을 측정한 실험결과이다. 도 6a는 평균 큐 길이를 측정한 결과로, 교통량이 120%에서 160%사이인 경우, 대체적으로 D-TCS의 평균 길이가 F-TCS보다 약 35%, A-TCS보다 약 60% 더 짧았다. D-TCS의 평균 큐 길이가 짧다는 것은 교차로에 대기하고 있는 차량이 많지 않다는 것을 의미한다. 이는 제안하는 시스템인 D-TCS가 효율적으로 교통신호를 분배했음을 증명한다. 도 6b는 평균 대기시간을 측정한 결과이다. 교통량이 120%이경우, D-TCS가 F-TCS와 A-TCS보다 약 32% 더 짧은 대기시간 성능을 보였다. 이는 교차로들은 서로 영향을 끼치기 때문에 인근 교차로의 정보를 이용한 D-TCS의 성능이 더 좋게 나오는 것으로 해석된다. 도 6c는 통행량을 측정한 결과로, 교통량이 120%일 때 D-TCS가 F-TCS보다 약 15% 차량을 더 많이 처리한다. 또한 평균적으로 D-TCS가 A-TCS보다 약 25% 더 많은 차량을 처리하는 것으로 나타난다.6A to 6C are experimental results of measuring performance according to traffic volume at an intersection. 6a shows the results of measuring the average queue length, when the traffic volume is between 120% and 160%, in general, the average length of the D-TCS was about 35% shorter than that of the F-TCS and about 60% shorter than that of the A-TCS. The short average queue length of D-TCS means that there are not many vehicles waiting at the intersection. This proves that the proposed system, D-TCS, efficiently distributed traffic signals. 6B is a result of measuring the average waiting time. In this case of traffic volume of 120%, D-TCS showed about 32% shorter latency performance than F-TCS and A-TCS. This is interpreted as the better performance of D-TCS using information from nearby intersections because intersections influence each other. 6c is a result of measuring the traffic volume, when the traffic volume is 120%, the D-TCS handles about 15% more vehicles than the F-TCS. It also appears that, on average, D-TCS handles about 25% more vehicles than A-TCS.

도 2는 본 발명의 일 실시예에 따른 강화학습 결과를 이용한 통행 신호 설정 방법에 대한 흐름도이다.2 is a flowchart of a method of setting a traffic signal using a reinforcement learning result according to an embodiment of the present invention.

단계 S210에서는, 교통 신호 제어 장치가 대상교차로의 교통정보를 이용하여, 대상교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 결정한다.In step S210 , the traffic signal control device determines a passage direction in which the number of waiting vehicles is the largest from among a plurality of travel directions of the target intersection by using the traffic information of the target intersection.

즉, 교통 신호 제어 장치는 대상교차로에 존재하는 복수의 통행 방향에 대하여, 대기 차량 수가 최대인 통행 방향을 결정할 수 있다. That is, the traffic signal control apparatus may determine a passage direction in which the number of waiting vehicles is the maximum with respect to a plurality of passage directions existing at the target intersection.

단계 S220에서는, 교통 신호 제어 장치가 그 대기 차량 수가 최대인 통행 방향과 강화학습의 결과인 Q함수을 이용하여, 복수의 신호 조합 중 중 Q함수의 값을 최대로 하는 통행 신호인 최대통행신호를 산출한다.In step S220, the traffic signal control device calculates a maximum traffic signal, which is a traffic signal that maximizes the value of the Q function among a plurality of signal combinations, by using the passage direction in which the number of waiting vehicles is the maximum and the Q function as a result of reinforcement learning. do.

이때, 교통 신호 제어 장치는 그 대기 차량 수가 최대인 통행 방향과 그 통행 방향에 적용가능한 모든 행동들을 Q함수에 입력하여, Q함수의 값을 최대로하는 행동, 즉, 최대통행신호를 산출할 수 있다.At this time, the traffic signal control device inputs the direction of travel in which the number of waiting vehicles is the maximum and all actions applicable to the direction of travel into the Q function to calculate the action that maximizes the value of the Q function, that is, the maximum travel signal. have.

마지막으로 단계 S230에서는, 교통 신호 제어 장치가 그 대기 차량 수가 최대인 통행 방향에 대하여 최대통행신호로 통행 신호를 설정한다.Finally, in step S230, the traffic signal control device sets the traffic signal as the maximum traffic signal for the traffic direction in which the number of waiting vehicles is the maximum.

즉, 교통 신호 제어 장치는 그 대기 차량 수가 최대인 통행 방향에 대하여 최대통행신호로 통행 신호를 설정함으로써, 대상교차로의 차량 통행량을 증가시키면서, 큐 길이의 표준편차를 감소시킬 수 있다.That is, the traffic signal control apparatus can reduce the standard deviation of the queue length while increasing the amount of vehicle traffic at the target intersection by setting the traffic signal as the maximum traffic signal for the traffic direction in which the number of waiting vehicles is the maximum.

도 3은 본 발명의 일 실시예에 따른 강화학습 기반의 교통 신호 제어 장치에 대한 블록도이다.3 is a block diagram of an apparatus for controlling a traffic signal based on reinforcement learning according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시예에 따른 강화학습 기반의 교통 신호 제어 장치(300)는 준비부(310), 학습부(320) 및 설정부(330)를 포함한다.Referring to FIG. 3 , the apparatus 300 for controlling a traffic signal based on reinforcement learning according to an embodiment of the present invention includes a preparation unit 310 , a learning unit 320 , and a setting unit 330 .

준비부(310)는 교차로에서 복수의 통행 방향에 따라 구분되는 상태, 특정 상태에 대해 복수의 신호 조합 중 하나로 통행 신호를 설정하는 행동 및 그 교차로의 차량 통행량 및 그 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간에 따라 결정되는 보상을 설정한다.The preparation unit 310 performs an action of setting a traffic signal as one of a plurality of signal combinations for a state divided according to a plurality of traffic directions at an intersection, a specific state, and a vehicle traffic volume of the intersection and a plurality of intersections adjacent to the intersection. Set the reward determined according to the average waiting time of the vehicle.

학습부(320)는 그 상태, 행동 및 보상에 기초하여, 소정의 교통 데이터를 이용한 강화학습을 수행한다.The learning unit 320 performs reinforcement learning using predetermined traffic data based on the state, behavior, and reward.

다른 실시예에서는, 특정 상태가 교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 나타내는 현재 상태일 때, 학습부(320)는 그 현재 상태에 행동이 적용될 때마다, 교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 새로운 현재 상태로 재설정할 수 있다.In another embodiment, when the specific state is a current state indicating a passage direction in which the number of waiting vehicles is the maximum among a plurality of passage directions of an intersection, the learning unit 320 is configured to perform a plurality of passages in the intersection whenever an action is applied to the current state. Among the directions, the travel direction with the largest number of waiting vehicles may be reset to a new current state.

마지막으로 설정부(330)는 그 강화학습의 결과 및 교차로에 대응되는 대상교차로의 교통정보에 기초하여, 대상교차로의 복수의 통행 방향에 대한 통행 신호를 설정한다.Finally, the setting unit 330 sets traffic signals for a plurality of travel directions of the target intersection based on the result of the reinforcement learning and traffic information of the target intersection corresponding to the intersection.

또 다른 실시예에서는, 설정부(330)는 대상교차로의 교통정보를 이용하여, 대상교차로의 복수의 통행 방향 중에서 대기 차량 수가 최대인 통행 방향을 결정하고, 그 대기 차량 수가 최대인 통행 방향과 강화학습의 결과인 Q함수를 이용하여, 복수의 신호 조합 중 Q함수의 값을 최대로 하는 통행 신호인 최대통행신호를 산출하고, 그 대기 차량 수가 최대인 통행 방향에 대하여 최대통행신호로 통행 신호를 설정할 수 있다.In another embodiment, the setting unit 330 uses the traffic information of the target intersection to determine the passage direction in which the number of waiting vehicles is the largest among a plurality of passage directions of the target intersection, and strengthens the passage direction in which the number of waiting vehicles is the maximum. Using the Q function, which is the result of learning, calculates the maximum traffic signal, which is a traffic signal that maximizes the value of the Q function among a plurality of signal combinations, and sets the traffic signal as the maximum traffic signal for the traffic direction in which the number of waiting vehicles is the maximum. can be set.

또 다른 실시예에서는, 보상은 단위 시간당 교차로의 차량 통행량에 비례하고, 교차로에 인접한 복수의 교차로에서의 차량 평균 대기시간에 반비례할 수 있다.In another embodiment, the compensation may be proportional to the amount of vehicle traffic at the intersection per unit time, and may be inversely proportional to the average waiting time of vehicles at a plurality of intersections adjacent to the intersection.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 사람이라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 실행된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical spirit of the present invention, and various modifications and variations will be possible without departing from the essential characteristics of the present invention by those of ordinary skill in the art to which the present invention pertains. Accordingly, the embodiments implemented in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

Claims

According to a state distinguished according to a plurality of traffic directions at an intersection, an action of setting a traffic signal as one of a plurality of signal combinations for a specific state, and the vehicle traffic volume at the intersection and the average waiting time of vehicles at a plurality of intersections adjacent to the intersection setting a determined reward;
performing reinforcement learning using predetermined traffic data based on the state, the action, and the reward; and
setting traffic signals for a plurality of travel directions of the target intersection based on the result of the reinforcement learning and traffic information of the target intersection corresponding to the intersection
including,
the reward is
Distributed traffic signal control method based on reinforcement learning in a multi-intersection environment, characterized in that it is proportional to the amount of vehicle traffic at the intersection per unit time and is inversely proportional to the average waiting time of vehicles at a plurality of intersections adjacent to the intersection.

According to claim 1,
When the specific state is a current state indicating a passage direction in which the number of waiting vehicles is the largest among a plurality of passage directions of the intersection,
The step of performing the reinforcement learning is
Distributed traffic signal control based on reinforcement learning in a multi-intersection environment, characterized in that whenever the action is applied to the current state, the traffic direction with the largest number of waiting vehicles among the plurality of traffic directions of the intersection is reset to a new current state Way.

According to claim 1,
The reinforcement learning is a distributed traffic signal control method based on reinforcement learning in a multi-intersection environment, characterized in that Q-learning.

4. The method of claim 3,
The step of setting the traffic signal is
determining a traveling direction in which the number of waiting vehicles is the largest among a plurality of traveling directions of the target intersection by using the traffic information of the target intersection;
calculating a maximum travel signal that is a travel signal that maximizes the value of the Q function among the plurality of signal combinations by using the travel direction in which the number of waiting vehicles is the maximum and the Q function as a result of the reinforcement learning; and
Setting a traffic signal as the maximum traffic signal for the traffic direction in which the number of waiting vehicles is the maximum
Distributed traffic signal control method based on reinforcement learning in a multi-intersection environment, comprising:

delete

According to claim 1,
the reward is
Distributed traffic signal control method based on reinforcement learning in a multi-intersection environment, characterized in that defined by Equation 1.
[Equation 1]

where r _t is compensation, i is an intersection index, α is a weighting factor, ω is a value greater than 1, tp is vehicle traffic per unit time, L _agent is the number of adjacent intersections, and ξ is It is a value between 0 and 1, and wt is the average waiting time of vehicles at a plurality of adjacent intersections.

According to a state distinguished according to a plurality of traffic directions at an intersection, an action of setting a traffic signal as one of a plurality of signal combinations for a specific state, and the vehicle traffic volume at the intersection and the average waiting time of vehicles at a plurality of intersections adjacent to the intersection a preparation unit for setting a reward to be determined;
a learning unit that performs reinforcement learning using predetermined traffic data based on the state, the action, and the reward; and
Based on the result of the reinforcement learning and the traffic information of the target intersection corresponding to the intersection, a setting unit for setting traffic signals for a plurality of travel directions of the target intersection
including,
the reward is
Distributed traffic signal control apparatus based on reinforcement learning in a multi-intersection environment, characterized in that it is proportional to the amount of vehicle traffic at the intersection per unit time and is inversely proportional to the average waiting time of vehicles at a plurality of intersections adjacent to the intersection.

8. The method of claim 7,
When the specific state is a current state indicating a passage direction in which the number of waiting vehicles is the largest among a plurality of passage directions of the intersection,
the learning unit
Distributed traffic signal control based on reinforcement learning in a multi-intersection environment, characterized in that whenever the action is applied to the current state, the traffic direction with the largest number of waiting vehicles among the plurality of traffic directions of the intersection is reset to a new current state Device.

8. The method of claim 7,
The reinforcement learning is a distributed traffic signal control device based on reinforcement learning in a multi-intersection environment, characterized in that Q-learning.

10. The method of claim 9,
the setting unit
Using the traffic information of the target intersection, determining a traveling direction in which the number of waiting vehicles is the largest among a plurality of traveling directions of the target intersection,
Using the travel direction in which the number of waiting vehicles is the maximum and the Q function as a result of the reinforcement learning, a maximum travel signal that is a travel signal that maximizes the value of the Q function among the plurality of signal combinations is calculated,
Reinforcement learning-based distributed traffic signal control device in a multi-intersection environment, characterized in that the traffic signal is set as the maximum traffic signal for the traffic direction in which the number of waiting vehicles is the maximum.

delete