KR102549744B1

KR102549744B1 - Method for controliing traffic flow using deep reinforcement learning based autonomous vehicles in road information system, recording medium and device for performing the method

Info

Publication number: KR102549744B1
Application number: KR1020210091129A
Authority: KR
Inventors: 권민혜
Original assignee: 숭실대학교 산학협력단
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2023-06-29
Also published as: KR20230010498A

Abstract

심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법은, 선택된 심층강화학습 알고리즘을 이용한 도로정보시스템으로부터 정해진 시간마다 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보로 전달받는 단계; 상태 정보를 기초로 액셀레이터 또는 브레이크를 밟는 강도인 자율주행차의 액션을 선택하는 단계; 도달하고자 하는 목표 속도와 자율주행차 및 비 자율주행차들의 속도를 기초로 자율주행차의 액션에 대한 보상값을 도출하는 단계; 보상값을 기초로 정책(policy)을 업데이트하는 단계; 및 학습 결과와 실시간 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보를 기초로, 도로정보시스템에서 비 자율주행차들과 함께 학습된 자율주행차를 실행시키는 단계;를 포함한다. 이에 따라, 교통 정체가 빈번히 발생하는 도로에서 자율주행차량의 속도 제어를 기반으로 도로 흐름을 원활하게 할 수 있다.A road flow control method in a road information system using a deep reinforcement learning-based autonomous vehicle is a road information system using a selected deep reinforcement learning algorithm, and the location and speed of self-driving vehicles and non-autonomous vehicles are converted to status information at predetermined times. receiving the transmission; Selecting an action of the self-driving vehicle, which is the strength of stepping on the accelerator or brake, based on the state information; deriving a compensation value for an action of an autonomous vehicle based on a target speed to be reached and the speeds of autonomous vehicles and non-autonomous vehicles; Updating a policy based on the compensation value; and executing the learned self-driving car together with the non-self-driving cars in the road information system based on the learning result and the location and speed status information of the real-time self-driving car and non-self-driving cars. Accordingly, on a road where traffic congestion frequently occurs, road flow can be smoothed based on speed control of an autonomous vehicle.

Description

Road flow control method in road information system using deep reinforcement learning-based autonomous vehicle, recording medium and device for performing this PERFORMING THE METHOD}

본 발명은 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법, 이를 수행하기 위한 기록 매체 및 장치에 관한 것으로서, 더욱 상세하게는 교통 정체가 빈번히 발생하는 교차로에서 자율주행차량의 속도 제어를 기반으로 도로 흐름을 원활하게 제어하는 기술에 관한 것이다.The present invention relates to a method for controlling road flow in a road information system using a deep reinforcement learning-based autonomous vehicle, and a recording medium and apparatus for performing the same, and more particularly, to a method for controlling autonomous vehicles at intersections where traffic congestion frequently occurs. It relates to a technology that smoothly controls road flow based on speed control.

인공지능 기술의 도약과 함께 자율주행기술 분야는 대중과 연구자들 모두에게 흥미로운 주제 중 하나로 각광받고 있다. 미국자동차공학회(Society of Automotive Engineers)는 자율주행기술을 0~5 단계로 구분한 가이드를 제공한다. 0단계인 완전 비 자율주행부터 운전자 보조, 부분 자율주행, 조건부 자율주행, 고도 자율주행, 그리고 5단계인 완전 자율주행 기술까지 총 6단계로 구분할 수 있다. Along with the leap in artificial intelligence technology, the field of self-driving technology is attracting attention as one of the interesting topics for both the public and researchers. The Society of Automotive Engineers provides a guide that classifies autonomous driving technology into levels 0-5. It can be divided into a total of six levels, from complete non-autonomous driving (level 0) to driver assistance, partial autonomous driving, conditional autonomous driving, advanced autonomous driving, and complete autonomous driving technology (level 5).

현재 국내외 기업에서 3단계 자율주행차량의 개발을 성공적으로 이루었으며 Waymo, GM, 현대기아 등은 4단계 자율주행차량의 개발을 진행 중에 있다. 또한, 한국은 세계 최초로 3단계 자율주행차량 안전기준을 발표하였으며 2020년 7월부터 3단계 차량의 출시 및 판매가 가능해졌다. 한국과학기술평가원이 2019년에 보고한 바에 따르면 2020년부터 2035년까지 레벨 4단계의 자율주행차량 시장은 연평균 84.2%의 성장을, 레벨 3단계의 경우 연평균 33.6%의 성장이 전망되었다.Currently, domestic and foreign companies have successfully developed level 3 self-driving vehicles, and Waymo, GM, and Hyundai Kia are in the process of developing level 4 autonomous vehicles. In addition, Korea announced the world's first level 3 autonomous vehicle safety standards, and from July 2020, the release and sale of level 3 vehicles became possible. According to a report by the Korea Institute of Science and Technology Evaluation and Planning in 2019, from 2020 to 2035, the level 4 self-driving vehicle market is expected to grow at an average annual rate of 84.2%, and in the case of level 3, an average annual growth rate of 33.6%.

자율주행기술의 구성 요소에는 환경 및 위치 인식과 같은 인지 기술, 판단 기술, 그리고 제어 기술과 함께 탑승자에게 정보를 제공하는 인터페이스(interface)가 있다. 3가지 요소들이 완벽하게 학습될 때 완전 자율주행차량이 상용화될 수 있으며, 이러한 연구는 인공지능의 발전에서 핵심적인 역할을 한 딥러닝(deep learning)을 통해 집중적으로 연구되고 있다.Components of autonomous driving technology include cognitive technology such as environment and location recognition, decision technology, and control technology, as well as an interface that provides information to passengers. Fully self-driving vehicles can be commercialized when the three elements are perfectly learned, and these studies are being intensively studied through deep learning, which has played a key role in the development of artificial intelligence.

한편, 최근 교통수요와 자동차 수가 지속적으로 증가함에 따라 도심 지역에서의 교통 정체가 일상이 되었다. 한국교통연구원에 따르면 2017년 국내 교통혼잡비용은 약 59조 6천억원으로 교통체증으로 인한 환경적, 경제적 문제가 심각한 수준이다. On the other hand, traffic congestion in downtown areas has become a daily routine as traffic demand and the number of cars continue to increase. According to the Korea Transport Institute, the cost of domestic traffic congestion in 2017 is about 59.6 trillion won, which is a serious environmental and economic problem caused by traffic congestion.

이러한 상황 속에서 자율주행차량이 상용화 될 가능성이 높아지고 있으며, 자율주행차량은 도로 교통 문제를 해결할 새로운 대안으로 기대되고 있다.In this situation, the possibility of commercialization of autonomous vehicles is increasing, and autonomous vehicles are expected as a new alternative to solve road traffic problems.

KRKR 10-2018-0086602 10-2018-0086602 AA CNCN 107886750 107886750 AA US 9,792,575 B2US 9,792,575 B2

이에, 본 발명의 기술적 과제는 이러한 점에서 착안된 것으로 본 발명의 목적은 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법을 제공하는 것이다.Accordingly, the technical problem of the present invention is conceived in this respect, and an object of the present invention is to provide a road flow control method in a road information system using a deep reinforcement learning-based autonomous vehicle.

본 발명의 다른 목적은 상기 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법을 수행하기 위한 컴퓨터 프로그램이 기록된 기록 매체를 제공하는 것이다.Another object of the present invention is to provide a recording medium on which a computer program for performing the method for controlling road flow in a road information system using the deep reinforcement learning-based autonomous vehicle is recorded.

본 발명의 또 다른 목적은 상기 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법을 수행하기 위한 장치를 제공하는 것이다.Another object of the present invention is to provide an apparatus for performing the road flow control method in a road information system using the deep reinforcement learning-based autonomous vehicle.

상기한 본 발명의 목적을 실현하기 위한 일 실시예에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법은, 선택된 심층강화학습 알고리즘을 이용한 도로정보시스템으로부터 정해진 시간마다 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보로 전달받는 단계; 상태 정보를 기초로 액셀레이터 또는 브레이크를 밟는 강도인 자율주행차의 액션을 선택하는 단계; 도달하고자 하는 목표 속도와 자율주행차 및 비 자율주행차들의 속도를 기초로 자율주행차의 액션에 대한 보상값을 도출하는 단계; 보상값을 기초로 정책(policy)을 업데이트하는 단계; 및 학습 결과와 실시간 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보를 기초로, 도로정보시스템에서 비 자율주행차들과 함께 학습된 자율주행차를 실행시키는 단계;를 포함한다. A road flow control method in a road information system using a deep reinforcement learning-based self-driving vehicle according to an embodiment for realizing the object of the present invention described above is an autonomous vehicle at a predetermined time from the road information system using a selected deep reinforcement learning algorithm. Receiving the location and speed of driving vehicles and non-autonomous vehicles as state information; Selecting an action of the self-driving vehicle, which is the strength of stepping on the accelerator or brake, based on the state information; deriving a compensation value for an action of an autonomous vehicle based on a target speed to be reached and the speeds of autonomous vehicles and non-autonomous vehicles; Updating a policy based on the compensation value; and executing the learned self-driving car together with the non-self-driving cars in the road information system based on the learning result and the location and speed status information of the real-time self-driving car and non-self-driving cars.

본 발명의 실시예에서, 상기 자율주행차의 액션에 대한 보상값을 도출하는 단계는, 도로정보시스템 내의 전체 차량의 수 및 도달하고자 하는 목표 속도를 기초로 보상값을 도출할 수 있다.In an embodiment of the present invention, the step of deriving a compensation value for the action of the autonomous vehicle may derive a compensation value based on the total number of vehicles in the road information system and a target speed to be reached.

본 발명의 실시예에서, 도로정보시스템의 도로 환경은 신호등이 없는 교차로가 포함된 도로 구조일 수 있다.In an embodiment of the present invention, the road environment of the road information system may be a road structure including an intersection without traffic lights.

본 발명의 실시예에서, 자율주행차의 액션은, 주행 중 액셀레이터를 밟는 강도를 의미하는 양의 연속적인 값인 가속(acceleration) 및 주행 중 브레이크를 밟는 강도를 의미하는 음의 연속적인 값인 감속(deceleration)으로 구성될 수 있다.In an embodiment of the present invention, the action of an autonomous vehicle is acceleration, which is a positive continuous value indicating the strength of stepping on the accelerator while driving, and deceleration, which is a negative continuous value indicating the strength of stepping on the brake while driving. ) can be configured.

본 발명의 실시예에서, 선택된 심층강화학습 알고리즘은 PPO(Proximal Policy Optimization) 알고리즘일 수 있다.In an embodiment of the present invention, the selected deep reinforcement learning algorithm may be a Proximal Policy Optimization (PPO) algorithm.

본 발명의 실시예에서, 상기 자율주행차를 학습시키는 단계는, PPO 알고리즘으로 최적화되도록 반복 학습할 때, 정책(policy) 업데이트 비율을 클리핑(clipping)으로 제한하는 단계;를 더 포함할 수 있다.In an embodiment of the present invention, the step of learning the self-driving vehicle may further include limiting a policy update rate to clipping when iteratively learning to optimize the PPO algorithm.

상기한 본 발명의 다른 목적을 실현하기 위한 일 실시예에 따른 컴퓨터로 판독 가능한 저장 매체에는, 상기 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법을 수행하기 위한 컴퓨터 프로그램이 기록되어 있다. A computer readable storage medium according to an embodiment for realizing another object of the present invention described above includes a computer program for performing a method for controlling road flow in a road information system using the deep reinforcement learning-based autonomous vehicle. It is recorded.

상기한 본 발명의 또 다른 목적을 실현하기 위한 일 실시예에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 장치는, 선택된 심층강화학습 알고리즘을 이용한 도로정보시스템으로부터 정해진 시간마다 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보로 전달받는 도로 모니터링부; 상태 정보를 기초로 액셀레이터 또는 브레이크를 밟는 강도인 자율주행차의 액션을 선택하는 액션 선택부; 도달하고자 하는 목표 속도와 자율주행차 및 비 자율주행차들의 속도를 기초로 자율주행차의 액션에 대한 보상값을 도출하는 액션 보상부; 보상값을 기초로 정책(policy)을 업데이트하는 정책 업데이트부; 및 액션 선택부의 학습 결과와 도로 모니터링부로부터 전달받는 실시간 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보를 기초로, 도로정보시스템에서 비 자율주행차들과 함께 학습된 자율주행차를 실행시키는 자율주행 실행부;를 포함한다.An apparatus for controlling road flow in a road information system using a deep reinforcement learning-based self-driving vehicle according to an embodiment for realizing another object of the present invention described above is provided at a predetermined time from the road information system using a selected deep reinforcement learning algorithm. a road monitoring unit that receives the location and speed of autonomous vehicles and non-autonomous vehicles as state information; an action selection unit that selects an action of the self-driving vehicle, which is the strength of stepping on the accelerator or brake, based on the state information; an action compensator for deriving a compensation value for an action of an autonomous vehicle based on a target speed to be reached and the speeds of autonomous vehicles and non-autonomous vehicles; a policy update unit for updating a policy based on a compensation value; And based on the learning result of the action selection unit and the location and speed status information of the real-time self-driving car and non-self-driving cars delivered from the road monitoring unit, the self-driving car learned along with the non-self-driving cars in the road information system It includes; an autonomous driving execution unit that executes.

본 발명의 실시예에서, 상기 액션 보상부는, 도로정보시스템 내의 전체 차량의 수 및 도달하고자 하는 목표 속도를 기초로 보상값을 도출할 수 있다.In an embodiment of the present invention, the action compensation unit may derive a compensation value based on the total number of vehicles in the road information system and a target speed to be reached.

본 발명의 실시예에서, PPO(Proximal Policy Optimization)를 심층강화학습 알고리즘으로 선택하는 경우, PPO 알고리즘으로 최적화되도록 반복 학습할 때 정책(policy) 업데이트 비율을 클리핑(clipping)으로 제한할 수 있다.In an embodiment of the present invention, when proximal policy optimization (PPO) is selected as a deep reinforcement learning algorithm, a policy update rate may be limited to clipping during iterative learning to be optimized by the PPO algorithm.

이와 같은 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법에 따르면, 심층강화학습 알고리즘을 기반으로 하는 자율주행차량을 이용하여 도로 내 모든 차량이 목표 속도에 근접하게 주행하도록 유도함으로써 도로 흐름을 원활하게 유도할 수 있다.According to the road flow control method in the road information system using such a deep reinforcement learning-based autonomous vehicle, all vehicles on the road are induced to drive close to the target speed by using the autonomous vehicle based on the deep reinforcement learning algorithm. By doing so, the road flow can be smoothly induced.

도 1은 본 발명의 일 실시예에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 장치의 블록도이다.
도 2는 본 발명에서 학습되는 신호등이 없는 교차로와 원형 도로가 결합된 도로 구조를 가지는 도로정보시스템을 보여주는 도면이다.
도 3은 자율주행차량이 없는 경우의 도로 상황 시뮬레이션을 보여주는 도면이다.
도 4는 본 발명에 따라 학습된 자율주행차량 1대가 포함된 경우의 도로 상황 시뮬레이션을 보여주는 도면이다.
도 5는 본 발명에 따른 자율주행차량 학습과정의 보상(Reward) 함수 그래프이다.
도 6은 본 발명에 따른 자율주행차량 학습과정의 손실(Loss) 함수 그래프이다.
도 7은 본 발명의 일 실시예에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법 중 자율주행 학습 단계의 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법 중 자율주행 실행 단계의 흐름도이다.1 is a block diagram of a road flow control device in a road information system using a deep reinforcement learning-based autonomous vehicle according to an embodiment of the present invention.
2 is a diagram showing a road information system having a road structure in which an intersection without a traffic light and a circular road are combined to be learned in the present invention.
3 is a diagram showing a road situation simulation in the case where there is no autonomous vehicle.
4 is a diagram showing a simulation of a road situation when one autonomous vehicle learned according to the present invention is included.
5 is a reward function graph of an autonomous vehicle learning process according to the present invention.
6 is a loss function graph of an autonomous vehicle learning process according to the present invention.
7 is a flowchart of an autonomous driving learning step in a road flow control method in a road information system using a deep reinforcement learning-based autonomous vehicle according to an embodiment of the present invention.
8 is a flowchart of an autonomous driving execution step in a road flow control method in a road information system using a deep reinforcement learning-based autonomous vehicle according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The detailed description of the present invention which follows refers to the accompanying drawings which illustrate, by way of illustration, specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable one skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different from each other but are not necessarily mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in one embodiment in another embodiment without departing from the spirit and scope of the invention. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description set forth below is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is limited only by the appended claims, along with all equivalents as claimed by those claims. Like reference numbers in the drawings indicate the same or similar function throughout the various aspects.

이하, 도면들을 참조하여 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다. Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 장치의 블록도이다.1 is a block diagram of a road flow control device in a road information system using a deep reinforcement learning-based autonomous vehicle according to an embodiment of the present invention.

본 발명에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 장치(10, 이하 장치)는 자율주행차를 심층강화학습기반으로 도로 환경에 따라 속도를 제어할 수 있도록 학습시킨 후, 비 자율주행차들과 함께 운행시켜 자율주행차의 속도 제어를 기반으로 도로 흐름이 개선되도록 유도한다.The road flow control device (10, hereinafter) in the road information system using a deep reinforcement learning-based self-driving vehicle according to the present invention learns the self-driving vehicle to control the speed according to the road environment based on deep reinforcement learning. After that, it is operated together with non-autonomous vehicles to induce road flow to be improved based on the speed control of autonomous vehicles.

도 1을 참조하면, 본 발명에 따른 장치(10)는 도로 모니터링부(110), 자율주행 학습부(130) 및 자율주행 실행부(150)를 포함하고, 상기 자율주행 학습부(130)는 액션 선택부(131), 액션 보상부(133) 및 정책 업데이트부(135)를 포함한다.Referring to FIG. 1 , the device 10 according to the present invention includes a road monitoring unit 110, an autonomous driving learning unit 130, and an autonomous driving executing unit 150, and the autonomous driving learning unit 130 comprises It includes an action selection unit 131, an action compensation unit 133 and a policy update unit 135.

상기 장치(10)는 자율주행차의 제어 모듈에 포함되거나 제어 모듈의 일부를 구성할 수 있다.The device 10 may be included in or constitute part of a control module of an autonomous vehicle.

본 발명의 상기 장치(10)는 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어를 수행하기 위한 소프트웨어(애플리케이션)가 설치되어 실행될 수 있으며, 상기 도로 모니터링부(110), 상기 액션 선택부(131), 상기 액션 보상부(133), 상기 정책 업데이트부(135) 및 상기 자율주행 실행부(150)의 구성은 상기 장치(10)에서 실행되는 상기 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어를 수행하기 위한 소프트웨어에 의해 제어될 수 있다. In the device 10 of the present invention, software (application) for performing road flow control in a road information system using a deep reinforcement learning-based autonomous vehicle may be installed and executed, and the road monitoring unit 110, the The configuration of the action selection unit 131, the action compensation unit 133, the policy update unit 135, and the autonomous driving execution unit 150 is the deep reinforcement learning-based autonomous vehicle executed in the device 10. It can be controlled by software for performing road flow control in the road information system using

상기 장치(10)는 별도의 단말이거나 또는 단말의 일부 모듈일 수 있다. 또한, 상기 도로 모니터링부(110), 상기 액션 선택부(131), 상기 액션 보상부(133), 상기 정책 업데이트부(135) 및 상기 자율주행 실행부(150)의 구성은 통합 모듈로 형성되거나, 하나 이상의 모듈로 이루어 질 수 있다. 그러나, 이와 반대로 각 구성은 별도의 모듈로 이루어질 수도 있다.The device 10 may be a separate terminal or a part of a module of the terminal. In addition, the configuration of the road monitoring unit 110, the action selection unit 131, the action compensation unit 133, the policy update unit 135, and the autonomous driving execution unit 150 are formed as an integrated module or , may consist of one or more modules. However, on the contrary, each component may be composed of a separate module.

상기 장치(10)는 이동성을 갖거나 고정될 수 있다. 상기 장치(10)는, 서버(server) 또는 엔진(engine) 형태일 수 있으며, 디바이스(device), 기구(apparatus), 단말(terminal), UE(user equipment), MS(mobile station), 무선기기(wireless device), 휴대기기(handheld device) 등 다른 용어로 불릴 수 있다. The device 10 may be mobile or stationary. The apparatus 10 may be in the form of a server or engine, and may be a device, an apparatus, a terminal, a user equipment (UE), a mobile station (MS), or a wireless device. It can be called by other terms such as wireless device, handheld device, etc.

상기 장치(10)는 운영체제(Operation System; OS), 즉 시스템을 기반으로 다양한 소프트웨어를 실행하거나 제작할 수 있다. 상기 운영체제는 소프트웨어가 장치의 하드웨어를 사용할 수 있도록 하기 위한 시스템 프로그램으로서, 안드로이드 OS, iOS, 윈도우 모바일 OS, 바다 OS, 심비안 OS, 블랙베리 OS 등 모바일 컴퓨터 운영체제 및 윈도우 계열, 리눅스 계열, 유닉스 계열, MAC, AIX, HP-UX 등 컴퓨터 운영체제를 모두 포함할 수 있다.The device 10 may execute or manufacture various software based on an operating system (OS), that is, a system. The operating system is a system program for enabling software to use the hardware of the device, and is a mobile computer operating system such as Android OS, iOS, Windows mobile OS, Bada OS, Symbian OS, Blackberry OS, and Windows-based, Linux-based, Unix-based, It can include all computer operating systems such as MAC, AIX, and HP-UX.

상기 도로 모니터링부(110)는 선택된 심층강화학습 알고리즘을 이용한 도로정보시스템(1, 도 2 참조)으로부터 정해진 시간마다 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보로 전달받는다. The road monitoring unit 110 receives the location and speed of self-driving vehicles and non-autonomous vehicles as status information at predetermined time intervals from the road information system (1, see FIG. 2) using the selected deep reinforcement learning algorithm.

도 2에 도시된 바와 같이, 도로정보시스템(1)은 실시간 도로 모니터링 모듈(20)을 포함하여, 이를 통해 도로 상황 정보를 수집하여 상기 도로 모니터링부(110)에 주기적으로 또는 필요시에 전달할 수 있다.As shown in FIG. 2, the road information system 1 includes a real-time road monitoring module 20, through which road condition information can be collected and transmitted to the road monitoring unit 110 periodically or when necessary. there is.

상기 자율주행 학습부(110)는 선택된 심층강화학습 알고리즘을 통해 도로 환경에 따라 속도를 제어하도록 자율주행차를 학습시킨다.The self-driving learning unit 110 learns the self-driving vehicle to control speed according to the road environment through the selected deep reinforcement learning algorithm.

다시 도 1을 참조하면, 상기 자율주행 학습부(110)는 상태 정보를 기초로 액셀레이터 또는 브레이크를 밟는 강도인 자율주행차의 액션을 선택하는 액션 선택부(131), 도달하고자 하는 목표 속도와 자율주행차 및 비 자율주행차들의 속도를 기초로 자율주행차의 액션에 대한 보상값을 도출하는 액션 보상부(133) 및 보상값을 기초로 정책(policy)을 업데이트하는 정책 업데이트부(135)를 포함한다.Referring back to FIG. 1, the self-driving learning unit 110 includes an action selection unit 131 that selects an action of the self-driving vehicle, which is the strength of stepping on the accelerator or brake, based on state information, a target speed to be reached, and autonomy. An action compensation unit 133 that derives a compensation value for an action of an autonomous vehicle based on the speed of a driving vehicle and non-autonomous vehicles and a policy update unit 135 that updates a policy based on the compensation value include

예를 들어, 심층강화학습 알고리즘 중 PPO(Proximal Policy Optimization), DDPG(Deep Deterministic Policy Gradient) 및 TD3(Twin Delayed DDPG) 등을 이용하여 자율주행차량 학습시킬 수 있다. 그러나, 상기 심층강화학습 알고리즘은 예시들에 불과하며, 다른 알고리즘 역시 추가로 활용할 수 있다.For example, among deep reinforcement learning algorithms, an autonomous vehicle may be trained by using Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3). However, the above deep reinforcement learning algorithms are only examples, and other algorithms can be additionally utilized.

강화학습은 에이전트(agent)가 환경(environment)과 상호작용하며 학습하는 방법이다. 에이전트가 주어진 상태(state)에서 액션(action)을 하면 상태가 바뀌며, 환경은 바뀐 상태와 보상(reward)을 에이전트에게 전달한다. 이 때 에이전트는 보상의 누적 값이 최대화되는 방향으로 학습한다.Reinforcement learning is a method in which an agent learns by interacting with the environment. When an agent takes an action in a given state, the state changes, and the environment delivers the changed state and reward to the agent. At this time, the agent learns in the direction of maximizing the cumulative value of rewards.

상기 액션 선택부(131)는 정해진 시간마다 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보로 하여, 액셀레이터 또는 브레이크를 밟는 강도인 자율주행차의 액션을 선택한다.The action selection unit 131 selects an action of the autonomous vehicle, which is the strength of stepping on the accelerator or brake, by using the positions and speeds of autonomous vehicles and non-autonomous vehicles as state information at predetermined time intervals.

본 발명의 일 실시예에서 환경은 신호등이 없는 교차로가 포함된 도로 구조일 수 있다. 도 2는 도로 구조 중 하나의 예시로서, 원형 도로가 결합된 8자형 구조이다. In one embodiment of the present invention, the environment may be a road structure including an intersection without traffic lights. 2 is an example of a road structure, which is an 8-shaped structure in which a circular road is combined.

본 발명에서 에이전트는 자율주행차량이고, 시간 t초 마다 복수의 비 자율주행차량(예를 들어, 13대)과 자신의 위치 및 속도를 상태로 받고 액션을 선택한다. In the present invention, the agent is an autonomous vehicle, receives a plurality of non-autonomous vehicles (eg, 13 vehicles) and its own location and speed every time t seconds, and selects an action.

액션은 연속적인 값인 가속(acceleration)과 감속(deceleration)으로 구성된다. 가속은 주행 중 액셀레이터를 밝는 강도를 의미하고, 감속은 주행 중 브레이크를 밝는 강도를 의미한다.Actions consist of continuous values, acceleration and deceleration. Acceleration refers to the intensity of pressing the accelerator while driving, and deceleration refers to the intensity of braking while driving.

상기 액션 보상부(133)는 도달하고자 하는 목표 속도와 자율주행차 및 비 자율주행차들의 속도를 기초로 자율주행차의 액션에 대한 보상값을 도출하고, 상기 정책 업데이트부(135)는 보상값을 기초로 정책(policy)을 업데이트한다.The action compensation unit 133 derives a compensation value for the action of the autonomous vehicle based on the target speed to be reached and the speed of the autonomous vehicle and non-autonomous vehicles, and the policy update unit 135 calculates the compensation value Update the policy based on

일 실시예에서, 학습에서 사용한 보상값을 도출하기 위한 보상 함수는 아래의 수학식 1과 같다.In one embodiment, a reward function for deriving a reward value used in learning is as shown in Equation 1 below.

[수학식 1][Equation 1]

여기서, v_d는 도달하고자 하는 목표 속도이며, 예를 들어 10m/s으로 설정할 수 있다. i는 차량 ID, N은 전체 차량의 수이다. 보상 함수는 도로 내 모든 차량이 목표 속도에 근접하게 주행하도록 유도한다.Here, v _d is the target speed to be reached, and can be set to 10 m/s, for example. i is the vehicle ID, and N is the total number of vehicles. The compensation function induces all vehicles on the road to drive close to the target speed.

이하에서는, 자율주행 차량의 액션을 제어하기 위해 심층강화학습 알고리즘으로 PPO를 사용한 경우를 예시로 설명한다. PPO 알고리즘은 데이터 샘플링(data sampling)과 확률적 경사도 상승(stochastic gradient ascent)을 사용하여 대리 목적 함수(surrogate objective function)를 최적화하는 것을 반복 학습하는 방법이다. Hereinafter, a case in which PPO is used as a deep reinforcement learning algorithm to control an action of an autonomous vehicle will be described as an example. The PPO algorithm is an iterative learning method to optimize a surrogate objective function using data sampling and stochastic gradient ascent.

이때, 정책(policy) 업데이트 비율을 크리핑(clipping)으로 제한 함으로써 정책이 급격하게 변하지 않도록 제어할 수 있다. PPO의 총 손실 함수는 아래의 수학식 2와 같다.At this time, by limiting the policy update rate to creeping, it is possible to control the policy not to change rapidly. The total loss function of PPO is shown in Equation 2 below.

[수학식 2][Equation 2]

여기서,

는 정책 손실 함수(policy loss function),

는 가치 손실 함수(Value loss function), 그리고 S는 엔트로피 보너스(entropy bonus)이다. C₁과 C₂는 계수(coefficient)이며, 예를 들어, C₁은 1.0, C₂는 0.0으로 설정할 수 있다.here,

is the policy loss function,

is the value loss function, and S is the entropy bonus. C ₁ and C ₂ are coefficients, and for example, C ₁ can be set to 1.0 and C ₂ can be set to 0.0.

상기 자율주행 실행부(150)는 상기 액션 선택부(131)의 학습 결과와 도로 모니터링부(110)로부터 전달받는 실시간 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보를 기초로, 도로정보시스템(1)에서 비 자율주행차들과 함께 학습된 자율주행차를 실행하여 전체 도로의 흐름을 제어한다. The self-driving execution unit 150 determines the location and speed of real-time self-driving vehicles and non-autonomous vehicles transmitted from the road monitoring unit 110 and the learning result of the action selection unit 131 based on the state information. In the information system 1, the flow of the entire road is controlled by executing the learned self-driving car together with non-self-driving cars.

일 실시예에서, 자율주행차량이 주행하는 환경을 구현하기 위한 시뮬레이터로 FLOW를 사용할 수 있다. FLOW는 교통 시뮬레이터 SUMO와 심층강화학습 오픈소스 라이브러리 RLlib을 연결한 실험 환경을 제공한다. In one embodiment, FLOW may be used as a simulator for implementing an environment in which an autonomous vehicle drives. FLOW provides an experimental environment that connects the traffic simulator SUMO and the deep reinforcement learning open source library RLlib.

본 발명의 일 실시예에서, 실제 환경과 비슷한 시뮬레이션 환경을 구현하기 위해 시뮬레이션 시간 간격(time step)을 0.1초로 설정할 수 있다. 에피소드가 시작되면, 모든 차량은 도 2처럼 앞뒤 차량과 일정한 간격을 유지하며 주행한다. 이후 PPO 알고리즘에 의해 액션이 제어되며, 주행 중 충돌이 발생하는 경우에는 해당 에피소드가 종료되고, 기존 에피소드를 다시 시작한다.In one embodiment of the present invention, a simulation time step may be set to 0.1 second to implement a simulation environment similar to a real environment. When an episode starts, all vehicles drive while maintaining a constant distance from the front and rear vehicles as shown in FIG. 2 . After that, the action is controlled by the PPO algorithm, and if a collision occurs while driving, the episode ends and the previous episode starts again.

도 3을 참조하면, 자율주행차량이 없이 비 자율주행차량 14대(흰색)가 주행하는 경우 교차로에서 교통 정체가 발생하였다. 이는 급정거를 유발하고 차량의 평균 속도를 저해한다. Referring to FIG. 3 , traffic congestion occurred at an intersection when 14 non-autonomous vehicles (white) were driving without an autonomous vehicle. This causes an abrupt stop and reduces the vehicle's average speed.

반면, 본 발명에 따라 학습된 자율주행차량 1대와 비 자율주행차량 13대가 주행하는 경우는 자율주행차량이 비 자율주행차량을 이끌어 도로 내 모든 차량이 교차로를 효율적으로 통과할 수 있도록 유도한다. 이는 도 4의 시뮬레이션 영상에서 확인할 수 있으며, 자율주행차량은 검은 색, 비 자율주행차량은 회색 차량이다. On the other hand, when one autonomous vehicle and 13 non-autonomous vehicles learned according to the present invention drive, the self-driving vehicle leads the non-autonomous vehicles so that all vehicles on the road can pass through the intersection efficiently. This can be confirmed in the simulation image of FIG. 4, and the autonomous vehicle is black and the non-autonomous vehicle is gray.

이때의 보상(reward) 그래프의 예시는 도 5와 같고, 총 손실(total loss) 그래프의 예시는 도 6과 같다. 도 5 및 도 6을 참조하면, 학습이 진행됨에 따라 보상은 증가하고, 손실은 감소하는 형태로 수렴하는 것을 확인할 수 있다.An example of a reward graph at this time is shown in FIG. 5 and an example of a total loss graph is shown in FIG. 6 . Referring to FIGS. 5 and 6 , it can be seen that the reward increases and the loss converges as the learning progresses.

자율주행차량에 의한 도로 흐름 개선 정도는 아래의 표 1과 같다. The degree of road flow improvement by autonomous vehicles is shown in Table 1 below.

[표 1][Table 1]

표 1을 참조하면, 자율주행차량이 주행하면서 자율주행차량이 없는 경우보다 약 20.7% 평균 속도가 증가한 것을 확인할 수 있다. 급정거 횟수 또한 약 50회에서 2회로 확연히 줄어들었다는 것을 확인할 수 있다. Referring to Table 1, it can be seen that while the self-driving vehicle is driving, the average speed increases by about 20.7% compared to the case without the self-driving vehicle. It can be seen that the number of sudden stops also significantly decreased from about 50 times to 2 times.

본 발명은 자율주행차량을 기반으로 신호등이 없는 교차로와 원형 도로가 결합된 도로 구조에서의 도로 흐름 개선 방법을 제안하였다. 심층강화학습 중 PPO(Proximal Policy Optimization) 알고리즘 이용하여 자율주행 시스템을 구현하기 위해 필요한 기술 중 판단 기술을 자율주행차량이 학습하도록 하였다. The present invention proposes a road flow improvement method in a road structure in which an intersection without a traffic light and a circular road are combined based on an autonomous vehicle. Among deep reinforcement learning, the self-driving vehicle learns the judgment technology among the technologies necessary to implement the autonomous driving system by using the PPO (Proximal Policy Optimization) algorithm.

학습이 진행됨에 따라 자율주행차량은 도로 내 모든 차량이 교차로를 원활하게 통과할 수 있도록 가속 및 감속을 한다. 이를 통해 자율주행차량이 적절한 속도 제어를 함으로써 자율주행차량이 없는 경우 보다 전체 교통 흐름이 개선된 것을 시뮬레이션을 통해서 확인하였다.As learning progresses, the self-driving vehicle accelerates and decelerates so that all vehicles on the road can pass through the intersection smoothly. Through this, it was confirmed through simulation that the overall traffic flow was improved compared to the case without an autonomous vehicle by appropriately controlling the speed of the autonomous vehicle.

도 7은 본 발명의 일 실시예에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법의 흐름도이다.7 is a flowchart of a road flow control method in a road information system using a deep reinforcement learning-based self-driving vehicle according to an embodiment of the present invention.

본 실시예에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법은, 도 1의 장치(10) 및 도 2의 도로정보시스템(1)과 실질적으로 동일한 구성에서 진행될 수 있다. 따라서, 도 1의 장치(10) 및 도 2의 도로정보시스템(1)과 동일한 구성요소는 동일한 도면부호를 부여하고, 반복되는 설명은 생략한다. The method for controlling road flow in a road information system using a deep reinforcement learning-based self-driving vehicle according to the present embodiment may be performed in substantially the same configuration as the device 10 of FIG. 1 and the road information system 1 of FIG. there is. Accordingly, components identical to those of the device 10 of FIG. 1 and the road information system 1 of FIG. 2 are given the same reference numerals, and repeated descriptions are omitted.

또한, 본 실시예에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법은 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어를 수행하기 위한 소프트웨어(애플리케이션)에 의해 실행될 수 있다.In addition, the method for controlling road flow in a road information system using a deep reinforcement learning-based autonomous vehicle according to the present embodiment is a software (application) for performing road flow control in a road information system using a deep reinforcement learning-based autonomous vehicle. ) can be executed.

본 발명은 자율주행차를 심층강화학습기반으로 도로 환경에 따라 속도를 제어할 수 있도록 학습시킨 후, 비 자율주행차들과 함께 운행시켜 자율주행차의 속도 제어를 기반으로 도로 흐름이 개선되도록 한다.The present invention trains an autonomous vehicle to control its speed according to the road environment based on deep reinforcement learning, and then operates it together with non-autonomous vehicles to improve the road flow based on the speed control of the autonomous vehicle. .

도 7을 참조하면, 본 실시예에 따른 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법은, 선택된 심층강화학습 알고리즘을 통해 도로 환경에 따라 속도를 제어하도록 자율주행차를 학습시킨다. Referring to FIG. 7 , a method for controlling road flow in a road information system using a deep reinforcement learning-based self-driving car according to the present embodiment is configured to control the speed of the self-driving car according to the road environment through a selected deep reinforcement learning algorithm. learn

일 실시예에서, 도로 환경은 신호등이 없는 교차로가 포함된 구조일 수 있고, 예를 들어 원형 도로가 결합된 8자형 도로 구조일 수 있다. In one embodiment, the road environment may have a structure including an intersection without traffic lights, and may be, for example, a figure-eight road structure in which a circular road is combined.

선택된 심층강화학습 알고리즘을 이용한 도로정보시스템으로부터 정해진 시간마다 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보로 전달받아(단계 S10), 상태 정보를 기초로 액셀레이터 또는 브레이크를 밟는 강도인 자율주행차의 액션을 선택한다(단계 S20).Autonomy, which is the intensity of stepping on the accelerator or brake based on the status information, by receiving the location and speed of self-driving cars and non-autonomous cars as status information at regular intervals from the road information system using the selected deep reinforcement learning algorithm (step S10). An action of the driving vehicle is selected (step S20).

상기 자율주행차가 선택할 수 있는 액션은, 주행 중 액셀레이터를 밟는 강도를 의미하는 연속적인 값인 가속(acceleration) 및 주행 중 브레이크를 밟는 강도를 의미하는 연속적인 값인 감속(deceleration)으로 구성될 수 있다.Actions that can be selected by the self-driving vehicle may include acceleration, which is a continuous value indicating the strength of stepping on the accelerator while driving, and deceleration, which is a continuous value indicating the strength of stepping on the brake while driving.

도달하고자 하는 목표 속도와 자율주행차 및 비 자율주행차들의 속도를 기초로 자율주행차의 액션에 대한 보상값을 도출하고(단계 S30), 보상값을 기초로 정책(policy)을 업데이트한다(단계 S40). 정해진 시간이 경과되는 경우(단계 S50), 학습을 종료할 수 있다.Based on the target speed to be reached and the speed of the autonomous vehicle and non-autonomous vehicles, a compensation value for the action of the autonomous vehicle is derived (step S30), and a policy is updated based on the compensation value (step S30). S40). When the predetermined time elapses (step S50), learning may be terminated.

강화학습은 에이전트(agent)가 환경(environment)과 상호작용하며 학습하는 방법이다. 에이전트가 주어진 상태(state)에서 액션(action)을 하면 상태가 바뀌며, 환경은 바뀐 상태와 보상(reward)을 에이전트에게 전달한다. 이 때 에이전트는 보상의 누적 값이 최대화되는 방향으로 학습한다. Reinforcement learning is a method in which an agent learns by interacting with the environment. When an agent takes an action in a given state, the state changes, and the environment delivers the changed state and reward to the agent. At this time, the agent learns in the direction of maximizing the cumulative value of rewards.

본 발명에서 에이전트는 자율주행차량이고, 정해진 시간 마다 복수의 비 자율주행차량(예를 들어, 13대)과 자신의 위치 및 속도를 상태로 받고 액션을 선택한다. In the present invention, the agent is an autonomous vehicle, receives a plurality of non-autonomous vehicles (eg, 13 vehicles) and its own location and speed as states at a predetermined time and selects an action.

일 실시예에서, 상기 선택된 심층강화학습 알고리즘은 PPO(Proximal Policy Optimization) 알고리즘일 수 있다. PPO 알고리즘으로 최적화되도록 반복 학습할 때, 정책(policy) 업데이트 비율을 클리핑(clipping)으로 제한하여 정책이 급격하게 변하지 않도록 제어할 수 있다.In one embodiment, the selected deep reinforcement learning algorithm may be a Proximal Policy Optimization (PPO) algorithm. When iteratively learning to be optimized by the PPO algorithm, the policy update rate can be limited to clipping so that the policy does not change rapidly.

도 8을 참조하면, 학습이 완료된 자율주행차 및 비 자율주행차들을 함께 운행하여 학습 패턴을 실행한다.Referring to FIG. 8 , a learning pattern is executed by driving autonomous vehicles and non-autonomous vehicles that have completed learning together.

차량의 운행 중 정해진 시간마다 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보를 전달받을 수 있다(단계 S60). 예를 들어, 시간 t초 마다 복수의 비 자율주행차량(예를 들어, 13대)와 자신의 위치 및 속도를 상태로 받고 액션을 선택할 수 있다. Position and speed state information of autonomous vehicles and non-autonomous vehicles may be received at predetermined times while the vehicle is in operation (step S60). For example, a plurality of non-autonomous vehicles (eg, 13 vehicles) and their own location and speed may be received as states and an action may be selected every time t seconds.

학습 결과와 실시간 자율주행차 및 비 자율주행차들의 위치 및 속도를 상태 정보를 기초로, 도로정보시스템에서 비 자율주행차들과 함께 학습된 자율주행차를 실행시키며, 액셀레이터 또는 브레이크를 밟는 강도인 자율주행차의 액션을 선택하여(단계 S70), 도로 흐름을 제어할 수 있다.Based on the learning result and the location and speed of real-time self-driving cars and non-autonomous cars, the learned self-driving car is executed along with non-autonomous cars in the road information system, and the intensity of stepping on the accelerator or brake The road flow may be controlled by selecting an action of the autonomous vehicle (step S70).

일 실시예에서, 보상 함수는 수학식 1과 같으며, 보상 함수는 도로 내 모든 차량이 목표 속도에 근접하게 주행하도록 유도한다. 학습이 진행됨에 따라, 보상은 증가하고, 손실은 감소하는 형태로 수렴한다 In one embodiment, the compensation function is as shown in Equation 1, and the compensation function induces all vehicles on the road to drive close to the target speed. As learning proceeds, the reward increases and the loss converges

본 발명은 심층강화학습(deep reinforcement learning)을 이용하여 교통정체가 빈번하게 발생하는 교차로에서의 도로 흐름을 개선한다. 이를 위해, 자율주행차량을 심층강화학습을 기반으로 도로 환경에 따라 속도를 제어할 수 있도록 학습시킨 후 일반 차량들과 함께 교차로에서 운행시킨다. The present invention uses deep reinforcement learning to improve road flow at intersections where traffic congestion frequently occurs. To this end, after learning to control the speed according to the road environment based on deep reinforcement learning, the self-driving vehicle is operated at an intersection along with other vehicles.

그 결과, 도로 내 전체 차량들의 급정거 횟수가 자율주행차량이 없는 경우보다 확연히 감소하고, 전체 차량의 평균 속도가 20% 정도 증가하는 등 도로 흐름이 개선된 것을 확인하였다.As a result, it was confirmed that the road flow was improved, such that the number of sudden stops of all vehicles on the road was significantly reduced compared to the case without autonomous vehicles, and the average speed of all vehicles increased by about 20%.

이와 같은, 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 방법은 애플리케이션으로 구현되거나 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. Such a method for controlling road flow in a road information system using a deep reinforcement learning-based autonomous vehicle is implemented as an application or implemented in the form of program commands that can be executed through various computer components and recorded on a computer-readable recording medium. It can be. The computer readable recording medium may include program instructions, data files, data structures, etc. alone or in combination.

상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거니와 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. Program instructions recorded on the computer-readable recording medium may be those specially designed and configured for the present invention, or those known and usable to those skilled in the art of computer software.

컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like.

프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes such as those produced by a compiler. The hardware device may be configured to act as one or more software modules to perform processing according to the present invention and vice versa.

이상에서는 실시예들을 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to embodiments, those skilled in the art can variously modify and change the present invention without departing from the spirit and scope of the present invention described in the claims below. You will understand.

본 발명은 학습된 자율주행차량의 적절한 속도 제어를 통해 전체 교통 흐름을 개선하였다. 따라서, 향후 자율 주행 기술의 발전으로 완전 자율주행차량이 상용화되어 소수의 자율주행차량이 도로에 도입되더라도 이를 통해 얻을 수 있는 도로 흐름 개선 및 경제적, 환경적 효과는 매우 클 것으로 기대된다.The present invention improves the overall traffic flow through appropriate speed control of the learned self-driving vehicle. Therefore, even if fully autonomous vehicles are commercialized with the development of autonomous driving technology in the future, even if a small number of autonomous vehicles are introduced on the road, it is expected that the road flow improvement and economic and environmental effects that can be obtained through this will be very large.

10: 심층강화학습기반 자율주행차를 이용한 도로정보시스템에서의 도로 흐름 제어 장치
110: 도로 모니터링부
130: 자율주행 학습부
131: 액션 선택부
133: 액션 보상부
135: 정책 업데이트부
150: 자율주행 실행부10: Road flow control device in road information system using deep reinforcement learning-based autonomous vehicle
110: road monitoring unit
130: autonomous driving learning unit
131: action selection unit
133: action compensation unit
135: policy update unit
150: autonomous driving execution unit

Claims

Receiving location and speed of self-driving vehicles and non-autonomous vehicles as status information at predetermined time intervals from a road information system using a selected deep reinforcement learning algorithm;
Selecting an action of the self-driving vehicle, which is the strength of stepping on the accelerator or brake, based on the state information;
deriving a compensation value for an action of an autonomous vehicle based on a target speed to be reached and the speeds of autonomous vehicles and non-autonomous vehicles;
Updating a policy based on the reward value; and
Executing the learned self-driving car together with non-self-driving cars in the road information system based on the learning result and the location and speed status information of the real-time self-driving car and non-self-driving car; including,
The step of deriving a compensation value for the action of the autonomous vehicle,
Derive a compensation value based on the total number of vehicles in the road information system and the target speed to be reached,
The compensation value is derived through the compensation function of Equation 1 below, a method for controlling road flow in a road information system using a deep reinforcement learning-based autonomous vehicle.
[Equation 1]

Here, v _d is the target speed to be reached, i is the vehicle ID, and N is the total number of vehicles.

delete

According to claim 1,
A road flow control method in a road information system using a deep reinforcement learning-based autonomous vehicle, in which the road environment of the road information system is a road structure including an intersection without traffic lights.

According to claim 1,
The action of an autonomous vehicle is deep reinforcement, consisting of acceleration, which is a continuous positive value indicating the strength of stepping on the accelerator while driving, and deceleration, a continuous negative value indicating the strength of stepping on the brake while driving. Road flow control method in road information system using learning-based autonomous vehicle.

According to claim 1,
The selected deep reinforcement learning algorithm is a PPO (Proximal Policy Optimization) algorithm, a road flow control method in a road information system using a deep reinforcement learning-based autonomous vehicle.

The method of claim 5, wherein the step of learning the self-driving vehicle,
A road flow control method in a road information system using a deep reinforcement learning-based autonomous vehicle, further comprising: limiting a policy update rate to clipping when iteratively learning to be optimized with the PPO algorithm.

A computer-readable storage medium in which a computer program for performing the method of controlling road flow in a road information system using the deep reinforcement learning-based autonomous vehicle according to claim 1 is recorded.

a road monitoring unit that receives location and speed of self-driving cars and non-autonomous cars as status information from a road information system using a selected deep reinforcement learning algorithm;
an action selection unit that selects an action of the self-driving vehicle, which is the strength of stepping on the accelerator or brake, based on the state information;
an action compensator for deriving a compensation value for an action of an autonomous vehicle based on a target speed to be reached and the speeds of autonomous vehicles and non-autonomous vehicles;
a policy update unit for updating a policy based on a compensation value; and
Based on the learning result of the action selection unit and the location and speed status information of the real-time self-driving and non-self-driving cars delivered from the road monitoring unit, the learned self-driving car is executed along with the non-self-driving cars in the road information system. Including; an autonomous driving execution unit that
The action compensation unit,
Derive a compensation value based on the total number of vehicles in the road information system and the target speed to be reached,
The compensation value is derived through the compensation function of Equation 1 below, a road flow control device in a road information system using a deep reinforcement learning-based autonomous vehicle.
[Equation 1]

delete

According to claim 8,
A road flow control device in a road information system using a deep reinforcement learning-based autonomous vehicle, in which the road environment of the road information system is a road structure including an intersection without traffic lights.

According to claim 8,
The action of an autonomous vehicle is deep reinforcement, consisting of acceleration, which is a continuous positive value indicating the strength of stepping on the accelerator while driving, and deceleration, a continuous negative value indicating the strength of stepping on the brake while driving. A road flow control device in a road information system using a learning-based autonomous vehicle.

According to claim 8,
When PPO (Proximal Policy Optimization) is selected as the deep reinforcement learning algorithm, roads using deep reinforcement learning-based self-driving cars that limit the policy update rate to clipping when iteratively learning to be optimized by the PPO algorithm Road flow control device in information system.