KR20220167730A

KR20220167730A - Method of lane change for autonomous vehicles based deep reinforcement learning, recording medium and device for performing the method

Info

Publication number: KR20220167730A
Application number: KR1020210118540A
Authority: KR
Inventors: 권민혜; 이동수
Original assignee: 숭실대학교산학협력단
Priority date: 2021-06-14
Filing date: 2021-09-06
Publication date: 2022-12-21
Also published as: KR102631402B1

Abstract

A deep reinforcement learning-based lane change method for an autonomous vehicle comprises the steps of: collecting data by partially observing the state of a nearby road by an autonomous vehicle in a multi-lane road environment; performing at least one action of lane change and acceleration control based on the collected data; reflecting the performed action of the autonomous vehicle and deriving a compensation value for the action of the autonomous vehicle from a compensation function based on a target speed of the autonomous vehicle and a safe distance from a rear vehicle; learning an action policy for at least one of lane change and acceleration control using a deep reinforcement learning algorithm based on at least one of the partially observed data, the performed action information, and the derived compensation value; and performing optimal driving of the autonomous vehicle for a current road state based on the learned action policy. Accordingly, a safe and efficient lane change of an autonomous vehicle is possible through a deep reinforcement learning algorithm.

Description

Lane change method based on deep reinforcement learning for autonomous vehicles, recording medium and device for performing the same

본 발명은 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법, 이를 수행하기 위한 기록 매체 및 장치에 관한 것으로서, 더욱 상세하게는 심층강화학습(deep reinforcement learning) 알고리즘을 이용하여 자율주행차량이 효율적이고 안전하게 차선을 변경할 수 있도록 학습하는 기술에 관한 것이다.The present invention relates to a deep reinforcement learning-based lane changing method for an autonomous vehicle, a recording medium and an apparatus for performing the same, and more particularly, an autonomous vehicle is efficient and efficient by using a deep reinforcement learning algorithm. It's about the technology that learns to change lanes safely.

최근 자율주행차량 관련 연구가 가속화 되며 상용화 가능성에 대한 검토 및 제도적인 발전이 이루어지고 있다. 자율주행차량이 상용화되기 위해서는 복잡한 도로 상황에 대해 차량의 안정성 및 효율적인 주행에 대한 기술이 필요하다. 특히, 운전자의 목적에 도달하기 위해 빈번하게 발생하는 차선 변경에 대한 전략은 주행의 안전성과 효율성을 높이는데 필수적인 기술이다.Recently, research related to autonomous vehicles is accelerating, and the possibility of commercialization is being reviewed and institutional development is being made. In order for self-driving vehicles to be commercialized, technology for vehicle stability and efficient driving is required for complex road conditions. In particular, a strategy for frequently changing lanes to reach the driver's goal is an essential technology for improving driving safety and efficiency.

기존의 등록된 기술의 안정적인 차선 변경을 위해 주변 환경의 인지, 안전성 판단, 기능 제어와 같이 단계를 구분한다. 그 후 각각의 단계에 대해 고전적인 조절 이론의 사용 혹은 심층학습(deep learning)을 통해 학습한다. For stable lane change of the existing registered technology, it is divided into stages such as perception of the surrounding environment, safety judgment, and function control. Then, for each step, it is learned through the use of classical control theory or deep learning.

선행기술문헌의 특허 문헌 1은 자율주행차량의 차선 변경 제어 장치 및 그 방법을 제시하고 있으나, 자율주행차량의 차선 변경 제어 장치 및 그 방법에 관한 발명으로 차선 변경 시 안전을 위해 고려해야 할 다양한 상황정보를 그룹별로 세분화하여 심층 학습을 수행한다.Patent Document 1 of the prior art literature presents a lane change control device and method for an autonomous vehicle, but various situational information to be considered for safety when changing lanes due to an invention related to a lane change control device and method for an autonomous vehicle. is subdivided into groups to perform deep learning.

선행기술문헌의 특허 문헌 2는 무인자동차의 차선 변경 방법, 장치 및 저장매체를 제시하고 있으나, 후보 차선 결정, 후보 차선에 대한 선별적 수행을 통해 목표 차선 선택, 안전성 검토, 차선 변경 수행에 대한 조절 이론 등을 수행한다.Patent Document 2 of the prior art literature presents a lane change method, device, and storage medium of an unmanned vehicle, but through selection of candidate lanes and selective performance of candidate lanes, selection of target lanes, safety review, and control of lane change execution. theory, etc.

이에, 상황 별 또는 동적인 행동 별로 세분화하지 않고 엔드-투-엔드로 학습을 수행하여 다양한 상황 및 행동에 적합한 안전하고 효율적인 차선 변경 기술이 필요하다.Accordingly, there is a need for a safe and efficient lane change technology suitable for various situations and behaviors by performing end-to-end learning without segmentation by situation or dynamic behavior.

KRKR 10-2021-0044960 10-2021-0044960 AA KRKR 10-2020-0116409 10-2020-0116409 AA

Erdmann, "SUMO's Lane-Changing Model," Springer, 2015. Erdmann, "SUMO's Lane-Changing Model," Springer, 2015.

이에, 본 발명의 기술적 과제는 이러한 점에서 착안된 것으로 본 발명의 목적은 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법을 제공하는 것이다.Therefore, the technical problem of the present invention is conceived in this respect, and an object of the present invention is to provide a deep reinforcement learning-based lane changing method for an autonomous vehicle.

본 발명의 다른 목적은 상기 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법을 수행하기 위한 컴퓨터 프로그램이 기록된 기록 매체를 제공하는 것이다.Another object of the present invention is to provide a recording medium on which a computer program for performing the deep reinforcement learning-based lane changing method for an autonomous vehicle is recorded.

본 발명의 또 다른 목적은 상기 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법을 수행하기 위한 장치를 제공하는 것이다.Another object of the present invention is to provide an apparatus for performing the deep reinforcement learning-based lane changing method for the self-driving vehicle.

상기한 본 발명의 목적을 실현하기 위한 일 실시예에 따른 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법은, 다차선의 도로 환경에서 자율주행차량이 근접 도로의 상태에 대해 부분 관측하여 데이터를 수집하는 단계; 수집한 데이터를 기반으로 차선 변경 및 가속도 조절 중 적어도 하나 의 행동을 수행하는 단계; 수행한 자율주행차량의 행동을 반영하여 자율주행차량의 목표 속도 및 후방차량과의 안전 거리를 기초로 하는 보상 함수로부터 자율주행차량의 행동에 대한 보상값을 도출하는 단계; 부분 관측한 데이터, 수행한 행동 정보 및 도출된 보상값 중 적어도 하나의 정보를 기초로 차선 변경 및 가속도 조절 중 적어도 하나에 대한 행동 정책(policy)을 심층강화학습(deep reinforcement learning) 알고리즘을 이용하여 학습하는 단계; 및 학습된 행동 정책을 기초로 현재 도로 상태에 대한 자율주행차량의 최적의 주행을 수행하는 단계;를 포함한다.In the deep reinforcement learning-based lane changing method for an autonomous vehicle according to an embodiment for realizing the object of the present invention, the autonomous vehicle partially observes the state of a nearby road in a multi-lane road environment to obtain data. collecting; performing at least one of lane change and acceleration control based on the collected data; Deriving a compensation value for the behavior of the autonomous vehicle from a compensation function based on a target speed of the autonomous vehicle and a safe distance from a vehicle behind the self-driving vehicle by reflecting the performed behavior of the autonomous vehicle; Based on at least one of partially observed data, performed action information, and the derived compensation value, a policy for at least one of lane change and acceleration control is determined using a deep reinforcement learning algorithm. learning step; and performing optimal driving of the self-driving vehicle for current road conditions based on the learned action policy.

본 발명의 실시예에서, 상기 근접 도로의 상태에 대해 부분 관측하여 데이터를 수집하는 단계는, 각 차선의 선두차량의 속도, 각 차선의 후방차량 속도, 자율주행차량의 속도, 각 차선의 선두차량과 자율주행차량 사이의 상대 거리, 각 차선의 후방차량과 자율주행차량 사이의 상대 거리 및 각 차량이 위치한 차선 번호를 수집할 수 있다.In an embodiment of the present invention, the step of partially observing the state of the nearby road and collecting data includes the speed of the leading vehicle in each lane, the speed of the rear vehicle in each lane, the speed of the autonomous vehicle, and the leading vehicle in each lane. It is possible to collect the relative distance between the vehicle and the autonomous vehicle, the relative distance between the vehicle behind each lane and the autonomous vehicle, and the number of the lane where each vehicle is located.

본 발명의 실시예에서, 상기 차선 변경 및 가속도 조절 중 적어도 하나의 행동을 수행하는 단계는, 자율주행차량의 가속도는 -1 내지 1 사이의 연속적인 범위를 가지며, 차선 변경 방향은 각각 차선 유지, 우측 차선으로의 차선 변경 및 좌측 차선으로의 차선 변경을 의미하는 값을 가질 수 있다.In an embodiment of the present invention, the step of performing at least one of the lane change and acceleration control may include: the acceleration of the self-driving vehicle has a continuous range between -1 and 1; It may have values indicating a lane change to the right lane and a lane change to the left lane.

본 발명의 실시예에서, 상기 보상 함수는, 자율주행차량의 목표 속도에 가깝게 주행할 수 있도록 하는 보상항 및 자율주행차량이 차선을 변경한 경우 후방차량과의 안전 거리를 침범하는 것에 대한 처벌항을 포함할 수 있다.In an embodiment of the present invention, the compensation function includes a compensation term for allowing the autonomous vehicle to drive close to a target speed and a penalty term for invading a safe distance from a rear vehicle when the autonomous vehicle changes lanes. can include

본 발명의 실시예에서, 상기 처벌항은, 차량 간 최소 허용 거리 및 선두차량과 후방차량이 동일한 위치에 도달하는데 필요한 최소 허용 시간을 기초로 생성될 수 있다.In an embodiment of the present invention, the penalty clause may be generated based on a minimum allowable distance between vehicles and a minimum allowable time required for the lead vehicle and the rear vehicle to reach the same position.

본 발명의 실시예에서, 상기 심층강화학습 알고리즘을 이용하여 학습하는 단계는, 심층강화학습 중 PPO(Proximal Policy Optimization) 알고리즘을 이용할 수 있다.In an embodiment of the present invention, in the step of learning using the deep reinforcement learning algorithm, a proximal policy optimization (PPO) algorithm may be used during deep reinforcement learning.

본 발명의 실시예에서, 상기 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법은, 도출한 자율주행차량의 행동에 대한 보상값을 기초로 행동 정책을 업데이트하는 단계;를 더 포함할 수 있다.In an embodiment of the present invention, the deep reinforcement learning-based lane changing method for an autonomous vehicle may further include updating an action policy based on the derived compensation value for the behavior of the autonomous vehicle.

본 발명의 실시예에서, 다차선의 도로 환경은 이차선 원형도로일 수 있다. In an embodiment of the present invention, the multi-lane road environment may be a two-lane circular road.

상기한 본 발명의 다른 목적을 실현하기 위한 일 실시예에 따른 컴퓨터로 판독 가능한 저장 매체에는, 상기 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법을 수행하기 위한 컴퓨터 프로그램이 기록되어 있다. A computer program for performing a deep reinforcement learning-based lane changing method for the self-driving vehicle is recorded in a computer-readable storage medium according to an embodiment for realizing another object of the present invention.

상기한 본 발명의 또 다른 목적을 실현하기 위한 일 실시예에 따른 자율주행차량을 위한 심층강화학습 기반 차선 변경 장치는, 다차선의 도로 환경에서 자율주행차량이 근접 도로의 상태에 대해 부분 관측하여 데이터를 수집하는 관측부; 수집한 데이터를 기반으로 차선 변경 및 가속도 조절 중 적어도 하나의 행동을 수행하고, 수행한 자율주행차량의 행동에 대한 보상값을 도출하여 부분 관측한 데이터, 수행한 행동 정보 및 도출된 보상값 중 적어도 하나의 정보를 기초로 차선 변경 및 가속도 조절 중 적어도 하나에 대한 행동 정책(policy)을 심층강화학습(deep reinforcement learning) 알고리즘을 이용하여 학습하는 통합 학습부; 및 학습된 행동 정책을 기초로 현재 도로 상태에 대한 자율주행차량의 최적의 주행을 수행하는 정책 활용부;를 포함한다.A deep reinforcement learning-based lane changing device for an autonomous vehicle according to an embodiment for realizing another object of the present invention described above is a multi-lane road environment in which an autonomous vehicle partially observes the state of a nearby road Observation unit for collecting data; Based on the collected data, at least one of the lane change and acceleration control is performed, and a reward value for the performed autonomous vehicle action is derived, and at least one of the partially observed data, the performed action information, and the derived reward value is obtained. An integrated learning unit that learns an action policy for at least one of lane change and acceleration control based on one piece of information using a deep reinforcement learning algorithm; and a policy utilization unit that performs optimal driving of the self-driving vehicle for current road conditions based on the learned behavioral policy.

본 발명의 실시예에서, 상기 관측부는, 각 차선의 선두차량의 속도, 각 차선의 후방차량 속도, 자율주행차량의 속도, 각 차선의 선두차량과 자율주행차량 사이의 상대 거리, 각 차선의 후방차량과 자율주행차량 사이의 상대 거리 및 각 차량이 위치한 차선 번호를 수집할 수 있다.In an embodiment of the present invention, the observer may include the speed of the leading vehicle in each lane, the speed of the rear vehicle in each lane, the speed of the autonomous vehicle, the relative distance between the leading vehicle in each lane and the autonomous vehicle, and the rear of each lane. The relative distance between the vehicle and the autonomous vehicle and the number of the lane each vehicle is located on can be collected.

본 발명의 실시예에서, 수집한 데이터를 기반으로 차선 변경 및 가속도 조절 중 적어도 하나의 행동을 수행하는 행동부; 수행한 자율주행차량의 행동을 반영하여 자율주행차량의 목표 속도 및 후방차량과의 안전 거리를 기초로 하는 보상 함수로부터 자율주행차량의 행동에 대한 보상값을 도출하는 보상부; 및 부분 관측한 데이터, 수행한 행동 정보 및 도출된 보상값 중 적어도 하나의 정보를 기초로 차선 변경 및 가속도 조절 중 적어도 하나에 대한 행동 정책을 심층강화학습 알고리즘을 이용하여 학습하는 정책 학습부;를 포함할 수 있다.In an embodiment of the present invention, an action unit performing at least one of lane change and acceleration control based on the collected data; a compensation unit for deriving a compensation value for the behavior of the autonomous vehicle from a compensation function based on a target speed of the autonomous vehicle and a safe distance from a rear vehicle by reflecting the performed behavior of the autonomous vehicle; and a policy learning unit that learns a behavioral policy for at least one of lane change and acceleration control using a deep reinforcement learning algorithm based on at least one of partially observed data, performed behavioral information, and the derived compensation value. can include

본 발명의 실시예에서, 상기 행동부는, 자율주행차량의 가속도는 -1 내지 1 사이의 연속적인 범위를 가지며, 차선 변경 방향은 각각 차선 유지, 우측 차선으로의 차선 변경 및 좌측 차선으로의 차선 변경을 의미하는 값을 가질 수 있다.In an embodiment of the present invention, the action unit may determine that the acceleration of the self-driving vehicle has a continuous range of -1 to 1, and the lane-changing directions include lane maintenance, lane change to the right lane, and lane change to the left lane, respectively. can have a value that means

본 발명의 실시예에서, 상기 통합 학습부는, 심층강화학습 중 PPO(Proximal Policy Optimization) 알고리즘을 이용할 수 있다.In an embodiment of the present invention, the integrated learning unit may use a Proximal Policy Optimization (PPO) algorithm during deep reinforcement learning.

본 발명의 실시예에서, 상기 통합 학습부는, 도출한 자율주행차량의 행동에 대한 보상값을 기초로 행동 정책을 업데이트할 수 있다.In an embodiment of the present invention, the integrated learning unit may update the action policy based on the derived reward value for the action of the autonomous vehicle.

본 발명의 실시예에서, 다차선의 도로 환경은 이차선 원형도로일 수 있다.In an embodiment of the present invention, the multi-lane road environment may be a two-lane circular road.

이와 같은 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법에 따르면, 심층강화학습의 사용으로 상황정보 및 행동을 그룹별로 세분화 하지 않고 엔드-투-엔드로 학습을 수행하여, 자율주행차량의 안전하고 효율적인 차선 변경을 가능하게 한다.According to such a deep reinforcement learning-based lane changing method for self-driving vehicles, by using deep reinforcement learning, end-to-end learning is performed without subdividing situational information and behavior into groups, thereby providing safety and safety of autonomous vehicles. Enables efficient lane changes.

본 발명을 적용하여 실험한 결과, 목표속도에 대한 분산이 크게 감소하였으며 효율적인 주행 성능을 보였다. 또한, 차선 변경 시 전 후방차량에 대해 안전 거리를 유지하며 차선 변경을 수행하는 능력을 보였다.As a result of the experiment by applying the present invention, the variance for the target speed was greatly reduced and efficient driving performance was shown. In addition, when changing lanes, it showed the ability to change lanes while maintaining a safe distance from the vehicle behind it.

도 1은 본 발명의 일 실시예에 따른 자율주행차량을 위한 심층강화학습 기반 차선 변경 장치의 블록도이다.
도 2는 본 발명에 적용하는 다차선 도로 환경의 예시 도면이다.
도 3은 본 발명의 자율주행차량이 부분 관측하는 근접 도로의 상태를 설명하기 위한 도면이다.
도 4는 본 발명의 통합 학습부의 학습 과정을 설명하기 위한 도면이다.
도 5는 본 발명에서 보상함수 중 처벌항의 안전 거리를 설명하기 위한 도면이다.
도 6은 본 발명의 성능을 검증하기 위해 시간에 따른 차선 변경 차량의 속도 변화를 종래 기술과 비교한 결과를 보여주는 그래프이다.
도 7은 본 발명의 일 실시예에 따른 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법의 흐름도이다.1 is a block diagram of a deep reinforcement learning-based lane changing device for an autonomous vehicle according to an embodiment of the present invention.
2 is an exemplary diagram of a multi-lane road environment applied to the present invention.
3 is a diagram for explaining the state of a nearby road partially observed by an autonomous vehicle of the present invention.
4 is a diagram for explaining the learning process of the integrated learning unit of the present invention.
5 is a diagram for explaining the safety distance of the punishment term among the reward functions in the present invention.
6 is a graph showing a result of comparing speed change of a lane-changing vehicle over time with the prior art in order to verify the performance of the present invention.
7 is a flowchart of a deep reinforcement learning-based lane changing method for an autonomous vehicle according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The detailed description of the present invention which follows refers to the accompanying drawings which illustrate, by way of illustration, specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable one skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different from each other but are not necessarily mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in one embodiment in another embodiment without departing from the spirit and scope of the invention. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description set forth below is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is limited only by the appended claims, along with all equivalents as claimed by those claims. Like reference numbers in the drawings indicate the same or similar function throughout the various aspects.

이하, 도면들을 참조하여 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다. Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 자율주행차량을 위한 심층강화학습 기반 차선 변경 장치의 블록도이다.1 is a block diagram of a deep reinforcement learning-based lane changing device for an autonomous vehicle according to an embodiment of the present invention.

본 발명에 따른 자율주행차량을 위한 심층강화학습 기반 차선 변경 장치(100, 이하 장치)는 심층강화학습 알고리즘을 이용하여 자율주행차량의 차선 변경 학습 방법을 제안한다. The deep reinforcement learning-based lane changing device 100 (hereinafter referred to as the device) for an autonomous vehicle according to the present invention proposes a lane changing learning method for an autonomous vehicle using a deep reinforcement learning algorithm.

본 발명에서, 개체(자율주행차량)는 환경(도로 상태)과의 상호작용을 통해 학습한다. 개체는 도로 상태를 관측한 뒤 학습된 정보를 바탕으로 그에 맞는 행동을 수행한다. 이때, 절대적인 도로 정보를 확인할 수 없기 때문에 부분적이며 완벽하지 않은 관측된 정보를 사용한다. In the present invention, an entity (autonomous vehicle) learns through interaction with the environment (road conditions). The entity observes road conditions and then performs actions based on the learned information. At this time, since absolute road information cannot be confirmed, partial and incomplete observed information is used.

행동 수행(차선 변경 및 가속도 조절)의 결과로 새로운 상태를 얻게 되며 이를 바탕으로 보상을 얻는다. 개체는 보상이 최대가 되는 방향으로 학습을 한다.As a result of performing actions (changing lanes and adjusting acceleration), a new state is obtained and a reward is obtained based on this. The object learns in a direction that maximizes the reward.

도 1을 참조하면, 본 발명에 따른 장치(100)는 관측부(110), 통합 학습부(130) 및 정책 활용부(150)를 포함한다. 상기 장치(10)는 자율주행차량의 제어 모듈에 포함되거나 제어 모듈의 일부를 구성할 수 있다.Referring to FIG. 1 , an apparatus 100 according to the present invention includes an observation unit 110 , an integrated learning unit 130 and a policy utilization unit 150 . The device 10 may be included in or constitute part of a control module of an autonomous vehicle.

본 발명의 상기 장치(100)는 자율주행차량을 위한 심층강화학습 기반 차선 변경을 수행하기 위한 소프트웨어(애플리케이션)가 설치되어 실행될 수 있으며, 상기 관측부(110), 상기 통합 학습부(130) 및 상기 정책 활용부(150)의 구성은 상기 장치(100)에서 실행되는 상기 자율주행차량을 위한 심층강화학습 기반 차선 변경을 수행하기 위한 소프트웨어에 의해 제어될 수 있다. In the apparatus 100 of the present invention, software (application) for performing lane change based on deep reinforcement learning for an autonomous vehicle may be installed and executed, and the observation unit 110, the integrated learning unit 130 and The configuration of the policy utilization unit 150 may be controlled by software for performing lane change based on deep reinforcement learning for the self-driving vehicle executed in the device 100 .

상기 장치(100)는 별도의 단말이거나 또는 단말의 일부 모듈일 수 있다. 또한, 상기 관측부(110), 상기 통합 학습부(130) 및 상기 정책 활용부(150)의 구성은 통합 모듈로 형성되거나, 하나 이상의 모듈로 이루어 질 수 있다. 그러나, 이와 반대로 각 구성은 별도의 모듈로 이루어질 수도 있다.The device 100 may be a separate terminal or a part of a module of the terminal. In addition, the observation unit 110, the integrated learning unit 130, and the policy utilization unit 150 may be formed as an integrated module or may be composed of one or more modules. However, on the contrary, each component may be composed of a separate module.

상기 장치(100)는 이동성을 갖거나 고정될 수 있다. 상기 장치(100)는, 서버(server) 또는 엔진(engine) 형태일 수 있으며, 디바이스(device), 기구(apparatus), 단말(terminal), UE(user equipment), MS(mobile station), 무선기기(wireless device), 휴대기기(handheld device) 등 다른 용어로 불릴 수 있다. The device 100 may be mobile or stationary. The apparatus 100 may be in the form of a server or an engine, and may be a device, an apparatus, a terminal, a user equipment (UE), a mobile station (MS), or a wireless device. It can be called by other terms such as wireless device, handheld device, etc.

상기 장치(100)는 운영체제(Operation System; OS), 즉 시스템을 기반으로 다양한 소프트웨어를 실행하거나 제작할 수 있다. 상기 운영체제는 소프트웨어가 장치의 하드웨어를 사용할 수 있도록 하기 위한 시스템 프로그램으로서, 안드로이드 OS, iOS, 윈도우 모바일 OS, 바다 OS, 심비안 OS, 블랙베리 OS 등 모바일 컴퓨터 운영체제 및 윈도우 계열, 리눅스 계열, 유닉스 계열, MAC, AIX, HP-UX 등 컴퓨터 운영체제를 모두 포함할 수 있다.The device 100 may execute or manufacture various software based on an operating system (OS), that is, a system. The operating system is a system program for enabling software to use the hardware of the device, and is a mobile computer operating system such as Android OS, iOS, Windows mobile OS, Bada OS, Symbian OS, Blackberry OS, and Windows-based, Linux-based, Unix-based, It can include all computer operating systems such as MAC, AIX, and HP-UX.

상기 관측부(110)는 다차선의 도로 환경에서 자율주행차량이 근접 도로의 상태에 대해 부분 관측하여 데이터를 수집한다. 예를 들어, 상기 관측부(110)는 자율주행차량의 속도와 차선, 선두차량의 속도, 후방차량의 속도, 선두차량의 상대 위치, 후방차량의 상대 위치 및 각 차량이 위치한 차선 번호를 수집할 수 있다.The observation unit 110 collects data by partially observing the state of a road adjacent to the self-driving vehicle in a multi-lane road environment. For example, the observation unit 110 may collect the speed and lane of the autonomous vehicle, the speed of the leading vehicle, the speed of the rear vehicle, the relative position of the leading vehicle, the relative position of the rear vehicle, and the lane number in which each vehicle is located. can

본 발명의 일 실시예에서, 다차선의 도로 환경은 이차선 원형도로일 수 있다(도 2 참조). 도로 내 비 자율주행차량들은 느리게 등속 주행을 하여 자율주행차량(10)이 차선 변경을 수행하여야만 목표 속도에 도달할 수 있는 환경을 조성하였다.In one embodiment of the present invention, the multi-lane road environment may be a two-lane circular road (see FIG. 2). Non-autonomous vehicles on the road travel at a slow constant speed, creating an environment in which the autonomous vehicle 10 can reach the target speed only when changing lanes.

도로 내 차량의 집합

은 비 자율주행차량의 집합

와 자율주행차량의 집합

으로 구성한다. 도로에 배치된 전체 차량의 수

대이다. 예를 들어, 차선 번호 k는 가장 바깥쪽 차선이 0번 차선이며 안쪽으로 갈수록 차선의 번호는 증가할 수 있다.collection of vehicles on the road

is a set of non-autonomous vehicles

and a set of autonomous vehicles

composed of Total number of vehicles deployed on the road

it's a stand For example, in the lane number k, the outermost lane is lane 0, and the number of lanes may increase toward the innermost lane.

도 3을 참조하면, 본 발명에서 자율주행차량 e_N(10)은 도로 전체의 상태 정보 s_t가 아닌 근접 도로 상태에 대한 부분 관측만이 가능하다. 자율주행차량 e_N(10)의 관측 정보

은 다음과 같은 14차원으로 정의할 수 있다.

Referring to FIG. 3 , in the present invention, the self-driving vehicle e _N 10 can only partially observe the state of a nearby road, not the state information s _t of the entire road. Observation information of autonomous vehicle e _N (10)

can be defined in the following 14 dimensions.

여기서,

는 각각 0번 차선 선두차량의 속도, 1번 차선 선두차량의 속도, 0번 차선 후방차량 속도, 1번 차선 후방차량의 속도, 자율주행차량의 속도를 의미한다. here,

denotes the speed of the leading vehicle in lane 0, the speed of the leading vehicle in lane 1, the speed of the rear vehicle in lane 0, the speed of the rear vehicle in lane 1, and the speed of the autonomous vehicle, respectively.

는 각각 0번 차선 선두차량, 1번 차선 선두차량, 0번 차선 후방차량, 1번 차선 후방차량과 자율주행차량 사이의 상대 거리를 의미한다.

denotes the relative distance between the leading vehicle in lane 0, the leading vehicle in lane 1, the rear vehicle in lane 0, and the rear vehicle in lane 1 and the autonomous vehicle, respectively.

마지막으로,

는 0번 차선 선두차량, 1번 차선 선두차량, 0번 차선 후방차량, 1번 차선 후방차량 그리고 자율주행차량의 시간 t에서의 차선을 의미한다.Finally,

denotes the lane at time t of the leading vehicle in lane 0, the leading vehicle in lane 1, the rear vehicle in lane 0, the rear vehicle in lane 1, and the autonomous vehicle.

상기 통합 학습부(130)는 도 4와 같이 심층 신경망(11)을 포함하고, 센서(13) 등을 통해 수집한 데이터(관측값)를 기반으로 행동을 수행하고, 수행한 자율주행차량의 행동을 반영하여 자율주행차량의 행동에 대한 보상값을 도출한다. As shown in FIG. 4, the integrated learning unit 130 includes a deep neural network 11, performs actions based on data (observation values) collected through sensors 13, etc., and performs actions of the autonomous vehicle. is reflected to derive the reward value for the behavior of the autonomous vehicle.

또한, 상기 통합 학습부(130)는 심층강화학습(deep reinforcement learning) 알고리즘을 이용하여 자율주행차량의 차선 변경 및 가속도 조절에 대한 행동 정책(policy)을 학습한다. In addition, the integrated learning unit 130 learns a behavior policy for lane change and acceleration control of the self-driving vehicle using a deep reinforcement learning algorithm.

일 실시예에서, 행동 정책(policy)을 학습하는 심층강화학습 알고리즘 중 PPO(Proximal Policy Optimization) 알고리즘을 이용할 수 있다.In one embodiment, a Proximal Policy Optimization (PPO) algorithm may be used among deep reinforcement learning algorithms for learning a behavioral policy.

상기 통합 학습부(130)는 행동부(131), 보상부(133) 및 정책 학습부(135)를 포함할 수 있다.The integrated learning unit 130 may include an action unit 131, a compensation unit 133, and a policy learning unit 135.

상기 행동부(131)는 수집한 데이터를 기반으로 차선 변경 및 가속도 조절 중 적어도 하나의 행동을 수행할 수 있다.The action unit 131 may perform at least one of lane change and acceleration control based on the collected data.

상기 보상부(133)는 수행한 자율주행차량의 행동을 반영하여 자율주행차량의 목표 속도 및 후방차량과의 안전 거리를 기초로 하는 보상 함수로부터 자율주행차량의 행동에 대한 보상값을 도출할 수 있다.The compensation unit 133 may derive a compensation value for the behavior of the autonomous vehicle from a compensation function based on a target speed of the autonomous vehicle and a safe distance from a rear vehicle by reflecting the behavior of the autonomous vehicle. there is.

상기 정책 학습부(135)는 부분 관측한 데이터, 수행한 행동 정보 및 도출된 보상값 중 적어도 하나의 정보를 기초로 차선 변경 및 가속도 조절 중 적어도 하나에 대한 행동 정책을 심층강화학습 알고리즘을 이용하여 학습할 수 있다. 또한, 도출한 자율주행차량의 행동에 대한 보상값을 기초로 행동 정책을 업데이트할 수 있다.The policy learning unit 135 uses a deep reinforcement learning algorithm to determine an action policy for at least one of lane change and acceleration control based on at least one of partially observed data, performed action information, and a derived compensation value. can learn In addition, an action policy may be updated based on the derived compensation value for the action of the autonomous vehicle.

본 발명에서는 신속한 주행 및 안정성을 위한 마르코프 의사결정 과정(Markov Decision Process; MDP)을 모델링하여, 심층 강화학습 알고리즘 중 하나인 Proximal Policy Optimization(PPO) 알고리즘을 통해 자율주행차량을 학습시켜 효과를 확인하고자 한다.In the present invention, by modeling the Markov Decision Process (MDP) for rapid driving and stability, and learning the self-driving vehicle through the Proximal Policy Optimization (PPO) algorithm, one of the deep reinforcement learning algorithms, to confirm the effect do.

강화학습은 학습의 주체인 개체(10, 자율주행차량)가 환경(도로(1) 상태)과의 상호작용을 통해 학습을 하는 기계학습의 방법 중 하나이다. 본 발명의 실시예에서 강화학습은 MDP를 따른다. MDP는 개체가 수행하는 의사 결정 과정을 확률적으로 모델링하는 방법으로 일련의 튜플

로 정의할 수 있다. Reinforcement learning is one of the machine learning methods in which an object (10, self-driving vehicle), which is the subject of learning, learns through interaction with the environment (state of road (1)). In an embodiment of the present invention, reinforcement learning follows MDP. MDP is a method of probabilistically modeling the decision-making process performed by an entity. It is a set of tuples.

can be defined as

상태 공간(state space)

는 개체가 상호작용하는 환경의 시간 t에서의 상태(state) s_t의 집합이다. 관측 공간(observation space)

는 개체가 환경을 관측 정보(observation) o_t의 집합이다. 이때 개체가 관측 가능한 상태 정보의 집합이 상태공간과 동일한 경우 완전 관측(full observation)이라고 하며, 일부로 한정되는 경우를 부분 관측(partial observation)이라고 한다. state space

is the set of states s _t at time t of the environment with which the entity interacts. observation space

is a set of observations o _t of the environment by the entity. At this time, if the set of observable state information of the object is the same as the state space, it is called full observation, and if it is limited to a part, it is called partial observation.

행동 공간(action space)

는 개체가 취할 수 있는 모든 행동(action) a_t의 집합이다. 보상함수

(s_t,a_t,s_t+1)(이하

_t로 표기)은 상태 s_t에서 행동 a_t를 취할 때 변한 상태 s_t+1에 대해 환경이 개체에게 주는 보상을 의미한다. 개체는 특정 상태 s_t에서 보상

_t가 최대가 되는 행동 a_t를 취하는 방향으로 학습한다. 마지막으로

는 시간에 따른 감가율(discount factor)을 의미한다.action space

is the set of all actions a _t that the object can take. reward function

(s _t ,a _t ,s _t+1 ) (below

_t ) means the reward given by the environment to the individual for the changed state s _t ₊₁ when the action a _t is taken in the state s t. An object compensates in a particular state s _t

It learns in the direction of taking the action a _t that maximizes _t . Finally

is the discount factor over time.

본 발명에서 차선 변경을 통해 개체가 효율적인 주행을 할 수 있도록 하며 동시에 주변 차량의 주행을 방해하지 않기 위한 보상 함수는 다음의 수학식 1과 같다.In the present invention, a compensation function for enabling an object to efficiently drive through a lane change and at the same time not interfering with the driving of surrounding vehicles is as shown in Equation 1 below.

[수학식 1][Equation 1]

먼저,

는 보상항으로 자율주행차량이 목표 속도

에 가깝게 주행할 수 있도록 한다. 만약 이 목표 속도

와 동일하다면 최댓값인 1의 보상이 주어지며,

에서 증가하거나 감소하는 경우 그보다 낮은 보상이 주어진다. ??first,

is the compensation term, and the target speed of the self-driving vehicle is

to be able to drive close to If this target speed

If it is equal to , the maximum value of 1 is given,

If it increases or decreases in , a lower reward is given. ??

는 자율주행차량이 차선 변경 했을 때 후방차량의 안전 범위를 침범하는 것에 대한 처벌항이다.

is a penalty clause for invading the safety range of the vehicle behind when the autonomous vehicle changes lanes.

도 5를 참조하면,

는 시간 t+1에서 후방차량과 자율주행차량 사이의 상대 거리를 의미한다.

는 안전 거리며 이는 환경 설정 및 사용자에 의해 조절될 수 있다. Referring to Figure 5,

Means the relative distance between the rear vehicle and the autonomous vehicle at time t+1.

is the safety distance, which can be adjusted by environment settings and by the user.

안전 거리는 사용하는 안전 시스템에 따라 달라질 수 있다. 예를 들어, 자율주행차량의 자동 안정 보조 시스템에 내장되어 있다면, 각각의 시스템에서 사용되는 안전 거리를 사용할 수 있다. The safety distance may vary depending on the safety system used. For example, if it is built into an automatic stability assist system of an autonomous vehicle, the safety distance used in each system can be used.

또한, 비 자율주행 차량의 설계를 IDM 컨트롤러(controller) 기반으로 하는 경우, IDM 컨트롤러에 의해 조절되기 때문에 안전 거리

는 아래의 수학식 2와 같이 설정할 수 있다.In addition, if the design of a non-autonomous vehicle is based on an IDM controller, the safety distance is controlled by the IDM controller.

Can be set as shown in Equation 2 below.

[수학식 2][Equation 2]

여기서,

는 차량 간 최소 허용 거리이며,

은 time headway로 선두차량과 후방차량이 동일한 위치에 도달하는데 필요한 최소 허용 시간이다.here,

is the minimum allowable distance between vehicles,

is the time headway, which is the minimum permissible time required for the lead vehicle and the rear vehicle to reach the same position.

개체가 취할 수 있는 행동은

로 나타낼 수 있다. acc는 자율주행차량의 가속도를 의미하며,

의 연속적인 범위를 갖는다.

는 자율주행차량의 차선 변경 방향을 의미하며,

와 같은 이산적인 값을 갖는다. 예를 들어, 0은 차선을 유지하는 경우, -1은 우측 차선으로의 차선 변경, 1은 좌측 차선으로의 차선 변경을 의미할 수 있다.Actions an object can take

can be expressed as acc means the acceleration of the autonomous vehicle,

has a continuous range of

Means the lane change direction of the autonomous vehicle,

has a discrete value such as For example, 0 may mean maintaining a lane, -1 may mean a lane change to the right lane, and 1 may mean a lane change to the left lane.

상기 정책 활용부(150)는 학습된 심층강화학습을 기초로 현재 도로 상태에 대한 자율주행차량의 최적의 주행을 수행한다.The policy utilization unit 150 performs optimal driving of the self-driving vehicle for current road conditions based on the learned deep reinforcement learning.

이하에서는, 도로 교통 시뮬레이터에 대한 심층 강화학습 프레임워크 FLOW를 사용하여 본 발명의 성능을 평가한 결과를 설명한다. Hereinafter, the results of evaluating the performance of the present invention using the deep reinforcement learning framework FLOW for road traffic simulators will be described.

도로의 구성은 260m의 2차선 원형 도로(도 2)이며 차량의 수

대 이다. 여기서, 자율주행차량의 수

대 이며 비 자율주행차량의 수

대 이다. 비 자율주행차량은 모두 IDM 컨트롤러를 사용하며 주행 속도는 1m/s로 고정하였다.The composition of the road is a 260m two-lane circular road (Fig. 2) and the number of vehicles

is a stand Here, the number of autonomous vehicles

and the number of non-autonomous vehicles

is a stand All non-autonomous vehicles use the IDM controller, and the driving speed is fixed at 1 m/s.

최소 허용 거리

, time headway

, 목표 속도

로 설정하였다. 본 시뮬레이션에서 수학식 1의

는 10, 1로 설정하였으며, 1 time step

로 정의하였다.minimum allowable distance

, time headway

, target speed

was set to In this simulation, Equation 1

is set to 10, 1, 1 time step

defined as

심층 강화학습 알고리즘 PPO로 학습한 차량의 성능 평가를 위해, 제어이론 기반의 LC2013(선행기술문헌의 비특허문헌 1) 차선 변경 모델을 적용한 경우와 성능 비교를 진행하였다. To evaluate the performance of the vehicle learned with the deep reinforcement learning algorithm PPO, performance comparison was performed with the case where the control theory-based LC2013 (non-patent document 1 of the prior art document) lane change model was applied.

도 6을 통해 두 차량 모두 목표 속도인 3m/s를 유지하려는 모습을 확인할 수 있다. 아래의 표 1을 통해 단일 에피소드에서 차선 변경을 수행하는 단일 차량의 평균 속도와 속도의 분산을 자세히 확인할 수 있다. 6, it can be seen that both vehicles try to maintain the target speed of 3 m/s. Table 1 below shows the average speed and speed variance of a single vehicle performing a lane change in a single episode in detail.

[표 1][Table 1]

LC2013 모델을 사용한 경우 차선 변경 결정을 바로 내리지 못하고 비 자율주행 선두차량 뒤에서 선두차량의 속도에 맞춰 운행하며 시간을 지체하는 것을 확인할 수 있다. 이는 도 6의 time step 1700 및 2500 부근에서 비 자율주행차량들의 속도인 1m/s로 유지하는 모습을 통해 단적으로 확인 가능하다. In the case of the LC2013 model, it can be seen that the lane change decision is not immediately made, and the driver delays time by driving behind the non-autonomous lead vehicle at the speed of the lead vehicle. This can be clearly confirmed through the state of maintaining the speed of non-autonomous vehicles at 1 m/s near time steps 1700 and 2500 in FIG. 6 .

반면, PPO를 사용하는 경우에는 전방차량이 길을 막고 있을 때 의미 없이 기다리지 않고 차선을 변경하기 때문에 일정한 속도를 유지하며 주행하는 모습을 확인하였다. On the other hand, in the case of using PPO, it was confirmed that the vehicle was driving while maintaining a constant speed because it changed lanes without waiting meaninglessly when the vehicle in front was blocking the road.

결과적으로, 제어 이론 기반의 차선 변경 모델인 LC2013을 사용한 경우에 비해 심층강화학습 기반 모델인 PPO를 사용하여 학습한 경우 더욱 자연스러운 차선 변경을 수행하는 모습을 확인하였다.As a result, it was confirmed that more natural lane changes were performed when learning using PPO, a deep reinforcement learning-based model, than when using LC2013, a lane changing model based on control theory.

결론적으로, 본 발명에 따라 심층 강화학습 알고리즘인 PPO를 통해 학습한 자율주행차량은 전통적인 제어이론 기반 차선 변경 모델을 적용한 비 자율주행차량과 비교하였을 때 더욱 높은 성능을 보여주었다. 자율주행차량의 평균 속력은 비 자율주행차량 약 10% 상승 하였고 목표 속도에 가까운 주행 능력을 보였다.In conclusion, the self-driving vehicle learned through PPO, a deep reinforcement learning algorithm according to the present invention, showed higher performance compared to non-autonomous vehicles to which the traditional control theory-based lane change model was applied. The average speed of self-driving vehicles increased by about 10% for non-autonomous vehicles, and showed driving ability close to the target speed.

도 7은 본 발명의 일 실시예에 따른 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법의 흐름도이다.7 is a flowchart of a deep reinforcement learning-based lane changing method for an autonomous vehicle according to an embodiment of the present invention.

본 실시예에 따른 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법은, 도 1의 장치(100)와 실질적으로 동일한 구성에서 진행될 수 있다. 따라서, 도 1의 장치(100)와 동일한 구성요소는 동일한 도면부호를 부여하고, 반복되는 설명은 생략한다. The deep reinforcement learning-based lane changing method for an autonomous vehicle according to the present embodiment may be performed in substantially the same configuration as the apparatus 100 of FIG. 1 . Accordingly, components identical to those of the device 100 of FIG. 1 are given the same reference numerals, and repeated descriptions are omitted.

또한, 본 실시예에 따른 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법은 자율주행차량을 위한 심층강화학습 기반 차선 변경을 수행하기 위한 소프트웨어(애플리케이션)에 의해 실행될 수 있다.In addition, the method for changing lanes based on deep reinforcement learning for autonomous vehicles according to the present embodiment may be executed by software (application) for performing lane changing based on deep reinforcement learning for autonomous vehicles.

본 발명은 심층강화학습 알고리즘 중 하나의 실시예로서 PPO 기반의 자율주행차량의 차선 변경 학습 방법을 제안한다. 본 발명에서, 개체(자율주행차량)는 환경(도로 상태)과의 상호작용을 통해 학습한다. 개체는 도로 상태를 관측한 뒤 학습된 정보를 바탕으로 그에 맞는 행동을 수행한다. 이때, 절대적인 도로 정보를 확인할 수 없기 때문에 부분적이며 완벽하지 않은 관측된 정보를 사용한다. The present invention proposes a PPO-based lane change learning method for an autonomous vehicle as an embodiment of a deep reinforcement learning algorithm. In the present invention, an entity (autonomous vehicle) learns through interaction with the environment (road conditions). The entity observes road conditions and then performs actions based on the learned information. At this time, since absolute road information cannot be confirmed, partial and incomplete observed information is used.

도 7을 참조하면, 본 실시예에 따른 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법은, 다차선의 도로 환경에서 자율주행차량이 근접 도로의 상태에 대해 부분 관측하여 데이터를 수집한다(단계 S10). Referring to FIG. 7 , in the deep reinforcement learning-based lane changing method for an autonomous vehicle according to the present embodiment, the autonomous vehicle collects data by partially observing the state of a nearby road in a multi-lane road environment (step S10).

예를 들어, 부분 관측을 통해 각 차선의 선두차량의 속도, 각 차선의 후방차량 속도, 자율주행차량의 속도, 각 차선의 선두차량과 자율주행차량 사이의 상대 거리, 각 차선의 후방차량과 자율주행차량 사이의 상대 거리 및 각 차량이 위치한 차선 번호를 수집할 수 있다.For example, through partial observation, the speed of the leading vehicle in each lane, the speed of the rear vehicle in each lane, the speed of the autonomous vehicle, the relative distance between the leading vehicle in each lane and the autonomous vehicle, and the vehicle behind the autonomous vehicle in each lane It is possible to collect the relative distance between driving vehicles and the number of the lane where each vehicle is located.

예를 들어, 다차선의 도로 환경은 이차선 원형도로일 수 있다.For example, a multi-lane road environment may be a two-lane circular road.

수집한 데이터를 기반으로 차선 변경 및 가속도 조절 중 적어도 하나의 행동을 수행한다.(단계 S20). 여기서, 자율주행차량의 가속도는 -1 내지 1 사이의 연속적인 범위를 가지며, 차선 변경 방향은 각각 차선 유지, 우측 차선으로의 차선 변경 및 좌측 차선으로의 차선 변경을 의미하는 값을 가질 수 있다.Based on the collected data, at least one of lane change and acceleration control is performed (step S20). Here, the acceleration of the self-driving vehicle has a continuous range between -1 and 1, and the lane change directions may have values indicating lane maintenance, lane change to the right lane, and lane change to the left lane, respectively.

수행한 자율주행차량의 행동을 반영하여 자율주행차량의 목표 속도 및 후방차량과의 안전 거리를 기초로 하는 보상 함수로부터 자율주행차량의 행동에 대한 보상값을 도출한다(단계 S30).A compensation value for the behavior of the autonomous vehicle is derived from a compensation function based on the target speed of the autonomous vehicle and a safe distance from a vehicle behind by reflecting the behavior of the autonomous vehicle that has been performed (step S30).

상기 보상 함수는, 자율주행차량의 목표 속도에 가깝게 주행할 수 있도록 하는 보상항 및 자율주행차량이 차선을 변경한 경우 후방차량과의 안전 거리를 침범하는 것에 대한 처벌항을 포함할 수 있다. 특히, 처벌항은 차량 간 최소 허용 거리 및 선두차량과 후방차량이 동일한 위치에 도달하는데 필요한 최소 허용 시간을 기초로 생성될 수 있다.The compensation function may include a compensation term for allowing the autonomous vehicle to drive close to a target speed and a penalty term for encroaching on a safe distance from a rear vehicle when the autonomous vehicle changes lanes. In particular, the penalty clause may be generated based on the minimum allowable distance between vehicles and the minimum allowable time required for the lead vehicle and the rear vehicle to reach the same position.

안전 거리는 사용하는 안전 시스템에 따라 달라질 수 있다. 예를 들어, 자율주행차량의 자동 안정 보조 시스템에 내장되어 있다면, 각각의 시스템에서 사용되는 안전 거리를 사용할 수 있다. 또한, 비 자율주행 차량의 설계를 IDM 컨트롤러(controller) 기반으로 하는 경우, IDM 컨트롤러에 의해 조절될 수도 있다.The safety distance may vary depending on the safety system used. For example, if it is built into an automatic stability assist system of an autonomous vehicle, the safety distance used in each system can be used. In addition, if the design of the non-autonomous vehicle is based on the IDM controller, it may be controlled by the IDM controller.

부분 관측한 데이터, 수행한 행동 정보 및 도출된 보상값 중 적어도 하나의 정보를 기초로 차선 변경 및 가속도 조절 중 적어도 하나에 대한 행동 정책(policy)을 심층강화학습(deep reinforcement learning) 알고리즘을 이용하여 학습한다(단계 S40). Based on at least one of partially observed data, performed action information, and the derived compensation value, a policy for at least one of lane change and acceleration control is determined using a deep reinforcement learning algorithm. Learn (step S40).

예를 들어, 심층강화학습 중 PPO(Proximal Policy Optimization) 알고리즘을 이용하여 행동 정책(policy)을 학습할 수 있고, 도출한 자율주행차량의 행동에 대한 보상값을 기초로 행동 정책을 업데이트할 수 있다.For example, during deep reinforcement learning, a behavioral policy can be learned using a Proximal Policy Optimization (PPO) algorithm, and the behavioral policy can be updated based on the derived reward value for the behavior of the autonomous vehicle. .

현재까지 학습된 정책의 성능이 미리 설정한 기준치를 초과하는 경우(단계 S50), 학습된 행동 정책을 기초로 현재 도로 상태에 대한 자율주행차량의 최적의 주행을 수행한다(단계 S60).If the performance of the policy learned so far exceeds the preset reference value (step S50), the autonomous vehicle performs optimal driving for the current road condition based on the learned action policy (step S60).

반면, 현재까지 학습된 정책의 성능이 미리 설정한 기준치에 미치지 못하는 경우(단계 S50), 단계 S10으로 돌아가 다시 학습을 수행한다.On the other hand, if the performance of the policy learned so far does not reach the preset reference value (step S50), return to step S10 and learn again.

이와 같은, 자율주행차량을 위한 심층강화학습 기반 차선 변경 방법은 애플리케이션으로 구현되거나 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. Such a deep reinforcement learning-based lane changing method for an autonomous vehicle may be implemented as an application or implemented in the form of program commands that can be executed through various computer components and recorded on a computer-readable recording medium. The computer readable recording medium may include program instructions, data files, data structures, etc. alone or in combination.

상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거니와 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. Program instructions recorded on the computer-readable recording medium may be those specially designed and configured for the present invention, or those known and usable to those skilled in the art of computer software.

컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like.

프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler. The hardware device may be configured to act as one or more software modules to perform processing according to the present invention and vice versa.

이상에서는 실시예들을 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to embodiments, those skilled in the art can variously modify and change the present invention without departing from the spirit and scope of the present invention described in the claims below. You will understand.

본 발명은 운전자의 목적에 도달하기 위해 빈번하게 발생하는 차선 변경에 대해 주행의 안전성과 효율성을 높일 수 있다. 따라서, 현재 인공지능 기술의 발전과 함께 발전하고 있는 자율주행기술에 유용하게 적용할 수 있다.According to the present invention, the safety and efficiency of driving can be improved with respect to lane changes that occur frequently to reach a driver's goal. Therefore, it can be usefully applied to autonomous driving technology that is developing along with the development of current artificial intelligence technology.

10: 자율주행차량
100: 자율주행차량을 위한 심층강화학습 기반 차선 변경 장치
110: 관측부
130: 통합 학습부
131: 행동부
133: 보상부
135: 정책 학습부
150: 정책 활용부
1: 도로
11: 심층 신경망
13: 센서10: autonomous vehicle
100: Lane changing device based on deep reinforcement learning for autonomous vehicles
110: observation unit
130: integrated learning unit
131: action department
133: compensation unit
135: policy learning unit
150: policy utilization unit
1: Road
11: Deep Neural Networks
13: sensor

Claims

Collecting data by partially observing the state of a nearby road by an autonomous vehicle in a multi-lane road environment;
performing at least one of lane change and acceleration control based on the collected data;
Deriving a compensation value for the behavior of the autonomous vehicle from a compensation function based on a target speed of the autonomous vehicle and a safe distance from a vehicle behind the self-driving vehicle by reflecting the performed behavior of the autonomous vehicle;
Based on at least one of partially observed data, performed action information, and the derived compensation value, a policy for at least one of lane change and acceleration control is determined using a deep reinforcement learning algorithm. learning step; and
A method for changing a lane based on deep reinforcement learning for an autonomous vehicle, comprising: performing an optimal driving of the autonomous vehicle for current road conditions based on the learned action policy.

The method of claim 1, wherein the step of collecting data by partially observing the state of the proximity road,
The speed of the leading vehicle in each lane, the speed of the rear vehicle in each lane, the speed of the autonomous vehicle, the relative distance between the leading vehicle in each lane and the autonomous vehicle, the relative distance between the vehicle behind each lane and the autonomous vehicle, and the A deep reinforcement learning-based lane change method for autonomous vehicles that collects the lane number where the vehicle is located.

The method of claim 1, wherein performing at least one of the lane change and acceleration control comprises:
The acceleration of the self-driving vehicle has a continuous range between -1 and 1, and the lane-changing direction has a value indicating lane maintenance, lane change to the right lane, and lane change to the left lane, respectively. A deep reinforcement learning-based lane changing method for

The method of claim 1, wherein the compensation function,
Deep Reinforcement Learning for Autonomous Vehicles, Including a Compensation Term for Allowing the Autonomous Vehicle to Drive Close to the Target Speed and a Punishment Term for Invading a Safe Distance from a Rear Vehicle When the Autonomous Vehicle Changes Lane based lane change method.

The method of claim 4, wherein the penalty clause,
A lane change method based on deep reinforcement learning for autonomous vehicles, which is generated based on the minimum allowable distance between vehicles and the minimum allowable time required for a leading vehicle and a rear vehicle to reach the same location.

The method of claim 1, wherein the learning using the deep reinforcement learning algorithm comprises:
A lane change method based on deep reinforcement learning for self-driving vehicles using PPO (Proximal Policy Optimization) algorithm during deep reinforcement learning.

According to claim 1,
A method for changing a lane based on deep reinforcement learning for an autonomous vehicle, further comprising: updating a behavioral policy based on the derived compensation value for the behavior of the autonomous vehicle.

According to claim 1,
A multi-lane road environment is a two-lane circular road, a deep reinforcement learning-based lane change method for autonomous vehicles.

A computer-readable storage medium on which a computer program for performing the deep reinforcement learning-based lane changing method for an autonomous vehicle according to claim 1 is recorded.

An observation unit for collecting data by partially observing the state of an autonomous vehicle on a nearby road in a multi-lane road environment;
Based on the collected data, at least one of the lane change and acceleration control is performed, and a reward value for the performed autonomous vehicle action is derived, and at least one of the partially observed data, the performed action information, and the derived reward value is obtained. An integrated learning unit that learns an action policy for at least one of lane change and acceleration control based on one piece of information using a deep reinforcement learning algorithm; and
A deep reinforcement learning-based lane changing device for an autonomous vehicle, comprising: a policy utilization unit that performs optimal driving of the autonomous vehicle for current road conditions based on the learned behavioral policy.

The method of claim 11, wherein the observation unit,
The speed of the leading vehicle in each lane, the speed of the rear vehicle in each lane, the speed of the autonomous vehicle, the relative distance between the leading vehicle in each lane and the autonomous vehicle, the relative distance between the vehicle behind each lane and the autonomous vehicle, and the A deep reinforcement learning-based lane change device for autonomous vehicles that collects the lane number where the vehicle is located.

The method of claim 11, wherein the integrated learning unit,
an action unit that performs at least one of lane change and acceleration control based on the collected data;
a compensation unit for deriving a compensation value for the behavior of the autonomous vehicle from a compensation function based on a target speed of the autonomous vehicle and a safe distance from a rear vehicle by reflecting the performed behavior of the autonomous vehicle; and
and a policy learning unit that learns a behavioral policy for at least one of lane change and acceleration control using a deep reinforcement learning algorithm based on at least one of partially observed data, performed behavioral information, and the derived compensation value. A lane changing device based on deep reinforcement learning for autonomous vehicles.

The method of claim 12, wherein the action unit,
The acceleration of the self-driving vehicle has a continuous range between -1 and 1, and the lane-changing direction has a value indicating lane maintenance, lane change to the right lane, and lane change to the left lane, respectively. A lane-changing device based on deep reinforcement learning for

The method of claim 11, wherein the compensation function,
Deep Reinforcement Learning for Autonomous Vehicles, Including a Compensation Term for Allowing the Autonomous Vehicle to Drive Close to the Target Speed and a Punishment Term for Invading a Safe Distance from a Rear Vehicle When the Autonomous Vehicle Changes Lane based lane change device.

The method of claim 14, wherein the punishment clause,
A lane change device based on deep reinforcement learning for an autonomous vehicle, which is generated based on the minimum allowable distance between vehicles and the minimum allowable time required for a leading vehicle and a rear vehicle to reach the same location.

The method of claim 11, wherein the integrated learning unit,
A deep reinforcement learning-based lane changing device for autonomous vehicles using PPO (Proximal Policy Optimization) algorithm during deep reinforcement learning.

The method of claim 11, wherein the integrated learning unit,
A lane change device based on deep reinforcement learning for an autonomous vehicle that updates an action policy based on a reward value for an autonomous vehicle's behavior.

According to claim 11,
A multi-lane road environment is a two-lane circular road, a deep reinforcement learning-based lane change device for autonomous vehicles.