KR102166811B1

KR102166811B1 - Method and Apparatus for Controlling of Autonomous Vehicle using Deep Reinforcement Learning and Driver Assistance System

Info

Publication number: KR102166811B1
Application number: KR1020190007345A
Authority: KR
Inventors: 허건수; 민규식; 김하영
Original assignee: 한양대학교 산학협력단
Priority date: 2019-01-21
Filing date: 2019-01-21
Publication date: 2020-10-19
Also published as: KR20200095590A

Abstract

심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법 및 장치가 제시된다. 일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법은, 측정된 센서 데이터 및 촬영된 이미지 데이터를 심층강화학습(deep reinforcement learning) 알고리즘으로 입력 받는 단계; 상기 심층강화학습 알고리즘에서 입력 받은 상기 센서 데이터 및 상기 이미지 데이터를 이용하여 차량 제어를 위한 행동을 결정하는 단계; 및 결정된 상기 행동에 따라 운전자보조시스템(Driver Assistance Systems, DAS)을 선택하여 차량을 제어하는 단계를 포함하여 이루어질 수 있다. A method and apparatus for controlling autonomous vehicles using in-depth reinforcement learning and driver assistance systems are presented. A method of controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system according to an embodiment includes the steps of receiving measured sensor data and photographed image data through a deep reinforcement learning algorithm; Determining an action for vehicle control using the sensor data and the image data received from the deep reinforcement learning algorithm; And controlling the vehicle by selecting a Driver Assistance Systems (DAS) according to the determined behavior.

Description

Method and Apparatus for Controlling of Autonomous Vehicle using Deep Reinforcement Learning and Driver Assistance System}

아래의 실시예들은 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법 및 장치에 관한 것이다. The following embodiments relate to a method and apparatus for controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system.

자율주행차량의 안전한 주행에 관한 연구들은 최근 몇 년간 많이 발전해왔다. 안전 자율주행에 관한 가장 중요한 주제 중 하나는 운전자에게 안전성과 편안함을 제공할 수 있는 운전 정책(driving policy)을 찾는 것이다. 이 문제의 어려움은 운전 정책이 다양한 도로 상황과 쉽게 변할 수 있는 차량 파라미터들에 대해 강건성을 가져야 한다는 것이다. Research on the safe driving of autonomous vehicles has developed a lot in recent years. One of the most important topics in safe autonomous driving is finding a driving policy that can provide safety and comfort to drivers. The difficulty with this problem is that the driving policy must be robust to a variety of road conditions and vehicle parameters that can easily change.

기존 자율주행차량 주행의 경우 인지, 계획, 제어의 단계를 거쳐 진행된다. 먼저 센서를 통해 주변 차량, 도로 및 환경에 대한 정보를 인지하고 이를 바탕으로 차량이 주행할 경로를 계획한다. 마지막으로 최종 경로가 결정되면 해당 경로를 따라갈 수 있도록 차량 제어를 수행한다. 이때 자율주행차량의 제어의 경우 일반적으로 제어에서 사용되는 PID, 슬라이딩 모드 제어(sliding mode control), 모델 예측 제어(model predictive control) 등의 기법을 통해 직접적으로 핸들이나 브레이크, 엑셀 등을 제어하게 된다. Existing autonomous vehicle driving proceeds through the stages of recognition, planning, and control. First, it recognizes information on surrounding vehicles, roads, and environments through sensors, and plans the route the vehicle will travel based on this. Finally, when the final route is determined, vehicle control is performed so that the corresponding route can be followed. At this time, in the case of autonomous vehicle control, the steering wheel, brake, and excel are directly controlled through techniques such as PID, sliding mode control, and model predictive control, which are generally used in control. .

강화학습을 통해서 자율주행차량을 제어하려는 시도도 과거에 존재했다. 대부분의 경우 강화학습을 이용하여 직접적으로 차량의 핸들, 브레이크, 액셀러레이터(accelerator) 등을 제어하는 방법이었으며, 운전자가 설정한 목표를 따라 최적의 제어 값을 도출하는 방식으로 학습을 진행하였다.There have also been attempts in the past to control autonomous vehicles through reinforcement learning. In most cases, it was a method of directly controlling the vehicle's steering wheel, brakes, and accelerators using reinforcement learning, and the learning was conducted by deriving the optimal control value according to the goal set by the driver.

그러나, 종래 기술의 경우 인지, 판단 그리고 제어의 부분에 있어 모든 것을 사람이 직접 결정하여 설정하는 룰-기반(rule-based)이다. 이런 경우 설계자가 고려한 상황에 대해서는 잘 대응하고 제어할 수 있지만, 고려하지 못한 특별한 상황이 생기는 경우 이에 대한 대처가 어렵다. 자율주행차량의 주행은 확률적이며 다양한 상황이 발생할 수 있는 환경이다. 이에 따라 사람이 주행 중에 일어날 수 있는 모든 상황에 대해 대응하는 것은 매우 어려운 일이다. However, in the case of the prior art, it is rule-based in which a person directly determines and sets everything in the part of recognition, judgment, and control. In this case, it is possible to respond well and control the situation considered by the designer, but it is difficult to cope with a special situation that is not considered. The driving of an autonomous vehicle is probabilistic and an environment in which various situations can occur. Accordingly, it is very difficult for a person to respond to all possible situations while driving.

또한 기존에도 강화학습을 이용해서 차량을 제어하려는 시도가 있었다. 이런 기법들은 직접적으로 차량의 핸들, 브레이크 및 엑셀을 제어하려고 하였다. 강화학습의 경우 학습이 완료되더라도 불안정한 움직임을 보일 가능성이 있다. 이에 따라 강화학습을 이용하여 직접적으로 제어하는 것은 사람이 느끼기에 불안정한 제어를 할 수 있으며 경우에 따라 위험한 상황을 발생시킬 수 있다.Also, there have been attempts to control vehicles using reinforcement learning. These techniques attempted to directly control the vehicle's steering wheel, brakes and excel. In the case of reinforcement learning, there is a possibility of showing unstable movement even when learning is completed. Accordingly, direct control using reinforcement learning can lead to unstable control because a person feels it, and in some cases, dangerous situations can occur.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, p. 529, 2015. V. Mnih, K. Kavukcuoglu, D. Silver, AA Rusu, J. Veness, MG Bellemare, A. Graves, M. Riedmiller, AK Fidjeland, G. Ostrovski, et al., "Human-level control through deep reinforcement learning ," Nature, vol. 518, no. 7540, p. 529, 2015. H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double q-learning." in AAAI, vol. 16, 2016, pp. 2094-2100. H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double q-learning." in AAAI, vol. 16, 2016, pp. 2094-2100. Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, "Dueling network architectures for deep reinforcement learning," arXiv preprint arXiv:1511.06581, 2015. Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, "Dueling network architectures for deep reinforcement learning," arXiv preprint arXiv:1511.06581, 2015. D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014. D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., "Tensorflow: A system for large-scale machine learning." in OSDI, vol. 16, 2016, pp. 265-283. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., "Tensorflow: A system for large-scale machine learning." in OSDI, vol. 16, 2016, pp. 265-283.

실시예들은 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법 및 장치에 관하여 기술하며, 보다 구체적으로 심층강화학습(deep reinforcement learning)을 이용한 최적의 행동을 결정하는 알고리즘을 통해 다양한 상황에 적절한 운전자보조시스템을 결정하여 자율주행차량을 제어하는 기술을 제공한다. The embodiments describe a method and apparatus for controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system, and more specifically, in various situations through an algorithm that determines the optimal behavior using deep reinforcement learning. It provides technology to control autonomous vehicles by determining an appropriate driver assistance system.

실시예들은 심층강화학습 알고리즘을 기반으로 어떤 운전자보조시스템을 선택할지 결정하는 시스템을 이용하여 매 상황에 맞는 차량의 행동을 선택하며 그에 따라 차량을 제어함으로써, 자율주행차량의 제어의 경로 계획 및 제어를 안정적으로 수행할 수 있는 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법 및 장치를 제공하는데 있다. The embodiments use a system that determines which driver assistance system to select based on an in-depth reinforcement learning algorithm, selects the behavior of the vehicle according to each situation, and controls the vehicle accordingly, thereby planning and controlling the path of the control of the autonomous vehicle. The objective is to provide a method and apparatus for controlling an autonomous vehicle using in-depth reinforcement learning and driver assistance systems that can perform stably.

일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법은, 측정된 센서 데이터 및 촬영된 이미지 데이터를 심층강화학습(deep reinforcement learning) 알고리즘으로 입력 받는 단계; 상기 심층강화학습 알고리즘에서 입력 받은 상기 센서 데이터 및 상기 이미지 데이터를 이용하여 차량 제어를 위한 행동을 결정하는 단계; 및 결정된 상기 행동에 따라 운전자보조시스템(Driver Assistance Systems, DAS)을 선택하여 차량을 제어하는 단계를 포함하여 이루어질 수 있다. A method of controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system according to an embodiment includes the steps of receiving measured sensor data and photographed image data through a deep reinforcement learning algorithm; Determining an action for vehicle control using the sensor data and the image data received from the deep reinforcement learning algorithm; And controlling the vehicle by selecting a Driver Assistance Systems (DAS) according to the determined behavior.

상기 측정된 센서 데이터 및 촬영된 이미지 데이터를 심층강화학습 알고리즘으로 입력 받는 단계는, 차량에 구성된 LIDAR 센서를 통해 측정된 상기 센서 데이터와 카메라를 통해 촬영된 이미지 데이터를 심층강화학습 알고리즘으로 입력 받을 수 있다. In the step of receiving the measured sensor data and the photographed image data as an in-depth reinforcement learning algorithm, the sensor data measured through the LIDAR sensor configured in the vehicle and the image data photographed through the camera may be input through the deep reinforcement learning algorithm. have.

상기 차량 제어를 위한 행동을 결정하는 단계는, 입력 받은 상기 센서 데이터 및 상기 이미지 데이터를 각각 정제하는 단계; 정제된 상기 센서 데이터 및 상기 이미지 데이터를 연결하여 연결된 데이터를 형성하는 단계; 상기 연결된 데이터를 상기 심층강화학습 알고리즘의 완전히 연결된 레이어(fully connected layer)에 입력하여 Q 값을 구하는 단계; 및 상기 Q 값에 따라 행동을 결정하는 단계를 포함할 수 있다. The determining of an action for vehicle control may include: purifying the input sensor data and the image data, respectively; Forming connected data by connecting the refined sensor data and the image data; Inputting the connected data into a fully connected layer of the deep reinforcement learning algorithm to obtain a Q value; And determining an action according to the Q value.

상기 센서 데이터 및 상기 이미지 데이터를 각각 정제하는 단계는, 상기 센서 데이터를 LSTM(Long Short-Term Memory)을 사용하여 정제하는 단계; 및 상기 이미지 데이터를 CNN(Convolutional Neural Network)을 사용하여 정제하는 단계를 포함할 수 있다. The refining of the sensor data and the image data may include refining the sensor data using Long Short-Term Memory (LSTM); And purifying the image data using a convolutional neural network (CNN).

상기 완전히 연결된 레이어에 입력하여 Q 값을 구하는 단계는, 상기 완전히 연결된 레이어를 상태 값과 각 행동에 대한 이점을 평가하는 행동 이점의 2 개의 파트로 나눈 네트워크를 이용하고, 상기 상태 값과 상기 행동 이점을 통합하여 상기 Q 값을 구할 수 있다. In the step of calculating the Q value by inputting the fully connected layer, the fully connected layer is divided into two parts of a state value and a behavioral advantage that evaluates an advantage for each behavior, and the state value and the behavioral advantage By integrating the Q value can be obtained.

상기 차량 제어를 위한 행동을 결정하는 단계는, 차량이 종 방향과 횡 방향으로 동시에 운전 가능하도록, 목표 속도 증가, 목표 속도 감소, 좌측으로 차선 변경, 우측으로 차선 변경, 현재 상태 유지의 5가지 행동을 포함할 수 있다. The step of determining the behavior for vehicle control includes five actions of increasing the target speed, decreasing the target speed, changing the lane to the left, changing the lane to the right, and maintaining the current state so that the vehicle can be driven simultaneously in the longitudinal and transverse directions. It may include.

상기 심층강화학습 알고리즘을 최적의 운전 정책으로 학습시키는 단계를 더 포함할 수 있다. It may further include the step of learning the deep reinforcement learning algorithm with an optimal driving policy.

상기 심층강화학습 알고리즘을 최적의 운전 정책으로 학습시키는 단계는, 상기 행동의 결과에 따라 보상을 설계하여 차량의 고속 주행, 충돌하지 않는 궤적으로의 주행 및 불필요한 차선 변경 배제에 기반한 최적의 운전 정책을 찾고, 상기 운전 정책으로 상기 심층강화학습 알고리즘을 학습시킬 수 있다. In the step of learning the in-depth reinforcement learning algorithm as an optimal driving policy, an optimal driving policy based on the high-speed driving of the vehicle, driving on a non-collision trajectory and unnecessary lane changes by designing a compensation according to the result of the action It is possible to find and learn the deep reinforcement learning algorithm with the driving policy.

다른 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 장치는, 측정된 센서 데이터 및 촬영된 이미지 데이터를 심층강화학습(deep reinforcement learning) 알고리즘으로 입력 받는 입력부; 상기 심층강화학습 알고리즘에서 입력 받은 상기 센서 데이터 및 상기 이미지 데이터를 이용하여 차량 제어를 위한 행동을 결정하는 심층강화학습부; 및 결정된 상기 행동에 따라 운전자보조시스템(Driver Assistance Systems, DAS)을 선택하여 차량을 제어하는 차량 제어부를 포함하여 이루어질 수 있다. An apparatus for controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system according to another embodiment includes: an input unit receiving measured sensor data and photographed image data through a deep reinforcement learning algorithm; A deep reinforcement learning unit for determining an action for vehicle control by using the sensor data and the image data input from the deep reinforcement learning algorithm; And a vehicle controller configured to control a vehicle by selecting a driver assistance system (DAS) according to the determined action.

상기 입력부는, 차량에 구성된 LIDAR 센서를 통해 측정된 상기 센서 데이터와 카메라를 통해 촬영된 이미지 데이터를 심층강화학습 알고리즘으로 입력 받을 수 있다. The input unit may receive the sensor data measured through the LIDAR sensor configured in the vehicle and the image data photographed through the camera as an in-depth reinforcement learning algorithm.

상기 심층강화학습부는, 입력 받은 상기 센서 데이터를 정제하는 센서 데이터 정제부; 입력 받은 상기 이미지 데이터를 정제하는 이미지 데이터 정제부; 정제된 상기 센서 데이터 및 상기 이미지 데이터를 연결하여 연결된 데이터를 형성하는 연결 데이터 형성부; 상기 연결된 데이터를 상기 심층강화학습 알고리즘의 완전히 연결된 레이어(fully connected layer)에 입력하여 Q 값을 구하는 완전히 연결된 레이어부; 및 상기 Q 값에 따라 행동을 결정하는 행동 결정부를 포함할 수 있다. The in-depth reinforcement learning unit may include a sensor data purification unit that purifies the received sensor data; An image data refiner that refines the input image data; A connection data forming unit connecting the refined sensor data and the image data to form connected data; A fully connected layer unit that inputs the connected data into a fully connected layer of the deep reinforcement learning algorithm to obtain a Q value; And an action determination unit that determines an action according to the Q value.

상기 센서 데이터 정제부는, 상기 센서 데이터를 LSTM(Long Short-Term Memory)을 사용하여 정제하고, 상기 이미지 데이터 정제부는, 상기 이미지 데이터를 CNN(Convolutional Neural Network)을 사용하여 정제할 수 있다. The sensor data refiner may refine the sensor data using Long Short-Term Memory (LSTM), and the image data refiner may refine the image data using a Convolutional Neural Network (CNN).

상기 완전히 연결된 레이어부는, 상기 완전히 연결된 레이어를 상태 값과 각 행동에 대한 이점을 평가하는 행동 이점의 2 개의 파트로 나눈 네트워크를 이용하고, 상기 상태 값과 상기 행동 이점을 통합하여 상기 Q 값을 구할 수 있다. The fully connected layer unit uses a network in which the fully connected layer is divided into two parts of a state value and an action advantage that evaluates an advantage for each action, and obtains the Q value by integrating the state value and the action advantage. I can.

상기 차량 제어부는, 차량이 종 방향과 횡 방향으로 동시에 운전 가능하도록, 목표 속도 증가, 목표 속도 감소, 좌측으로 차선 변경, 우측으로 차선 변경, 현재 상태 유지의 5가지 행동을 포함할 수 있다. The vehicle control unit may include five actions of increasing a target speed, decreasing a target speed, changing a lane to the left, changing a lane to the right, and maintaining a current state so that the vehicle can be driven simultaneously in a longitudinal direction and a transverse direction.

상기 심층강화학습부는, 상기 행동의 결과에 따라 보상을 설계하여 차량의 고속 주행, 충돌하지 않는 궤적으로의 주행 및 불필요한 차선 변경 배제에 기반한 최적의 운전 정책을 찾고, 상기 운전 정책으로 상기 심층강화학습 알고리즘을 학습시킬 수 있다. The in-depth reinforcement learning unit designs compensation according to the result of the action to find an optimal driving policy based on high-speed driving of the vehicle, driving on a non-collision trajectory, and excluding unnecessary lane changes, and the in-depth reinforcement learning with the driving policy. You can train the algorithm.

실시예들에 따르면 심층강화학습을 이용한 최적의 행동을 결정하는 알고리즘을 통해 다양한 상황에 적절한 운전자보조시스템을 결정하여 자율주행차량을 제어하는 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법 및 장치를 제공할 수 있다. According to embodiments, in-depth reinforcement learning to control an autonomous vehicle by determining an appropriate driver assistance system for various situations through an algorithm that determines the optimal behavior using in-depth reinforcement learning, and a control method of an autonomous vehicle using a driver assistance system And an apparatus.

실시예들에 따르면 심층강화학습 알고리즘을 기반으로 어떤 운전자보조시스템을 선택할지 결정하는 시스템을 이용하여 매 상황에 맞는 차량의 행동을 선택하며 그에 따라 차량을 제어함으로써, 자율주행차량의 제어의 경로 계획 및 제어를 안정적으로 수행할 수 있는 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법 및 장치를 제공할 수 있다. According to embodiments, by using a system that determines which driver assistance system to select based on an in-depth reinforcement learning algorithm, the vehicle behavior is selected according to each situation and the vehicle is controlled accordingly, thereby planning the path of the control of the autonomous vehicle. And it is possible to provide a control method and apparatus for an autonomous vehicle using in-depth reinforcement learning and a driver assistance system capable of stably performing control.

도 1은 일 실시예에 따른 차량의 센서 및 카메라의 구성을 설명하기 위한 도면이다.
도 2는 일 실시예에 따른 주변 차량에 대한 LIDAR 센서의 예를 설명하기 위한 도면이다.
도 3은 일 실시예에 따른 느린 차량의 추월에 따른 즉각적인 보상을 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 시뮬레이터 예시를 나타내는 도면이다.
도 5는 일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법을 나타내는 흐름도이다.
도 6은 일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 장치의 구조를 개략적으로 나타내는 도면이다.
도 7은 일 실시예에 따른 운전 정책을 결정하는 알고리즘을 통한 평균 속도의 예를 나타내는 도면이다.
도 8은 일 실시예에 따른 운전 정책을 결정하는 알고리즘을 통한 평균 차선 변경 횟수의 예를 나타내는 도면이다.
도 9는 일 실시예에 따른 운전 정책을 결정하는 알고리즘을 통한 평균 추월 수의 예를 나타내는 도면이다.1 is a diagram illustrating a configuration of a sensor and a camera of a vehicle according to an exemplary embodiment.
2 is a diagram for describing an example of a LIDAR sensor for a nearby vehicle according to an exemplary embodiment.
3 is a diagram for explaining immediate compensation according to overtaking of a slow vehicle according to an exemplary embodiment.
4 is a diagram illustrating an example of a simulator according to an embodiment.
5 is a flowchart showing a method of controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system according to an embodiment.
6 is a diagram schematically showing the structure of an autonomous vehicle control apparatus using in-depth reinforcement learning and a driver assistance system according to an embodiment.
7 is a diagram illustrating an example of an average speed through an algorithm for determining a driving policy according to an embodiment.
8 is a diagram illustrating an example of an average number of lane changes through an algorithm for determining a driving policy according to an embodiment.
9 is a diagram illustrating an example of an average number of passing through an algorithm for determining a driving policy according to an embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in various forms, and the scope of the present invention is not limited by the embodiments described below. In addition, various embodiments are provided to more completely explain the present invention to those of ordinary skill in the art. In the drawings, the shapes and sizes of elements may be exaggerated for clearer explanation.

최근 다양한 운전자보조시스템(Driver Assistance Systems, DAS)의 상용화에 따라 대부분의 차량들이 부분적으로 스마트 크루즈 컨트롤(Smart Cruise Control, SCC), 차선 유지 시스템(Lane Keeping System, LKS)과 같은 자율주행차량의 기능을 가지게 되었다. 한편, 고속도로와 같은 한정된 상황에서는 운전자보조시스템(DAS)들을 잘 조합하여 이용하는 것만으로 운전자의 개입 없이 자동으로 주행하는 것이 가능하다. 이런 자율주행기능 구현을 위해서는 적절한 시점에 적절한 운전자보조시스템(DAS) 기능을 선택할 수 있어야 한다. With the recent commercialization of various Driver Assistance Systems (DAS), most of the vehicles are partially self-driving vehicle functions such as Smart Cruise Control (SCC) and Lane Keeping System (LKS). Came to have. On the other hand, in a limited situation such as a highway, it is possible to drive automatically without driver intervention simply by using a good combination of driver assistance systems (DAS). In order to implement this autonomous driving function, it is necessary to be able to select an appropriate driver assistance system (DAS) function at the appropriate time.

아래의 실시예들은 심층강화학습(deep reinforcement learning)을 통해 적절한 운전자보조시스템(DAS)을 선택하는 감독자(supervisor)를 학습시키는 기술을 제공할 수 있다. 운전 정책(driving policy)은 자율주행차량에서 접근 가능한 카메라 이미지들과 LIDAR 데이터를 기반으로 작동할 수 있다. The following embodiments may provide a technique for learning a supervisor to select an appropriate driver assistance system (DAS) through deep reinforcement learning. Driving policy can be operated based on camera images and LIDAR data accessible from autonomous vehicles.

보다 구체적으로, 실시예들은 기존 운전자보조시스템(DAS) 및 심층강화학습을 이용한 자율주행 프레임을 제공한다. 제안된 기술은 차선 변경과 함께 평균 속도 및 추월 수를 최대화하는 차선 유지, 차선 변경 및 크루즈 컨트롤과 같은 운전자보조시스템(DAS) 기능을 결정할 수 있다. 행동 공간(action space)은 행동 레벨(behavior level)에 따라 정의되며 운전 정책은 차선 유지, 차선 변경 및 크루즈 컨트롤과 같은 행동 레벨을 결정하기 위해 학습될 수 있다. More specifically, embodiments provide an autonomous driving frame using an existing driver assistance system (DAS) and in-depth reinforcement learning. The proposed technology can determine driver assistance system (DAS) functions such as lane keeping, lane change and cruise control that maximize average speed and number of overtakings along with lane changes. The action space is defined according to the behavior level and the driving policy can be learned to determine the level of action such as lane keeping, lane change and cruise control.

차량이 기존의 상용화된 운전자보조시스템(DAS) 중 핵심적인 기능들을 기반으로 거동하기 때문에 스로틀(throttle), 브레이크(brake) 및 스티어링(steering) 값을 직접 학습하는 방법과 달리 주행의 안전성을 보장할 수 있다. 운전 정책은 자율주행차량에서 쉽게 접근이 가능한 카메라 이미지와 LIDAR 데이터를 기반으로 작동하도록 설계될 수 있다. Unity를 통해 도로와 다수 차량의 움직임을 구현한 시뮬레이터를 제작하였으며, 이를 이용하여 심층강화학습 알고리즘의 학습 및 검증을 수행할 수 있다. Unlike the method of directly learning throttle, brake, and steering values, because the vehicle behaves based on the core functions of the existing commercially available driver assistance systems (DAS), it can guarantee the safety of driving. I can. Driving policies can be designed to operate based on camera images and LIDAR data that are easily accessible from autonomous vehicles. Through Unity, a simulator that implements the movement of roads and multiple vehicles was produced, and learning and verification of the deep reinforcement learning algorithm can be performed using this.

마르코프 결정 프로세스(Markov Decision Processes, MDP)는 의사 결정(decision making)을 위한 수학적 프레임워크로, 상태 집합(set of states) S, 행동 집합(set of actions) A, 전이 모델(transition model) T, 보상 함수(reward function) R, 할인 계수(discount factor)

의 튜플(tuple) <S, A, T, R,

>로 정의될 수 있다. MDP에서 풀고자 하는 문제는 의사 결정을 위한 정책(policy)

를 찾는 문제로 주어진 보상 함수 R에 대해 예상 합(expected sum)

를 최대화하는

를 찾는 것이다. Markov Decision Processes (MDP) is a mathematical framework for decision making, a set of states S, a set of actions A, a transition model T, and Reward function R, discount factor

Tuple of <S, A, T, R,

Can be defined as >. The problem MDP is trying to solve is the policy for decision-making

The expected sum for a given reward function R as a problem of finding

To maximize

Is looking for.

한편, 최근의 심층강화학습에서는 효과적으로 심층신경망(Deep Neural Network, DNN)를 큰 데이터 셋으로부터 안정적으로 학습시킬 수 있게 되면서 직접 상태(state) S_t를 알지 않더라도, 원 입력(raw input)으로부터 취득한 고정 상태 표현(fixed state representation)

를 이용하여 MDP를 풀 수 있게 되었다. 실제로 컴퓨터 비전 분야에서는 손으로 만들어진 특징(handcrafted feature)을 사용하는 것보다 신경망을 통해 보다 나은 표현(representation)을 학습할 수 있다. On the other hand, in recent deep reinforcement learning, it is possible to stably learn a deep neural network (DNN) from a large data set, so that even if the state S _t is not known directly, it is possible to obtain fixed data from raw input. Fixed state representation

It became possible to solve the MDP by using. In fact, in the field of computer vision, better representations can be learned through neural networks than using handcrafted features.

운전 정책의 학습은 MDP를 기반으로 주변 차량 및 차선을 포함한 주변 환경에 대해 호스트 차량(host vehicle)이 상호작용하며 진행된다. 심층강화학습이 표현(representation) 자체를 잘 배울 수 있다는 장점을 활용하여 운전 정책 학습을 위한 관측 상태(observation states) S, 행동 공간(action space) A, 보상 함수 R를 다음과 같이 정의할 수 있다.Driving policy learning is conducted by interacting with a host vehicle with the surrounding environment including surrounding vehicles and lanes based on the MDP. Taking advantage of the advantage that deep reinforcement learning can learn the representation itself, observation states S, action space A, and reward function R for driving policy learning can be defined as follows. .

도 1은 일 실시예에 따른 차량의 센서 및 카메라의 구성을 설명하기 위한 도면이다. 1 is a diagram illustrating a configuration of a sensor and a camera of a vehicle according to an exemplary embodiment.

도 1을 참조하면, 일 실시예에서는 차량(100)의 센서 데이터와 카메라 데이터를 이용하여 관측 상태(observation states)를 구성할 수 있다. Referring to FIG. 1, in an embodiment, observation states may be configured using sensor data and camera data of the vehicle 100.

센서는 측방을 탐지하는 LIDAR를 모사하는 센서로, 특정 각도로 입자를 발사하여 센서 커버리지(110)에서 입자가 어떤 물체와 충돌한 거리를 반환할 수 있다. The sensor is a sensor that simulates a LIDAR detecting sideways, and by firing particles at a specific angle, the distance at which the particles collide with an object in the sensor coverage 110 may be returned.

그리고, 카메라는 차량의 전방 및 후방 중 적어도 어느 하나 이상에 위치하여 전후방 이미지 정보(120)를 제공할 수 있다. 이렇게 얻은 센서 데이터와 카메라 데이터를 처리하는 2개의 네트워크 모델이 각각 존재하며, 각 모델의 결과를 통해 정제된 정보를 결합하여 심층강화학습의 입력(input)인 관측(observation)으로 이용할 수 있다. In addition, the camera may be positioned in at least one of the front and rear of the vehicle to provide the front and rear image information 120. There are two network models that process the obtained sensor data and camera data, respectively, and the information refined through the results of each model can be combined and used as observation, an input of deep reinforcement learning.

도 2는 일 실시예에 따른 주변 차량에 대한 LIDAR 센서의 예를 설명하기 위한 도면이다. 2 is a diagram for describing an example of a LIDAR sensor for a nearby vehicle according to an exemplary embodiment.

도 2에 도시된 바와 같이, 예컨대, 차량(200)의 좌우측을 각각 90도의 범위로 감지하며 1도마다 광선(ray)(210)을 하나씩 발사할 수 있다. LIDAR 센서로부터 각 1 도에 대한 장애물까지의 거리를 얻을 수 있으며, 전면(frontside)의 원 이미지(raw image)도 관측을 구성하기 위해 얻을 수 있다.As shown in FIG. 2, for example, each of the left and right sides of the vehicle 200 is sensed in a range of 90 degrees, and a ray 210 may be emitted every 1 degree. The distance from the LIDAR sensor to the obstacle for each degree can be obtained, and a raw image of the frontside can also be obtained to construct the observation.

LIDAR로부터의 범위 데이터(range data)와 카메라로부터의 이미지 데이터가 완전히 다른 특성을 가지므로, 다중 모달(multi-modal) 입력 방식을 갖는 네트워크가 사용될 수 있다. Since range data from LIDAR and image data from a camera have completely different characteristics, a network having a multi-modal input method can be used.

운전 정책을 위한 행동 공간(action space)은 별개의 행동 공간(discrete action space)에서 정의될 수 있다. 이 행동(action)들은 기존의(conventional) 운전자보조시스템(DAS) 기반에서 수행되기 때문에 운전자보조시스템(DAS)의 기능을 우선적으로 수행하는 범위 내에서 행동(action)을 수행할 수 있다. An action space for driving policy may be defined in a discrete action space. Since these actions are performed based on the conventional driver assistance system (DAS), the action can be performed within the range of preferentially performing the functions of the driver assistance system (DAS).

종 방향의 경우, 세 가지 종류의 행동이 있다. 1) 여기서, 속도 v + v _cc로 크루즈 컨트롤(cruise control)되며, 여기서 v _cc는 5km/h로 설정된 추가 목표 속도일 수 있다. 2) 현재 속도 v로 크루즈 컨트롤될 수 있고, 3) 속도 v - v _cc로 크루즈 컨트롤될 수 있다. 이러한 종 방향의 행동들은 자율 비상 제동(Autonomous Emergency Braking, AEB) 및 적응형 크루즈 컨트롤(Adaptive Cruise Control, ACC)을 포함할 수 있다. 이와 같이 종 방향의 행동들을 통해 목표 속도를 증가시키거나 감소시키는 행동(action)을 수행하며, 목표 속도에 따라 주행을 하지만 앞차와의 충돌이 발생할만한 위험 상황의 경우 차량이 감속 제어를 하여 충돌을 방지할 수 있다. In the longitudinal case, there are three kinds of behavior. 1) Here, cruise control is performed with speed v + v _cc , where v _cc may be an additional target speed set to 5 km/h. 2) Cruise can be controlled at the current speed v , and 3) Cruise can be controlled at the speed v - v _cc . These longitudinal actions may include Autonomous Emergency Braking (AEB) and Adaptive Cruise Control (ACC). In this way, an action that increases or decreases the target speed is performed through the longitudinal actions, and the vehicle drives according to the target speed, but in a dangerous situation where a collision with the vehicle in front may occur, the vehicle performs deceleration control to prevent a collision. Can be prevented.

횡 방향의 경우, 1) 차선 유지, 2) 좌측 차선 변경, 및 3) 우측 차선 변경의 세 가지 종류의 행동이 있다. 횡 방향의 행동에서, 좌우측 차선 변경의 행동(action)을 수행하지만 호스트 차량의 좌측 일정 거리 안에 차량이 존재할 시에 차선 변경에 대한 운전자보조시스템(DAS)이 동작하여 좌측 방향 차선 변경 행동(action)을 주어도 차량이 좌측으로 이동하지 않는다. 우측의 경우도 좌측과 동일하게 동작할 수 있다. 또한, 현재 상태를 유지하는 행동(action)을 수행하는 경우 차선의 중앙을 맞추는 제어를 수행하게 된다. In the lateral direction, there are three kinds of actions: 1) Keep Lane, 2) Change Left Lane, and 3) Change Right Lane. In the lateral action, the left and right lane change action is performed, but when the vehicle is within a certain distance to the left of the host vehicle, the Driver Assistance System (DAS) for the lane change operates and the left lane change action is performed. Even if is given, the vehicle does not move to the left. The right side can also operate in the same way as the left side. In addition, when performing an action that maintains the current state, the control to align the center of the lane is performed.

이에 따라 자율주행차량은 종 방향과 횡 방향으로 동시에 운전해야 하므로, 목표 속도 증가, 목표 속도 감소, 좌측 차선 변경, 우측 차선 변경, 현재 상태 유지의 5가지 행동(action)을 포함할 수 있다. 운전 정책은 종방향 안정성을 기본적으로 보장하는 기존의 운전자보조시스템(DAS) 시스템을 그대로 이용하여 강건성을 확보할 수 있다.Accordingly, since the autonomous vehicle must drive in the longitudinal and transverse directions at the same time, five actions such as increasing the target speed, decreasing the target speed, changing the left lane, changing the right lane, and maintaining the current state may be included. Driving policy can secure robustness by using the existing driver assistance system (DAS) system, which basically guarantees longitudinal stability.

강화학습과 관련하여 행동(action)을 선택하게 되면 행동(action)의 결과로 보상을 받게 된다. 상술한 바와 같이, MDP 상에서 풀고자 하는 문제는 미래 보상의 기댓값을 최대화하는 운전 정책을 찾는 것이다. 즉, 보상 정책의 설계 방법에 따라 최적의 운전 정책이 완전히 달라질 수 있다. 따라서 자율주행차량의 적절한 운전 정책을 학습하기 위해 보상을 적절하게 설계하는 것이 중요하다.When you choose an action in relation to reinforcement learning, you are rewarded as a result of the action. As described above, the problem to be solved on the MDP is to find a driving policy that maximizes the expected value of future compensation. In other words, the optimal driving policy may be completely different depending on the design method of the compensation policy. Therefore, it is important to properly design compensation in order to learn the proper driving policy of autonomous vehicles.

차량이 조밀한(dense) 교통 상황을 주행할 때 다음 세 가지 조건을 만족시키는 것이 바람직하다. 세 가지 조건은 1) 차량을 고속 주행시키는 정책을 찾는 것, 2) 충돌하지 않는 궤적으로 주행하는 것, 및 3) 너무 자주 차선을 변경하지 않는 것이다. 이러한 세 가지 조건을 만족시키는 보상을 설계하였으며, 다음 식과 같이 나타낼 수 있다. When a vehicle is driving in a dense traffic situation, it is desirable to satisfy the following three conditions. The three conditions are: 1) finding a policy to drive the vehicle at high speed, 2) driving on a non-collision trajectory, and 3) not changing lanes too often. We designed a compensation that satisfies these three conditions, and can be expressed as the following equation.

[수학식 1][Equation 1]

[수학식 2][Equation 2]

[수학식 3][Equation 3]

[수학식 4][Equation 4]

[수학식 5][Equation 5]

여기서, v는 차량의 현재 속도, v _max 는 적정 최고 속도, v _min 는 적정 최저 속도이다. 또한, r _v,max 는 최고 속도로 주행시의 보상이고, v _lc 는 차선 변경 시 페널티(penalty) 값이고, r _collision 은 충돌 시 에이전트(agent)가 받는 페널티(penalty) 값이며, r _overtake 은 다른 차량을 추월하는 것에 대한 보상이다. Here, v is the current speed of the vehicle, v _max is an appropriate maximum speed, and v _min is an appropriate minimum speed. In addition, r _v,max is the compensation for driving at the maximum speed, v _lc is the penalty value when changing lanes, and r _collision is the penalty value received by the agent when colliding, r _overtake is a reward for overtaking another vehicle.

안전의 측면에서, 충돌은 차량 운전 시 가장 피해야 할 행동이므로, r _collision 값은 다른 어떤 보상 값들 보다 더 큰 값을 부여하여 충돌하는 행동을 최우선적으로 방지할 수 있다. 그리고 적정 최고 속도 v _max 이하에서는 최대한 빠르게 주행하는 것이 바람직하므로 적정 최고 속도 이하에서는 현재 속도에 따라 선형 보상을 취득할 수 있다. 최적의 운전 정책은 호스트 차량(host vehicle)이 느린 주행 차량들을 추월하도록 할 수 있다. 마지막으로, 차선은 변경하는 행동을 너무 자주 하는 것을 바람직한 일이 아니므로, 불필요한 차선 변경 없이 운전하는 법을 학습하도록 r _lc (또는 r _lanechange ) 값을 설정할 수 있다. In terms of safety, collision is the most avoidable behavior when driving a vehicle, so r _collision The value can be given a higher value than any other reward value to prevent conflicting behavior first. In addition, it is desirable to travel as fast as possible at the optimum maximum speed v _max or less, so linear compensation can be obtained according to the current speed at less than the appropriate maximum speed. An optimal driving policy can allow the host vehicle to overtake slower driving vehicles. Finally, it is not desirable to do lane-changing behavior too often, so you can set the value of r _lc (or r _lanechange ) to learn how to drive without unnecessary lane changes.

표 1은 보상에 대한 파라미터를 정의할 수 있다. Table 1 can define parameters for compensation.

[표 1][Table 1]

도 3은 일 실시예에 따른 느린 차량의 추월에 따른 즉각적인 보상을 설명하기 위한 도면이다. 3 is a diagram for explaining immediate compensation according to overtaking of a slow vehicle according to an exemplary embodiment.

도 3을 참조하면, 느린 차량의 추월에 따른 즉각적인 보상에 대한 예시이며, 느린 차량을 추월하기 위해 60km/h로 주행하던 차량이 차선을 변경하고, 이후 70km/h로 더 빨리 주행했을 때 받는 r _tot 값의 예를 나타낸다. 여기서 보상은 차선 변경과 함께 감소하지만, 속도 증가 및 추월과 함께 증가할 수 있다. Referring to FIG. 3, the example of the immediate rewards of overtaking slow vehicles, change the vehicle was traveling at 60km / h lane to pass a slower vehicle, receives when traveling faster after 70km / h r _Here is an example of _tot value. Here, the reward decreases with lane changes, but can increase with increasing speed and overtaking.

심층신경망과 강화학습을 결합한 심층강화학습이 다양하게 연구되었다. 특히 DQN을 기초로 하는 가치 기반(value-based) 심층강화학습의 경우, 이를 발전시키는 많은 수의 연구들이 수행되었다. 여기에서는 다양한 가치 기반 심층강화학습 알고리즘 중 DQN(비특허문헌 1), Double DQN(비특허문헌 2), Dueling DQN(비특허문헌 3)을 이용하여 자율주행 에이전트(agent)의 도출 및 결과 비교를 수행할 것이다. Deep reinforcement learning that combines deep neural networks and reinforcement learning has been studied in various ways. In particular, in the case of value-based deep reinforcement learning based on DQN, a large number of studies have been conducted to develop this. Here, among various value-based in-depth reinforcement learning algorithms, DQN (Non-Patent Document 1), Double DQN (Non-Patent Document 2), and Dueling DQN (Non-Patent Document 3) are used to derive and compare results of an autonomous driving agent. Will perform.

DQN(비특허문헌 1)은 CNN(Convolutional Neural Network)과 강화학습을 결합한 기법으로, 게임 화면의 원 픽셀 프레임(raw pixel frame)만을 관측 O_t로 하여 CNN에 입력으로써 대입하고, CNN은 각 행동(action) (A_t)에 대한 Q 값을 도출해낸다. 이때 행동(action)은

기법에 따라 선택되어 수행되며 해당 행동(action)을 수행하고 나면 다음 관측 O_t ₊₁과 보상 R_t를 구할 수 있게 된다. 이렇게 얻은 정보들을 하나의 경험(experience)

로 설정하여 데이터셋(dataset)

에 저장하였다가 학습을 할 때에는 데이터셋 안의 데이터를 랜덤 샘플링(random sampling)하여 미니-배지 업데이트(mini-batch update)를 수행하는 경험 리플레이(experience replay) 기법을 이용한다. CNN의 네트워크는 확률적 기울기 하강법(stochastic gradient descent)을 이용하여 손실을 최소화하는 방향으로 신경망의 파라미터들을 학습할 수 있으며, 다음 식과 같이 나타낼 수 있다. DQN (Non-Patent Document 1) is a technique that combines CNN (Convolutional Neural Network) and reinforcement learning.Only the raw pixel frame of the game screen is observed as O _t and substituted into the CNN as input, and the CNN is each action. We derive the Q value for (action) (A _t ). At this time, the action is

It is selected and performed according to the technique, and after performing the action, the next observation O _t ₊₁ and the reward R _t can be obtained. The information obtained in this way is an experience

Set to dataset

When learning after storing in the data set, an experience replay technique is used that performs a mini-batch update by random sampling the data in the dataset. The CNN network can learn the parameters of the neural network in the direction of minimizing the loss using stochastic gradient descent, which can be expressed as the following equation.

[수학식 6][Equation 6]

여기서,

는 할인 계수(discount factor)로서 미래에 받을 것으로 예측되는 보상의 합을 네트워크 학습에 얼마의 비율로 반영할지를 결정할 수 있다.

는 네트워크의 파라미터이며,

는 목표 네트워크(network)의 파라미터로 일정 단계(step)마다 네트워크(network)의 파라미터를 그대로 복사해오며 따로 학습은 수행하지 않는 파라미터이다. here,

As a discount factor, it is possible to determine a percentage of the sum of rewards expected to be received in the future to be reflected in network learning.

Is a parameter of the network,

Is a parameter of the target network, the parameter of the network is copied as it is at every certain step, and no additional learning is performed.

Double DQN(DDQN)(비특허문헌 2)의 경우, DQN 알고리즘이 가지고 있는 행동(action) 값에 대한 과대평가(overestimation) 문제를 해결하였다. DQN에서는 목표 값을 계산할 때 동일한 Q-function을 이용하여 행동(action)을 선택하고 행동(action)을 평가하는 2가지를 모두 수행할 수 있으며, 다음 식과 같이 나타낼 수 있다. In the case of Double DQN (DDQN) (Non-Patent Document 2), the overestimation problem for the action value of the DQN algorithm was solved. In DQN, both of selecting an action and evaluating an action can be performed using the same Q-function when calculating a target value, and can be expressed as the following equation.

[수학식 7][Equation 7]

이 결과로 값에 대한 과대평가가 발생하게 되었고, 이는 DQN의 성능 저하의 원인이 되었다. DDQN에서는 목표 값을 결정할 때 행동(action) 선택과 평가의 과정을 2개의 Q-function을 이용하는 방법을 통해 분리시킴으로써 값의 과대평가 문제를 해결할 수 있으며, 다음 식과 같이 나타낼 수 있다. As a result, an overestimation of the value occurred, which caused the performance of DQN to deteriorate. In DDQN, by separating the process of action selection and evaluation when determining a target value through a method using two Q-functions, the problem of overestimating values can be solved, and can be expressed as the following equation.

[수학식 8][Equation 8]

Dueling DQN(비특허문헌 3)의 경우, DQN 네트워크의 컨볼루션(convolution) 부분은 그대로 이용하고, 완전히 연결된 레이어 부분을 상태 값(state-value)과 각 행동(action)에 대한 이점(advantage)을 평가하는 2 개의 파트로 나눈 네트워크를 이용할 수 있다. 이에 따라 Q-값을 계산하는 최종식이 상태 값(state-value)과 행동 이점(action advantage)을 통합(aggregate)하는 형태로 변경될 수 있으며, 다음 식과 같이 나타낼 수 있다. In the case of Dueling DQN (Non-Patent Document 3), the convolution part of the DQN network is used as it is, and the fully connected layer part is used to provide a state-value and an advantage for each action. A network divided into two parts to evaluate is available. Accordingly, the final formula for calculating the Q-value can be changed into a form that aggregates the state-value and the action advantage, and can be expressed as the following equation.

[수학식 9][Equation 9]

여기서,

는 이점 함수(advantage function) A를 구하기 위한 완전히 연결된 레이어(fully connected layer)의 파라미터이며,

는 값-함수(value-function) V을 구하기 위한 완전히 연결된 레이어(fully connected layer)의 파라미터이다. 이렇게 2 개의 스트림(stream)으로 구성된 네트워크를 이용한 dueling DQN은 DQN에 비해 월등한 성능향상을 보여주었으며, 특히 행동(action)의 수가 많아져도 성능의 강인함이 확보되는 특징을 보여주었다. here,

Is the parameter of the fully connected layer to obtain the advantage function A,

Is a parameter of a fully connected layer to find the value-function V. Dueling DQN using a network composed of two streams showed a superior performance improvement compared to DQN, and in particular, it showed a characteristic that the robustness of performance was secured even when the number of actions increased.

일 실시예에 따르면 상술한 심층강화학습 알고리즘(DQN, DDQN, Dueling DQN)을 결합하여 시뮬레이션에서 최적의 운전 정책을 찾을 수 있다. 이때 심층강화학습 알고리즘의 목적은 다른 차량과의 충돌 없으며, 불필요한 차선 변경 없이 고속으로 주행하는 것이다.According to an embodiment, an optimal driving policy may be found in a simulation by combining the above-described deep reinforcement learning algorithms (DQN, DDQN, and Dueling DQN). At this time, the purpose of the deep reinforcement learning algorithm is to drive at high speed without collision with other vehicles and unnecessary lane changes.

도 4는 일 실시예에 따른 시뮬레이터 예시를 나타내는 도면이다. 4 is a diagram illustrating an example of a simulator according to an embodiment.

도 4를 참조하면, 실시예들을 테스트 하기 위해 시뮬레이터 Unity 또는 Unity ML-Agents를 사용하여 자율주행차량 환경의 예를 구성할 수 있다. 에이전트(Agent, 400)로 표시된 차량이 직접 제어하는 차량이고, 주변에 다수의 차량이 존재하며 차선 변경, 가속, 감속 등을 임의로 수행할 수 있다. 본 시뮬레이터를 이용한 학습 목표는 다수의 차량이 존재하는 고속도로 환경을 빠른 속도로, 필요 없는 차선 변경은 최소화하며 주행하는 것이다. 본 환경에서의 시험을 위해 심층강화학습 알고리즘 중 상술한 DQN 기반의 알고리즘을 사용할 수 있으며, DQN뿐 아니라 다른 심층강화학습 알고리즘을 적용할 수 있다. Referring to FIG. 4, an example of an autonomous vehicle environment may be configured by using a simulator Unity or Unity ML-Agents to test the embodiments. A vehicle indicated by an agent 400 is a vehicle that is directly controlled, and there are a number of vehicles around it, and lane change, acceleration, deceleration, etc. can be arbitrarily performed. The goal of learning using this simulator is to drive in a highway environment with a large number of vehicles at high speed and minimize unnecessary lane changes. For testing in this environment, among the deep reinforcement learning algorithms, the above-described DQN-based algorithm can be used, and other deep reinforcement learning algorithms as well as DQN can be applied.

시뮬레이션된 도로 환경은 5개의 차선으로 구성된 고속도로 주행 환경이다. 주변의 다른 차량들은 시각적으로 각각 다른 8 종류의 차량이 호스트 차량의 일정 거리 안의 차선 중앙에서 랜덤하게 생성될 수 있다. 또한 각각의 차량에는 종 방향으로 충돌 방지 시스템이 엄격하게 적용되어 있어 주변 차량 간에는 최대한 충돌은 하지 않도록 주행할 수 있다. 그리고, 주변 차량들은 각각 가속, 감속, 좌측으로 차선 변경, 우측으로 차선 변경 등을 임의로 수행하면서 운전 정책을 학습하는 차량이 다양한 환경을 경험할 수 있도록 시뮬레이션 환경에 변화를 주었다. The simulated road environment is a highway driving environment consisting of five lanes. Eight types of vehicles, which are visually different from other vehicles in the vicinity, may be randomly generated in the center of a lane within a certain distance of the host vehicle. In addition, a collision avoidance system is strictly applied to each vehicle in the longitudinal direction, so it is possible to drive so as not to cause a collision between nearby vehicles. In addition, the surrounding vehicles randomly performed acceleration, deceleration, lane change to the left, and lane change to the right, respectively, while changing the simulation environment so that the vehicle learning the driving policy can experience various environments.

다른 차량에 의한 다양한 행동은 다양한 방법으로 시뮬레이션 환경을 변경하므로 에이전트는 다양한 상황을 경험할 수 있다. 시뮬레이터의 관측은 두 가지 유형이 있으며, 하나는 이미지이고 다른 하나는 LIDAR형 범위 배열이다. 전방 등에 카메라가 있기 때문에 매 단계마다 미가공 픽셀 이미지가 관측될 수 있다. LIDAR 센서는 각도 1도당 하나의 광선이 제공되어 360도 범위를 감지할 수 있다. 광선이 물체(예컨대, 다른 차량 등 장애물)에 닿으면 호스트 차량과 물체 사이의 거리를 반환할 수 있다. 장애물이 없으면 시뮬레이터의 모든 단계마다 최대 감지 거리를 반환할 수 있다.Since various actions by different vehicles change the simulation environment in various ways, the agent can experience various situations. There are two types of observations in the simulator, one is an image and the other is a LIDAR type range array. Since there is a camera in the front, etc., a raw pixel image can be observed at every step. The LIDAR sensor is capable of sensing a range of 360 degrees with one light beam provided per degree of angle. When the light beam hits an object (eg, an obstacle such as another vehicle), the distance between the host vehicle and the object may be returned. If there are no obstacles, the maximum sensing distance can be returned for every step of the simulator.

도 5는 일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법을 나타내는 흐름도이다. 5 is a flowchart showing a method of controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system according to an embodiment.

도 5a를 참조하면, 일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법은, 측정된 센서 데이터 및 촬영된 이미지 데이터를 심층강화학습(deep reinforcement learning) 알고리즘으로 입력 받는 단계(S110), 심층강화학습 알고리즘에서 입력 받은 센서 데이터 및 이미지 데이터를 이용하여 차량 제어를 위한 행동을 결정하는 단계(S120), 및 결정된 행동에 따라 운전자보조시스템(Driver Assistance Systems, DAS)을 선택하여 차량을 제어하는 단계(S130)를 포함하여 이루어질 수 있다.5A, a method for controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system according to an embodiment receives measured sensor data and photographed image data as a deep reinforcement learning algorithm. Step (S110), determining an action for vehicle control using sensor data and image data received from the deep reinforcement learning algorithm (S120), and selecting a Driver Assistance Systems (DAS) according to the determined action Thus, it may be accomplished including the step of controlling the vehicle (S130).

도 5b를 참조하면, 차량 제어를 위한 행동을 결정하는 단계(S120)는, 입력 받은 센서 데이터 및 이미지 데이터를 각각 정제하는 단계(S121), 정제된 센서 데이터 및 이미지 데이터를 연결하여 연결된 데이터를 형성하는 단계(S122), 연결된 데이터를 심층강화학습 알고리즘의 완전히 연결된 레이어(fully connected layer)에 입력하여 Q 값을 구하는 단계(S123), 및 Q 값에 따라 행동을 결정하는 단계(S124)를 포함할 수 있다. Referring to FIG. 5B, the step of determining an action for vehicle control (S120) includes refining the input sensor data and image data (S121), connecting the refined sensor data and image data to form connected data. Step S122, inputting the connected data into a fully connected layer of the deep reinforcement learning algorithm to obtain a Q value (S123), and determining an action according to the Q value (S124). I can.

실시예들은 인공신경망 기법과 강화학습 기법을 적용하여 차량을 제어하는 방법을 제공한다. 인공지능과 같이 학습 기반의 방법은 사람이 직접 설정하는 것 없이 스스로 다양한 경험을 하면서 학습을 수행하고 이에 따라 최적의 제어방법을 찾아낸다. 다양한 경험을 통해 학습하기 때문에 사람이 설정하기 어려운 다양한 상황을 대응할 수 있으며 더욱 최적의 행동을 선택할 수 있게 된다. The embodiments provide a method of controlling a vehicle by applying an artificial neural network technique and a reinforcement learning technique. Like artificial intelligence, learning-based methods do not have to be directly set by humans, but perform learning while experiencing various experiences by themselves and find the optimal control method accordingly. Because they learn through various experiences, they can cope with various situations that are difficult for humans to set up, and more optimal behavior can be selected.

또한 실시예들은 단순히 강화학습으로 핸들, 브레이크, 액셀러레이터 등을 제어하는 것이 아니라 이미 안정적으로 차량을 제어하기 위해 개발된 운전자보조시스템(DAS)을 선택한다. 이와 같이 운전자보조시스템(DAS)을 이용하는 경우 운전자에게 불편함을 주지 않으면서도 안전하게 차량을 제어할 수 있다. In addition, the embodiments do not simply control steering wheel, brake, accelerator, etc. through reinforcement learning, but select a driver assistance system (DAS) that has already been developed to stably control the vehicle. In this way, when the driver assistance system (DAS) is used, the vehicle can be safely controlled without causing discomfort to the driver.

아래에서 일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법의 각 단계를 설명한다. Each step of a method for controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system according to an embodiment will be described below.

도 6은 일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 장치의 구조를 개략적으로 나타내는 도면이다. 6 is a diagram schematically showing the structure of an autonomous vehicle control apparatus using in-depth reinforcement learning and a driver assistance system according to an embodiment.

도 6을 참조하면, 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 장치의 구조를 나타내며, 멀티 모달 운전 정책 네트워크의 구조를 확인할 수 있다. 여기서, 그레이 스케일(grayscale)의 카메라의 원 이미지(raw image)와 미터 단위의 거리 배열을 갖는 LIDAR 센서 데이터의 두 가지 입력이 요구된다. Referring to FIG. 6, a structure of an autonomous vehicle control device using in-depth reinforcement learning and a driver assistance system is shown, and the structure of a multi-modal driving policy network can be confirmed. Here, two inputs are required: a raw image of a grayscale camera and LIDAR sensor data having a distance arrangement in meters.

일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법은 일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 장치를 이용하여 보다 상세히 설명할 수 있다. 일 실시예에 따른 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 장치는 입력부, 심층강화학습부(600) 및 차량 제어부를 포함하여 이루어질 수 있다. 특히, 심층강화학습부(600)는 센서 데이터 정제부(610), 이미지 데이터 정제부(620), 연결 데이터 형성부(630), 완전히 연결된 레이어부(640, 650) 및 행동 결정부(660)를 포함할 수 있다. A method of controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system according to an embodiment may be described in more detail by using a control device for an autonomous vehicle using in-depth reinforcement learning and a driver assistance system according to an embodiment. . An apparatus for controlling an autonomous vehicle using in-depth reinforcement learning and a driver assistance system according to an embodiment may include an input unit, an in-depth reinforcement learning unit 600, and a vehicle control unit. In particular, the deep reinforcement learning unit 600 includes a sensor data purification unit 610, an image data purification unit 620, a connection data forming unit 630, fully connected layer units 640 and 650, and an action determination unit 660. It may include.

단계(S110)에서, 입력부는 측정된 센서 데이터(602) 및 촬영된 이미지 데이터(601)를 심층강화학습(deep reinforcement learning) 알고리즘으로 입력 받을 수 있다. 예컨대, 입력부는 차량에 구성된 LIDAR 센서를 통해 측정된 센서 데이터(602)와 카메라를 통해 촬영된 이미지 데이터(601)를 심층강화학습 알고리즘으로 입력 받을 수 있다. In step S110, the input unit may receive the measured sensor data 602 and the photographed image data 601 as a deep reinforcement learning algorithm. For example, the input unit may receive sensor data 602 measured through a LIDAR sensor configured in a vehicle and image data 601 photographed through a camera as a deep reinforcement learning algorithm.

단계(S120)에서, 심층강화학습부(600)는 심층강화학습 알고리즘에서 입력 받은 센서 데이터(602) 및 이미지 데이터(601)를 이용하여 차량 제어를 위한 행동(603)을 결정할 수 있다. In step S120, the deep reinforcement learning unit 600 may determine an action 603 for vehicle control using the sensor data 602 and image data 601 received from the deep reinforcement learning algorithm.

보다 구체적으로, 단계(S121)에서, 센서 데이터(602) 정제부(610)는 입력 받은 센서 데이터(602)를 정제하고, 이미지 데이터(601) 정제부(620)는 입력 받은 이미지 데이터(601)를 정제할 수 있다. 이때, 센서 데이터(602) 정제부(610)는 센서 데이터(602)를 LSTM(Long Short-Term Memory)을 사용하여 정제할 수 있고, 이미지 데이터(601) 정제부(620)는 이미지 데이터(601)를 CNN(Convolutional Neural Network)을 사용하여 정제할 수 있다. More specifically, in step S121, the sensor data 602 and the refiner 610 refine the received sensor data 602, and the image data 601 and the refiner 620 are the input image data 601 Can be purified. At this time, the sensor data 602 refiner 610 may refine the sensor data 602 using Long Short-Term Memory (LSTM), and the image data 601 and the refiner 620 ) Can be purified using a CNN (Convolutional Neural Network).

단계(S122)에서, 연결 데이터 형성부(630)는 정제된 센서 데이터(602) 및 이미지 데이터(601)를 연결하여 연결된 데이터를 형성할 수 있다. In step S122, the connection data forming unit 630 may connect the refined sensor data 602 and the image data 601 to form connected data.

단계(S123)에서, 완전히 연결된 레이어부(640, 650)는 연결된 데이터를 심층강화학습 알고리즘의 완전히 연결된 레이어(fully connected layer)에 입력하여 Q 값을 구할 수 있다. In step S123, the fully connected layer units 640 and 650 may input the connected data into a fully connected layer of the deep reinforcement learning algorithm to obtain a Q value.

특히, 완전히 연결된 레이어부(640, 650)는 심층강화학습 알고리즘에 따라 완전히 연결된 레이어를 상태 값과 각 행동에 대한 이점을 평가하는 행동 이점의 2 개의 파트로 나눈 네트워크를 이용하고, 상태 값과 행동 이점을 통합하여 Q 값을 구할 수 있다. In particular, the fully connected layer units 640 and 650 use a network that divides the fully connected layer into two parts: a state value and a behavioral advantage that evaluates the benefits for each action according to the deep reinforcement learning algorithm, and uses a state value and an action. We can combine these to get the Q value.

한편, 심층강화학습 알고리즘을 최적의 운전 정책으로 학습시키는 단계를 더 포함할 수 있다. 심층강화학습부(600)는 행동(603)의 결과에 따라 보상을 설계하여 차량의 고속 주행, 충돌하지 않는 궤적으로의 주행 및 불필요한 차선 변경 배제에 기반한 최적의 운전 정책을 찾고, 운전 정책으로 심층강화학습 알고리즘을 학습시킬 수 있다. Meanwhile, the step of learning the deep reinforcement learning algorithm as an optimal driving policy may be further included. The in-depth reinforcement learning unit 600 designs compensation according to the result of the action 603 to find an optimal driving policy based on the high-speed driving of the vehicle, driving on a non-collision trajectory, and excluding unnecessary lane changes, and in-depth driving policy. Reinforcement learning algorithms can be trained.

단계(S124)에서, 행동 결정부(660)는 Q 값에 따라 행동(603)을 결정할 수 있다. In step S124, the action determination unit 660 may determine the action 603 according to the Q value.

단계(S130)에서, 차량 제어부는 결정된 행동(603)에 따라 운전자보조시스템(Driver Assistance Systems, DAS)을 선택하여 차량을 제어할 수 있다. 차량 제어부는 차량이 종 방향과 횡 방향으로 동시에 운전 가능하도록, 목표 속도 증가, 목표 속도 감소, 좌측으로 차선 변경, 우측으로 차선 변경, 현재 상태 유지의 5가지 행동을 포함할 수 있다. In step S130, the vehicle controller may control the vehicle by selecting a driver assistance system (DAS) according to the determined action 603. The vehicle controller may include five actions of increasing a target speed, decreasing a target speed, changing a lane to the left, changing a lane to the right, and maintaining a current state so that the vehicle can be driven simultaneously in the longitudinal and transverse directions.

이와 같이, 센서 데이터(602), 카메라 이미지와 같은 인지에 필요한 정보들을 심층강화학습 알고리즘의 입력으로 받고, 심층강화학습 알고리즘은 들어온 정보를 기반으로 가장 적절하다고 생각하는 운전자보조시스템(DAS)을 선택하게 되고 이에 따라 차량을 제어할 수 있다. 이때 운전자보조시스템(DAS)이 추가되어 있는 관계로 차량은 안전하고 안정적으로 주행이 가능하다. In this way, information necessary for recognition, such as sensor data 602 and camera image, is received as an input of the deep reinforcement learning algorithm, and the deep reinforcement learning algorithm selects the driver assistance system (DAS) that is considered the most appropriate based on the received information. And the vehicle can be controlled accordingly. At this time, the driver assistance system (DAS) is added, so the vehicle can drive safely and stably.

다양한 상황에서의 주행을 통해 차량은 점점 더 최적의 주행 방법을 학습하게 된다. 실시예들에 따르면 자율주행차량에 적용되어 심층강화학습을 통해 매 상황마다 어떤 운전자보조시스템(DAS)을 사용하는 것이 적절한지 판단하고 차량에 내장된 운전자보조시스템(DAS)을 이용하여 차량을 제어할 수 있다. Through driving in various situations, the vehicle is increasingly learning the optimal driving method. According to embodiments, it is applied to an autonomous vehicle, and through in-depth reinforcement learning, it determines which driver assistance system (DAS) is appropriate to use in each situation, and controls the vehicle using a driver assistance system (DAS) built into the vehicle. can do.

실시예들에 따르면 심층강화학습 알고리즘을 기반으로 어떤 운전자보조시스템을 선택할지 결정하는 시스템을 이용하여 매 상황에 맞는 차량의 행동(603)을 선택하며 그에 따라 차량을 제어할 수 있다. 이를 통해 자율주행차량의 제어의 경로 계획 및 제어를 안정적으로 수행할 수 있다. 더욱이, 실시예들은 학습기반으로 알고리즘이 발전하므로 사람이 모든 상황을 다루기 어려운 차량 주행 환경에 적절하게 대응할 수 있다.According to embodiments, using a system for determining which driver assistance system to select based on an in-depth reinforcement learning algorithm, the vehicle behavior 603 suitable for each situation may be selected and the vehicle may be controlled accordingly. Through this, it is possible to stably perform path planning and control of the control of the autonomous vehicle. Moreover, since the embodiments develop algorithms based on learning, it is possible to appropriately respond to a vehicle driving environment in which humans cannot handle all situations.

운전 정책을 학습하기 위해 제안된 심층강화학습 알고리즘이 동시에 두 개의 다른 입력을 사용하기 때문에 원본 DQN 구조의 컨볼루션 레이어가 다중 입력을 분석하도록 변경될 수 있다. Since the proposed deep reinforcement learning algorithm to learn driving policy uses two different inputs at the same time, the convolutional layer of the original DQN structure can be changed to analyze multiple inputs.

카메라 이미지의 경우 DQN과 같은 CNN(Convolutional Neural Network)을 사용하여 이미지 데이터를 정제할 수 있고, LIDAR 센서 데이터는 LSTM(Long Short-Term Memory)을 사용하여 정제할 수 있다. 두 개의 정제된 이미지 데이터와 센서 데이터를 연결한 후, 이 데이터는 행동(action)을 결정하는 Q 값을 얻기 위해 완전히 연결된 레이어(fully connected layer)의 입력으로 사용될 수 있다. 즉, 두 개의 결과를 융합한 후, 완전히 연결된 레이어(fully connected layer)를 통과시켜 최종적으로 Q 값을 구하고 행동(action)을 결정할 수 있다. In the case of camera images, image data can be refined using a convolutional neural network (CNN) such as DQN, and LIDAR sensor data can be refined using long short-term memory (LSTM). After linking the two refined image data and sensor data, this data can be used as input to a fully connected layer to obtain a Q value that determines the action. In other words, after fusion of the two results, it is possible to finally obtain a Q value and determine an action by passing through a fully connected layer.

이때 Dueling DQN의 경우, 마지막 완전히 연결된 레이어(fully connected layer) 부분을 상태 값(state-value)과 행동 이점(action-advantage)을 구하는 2 개의 스트림(stream)으로 된 구조를 이용할 수 있다. At this time, in the case of Dueling DQN, a structure consisting of two streams for obtaining a state-value and an action-advantage for the last fully connected layer may be used.

표 2는 운전 정책 네트워크의 전체 구조 및 이 네트워크의 하이퍼 파라미터이다. Table 2 shows the overall structure of the driving policy network and its hyperparameters.

[표 2] [Table 2]

(비특허문헌 4)는 정책 네트워크를 교육하는데 사용될 수 있다. 다른 하이퍼 파라미터는 다음과 같다: 학습 속도는 0.00025이고, 배치 크기는 32이며, 재생 메모리 크기는 100,000이며 스킵된 프레임 수는 4이다. (Non-Patent Document 4) can be used to educate policy networks. Other hyperparameters are as follows: the learning rate is 0.00025, the batch size is 32, the playback memory size is 100,000, and the number of frames skipped is 4.

아래에서는 하나의 예를 들어 심층강화학습 알고리즘을 보다 상세히 설명한다. The deep reinforcement learning algorithm will be described in more detail below by way of example.

예를 들어, 이미지 데이터를 처리하는 CNN 네트워크는 DQN의 CNN 네트워크와 구조와 동일한 구조로 구성될 수 있다. 우선, 이미지 입력의 경우 80x80짜리 그레이스케일(grayscale) 이미지를 4개 스택(stack)하여 사용할 수 있다. 여기에서의 시뮬레이터의 경우 카메라가 전후방에 존재하는 것을 가정하여, 총 80x80x8의 이미지가 CNN의 입력으로 들어갈 수 있다. 첫 번째 컨볼루션 은닉 레이어(convolution hidden layer)의 경우 8x8의 크기를 가진 32개의 필터(filter)로 구성되어 있으며, 스트라이드(stride)는 4이다. 두 번째 컨볼루션 은닉 레이어의 경우 4x4의 크기를 가진 64개의 필터로 구성되어 있으며 스트라이드(stride)는 2이다. 세 번째 컨볼루션 은닉 레이어의 경우 3x3의 크기를 가진 64개의 필터로 구성되어 있으며 스트라이드(stride)는 1이다. 모든 컨볼루션 은닉 레이어에서 활성화 함수(activation function)로는 ReLU를 이용할 수 있다.For example, a CNN network that processes image data may be configured in the same structure as a CNN network of DQN. First of all, in the case of image input, you can use 4 stacks of 80x80 grayscale images. In the case of the simulator here, assuming that the cameras exist in the front and rear, a total of 80x80x8 images can be entered as an input of the CNN. The first convolution hidden layer is composed of 32 filters having a size of 8x8, and the stride is 4. The second convolutional hidden layer consists of 64 filters with a size of 4x4, and the stride is 2. The third convolutional hidden layer consists of 64 filters with a size of 3x3, and the stride is 1. ReLU can be used as an activation function in all convolution hidden layers.

센서 입력의 경우 왼쪽, 오른쪽 각각 90도씩의 범위를 감지하며 1도당 하나의 입자가 발사되어 어떤 물체와 충돌하는 경우 각 물체까지의 거리를 반환하며, 충돌하지 않는 경우 최대 감지 거리 값을 반환할 수 있다. 이렇게 180개의 거리 데이터를 이미지와 같이 4개의 시간 단계(time step)에 대해서 저장하였다가 LSTM을 이용한 RNN 구조에 적용할 수 있다. LSTM 중 가장 마지막 셀(cell)에서 도출된 결과를 최종 센서 데이터의 출력으로 결정한다. LSTM의 셀 상태(cell state) 수는 256개로 한다. In the case of sensor input, it detects a range of 90 degrees for each of the left and right, and if one particle is emitted per degree and collides with an object, the distance to each object is returned, and if there is no collision, the maximum detection distance value can be returned. have. In this way, 180 distance data can be stored for 4 time steps as an image and then applied to an RNN structure using LSTM. The result derived from the last cell among LSTMs is determined as the output of the final sensor data. The number of cell states of the LSTM is 256.

이미지 데이터를 처리한 결과와 센서 데이터를 처리한 결과를 융합시킨 후 512개의 ReLU로 구성된 완전히 연결된 레이어를 적용하여 최종 행동(action)의 개수만큼 Q 값을 도출할 수 있다. 기타 파라미터에 대해서는 학습율(learning rate)은 0.00025, batch size는 32, replay memory의 크기는 100,000개, 스킵(skip)하는 프레임의 수는 4, 최적화는 (비특허문헌 4)를 이용할 수 있다. After fusing the image data processing result and the sensor data processing result, a fully connected layer consisting of 512 ReLUs is applied to derive the Q value as much as the number of final actions. For other parameters, the learning rate is 0.00025, the batch size is 32, the size of replay memory is 100,000, the number of frames to be skipped is 4, and (Non-Patent Document 4) can be used for optimization.

도 7은 일 실시예에 따른 운전 정책을 결정하는 알고리즘을 통한 평균 속도의 예를 나타내는 도면이다. 도 7을 참조하면, 다중 입력, 영상 입력 및 센서 입력에 의한 5 회분 평균 속도를 확인할 수 있다. 7 is a diagram illustrating an example of an average speed through an algorithm for determining a driving policy according to an embodiment. Referring to FIG. 7, it is possible to check the average speed for five times by multiple input, image input, and sensor input.

도 8은 일 실시예에 따른 운전 정책을 결정하는 알고리즘을 통한 평균 차선 변경 횟수의 예를 나타내는 도면이다. 도 8을 참조하면, 다중 입력, 영상 입력 및 센서 입력에 의한 5 회분 평균 차선 변경 횟수의 예를 확인할 수 있다. 8 is a diagram illustrating an example of an average number of lane changes through an algorithm for determining a driving policy according to an embodiment. Referring to FIG. 8, an example of the average number of lane changes for five times by multiple input, image input, and sensor input can be confirmed.

도 9는 일 실시예에 따른 운전 정책을 결정하는 알고리즘을 통한 평균 추월 수의 예를 나타내는 도면이다. 도 9를 참조하면, 다중 입력, 영상 입력 및 센서 입력에 의한 5회분 평균 추월 수의 예를 확인할 수 있다. 9 is a diagram illustrating an example of an average number of passing through an algorithm for determining a driving policy according to an embodiment. Referring to FIG. 9, an example of the average number of passing five times by multiple input, image input, and sensor input can be confirmed.

도 7 내지 도 9의 그래프들은 학습 단계에 따른 평균 속도, 차선 변경 횟수, 그리고 차량 추월 수를 나타낸다. 학습이 진행될수록 차량의 평균 속도와 추월한 차량의 수는 계속 증가하며, 반대로 차선 변경 횟수는 줄어든다. 이를 통해 차량이 설정한 목표에 맞게 차선 변경은 필요할 때만 수행하며 속도는 최대 속도에 가깝게 유지하며 주행하도록 동작하는 것을 알 수 있다.The graphs of FIGS. 7 to 9 represent the average speed, the number of lane changes, and the number of vehicles overtaking according to the learning step. As learning progresses, the average speed of the vehicle and the number of overtaking vehicles continue to increase, and the number of lane changes decreases. Through this, it can be seen that the lane change is performed only when necessary to meet the target set by the vehicle, and the vehicle operates to drive while maintaining the speed close to the maximum speed.

한편, 제안된 운전 정책을 결정하는 알고리즘은 (비특허문헌 5)를 사용하여 구현될 수 있다. 제안하는 운전 정책 학습 알고리즘은 python 상에서 (비특허문헌 5) 라이브러리를 이용하여 작성하였다. 모든 시뮬레이션은 Nvidia GTX 1080Ti를 이용하여 진행하였다. 모든 알고리즘은 충분히 리플레이 메모리(Replay memory)가 쌓였다고 생각되는 2.5만개부터 학습을 시작하였다. 운전 정책 학습을 위해 사용한 4 개의 심층강화학습 알고리즘은 기존의 운전자보조시스템(DAS)을 잘 선택하는 방법을 학습함으로써 평균 속도 측면에서 초기값과 비교해 봤을 때 모두 발전이 있는 것을 확인할 수 있다. Meanwhile, the algorithm for determining the proposed driving policy may be implemented using (Non-Patent Document 5). The proposed driving policy learning algorithm was created on python using (Non-Patent Document 5) library. All simulations were conducted using an Nvidia GTX 1080Ti. All algorithms started learning from 250,000, which is thought to have sufficiently accumulated replay memory. The four in-depth reinforcement learning algorithms used to learn driving policies can be confirmed to have improved when compared with the initial values in terms of average speed by learning how to select the existing driver assistance system (DAS) well.

학습된 운전 정책을 시뮬레이션 환경에서 확인해 보면, 전방에 차량이 느리게 주행하고, 옆 차선이 비어있을 때에는 차선을 변경하여 빠르게 진행하는 것을 확인할 수 있다. 차선 변경 행위 자체는 단순히 네거티브(negative) 보상을 받음에도 불구하고 차선을 변경했을 때 얻을 수 있는 속도 보상을 고려하여 미래 보상의 기댓값을 최대화하는 차선 변경을 수행하는 것으로 볼 수 있다. When the learned driving policy is checked in a simulation environment, it can be seen that the vehicle in front drives slowly, and when the lane next to it is empty, it changes lanes and proceeds quickly. The lane change behavior itself can be viewed as simply performing a lane change that maximizes the expected value of future compensation by considering the speed compensation that can be obtained when changing lanes despite receiving negative compensation.

또한, 카메라와 LIDAR의 기능을 각각 CNN과 LSTM을 통해 결합한 다중 입력 아키텍처의 이점을 확인하기 위해 카메라 입력 및 LIDAR 입력의 두 가지 정책 네트워크가 추가적으로 구현될 수 있다.In addition, two policy networks, camera input and LIDAR input, can be additionally implemented to confirm the advantages of a multi-input architecture that combines the functions of camera and LIDAR through CNN and LSTM, respectively.

아래에서는 카메라, LIDAR, 카메라 및 LIDAR의 서로 다른 입력을 위한 세 가지 다른 네트워크 아키텍처를 비교할 수 있다. 학습이 진행됨에 따라 자율주행차량은 모든 입력 구성에서 불필요한 차선 변경 없이 더 많은 차량을 추월하고 더 빠른 속도로 주행하는 경향이 있다.Below you can compare three different network architectures for different inputs of camera, LIDAR, camera and LIDAR. As learning progresses, autonomous vehicles tend to overtake more vehicles and drive at higher speeds without unnecessary lane changes in all input configurations.

표 3은 카메라, LIDAR, 카메라 및 LIDAR의 서로 다른 입력에 따른 성과 비교를 나타낸다. Table 3 shows the comparison of performance according to different inputs of camera, LIDAR, camera and LIDAR.

[표 3][Table 3]

표 3의 결과에서 알 수 있듯이, 카메라와 LIDAR를 모두 사용한 다중 입력 구조는 평균 속도와 평균 추월 수가 각각 73.54km/h와 42.2라는 최상의 성능을 보인다. 그러나 다중 입력에 대한 아키텍처를 사용할 때 차선 변경 횟수가 가장 많으며 평균값은 30.2이다.As can be seen from the results of Table 3, the multiple input structure using both the camera and LIDAR shows the best performance of 73.54km/h and 42.2, respectively, with the average speed and average passing number. However, when using the architecture for multiple inputs, the number of lane changes is the most and the average value is 30.2.

제안된 알고리즘의 목적은 불필요한 차선 변경 횟수를 줄이는 것이지만, 다중 모드 입력 결과는 차선 변경 횟수 측면에서 가장 높다. 그러나 LIDAR의 아키텍처와 카메라의 아키텍처의 경우, 선행 차량이 느린 경우에도 차선을 변경하지 않고 앞 차량을 따라가는 경우가 있다. 따라서 최적의 정책을 찾기 위해 합리적인 가치가 있는 결과보다 차선 변경 횟수가 많다. The purpose of the proposed algorithm is to reduce the number of unnecessary lane changes, but the multi-mode input result is the highest in terms of the number of lane changes. However, in the case of LIDAR's architecture and the camera's architecture, even if the preceding vehicle is slow, there are cases where the vehicle in front is followed without changing lanes. Therefore, there are more lane changes than results of reasonable value to find the optimal policy.

안전 자율주행은 지난 몇 년 동안 적극적으로 개발되었다. 자율주행에 대한 중요한 주제 중 하나는 운전자에게 안전과 편안함을 제공하는 최고의 주행 정책 또는 감독자를 찾는 것이다. 이때 운전 정책이 다양한 교통 환경에 관계없이 견고성을 충족시켜야 한다. 그러나 규칙 기반 알고리즘은 다양한 상황에 대처하기가 어렵다. 이러한 문제를 해결하기 위해, 실시예들에 따르면 심층강화학습 알고리즘을 이용하여 다양한 경험을 통해 자체 학습을 수행하고 원 이미지와 같은 고차원 입력을 사용하여 우수한 성능을 보일 수 있다.Safe autonomous driving has been actively developed over the past few years. One of the important topics for autonomous driving is finding the best driving policies or supervisors that provide safety and comfort to drivers. At this time, the driving policy must meet robustness regardless of various traffic environments. However, it is difficult for rule-based algorithms to cope with various situations. In order to solve this problem, according to embodiments, it is possible to perform self-learning through various experiences using a deep reinforcement learning algorithm and exhibit excellent performance by using a high-dimensional input such as an original image.

이상과 같이, 실시예들은 인지능력의 정확성 향상을 위해 카메라 이미지뿐 아니라 실제 자율주행차량에서 주로 이용되는 센서를 모사한 데이터까지 입력으로 하며 안전성 확보를 위하여 운전자보조시스템(DAS) 기반에서 주행하는 자율주행차량의 거동을 심층강화학습을 통하여 최적화하는 문제를 해결할 수 있다. As described above, in the embodiments, not only camera images but also data simulating sensors mainly used in actual autonomous vehicles are input to improve the accuracy of cognitive ability, and autonomous driving based on a driver assistance system (DAS) to secure safety. It is possible to solve the problem of optimizing the behavior of a driving vehicle through in-depth reinforcement learning.

실시예들에 따르면 심층강화학습 알고리즘을 통해 학습된 자율주행차량은 불필요한 차선 변경 없이 시뮬레이션된 고속도로 시나리오에서 고속으로 성공적으로 운전할 수 있다. 제안된 운전 정책 네트워크가 멀티 모달 입력을 사용하므로 평균 속도, 차선 변경 횟수 및 추월 수와 관련하여 단일 입력이 있는 것보다 차량을 보다 잘 주행할 수 있다. According to embodiments, the autonomous vehicle learned through the deep reinforcement learning algorithm can successfully drive at high speed in a simulated highway scenario without unnecessary lane changes. Since the proposed driving policy network uses multi-modal inputs, the vehicle can drive better than with a single input in terms of average speed, number of lane changes, and number of overtakings.

실시예들은 고속도로 주행 시 자율주행 차량의 주행 및 제어에 이용될 수 있다. 현재 상태에 가장 적절한 행동을 결정하고 이에 따라 제어를 수행한다는 점에서 자율주행 드론 및 로봇에도 적용이 가능하다. The embodiments may be used for driving and controlling an autonomous vehicle when driving on a highway. It can also be applied to autonomous drones and robots in that it determines the most appropriate action for the current state and performs control accordingly.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the devices and components described in the embodiments include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It can be implemented using one or more general purpose computers or special purpose computers, such as a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications executed on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of software. For the convenience of understanding, although it is sometimes described that one processing device is used, one of ordinary skill in the art, the processing device is a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 행동하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may contain a computer program, code, instructions, or a combination of one or more of these, configuring the processing unit to behave as desired or processed independently or collectively. You can command the device. Software and/or data may be interpreted by a processing device or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. Can be embodyed in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of the program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and drawings as described above, various modifications and variations can be made from the above description to those of ordinary skill in the art. For example, the described techniques are performed in a different order from the described method, and/or components such as a system, structure, device, circuit, etc. described are combined or combined in a form different from the described method, or other components Alternatively, even if substituted or substituted by an equivalent, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다. Therefore, other implementations, other embodiments, and claims and equivalents fall within the scope of the claims to be described later.

Claims

Receiving the measured sensor data and the captured image data through a deep reinforcement learning algorithm;
Determining an action for vehicle control using the sensor data and the image data received from the deep reinforcement learning algorithm; And
Controlling a vehicle by selecting a Driver Assistance Systems (DAS) according to the determined behavior
Including,
The step of determining the action for vehicle control,
Refining the input sensor data and the image data, respectively;
Forming connected data by connecting the refined sensor data and the image data;
Inputting the connected data into a fully connected layer of the deep reinforcement learning algorithm to obtain a Q value; And
Determining an action according to the Q value
Including,
Purifying the sensor data and the image data, respectively,
Purifying the sensor data using Long Short-Term Memory (LSTM); And
Purifying the image data using a convolutional neural network (CNN)
Including,
The deep reinforcement learning algorithm analyzes multiple inputs to use two different inputs at the same time to learn the driving policy, and after connecting the two refined sensor data and the image data, a fully connected layer ) To finally get the Q value and determine the action
A method of controlling an autonomous vehicle using an in-depth reinforcement learning and driver assistance system, characterized in that.

The method of claim 1,
The step of receiving the measured sensor data and the photographed image data as a deep reinforcement learning algorithm,
Receiving the sensor data measured through the LIDAR sensor configured in the vehicle and the image data photographed through the camera as an in-depth reinforcement learning algorithm
A method of controlling an autonomous vehicle using an in-depth reinforcement learning and driver assistance system, characterized in that.

delete

The method of claim 1,
The step of obtaining a Q value by inputting into the fully connected layer,
Using a network that divides the fully connected layer into two parts: a state value and a behavioral advantage that evaluates the benefits for each behavior, and obtains the Q value by integrating the state value and the behavioral advantage.
A method of controlling an autonomous vehicle using an in-depth reinforcement learning and driver assistance system, characterized in that.

The method of claim 1,
The step of determining the action for vehicle control,
Including five actions: increasing the target speed, decreasing the target speed, changing lanes to the left, changing lanes to the right, and maintaining the current state so that the vehicle can drive in the longitudinal and transverse directions simultaneously.
A method of controlling an autonomous vehicle using an in-depth reinforcement learning and driver assistance system, characterized in that.

The method of claim 1,
Learning the deep reinforcement learning algorithm as an optimal driving policy
A method for controlling an autonomous vehicle using an in-depth reinforcement learning and driver assistance system further comprising a.

The method of claim 7,
The step of learning the deep reinforcement learning algorithm as an optimal driving policy,
Designing a reward according to the result of the action to find an optimal driving policy based on high speed driving of the vehicle, driving on a non-collision trajectory, and excluding unnecessary lane changes, and learning the deep reinforcement learning algorithm with the driving policy
A method of controlling an autonomous vehicle using an in-depth reinforcement learning and driver assistance system, characterized in that.

An input unit that receives measured sensor data and photographed image data through a deep reinforcement learning algorithm;
A deep reinforcement learning unit for determining an action for vehicle control by using the sensor data and the image data input from the deep reinforcement learning algorithm; And
Vehicle control unit that controls the vehicle by selecting a driver assistance system (DAS) according to the determined behavior
Including,
The deep reinforcement learning unit,
A sensor data refiner that refines the input sensor data;
An image data refiner that refines the input image data;
A connection data forming unit connecting the refined sensor data and the image data to form connected data;
A fully connected layer unit that inputs the connected data into a fully connected layer of the deep reinforcement learning algorithm to obtain a Q value; And
An action decision unit that determines an action based on the Q value.
Including,
The sensor data purification unit,
The sensor data is purified using LSTM (Long Short-Term Memory),
The image data refinement unit,
The image data is refined using a convolutional neural network (CNN),
The deep reinforcement learning algorithm analyzes multiple inputs to use two different inputs at the same time to learn the driving policy, and after connecting the two refined sensor data and the image data, a fully connected layer ) To finally get the Q value and determine the action
An autonomous vehicle control device using an in-depth reinforcement learning and driver assistance system, characterized in that.

The method of claim 9,
The input unit,
Receiving the sensor data measured through the LIDAR sensor configured in the vehicle and the image data photographed through the camera as an in-depth reinforcement learning algorithm
An autonomous vehicle control device using an in-depth reinforcement learning and driver assistance system, characterized in that.

delete

The method of claim 9,
The fully connected layer part,
Using a network that divides the fully connected layer into two parts: a state value and a behavioral advantage that evaluates the benefits for each behavior, and obtains the Q value by integrating the state value and the behavioral advantage.
An autonomous vehicle control device using an in-depth reinforcement learning and driver assistance system, characterized in that.

The method of claim 9,
The vehicle control unit,
Including five actions: increasing the target speed, decreasing the target speed, changing lanes to the left, changing lanes to the right, and maintaining the current state so that the vehicle can drive in the longitudinal and transverse directions simultaneously.
An autonomous vehicle control device using an in-depth reinforcement learning and driver assistance system, characterized in that.

The method of claim 9,
The deep reinforcement learning unit,
Designing a reward according to the result of the action to find an optimal driving policy based on high speed driving of the vehicle, driving on a non-collision trajectory, and excluding unnecessary lane changes, and learning the deep reinforcement learning algorithm with the driving policy
An autonomous vehicle control device using an in-depth reinforcement learning and driver assistance system, characterized in that.