KR102590791B1

KR102590791B1 - Method and apparatus of uncertainty-conditioned deep reinforcement learning

Info

Publication number: KR102590791B1
Application number: KR1020230045027A
Authority: KR
Inventors: 서승우; 김성우; 윤형석; 김찬
Original assignee: 대한민국(방위사업청장)
Priority date: 2023-04-05
Filing date: 2023-04-05
Publication date: 2023-10-30

Abstract

One embodiment of the present disclosure provides a deep reinforcement learning device and method, which inputs a state (s_t) and outputs a feature vector (h_t) for the state (s_t), estimates an epistemological uncertainty (u) for the state (s_t), and estimates an action (a_t) based on the feature vector (h_t) and the epistemological uncertainty (u), wherein the deep reinforcement learning device learns based on a reward (r_t) for the action (a_t). Accordingly, the present invention determines whether an input state is well-trained in a deep reinforcement learning model, and determines an appropriate action by considering this.

Description

Uncertainty conditional deep reinforcement learning method and processing device thereof {METHOD AND APPARATUS OF UNCERTAINTY-CONDITIONED DEEP REINFORCEMENT LEARNING}

본 개시는 불확실성 조건부 심층 강화 학습 방법 및 그 처리 장치에 관한 것이다. 보다 구체적으로, 본 개시는 심층 강화 학습 모델에서 현재 상태(state)와 더불어 불확실성(uncertainty)을 고려하여 행동(action)을 결정하는 방법 및 그 처리 장치에 관한 것이다.This disclosure relates to an uncertainty conditional deep reinforcement learning method and a processing device thereof. More specifically, the present disclosure relates to a method and processing device for determining an action by considering uncertainty as well as the current state in a deep reinforcement learning model.

강화 학습(reinforcement learning)은 현재의 상태(state)에서 어떤 행동(action)을 취하는 것이 최적인지 학습하는 것으로, 행동을 취할 때마다 외부 환경에서 보상(reward)이 주어지고, 이러한 보상이 최대가 되는 방향으로 학습이 진행된다. Reinforcement learning is learning which action is optimal in the current state. Whenever an action is taken, a reward is given from the external environment, and this reward is maximized. Learning progresses in this direction.

심층 신경망을 이용한 심층 강화 학습(deep reinforcement learning, DRL) 모델은 현재의 상태(state)가 입력으로 주어졌을 때, 행동(action)을 생성하는 함수로 표현되는 정책(policy)을 학습하여 과제를 해결하는 방법을 학습한다. 그러나, 기존의 심층 강화 학습 모델은 입력 상태가 학습된 상태들의 분포와 거리가 먼 경우에도 선택한 행동을 과신하는 문제가 있다. 예를 들어, 자율 주행차에서 학습되지 않은 상황에서 과신한 행동을 취하게 되면 잘못된 결정으로 이어질 수 있으며, 주변차량, 보행자 및 자율 주행차 자체에 치명적인 결과를 초래할 수 있다. 이를 해결하기 위해서는, 입력되는 상태(state)가 심층 강화 학습 모델에 잘 훈련된 것인지 여부를 확인하고, 이를 고려하여 행동(action)을 결정하는 것이 필요하다.A deep reinforcement learning (DRL) model using a deep neural network solves tasks by learning a policy expressed as a function that creates an action when the current state is given as input. Learn how to do it. However, existing deep reinforcement learning models have the problem of overconfidence in the selected action even when the input state is far from the distribution of learned states. For example, taking overconfident actions in an untrained situation in an autonomous vehicle can lead to incorrect decisions, which can have fatal consequences for surrounding vehicles, pedestrians, and the autonomous vehicle itself. To solve this problem, it is necessary to check whether the input state has been well trained in the deep reinforcement learning model and decide on an action by considering this.

결국, 기존의 방식에 따르면 입력되는 상태(state)가 심층 강화 학습 모델에 잘 훈련된 것인지 여부를 확인하고, 이를 고려하여 행동(action)을 결정하는 방식을 제공하지 못하는 문제점이 있었고, 본 발명은 이를 해결하기 위한 것이다.In the end, according to the existing method, there was a problem in that it could not provide a method of checking whether the input state was well trained in the deep reinforcement learning model and determining an action by considering this, and the present invention This is to solve this problem.

본 개시의 일 실시예는 입력되는 상태(state)가 심층 강화 학습 모델에 잘 훈련된 것인지 여부를 확인하고, 이를 고려하여 적절한 행동(action)을 결정하기 위한 것이다. One embodiment of the present disclosure is to check whether the input state has been well trained in the deep reinforcement learning model and determine an appropriate action by considering this.

또한, 본 개시의 일 실시예는 입력되는 상태의 불확실성을 고려하여 적절한 행동을 결정하는 심층 강화 학습 모델을 제공하기 위한 것이다.Additionally, an embodiment of the present disclosure is intended to provide a deep reinforcement learning model that determines an appropriate action by considering the uncertainty of the input state.

본 개시의 일 실시예는 불확실성 조건부 심층 강화 학습 방법 및 그 처리 장치를 제공하고자 한다.An embodiment of the present disclosure seeks to provide an uncertainty conditional deep reinforcement learning method and a processing device thereof.

본 개시의 일 실시예는 상태(s_t)를 입력으로 하여, 상기 상태(s_t)에 대한 특성 벡터(h_t)를 출력하는 특징 추출부; 상기 상태(s_t)에 대한 인식론적 불확실성(u)을 추정하는 불확실성 추정부; 및 상기 특성 벡터(h_t) 및 인식론적 불확실성(u)에 기초하여, 행동(a_t)을 추정하는 행동 추정부;를 포함하고, 상기 심층 강화 학습 장치는 행동(a_t)에 대한 보상(r_t)에 기초하여 학습되는, 심층 강화 학습 장치를 제공하고자 한다.An embodiment of the present disclosure includes a feature extraction unit that receives a state (s _t ) as input and outputs a feature vector (h _t ) for the state (s _t ); an uncertainty estimation unit that estimates epistemological uncertainty (u) for the state (s _t ); And an action estimation unit that estimates the action (a _t ) based on the feature vector (h _t ) and the epistemological uncertainty (u), wherein the deep reinforcement learning device provides a reward for the action (a _t ). The aim is to provide a deep reinforcement learning device that learns based on r _t ).

일 실시예에서, 상기 특징 추출부는 상기 상태(s_t)에 대한 특성 벡터(h_t)를 출력하는 엔코더를 포함할 수 있다.In one embodiment, the feature extractor may include an encoder that outputs a feature vector (h _t ) for the state (s _t ).

일 실시예에서, 상기 불확실성 추정부는, 몬테-카를로 드롭아웃 방법을 이용하여 상기 상태(s_t)에 대한 불확실성을 추정할 수 있다.In one embodiment, the uncertainty estimation unit may estimate uncertainty about the state (s _t ) using a Monte-Carlo dropout method.

일 실시예에서, 상기 불확실성 추정부는, 상기 엔코더에 대하여 몬테-카를로 드롭아웃을 적용하여, N번의 몬테 카를로 샘플들의 분산으로 불확실성 지표()를 도출하는 불확실성 계산부를 포함할 수 있다.In one embodiment, the uncertainty estimation unit applies Monte-Carlo dropout to the encoder, using the variance of N Monte Carlo samples as an uncertainty indicator ( ) may include an uncertainty calculation unit that derives.

일 실시예에서, 상기 불확실성 추정부는, 상기 엔코더의 출력인 특징 벡터와 동일한 크기를 가지는 벡터인 상기 불확실성 지표를 상기 불확실성 지표()의 각 요소()의 상한()으로 정규화한 후에, 가중 평균하여 스칼라 값(u)을 출력하는 스칼라 맵핑부를 포함할 수 있다.In one embodiment, the uncertainty estimation unit sets the uncertainty indicator, which is a vector having the same size as the feature vector that is the output of the encoder, to the uncertainty indicator ( ) for each element ( ) upper limit ( ), it may include a scalar mapping unit that performs a weighted average and outputs a scalar value (u).

일 실시예에서, 상기 행동 추정부는, 상기 특성 벡터(h_t) 및 상기 인식론적 불확실성(u)에 기초하여, 상기 행동(a_t)의 평균 및 분산을 추정하고, 상기 평균 및 상기 분산을 가지는 상기 행동(a_t)의 분포로부터 샘플링하여 현재의 상태에 대한 행동을 제공할 수 있다.In one embodiment, the behavior estimation unit estimates the mean and variance of the behavior (a _t ) based on the characteristic vector (h _t ) and the epistemic uncertainty (u), and has the mean and the variance. Actions for the current state can be provided by sampling from the distribution of the actions (a _t ).

일 실시예에서, 상기 행동 추정부는, 상기 인식론적 불확실성(u)의 정도에 따라서 행동(a_t)이 달라지도록 하기 위하여, 상기 인식론적 불확실성(u)과 상기 행동(a_t)의 상호 정보의 하한이 최대가 되도록 학습할 수 있다.In one embodiment, the action estimation unit determines the mutual information between the epistemic uncertainty (u) and the action (a _t ) so that the action (a _t ) varies depending on the degree of the epistemic uncertainty (u). It can be learned so that the lower limit is the maximum.

일 실시예에서, 상기 행동 추정부는, 소프트 정책 향상 및 상기 인식론적 불확실성(u)과 상기 행동(a_t)의 상호 정보의 하한의 합이 최대가 되도록 학습할 수 있다.In one embodiment, the action estimator may improve soft policy and learn to maximize the sum of the lower bound of the mutual information of the epistemic uncertainty (u) and the action (a _t ).

일 실시예에서, 상기 인식론적 불확실성(u)과 상기 행동(a_t)의 상호 정보의 하한은 리플레이 버퍼로부터 샘플된 데이터들을 이용하여 계산될 수 있다.In one embodiment, the lower bound of the mutual information of the epistemic uncertainty (u) and the action (a _t ) may be calculated using data sampled from a replay buffer.

본 개시의 일 실시예는 상태(s_t)를 입력으로 하여, 상기 상태(s_t)에 대한 특성 벡터(h_t)를 출력하는 특징 추출 단계; 상기 상태(s_t)에 대한 인식론적 불확실성(u)을 추정하는 불확실성 추정 단계; 및 상기 특성 벡터(h_t) 및 인식론적 불확실성(u)에 기초하여, 행동(a_t)을 추정하는 행동 추정 단계;를 포함하고, 상기 특징 추출 단계 및 상기 불확실성 추정 단계는 병렬적으로 수행되는, 심층 강화 학습 방법을 제공하고자 한다.One embodiment of the present disclosure includes a feature extraction step of inputting a state (s _t ) and outputting a feature vector (h _t ) for the state (s _t ); An uncertainty estimation step of estimating epistemological uncertainty (u) for the state (s _t ); And a behavior estimation step of estimating an action (a _t ) based on the feature vector (h _t ) and the epistemological uncertainty (u), wherein the feature extraction step and the uncertainty estimation step are performed in parallel. , we aim to provide a deep reinforcement learning method.

일 실시예에서, 상기 특징 추출 단계에서, 엔코더를 이용하여 상기 상태(s_t)에 대한 특성 벡터(h_t)를 출력할 수 있다.In one embodiment, in the feature extraction step, a feature vector (h _t ) for the state (s _t ) may be output using an encoder.

일 실시예에서, 상기 불확실성 추정 단계에서 몬테-카를로 드롭아웃 방법을 이용하여 상기 상태(s_t)에 대한 불확실성을 추정할 수 있다.In one embodiment, the uncertainty about the state (s _t ) may be estimated using a Monte-Carlo dropout method in the uncertainty estimation step.

일 실시예에서, 상기 불확실성 추정 단계는 상기 엔코더에 대하여 몬테-카를로 드롭아웃을 적용하여, N번의 몬테 카를로 샘플들의 분산으로 불확실성 지표()를 도출할 수 있다.In one embodiment, the uncertainty estimation step applies Monte-Carlo dropout to the encoder, and the uncertainty index ( ) can be derived.

일 실시예에서, 상기 불확실성 추정 단계는, 상기 엔코더의 출력인 특징 벡터와 동일한 크기를 가지는 벡터인 상기 불확실성 지표를 상기 불확실성 지표()의 각 요소()의 상한()으로 정규화한 후에, 가중 평균하여 스칼라 값(u)을 출력할 수 있다.In one embodiment, the uncertainty estimation step is to use the uncertainty indicator, which is a vector having the same size as the feature vector that is the output of the encoder, as the uncertainty indicator ( ) for each element ( ) upper limit ( After normalizing with ), a scalar value (u) can be output by weighted average.

일 실시예에서, 상기 행동 추정 단계는, 상기 특성 벡터(h_t) 및 상기 인식론적 불확실성(u)에 기초하여, 상기 행동(a_t)의 평균 및 분산을 추정하고, 상기 평균 및 상기 분산을 가지는 상기 행동(a_t)의 분포로부터 샘플링하여 현재의 상태에 대한 행동을 제공할 수 있다.In one embodiment, the action estimation step estimates the mean and variance of the action (a t) based on the feature vector (h _t ) and the epistemic uncertainty ( _u ), and calculates the mean and the variance. A branch can provide an action for the current state by sampling from the distribution of the action (a _t ).

일 실시예에서, 상기 행동 추정 단계에서, 상기 인식론적 불확실성(u)의 정도에 따라서 행동(a_t)이 달라지도록 하기 위하여, 상기 인식론적 불확실성(u)과 상기 행동(a_t)의 상호 정보의 하한이 최대가 되도록 학습될 수 있다.In one embodiment, in the action estimation step, mutual information between the epistemic uncertainty (u) and the action (a _t ) is used so that the action (a _t ) varies depending on the degree of the epistemic uncertainty (u). It can be learned so that the lower limit of is the maximum.

일 실시예에서, 상기 행동 추정 단계에서, 소프트 정책 향상 및 상기 인식론적 불확실성(u)과 상기 행동(a_t)의 상호 정보의 하한의 합이 최대가 되도록 학습될 수 있다.In one embodiment, in the action estimation step, a soft policy can be improved and learned such that the sum of the lower bound of the mutual information of the epistemic uncertainty (u) and the action (a _t ) is maximized.

일 실시예에서, 상기 인식론적 불확실성(u)과 상기 행동(a_t)의 상호 정보의 하한은 리플레이 버퍼로부터 샘플된 데이터들로을 이용하여 계산될 수 있다.In one embodiment, the lower bound of the mutual information of the epistemic uncertainty (u) and the action (a _t ) may be calculated using data sampled from a replay buffer.

본 개시의 일 실시예는, 본 개시의 일 실시예에 의한 방법을 컴퓨터에서 실행시키도록 기록매체에 저장된 프로그램을 포함한다.One embodiment of the present disclosure includes a program stored in a recording medium to execute the method according to the embodiment of the present disclosure on a computer.

본 개시의 일 실시예는, 본 개시의 일 실시예에 의한 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 포함한다.An embodiment of the present disclosure includes a computer-readable recording medium on which a program for executing a method according to an embodiment of the present disclosure on a computer is recorded.

본 개시의 일 실시예는, 본 개시의 일 실시예에서 사용되는 데이터베이스를 기록한 컴퓨터로 읽을 수 있는 기록매체를 포함한다.An embodiment of the present disclosure includes a computer-readable recording medium that records a database used in an embodiment of the present disclosure.

본 개시의 일 실시예에 따르면 불확실성 조건부 심층 강화 학습 방법 및 그 처리 장치가 제공될 수 있다.According to an embodiment of the present disclosure, an uncertainty conditional deep reinforcement learning method and a processing device thereof may be provided.

또한, 본 개시의 일 실시예에 따르면, 입력되는 상태(state)가 심층 강화 학습 모델에 잘 훈련된 것인지 여부를 확인하고, 이를 고려하여 적절한 행동(action)을 결정하는 심층 강화 학습 모델을 제공할 수 있다.In addition, according to an embodiment of the present disclosure, a deep reinforcement learning model is provided that checks whether the input state is well trained in the deep reinforcement learning model and determines an appropriate action by considering this. You can.

또한, 본 개시의 일 실시예에 따르면, 입력되는 상태의 불확실성을 고려하여 적절한 행동을 결정하는 심층 강화 학습 모델을 제공할 수 있다.Additionally, according to an embodiment of the present disclosure, it is possible to provide a deep reinforcement learning model that determines an appropriate action by considering the uncertainty of the input state.

또한, 본 개시의 일 실시예에 따르면, 학습되지 않은 상황에 대하여 적절한 행동을 취하는 심층 강화 학습 모델을 제공할 수 있다.Additionally, according to an embodiment of the present disclosure, a deep reinforcement learning model that takes appropriate action for unlearned situations can be provided.

도 1은 본 개시의 일 실시예에 따른 심층 강화 학습 모델과 불확실성을 고려하지 않는 모델의 차이점을 설명하는 그래프이다.
도 2는 본 개시의 일 실시예에 따라 불확실성을 고려한 심층 강화 학습 모델을 도시한 개념도이다.
도 3은 본 개시의 일 실시예에 따라 불확실성을 고려한 심층 강화 학습 모델을 도시한 개념도이다.
도 4는 본 개시의 일 실시예에 따른 불확실성 조건부 심층 강화 학습 방법을 도시한 순서도이다.
도 5는 본 개시의 일 실시예의 성능을 평가하기 위한 실험환경을 설명하기 위한 개념도이다.
도 6 내지 도 9는 MuJoCo 시뮬레이터를 이용하여 본 개시의 일 실시예의 성능을 평가한 결과를 보여주는 그래프이다.Figure 1 is a graph illustrating the difference between a deep reinforcement learning model and a model that does not consider uncertainty according to an embodiment of the present disclosure.
Figure 2 is a conceptual diagram illustrating a deep reinforcement learning model considering uncertainty according to an embodiment of the present disclosure.
Figure 3 is a conceptual diagram illustrating a deep reinforcement learning model considering uncertainty according to an embodiment of the present disclosure.
Figure 4 is a flow chart illustrating an uncertainty conditional deep reinforcement learning method according to an embodiment of the present disclosure.
Figure 5 is a conceptual diagram illustrating an experimental environment for evaluating the performance of an embodiment of the present disclosure.
6 to 9 are graphs showing the results of evaluating the performance of an embodiment of the present disclosure using the MuJoCo simulator.

본 개시의 기술적 사상을 명확하게 하기 위하여 첨부된 도면을 참조하여 본 개시의 실시예를 상세하게 설명하도록 한다. 본 개시를 설명함에 있어서, 관련된 공지 기능 또는 구성요소에 대한 구체적인 설명이 본 개시의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략할 것이다. 도면들 중 실질적으로 동일한 기능구성을 갖는 구성요소들에 대하여는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 참조번호들 및 부호들을 부여하였다. 설명의 편의를 위하여 필요한 경우에는 장치와 방법을 함께 서술하도록 한다. 본 개시의 각 동작은 반드시 기재된 순서대로 수행되어야 할 필요는 없고, 병렬적, 선택적, 또는 개별적으로 수행될 수 있다.In order to clarify the technical idea of the present disclosure, embodiments of the present disclosure will be described in detail with reference to the attached drawings. In describing the present disclosure, if it is determined that a detailed description of a related known function or component may unnecessarily obscure the gist of the present disclosure, the detailed description will be omitted. Components having substantially the same functional configuration among the drawings are given the same reference numbers and symbols as much as possible, even if they are shown in different drawings. For convenience of explanation, if necessary, the device and method should be described together. Each operation of the present disclosure does not necessarily have to be performed in the order described, and may be performed in parallel, selectively, or individually.

본 개시의 실시예들에서 사용되는 용어는 본 개시의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 실시예의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 명세서에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the embodiments of the present disclosure have selected general terms that are currently widely used as much as possible while considering the function of the present disclosure, but this may vary depending on the intention or precedent of a person working in the art, the emergence of new technology, etc. . In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the relevant embodiment. Therefore, the terms used in this specification should not be defined simply as the names of the terms, but should be defined based on the meaning of the term and the overall content of the present disclosure.

본 개시 전체에서 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. "포함하다" 또는 "가지다" 등의 용어는 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 즉, 본 개시 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다.Throughout this disclosure, singular expressions may include plural expressions, unless the context clearly dictates otherwise. Terms such as "include" or "have" are intended to designate the presence of a feature, number, step, operation, component, part, or combination thereof, but not one or more other features, numbers, steps, operations, or composition. It should be understood that this does not exclude in advance the possibility of the presence or addition of elements, parts, or combinations thereof. In other words, when it is said that a part "includes" a certain element throughout the present disclosure, this means that other elements may be further included rather than excluding other elements, unless specifically stated to the contrary.

"적어도 하나의"와 같은 표현은, 구성요소들의 리스트 전체를 수식하고, 그 리스트의 구성요소들을 개별적으로 수식하지 않는다. 예를 들어, "A, B, 및 C 중 적어도 하나" 및 "A, B, 또는 C 중 적어도 하나"는 오직 A, 오직 B, 오직 C, A와 B 모두, B와 C 모두, A와 C 모두, A와 B와 C 전체, 또는 그 조합을 가리킨다.An expression such as "at least one" modifies the entire list of elements, not the elements of the list individually. For example, “at least one of A, B, and C” and “at least one of A, B, or C” means only A, only B, only C, both A and B, both B and C, and A and C All refers to all of A, B, and C, or a combination thereof.

또한, 본 개시에 기재된 "..부", "..모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.In addition, terms such as “..unit” and “..module” described in the present disclosure refer to a unit that processes at least one function or operation, which may be implemented as hardware or software or through a combination of hardware and software. You can.

본 개시 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the present disclosure, when a part is said to be “connected” to another part, this includes not only the case where it is “directly connected,” but also the case where it is “electrically connected” with another element in between. do. Additionally, when a part "includes" a certain component, this means that it may further include other components rather than excluding other components, unless specifically stated to the contrary.

본 개시 전체에서 사용된 표현 "~하도록 구성된(또는 설정된)(configured to)"은 상황에 따라, 예를 들면, "~에 적합한(suitable for)", "~하는 능력을 가지는(having the capacity to)", "~하도록 설계된(designed to)", "~하도록 변경된(adapted to)", "~하도록 만들어진(made to)", 또는 "~를 할 수 있는(capable of)"과 바꾸어 사용될 수 있다. 용어 "~하도록 구성된(또는 설정된)"은 하드웨어적으로 "특별히 설계된(specifically designed to)" 것만을 반드시 의미하지 않을 수 있다. 대신, 어떤 상황에서는, "~하도록 구성된 시스템"이라는 표현은, 그 시스템이 다른 장치 또는 부품들과 함께 "~할 수 있는" 것을 의미할 수 있다. 예를 들면, 문구 "A, B, 및 C를 수행하도록 구성된(또는 설정된) 프로세서"는 해당 동작을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리에 저장된 하나 이상의 소프트웨어 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(generic-purpose processor)(예: CPU 또는 application processor)를 의미할 수 있다.The expression “configured to” used throughout the present disclosure may be used, depending on the context, for example, “suitable for,” “having the capacity to.” )", "designed to", "adapted to", "made to", or "capable of". . The term “configured (or set to)” may not necessarily mean “specifically designed to” in hardware. Instead, in some contexts, the expression “system configured to” may mean that the system is “capable of” in conjunction with other devices or components. For example, the phrase "processor configured (or set) to perform A, B, and C" refers to a processor dedicated to performing the operations (e.g., an embedded processor), or by executing one or more software programs stored in memory. It may refer to a general-purpose processor (e.g., CPU or application processor) that can perform the corresponding operations.

본 개시에 따른 인공지능과 관련된 기능은 프로세서와 메모리를 통해 동작될 수 있다. 프로세서는 CPU, AP, DSP(Digital Signal Processor) 등과 같은 범용 프로세서, GPU, VPU(Vision Processing Unit)와 같은 그래픽 전용 프로세서 또는 NPU와 같은 인공지능 전용 프로세서 등을 포함할 수 있다. 또한, 프로세서는 메모리에 저장된 기 정의된 동작 규칙 또는 인공지능 모델에 따라, 입력 데이터를 처리하도록 제어할 수 있다. 또한, 인공지능 전용 프로세서는 특정 인공지능 모델의 처리에 특화된 하드웨어 구조로 설계될 수 있다.Functions related to artificial intelligence according to the present disclosure may be operated through a processor and memory. The processor may include a general-purpose processor such as a CPU, AP, or DSP (Digital Signal Processor), a graphics-specific processor such as a GPU or VPU (Vision Processing Unit), or an artificial intelligence-specific processor such as an NPU. Additionally, the processor can control input data to be processed according to predefined operation rules or artificial intelligence models stored in memory. Additionally, an artificial intelligence-specific processor may be designed with a hardware structure specialized for processing a specific artificial intelligence model.

기 정의된 동작 규칙 또는 인공지능 모델은 학습을 통해 만들어진 것을 특징으로 한다. 여기서, 학습을 통해 만들어진다는 것은, 기본 인공지능 모델이 학습 알고리즘에 의하여 다수의 학습 데이터들을 이용하여 학습됨으로써, 원하는 목적을 수행하도록 설정된 기 정의된 동작 규칙 또는 인공지능 모델이 만들어짐을 의미할 수 있다. 이러한 학습은 본 개시에 따른 인공지능이 수행되는 기기 자체에서 이루어질 수도 있고, 별도의 서버 및/또는 시스템을 통해 이루어질 수도 있다.Predefined operation rules or artificial intelligence models are characterized by being created through learning. Here, being created through learning may mean that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, thereby creating a predefined operation rule or artificial intelligence model set to perform the desired purpose. . This learning may be accomplished in the device itself that performs the artificial intelligence according to the present disclosure, or may be accomplished through a separate server and/or system.

인공지능 모델은, 복수의 신경망 레이어들로 구성될 수 있다. 복수의 신경망 레이어들 각각은 복수의 가중치들(weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 복수의 가중치들 간의 연산을 통해 신경망 연산을 수행할 수 있다. 복수의 신경망 레이어들이 갖고 있는 복수의 가중치들은 인공지능 모델의 학습 결과에 의해 최적화될 수 있다. 예를 들어, 학습 과정 동안 인공지능 모델에서 획득한 로스(loss) 값 또는 코스트(cost) 값이 감소 또는 최소화되도록 복수의 가중치들이 갱신될 수 있다. 인공 신경망은 예를 들어, CNN (Convolutional Neural Network), 심층 신경망 (DNN, Deep Neural Network), RNN (Recurrent Neural Network), RBM (Restricted Boltzmann Machine), DBN (Deep Belief Network), BRDNN (Bidirectional Recurrent Deep Neural Network) 또는 심층 Q-네트워크 (Deep Q-Networks) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.An artificial intelligence model may be composed of multiple neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and a neural network calculation can be performed through calculation between the calculation result of the previous layer and the plurality of weights. Multiple weights of multiple neural network layers can be optimized by the learning results of the artificial intelligence model. For example, a plurality of weights may be updated so that loss or cost values obtained from the artificial intelligence model are reduced or minimized during the learning process. Artificial neural networks are, for example, Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), and Bidirectional Recurrent Deep Neural Network (BRDNN). Neural Network) or Deep Q-Networks, etc., but are not limited thereto.

도 1은 본 개시의 일 실시예에 따른 심층 강화 학습 모델과 불확실성을 고려하지 않는 모델의 차이점을 설명하는 그래프이다.Figure 1 is a graph illustrating the difference between a deep reinforcement learning model and a model that does not consider uncertainty according to an embodiment of the present disclosure.

기존의 심층 강화 학습 모델은 현재의 상태(state)를 고려할 뿐이고, 해당 상태에 대한 학습 모델의 불확실성(uncertainty)을 고려하지 않는다. 본 개시에서 고려하는 불확실성(uncertainty)이란 데이터를 나타내는 모델에 대한 지식이 부족하여 발생하는 모델 불확실성 또는 인식론적 불확실성(epistemic uncertainty)이고, 입력에 내재된 무작위성으로 인해 발생하는 데이터 불확실성 또는 우월적 불확실성(aleatoric uncertainty)은 이에 해당되지 않는다. 따라서 본 개시에서 언급하는 불확실성은 인식론적 불확실성을 의미한다.Existing deep reinforcement learning models only consider the current state and do not consider the uncertainty of the learning model for that state. The uncertainty considered in this disclosure is model uncertainty or epistemic uncertainty resulting from insufficient knowledge of the model representing the data, and data uncertainty or aleatoric uncertainty arising from the randomness inherent in the input. uncertainty) does not apply to this. Therefore, the uncertainty mentioned in this disclosure means epistemological uncertainty.

기존의 심층 강화 학습 모델에서 현재의 상태가 결정되면, 불확실성의 정도와 무관하게 그에 따른 행동(action)이 결정된다. 도 1(a)를 참조하면, 현재의 상태()가 결정되면 불확실성 u와 무관하게 행동(a_t)이 결정되므로, 불확실성(u)이 높은 경우에는 선택된 행동(a_t)은 과신한 행동(overconfident action)이 될 수 있다. 즉, 학습 모델이 해당 상태를 충분히 학습하지 못한 상태에서 선택된 과신한 행동이 될 수 있다.In an existing deep reinforcement learning model, once the current state is determined, the corresponding action is determined regardless of the degree of uncertainty. Referring to Figure 1(a), the current state ( ) is determined, the action (a _t ) is determined regardless of uncertainty u, so if uncertainty (u) is high, the selected action (a _t ) may be an overconfident action. In other words, it may be an overconfident action selected when the learning model has not sufficiently learned the corresponding state.

본 개시의 일 실시예에 따른 심층 강화 학습 모델은 현재 상태(state)와 더불어 불확실성(uncertainty)의 정도를 입력으로 받아, 불확실성의 정도에 따라서 현재의 상태에 대한 행동(action)을 결정한다. 도 1(b)를 참조하면, 동일한 상태()에 대하여도 불확실성(u)의 정도에 따라서 행동(a_t)이 달라질 수 있다. 예를 들어, 현재의 상태()가 이전에 운행한 적이 없는 좁은 길에서의 운전 상황이고, 이 상황에 대하여 자율 주행을 위한 학습 모델이 충분히 학습되지 않아서 불확실성(u)이 높다고 판단되면, 자율 주행차의 제어 행동(a_t)을 불확실성이 낮은 경우보다 작은 값으로 선택하여 조심스럽게 운전하도록 할 수 있다.The deep reinforcement learning model according to an embodiment of the present disclosure receives the current state and the degree of uncertainty as input, and determines an action for the current state according to the degree of uncertainty. Referring to Figure 1(b), the same state ( ), the behavior (a _t ) may vary depending on the degree of uncertainty (u). For example, the current state ( ) is a driving situation on a narrow road that has never been driven before, and if uncertainty (u) is determined to be high because the learning model for autonomous driving has not been sufficiently learned for this situation, the control behavior of the autonomous vehicle (a _t ) You can drive cautiously by selecting a smaller value than when uncertainty is low.

도 2는 본 개시의 일 실시예에 따라 불확실성을 고려한 심층 강화 학습 모델을 도시한 개념도이다.Figure 2 is a conceptual diagram illustrating a deep reinforcement learning model considering uncertainty according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따른 심층 강화 학습 모델은 특징 추출부(100), 불확실성 추정부(200), 및 행동 추정부(300)를 포함할 수 있다. 일 실시예에서, 특징 추출부(100) 및 불확실성 추정부(200)는 현재의 상태(s_t)가 입력되고, 행동 추정부(300)는 특징 추출부(100) 및 불확실성 추정부(200)의 출력을 입력으로 받아서, 현재의 상태(s_t) 및 불확실성(u)에 대응하는 행동(a_t)을 출력한다.The deep reinforcement learning model according to an embodiment of the present disclosure may include a feature extraction unit 100, an uncertainty estimation unit 200, and a behavior estimation unit 300. In one embodiment, the current state (s _t ) is input to the feature extraction unit 100 and the uncertainty estimation unit 200, and the behavior estimation unit 300 inputs the feature extraction unit 100 and the uncertainty estimation unit 200. It receives the output of as input and outputs an action (a _t ) corresponding to the current state (s _t ) and uncertainty (u).

일 실시예에서, 특징 추출부(100)는 상태(s_t)를 입력받아서, 해당 상태에 대한 특징 벡터(h_t)를 출력할 수 있다. 예를 들어, 특징 추출부(100)는 엔코더 네트워크(encoder network)를 포함할 수 있다.In one embodiment, the feature extractor 100 may receive a state (s _t ) as input and output a feature vector (h _t ) for the state. For example, the feature extractor 100 may include an encoder network.

일 실시예에서, 불확실성 추정부(200)는 몬테 카를로 드롭아웃(Monte-Carlo dropout) 방법을 이용하여 현재의 상태(s_t)에 대한 강화 학습 모델의 인식론적 불확실성(epistemic uncertainty)을 추정할 수 있다. 예를 들어, 불확실성 추정부(200)는 특징 추출부(100)의 출력인 특징 벡터(h_t)와 동일한 크기(dimension)를 가지는 불확실성 벡터의 각 요소들을 정규화(normalization)한 후에 가중 산술 평균하여 불확실성의 정도(u)를 스칼라(scalar) 값으로 출력할 수 있다.In one embodiment, the uncertainty estimation unit 200 can estimate the epistemic uncertainty of the reinforcement learning model for the current state (s _t ) using the Monte-Carlo dropout method. there is. For example, the uncertainty estimation unit 200 normalizes each element of the uncertainty vector having the same dimension as the feature vector (h _t ), which is the output of the feature extraction unit 100, and then performs a weighted arithmetic average. The degree of uncertainty (u) can be output as a scalar value.

일 실시예에서, 행동 추정부(300)에는 특징 추출부(100)의 출력인 특징 벡터(h_t) 및 불확실성 추정부(200)의 출력인 불확실성의 정도(u)를 입력받아, 이에 대응되는 행동(a_t)을 출력할 수 있다. 예를 들어, 특징 벡터(ht) 및 불확실성의 정도(u)는 하나의 벡터로 연쇄(concatenation)되어 행동 추정부(300)에 입력될 수 있고, 행동 추정부(300)는 이에 대응하는 행동(a_t)의 평균 및 분산을 도출하고, 이러한 행동(a_t)의 분포로부터 최종적으로 행동(a_t)을 추정할 수 있다.In one embodiment, the behavior estimation unit 300 receives the feature vector (h _t ), which is the output of the feature extraction unit 100, and the degree of uncertainty (u), the output of the uncertainty estimation unit 200, and generates the corresponding Action (a _t ) can be output. For example, the feature vector (ht) and the degree of uncertainty (u) may be concatenated into one vector and input to the behavior estimation unit 300, and the behavior estimation unit 300 may perform the corresponding action ( The mean and variance of a _t ) can be derived, and the final behavior (a _t ) can be estimated from the distribution of this behavior (a _t ).

도 3은 본 개시의 일 실시예에 따라 불확실성을 고려한 심층 강화 학습 모델을 도시한 개념도이다.Figure 3 is a conceptual diagram illustrating a deep reinforcement learning model considering uncertainty according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따르면, 특징 추출부(100)는 정책 네크워크(policy network)의 엔코더(110, encoder)이다. 엔코더 네트워크(110)는 입력인 상태(s_t)로부터 특징 벡터(h_t)를 출력하도록 학습되며, 로 표현될 수 있다.According to an embodiment of the present disclosure, the feature extractor 100 is an encoder 110 of a policy network. The encoder network 110 is learned to output a feature vector (h _t ) from the input state (s _t ), It can be expressed as

본 개시의 일 실시예에 따르면, 불확실성 추정부(200)는 불확실성 계산부(210) 및 스칼라 맵핑부(220)를 포함할 수 있다. 일 실시예에서, 불확실성 계산부(210)는 학습된 정책 네트워크의 엔코더(110)에 대하여 신경망의 일부를 생략하는 드롭아웃(dropout)을 적용하여 인식론적 불확실성을 추정할 수 있다. 예를 들어, 신경망에 몬테-카를로 드롭아웃(Monte-Carlo Dropout)을 적용하여, N번의 몬테 카를로 샘플들의 분산으로 아래의 수식과 같이 불확실성 지표인 를 계산할 수 있다(관련 논문 [Y. Gal and Z. Ghahramani, "Dropout as a Bayesian approximation: Representing model uncertainty in deep learning," in Proc. 33rd Int. Conf. Mach. Learn., 2016, vol. 48, pp. 1050-1059.]는 참조로서 본 개시에 포함됨).According to an embodiment of the present disclosure, the uncertainty estimation unit 200 may include an uncertainty calculation unit 210 and a scalar mapping unit 220. In one embodiment, the uncertainty calculation unit 210 may estimate epistemological uncertainty by applying dropout, which omits part of the neural network, to the encoder 110 of the learned policy network. For example, by applying Monte-Carlo Dropout to a neural network, the variance of N Monte Carlo samples is used as an uncertainty indicator as shown in the formula below. can be calculated (related paper [Y. Gal and Z. Ghahramani, "Dropout as a Bayesian approximation: Representing model uncertainty in deep learning," in Proc. 33rd Int. Conf. Mach. Learn., 2016, vol. 48, pp. 1050-1059.] are incorporated herein by reference).

(1) (One)

여기서, 는 입력 x에 대하여 드롭아웃(dropout)이 적용된 신경망 의 i번째 출력이고, 은 의 평균이다.here, is a neural network with dropout applied to input x is the ith output of silver is the average of

일 실시예에서, 불확실성 계산부(210)는 학습된 정책 네트워크의 엔코더(110)인 에 대하여 몬테-카를로 드롭아웃(Monte-Carlo Dropout)을 적용하여 아래 수식 (2)와 같이 불확실성 를 추정할 수 있다.In one embodiment, the uncertainty calculation unit 210 is the encoder 110 of the learned policy network. By applying Monte-Carlo Dropout to uncertainty as shown in Equation (2) below: can be estimated.

(2) (2)

여기서, 는 드롭아웃(dropout)이 적용된 엔코더(110)의 i번째 출력이고, 은 의 평균이다.here, is the i-th output of the encoder 110 to which dropout is applied, silver is the average of

일 실시예에서, 엔코더 네트워크(110)는 특징 벡터(h_t)를 출력하므로, 불확실성 계산부(210)의 출력도 특징 벡터(h_t)와 동일한 크기를 가지는 벡터값(U)인데, 스칼라 맵핑부(220)는 불확실성 계산부(210)의 출력을 스칼라 값(u)으로 맵핑한다. 즉, 불확실성 추정부(200)에서 추정하고자 하는 것은 불확실성 계산부(210)의 출력인 벡터값(U)의 각 요소(element)에 대응되는 주어진 상태(s_t)의 각 요소(element)에 대한 불확실성이 아니라, 주어진 상태(s_t) 자체에 대한 불확실성이므로, 벡터값(U)의 각 요소를 가중 평균하여 스칼라 값(u)으로 맵핑한다.In one embodiment, the encoder network 110 outputs a feature vector (h _t ), so the output of the uncertainty calculation unit 210 is also a vector value (U) having the same size as the feature vector (h _t ), and scalar mapping Unit 220 maps the output of uncertainty calculation unit 210 to a scalar value (u). In other words, what the uncertainty estimation unit 200 wants to estimate is the value for each element of the given state (s _t ) corresponding to each element of the vector value (U) that is the output of the uncertainty calculation unit 210. Since it is not uncertainty, but uncertainty about the given state (s _t ) itself, each element of the vector value (U) is weighted and mapped to a scalar value (u).

예를 들어, 스칼라 맵핑부(220)는 불확실성의 정도를 나타내는 스칼라 값(u)을 [0, 1] 사이의 범위로 일반화할 수 있고, 이를 위하여 불확실성 벡터(U)의 각 요소에 대한 상한으로 정규화하여 가중 평균을 계산하여, 불확실성 스칼라 값(u)을 계산할 수 있다. 이를 위해서, 엔코더 네트워크(110)의 학습 중에 특징 벡터(h_t)의 각 요소(hⁱ) 별로 최소 및 최대 출력인 및 가 새로운 최대 또는 최소가 발견될 때마다 업데이트하여 아래와 같이 수집될 수 있다(아래 수식에서 hⁱ는 벡터 h_t의 i번째 요소를 의미한다).For example, the scalar mapping unit 220 can generalize the scalar value (u) representing the degree of uncertainty to a range between [0, 1], and for this purpose, the upper limit for each element of the uncertainty vector (U) is used. By normalizing and calculating the weighted average, the uncertainty scalar value (u) can be calculated. For this purpose, during learning of the encoder network 110, the minimum and maximum outputs for each element (h ⁱ ) of the feature vector (h _t ) are and can be collected as follows by updating each time a new maximum or minimum is found (in the formula below, h ⁱ refers to the ith element of vector h _t ).

(3) (3)

(4) (4)

수집된 및 를 이용하면 특징 벡터(h_t)의 각 요소(hⁱ)에 대한 불확실성의 상한()은 출력이 균일 분포(uniform distribution)을 따른다고 가정하면 균일 분포의 분산으로 아래와 같이 계산될 수 있다.collected and Using , the upper limit of uncertainty for each element (h ⁱ ) of the feature vector (h _t ) ( ) can be calculated as the variance of the uniform distribution as follows, assuming that the output follows a uniform distribution.

(5) (5)

수식 (5)에 따른 불확실성의 상한()은 신경망 모델이 모든 데이터에 대하여 높은 불확실성을 가지는 훈련 초반에 수렴하기 때문에, 학습 과정을 방해하지 않는다. 일 실시예에서, 불확실성 계산 중에는 엔코더 네트워크(110)를 최적화하기 위한 경사도(gradient)를 계산하지 않을 수 있다.Upper limit of uncertainty according to equation (5) ( ) does not interfere with the learning process because the neural network model converges early in training when there is high uncertainty for all data. In one embodiment, the gradient for optimizing the encoder network 110 may not be calculated during uncertainty calculation.

마지막으로, 불확실성의 정도를 나타내는 스칼라 값(u)은 각 요소의 불확실성 값()을 불확실성의 상한()으로 정규화하고, 각 요소별로 가중 평균하여 아래와 같이 계산할 수 있다.Finally, the scalar value (u), which represents the degree of uncertainty, is the uncertainty value of each element ( ) is the upper limit of uncertainty ( ) and can be calculated as follows by taking the weighted average for each element.

(6) (6)

여기서, 이고, d는 의 크기(dimension)이다. 또한, 수식 (6)에서 가중 평균을 사용하였으므로, 낮은 불확실성을 가지는 요소에 의하여 높은 불확실성을 가지는 요소가 조정(offset)되는 것을 막을 수 있다.here, , and d is It is the size of . Additionally, since a weighted average is used in Equation (6), it is possible to prevent elements with high uncertainty from being offset by elements with low uncertainty.

본 개시의 일 실시예에 따르면, 행동 추정부(300)는 특징 추출부(100)의 출력인 특징 벡터(h_t) 및 불확실성 추정부(200)의 출력인 불확실성 스칼라 값(u_t)을 입력받아, 심층 강화 학습을 이용하여 행동(a_t)을 추정한다. 일 실시예에서, 특징 추출부(100)와 불확실성 추정부(200)는 병렬적으로 각각 특징 벡터(h_t) 및 불확실성 스칼라 값(u_t)을 계산하고, 이를 행동 추정부(300)의 입력값으로 제공할 수 있다.According to an embodiment of the present disclosure, the behavior estimation unit 300 inputs a feature vector (h _t ), which is the output of the feature extraction unit 100, and an uncertainty scalar value (u _t ), which is the output of the uncertainty estimation unit 200. Then, estimate the action (a _t ) using deep reinforcement learning. In one embodiment, the feature extraction unit 100 and the uncertainty estimation unit 200 calculate a feature vector (h _t ) and an uncertainty scalar value (u _t ) in parallel, respectively, and input them to the behavior estimation unit 300. It can be provided as a value.

일 실시예에서, 행동 추정부(300)는 강화 학습을 위한 행동 가치 함수인 Q 함수(Q-function)인 , 불확실성을 고려한 정책(policy)인 및 신경 정보 척도(neural information measure)인 를 이용할 수 있다. 예를 들어, 신경 정보 척도 는 불확실성(uncertainty)과 행동(action) 사이의 상호 의존 정도를 나타내는 상호 정보(mutual information)의 하한으로 아래와 같이 정의될 수 있다.In one embodiment, the action estimation unit 300 is a Q-function, which is an action value function for reinforcement learning. , a policy that takes uncertainty into account. and a neural information measure. can be used. For example, neural information measures can be defined as the lower limit of mutual information, which represents the degree of interdependence between uncertainty and action.

(7) (7)

일 실시예에서, 행동 추정부(300)에는 특징 추출부(100)의 출력인 인코딩된 상태인 특징 벡터(h_t) 및 인식론적 불확실성의 정도를 나타내는 스칼라 값(u_t)을 연쇄 (concatenation)하여 입력되고, 행동 추정부(300)는 강화 학습을 이용하여 행동 분포(action distribution)의 평균 및 분산을 제공할 수 있고, 최종적으로 이러한 행동 분포로부터 출력인 행동(a_t)을 샘플할 수 있다.In one embodiment, the behavior estimation unit 300 concatenates the encoded state feature vector (h _t ), which is the output of the feature extraction unit 100, and a scalar value (u _t ) indicating the degree of epistemological uncertainty. input, the action estimation unit 300 can provide the mean and variance of the action distribution using reinforcement learning, and finally sample the output action (a _t ) from this action distribution. .

일 실시예에서, 행동 추정부(300)의 Q-함수 는 주어진 상황(s_t, u_t, a_t)에서 행동 가치인 Q-값(Q-value)을 추정할 수 있다. 또한, Q-함수는 강화학습 알고리즘인 SAC(Soft Actor-Critic)과 동일한 방식으로 소프트 정책 평가(soft policy evaluation)를 이용하여 훈련될 수 있다.In one embodiment, the Q-function of the behavior estimation unit 300 can estimate the Q-value, which is the action value, in a given situation (s _t , u _t , a _t ). Additionally, the Q-function can be trained using soft policy evaluation in the same way as the reinforcement learning algorithm SAC (Soft Actor-Critic).

일 실시예에서, 행동 추정부(300)의 정책 함수 는 기본적인 목적인 소프트 정책 향상(soft policy improvement)을 최대화하는 동시에, 불확실성의 정도에 따라서 정책을 구분하기 위하여 를 최대화하도록 훈련된다. 예를 들어, 정책 함수 를 훈련하기 위하여 확장된 목적 함수를 아래와 같이 구성할 수 있다.In one embodiment, the policy function of the behavior estimation unit 300 To maximize soft policy improvement, which is the basic goal, while classifying policies according to the degree of uncertainty. trained to maximize For example, the policy function To train, the extended objective function can be constructed as follows.

(8) (8)

여기서, 는 리플레이 버퍼(replay buffer)로부터 샘플된 데이터들((s⁽¹⁾, u⁽¹⁾), ...,(s^(b), u^(b)))~D을 이용하여 아래와 같이 계산될 수 있다.here, is calculated as follows using the data sampled from the replay buffer ((s ⁽¹⁾ , u ⁽¹⁾ ), ...,(s ^(b) , u ^(b) ))~D. You can.

(9) (9)

여기서, 는 근사적으로 이고, b는 샘플된 데이터의 갯수이며, 는 주변 확률 분포(marginal distribution) P_U로부터 샘플된 데이터들을 근사화한 불확실성의 임의로 샘플된 정도(randomly sampled degree)이다.here, is approximately , b is the number of sampled data, is the randomly sampled degree of uncertainty approximating the sampled data from the marginal distribution P _U .

수식 (8)의 목적함수를 최대화함에 의하여, 행동 추정부(300)는 기대되는 합산 보상(cumulative reward)을 최대화하는 행동(action)을 제공하도록 정책을 향상시키면서, 불확실성의 정도에 따라 정책의 행동을 구별할 수 있도록 학습될 수 있다.By maximizing the objective function of Equation (8), the action estimation unit 300 improves the policy to provide an action that maximizes the expected cumulative reward, while improving the policy's action according to the degree of uncertainty. can be learned to distinguish between.

도 4는 본 개시의 일 실시예에 따른 불확실성 조건부 심층 강화 학습 방법을 도시한 순서도이다.Figure 4 is a flow chart illustrating an uncertainty conditional deep reinforcement learning method according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따르면, 온라인 데이터셋의 자가 균형화 방법은 초기화 단계(S100), 특징 추출 단계(S200), 불확실성 추정 단계(S300), 행동 추정 단계(S400), 행동 실행 및 보상 확인 단계(S500), 리플레이 버퍼 저장 단계(S600) 및 파라미터 업데이트 단계(S700)를 포함할 수 있다.According to an embodiment of the present disclosure, the method for self-balancing an online dataset includes an initialization step (S100), a feature extraction step (S200), an uncertainty estimation step (S300), an action estimation step (S400), and an action execution and reward confirmation step. (S500), a replay buffer storage step (S600), and a parameter update step (S700).

일 실시예에 따르면, 초기화 단계(S100)에서 파라미터들인 및 리플레이 버퍼 D를 초기화한다.According to one embodiment, the parameters in the initialization step (S100) and initialize the replay buffer D.

일 실시예에 따르면, 특징 추출 단계(S200)에서 현재의 상태(s_t)를 입력받아, 해당 상태에 대한 특징 벡터(h_t)를 출력할 수 있다. 예를 들어, 특징 추출 단계(S200)는 엔코더 네트워크(encoder network)를 이용하여 수행될 수 있다.According to one embodiment, in the feature extraction step (S200), the current state (s _t ) may be input and a feature vector (h _t ) for the state may be output. For example, the feature extraction step (S200) may be performed using an encoder network.

일 실시예에 따르면, 불확실성 추정 단계(S300)에서 몬테 카를로 드롭아웃(Monte-Carlo dropout) 방법을 이용하여 현재의 상태(s_t)에 대한 강화 학습 모델의 인식론적 불확실성(epistemic uncertainty)을 추정할 수 있다. 예를 들어, 불확실성 추정 단계(S300)는 특징 벡터(h_t)와 동일한 크기(dimension)를 가지는 불확실성 벡터의 각 요소들을 정규화(normalization)한 후에 가중 산술 평균하여 불확실성의 정도(u)를 스칼라(scalar) 값으로 출력할 수 있다.According to one embodiment, in the uncertainty estimation step (S300), the epistemic uncertainty of the reinforcement learning model for the current state (s _t ) is estimated using the Monte-Carlo dropout method. You can. For example, the uncertainty estimation step (S300) normalizes each element of the uncertainty vector having the same dimension as the feature vector (h _t ) and then performs a weighted arithmetic average to calculate the degree of uncertainty (u) as a scalar ( It can be output as a scalar) value.

일 실시예에 따르면, 특징 추출 단계(S200) 및 불확실성 추정 단계(S300)는 병렬적으로 수행될 수 있다. 또한, 특징 추출 단계(S200) 및 불확실성 추정 단계(S300)에서는, 엔코더 네트워크(110)를 최적화하기 위한 경사도(gradient)를 계산하지 않을 수 있다.According to one embodiment, the feature extraction step (S200) and the uncertainty estimation step (S300) may be performed in parallel. Additionally, in the feature extraction step (S200) and the uncertainty estimation step (S300), the gradient for optimizing the encoder network 110 may not be calculated.

일 실시예에 따르면, 행동 추정부 단계(S400)에서는 특징 추출 단계(S200)에서 도출한 특징 벡터(h_t) 및 불확실성 추정 단계(S300)에서 도출한 불확실성의 정도(u)를 입력받아, 이에 대응되는 행동(a_t)을 출력할 수 있다. 예를 들어, 특징 벡터(h_t) 및 불확실성의 정도(u)는 하나의 벡터로 연쇄(concatenation)되어 강화 학습으로 훈련된 신경망에 입력되고, 보상의 높은 행동(a_t) 분포의 평균 및 분산을 도출하고, 이러한 행동(a_t) 분포로부터 최종적으로 행동(a_t)을 추정할 수 있다.According to one embodiment, in the action estimator step (S400), the feature vector (h _t ) derived in the feature extraction step (S200) and the degree of uncertainty (u) derived in the uncertainty estimation step (S300) are input, The corresponding action (a _t ) can be output. For example, the feature vector (h _t ) and the degree of uncertainty (u) are concatenated into one vector and input into a neural network trained by reinforcement learning, and the mean and variance of the high action (a _t ) distribution of the reward are is derived, and the final action (a _t ) can be estimated from this action (a _t ) distribution.

일 실시예에 따르면, 행동 실행 및 보상 확인 단계(S500)는 행동 추정 단계(S400)에서 출력한 행동(a_t)을 실행하고, 그에 따른 보상(r_t)과 새로운 상태(s_t+1)를 관찰하여 확인할 수 있다.According to one embodiment, the action execution and reward confirmation step (S500) executes the action (a _t ) output in the action estimation step (S400), and the corresponding reward (r _t ) and new state (s _t+1 ) This can be confirmed by observing.

일 실시예에 따르면, 리플레이 버퍼 저장 단계(S600)는 상태(s_t), 불확실성(u_t), 행동(a_t), 보상(r_t) 및 새로운 상태(s_t+1)를 리플레이 버퍼(replay buffer) D에 저장할 수 있다.According to one embodiment, the replay buffer storage step (S600) stores the state (s _t ), uncertainty (u _t ), action (a _t ), reward (r _t ), and new state (s _t+1 ) in the replay buffer ( It can be stored in replay buffer D.

일 실시예에 따르면, 파라미터(parameter) 업데이터 단계(S700)에서는 리플레이 버퍼에서 데이터들을 샘플링하여, 파라미터 업데이트를 위한 경사도(gradient)를 계산하고, 이를 이용하여 파라미터 를 업데이트할 수 있다. 일 실시예에서, 파라미터 업데이트 단계(S700)는 샘플된 데이터들 각각에 대하여 반복적으로 수행될 수 있다.According to one embodiment, in the parameter update step (S700), data is sampled from the replay buffer, a gradient for parameter update is calculated, and the parameter update is used. can be updated. In one embodiment, the parameter update step (S700) may be performed repeatedly for each sampled data.

도 5는 본 개시의 일 실시예의 성능을 평가하기 위한 실험환경을 설명하기 위한 개념도이다.Figure 5 is a conceptual diagram illustrating an experimental environment for evaluating the performance of an embodiment of the present disclosure.

본 개시의 성능을 기존의 방법들과 비교 평가하기 위하여, MuJoCo 물리 시뮬레이터와 자율주행 연구를 위한 오픈 소스 시뮬레이터인 CARLA 시뮬레이터를 이용하였다.To evaluate and compare the performance of this disclosure with existing methods, the MuJoCo physics simulator and the CARLA simulator, an open source simulator for autonomous driving research, were used.

도 5(a)를 참조하면, MuJoCo 시뮬레이터는 네 가지 환경(Ant, Hopper, Humanoid 및 Walker2D)에서 훈련 단계에서 상태나 보상을 수정하지 않고 각 환경을 원래 작업 세트를 이용하되, 테스트 단계에서 다변량 정규 분포(multi-variate normal distribution)에서 샘플링된 무작위 노이즈가 상태에 추가되었다. 여기서, 추가된 노이즈는 실험에서 인식론적 불확실성을 증가시킨다고 가정한다.Referring to Figure 5(a), the MuJoCo simulator uses the original task set for each environment in four environments (Ant, Hopper, Humanoid, and Walker2D) without modifying the state or reward in the training phase, but multivariate normal in the testing phase. Random noise sampled from a distribution (multi-variate normal distribution) was added to the state. Here, it is assumed that the added noise increases the epistemological uncertainty in the experiment.

도 5(b) 내지 5(e)를 참조하면, CARLA 환경에서는 에이전트가 통과해야 하는 중앙 원형 교차로 시나리오(도 5(b)), 모든 차량이 자전거 타는 사람으로 대체되는 자전거 타는 사람 시나리오(도 5(c)), 에이전트가 한 번도 본적 없는 위치에서 생성되는 임의 생성 시나리오(도 5(d))를 포함하고 있으며, 조감도 시맨틱 이미지(도 5(e))가 관찰로 주어진다. CARLA 환경에서 주어지는 조감도 시맨틱 이미지를 잠재 벡터(latent vector)로 인코딩하여, 이를 환경의 상태(state)를 나타내도록 하고, 잠재 벡터가 상태(state)로 주어졌을 때, 조향은 전체 경로가 주어졌을 때 최소한의 주행 능력을 보장하도록 설정하고, 가속(throttle) 및 감속(brake)과 동작(action)으로 계산하였다.Referring to Figures 5(b) to 5(e), in the CARLA environment, there is a central roundabout scenario that agents must pass through (Figure 5(b)), and a cyclist scenario where all vehicles are replaced by cyclists (Figure 5 (c)), contains a randomly generated scenario (Figure 5(d)) generated from a location the agent has never seen before, and a bird's eye view semantic image (Figure 5(e)) is given as an observation. The bird's eye view semantic image given in the CARLA environment is encoded into a latent vector to represent the state of the environment. When the latent vector is given as a state, steering is performed when the entire path is given. It was set to ensure minimum driving ability and calculated based on acceleration, deceleration, and action.

도 6 내지 도 9는 MuJoCo 시뮬레이터를 이용하여 본 개시의 일 실시예의 성능을 평가한 결과를 보여주는 그래프이다.6 to 9 are graphs showing the results of evaluating the performance of an embodiment of the present disclosure using the MuJoCo simulator.

MuJoCo 시뮬레이터에서는 불확실성을 고려하지 않은 심층 강화 학습 방법인 SAC, 추가 입력으로 불확실성의 정도가 u인 SAC-U, 본 개시의 일 실시예에 따른 방법(UNICON) 및 본 개시의 일 실시예에서 맵핑 함수에 해당하는 스칼라 맵핑부(220)를 제외한 모델(UNICON-V)을 비교하였다.In the MuJoCo simulator, SAC is a deep reinforcement learning method that does not consider uncertainty, SAC-U with a degree of uncertainty u as an additional input, a method (UNICON) according to an embodiment of the present disclosure, and a mapping function in an embodiment of the present disclosure. The model (UNICON-V) excluding the scalar mapping unit 220 corresponding to was compared.

도 6 내지 도 9를 참조하면, 본 개시의 일 실시예가 전체적으로 성능의 저하가 적고, 평균 반환값이 높다는 것을 확인할 수 있다.Referring to Figures 6 to 9, it can be seen that an embodiment of the present disclosure has little overall performance degradation and a high average return value.

또한, CARLA 환경에서는 불확실성을 고려하지 않은 심층 강화 학습 방법인 SAC, 추가 입력으로 불확실성의 정도가 u인 SAC-U 및 본 개시의 일 실시예에 따른 방법(UNICON으로 표시함)을 성공률 및 충돌률을 성능 지표로 비교하였다.In addition, in the CARLA environment, SAC, which is a deep reinforcement learning method that does not consider uncertainty, SAC-U, where the degree of uncertainty is u as an additional input, and a method according to an embodiment of the present disclosure (represented by UNICON) are used to achieve success rate and collision rate. were compared as performance indicators.

아래 표 1 내지 표 3을 참조하면, CARLA의 모든 시나리오에 대하여 본 개시의 일 실시예에 따른 방법이 성공률이 가장 높고, 충돌률이 가능 낮은 것을 확인할 수 있다.Referring to Tables 1 to 3 below, it can be seen that for all CARLA scenarios, the method according to an embodiment of the present disclosure has the highest success rate and the lowest collision rate.

<표 1><Table 1>

<표 2><Table 2>

<표 3><Table 3>

또한, 공지예외로 주장한 논문[C. Kim, J. K. Cho, H. S. Yoon, S. W. Seo, and S. W. Kim, "UNICON: Uncertainty-Conditioned Policy for Robust Behavior in Unfamiliar Scenarios," IEEE Robotics and Automation Letters, Vol. 7, No. 4, pp. 9099-9106.]은 참조로서 본 개시에 포함된다.In addition, the paper [C. Kim, J. K. Cho, H. S. Yoon, S. W. Seo, and S. W. Kim, “UNICON: Uncertainty-Conditioned Policy for Robust Behavior in Unfamiliar Scenarios,” IEEE Robotics and Automation Letters, Vol. 7, No. 4, pp. 9099-9106.] are incorporated into this disclosure by reference.

본 개시의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체 및 통신 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. 통신 매체는 전형적으로 컴퓨터 판독가능 명령어, 데이터 구조, 또는 프로그램 모듈을 포함하며, 임의의 정보 전달 매체를 포함한다.An embodiment of the present disclosure may also be implemented in the form of a recording medium containing instructions executable by a computer, such as program modules executed by a computer. Computer-readable media can be any available media that can be accessed by a computer and includes both volatile and non-volatile media, removable and non-removable media. Additionally, computer-readable media may include both computer storage media and communication media. Computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Communication media typically includes computer-readable instructions, data structures, or program modules and includes any information delivery medium.

전술한 본 개시의 설명은 예시를 위한 것이며, 본 개시가 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. The foregoing description of the present disclosure is for illustrative purposes, and a person skilled in the art to which the present disclosure pertains will understand that the present invention can be easily modified into other specific forms without changing its technical idea or essential features. will be.

그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive. For example, each component described as unitary may be implemented in a distributed manner, and similarly, components described as distributed may also be implemented in a combined form.

본 개시의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 개시의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present disclosure is indicated by the claims described below rather than the detailed description above, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present disclosure. do.

Claims

In the deep reinforcement learning device,
a feature extraction unit that receives a state (s _t ) as input and outputs a feature vector (h _t ) for the state (s _t );
an uncertainty estimation unit that estimates epistemological uncertainty (u) for the state (s _t ); and
A behavior estimation unit that estimates the behavior (a _t ) based on the characteristic vector (h _t ) and the epistemological uncertainty (u);
The deep reinforcement learning device is learned based on the reward (r _t ) for the action (a _t ),
The feature extraction unit includes an encoder that outputs a feature vector (h _t ) for the state (s _t ),
Deep reinforcement learning device.

delete

The method of claim 1, wherein the uncertainty estimation unit,
Estimating uncertainty about the state (s _t ) using the Monte-Carlo dropout method,
Deep reinforcement learning device.

The method of claim 1, wherein the uncertainty estimation unit,
By applying Monte-Carlo dropout to the encoder, the uncertainty index ( ), including an uncertainty calculation unit that derives
Deep reinforcement learning device.

The method of claim 4, wherein the uncertainty estimation unit,
The uncertainty indicator, which is a vector with the same size as the feature vector output from the encoder, is used as the uncertainty indicator ( ) for each element ( )'s upper limit ( ), and then includes a scalar mapping unit that outputs a scalar value (u) by weighted average,
Deep reinforcement learning device.

The method of claim 1, wherein the behavior estimation unit,
Based on the feature vector (h _t ) and the epistemic uncertainty (u), estimate the mean and variance of the behavior (a _t ), and sample from the distribution of the behavior (a _t ) with the mean and the variance. to provide action on the current state,
Deep reinforcement learning device.

The method of claim 1, wherein the behavior estimation unit,
In order for the action (a _t ) to vary depending on the degree of the epistemic uncertainty (u), learning so that the lower limit of the mutual information between the epistemological uncertainty (u) and the action (a _t ) is maximized,
Deep reinforcement learning device.

The method of claim 1, wherein the behavior estimation unit,
Soft policy improvement and learning such that the sum of the lower bound of the epistemic uncertainty (u) and the mutual information of the action (a _t ) is maximized,
Deep reinforcement learning device.

In clause 7,
The lower bound of the mutual information of the epistemic uncertainty (u) and the action (a _t ) is,
Calculated using data sampled from the replay buffer,
Deep reinforcement learning device.

In the deep reinforcement learning method,
A feature extraction step of inputting a state (s _t ) and outputting a feature vector (h _t ) for the state (s _t );
An uncertainty estimation step of estimating epistemological uncertainty (u) for the state (s _t ); and
An action estimation step of estimating an action (a _t ) based on the feature vector (h _t ) and epistemological uncertainty (u),
The feature extraction step and the uncertainty estimation step are performed in parallel,
Outputting a feature vector (h _t ) for the state (s _t ) using an encoder in the feature extraction step,
Deep reinforcement learning methods.

delete

The method of claim 10, wherein in the uncertainty estimation step,
Estimating uncertainty about the state (s _t ) using the Monte-Carlo dropout method,
Deep reinforcement learning methods.

The method of claim 10, wherein the uncertainty estimation step is,
By applying Monte-Carlo dropout to the encoder, the uncertainty index ( ), which derives
Deep reinforcement learning methods.

The method of claim 13, wherein the uncertainty estimation step is,
The uncertainty indicator, which is a vector with the same size as the feature vector output from the encoder, is used as the uncertainty indicator ( ) for each element ( ) upper limit ( ), then output a scalar value (u) by weighted average,
Deep reinforcement learning methods.

The method of claim 10, wherein the action estimation step is,
Based on the feature vector (h _t ) and the epistemic uncertainty (u), estimate the mean and variance of the behavior (a _t ), and sample from the distribution of the behavior (a _t ) with the mean and the variance. to provide action on the current state,
Deep reinforcement learning methods.

The method of claim 10, wherein in the action estimation step,
In order for the action (a _t ) to vary depending on the degree of the epistemic uncertainty (u), learning so that the lower limit of the mutual information between the epistemological uncertainty (u) and the action (a _t ) is maximized,
Deep reinforcement learning methods.

The method of claim 10, wherein in the action estimation step,
Soft policy improvement and learning such that the sum of the lower bound of the epistemic uncertainty (u) and the mutual information of the action (a _t ) is maximized,
Deep reinforcement learning methods.

According to clause 16,
The lower bound of the mutual information of the epistemic uncertainty (u) and the action (a _t ) is,
Calculated using data sampled from the replay buffer,
Deep reinforcement learning methods.

A program stored in a computer-readable recording medium to execute the method of any one of claims 10 and 12 to 18 on a computer.