KR20220162096A

KR20220162096A - Deep Neural Network Structure for Inducing Rational Reinforcement Learning Agent Behavior

Info

Publication number: KR20220162096A
Application number: KR1020220067133A
Authority: KR
Inventors: 최호진; 황예찬; 구본홍; 윤성열; 최형균; 이성후; 심현우; 김보라; 이정욱; 남제현; 원준희
Original assignee: 한국과학기술원; 주식회사 네비웍스
Priority date: 2021-05-31
Filing date: 2022-05-31
Publication date: 2022-12-07

Abstract

In a step of executing the reinforcement learning agent of a deep neural network, a behavior induced by a user is provided as input to the reinforcement learning agent. The reinforcement learning agent calculates, as a final value function value, a value obtained by subtracting a parameter value for reflecting the forcibleness of the induced behavior from the output value of a value function of all the rest behaviors excluding the induced behavior. A final behavior to be performed is determined using the calculated final value function value. The parameter value can be changed to adjust the degree of forcibleness of the induced behavior. The present invention can effectively respond to a situation in which a reinforcement learning agent user desires to induce or command an agent to conduct a specific behavior after violating a greedy behavior decision method for selecting a behavior with the highest expected cumulative reward among behaviors that can be conducted in the current state.

Description

Method for Inducing Rational Behavioral Decisions of Reinforcement Learning Agents of Deep Neural Networks {Deep Neural Network Structure for Inducing Rational Reinforcement Learning Agent Behavior}

본 발명은 심층신경망에서 강화학습 에이전트에 관한 것으로서, 보다 상세하게는 강화학습 에이전트에서 행동 결정을 유도하는 방법에 관한 것이다.The present invention relates to a reinforcement learning agent in a deep neural network, and more particularly, to a method for inducing action decisions in a reinforcement learning agent.

강화학습과 딥러닝을 결합한 심층 강화학습에 대한 다양한 연구는 Atari 게임, 바둑, 도타 2 등 게임 분야에서뿐만 아니라 컴퓨터비전, 지능형 로봇이나 자연어 처리 등의 분야에도 적용되며 매우 성공적인 발전을 이어나가고 있다. 강화학습 모델의 세부적인 구조는 각 모델이 목표하는 바에 따라 다르지만, 그들은 모두 공통적으로 마르코프 결정 과정(MDP: Markov Decision Process)으로 표현 가능한 환경에서 누적 보상의 기대값이 최대가 되는 순차적인 의사결정을 내리기 위해 학습한다.Various studies on deep reinforcement learning combining reinforcement learning and deep learning are applied not only to games such as Atari games, Go, and Dota 2, but also to fields such as computer vision, intelligent robots, and natural language processing, and are continuing to develop very successfully. The detailed structure of reinforcement learning models differs depending on what each model aims, but they all have in common sequential decision-making in which the expected value of cumulative reward is maximized in an environment that can be expressed as a Markov Decision Process (MDP). learn to get off

도 1은 보상을 기반으로 학습을 진행하는 일반적인 강화학습 모델의 개념도이다. 도시된 것처럼, 강화학습의 에이전트(Agent)는, 상태에 대한 정보(S_t)가 입력으로 주어졌을 때 수행할 행동을 결정하는 함수인 정책을 이용하여 다음 행동(A_t)을 결정 및 수행한다. 에이전트의 환경(Environment)은 현재 에이전트의 상태(S_t+1)와 수행된 액션에 기반하여 보상(R_t+1)을 반환한다. 에이전트는 이 보상이라는 신호를 통해 이전에 수행한 행동에 대한 자체적인 평가를 한 뒤 정책을 조금씩 수정해 나가는 방식으로 학습을 진행한다. 따라서 학습이 잘 된 에이전트라면 모든 가능한 상태에 대해 높은 누적보상을 얻을 수 있는 행동을 반환하는 정책을 가질 것이다.1 is a conceptual diagram of a general reinforcement learning model that proceeds with reward-based learning. As shown, the agent of reinforcement learning determines and performs the next action (A _t ) using a policy, which is a function that determines the action to be performed when information on the state (S _t ) is given as an input . The agent's environment returns a reward (R _t+1 ) based on the agent's current state (S _t+1 ) and the action taken. Through this reward signal, the agent performs self-evaluation on the previously performed actions, and then proceeds with learning by modifying the policy little by little. Therefore, a well-learned agent will have a policy that returns an action that yields a high cumulative reward for all possible states.

학습이 끝나고 난 뒤에도 에이전트는 학습 단계 때와 유사하게 정책을 기반으로 특정 상태에서 수행할 행동을 결정한다. 이때 학습된 에이전트의 행동 결정 알고리즘은 학습 단계 때와는 다음과 같은 차이점을 가진다. 즉, 학습 단계에서는 최대한 많은 상황에 대해 시행착오를 겪어볼 수 있도록 일정 확률로 현재 보상의 기대값이 가장 높은 행동 대신 무작위 행동을 수행하는 등의 행동 결정 알고리즘(예: 입실론 그리디 알고리즘)이 자주 채택된다. 반면에, 학습이 끝나고 난 뒤에는 더 이상 시행착오가 불필요하므로 항상 가장 높은 보상의 기대값을 갖는 행동을 탐욕적으로 결정한다는 차이점이 있다. 따라서 학습이 끝나고 난 뒤 일반적인 경우에는 강화학습 에이전트는 항상 현재 상태에서 수행 가능한 행동들 중 기대 누적보상이 가장 높은 행동을 선택하여 수행한다.Even after learning is complete, the agent determines the action to be performed in a specific state based on the policy, similar to the learning phase. At this time, the behavior decision algorithm of the learned agent has the following differences from the learning stage. That is, in the learning phase, an action decision algorithm (e.g. Epsilon Greedy Algorithm), such as performing a random action instead of an action with the highest expected value of the current reward with a certain probability, is often used to allow trial and error in as many situations as possible. Adopted. On the other hand, there is a difference that trial and error is no longer necessary after learning is over, so the action with the highest expected value of reward is always greedily determined. Therefore, after learning is completed, in general, the reinforcement learning agent selects and performs the action with the highest expected cumulative reward among the actions that can be performed in the current state.

그러나 사용자는 때때로 강화학습 에이전트가 이런 탐욕적인 행동결정 방식을 위반하고 특정 행동을 수행하도록 유도 혹은 명령하고 싶은 상황을 마주할 수 있다. 강화학습으로 학습된 자율주행 자동차의 운행 방향을 변경하거나 게임 에이전트의 플레이 방식이나 행동을 조정하는 등의 행위가 그 예시이다.However, sometimes users may encounter a situation where they want to induce or command the reinforcement learning agent to perform a specific action, violating this greedy action decision method. Examples include actions such as changing the driving direction of an autonomous vehicle learned through reinforcement learning or adjusting the play method or behavior of a game agent.

1. "Playing Atari with Deep Reinforcement Learning," in NIPS Deep Learning Workshop, Mnih et al., 2013.1. "Playing Atari with Deep Reinforcement Learning," in NIPS Deep Learning Workshop, Mnih et al., 2013.

본 발명은 심층 신경망의 강화학습 에이전트의 실행단계에서 사용자가 유도한 행동을 합리적으로 유도할 수 있는 심층신경망의 강화학습 에이전트의 합리적인 행동 결정 유도 방법을 제공하는 것을 목적으로 한다.An object of the present invention is to provide a method for inducing a reasonable action decision of a reinforcement learning agent of a deep neural network that can reasonably induce an action induced by a user in an execution step of a reinforcement learning agent of a deep neural network.

에이전트의 합리적인 행동 결정 유도 방법은 심층신경망의 강화학습 에이전트의 실행단계에서 사용자에 의해 유도된 행동을 상기 강화학습 에이전트의 입력으로 제공하는 단계; 상기 강화학습 에이전트에서, 상기 유도된 행동을 제외한 나머지 모든 행동의 가치함수의 출력값에서 상기 유도된 행동의 강제성을 반영하기 위한 매개변수 값을 뺀 값을 최종 가치함수 값으로 산출하는 단계; 및 상기 강화학습 에이전트에서, 상기 산출된 최종 가치함수 값을 사용하여 수행해야 할 최종 행동을 결정하는 단계를 포함한다.A method of inducing a reasonable action decision of an agent includes providing an action induced by a user as an input to a reinforcement learning agent in a step of executing a reinforcement learning agent of a deep neural network; Calculating, in the reinforcement learning agent, a value obtained by subtracting a parameter value for reflecting the coercion of the induced action from an output value of the value function of all other actions except the induced action, as a final value function value; and determining, in the reinforcement learning agent, a final action to be performed using the calculated final value function value.

예시적인 실시예에서, 상기 강화학습 에이전트는 아래 식을 상기 최종 행동의 결정에 사용될 수정된 가치함수 q'로 사용하여, 상기 유도된 행동과 현재의 관측정보 및 상황을 동시에 고려하여 상기 최종 행동을 결정할 수 있다.In an exemplary embodiment, the reinforcement learning agent uses the following equation as a modified value function q' to be used in determining the final action, and simultaneously considers the derived action and the current observational information and situation to determine the final action. can decide

여기서, a와 s는 행동과 현재상황을 각각 나타내고, λ는 상기 매개변수를 나타내며, i와 j는 행동 a의 인덱스이고, q는 현재 상황에서 어떤 행동을 수행했을 때 얻을 수 있을 것으로 예상되는 보상의 기대값을 근사하는 가치함수를 나타낸다.Here, a and s represent the action and the current situation, respectively, λ represents the above parameter, i and j are the indices of action a, and q is the expected reward for performing an action in the current situation represents a value function that approximates the expected value of

예시적인 실시예에서, 상기 심층신경망의 강화학습 에이전트의 합리적인 행동 결정 유도 방법은 상기 유도된 행동의 강제성의 정도를 조정하기 위해 상기 매개변수 값을 변경하는 단계를 더 포함할 수 있다.In an exemplary embodiment, the method for inducing a rational action decision of the reinforcement learning agent of the deep neural network may further include changing the parameter value to adjust the degree of coercion of the induced action.

예시적인 실시예에서, 매개변수 값을 변경하는 단계는 상기 유도된 특정 행동(a_i)을 제외한 나머지 모든 행동들의 가치함수 출력값을 작아지게 하여 상기 강화학습 에이전트가 상기 유도된 행동(a_i)을 상기 최종 행동으로 선택할 가능성이 높아지도록 하기 위해 상기 매개변수(λ)의 값을 큰 값으로 변경하는 단계를 포함할 수 있다.In an exemplary embodiment, the step of changing the parameter value reduces the value function output values of all other behaviors except for the specific induced behavior (a _i ) so that the reinforcement learning agent performs the induced behavior (a _i ). A step of changing the value of the parameter λ to a large value in order to increase the possibility of selecting the final action may be included.

예시적인 실시예에서, 상기 심층신경망은 상기 강화학습 에이전트가 수행 가능한 행동의 수와 동일한 수의 노드가 마지막 레이어에 존재하도록 구성될 수 있고, 각 노드는 현재 상황에서 특정 행동을 수행했을 때 얻을 수 있을 것으로 예상되는 보상의 기대값을 근사하는 가치함수 q를 학습할 수 있다.In an exemplary embodiment, the deep neural network may be configured such that the same number of nodes as the number of actions that the reinforcement learning agent can perform exists in the last layer, and each node can be obtained when a specific action is performed in the current situation. We can learn a value function q that approximates the expected value of the expected reward.

예시적인 실시예에서, 상기 유도된 행동은 복수 개의 행동을 포함할 수 있다.In an example embodiment, the induced action may include a plurality of actions.

본 발명은 강화학습 에이전트 사용자가 때때로 에이전트가 현재 상태에서 수행 가능한 행동들 중 기대 누적보상이 가장 높은 행동을 선택하는 탐욕적인 행동결정 방식을 위반하고 특정 행동을 수행하도록 유도 혹은 명령하고 싶은 상황에 효과적으로 대응할 수 있다. 즉, 본 발명은 강화학습 에이전트의 실행단계에서 사용자가 강화학습 에이전트에게 특정 행동을 유도하였을 경우, 그 에이전트가 유도된 행동과 현재의 관측정보 및 상황을 동시에 고려하여 합리적으로 최종 행동을 결정할 수 있다. The present invention is effective in a situation in which a reinforcement learning agent user wants to induce or command a specific action to be performed, violating the greedy action decision method in which the agent selects an action with the highest expected cumulative reward among actions that can be performed in the current state. can respond That is, in the present invention, when a user induces a specific action to a reinforcement learning agent in the execution phase of a reinforcement learning agent, the final action can be reasonably determined by considering the agent's induced action and the current observational information and situation at the same time. .

본 발명은 구현 단계에서 기존의 가치 함수에 수정을 가하지 않기 때문에 일정기간 동안의 행동 유도 이후에도 행동 결정에 있어 영향을 미치지 않는다는 장점이 있다. 추가적으로 유도를 원하는 행동이 다수일 경우에도 쉽게 적용이 가능하다.Since the present invention does not modify the existing value function in the implementation stage, it has the advantage of not affecting the behavior decision after inducing behavior for a certain period of time. In addition, it can be easily applied even when there are many actions to be induced.

본 발명은 또한 추가적인 매개변수를 통해 유도하려는 행동의 강제성의 정도를 에이전트의 결정 단계에 반영할 수 있도록 한다.The present invention also makes it possible to reflect the degree of coercion of an action to be induced through an additional parameter in the agent's decision step.

이 방법론을 통하여 에이전트 사용자들은 더욱 유연한 에이전트 활용이 가능해질 것이다.Through this methodology, agent users will be able to use agents more flexibly.

도 1은 보상을 기반으로 학습을 진행하는 일반적인 강화학습 모델의 개념도이다.
도 2는 심층 강화학습에서 가능한 행동의 수를 N으로 가정하였을 때 심층 신경망의 입출력 구조를 도시한다.
도 3은 본 발명의 예시적인 실시예에 따른 심층신경망의 강화학습 에이전트에서 합리적 행동 결정을 유도하는 방법의 알고리즘을 나타낸 흐름도이다.
도 4는 관측정보와 유도된 행동을 고려하여 행동 결정을 하기 위한 새로운 가치함수 유도과정을 나타낸다.1 is a conceptual diagram of a general reinforcement learning model that proceeds with reward-based learning.
Figure 2 shows the input/output structure of a deep neural network when it is assumed that the number of possible actions in deep reinforcement learning is N.
3 is a flowchart illustrating an algorithm of a method for inducing a rational action decision in a reinforcement learning agent of a deep neural network according to an exemplary embodiment of the present invention.
4 shows a new value function derivation process for making action decisions in consideration of observation information and induced actions.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention will be described in more detail. The same reference numerals are used for the same components in the drawings, and redundant descriptions of the same components are omitted.

이산적인 행동 공간을 갖는 대부분의 심층 강화학습에서는 심층신경망을 이용하여 다중 분류 문제를 푸는 과정과 유사한 방식으로 에이전트의 행동을 결정한다. 심층신경망의 마지막 레이어에는 에이전트가 수행 가능한 행동의 수와 동일한 수의 노드가 존재하며 각 노드는 현재 상황 s에서 특정 행동 a_i를 수행했을 때 얻을 수 있을 것으로 예상되는 보상의 기대값을 근사하는 가치함수 q를 학습한다.In most deep reinforcement learning with discrete action space, the action of the agent is determined in a similar way to the process of solving multi-classification problems using deep neural networks. The last layer of the deep neural network has the same number of nodes as the number of actions that the agent can perform, and each node has a value that approximates the expected value of the reward that is expected to be obtained when a specific action a _i is performed in the current situation s Learn function q.

이러한 심층 강화학습의 구조를 활용하여 가치함수 q의 값을 가장 크게 만드는 행동을 선택하는 과정에서 새로운 구조를 추가하여 행동을 유도하는 방식으로 문제를 해결할 수 있다.In the process of selecting an action that makes the value of the value function q the largest by using this structure of deep reinforcement learning, the problem can be solved by adding a new structure to induce an action.

도 2는 본 발명의 예시적인 실시예에 따른 심층 신경망의 입출력 구조를 예시한 것이다. 2 illustrates an input/output structure of a deep neural network according to an exemplary embodiment of the present invention.

도 2를 참조하면, 도시된 심층 신경망(100)의 입출력 구조는 가능한 행동의 수를 N으로 가정한 것이다. 이산적인 행동 공간을 갖는 대부분의 심층 강화학습에서는 심층신경망을 이용하여 다중 분류 문제를 푸는 과정과 유사한 방식으로 에이전트의 행동을 결정한다. 예시적인 실시예에서, 심층신경망(100)의 마지막 레이어에는 강화학습 에이전트(110)가 수행 가능한 행동의 수(N)와 동일한 수의 노드가 존재한다. 각 노드는 현재 상황 s에서 특정 행동 a_i 를 수행했을 때 얻을 수 있을 것으로 예상되는 보상의 기대값을 근사하는 가치함수 q를 학습할 수 있다.Referring to FIG. 2 , the input/output structure of the illustrated deep neural network 100 assumes that the number of possible actions is N. In most deep reinforcement learning with discrete action space, the action of the agent is determined in a similar way to the process of solving multi-classification problems using deep neural networks. In an exemplary embodiment, the same number of nodes as the number of actions (N) that the reinforcement learning agent 110 can perform exists in the last layer of the deep neural network 100 . Each node can learn a value function q that approximates the expected value of the reward expected to be obtained when a specific action _ai is performed in the current situation s.

강화학습 에이전트(110)는 가치함수 q의 값을 가장 크게 만드는 행동을 선택하는 등의 방식으로 수행할 행동을 결정하게 할 수 있다.The reinforcement learning agent 110 may determine an action to be performed in a manner such as selecting an action that makes the value of the value function q the largest.

본 발명의 예시적인 실시예는 앞서 발명의 배경 설명의 마지막 단락에서 언급된 특수한 경우에서 강화학습 에이전트의 행동에 제약을 가하기 위한 방법에 관한 것이다. 즉, 예시적인 실시예는 강화학습 에이전트(110)의 행동에 제약을 가하기 위하여 필요한 요구사항을 모델 구조 관점에서 제시한 뒤, 특정한 행동을 합리적으로 유도할 수 있는 심층신경망의 강화학습 에이전트의 합리적인 행동 결정 유도 방법을 제시한다. 이 실시예에서 제시된 방법을 통해 강화학습 에이전트(110)는 유도된 행동과 현재의 관측정보 및 상황을 동시에 고려하여 최종 행동을 결정할 것이다.An exemplary embodiment of the present invention relates to a method for constraining the behavior of a reinforcement learning agent in the special case mentioned in the last paragraph of the background description above. That is, the exemplary embodiment presents the requirements necessary to constrain the behavior of the reinforcement learning agent 110 from the viewpoint of the model structure, and then the rational behavior of the reinforcement learning agent of the deep neural network that can reasonably induce a specific behavior We present a method for inducing decisions. Through the method presented in this embodiment, the reinforcement learning agent 110 will determine the final action by simultaneously considering the induced action and the current observation information and situation.

도 3은 본 발명의 예시적인 실시예에 따른 강화학습 에이전트의 합리적인 행동 결정 유도 방법을 유도하는 방법의 알고리즘을 나타낸 흐름도이다.3 is a flowchart illustrating an algorithm of a method for inducing a method for inducing a rational action decision of a reinforcement learning agent according to an exemplary embodiment of the present invention.

도 3을 참조하면, 사용자가 때때로 에이전트가 현재 상태에서 수행 가능한 행동들 중 기대 누적보상이 가장 높은 행동을 선택하는 탐욕적인 행동결정 방식을 위반하고 특정 행동을 수행하도록 유도 혹은 명령하고 싶은 경우를 고려한다. 그 경우, 심층신경망(100)의 강화학습 에이전트(110)의 실행단계에서 사용자가 유도 또는 명령한 행동 a_i은 강화학습 에이전트(110)의 입력으로 제공될 수 있다(S10).Referring to FIG. 3, consider the case where the user sometimes wants to induce or command the agent to perform a specific action, violating the greedy action decision method of selecting an action with the highest expected cumulative reward among actions that can be performed in the current state. do. In this case, the action a _i induced or commanded by the user in the execution step of the reinforcement learning agent 110 of the deep neural network 100 may be provided as an input to the reinforcement learning agent 110 (S10).

강화학스 에이전트의 특정 행동 유도를 위해 먼저 고려해야 할 점은 일정기간 동안의 행동 유도가 이후의 행동 결정에 있어 영향을 미쳐서는 안 된다는 점이다. 따라서 심층신경망(100) 내의 매개변수들을 수정하지 않으면서 특정 행동을 유도하는 방법을 사용할 필요가 있다. The first thing to consider for inducing a specific behavior of a reinforcement learning agent is that induction of behavior for a certain period of time should not affect future behavior decisions. Therefore, it is necessary to use a method of inducing a specific action without modifying the parameters in the deep neural network 100 .

두 번째로 고려해야 할 점은 강화학습 에이전트(110)가 유도된 행동과 현재의 관측정보를 동시에 고려하여 적절한 행동 선택을 해야 한다는 점이다. 즉, 강화학습 에이전트(110)는 적절한 판단 하에 유도된 행동 대신 다른 행동을 수행하는 결정 역시 내릴 수 있어야 한다.The second thing to consider is that the reinforcement learning agent 110 should select an appropriate action by simultaneously considering the induced action and the current observation information. That is, the reinforcement learning agent 110 should also be able to make a decision to perform another action instead of the induced action under appropriate judgment.

그런 다음, 위의 두 가지 사항을 모두 만족시키기 위해, 강화학습 에이전트(110)에서, 사용자에 의해 유도된 행동 a_i를 제외한 나머지 모든 행동들의 가치함수의 출력값에서 매개변수 λ를 뺀 값을 최종 가치함수 값으로 산출할 수 있다(S20). 그리고 강화학습 에이전트(110)에서, 그 산출된 최종 가치함수에 따라 수행해야 할 최종 행동을 결정하여 수행되도록 할 수 있다(S30).Then, in order to satisfy both the above two points, in the reinforcement learning agent 110, the value obtained by subtracting the parameter λ from the output value of the value function of all other actions except for the action a _i induced by the user is the final value. It can be calculated as a function value (S20). In addition, the reinforcement learning agent 110 may determine and perform a final action to be performed according to the calculated final value function (S30).

도 4는 이를 도식적으로 나타낸 것이다. 즉, 도 4는 관측정보와 유도된 행동을 고려하여 행동 결정을 하기 위한 새로운 가치함수 유도과정을 보여준다. 이때 매개변수 λ는 행동 a_i의 강제성을 반영하기 위한 매개변수이다. 최종 행동 결정에 사용될 수정된 가치함수 q'는 수학식 1과 같이 정의될 수 있다.Figure 4 shows this schematically. That is, FIG. 4 shows a new value function derivation process for determining an action considering observation information and induced action. At this time, the parameter λ is a parameter to reflect the coerciveness of action a _i . The modified value function q' to be used in determining the final action can be defined as in Equation 1.

구체적으로 q'(s, a_j, a_i)는 유도된 행동으로 a_i가 주어졌을 때, 행동 a_j의 가치함수 출력값을 의미한다. 큰 값을 가지는 매개변수 λ가 주어졌을 경우, 유도된 행동 a_i를 제외한 모든 행동들은 가치함수 출력값으로 작은 값을 가지게 될 것이고 강화학습 에이전트(110)가 행동 a_i를 택할 가능성은 매우 커지게 된다. 이는 값이 큰 매개변수 λ를 사용함으로써 행동 a_i에 높은 강제성을 부여할 수 있음을 의미한다. Specifically, q'(s, a _j , a _i ) means the output value of the value function of action a _j when a _i is given as an induced action. If a parameter λ having a large value is given, all actions except the induced action a _i will have a small value as the output value of the value function, and the probability that the reinforcement learning agent 110 will select the action a _i becomes very high . This means that high coercion can be given to the action a _i by using a parameter λ with a large value.

이에 반해 작은 값을 가지는 매개변수 λ가 주어졌을 경우엔 i가 i≠j인 행동 a_i 역시 큰 가치함수 출력값을 가질 수 있어, 강화학습 에이전트(110)는 관측값을 고려하여 행동 a_i 외에 다른 행동을 선택할 여지가 남아있게 된다. On the other hand, when a parameter λ having a small value is given, the action a _i for which i≠j can also have a large output value of the value function, so the reinforcement learning agent 110 considers the observed value and considers the action a _i You are left with a choice of action.

이렇듯 유도를 원하는 행동에 해당되는 첨자 i와 함께 행동의 강제성의 정도를 반영하기 위해 사용되는 매개변수 λ를 동시에 모델에 입력하는 방식으로 관측정보와 유도된 행동을 동시에 고려하여 최종 행동을 결정하는 강화학습 에이전트(110)를 활용할 수 있다.In this way, the parameter λ used to reflect the degree of coercion of the behavior is input into the model at the same time as the subscript i corresponding to the behavior desired to be induced, and the reinforcement that determines the final behavior by considering the observed information and the induced behavior at the same time A learning agent 110 may be utilized.

추가적으로 해당 방법론은 유도를 원하는 행동이 다수일 경우에도 쉽게 적용이 가능하다.Additionally, the methodology can be easily applied even when there are multiple behaviors to be induced.

예시적인 실시예에서, 강화학습 에이전트의 합리적인 행동 결정 유도 방법은 상기 유도된 행동의 강제성의 정도를 조정하기 위해 상기 매개변수 λ값을 변경하는 단계를 더 포함할 수 있다. 예시적인 실시예에서, 강화학습 에이전트의 활용을 종료할지 여부를 결정하고(S40), 만약 강화학습 에이전트를 더 활용하기 원하는 경우엔 매개변수 λ의 값을 원하는 값으로 수정하거나 또는 현재의 값 그대로 유지할 수 있다(S50). 매개변수 값을 변경함에 있어서, 상기 유도된 특정 행동 a_i을 제외한 나머지 모든 행동들의 가치함수 출력값을 작아지게 하여 강화학습 에이전트(110)가 상기 유도된 행동 a_i을 최종 행동으로 선택할 가능성이 높아지도록 하기 위해 매개변수 λ의 값을 큰 값으로 변경할 수 있다.In an exemplary embodiment, the method for inducing a rational action decision of a reinforcement learning agent may further include changing the value of the parameter λ to adjust the degree of coercion of the induced action. In an exemplary embodiment, it is determined whether to terminate the use of the reinforcement learning agent (S40), and if the reinforcement learning agent is to be used more, the value of the parameter λ is modified to a desired value or the current value is maintained. It can (S50). In changing the parameter value, the output value of the value function of all other actions except for the specific induced action a _i is reduced so that the reinforcement learning agent 110 is more likely to select the induced action a _i as the final action. To do this, the value of the parameter λ can be changed to a large value.

본 발명의 실시예에 따른 강화학습 에이전트의 합리적인 행동 결정 유도 방법의 성능을 평가하기 위해, 일반적으로 강화학습 에이전트 성능 평가용으로 흔히 사용되는 'OpenAI Gym'의 환경 중 하나인 'CartPole' 환경에서 실험을 진행하였다. 해당 환경은 마찰이 없는 지면에 놓인 카트 위에 세워진 막대를 넘어지지 않게 하는 것을 목표로 한다. 강화학습 에이전트는 카트를 왼쪽 또는 오른쪽으로 이동시킴으로써 막대의 균형을 조절할 수 있다. 강화학습 에이전트는 막대가 넘어지지 않은 매 단위시간 동안 +1의 보상을 받으며 이를 최대화하기 위해 학습을 진행할 수 있다. 막대의 균형을 유지해야 하는 해당 환경의 특성상 일반적으로 카트를 왼쪽으로 이동시키는 횟수와 오른쪽으로 이동시키는 횟수를 비슷하게 유지하는 강화학습 에이전트가 높은 보상을 받을 것으로 예측할 수 있다.In order to evaluate the performance of the reinforcement learning agent's rational action decision induction method according to an embodiment of the present invention, an experiment is conducted in the 'CartPole' environment, which is one of the 'OpenAI Gym' environments commonly used for performance evaluation of reinforcement learning agents. proceeded. The environment aims to keep a pole propped up on a cart resting on a frictionless surface from tipping over. The reinforcement learning agent can balance the bar by moving the cart left or right. The reinforcement learning agent receives a reward of +1 for every unit time in which the bar does not fall, and can proceed with learning to maximize it. Due to the nature of the environment in which the bar needs to be balanced, it can be predicted that a reinforcement learning agent that maintains a similar number of times of moving the cart to the left and to the right will receive a high reward.

심층 Q-네트워크(deep Q-network: DQN) 방식을 이용해 학습이 완료된 'CartPole' 에이전트에 대해 본 발명의 실시예에 따른 강화학습 에이전트의 합리적인 행동 결정 유도 방법을 적용하지 않고 1,000번의 에피소드 동안 평가를 진행한 결과는 아래의 표 1과 같다. 표 1은 기준모델의 성능을 정리한 것이다.For the 'CartPole' agent, which has been trained using the deep Q-network (DQN) method, evaluation is performed for 1,000 episodes without applying the method for inducing rational action decisions of the reinforcement learning agent according to the embodiment of the present invention. The results are shown in Table 1 below. Table 1 summarizes the performance of the reference model.

평균 보상average reward 좌로 이동move left 우로 이동move right 기준모델reference model 500500 250,001250,001 249,999249,999

'CartPole' 환경은 에피소드의 종료 조건으로 에피소드의 단위시간이 500을 넘어가는 경우를 포함하기 때문에 평균 보상이 500은 기준모델이 매 에피소드마다 막대를 넘어뜨리지 않고 에피소드를 마쳤음을 의미한다. 해당 모델은 1,000번의 에피소드 동안 카트를 좌로 이동시킨 횟수와 우로 이동시킨 횟수가 매우 유사함을 표 1에서 확인할 수 있다.Since the 'CartPole' environment includes cases where the unit time of an episode exceeds 500 as an episode end condition, an average reward of 500 means that the reference model has completed an episode without knocking over the bar in every episode. It can be seen in Table 1 that the number of times the cart was moved to the left and the number of times that the cart was moved to the right during 1,000 episodes were very similar.

예시적인 실시예의 강화학습 에이전트의 합리적인 행동 결정 유도 방법을 기준모델에 적용한 뒤 성능을 평가하기 위하여, 유도를 원하는 행동을 '카트를 오른쪽으로 이동'으로 가정하였다. 또한, 유도를 원하는 행동의 강제성의 정도 차이에 따른 성능 변화를 알아보기 위해, 매개변수 λ를 0.007, 0.01로 설정한 두 비교모델에 대해 실험을 진행하였다. 행동의 유도는 단위시간을 10으로 나눈 나머지가 5보다 작은 경우에만 이루어져 막대가 왼쪽으로 넘어지려는 상황에서 벗어날 수 있는 기회를 에이전트에게 제공하고자 하였다. 두 비교모델에 대하여 1,000번의 에피소드 동안 성능 평가를 진행한 결과는 표 2와 같다.In order to evaluate the performance after applying the rational action decision induction method of the reinforcement learning agent of the exemplary embodiment to the reference model, it is assumed that 'moving the cart to the right' is the desired action. In addition, in order to investigate the performance change according to the degree of coercion of the behavior desired to be induced, an experiment was conducted on two comparative models with the parameter λ set to 0.007 and 0.01. The induction of action is made only when the remainder of dividing the unit time by 10 is less than 5, and it is intended to provide the agent with an opportunity to escape from the situation where the bar tends to fall to the left. Table 2 shows the results of performance evaluation for 1,000 episodes for the two comparative models.

평균 보상average reward 좌로 이동move left 우로 이동move right 비교모델 1
(λ=0.007)Comparison model 1
(λ=0.007) 493.738493.738 245,112
(49.6%)245,112
(49.6%) 248,626
(50.4%)248,626
(50.4%) 비교모델 2
(λ=0.01)Comparison model 2
(λ=0.01) 168.379168.379 80,034
(47.5%)80,034
(47.5%) 88,345
(52.5%)88,345
(52.5%)

강제성의 정도가 더 낮은 비교모델 1의 경우, 평균 보상이 500에 가까운 493.738로 높은 성능을 보였다. 또한 에이전트에게 유도를 한 행동인 '우로 이동'이 전체 행동 중 50.4%를 차지하여 '좌로 이동'보다 더 높은 비율을 보였다. 이 비교모델 1의 결과를 통해 성능을 크게 저하하지 않는 정도에서 에이전트에게 행동을 최대한 유도했음을 확인할 수 있다.In the case of Comparative Model 1, which has a lower degree of coercion, the average reward was 493.738, which is close to 500, and showed high performance. In addition, 'move right', which is an action that induces the agent, accounted for 50.4% of the total actions, showing a higher rate than 'move left'. Through the results of this comparison model 1, it can be confirmed that the maximum action was induced to the agent without significantly degrading the performance.

강제성의 정도가 높았던 비교모델 2의 경우, 나머지 두 모델들보다 유도된 행동인 '우로 이동'의 비율이 가장 높았으며(52.5%) 평균 보상이 168.379로 가장 낮은 평균 보상을 얻었다. Comparative model 2, which had a high degree of coercion, had the highest rate of 'move right', an induced action, than the other two models (52.5%), and obtained the lowest average reward with an average reward of 168.379.

실험이 진행된 'CartPole' 환경의 경우, 수행 가능한 행동의 가지 수가 두 가지뿐이고 환경의 특성상 두 행동의 수행 횟수가 비슷해야 높은 성능을 기대할 수 있기에 행동 유도로 인한 성능 저하가 불가피하였다. 만약 수행 가능한 행동의 가지 수가 더 많고 해당 행동들을 수행한 횟수가 비슷하지 않아도 높은 성능을 기대할 수 있는 환경이라면, 예시적인 실시예의 강화학습 에이전트 행동 유도 방법을 적용하였을 때 성능은 유지하면서 동시에 더욱 뚜렷한 에이전트 행동의 양상 변화를 확인할 수 있을 것이다.In the case of the 'CartPole' environment where the experiment was conducted, there were only two types of behaviors that could be performed, and due to the nature of the environment, high performance could be expected only when the number of times the two behaviors were performed was similar, so performance degradation due to behavior induction was inevitable. If the number of actions that can be performed is greater and the number of actions performed is not similar, in an environment where high performance can be expected, when the reinforcement learning agent action induction method of the exemplary embodiment is applied, the agent performance is maintained and at the same time a more distinct agent You will be able to see changes in the pattern of behavior.

이상에서는 예시적인 실시예에 따라 강화학습 에이전트에게 특정 행동을 유도하였을 경우 에이전트가 유도된 행동과 현재의 관측정보 및 상황을 동시에 고려하여 합리적으로 최종 행동을 결정할 수 있도록 하는 심층신경망 활용방안이 제시되었다. 제시된 방법을 통하여 에이전트 사용자들은 더욱 유연한 에이전트 활용이 가능해질 것이다. In the above, when a specific action is induced to a reinforcement learning agent according to an exemplary embodiment, a method of using a deep neural network that allows the agent to reasonably determine the final action by considering the induced action and the current observational information and situation at the same time has been suggested. . Through the proposed method, agent users will be able to use the agent more flexibly.

예시적인 실시예는 이산적인 행동 공간을 갖는 심층 강화학습에 관한 것이지만, 에이전트가 연속적인 행동 공간을 갖는 경우에도 행동에 제약을 가하는 활용방안을 얻을 수 있다.An exemplary embodiment relates to deep reinforcement learning with a discrete action space, but even when an agent has a continuous action space, it is possible to obtain a method of applying constraints on actions.

이상에서 설명한 본 발명의 실시예들에 따른 강화학습 에이전트의 합리적인 행동 결정 유도 방법은 다양한 컴퓨터 수단을 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있다. 구현된 컴퓨터 프로그램은 컴퓨터의 저장부 또는 컴퓨터 판독이 가능한 별도의 기록매체에 기록될 수 있다. 그 구현된 컴퓨터 프로그램을 컴퓨터 장치에서 연산처리장치가 기록 매체에 저장된 그 컴퓨터 프로그램을 읽어 들여 실행하는 것에 의해 본 발명의 방법을 수행할 수 있다.The method for inducing a rational action decision of a reinforcement learning agent according to embodiments of the present invention described above may be implemented in the form of a computer program that can be executed through various computer means. The implemented computer program may be recorded in a storage unit of a computer or a separate computer-readable recording medium. The method of the present invention can be performed by reading and executing the implemented computer program stored in a recording medium by an arithmetic processing unit in a computer device.

상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상의 실시 예들에서 설명된 특징, 구조, 효과 등은 본 발명의 하나의 실시 예에 포함되며, 반드시 하나의 실시 예에만 한정되는 것은 아니다. 나아가, 각 실시 예에서 예시로 주어진 특징, 구조, 효과 등은 실시 예들이 속하는 분야의 통상의 지식을 가지는 자에 의해 다른 실시 예들에 대해서도 조합 또는 변형되어 실시 가능하다. 따라서 이러한 조합과 변형에 관계된 내용들은 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다.Features, structures, effects, etc. described in the above embodiments are included in one embodiment of the present invention, and are not necessarily limited to only one embodiment. Furthermore, features, structures, effects, etc. given as examples in each embodiment can be combined or modified in other embodiments by a person having ordinary knowledge in the field to which the embodiments belong. Therefore, contents related to these combinations and variations should be construed as being included in the scope of the present invention.

또한, 이상에서 실시 예들을 중심으로 설명하였으나 이는 단지 하나의 예시일 뿐 본 발명을 한정하는 것이 아니다. 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 실시 예의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 그리고 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다.In addition, although the embodiments have been described above, this is only an example and does not limit the present invention. Those skilled in the art to which the present invention pertains will know that various modifications and applications not exemplified above are possible without departing from the essential characteristics of the present embodiment. And differences related to these modifications and applications should be construed as being included in the scope of the present invention as defined in the appended claims.

본 발명은 심층신경망의 강화학습 분야에 이용될 수 있다.The present invention can be used in the field of reinforcement learning of deep neural networks.

100: 심층 신경망
110: 강화학습 에이전트100: deep neural network
110: reinforcement learning agent

Claims

Providing an action induced by a user as an input to the reinforcement learning agent in the execution step of the reinforcement learning agent of the deep neural network;
Calculating, in the reinforcement learning agent, a value obtained by subtracting a parameter value for reflecting the coercion of the induced action from an output value of the value function of all other actions except the induced action, as a final value function value; and
A method for inducing a rational action decision of a reinforcement learning agent of a deep neural network, comprising determining, in the reinforcement learning agent, a final action to be performed using the calculated final value function value.

The method of claim 1, wherein the reinforcement learning agent uses the following equation as a modified value function q' to be used to determine the final action, and simultaneously considers the derived action and current observation information and situations to determine the final action. decide,

Here, a and s represent the action and the current situation, respectively, λ represents the above parameter, i and j are the indices of action a, and q is the expected reward for performing an action in the current situation A method for inducing a rational action decision of a reinforcement learning agent of a deep neural network, characterized in that it represents a value function approximating the expected value of.

The method of claim 1, further comprising the step of changing the parameter value to adjust the degree of coercion of the induced action.

The method of claim 3, wherein the step of changing the parameter value reduces the output value of the value function of all other behaviors except for the specific induced behavior (a _i ) so that the reinforcement learning agent performs the induced behavior (a _i ). A method for inducing a rational action decision of a reinforcement learning agent of a deep neural network, comprising the step of changing the value of the parameter (λ) to a large value in order to increase the possibility of selecting the final action.

The method of claim 1, wherein the deep neural network is configured so that the same number of nodes as the number of actions that can be performed by the reinforcement learning agent exists in the last layer, and each node is expected to be obtained when a specific action is performed in the current situation. A method for inducing rational action decisions of a reinforcement learning agent of a deep neural network, characterized by learning a value function approximating the expected value of an expected reward.

The method of claim 1, wherein the induced behavior includes a plurality of behaviors.

A computer executable program stored in a computer readable recording medium to perform the method for inducing rational behavior decision of a reinforcement learning agent of a deep neural network according to any one of claims 1 to 6.

A computer-readable recording medium on which a computer executable program is recorded for performing the method for inducing a rational action decision of a reinforcement learning agent of a deep neural network according to any one of claims 1 to 6.