KR102190584B1

KR102190584B1 - System and method for predicting human choice behavior and underlying strategy using meta-reinforcement learning

Info

Publication number: KR102190584B1
Application number: KR1020190001557A
Authority: KR
Inventors: 이상완; 김동재
Original assignee: 한국과학기술원
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2020-12-14
Also published as: KR20200092457A

Abstract

메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템 및 방법이 제시된다. 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템은, 보상 함수(reward function)를 학습하는 모델-프리(model-free, MF) 강화 학습부; 상기 보상 함수 및 상태-천이 함수(state-transition function)를 학습하며, 상기 상기 모델-프리(MF) 강화 학습부와 동적으로 상호 작용하는 모델-기반(model-based, MB) 강화 학습부; 및 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부에 가중치를 동적으로 부여함에 따라 계층적으로 제어하여 인간의 의사 결정을 예측하는 계층적 제어부를 포함하여 이루어질 수 있다. A system and method for estimating human behavior patterns and behavior strategies using meta-reinforcement learning are presented. A system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an embodiment includes: a model-free (MF) reinforcement learning unit for learning a reward function; A model-based (MB) reinforcement learning unit that learns the reward function and a state-transition function and dynamically interacts with the model-free (MF) reinforcement learning unit; And a hierarchical control unit for predicting human decision-making by hierarchically controlling weights by dynamically assigning weights to the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit. have.

Description

Human behavior pattern and behavioral strategy estimation system and method using meta-reinforcement learning {SYSTEM AND METHOD FOR PREDICTING HUMAN CHOICE BEHAVIOR AND UNDERLYING STRATEGY USING META-REINFORCEMENT LEARNING}

아래의 실시예들은 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템 및 방법에 관한 것이다.The following embodiments relate to a system and method for estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

강화 학습(reinforcement learning, RL) 알고리즘은 최적 정책을 발견하기 위하여 보상을 통해 산출되는 예측 오차 정도를 감소시키는 방향으로 값을 업데이트하기 때문에 적응적 제어 문제(adaptive control problem)를 해결하기 위해 사용되어 왔다. 강화 학습(RL)은 보상 함수(reward function)를 학습하는 모델-프리(model-free, MF) 강화 학습(RL)과 보상 함수 및 상태-천이 함수(state-transition function) 모두를 학습하는 모델-기반(model-based, MB) 강화 학습(RL)으로 분류될 수 있다. 대부분의 실질적인 문제에서는 모델에 대응하는 상태-공간 표현이 불명확하기 때문에 모델-프리(MF) 강화 학습(RL) 에이전트는 일반적인 목적에 적합하다. 그러나, 모델-기반(MB) 강화 학습(RL) 에이전트는 상태-천이 함수 자체를 학습하여 빠르게 적응하는 장점을 가지고 있으며, 이는 완전하게 이용되는 상태-공간(fully exploited state-space) 문제에 활용된다.The reinforcement learning (RL) algorithm has been used to solve the adaptive control problem because the value is updated in the direction of reducing the degree of prediction error calculated through compensation in order to find the optimal policy. . Reinforcement learning (RL) is a model-free (MF) reinforcement learning (RL) model that learns a reward function, and a model that learns both a reward function and a state-transition function. It can be classified as model-based (MB) reinforcement learning (RL). In most practical problems, the model-free (MF) reinforcement learning (RL) agent is suitable for general purposes because the state-space representation corresponding to the model is unclear. However, the model-based (MB) reinforcement learning (RL) agent has the advantage of quickly adapting by learning the state-transition function itself, which is utilized in a fully exploited state-space problem. .

인간 강화 학습(RL)에 대한 수많은 연구들은 인간이 모델-프리(MF) 강화 학습(RL) 및 모델-기반(MB) 강화 학습(RL)을 함께 사용하고 있으며 상태-천이에 대한 불확실성(uncertainty)에 의해 계층적으로 제어된다는 점을 제시했다(비특허문헌 1). 이 연구는 상태-천이의 불확실성 정도가 변화하는 태스크를 수행하기 위해 두 가지 강화 학습(RL)에 걸친 계층적 제어에 대한 컴퓨터 모델을 제안했다.Numerous studies on human reinforcement learning (RL) have shown that humans are using model-free (MF) reinforcement learning (RL) and model-based (MB) reinforcement learning (RL) together, and the uncertainty of state-transition. It has been suggested that it is hierarchically controlled by (Non-Patent Document 1). This study proposed a computer model for hierarchical control over two reinforcement learning (RL) to perform tasks with varying degrees of state-transition uncertainty.

그러나, 상기 연구를 포함하는 대부분의 인간 강화 학습(RL) 연구들은 실제 인간의 선택 행동을 예측하는 데에 주요한 제한을 두는 간소화된 마르코프 의사 결정 과정(Markov decision process, MDP) 환경을 가정했다. 실생활에서의 환경은 강화 학습(RL) 프로세스와 강화 학습(RL)에 걸친 계층적 제어에 영향을 주는 수많은 요인들을 가지고 있기 때문에 인간 강화 학습(RL)을 이해하기 위한 기존의 접근 방법은 인간의 학습 및 행동에 대한 강화 학습(RL)의 영향력을 검증하는 데 실패할 수도 있다. 이 문제는 현실적인 환경 요인들을 고려하는 마르코프 의사 결정 과정(MDP)을 시험함으로써 해결될 수 있다.However, most human reinforcement learning (RL) studies, including the above studies, assumed a simplified Markov decision process (MDP) environment that places major limitations on predicting real human selection behavior. The traditional approach to understanding human reinforcement learning (RL) is human learning because the real-life environment has a number of factors that influence the reinforcement learning (RL) process and hierarchical control over reinforcement learning (RL). And the impact of reinforcement learning (RL) on behavior. This problem can be solved by testing the Markov Decision Process (MDP), which takes into account realistic environmental factors.

S. W. Lee, S. Shimojo, and J. P. ODoherty, Neural Computations Underlying Arbitration between Model-Based and Model-free Learning, Neuron, vol. 81, no. 3, pp. 687-699, Feb. 2014. S. W. Lee, S. Shimojo, and J. P. ODoherty, Neural Computations Underlying Arbitration between Model-Based and Model-free Learning, Neuron, vol. 81, no. 3, pp. 687-699, Feb. 2014. P. Dayan and L. F. Abbott, Theoretical neuroscience: computational and mathematical modeling of neural systems. Massachusetts Institute of Technology Press, 2001. P. Dayan and L. F. Abbott, Theoretical neuroscience: computational and mathematical modeling of neural systems. Massachusetts Institute of Technology Press, 2001.

실시예들은 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템 및 방법에 관하여 기술하며, 보다 구체적으로 다양한 태스크 목표와 관련된 모델 기반(model-based, MB) 강화 학습(RL) 방법 및 다수의 모델 프리(model-free) 강화 학습(RL) 방법에 제어 가중치를 동적으로 가하는 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템 및 방법을 제공하는데 있다. The embodiments describe a system and method for estimating human behavior patterns and behavior strategies using meta-reinforcement learning, and more specifically, a model-based (MB) reinforcement learning (RL) method and a number of models related to various task goals. It is to provide a system and method for estimating human behavior patterns and behavior strategies using meta-reinforcement learning that dynamically applies control weights to a model-free reinforcement learning (RL) method.

일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템은, 보상 함수(reward function)를 학습하는 모델-프리(model-free, MF) 강화 학습부; 상기 보상 함수 및 상태-천이 함수(state-transition function)를 학습하며, 상기 상기 모델-프리(MF) 강화 학습부와 동적으로 상호 작용하는 모델-기반(model-based, MB) 강화 학습부; 및 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부에 가중치를 동적으로 부여함에 따라 계층적으로 제어하여 인간의 의사 결정을 예측하는 계층적 제어부를 포함하여 이루어질 수 있다. A system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an embodiment includes: a model-free (MF) reinforcement learning unit for learning a reward function; A model-based (MB) reinforcement learning unit that learns the reward function and a state-transition function and dynamically interacts with the model-free (MF) reinforcement learning unit; And a hierarchical control unit for predicting human decision-making by hierarchically controlling weights by dynamically assigning weights to the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit. have.

상기 계층적 제어부는, 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부의 신뢰성에 대한 함수인 두 가지 방향의 천이 확률(transition rate)을 통해 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부의 상기 가중치를 갱신할 수 있다. The hierarchical control unit includes the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit through the transition rates in two directions, which are functions of reliability. ) The weights of the reinforcement learning unit and the model-based (MB) reinforcement learning unit may be updated.

상기 계층적 제어부는, 상기 인간의 의사 결정을 예측하고, 상기 인간의 의사 결정에 내재된 행동전략을 추정할 수 있다. The hierarchical control unit may predict the human decision making and estimate a behavioral strategy inherent in the human decision making.

상기 계층적 제어부는, 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부를 가변 태스크 변수인 목표 값(goal value), 상태-공간의 크기 및 상태-천이의 불확실성 중 적어도 어느 하나 이상에 의해 계층적으로 제어하여 인간의 의사 결정을 예측할 수 있다. The hierarchical control unit includes the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit among a target value, which is a variable task variable, a state-space size, and a state-transition uncertainty. By hierarchical control by at least one or more, human decision-making can be predicted.

상기 모델-프리(MF) 강화 학습부는, 다수의 목표에 따른 다수의 정책을 제공하기 위해 목표 의존적인 다중 모델-프리(MF) 강화 학습 에이전트들로 이루어져 목표 값을 반영할 수 있다. The model-free (MF) reinforcement learning unit may reflect a target value composed of target-dependent multiple model-free (MF) reinforcement learning agents in order to provide a plurality of policies according to a plurality of goals.

상기 모델-프리(MF) 강화 학습부 또는 상기 모델-기반(MB) 강화 학습부는, 가변적인 환경으로부터 상기 가변 태스크 변수인 목표 값(goal value), 상태-공간의 크기 및 상태-천이의 불확실성을 전달 받아 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부 사이의 천이 확률을 계산하고, 상기 계층적 제어부는, 상기 천이 확률을 통해 모델 선택 확률을 결정할 수 있다. The model-free (MF) reinforcement learning unit or the model-based (MB) reinforcement learning unit, the variable task variable target value (goal value), state-space size, and state-transition uncertainty from a variable environment. Received and calculate a transition probability between the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit, the hierarchical control unit may determine a model selection probability through the transition probability.

상기 계층적 제어부는, 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부의 상기 모델 선택 확률을 통합하여 동작 값을 계산하고, 상기 동작 값을 통해 동작 선택 확률을 제공하여 동작으로 변환할 수 있다. The hierarchical control unit calculates an operation value by integrating the model selection probability of the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit, and provides an operation selection probability through the operation value. You can convert it into an action.

다른 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 방법은, 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부에 가중치를 동적으로 부여하는 단계; 및 부여된 상기 가중치에 따라 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부를 계층적으로 제어하여 인간의 의사 결정을 예측하는 단계를 포함하여 이루어질 수 있다. According to another embodiment, a method for estimating human behavior patterns and behavior strategies using meta-reinforcement learning may include dynamically assigning weights to a model-free (MF) reinforcement learning unit and a model-based (MB) reinforcement learning unit; And predicting human decision making by hierarchically controlling the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit according to the assigned weight.

상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부의 신뢰성에 대한 함수인 두 가지 방향의 천이 확률(transition rate)을 통해 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부의 상기 가중치를 갱신하는 단계를 더 포함할 수 있다. The model-free (MF) reinforcement learning section and the model-free (MF) reinforcement learning section and the model-free (MF) reinforcement learning section and the It may further include the step of updating the weight of the model-based (MB) reinforcement learning unit.

상기 계층적으로 제어하여 인간의 의사 결정을 예측하는 단계는, 상기 인간의 의사 결정을 예측하고, 상기 인간의 의사 결정에 내재된 행동전략을 추정할 수 있다. In the hierarchical control to predict human decision-making, the human decision-making may be predicted, and a behavioral strategy inherent in the human decision-making may be estimated.

상기 계층적으로 제어하여 인간의 의사 결정을 예측하는 단계는, 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부를 가변 태스크 변수인 목표 값(goal value), 상태-공간의 크기 및 상태-천이의 불확실성 중 적어도 어느 하나 이상에 의해 계층적으로 제어하여 인간의 의사 결정을 예측할 수 있다. In the hierarchical control of predicting human decision making, the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit are variable task variables, such as a goal value and a state- Human decision making can be predicted by hierarchically controlling at least one or more of the size of the space and the uncertainty of the state-transition.

상기 계층적으로 제어하여 인간의 의사 결정을 예측하는 단계는, 상기 모델-프리(MF) 강화 학습부 또는 상기 모델-기반(MB) 강화 학습부에서 가변적인 환경으로부터 상기 가변 태스크 변수인 목표 값(goal value), 상태-공간의 크기 및 상태-천이의 불확실성을 전달 받아 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부 사이의 천이 확률을 계산한 결과를 이용하여, 모델 선택 확률을 결정할 수 있다.The step of predicting human decision-making by controlling hierarchically may include, in the model-free (MF) reinforcement learning unit or the model-based (MB) reinforcement learning unit, a target value that is the variable task variable from a variable environment ( goal value), the size of the state-space, and the uncertainty of the state-transition, using the result of calculating the transition probability between the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit , The model selection probability can be determined.

상기 계층적으로 제어하여 인간의 의사 결정을 예측하는 단계는, 상기 모델-프리(MF) 강화 학습부 및 상기 모델-기반(MB) 강화 학습부의 상기 모델 선택 확률을 통합하여 동작 값을 계산하고, 상기 동작 값을 통해 동작 선택 확률을 제공하여 동작으로 변환할 수 있다. The hierarchically controlling and predicting human decision-making may include calculating an operation value by integrating the model selection probability of the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit, An operation selection probability may be provided through the operation value and converted into an operation.

실시예들에 따르면 다양한 태스크 목표와 관련된 모델 기반(model-based, MB) 강화 학습(RL) 방법 및 다수의 모델 프리(model-free) 강화 학습(RL) 방법에 제어 가중치를 동적으로 가하는 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템 및 방법을 제공할 수 있다.According to embodiments, meta-reinforcement that dynamically applies control weights to a model-based (MB) reinforcement learning (RL) method related to various task goals and a plurality of model-free reinforcement learning (RL) methods. A system and method for estimating human behavior patterns and behavior strategies using learning can be provided.

실시예들에 따르면 모델-기반 강화 학습(RL) 방법과 다수의 목표 의존적 모델-프리 강화 학습(RL) 방법들 간의 경쟁을 구현하는 모델을 통해 이용한 인간 행동패턴 및 행동전략 추정할 수 있고, 나아가 실생활에서 인간과 로봇의 다양한 상호작용 시나리오에 귀납적인(goal-driven) 강화 학습(RL) 제어를 적용할 수 있다. According to embodiments, human behavior patterns and behavior strategies can be estimated through a model that implements competition between a model-based reinforcement learning (RL) method and a plurality of goal-dependent model-free reinforcement learning (RL) methods. In real life, you can apply goal-driven reinforcement learning (RL) control to various human-robot interaction scenarios.

도 1은 일 실시예에 따른 인간의 행동전략 및 의사 결정의 추론을 설명하기 위한 도면이다.
도 2는 일 실시예에 따른 복잡성이 변화하는 2단계 마르코프 의사 결정 태스크를 설명하기 위한 도면이다.
도 3은 일 실시예에 따른 상위 2개 모델의 구성을 나타내는 도면이다.
도 4는 일 실시예에 따른 57개 모델에 대한 초과 확률을 나타내는 도면이다.
도 5는 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템을 나타내는 블록도이다.
도 6은 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 방법을 나타내는 흐름도이다.
도 7은 일 실시예에 따른 태스크 설계를 설명하기 위한 도면이다.
도 8은 일 실시예에 따른 불확실성과 복잡성을 통합하는 중재 제어의 컴퓨터 모델을 설명하기 위한 도면이다.1 is a diagram for explaining inference of a behavioral strategy and decision-making of a human according to an exemplary embodiment.
FIG. 2 is a diagram illustrating a two-stage Markov decision-making task of varying complexity, according to an exemplary embodiment.
3 is a diagram showing the configuration of the upper two models according to an embodiment.
4 is a diagram showing an excess probability for 57 models according to an embodiment.
5 is a block diagram illustrating a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an exemplary embodiment.
6 is a flowchart illustrating a method of estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an exemplary embodiment.
7 is a diagram for describing a task design according to an embodiment.
8 is a diagram for explaining a computer model of arbitration control incorporating uncertainty and complexity according to an embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in various forms, and the scope of the present invention is not limited by the embodiments described below. In addition, various embodiments are provided to more completely explain the present invention to those of ordinary skill in the art. In the drawings, the shapes and sizes of elements may be exaggerated for clearer explanation.

신경 과학 및 심리학 분야의 최근의 증거는 상태-동작-상태(state-action-state) 천이의 불확실성이 시간에 따라 변화하는 불확실하고 동적인 환경에서, 단일한 강화 학습 알고리즘이 인간의 선택 행동 분산의 60% 미만만 차지한다는 점을 시사한다. 예측 성능은 상태 공간(state-space)의 크기가 증가할 때 더 감소한다. Recent evidence in the fields of neuroscience and psychology is that in an uncertain and dynamic environment where the uncertainty of the state-action-state transition changes over time, a single reinforcement learning algorithm is not limited to the variance of human selection behavior. This suggests that it only accounts for less than 60%. The prediction performance further decreases as the size of the state-space increases.

본 발명은 다양한 태스크 목표와 관련된 모델 기반(model-based, MB) 강화 학습(RL) 방법 및 다수의 모델 프리(model-free) 강화 학습(RL) 방법에 제어 가중치를 동적으로 가하는 계층적 문맥 의존적(종속) 강화 학습 제어 프레임워크(hierarchical context-dependent RL control framework)를 제안한다. 제안하는 방법의 유효성을 적절하게 평가하기 위하여, 3가지 다른 유형의 문맥이 시간에 따라 변화하는 2단계(two-stage) 마르코프 의사 결정 태스크(Markov decision task, MDT)를 고려한다. 여기에서는 칼텍(Caltech) 의사 결정 태스크(MDT) 데이터세트로 57개의 다른 강화 학습(RL) 제어 모델을 학습시킨 후 베이지안 모델 비교를 사용하여 그것들의 예측 성능을 비교한 결과, 정확한 예측을 제공하는 모델은 모델-기반 강화 학습(RL) 방법과 다수의 목표 의존적 모델-프리 강화 학습(RL) 방법들 간의 경쟁을 구현하는 모델이었다. 실시예들에 따르면 실생활에서 인간과 로봇의 다양한 상호작용 시나리오에 귀납적인(goal-driven) 강화 학습(RL) 제어를 적용할 수 있다.The present invention is a hierarchical context-dependent method that dynamically applies control weights to a model-based (MB) reinforcement learning (RL) method and a number of model-free reinforcement learning (RL) methods related to various task goals. We propose a (dependent) reinforcement learning control framework (hierarchical context-dependent RL control framework). In order to properly evaluate the effectiveness of the proposed method, we consider a two-stage Markov decision task (MDT) in which three different types of context change over time. Here, we trained 57 different reinforcement learning (RL) control models with the Caltech decision-making task (MDT) dataset and then compared their prediction performance using Bayesian model comparison, a model that provides accurate predictions. Is a model that implements competition between model-based reinforcement learning (RL) methods and multiple goal-dependent model-free reinforcement learning (RL) methods. According to embodiments, it is possible to apply goal-driven reinforcement learning (RL) control to various interaction scenarios between humans and robots in real life.

도 1은 일 실시예에 따른 인간의 행동전략 및 의사 결정의 추론을 설명하기 위한 도면이다. 1 is a diagram for explaining inference of a behavioral strategy and decision-making of a human according to an exemplary embodiment.

도 1을 참조하면, 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템은 인간의 의사 결정에 내재된 행동전략과 의사 결정을 추정(140)할 수 있다. 실시예들은 인간의 행동(110)을 다수의 목표 의존적 습관성 행동 에이전트를 통해서 설명함으로써 보다 다양한 상황변화에 맞는 인간의 행동(110)을 예측할 수 있다. 인간의 행동(110)을 예측하는데 있어서 환경이 얼마나 실제를 가정하였는지가 중요하며, 여기에서 환경은 다양한 상황변화를 포함하고 있으므로 더욱 응용이 용이하다. Referring to FIG. 1, a system for estimating a human behavior pattern and a behavior strategy using meta-reinforcement learning according to an embodiment may estimate 140 a behavior strategy and a decision-making inherent in human decision-making. Embodiments describe the human behavior 110 through a plurality of target-dependent addictive behavior agents, so that the human behavior 110 may be predicted in accordance with various changes in circumstances. In predicting the human behavior 110, it is important how much the environment is assumed to be real, and since the environment includes various changes in circumstances, it is more applicable.

인간의 의사 결정 과정에는 강화 학습이 활용될 수 있다. 강화 학습은 불확실한 환경에서 인간의 행동(110)을 유추하는데 있어서 장점을 보인다. 많은 연구에서 대부분의 경우 간단한 환경 내에서의 인간 강화 학습을 관찰하고 행동을 예측하는 것을 목적으로 한다. 그러나 실생활에서 인간이 선택을 내려야 하는 환경은 간단하지 않고 매우 복잡하며, 지속적인 환경의 변화를 겪고 있다(가변 환경). 가변 환경 내에서 예상 가능한 변화로는 크게 세 가지가 있다. 첫 번째는 상황의 복잡성(복잡도) 변화이며, 가능한 선택의 수가 변화한다. 두 번째는 불확실성 변화이며, 선택의 결과에 대한 예측이 부정확해진다. 세 번째는 목표 변화이며, 선택의 결과로 취하고자 하는 목표가 달라진다.Reinforcement learning can be used in the human decision-making process. Reinforcement learning has an advantage in inferring human behavior 110 in an uncertain environment. Many studies aim to observe human reinforcement learning and predict behavior in most cases in simple environments. However, in real life, the environment in which humans have to make choices is not simple, but very complex, and undergoes constant environmental changes (variable environment). There are three major changes that can be predicted within a variable environment. The first is a change in the complexity (complexity) of the situation, and the number of possible choices changes. The second is the uncertainty change, and the prediction of the outcome of the choice becomes inaccurate. The third is goal change, and the goal you want to take is different as a result of your choice.

실시예들은 다양한 강화 학습 에이전트들을 포괄적으로 제어하는 제어부(메타-컨트롤러)를 이용하여 인간의 의사 결정을 예측할 뿐만 아니라, 의사 결정 과정에 관여하는 행동전략까지 추정(140)하는 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템을 제공할 수 있으며, 이러한 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템은 다양한 환경 변화를 대응할 수 있고, 다수의 강화 학습 에이전트에 의해 적응형으로 변화할 수 있다.　Examples are humans using meta-reinforcement learning that not only predicts human decision making using a control unit (meta-controller) that comprehensively controls various reinforcement learning agents, but also estimates 140 action strategies involved in the decision making process. A system for estimating behavior patterns and behavior strategies can be provided, and a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning can cope with various environmental changes and can be adaptively changed by a number of reinforcement learning agents. .

　강화 학습 모델은 인간의 의사 결정의 기반이 되는 모델로,　하나의 목표 지향성(goal-directed) 행동 에이전트(120)와 다양한 목표에 대응하기 위한 다수의 목표 의존적(goal-dependent)인 습관성(habitual) 행동 에이전트(130)로 이루어질 수 있다. 목표 의존적인 습관성 행동 에이전트(130)가 다수의 습관성 행동 에이전트(131)로 구성되는 이유는 사용자가 추구하고자 하는 목적에 쉽고 효율적으로 대응하기 위한 것이다. 이는, 가변적 환경에서 특히 유용하다.　 The reinforcement learning model is a model that is the basis of human decision-making, and it has one goal-directed behavior agent 120 and a number of goal-dependent habits to respond to various goals. It may be made of an action agent 130. The reason why the goal-dependent addictive behavior agent 130 is composed of a plurality of addictive behavior agents 131 is to easily and efficiently respond to the purpose that the user seeks. This is particularly useful in a variable environment.

제어부는 각 에이전트의 성능을 따져 어느 에이전트의 활용도를 높일지 그 수치를 결정하여 인간의 의사 결정 행동을 예측할 수 있다.　The control unit can predict human decision-making behavior by determining the number of agents to increase the utilization of each agent based on the performance of each agent.

아래에서 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템 및 방법에 대해 보다 상세히 설명한다. A system and method for estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an exemplary embodiment will be described in more detail below.

본 발명은 3개의 가변 태스크 변수(three varying task variables)인 목표 값(goal value), 상태-공간의 크기 및 상태-천이의 불확실성 정도를 사용하는 더 복잡한 의사 결정 태스크(MDT)를 통해 습득되는 칼텍(Caltech) 강화 학습 데이터세트를 고려한다. 특히, 실시예들은 계층적 제어 강화 학습(RL) 모델 내 상태-공간의 크기 및 목표 값 변화의 영향력을 연구한다. 이 영향은 기존 연구에서 충분히 연구되지 않았다(비특허문헌 1). 복잡한 의사 결정 과정(MDP) 설정에서 계층적 제어에 영향을 미칠 수 있는 요소들의 영향력을 증명하기 위하여, 총 57개의 계층적 제어 강화 학습(RL) 모델들을 비교했으며, 각각은 서로 다른 가설들을 조합하여 구현했다. 결과적으로, 실제적인 의사 결정 과정(MDP) 환경에서 선택 행동에 대한 매우 정확한 예측을 제공하는 새로운 계층적 제어 강화 학습(RL) 모델을 발견했다. 시뮬레이션 결과, 인간의 선택 행동을 설명하는데 있어서 모델-기반(MB) 강화 학습(RL) 에이전트 및 목표 의존적(goal-dependent) 모델-프리(MF) 강화 학습(RL) 에이전트에 걸친 계층적 제어의 큰 장점을 보여준다.The present invention is a Caltech acquired through a more complex decision-making task (MDT) using three varying task variables, the goal value, the size of the state-space, and the degree of uncertainty of the state-transition. (Caltech) Consider the reinforcement learning dataset. In particular, embodiments study the influence of changes in the size and target value of the state-space in the hierarchical control reinforcement learning (RL) model. This effect has not been sufficiently studied in previous studies (Non-Patent Document 1). In order to prove the influence of factors that can affect hierarchical control in the complex decision making process (MDP) setting, a total of 57 hierarchical control reinforcement learning (RL) models were compared, each combining different hypotheses. Implemented. As a result, we discovered a new hierarchical controlled reinforcement learning (RL) model that provides highly accurate predictions of choice behavior in a realistic decision-making process (MDP) environment. Simulation results show that there is a large amount of hierarchical control over model-based (MB) reinforcement learning (RL) agents and goal-dependent model-free (MF) reinforcement learning (RL) agents in explaining human selection behavior. Show the advantages.

아래에서는 앞에서 설명한 3개의 가변 환경 변수를 통합하는 계층적 제어 아키텍처를 제공한다. Below, we provide a hierarchical control architecture that integrates the three variable environment variables described above.

도 2는 일 실시예에 따른 복잡성이 변화하는 2단계 마르코프 의사 결정 태스크를 설명하기 위한 도면이다. FIG. 2 is a diagram illustrating a two-stage Markov decision-making task of varying complexity, according to an exemplary embodiment.

도 2를 참조하면, 목표 값(210), 상태-공간의 크기(220) 및 상태-천이의 불확실성 정도(230)의 3개의 가변 태스크 변수를 사용하는 2단계 마르코프 의사 결정 태스크(MDT)를 설명할 수 있다. 실시예들은 2단계 마르코프 의사 결정 태스크를 통해 얻어진 칼텍(Caltech) 강화 학습 데이터세트를 사용한다. 데이터세트는 22명의 참가자 데이터(여성 10명, 19-55세)로 구성된다. 참가자들은 3개의 태스크 변수(210, 220, 230)가 변화하는 동안 순차적인 두 개의 선택(201, 202)을 할 수 있다. 참가자들은 컬러 코인의 목표 값(210)에 대한 정보를 전달 받고, 누적 보상을 최대화하기 위해 두 가지의 가능한 옵션 중에서 선택(201, 202)을 할 수 있다. 상태 천이는 참가자의 선택(201, 202)에 따라 발생하며, 상태-천이의 불확실성의 정도(230)는 낮은 상태(Low)와 높은 상태(High) 사이에서 변화할 수 있다. 도 1의 단계 2(Stage 2)에 도시된 바와 같이, 상태-공간의 크기(220)를 의미하는 가능한 옵션의 수는 두 개에서 네 개 사이에서 변화할 수 있다. 태스크의 보상(Reward) 정책은 참가자들이 최적 정책을 학습하도록 고취시킬 수 있다.Referring to FIG. 2, a two-step Markov decision-making task (MDT) using three variable task variables of a target value 210, a state-space size 220, and a state-transition uncertainty degree 230 is described. can do. Examples use a Caltech reinforcement learning dataset obtained through a two-step Markov decision-making task. The dataset consists of data from 22 participants (10 females, 19-55 years old). Participants can make two sequential choices 201, 202 while the three task variables 210, 220, 230 change. Participants are informed about the target value 210 of the color coin and can choose 201, 202 from two possible options to maximize the cumulative reward. The state transition occurs according to the participant's choices 201 and 202, and the degree of state-transition uncertainty 230 may vary between a low state and a high state. As shown in Stage 2 of FIG. 1, the number of possible options, meaning the size 220 of the state-space, may vary between two and four. The task's reward policy can encourage participants to learn the optimal policy.

아래에서는 계층적 제어 강화 학습(RL) 모델에 대해 설명한다. 여기서, 계층적 제어 강화 학습(RL) 모델은 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템이 될 수 있다. 또한, 계층적 제어 강화 학습(RL) 모델은 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템에 포함되는 것도 가능하다. Below, a hierarchical controlled reinforcement learning (RL) model will be described. Here, the hierarchical control reinforcement learning (RL) model may be a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an embodiment. In addition, the hierarchical controlled reinforcement learning (RL) model may be included in a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

계층적 제어 강화 학습(RL) 모델은 모델-기반(MB) 에이전트와 모델-프리(MF) 에이전트를 계층적으로 제어할 수 있다. 계층적 제어 강화 학습(RL) 모델의 주요 구성요소는 모델-기반(MB) 강화 학습(RL)과 모델-프리(MF) 강화 학습(RL) 간의 신뢰성-기반 경쟁(reliability-based competition)이다. 이 계층적 제어 강화 학습(RL) 모델에서, 제어부는 각 강화 학습(RL) 에이전트의 신뢰성에 대한 함수인 두 가지 방향(MB에서 MF, MF에서 MB)의 천이 확률(transition rate)을 통해 각 강화 학습(RL) 에이전트의 가중치를 갱신(update)할 수 있다. The hierarchical control reinforcement learning (RL) model can hierarchically control model-based (MB) agents and model-free (MF) agents. A major component of a hierarchical controlled reinforcement learning (RL) model is a reliability-based competition between model-based (MB) reinforcement learning (RL) and model-free (MF) reinforcement learning (RL). In this hierarchical controlled reinforcement learning (RL) model, the control unit reinforces each reinforcement learning (RL) through a transition rate in two directions (MB to MF, MF to MB) that is a function of the reliability of each agent. The weight of the learning (RL) agent can be updated.

표 1은 계층적 제어 강화 학습(RL) 모델을 나타내는 알고리즘 1이다. Table 1 is Algorithm 1 representing a hierarchical controlled reinforcement learning (RL) model.

[표 1][Table 1]

표 2는 계층적 제어를 위한 가중치 갱신을 나타내는 알고리즘 2이다. Table 2 is Algorithm 2 indicating weight update for hierarchical control.

[표 2][Table 2]

이 동적인 2 단계의 천이 모델은 두 가지의 강화 학습(RL) 간의 경쟁을 구현하기 위해 생물물리학에서 차용되었다(비특허문헌 2).This dynamic two-stage transition model was borrowed from biophysics to realize competition between two types of reinforcement learning (RL) (Non-Patent Document 2).

그러나, 복잡한 상황에서의 일반적인 인간의 강화 학습(RL)을 해석하기 위하여 3가지의 변화하는 태스크 변수를 구비한 2 단계 의사 결정 태스크(MDT)를 활용했기 때문에, 모델은 모델-기반(MB) 강화 학습(RL) 및 모델-프리(MF) 강화 학습(RL)에 걸친 제어부보다 더 많은 것을 필요로 할 것이다. 태스크 변수와 관련하여 핵심적인 두 가지 구성요소 복잡성(complexity) 및 통화(currency)가 있으며, 각 구성요소에 대한 몇 가지 가설이 있다.However, since we used a two-stage decision-making task (MDT) with three changing task variables to interpret the general human reinforcement learning (RL) in complex situations, the model is model-based (MB) reinforcement. It will require more than a control over learning (RL) and model-free (MF) reinforcement learning (RL). There are two key component complexity and currency associated with task variables, and there are several hypotheses for each component.

먼저, 복잡성(상태 공간(state-space)의 크기)은 상태 공간의 크기에 대한 이원 변화를 나타내며, 계층적 제어 내 현재 상태 공간 크기의 영향력을 공식화하기 위해 사용될 수 있다. 상태 공간은 변화하기 때문에, 각 강화 학습(RL) 시스템의 가중치 및 대응하는 최적 정책 또한 갱신되어야 한다. First, complexity (the size of the state-space) represents a binary change to the size of the state space, and can be used to formulate the influence of the current state space size in hierarchical control. As the state space changes, the weights of each reinforcement learning (RL) system and the corresponding optimal policy must also be updated.

여기에서는 계층적 제어에 4가지 다른 복잡성 영향을 가정했다(알고리즘 2). 이 알고리즘 2에서,

는 복잡성에 대한 기여, 다시 말해 이전 연구에서 언급된 천이 확률(transition rate)이 없다는 것을 의미한다(비특허문헌 1). 다른 3가지의 천이 확률(알고리즘 2의

-

)들은 알고리즘 2의 line 3 - 9에 기재된 바와 같이 분자 또는 분모 상에서 복잡성에 대한 기여를 한다. 이 3가지 천이 확률에서, 각각에 대한 6개의 변형이 있으며, 이는 3가지 방법(양방향 복잡성 기여 / MB에서 MF 방향 기여/ MF에서 MB 방향 기여) 및 2가지 방향(순/역방향 복잡성, 즉 상태 공간이 확장할 때 복잡성 항 z가 증가/감소하는 것)으로 적용될 수 있다. Four different complexity effects are assumed here on hierarchical control (Algorithm 2). In this algorithm 2,

Means that there is no contribution to complexity, that is, the transition rate mentioned in the previous study (Non-Patent Document 1). The other 3 transition probabilities (Algorithm 2's

-

) Contribute to the complexity on the numerator or denominator as described in lines 3-9 of Algorithm 2. In these 3 transition probabilities, there are 6 variants for each, which are 3 methods (bidirectional complexity contribution / MB to MF direction contribution / MF to MB direction contribution) and 2 directions (forward/reverse complexity, i.e. state space). This can be applied as the complexity term z increases/decreases when expanding.

도 3은 일 실시예에 따른 상위 2개 모델의 구성을 나타내는 도면이다. 3 is a diagram showing the configuration of the upper two models according to an embodiment.

도 3에 도시된 바와 같이, 예를 들어, 위 쪽의 모델은 MF -> MB 천이 확률에 대한 복잡성 영향만 가지고 있으며, 이는 순방향(forward)의 복잡성 항을 가진다.As shown in FIG. 3, for example, the upper model has only a complexity effect on the MF -> MB transition probability, which has a forward complexity term.

다음으로, 통화(currency)(목표 값(goal value))는 코인의 현재 목표 값을 나타내며, 계층적 제어 강화 학습(RL) 모델에 반영되어야 한다. 여기에서는 이 개념을 다중 에이전트 강화 학습(RL)에서 차용했으며, 이는 다수의 목표에 기인해 다수의 최적 정책을 필요로 하는 환경에서 빈번하게 사용된다. 일 실시예에 따른 태스크에서 목표 값은 변화하기 때문에, 이미 역방향 계획 과정(backward planning process)을 구비하고 목표 변화와 더 적은 관련이 있는 단일한 모델-기반(MB) 강화 학습(RL)보다 목표 의존적 모델-프리(MF) 강화 학습(RL)을 사용하는 것이 직관적이다. Next, the currency (goal value) represents the current target value of the coin, and should be reflected in the hierarchical control reinforcement learning (RL) model. Here, this concept is borrowed from multi-agent reinforcement learning (RL), which is frequently used in environments that require multiple optimal policies due to multiple goals. Since the target value in a task according to an embodiment changes, it is more target dependent than a single model-based (MB) reinforcement learning (RL) that already has a backward planning process and is less related to target change. It is intuitive to use model-free (MF) reinforcement learning (RL).

따라서, 목표 의존적 에이전트들(goal-dependent agents)(3MF 모델 및 3Q 모델)에 대해 두 가지 가설을 검증하고, 오직 하나의 모델-프리(MF)(1MF 모델)를 갖는 하나의 귀무가설(null hypothesis)을 검증했다. 3MF 모델에서는, 3Q 모델과 달리 3개의 코인에 대해 개별적인 3개의 모델-프리(MF) 에이전트가 있으며, 이들은 각 코인에 대해 각 동작 값(action value) 및 단일한 모델-프리(MF) 강화 학습(RL) 에이전트를 공유할 수 있다. Therefore, we test two hypotheses for goal-dependent agents (3MF model and 3Q model), and one null hypothesis with only one model-free (MF) (1MF model). ) Has been verified. In the 3MF model, unlike the 3Q model, there are three separate model-free (MF) agents for three coins, and these are each action value for each coin and a single model-free (MF) reinforcement learning ( RL) can share agents.

따라서, 알고리즘 1에서, 3MF 모델은 3개의 모델-프리(MF) 강화 학습(RL) 에이전트를 가지며(line 5-7), 3Q 모델은 line 5 및 6에서 3개의 Q_MF를 가질 수 있다. 두 개의 목표 의존적 에이전트 모델은 오직 하나의 모델-프리(MF) 강화 학습(RL) 에이전트(또는 동작 값)를 갱신할 수 있다. line 5 시작 전에, 목표 의존적 에이전트 모델은 동작 값만을 선택적으로 갱신하거나(3Q 모델), 각 시도의 최대 통화에 따라 동작 값 및 각 목표 의존적 모델-프리(MF) 에이전트의 대응하는 신뢰성을 갱신할 수 있다(3MF 모델).Thus, in Algorithm 1, the 3MF model has 3 model-free (MF) reinforcement learning (RL) agents (line 5-7), and the 3Q model can have 3 Q _MFs in lines 5 and 6. Two target dependent agent models can only update one model-free (MF) reinforcement learning (RL) agent (or action value). Before line 5 starts, the target-dependent agent model can selectively update only the action value (3Q model), or update the action value and the corresponding reliability of each goal-dependent model-free (MF) agent according to the maximum call of each attempt. Yes (3MF model).

도 4는 일 실시예에 따른 57개 모델에 대한 초과 확률(exceedance probability)을 나타내는 도면이다. 4 is a diagram illustrating an excedance probability for 57 models according to an exemplary embodiment.

3*3*3*2 = 54개의 계층적 제어 강화 학습(RL) 모델이 있으며, 이는 복잡성 항

(양방향 / MB에서 MF 방향 / MF에서 MB 방향)에 대한 기여의 개수, 목표 의존적 모델-프리(MF) 에이전트 구성의 개수(3MF / 3Q / 1MF 모델), 천이 확률에 대한 가설의 개수(

/

) 및 복잡성 항 방향의 개수(순방향 / 역방향)의 곱이다. 또한, 3개의 null 모델이 있으며, 이는 오직 목표 의존적 모델-프리(MF) 에이전트의 구성만을 구비하고 있다(1MF-null 모델은 이전 연구(비특허문헌 1)의 모델과 동일하다). 향후, 모델의 세부 사항은 목표 의존적 모델-프리(MF) 에이전트의 구성 및 복잡성 항

에 대한 기여를 제외하고는, 도 4(m1에서 m9)에 도시된 바와 같이 축약 표현될 것이다. 예를 들어, 역방향 복잡성과 3MF 모델을 갖는 m9는 m9-reverse 3MF로 축약될 것이다.3*3*3*2 = There are 54 hierarchical controlled reinforcement learning (RL) models, which in terms of complexity

(Bidirectional / MB to MF direction / MF to MB direction) number of contributions, number of target-dependent model-free (MF) agent configurations (3MF / 3Q / 1MF model), number of hypotheses for transition probability (

/

) And the number of directions in the complexity term (forward/reverse). In addition, there are three null models, which have only the configuration of target dependent model-free (MF) agents (the 1MF-null model is the same as the model of the previous study (Non-Patent Document 1)). In the future, the details of the model will be in terms of the composition and complexity of the target-dependent model-free (MF) agent.

Except for the contribution to, it will be abbreviated as shown in Fig. 4 (m1 to m9). For example, m9 with reverse complexity and 3MF model would be abbreviated to m9-reverse 3MF.

도 5는 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템을 나타내는 블록도이다. 5 is a block diagram illustrating a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an exemplary embodiment.

도 5를 참조하면, 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템(500)은 모델-프리(model-free, MF) 강화 학습부(510), 모델-기반(model-based, MB) 강화 학습부(520) 및 계층적 제어부(530)를 포함하여 이루어질 수 있다. Referring to FIG. 5, a system 500 for estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an embodiment includes a model-free (MF) reinforcement learning unit 510, a model-based -based, MB) It may include a reinforcement learning unit 520 and a hierarchical control unit 530.

모델-프리(MF) 강화 학습부는 보상 함수(reward function)를 학습하며, 모델-기반(MB) 강화 학습부와 동적으로 상호 작용할 수 있다. 특히, 모델-프리(MF) 강화 학습부는 다수의 목표에 따른 다수의 정책을 제공하기 위해 목표 의존적인 다중 모델-프리(MF) 강화 학습 에이전트들로 이루어져 목표 값을 반영할 수 있다. The model-free (MF) reinforcement learning unit learns a reward function, and can dynamically interact with the model-based (MB) reinforcement learning unit. In particular, the model-free (MF) reinforcement learning unit may reflect a target value composed of target-dependent multi-model-free (MF) reinforcement learning agents in order to provide a plurality of policies according to a plurality of goals.

모델-기반(MB) 강화 학습부는 보상 함수 및 상태-천이 함수(state-transition function)를 학습하며, 모델-프리(MF) 강화 학습부와 동적으로 상호 작용할 수 있다. The model-based (MB) reinforcement learning unit learns a reward function and a state-transition function, and can dynamically interact with the model-free (MF) reinforcement learning unit.

계층적 제어부는 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부에 가중치를 동적으로 부여함에 따라 계층적으로 제어하여 인간의 의사 결정을 예측할 수 있다. 또한, 계층적 제어부는 인간의 의사 결정에 내재된 행동전략을 추정할 수 있다. The hierarchical control unit can predict human decision making by hierarchically controlling the weights dynamically to the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit. In addition, the hierarchical control unit can estimate the behavioral strategy inherent in human decision-making.

이러한 계층적 제어부는 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부를 가변 태스크 변수인 목표 값(goal value), 상태-공간의 크기 및 상태-천이의 불확실성 중 적어도 어느 하나 이상에 의해 계층적으로 제어하여 인간의 의사 결정을 예측할 수 있다. This hierarchical control unit includes at least one of a model-free (MF) reinforcement learning unit and a model-based (MB) reinforcement learning unit, which are variable task variables, a goal value, a state-space size, and a state-transition uncertainty. By controlling hierarchically by the above, human decision making can be predicted.

그리고, 계층적 제어부는 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부의 신뢰성에 대한 함수인 두 가지 방향의 천이 확률(transition rate)을 통해 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부의 가중치를 갱신할 수 있다. And, the hierarchical control unit is model-free (MF) reinforcement learning through transition rates in two directions that are functions of the reliability of the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit. The weights of the sub and model-based (MB) reinforcement learning units can be updated.

모델-프리(MF) 강화 학습부 또는 모델-기반(MB) 강화 학습부는 가변적인 환경으로부터 가변 태스크 변수인 목표 값(goal value), 상태-공간의 크기 및 상태-천이의 불확실성을 전달 받아 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부 사이의 천이 확률을 계산할 수 있고, 계층적 제어부는 천이 확률을 통해 모델 선택 확률을 결정할 수 있다. The model-free (MF) reinforcement learning unit or the model-based (MB) reinforcement learning unit receives the variable task variable target value, state-space size, and state-transition uncertainty from a variable environment. The transition probability between the free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit may be calculated, and the hierarchical control unit may determine a model selection probability through the transition probability.

이후, 계층적 제어부는 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부의 모델 선택 확률을 통합하여 동작 값을 계산하고, 동작 값을 통해 동작 선택 확률을 제공하여 동작으로 변환할 수 있다. Thereafter, the hierarchical control unit calculates the motion value by integrating the model selection probability of the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit, and converts it into an action by providing the motion selection probability through the motion value. can do.

도 6은 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 방법을 나타내는 흐름도이다. 6 is a flowchart illustrating a method of estimating a human behavior pattern and a behavior strategy using meta-reinforcement learning according to an exemplary embodiment.

도 6을 참조하면, 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 방법은, 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부에 가중치를 동적으로 부여하는 단계(S110), 및 부여된 가중치에 따라 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부를 계층적으로 제어하여 인간의 의사 결정을 예측하는 단계(S120)를 포함하여 이루어질 수 있다. 6, a method for estimating a human behavior pattern and a behavior strategy using meta-reinforcement learning according to an exemplary embodiment dynamically applies weights to a model-free (MF) reinforcement learning unit and a model-based (MB) reinforcement learning unit. Including the step of giving (S110), and predicting human decision making by hierarchically controlling the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit according to the assigned weight (S120). It can be done by doing.

모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부의 신뢰성에 대한 함수인 두 가지 방향의 천이 확률(transition rate)을 통해 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부의 가중치를 갱신하는 단계(S130)를 더 포함할 수 있다. The model-free (MF) reinforcement learning unit and the model-based (model-free (MF) reinforcement learning unit and model-based ( MB) The step of updating the weight of the reinforcement learning unit (S130) may be further included.

아래에서 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 방법의 각 단계를 설명한다. 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 방법은 도 5에서 설명한 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템을 통해 보다 상세히 설명할 수 있다. 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템은 계층적 제어부를 포함하여 이루어질 수 있으며, 실시예에 따라 모델-프리(model-free, MF) 강화 학습부, 모델-기반(model-based, MB) 강화 학습부를 더 포함하여 이루어질 수 있다.Each step of a method of estimating a human behavior pattern and a behavior strategy using meta-reinforcement learning according to an exemplary embodiment will be described below. The method of estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an embodiment may be described in more detail through the system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to the embodiment described with reference to FIG. 5. The system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an embodiment may include a hierarchical control unit, and according to an embodiment, a model-free (MF) reinforcement learning unit, a model-based (model-based, MB) may be further included in the reinforcement learning unit.

단계(S110)에서, 계층적 제어부는 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부에 가중치를 동적으로 부여할 수 있다. 한편, 모델-프리(MF) 강화 학습부는 다수의 목표에 따른 다수의 정책을 제공하기 위해 목표 의존적인 다중 모델-프리(MF) 강화 학습 에이전트들로 이루어져 목표 값을 반영할 수 있다. In step S110, the hierarchical control unit may dynamically assign a weight to the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit. On the other hand, the model-free (MF) reinforcement learning unit may reflect a target value composed of target-dependent multiple model-free (MF) reinforcement learning agents in order to provide a plurality of policies according to a plurality of goals.

단계(S120)에서, 계층적 제어부는 부여된 가중치에 따라 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부를 계층적으로 제어하여 인간의 의사 결정을 예측할 수 있다. 또한, 계층적 제어부는 인간의 의사 결정에 내재된 행동전략을 추정할 수 있다. In step S120, the hierarchical control unit may predict human decision making by hierarchically controlling a model-free (MF) reinforcement learning unit and a model-based (MB) reinforcement learning unit according to the assigned weight. In addition, the hierarchical control unit can estimate the behavioral strategy inherent in human decision-making.

계층적 제어부는 모델-프리(MF) 강화 학습부 또는 모델-기반(MB) 강화 학습부에서 가변적인 환경으로부터 가변 태스크 변수인 목표 값(goal value), 상태-공간의 크기 및 상태-천이의 불확실성을 전달 받아 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부 사이의 천이 확률을 계산한 결과를 이용하여, 모델 선택 확률을 결정할 수 있다.The hierarchical control unit is a model-free (MF) reinforcement learning unit or a model-based (MB) reinforcement learning unit, which is a variable task variable from a variable environment, the goal value, the size of the state-space and the uncertainty of the state-transition. The model selection probability may be determined by using the result of calculating the transition probability between the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit by receiving the received.

그리고 계층적 제어부는 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부의 모델 선택 확률을 통합하여 동작 값을 계산하고, 동작 값을 통해 동작 선택 확률을 제공하여 동작으로 변환할 수 있다. In addition, the hierarchical control unit calculates the motion value by integrating the model selection probability of the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit, and provides the motion selection probability through the motion value to convert it I can.

단계(S130)에서, 계층적 제어부는 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부의 신뢰성에 대한 함수인 두 가지 방향의 천이 확률(transition rate)을 통해 모델-프리(MF) 강화 학습부 및 모델-기반(MB) 강화 학습부의 가중치를 갱신할 수 있다. In step S130, the hierarchical control unit uses the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit through a transition rate in two directions that is a function of the reliability. MF) The weights of the reinforcement learning unit and the model-based (MB) reinforcement learning unit can be updated.

아래에서 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템 및 방법을 보다 상세히 설명한다. Hereinafter, a system and method for estimating human behavior patterns and behavior strategies using meta-reinforcement learning according to an embodiment will be described in more detail.

도 7은 일 실시예에 따른 태스크 설계를 설명하기 위한 도면이다.7 is a diagram for describing a task design according to an embodiment.

도 7의 (a)는 2단계 마르코프 의사 결정 태스크를 나타낸다. 보다 구체적으로, 특정한 상태-천이 확률 p에 의한 천이 이후 참가자들은 2개 내지 4개의 옵션 중에서 선택(choice)을 하며(예컨대, 참가자들의 좌측(L) 또는 우측(R)의 행동 선택을 함.), 이를 통해 참가자들은 한 상태(710)에서 다른 상태(711)로 이동하게 된다. 원하는 상태로의 성공적인 천이 확률은 상태-천이 확률의 추정 정확도와 비례하며, 이는 상태-천이의 실제 확률 분포 엔트로피에 의해 제한된다. 예를 들어,

(가장 높은 엔트로피 케이스)일 때, 원하는 상태로의 성공적인 천이 확률은 0.5를 초과할 수 없다.7A shows a second-stage Markov decision making task. More specifically, after a transition by a specific state-transition probability p, participants make a choice from 2 to 4 options (e.g., the participants' left (L) or right (R) behavior choices). Through this, participants move from one state 710 to another state 711. The probability of a successful transition to the desired state is proportional to the estimated accuracy of the state-transition probability, which is limited by the actual probability distribution entropy of the state-transition. For example,

In the (highest entropy case), the probability of a successful transition to the desired state cannot exceed 0.5.

도 7의 (b)는 실험 조건에 대해 설명한다. 720은 상태-천이 확률을 나타내는 것으로, 낮은 상태-천이 불확실성 조건 및 높은 상태-천이 불확실성 조건은 각각 상태-천이 확률

및

에 대응된다. 721은 선택 이용 가능성을 나타내며, 낮은 상태-공간 복잡성 조건 및 높은 상태-공간 복잡성 조건은 각각 2개의 선택 및 4개의 선택이 이용 가능한 케이스에 대응한다. 첫 번째 상태에서 오직 2개의 선택만이 항상 이용 가능하며, 다음 상태에서 복잡성 조건에 따라 2개 또는 4개의 옵션이 이용 가능하다.7(b) describes the experimental conditions. 720 represents the state-transition probability, and the low state-transition uncertainty condition and the high state-transition uncertainty condition are the state-transition probability, respectively.

And

Corresponds to 721 indicates the availability of a selection, and the low state-space complexity condition and the high state-space complexity condition correspond to cases in which two and four options are available, respectively. In the first state only 2 options are always available, in the next state 2 or 4 options are available depending on the complexity condition.

도 7의 (c)를 참조하면, 참가자들은 값이 전 시도에 걸쳐 변화하는 서로 다른 컬러 코인(은색, 파란색 및 빨간색)(730)을 얻기 위하여 두 개의 순차적인 선택을 내릴 수 있다. 각 시도에서, 참가자들은 "통화", 즉 각 코인(730)의 현재 값에 대해 전달 받는다. (프랙탈 이미지로 표현된) 그 다음 두 개의 상태 각각에서, 그들은 이용 가능한 버튼(L1, L2, R1, R2) 중 하나를 누름으로써 선택을 할 수 있다. 선택 이용 가능성 정보(731)는 스크린의 아래 쪽에 제시될 수 있다. 여기서, 볼드체의 회색 원(732) 및 옅은 회색 원(733)은 각각 이용 가능한 선택 및 이용 불가능한 선택을 나타낸다.Referring to (c) of FIG. 7, participants can make two sequential choices to obtain different color coins (silver, blue, and red) 730 whose values change over all trials. In each trial, participants are communicated for the "currency", that is, the current value of each coin 730. In each of the next two states (represented by a fractal image), they can make a selection by pressing one of the available buttons (L1, L2, R1, R2). Optional availability information 731 may be presented at the bottom of the screen. Here, a bold gray circle 732 and a light gray circle 733 represent an available choice and an unavailable choice, respectively.

도 7의 (d)는 태스크를 설명하는 것으로, 각 회색 원형은 상태를 나타낸다. 진한 화살표 및 실선 각각은 참가자들의 선택(L₁, L₂, R₁, R₂) 및 상태-천이 확률에 따른 차후의 상태-천이를 나타낸다. 각 결과 상태(상태 4-11)는 보상(컬러 코인 또는 회색 모자이크 이미지로 표현되는 코인이 없는 것)과 관련이 있다. 보상 확률은 0.8이다.7D illustrates a task, and each gray circle represents a state. Each of the dark arrows and solid lines represents the participants' choices (L ₁ , L ₂ , R ₁ , R ₂ ) and the subsequent state-transition according to the state-transition probability. Each result state (states 4-11) is associated with a reward (no coins represented by color coins or gray mosaic images). The reward probability is 0.8.

조정 제어(arbitration control)에서 불확정성 및 복잡성의 역할을 검토하기 위하여, 도 7의 (a)에 도시된 바와 같이, 불확정성(uncertainty) 및 복잡성(complexity)의 정도가 변화하는 새로운 2단계 마르코프 의사 결정 태스크를 설계했으며(MDP), 여기서, 도 7의 (b)에 도시된 바와 같이, 시도 블록(blocks of trial)에 상태-천이 불확실성 및 상태-공간 복잡성에 걸쳐 2개의 태스크 변수를 시스템적으로 조작할 수 있다. 상태-천이 불확실성은 상태-동작-상태 천이 확률을 사용하여 제어될 수 있다. 상태-동작-상태 천이 확률은 2개의 상태인 높은 불확실성(0.5 대 0.5) 및 낮은 불확실성(0.9 대 0.1) 사이에서 변화할 수 있다. In order to examine the role of uncertainty and complexity in arbitration control, a new two-stage Markov decision-making task in which the degree of uncertainty and complexity is changed as shown in Fig.7(a). (MDP), where, as shown in Fig.7(b), the two task variables are systematically manipulated across state-transition uncertainty and state-space complexity in blocks of trial. I can. State-transition uncertainty can be controlled using state-action-state transition probabilities. The state-action-state transition probability can vary between the two states: high uncertainty (0.5 vs 0.5) and low uncertainty (0.9 vs 0.1).

이 두 가지 불확실성 상태 간의 전환은 상태 예측 오차(state prediction errors, SPE)의 평균적인 양에 변화를 준 뒤, 이를 모델-기반(MB) 시스템의 신뢰성으로 효과적으로 변환하기 위해 설계될 수 있다. 예를 들어, 높은 불확실성 상태는 많은 양의 상태 예측 오차(SPE)를 이끌어낼 것이며, 이는 본질적으로 모델-기반(MB) 예측 성능의 감소를 야기한다. 낮은 불확실성 상태에서, 모델-기반(MB)가 상태-동작-상태 천이 확률 추정을 개선하기 때문에 상태 예측 오차(SPE)는 감소하거나 평균적으로 낮은 상태를 유지할 것이다. 반면, 모델-프리(MF)의 성능은 상태-천이 불확실성 정도에 의해 더 적게 영향을 받는다. The transition between these two uncertain states can be designed to change the average amount of state prediction errors (SPEs) and then effectively transform them into the reliability of a model-based (MB) system. For example, a high uncertainty state will lead to a large amount of state prediction error (SPE), which essentially leads to a decrease in model-based (MB) prediction performance. In a state of low uncertainty, the state prediction error (SPE) will decrease or remain low on average because the model-based (MB) improves the state-action-state transition probability estimation. On the other hand, the performance of model-free (MF) is less affected by the degree of state-transition uncertainty.

두 번째 변수인 이용 가능한 선택의 개수는 태스크 복잡성을 조작하기 위한 것이다. 이용 가능한 선택의 총 개수는 낮은 복잡성 상태 및 높은 복잡성 상태 각각에서 2개 및 4개이다. 상태-공간 표현식이 너무 복잡해지는 것을 막기 위해, 여기에서는 이용 가능한 선택의 개수를 각 시도의 첫 번째 단계에서는 2개로 제한했고, 두 번째 단계에서는 2개 또는 4개로 두었다. 선택 이용 가능성을 조작하면 각 목표를 이루기 위한 방법의 개수가 더 다양해지며, 이는 각 시도의 난이도를 쉬운 수준에서 매우 어려운 수준까지 아우르도록 만든다. 따라서, 이 설계는 4개의 서로 다른 상태 유형을 제공할 수 있다(낮은/높은 x 불확실성/복잡성). 참가자들은 다른 컬러 코인을 얻기 위하여 순차적인 선택을 할 수 있다.The second variable, the number of available choices, is intended to manipulate task complexity. The total number of available choices is 2 and 4 in the low complexity state and the high complexity state, respectively. To prevent the state-space expression from becoming too complex, here we limit the number of available choices to two in the first step of each trial, and to two or four in the second step. Manipulating the availability of choices increases the number of ways to achieve each goal, which makes the difficulty of each attempt range from easy to very difficult. Thus, this design can provide four different state types (low/high x uncertainty/complexity). Participants can make sequential choices to earn different color coins.

도 7의 (c)에 도시된 바와 같이, 의사 결정 과정(MDP)의 또 다른 특징은 각 시도에서 참가자들은 3개의 다른 코인인 은색, 빨간색 또는 파란색 코인을 얻기 위하여 동작을 취할 수 있다. 각 시도에서, 실생활의 통화(US cent)에 대한 각 코인의 환율에 관해 코인의 상대적인 가치는 유연하게 할당되었으며, 이는 각 시도의 처음에 밝혀졌다. As shown in (c) of FIG. 7, another feature of the decision making process (MDP) is that in each attempt, participants can take action to obtain three different coins, silver, red or blue coins. In each trial, the relative value of the coin was flexibly allocated with respect to each coin's exchange rate against the real life currency (US cent), which was revealed at the beginning of each trial.

예를 들어, 특정 시도에서 이긴다면 실버 코인은 1 US cents를, 빨간 코인은 9 US cents, 파란 코인은 3 cents를 산출할 것이며, 이 할당은 다음 시도에서 달라질 수 있다. 이 설계의 특징은 목표 값의 시도 별 변화를 유도하여, 보상 예측 오차의 편차를 유도하고 여러 시도에서 모델-프리(MF)의 신뢰성을 유도하기 위한 것이다.For example, if you win a particular attempt, a silver coin will yield 1 US cents, a red coin will yield 9 US cents, and a blue coin will yield 3 cents, and this allocation may vary in the next attempt. The characteristic of this design is to induce the change of the target value for each trial, to induce the deviation of the compensation prediction error, and to induce the reliability of the model-free (MF) in several trials.

24명의 성인 참가자들(여성 20명, 19세 - 55세)이 태스크를 수행했으며, 그들 중 22명의 참가자들은 fMRI(기능성 자기공명영상) 스캔을 받았다. 얻은 보상의 총 양 및 최적 선택의 비율의 관한 피실험자의 태스크 능력은 모든 조건에서 기회 수준(chance level)보다 상당히 크다(t-test; p<1e-5). Twenty-four adult participants (20 females, 19-55 years old) performed the task, and 22 of them had functional magnetic resonance imaging (fMRI) scans. The subject's task ability in terms of the total amount of rewards obtained and the ratio of optimal choices is significantly greater than the chance level under all conditions (t-test; p<1e-5).

도 8은 일 실시예에 따른 불확실성과 복잡성을 통합하는 중재 제어의 컴퓨터 모델을 설명하기 위한 도면이다.8 is a diagram for describing a computer model of arbitration control incorporating uncertainty and complexity according to an embodiment.

도 8을 참조하면, 불확실성과 복잡성을 통합하는 중재 제어의 컴퓨터 모델을 설명할 수 있으며, 여기서 불확실성과 복잡성을 통합하는 중재 제어의 컴퓨터 모델은 일 실시예에 따른 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템일 수 있다. 또한, 불확실성과 복잡성을 통합하는 중재 제어의 컴퓨터 모델은 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템에 포함될 수 있다. Referring to FIG. 8, a computer model of interventional control incorporating uncertainty and complexity can be described, wherein the computer model of interventional control incorporating uncertainty and complexity is a human behavior pattern using meta-reinforcement learning according to an embodiment, and It may be a behavioral strategy estimation system. In addition, a computer model of interventional control that integrates uncertainty and complexity can be included in a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

원형 및 화살표 모양은 2-상태 동적 천이 모델을 나타내며, 여기서 현재 상태는 이전 상태(내재적 변수(endogenous variable)) 및 환경에서 오는 입력(외재적 변수(exogenous variables))에 의존적이다. 이 환경적인 입력은 상태 예측 오차(SPE)를 도출하는 상태-천이와, 보상 예측 오차(reward prediction error, RPE)를 도출하는 보상 및 태스크 복잡성을 포함한다. 화살표는 모델-기반(MB) 강화 학습(RL)에서 모델-프리(MF) 강화 학습(RL)로 가는 천이 확률 또는 그 반대의 천이 확률을 나타내며, 이는 상태 예측 오차(SPE), 보상 예측 오차(RPE) 및 태스크 복잡성의 함수이다. 원 모양은 모델-기반(MB) 강화 학습(RL)을 선택하는 확률(P_MB)로 정의되는 상태를 나타낼 수 있다. Q(s,a)는 현재 상태(s)에서 현재 이용 가능한 동작(a)의 값을 나타낼 수 있다. 이 값은 이후 동작으로 변환되며, 동작 선택 확률 P(a|s)라고 표현될 수 있다.The circle and arrow shapes represent a two-state dynamic transition model, where the current state is dependent on the previous state (endogenous variable) and the input from the environment (exogenous variables). This environmental input includes a state-transition that results in a state prediction error (SPE), and a compensation and task complexity that results in a reward prediction error (RPE). The arrows represent the transition probability from model-based (MB) reinforcement learning (RL) to model-free (MF) reinforcement learning (RL) or vice versa, which are state prediction error (SPE), compensation prediction error ( RPE) and task complexity. The circle shape may represent a state defined by the probability (P _MB ) of choosing model-based (MB) reinforcement learning (RL). Q(s,a) may represent a value of an operation (a) currently available in the current state (s). This value is converted to a subsequent operation, and can be expressed as an operation selection probability P(a|s).

실시예에 따른 불확실성 및 복잡성을 통합하는 중재 제어에 대한 컴퓨터 모델에서, 모델-프리(MF) 강화 학습(RL)과 모델-프리(MF) 강화 학습(RL) 간의 동적인 상호 작용은 2-상태 천이 모델을 기반으로 한다. 이 유형의 회귀 구조는 행동 데이터(action) 및 fMRI 데이터에서 모델-기반(MB) 강화 학습(RL)과 모델-프리(MF) 강화 학습(RL) 간의 중재 과정을 잘 설명하기 위해 이미 검토되었다.In a computer model for interventional control incorporating uncertainty and complexity according to an embodiment, the dynamic interaction between model-free (MF) reinforcement learning (RL) and model-free (MF) reinforcement learning (RL) is a two-state It is based on the transition model. This type of regression structure has already been reviewed to better explain the intervention process between model-based (MB) reinforcement learning (RL) and model-free (MF) reinforcement learning (RL) in action and fMRI data.

이 모델에서, 각 상태는 이전의 상태(내재적 입력(endogenous input))와 상태, 보상, 인지된 태스크 복잡성과 같이 환경에서 오는 입력(외재적 변수(exogenous variables))에 종속된다. 이 모델에서, 모델 선택 확률(P_MB)로 표현되는, 모델-기반(MB) 강화 학습(RL)과 모델-프리(MF) 강화 학습(RL)에 대한 선호도는 예측 불확실성 및 태스크 복잡성의 함수이다. 예측 불확실성은 상태-동작-상태 천이 및 복잡성에 대한 추정 불확실성을 나타낸다.In this model, each state is dependent on a previous state (endogenous input) and an input from the environment (exogenous variables) such as state, reward, and perceived task complexity. In this model, the preference for model-based (MB) reinforcement learning (RL) and model-free (MF) reinforcement learning (RL), expressed as model selection probabilities (P _MB ), is a function of prediction uncertainty and task complexity. . The prediction uncertainty represents the estimated uncertainty for the state-action-state transition and complexity.

이는 모델-기반(MB) 학습 및 모델-프리(MF) 학습 각각에 대해 핵심 변수인 상태 예측 오차(SPE) 및 보상 예측 오차(RPE)를 기반으로 계산된다. 여기에서는 특별히 태스크 복잡성이 모델-프리(MF) 및 모델-프리(MF) 간의 천이에 영향을 준다고 가정했다. 환경이 완전하게 안정적일 때(즉, 고정된 상태-천이 불확실성 및 고정된 수준의 태스크 복잡성), 이 모델의 세부 사항은 모델-기반(MB) 강화 학습(RL) 및 모델-프리(MF) 강화 학습(RL)의 안정적인 혼합으로 수렴할 수 있다.It is calculated based on state prediction error (SPE) and compensation prediction error (RPE), which are key variables for model-based (MB) training and model-free (MF) training, respectively. It is specifically assumed here that task complexity affects the transition between model-free (MF) and model-free (MF). When the environment is completely stable (i.e., fixed state-transition uncertainty and a fixed level of task complexity), the details of this model are model-based (MB) reinforcement learning (RL) and model-free (MF) reinforcement. It can converge with a stable mix of learning (RL).

실시예에 따른 컴퓨터 모델의 프로세스는 다음과 같이 설명된다.The process of the computer model according to the embodiment is described as follows.

첫 번째, 각 시도에서 에이전트의 동작에 따라, 환경은 상태-동작-상태 천이, 코인 값 및 태스크 복잡성을 모델에 제공할 수 있다. 이들은 천이 확률(MB

MF, MF

MB)을 계산하는데 사용되며, 차후에 모델 선택 확률 P_MB을 결정할 수 있다.First, depending on the agent's behavior at each attempt, the environment can provide the model with state-action-state transitions, coin values, and task complexity. These are the transition probabilities (MB

MF, MF

MB), and can later determine the model selection probability P _MB .

두 번째, 모델은 모델-기반(MB) 값 추정과 모델-프리(MF) 값 추정을 통합하여 전체적인 통합된 동작 값(도 8의 Q(s,a))을 계산할 수 있으며, 이는 차후에 동작으로 변환될 수 있다(도 8의 동작 선택 확률 P(a|s)). 이 프레임워크를 사용하여 강화 학습(RL)에 대한 불확실성 및 복잡성의 영향에 관한 다양한 가설을 공식적으로 구현할 수 있다. 예를 들어, 피실험자의 선택 행동을 잘 설명하는 모델의 구성은 환경의 불확실성 및 복잡성을 설명하기 위해 사람들이 모델-기반(MB) 강화 학습(RL)과 모델-프리(MF) 강화 학습(RL)을 조합하여 그들의 행동을 조정하는 방법을 구체화할 수 있다.Second, the model can calculate the overall integrated motion value (Q(s,a) in Fig. 8) by integrating the model-based (MB) value estimation and the model-free (MF) value estimation, which is a later operation. Can be converted (operation selection probability P(a|s) in FIG. 8). This framework can be used to formally implement various hypotheses about the impact of uncertainty and complexity on reinforcement learning (RL). For example, the construction of a model that well describes the subject's choice behavior is that people use model-based (MB) reinforcement learning (RL) and model-free (MF) reinforcement learning (RL) to account for the uncertainty and complexity of the environment. You can combine them to specify how to coordinate their behavior.

이상과 같이, 본 발명에서 말하는 가변 환경은 인간의 실생활을 반영하며, 이에 따라 발생하는 인간의 의사 결정 예측에 광범위하게 적용될 수 있다. 특히, 인간 행동을 예측하는 것은 소프트웨어 측면에서 인간의 편의를 위한 UX(user experience) 개선에 도움이 된다. 실시예들은 다음과 같이 다양하게 적용 가능하다.As described above, the variable environment referred to in the present invention reflects the real life of humans, and can be widely applied to predicting human decision-making that occurs accordingly. In particular, predicting human behavior is helpful in improving UX (user experience) for human convenience in terms of software. The embodiments can be applied in various ways as follows.

예컨대, 처음 방문한 도시에서의 내비게이션 시스템은 사용자의 행동 의도를 추정하여 추가적인 정보를 제공할 수 있다. 다른 예로, 문서 발급 절차 등 기존에 존재했던 행정 프로세스가 무인 시스템으로 대체되거나 기능이 늘어나는 등의 상황에서 사용자의 적응을 돕는 데 이용될 수 있다.　또한, IoT(Internet of things) 기반 시설에서 새로운 IoT 장비가 추가되거나 기존 IoT 장비의 기능이 추가되었을 경우 인간의 행동을 예측하여 적응과 활용에 있어 직관성을 부여할 수 있다. 그리고 기존의 제품을 보유하고 있는 사용자가 최신 제품을 구매하였을 경우 신제품에서의 추가적 기능을 보다 빠르게 알도록 할 수 있다.　For example, the navigation system in the first city visited may provide additional information by estimating the user's behavioral intention. As another example, it can be used to help users adapt in situations such as an unattended system replacing an existing administrative process, such as a document issuing procedure, or increasing functions. In addition, when new IoT devices are added or functions of existing IoT devices are added in the Internet of things (IoT) infrastructure, human behavior can be predicted to give intuition in adaptation and utilization. In addition, when a user with an existing product purchases the latest product, it is possible to quickly learn about additional functions in the new product.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the devices and components described in the embodiments include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It can be implemented using one or more general purpose computers or special purpose computers, such as a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications executed on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of software. For the convenience of understanding, although it is sometimes described that one processing device is used, one of ordinary skill in the art, the processing device is a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of these, configuring the processing unit to behave as desired or processed independently or collectively. You can command the device. Software and/or data may be interpreted by a processing device or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. Can be embodyed in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of the program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by the limited embodiments and drawings, various modifications and variations are possible from the above description by those of ordinary skill in the art. For example, the described techniques are performed in a different order from the described method, and/or components such as a system, structure, device, circuit, etc. described are combined or combined in a form different from the described method, or other components Alternatively, even if substituted or substituted by an equivalent, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다. Therefore, other implementations, other embodiments, and claims and equivalents fall within the scope of the claims to be described later.

Claims

A model-free (MF) reinforcement learning unit that learns a reward function;
A model-based (MB) reinforcement learning unit that learns the reward function and a state-transition function and dynamically interacts with the model-free (MF) reinforcement learning unit; And
A hierarchical control unit that predicts human decision-making by controlling hierarchically by dynamically assigning weights to the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit
Containing, human behavioral pattern and behavioral strategy estimation system using meta-reinforcement learning.

The method of claim 1,
The hierarchical control unit,
The model-free (MF) reinforcement learning unit and the model-free (MF) reinforcement learning unit and the model-free (MF) reinforcement learning unit through transition rates in two directions that are functions of reliability of the model-based (MB) reinforcement learning unit Updating the weights of a model-based (MB) reinforcement learning unit
Characterized in, a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

The method of claim 1,
The hierarchical control unit,
Predicting the human decision making, and estimating the behavioral strategy inherent in the human decision making
Characterized in, a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

The method of claim 1,
The hierarchical control unit,
The model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit are determined by at least one of a variable task variable, a goal value, a state-space size, and a state-transition uncertainty. Hierarchical control to predict human decision-making
Characterized in, a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

The method of claim 4,
The model-free (MF) reinforcement learning unit,
Reflecting the target value, consisting of target-dependent multiple model-free (MF) reinforcement learning agents to provide multiple policies according to multiple goals
Characterized in, a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

The method of claim 4,
The model-free (MF) reinforcement learning unit or the model-based (MB) reinforcement learning unit,
The model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning by receiving the variable task variables target value, state-space size, and state-transition uncertainty from a variable environment Calculate the probability of transition between negatives,
The hierarchical control unit,
Determining the model selection probability through the transition probability
Characterized in, a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

The method of claim 6,
The hierarchical control unit,
Integrating the model selection probabilities of the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit to calculate a motion value, and converting into motion by providing a motion selection probability through the motion value
Characterized in, a system for estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

Dynamically assigning weights to a model-free (MF) reinforcement learning unit and a model-based (MB) reinforcement learning unit in a hierarchical control unit of a human behavior pattern and behavior strategy estimation system using meta-reinforcement learning; And
Predicting human decision making by hierarchically controlling the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit according to the weights given by the hierarchical control unit
Containing, a method of estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

The method of claim 8,
The model-free (MF) reinforcement learning unit and the model-free (MF) reinforcement learning unit and the model-free (MF) reinforcement learning unit through transition rates in two directions that are functions of reliability of the model-based (MB) reinforcement learning unit Updating the weight of a model-based (MB) reinforcement learning unit
A method for estimating human behavior patterns and behavior strategies using meta-reinforcement learning further comprising a.

The method of claim 8,
The hierarchical control to predict human decision making,
Predicting the human decision making, and estimating the behavioral strategy inherent in the human decision making
Characterized in, a method of estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

The method of claim 8,
The hierarchical control to predict human decision making,
The model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit are determined by at least one of a variable task variable, a goal value, a state-space size, and a state-transition uncertainty. Hierarchical control to predict human decision-making
Characterized in, a method of estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

The method of claim 11,
The model-free (MF) reinforcement learning unit,
Reflecting the target value, consisting of target-dependent multiple model-free (MF) reinforcement learning agents to provide multiple policies according to multiple goals
Characterized in, a method of estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

The method of claim 11,
The hierarchical control to predict human decision making,
In the model-free (MF) reinforcement learning unit or the model-based (MB) reinforcement learning unit, the variable task variable target value, state-space size, and state-transition uncertainty are determined from a variable environment. Determining the model selection probability by using the result of calculating the transition probability between the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit received
Characterized in, a method of estimating human behavior patterns and behavior strategies using meta-reinforcement learning.

The method of claim 13,
The hierarchical control to predict human decision making,
Integrating the model selection probabilities of the model-free (MF) reinforcement learning unit and the model-based (MB) reinforcement learning unit to calculate a motion value, and converting into motion by providing a motion selection probability through the motion value
Characterized in, a method of estimating human behavior patterns and behavior strategies using meta-reinforcement learning.