KR101932835B1

KR101932835B1 - An apparatus for selecting action and method thereof, computer-readable storage medium

Info

Publication number: KR101932835B1
Application number: KR1020170014504A
Authority: KR
Inventors: 세르게이; 이세희; 이지형
Original assignee: 성균관대학교산학협력단
Priority date: 2017-02-01
Filing date: 2017-02-01
Publication date: 2019-03-20
Also published as: KR20180089769A

Abstract

프로세서에 의해 수행되는, NFQ (Neural Fitted Q-Iteration) 기반의 강화 학습에 따른 행동 결정 방법이 제공된다. 방법은, 복수의 상태 전이 샘플 데이터를 기반으로, 현재 상태 (state) 및 행동 (action) 을 입력으로 가지고 행동에 따른 Q 값을 출력으로 가지는 Q 값 함수를 인공 신경망에 훈련시키는 단계와, Q 값 함수를 기반으로 소정 상태에서의 복수의 행동들에 따른 Q 값을 각각 산출하는 단계, 그리고 복수의 행동들 중 Q 값이 가장 큰 행동을 다음 행동으로 결정하는 단계를 포함할 수 있고, 훈련시키는 단계는 학습 대상인 샘플 데이터의 시퀀스의 길이를 기반으로 다음 상태로의 전이 비용 (transition cost) 을 결정할 수 있다. 따라서, 인공 신경망이 보다 빠르게 수렴될 수 있고, 더 정확한 결과 값을 산출할 수 있다. There is provided a behavior determination method based on reinforcement learning based on Neural Fitted Q-Iteration (NFQ) performed by a processor. The method comprises the steps of training a neural network with a Q function having an input of a current state and an action as inputs and a Q value as an output based on a plurality of state transition sample data, Calculating a Q value according to a plurality of behaviors in a predetermined state based on the function, and determining a behavior having the largest Q value among the plurality of behaviors as a next behavior, Can determine the transition cost to the next state based on the length of the sequence of sample data to be learned. Therefore, the artificial neural network can converge more quickly, and a more accurate result value can be calculated.

Description

[0001] APPARATUS FOR SELECTING ACTION AND METHOD THEREOF, COMPUTER-READABLE STORAGE MEDIUM [0002]

본 발명은 행동 결정 장치 및 방법에 관한 것으로서, 보다 구체적으로는 강화 학습을 통해 인공 신경망 기반으로 행동을 결정하기 위한 장치 및 방법에 관한 것이다. The present invention relates to an apparatus and method for determining behavior, and more particularly, to an apparatus and method for determining behavior based on an artificial neural network through reinforcement learning.

NFQ (Neural Fitted Q-Iteration) 는 Q - 러닝 (Q - Learning) 을 기반으로 작동하는 알고리즘이다. 기존의 Q - 러닝은 온라인 학습 환경에서만 사용 가능하므로, 그렇지 않은 경우에는 사용하기 어려운 문제가 있다. 이를 해결하기 위해 최적의 정책 (Policy) 를 학습하기 위한 기법들이 여러 가지 제안되었고, NFQ 는 가장 널리 사용되는 기법들 중 하나이다. 이는 온라인 강화 학습 (Online Reinforcement Learning) 의 함수 근사 (Function approximation) 기법과 매우 유사한 방법론으로서, 각각의 (상태 (state), 행동 (action)) 의 Q - 값을 테이블 형식으로 저장하는 것이 아니라, 인공 신경망 (Artificial Neural Network) 을 이용하여 Q - 값의 근사값을 학습하는 방법이다. 다만, 인공 신경망을 학습시킬 때, 학습 대상 데이터에서 전이 비용 (Transition Cost) 이 없으면 전문가 지식을 이용하여 전이 비용을 수동으로 결정해야 한다. NFQ (Neural Fitted Q-Iteration) is an algorithm based on Q-Learning. Existing Q - learning can only be used in an online learning environment, which is difficult to use. To solve this problem, several techniques for learning the optimal policy have been proposed, and NFQ is one of the most widely used techniques. This is a methodology very similar to the Function Approximation method of Online Reinforcement Learning. It does not store the Q - values of each state (state, action) in table form, It is a method to learn the approximate value of Q - value by using Artificial Neural Network. However, when learning an artificial neural network, if there is no transition cost in the learning target data, the transition cost should be manually determined using expert knowledge.

구체적으로, 종래 NFQ 기법에서 인공 신경망을 학습시킬 때, 학습 대상이 최종 상태 (Final state) 이면 최종 상태의 보상 (Reward) 을 학습에 사용했고, 최종 상태가 아니면 전이 비용 및 다음 상태의 Q 값의 합을 학습에 사용했으나, 전이 비용이 주어지지 않을 경우에는 전이 비용을 수동으로 선언해야 했다. 이 때, 선언을 잘못하면 인공 신경망의 수렴 속도가 느려지는 문제가 있었다.Specifically, in learning the artificial neural network in the conventional NFQ technique, if the learning target is the final state, reward of the final state is used for learning. Otherwise, the transition cost and the Q value of the next state If the sum was used for learning, but the transfer cost was not given, the transfer cost had to be declared manually. At this time, there is a problem that the convergence speed of the artificial neural network is slowed if the declaration is wrong.

한국 등록 특허공보 제1171054호 ("휴대 단말기 파지 패턴의 적응적 학습 장치 및 방법", 삼성전자 주식회사)Korean Patent Registration No. 1171054 ("Adaptive Learning Device and Method of Handheld Terminal Phasing Pattern ", Samsung Electronics Co., Ltd.)

전술한 문제점을 해결하기 위한 본 발명의 목적은 최종 상태의 보상과 학습 대상 데이터의 시퀀스의 길이를 기반으로 전이 비용을 설정하도록 함으로써 보다 효율적인 NFQ 기반의 강화 학습에 따른 행동 결정 방법을 제공하는 것이다. It is an object of the present invention to solve the aforementioned problems by providing a method for determining a behavior based on NFQ-based reinforcement learning by setting a transition cost based on a compensation of a final state and a length of a sequence of learning data.

전술한 문제점을 해결하기 위한 본 발명의 다른 목적은 최종 상태의 보상과 학습 대상 데이터의 시퀀스의 길이를 기반으로 전이 비용을 설정하도록 함으로써 보다 효율적인 NFQ 기반의 강화 학습에 따른 행동 결정 장치를 제공하는 것이다. Another object of the present invention to solve the above problems is to provide a behavior decision apparatus based on NFQ-based reinforcement learning by setting a transition cost based on a compensation of a final state and a length of a sequence of learning data .

다만, 본 발명의 해결하고자 하는 과제는 이에 한정되는 것이 아니며, 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위에서 다양하게 확장될 수 있을 것이다.It should be understood, however, that the present invention is not limited to the above-described embodiments, but may be variously modified without departing from the spirit and scope of the invention.

전술한 목적을 달성하기 위한 본 발명의 일 실시예에 따른 행동 결정 방법은, 프로세서에 의해 수행되는, NFQ (Neural Fitted Q-Iteration) 기반의 강화 학습에 따른 행동 결정 방법으로서, 복수의 상태 전이 샘플 데이터를 기반으로, 현재 상태 (state) 및 행동 (action) 을 입력으로 가지고 상기 행동에 따른 Q 값을 출력으로 가지는 Q 값 함수를 인공 신경망에 훈련시키는 단계; 상기 Q 값 함수를 기반으로 소정 상태에서의 복수의 행동들에 따른 Q 값을 각각 산출하는 단계; 및 상기 복수의 행동들 중 Q 값이 가장 큰 행동을 다음 행동으로 결정하는 단계를 포함하고, 상기 훈련시키는 단계는 학습 대상인 샘플 데이터의 시퀀스의 길이를 기반으로 다음 상태로의 전이 비용 (transition cost) 을 결정할 수 있다. According to another aspect of the present invention, there is provided a behavior determination method based on enhanced learning based on NFQ (Neural Fitted Q-Iteration) performed by a processor, Training the artificial neural network to a Q-value function having inputs of a current state and an action and outputting a Q value according to the behavior based on the data; Calculating a Q value according to a plurality of behaviors in a predetermined state based on the Q value function; And determining a behavior having a largest Q value among the plurality of behaviors as a next action, wherein the step of training includes a step of calculating a transition cost to a next state based on a length of a sequence of sample data to be learned, Can be determined.

일 측면에 따르면, 상기 훈련시키는 단계는, 다음 상태가 최종 상태일 경우, 다음 상태의 보상 (Reward) 을 상기 Q 값으로 설정할 수 있다. According to an aspect of the present invention, the step of training may set a compensation value of the next state to the Q value when the next state is the final state.

일 측면에 따르면, 상기 훈련시키는 단계는, 다음 상태가 최종 상태가 아닐 경우, 다음 상태로의 전이 비용 및 다음 상태의 최대 Q 값을 기반으로 상기 Q 값을 설정할 수 있다. According to an aspect, the training step may set the Q value based on the transition cost to the next state and the maximum Q value of the next state, if the next state is not the final state.

일 측면에 따르면, 상기 훈련시키는 단계는, 다음 상태가 최종 상태가 아닐 경우, 다음 상태로의 전이 비용과 디스카운트된 다음 상태의 최대 Q 값의 합을 상기 Q 값으로 설정할 수 있다. According to an aspect, the step of training may set the sum of the transition cost to the next state and the maximum Q value of the next discounted state to the Q value, when the next state is not the final state.

일 측면에 따르면, 상기 전이 비용은 학습 대상인 샘플 데이터의 시퀀스의 길이 및 상기 시퀀스의 최종 상태 보상을 기반으로 결정될 수 있다. According to an aspect, the transition cost may be determined based on a length of a sequence of sample data to be learned and a final state compensation of the sequence.

일 측면에 따르면, 상기 전이 비용은 학습 대상인 샘플 데이터의 시퀀스의 최종 상태 보상을 상기 시퀀스의 길이로 나눈 값에서 1 을 뺀 값으로 결정될 수 있다. According to an aspect of the present invention, the transition cost may be determined by subtracting 1 from a value obtained by dividing a final state compensation of a sequence of sample data as a learning target by a length of the sequence.

일 측면에 따르면, 상기 최종 상태 보상은 정규화 (Normalize) 된 값일 수 있다. According to an aspect, the final state compensation may be a normalized value.

일 측면에 따르면, 상기 상태는 현재까지 수행된 활동들을 나타낼 수 있다. According to one aspect, the status can indicate activities performed to date.

일 측면에 따르면, 상기 행동은 현재 수행하고자 하는 활동을 나타낼 수 있다. According to one aspect, the behavior may indicate an activity that is currently being performed.

전술한 문제점을 해결하기 위한 본 발명의 다른 실시예에 따른 행동 결정 장치는, NFQ (Neural Fitted Q-Iteration) 기반의 강화 학습에 따른 행동 결정 장치로서, 상기 장치는 프로세서를 포함하고, 상기 프로세서는, 복수의 상태 전이 샘플 데이터를 기반으로, 현재 상태 (state) 및 행동 (action) 을 입력으로 가지고 상기 행동에 따른 Q 값을 출력으로 가지는 Q 값 함수를 인공 신경망에 훈련시키는 것; 상기 Q 값 함수를 기반으로 소정 상태에서의 복수의 행동들에 따른 Q 값을 각각 산출하는 것; 및 상기 복수의 행동들 중 Q 값이 가장 큰 행동을 다음 행동으로 결정하는 것을 수행하도록 구성되고, 상기 훈련시키는 것은 학습 대상인 샘플 데이터의 시퀀스의 길이를 기반으로 다음 상태로의 전이 비용 (transition cost) 을 결정할 수 있다. According to another aspect of the present invention, there is provided an apparatus for determining behavior based on NFQ (Neural Fitted Q-Iteration) based reinforcement learning, the apparatus including a processor, Training a Q value function in an artificial neural network based on a plurality of state transition sample data and having a current state and an action as inputs and a Q value according to the behavior as an output; Calculating a Q value according to a plurality of behaviors in a predetermined state based on the Q value function; And determining that the behavior with the largest Q value among the plurality of behaviors is the next behavior, the training being based on a transition cost to a next state based on a length of a sequence of sample data to be learned, Can be determined.

일 측면에 따르면, 상기 훈련시키는 것은, 다음 상태가 최종 상태일 경우, 다음 상태의 보상 (Reward) 을 상기 Q 값으로 설정할 수 있다. According to an aspect of the present invention, the training may set a next state Reward to the Q value when the next state is the final state.

일 측면에 따르면, 상기 훈련시키는 것은, 다음 상태가 최종 상태가 아닐 경우, 다음 상태로의 전이 비용 및 다음 상태의 최대 Q 값을 기반으로 상기 Q 값을 설정할 수 있다. According to one aspect, the training may set the Q value based on the transition cost to the next state and the maximum Q value of the next state, if the next state is not the final state.

일 측면에 따르면, 상기 훈련시키는 것은, 다음 상태가 최종 상태가 아닐 경우, 다음 상태로의 전이 비용과 디스카운트된 다음 상태의 최대 Q 값의 합을 상기 Q 값으로 설정할 수 있다. According to one aspect, the training may set the sum of the transition cost to the next state and the maximum Q value of the next discounted state to the Q value, if the next state is not the final state.

전술한 문제점을 해결하기 위한 본 발명의 다른 실시예에 따른 컴퓨터 판독 가능한 저장 매체는, NFQ (Neural Fitted Q-Iteration) 기반의 강화 학습에 따른 행동 결정을 수행하게 하기 위해, 상기 컴퓨터에 포함된 프로세서로 하여금, 복수의 상태 전이 샘플 데이터를 기반으로, 현재 상태 (state) 및 행동 (action) 을 입력으로 가지고 상기 행동에 따른 Q 값을 출력으로 가지는 Q 값 함수를 인공 신경망에 훈련시키도록 하기 위한 명령어; 상기 Q 값 함수를 기반으로 소정 상태에서의 복수의 행동들에 따른 Q 값을 각각 산출하도록 하기 위한 명령어; 및 상기 복수의 행동들 중 Q 값이 가장 큰 행동을 다음 행동으로 결정하도록 하기 위한 명령어를 포함하고, 상기 훈련시키는 것은 학습 대상인 샘플 데이터의 시퀀스의 길이를 기반으로 다음 상태로의 전이 비용 (transition cost) 을 결정할 수 있다. According to another aspect of the present invention, there is provided a computer-readable storage medium for storing a program for causing a processor included in the computer to perform a behavior decision based on Neural Fitted Q-Iteration (NFQ) To train the artificial neural network with a Q-value function having a current state and an action as inputs and a Q value as an output based on a plurality of state transition sample data, ; And to calculate a Q value according to a plurality of behaviors in a predetermined state based on the Q value function; And a command to determine a behavior having a greatest Q value among the plurality of behaviors as a next action, wherein the training is a transition cost to a next state based on a length of a sequence of sample data to be learned Can be determined.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technique may have the following effects. It is to be understood, however, that the scope of the disclosed technology is not to be construed as limited thereby, as it is not meant to imply that a particular embodiment should include all of the following effects or only the following effects.

전술한 본 발명의 일 실시예에 따른 NFQ (Neural Fitted Q-Iteration) 기반의 강화 학습에 따른 행동 결정 방법 및 장치에 따르면, 최종 상태의 보상과 학습 대상 데이터의 시퀀스의 길이를 기반으로 전이 비용을 설정하도록 함으로써 보다 효율적으로 인공 신경망을 이용한 행동 결정을 수행할 수 있다. According to the method and apparatus for determining behavior based on reinforcement learning based on NFQ (Neural Fitted Q-Iteration) according to an embodiment of the present invention, it is possible to reduce the transition cost based on the compensation of the final state and the length of the sequence of learning data It is possible to perform the behavior decision using the artificial neural network more efficiently.

구체적으로, 학습 대상인 샘플 데이터들을 기반으로 인공 신경망에 Q 값 함수 (Q-value function) 를 훈련시키는 데에 있어 더욱 신속한 수렴을 달성할 수 있으며, 더 정확한 결과값을 산출할 수 있어, 최적화된 행동 결정을 수행할 수 있다. Specifically, it is possible to achieve faster convergence in training the Q-value function in the artificial neural network based on the sample data to be learned, more accurate result can be calculated, A determination can be made.

도 1 은 Q 러닝의 흐름도이다.
도 2 는 인공 신경망의 예시도이다.
도 3 은 NFQ (Neural Fitted Q-Iteration) 를 위한 인공 신경망의 구조이다.
도 4 는 본 발명의 일 실시예에 따른 행동 결정 방법의 흐름도이다.
도 5 는 본 발명의 다른 실시예에 따른 행동 결정 장치의 구성을 나타내는 블록도이다.
도 6 은 본 발명의 일 실시예에 따른 행동 결정의 성능 비교를 나타낸다. 1 is a flow chart of Q-learning.
2 is an illustration of an artificial neural network.
3 is a structure of an artificial neural network for NFQ (Neural Fitted Q-Iteration).
4 is a flowchart of a behavior determination method according to an embodiment of the present invention.
5 is a block diagram illustrating a configuration of a behavior determination apparatus according to another embodiment of the present invention.
Figure 6 shows a performance comparison of behavior decisions according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

제 1, 제 2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

강화학습은 현재 상태에서 최적의 행동을 계산을 통해 결정하지 않고, 여러 번의 시행착오를 통한 경험에 의해 각 상태에서의 최적의 행동을 조금씩 학습해 나가는 방법이다. 시스템이 주어진 환경과의 상호 작용을 통해 자신의 학습 목표를 실시간으로 학습하며 환경과의 상호작용을 통해서 학습해 나간다. 학습이라는 개념은 인공 지능 이론의 기본 바탕이며, 이러한 학습을 프로세서가 수행하도록 하는 접근 방식을 기계 학습 (Machine Learning) 이라고 할 수 있다. Reinforcement learning is a way to learn the best behavior in each state little by little by trial and error experience rather than by calculating the best behavior in the current state. The system learns its learning objectives in real time through interaction with the given environment and through interaction with the environment. The concept of learning is the basic basis of artificial intelligence theory, and the approach that allows the processor to perform such learning is machine learning.

강화학습에서 종래 대표적으로 쓰이는 방법은 Q-러닝으로, 모델의 정보 없이 행동의 적합성을 나타내는 Q 값을 테이블로 구성하여 학습하므로 구현하기 간단하였다. 이후 Q-러닝은 연속적인 상태와 행동 공간 영역으로 확장되어 상태 공간과 행동을 연속된 상태로 다루고 있다. The conventional method used in reinforcement learning is Q-learning, and it is simple to implement because it learns Q value that represents suitability of behavior without model information by constructing it as a table. Since then, Q-learning has been extended to the continuous state and behavior space domain to treat state space and behavior in a continuous state.

도 1 은 Q 러닝의 흐름도이다. 강화 학습은 현재 상태에서의 최적의 행동을 계산을 통해 결정하지 않고, 여러 번의 시행착오에 기반한 경험에 의해 각 상태에서의 최적의 행동을 학습힌다. 강화학습 중 가장 널리 이용되는 Q 러닝은 환경과의 상호작용을 통해 보상을 최대로 하는 제어정책을 학습하는 방법이다. 도 1 에 도시된 바와 같이, Q 러닝은 에이전트 (agent) 와 환경 (Environment) 의 연속적인 상호 작용을 통해 수행된다. 학습의 대상이 에이전트, 에이전트 이외의 모든 것을 환경이라 할 수 있으며, 에이전트는 행동 (action, a) 을 취하고, 환경을 행동에 대해 반응하여 새로운 상태 (state, s) 로의 전이 및 보상 (reward, r) 을 제공하며, 에이전트는 미래에 얻을 수 있는 보상이 최대가 되도록 행동을 다시 선택한다. 1 is a flow chart of Q-learning. Reinforcement learning does not determine optimal behavior in the current state through calculation, but learns optimal behavior in each state by experience based on multiple trial and error. Q-Learning, which is the most widely used of reinforcement learning, is a method of learning control policies that maximize compensation through interaction with the environment. As shown in FIG. 1, Q-learning is performed through successive interactions of an agent and an environment. In this paper, we propose a new approach to the learning process. In this paper, we propose a new learning environment for the learning environment. ), And the agent reselects the action to maximize future compensation.

기존의 Q 러닝에서는 Q 값을 저장하기 위해 룩업 테이블을 이용하여 불연속적인 상태와 행동을 통해 학습을 수행하였다. 그러나, 룩업 테이블을 사용할 경우 상태와 행동의 쌍, 즉 학습 대상이 되는 샘플 데이터들이 적을 경우에는 문제 해결에 제약이 따르게 되고, 샘플 데이터들이 지나치게 많을 경우에는 메모리의 용량이 많이 필요로 하게 되어 학습 속도가 현저히 느려지는 문제가 있어, 인공 신경망을 통해 Q 값을 직접 근사화하는 방법들이 제안되었다. In the existing Q - learning, learning was performed through discontinuous state and behavior using a lookup table to store the Q value. However, when the lookup table is used, a problem is solved when a pair of state and action, that is, a sample data to be learned, is small, and when the sample data is too large, There is a problem in that the Q value is directly approximated through the artificial neural network.

인공 신경망은 인간이나 동물의 두뇌 구조와 기능을 모델로 하는 인공지능 알고리즘이다. 인공 신경망은 입력과 출력의 관계를 가지는 수학적 함수를 근사화할 수 있는 능력을 가지는데 여기서 입력과 출력은 실수가 될 수 있으며, 근사화하는 함수는 비선형 함수가 될 수 있다. 도 2 는 인공 신경망의 예시도이다. 도 2 에 도시된 바와 같이, 널리 활용되는 멀티-레이어 인공 신경망은 입력층 (Input Layer) 및 출력층 (Output Layer) 사이에 하나 이상의 은닉층 (Hidden Layer) 을 가진다. 입력 뉴런 및 출력 뉴런, 은닉층의 개수 등은 인공 신경망을 통해 해결하고자 하는 과제 또는 근사하고자 하는 함수의 종류에 따라 상이할 수 있다. Artificial neural networks are artificial intelligence algorithms that model human brain or brain structure and function. Artificial neural networks have the ability to approximate a mathematical function that has a relationship between input and output, where input and output can be real numbers, and the approximating function can be a nonlinear function. 2 is an illustration of an artificial neural network. As shown in FIG. 2, a widely used multi-layer artificial neural network has one or more hidden layers between an input layer and an output layer. Input neurons and output neurons, the number of hidden layers, etc., can be different depending on the task to be solved through the artificial neural network or the type of function to be approximated.

전술한 Q 값을 근사화하는 방법들 중, 특히 NFQ (Neural Fitted Q-Iteration) 알고리즘은 최적의 행동 선택 정책을 근사하기 위한 배치 (batch) 강화 학습 알고리즘이다. 도 3 은 NFQ (Neural Fitted Q-Iteration) 를 위한 인공 신경망의 구조이다. 최적의 행동 선택 정책은, 최종 목표를 달성하기 위해 수행될 수 있는 복수의 행동들 중에서, 적어도 일부의 행동들로 이루어진 시퀀스를 결정하는 것이 될 수 있다. 여기서, 상태는 현재까지 이미 수행된 행위들을 표현될 수 있고, 행동은 현재 수행하고자 하는 행위를 표현할 수 있다. 학습 대상 샘플 데이터들을 기반으로 인공 신경망에 Q 값 함수를 학습 또는 훈련 시킬 수 있고, 훈련된 Q 값 함수를 기반으로, 현재의 상태에서 최고의 Q 값을 가지는 행동을 결정하고, 다음 상태에서 다시 최적의 행동을 결정하는 것을 반복하는 것을 통해, 최종 목표를 위한 최적의 행동들의 시퀀스를 결정할 수 있다. Among the methods of approximating the above Q values, in particular, the NFQ (Neural Fitted Q-Iteration) algorithm is a batch reinforcement learning algorithm for approximating an optimal behavior selection policy. 3 is a structure of an artificial neural network for NFQ (Neural Fitted Q-Iteration). The optimal behavior selection policy may be to determine a sequence of at least some of the actions among a plurality of actions that may be performed to achieve a final goal. Here, the state can be represented by the actions that have been performed so far, and the action can express the action that is currently performed. Based on the sample data to be learned, it is possible to learn or train the Q value function in the artificial neural network, determine the behavior with the highest Q value in the current state based on the trained Q value function, By repeating the determination of behavior, you can determine the sequence of optimal behaviors for the final goal.

도 3 에 도시된 바와 같이, NFQ 알고리즘을 위해서는 최적의 목표에 도달하기 위해 상태들, 행동들 및 주어진 태스크의 보상에 대한 적절한 표현이 요구된다. 인공 신경망의 입력으로서 상태 (예들 들어, Input 1) 와 행동 (예를 들어, Input 2) 을 가지고, 타겟 값으로써 Q 값을 가질 수 있다. 통상적으로, 이러한 속성들은 해결하고자 하는 과제 자체로부터 표현될 수 있고, 이외의 경우에는 전문가에 의해 수동으로 정의되는 것이 요구된다. 그러나 수동으로 정의하는 것은 매우 어렵고 전문가의 지식이 충분하지 않을 수도 있다. 알고리즘에 잘못된 정보가 제공되는 경우 알고리즘이 잘못 동작할 수 있으며, 인공 신경망의 수렴 속도를 감소시킬 수 있다. 특히, 상태들 간의 전이 비용은, 하나의 상태로부터 다음 상태로의 각 단계들에 대한 즉각적인 보상이기 때문에, 다른 속성들에 비해 알고리즘의 수렴 속도에 미치는 영향이 크다. As shown in FIG. 3, the NFQ algorithm requires an appropriate representation of states, actions, and compensation of a given task to reach an optimal goal. (E.g., Input 1) and behavior (e.g., Input 2) as input to the artificial neural network, and may have a Q value as a target value. Typically, these attributes can be expressed from the task itself to be solved, otherwise it is required to be defined manually by an expert. However, manual definition is very difficult and expert knowledge may not be sufficient. If incorrect information is provided to the algorithm, the algorithm can operate incorrectly and the convergence speed of the artificial neural network can be reduced. In particular, since the transition cost between states is an immediate compensation for each step from one state to the next, there is a large effect on the convergence speed of the algorithm over other attributes.

본 발명에서는, NFQ 알고리즘을 위한 전이 비용의 결정 방법이 개시된다. NFQ 는 상태 및 행동을 입력으로 가지고, Q 값을 타겟 값으로 가져 배치 (batch) 모드에서 Q 값들을 근사하는 인공 신경망을 나타낼 수 있다. NFQ 는 수렴을 위해 수 차례의 업데이트를 요구한다. 그러나 최종 상태에서의 보상을 알고 있는 경우에 최종 목표로의 경로 상의 각 상태 전이에 대한 비용을 근사화 할 수 있고, 이는 인공 신경망이 더 빠르게 수렴하고 더 좋은 성능을 가지도록 할 수 있다. In the present invention, a method for determining the transition cost for the NFQ algorithm is disclosed. NFQ can represent artificial neural networks that have state and behavior as inputs and approximate Q values in batch mode with Q values as target values. NFQ requires several updates to converge. However, if the compensation in the final state is known, the cost of each state transition on the path to the final target can be approximated, which allows the artificial neural network to converge faster and have better performance.

통상적인 NFQ (Neural Fitted Q-Iteration) 알고리즘에 대해 먼저 살펴보면, 인공 신경망은 높은 정확도로 함수들을 근사 (approximate) 할 수 있고, 적은 훈련 샘플들로부터 양호한 일반화를 수행할 수 있는 장점을 가진다. 이러한 능력은 Q 러닝에서 사용되는 Q 값 함수를 표현하기 위해 사용될 수 있다. Q 값 추정 함수를 인공 신경망의 목표 함수로 표현함으로써, 인공 신경망이 각 상태의 Q 값을 훈련하도록 할 수 있다. 이어서, NFQ 의 업데이트 및 훈련을 반복함으로써, 모든 상태들의 Q 값을 근사할 수 있게 된다. 큰 장점은, 새로운 상태가 훈련을 위한 데이터 세트에 포함되어 있지 않더라도, 그 상태에 대한 근사된 Q 값을 용이하게 획득할 수 있다는 것이다. 따라서, 존재하는 상태 전이들을 기반으로 새로운 정책들을 생성하고, 피드백을 수행할 수 있게 한다. NFQ 알고리즘은 다음과 같다. First of all, the artificial neural network has a merit that it can approximate functions with high accuracy and perform good generalization from few training samples. This capability can be used to express the Q-value function used in Q-learning. By expressing the Q value estimation function as the objective function of the artificial neural network, it is possible to allow the artificial neural network to train the Q value of each state. Then, by repeating the updating and training of the NFQ, the Q value of all states can be approximated. A major advantage is that, even if the new state is not included in the data set for training, it is readily possible to obtain an approximated Q value for that state. Thus, new policies can be created based on existing state transitions and feedback can be performed. The NFQ algorithm is as follows.

여기서, NN 은 인공 신경망 (Neural Network) 을, I 는 데이터 세트의 인스턴스 (instance) 들의 수를, s, a 는 데이터 세트로부터의 상태 (state), 행동 (action) 쌍을, R (s, a, s') 은 모든 가능한 행동들 A 의 세트들 중의 행동 a 에 따른 현재 상태 s 로부터 다음 상태 s'으로의 전이 비용 (또는 전이 보상) 을 나타낸다. Here, NN The artificial neural network (Neural Network) to, I is the number of instances (instance) of the data set, s, a is the status (state), action (action) pair from the set of data, R (s, a, s ' ) Represents the transition cost (or transition compensation) from the current state s to the next state s' according to the action a in all possible sets of actions A.

상기 나타난 바와 같이, 통상적인 NFQ 에서 인공 신경망을 훈련시킴에 있어 타겟 값은 하기의 수학식 1 과 같다. As shown above, in training the artificial neural network in the normal NFQ, the target value is expressed by Equation 1 below.

즉, 현재 상태 s 에서의 행동 a 에 따른 타겟 값인 Q 값을 설정함에 있어서, 다음 상태 s' 가 최종 상태 (final state) 일 경우에는 최종 상태의 보상 R (s') 을 Q 값으로 설정하여 인공 신경망을 학습시킬 수 있으나, 다음 상태 s' 이 최종 상태가 아닐 경우에는, 그 타겟 값은 현재 상태 s 에서 행동 a 에 따른 다음 상태 s' 으로의 전이 비용 R (s, a, s') 과, 다음 상태 s' 에서의 행동 a' 에 따른 최대한의 Q 값을 적절히 디스카운트한 값의 합으로 설정될 수 있다. 최종 상태가 아닐 경우에, 다음 상태에서의 최대한의 Q 값은 즉각적인 보상이 아니라 지연된 보상이므로, 디스카운트 인자

를 적절히 설정하여 반영할 수 있다. That is, in setting the Q value as the target value according to the behavior a in the current state s, when the next state s 'is the final state, the final state compensation R (s') is set to the Q value, can learn the neural network. However, the next state s' this case is not the final state, the target value of the next state s in accordance with the action a in the current state s' by a transition costs R (s, a, s') and, Can be set to a sum of values obtained by appropriately discounting the maximum Q value according to the behavior a 'in the next state s'. If not the final state, the maximum Q value in the next state is a delayed compensation rather than an immediate compensation, so the discount factor

Can be appropriately set and reflected.

여기서, 전술한 바와 같이 전이 비용은 학습 대상인 데이터 세트에서 그 자체로 나타나지 않는 경우, 전문가의 지식을 사용하여 적절히 설정되어야 하는데, 그 설정이 어렵고 잘못 설정될 경우 인공 신경망의 수렴 속도를 현저하게 저하시킬 수 있으며, Q 값 근사의 정확도 역시 감소될 수 있다. Here, as described above, when the transition cost does not appear in itself as a data set to be learned, it should be appropriately set by using expert knowledge. If the setting is difficult and erroneously set, the convergence speed of the artificial neural network is remarkably reduced And the accuracy of the Q value approximation can also be reduced.

본 발명의 일 실시예에 따르면, 학습 대상이 되는 데이터 샘플의 시퀀스의 길이를 기반으로, 전이 비용을 자동으로 결정하도록 함으로써 인공 신경망의 수렴 속도를 향상시키고 근사의 정확성을 높일 수 있다. According to an embodiment of the present invention, the transition cost is automatically determined based on the length of the sequence of the data samples to be learned, thereby improving the convergence speed of the artificial neural network and improving the accuracy of the approximation.

본 발명의 일 실시예에 따르면, 하기의 수학식 2 에 따라 전이 비용을 결정할 수 있다. According to an embodiment of the present invention, the transition cost can be determined according to Equation (2) below.

여기서,

는 샘플 데이터에 포함된 모든 시퀀스들 중 하나인, 현재 학습 대상인 시퀀스

의 최종 상태의 보상을,

은 시퀀스

의 길이를 나타낼 수 있다. here,

Which is one of all the sequences included in the sample data,

Lt; RTI ID = 0.0 >

Sequence

Can be expressed by the following formula

즉, 다음 상태 s' 가 최종 상태 (final state) 인 경우에는 종래 NFQ 알고리즘과 마찬가지로 최종 상태의 보상 r 을 타겟 값으로 설정하여 인공 신경망을 훈련시키도록 할 수 있으나, 다음 상태 s' 가 최종 상태가 아닌 경우, 즉 전이 상태 (transition state) 이면 학습 대상인 샘플 데이터의 시퀀스의 길이를 기반으로 다음 상태로의 전이 비용을 계산할 수 있다. 보다 구체적으로, 수학식 2 에 나타난 바와 같이, 현재 학습 대상인 시퀀스의 최종 상태의 보상과 그 시퀀스의 길이를 기반으로, 현재 시퀀스에 포함된 각각의 상태들의 평균 보상을 결정할 수 있다. 또한, 최종 상태 보상은 정규화 (Normalize) 시키고, 이를 시퀀스의 길이로 나눈 후에 1 을 뺀 값을 각각의 상태의 전이 비용으로 설정함으로써, 상태 전이 비용이 항상 음수 값을 가지되, 비용이 클수록 절대값이 크고, 비용이 작을수록 절대값이 작도록 전이 비용을 결정할 수 있다. That is, when the next state s 'is the final state, it is possible to train the artificial neural network by setting the final state compensation r to the target value like the conventional NFQ algorithm, but the next state s' The transition cost to the next state can be calculated based on the length of the sequence of the sample data to be learned. More specifically, as shown in Equation (2), based on the compensation of the final state of the current learning target sequence and the length of the sequence, the average compensation of the respective states included in the current sequence can be determined. In addition, the state transition cost is always negative, and the cost is always larger. In this case, the state transition cost is always normalized. And the cost is small, the transition cost can be determined so that the absolute value is small.

하기의 표 1 은 예시적인 샘플 데이터를 나타낸다. Table 1 below shows exemplary sample data.

본 발명의 일 실시예에 따른 행동 결정 방법은, 교육적인 프로세스 마이닝 (Educational Process Mining) 에 사용될 수 있다. 표 1 에 예시된 바와 같이, 교육 프로세스의 최종 보상으로 최종적인 성적 (Final Grade) 이 사용될 수 있고, 이를 위한 복수의 행동들, 예를 들어 Diagram, Study Exercise, Working in Texk Editor 들 중 적어도 일부의 시퀀스들이 샘플 데이터로서 사용될 수 있다. 각각의 학생이 수행한 행동들의 시퀀스들과 각각의 학생의 최종 성적들이, 복수의 시퀀스들로서 샘플 데이터에 포함될 수 있다. 각 학생별로, 수행한 행동들 및 그 순서가 상이할 수 있고, 복수의 샘플 데이터들을 기반으로 학습된 Q 값 함수를 사용하여 상태들에서의 최적 행동을 결정함으로써, 종국적으로 가장 좋은 최종 성적을 산출할 수 있는 행동들의 시퀀스를 결정할 수 있다. The behavior determination method according to an embodiment of the present invention can be used for educational process mining. As illustrated in Table 1, Final Grade can be used as a final reward for the training process, and multiple actions for it, such as Diagram, Study Exercise, and at least some of the Working in Texk Editors The sequences may be used as sample data. The sequences of behaviors performed by each student and the final scores of each student may be included in the sample data as a plurality of sequences. By determining the optimal behavior in states using a Q-value function that is learned based on a plurality of sample data, the behaviors and the order in which they are performed can be different for each student, the final best result is finally calculated You can determine the sequence of actions you can do.

도 4 는 본 발명의 일 실시예에 따른 행동 결정 방법의 흐름도이다. 이하, 도 4 를 참조하여 본 발명의 일 실시예에 따른 행동 결정 방법을 설명한다. 4 is a flowchart of a behavior determination method according to an embodiment of the present invention. Hereinafter, a behavior determination method according to an embodiment of the present invention will be described with reference to FIG.

본 발명의 일 실시예에 따른 행동 결정 방법은 프로세서에 의해 수행되는, NFQ (Neural Fitted Q-Iteration) 기반의 강화 학습에 따른 행동 결정 방법이다. 도 4 에 도시된 바와 같이, 먼저, 복수의 상태 전이 샘플 데이터를 기반으로, 현재 상태 (state) 및 행동 (action) 을 입력으로 가지고 상기 행동에 따른 Q 값을 출력으로 가지는 Q 값 함수를 인공 신경망에 훈련시킬 수 있다 (단계 410). The behavior determination method according to an exemplary embodiment of the present invention is a behavior determination method based on NFQ (Neural Fitted Q-Iteration) based reinforcement learning, which is performed by a processor. As shown in FIG. 4, first, a Q-value function having a current state and an action as inputs and a Q value as an output based on a plurality of state transition sample data is output as an artificial neural network (Step 410).

전술한 바와 같이, 인공 신경망을 훈련시킴에 있어서 학습 대상인 샘플 데이터에서 다음 상태가 최종 상태일 경우에는 다음 상태의 보상 (Reward) 을 상기 Q 값으로 설정할 수 있다. 다만, 다음 상태가 최종 상태가 아닐 경우, 수학식 1 에 나타난 바와 같이, 인공 신경망을 훈련시키기 위한 타겟 값인 Q 값은, 다음 상태로의 전이 비용 및 다음 상태의 최대 Q 값을 기반으로 결정될 수 있다. 보다 구체적으로는, 다음 상태의 최대 Q 값은 즉각적인 보상이 아니라 지연된 보상임을 반영하기 위하여, 다음 상태로의 전이 비용과 디스카운트된 다음 상태의 최대 Q 값의 합을 Q 값으로 설정할 수 있다. As described above, in the training of the artificial neural network, in the case where the next state is the final state in the sample data to be learned, the reward of the next state can be set to the Q value. However, if the next state is not the final state, as shown in Equation (1), the Q value as the target value for training the artificial neural network can be determined based on the transition cost to the next state and the maximum Q value of the next state . More specifically, in order to reflect that the maximum Q value of the next state is not the immediate compensation but the delayed compensation, the sum of the transition cost to the next state and the maximum Q value of the next discounted state may be set to the Q value.

한편, 전술한 바와 같이 전이 비용의 결정은 인공 신경망의 수렴 속도 및 근사된 Q 값의 정확도에 중요한 영향을 미친다. 일 측면에 따르면, 학습 대상인 샘플 데이터의 시퀀스의 길이를 기반으로 다음 상태로의 전이 비용 (transition cost) 을 결정할 수 있다. 학습 대상인 샘플 데이터는, 현재 상태 및 그 상태에서의 행동을 가질 수 있고, 이를 포함하는 시퀀스를 가질 수 있다. 이러한 현재 시퀀스의 길이와, 시퀀스의 최종 상태 보상을 기반으로 Q 값을 설정할 수 있다. 보다 구체적으로는, 수학식 2 에 나타난 바와 같이, 학습 대상인 샘플 데이터의 시퀀스의 최종 상태 보상을 상기 시퀀스의 길이로 나눈 값에서 1 을 뺀 값으로 결정할 수 있다. 여기서, 최종 상태 보상은 정규화 (Normalize) 된 값일 수 있으며, 따라서 결정된 Q 값은 항상 음의 값을 가질 수 있다. On the other hand, as described above, the determination of the transition cost has a significant effect on the convergence speed of the artificial neural network and the accuracy of the approximated Q value. According to one aspect, the transition cost to the next state can be determined based on the length of the sequence of sample data to be learned. The sample data to be learned may have a current state and behavior in that state, and may have a sequence containing the same. The Q value can be set based on the length of this current sequence and the final state compensation of the sequence. More specifically, as shown in Equation (2), the final state compensation of the sequence of the sample data to be learned can be determined by subtracting 1 from the value obtained by dividing by the length of the sequence. Here, the final state compensation may be a normalized value, and thus the determined Q value may always have a negative value.

상기와 같은 학습 과정을 통해, 본 발명의 일 실시예에 따른 행동 결정 방법은 인공 신경망을 더 빠르게 수렴시킬 수 있고, 정확한 결과값을 산출할 수 있다. Through the learning process as described above, the behavior decision method according to an embodiment of the present invention can converge the artificial neural network more quickly and calculate accurate result values.

다시 도 4 를 참조하면, 복수의 샘플 데이터들을 기반으로 인공 신경망에 Q 값 함수를 학습시킨 후에는, Q 값 함수를 기반으로 소정 상태에서의 복수의 행동들에 따른 Q 값을 각각 산출할 수 있다 (단계 420). 상태는 현재까지 수행된 활동들을 나타낼 수 있고, 행동은 현재 수행하고자 하는 활동을 나타낼 수 있다. 따라서, 현재까지 수행된 행동들의 상태에서, 어떤 행동을 수행했을 때의 각각의 Q 값을 인공 신경망에 학습된 Q 값 함수를 기반으로 근사할 수 있다. Referring again to FIG. 4, after learning a Q-value function in the artificial neural network based on a plurality of sample data, a Q value according to a plurality of behaviors in a predetermined state can be calculated based on the Q-value function (Step 420). The state can represent the activities performed so far, and the behavior can represent the activity that is currently being performed. Therefore, in the state of the actions performed so far, each Q value when performing an action can be approximated based on the Q value function learned in the artificial neural network.

이후에는, 복수의 행동들 중 Q 값이 가장 큰 행동을 다음 행동으로 결정할 수 있다 (단계 430). 이와 같이 Q 값이 가장 큰 행동을 다음 행동으로 결정하는 과정을 최종 상태까지 반복함으로써, 최적의 행동 정책을 결정할 수 있다. Thereafter, the action with the highest Q value among the plurality of actions may be determined as the next action (step 430). Thus, the optimal behavior policy can be determined by repeating the process of determining the action with the greatest Q value as the next action to the final state.

도 5 는 본 발명의 다른 실시예에 따른 행동 결정 장치의 구성을 나타내는 블록도이다. 도 5 에 도시된 바와 같이, 본 발명의 일 실시예에 따른 행동 결정 장치 (500) 는 NFQ (Neural Fitted Q-Iteration) 기반의 강화 학습에 따른 행동 결정 장치로서, 프로세서 (510) 및 인공 신경망 (Neural Network, 520) 을 포함할 수 있다. 여기서 인공 신경망 (520) 은 행동 결정 장치 (500) 와 별개의 시스템으로서 독립적으로 구성될 수도 있으며, 행동 결정 장치 (500) 내에 통합되어 소프트웨어적으로 구현될 수도 있다. 5 is a block diagram illustrating a configuration of a behavior determination apparatus according to another embodiment of the present invention. 5, the behavior determining apparatus 500 according to an exemplary embodiment of the present invention includes a processor 510 and an artificial neural network (hereinafter, referred to as " neural network " Neural Network, 520). Here, the artificial neural network 520 may be independently configured as a separate system from the behavior determining apparatus 500, or integrated into the behavior determining apparatus 500 and implemented in software.

프로세서 (510) 는, 복수의 상태 전이 샘플 데이터를 기반으로, 현재 상태 (state) 및 행동 (action) 을 입력으로 가지고 상기 행동에 따른 Q 값을 출력으로 가지는 Q 값 함수를 인공 신경망에 훈련시키는 것과, 상기 Q 값 함수를 기반으로 소정 상태에서의 복수의 행동들에 따른 Q 값을 각각 산출하는 것 및 상기 복수의 행동들 중 Q 값이 가장 큰 행동을 다음 행동으로 결정하는 것을 수행하도록 구성될 수 있다. 프로세서 (510) 는 인공 신경망의 훈련을 위한 훈련부 (511), Q 값 산출을 위한 산출부 (513) 및 다음 행동의 결정을 위한 결정부 (515) 를 포함할 수 있고, 상기 훈련부 (511), 산출부 (513) 및 결정부 (515) 는 각각 별개의 프로세서 상에서 구현될 수 있고, 적어도 일부를 통합하는 프로세서에서 구현될 수도 있다. Processor 510 may be configured to train a Q value function on an artificial neural network based on a plurality of state transition sample data, with a current state and an action as inputs and a Q value according to the behavior as an output, Calculating a Q value according to a plurality of behaviors in a predetermined state on the basis of the Q value function, and determining a behavior having the largest Q value among the plurality of behaviors as a next behavior have. The processor 510 may include a training unit 511 for training the artificial neural network, a calculation unit 513 for calculating the Q value, and a determination unit 515 for determining the next behavior, and the training unit 511, The calculating unit 513 and the determining unit 515 may be implemented on separate processors, and may be implemented in a processor that integrates at least some of them.

본 발명의 일 실시예에 따른 행동 결정 장치 (500) 의 구체적인 동작은 전술한 본 발명의 일 실시예에 따른 행동 결정 방법에 따를 수 있다. The concrete operation of the behavior determining apparatus 500 according to an embodiment of the present invention may be based on the behavior determining method according to the embodiment of the present invention described above.

도 6 은 본 발명의 일 실시예에 따른 행동 결정의 성능 비교를 나타낸다. 전술한 바와 같이, 학습을 위한 프로세스 마이닝에 본 발명의 일 실시예에 따른 행동 결정 방법이 적용될 수 있으며, 실험예로서 115 명의 학생들의 각각의 학습을 위한 행위들의 기록 및 최종 성적을 나타내는 기록들이 학습 대상 샘플 데이터에 포함되었다. 상태는 현재 시점에서의 수행된 학습을 위한 행위들의 수를 나타내고, 행동은 학습을 위한 다음 행위를 나타낸다. 최종 상태의 보상은 전체적인 교육 프로세스에 따른 각 학생의 성적을 나타낸다. 본 발명의 일 실시예에 따른 행동 결정 방법을 적용하여 NFQ 를 학습시키고, 새로운 교육 프로세스 시퀀스를 생성하기 위해 훈련된 Q 값 함수를 사용하였으며, 주어진 성적에서 훈련된 인공 신경망을 사용하였다. 도 6 에 도시된 바와 같이, 원래의 데이터와 통상적인 NFQ 알고리즘과 비교하여, 본 발명의 일 실시예에 따른 수정된 NFQ 방법에 따른 교육 프로세스의 시퀀스는 더 높은 결과값을 가지도록 한다. Figure 6 shows a performance comparison of behavior decisions according to an embodiment of the present invention. As described above, a behavior determination method according to an embodiment of the present invention can be applied to process mining for learning. In an experimental example, records of activities for learning of each of 115 students, Was included in the target sample data. The state represents the number of behaviors for the learning performed at the current time, and the behavior represents the next activity for learning. The reward in the final state represents the student's performance according to the overall education process. The trained Q value function is used to learn NFQ by applying the behavior decision method according to an embodiment of the present invention and to generate a new training process sequence, and the artificial neural network trained in the given grade is used. As shown in FIG. 6, the sequence of the training process according to the modified NFQ method according to an embodiment of the present invention has a higher result compared with the original data and the conventional NFQ algorithm.

상술한 본 발명에 따른 행동 결정 방법은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체로는 컴퓨터 시스템에 의하여 해독될 수 있는 데이터가 저장된 모든 종류의 기록 매체를 포함한다. 예를 들어, ROM(Read Only Memory), RAM(Random Access Memory), 자기 테이프, 자기 디스크, 플래시 메모리, 광 데이터 저장장치 등이 있을 수 있다. 또한, 컴퓨터로 판독 가능한 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다.The above-described method of determining a behavior according to the present invention can be implemented as a computer-readable code on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording media storing data that can be decoded by a computer system. For example, there may be a ROM (Read Only Memory), a RAM (Random Access Memory), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device and the like. The computer-readable recording medium may also be distributed and executed in a computer system connected to a computer network and stored and executed as a code that can be read in a distributed manner.

이상, 도면 및 실시예를 참조하여 설명하였지만, 본 발명의 보호범위가 상기 도면 또는 실시예에 의해 한정되는 것을 의미하지는 않으며 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention as defined by the appended claims. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

500 : 행동 결정 장치
510 : 프로세서
511 : 훈련부
513 : 산출부
515 : 결정부
520 : 인공 신경망500: Behavior determination device
510: Processor
511: Training Department
513:
515:
520: Artificial Neural Network

Claims

1. A method for determining behavior based on reinforcement learning based on NFQ (Neural Fitted Q-Iteration), which is performed by a processor,
Training a Q value function in an artificial neural network based on a plurality of state transition sample data and having a current state and an action as inputs and a Q value according to the behavior as an output;
Calculating a Q value according to a plurality of behaviors in a predetermined state based on the Q value function; And
Determining a behavior having a largest Q value among the plurality of behaviors as a next behavior,
Wherein the training comprises determining a transition cost to a next state based on a length of a sequence of sample data to be learned,
The transition cost is determined based on a length of a sequence of sample data to be learned and a final state compensation of the sequence,
Wherein the transition cost is determined by subtracting 1 from a value obtained by dividing the final state compensation of the sequence of sample data to be learned by the length of the sequence.

The method according to claim 1,
Wherein the step of training includes setting a next state of compensation (Reward) to the Q value if the next state is a final state.

The method according to claim 1,
Wherein the training comprises setting the Q value based on a transition cost to the next state and a maximum Q value of the next state if the next state is not the final state.

The method according to claim 1,
Wherein the step of training includes setting a sum of the transition cost to the next state and the maximum Q value of the next discounted state to the Q value if the next state is not the final state .

delete

The method according to claim 1,
Wherein the final state compensation is a normalized value.

The method according to claim 1,
Wherein the current state represents activities performed so far.

The method according to claim 1,
Wherein said action represents an activity to be performed at present,

An apparatus for behavior determination according to reinforcement learning based on NFQ (Neural Fitted Q-Iteration), the apparatus comprising a processor,
Training a Q value function in an artificial neural network based on a plurality of state transition sample data and having a current state and an action as inputs and a Q value according to the behavior as an output;
Calculating a Q value according to a plurality of behaviors in a predetermined state based on the Q value function; And
And to determine, as a next action, a behavior having the largest Q value among the plurality of behaviors,
Wherein the training comprises determining a transition cost to a next state based on a length of a sequence of sample data to be learned,
The transition cost is determined based on a length of a sequence of sample data to be learned and a final state compensation of the sequence,
Wherein the transition cost is determined by subtracting 1 from a value obtained by dividing a final state compensation of a sequence of sample data as a learning object by the length of the sequence.

11. The method of claim 10,
Wherein the training comprises setting a next state Reward to the Q value if the next state is a final state.

11. The method of claim 10,
Wherein the training comprises setting the Q value based on a transition cost to the next state and a maximum Q value of the next state if the next state is not the final state.

11. The method of claim 10,
Wherein the training comprises setting a sum of the transition cost to the next state and the maximum Q value of the next discounted state to the Q value if the next state is not the final state.

delete

11. The method of claim 10,
Wherein the final state compensation is a normalized value.

11. The method of claim 10,
Wherein the current state indicates activities performed so far.

11. The method of claim 10,
Wherein the action represents an activity to be performed at present.

A computer-readable storage medium having stored thereon a processor included in the computer for causing a processor to perform behavioral decisions based on Neural Fitted Q-Iteration (NFQ)
Instructions for training a Q value function in an artificial neural network based on a plurality of state transition sample data and having a current state and an action as inputs and a Q value according to the behavior as an output;
And to calculate a Q value according to a plurality of behaviors in a predetermined state based on the Q value function; And
Storing a command for causing a behavior having the largest Q value among the plurality of behaviors to be determined as a next action,
The instruction for training includes a command for determining a transition cost to a next state based on a length of a sequence of sample data to be learned,
The transition cost is determined based on a length of a sequence of sample data to be learned and a final state compensation of the sequence,
Wherein the transition cost is determined by subtracting 1 from a value obtained by dividing the final state compensation of the sequence of sample data to be learned by the length of the sequence.