KR102209917B1

KR102209917B1 - Data processing apparatus and method for deep reinforcement learning

Info

Publication number: KR102209917B1
Application number: KR1020180173851A
Authority: KR
Inventors: 오상윤; 여상호; 이민수
Original assignee: 아주대학교산학협력단
Priority date: 2018-12-31
Filing date: 2018-12-31
Publication date: 2021-01-29
Also published as: KR20200087341A; KR102209917B9

Abstract

심층 강화 학습을 위한 데이터 처리 방법에 관한 것이며, 심층 강화 학습을 위한 데이터 처리 방법은, (a) 대상 데이터 중 휴먼 데몬스트레이션 데이터를 휴먼 리플레이 버퍼에 저장하고, 상기 대상 데이터 중 액터 데이터를 액터 리플레이 버퍼에 저장하는 단계; 및 (b) 상기 휴먼 리플레이 버퍼에 저장된 상기 휴먼 데몬스트레이션 데이터의 샘플링을 위해, 상기 휴먼 리플레이 버퍼에 대하여 온라인 프레임 스킵핑을 적용하여 데이터 처리를 수행하는 단계를 포함하고, 상기 온라인 프레임 스킵핑은, 프레임 스킵핑의 간격의 가변 여부를 고려한 두가지 유형의 프레임 스킵핑을 포함할 수 있다.It relates to a data processing method for deep reinforcement learning, and the data processing method for deep reinforcement learning includes (a) storing human demonstration data among target data in a human replay buffer, and storing actor data among the target data in an actor replay buffer. Storing; And (b) performing data processing by applying online frame skipping to the human replay buffer for sampling the human demonstration data stored in the human replay buffer, wherein the online frame skipping comprises: Two types of frame skipping may be included in consideration of whether or not the skipping interval is variable.

Description

Data processing device and method for deep reinforcement learning {DATA PROCESSING APPARATUS AND METHOD FOR DEEP REINFORCEMENT LEARNING}

본원은 심층 강화 학습을 위한 데이터 처리 장치 및 방법에 관한 것이다. 특히, 본원은 심층 강화 학습(Deep Reinforcement Learning)의 가속화(Accelerating)를 위해, 듀얼 리플레이 버퍼 관리(Dual Replay Buffer Management)와 온라인 프레임 스킵핑(Online Frame Skipping)을 기반으로 휴먼 데몬스트레이션 데이터(Human Demonstration Data)를 사용하는 심층 강화 학습을 위한 데이터 처리 장치 및 방법에 관한 것이다.The present application relates to a data processing apparatus and method for deep reinforcement learning. In particular, for accelerating deep reinforcement learning, we have Human Demonstration Data based on Dual Replay Buffer Management and Online Frame Skipping. It relates to a data processing apparatus and method for deep reinforcement learning using ).

심층 강화 학습(Deep Reinforcement Learning)은 심층 신경망(deep neural network)과 강화 학습 알고리즘(reinforcement learning algorithm)을 병합하는 방법이다.Deep reinforcement learning is a method of merging a deep neural network and a reinforcement learning algorithm.

AlphaGo와 DQN(Deep Q-Network)은 복잡한 의사 결정 문제에 대한 심층적인 학습의 효과를 보여준다. 여기서, DQN은 일예로 문헌 1 [V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, "Playing atari with deep reinforcement learning," CoRR, vol. abs/1312.5602, 2013. [Online]. Available: http://arxiv.org/abs/1312.5602]를 참고하여 이해될 수 있다.AlphaGo and Deep Q-Network (DQN) demonstrate the effectiveness of deep learning on complex decision-making problems. Here, DQN is an example in document 1 [V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, "Playing atari with deep reinforcement learning," CoRR, vol. abs/1312.5602, 2013. [Online]. Available: http://arxiv.org/abs/1312.5602].

최근의 심층 강화 학습의 진보는 로봇 공학, 게임 AI 및 자가 운전(self-driving) 자동차의 돌파구로 이어졌다. 심층 강화 학습 알고리즘은 일반적으로 좋은 성능(good performance)을 내기 위해 환경(environment)과의 엄청난 양의 상호 작용(interactions)을 필요로 한다. 따라서, 심층 강화 학습의 학습 과정(learning procedure)은 다른 심층 학습 기술에 비해 계산적으로 고비용(computationally expensive)이다.Recent advances in deep reinforcement learning have led to breakthroughs in robotics, game AI and self-driving cars. Deep reinforcement learning algorithms generally require a huge amount of interactions with the environment in order to achieve good performance. Therefore, the learning procedure of deep reinforcement learning is computationally expensive compared to other deep learning techniques.

심층 강화 학습의 학습 과정을 가속화하기 위해 두 가지 카테고리의 접근법이 존재한다. Two categories of approaches exist to accelerate the learning process of deep reinforcement learning.

첫 번째 카테고리의 접근법은 액터 비평가 에이전트(actor-critic agents)를 사용하는데, 여기에는 A3C(Asynchronous Advantage Actor-Critic)가 포함된다. 여기서, A3C은 일예로 문헌 2 [V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, "Asynchronous methods for deep reinforcement learning," in International conference on machine learning, 2016, pp. 1928-1937.]를 참고하여 이해될 수 있다. 이 에이전트는 병렬로 실행되는 여러 액터(multiple actors)에 의해 수집된 전환(transitions)을 훈련하며, 이 아키텍처는 전체 훈련 시간을 크게 단축시켰다.The first category of approaches uses actor-critic agents, which include Asynchronous Advantage Actor-Critic (A3C). Here, A3C is an example in document 2 [V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, "Asynchronous methods for deep reinforcement learning," in International conference on machine learning, 2016, pp. 1928-1937.]. This agent trains transitions collected by multiple actors running in parallel, and this architecture greatly shortens the overall training time.

두 번째 카테고리의 접근법은 DQfD(Deep Q-Learning from Demonstration)와 NAC(Normalized Actor-Critic) 모델을 포함하는 심층 강화 학습에 대한 휴먼 데몬스트레이션 데이터를 사용한다. The second category of approaches uses human demonstration data for deep reinforcement learning, including Deep Q-Learning from Demonstration (DQfD) and Normalized Actor-Critic (NAC) models.

여기서, DQfD는 일예로 문헌 3 [T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, G. Dulac-Arnold et al., "Deep qlearning from demonstrations," arXiv preprint arXiv:1704.03732, 2017.]을 참고하여 이해될 수 있다. 또한, NAC는 일예로 문헌 4 [Y. Gao, H. Xu, J. Lin, F. Yu, S. Levine, and T. Darrell, "Reinforcement learning from imperfect demonstrations," CoRR, vol. abs/1802.05313, 2018. [Online]. Available: http://arxiv.org/abs/1802.05313]를 참고하여 이해될 수 있다.Here, DQfD is described in Document 3 [T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, G. Dulac-Arnold et al., "Deep qlearning from demonstrations," arXiv preprint arXiv:1704.03732, 2017.]. In addition, NAC is described in document 4 [Y. Gao, H. Xu, J. Lin, F. Yu, S. Levine, and T. Darrell, "Reinforcement learning from imperfect demonstrations," CoRR, vol. abs/1802.05313, 2018. [Online]. Available: http://arxiv.org/abs/1802.05313].

DQfD는 기존 DQN의 손실 모델(loss model)에 추가적으로 감독 손실(supervised loss)을 특별히 적용하였으며, 사전 훈련 단계(pre-training stage)뿐만 아니라 강화 학습 단계에서도 우수한 학습 성능을 보였다.DQfD specially applied supervised loss in addition to the loss model of the existing DQN, and showed excellent learning performance not only in the pre-training stage but also in the reinforcement learning stage.

그러나 DQfD는 감독 손실로 인해 모델을 전문가 데이터(expert data)에 과도하게 적용한다. 특히, 전문가 데이터세트(expert dataset)의 양이 제한적일 경우, DQfD의 오버피팅(Over Fitting, 과적합)은 더욱 악화된다. NAC는 새로운 손실 모델로 DQfD의 오버피팅 문제를 해결하려고 시도했으나, 적은 수의 휴먼 데몬스트레이션으로 DQfD 보다 낮은 성능을 보인다.However, DQfD over-applies the model to expert data due to the loss of supervision. In particular, when the amount of expert dataset is limited, the overfitting of DQfD is worse. NAC tried to solve the overfitting problem of DQfD with a new loss model, but it shows lower performance than DQfD with a small number of human demonstrations.

충분한 양(sufficient amount)의 전문가 데몬스트레이션 데이터(experts' demonstration data)를 사용하는 것이 심층 강화 학습의 학습 과정을 가속화하고 좋은 성과를 달성하는 데 도움이 된다는 것은 잘 알려져 있다. 그러나 대부분의 경우에 전문가 데몬스트레이션 데이터를 충분히 확보(또는 생성)하는 것이 어렵다. 이 문제를 더욱 악화시키는 것으로, 최근 발표된 대부분의 문서는 개인 정보 보호(privacy) 및/또는 보안상(security)의 이유로 그들의 성능 평가에 관한 휴먼 데몬스트레이션 데이터를 공유하지 않는다. 따라서, 고품질의 휴먼 데몬스트레이션 데이터가 사용 가능한(available) 데이터세트 및/또는 오픈 데이터세트에 충분히 포함되어 있지 않기 때문에 그 결과를 재현하기가 어렵다.It is well known that using a sufficient amount of experts' demonstration data can help accelerate the learning process of deep reinforcement learning and achieve good results. However, in most cases, it is difficult to sufficiently secure (or generate) expert demonstration data. To aggravate this problem, most recently published documents do not share human demonstration data regarding their performance evaluation for privacy and/or security reasons. Therefore, it is difficult to reproduce the results because high quality human demonstration data is not sufficiently contained in the available and/or open datasets.

예를 들어, Atari 2600 Games (ALE Environments) 환경의 휴먼 데몬스트레이션 데이터세트가 공개적으로(publically) 사용 가능하더라도 데이터세트에는 높은 점수의 에피소드가 포함되지 않는다. 여기서, ALE Environments은 일예로 문헌 5 [M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, "The arcade learning environment: An evaluation platform for general agents," CoRR, vol. abs/1207.4708, 2012. [Online]. Available: http://arxiv.org/abs/1207.4708]를 참고하여 이해될 수 있다. 그러나 종래의 DQfD 연구에서 사용된 데이터세트는 매우 높은 점수의 에피소드를 포함한다.For example, although the human demonstration dataset in the Atari 2600 Games (ALE Environments) environment is publicly available, the dataset does not contain high-scoring episodes. Here, ALE Environments is described in document 5 [M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, "The arcade learning environment: An evaluation platform for general agents," CoRR, vol. abs/1207.4708, 2012. [Online]. Available: http://arxiv.org/abs/1207.4708]. However, the dataset used in conventional DQfD studies contains episodes with very high scores.

따라서, 실제 사례(practical cases)(즉, 실세계 사례, real-world cases)에서 고품질의 휴먼 데몬스트레이션 데이터를 수집하는 것은 상대적으로 어렵기 때문에, 사용 가능한 휴먼 데몬스트레이션 데이터를 보다 효과적으로 활용할 필요가 있다.Therefore, it is relatively difficult to collect high-quality human demonstration data from practical cases (ie, real-world cases), and therefore, it is necessary to more effectively utilize the available human demonstration data.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 심층 강화 학습의 가속화를 위해 휴먼 데몬스트레이션 데이터를 보다 효과적으로 활용하는 심층 강화 학습을 위한 데이터 처리 장치 및 방법을 제공하려는 것을 목적으로 한다.An object of the present application is to provide a data processing apparatus and method for deep reinforcement learning that more effectively utilizes human demonstration data to accelerate deep reinforcement learning to solve the problems of the prior art described above.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the embodiments of the present application is not limited to the technical problems as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 방법은, (a) 대상 데이터 중 휴먼 데몬스트레이션 데이터를 휴먼 리플레이 버퍼에 저장하고, 상기 대상 데이터 중 액터 데이터를 액터 리플레이 버퍼에 저장하는 단계; 및 (b) 상기 휴먼 리플레이 버퍼에 저장된 상기 휴먼 데몬스트레이션 데이터의 샘플링을 위해, 상기 휴먼 리플레이 버퍼에 대하여 온라인 프레임 스킵핑을 적용하여 데이터 처리를 수행하는 단계를 포함하고, 상기 온라인 프레임 스킵핑은, 프레임 스킵핑의 간격의 가변 여부를 고려한 두가지 유형의 프레임 스킵핑을 포함할 수 있다.As a technical means for achieving the above technical problem, a data processing method for deep reinforcement learning according to an embodiment of the present application includes (a) storing human demonstration data among target data in a human replay buffer, and among the target data Storing actor data in an actor replay buffer; And (b) performing data processing by applying online frame skipping to the human replay buffer for sampling the human demonstration data stored in the human replay buffer, wherein the online frame skipping comprises: Two types of frame skipping may be included in consideration of whether or not the skipping interval is variable.

또한, 상기 온라인 프레임 스킵핑은, 제1 유형의 프레임 스킵핑으로서, 프레임 스킵핑의 간격을 가변하지 않고 일정하게 유지하여 스킵핑을 수행하는 FS-ER 스킵핑을 포함하고, 제2 유형의 프레임 스킵핑으로서, 프레임 스킵핑의 간격을 가변하여 동적으로 스킵핑을 수행하는 DFS-ER 스킵핑을 포함할 수 있다.In addition, the online frame skipping, as a first type of frame skipping, includes FS-ER skipping in which skipping is performed by keeping the frame skipping interval constant without varying, and the second type of frame The skipping may include DFS-ER skipping, which dynamically performs skipping by varying the frame skipping interval.

또한, 상기 (b) 단계는, 상기 온라인 프레임 스킵핑의 적용 이전에, 상기 휴먼 리플레이 버퍼에 저장된 휴먼 데몬스트레이션 데이터에 대한 프레임 상태를 생성하고, 상기 프레임 상태는 각각 프레임 스킵핑 계수에 대응하는 수의 프레임을 적층하여 생성될 수 있다.In addition, the step (b), prior to the application of the online frame skipping, generates a frame state for the human demonstration data stored in the human replay buffer, and the frame states are a number corresponding to each frame skipping coefficient. It can be created by stacking frames.

또한, 상기 (b) 단계는, 상기 제1 유형의 프레임 스킵핑의 적용시, 미리 설정된 수의 프레임에 대응하는 간격마다 상기 프레임 스킵핑 계수에 대응하는 수의 프레임을 적층시킬 수 있다.In addition, in step (b), when the first type of frame skipping is applied, a number of frames corresponding to the frame skipping coefficient may be stacked at intervals corresponding to a preset number of frames.

또한, 상기 (b) 단계는, 상기 휴먼 데몬스트레이션 데이터의 각 프레임의 액션 지속 기간을 나타내는 액션 반복 값을 고려하여 상기 제2 유형의 프레임 스킵핑을 적용하고, 상기 액션 반복 값은 두 개의 스킵 값을 포함할 수 있다.In addition, in step (b), the second type of frame skipping is applied in consideration of an action repetition value representing an action duration of each frame of the human demonstration data, and the action repetition value is two skipping values. Can include.

또한, 상기 (b) 단계는, 상기 제2 유형의 프레임 스킵핑의 적용시, 상기 프레임 상태를 생성할 때 프레임마다의 스킵 값을 확인하고, 상기 확인된 스킵 값을 고려하여 상기 프레임 상태를 생성할 수 있다.In addition, in step (b), when the second type of frame skipping is applied, when generating the frame state, a skip value for each frame is checked, and the frame state is generated in consideration of the checked skip value. can do.

또한, 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 방법은, (c) 상기 (b) 단계에서의 데이터 처리에 의해 샘플링된 휴먼 데몬스트레이션 데이터 및 상기 액터 데이터를 입력으로 하는 신경망의 업데이트시 이용되는 손실 값을 계산하기 위해 요구되는 반환 값을 재정의하는 단계를 더 포함하고, 상기 (c) 단계는, 상기 두가지 유형의 프레임 스킵핑 적용 각각에 대하여 반환 값을 재정의할 수 있다.In addition, the data processing method for deep reinforcement learning according to an embodiment of the present application includes (c) the human demonstration data sampled by the data processing in step (b) and the neural network using the actor data as inputs. The method further includes redefining a return value required to calculate a loss value to be used, and the step (c) may redefine the return value for each of the two types of frame skipping applications.

한편, 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치는, 대상 데이터 중 휴먼 데몬스트레이션 데이터를 휴먼 리플레이 버퍼에 저장하고, 상기 대상 데이터 중 액터 데이터를 액터 리플레이 버퍼에 저장하는 리플레이 매니저; 및 상기 휴먼 리플레이 버퍼에 저장된 상기 휴먼 데몬스트레이션 데이터의 샘플링을 위해, 상기 휴먼 리플레이 버퍼에 대하여 온라인 프레임 스킵핑을 적용하여 데이터 처리를 수행하는 데이터 처리부를 포함하고, 상기 온라인 프레임 스킵핑은, 프레임 스킵핑의 간격의 가변 여부를 고려한 두가지 유형의 프레임 스킵핑을 포함할 수 있다.Meanwhile, a data processing apparatus for deep reinforcement learning according to an embodiment of the present disclosure includes: a replay manager storing human demonstration data among target data in a human replay buffer and storing actor data among the target data in an actor replay buffer; And a data processing unit that performs data processing by applying online frame skipping to the human replay buffer for sampling the human demonstration data stored in the human replay buffer, wherein the online frame skipping comprises: frame skipping. It may include two types of frame skipping considering whether or not the interval of is variable.

또한, 상기 데이터 처리부는, 상기 온라인 프레임 스킵핑의 적용 이전에, 상기 휴먼 리플레이 버퍼에 저장된 휴먼 데몬스트레이션 데이터에 대한 프레임 상태를 생성하고, 상기 프레임 상태는 각각 프레임 스킵핑 계수에 대응하는 수의 프레임을 적층하여 생성될 수 있다.In addition, the data processing unit generates a frame state for the human demonstration data stored in the human replay buffer before the application of the on-line frame skipping, and the frame states each generate a number of frames corresponding to the frame skipping coefficient. It can be created by laminating.

한편, 본원의 일 실시예에 따른 신경망의 심층 강화 학습 장치는, 대상 데이터에 대하여 심층 강화 학습을 위한 데이터 처리를 수행하는 데이터 처리 장치; 및 상기 데이터 처리 장치의 데이터 처리에 의해 샘플링된 휴먼 데몬스트레이션 데이터 및 액터 데이터를 상기 신경망의 입력으로 하여 상기 신경망을 심층 강화 학습시키는 학습 제어부를 포함하고, 상기 데이터 처리 장치는, 대상 데이터 중 휴먼 데몬스트레이션 데이터를 휴먼 리플레이 버퍼에 저장하고, 상기 대상 데이터 중 액터 데이터를 액터 리플레이 버퍼에 저장하는 리플레이 매니저; 및 상기 휴먼 리플레이 버퍼에 저장된 상기 휴먼 데몬스트레이션 데이터의 샘플링을 위해, 상기 휴먼 리플레이 버퍼에 대하여 온라인 프레임 스킵핑을 적용하여 데이터 처리를 수행하는 데이터 처리부를 포함하고, 상기 온라인 프레임 스킵핑은, 프레임 스킵핑의 간격의 가변 여부를 고려한 두가지 유형의 프레임 스킵핑을 포함할 수 있다.Meanwhile, an apparatus for deep reinforcement learning of a neural network according to an embodiment of the present application includes: a data processing apparatus that processes data for deep reinforcement learning on target data; And a learning control unit for in-depth reinforcement learning of the neural network by using human demonstration data and actor data sampled by data processing of the data processing device as inputs of the neural network, wherein the data processing device includes human demonstration data among target data A replay manager that stores the data in a human replay buffer and stores actor data among the target data in the actor replay buffer; And a data processing unit that performs data processing by applying online frame skipping to the human replay buffer for sampling the human demonstration data stored in the human replay buffer, wherein the online frame skipping comprises: frame skipping. It may include two types of frame skipping considering whether or not the interval of is variable.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present application. In addition to the above-described exemplary embodiments, additional embodiments may exist in the drawings and detailed description of the invention.

전술한 본원의 과제 해결 수단에 의하면, 심층 강화 학습을 위한 데이터 처리 장치 및 방법을 제공함으로써, 휴먼 데몬스트레이션 데이터를 효과적으로 활용하여 심층 강화 학습을 가속화할 수 있다.According to the above-described problem solving means of the present application, by providing a data processing apparatus and method for deep reinforcement learning, it is possible to accelerate deep reinforcement learning by effectively utilizing human demonstration data.

전술한 본원의 과제 해결 수단에 의하면, 휴먼 리플레이 버퍼와 액터 리플레이 버퍼를 포함하는 듀얼 리플레이 버퍼를 기반으로 온라인 프레임 스킵핑을 수행함으로써, 양질의 휴먼 데몬스트레이션 데이터가 버려지는 것 없이, 휴먼 데몬스트레이션 데이터를 전부 활용하여 심층 강화 학습이 이루어지도록 할 수 있다. 이를 통해 본원은 심층 강화 학습의 결과 및 성능 등을 효과적으로 향상시킬 수 있다.According to the above-described problem solving means of the present application, by performing online frame skipping based on a dual replay buffer including a human replay buffer and an actor replay buffer, all human demonstration data is not discarded without discarding good quality human demonstration data. In-depth reinforcement learning can be achieved by using the Through this, the present application can effectively improve the results and performance of deep reinforcement learning.

전술한 본원의 과제 해결 수단에 의하면, 휴먼 데몬스트레이션 데이터와 액터 데이터를 단일 리플레이 버퍼에 모두 저장하는 것이 아닌 2개의 리플레이 버퍼 각각에 분리하여 따로 저장하고, 이를 기반으로 대상 데이터 중 휴먼 리플레이 버퍼에 저장된 휴먼 데몬스트레이션 데이터에 대하여 온라인 프레임 스킵핑을 독립적으로 적용할 수 있다. 이러한 본원은 휴먼 데몬스트레이션 데이터를 모두 신경망 학습을 위한 학습데이터로 활용되도록 할 수 있다. According to the above-described problem solving means of the present application, instead of storing both human demonstration data and actor data in a single replay buffer, they are separately stored in each of the two replay buffers, and based on this, the humans stored in the human replay buffer Online frame skipping can be applied independently for demonstration data. In this case, all human demonstration data can be used as learning data for neural network learning.

다만, 본원에서 얻을 수 있는 효과는 상기된 바와 같은 효과들로 한정되지 않으며, 또 다른 효과들이 존재할 수 있다.However, the effect obtainable in the present application is not limited to the effects as described above, and other effects may exist.

도 1은 종래의 DQfD의 아키텍처를 나타낸 도면이다.
도 2는 종래 DFDQN의 아키텍처(a)와 FiGAR의 아키텍처(b)를 개략적으로 나타낸 도면이다.
도 3은 Atari Grand Challenge 데이터세트의 스크린 샷(screen-shot) 샘플의 예를 나타낸 도면이다.
도 4는 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치의 개략적인 구성을 나타낸 블록도이다.
도 5는 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치의 아키텍처를 개략적으로 나타낸 도면이다.
도 6은 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치에 의한 휴먼 데몬스트레이션 데이터의 샘플링 접근법을 설명하기 위한 도면이다.
도 7은 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치에 의한 DFS-ER 스킵핑의 동작에 대한 알고리즘을 나타낸 도면이다.
도 8은 본원의 일 실시예에 따른 신경망의 심층 강화 학습 장치의 개략적인 구성을 나타낸 블록도이다.
도 9는 본원의 일 실험 결과로서, 미니배치의 휴먼 데이터의 수를 나타낸다.
도 10은 본원의 일 실험결과로서, 3가지 게임의 평균 에피소드 점수를 나타낸 도면이다.
도 11은 본원의 일 실험결과로서, 4 가지의 게임에서 달성된 평균 점수를 요약하여 나타낸 도면이다.
도 12는 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 방법에 대한 동작 흐름도이다.1 is a diagram showing the architecture of a conventional DQfD.
2 is a diagram schematically showing the architecture (a) of the conventional DFDQN and the architecture (b) of the FiGAR.
3 is a diagram showing an example of a screen-shot sample of the Atari Grand Challenge dataset.
4 is a block diagram showing a schematic configuration of a data processing apparatus for deep reinforcement learning according to an embodiment of the present application.
5 is a diagram schematically illustrating an architecture of a data processing apparatus for deep reinforcement learning according to an embodiment of the present application.
6 is a diagram illustrating a sampling approach of human demonstration data by a data processing device for deep reinforcement learning according to an embodiment of the present application.
7 is a diagram illustrating an algorithm for an operation of DFS-ER skipping by a data processing device for deep reinforcement learning according to an embodiment of the present application.
8 is a block diagram showing a schematic configuration of an apparatus for deep reinforcement learning of a neural network according to an embodiment of the present application.
9 is an experiment result of the present application, showing the number of human data in a mini-batch.
10 is a diagram showing an average episode score of three games as an experiment result of the present application.
11 is a view showing a summary of the average score achieved in four games as an experiment result of the present application.
12 is an operation flowchart of a data processing method for deep reinforcement learning according to an embodiment of the present application.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, exemplary embodiments of the present application will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art may easily implement the present application. However, the present application may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in the drawings, parts not related to the description are omitted in order to clearly describe the present application, and similar reference numerals are attached to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결" 또는 "간접적으로 연결"되어 있는 경우도 포함한다. Throughout the present specification, when a part is said to be "connected" with another part, it is not only "directly connected", but also "electrically connected" or "indirectly connected" with another element interposed therebetween. "Including the case.

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout this specification, when a member is positioned "on", "upper", "upper", "under", "lower", and "lower" of another member, this means that a member is located on another member. It includes not only the case where they are in contact but also the case where another member exists between the two members.

본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification of the present application, when a certain part "includes" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

휴먼 데몬스트레이션 데이터(Human Demonstration Data)는 심층 강화 학습(deep reinforcement learning)의 초기 단계(early stage)에서 교육 프로세스(training process)를 가속화하고 강화 학습 에이전트(reinforcement learning agent)를 안내하여 복잡한 정책(complicated policy)을 학습하는 데 중요한 역할을 한다.Human Demonstration Data accelerates the training process at the early stage of deep reinforcement learning and guides the reinforcement learning agent to create a complex policy. ) Plays an important role in learning.

그러나, 휴먼 데몬스트레이션 데이터 및 보상(reward)을 갖는 종래의 강화 학습 방법의 대부분은 충분한 양의 고품질(high-quality) 휴먼 데몬스트레이션 데이터가 있다고 가정하고 충분한 양의 전문가 데몬스트레이션 데이터(experts' demonstration data)가 항상 제한되어 있는 대부분의 실세계의 학습 사례(real-world learning cases)에는 해당되지 않는다.However, most of the conventional reinforcement learning methods with human demonstration data and reward assume that there is a sufficient amount of high-quality human demonstration data, and a sufficient amount of expert demonstration data is always available. This is not the case for most limited real-world learning cases.

이러한 제한을 극복하기 위해, 본원은 듀얼 리플레이 버퍼 관리(Dual Replay Buffer Management) 및 온라인 프레임 스킵핑(Online Frame Skipping)을 통해 새로운 심층 강화 학습(deep reinforcement learning) 방법을 제안한다. In order to overcome this limitation, the present application proposes a new deep reinforcement learning method through Dual Replay Buffer Management and Online Frame Skipping.

다시 말해, 종래에 제한된 휴먼 데몬스트레이션 데이터 문제를 극복하기 위해, 본원은 휴먼 데몬스트레이션 샘플링을 위해 듀얼 리플레이 버퍼 관리와 온라인 프레임 스킵핑을 사용하는 휴먼 데몬스트레이션에서의 새로운 심층 강화 학습 방법에 대해 제안한다. 이하에서는 이러한 본원에서 제안하는 기술(듀얼 리플레이 버퍼를 기반으로 한 온라인 프레임 스킵핑 기술, 즉 후술하는 본 장치(100)에 의한 심층 강화 학습을 위한 데이터 처리 방법)을 설명의 편의상 본 제안 기술이라 하기로 한다. In other words, in order to overcome the conventional limited human demonstration data problem, the present application proposes a new deep reinforcement learning method in human demonstration using dual replay buffer management and online frame skipping for human demonstration sampling. Hereinafter, the technology proposed by the present application (online frame skipping technology based on a dual replay buffer, that is, a data processing method for deep reinforcement learning by the apparatus 100 to be described later) will be referred to as the proposed technology for convenience of explanation. To

본 제안 기술 중 듀얼 리플레이 버퍼는 휴먼 리플레이 버퍼(human replay buffer, 달리 표현해 휴먼 리플레이 메모리), 액터 리플레이 버퍼(actor replay buffer, 달리 표현해 액터 리플레이 메모리) 및 리플레이 매니저(replay manager, 리플레이 관리자)로 구성될 수 있다. 본원에서 버퍼는 일예로 메모리라 달리 표현될 수 있다. 또한, 본원에 의하면, 독립적인 샘플링 정책(independent sampling policies)으로 두 개의 리플레이 버퍼를 관리할 수 있다.Among the proposed technologies, the dual replay buffer will consist of a human replay buffer (human replay memory), an actor replay buffer (an actor replay memory), and a replay manager (replay manager). I can. In the present application, a buffer may be differently expressed as a memory. In addition, according to the present application, two replay buffers can be managed with independent sampling policies.

본원에서 고려되는 듀얼 리플레이 버퍼 관리는 휴먼 리플레이 버퍼(메모리), 액터 리플레이 버퍼(메모리) 및 리플레이 매니저로 구성되는 반면, 종래의 리플레이 버퍼(conventional replay buffer)에는 동일한 리플레이 버퍼(same replay buffer) 내에 휴먼 데몬스트레이션 데이터(human demonstration data)와 액터 데이터(actor data)가 모두 포함되어 있다.The dual replay buffer management considered in this application consists of a human replay buffer (memory), an actor replay buffer (memory), and a replay manager, whereas the conventional replay buffer contains a human in the same replay buffer. Both human demonstration data and actor data are included.

또한, 종래의 정적 프레임 스킵핑(conventional static frame skipping) 방법은 메모리 버퍼(memory buffer)에 저장된 휴먼 데몬스트레이션 데이터를 프레임별로 스킵하기(건너뛰기) 때문에, 나머지 드롭된 프레임들(remaining dropped frames)은 강화 학습 중에 이용될 수 없는 문제가 있다. 또한, 종래의 프레임 스킵 방법은 강화 학습 알고리즘이 휴먼 데몬스트레이션 데이터의 액션 반복(action repeat, 행동 반복)의 특징을 캡쳐(포착)하는 것을 방해하는 문제가 있다.In addition, because the conventional static frame skipping method skips (skips) the human demonstration data stored in the memory buffer for each frame, the remaining dropped frames are reinforced. There is a problem that cannot be used during learning. In addition, the conventional frame skipping method has a problem in that the reinforcement learning algorithm prevents capturing (capturing) features of action repeat (action repeat) of human demonstration data.

반면, 본 제안 기술의 아키텍처는 두 개의 별개(distinct)의 리플레이 버퍼(즉, 휴먼 리플레이 버퍼와 액터 리플레이 버퍼, 달리 표현해 휴먼 리플레이 메모리와 액터 리플레이 메모리)를 가지고 있기 때문에, 리플레이 매니저가 두 개의 리플레이 버퍼 각각을 독립적인 샘플링 정책(independent sampling policies)으로 처리할 수 있다. 이러한 본 제안 기술의 듀얼 리플레이 버퍼 아키텍처는 휴먼 데몬스트레이션 데이터를 보다 정교하게(sophisticatedly) 처리하는 데 도움이 될 수 있다.On the other hand, the architecture of this proposed technology has two distinct replay buffers (i.e., a human replay buffer and an actor replay buffer, in other words, a human replay memory and an actor replay memory), so the replay manager has two replay buffers. Each can be processed with independent sampling policies. The dual replay buffer architecture of this proposed technology can help to process human demonstration data more sophistically.

또한, 본원에서는 사용 가능한 휴먼 데이터(available human data)를 완전히(fully, 온전히, 전부) 활용(utilize)하기 위해 온라인 프레임 스킵핑을 제안한다. 본원에서 휴먼 데이터는 휴먼 데몬스트레이션 데이터라 달리 표현될 수 있다. 훈련 기간(training period) 동안, 본원에서의 프레임 스킵핑은 모든 휴먼 데몬스트레이션 데이터가 저장되는 휴먼 리플레이 버퍼에 동적으로(dynamically) 수행될 수 있다. 본원에서는 휴먼 리플레이 버퍼에서 데이터를 샘플링하기 위해, 두 가지의 온라인 프레임 스킵핑으로서 FS-ER (Frame Skipping-Experience Replay)과 DFS-ER (Dynamic Frame Skipping-Experience Replay)를 사용할 수 있다.In addition, the present application proposes online frame skipping in order to fully (fully, completely, all) utilize the available human data. In the present application, human data may be expressed differently as human demonstration data. During the training period, frame skipping herein may be dynamically performed in a human replay buffer in which all human demonstration data is stored. Herein, in order to sample data from the human replay buffer, two types of on-line frame skipping can be used as Frame Skipping-Experience Replay (FS-ER) and Dynamic Frame Skipping-Experience Replay (DFS-ER).

다시 말해, 본원은 프레임 스킵핑(frame skipping)으로 전체 휴먼 데몬스트레이션 데이터를 사용하기 위해 FS-ER (Frame Skipping Experience Replay) 기술에 대해 제안한다. FS-ER을 사용하면, 전체 휴먼 데몬스트레이션 데이터가 휴먼 리플레이 버퍼에 저장되고, 프레임 스킵핑이 훈련 기간(training period) 동안 온라인으로 수행될 수 있다. FS-ER을 사용함으로써, 모델은 사용 가능한 휴먼 데몬스트레이션 데이터의 전체 세트(full set)로 훈련될 수 있다.In other words, the present application proposes a Frame Skipping Experience Replay (FS-ER) technology in order to use the entire human demonstration data with frame skipping. With FS-ER, the entire human demonstration data is stored in the human replay buffer, and frame skipping can be performed online during the training period. By using FS-ER, the model can be trained with a full set of human demonstration data available.

또한, 본원은 FS-ER에 DFDQN(Dynamic Frame-Skipping Deep Q-Network)의 동적 프레임 스킵핑(dynamic frame skipping)을 채택함으로써 DFS-ER (Dynamic Frame Skipping Experience Replay) 기술에 대해 제안한다. 여기서, DFDQN은 일예로 문헌 6 [A. S. Lakshminarayanan, S. Sharma, and B. Ravindran, "Dynamic frame skip deep Q network," CoRR, vol. abs/1605.05365, 2016. [Online]. Available: http://arxiv.org/abs/1605.05365]을 참고하여 이해될 수 있다.In addition, the present application proposes a DFS-ER (Dynamic Frame Skipping Experience Replay) technology by adopting dynamic frame skipping of DFDQN (Dynamic Frame-Skipping Deep Q-Network) in FS-ER. Here, DFDQN is described in Document 6 [A. S. Lakshminarayanan, S. Sharma, and B. Ravindran, "Dynamic frame skip deep Q network," CoRR, vol. abs/1605.05365, 2016. [Online]. Available: http://arxiv.org/abs/1605.05365].

본 제안 기술의 DFS-ER은 휴먼 데몬스트레이션 데이터의 액션 반복(action repeat, 행동 반복)을 훈련할 수 있으며, 프레임을 동적으로(dynamically) 스킵(skip)하기(건너뛰기) 때문에 종래의 정적 프레임 스킵핑(conventional static frame skipping) 방법보다 빠르게 환경(environment)을 탐색(explore)할 수 있다.The DFS-ER of this proposed technology can train action repeat (action repeat) of human demonstration data, and because it dynamically skips (skips) frames, conventional static frame skipping You can explore the environment faster than the (conventional static frame skipping) method.

이하에서는 본 제안 기술에 대한 구체적인 설명에 앞서, 먼저 본 제안 기술의 배경이 되는 기술에 대하여 보다 상세히 설명하기로 한다.Hereinafter, prior to a detailed description of the proposed technology, a technology that is a background of the proposed technology will be described in more detail.

데몬스트레이션에서의 심층 강화 학습(Deep Reinforcement Learning From Demonstration)에 대한 설명은 다음과 같다.The description of Deep Reinforcement Learning From Demonstration is as follows.

휴먼 데몬스트레이션 데이터(Human demonstration data)는 심층 강화 학습의 초기 단계(initial stage)를 향상시키는데 활용될 수 있으며, 희소한 보상(sparse reward) 환경에서의 강화 학습에 도움이 될 수 있다.Human demonstration data can be used to improve the initial stage of deep reinforcement learning, and can be helpful for reinforcement learning in a sparse reward environment.

현재, 휴먼 데몬스트레이션 데이터를 사용하는 가장 진보된 모델은 DQfD와 NAC라 할 수 있다.Currently, the most advanced models using human demonstration data are DQfD and NAC.

DQfD는 휴먼 데몬스트레이션 데이터로 훈련할 때 감독 손실(supervised loss)을 추가한다. 감독 손실(supervised loss)은 에이전트의 행동(agent's action)이 전문가의 행동(expert's action)과 다른 경우 전반적인 손실 값(overall loss value)을 증가시키기 때문에, 이는 모델이 휴먼 데몬스트레이션 데이터로 훈련된 경우 감독 학습 알고리즘(supervised learning algorithm)처럼 작동할 수 있다.DQfD adds supervised loss when training with human demonstration data. Because supervised loss increases the overall loss value when the agent's action is different from the expert's action, it is supervised learning when the model is trained with human demonstration data. It can work like a supervised learning algorithm.

도 1은 종래의 DQfD의 아키텍처를 나타낸 도면이다.1 is a diagram showing the architecture of a conventional DQfD.

도 1을 참조하면, DQfD의 아키텍처에서는 휴먼 데몬스트레이션 데이터(휴먼 데이터)와 액터 데이터가 모두 단일 리플레이 버퍼(single replay buffer)에 저장되고 관리됨을 확인할 수 있다. 특히, 휴먼 데몬스트레이션 데이터는 리플레이 버퍼에 영구적으로(permanently) 저장되는 반면, 오래된 액터 데이터(older actor data)는 들어오는(incoming) 액터 데이터로 대체된다. Referring to FIG. 1, it can be seen that in the architecture of DQfD, both human demonstration data (human data) and actor data are stored and managed in a single replay buffer. In particular, human demonstration data is permanently stored in the replay buffer, while older actor data is replaced with incoming actor data.

또한, 휴먼 데몬스트레이션 데이터와 액터 데이터는 모두 리플레이 버퍼에 저장하기 전에 이미 프레임 스킵(frame-skipped)된다. 따라서, DQfD에서는 데이터가 학습 과정(learning process) 중에 샘플링될 때, 사전 상태(Pre-State), 포스트 상태(Post-State), 보상(Reward), 액션(Action, 행동), n 단계 상태(n-Step-State) 및 n 단계 보상(n Step-Reward)을 포함하는 경험 데이터(experience data)가 프레임 스킵핑(frame-skipping) 없이 샘플링된다. 도 1에서의 미니배치(Minibatch)는 이 샘플링 프로세스를 나타낸다.Also, both human demonstration data and actor data are already frame-skipped before being stored in the replay buffer. Therefore, in DQfD, when data is sampled during the learning process, Pre-State, Post-State, Reward, Action, and n-level states (n -Step-State) and experience data including n Step-Reward are sampled without frame-skipping. The minibatch in Fig. 1 represents this sampling process.

도 1의 미니배치(Minibatch)에서, 파란색 부분(blue-colored part, P1)은 사전 상태(Pre-State)와 포스트 상태(Post-State)를 나타내고, 녹색 부분(green-colored part, P2)은 n 단계 상태(n-Step-State)와 n 단계 보상(n Step-Reward)을 나타낸다. 미니배치(Minibatch)는 샘플링 이후에 신경망(Network, neural network)에 입력될 수 있다(inputted). 신경망은 4 개의 손실(loss)의 합에 의해 역 전파(back-propagated)될 수 있다.In the mini-batch (Minibatch) of FIG. 1, a blue-colored part (P1) represents a pre-state and a post state (Post-State), and a green-colored part (P2) is It represents the n-Step-State and n Step-Reward. Minibatch may be inputted to a neural network (Network) after sampling. The neural network can be back-propagated by the sum of the four losses.

손실(loss) 방정식(즉, 손실 함수의 식)은 하기 식 1 및 식 2와 같이 표현될 수 있다.The loss equation (that is, the equation of the loss function) can be expressed as Equations 1 and 2 below.

[식 1][Equation 1]

[식 2][Equation 2]

식 1은 DQfD 네트워크를 업데이트하는데 사용된 전체 손실(overall loss)을 계산하기 위한 식을 나타낸다. DQfD의 전체 손실은 Q 손실(Q loss), n 단계 Q 손실(n-step Q loss), 감독 손실(supervised loss) 및 L2 정규화 손실(L2 regularization loss)로 이루어지며, 각 손실은 λ 값으로 가중된다(weighted).Equation 1 shows an equation for calculating the overall loss used to update the DQfD network. The total loss of DQfD consists of Q loss, n-step Q loss, supervised loss, and L2 regularization loss, and each loss is weighted by a λ value. Weighted.

식 2는 감독 손실을 계산하기 위한 식을 나타낸다.Equation 2 shows the equation for calculating the supervision loss.

식 2에서, a _expert 는 상태(State) s에서 휴먼(인간, 사람 중에서 전문가)이 수행한 행동(performed action)을 나타낸다. 만약, a _expert 와 a 가(a는 휴먼 데이터에 포함되는 각 행동에 관한 데이터 값) 같은 값(same value)이 아니면,

는 종래의(original) DQfD에서 0.8을 가질 수 있다. 또한

는 샘플 데이터(sampled data)가 휴먼 데몬스트레이션 데이터인 경우에만 존재한다.In Equation 2, a _expert denotes a performed action by a human (a human, an expert among humans) in the state s. If, as a _expert is not equal to a value (same value) like (a data value of each action contained in the human data),

May have 0.8 in the original DQfD. Also

Is present only when sampled data is human demonstration data.

DQfD는 아케이드(arcade) 학습 환경(ALE)에서 우선 순위(prioritized) 경험 리플레이를 통해 DQN보다 우수한 성능을 보여준다. 이는 몬테주마의 복수(Montezuma's Revenge)와 같은 게임을 작지만 고품질(high-quality)의 휴먼 데몬스트레이션 데이터세트로 훈련시킬 수 있다. 몬테주마의 복수(Montezuma's Revenge)는 희소한 보상(sparse reward) 때문에 A3C 및 DQN과 같은 종래의 심층 강화 학습 알고리즘을 기반으로 하는 모델로 훈련시키기 어려운 게임이다.DQfD shows better performance than DQN through prioritized experience replay in arcade learning environment (ALE). It can train games like Montezuma's Revenge on a small but high-quality human demonstration dataset. Montezuma's Revenge is a game that is difficult to train with models based on conventional deep reinforcement learning algorithms such as A3C and DQN because of sparse rewards.

DQfD는 감독 손실(supervised loss)에 기반하여 작동하고 소량(small amount)의 휴먼 데몬스트레이션 데이터를 사용하여 훈련(training)이 수행되기 때문에, 종종 오버피팅(Over Fitting, 과적합) 문제를 보여주며, 이 때문에 DQfD는 최고 점수를 쉽게 산출할 수 있다. 그러나, 사전 훈련(pre-training) 중에 높은 점수(high score)에 도달한 후, 점수(score)는 거의 증가하지 않거나 다시 최고 점수를 받는 것이 거의 없다. L2 정규화 손실(L2 regularization loss)은 DQfD의 오버피팅 문제를 완화하기 위해 채택될 수 있다. DQfD와 NAC 사이의 성능 평가에서, 감독 손실(supervised loss) 감쇄(decay)는 반복(iteration)이 진행됨에 따라 도입될 수 있다. 그러나 DQfD의 오버피팅 문제는 아직 완전히 해결되지 않았다.Because DQfD operates based on supervised loss and training is performed using a small amount of human demonstration data, it often presents an overfitting problem. Therefore, DQfD can easily calculate the highest score. However, after reaching a high score during pre-training, the score hardly increases or rarely gets the highest score again. The L2 regularization loss can be adopted to alleviate the overfitting problem of DQfD. In the performance evaluation between DQfD and NAC, a supervised loss decay can be introduced as the iteration progresses. However, DQfD's overfitting problem has not been completely resolved.

NAC는 DQfD의 오버피팅 문제를 해결하기 위한 모델로서, 심층 강화 학습에서 불완전한(imperfect) 휴먼 데몬스트레이션 데이터를 사용할 수 있다. NAC 모델은 각 액터(actor, 행위자)와 휴먼 데몬스트레이션 데이터에서 별도의 손실 모델을 사용하는 DQfD와 달리 통합 손실 모델(unified loss model)을 사용한다. NAC 모델은 휴먼 데몬스트레이션 데이터에서 DQN 모델이 학습하는 문제의 원인을 보여준다. NAC 모델의 일예를 개시하는 문헌 4의 저자들은 이 문제의 주요 원인을 Q 값의 과대평가(overestimation)라고 지적한다.As a model for solving the overfitting problem of DQfD, NAC can use imperfect human demonstration data in deep reinforcement learning. Unlike DQfD, which uses a separate loss model for each actor (actor) and human demonstration data, the NAC model uses a unified loss model. The NAC model shows the cause of the problem that the DQN model learns from human demonstration data. The authors of Document 4, which discloses an example of the NAC model, point out that the main cause of this problem is the overestimation of the Q value.

이를 해결하기 위해, 문헌 4에서는 현존하는(existing) 소프트 Q 학습 모델을 수정하고, 수정된 NAC가 Q 값의 그래디언트(gradient)에서 상태의 값 함수의 그래디언트를 빼도록 만들고, 이후 Q 값은 소프트 Q 학습 손실 모델에 따라 정규화(normalized)하는 기술을 개시한다. 그런데, 이러한 NAC 모델은 휴먼 데몬스트레이션 데이터의 양이 300,000 전환(transitions)일 때 DQfD 모델보다 성능이 우수하나, 휴먼 데몬스트레이션 데이터의 양이 100,000 전환일 경우 DQfD보다 성능이 떨어진다. 즉, 종래의 NAC 모델은 휴먼 데몬스트레이션 데이터의 양이 적은 경우에 대하여 DQfD보다 성능이 떨어짐을 보인다.To solve this problem, in Document 4, the existing soft Q learning model is modified, and the modified NAC is made to subtract the gradient of the value function of the state from the gradient of the Q value, and then the Q value is soft Q A technique of normalizing according to the learning loss model is disclosed. However, this NAC model has better performance than the DQfD model when the amount of human demonstration data is 300,000 transitions, but performance is lower than that of DQfD when the amount of human demonstration data is 100,000 transitions. That is, the conventional NAC model shows that the performance is lower than that of DQfD when the amount of human demonstration data is small.

한편, 동적 프레임 스킵핑(Dynamic Frame Skipping)에 대한 설명은 다음과 같다.Meanwhile, a description of dynamic frame skipping is as follows.

프레임 스킵핑(Frame skipping)은 DQN에서 환경을 더 빠르게 탐색(explore)하기 위한 기술이다. 프레임 스킵핑으로, 강화 학습의 에이전트는 프레임 스킵핑 계수(coefficient)가 만날 때까지 예측된 액션(action, 동작)을 반복한다. 이는 결과적으로 액션 결정(action decisions)의 수가 감소됨에 따라 실제 훈련 시간(actual training time)이 줄어든다. Frame skipping is a technique for faster exploration of the environment in DQN. With frame skipping, the agent of reinforcement learning repeats a predicted action (action) until a frame skipping coefficient is met. This consequently reduces the actual training time as the number of action decisions decreases.

더욱이, 개선된 프레임 스킵핑(improved frame skipping) 방법이나 동적 프레임 스킵핑이 적용된 에이전트는, 신경망의 부가적인(additional) 출력 값(output value)을 프레임 스킵핑 계수(frame skipping coefficient)로 설정함으로써, 각 상태(state)에서 반복되는 액션을 학습할 수 있다. 다음 두 개의 아키텍처(즉, DFDQN와 FiGAR)는 동적 프레임 스킵핑을 적용하는 아키텍처이다.Moreover, the agent to which the improved frame skipping method or dynamic frame skipping is applied sets the additional output value of the neural network as the frame skipping coefficient, You can learn repeated actions in each state. The next two architectures (ie, DFDQN and FiGAR) are architectures that apply dynamic frame skipping.

도 2는 종래 DFDQN의 아키텍처(a)와 FiGAR의 아키텍처(b)를 개략적으로 나타낸 도면이다.2 is a diagram schematically showing the architecture (a) of the conventional DFDQN and the architecture (b) of the FiGAR.

DFDQN(Dynamic Frame Skipping DQN)는 상술한 문헌 6을 참고하여 이해될 수 있으며, 간단히 설명하면 다음과 같다. DFDQN은 각 상태에서 액션(동작)과 액션의 반복을 모두 학습하도록 설계되어 있다. 이 모델은 액션 반복 값 fs ₁ 과 fs ₂ 의 두 가지 경우를 설정한다. 또한, 이 모델은 신경망(neural network)의 출력으로서 액션의 반복(반복 액션)을 나타내기 위해, 출력의 차원(dimensions)을 두 배(doubled)로 늘렸다. 따라서, 예측된 액션 값(predicted action value)이 액션 차원(action dimension)보다 낮으면, 에이전트는 액션(동작)을 fs ₁ 번 반복하고, 그렇지 않으면 액션(동작)을 fs ₂ 회 반복한다. DFDQN (Dynamic Frame Skipping DQN) can be understood with reference to the above-described document 6, briefly described as follows. DFDQN is designed to learn both actions (actions) and repetition of actions in each state. This model is the action iteration value fs ₁ And fs ₂ are set in two cases. In addition, the model doubled the dimensions of the output in order to represent the iteration of the action (repeated action) as the output of the neural network. Therefore, if the predicted action value is lower than the action dimension, the agent sets the action (action) to fs ₁ Repeated once, otherwise an action (operation) ₂ times fs.

DFDQN은 또한 더블 액션 차원(double action dimension)을 처리하기 위해 마지막 전체 연결 레이어(last full connected layer)를 두 배로 늘렸다. 훈련이 진행됨에 따라, 액션 차원보다 큰 액션 값이 특정 상태(specific state)에서 선택될 수 있다. 이렇게 함으로써, 각 에피소드(episode)의 탐색(exploration)이 가속화(accelerated)될 수 있다.DFDQN also doubled the last full connected layer to handle the double action dimension. As training progresses, an action value larger than the action dimension may be selected in a specific state. By doing so, the exploration of each episode can be accelerated.

결과적으로, DFDQN은 아타리(Atari) 게임 환경에서 그들의 모델의 성능을 평가할 때, 같은 시간 단계(same time steps)에서 DQN보다 더 빨리 학습하고 더 많은 보상을 얻을 수 있다.As a result, DFDQNs learn faster and get more rewards than DQNs in the same time steps when evaluating the performance of their models in the Atari gaming environment.

도 2에서 (a) 는 fs ₁ 에 4 개, fs ₂ 에 20 개가 있는 DFDQN를 나타낸다. 도 2의 (a)에 도시된 신경망(neural network) 모델은 액션 값(action value)이 액션 차원(action dimension,)보다 낮으면 액션을 4번 반복하고, 그렇지 않으면 액션을 20번 반복한다.In Figure 2 (a) is fs ₁ 4 in, fs ₂ DFDQNs with 20 in In the neural network model shown in FIG. 2A, if the action value is lower than the action dimension, the action is repeated 4 times, otherwise, the action is repeated 20 times.

한편, FiGAR (Fine Grained Action Repetition)는 일예로 문헌 7 [[S. Sharma, A. S. Lakshminarayanan, and B. Ravindran, "Learning to repeat: Fine grained action repetition for deep reinforcement learning," arXiv preprint arXiv:1702.06054, 2017.]을 참고하여 이해될 수 있으며, 간단히 설명하면 다음과 같다.On the other hand, FiGAR (Fine Grained Action Repetition) is an example of document 7 [[S. Sharma, A. S. Lakshminarayanan, and B. Ravindran, "Learning to repeat: Fine grained action repetition for deep reinforcement learning," arXiv preprint arXiv:1702.06054, 2017.], and briefly described as follows.

FiGAR(Fine Grained Action Repetition)는 DFDQN의 이전(previous) 동적 프레임 스킵핑을 확장한 모델이다. 이 모델은 액션 반복 값(action repeat value)의 두 가지 경우뿐만 아니라 액션 반복의 특정 범위(certain range)를 선택할 수 있다. FiGAR는 다양한 심층 강화 학습 알고리즘을 채택할 수 있다.FiGAR (Fine Grained Action Repetition) is a model that extends the previous (previous) dynamic frame skipping of DFDQN. This model can select not only two cases of action repeat value, but also a certain range of action repeat. FiGAR can adopt a variety of deep reinforcement learning algorithms.

도 2에서 (b) 는 2개의 네트워크(신경망)가 존재하는 FiGAR의 구조를 나타낸다. 2개의 네트워크 중 하나는 액션을 선택하고, 다른 하나는 액션의 반복(반복 액션)을 선택한다. 그러나, 이 모델을 다양한 심층 강화 모델에 적용하고자 할 경우, 대상(target, 타겟) 심층 강화 모델을 수정할 필요가 있다. In FIG. 2, (b) shows the structure of FiGAR in which two networks (neural networks) exist. One of the two networks selects an action, and the other selects an action repetition (repeated action). However, when applying this model to various deep reinforcement models, it is necessary to modify the target (target) deep reinforcement model.

문헌 7에서는 FiGAR를 A3C, TRPO, DDPG 등 3 가지 모델에 적용한 경우에 대하여 개시한다. 또한, FiGAR는 스킵된 프레임(skipped frames)의 수와 일치하도록 A3C의 n 단계 반환(n-step return)의 추정치(estimate)를 하기 식 4와 같이 정의한다.Document 7 discloses a case where FiGAR is applied to three models such as A3C, TRPO, and DDPG. In addition, FiGAR defines an estimate of the n-step return of A3C as shown in Equation 4 below to match the number of skipped frames.

[식 3][Equation 3]

[식 4][Equation 4]

식 3에서, x _k 는 상태(state) s _k 에서 액션 a _k 의 반복 값(repeat value)을 의미하고, y는 식 4의 감쇄 값(decay value)

의 지수(exponent)를 나타낸다. 즉, 액션의 반복 횟수가 많고 y의 값이 커질수록, r _k 의 값(각 프레임 k의 보상값(reward))은

에 의해 더 감소(decayed)될 수 있다.In Equation 3, x _k is the state (state) in the attenuation values of s _k mean repeat value of the action a _k (repeat value), and y is from expression 4 (decay value)

Represents the exponent of That is, as the number of repetitions of the action increases and the value of y increases, the value of r _k (reward value of each frame k) becomes

Can be further decayed by

한편, 아타리 게임 휴먼 데몬스트레이션 데이터세트(Atari Game Human Demonstration Dataset)에 대한 설명은 다음과 같다.Meanwhile, the description of the Atari Game Human Demonstration Dataset is as follows.

도 3은 Atari Grand Challenge 데이터세트의 스크린 샷(screen-shot) 샘플의 예를 나타낸 도면이다. 특히, 도 3은 Atari Grand Challenge 데이터세트에서, Q*bert (a), 비디오 핀볼(Video Pinball) (b), 팩맨(Ms. Pacman) (c), 스페이스 인베이더(Space Invaders) (d) 및 몬테주마의 복수(Montezuma's Revenge) (e)에 대한 스크린 샷의 예를 나타낸다.3 is a diagram showing an example of a screen-shot sample of the Atari Grand Challenge dataset. In particular, FIG. 3 shows, in the Atari Grand Challenge dataset, Q*bert (a), Video Pinball (b), Ms. Pacman (c), Space Invaders (d) and Monte The following is an example of a screen shot for Montezuma's Revenge (e).

도 3을 참조하면, Atari Grand Challenge 데이터세트는 심층 강화 학습을 위한 다섯 가지의 아타리(Atari) 게임의 오픈 데이터세트(open data set)이다. 각 데이터세트에는 초보자(novice)부터 전문가(expert)까지 다양한 점수 분포 범위(score distribution ranging)를 가진 다양한 에피소드(diverse episodes)가 포함되어 있다. 게임의 각 에피소드에는 여러 개의 이미지 파일(image files)과 하나의 궤적 파일(trajectory file)이 있다. 이미지 데이터세트(Image data set)에는 아타리 게임(Atari games)의 원시 색상 이미지(raw color images)가 포함되어 있다. 궤적 데이터세트(Trajectory dataset)에는 보상(reward), 프레임 ID(frame ID), 점수(score, 스코어), 터미널 상태(terminal state) 및 액션(action)과 같은 정보가 포함되어 있다.3, the Atari Grand Challenge dataset is an open dataset of five Atari games for deep reinforcement learning. Each dataset contains different episodes with varying score distribution ranging from novice to expert. Each episode of the game has several image files and one trajectory file. The image data set contains raw color images from Atari games. The trajectory dataset contains information such as reward, frame ID, score, terminal state, and action.

프레임 스킵핑 관점에서 볼 때, 이것은 게임의 어떠한 프레임(any frames)도 드롭핑(dropping)시키는 것 없이, 전체 60Hz 휴먼 데몬스트레이션 데이터를 가지고 있다. 이는 이미 프레임 스킵된 15Hz 휴면 데몬스트레이션 데이터를 사용하는 종래 DQfD에 대해 개시하는 문헌 3의 데이터세트와는 다르다. 점수의 측면에서, 데이터세트에서의 대부분의 에피소드 점수는 종래 DQfD에 대해 개시하는 문헌 3의 의 데이터세트보다 낮다. 또한, 데이터 집합에는 각 프레임에 부정적인 보상(negative reward)을 주기 어렵게 만드는 수명 손실(life loss)에 대한 정보가 부족하다.From a frame skipping point of view, it has full 60Hz human demonstration data without dropping any frames in the game. This is different from the dataset of Document 3 which discloses for a conventional DQfD using 15Hz dormant demonstration data that has already been frame skipped. In terms of scores, the score of most episodes in the dataset is lower than the dataset of Document 3, which discloses for conventional DQfD. In addition, the data set lacks information on life loss that makes it difficult to give negative rewards to each frame.

이하에서는 본 제안 기술(본원에서 제안하는 기술)인 듀얼 리플레이 버퍼를 기반으로 한 온라인 프레임 스킵핑(Online Frame Skipping based on Dual Replay Buffer) 기술에 대하여 보다 상세히 설명한다.Hereinafter, an Online Frame Skipping based on Dual Replay Buffer, which is the proposed technology (a technology proposed herein), will be described in more detail.

도 4는 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치(100)의 개략적인 구성을 나타낸 블록도이다. 도 5는 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치(100)의 아키텍처를 개략적으로 나타낸 도면이다. 즉, 도 5는 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치(100)에 의한 듀얼 리플레이 버퍼를 기반으로 한 온라인 프레임 스킵핑 기술에 대한 시스템 아키텍처를 나타낸다.4 is a block diagram showing a schematic configuration of a data processing apparatus 100 for deep reinforcement learning according to an embodiment of the present application. 5 is a diagram schematically showing an architecture of a data processing apparatus 100 for deep reinforcement learning according to an embodiment of the present application. That is, FIG. 5 shows a system architecture for an online frame skipping technology based on a dual replay buffer by the data processing apparatus 100 for deep reinforcement learning according to an embodiment of the present application.

이하에서는 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치(100)를 설명의 편의상 본 장치(100)라 하기로 한다. Hereinafter, the data processing apparatus 100 for deep reinforcement learning according to an embodiment of the present application will be referred to as the present apparatus 100 for convenience of description.

도 4 및 도 5를 참조하면, 본 장치(100)는 심층 강화 학습을 위한 데이터 처리 장치, 특히 심층 강화 학습의 가속화를 위한 데이터 처리 장치에 관한 것이다.Referring to FIGS. 4 and 5, the apparatus 100 relates to a data processing device for deep reinforcement learning, in particular, to a data processing device for accelerating deep reinforcement learning.

본 장치(100)는 앞서 설명한 본 제안 기술을 수행할 수 있다. 즉, 본 장치(100)는 휴먼 데몬스트레이션 데이터(human demonstration data, 111')를 이용한 훈련 과정(training process)을 가속화하기 위해, 듀얼 리플레이 버퍼 관리와 온라인 프레임 스킵핑 기술을 기반으로 하는 새로운 심층 강화 학습(deep reinforcement learning)을 위한 데이터 처리를 수행할 수 있다. The device 100 may perform the proposed technology described above. That is, the device 100 is a new deep reinforcement learning based on dual replay buffer management and online frame skipping technology to accelerate the training process using human demonstration data 111'. Data processing for (deep reinforcement learning) can be performed.

다시 말해, 본 장치(100)는 듀얼 리플레이 버퍼 관리와 온라인 프레임 스킵핑 기술을 기반으로 심층 강화 학습(특히 심층 강화 학습의 가속화)을 위한 데이터 처리를 수행할 수 있다.In other words, the apparatus 100 may perform data processing for deep reinforcement learning (especially acceleration of deep reinforcement learning) based on dual replay buffer management and online frame skipping technology.

본 장치(100)는 리플레이 매니저(110), 데이터 처리부(120) 및 재정의부(130)를 포함할 수 있다. 또한, 본 장치(100)는 리플레이 매니저(110)에 의해 관리가 이루어지는 휴먼 리플레이 버퍼(111) 및 액터 리플레이 버퍼(112)를 포함할 수 있다. 본원에서 휴먼 리플레이 버퍼(111)와 액터 리플레이 버퍼(112)는 듀얼 리플레이 버퍼(111, 112)라 지칭될 수 있다.The device 100 may include a replay manager 110, a data processing unit 120, and a redefinition unit 130. In addition, the device 100 may include a human replay buffer 111 and an actor replay buffer 112 managed by the replay manager 110. Herein, the human replay buffer 111 and the actor replay buffer 112 may be referred to as dual replay buffers 111 and 112.

리플레이 매니저(110)는 대상 데이터 중 휴먼 데몬스트레이션 데이터(111')를 휴먼 리플레이 버퍼(111)에 저장하고, 대상 데이터 중 액터 데이터(112')를 액터 리플레이 버퍼(112)에 저장할 수 있다. 달리 표현하여, 휴먼 리플레이 버퍼(111)는 대상 데이터 중 휴먼 데몬스트레이션 데이터(111')를 저장하는 버퍼이고, 액터 리플레이 버퍼(112)는 대상 데이터 중 액터 데이터(112')를 저장하는 버퍼일 수 있다. 본원에서 버퍼는 메모리라 달리 표현될 수 있다.The replay manager 110 may store human demonstration data 111 ′ among target data in the human replay buffer 111 and may store actor data 112 ′ among target data in the actor replay buffer 112. In other words, the human replay buffer 111 may be a buffer that stores human demonstration data 111 ′ among target data, and the actor replay buffer 112 may be a buffer that stores actor data 112 ′ among target data. . Herein, the buffer may be expressed differently as a memory.

여기서, 대상 데이터는 일예로 학습이 이루어지는 대상(타겟) 데이터를 의미할 있다. 대상 데이터는 일예로 이미지 데이터를 의미할 수 있다. 구체적인 예로, 대상 데이터가 게임과 관련된 데이터인 경우, 대상 데이터는 게임 관련 이미지 데이터를 의미할 수 있다.Here, the target data may mean target (target) data on which learning is performed, for example. The target data may mean image data as an example. As a specific example, when the target data is data related to a game, the target data may refer to image data related to the game.

또한, 휴먼 데몬스트레이션 데이터(휴먼 데이터, 111')는 휴먼(사람, 인간)이 플레이(시연)한 데이터를 의미할 수 있다. 즉, 사람(휴먼)이 게임을 수행하는 경우, 휴먼 데몬스트레이션 데이터(111')는 휴먼(사람)이 게임을 플레이(시연)함으로써 발생(생성)되는 데이터를 의미할 수 있다. 액터 데이터(112')는 에이전트(혹은 시스템, 기기)가 행동한 데이터를 의미할 수 있다. 즉, 액터 데이터(112')는 에이전트의 행동(동작)에 의해 발생(생성)되는 데이터로서, 달리 말해 에이전트의 행동(동작)과 관련된 데이터를 의미할 수 있다.In addition, human demonstration data (human data, 111') may mean data played (demonstrated) by humans (persons, humans). That is, when a person (human) plays a game, the human demonstration data 111 ′ may mean data generated (generated) by the human (person) playing (demonstration) the game. The actor data 112 ′ may refer to data performed by an agent (or system, device). That is, the actor data 112 ′ is data generated (generated) by an agent's action (action), and in other words, may mean data related to an agent's action (action).

데이터 처리부(120)는 휴먼 리플레이 버퍼(111)에 저장된 휴먼 데몬스트레이션 데이터(111')의 샘플링을 위해, 휴먼 리플레이 버퍼(111)에 대하여 온라인 프레임 스킵핑을 적용하여 데이터 처리를 수행할 수 있다. The data processing unit 120 may perform data processing by applying online frame skipping to the human replay buffer 111 in order to sample the human demonstration data 111 ′ stored in the human replay buffer 111.

여기서, 온라인 프레임 스킵핑은 프레임 스킵핑의 간격의 가변 여부를 고려한 두가지 유형의 프레임 스킵핑(121, 122)을 포함할 수 있다. 특히, 두가지 유형의 프레임 스킵핑(121, 122)은 온라인 프레임 스킵핑(Online Frame Skipping, Frame-skipping Experience Replay, FS-ER 스킵핑, 121) 및 동적 온라인 프레임 스킵핑(Dynamic Online Frame Skipping, Dynamic Frame-skipping Experience Replay, DFS-ER 스킵핑, 122)을 포함할 수 있다.Here, online frame skipping may include two types of frame skipping (121, 122) in consideration of whether or not the frame skipping interval is variable. In particular, the two types of frame skipping (121, 122) are online frame skipping (Online Frame Skipping, Frame-skipping Experience Replay, FS-ER skipping, 121) and dynamic online frame skipping (Dynamic Online Frame Skipping, Dynamic Frame-skipping Experience Replay, DFS-ER skipping, 122) may be included.

즉, 온라인 프레임 스킵핑은 두가지 유형의 프레임 스킵핑(121, 122) 중 제1 유형의 프레임 스킵핑으로서, 전체 휴먼 데몬스트레이션 데이터(111')를 사용하기 위해 프레임 스킵핑의 간격을 가변하지 않고 일정하게 유지하여 스킵핑을 수행하는 온라인 프레임 스킵핑(FS-ER 스킵핑, 121)을 포함할 수 있다. 여기서, 온라인 프레임 스킵핑(FS-ER 스킵핑, 121)은 정적 온라인 프레임 스킵핑, 혹은 고정된(일정한) 프레임 스킵핑이라 달리 표현될 수 있다. 또한, 온라인 프레임 스킵핑은 두가지 유형의 프레임 스킵핑(121, 122) 중 제2 유형의 프레임 스킵핑으로서, 휴먼 데몬스트레이션 데이터(111')의 액션 반복을 훈련하기 위해 프레임 스킵핑의 간격을 가변하여 동적으로 스킵핑을 수행하는 동적 온라인 프레임 스킵핑(DFS-ER 스킵핑, 122)을 포함할 수 있다.In other words, online frame skipping is the first type of frame skipping among the two types of frame skipping (121, 122). In order to use the entire human demonstration data (111'), the frame skipping interval is not changed and On-line frame skipping (FS-ER skipping, 121) for performing skipping by maintaining the same level may be included. Here, online frame skipping (FS-ER skipping, 121) may be expressed differently as static online frame skipping or fixed (constant) frame skipping. In addition, online frame skipping is the second type of frame skipping among the two types of frame skipping (121, 122). In order to train the action repetition of the human demonstration data 111 ′, the interval of frame skipping is varied. It may include dynamic online frame skipping (DFS-ER skipping, 122) that dynamically performs skipping.

데이터 처리부(120)는 온라인 프레임 스킵핑으로서, 프레임 스킵핑의 간격의 가변 여부를 고려한 이러한 두가지 유형의 프레임 스킵핑(121, 122)을 휴먼 리플레이 버퍼(111)에 대하여 적용하여, 휴먼 데몬스트레이션 데이터(111')에 대한 데이터 처리를 수행할 수 있다.As online frame skipping, the data processing unit 120 applies these two types of frame skipping (121, 122) in consideration of whether or not the frame skipping interval is variable to the human replay buffer 111, so that the human demonstration data ( 111') can be processed.

보다 구체적인 설명은 다음과 같다. 본 장치(100)는 도 1의 DQfD과 비교하여 종래 기술과는 다른 3가지의 주요 특징(main characteristics)을 가지며, 3가지 주요 특징에는 듀얼 리플레이 버퍼 관리(Dual Replay Buffer Management), 온라인 프레임 스킵핑(Online Frame Skipping) 및 동적 온라인 프레임 스킵핑(Dynamic Online Frame Skipping)이 포함될 수 있다.A more detailed description is as follows. Compared to the DQfD of FIG. 1, the device 100 has three main characteristics different from the prior art, and the three main characteristics include dual replay buffer management and online frame skipping. (Online Frame Skipping) and dynamic online frame skipping (Dynamic Online Frame Skipping) may be included.

듀얼 리플레이 버퍼 관리는 휴먼 리플레이 버퍼(human replay buffer, 111), 액터 리플레이 버퍼(actor replay buffer, 112) 및 리플레이 매니저(replay manager, 110)를 포함할 수 있다. 리플레이 매니저(110)는 휴먼 리플레이 버퍼(111)와 액터 리플레이 버퍼(112)를 별도의 관리 정책(distinct management policies)으로 분리(separately)하여 처리(관리)할 수 있다. The dual replay buffer management may include a human replay buffer 111, an actor replay buffer 112, and a replay manager 110. The replay manager 110 may process (manage) the human replay buffer 111 and the actor replay buffer 112 by separating them into separate management policies.

DQfD의 경우, 휴먼 데몬스트레이션 데이터와 액터 데이터가 모두 단일 리플레이 버퍼(single replay buffer)에 저장된다. 단일 리플레이 버퍼에 저장되는 휴먼 데몬스트레이션 데이터는 액터 데이터 대비 상대적으로 적은 양의 데이터를 갖는 반면, 액터 데이터는 휴먼 데몬스트레이션 데이터 대비 상대적으로 많은 양의 데이터를 갖는다. 따라서, 해당 버퍼에 프레임 스킵핑을 적용하는 경우, 종래에는 프레임 스킵핑의 적용으로 인해 해당 버퍼에 포함된 적은 양의 휴먼 데몬스트레이션 데이터 중에서도 일부 휴먼 데몬스트레이션 데이터가 버려지게 되므로, 이에 따라 신경망 학습시 활용(사용) 가능한 학습데이터의 양이 줄어들어(즉, 신경망 학습시 활용 가능한 휴먼 데몬스트레이션 데이터의 양이 충분치 않아) 신경망의 학습 성능(예를 들어, 정확도 등)이 떨어지는 문제가 있다.In the case of DQfD, both human demonstration data and actor data are stored in a single replay buffer. Human demonstration data stored in a single replay buffer has a relatively small amount of data compared to actor data, whereas actor data has a relatively large amount of data compared to human demonstration data. Therefore, when frame skipping is applied to the corresponding buffer, some human demonstration data among the small amount of human demonstration data included in the buffer is discarded due to the application of frame skipping. There is a problem that the learning performance (eg, accuracy, etc.) of the neural network is deteriorated because the amount of available training data is reduced (that is, the amount of human demonstration data that can be used when training a neural network is insufficient).

반면, 본 장치(100)는 듀얼 리플레이 버퍼를 기반으로 한 온라인 프레임 스킵핑(121, 122) 기술을 사용함으로써, 액터 리플레이 버퍼의 샘플링 방법에 관계없이, 휴먼 데몬스트레이션 데이터에 대하여 FS-ER 또는 DFS-ER을 독립적으로(independently) 적용할 수 있다. On the other hand, the present apparatus 100 uses the on-line frame skipping (121, 122) technology based on the dual replay buffer, so that the human demonstration data is FS-ER or DFS-regardless of the sampling method of the actor replay buffer. ER can be applied independently.

이러한 본 장치(100)는 종래에 단일 리플레이 버퍼에 휴먼 데몬스트레이션 데이터와 액터 데이터가 모두 저장됨에 따라, 해당 단일 리플레이 버퍼에 대한 프레임 스킵핑의 적용시 휴먼 데몬스트레이션 데이터가 버려져, 신경망 학습을 위한 충분한 데이터를 확보하지 못하던 문제를 해소할 수 있다. 즉, 본 장치(100)는 휴먼 데몬스트레이션 데이터와 액터 데이터를 단일 리플레이 버퍼에 모두 저장하는 것이 아닌 2개의 리플레이 버퍼(111, 112) 각각에 분리하여 따로 저장하고, 이를 기반으로 대상 데이터 중 휴먼 리플레이 버퍼(111)에 저장된 휴먼 데몬스트레이션 데이터(111')에 대하여 온라인 프레임 스킵핑(121, 122)을 독립적으로 적용할 수 있다. 이에 따라, 본 장치(100)는 듀얼 리플레이 버퍼 기반의 온라인 프레임 스킵핑을 통해, 휴먼 데몬스트레이션 데이터(111')가 버려지는 것 없이 모두 신경망 학습을 위한 학습데이터로 활용되도록 할 수 있다.As the present apparatus 100 has conventionally stored both human demonstration data and actor data in a single replay buffer, human demonstration data is discarded when frame skipping is applied to the single replay buffer, thereby providing sufficient data for neural network learning. Problems that could not be secured can be solved. That is, the device 100 does not store both human demonstration data and actor data in a single replay buffer, but separates and stores them separately in each of the two replay buffers 111 and 112, and based on this, the human replay buffer among the target data Online frame skipping (121, 122) can be independently applied to the human demonstration data (111') stored in (111). Accordingly, the apparatus 100 may be used as training data for neural network training without discarding the human demonstration data 111 ′ through online frame skipping based on the dual replay buffer.

본 장치(100)에서 고려되는 2개의 온라인 프레임 스킵핑(121, 122) 중 하나인 FS-ER(Frame Skipping-Experience Replay, 121) 기술에 대한 설명은 다음과 같다.A description of the FS-ER (Frame Skipping-Experience Replay) 121, which is one of the two online frame skipping (121, 122) considered in the present apparatus 100, is as follows.

FS-ER는 전체 휴먼 데몬스트레이션 데이터(111')를 휴먼 리플레이 버퍼(111)에 저장하고, 프레임 스킵핑(frame skipping)이 훈련 기간(training period) 동안 온라인(online)으로 수행될 수 있다.The FS-ER stores all human demonstration data 111' in the human replay buffer 111, and frame skipping may be performed online during a training period.

달리 말해, 리플레이 매니저(110)는 휴먼 데몬스트레이션 데이터(111')를 휴먼 리플레이 버퍼(111)에 저장할 수 있다. 이후, 데이터 처리부(120)는 휴먼 리플레이 버퍼(111), 그에 저장된 휴먼 데몬스트레이션 데이터(111')에 대하여 FS-ER 스킵핑(121)을 적용하여 데이터 처리를 수행할 수 있다. 이때, 데이터 처리부(120)는 FS-ER 스킵핑(121)을 훈련 기간(training period) 동안 온라인(online)으로 수행할 수 있다In other words, the replay manager 110 may store the human demonstration data 111 ′ in the human replay buffer 111. Thereafter, the data processing unit 120 may perform data processing by applying the FS-ER skipping 121 to the human replay buffer 111 and the human demonstration data 111 ′ stored therein. In this case, the data processing unit 120 may perform the FS-ER skipping 121 online during the training period.

종래 DQfD는 휴먼 데몬스트레이션 데이터와 액터 데이터를 모두 단일 리플레이 버퍼에 저장하기 전에, 그들을 생성하기 위해 프레임을 스킵(skip)하며(건너뒤며), 이에 따라 학습 과정(learning process) 중에 선택된 프레임만 사용할 수 있다. 반면, 데이터 처리부(122)는 훈련 과정(training procedure, 훈련 절차)이 휴먼 데몬스트레이션 데이터(111')의 입력을 필요로 할 때, 온디맨드식(on-demand, 주문형) 프레임 스킵핑을 위해 FS-ER 스킵핑(121)을 수행할 수 있다. 본 장치(100)는 FS-ER 스킵핑(121)을 사용함으로써, 각 에피소드의 전체 프레임을 모델 훈련에 활용할 수 있다.Conventional DQfD skips (skips) frames to create them before storing both human demonstration data and actor data in a single replay buffer, so that only selected frames can be used during the learning process. . On the other hand, when the training procedure (training procedure) requires input of the human demonstration data 111 ′, the data processing unit 122 uses FS- for skipping frames on demand (on-demand). ER skipping 121 may be performed. The device 100 may utilize the entire frame of each episode for model training by using the FS-ER skipping 121.

도 5를 참조하면, 본 장치(100)에서 고려되는 미니배치(MiniBatch)는, 휴먼 데이터(human data, 111')를 샘플링하기 위한 프레임 스킵핑을 묘사(depict)하기 위해 각각의 샘플링된 프레임(each sampled frame)에서 스페이싱(spacing, 간격)을 갖는다. 다시 말해, 본 장치(100)에서 고려되는 미니배치에는 휴먼 데이터를 샘플링하는 프레임 스킵핑을 묘사하기 위해 각 샘플링된 프레임에 스페이싱(spacing, 간격, 거리두기, 간격두기)이 두어져 있다.Referring to FIG. 5, the mini-batch considered in the present apparatus 100 is each sampled frame (MiniBatch) to describe frame skipping for sampling human data 111'. Each sampled frame) has spacing (spacing). In other words, in the mini-batch considered by the present apparatus 100, spacing (spacing, spacing, spacing, spacing) is placed in each sampled frame to describe frame skipping for sampling human data.

본원에서 고려되는 2개의 온라인 프레임 스킵핑 기술 중 다른 하나인 동적 온라인 프레임 스킵핑(Dynamic Frame Skipping Experience Replay, DFS-ER, 122)에 대한 설명은 다음과 같다. A description of another one of the two online frame skipping techniques considered in the present application, Dynamic Frame Skipping Experience Replay (DFS-ER, 122) is as follows.

본원에서는 휴먼 데몬스트레이션 데이터(111')의 액션 반복(action repetition)을 처리하기 위해, FS-ER의 변형(variant)인 DFS-ER 기술을 제안한다. 즉, 데이터 처리부(120)는 휴먼 데몬스트레이션 데이터(111')의 액션 반복(action repetition)을 처리하기 위해, 휴먼 데몬스트레이션 데이터(111')에 대하여 FS-ER의 변형(variant)인 DFS-ER 스킵핑(122)을 적용하여 데이터 처리를 수행할 수 있다. 본 장치(100)는 DFDQN의 동적 프레임 스킵핑(dynamic frame skipping) 방식을 채택함으로써, 전체 60Hz 휴먼 데몬스트레이션 데이터를 처리하기 위한 동적 프레임 스킵핑 방식(scheme, 구조)으로서 DFS-ER 스킵핑(122) 기술을 구현할 수 있다. In the present application, in order to process the action repetition of the human demonstration data 111', a DFS-ER technology, which is a variant of the FS-ER, is proposed. That is, in order to process the action repetition of the human demonstration data 111 ′, the data processing unit 120 skips DFS-ER, which is a variant of the FS-ER, with respect to the human demonstration data 111 ′. Data processing can be performed by applying (122). The apparatus 100 adopts a dynamic frame skipping method of DFDQN, and thus, DFS-ER skipping 122 as a dynamic frame skipping method (scheme, structure) for processing all 60Hz human demonstration data. Technology can be implemented.

도 5 에서, 신경망(neural network, 50)의 출력 차원(output dimension)은 두 개의 반복 값(two repeat values)을 나타내기 위해 액션 차원(action dimension)의 두 배(twice)의 크기이다. 미니배치(MiniBatch)에서 프레임들(frames)은 동적 프레임 스킵핑(DES-ER, 122)으로 인해 일정한 간격(uniform intervals, 균등한 간격)으로 샘플링되지 않는다. 데이터 처리부(120)는 DES-ER 스킵핑(122)의 적용시의 스킵 간격(skip intervals)을 휴먼 데몬스트레이션 데이터(111')에서 액션의 스킵 값(skip value)에 의해 동적으로(dynamically) 정의할 수 있다.In FIG. 5, an output dimension of a neural network 50 is a size of twice the action dimension in order to represent two repeat values. In the MiniBatch, frames are not sampled at uniform intervals due to dynamic frame skipping (DES-ER, 122). The data processing unit 120 dynamically defines the skip intervals when the DES-ER skipping 122 is applied by the skip value of the action in the human demonstration data 111'. I can.

이에 따르면, 본원에서 제안하는 듀얼 리플레이 버퍼 관리는 휴먼 데몬스트레이션 데이터(111')와 액터 데이터(112') 사이의 독립적인(independent) 관리를 가능하게 하는 아키텍처라 할 수 있다. 본 장치(100)의 데이터 처리부(120)는 이러한 관리 아키텍처를 통해, 액터 데이터(112')와 관련없이 FS-ER(121)라는 온라인 프레임 스킵핑을 휴먼 리플레이 버퍼(111)에 적용할 수 있다. 또한, 본 장치(100)의 데이터 처리부(120)는 휴먼 데몬스트레이션 데이터(111')의 액션 반복을 처리하기 위해, 온라인 프레임 스킵핑으로서 DFS-ER이라는 동적 온라인 스킵핑을 적용할 수 있다.Accordingly, the dual replay buffer management proposed in the present application can be said to be an architecture that enables independent management between the human demonstration data 111 ′ and the actor data 112 ′. The data processing unit 120 of the device 100 can apply online frame skipping called FS-ER 121 to the human replay buffer 111 regardless of the actor data 112 ′ through this management architecture. . In addition, the data processing unit 120 of the apparatus 100 may apply dynamic online skipping called DFS-ER as online frame skipping in order to process the action repetition of the human demonstration data 111'.

본 장치(100)에 대한 보다 구체적인 설명은 다음과 같다. A more detailed description of the device 100 is as follows.

듀얼 리플레이 버퍼(111, 112) 관리 기술과 관련하여, 종래 DQfD에서는 휴먼 데몬스트레이션 데이터와 액터 데이터가 하나의 메모리(single memory, 단일 메모리, 단일 버퍼)에 저장되고 관리되기 때문에, 휴먼 데이터(휴먼 데몬스트레이션 데이터)와 액터 데이터에 각각(separately) 다른 샘플링 방법을 적용하는 것은 물론 다른 프레임 스킵핑 방법을 적용하는 것이 어려운 문제가 있다. Regarding the dual replay buffer (111, 112) management technology, in the conventional DQfD, human demonstration data and actor data are stored and managed in one memory (single memory, single memory, single buffer), so human data (human demonstration data) ) And actor data, as well as applying a different sampling method to each other (separately) is difficult to apply a different frame skipping method.

반면, 본 장치(100)는 리플레이 매니저(110)가 듀얼 리플레이 버퍼(111, 112) 관리 기술을 통해 별개의(distinct) 메모리(버퍼) 공간에서 휴먼 데이터(111')와 액터 데이터(112')를 관리함으로써 이러한 문제를 해소할 수 있다. 데이터 처리부(120)는 본원에서 제안하는 온라인 프레임 스킵핑(121, 122) 방식(scheme, 구조, 기술)을 이러한 듀얼 리플레이 버퍼(111, 112)를 기반으로 쉽게 적용할 수 있다. On the other hand, in the present apparatus 100, the replay manager 110 manages the human data 111 ′ and the actor data 112 ′ in a separate memory (buffer) space through the dual replay buffer 111 and 112 management technology. This problem can be solved by managing The data processing unit 120 can easily apply the online frame skipping (121, 122) scheme (scheme, structure, and technology) proposed herein based on the dual replay buffers 111 and 112.

리플레이 매니저(110)는 듀얼 리플레이 버퍼(111, 112)를 관리하고, 훈련 데이터(training data, 휴먼 데몬스트레이션 데이터와 액터 데이터를 포함하는 대상 데이터)를 샘플링하거나 새로운 애플리케이션(new applications)을 리플레이 버퍼(111, 112)에 저장하기 위해 에이전트(agents)의 요청(requests)을 처리할 수 있다.The replay manager 110 manages the dual replay buffers 111 and 112, and samples training data (training data, target data including human demonstration data and actor data), or replays new applications. , 112), it is possible to process requests from agents.

리플레이 매니저(110)는 휴먼 데몬스트레이션 데이터(111')의 샘플링 비율(sampling ratio)을 직접(directly) 제어할 수 있다. 또한, 본 장치(100)에서는 종래 DQfD와는 달리, 리플레이 버퍼(111, 112)에서의 휴먼 데몬스트레이션 데이터의 비율(proportion, 차지 비율)에 제한이 없을 수 있다. 따라서, 리플레이 매니저(110)는 듀얼 리플레이 버퍼(111, 112) 관리 기술을 통해, 휴먼 데몬스트레이션 데이터(111')의 샘플링 비율(sampling ratio)을 보다 유연하게(flexibly) 관리할 수 있다.The replay manager 110 may directly control a sampling ratio of the human demonstration data 111 ′. In addition, in the present apparatus 100, unlike the conventional DQfD, there may be no limitation on the proportion of human demonstration data in the replay buffers 111 and 112. Accordingly, the replay manager 110 may flexibly manage the sampling ratio of the human demonstration data 111 ′ through the dual replay buffer 111 and 112 management technology.

데이터 처리부(120)에 의한 FS-ER 스킵핑(FS-ER 프레임 스킵핑, 121) 기술의 적용에 대한 구체적인 설명은 다음과 같다. 이는 도 6을 참조하여 보다 쉽게 이해될 수 있다.A detailed description of the application of the FS-ER skipping (FS-ER frame skipping, 121) technique by the data processing unit 120 is as follows. This can be more easily understood with reference to FIG. 6.

도 6은 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치(100)에 의한 휴먼 데몬스트레이션 데이터(111')의 샘플링 접근법(sampling approaches)을 설명하기 위한 도면이다. 즉, 도 6은 데이터 처리부(120)에 의한 휴먼 데몬스트레이션 데이터(111')의 샘플링 접근법으로서 FS-ER 스킵핑(121) 기반의 샘플링 기술과 DFS-ER 스킵핑(122) 기반의 샘플링 기술을 설명하기 위한 도면이다. 6 is a diagram for explaining sampling approaches of human demonstration data 111 ′ by the data processing apparatus 100 for deep reinforcement learning according to an embodiment of the present disclosure. That is, FIG. 6 illustrates a sampling technique based on FS-ER skipping 121 and a sampling technique based on DFS-ER skipping 122 as a sampling approach for human demonstration data 111 ′ by the data processing unit 120 It is a drawing to do.

특히, 도 6에서 (a)는 DQfD 아키텍처(Original)에 대한 휴먼 데몬스트레이션 데이터의 샘플링 방식(scheme)을 설명하기 위한 도면이다. 도 6에서 (b)는 본 장치(100)에서 고려되는 FS-ER(121), 즉 샘플링 과정 중 온디맨드식(on-demand, 주문형) 프레임 스킵핑을 위한 FS-ER 스킵핑(121) 기술을 설명하기 위한 도면이다. 도 6에서 (c)는 본 장치(100)에서 고려되는 DFS-ER(122), 즉 샘플링 과정 중 휴먼 데몬스트레이션 데이터(111')의 액션 반복을 처리를 위한 DFS-ER 스킵핑(122) 기술을 설명하기 위한 도면이다.In particular, (a) in FIG. 6 is a diagram for explaining a sampling scheme of human demonstration data for a DQfD architecture (Original). In FIG. 6 (b) is an FS-ER 121 considered in the present apparatus 100, that is, an FS-ER skipping 121 technology for skipping an on-demand (on-demand) frame during the sampling process. It is a figure for explaining. In FIG. 6 (c) shows the DFS-ER 122 considered in the present apparatus 100, that is, the DFS-ER skipping 122 technology for processing the repetition of the action of the human demonstration data 111 ′ during the sampling process. It is a drawing for explanation.

도 6을 참조하면, 데몬스트레이션 데이터(111')에 대하여 강화 학습을 위한 미니배치를 준비하기 위해, 여러 경험(multiple experiences)을 샘플링할 수 있다. 여기서, 여러 경험이라 함은 휴먼 리플레이 버퍼(111)에 저장된 휴먼 데몬스트레이션 데이터(111')들을 의미할 수 있다.Referring to FIG. 6, in order to prepare a mini-batch for reinforcement learning with respect to the demonstration data 111', multiple experiences may be sampled. Here, the various experiences may refer to human demonstration data 111 ′ stored in the human replay buffer 111.

데이터 처리부(120)는 각 경험을 샘플링하기 위해 세가지의 상태로서 S _pre , S _post 및 S _n _- _step 를 생성할 수 있으며, 각 상태(each state)는 프레임 스킵핑 계수에 대응하는 수의 프레임을 적층(stacking)함으로써 생성될 수 있다. 이때, 프레임 스킵핑 계수는 사용자 입력에 의해 미리 설정될 수 있으며, 예시적으로 4일 수 있다. 이에 따르면, 프레임 스킵핑 계수에 대응하는 수의 프레임은 예시적으로 네 개(4개)의 프레임일 수 있다. Data processing unit 120 is S _pre, S _post and S _n as the three states in order to sample a respective experience _- may generate a _step, each state (each state) is the number of frames corresponding to the frame skipping coefficient It can be created by stacking. In this case, the frame skipping coefficient may be set in advance by a user input, and may be 4 as an example. Accordingly, the number of frames corresponding to the frame skipping coefficient may be four (four) frames by way of example.

다시 말해, 데이터 처리부(120)는 온라인 프레임 스킵핑(FS-ER 스킵핑, 121)의 적용 이전에, 휴먼 리플레이 버퍼(111)에 저장된 휴먼 데몬스트레이션 데이터(111')에 대한 프레임 상태를 생성할 수 있다. 특히, 데이터 처리부(120)는 휴먼 데몬스트레이션 데이터(111')에 대한 복수 유형의 프레임 상태로서 S _pre , S _post 및 S _n _-step 를 생성할 수 있다. 여기서, 프레임 상태는 각각 프레임 스킵핑 계수(frame skipping coefficient)(일예로 4)에 대응하는 수의 프레임(일예로 4개의 프레임)을 적층하여 생성될 수 있다.In other words, the data processing unit 120 may generate a frame state for the human demonstration data 111 ′ stored in the human replay buffer 111 prior to application of the online frame skipping (FS-ER skipping, 121). have. In particular, the data processing unit 120 may generate S _pre , S _post and S _n _-step as a plurality of types of frame states for the human demonstration data 111 ′. Here, the frame state may be generated by stacking a number of frames (for example, four frames) corresponding to each frame skipping coefficient (for example, 4).

즉, 데이터 처리부(120)는 휴먼 리플레이 버퍼(111) 내 휴먼 데몬스트레이션 데이터(111')에 대하여 FS-ER 스킵핑(121)을 적용하기 이전에, 먼저 도 6의 (a)에 도시된 바와 같이 4개의 프레임을 그룹화함으로써 각 상태를 생성(구성)할 수 있다.That is, before applying the FS-ER skipping 121 to the human demonstration data 111' in the human replay buffer 111, the data processing unit 120 first, as shown in FIG. 6A. Each state can be created (configured) by grouping four frames.

앞서 말한 바와 같이, 도 6에서 (a)는 종래의 DQfD 아키텍처(Original)에 대한 휴먼 데몬스트레이션 데이터의 샘플링 방식(scheme)을 나타낸다. 예를 들어, Atari Grand Challenge 데이터세트의 60Hz 휴먼 데몬스트레이션 데이터를 사용하여 강화 학습 모델(reinforcement learning model)을 훈련한다고 가정하자. 이때, 아타리(Atari) 게임과 관련하여 프레임 스킵핑의 간격(interval)은 일반적으로(usually) 4로 설정될 수 있다. 즉, 종래의 DQfD 방식은 휴먼 데몬스트레이션 데이터를 샘플링하기 위해 4개의 간격마다(every four intervals) 프레임을 스킵한다(건너뛴다, 생략한다).As mentioned above, (a) in FIG. 6 shows a sampling scheme of human demonstration data for a conventional DQfD architecture (Original). For example, suppose you train a reinforcement learning model using 60Hz human demonstration data from the Atari Grand Challenge dataset. In this case, in relation to the Atari game, an interval of frame skipping may be generally set to 4. That is, the conventional DQfD method skips (skips, omits) frames every four intervals to sample human demonstration data.

도 6에서 (b)는 샘플링 과정 중 온디맨드식(on-demand, 주문형) 프레임 스킵핑을 위해 제안된 본 제안 기술의 FS-ER를 나타낸다. 데이터 처리부(120)는 제1 유형의 프레임 스킵핑(FS-ER 스킵핑, 121)의 적용시, 미리 설정된 수의 프레임에 대응하는 간격마다 프레임 스킵핑 계수에 대응하는 수의 프레임을 적층시킬 수 있다. 여기서, 미리 설정된 수는 사용자 입력에 의해 설정될 수 있다. 일예로 미리 설정된 수는 4일 수 있다. 따라서, 데이터 처리부(120)는 FS-ER 스킵핑(121)의 적용시, 휴먼 데몬스트레이션 데이터(111')에 대하여 프레임 상태를 생성함에 있어서, 미리 설정된 수의 프레임에 대응하는 간격마다(일예로, 4개의 프레임에 대응하는 간격마다, 4개의 간격마다) 프레임 스킵핑 계수에 대응하는 수의 프레임(일예로, 4개의 프레임)을 적층시킬 수 있다.In FIG. 6 (b) shows the FS-ER of the proposed technology proposed for skipping an on-demand (on-demand) frame during a sampling process. When the first type of frame skipping (FS-ER skipping, 121) is applied, the data processing unit 120 may stack a number of frames corresponding to the frame skipping coefficient at intervals corresponding to a preset number of frames. have. Here, the preset number may be set by user input. For example, the preset number may be 4. Therefore, when the FS-ER skipping 121 is applied, the data processing unit 120 generates a frame state with respect to the human demonstration data 111 ′, at each interval corresponding to a preset number of frames (for example, A number of frames (for example, four frames) corresponding to the frame skipping coefficient may be stacked for every interval corresponding to four frames, every four intervals.

다시 말해, 프레임 스킵핑 계수(frame skipping coefficient)가 일예로 4인 경우, 종래의 심층 강화 학습을 위한 프레임 적층(conventional frame stacking)과 유사하게 보이기 위해, 데이터 처리부(120)는 FS-ER 스킵핑(121)의 적용시, 4 개의 간격마다(every four intervals) 4 개의 프레임을 적층시킴으로써 프레임 상태를 생성할 수 있다. 즉, 본 장치(100)는 FS-ER 스킵핑(121)의 적용을 위해 4개의 간격마다 4개의 프레임을 적층시키는 반면, 종래의 DQfD는 4 개의 연속적인 프레임을, 프레임들 간 공백 없이 적층시킨다는 점에서 차이가 있다.In other words, when the frame skipping coefficient is 4, for example, in order to look similar to conventional frame stacking for deep reinforcement learning, the data processing unit 120 performs FS-ER skipping. When applying (121), it is possible to create a frame state by stacking four frames every four intervals. That is, the device 100 stacks 4 frames every 4 intervals for the application of the FS-ER skipping 121, whereas the conventional DQfD stacks 4 consecutive frames without spaces between frames. There is a difference in that.

이에 따르면, FS-ER 스킵핑(121)시에는 프레임 스킵핑의 간격이 가변하지 않고 일정하게 유지될 수 있는데, 이때, 일정하게 유지되는 프레임 스킵핑 간격은 미리 설정된 수의 프레임(일예로 4개)에 대응하는 간격을 의미할 수 있다. 즉, 데이터 처리부(121)는 FS-ER 스킵핑(121)의 적용시, 프레임 스킵핑의 간격을 미리 설정된 수의 프레임에 대응하는 간격(일예로 4개의 프레임에 대응하는 간격)으로 일정하게 설정할 수 있다.Accordingly, during the FS-ER skipping 121, the frame skipping interval may not be varied and may be kept constant. In this case, the frame skipping interval maintained constant is a preset number of frames (for example, 4 ) May mean an interval corresponding to. That is, when applying the FS-ER skipping 121, the data processing unit 121 uniformly sets the frame skipping interval to an interval corresponding to a preset number of frames (for example, an interval corresponding to 4 frames). I can.

한편, 재정의부(130)는 데이터 처리부(120)에 의한 데이터 처리에 의해 샘플링된 휴먼 데몬스트레이션 데이터(111') 및 액터 리플레이 버퍼(112)에 저장된 기 샘플링된 액터 데이터(112')를 입력으로 하는 신경망(50)의 업데이트시 이용되는 손실 값(손실 함수)을 계산하기 위해 요구되는 반환 값을 재정의할 수 있다. 즉, 액터 리플레이 버퍼(112)에는 액터 데이터(112')가 기 샘플링되어 저장될 수 있다. On the other hand, the redefinition unit 130 inputs human demonstration data 111 ′ sampled by data processing by the data processing unit 120 and previously sampled actor data 112 ′ stored in the actor replay buffer 112. It is possible to redefine a return value required to calculate a loss value (loss function) used when updating the neural network 50. That is, the actor data 112 ′ may be pre-sampled and stored in the actor replay buffer 112.

재정의부(130)는 두가지 유형의 프레임 스킵핑(121, 122) 적용 각각에 대하여 반환 값(리턴 값)을 재정의할 수 있다. 즉, 재정의부(130)는 FS-ER 스킵핑(121) 기술과 DFS-ER 스킵핑(122) 기술에 적용 가능한 손실 함수(손실 값)을 구하기 위해 필요한 신경망(50)의 반환 값(리턴 값)을 재정의할 수 있다. 다시 말해, 재정의부(130)는 FS-ER 스킵핑(121)의 적용과 DFS-ER 스킵핑(122) 적용 각각에 대하여, 손실 함수를 구하기 위한 반환 값(리턴 값)을 구하는 방식을 재정의할 수 있다.The redefinition unit 130 may redefine a return value (return value) for each of the application of the two types of frame skipping 121 and 122. That is, the redefinition unit 130 is the return value (return value) of the neural network 50 required to obtain a loss function (loss value) applicable to the FS-ER skipping 121 technology and the DFS-ER skipping 122 technology. ) Can be redefined. In other words, the redefinition unit 130 may redefine the method of obtaining a return value (return value) for obtaining a loss function for each application of the FS-ER skipping 121 and the application of the DFS-ER skipping 122. I can.

재정의부(130)는 상기 식 1에서와 같이 DQfD의 손실(loss)인 1 단계 Q 손실(1-step Q loss), n 단계 Q 손실(n-step Q loss) 및 감독 손실(supervised loss)을 구성(construct, 구축)하기 위해, n 단계 반환(n-step return) 뿐만 아니라 선택된 프레임(selected frame)의 보상(reward)과 액션(action)을 계산할 수 있다. n 단계 반환(n-step return)의 방정식은 하기 식 5와 같이 표현될 수 있다.As shown in Equation 1 above, the redefinition unit 130 calculates the loss of DQfD: 1-step Q loss, n-step Q loss, and supervised loss. To construct, it is possible to calculate the reward and action of the selected frame as well as n-step return. The equation of n-step return can be expressed as Equation 5 below.

[식 5][Equation 5]

또한, 재정의부(130)는 FS-ER 스킵핑(121)의 적용 가능한 반환 값 재정의를 위해, 각 프레임의 보상 값(reward value, R _t )을 하기 식 6과 같이 프레임 스킵핑 계수에 대응하는 수의 프레임(4개의 프레임) 시리즈의 보상의 합에 의해 계산할 수 있다. 식 6은 각 프레임의 보상 값에 대한 방정식을 의미한다. 이는 심층 강화 학습 환경(deep reinforcement learning environment)이 일반적으로 보상으로 스킵된 프레임(skipped frame) 사이의 모든 보상(all rewards)의 합(sum)을 반환(returns)하기 때문이라 할 수 있다.In addition, the redefinition unit 130, in order to redefine the applicable return value of the FS-ER skipping 121, the compensation value ( R _t ) of each frame corresponding to the frame skipping coefficient as shown in Equation 6 below. It can be calculated by the sum of the compensation of the number of frames (4 frames) series. Equation 6 means the equation for the compensation value of each frame. This is because the deep reinforcement learning environment generally returns the sum of all rewards between frames skipped as rewards.

[식 6][Equation 6]

[식 7][Equation 7]

재정의부(120)는 본원에서 제안하는 FS-ER 스킵핑(121)에 적용 가능한 반환 값으로서 FS-ER의 n-단계 반환(n-step return)을 상기 식 7을 만족하도록 재정의할 수 있다. 다시 말해, 재정의부(120)에 의해 재정의된 FS-ER의 n-단계 반환은 상기 식 7과 같이 표현될 수 있다. 즉, 식 7은 FS-ER의 n-단계 반환 값에 대한 방정식을 의미한다.The redefinition unit 120 may redefine the n-step return of FS-ER as a return value applicable to the FS-ER skipping 121 proposed in the present application to satisfy Equation 7. In other words, the n-step return of the FS-ER redefined by the redefinition unit 120 may be expressed as in Equation 7 above. That is, Equation 7 means the equation for the n-step return value of FS-ER.

FS-ER의 n-단계 반환(n-step return)은 미리 설정된 수의 프레임(일예로 4개의 프레임)에 대응하는 간격마다 하기 식 6에 의해 계산된 감쇄된(decayed) R _t 를 합산하여 산출될 수 있다. 또한, FS-ER의 n-단계 반환은 n-단계 이후에 감쇄 값(decay value)

를 곱한

를 최대 Q 값(maximum Q value)으로 더하여 산출될 수 있다. 이에 따르면, FS-ER의 n-단계 반환(n-step return) 값은 상기 식 6 및 상기 식 7을 이용하여 계산될 수 있다.The n-step return of FS-ER is a decayed R _t calculated by Equation 6 below for each interval corresponding to a preset number of frames (for example, 4 frames). It can be calculated by summing. In addition, the return of the n-step of FS-ER is the decay value after the n-step

Multiplied by

It can be calculated by adding to the maximum Q value. According to this, the n-step return value of FS-ER can be calculated using

Equations

6 and 7 above.

본 장치(100)는 이러한 FS-ER 스킵핑(121)의 적용을 통해, 휴먼 데몬스트레이션 데이터(111')에 대하여 어떠한 데이터도 드롭핑(dropping)되는 것 없이(즉, 낭비되는것 없이, 삭제되는것 없이), 전체 60Hz 휴먼 데몬스트레이션 데이터를 완전히(thoroughly, 온전히) 이용(활용)할 수 있다. 본원은 FS-ER를 이용(채택)함으로써, 동일한 수의 에피소드(episodes)를 사용하는 종래의 프레임 스킵핑 방법에 비해, 샘플링된 훈련 데이터(sampled training data)의 더 높은 다양성(higher diversity)을 보장할 수 있다. 즉, 본 장치(100)는 FS-ER 스킵핑(121)의 적용을 통해, 휴먼 데몬스트레이션 데이터(111')가 적어도 일부 버려지는 것 없이, 휴먼 데몬스트레이션 데이터(111')를 전부 학습데이터로서 활용할 수 있어, 훈련 데이터의 다양성을 보장할 수 있다. 이러한 훈련 데이터의 다양성으로 인해 본 장치(100)는 훈련 효율성(training efficiency)을 증가시킬 수 있다.Through the application of the FS-ER skipping 121, the device 100 does not drop any data on the human demonstration data 111' (that is, without being wasted, without being deleted). ), the entire 60Hz human demonstration data can be fully used (utilized). By using (adopting) FS-ER, the present application guarantees higher diversity of sampled training data compared to the conventional frame skipping method using the same number of episodes. can do. That is, the present apparatus 100 can utilize all of the human demonstration data 111 ′ as learning data without at least partially discarding the human demonstration data 111 ′ through the application of the FS-ER skipping 121. Yes, it can guarantee the diversity of training data. Due to the diversity of training data, the apparatus 100 may increase training efficiency.

데이터 처리부(120)에 의한 DFS-ER 스킵핑(DFS-ER 프레임 스킵핑, 122) 기술의 적용에 대한 구체적인 설명은 다음과 같다.A detailed description of the application of the DFS-ER skipping (DFS-ER frame skipping, 122) technique by the data processing unit 120 is as follows.

DFS-ER은 온라인 프레임 스킵핑으로 휴먼 데몬스트레이션 데이터(111')의 액션 반복(action repetition)을 처리하기 위해, FS-ER에 동적 프레임 스킵핑을 추가한 것이다.DFS-ER is the addition of dynamic frame skipping to FS-ER in order to process action repetition of human demonstration data 111' through online frame skipping.

데이터 처리부(120)는 휴먼 데몬스트레이션 데이터(111')의 각 프레임의 액션 지속 기간을 나타내는 액션 반복 값(action repetition value)을 고려하여 제2 유형의 프레임 스킵핑(DFS-ER, 122)을 적용해 데이터 처리를 수행할 수 있다. 여기서, 액션 반복 값은 두 개의 스킵 값(즉, fs ₁ 과 fs ₂ )을 포함할 수 있다. The data processing unit 120 applies a second type of frame skipping (DFS-ER, 122) in consideration of an action repetition value indicating the action duration of each frame of the human demonstration data 111 ′. Data processing can be performed. Here, the action repetition value is two skip values (i.e. fs ₁ And fs ₂ ).

이때, 데이터 처리부(120)는 DFS-ER 스킵핑(122)의 적용시, 휴먼 데몬스트레이션 데이터(111')에 대한 프레임 상태를 생성함에 있어서, 프레임 상태를 생성할 때 프레임마다의 스킵 값을 확인하고, 확인된 스킵 값을 고려하여 프레임 상태를 생성할 수 있다.At this time, when the DFS-ER skipping 122 is applied, the data processing unit 120 checks the skip value for each frame when generating the frame state in generating the frame state for the human demonstration data 111 ′. , It is possible to generate a frame state in consideration of the checked skip value.

이처럼 프레임마다 확인된 스킵 값을 고려한 프레임 상태의 생성을 통해, 본 장치(100)는 DFS-ER 스킵핑(122)의 적용시 동적으로 프레임 스킵핑을 적용할 수 있다. Through the generation of the frame state in consideration of the skip value checked for each frame as described above, the apparatus 100 may dynamically apply frame skipping when the DFS-ER skipping 122 is applied.

즉, 데이터 처리부(120)에 의한 DFS-ER 스킵핑(122)의 적용시에는, 프레임 스킵핑의 간격을 가변하여 동적으로 스킵핑이 이루어질 수 있는데, 이때, 동적으로 스킵핑되는 프레임 스킵핑 간격(즉, 동적 스킵핑시 프레임 스킵핑 간격)은 휴먼 데몬스트레이션 데이터(111')의 액션의 스킵 값(즉, 액션 반복 값과 관련된 스킵 값)에 의해 동적으로 설정될 수 있다. 달리 말해, 데이터 처리부(120)는 DFS-ER 스킵핑(121)의 적용시, 프레임 스킵핑의 간격을 휴먼 데몬스트레이션 데이터(111')의 각 프레임에 대한 액션 반복 값의 스킵 값(즉, fs ₁ 과 fs ₂ )에 의해 동적으로 설정(정의)할 수 있다.That is, when the DFS-ER skipping 122 is applied by the data processing unit 120, the frame skipping interval may be changed to dynamically perform skipping. In this case, the frame skipping interval that is dynamically skipped (In other words, the frame skipping interval during dynamic skipping) may be dynamically set by a skip value of an action (ie, a skip value related to an action repetition value) of the human demonstration data 111'. In other words, when the DFS-ER skipping 121 is applied, the data processing unit 120 sets the frame skipping interval to a skip value of the action repetition value for each frame of the human demonstration data 111' (i.e., fs ₁ And fs ₂ ) can be dynamically set (defined).

데이터 처리부(120)는 DFS-ER 스킵핑(122)의 적용을 위해, 액션 반복 값의 스킵 값에 의해 동적으로 설정되는 간격마다 프레임 스킵핑 계수에 대응하는 수의 프레임(일예로, 4개의 프레임)을 적층시킬 수 있다. In order to apply the DFS-ER skipping 122, the data processing unit 120 includes the number of frames corresponding to the frame skipping coefficient at each interval dynamically set by the skipping value of the action repetition value (for example, 4 frames). ) Can be stacked.

다시 말해, 데이터 처리부(120)에 의한 DFS-ER 스킵핑(동적 프레임 스킵핑, 122) 동작은, 각 프레임의 현재 액션(current action)이 얼마나 오래 지속되는지에 따라 정의되는 휴먼 데몬스트레이션 데이터(111')의 각 프레임의 액션 반복 값(action repetition value)을 기반으로 작동(working, 동작)할 수 있다.In other words, the DFS-ER skipping (dynamic frame skipping, 122) operation by the data processing unit 120 is human demonstration data 111 ′ defined according to how long the current action of each frame lasts. ) Can be operated (working) based on the action repetition value of each frame.

예를 들어, 현재 프레임(current frame)의 액션 반복 값(action repetition value)이 7 일 때, 해당 액션(action, 동작)은 이 프레임(현재 프레임) 이후에 7개의 프레임에 대해 계속(continue) 유지될 수 있다. 액션 반복 값(action repetition value)은 작은 스킵 값(small skip value)과 큰 스킵 값(big skip value) 각각에 대한 하이퍼 파라미터(hyper-parameters)인 fs ₁ 과 fs ₂ 라는 두 개의 스킵 값(two skip values)으로 추상화(abstracted)될 수 있다. 이러한 두 개의 스킵 값의 추상화에 대한 설명은 일예로 상술한 종래의 문헌 6을 참고하여 보다 쉽게 이해될 수 있으며, 본원에서 구체적인 설명은 생략하기로 한다. For example, when the action repetition value of the current frame is 7, the action is maintained for 7 frames after this frame (current frame). Can be. The action repetition value is fs ₁ , which is the hyper-parameters for each of the small and big skip values. And fs ₂ can be abstracted into two skip values. A description of the abstraction of these two skip values may be more easily understood with reference to the conventional document 6 described above as an example, and a detailed description thereof will be omitted herein.

이러한, 두 개의 스킵 값(two skip values)인 fs ₁ 과 fs ₂ 는 동적 프레임 스킵핑(dynamic frame skipping)을 프레임 적층(frame stacking), 보상(reward) 및 n-단계 반환(n-step return)에 적용(adapting, 적응, 조정)시키는 주요 역할(main role)을 수행할 수 있다.These, two skip values, fs ₁ And fs ₂ are the main roles (adapting, adapting, and adjusting) dynamic frame skipping to frame stacking, reward, and n-step return. role) can be performed.

만약, DFS-ER이 도 6의 (b)와 같이 고정된 스페이싱 간격(static spacing intervals)을 두고 프레임을 적층(stacks)한 경우, 휴먼 데몬스트레이션 데이터(111')의 프레임 사이의 스페이싱 간격(spacing intervals)은 액터 데이터(actor data)의 간격 단계(interval steps)와 다르므로, 동적 프레임 스킵핑(DFS-ER, 122) 이후에 액터 리플레이 버퍼(112)에 저장될 수 있다.If the DFS-ER stacks frames at fixed spacing intervals as shown in FIG. 6(b), spacing intervals between frames of human demonstration data 111' ) Is different from the interval steps of actor data, and thus may be stored in the actor replay buffer 112 after dynamic frame skipping (DFS-ER, 122).

또한, 데이터 처리부(120)는 휴먼(human, 인간)과 액터(actor)의 행동(behaviors) 간 차이(difference)를 보정(calibrate)하기 위해, 휴먼 데몬스트레이션 데이터(111')의 현재 프레임(current frame)의 액션 반복 값(action repetition value)을 도 7에 도시된 알고리즘 1(Algorithm 1)에서와 같이 fs ₂ 의 조건(condition)인 fs ₂ _{_condition} 에 따라 fs ₁ 과 fs ₂ 로 분류(classify)할 수 있다. 알고리즘 1에 대한 설명은 후술하여 보다 구체적으로 설명하기로 한다. In addition, the data processing unit 120 calibrates the difference between the behaviors of the human (human) and the actor (current frame) of the human demonstration data 111 ′. ) repeat of the action values (action repetition value) to Figure 7 the algorithm shown in 1 ₍₁ fs in accordance with algorithm 1) fs ₂ _{_condition} the (condition) condition of fs _2, as shown in And fs ₂ can be classified. Description of Algorithm 1 will be described later in more detail.

도 6의 (b)와 도 6의 (c) 간의 차이(difference)로 알 수 있는 바와 같이, 데이터 처리부(120)는, DFS-ER 스킵핑(122)의 적용을 위해, 일예로 4 개의 프레임을 적층(stacking)하여 복수의 프레임 상태(즉, S _pre , S _post 및 S _n _-step 각각)를 생성(구성)할 때, 프레임마다(every frame)의 스킵 값(skip value)을 확인(checks)할 수 있다.As can be seen from the difference between (b) of Figure 6 and (c) of Figure 6, the data processing unit 120, for the application of the DFS-ER skipping 122, for example, four frames When generating (configuring) a plurality of frame states (i.e., S _pre , S _post and S _n _-step respectively) by _stacking , checks the skip value of every frame. )can do.

예를 들어, DFS-ER 스킵핑(122)의 적용을 위한 S _post 의 생성 과정에 대한 설명은 다음과 같다. 일예로, 선택된 프레임의 인덱스(index)를 t라고 가정하는 경우, 데이터 처리부(120)는 S _post 를 생성하기 위해 프레임 x _t 와 그 이전(previous)의 세개의 프레임(three frames)을 적층(stacking)시킬 수 있다. 이러한 4개의 프레임을 적층시키기 위해, 데이터 처리부(120)는 역으로(backwards, 거꾸로) 4개의 프레임 각각의 스킵 값(skip value)을 확인(check)할 수 있다. For example, a description of the process of generating S _post for application of the DFS-ER skipping 122 is as follows. As an example, assuming that the index of the selected frame is t , the data processing unit 120 stacks a frame x _t and three frames before it to generate an S _post . ). In order to stack these four frames, the data processing unit 120 may check (backwards) skip values of each of the four frames.

도 6의 (c)를 참조하여 예를 들면, 데이터 처리부(120)는

의 스킵 값(skip values)을 확인할 수 있다. 만약,

의 스킵 값이 fs ₁ 이면, 이(

)는 S _post 에 추가될 수 있다. 그렇지 않고, 만약

의 스킵 값이 fs ₂ 이면, 이(

)는 S _post 에 추가될 수 없다. 그 이유는, y _t 와 y _t ₊₁ 사이의 차이(difference)(즉, fs ₁ )가

의 스킵 값(즉, fs ₂ )과 동일하지 않기 때문이라 할 수 있다. 이러한 과정은 S _post 의 네 프레임(4개의 프레임, four frames)이 채워질 때까지 반복될 수 있다. Referring to Figure 6 (c), for example, the data processing unit 120

You can check the skip values of. if,

The skip value of fs ₁ If, if (

) Can be added to S _post . Otherwise, if

The skip value of fs ₂ If, if (

) Cannot be added to S _post . The reason is the difference between y _t and y _t ₊₁ (i.e. fs ₁ )end

This is because it is not the same as the skip value of (ie, fs ₂ ). This process can be repeated until four frames (four frames) of S _post are filled.

달리 표현하여, 데이터 처리부(120)는 DFS-ER 스킵핑(122)의 적용을 위해,

의 스킵 값(skip values)을 확인할 수 있다. 만약,

의 스킵 값이 fs ₁ 이면, 데이터 처리부(120)는

가 S _post 에 추가(반영)되도록 프레임 상태 S _post 를 생성할 수 있다. 그렇지 않고, 만약

의 스킵 값이 fs ₂ 이면, 데이터 처리부(120)는

가 S _post 에 추가(반영)되지 않도록 하여 프레임 상태 S _post 를생성할 수 있다. 이러한 과정은 S _post 의 네 프레임(4개의 프레임, four frames)이 채워질 때까지 반복될 수 있다. 즉, 데이터 처리부(120)는 이러한 과정을 S _post 에 4개의 프레임이 포함될 때까지 반복 수행할 수 있다.In other words, the data processing unit 120 applies the DFS-ER skipping 122,

You can check the skip values of. if,

The skip value of fs ₁ If the back, the data processing unit 120

That can generate the frame state S to _post addition (reflected) in the S _post. Otherwise, if

The skip value of fs ₂ If the back, the data processing unit 120

A frame status to prevent more (reflected) to the _post S S _post Can be created. This process can be repeated until four frames (four frames) of S _post are filled. That is, the data processing unit 120 may repeat this process until four frames are included in S _post .

한편, S _pre 와 S _n _- _step 를 생성하는 과정은 S _post 를 생성하는 과정과 동일 내지 유사하게 이해될 수 있으며, 따라서 이하 구체적인 설명은 생략하기로 한다.On the other hand, _pre S and S _n _- generating a _step may be understood in the same or similar as the procedure for generating a _post S, thus less detailed description will be omitted.

데이터 처리부(120)는 도 7에 도시된 알고리즘 1을 기반으로 DFS-ER 스킵핑(122)을 수행할 수 있다.The data processing unit 120 may perform DFS-ER skipping 122 based on Algorithm 1 shown in FIG. 7.

도 7은 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치(100)에 의한 DFS-ER 스킵핑(122)의 동작에 대한 알고리즘을 나타낸 도면이다.7 is a diagram illustrating an algorithm for an operation of the DFS-ER skipping 122 by the data processing apparatus 100 for deep reinforcement learning according to an embodiment of the present application.

도 7을 참조하면, 알고리즘 1에서 입력(Inputs) 값으로는 다음과 같은 파라미터들이 주어질 수 있다. 즉, DFS-ER을 위한 알고리즘의 입력으로는, 휴먼 리플레이 버퍼(human replay buffer)인 M _human , 액터 리플레이 버퍼(actor replay buffer)인 M _actor , 사전 훈련 단계의 수(the number of pre-train step)인 k, 휴먼 리플레이 버퍼 크기(human replay buffer size)인 Size _human , Size _human 에 의해 정의된 사전 훈련의 시작 조건(the pre-train start condition)인 Start _pretrain , fs ₂ 의 스킵 조건(skip condition)인 fs ₂ _{_condition} , 선택된 프레임x _t 의 액션 반복 값(action repetition value)인 repeat _t , x _t 의 액션 인덱스(action index)인 a _expert , 및 환경(environment)에서의 액션 수(예를 들어, ALE)인 action_dim가 고려될 수 있다.Referring to FIG. 7, the following parameters may be given as input values in Algorithm 1. In other words, as input to the algorithm for DFS-ER, M _human is a human replay buffer, M _actor is an actor replay buffer, and the number of pre-train steps ) of k, human replay buffer size (human replay buffer size) size of _human, the starting conditions of the pre-training defined by size _human (the pretrain start condition), the skipping condition _{_{start pretrain, fs 2 (skip condition}} ) the fs ₂ _{_condition,} action repeat value of the selected frame x _t number of actions in which a _expert, and the environment (environment) (action repetition value) of repeat _t, action index (action index) of x _t (e.g., ALE ), action_dim may be considered.

액터 데이터(112')의 보상(reward)과 같이 동적인 시간 간격(dynamic time interval)을 갖는 휴먼 데이터(human data, 111')에 대한 보상을 획득하기 위해, 본 장치(100)의 재정의부(130)는 휴먼 데몬스트레이션 데이터에서 보상(reward)의 방정식(equation)을 수정(modify)할 수 있다. 즉, 재정의부(130)는 액터 데이터(112')의 보상(reward)과 같이 동적인 시간 간격을 갖는 휴먼 데몬스트레이션 데이터(111')에 대한 보상을 획득하기 위해, 휴먼 데몬스트레이션 데이터(111')에서 보상의 방정식을 수정(modify)할 수 있다. In order to obtain a reward for human data 111' having a dynamic time interval, such as a reward of the actor data 112', the redefinition unit ( 130) may modify an equation of a reward in the human demonstration data. That is, the redefinition unit 130 obtains a reward for the human demonstration data 111 ′ having a dynamic time interval, such as a reward for the actor data 112 ′, in the human demonstration data 111 ′. You can modify the equation of compensation.

DQfD에서 DFDQN의 동적 프레임 스킵핑을 적용(adapt)하기 위해, 재정의부(130)는 n 단계 반환(n-step return)이 DFDQN에서 고려되지 않으므로, 동적 프레임 스킵핑(DSR-ER, 122)을 고려한 올바른(correct) n 단계 반환(n-step return)을 반영(reflect)하기 위해, DQfD의 n 단계 반환을 수정(modified)하여 재정의할 수 있다. 즉, 재정의부(130)는 DQfD의 n 단계 반환을 수정하여 DFS-ER의 n-단계 반환(n-단계 반환 값)을 후술하는 식 11을 만족하도록 재정의할 수 있다.In order to adapt the dynamic frame skipping of DFDQN in DQfD, the redefinition unit 130 performs dynamic frame skipping (DSR-ER, 122) because n-step return is not considered in DFDQN. In order to reflect the considered correct n-step return, the n-step return of DQfD can be modified and redefined. That is, the redefinition unit 130 may modify the n-step return of DQfD to redefine the n-step return (n-step return value) of DFS-ER to satisfy Equation 11 described later.

도 7에서 알고리즘 1의 라인(line) 3과 라인 4에 나타난 바와 같이, 보상(reward)과 n 단계 반환(n-step return)에 대하여 동적 프레임 스킵핑 방식(DFS-ER 스킵핑, 122)을 적용하기 위해, 재정의부(130)는 보상(reward) R _t (즉, 선택된 프레임 x _t 의 보상인 R _t )의 후보(candidates)를 획득하기 위해, fs ₁ 또는 fs ₂ 의 시리즈(series, 일련)의 부분적인 보상(partial rewards)을 합할 수 있다. 즉, fs ₁ 또는 fs ₂ 의 시리즈의 보상을 부분적으로 합할 수 있다.As shown in line 3 and line 4 of Algorithm 1 in FIG. 7, a dynamic frame skipping method (DFS-ER skipping, 122) is used for reward and n-step return. In order to apply, the redefinition unit 130 is a reward R _t In order to obtain candidates for (that is, R _t , which is a reward of the selected frame x _t ), partial rewards of a series of fs ₁ or fs ₂ may be summed. That is, the compensation of the series of fs ₁ or fs ₂ can be partially summed.

또한, n 단계 반환(n-step return)의 재정의를 위해, 재정의부(130)는 R _t 와 y를 모두 찾을 뿐만 아니라 이 두 변수(R _t 와 y) 얻기 위해 액션의 값(value)을 스킵(skip, 생략)할 수 있다. 여기서, y는 프레임을 동적으로 스킵한 이후의 프레임의 인덱스를 의미할 수 있다.In addition, for redefinition of n-step return, the redefinition unit 130 not only finds both R _t and y , but also skips the value of the action to obtain these two variables ( R _t and y ). You can (skip, omitted). Here, y may mean an index of a frame after dynamically skipping the frame.

휴먼 데몬스트레이션 데이터(111')에서 액션의 스킵 값(skip value)은 도 7에서 알고리즘 1의 라인 5에 의한 액션 값(action value)으로 정의될 수 있다. 만약, 액션 값이 action_dim(즉, repeat _t 가 fs ₂ _{_} _condition 보다 작은 경우인 action_dim) 보다 작으면, 액션의 스킵 값(액션 반복 값과 관련된 스킵 값)은 fs ₁ 이고, 그렇지 않으면 fs ₂ 일 수 있다.In the human demonstration data 111 ′, a skip value of an action may be defined as an action value according to line 5 of Algorithm 1 in FIG. 7. If, action value is action_dim less than (that is, the action_dim if repeat _t is fs ₂ _{_} smaller than the _condition), a skip value of the action (skip value associated with the action repeat value) is fs _1, otherwise be fs ₂ have.

알고리즘 1에서 휴먼 리플레이 버퍼(human replay buffer)인 M _human 는 하기 식 8을 만족할 수 있다. DFS-ER(122)의 적용시 고려되는 보상(reward) R _t 는, 하기 식 9에 나타난 바와 같이(즉, 현재 프레임의 액션 값 a _i 를 확인함으로써) R _r1 및 R _r2 중 어느 하나로 선택될 수 있다. 또한, 본 장치(100)는 DFS-ER(122)의 적용시, 하기 식10을 이용하여(즉, 현재 프레임 x _t 의 스킵 값을 확인함으로써) 스킵된 프레임 인덱스(skipped frame index) y의 다음 동적 프레임(next dynamic frame)을 찾을 수 있다.In Algorithm 1, M _human , a human replay buffer, can satisfy Equation 8 below. Reward R _t considered when applying DFS-ER(122) May be selected as one of R _r1 and R _r2 as shown in Equation 9 below (that is, by checking the action value a _i of the current frame). In addition, when the DFS-ER 122 is applied, the apparatus 100 uses the following equation 10 (i.e., by checking the skip value of the current frame x _t ) after the skipped frame index y You can find the next dynamic frame.

[식 8][Equation 8]

[식 9][Equation 9]

[식 10][Equation 10]

본 장치(100)는 식 9의 R _i 와 식 10의 y _i 를 계산한 이후, 이를 이용하여 하기 식 11을 통해 DFS-ER의 n 단계 반환(n-step return) 값을 획득할 수 있다. The device 100 uses R _i in Equation 9 and y _{i in} Equation 10 After calculating, it is possible to obtain an n-step return value of DFS-ER through Equation 11 below.

즉, 재정의부(30)는 본원에서 제안하는 DFS-ER 스킵핑(122)에 적용 가능한 반환 값으로서 DFS-ER의 n-단계 반환(n-step return)을 하기 식 11을 만족하도록 재정의할 수 있다. 다시 말해, 재정의부(120)에 의해 재정의된 DFS-ER의 n-단계 반환은 하기 식 11과 같이 표현될 수 있다. 식 11은 DFS-ER의 n-단계 반환 값에 대한 방정식을 의미한다.That is, the redefinition unit 30 can redefine the n-step return of DFS-ER as a return value applicable to the DFS-ER skipping 122 proposed by the present application to satisfy Equation 11 below. have. In other words, the n-step return of the DFS-ER redefined by the redefinition unit 120 may be expressed as Equation 11 below. Equation 11 refers to the equation for the n-step return value of DFS-ER.

[식 11][Equation 11]

본 장치(100)는 식 11의 방정식에서 상기 식 4를 사용함으로써 보상(reward)을 감쇄(decay)시킬 수 있다.The apparatus 100 can decay the reward by using Equation 4 in Equation 11.

이러한 재정의부(130)는 FS-ER 스킵핑에 적용 가능한 손실 함수를 구하기 위한 반환 값(리턴 값)을 상기 식 7과 같이 재정의할 수 있으며, DFS-ER 스킵핑에 적용 가능한 손실 함수를 구하기 위한 반환 값(리턴 값)을 상기 식 11과 같이 재정의할 수 있다. 이러한 재정의부(130)에 의해 재정의된 반환 값을 기반으로 하여, 상기 식 1을 이용해 본 장치(100)에서 고려되는 온라인 프레임 스킵핑 기술의 적용시에 이용될 수 있는 손실 함수의 값(즉, 신경망의 업데이트시 이용되는 손실 값)이 산출될 수 있다.The redefinition unit 130 may redefine a return value (return value) for obtaining a loss function applicable to FS-ER skipping as shown in Equation 7 above, and to obtain a loss function applicable to DFS-ER skipping. The return value (return value) can be redefined as in Equation 11 above. Based on the return value redefined by the redefinition unit 130, the value of the loss function that can be used when applying the online frame skipping technique considered in the device 100 using Equation 1 (i.e., A loss value used when updating a neural network) can be calculated.

도 2의 (b)에 도시된 종래 FiGAR과 본원에서 제안하는 DFS-ER 스킵핑(122) 간에는 다음 두 가지의 차이점이 존재한다.There are two differences between the conventional FiGAR shown in FIG. 2B and the DFS-ER skipping 122 proposed by the present application.

첫번째 차이점은, FiGAR는 특정 범위(일정 범위, certain range)의 스킵 값을 갖는 반면, 본 제안 기술의 DFS-ER 스킵핑(122)은 두 가지의 프레임 스킵핑의 경우(case)(즉, 온라인 프레임 스킵핑인 FS-ER의 경우와 동적 온라인 프레임 스킵핑인 DFS-ER의 경우)를 갖는다는 점에서 차이가 있다.The first difference is that FiGAR has a skip value of a specific range (a certain range), whereas the DFS-ER skipping 122 of the present proposed technology is a case of two frame skipping (ie, online There is a difference in that it has frame skipping (FS-ER) and dynamic online frame skipping (DFS-ER).

두번째 차이점은, 본 제안 기술의 DFS-ER(122)은 yi 를 사용하는 대신,

을 감마(gamma,

)의 지수(exponent)로 사용한다는 점에서 차이가 있다. 그 이유는 fs ₁ 스킵 값이 n 단계 Q 손실(n-step Q loss)로 에이전트(agent)를 훈련할 때 하나의 감쇄 단위(one decay unit)로 사용되기 때문이라 할 수 있다.The second difference is that the DFS-ER 122 of the proposed technology is yi Instead of using

Gamma,

There is a difference in that it is used as an exponent of ). The reason is fs ₁ This is because the skip value is used as one decay unit when training an agent with n-step Q loss.

이에 따르면, 본 장치(100)는 고품질(high quality)의 휴먼 데몬스트레이션 데이터가 대부분의 실제 사례(real-world cases)에 제한되는 문제를 해소할 수 있다. 즉, 종래 강화 학습은 보다 높은 정확도의 산출을 위해(효과적인 성능 향상을 위해) 일반적으로 고품질의 학습 데이터를 이용해 학습을 수행한다. 이로 인해, 고품질의 학습 데이터가 충분히 확보되기 어려운 상황에서는 강화 학습의 성능을 향상시키는 데에 한계가 있다.According to this, the apparatus 100 can solve a problem in which high quality human demonstration data is limited to most real-world cases. That is, conventional reinforcement learning generally performs learning using high-quality learning data in order to calculate higher accuracy (for effective performance improvement). For this reason, there is a limit to improving the performance of reinforcement learning in a situation where it is difficult to sufficiently secure high-quality learning data.

이와 같이 고품질의 학습 데이터의 활용(이용)이 어려운 상황에서 강화 학습의 성능을 높이기 위해서는 현존하는 데이터 중 가치있는 데이터(이는 양질의 데이터로서, 일예로 휴먼 데몬스트레이션 데이터를 의미할 수 있음)를 낭비없이 효과적으로 사용해야할 필요가 있다. 그러나, 종래에는 휴먼 데몬스트레이션 데이터와 액터 데이터가 모두 저장된 단일 리플레이 버퍼에 대하여 프레임 스킵핑이 이루어짐에 따라, 휴먼 데몬스트레이션 데이터가 버려지는 문제가 있었다. 즉, 이러한 종래기술에 의하면, 프레임 스킵핑의 적용에 의해 가치있는 데이터(휴먼 데몬스트레이션 데이터)가 버려짐에 따라, 신경망 학습시 활용(사용) 가능한 학습데이터의 양이 줄어들어(즉, 신경망 학습시 활용 가능한 휴먼 데몬스트레이션 데이터의 양이 충분치 않아) 신경망의 학습 성능이 떨어지는 문제가 있었다.In a situation where the use (use) of high-quality learning data is difficult, in order to increase the performance of reinforcement learning, valuable data (this is high-quality data, which can mean human demonstration data for example) is not wasted. It needs to be used effectively. However, conventionally, as frame skipping is performed on a single replay buffer in which both human demonstration data and actor data are stored, there is a problem in that human demonstration data is discarded. That is, according to this prior art, as valuable data (human demonstration data) is discarded by the application of frame skipping, the amount of learning data that can be utilized (used) when learning a neural network decreases (that is, There was a problem that the learning performance of the neural network was degraded because the amount of human demonstration data was insufficient.

이에 본원은 고품질의 학습 데이터가 충분히 확보되기 어려운 상황에서는 강화 학습의 성능을 효과적으로 향상시키기 위해(즉, 강화 학습의 가속화를 위해), 휴먼 데몬스트레이션 데이터가 버려지는 것 없이 전부 신경망 학습의 학습데이터로 활용될 수 있는 기술로서, 듀얼 리플레이 버퍼 기반의 온라인 프레임 스킵핑 기술에 대하여 제안한다.Therefore, in order to effectively improve the performance of reinforcement learning (that is, to accelerate reinforcement learning) in situations where it is difficult to secure enough high-quality learning data, we use all human demonstration data as training data for neural network learning without being discarded. As a possible technology, we propose a dual replay buffer-based online frame skipping technology.

즉, 본원은 휴먼 데몬스트레이션 데이터(111')의 샘플링을 위해, 듀얼 리플레이 버퍼 관리와 온라인 프레임 스킵핑(121, 122)을 사용하는 새로운 심층 강화 학습(deep reinforcement learning) 기술에 대하여 제안한다. That is, the present application proposes a new deep reinforcement learning technique using dual replay buffer management and online frame skipping 121 and 122 for sampling the human demonstration data 111 ′.

본원에서 제안하는 듀얼 리플레이 버퍼(121, 122) 관리 기술에 의하면, 리플레이 매니저(110)가 독립적인 샘플링 정책(independent sampling policies)을 사용하여 휴먼 리플레이 버퍼(human replay buffer)와 액터 리플레이 버퍼(actor replay buffer)를 개별적으로(separately, 별도로) 처리할 수 있다. According to the dual replay buffer (121, 122) management technology proposed by the present application, the replay manager 110 uses an independent sampling policy to provide a human replay buffer and an actor replay buffer. buffer) can be processed separately (separately, separately).

또한, 본 장치(100)는 종래의 프레임 스킵핑 기술로 인해 버려지던 양질의 휴먼 데몬스트레이션 데이터가 버려지는 것 없이 모두 활용될 수 있도록 하기 위해(즉 전체 휴먼 데몬스트레이션 데이터가 사용될 수 있도록 하기 위해,) FS-ER 스킵핑(121) 기술에 대하여 제안한다. 또한, 본 장치(100)는 휴먼 데몬스트레이션 데이터(111')의 액션 반복(action repeat)을 훈련하기 위해, 동적 온라인 스킵핑 기술인 DFS-ER 스킵핑(122) 기술에 대하여 제안한다. In addition, the present apparatus 100 is to enable all of the high-quality human demonstration data discarded due to the conventional frame skipping technology to be used without being discarded (that is, in order to enable the entire human demonstration data to be used,) FS -ER skipping 121 technology is proposed. In addition, the present apparatus 100 proposes a DFS-ER skipping 122 technique, which is a dynamic online skipping technique, in order to train action repeat of the human demonstration data 111'.

즉, 본 장치(100)는 고품질의 학습 데이터로 강화 학습을 수행할 수 없는 환경에서(즉, 고품질의 학습데이터를 충분히 확보할 수 없는 환경일지라도), 양질의 학습데이터인 휴먼 데몬스트레이션 데이터가 버려지는 것 없이 이를 강화 학습을 위한 학습 데이터로서 충분히 활용(낭비 없이 휴먼 데몬스트레이션 데이터 전체가 활용)되도록 하여 강화 학습의 성능(속도와 정확도)을 효과적으로 향상시킬 수 있다.That is, in an environment where reinforcement learning cannot be performed with high-quality learning data (that is, even in an environment where high-quality learning data cannot be sufficiently secured), the human demonstration data, which is high-quality learning data, is discarded. It is possible to effectively improve the performance (speed and accuracy) of reinforcement learning by making it fully utilized as learning data for reinforcement learning (all human demonstration data is utilized without waste).

다시 말해, 본 장치(100)는 고품질의 학습데이터로 강화 학습을 수행할 수 없는 환경에서도 휴먼 데몬스트레이션의 충분한 활용을 통해 높은 성능을 내는 강화 학습이 이루어지도록 제공할 수 있다.In other words, even in an environment where reinforcement learning cannot be performed with high-quality learning data, the apparatus 100 may provide reinforcement learning with high performance through sufficient utilization of human demonstrations.

본원은 심층 강화 학습에 사용되는 데이터를 처리하는 기술에 대하여 제안한다. 본원은 종래의 프레임 스킵핑(frame skipping)으로 인해 버려지던 양질의 휴먼 데이터를 버려지는 데이터 없이 모두 활용함으로써, 심층 강화 학습의 결과 및 성능을 향상시킬 수 있다.This application proposes a technology for processing data used in deep reinforcement learning. The present application utilizes all high-quality human data discarded due to conventional frame skipping without discarding data, thereby improving the results and performance of deep reinforcement learning.

또한, 본원은 종래 기술들과는 달리, 휴먼 리플레이 버퍼와 액터 리플레이 버퍼를 별도로 구축함으로써, 휴먼 데이터에 한하여 온라인 프레임 스킵핑을 용이하게 적용할 수 있다. 또한, 본원은 프레임 스킵핑의 간격(intervals)을 일정하게 유지하여 스킵핑을 수행하는 FS-ER 스킵핑 기술과 프레임 스킵핑의 간격(intervals)을 가변하여 스킵핑을 수행하는 DFS-ER 스킵핑 기술을 적용함으로써, 휴먼 데이터를 낭비되는 것 없이 모두 사용(활용)할 수 있다.In addition, unlike the prior art, in the present application, by separately constructing a human replay buffer and an actor replay buffer, online frame skipping can be easily applied only to human data. In addition, the present application is a FS-ER skipping technology that performs skipping by maintaining a constant frame skipping interval, and DFS-ER skipping that performs skipping by varying the frame skipping intervals. By applying the technology, all human data can be used (utilized) without wasting it.

또한, 본원에서는 FS-ER 스킵핑 기술과 DFS-ER 스킵핑 기술에 적용 가능한 손실(loss) 함수를 구하기 위한 신경망의 반환(return, 리턴) 값을 계산하는 방식을 재정의(redefinition)할 수 있다.In addition, in this application, a method of calculating the return (return) value of a neural network for obtaining a loss function applicable to the FS-ER skipping technology and the DFS-ER skipping technology may be redefined. .

도 8은 본원의 일 실시예에 따른 신경망의 심층 강화 학습 장치(1)의 개략적인 구성을 나타낸 블록도이다.8 is a block diagram showing a schematic configuration of an apparatus 1 for deep reinforcement learning of a neural network according to an embodiment of the present application.

도 8을 참조하면, 본원의 일 실시예에 따른 신경망의 심층 강화 학습 장치(1)는 데이터 처리 장치(100) 및 학습 제어부(200)를 포함할 수 있다. Referring to FIG. 8, the apparatus 1 for deep reinforcement learning of a neural network according to an exemplary embodiment of the present disclosure may include a data processing apparatus 100 and a learning control unit 200.

여기서, 데이터 처리 장치(100)는 앞서 설명한 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치(100, 본 장치)를 의미할 수 있다. 따라서, 이하 생략된 내용이라 하더라도, 앞서 본 장치(100)에 대하여 설명된 내용은 도 8에 도시된 데이터 처리 장치(100)에 대한 설명에도 동일하게 적용될 수 있다.Here, the data processing apparatus 100 may refer to a data processing apparatus 100 (the present apparatus) for deep reinforcement learning according to an exemplary embodiment of the present disclosure described above. Accordingly, even if the contents are omitted below, the contents described above with respect to the apparatus 100 may be equally applied to the description of the data processing apparatus 100 illustrated in FIG. 8.

데이터 처리 장치(100)는 대상 데이터에 대하여 심층 강화 학습을 위한 데이터 처리를 수행할 수 있다.The data processing apparatus 100 may perform data processing for deep reinforcement learning on target data.

데이터 처리 장치(100)는 대상 데이터 중 휴먼 데몬스트레이션 데이터를 휴먼 리플레이 버퍼에 저장하고, 대상 데이터 중 액터 데이터를 액터 리플레이 버퍼에 저장할 수 있다. 또한, 데이터 처리 장치(100)는 휴먼 리플레이 버퍼에 저장된 휴먼 데몬스트레이션 데이터의 샘플링을 위해, 휴먼 리플레이 버퍼에 대하여 온라인 프레임 스킵핑을 적용하여 데이터 처리를 수행할 수 있다.The data processing apparatus 100 may store human demonstration data among target data in a human replay buffer, and store actor data among target data in an actor replay buffer. In addition, the data processing apparatus 100 may perform data processing by applying online frame skipping to the human replay buffer in order to sample the human demonstration data stored in the human replay buffer.

여기서, 온라인 프레임 스킵핑은, 프레임 스킵핑의 간격의 가변 여부를 고려한 두가지 유형의 프레임 스킵핑을 포함할 수 있다.Here, online frame skipping may include two types of frame skipping in consideration of whether or not the interval of frame skipping is variable.

온라인 프레임 스킵핑은 제1 유형의 프레임 스킵핑으로서, 프레임 스킵핑의 간격을 가변하지 않고 일정하게 유지하여 스킵핑을 수행하는 FS-ER 스킵핑을 포함할 수 있다. 또한, 온라인 프레임 스킵핑은 제2 유형의 프레임 스킵핑으로서, 프레임 스킵핑의 간격을 가변하여 동적으로 스킵핑을 수행하는 DFS-ER 스킵핑을 포함할 수 있다.Online frame skipping is a first type of frame skipping, and may include FS-ER skipping in which skipping is performed by keeping the frame skipping interval constant without changing. In addition, online frame skipping is a second type of frame skipping, and may include DFS-ER skipping, which dynamically performs skipping by varying an interval of frame skipping.

또한, 데이터 처리 장치(100)는 데이터 처리에 의해 샘플링된 휴먼 데몬스트레이션 데이터 및 기 샘플링되어 액터 리플레이 버퍼에 저장된 액터 데이터를 입력으로 하는 신경망(50)의 업데이트시 이용되는 손실 값(손실 함수)을 계산하기 위해 요구되는 반환 값을 재정의할 수 있다.In addition, the data processing device 100 calculates a loss value (loss function) used when updating the neural network 50 using the human demonstration data sampled by data processing and the actor data previously sampled and stored in the actor replay buffer as inputs. You can override the return value required to do so.

학습 제어부(200)는 데이터 처리 장치(100)의 데이터 처리에 의해 샘플링된 휴먼 데몬스트레이션 데이터 및 액터 데이터를 신경망(50)의 입력으로 하여 신경망(50)을 심층 강화 학습시킬 수 있다.The learning control unit 200 may perform deep reinforcement learning of the neural network 50 by using human demonstration data and actor data sampled by data processing of the data processing apparatus 100 as inputs of the neural network 50.

이러한 본원의 일 실시예에 따른 신경망의 심층 강화 학습 장치(1)는 휴먼 데몬스트레이션 데이터를 전부 활용(사용)해 신경망(50)을 심층 강화 학습시킴으로써, 강화 학습의 성능(속도와 정확도)을 효과적으로 향상시킬 수 있다.The deep reinforcement learning device 1 of a neural network according to an embodiment of the present application effectively improves the performance (speed and accuracy) of reinforcement learning by using (using) all human demonstration data to deeply reinforce the neural network 50 I can make it.

이하에서는 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 장치(100)의 성능을 평가(증명)하기 위한 본원의 일 실험예에 대하여 설명하기로 한다.Hereinafter, an experimental example of the present application for evaluating (proving) the performance of the data processing apparatus 100 for deep reinforcement learning according to an exemplary embodiment of the present application will be described.

본원의 일 실험에서는, 본 장치(100)의 성능을 평가하기 위해(즉, 본 제안 기술의 성능을 평가하기 위해), 4개의 인기있는 아타리 게임(four popular Atari games)에 대한 경험적 실험(empirical experiments)을 수행하였다. 그 결과는 본 제안 기술인 듀얼 리플레이 메모리(dual replay memory)를 사용하는 두 가지의 온라인 프레임 스킵핑 기술이 기존의 베이스라인(existing baselines)보다 우위(outperforms)에 있음을 보여준다. In one experiment of the present application, in order to evaluate the performance of the device 100 (that is, to evaluate the performance of the proposed technology), empirical experiments on four popular Atari games. ) Was performed. The results show that the two online frame skipping techniques that use dual replay memory, which is the proposed technique, are outperforms the existing baselines.

특히, DFS-ER은 4 가지 실험 중 3 가지 실험에서 강화 학습 과정 중에 가장 빠른 점수 증가(score increment)를 보이는 것으로 나타났다. 또한, FS-ER은 희소한 보상(sparse reward)으로 인해 모델을 훈련시키기 어려운 다른 환경에서 최고의 성능을 보이는 것으로 나타났다.In particular, DFS-ER was shown to show the fastest score increment during the reinforcement learning process in 3 of the 4 experiments. In addition, FS-ER has been shown to perform best in other environments where it is difficult to train the model due to sparse rewards.

다시 말해, 본원의 일 실험에서는 상술한 문헌 3의 DQfD 모델과 4개의 아타리 그랜드 챌린지 데이터세트(Atari Grand Challenge Dataset)를 사용하여 경험적 실험(empirical experiments)을 수행하였다. 여기서, 4개의 Atari Grand Challenge Dataset은 일예로 문헌 8 [V. Kurin, S. Nowozin, K. Hofmann, L. Beyer, and B. Leibe, "The atari grand challenge dataset," CoRR, vol. abs/1705.10998, 2017. [Online]. Available: http://arxiv.org/abs/1705.10998]를 참고하여 이해될 수 있다.In other words, in one experiment of the present application, empirical experiments were performed using the DQfD model of Document 3 and four Atari Grand Challenge Datasets. Here, four Atari Grand Challenge Datasets are described in Document 8 [V. Kurin, S. Nowozin, K. Hofmann, L. Beyer, and B. Leibe, "The atari grand challenge dataset," CoRR, vol. abs/1705.10998, 2017. [Online]. Available: http://arxiv.org/abs/1705.10998].

실험 결과에 따르면, 4 가지 평가 환경(evaluation environments) 중 하나인 몬테주마의 복수(Montezuma's Revenge)에서 FS-ER가 다른 종래의 방법보다 우수한 것으로 나타났다. 몬테주마의 복수(Montezuma's Revenge)는 희소한 보상(sparse reward) 때문에 강화 학습 모델을 훈련하기 어려운 환경이다. 또한, 본 제안 기술의 DFS-ER은 4가지 평가 환경 중 세 가지에서 다른 종래의 방법보다 성능이 우수한 것으로 나타났다. 특히, 사전 훈련 단계(pre-training stage)에서 DFS-ER은 평균 점수가 종래의 방법보다 향상된 것으로 나타났다. 또한, 강화 학습 단계(reinforcement learning stage)에서는 평균 점수(averaged score)가 빠르게 증가하는 것으로 나타났다.According to the experimental results, the FS-ER was found to be superior to other conventional methods in Montezuma's Revenge, one of the four evaluation environments. Montezuma's Revenge is a difficult environment to train reinforcement learning models due to sparse rewards. In addition, the DFS-ER of the proposed technology was found to be superior to other conventional methods in three of the four evaluation environments. In particular, in the pre-training stage, the average score of DFS-ER was improved compared to the conventional method. In addition, the averaged score increased rapidly in the reinforcement learning stage.

본원의 일 실험(이하 설명의 편의상 본 실험이라 함)에 대한 보다 구체적인 설명은 다음과 같다. A more detailed description of one experiment of the present application (hereinafter referred to as this experiment for convenience of description) is as follows.

먼저, 본 실험에서의 평가 설정(Evaluation Settings)에 대한 설명은 다음과 같다.First, the description of the evaluation settings in this experiment is as follows.

본 실험에서는 본 제안 기술의 효과 평가를 위해, 아타리 학습 환경(Arcade Learning Environment(ALE))에 대한 경험적 실험을 수행하였다. 본 실험에서는, 텐서 플로우(Tensorflow) 1.8을 사용하여 테스트 프로그램을 구현하고 휴먼 데몬스트레이션 데이터를 위해 Atari Grand Challenge v1 데이터세트를 사용하였다. 본 실험에서 고려된 리플레이 버퍼 클래스(class)는 리플레이 버퍼의 확장된 버전(extended version)일 수 있다. In this experiment, an empirical experiment was conducted on the Arcade Learning Environment (ALE) to evaluate the effectiveness of the proposed technology. In this experiment, a test program was implemented using Tensorflow 1.8 and the Atari Grand Challenge v1 dataset was used for human demonstration data. The replay buffer class considered in this experiment may be an extended version of the replay buffer.

또한, 본 실험에서는 상술한 문헌 3에 도시된 DQfD 모델이 테스트 신경망 모델(test neural network model)과 성능 비교를 위한 베이스 라인 모델(baseline model)로 사용되었다. 또한, 오픈AI 베이스라인(OpenAI baseline)의 리플레이 버퍼를 가진 DQfD 모델을 사용하기 위해, DFSER와 FS-ER의 방법을 추가하여 리플레이 버퍼 클래스를 확장하였다.In addition, in this experiment, the DQfD model shown in Document 3 was used as a test neural network model and a baseline model for performance comparison. In addition, in order to use the DQfD model with an OpenAI baseline replay buffer, the replay buffer class was extended by adding the methods of DFSER and FS-ER.

또한, 아타리 그랜드 챌린지 데이터세트(Atari Grand Challenge dataset)의 사용과 관련하여, 총 5 개의 게임 중 4 개만 사용하였다. 그 이유는 5개의 게임 중 비디오 핀볼(도 3의 (b))은 본 실험에서 임의의 단일 정책(random uniform policy)에 의해 액션을 하더라도 매우 쉽게 높은 점수를 얻을 수 있기 때문이라 할 수 있다. 따라서, 본 실험에서는 비디오 핀볼을 평가 환경(evaluation environments)에서 제외시켰다.In addition, with regard to the use of the Atari Grand Challenge dataset, only 4 out of 5 games were used. The reason for this is that the video pinball (Fig. 3(b)) of the five games can be said to be very easy to obtain a high score even if the action is performed according to a random uniform policy in this experiment. Therefore, in this experiment, video pinball was excluded from evaluation environments.

또한, 본 실험에서 휴먼 리플레이 버퍼(111)의 크기는 휴먼 데몬스트레이션 데이터에 대해 프레임 스킵핑을 수행하지 않았을 때 50,000일 수 있다. 하지만, 휴먼 데몬스트레이션 데이터를 휴먼 리플레이 버퍼에 저장하기 전에 프레임을 이미 스킵핑하는 종래의 방법(conventional method)을 평가하기 위해, 본 실험에서 고려되는 휴먼 리플레이 버퍼(111)의 크기는 12,500(즉, 5만분의 1)로 설정될 수 있다.In addition, in this experiment, the size of the human replay buffer 111 may be 50,000 when frame skipping is not performed on the human demonstration data. However, in order to evaluate the conventional method of skipping frames before storing human demonstration data in the human replay buffer, the size of the human replay buffer 111 considered in this experiment is 12,500 (i.e., 5 It can be set to ten thousandths)

또한 본 실험에서는 각 게임의 에피소드 점수를 내림차순으로 정렬한 다음 높은 점수(high-scored, 고득점)를 받은 휴먼 데몬스트레이션 데이터만 사용할 수 있도록 휴먼 데몬스트레이션 데이터를 저장할 수 있다. 또한, 액터 리플레이 버퍼(112)의 크기는 휴먼 리플레이 버퍼(111)의 크기와 같게 설정될 수 있다.Also, in this experiment, human demonstration data can be stored so that only the human demonstration data that received high-scored (high-scored) after sorting the episode scores of each game in descending order can be used. In addition, the size of the actor replay buffer 112 may be set equal to the size of the human replay buffer 111.

또한, 본 실험에서 사용된 파라미터(매개변수)는 다음과 같이 설정될 수 있다.In addition, the parameters (parameters) used in this experiment may be set as follows.

미니배치의 크기는 32로 설정될 수 있다. 전체 사전 훈련 단계(total pre-training steps)는 500,000 단계, 전체 강화 학습 단계(total reinforcement learning steps)는 1,000,000 단계로 설정될 수 있다. 학습률(learning rate)은 0.0001로 설정될 수 있다. 또한, fs ₁ 스킵 값은 4, fs ₂ 의 스킵 값은 12, fs ₂ 의 조건 값은 70으로 설정될 수 있다. 또한, 환경의 최대 에피소드 길이(maximum episode length)는 50,000 단계로 설정될 수 있다. DQfD의 λ₁, λ₂, λ₃ 및 λ₄ 의 값은 각각 1.0, 1.0, 1.0 및 0.0001로 설정될 수 있다.The size of the mini-batch can be set to 32. The total pre-training steps may be set to 500,000 steps, and the total reinforcement learning steps may be set to 1,000,000 steps. The learning rate may be set to 0.0001. Also, fs ₁ Skip value is 4, fs ₂ The skip value of is 12, fs ₂ The condition value of may be set to 70. In addition, the maximum episode length of the environment may be set to 50,000 steps. Values of λ ₁ , λ ₂ , λ ₃ and λ ₄ of DQfD may be set to 1.0, 1.0, 1.0, and 0.0001, respectively.

본 제안 기술의 평가를 위한 본 실험은, 종래 기술인 단일 리플레이 버퍼 관리를 통한 프레임 스킵핑 기술과 본 제안 기술인 듀얼 리플레이 버퍼 관리를 통한 온라인 프레임 스킵핑 기술을 비교하는 데에 초점을 두고 이루어질 수 있다.This experiment for evaluating the proposed technology may focus on comparing a conventional frame skipping technology through single replay buffer management and an online frame skipping technology through dual replay buffer management, which is the present technology.

본 실험에서는 3가지의 접근법에 대하여 평가하였으며, 각각 4번의 실험에서 랜덤 시드(random seeds)를 사용해 평균을 하였다.In this experiment, three approaches were evaluated, and each of the four experiments was averaged using random seeds.

즉, 3가지의 접근법은 싱글(SINGLE), FS-ER 및 DFS-ER를 의미한다. 싱글(SINGLE)은 종래의 단일 리플레이 버퍼 관리를 기반으로 하는 프레임 스킵핑 기술을 의미한다. FS-ER는 본원에서 제안한 제1 유형의 프레임 스킵핑 기술로서, 듀얼 리플레이 버퍼 관리를 기반으로 하는 FS-ER 스킵핑 기술을 의미한다. DFS-ER는 본원에서 제안한 제2 유형의 프레임 스킵핑 기술로서, 듀얼 리플레이 버퍼 관리를 기반으로 하는 DFS-ER 스킵핑 기술을 의미한다.In other words, the three approaches mean single, FS-ER and DFS-ER. Single (SINGLE) refers to a frame skipping technology based on the conventional single replay buffer management. FS-ER is the first type of frame skipping technology proposed in the present application, and refers to an FS-ER skipping technology based on dual replay buffer management. DFS-ER is the second type of frame skipping technology proposed in the present application, and refers to a DFS-ER skipping technology based on dual replay buffer management.

각 강화 학습 단계의 휴먼 샘플링 비율을 동일하게 유지하기 위해, 본 실험에서는 듀얼 리플레이 버퍼 관리에서 단일 리플레이 버퍼 관리를 모방(mimics)한 샘플링 비율 정책(sampling ratio policy)을 설정하였다.In order to keep the human sampling rate of each reinforcement learning step the same, in this experiment, a sampling ratio policy was set in which the dual replay buffer management mimics the single replay buffer management.

도 9는 본원의 일 실험 결과로서 미니배치의 휴먼 데이터의 수를 나타낸다. 즉, 도 9는 듀얼 리플레이 버퍼 관리와 싱글(단일) 리플레이 버퍼 관리 기술에 대한 휴먼 데몬스트레이션 데이터의 샘플링 예를 나타낸다.9 shows the number of human data in a mini-batch as an experiment result of the present application. That is, FIG. 9 shows an example of sampling human demonstration data for dual replay buffer management and single (single) replay buffer management techniques.

도 9를 참조하면, 듀얼 리플레이 버퍼 관리와 싱글(단일) 리플레이 버퍼 관리 기술 간의 차이가 작은 이유는 휴먼 리플레이 버퍼에서 샘플링되는 휴먼 데몬스트레이션 데이터의 수를 정수로 변환하였기 때문이라 할 수 있다.Referring to FIG. 9, the reason that the difference between the dual replay buffer management and the single (single) replay buffer management technology is small is that the number of human demonstration data sampled from the human replay buffer is converted into an integer.

이하에서는 본 실험에 의한 평가 결과에 대하여 기술한다.Hereinafter, the evaluation result by this experiment will be described.

도 10은 본원의 일 실험결과로서, 3가지 게임의 평균 에피소드 점수를 나타낸 도면이다. 특히, 도 10은 Q*bert (a), Ms. Pacman (b) 및 Montezuma's Revenge (c)를 포함하는 3가지의 게임 각각에 대하여, 상술한 3가지의 접근법인 싱글(SINGLE), FS-ER 및 DFS-ER를 적용하였을 때의 평균 에피소드 점수를 나타낸 도면이다.10 is a diagram showing an average episode score of three games as an experiment result of the present application. In particular, FIG. 10 shows Q*bert (a), Ms. For each of the three games including Pacman (b) and Montezuma's Revenge (c), showing the average episode score when applying the three approaches described above, SINGLE, FS-ER and DFS-ER It is a drawing.

도 10의 (a)는 Q*bert 게임의 평가 결과의 비교를 보여준다. Figure 10 (a) shows a comparison of the evaluation results of the Q * bert game.

도 10의 (a)를 참조하면, FS-ER은 사전 훈련(pre-training) 단계에서 다른 방법들 우수한 성능을 보임을 확인할 수 있다. 이는 높은 점수를 갖는 휴먼 데몬스트레이션 데이터를 사용하는 종래 DQfD를 개시하는 상술한 문헌 7에서 보고된 결과보다 본원에서 제안하는 FS-ER 기술이 더 높은 점수를 획득하였기 때문에, 의미있는 결과라 할 수 있다.Referring to (a) of FIG. 10, it can be seen that the FS-ER shows excellent performance of other methods in a pre-training step. This can be said to be a meaningful result because the FS-ER technology proposed in the present application obtained a higher score than the result reported in the above-described document 7 which discloses the conventional DQfD using human demonstration data having a high score.

또한, 본 제안 기술인 DFS-ER는 신경망의 출력 차원(output dimension)을 두배로 늘렸음에도 불구하고, 종래의 방법보다 더 좋은 점수를 나타냄을 확인할 수 있다. 이는 본 제안 기술인 DFS-ER이 사전 훈련 단계에서 긍정적인 효과가 있음을 의미한다.In addition, it can be seen that the proposed technology, DFS-ER, shows a better score than the conventional method, even though the output dimension of the neural network is doubled. This means that the proposed technique, DFS-ER, has a positive effect in the pre-training stage.

또한, Q*bert 결과에서, FS-ER과 싱글(SINGLE)은 훈련하면서 점점 가까워지는 것으로 나타났다. 또한, DFS-ER의 점수는 동적 프레임 스킵핑의 신속한 에피소드 탐사를 통해, 그래프의 끝 부분(도면 상에서 그래프의 우측 부분)에서 다른 방법들보다 더 우수하게 나타남을 확인할 수 있다.In addition, in the Q*bert results, FS-ER and single (SINGLE) were shown to get closer as they train. In addition, it can be seen that the score of DFS-ER is better than other methods at the end of the graph (the right part of the graph on the drawing) through rapid episode detection of dynamic frame skipping.

도 10의 (b)는 팩맨(Ms. Pacman) 게임의 평가 결과의 비교를 보여준다. Figure 10 (b) shows a comparison of the evaluation results of the Pacman (Ms. Pacman) game.

도 10의 (b)를 참조하면, 팩맨 게임에 대한 평가 결과의 패턴은 도 10의 (a)에 도시된 Q*bert 게임의 패턴과는 다르게 나타났다. DFS-ER은 팩맨 게임의 사전 훈련 단계에서 다른 방법들보다 약간 높은 점수를 보이는 것으로 나타나고, 강화 학습 단계에서 우수한 성과를 보이는 것으로 나타났다. 이는 DFS-ER를 사용한 휴먼 데몬스트레이션 데이터의 액션 반복(action repetition)의 학습이 효과적임을 보여준다.Referring to FIG. 10B, the pattern of the evaluation result for the Pac-Man game was different from the pattern of the Q*bert game shown in FIG. 10A. DFS-ER was found to have a slightly higher score than other methods in the pre-training stage of the Pac-Man game, and showed excellent performance in the reinforcement learning stage. This shows that learning of action repetition of human demonstration data using DFS-ER is effective.

한편, 스페이스 인베이더(Space Invaders) 게임의 평가 결과는 도 10의 (b)에 도시된 팩맨의 평가 결과의 패턴과 거의 동일하게 나타났다.On the other hand, the evaluation result of the Space Invaders game appeared almost the same as the pattern of the evaluation result of Pac-Man shown in FIG. 10B.

도 10의 (c)는 몬테주마의 복수(Montezuma 's Revenge) 게임의 평가 결과의 비교를 보여준다. FIG. 10C shows a comparison of evaluation results of Montezuma's Revenge game.

도 10의 (c)를 참조하면, 도 10의 (a)와 (b)와는 달리 몬테주마의 복수 게임의 경우, 에피소드의 길이가 다른 것보다 길기 때문에 에포크(epoch, 신경망 학습의 한 번의 수행 주기)의 길이(즉, 평균 점수의 간격)가 100,000으로 증가되었다.Referring to Figure 10 (c), unlike Figures 10 (a) and (b), in the case of a multiple game of Montezuma, since the length of the episode is longer than that of the others, an epoch (one execution cycle of neural network learning) ) The length (i.e., the average score interval) was increased to 100,000.

몬테주마의 복수 게임의 경우, FS-ER가 그래프의 끝 부분(도면 상에서 그래프의 우측 부분)에서 다른 방법보다 성능이 뛰어난 것으로 나타났다. 이러한 결과는 FS-ER이 전체 휴먼 데몬스트레이션 데이터에 대해 어떠한 프레임에서도 드롭되지 않도록 관찰함으로써, 기존 방법보다 휴먼 데몬스트레이션 데이터를 더 효율적으로 사용하고 있음을 의미한다.In the case of Montezuma's revenge game, FS-ER was found to outperform other methods at the end of the graph (right side of the graph on the drawing). This result means that FS-ER is using human demonstration data more efficiently than the existing method by observing that the entire human demonstration data does not drop in any frame.

도 11은 본원의 일 실험결과로서, 4 가지의 게임에서 달성된 평균 점수를 요약하여 나타낸 도면이다. 구체적으로, 도 11은 4가지의 게임에 대하여 상술한 3가지의 접근법인 싱글(SINGLE), FS-ER 및 DFS-ER를 적용하였을 때 나타난 평균 점수를 나타낸 도면이다. 즉, 도 11은 본 실험에서 고려된 모든 환경(environments)의 평균 점수(average score)를 요약하여 나타낸 도면이다.11 is a view showing a summary of the average score achieved in four games, as an experiment result of the present application. Specifically, FIG. 11 is a diagram showing the average score shown when applying the above-described three approaches, SINGLE, FS-ER, and DFS-ER for four games. That is, FIG. 11 is a view summarizing the average score of all environments considered in this experiment.

도 11에서, Pretrain 열은 사전 훈련(pre-training) 단계의 끝 부분에서, 서로 다른 랜덤 시드를 가진 4가지 실험의 평균 점수를 의미한다. 또한, RL 열은 강화 학습(reinforcement learning) 단계의 끝 부분에서의 평균 점수를 의미한다. 또한, 도 11에는 각 게임에서의 3가지의 접근법(싱글, FS-ER, DFS-ER)에 대한 평가에서, 전체 훈련 에포크(즉, 사전 훈련 단계와 강화 학습 단계를 포함하는 전체 훈련 과정)의 최대 평균 점수(Maximum Average Score)에 대한 열(column)이 도시되어 있다.In FIG. 11, the Pretrain column denotes the average score of four experiments with different random seeds at the end of a pre-training step. In addition, the RL column means the average score at the end of the reinforcement learning step. In addition, Figure 11 shows, in the evaluation of the three approaches (single, FS-ER, DFS-ER) in each game, the entire training epoch (i.e., the entire training process including the pre-training step and the reinforcement learning step). A column for the maximum average score is shown.

도 11을 참조하면, 모든 경우(즉, Pretrain 열, RL 열 및 최대 평균 점수 열 각각의 경우)에 대하여, 본원에서 제안하는 FS-ER 스킵핑 기술 또는 DFS-ER 스킵핑 기술이 종래의 싱글(SINGLE) 기술(즉, 단일 리플레이 버퍼를 기반으로 한 스킵핑 기술)보다 더 높은 성능을 보이는 것으로 나타났다.Referring to FIG. 11, for all cases (i.e., for each of the Pretrain column, RL column, and maximum average score column), the FS-ER skipping technology or the DFS-ER skipping technology proposed herein is a conventional single ( SINGLE) technology (ie, skipping technology based on a single replay buffer) showed higher performance.

즉, 본 실험 결과에 따르면, 앞서 설명된 4 개의 아타리(Atari) 게임의 경험적 실험 결과(Empirical experimental results), 온라인 스킵핑 기술(FS-ER, DFS-ER)을 가진 듀얼 리플레이 버퍼(메모리) 아키텍처를 제공하는 본 제안 기술은 기존의 베이스라인(existing baselines, 종래 기술)보다 성능이 더 우수하게 나타나는 것으로 확인되었다. That is, according to the results of this experiment, the dual replay buffer (memory) architecture with the experimental results of the four Atari games described above and the online skipping technology (FS-ER, DFS-ER) It has been confirmed that the proposed technology providing a better performance than existing baselines (existing baselines).

DFS-ER은 4 가지 환경(environments) 중 3 가지 환경에서, 사전 훈련(pre-training) 중 가장 빠른 점수 증가율(score increment)을 보여 다른 종래 방법보다 우수한 성능을 보이는 것으로 나타났다. 또한, FS-ER은 희소한 보상(sparse reward)으로 인해 모델을 훈련하기 어려운 다른 환경(예를 들어, 몬테주마의 복수, Montezuma's Revenge)에서 최고의 성능을 보이는 것으로 나타났다. DFS-ER showed the fastest score increment during pre-training in 3 out of 4 environments, showing superior performance than other conventional methods. In addition, FS-ER has been shown to perform best in other environments where it is difficult to train the model due to sparse rewards (eg, Montezuma's Revenge).

이러한 본원의 일 실험에 따른 실험 결과는 훈련 문제를 강화시키기 위해 제한된 휴먼 데몬스트레이션 데이터의 활용에 대한 본 제안 기술의 효과를 입증(demonstrate, 증명)하고 있음을 보여준다.The experimental results according to the experiment of the present application show that the effect of the proposed technology on the use of limited human demonstration data to reinforce the training problem is demonstrated.

이하에서는 상기에 자세히 설명된 내용을 기반으로, 본원의 동작 흐름을 간단히 살펴보기로 한다.Hereinafter, based on the details described above, the operation flow of the present application will be briefly described.

도 12는 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 방법에 대한 동작 흐름도이다.12 is a flowchart of a data processing method for deep reinforcement learning according to an embodiment of the present application.

도 12에 도시된 심층 강화 학습을 위한 데이터 처리 방법은 앞서 설명된 본 장치(100)에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 본 장치(100)에 대하여 설명된 내용은 심층 강화 학습을 위한 데이터 처리 방법에 대한 설명에도 동일하게 적용될 수 있다.The data processing method for deep reinforcement learning shown in FIG. 12 may be performed by the apparatus 100 described above. Accordingly, even if the contents are omitted below, the contents described with respect to the apparatus 100 may be equally applied to the description of the data processing method for deep reinforcement learning.

도 12를 참조하면, 단계S110에서는 대상 데이터 중 휴먼 데몬스트레이션 데이터를 휴먼 리플레이 버퍼에 저장하고, 대상 데이터 중 액터 데이터를 액터 리플레이 버퍼에 저장할 수 있다.Referring to FIG. 12, in step S110, human demonstration data among target data may be stored in a human replay buffer, and actor data among target data may be stored in an actor replay buffer.

다음으로, 단계S120에서는 단계S110에서 휴먼 리플레이 버퍼에 저장된 휴먼 데몬스트레이션 데이터의 샘플링을 위해, 휴먼 리플레이 버퍼에 대하여 온라인 프레임 스킵핑을 적용하여 데이터 처리를 수행할 수 있다.Next, in step S120, in order to sample the human demonstration data stored in the human replay buffer in step S110, online frame skipping may be applied to the human replay buffer to perform data processing.

또한, 상기 온라인 프레임 스킵핑은, 제1 유형의 프레임 스킵핑으로서, 프레임 스킵핑의 간격을 가변하지 않고 일정하게 유지하여 스킵핑을 수행하는 FS-ER 스킵핑을 포함할 수 있다. 또한, 온라인 프레임 스킵핑은, 제2 유형의 프레임 스킵핑으로서, 프레임 스킵핑의 간격을 가변하여 동적으로 스킵핑을 수행하는 DFS-ER 스킵핑을 포함할 수 있다.In addition, the online frame skipping, as the first type of frame skipping, may include FS-ER skipping in which skipping is performed by maintaining a constant frame skipping interval without varying. In addition, online frame skipping, as a second type of frame skipping, may include DFS-ER skipping, which dynamically performs skipping by varying an interval of frame skipping.

또한, 단계S120에서는 온라인 프레임 스킵핑의 적용 이전에, 휴먼 리플레이 버퍼에 저장된 휴먼 데몬스트레이션 데이터에 대한 프레임 상태를 생성할 수 있다. 이때, 프레임 상태는 각각 프레임 스킵핑 계수에 대응하는 수의 프레임을 적층하여 생성될 수 있다.In addition, in step S120, a frame state for the human demonstration data stored in the human replay buffer may be generated prior to the application of the online frame skipping. In this case, the frame state may be generated by stacking a number of frames corresponding to each frame skipping coefficient.

또한, 단계S120에서는 제1 유형의 프레임 스킵핑의 적용시, 미리 설정된 수의 프레임에 대응하는 간격마다 프레임 스킵핑 계수에 대응하는 수의 프레임을 적층시킬 수 있다.Further, in step S120, when the first type of frame skipping is applied, a number of frames corresponding to the frame skipping coefficient may be stacked at intervals corresponding to a preset number of frames.

또한, 단계S120에서는 휴먼 데몬스트레이션 데이터의 각 프레임의 액션 지속 기간을 나타내는 액션 반복 값을 고려하여 제2 유형의 프레임 스킵핑을 적용할 수 있다. 여기서, 액션 반복 값은 두 개의 스킵 값을 포함할 수 있다.Also, in step S120, the second type of frame skipping may be applied in consideration of an action repetition value indicating an action duration of each frame of the human demonstration data. Here, the action repetition value may include two skip values.

또한, 단계S120에서는 제2 유형의 프레임 스킵핑의 적용시, 프레임 상태를 생성할 때 프레임마다의 스킵 값을 확인하고, 확인된 스킵 값을 고려하여 프레임 상태를 생성할 수 있다.In addition, in step S120, when the second type of frame skipping is applied, when generating a frame state, a skip value for each frame may be checked, and a frame state may be generated in consideration of the checked skip value.

또한, 도면에 도시하지는 않았으나, 본원의 일 실시예에 따른 심층 강화 학습을 위한 데이터 처리 방법은, 단계S120에서의 데이터 처리에 의해 샘플링된 휴먼 데몬스트레이션 데이터 및 액터 데이터를 입력으로 하는 신경망의 업데이트시 이용되는 손실 값을 계산하기 위해 요구되는 반환 값을 재정의하는 단계를 포함할 수 있다. In addition, although not shown in the drawing, the data processing method for deep reinforcement learning according to an embodiment of the present application is used when updating a neural network that inputs human demonstration data and actor data sampled by the data processing in step S120. It may include redefining the required return value to calculate the lost value.

여기서, 재정의하는 단계에서는 두가지 유형의 프레임 스킵핑 적용 각각에 대하여 반환 값을 재정의할 수 있다.Here, in the redefining step, the return value may be redefined for each of the two types of frame skipping applications.

상술한 설명에서, 단계 S110 및 S120은 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S110 and S120 may be further divided into additional steps or may be combined into fewer steps, depending on the embodiment of the present application. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

본원의 일 실시 예에 따른 심층 강화 학습을 위한 데이터 처리 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The data processing method for deep reinforcement learning according to an exemplary embodiment of the present disclosure may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of the program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The above-described hardware device may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

또한, 전술한 심층 강화 학습을 위한 데이터 처리 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.Further, the above-described data processing method for deep reinforcement learning may be implemented in the form of a computer program or application executed by a computer stored in a recording medium.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the present application is for illustrative purposes only, and those of ordinary skill in the art to which the present application pertains will be able to understand that it is possible to easily transform it into other specific forms without changing the technical spirit or essential features of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present application.

100: 심층 강화 학습을 위한 데이터 처리 장치
110: 리플레이 매니저
111: 휴먼 리플레이 버퍼
112: 액터 리플레이 버퍼
120: 데이터 처리부
130: 재정의부100: Data processing unit for deep reinforcement learning
110: Replay Manager
111: human replay buffer
112: Actor replay buffer
120: data processing unit
130: Ministry of Finance

Claims

In the data processing method for deep reinforcement learning in which each step is performed by a data processing device implemented as a computer,
(a) storing human demonstration data among target data in a human replay buffer, and storing actor data among the target data in an actor replay buffer; And
(b) for sampling the human demonstration data stored in the human replay buffer, performing data processing by applying online frame skipping to the human replay buffer,
The online frame skipping includes two types of frame skipping in consideration of whether or not an interval of frame skipping is variable.

The method of claim 1,
The online frame skipping,
As a first type of frame skipping, it includes FS-ER skipping, which performs skipping by maintaining a constant frame skipping interval without changing,
The second type of frame skipping, which includes DFS-ER skipping, which dynamically performs skipping by varying an interval of frame skipping.

The method of claim 2,
The step (b),
Before the application of the online frame skipping, a frame state for human demonstration data stored in the human replay buffer is generated,
The frame state is generated by stacking a number of frames corresponding to each frame skipping coefficient.

The method of claim 3,
The step (b),
When the first type of frame skipping is applied, a number of frames corresponding to the frame skipping coefficients are stacked at intervals corresponding to a preset number of frames.

The method of claim 3,
The step (b),
Applying the second type of frame skipping in consideration of an action repetition value indicating an action duration of each frame of the human demonstration data,
The action repetition value includes two skip values. A data processing method for deep reinforcement learning.

The method of claim 5,
The step (b),
When the second type of frame skipping is applied, a skip value for each frame is checked when generating the frame state, and the frame state is generated in consideration of the checked skip value, for deep reinforcement learning Data processing method.

The method of claim 1,
(c) redefining a return value required to calculate a loss value used when updating a neural network using the human demonstration data sampled by the data processing in step (b) and the actor data as inputs,
Including more,
The step (c) is to redefine a return value for each of the two types of frame skipping application, the data processing method for deep reinforcement learning.

In the data processing device for deep reinforcement learning,
A replay manager that stores human demonstration data among target data in a human replay buffer, and stores actor data among the target data in an actor replay buffer; And
A data processing unit that performs data processing by applying online frame skipping to the human replay buffer for sampling the human demonstration data stored in the human replay buffer,
The online frame skipping includes two types of frame skipping in consideration of whether or not an interval of frame skipping is variable.

The method of claim 8,
The online frame skipping,
As a first type of frame skipping, it includes FS-ER skipping, which performs skipping by maintaining a constant frame skipping interval without changing,
As the second type of frame skipping, the data processing apparatus for deep reinforcement learning includes DFS-ER skipping, which dynamically performs skipping by varying an interval of frame skipping.

The method of claim 9,
The data processing unit,
Before the application of the online frame skipping, a frame state for human demonstration data stored in the human replay buffer is generated,
The frame state is generated by stacking a number of frames corresponding to each frame skipping coefficient.

In the deep reinforcement learning device of a neural network,
A data processing device that performs data processing for deep reinforcement learning on target data; And
And a learning control unit for deep reinforcement learning of the neural network by using human demonstration data and actor data sampled by data processing by the data processing device as inputs of the neural network,
The data processing device,
A replay manager that stores human demonstration data among target data in a human replay buffer, and stores actor data among the target data in an actor replay buffer; And
A data processing unit that performs data processing by applying online frame skipping to the human replay buffer for sampling the human demonstration data stored in the human replay buffer,
The online frame skipping includes two types of frame skipping in consideration of whether or not the frame skipping interval is variable.

The method of claim 11,
The online frame skipping,
As a first type of frame skipping, it includes FS-ER skipping, which performs skipping by maintaining a constant frame skipping interval without changing,
The second type of frame skipping, which includes DFS-ER skipping, which dynamically performs skipping by varying the frame skipping interval.