KR20200082909A

KR20200082909A - Imitational reinforcement learning system for controlling devices remotely and the method thereof

Info

Publication number: KR20200082909A
Application number: KR1020180173947A
Authority: KR
Inventors: 한연희; 임현교; 김주봉; 권도형
Original assignee: 한국기술교육대학교 산학협력단
Priority date: 2018-12-31
Filing date: 2018-12-31
Publication date: 2020-07-08
Also published as: KR102198450B1

Abstract

The present invention relates to an imitation reinforcement learning system for remote device control and to a method thereof. Provided are the imitation reinforcement learning system for remote device control, which utilizes a traditional machine control model defined in advance to control a specific device to generate and store transition information for the specific device, and imitates the stored transition information, so that the reinforcement learning for the reinforcement learning model for controlling the specific device can be performed through initial stable data. When trying to control the device exposed to various environments through the reinforcement learning model, the device can be accurately and effectively controlled by rapidly adapting to the environment, and the method thereof.

Description

IMITATIONAL REINFORCEMENT LEARNING SYSTEM FOR CONTROLLING DEVICES REMOTELY AND THE METHOD THEREOF}

본 발명은 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법에 관한 것으로, 특정 디바이스를 제어하기 위해 사전에 정의된 전통적인 기계 제어모델을 활용하여 상기 특정 디바이스에 대한 전이정보(transition information)를 생성 및 저장하고, 상기 저장한 전이정보를 모방함으로써, 초기 안정적인 데이터를 통해 상기 특정 디바이스를 제어하기 위한 강화학습모델에 대한 강화학습을 수행할 수 있도록 함으로써, 상기 강화학습모델을 통해 다양한 환경에 노출되는 상기 디바이스를 제어하고자 할 때, 해당 환경에 빠르게 적응하여 상기 디바이스를 정밀하고 효과적으로 제어할 수 있도록 하는 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법에 관한 것이다.The present invention relates to an imitation reinforcement learning system for remote control of a device and a method therefor, using a traditional machine control model previously defined to control a specific device, and generating transition information for the specific device and By storing and imitating the stored transition information, it is possible to perform reinforcement learning on the reinforcement learning model for controlling the specific device through initial stable data, thereby exposing the environment to various environments through the reinforcement learning model. When it is desired to control a device, the present invention relates to an imitation reinforcement learning system and a method for remote control of a device to quickly adapt to the corresponding environment and to precisely and effectively control the device.

최근 산업기술 및 정보통신기술의 급격한 발전으로 인해, 빅 데이터 분석 및 처리 분야, 로봇공학분야, 사물인터넷분야, 무인 자동차 등과 같은 무인 운송 수단 분야 등과 같은 다양한 분야에서 실제 물리 환경과 사이버 환경을 ICT 기반으로 융합되어, 실시계의 모든 제품이 네트워크로 연결되어 지능화됨으로써, 사용자에게 보다 편리하고 적합한 서비스를 제공하도록 하는 4차 산업혁명이 발생하였다.Due to the recent rapid development of industrial technology and information and communication technology, ICT-based real physical and cyber environments in various fields such as big data analysis and processing field, robotics field, IoT field, unmanned vehicle transport field, etc. By fusion, all products of the real world were connected to the network and became intelligent, so the 4th industrial revolution occurred to provide users with more convenient and suitable services.

이러한 4차 산업혁명은 딥러닝과 같은 머신러닝이 그 중심에 있다. 특히, 머신러닝의 한 분야인 강화학습은, 특정 환경 안에서 정의된 에이전트가 실제 물리 환경에 위치하는 특정 디바이스의 현재 상태를 인식하여 선택 가능한 행동들 중 보상을 최대화하는 행동 혹은 행동순서를 선택하는 것을 통해 상기 에이전트의 행동양식을 발전시켜 나감으로써, 상기 물리 환경에 위치하는 디바이스에 대한 제어를 최적화시켜, 상기 디바이스를 효율적으로 제어할 수 있도록 한다.Machine learning such as deep learning is at the center of this fourth industrial revolution. In particular, reinforcement learning, a field of machine learning, involves selecting an action or sequence of actions that maximizes the reward among the selectable actions by recognizing the current state of a specific device located in a real physical environment by an agent defined in a specific environment. Through the development of the behavioral mode of the agent, control of a device located in the physical environment is optimized, so that the device can be efficiently controlled.

그러나 상기 강화학습은, 실제 디바이스를 제어하기 위한 제어시스템에 활용되기보다는 아케이드 게임 시뮬레이션에서의 활용이 압도적으로 많다.However, the reinforcement learning is overwhelmingly used in arcade game simulation rather than being used in a control system for controlling a real device.

그 이유는, 실제 물리 환경의 고차원적 환경 상태를 파악해야 하기 때문에 강화학습 네트워크를 구성하는 것이 매우 어렵기 때문이다.The reason is that it is very difficult to construct a reinforcement learning network because it is necessary to grasp the high-dimensional environmental state of the real physical environment.

또한 상기 디바이스를 제어하기 위한 종래의 강화학습 방법은, 가중치가 랜덤하게 설정된 강화학습 네트워크를 실제 환경에 곧 바로 적용하여 상기 디바이스를 제어하기 위한 제어명령을 생성하고, 상기 생성한 제어명령에 따른 결과에 따라 상기 강화학습을 수행한다. 이는 실제 물리 환경의 비선형적 복잡성 때문에, 상기 강화학습 네트워크가 상기 물리 환경에 적응하여 상기 디바이스를 정밀하게 제어하게 제어하기 위한 완성도를 가질 때 까지 소요되는 시간이 매우 오래 걸리는 문제점이 있으며, 더 나아가 상기 디바이스를 미리 설정한 에피소드에 따라 제어하기 위한 정확한 제어명령을 생성하지 못해, 상기 디바이스를 정밀하게 제어하지 못하는 문제점을 내포하고 있다.In addition, the conventional reinforcement learning method for controlling the device generates a control command for controlling the device by directly applying the reinforcement learning network with randomly set weights to the real environment, and results according to the generated control command. According to the reinforcement learning is performed. This is because of the non-linear complexity of the actual physical environment, it takes a very long time for the reinforcement learning network to have perfection to adapt to the physical environment and precisely control the device, and furthermore, Since it does not generate an accurate control command for controlling a device according to a preset episode, it implies that the device cannot be precisely controlled.

따라서 본 발명에서는, 전통적인 기계 제어 모델을 활용하여, 상기 전통적인 기계 제어 모델을 통해 상기 디바이스를 제어하여, 상기 디바이스의 현재 상태정보, 제어정보, 리워드정보 및 다음 상태정보를 포함하는 상기 디바이스의 전이정보를 생성하여 저장하고, 상기 저장한 전이정보를 토대로 상기 전통적인 기계 제어 모델을 통해 수행한 상기 디바이스에 대한 일련의 제어과정을 모방하여 강화학습을 수행함으로써, 상기 디바이스의 물리 환경에 빠르게 적응하여 상기 디바이스를 원격에서 정밀하게 제어할 수 있도록 하는 기술적 방안을 제안하고자 한다.Therefore, in the present invention, by utilizing the traditional machine control model, by controlling the device through the traditional machine control model, the current state information, control information, reward information of the device and transition information of the device including the next state information To generate and store, and to rapidly adapt to the physical environment of the device by performing reinforcement learning by imitating a series of control processes for the device performed through the traditional machine control model based on the stored transition information. We would like to propose a technical solution that enables precise control from a remote location.

즉, 본 발명은, 상기 전통적인 기계 제어 모델을 활용하여 저장한 전이정보에 따른 초기 안정된 데이터를 토대로 빠른 강화학습을 수행할 수 있도록 함으로써, 상기 디바이스가 위치하는 실제 물리 환경에 빠르게 적응하여 상기 디바이스를 원격에서 정밀하게 제어할 수 있도록 하는 강화학습 시스템 및 그 방법을 제공하고자 하는 것이다.That is, the present invention allows the device to be rapidly adapted to the actual physical environment in which the device is located, by enabling rapid reinforcement learning based on the initial stable data according to the stored transition information by utilizing the traditional machine control model. It is intended to provide a reinforcement learning system and a method that can be precisely controlled remotely.

다음으로 본 발명의 기술분야에 존재하는 선행기술에 대하여 간단하게 설명하고, 이어서 본 발명이 상기 선행기술에 비해서 차별적으로 이루고자 하는 기술적 사항에 대해서 기술하고자 한다.Next, the prior art existing in the technical field of the present invention will be briefly described, and then the technical matters to be achieved differently from the prior art will be described.

먼저 한국공개특허 제2018-0034553호(2018.04.04.)는 심층 강화 학습을 이용한 지속적인 제어 관한 것으로, 환경의 각 상태를 특정 짓는 관찰을 수신하고, 상기 각 관찰에 응답하여 액션 신경망으로부터 다음 액션을 선택하여 상기 액션을 수행하고, 해당 액션에 수행한 결과로서 상기 환경이 전환되는 상태 및 이에 대한 보상을 상기 환경으로부터 수신 받아, 상기 현재 관찰, 선택된 액션, 보상 및 다음 관찰을 포함하는 경험 튜플을 생성하여 저장한 후, 상기 저장한 경험 튜플을 이용하여 상기 액션 신경망에 대한 강화학습을 수행하는 심층 강화 학습을 이용한 지속적인 제어 관한 것이다.First, Korean Patent Publication No. 2018-0034553 (2018.04.04.) relates to continuous control using in-depth reinforcement learning, receives observations that specify each state of the environment, and responds to the observations to perform the next action from the action neural network. Select and perform the action, and receive the status of the environment switching and the compensation for the environment as a result of performing the action, and generate an experience tuple including the current observation, the selected action, the reward, and the next observation. After storing it, it relates to continuous control using deep reinforcement learning to perform reinforcement learning for the action neural network using the stored experience tuple.

즉, 상기 선행기술은 특정 환경에 대한 현재 상태를 관찰하여, 이에 대한 액션을 생성한 후, 상기 생성한 액션에 대해 상기 환경이 변화된 다음 상태정보를 관찰하고, 이에 대한 보상을 설정한 후, 상기 현재 상태, 액션, 보상 및 다음 상태를 포함하는 경험 튜플을 누적하여 저장한 후, 상기 저장한 검험 튜플을 이용하여 상기 액션을 생성하기 위한 액션 신경망에 대한 강화학습을 수행하도록 하는 것이다.That is, the prior art observes the current state of a specific environment, generates an action for this, observes the state information after the environment has changed for the generated action, sets compensation for the action, and then After accumulating and storing experience tuples including the current state, action, reward, and next state, reinforcement learning is performed on an action neural network for generating the action using the stored test tuples.

또한 한국공개특허 제2018-0137562호(2018.12.27.)는 심층 신경망 기반의 큐-러닝을 위한 경험 기억을 프루닝하는 방법 및 장치에 관한 것으로, 환경과 상호작용하는 에이전트를 통해 해당 환경에 대한 제1 상태를 수집하고, 상기 수집한 제1 상태를 신경망에 입력하여 상기 제1 상태에 대한 동작을 생성하여, 상기 환경으로 제공한 후, 상기 환경으로부터 상기 동작에 대한 제2 상태 및 이에 대한 보상을 수신함으로서, 상기 제1 상태, 동작, 보상, 제2 상태를 포함하는 하나의 새로운 경험을 형성한 후, 상기 새로운 경험과 기존에 저장된 경험과의 유사도를 계산하여, 상기 새로운 경험이 기존에 저장된 경험과 매유 유사한 경우, 상기 새로운 경함을 폐기하도록 함으로써, 메모리 크기의 제약을 완화하도록 하는 심층 신경망 기반의 큐-러닝을 위한 경험 기억을 프루닝하는 방법 및 장치에 관한 것이다.In addition, Korean Patent Publication No. 2018-0137562 (2018.12.27.) relates to a method and apparatus for pruning experience memory for deep neural network-based cue-learning, and provides information about the environment through an agent that interacts with the environment. After collecting a first state, inputting the collected first state to a neural network, and generating an operation for the first state, providing it to the environment, and then compensating for the second state and compensation for the operation from the environment By receiving, after forming a new experience including the first state, operation, reward, and second state, the similarity between the new experience and the previously stored experience is calculated, and the new experience is stored in the existing A method and apparatus for pruning empirical memory for deep neural network based cue-learning to relieve memory size constraints by discarding the new wariness, in a very similar case to experience.

상기 선행기술은, 신경망을 통해 환경에 대한 특정 상태에 대한 동작을 생성하여, 해당 환경에 적용하고, 이에 대한 보상을 상기 환경으로부터 수신하여, 이를 토대로 생성한 경험을 메모리에 저장함으로써, 상기 메모리에 저장한 환경에 대한 경험을 상기 신경망에 적용함으로써, 해당 신경망을 강화학습하도록 하는 것이다.The prior art generates an operation for a specific state of the environment through a neural network, applies it to the environment, receives a compensation for it from the environment, and stores the experiences created based on the memory in the memory, By applying the experience of the stored environment to the neural network, the neural network is reinforced learning.

상기에서 살펴본 바와 같이, 대부분의 선행기술들은 환경에 대한 액션을 생성하는 것은, 초기 신경망의 학습에 전적으로 의존하기 때문에, 상기 신경망이 상기 환경에 적응하는데 오랜 시간이 소요되며, 학습이 충분하지 않은 상태에서는 상기 환경에 대한 제어를 정확하게 수행하지 못하는 문제점이 있다.As described above, since most of the prior arts generate an action on the environment, it is entirely dependent on the learning of the initial neural network, so it takes a long time for the neural network to adapt to the environment, and the learning is insufficient. There is a problem in that it is impossible to accurately control the environment.

반면에 본 발명은, 전통적인 제어모델을 통해 디바이스를 제어하기 위한 초기 안정된 데이터를 빠르게 수집하고, 상기 수집한 데이터를 토대로 빠른 강화학습을 수행할 수 있도록 함으로써, 특정 디바이스가 위치하는 실제 물리 환경에 적응하는 시간을 최소화함과 동시에, 상기 디바이스를 정밀하고 효과적으로 제어할 수 있도록 하는 것으로, 상기 선행기술들은 본 발명의 이러한 기술적 특징을 기재하거나 시사하고 있지 않다.On the other hand, the present invention adapts to the actual physical environment in which a specific device is located by quickly collecting initial stable data for controlling the device through a traditional control model and performing fast reinforcement learning based on the collected data. At the same time minimizing the time to be performed, and to enable precise and effective control of the device, the prior arts do not describe or suggest this technical feature of the present invention.

본 발명은 상기와 같은 문제점을 해결하기 위해 창안된 것으로서, 물리적 환경에 위치하는 복수의 디바이스를 사이버 환경에서 원격으로 자동 제어할 수 있는 디바이스 제어용 강화학습모델을 통해 상기 물리적 환경에 빠르게 적응하여 상기 각 디바이스를 정확하고 효율적으로 제어할 수 있도록 하는 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법을 제공하는 것을 그 목적으로 한다.The present invention has been devised to solve the above problems, and the device is rapidly adapted to the physical environment through an enhanced learning model for device control that can remotely and automatically control a plurality of devices located in the physical environment in a cyber environment. It is an object of the present invention to provide an imitation reinforcement learning system and a method for remote control of a device that enables accurate and efficient control of the device.

또한 본 발명은, 전통적인 디바이스 제어모델을 통해 특정 디바이스에 대한 현재 상태정보, 현재 상태에서 상기 디바이스를 제어하기 위한 제어정보, 상기 제어정보를 토대로 상기 디바이스가 변화된 상태에 대한 리워드 정보 및 상기 변화된 상태인 다음 상태정보를 포함하는 상기 디바이스의 전이정보를 수집하여 경험 메모리에 저장하고, 상기 저장된 전이정보를 모방하여 강화학습을 수행함으로써, 초기 안정된 데이터를 토대로 빠른 강화학습을 수행하여, 상기 물리적 환경에 신속하게 적응하여 상기 물리적 환경에 위치하는 디바이스를 정밀하게 제어할 수 있도록 하는 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법을 제공하는 것을 또 다른 목적으로 한다.In addition, according to the present invention, the current state information for a specific device through a traditional device control model, control information for controlling the device in the current state, reward information for a state in which the device has changed based on the control information, and the changed state Transition information of the device including the following state information is collected and stored in an experience memory, and reinforcement learning is performed by imitating the stored transition information, thereby performing rapid reinforcement learning based on the initial stable data and promptly returning to the physical environment. Another object of the present invention is to provide an imitation reinforcement learning system and a method for remotely controlling a device to adaptively control a device located in the physical environment.

또한 본 발명은, 상기 디바이스를 미리 설정된 에피소드에 따라 제어하고자 강화학습을 수행 할 때, 상기 경험 메모리에 저장된 전이정보가 미리 설정한 수 미만인 경우, 상기 전통적인 디바이스 제어모델을 통해 상기 디바이스에 대한 제어를 수행함으로서, 상기 물리적 환경에 위치하는 상기 디바이스에 대한 안정적이고 우수한 전이정보를 효과적으로 수집하여, 초기 안정적인 전이정보를 통해 상기 강화학습을 신속하게 수행할 수 있도록 하는 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법을 제공하는 것을 또 다른 목적으로 한다.In addition, according to the present invention, when performing reinforcement learning to control the device according to a preset episode, when the transition information stored in the experience memory is less than a preset number, control of the device through the traditional device control model By performing it, the imitation reinforcement learning system for remote control of a device to collect the stable and excellent transition information for the device located in the physical environment effectively and quickly perform the reinforcement learning through the initial stable transition information, and Another object is to provide the method.

또한 본 발명은, 상기 전통적인 제어모델을 통해 수집된 전이정보를 토대로 상기 디바이스를 원격에서 제어하기 위한 디바이스 제어용 강화학습모델에 대한 강화학습을 수행한 후, 상기 디바이스 제어용 강화학습모델을 통해 상기 디바이스의 현재 상태정보에 대한 제어정보를 생성하여, 상기 디바이스를 제어하고, 이에 대한 전이정보를 상기 경험메모리에 수집 저장하고, 상기 에피소드가 종료될 때 마다 상기 경험메모리에 저장된 상기 전이정보를 토대로 상기 디바이스 제어용 강화학습모델에 대한 강화학습을 수행할 수 있도록 하여, 상기 디바이스 제어용 강화학습모델을 고도화시켜, 상기 디바이스를 정밀하게 제어할 수 있도록 하는 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법을 제공하는 것을 또 다른 목적으로 한다.In addition, according to the present invention, after performing reinforcement learning on a reinforcement learning model for device control for remotely controlling the device based on transfer information collected through the traditional control model, the device is controlled through the reinforcement learning model for device control. Generates control information for the current status information, controls the device, collects and stores the transfer information about it in the experience memory, and controls the device based on the transfer information stored in the experience memory whenever the episode ends Providing an imitation reinforcement learning system and a method for remotely controlling a device to enable the reinforcement learning model for reinforcement learning model to advance the reinforcement learning model for controlling the device and to precisely control the device For another purpose.

또한 본 발명은, 상기 경험메모리를 FIFO(first in first out) 형식의 메모리 구조로 구성되며, 상기 전이정보가 미리 설정한 개수만큼 상기 FIFO 형식으로 저장되도록 하여, 특정 전이 정보가 상기 강화학습에 반복하여 사용되는 것을 최소화하여, 상기 강화학습의 효율을 최대화시키도록 하는 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법을 제공하는 것을 또 다른 목적으로 한다. In addition, according to the present invention, the experience memory is composed of a first in first out (FIFO) memory structure, and the transfer information is stored in the FIFO format as many as a preset number, so that specific transfer information is repeated in the reinforcement learning. Another object is to provide an imitation reinforcement learning system for remote control of a device and a method for minimizing the use thereof, thereby maximizing the efficiency of reinforcement learning.

또한 본 발명은, 상기 상태정보, 제어정보 및 전이정보를 송수신할 때, 상기 상태정보, 제어정보 및 전이정보에 대한 네트워크상에서의 트래픽이 최대 우선권을 가지도록 조정함으로써, 네트워크 지연 없이 상기 디바이스를 실시간으로 제어할 수 있도록 하는 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법을 제공하는 것을 또 다른 목적으로 한다.In addition, according to the present invention, when transmitting/receiving the state information, control information, and transition information, the traffic on the network for the state information, control information, and transition information is adjusted to have the highest priority, so that the device is real-time without network delay. Another object of the present invention is to provide an imitation reinforcement learning system and a method for remote control of a device to enable control.

본 발명의 일 실시예에 따른 디바이스 원격 제어를 위한 모방 강화학습 시스템은, 미리 설정한 에피소드에 따라 물리 환경에 위치하는 적어도 하나 이상의 디바이스에 대한 제어결과를 토대로 생성되는 전이정보를 FIFO(first in first out)방식으로 저장하는 경험 메모리 및 상기 경험 메모리에 저장된 전이정보를 이용하여 상기 디바이스를 단계적으로 제어하기 위한 디바이스 제어용 강화학습모델에 대한 강화학습을 수행하는 모방 강화학습부를 포함하며, 상기 전이정보는, 상기 디바이스의 현재 상태정보, 상기 디바이스를 제어하기 위한 제어정보, 상기 제어정보에 따라 제어된 상기 디바이스의 다음 상태정보 및 상기 다음 상태정보에 대한 리워드 정보를 포함하는 것을 특징으로 한다.The imitation reinforcement learning system for remote control of a device according to an embodiment of the present invention is a first in first FIFO (transition information) generated based on a control result for at least one device located in a physical environment according to a preset episode. and an emulation reinforcement learning unit for performing reinforcement learning on the reinforcement learning model for device control for controlling the device step by step using the experience memory stored in an out) manner and the transition information stored in the experience memory, wherein the transition information is It is characterized by including the current status information of the device, control information for controlling the device, the next status information of the device controlled according to the control information, and reward information for the next status information.

또한 상기 모방 강화학습 시스템은, 상기 전이정보가 미리 설정한 개수 미만으로 상기 경험 메모리에 저장되어 있는 경우, PID 제어모델을 이용한 PID 컨트롤러를 통해 상기 디바이스를 상기 에피소드에 따라 제어하여, 상기 전이정보를 생성함으로써, 상기 경험 메모리에 저장하며, 상기 모방 강화학습부는, 상기 PID 컨트롤러를 통해 생성하여 저장한 전이정보가 상기 미리 설정한 개수로 되거나, 상기 전이정보가 상기 PID 컨트롤러를 통해 상기 전이정보가 상기 경험 메모리에 저장될 때마다, 또는 상기 에피소드가 종료될 때, 상기 PID 컨트롤러를 통해 생성하여 저장한 전이정보를 토대로 상기 강화학습을 수행함으로써, 상기 디바이스를 제어한 상기 PID 컨트롤러의 경험을 모방하여 상기 강화학습을 수행하는 것을 더 포함하는 것을 특징으로 한다.In addition, the imitation reinforcement learning system, when the transition information is stored in the experience memory less than a preset number, by controlling the device according to the episode through a PID controller using a PID control model, the transition information By generating, storing in the experience memory, the imitation reinforcement learning unit, the transfer information generated and stored through the PID controller becomes the preset number, or the transfer information is transferred through the PID controller. By performing the reinforcement learning based on the transfer information generated and stored through the PID controller whenever it is stored in the experience memory or when the episode ends, the experience of the PID controller controlling the device is mimicked by the Characterized in that it further comprises performing reinforcement learning.

또한 상기 모방 강화학습 시스템은, 상기 전이정보가 미리 설정한 개수로 상기 경험 메모리에 저장되는 경우, 상기 디바이스의 상태정보를 상기 모방 강화학습을 수행한 디바이스 제어용 강화학습모델에 입력하여 상기 디바이스를 제어하기 위한 제어정보를 추출함으로써, 상기 에피소드에 따라 상기 디바이스를 단계적으로 제어하기 위한 강화학습 에이전트를 더 포함하고, 상기 강화학습 에이전트는, 상기 제어정보를 토대로 제어된 상기 디바이스로부터 디바이스 상태정보를 수신 받아, 상기 디바이스에 대한 전이정보를 생성하여 상기 경험 메모리에 FIFO 방식으로 저장하는 것을 특징으로 한다.In addition, the imitation reinforcement learning system, when the transition information is stored in the experience memory in a preset number, inputs the state information of the device into the reinforcement learning model for controlling the device that performs the imitation reinforcement learning to control the device And by extracting the control information for, further comprising a reinforcement learning agent for controlling the device step by step according to the episode, the reinforcement learning agent receives device status information from the device controlled based on the control information , Generating transition information for the device and storing it in the FIFO method in the experience memory.

또한 상기 강화학습은, 상기 디바이스의 특정 상태에 대한 디바이스 상태정보에 따라 미리 설정한 상기 디바이스의 최종 상태까지 리워드 정보의 총합이 최대가 되는 제어정보를 추출하도록 수행되며, 상기 경험 메모리는, 상기 전이정보를 FIFO 방식으로 저장함으로써, 상기 강화학습시 상기 전이정보가 중복되어 적용되는 것을 방지하는 것을 특징으로 한다.In addition, the reinforcement learning is performed to extract control information in which the sum of the reward information is maximum to the final state of the device, which is preset according to device state information for a specific state of the device, and the experience memory is the transition By storing the information in a FIFO manner, it is characterized in that the transition information is prevented from being applied repeatedly during the reinforcement learning.

또한 상기 리워드 정보는, 상기 디바이스의 상태정보가 미리 설정한 임계범위 내에서 제어된 결과인 경우 플러스 보상값을 적용하고, 미리 설정한 임계범위를 초과하여 제어된 결과인 경우 마이너스 보상값을 적용함으로써, 생성되는 것을 특징으로 한다.In addition, the reward information is applied by applying a positive compensation value when the state information of the device is a control result within a preset threshold range, and by applying a negative compensation value when the control result exceeds a preset threshold range. It is characterized by being generated.

아울러 본 발명의 일 실시예에 따른 디바이스 원격 제어를 위한 모방 강화학습 방법은, 미리 설정한 에피소드에 따라 물리 환경에 위치하는 적어도 하나 이상의 디바이스에 대한 제어결과를 토대로 생성되는 전이정보를 FIFO(first in first out)방식으로 경험 메모리에 저장하는 저장 단계 및 상기 경험 메모리에 저장된 전이정보를 이용하여 상기 디바이스를 제어하기 위한 디바이스 제어용 강화학습모델에 대한 강화학습을 수행하는 모방 강화학습 단계를 포함하며, 상기 전이정보는, 상기 디바이스의 현재 상태정보, 상기 디바이스를 제어하기 위한 제어정보, 상기 제어정보에 따라 제어된 상기 디바이스의 다음 상태정보 및 상기 다음 상태정보에 대한 리워드 정보를 포함하는 것을 특징으로 한다.In addition, the imitation reinforcement learning method for remote control of a device according to an embodiment of the present invention is a first in FIFO (transition information) generated based on a control result for at least one device located in a physical environment according to a preset episode. and a imitation reinforcement learning step of performing reinforcement learning on a reinforcement learning model for device control for controlling the device using a storage step of storing in an experience memory in a first out method and transition information stored in the experience memory, wherein the The transition information includes current status information of the device, control information for controlling the device, next status information of the device controlled according to the control information, and reward information for the next status information.

또한 상기 모방 강화학습 방법은, 상기 전이정보가 미리 설정한 개수 미만으로 상기 경험 메모리에 저장되어 있는 경우, PID 제어모델을 이용한 PID 컨트롤러를 통해 상기 디바이스를 상기 에피소드에 따라 제어하여, 상기 전이정보를 생성함으로써, 상기 경험 메모리에 저장하며, 상기 모방 강화학습 단계는, 상기 PID 컨트롤러를 통해 생성하여 저장되는 전이정보가 상기 미리 설정한 개수로 되거나, 상기 전이정보가 상기 PID 컨트롤러를 통해 상기 전이정보가 상기 경험 메모리에 저장될 때마다, 또는 상기 에피소드가 종료될 때, 상기 PID 컨트롤러를 통해 생성하여 저장한 전이정보를 토대로 상기 강화학습을 수행함으로써, 상기 디바이스를 제어한 상기 PID 컨트롤러의 경험을 모방하여 상기 강화학습을 수행하는 것을 더 포함하는 것을 특징으로 한다.In addition, in the imitation reinforcement learning method, when the transition information is stored in the experience memory in less than a preset number, the device is controlled according to the episode through a PID controller using a PID control model to control the transition information. By generating, storing in the experience memory, and in the imitation reinforcement learning step, the transition information generated and stored through the PID controller becomes the preset number, or the transition information is transmitted through the PID controller. Each time it is stored in the experience memory, or when the episode ends, the reinforcement learning is performed based on the transfer information generated and stored through the PID controller to mimic the experience of the PID controller that controls the device. It characterized in that it further comprises performing the reinforcement learning.

또한 상기 모방 강화학습 방법은, 상기 전이정보가 미리 설정한 개수로 상기 경험 메모리에 저장되는 경우, 강화학습 에이전트에서, 상기 디바이스의 상태정보를 상기 강화학습을 수행한 디바이스 제어용 강화학습모델에 입력하여 상기 디바이스를 제어하기 위한 제어정보를 추출함으로써, 상기 에피소드에 따라 상기 디바이스를 단계적으로 제어하기 위한 단계;를 더 포함하고, 상기 강화학습 에이전트는, 상기 제어정보를 토대로 제어된 상기 디바이스로부터 디바이스 상태정보를 수신 받아, 상기 디바이스에 대한 전이정보를 생성하여 상기 경험 메모리에 FIFO 방식으로 저장하는 것을 특징으로 한다.In addition, in the imitation reinforcement learning method, when the transition information is stored in the experience memory in a preset number, the reinforcement learning agent inputs state information of the device into the reinforcement learning model for controlling the device that has performed the reinforcement learning. Further comprising, by extracting control information for controlling the device, the step of controlling the device in accordance with the episode; further comprising, the reinforcement learning agent, device status information from the device controlled based on the control information And receiving, generating transition information for the device and storing it in the FIFO method in the experience memory.

이상에서와 같이 본 발명의 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법은, 전통적 제어모델을 통해 상기 디바이스에 대한 안정적인 전이정보를 수집하여, 상기 수집한 안정적인 전이정보를 토대로 상기 디바이스를 원격 제어하기 위한 디바이스 제어용 강화학습모델에 대한 강화학습을 수행함으로써, 상기 디바이스의 물리적 환경에 빠르게 적응하여 상기 디바이스를 정밀하게 제어할 수 있도록 하는 효과가 있다.As described above, the imitation reinforcement learning system and method for remote control of a device of the present invention collect stable transfer information for the device through a traditional control model, and remotely control the device based on the collected stable transfer information By performing reinforcement learning on the reinforcement learning model for device control, there is an effect of quickly adapting to the physical environment of the device and precisely controlling the device.

또한 FIFO 구조를 가지는 경험메모리를 통해 상기 전이정보를 FIFO 형식으로 저장함으로써, 특정 전이정보가 상기 강화학습에 반복적으로 적용되는 것을 최소화하여, 상기 강화학습을 효율적으로 수행될 수 있도록 하는 효과가 있다.In addition, by storing the transition information in an FIFO format through an experience memory having a FIFO structure, it is possible to minimize the repetitive application of specific transition information to the reinforcement learning, so that the reinforcement learning can be efficiently performed.

또한 상기 강화학습에 필요한 관련 데이터의 네트워크 트래픽에 대한 우선권이 최대 우선권을 가지도록 자동으로 조정할 수 있도록 함으로서, 상기 디바이스를 실시간으로 제어할 수 있도록 하는 효과가 있다.In addition, by automatically adjusting the priority for the network traffic of the related data necessary for the reinforcement learning to have the maximum priority, there is an effect to control the device in real time.

도 1은 본 발명의 일 실시예에 따른 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법을 개략적으로 설명하기 위해 나타낸 도면이다.
도 2는 본 발명의 일 실시예에 따른 디바이스 원격 제어 장치와 모방 강화학스부를 통해 디바이스 원격 제어를 위한 모방 강화학습을 수행하는 방법을 설명하기 위해 나타낸 도면이다.
도 3은 본 발명의 일 실시예에 따른 강화학습 네트워크의 구조를 나타낸 도면이다.
도 4는 본 발명의 일 실시예에 따른 디바이스 원격 제어 장치의 구성을 나타낸 블록도이다.
도 5는 본 발명의 일 실시예에 따른 디바이스 원격 제어를 위한 모방 강화학습 시스템의 모방 강화학습부의 구성을 나타낸 도면이다.
도 6은 본 발명의 일 실시예에 따른 강화학습을 수행하는 절차를 나타낸 흐름도이다.FIG. 1 is a view schematically illustrating an imitation reinforcement learning system and method for remote control of a device according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a method for performing imitation reinforcement learning for remote control of a device through a device remote control apparatus and an imitation reinforcement learning unit according to an embodiment of the present invention.
3 is a diagram showing the structure of a reinforcement learning network according to an embodiment of the present invention.
4 is a block diagram showing the configuration of a device remote control apparatus according to an embodiment of the present invention.
5 is a view showing the configuration of the imitation reinforcement learning unit of the imitation reinforcement learning system for remote control of a device according to an embodiment of the present invention.
6 is a flowchart illustrating a procedure for performing reinforcement learning according to an embodiment of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법에 대한 바람직한 실시예를 상세히 설명한다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다. 또한 본 발명의 실시예들에 대해서 특정한 구조적 내지 기능적 설명들은 단지 본 발명에 따른 실시예를 설명하기 위한 목적으로 예시된 것으로, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는 것이 바람직하다. Hereinafter, preferred embodiments of the imitation reinforcement learning system and method for remote control of a device of the present invention will be described in detail with reference to the accompanying drawings. The same reference numerals in each drawing denote the same members. In addition, specific structural or functional descriptions of the embodiments of the present invention are exemplified for the purpose of describing the embodiments according to the present invention, and are used herein, including technical or scientific terms, unless otherwise defined. The terms have the same meaning as generally understood by a person skilled in the art to which the present invention pertains. Terms such as those defined in a commonly used dictionary should be interpreted as having meanings consistent with meanings in the context of related technologies, and should not be interpreted as ideal or excessively formal meanings unless explicitly defined herein. It is desirable not to.

도 1은 본 발명의 일 실시예에 따른 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법을 개략적으로 설명하기 위해 나타낸 도면이다.FIG. 1 is a view schematically illustrating an imitation reinforcement learning system and method for remote control of a device according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 본 발명의 일 실시예에 따른 디바이스 원격 제어를 위한 모방 강화학습 시스템(10)은, 물리 환경에 위치하는 적어도 하나 이상의 디바이스(400), 상기 물리 환경에 위치하는 적어도 하나 이상의 디바이스(400)를 사이버 환경에서 원격으로 제어하기 위한 디바이스 원격 제어 장치(100), 상기 디바이스(400)에 대한 제어정보를 추출하여 상기 디바이스(400)를 원격에서 제어할 수 있도록 하는 디바이스 제어용 강화학습모델을 생성하고, 상기 생성한 디바이스 제어용 강화학습모델에 대한 강화학습을 수행하여 상기 디바이스 제어용 강화학습모델을 고도화시킴으로써, 상기 디바이스(400)를 정밀하게 제어할 수 있도록 하는 모방 강화학습부(200), 네트워크를 통해 상기 디바이스 원격 제어 장치(100)와 상기 디바이스(400)간 데이터를 송수신하기 위한 네트워크의 트래픽을 관리 및 조정하는 네트워크 컨트롤러(300) 및 사용자 단말(500)을 포함하여 구성된다.As shown in FIG. 1, the imitation reinforcement learning system 10 for remote control of a device according to an embodiment of the present invention includes at least one device 400 located in a physical environment, at least located in the physical environment Device remote control device 100 for remotely controlling one or more devices 400 in a cyber environment, device control for extracting control information for the device 400 so that the device 400 can be controlled remotely By creating a reinforcement learning model and performing reinforcement learning on the created reinforcement learning model for device control to advance the reinforcement learning model for controlling the device, an imitation reinforcement learning unit for precisely controlling the device 400 ( 200), including a network controller 300 and a user terminal 500 for managing and coordinating traffic of a network for transmitting and receiving data between the device remote control apparatus 100 and the device 400 through a network. .

한편, 상기 복수의 디바이스(400)는, 스마트 팩토리 또는, 실외, 실내 등과 같이 다양한 물리 환경에 위치하고, 모터(motor), 액추에이터(actuator), 특정 정보를 센싱하여 수집하기 위한 다양한 센서(sensor) 등을 구비하여, 고유의 기능을 수행하기 위한 것으로, 본 발명의 디바이스 원격 제어를 위한 모방 강화학습 시스템(10)을 통해 원격에서 제어되는 제어대상인 다양한 장치를 의미한다.Meanwhile, the plurality of devices 400 are located in various physical environments, such as a smart factory or outdoor, indoor, and various sensors for sensing and collecting specific information such as a motor, an actuator, and the like. With, it is intended to perform a unique function, means a variety of devices that are controlled remotely through the imitation reinforcement learning system 10 for remote control of the device of the present invention.

예를 들어, 상기 디바이스(300)는, 도립진자를 세우기 위한 회전식 역진자 장치(rotary inverted pendulum)이거나, 특정 물건을 미리 설정한 위치로 옮기기 위한 로봇 팔(robot arm), 또는 특정 공간을 자동으로 청소하기 위한 청소 로봇 등과 같은 다양한 장치일 수 있다.For example, the device 300 is a rotary inverted pendulum for erecting an inverted pendulum, or a robot arm for moving a specific object to a preset position, or a specific space automatically It may be various devices such as a cleaning robot for cleaning.

즉, 상기 디바이스(300)는, 네트워크를 통해 상기 디바이스(400)를 원격에서 제어하기 위한 디바이스 원격 제어 장치(100)와 연결되어, 미리 정해진 에피소드에 따라 상기 디바이스 원격 제어 장치(100)의 제어를 통해 그 고유의 기능을 수행하는 다양한 장치를 의미하는 것이다.That is, the device 300 is connected to the device remote control device 100 for remotely controlling the device 400 through a network, and controls the device remote control device 100 according to a predetermined episode. Through this, it means various devices that perform their unique functions.

또한 상기 디바이스 원격 제어 장치(100)는, 서버 형태로 구현되는 상기 모방 강화학습부(200)로부터 미리 설정한 에피소드에 따라 상기 디바이스(400)를 원격에서 제어하기 위한 제어정보를 추출하기 위한 디바이스 제어용 강화학습모델을 제공받아, 상기 제공받은 디바이스 제어용 강화학습모델을 기반으로 상기 디바이스(400)를 제어하는 기능을 수행한다.In addition, the device remote control apparatus 100 is for device control for extracting control information for remotely controlling the device 400 according to a preset episode from the imitation reinforcement learning unit 200 implemented in the form of a server. Receiving a reinforcement learning model, it performs a function of controlling the device 400 based on the provided reinforcement learning model for device control.

또한 상기 디바이스 원격 제어 장치(100)는, 상기 디바이스(300)와 상호 인터렉션을 수행하여, 상기 각 디바이스(300)로부터 실제 물리적인 상태정보를 수집하고, 상기 수집한 상태정보를 토대로 상기 디바이스(400)를 원격에서 제어할 수 있도록 하는 강화학습 에이전트(120) 및 PID 컨트롤러(110)를 포함하여 구성될 수 있다.Further, the device remote control apparatus 100 performs interaction with the device 300 to collect actual physical state information from each device 300, and based on the collected state information, the device 400 ) Can be configured to include a reinforcement learning agent 120 and a PID controller 110 to allow remote control.

상기 강화학습 에이전트(120)는, 상기 디바이스(400)를 원격에서 제어하기 위해 상기 모방 강화학습부(200)에서 제공받은 디바이스 제어용 강화학습모델을 통해 상기 디바이스(400)의 상태정보에 따라 상기 디바이스(400)를 제어하기 위한 제어정보를 추출하여, 상기 디바이스(400)로 전송함으로써, 상기 디바이스(400)를 원격에서 제어할 수 있도록 한다.The reinforcement learning agent 120, according to the state information of the device 400 through the reinforcement learning model for device control provided by the imitation reinforcement learning unit 200 to remotely control the device 400 Control information for controlling the 400 is extracted and transmitted to the device 400 so that the device 400 can be remotely controlled.

즉, 상기 강화학습 에이전트(120)는 상기 디바이스 제어용 강화학습모델을 자체적인 메모리(미도시)에 저장하고 있으면서, 상기 디바이스 제어용 강화학습모델을 이용하여 물리 환경에 위치하는 상기 디바이스(400)를 미리 설정된 에피소드에 따라 상기 디바이스(400)를 제어하고자 하는 경우, 상기 강화학습 에이전트(110)는 상기 디바이스(400)를 초기화하기 위한 제어명령을 생성하여 상기 디바이스(400)로 전송함으로써, 해당 디바이스(400)를 초기화하고, 이에 대한 상태정보를 상기 디바이스(400)로부터 수신 받는다.That is, while the reinforcement learning agent 120 stores the reinforcement learning model for device control in its own memory (not shown), the device 400 located in a physical environment is previously used by using the reinforcement learning model for device control. When it is desired to control the device 400 according to a set episode, the reinforcement learning agent 110 generates a control command for initializing the device 400 and transmits it to the device 400, so that the device 400 ), and receives the status information from the device 400.

또한 상기 강화학습 에이전트(120)는, 상기 수신 받은 초기화에 따른 상태정보를 상기 디바이스 제어용 강화학습모델에 입력하여, 상기 입력한 상태정보에 따라 상기 디바이스(400)를 제어하기 위한 제어정보를 추출하여 상기 디바이스(400)로 제공하고, 상기 디바이스(400)로부터 상기 제어정보에 따라 제어된 후의 상태정보를 수신 받는다.Also, the reinforcement learning agent 120 inputs state information according to the received initialization into the reinforcement learning model for controlling the device, extracts control information for controlling the device 400 according to the entered state information Provided to the device 400, and receives the status information after being controlled according to the control information from the device 400.

이때, 상기 강화학습 에이전트(120)는, 현재 전송받은 상태정보가 상기 디바이스(400)로 제공한 제어정보에 따라 상기 디바이스(300)가 제어된 결과인 경우에는, 해당 상태정보에 대한 리워드를 계산하고, 하고, 상기 전송받은 상태정보가 상기 초기화에 따른 최초 상태정보인 경우에는, 상기 리워드를 계산하지 않는다.In this case, the reinforcement learning agent 120 calculates a reward for the status information when the currently transmitted status information is a result of controlling the device 300 according to the control information provided to the device 400. When the received status information is the initial status information according to the initialization, the reward is not calculated.

또한 상기 강화학습 에이전트(120)는, 상기 전송받은 상태정보를 토대로 상기 디바이스(400)에 대한 전이정보를 생성하여 상기 모방 강화학습부(200)에 구비되는 경험 메모리(experience memory)(미도시)에 저장함으로써, 상기 모방 강화학습부(200)에 의해 상기 디바이스 제어용 강화학습모델에 대한 강화학습을 수행할 수 있도록 한다.In addition, the reinforcement learning agent 120 generates experience information for the device 400 based on the received status information, and experiences memory provided in the imitation reinforcement learning unit 200 (not shown) By storing in, it is possible to perform reinforcement learning on the reinforcement learning model for controlling the device by the imitation reinforcement learning unit 200.

한편 상기 전이정보는, 상기 디바이스(400)의 현재 상태정보, 제어정보, 리워드 정보 및 다음 상태정보를 포함하는 것으로, 상기 현재 상태정보는 상기 제어정보를 토대로 상기 디바이스(400)가 제어되기 전의 상태정보를 의미하며, 상기 다음 상태정보는 상기 제어정보를 토대로 상기 디바이스(400)가 제어되어 특정 상태로 전이된 디바이스(400)의 상태정보를 의미한다.Meanwhile, the transition information includes current state information, control information, reward information, and next state information of the device 400, and the current state information is a state before the device 400 is controlled based on the control information. Information, and the next state information refers to state information of the device 400 that is controlled and transferred to a specific state based on the control information.

또한 상기 리워드 정보는, 상기 제어정보를 토대로 제어된 상기 디바이스(400)에 대한 다음 상태정보에 대한 보상값을 의미한다.In addition, the reward information means a compensation value for the next state information for the device 400 controlled based on the control information.

한편 상기 전이정보에 포함되는 디바이스(400)의 현재 상태정보 및 다음 상태정보는, 상기 디바이스(400)의 종류에 따라 다양하게 설정될 수 있는 것으로, 상기 디바이스(400)를 동작시키기 위해 구비되는 모터의 각도, 각속도, 또는 상기 디바이스(400)의 위치정보 등을 포함할 수 있다.Meanwhile, the current state information and the next state information of the device 400 included in the transition information may be variously set according to the type of the device 400, and a motor provided to operate the device 400 It may include an angle, angular velocity, or location information of the device 400.

또한 상기 제어정보 또한 상기 디바이스(400)의 종류에 따라 다양하게 설정될 수 있으며, 미리 설정된 에피소드에 따라 상기 디바이스(400)를 구동시키기 위해 필요한 전력량, 연료량 등을 포함할 수 있다.In addition, the control information may also be variously set according to the type of the device 400, and may include an amount of power and a fuel required to drive the device 400 according to a preset episode.

또한 상기 리워드 정보는, 상기 제어정보에 따라 미리 설정된 임계범위 내에서 상기 디바이스(400)가 적절하게 제어되었는지에 대한 여부를 판단하여, 상기 판단결과에 따라 미리 설정한 보상값으로 부여되는 것을 의미한다.In addition, the reward information means that it is determined whether the device 400 is properly controlled within a predetermined threshold range according to the control information, and is given as a preset compensation value according to the determination result. .

예를 들어, 상기 디바이스(300)가 도립진자를 세우는 회전식 역진자 장치(rotary inverted pendulum)인 경우, 상기 현재 상태정보 및 다음 상태정보는 상기 회전식 역진자 장치에 구비되는 모터의 각속도, 각도, 상기 도립진자의 각속도 및 각도를 포함할 수 있고, 상기 제어정보는 상기 모터를 동작시키기 위한 전력량을 포함할 수 있으며, 상기 리워드 정보는, 상기 제어정보에 따라 상기 모터의 각속도, 각도, 상기 도립진자의 각속도 및 각도가 미리 설정한 범위 내에서 동작된 경에는 플러스 보상값(예: +1)으로 부여될 수 있으며, 상기 미리 설정한 범위를 초과하여 동작된 경우에는 마이너스 보상값(예: -100)으로 부여될 수 있다.For example, when the device 300 is a rotary inverted pendulum that erects an inverted pendulum, the current state information and the next state information are angular velocity, angle, and angle of the motor provided in the rotary inverted pendulum device. The angular velocity and angle of an inverted pendulum may be included, and the control information may include an amount of power for operating the motor, and the reward information may include an angular speed, angle, and inverted pendulum of the motor according to the control information. When the angular velocity and angle are operated within a preset range, a positive compensation value (for example, +1) may be provided, and when operated in excess of the preset range, a negative compensation value (eg -100) Can be given as

한편 상기 디바이스 제어용 강화학습모델을 통해 상기 디바이스(400)를 미리 설정한 에피소드에 따라 제어하고자 할 때, 초기에 생성되는 디바이스 제어용 강화학습모델은 강화학습의 횟수가 많지 않고, 해당 디바이스 제어용 강화학습모델의 가중치가 랜덤 혹은 사용자의 선택에 따라 임의로 설정되어 있으므로, 해당 디바이스(400)의 환경에 적응하여 상기 디바이스(400)의 상태정보에 따라 해당 디바이스(400)를 제어하기 위해 추출되는 제어정보에 대한 정확성이 떨어질 수 있다.On the other hand, when the device 400 is to be controlled according to a preset episode through the reinforcement learning model for controlling the device, the reinforcement learning model for device control generated initially does not have many times of reinforcement learning, and the reinforcement learning model for controlling the corresponding device Since the weight of is randomly or randomly set according to a user's selection, the control information extracted to control the device 400 according to the state information of the device 400 by adapting to the environment of the device 400 Accuracy may decrease.

따라서 본 발명은 상기 PID 컨트롤러(110)를 통해 미리 설정한 개수의 전이정보를 경험 메모리에 누적하여 저장하고, 초기 안정적인 전이정보를 토대로 신속한 강화학습을 수행할 수 있도록 함으로써, 상기 디바이스(400)의 물리적 환경에 빠르게 적응하여 상기 디바이스(400)를 정밀하게 제어할 수 있도록 한다.Accordingly, the present invention accumulates and stores a preset number of transition information in the experience memory through the PID controller 110, and enables rapid reinforcement learning based on the initial stable transition information. The device 400 is quickly adapted to the physical environment to precisely control the device 400.

이때, 상기 디바이스 원격 제어 장치(100)는, 상기 디바이스(400)에 대해 미리 설정한 에피소드에 따라 상기 디바이스(400)를 제어하고자 할 때, 상기 모방 강화학습부(200)의 경험 메모리에 미리 설정한 개수의 전이정보가 저장되어 있지 않는 경우, 상기 PID 컨트롤러(110)를 통해 상기 디바이스(400)에 대한 제어를 우선적으로 수행한다.In this case, when the device remote control apparatus 100 wants to control the device 400 according to a preset episode for the device 400, it is preset in the experience memory of the imitation reinforcement learning unit 200 When a single number of transition information is not stored, control of the device 400 is preferentially performed through the PID controller 110.

한편 상기 PID 컨트롤러(110)는, 상기 디바이스(400)를 전통적 제어모델인 PID 제어모델(proportional integral derivative)을 통해 상기 디바이스(400)를 제어하기 위한 것으로, 초기 상기 디바이스 제어용 강화학습모델의 강화학습에 대한 초기 데이터를 신속하게 수집하기 위한 것이다.On the other hand, the PID controller 110 is for controlling the device 400 through a PID control model (proportional integral derivative), which is a traditional control model, and reinforced learning of the reinforcement learning model for controlling the device initially. It is intended to quickly collect the initial data for.

한편 상기 PID 제어모델은, 사용자에 의해 사전에 정의되어 상기 디바이스(400)를 안정적으로 제어하기 위한 것으로, 제어하고자 하는 특정 디바이스(400)의 출력값(즉, 상태정보)을 미리 설정하여 저장한 참조값 또는 설정값과 비교하여 오차(error)를 계산하고, 상기 계산한 오차값을 이용하여 상기 디바이스(400)의 제어에 필요한 값을 계산하는 수학적 모델을 의미한다.On the other hand, the PID control model is defined in advance by a user, and is for stably controlling the device 400. The reference value is stored by presetting an output value (that is, status information) of a specific device 400 to be controlled. Alternatively, it means a mathematical model that calculates an error by comparing with a set value, and calculates a value necessary for control of the device 400 using the calculated error value.

상기 PID 제어모델은, 상기 디바이스 제어용 강화학습모델의 강화학습에 대한 초기 데이터인 전이정보를 신속하게 수집하기 위한 것으로, 사용자에 의해 상기 디바이스(400)별로 사전에 정의되어 구축될 수 있다.The PID control model is for quickly collecting transition information, which is initial data for reinforcement learning of the reinforcement learning model for device control, and may be defined and built in advance for each device 400 by a user.

또한 상기 PID 컨트롤러(110)는, 해당 디바이스(400)에 미리 설정된 에피소드가 종료될 때 까지 상기 디바이스(400)의 상태정보를 입력받아, 상기 PID 제어모델에 따른 상기 상태정보에 대한 제어정보를 생성하고, 상기 생성한 제어정보를 상기 디바이스로 제공하여 해당 디바이스(400)를 제어하며, 상기 디바이스(400)로부터 해당 제어결과에 따른 상태정보를 수신 받고, 상기 제어결과에 따른 상태정보에 대한 리워드를 계산함으로서, 상기 디바이스(400)에 대한 전이정보를 생성하여 상기 모방 강화학습부(200)의 경험 메모리에 저장하게 된다.In addition, the PID controller 110 receives status information of the device 400 until a preset episode of the corresponding device 400 ends, and generates control information for the status information according to the PID control model. And controls the corresponding device 400 by providing the generated control information to the device, receives status information according to the control result from the device 400, and receives rewards for status information according to the control result. By calculating, the transfer information for the device 400 is generated and stored in the experience memory of the imitation reinforcement learning unit 200.

또한 상기 디바이스 원격 제어 장치(100)는, 상기 경함 메모리에 미리 설정한 개수의 전이정보가 저장되는 경우, 상기 디바이스 원격 제어 장치(100)는, 상기 PID 컨트롤러(110)에 의해 저장된 전이정보를 이용하여 상기 모방 강화학습부(200)에서 상기 디바이스 제어용 강화학습모델에 대한 강화학습을 수행할 수 있도록 한다.In addition, the device remote control apparatus 100, when a preset number of transition information is stored in the warp memory, the device remote control apparatus 100 uses the transition information stored by the PID controller 110 By doing so, the imitation reinforcement learning unit 200 can perform reinforcement learning on the reinforcement learning model for controlling the device.

즉, 상기 디바이스 원격 제어 장치(100)는, 상기 PID 컨트롤러(110)를 통해 상기 디바이스 제어용 강화학습모델에 대한 강화학습용 전이정보를 초기에 신속하게 수집하고, 상기 PID 컨트롤러(110)에 의해 수집된 전이정보를 이용하여 상기 디바이스 제어용 강화학습모델에 대한 초기 강화학습을 빠르게 수행할 수 있도록 함으로써, 상기 디바이스(400)가 위치하는 환경에 신속하게 적응하여 상기 디바이스(400)를 정밀하게 제어할 수 있도록 하는 것이다.That is, the device remote control apparatus 100, through the PID controller 110, quickly collects transition information for reinforcement learning for the reinforcement learning model for controlling the device at an early stage, and is collected by the PID controller 110. By enabling the initial reinforcement learning of the reinforcement learning model for controlling the device by using the transition information, the device 400 can be quickly adapted to the environment where the device 400 is located so that the device 400 can be precisely controlled. Is to do.

이후, 상기 디바이스 원격 제어 장치(100)는, 상기 PID 컨트롤러(110)에 의한 디바이스(400) 제어를 중단하고, 상기 강화학습 에이전트(120)를 통해 상기 디바이스(400)를 제어할 수 있도록 한다.Thereafter, the device remote control apparatus 100 stops control of the device 400 by the PID controller 110 and allows the device 400 to be controlled through the reinforcement learning agent 120.

상기 강화학습 에이전트(120)를 통해 상기 디바이스(400)를 제어하는 것은, 상기 디바이스 제어용 강화학습모델을 이용하여 수행되는 것으로, 상기 디바이스(400)로부터 수신되는 상태정보를 상기 디바이스 제어용 강화학습모델에 입력하여 미리 설정한 에피소드에 따라 상기 디바이스(400)를 제어하기 위한 제어정보를 추출한다.Controlling the device 400 through the reinforcement learning agent 120 is performed using the reinforcement learning model for device control, and the status information received from the device 400 is transmitted to the reinforcement learning model for device control. Input and extract control information for controlling the device 400 according to a preset episode.

또한 상기 강화학습 에이전트(120)는, 상기 추출한 제어정보를 상기 디바이스(400)로 전송하여, 상기 제어정보를 토대로 해당 디바이스(400)를 제어할 수 있도록 하며, 상기 디바이스(400)는 상기 제어정보를 토대로 제어된 상태인 해당 디바이스(400)의 상태정보를 상기 강화학습 에이전트로(120)로 전송한다.In addition, the reinforcement learning agent 120 transmits the extracted control information to the device 400 so that the corresponding device 400 can be controlled based on the control information, and the device 400 controls the control information. Based on the control status of the device 400 is transmitted to the reinforcement learning agent 120.

이후 상기 강화학습 에이전트(120)는, 상기 수신한 디바이스(400)의 상태정보를 토대로 상기 모방 강화학습부(200)에서 수행되는 강화학습을 위한 전이정보를 생성하여 상기 모방 강화학습부(200)의 경험 메모리에 순차적으로 저장한다.Subsequently, the reinforcement learning agent 120 generates transition information for reinforcement learning performed by the imitation reinforcement learning unit 200 based on the received state information of the device 400, and the imitation reinforcement learning unit 200 Let's store it sequentially in the experience memory.

이후, 상기 모방 강화학습부(200)는, 상기 경험 메모리에 저장된 전이정보를 랜덤 미니 배치(random mini batch)하여, 상기 디바이스 제어용 강화학습모델의 토대가 되는 강화학습 네트워크에 대한 강화학습을 수행한다.Then, the imitation reinforcement learning unit 200 performs random reinforcement learning on the reinforcement learning network, which is the basis of the reinforcement learning model for controlling the device, by randomly minimizing the transfer information stored in the experience memory. .

또한 상기 강화학습은, 상기 경험 메모리에 저장된 전이정보를 토대로 상기 디바이스 제어용 강화학습 모델의 강화학습 네트워크를 업데이트함으로서, 수행되는 것으로, 상기 에피소드가 종료될 때마다 수행된다.In addition, the reinforcement learning is performed by updating the reinforcement learning network of the reinforcement learning model for controlling the device based on the transfer information stored in the experience memory, and is performed whenever the episode ends.

한편 도 1에 나타낸 것과 같이, 상기 디바이스 제어용 강화학습모델의 초기 강화학습을 위한 전이정보를 빠르게 수집하기 위한 수단으로 PID 제어모델을 이용한 PID 컨트롤러(110)를 도시하고 있으나, 이외에도 LQR(linear quadratic regulation) 제어모델을 이용한 LQR 컨트롤러와 같이 특정 디바이스를 전통적인 수학적 모델을 통해 제어하기 위한 다양한 컨트롤러를 이용할 수 있다.Meanwhile, as illustrated in FIG. 1, the PID controller 110 using a PID control model is illustrated as a means for quickly collecting transition information for initial reinforcement learning of the reinforcement learning model for device control, but in addition, linear quadratic regulation (LQR) ) It is possible to use various controllers to control a specific device through a traditional mathematical model, such as an LQR controller using a control model.

또한 모방 강화학습부(200)는, 초기 강화학습을 수행하는 경우, 상기에서 설명한 것과 같이 PID 컨트롤러(110)에 의해 경험 메모리에 저장된 전이정보를 이용하여 상기 강화학습을 수행한다.Also, when performing the initial reinforcement learning, the imitation reinforcement learning unit 200 performs the reinforcement learning by using the transfer information stored in the experience memory by the PID controller 110 as described above.

즉, 초기 상기 강화학습 네트워크에 대한 강화학습을 수행하는 경우(즉, 상기 경험 메모리에 저장된 전이정보가 미리 설정한 개수 이하인 경우), 상기 모방 강화학습부(220)는 상기 PID 컨트롤러(110)에 의해 저장된 전이정보를 토대로 상기 강화학습을 수행하며, 이후 상기 PID 컨트롤러(110)에 의해 저장된 전이정보가 점차적으로 상기 강화학습 에이전트(120)에 의해 저장된 전이정보를 토대로 상기 강화학습을 수행하게 되는 것이다. That is, when performing reinforcement learning for the initial reinforcement learning network (ie, when the transition information stored in the experience memory is equal to or less than a preset number), the imitation reinforcement learning unit 220 is connected to the PID controller 110. The reinforcement learning is performed based on the stored transfer information, and then the transfer information stored by the PID controller 110 gradually performs the reinforcement learning based on the transfer information stored by the reinforcement learning agent 120. .

한편 상기 디바이스 원격 제어 장치(100)와 상기 디바이스(400) 간의 데이터(즉, 상태정보 및 제어정보)는, 유무선 통신네트워크를 통해 송수신됨으로써, 상기 디바이스 원격 제어 장치(100)를 통해 상기 디바이스(400)를 원격에서 제어할 수 있도록 한다.Meanwhile, data (that is, status information and control information) between the device remote control apparatus 100 and the device 400 is transmitted and received through a wired/wireless communication network, so that the device 400 is transmitted through the device remote control apparatus 100. ) To be controlled remotely.

상기 유무선 네트워크는, 5G(fifth generation), 이더넷(ethernet) 등과 같은 주지의 유무선 통신네트워크를 의미한다.The wired/wireless network means a well-known wired/wireless communication network such as 5G (fifth generation), Ethernet, or the like.

상기에서 설명한 것과 같이, 본 발명의 디바이스 원격 제어를 위한 모방 강화학습 시스템(10)은, 상기 디바이스(400)를 원격에서 제어하기 위한 디바이스 제어용 강화학습모델 원격 제어를 위한 각 디바이스(400)별로 미리 설정된 에피소드대로 제어하고자 할 때, 미리 설정된 개수의 전이정보가 상기 경험 메모리에 저장되어 있지 않은 경우에는 전통적 제어모델을 이용한 PID 컨트롤러를 통해 상기 디바이스(400)를 제어하는 과정을 반복적으로 수행하여 상기 디바이스 제어용 강화학습모델에 대한 강화학습을 수행하기 위한 전이정보를 상기 PID 컨트롤러(110)를 통해 생성 및 저장함으로서, 상기 저장한 PID 컨트롤러(110)에 의한 전이정보를 모방하여 초기 상기 디바이스 제어용 강화학습모델에 대한 강화학습을 신속하게 수행함으로써, 상기 디바이스 제어용 강화학습모델이 상기 디바이스(400)가 위치하는 환경에 신속하게 적응할 수 있도록 하는 것이다.As described above, the imitation reinforcement learning system 10 for remote control of a device of the present invention includes a reinforcement learning model for device control for remotely controlling the device 400 for each device 400 for remote control. When it is desired to control according to a set episode, if the preset number of transition information is not stored in the experience memory, the device is repeatedly performed through the process of controlling the device 400 through a PID controller using a traditional control model. By generating and storing transition information for performing reinforcement learning for the reinforcement learning model for control through the PID controller 110, the reinforcement learning model for initial device control by mimicking the transition information by the stored PID controller 110 By rapidly performing reinforcement learning for the device, the reinforcement learning model for controlling the device can be quickly adapted to the environment in which the device 400 is located.

이후, 상기 디바이스 제어용 강화학습모델을 이용하여 상기 강화학습 에이전트(120)를 통해 상기 디바이스 제어용 강화학습모델에 대한 강화학습을 수행하기 위한 전이정보를 생성 및 저장하고, 상기 저장한 전이정보를 이용하여 상기 디바이스 제어용 강화학습모델을 지속적으로 업데이트함으로서, 상기 디바이스 제어용 강화학습모델을 고도화하여 해당 디바이스(400)를 정밀하게 제어할 수 있도록 한다.Subsequently, using the reinforcement learning model for device control, the reinforcement learning agent 120 generates and stores transfer information for performing reinforcement learning on the reinforcement learning model for device control, and uses the stored transfer information By continuously updating the reinforcement learning model for device control, the reinforcement learning model for device control is upgraded to precisely control the corresponding device 400.

또한 네트워크 컨트롤러(300)는, 상기 디바이스 원격 제어 장치(100)를 통해 상기 디바이스(300)를 실시간으로 제어할 수 있도록, 상기 네트워크 트래픽에 대한 우선권을 조정하는 기능을 수행한다.In addition, the network controller 300 performs a function of adjusting the priority of the network traffic so that the device 300 can be controlled in real time through the device remote control device 100.

즉, 상기 네트워크 컨트롤러(300)는, 상기 네트워크를 통해 상기 디바이스(400)에서 상기 디바이스 원격 제어 장치(100)로 전송되는 상태정보와 상기 디바이스 원격 제어 장치(100)에서 상기 디바이스(400)로 전송되는 제어정보의 네트워크 트래픽에 대한 우선권을 최대 우선권으로 설정함으로써, 상기 상태정보 및 제어정보가 실시간 송수신이 가능하도록 하여, 상기 디바이스(400)를 원격에서 실시간으로 제어할 수 있도록 한다.That is, the network controller 300 transmits status information transmitted from the device 400 to the device remote control device 100 through the network and from the device remote control device 100 to the device 400. By setting the priority for network traffic of the control information to be the highest priority, the status information and control information can be transmitted and received in real time, so that the device 400 can be remotely controlled in real time.

이때, 상기 네트워크 컨트롤러(300)는, 오픈 플로우(open flow) 프로토콜을 통해 상기 네트워크를 구성하는 복수의 네트워크 노드(미도시)와 통신을 수행하여, 상기 각 네트워크 노드에 대한 동작을 설정하는 기능을 수행한다.At this time, the network controller 300 communicates with a plurality of network nodes (not shown) constituting the network through an open flow protocol to set an operation for each network node. Perform.

이때, 상기 네트워크 노드는, 네트워크를 통해 전송되는 데이터를 목적지로 전송하는 기능을 수행하는 것으로, 스위치(예: OVS(open virtual switch)), 라우터 등으로 구성된다.At this time, the network node performs a function of transmitting data transmitted through a network to a destination, and is composed of a switch (eg, an open virtual switch (OVS)), a router, and the like.

본 발명은, 상기 네트워크 컨트롤러(300)를 통해 각 네트워크 노드의 컨트롤 플레인(control plane)과 데이터 플레인(data plane)을 분리함으로써, 네트워크 세부 정보에 상관없이 중앙의 상기 네트워크 컨트롤러(300)를 통해 상기 네트워크 노크를 중앙에서 제어 및 관리할 수 있도록 한다.The present invention, by separating the control plane (control plane) and data plane (data plane) of each network node through the network controller 300, regardless of the network details through the central network controller 300, the Allows centralized control and management of network knocks.

즉, 본 발명에서의 네트워크 컨트롤러(300)는 상기 복수의 네트워크 노드로부터 노드정보를 실시간으로 수신하여, 상기 각 네트워크 노드를 관리 및 제어하는 기능을 수행한다.That is, the network controller 300 in the present invention receives the node information from the plurality of network nodes in real time, and performs a function of managing and controlling each network node.

한편 상기 노드정보는, 각 네트워크 노드를 통해 송수신되는 모든 데이터에 대한 네트워크 트래픽의 플로우 테이블(flow table)과 각 네트워크 노드에서의 데이터 전송경로에 대한 혼잡여부에 대한 정보를 포함한다.Meanwhile, the node information includes a flow table of network traffic for all data transmitted and received through each network node, and information about congestion of a data transmission path in each network node.

따라서, 상기 네트워크 컨트롤러(300)는, 상기 각 네트워크 노드로부터 실시간으로 수신되는 노드정보를 토대로 상기 각 네트워크 노드를 통해 송수신되는 모든 데이터에 대한 네트워크 트래픽과 상기 각 데이터의 송수신 경로상의 혼잡여부를 실시간으로 모니터링할 수 있다.Therefore, the network controller 300, based on the node information received in real time from each network node, the network traffic for all data transmitted and received through each network node and whether or not congestion in each data transmission/reception path is congested in real time. Can be monitored.

또한 물리적 환경에 위치하는 복수의 디바이스(400)와 사이버 환경에 위치하여 상기 각 디바이스(300)를 원격에서 제어하기 위한 디바이스 원격 제어 장치(100)는 상기 네트워크를 구성하는 복수의 네트워크 노드를 통해 상기 제어를 위한 데이터(즉, 상태정보 및 제어정보)를 송수신하게 된다.Also, a plurality of devices 400 located in a physical environment and a device remote control device 100 located in a cyber environment to remotely control each of the devices 300 may be controlled through a plurality of network nodes constituting the network. Data for control (ie, status information and control information) is transmitted and received.

한편 상기 디바이스 원격 제어 장치(100)를 통해 상기 복수의 디바이스(400)를 각각 실시간으로 제어하기 위해서는, 상기 복수의 디바이스(400)로부터 전송되는 디바이스 상태정보와 상기 디바이스 원격 제어 장치(100)로부터 전송되는 제어정보는 실시간으로 송수신되는 것이 매우 중요하다.Meanwhile, in order to control each of the plurality of devices 400 in real time through the device remote control device 100, device status information transmitted from the plurality of devices 400 and the device remote control device 100 are transmitted. It is very important that the control information to be transmitted and received in real time.

따라서 상기 네트워크 컨트롤러(300)는 상기 복수의 네트워크 노드로부터 수신되는 네트워크 트래픽에 대한 플로우 테이블을 토대로 네트워크 트래픽의 플로우를 모니터링하고 있으면서, 상기 상태정보와 상기 제어명령에 대한 네트워크 트래픽에 대한 대역폭을 최대 대역폭으로 할당할 수 있도록 상기 각 네트워크 노드를 설정한다.Therefore, the network controller 300 monitors the flow of network traffic based on the flow table for network traffic received from the plurality of network nodes, and the bandwidth for the network traffic for the status information and the control command is the maximum bandwidth. Each of the above network nodes is set so as to be assigned as a.

또한 상기 네트워크 컨트롤러(300)는, 상기 상태정보와 상기 제어명령에 대한 네트워크 트래픽에 대한 우선권을 조정하여, 상기 각 네트워크 노드에서 상기 상태정보와 상기 제어명령을 다른 데이터보다 최우선적으로 송수신할 수 있도록 설정한다.In addition, the network controller 300, by adjusting the priority for the network traffic for the status information and the control command, so that each network node can transmit and receive the status information and the control command with priority over other data. Set.

즉, 상기 네트워크 컨트롤러(300)는, 상기 플로우 테이블을 참조하여 각 네트워크 노드에 대한 네트워크 트래픽을 확인하고, 상기 각 네트워크 트래픽에 대한 우선권을 조정함으로써, 상기 상태정보 및 제어정보 이외의 트래픽에 대해서는 낮은 우선권으로 설정하고, 상태정보 및 제어정보에 대한 네트워크 트래픽에 대해서는 높은 우선권으로 설정함으로써, 상기 상태정보 및 제어정보를 우선적으로 송수신할 수 있도록 하는 것이다. That is, the network controller 300 checks network traffic for each network node with reference to the flow table, and adjusts the priority for each network traffic, so that the traffic other than the status information and control information is low. By setting the priority and setting the network traffic for the status information and the control information to a high priority, the status information and the control information can be preferentially transmitted and received.

또한 상기 네트워크 컨트롤러(300)는, 상기 수신되는 노드정보의 데이터 전송경로에 대한 혼잡여부에 대한 정보를 참조하여, 상기 상태정보 및 제어명령에 대한 최단 경로를 산출하여, 상기 산출한 최단 경로를 상기 각 네트워크 노드로 전송함으로써, 상기 각각의 네트워크 노드에서 상기 최단 경로에 따라 상기 상태정보 및 제어정보를 전송할 수 있도록 한다.In addition, the network controller 300 calculates the shortest path for the status information and the control command by referring to information on whether the received node information is congested with respect to the data transmission path, and the calculated shortest path is the By transmitting to each network node, the status information and control information can be transmitted from each network node according to the shortest path.

즉, 본 발명의 네트워크 컨트롤러(300)는, 상기 상태정보 및 제어정보에 대한 최대 대역폭할당, 우선권 조정 및 최단 경로 산출을 통해 상기 디바이스(400)와 상기 디바이스 원격 제어 장치(100) 간의 상기 상태정보 및 제어정보를 실시간으로 송수신할 수 있도록 함으로써, 상기 디바이스 원격 제어 장치(100)를 통해 상기 디바이스(400)를 원격에서 실시간으로 제어할 수 있도록 한다.That is, the network controller 300 of the present invention, the status information between the device 400 and the device remote control apparatus 100 through the maximum bandwidth allocation, priority adjustment and shortest path calculation for the status information and control information And transmitting and receiving control information in real time, so that the device 400 can be remotely controlled in real time through the device remote control apparatus 100.

한편 상기 네트워크 컨트롤러(300)는, 상기 각 노드정보와 네트워크 트래픽에 대한 우선권정보를 REST API(representational state transfer API)를 이용하여 상기 사용자 단말(500)로 제공함으로서, 사용자로 하여금 상기 노드정보 및 우선권정보를 포함하는 네트워크 상태를 실시간으로 모니터링할 수 있도록 한다.Meanwhile, the network controller 300 provides the node information and the priority information for the network traffic to the user terminal 500 using a RESTal API (representational state transfer API), thereby allowing the user to receive the node information and priority. It enables real-time monitoring of network status including information.

또한 상기 사용자는 상기 사용자 단말(500)을 통해 상기 상태정보 및 제어정보에 대한 네트워크 트래픽의 우선권을 조정할 수 있으며, 상기 우선권을 조정한 우선권 조정정보를 상기 REST API를 통해 상기 네트워크 컨트롤러(300)로 제공함으로써, 상기 조정한 우선권을 상기 상태정보 및 제어정보에 적용할 수 있도록 한다.In addition, the user can adjust the priority of the network traffic for the status information and control information through the user terminal 500, and the priority adjustment information adjusted by the priority to the network controller 300 through the REST API By providing, it is possible to apply the adjusted priority to the status information and control information.

도 2는 본 발명의 일 실시예에 따른 디바이스 원격 제어 장치와 모방 강화학스부를 통해 디바이스 원격 제어를 위한 모방 강화학습을 수행하는 방법을 설명하기 위해 나타낸 도면이다.FIG. 2 is a diagram illustrating a method for performing imitation reinforcement learning for remote control of a device through a device remote control apparatus and an imitation reinforcement learning unit according to an embodiment of the present invention.

도 2에 도시한 바와 같이, 모방 강화학습부(200)를 통해 디바이스 원격 제어를 위한 모방 강화학습을 수행하는 과정은, 미리 설정한 에피소드에 따라 물리 환경에 위치하는 디바이스(400)에 대한 제어를 수행하여, 복수의 전이정보를 FIFO 구조의 경험 메모리(210)에 저장하고, 상기 설정한 에피소드가 종료될 때, 마다 상기 경험 메모리(210)에 저장한 전이정보를 적용하여, 강화학습 네트워크를 업데이트함으로써, 수행된다.As illustrated in FIG. 2, the process of performing imitation reinforcement learning for remote control of a device through the imitation reinforcement learning unit 200 controls the control of the device 400 located in the physical environment according to a preset episode. By performing, the plurality of transition information is stored in the experience memory 210 of the FIFO structure, and when the set episode ends, the transition information stored in the experience memory 210 is applied to update the reinforcement learning network. By doing so.

한편, 상기 강화학습 네트워크는 상기 디바이스(400)를 원격에서 제어하기 위한 제어정보를 추출하도록 하는 디바이스 제어용 학습모델의 기반이 되는 것으로, 초기 강화학습 네트워크의 가중치는 랜덤 혹은 사용자의 선택에 따라 임의로 설정된다.Meanwhile, the reinforcement learning network is a basis for a device control learning model that extracts control information for remotely controlling the device 400, and the weight of the initial reinforcement learning network is randomly or randomly set according to a user's selection do.

따라서 초기 생성되는 디바이스 제어용 학습모델은, 강화학습의 횟수가 매우 적고 강화학습에 이용되는 전이정보의 개수가 적어 그 정확도가 떨어질 수 있다.Accordingly, the learning model for device control that is initially generated may have a very small number of reinforcement learning and a small number of transfer information used for reinforcement learning, so that the accuracy may be deteriorated.

이러한 문제점을 해결하기 위해, 상기 디바이스 원격 제어 장치(100)는, 상기 경험 메모리(210)를 참조하여, 상기 경험 메모리(210)에 저장된 상기 전이정보가 미리 설정한 개수(예: 20,000개)보다 미만인 경우에, 상기 PID 컨트롤러(110)를 이용하여, 상기 디바이스(400)를 제어함으로써, 초기 전이정보를 안정적이고 신속하게 생성하여 상기 경험 메모리(210)에 저장할 수 있도록 한다.To solve this problem, the device remote control apparatus 100 refers to the experience memory 210, and the transition information stored in the experience memory 210 is greater than a preset number (for example, 20,000). In the case of less, by controlling the device 400 using the PID controller 110, it is possible to stably and quickly generate initial transition information and store it in the experience memory 210.

즉, 상기 디바이스 원격 제어 장치(100)는, 상기 PID 컨트롤러(110)를 통해 상기 디바이스(400)를 제어함으로써, 초기 부족한 전이정보를 신속하게 수집하여, 이를 토대로 상기 강화학습을 수행할 수 있도록 함으로서, 상기 디바이스 제어용 강화학습모델이 상기 디바이스(400)의 물리 환경에 빠르게 적응하여 상기 디바이스(400)를 정밀하게 제어할 수 있도록 한다.That is, the device remote control apparatus 100, by controlling the device 400 through the PID controller 110, by quickly collecting the initial lack of transition information, thereby enabling the reinforcement learning based on this , The reinforcement learning model for controlling the device can be quickly adapted to the physical environment of the device 400 to precisely control the device 400.

또한 상기 디바이스 원격 제어 장치(100)는, 상기 디바이스(400)를 미리 설정한 에피소드에 따라 제어할 때, 상기 디바이스(400)를 초기화하기 위한 제어정보를 생성하여 상기 디바이스(400)로 전송함으로서, 상기 디바이스를 초기화한다.In addition, when the device remote control apparatus 100 controls the device 400 according to a preset episode, by generating control information for initializing the device 400 and transmitting it to the device 400, Initialize the device.

이 때, 상기 디바이스 원격 제어 장치(100)는, 상기 경험 메모리(210)에 저장된 전이정보가 미리 설정한 개수 이하인 경우, 상기 PID 컨트롤러(110)를 통해 상기 디바이스(300)를 초기화할 수 있도록 하며, 상기 디바이스(400)는 상기 초기화를 수행한 후, 디바이스 상태정보 상기 PID 컨트롤러(110)로 제공한다.At this time, the device remote control apparatus 100, if the transition information stored in the experience memory 210 is less than or equal to a preset number, allows the device 300 to be initialized through the PID controller 110 and , After the device 400 performs the initialization, device status information is provided to the PID controller 110.

또한 상기 PID 컨트롤러(110)는, 상기 디바이스(400)로부터 수신 받은 디바이스 상태정보가 초기화를 위한 제어정보에 의한 디바이스 상태정보인 경우, 기 저장된 PID 제어모델에 따라 해당 상태정보에 대한 제어정보를 생성하여 상기 디바이스(400)로 전송하여, 이에 대한 디바이스 상태정보를 수신한다.In addition, when the device status information received from the device 400 is device status information based on control information for initialization, the PID controller 110 generates control information for the status information according to a previously stored PID control model. To the device 400 to receive device status information.

이때, 상기 PID 컨트롤러(110)는, 상기 수신한 디바이스 상태정보를 토대로 해당 디바이스 상태정보에 대한 리워드를 계산함으로써, 상기 디바이스 상태정보에 대한 리워드 정보를 생성한다.At this time, the PID controller 110 generates reward information for the device status information by calculating a reward for the device status information based on the received device status information.

이후, 상기 PID 컨트롤러(110)는, 상기 생성한 리워드 정보를 포함한 상기 디바이스에 대한 전이정보를 생성하여, 상기 생성한 전이정보를 경험 메모리(210)에 저장하게 된다.Thereafter, the PID controller 110 generates transition information for the device including the generated reward information, and stores the generated transition information in the experience memory 210.

이러한 생성한 전이정보를 상기 경험 메모리(210)에 저장하기 위한 상기의 과정은 상기 설정한 에피소드가 종료될 때까지 반복적으로 수행된다.The above process for storing the generated transition information in the experience memory 210 is repeatedly performed until the set episode ends.

또한 상기 에피소드가 종료된 경우, 상기 모방 강화학습부(200)의 강화학습부(220)는, 상기 경험 메모리(210)에 저장된 전이정보에 대한 랜덤 미니 배치를 수행하여 미리 설정한 개수만큼 상기 전이정보를 상기 경험 메모리(210)로부터 랜덤하게 선택하여 추출하고, 상기 추출한 전이정보를 이용하여 상기 강화학습 네트워크에 대한 강화학습을 수행한다.In addition, when the episode ends, the reinforcement learning unit 220 of the imitation reinforcement learning unit 200 performs a random mini-batch on the transition information stored in the experience memory 210 to transfer the preset number of times. Information is randomly selected and extracted from the experience memory 210, and reinforced learning for the reinforcement learning network is performed using the extracted transfer information.

한편, 상기 에피소드가 종료되고 상기 경험 메모리(210)에 저장된 전이정보가 미리 설정한 개수 이하인 경우에는, 상기 강화학습은 상기 PID 컨트롤러(110)에 의해 생성된 전이정보이다.On the other hand, when the episode ends and the transition information stored in the experience memory 210 is less than or equal to a preset number, the reinforcement learning is transition information generated by the PID controller 110.

상기 PID 컨트롤러(110)에 의해 생성된 전이정보는, 사전에 정의되어 상기 디바이스(400)를 정밀하게 제어하기 위한 복잡한 수학적 제어모델인 PID 제어모델을 통해 제어한 결과를 토대로 생성된 것이므로, 신뢰도가 높고 매우 안정적인 데이터이다.Since the transition information generated by the PID controller 110 is defined in advance and is generated based on a result of controlling through a PID control model, which is a complex mathematical control model for precisely controlling the device 400, reliability is obtained. It is high and very stable data.

따라서 상기 경험 메모리(210)에 미리 설정한 개수만큼 초기에 저장되는 전이정보는 상기 PID 컨트롤러(110)를 통해 생성된 전이정보이므로, 상기 강화 학습부(220)는 초기 안정된 전이정보를 토대로 빠른 강화학습을 수행할 수 있다.Therefore, since the transition information initially stored in the preset number of times in the experience memory 210 is the transition information generated through the PID controller 110, the reinforcement learning unit 220 rapidly reinforces based on the initial stable transition information. Learning can be performed.

즉, 상기 강화 학습부(220)를 통해 상기 강화학습 네트워크에 대한 강화학습을 수행하는 것은, 상기 PID 컨트롤러(110)를 통해 생성된 전이정보를 토대로 수행되는 것이며, 이는 상기 PID 컨트롤러(110)의 경험(즉, 상기 디바이스를 사전에 정의한 수학적 모델에 따라 제어한 결과)을 모방하여 강화학습을 수행하는 것으로, 본 발명에서는 이를 모방 강화학습이라고 정의한다.That is, performing reinforcement learning for the reinforcement learning network through the reinforcement learning unit 220 is performed based on the transfer information generated through the PID controller 110, which is performed by the PID controller 110. Reinforcement learning is performed by imitating experience (that is, a result of controlling the device according to a predefined mathematical model), which is defined as mimic reinforcement learning in the present invention.

다음으로, 상기 디바이스 원격 제어 장치(100)는, 상기 경험 메모리(210)에 저장된 전이정보가 미리 설정한 개수인 경우, 상기 강화학습 에이전트(120)를 통해 상기 디바이스(120)를 제어하여 상기 전이정보를 생성함으로써, 상기 강화학습 네트워크에 대한 강화학습을 수행할 수 있도록 한다.Next, when the transfer information stored in the experience memory 210 is a preset number, the device remote control apparatus 100 controls the device 120 through the reinforcement learning agent 120 to transfer the transition By generating information, it is possible to perform reinforcement learning for the reinforcement learning network.

즉, 상기 경험 메모리(210)에 저장된 전이정보가 미리 설정한 개수인 경우, 상기 PID 컨트롤러(110)를 통한 제어는 배제되며, 이때부터는, 상기 강화학습 에이전트(120)를 통한 제어만이 수행된다.That is, when the transfer information stored in the experience memory 210 is a preset number, control through the PID controller 110 is excluded, and from this point, only control through the reinforcement learning agent 120 is performed. .

이는, 상기 강화학습 에이전트(120)의 메모리(미도시)에 저장된 디바이스 제어용 강화학습 모델은 상기 경험 메모리(210)에 저장된 전이정보(즉, PID 컨트롤러에 의해 저장된 전이정보)를 토대로 강화학습을 수행한 상태이므로, 상기 강화학습 에이전트(120)를 통해 상기 디바이스(400)에 대한 신뢰성 있는 전이정보를 확보할 수 있다.This is, the reinforcement learning model for device control stored in the memory (not shown) of the reinforcement learning agent 120 performs reinforcement learning based on the transfer information (ie, the transfer information stored by the PID controller) stored in the experience memory 210. Since it is in one state, reliable transition information about the device 400 can be secured through the reinforcement learning agent 120.

따라서 상기 강화학습 에이전트(120)는, 상기 제어에 따라 상기 디바이스(400)로부터 디바이스 상태정보를 수신 받으면, 상기 메모리에 저장한 디바이스 제어용 강화학습모델에 상기 수신한 디바이스 상태정보를 입력하여, 상기 디바이스(300)를 제어하기 위한 제어정보를 생성하여, 상기 디바이스(300)로 전송하여, 상기 디바이스(300)를 통해 해당 제어정보에 따른 동작을 수행할 수 있도록 한다.Therefore, when the reinforcement learning agent 120 receives device status information from the device 400 according to the control, the reinforcement learning agent 120 enters the received device status information into the reinforcement learning model for device control stored in the memory, and the device Control information for controlling the 300 is generated and transmitted to the device 300 so that an operation according to the corresponding control information can be performed through the device 300.

이후, 상기 디바이스(400)는, 상기 강화학습 에이전트(120)로부터 수신한 제어정보를 토대로 해당 디바이스(400)를 동작시키고, 이에 대한 디바이스 상태정보를 생성하여 상기 강화학습 에이전트(120)로 전송한다.Subsequently, the device 400 operates the corresponding device 400 based on the control information received from the reinforcement learning agent 120, generates device status information for this, and transmits the device status information to the reinforcement learning agent 120. .

또한 상기 강화학습 에이전트(120)는, 상기 수신한 디바이스 상태정보에 대한 리워드를 계산함으로서, 해당 디바이스의 상태정보에 대한 리워드 정보를 생성하고, 상기 생성한 리워드 정보를 포함하는 상기 디바이스(400)의 전이정보를 상기 경험 메모리(210)에 저장한다.In addition, the reinforcement learning agent 120, by calculating the reward for the received device status information, generates the reward information for the status information of the corresponding device, and of the device 400 including the generated reward information Transition information is stored in the experience memory 210.

이후, 상기 에피소드가 종료된 때, 상기 모방 강화학습부(200)의 강화 학습부(220)는, 상기 경험 메모리(210)에 저장된 전이정보를 랜덤 미니 배치를 통해 미리 설정한 개수만큼 복수의 전이정보를 추출하고, 상기 추출한 복수의 전이정보를 이용하여 상기 강화학습 네트워크에 대한 강화학습을 수행하게 된다.Thereafter, when the episode is over, the reinforcement learning unit 220 of the imitation reinforcement learning unit 200 transfers a plurality of transition information stored in the experience memory 210 as a preset number through a random mini-batch. Information is extracted, and reinforcement learning is performed on the reinforcement learning network using the extracted plurality of transfer information.

또한 상기 강화학습부(220)는 상기 강화학습을 수행한 경우, 이에 대한 강화학습 결과인 학습 파라미터를 상기 강화학습 에이전트(120)로 제공하고, 상기 강화학습 에이전트(120)는, 상기 수신한 학습 파라미터를 상기 메모리에 저장된 강화학습모델에 적용함으로써, 해당 강화학습모델을 업데이트하며, 이를 통해 상기 디바이스 원격제어를 위한 디바이스 원격 제어용 강화학습모델을 고도화한다.In addition, when the reinforcement learning unit 220 performs the reinforcement learning, the reinforcement learning agent 120 provides learning parameters that are the result of the reinforcement learning to the reinforcement learning agent 120, and the reinforcement learning agent 120 receives the received learning By applying the parameter to the reinforcement learning model stored in the memory, the corresponding reinforcement learning model is updated, thereby enhancing the reinforcement learning model for device remote control for remote control of the device.

상기에서 설명한 것과 같이, 상기 디바이스(400)에 대한 전이정보는, 상기 미리 설정한 에피소드에 따라 상기 디바이스를 제어한 복수의 결과를 나타내는 것이며, 상기 에피소드가 종료될 때 까지 FIFO 방식으로 상기 경험 메모리(210)에 저장된다.As described above, the transition information for the device 400 indicates a plurality of results of controlling the device according to the preset episode, and the experience memory (FIFO method) until the episode ends 210).

즉, 상기 전이정보는 미리 설정한 개수만큼 상기 경험 메모리(210)에 항상 저장되며, 상기 미리 설정한 개수를 초과하여 상기 전이정보가 입력되는 경우, 현재 경험 메모리(210)에서 가장 먼저 저장된 전이정보가 삭제됨으로서, 이후 입력되는 전이정보가 순차적으로 저장된다.That is, the transition information is always stored in the experience memory 210 as many as a preset number, and when the transition information is input in excess of the preset number, the transition information stored first in the current experience memory 210 As is deleted, transition information that is subsequently input is sequentially stored.

한편 상기 에피소드는, 상기 각 디바이스(400)에 대한 고유의 기능에 따라 특정 작업을 수행하는 일련과 과정을 의미하는 것으로, 예를 들어, 상기 디바이스(400)가 회전식 역진자 장치(rotary inverted pendulum)인 경우, 진자를 세우는 일련의 과정이 에피소드가 되며, 상기 디바이스(400)가 특정 객체를 미리 설정한 위치로 이동시키기 위한 로봇 팔인 경우, 상기 특정 객체를 미리 설정한 위치로 이동시키는 일련의 과정이 하나의 에피소드가 된다.Meanwhile, the episode refers to a series and process of performing a specific task according to a unique function for each device 400. For example, the device 400 is a rotary inverted pendulum. In the case of, a series of processes of establishing a pendulum becomes an episode, and when the device 400 is a robot arm for moving a specific object to a preset position, a series of processes for moving the specific object to a preset position It becomes an episode.

또한 상기 전이정보는, 상기 미리 설정한 에피소드를 수행하기 위해 상기 디바이스(400)를 단계적으로 제어하기 위한 각각의 제어정보마다 생성되는 것으로, 상기 디바이스(400)의 현재 상태정보, 제어정보, 상기 디바이스(400)의 다음 상태정보에 대한 리워드 정보 및 상기 디바이스(400)의 다음 상태정보를 포함하여 구성된다.In addition, the transition information is generated for each control information for controlling the device 400 step by step to perform the preset episode, the current state information, control information, and the device of the device 400 It comprises a reward information for the next state information of the 400 and the next state information of the device 400.

또한 상기 디바이스(400)의 현재 상태정보는, 상기 제어정보를 토대로 해당 디바이스(400)가 제어되기 전의 상태정보를 의미하며, 상기 디바이스(400)의 다음 상태정보는, 상기 제어정보를 토대로 해당 디바이스(400)가 제어된 후의 상태정보를 의미한다.In addition, the current status information of the device 400 refers to status information before the corresponding device 400 is controlled based on the control information, and the next status information of the device 400 is the corresponding device based on the control information It means the status information after 400 is controlled.

또한 상기 리워드 정보는, 상기 제어정보를 토대로 제어된 후의 다음 상태정보가 미리 설정한 임계범위를 기준으로 주어지는 보상값을 의미하는 것으로, 상기 다음 상태정보가 미리 설정한 임계범위 내에서 동작한 결과인 경우 플러스 보상값 (예: +1), 상기 다음 상태정보가 미리 설정한 임계범위를 초과하여 동작한 결과인 경우 마이너스 보상값(예: -100)으로 부여될 수 있음은 상술한 바와 같다.In addition, the reward information refers to a compensation value that is given based on a preset threshold range for the next state information after being controlled based on the control information, which is a result of operating the next state information within a preset threshold range. In the case of a positive compensation value (for example, +1), when the next state information is a result of operating beyond a preset threshold range, it can be assigned as a negative compensation value (for example, -100).

또한 상기 강화학습은, 상기 리워드 정보를 토대로 상기 디바이스(400)의 현재 상태정보를 토대로 상기 디바이스(400)를 제어하기 위한 제어정보를 추출할 때, 에피소드가 종료될 때까지의 전체 리워드 정보가 제일 높은 다음 상태정보로 제어하기 위한 제어정보가 추출되도록 수행된다.In addition, in the reinforcement learning, when extracting control information for controlling the device 400 based on the current state information of the device 400 based on the reward information, the entire reward information until the episode ends It is performed so that control information for controlling with high next state information is extracted.

즉, 상기 강화학습은, 디바이스(400)의 현재 상태에서 특정 제어명령에 따라 다음 상태로 전이될 때, 다음 상태에 전이시키기 위한 제어명령에 대한 리워드 정보가 제일 높은 제어명령을 추출하는 것이 아니라, 현재 상태에서 에피소드가 종료될 때 까지 전체 리워드 정보가 높은 다음 상태정보에 대한 제어명령을 추출하게 되는 것이다.That is, in the reinforcement learning, when the device 400 transitions from the current state to the next state according to a specific control command, it does not extract the control command having the highest reward information for the control command to transition to the next state. From the current state, until the end of the episode, the control command for the next state information with high total reward information is extracted.

다시 말해, 상기 강화 학습부(220)는, 상기 경험 메모리(210)에 저장된 전이정보를 토대로, 상기 디바이스(400)의 특정 상태정보가 입력되는 경우, 전체 리워드 정보가 높은 제어정보를 출력하도록 상기 디바이스 제어용 학습네트워크에 대한 강화학습을 수행하게 된다.In other words, the reinforcement learning unit 220, based on the transition information stored in the experience memory 210, when specific status information of the device 400 is input, the entire reward information to output high control information to the The reinforcement learning for the device control learning network will be performed.

한편 상기 디바이스 제어용 학습네트워크는 CNN(convolutonal neural network)을 이용하여 구축되는 것이 바람직하지만, ANN(artificial neural network)등과 같은 다양한 기계학습 네트워크를 통해 구축될 수 있다. 즉, 본 발명의 일 실시예에 따른 디바이스 제어용 학습네트워크에 대한 구조는 그 제한을 두지 아니한다.On the other hand, the learning network for device control is preferably constructed using a convolutonal neural network (CNN), but can be constructed through various machine learning networks such as an artificial neural network (ANN). That is, the structure of the learning network for device control according to an embodiment of the present invention is not limited.

이러한 강화학습은, 상기 에피소드가 끝날 때 마다 반복적으로 수행되며, 이러한 과정을 통해 상기 학습모델을 지속적으로 업데이트함으로써, 다양한 환경에 노출되는 각각의 디바이스(400)를 상기 환경에 적응하여 정밀하게 제어할 수 있도록 한다.Such reinforcement learning is repeatedly performed each time the episode ends, and by continuously updating the learning model through such a process, each device 400 exposed to various environments can be precisely controlled by adapting to the environment. Make it possible.

따라서 상기 강화학습 에이전트(120)는, 임의의 환경에 위치하는 디바이스(300)의 현재 상태를 인식하여, 선택 가능한 제어정보 중 전체 에피소드에서 리워드 정보의 합이 최대인 제어정보를 선택함으로써, 상기 디바이스(300)를 정밀하게 제어할 수 있도록 한다.Therefore, the reinforcement learning agent 120 recognizes the current state of the device 300 located in an arbitrary environment, and selects control information having the maximum sum of reward information in all episodes among the selectable control information, thereby making the device It is possible to precisely control the 300.

또한 상기 디바이스 상태정보 및 제어정보는, 임의의 통신방법을 통해 송수신 될 수 있다.Also, the device status information and control information may be transmitted and received through an arbitrary communication method.

따라서, 상기 디바이스 상태정보 및 제어정보는, 상기 특정 통신방법에 적합한 데이터 포맷으로 변환되어 송수신된다.Accordingly, the device status information and control information are converted into a data format suitable for the specific communication method, and transmitted and received.

한편 상기 통신방법은, 통신기기 사이에서 통신할 수 있는 경량의 발행(publish), 구독(subscribe) 메시징 전송 프로토콜을 의미하는 것으로, M2M(machine to machine), IoT(internet of things) 등에 최적화되어 낮은 전력 및 낮은 대역폭 환경에서 고속으로 통신을 수행할 수 있는 것을 포함할 수 있다. Meanwhile, the communication method refers to a lightweight publishing and subscription messaging transport protocol capable of communicating between communication devices, and is optimized for machine-to-machine (M2M), internet of things (IoT), and low. And capable of performing communication at high speed in a power and low bandwidth environment.

도 3은 본 발명의 일 실시예에 따른 강화학습 네트워크의 구조를 나타낸 도면이다. 3 is a diagram showing the structure of a reinforcement learning network according to an embodiment of the present invention.

도 3에 도시한 바와 같이, 본 발명의 일 실시예에 따른 강화학습 네트워크는, 경험 메모리(210)에 저장된 전이정보를 입력받는 입력 레이어, 미리 설정한 크기의 커널(kernel)을 이용하여 상기 커널과 상기 강화학습 데이터의 특정 부분을 컨볼루션(convolution)하는 제1 컨볼루션 레이어(convolution layer)와 제2 컨볼루션 레이어와 두 개의 완전연관레이어(fully connected layer)및 출력레이어로 구성된다.As shown in FIG. 3, the reinforcement learning network according to an embodiment of the present invention uses the input layer for receiving transition information stored in the experience memory 210 and a kernel of a preset size. And a first convolution layer, a second convolution layer, two fully connected layers, and an output layer that convolution a specific portion of the reinforcement learning data.

상기 입력 레이어는, 디바이스(400)마다 미리 설정된 에피소드가 종료될 때 마다, 상기 경험 메모리(210)에 저장된 복수의 전이정보를 입력받는다. 이때, 상기 전이정보는, 상기 강화학습 네트워크에 적합한 이미지로 변환되어 상기 입력 레이어에 입력된다.The input layer receives a plurality of transition information stored in the experience memory 210 whenever a preset episode is ended for each device 400. At this time, the transition information is converted into an image suitable for the reinforcement learning network and input to the input layer.

이때, 상기 전이정보는, PID 컨트롤러(110) 또는 상기 강화학습 에이전트(120)에서 생성되며, 상기 경험 메모리(210)에 저장된 모든 전이정보가 상기 입력 레이어로 입력되는 것이 아니라, 랜덤 미니 배치를 수행하여 미리 설정한 수만큼 랜덤하게 선택되어 배치되어 상기 입력 레이어에 입력된다. At this time, the transition information is generated by the PID controller 110 or the reinforcement learning agent 120, and not all the transition information stored in the experience memory 210 is input to the input layer, but performs random mini-batch. Then, it is randomly selected and arranged as many as a preset number and input to the input layer.

한편 상기 전이정보에 대한 랜덤 미니 배치를 수행하는 이유는, 상기 전이정보에 포함된 현재 상태정보 및 다음 상태정보가 제어정보에 따라 순차적으로 전이되는 과정을 포함하고 있기 때문에, 상기 각 전이정보 간의 상관관계(correlation)로 인해 상기 강화학습에 대한 학습속도가 매우 느려지는 문제점이 발생할 수 있으므로, 상기 상관관계를 무시하고, 특정 상태정보에서 상기 에피소드가 종료될 때까지(즉, 상기 디바이스의 최종 상태)의 리워드 정보의 총합이 제일 높은 제어명령을 추출할 수 있도록 신속하게 강화학습이 수행되도록 하기 위함이다. On the other hand, the reason for performing the random mini-batch for the transition information, because the current state information and the next state information included in the transition information includes a process of sequentially transitioning according to the control information, correlation between the respective transition information Correlation may cause a problem that the learning speed for the reinforcement learning becomes very slow, ignoring the correlation, and until the episode ends in the specific state information (that is, the final state of the device) This is to enable reinforcement learning to be performed quickly so that the control command with the highest sum of the reward information of can be extracted.

또한 상기 제1 컨볼루션 레이어는, 미리 설정한 크기(예: 4x4)와 특정 가중치를 가지는 제1 커널을 미리 설정한 스트라이드의 크기(예: 1)에 따라 이동해 가면서 상기 복수의 전이정보에 대한 이미지의 특정 부분과 상기 제1 커널의 가중치를 각각 컨볼루션하여, 상기 이미지의 특정 부분에 대한 제1 특징 맵을 생성하여 출력한다. 여기서, 상기 스트라이드의 크기는 상기 커널의 이동단위를 의미한다.In addition, the first convolution layer, the image for the plurality of transition information while moving according to the size of the preset stride (eg 1) of the first kernel having a predetermined weight (eg 4x4) and a specific weight Each of the weights of the specific portion and the first kernel is convolved, and a first feature map for a specific portion of the image is generated and output. Here, the size of the stride means the moving unit of the kernel.

이때, 상기 제1 컨볼루션 레이어를 컨볼루션을 통해 출력되는 결과는, 7x1의 크기를 가지는 총 32개(즉, 7x1x32)의 특징 맵이며, 상기 제2 컨볼루션은, 상기 제1 컨볼루션 레이어를 통해 출력되는 제1 특징 맵을 미리 설정한 크기(예: 4x1)와 특정 가중치를 가지는 제2 커널을 미리 설정한 스트라이드의 크기(예: 1)에 따라 이동해 가면서 상기 제1 컨볼루션 레이어에서 출력한 제1 특징 맵으로부터 제2 특징 맵을 생성하여 출력한다. 제2 컨볼루션 레이어의 제2 커널의 크기가 4x1인 경우, 상기 제2 컨볼루션 레이어를 통해 출력되는 제2 특징 맵은 4x1의 크기를 가지는 총 64개의 특징 맵으로 구성될 수 있다.At this time, the result of outputting the first convolution layer through convolution is a feature map of a total of 32 (ie, 7x1x32) having a size of 7x1, and the second convolution is the first convolution layer. The first feature map outputted through the preset size (eg 4x1) and the second kernel having a specific weight are moved according to the preset stride size (eg 1) and output from the first convolution layer. The second feature map is generated and output from the first feature map. When the size of the second kernel of the second convolution layer is 4x1, the second feature map output through the second convolution layer may be composed of a total of 64 feature maps having a size of 4x1.

다음으로 상기 두 개의 완전연관레이어는, 상기 출력한 특징 맵을 연결시켜 상기 디바이스(400)의 특정 상태에서 다음 상태로 전이시키기 위한 적어도 하나 이상의 제어명령을 출력 레이어를 통해 출력할 수 있도록 한다.Next, the two fully connected layers connect the output feature map to output at least one control command for transitioning from a specific state of the device 400 to a next state through an output layer.

이때 상기 제어명령은, 상기 특정 상태에서 다음 상태로 상기 디바이스를 전이시킬 때, 해당 디바이스(400)에 설정된 에피소드가 종료되는 종료 상태까지 각 상태에 대한 리워드 정보의 총합이 높은 순서대로 적어도 하나 이상 출력된다.At this time, the control command, when transferring the device from the specific state to the next state, outputs at least one or more in the order of the sum of the reward information for each state up to the end state in which the episode set in the device 400 ends. do.

이러한 과정을 통해 상기 강화학습부(220)는 상기 에피소드에 따라 상기 디바이스(400)를 제어할 때마다 생성되는 전이정보를 상기 강화학습 네트워크에 적용함으로써, 상기 강화학습 네트워크를 점진적으로 업데이트할 수 있다.Through this process, the reinforcement learning unit 220 may gradually update the reinforcement learning network by applying the transition information generated each time the device 400 is controlled according to the episode to the reinforcement learning network. .

이를 통해, 상기 강화학습부(220)는 상기 디바이스 제어용 강화학습모델을 지속적으로 고도화할 수 있도록 하며, 다양한 환경에 위치하는 디바이스(400)를 해당 환경에 적응하여 원격에서 정밀하게 제어할 수 있도록 한다.Through this, the reinforcement learning unit 220 allows the device to continuously enhance the reinforcement learning model for controlling the device, and adapts the device 400 located in various environments to the environment to precisely control it remotely. .

도 4는 본 발명의 일 실시예에 따른 디바이스 원격 제어 장치의 구성을 나타낸 블록도이다.4 is a block diagram showing the configuration of a device remote control apparatus according to an embodiment of the present invention.

도 4에 도시한 바와 같이, 본 발명의 일 실시예에 따른 디바이스 원격 제어 장치(100)는, PID 제어모델을 이용하여 상기 디바이스를 제어하기 위한 PID 컨트롤러(110) 및 디바이스 제어용 강화학습 모델을 이용하여 상기 디바이스를 제어하기 위한 강화학습 에이전트(120)를 포함하여 구성된다.As shown in FIG. 4, the device remote control apparatus 100 according to an embodiment of the present invention uses a PID controller 110 for controlling the device and a reinforcement learning model for device control using a PID control model It comprises a reinforcement learning agent 120 for controlling the device.

또한 디바이스 원격 제어 장치(100)는, 상기 모방 강화학습부(200)의 경험 메모리(210)를 참조하여 해당 경험 메모리(210)에 미리 설정한 개수 미만의 전이정보가 저장되어 있는 경우, 상기 PID 컨트롤러(110)를 통해 상기 디바이스(400)를 제어하여 상기 디바이스(400)에 대한 전이정보를 생성하고 상기 경험 메모리(210)에 저장할 수 있도록 한다.Also, the device remote control apparatus 100 refers to the experience memory 210 of the imitation reinforcement learning unit 200, and when the transition information of a preset number or less is stored in the experience memory 210, the PID The device 400 is controlled through the controller 110 to generate transition information for the device 400 and store it in the experience memory 210.

이때, 상기 PID 컨트롤러(110)는, 해당 PID 컨트롤러(110)의 메모리(미도시)에 저장된 PID 제어모델을 이용하여, 미리 설정된 에피소드에 따라 상기 디바이스(400)를 제어하여, 상기 전이정보를 생성하고, 상기 생성한 전이정보를 상기 모방 강화학습부(210)의 경험 메모리에 저장할 수 있도록 한다.In this case, the PID controller 110 controls the device 400 according to a preset episode using the PID control model stored in a memory (not shown) of the PID controller 110 to generate the transition information. And, it is possible to store the generated transition information in the experience memory of the imitation reinforcement learning unit 210.

상기 PID 컨트롤러(110)에 대해서는 도 2를 참조하여 설명하였음으로 더 이상의 상세한 설명은 생략하도록 한다.Since the PID controller 110 has been described with reference to FIG. 2, further detailed description will be omitted.

한편 도 1 내지 도 3에는, 상기 PID 컨트롤러(110)가 상기 디바이스 원격 제어 장치(100)에 통합되어 있는 것으로 도시되어 있으나, 상기 디바이스 원격 제어 장치(100)와 별도로 구성되어 상기 디바이스 원격 제어 장치(100)와의 연동을 통해 상기 PID 컨트롤러(110)의 기능을 수행할 수 있도록 구현될 수 있음은 당연하다.Meanwhile, in FIGS. 1 to 3, although the PID controller 110 is illustrated as being integrated in the device remote control device 100, the device remote control device is configured separately from the device remote control device 100 ( It is natural that it can be implemented to perform the function of the PID controller 110 through interworking with 100).

또한 상기 디바이스 원격 제어 장치(100)는, 상기 모방 강화학습부(200)의 경험 메모리(210)를 참조하여 해당 경험 메모리(210)에 미리 설정한 개수로 전이정보가 저장되어 있는 경우, 상기 강화학습 에이전트(120)를 통해 상기 디바이스(400)를 제어하여, 상기 디바이스(400)에 대한 전이정보를 생성하고, 상기 경험 메모리(210)에 저장할 수 있도록 한다. In addition, the device remote control apparatus 100, if the transition information is stored in a preset number in the corresponding experience memory 210 with reference to the experience memory 210 of the imitation reinforcement learning unit 200, the reinforcement The device 400 is controlled through the learning agent 120 to generate transfer information for the device 400 and store it in the experience memory 210.

상기 강화학습 에이전트(120)는, 미리 설정된 에피소드에 따라 상기 디바이스(400)를 원격에서 제어하기 위한 제어정보를 생성하고, 상기 생성한 제어정보를 상기 디바이스(400)로 전송함으로써, 상기 디바이스(400)를 원격에서 제어할 수 있도록 하는 제어정보 처리부(121), 상기 디바이스(400)로부터 상기 제어정보에 따라 제어된 해당 디바이스(400)의 디바이스 상태정보를 수신하는 디바이스 상태정보 수신부(122), 상기 수신한 디바이스 상태정보에 대한 리워드 정보를 생성하는 리워드 정보 생성부(123), 상기 생성한 리워드 정보를 토대로 상기 디바이스(400)에 대한 전이정보를 생성하여, 상기 경험 메모리(121)에 저장하는 전이정보 생성부(124), 상기 모방 강화학습부(220)로부터 학습 파라미터를 제공받아 상기 디바이스 제어용 강화학습모델을 업데이트하는 강화학습모델 업데이트부(125) 및 메모리(126)를 포함하여 구성된다.The reinforcement learning agent 120 generates control information for remotely controlling the device 400 according to a preset episode, and transmits the generated control information to the device 400, thereby allowing the device 400 to ), a control information processing unit 121 for remote control, a device status information receiving unit 122 for receiving device status information of the corresponding device 400 controlled according to the control information from the device 400, the Reward information generation unit 123 for generating reward information for the received device status information, a transition for generating transition information for the device 400 based on the generated reward information, and storing it in the experience memory 121 It comprises an information generation unit 124, a reinforcement learning model update unit 125 and memory 126 for receiving learning parameters from the imitation reinforcement learning unit 220 and updating the reinforcement learning model for controlling the device.

상기 제어정보 처리부(121)는, 상기 모방 강화학습부(200)에서 제공받은 디바이스 제어용 강화학습모델을 이용하여 미리 설정된 에피소드에 따라 상기 디바이스(400)를 제어하기 위한 제어정보를 생성하고, 상기 생성한 제어정보를 상기 디바이스(400)로 전송함으로써, 상기 디바이스(400)를 제어할 수 있도록 한다.The control information processing unit 121 generates control information for controlling the device 400 according to a preset episode using the reinforcement learning model for device control provided by the imitation reinforcement learning unit 200 and generates the control information. By transmitting one control information to the device 400, it is possible to control the device 400.

즉, 상기 제어정보 처리부(121)는 상기 디바이스 상태정보 수신부(122)에서 수신한 디바이스 상태정보를 해당 디바이스(400)에 대한 디바이스 제어용 강화학습모델을 상기 메모리(126)로부터 로딩하고, 상기 로딩한 디바이스 제어용 강화학습모델에 상기 수신한 디바이스 상태정보를 입력하여, 해당 디바이스(400)를 단계적으로 제어하기 위한 제어정보를 생성하게 된다.That is, the control information processing unit 121 loads the device status information received from the device status information receiving unit 122 from the memory 126 for the device learning reinforcement learning model for the corresponding device 400, and loads the device status information. The received device status information is input to the reinforcement learning model for device control, and control information for controlling the corresponding device 400 step by step is generated.

이때, 상기 디바이스 제어용 강화학습모델은, 미리 설정한 디바이스(400)의 에피소드에 따라 최종 상태까지의 리워드 정보의 총합이 높은 순으로 적어도 하나 이상의 제어정보를 출력하며, 상기 제어정보 처리부(121)는, 상기 리워드 정보가 제일 높은 제어정보를 선택함으로써, 상기 디바이스(400)를 제어하기 위한 제어정보를 생성하게 된다.At this time, the reinforcement learning model for device control outputs at least one or more control information in the order of the highest sum of the reward information up to the final state according to the episode of the device 400 preset, and the control information processing unit 121 , By selecting control information having the highest reward information, control information for controlling the device 400 is generated.

한편 상기 제어정보 처리부(121)는, 상기 디바이스(400)를 제어하고자 할 때, 상기 디바이스(400)를 초기화하기 위한 제어정보를 생성하여 상기 디바이스(400)로 전송하는 기능을 더 포함한다.Meanwhile, the control information processing unit 121 further includes a function for generating control information for initializing the device 400 and transmitting it to the device 400 when the device 400 is to be controlled.

또한 상기 디바이스 상태정보 수신부(121)는, 상기 초기화를 위한 제어정보 또는 상기 디바이스 제어용 강화학습모델을 통해 생성한 제어정보에 따라 상기 디바이스(400)가 제어되어 상기 디바이스(400)의 상태정보를 수신하는 기능을 수행한다.In addition, the device status information receiving unit 121, the device 400 is controlled according to the control information for the initialization or control information generated through the reinforcement learning model for controlling the device to receive the status information of the device 400 To perform the function.

한편 상기 제어정보 및 상기 디바이스 상태정보는, 임의의 통신방법을 이용하여 송수신됨은 상술한 바와 같다.Meanwhile, the control information and the device status information are transmitted and received using an arbitrary communication method as described above.

또한 상기 리워드 정보 생성부(123)는, 상기 수신한 디바이스 상태정보를 토대로 미리 설정한 임계범위를 기준으로 해당 디바이스 상태정보에 대한 리워드를 계산함으로써, 상기 디바이스 상태정보에 대한 리워드 정보를 생성하는 기능을 수행한다.Also, the reward information generating unit 123 generates a reward information for the device status information by calculating a reward for the device status information based on a preset threshold range based on the received device status information. To perform.

상기 리워드 정보 생성부(123)는, 상기 수신한 디바이스 상태정보가 상기 임계범위 내에서 제어된 경우에는 플러스 보상값이 적용하며, 상기 수신한 디바이스 상태정보가 미러 설정한 임계범위를 초과하여 동작된 경우에는 마이너스 보상값을 적용함으로써, 상기 수신한 디바이스 상태정보에 대한 리워드 정보를 생성한다.The reward information generation unit 123 applies a positive compensation value when the received device status information is controlled within the threshold range, and the received device status information is operated beyond a mirror-set threshold range. In this case, by applying a negative compensation value, reward information for the received device status information is generated.

또한 전이정보 생성부(124)는, 상기 수신한 디바이스 상태정보 및 상기 생성한 리워드 정보를 토대로, 상기 제어정보에 따른 상기 디바이스(400)의 전이정보를 생성하여, 상기 모방 강화학습부(200)의 경험 메모리(210)에 저장하는 기능을 수행한다. In addition, the transition information generation unit 124, based on the received device status information and the generated reward information, generates the transition information of the device 400 according to the control information, the imitation reinforcement learning unit 200 It performs the function of storing in the experience memory 210.

상기 전이정보는, 상기 디바이스(400)의 현재 상태정보, 제어정보, 리워드 정보 및 다음 상태정보를 포함하여 구성된다.The transition information is configured to include current status information, control information, reward information, and next status information of the device 400.

즉, 전이정보 생성부(124)는, 상기 디바이스(400)로부터 수신되는 디바이스 상태정보 및 리워드 정보를 다음 상태정보 및 리워드 정보로 설정하고, 상기 다음 상태정보의 이전 상태정보를 현재 상태정보로 설정하며, 상기 이전 상태정보에 대해 생성한 제어정보를 해당 전이정보의 제어정보로 설정함으로서, 상기 전이정보를 생성하게 되는 것이다.That is, the transition information generation unit 124 sets device status information and reward information received from the device 400 as next status information and reward information, and sets previous status information of the next status information as current status information. And, by setting the control information generated for the previous state information as control information of the corresponding transition information, the transition information is generated.

이러한 전이정보는, 상기 설정한 에피소드가 종료될 때까지 순차적으로 상기 경험 메모리(210)에 저장된다. 한편 상기 설정한 에피소드가 종료될 때는, 상기 디바이스(400)의 제어가 최종적으로 완료되는 최종 상태를 나타내는 것으로, 상기 제어에 따라 상기 디바이스(400)가 미리 설정된 종료 상태로 최종적으로 전이된 때를 의미한다.The transition information is sequentially stored in the experience memory 210 until the set episode ends. On the other hand, when the set episode ends, it indicates a final state in which control of the device 400 is finally completed, and means when the device 400 finally transitions to a preset end state according to the control. do.

한편, 상기 전이정보는, 상기 에피소드가 종료될 때까지, 상기 강화학습 에이전트(120)에 구비되는 메모리(126)에 저장하고 있으면서, 상기 에피소드가 종료되는 경우에 상기 경험 메모리(210)에 상기 전이정보가 생성된 순서대로 한 번에 저장할 수도 있다.Meanwhile, the transition information is stored in the memory 126 provided in the reinforcement learning agent 120 until the episode ends, while the transition is transferred to the experience memory 210 when the episode ends. Information can be stored at once in the order in which it was created.

이후, 상기 에피소드가 종료된 경우, 상기 모방 강화학습부(200)는, 상기 경험 메모리(200)에 저장된 복수의 전이정보 중 미리 설정한 개수만큼 랜덤하게 선택하여 상기 디바이스 제어용 강화학습네트워크에 대한 강화학습을 수행할 수 있도록 한다.Thereafter, when the episode ends, the imitation reinforcement learning unit 200 randomly selects a predetermined number of a plurality of transition information stored in the experience memory 200 to strengthen the reinforcement learning network for controlling the device. Make learning possible.

또한 상기 강화학습모델 업데이트부(125)는, 상기 모방 강화학습부(200)에서 강화학습을 수행한 결과인 학습 파라미터를 상기 모방 강화학습부(200)로부터 제공받아, 상기 제공받은 학습 파라미터를 메모리(126)에 저장된 디바이스 제어용 강화학습모델에 적용함으로써, 해당 학습모델을 업데이트한다.In addition, the reinforcement learning model update unit 125 receives learning parameters that are a result of performing reinforcement learning in the imitation reinforcement learning unit 200 from the imitation reinforcement learning unit 200, and memory the provided learning parameters. By applying the reinforcement learning model for device control stored in (126), the corresponding learning model is updated.

즉, 강화학습 에이전트(120)는, 미리 설정한 에피소드에 따라 상기 디바이스(400)를 제어하는 과정, 상기 제어를 통한 전이정보를 생성하여 모방 강화학습부(200)로 제공하는 과정 및 상기 모방 강화학습부(200)로부터 강화학습을 수행한 결과인 학습 파라미터를 제공받아 상기 디바이스 제어용 강화학습모델을 업데이트하는 과정을 반복하여 수행함으로써, 상기 디바이스 제어용 강화학습모델을 점진적으로 고도화시켜, 상기 디바이스(400)를 정밀하게 제어할 수 있도록 한다.That is, the reinforcement learning agent 120 controls the device 400 according to a preset episode, generates transition information through the control, and provides it to the imitation reinforcement learning unit 200 and strengthens the imitation By receiving a learning parameter that is a result of performing reinforcement learning from the learning unit 200 and repeatedly performing the process of updating the reinforcement learning model for device control, the device control reinforcement learning model is gradually upgraded and the device 400 ) Can be precisely controlled.

도 5는 본 발명의 일 실시예에 따른 디바이스 원격 제어를 위한 모방 강화학습 시스템의 모방 강화학습부의 구성을 나타낸 도면이다.5 is a view showing the configuration of the imitation reinforcement learning unit of the imitation reinforcement learning system for remote control of a device according to an embodiment of the present invention.

도 5에 도시한 바와 같이, 본 발명의 일 실시예에 따른 디바이스 원격 제어를 위한 모방 강화학습 시스템(10)의 모방 강화학습부(200)는, 특정 디바이스(400)를 제어하기 위한 디바이스 제어용 강화학습모델을 생성하고, 상기 생성한 디바이스 제어용 강화학습모델을 상기 강화학습 에이전트(120)로 제공하여, 상기 디바이스 제어용 강화학습모델을 통해 물리 환경에 위치하는 해당 디바이스(400)를 원격에서 제어할 수 있도록 한다. As shown in FIG. 5, the imitation reinforcement learning unit 200 of the imitation reinforcement learning system 10 for remote control of a device according to an embodiment of the present invention enhances device control for controlling a specific device 400 A learning model is generated, and the reinforcement learning model for controlling the generated device is provided to the reinforcement learning agent 120 to remotely control the corresponding device 400 located in a physical environment through the reinforcement learning model for controlling the device. To make.

또한 상기 모방 강화학습부(200)는, 상기 디바이스(400)의 제어를 토대로 생성되는 상기 디바이스(400)의 전이정보를 미리 설정한 개수만큼 FIFO 형식으로 저장하는 경험 메모리(210), 상기 저장한 전이정보를 이용하여 상기 디바이스 제어용 강화학습모델의 강화학습 네트워크에 대한 강화학습을 수행하는 강화 학습부(220), 및 상기 강화학습을 수행한 결과인 학습 파라미터를 상기 강화학습 에이전트(120)의 강화학습모델 업데이트부(125)로 제공하는 학습 파라미터 제공부(230)를 포함하여 구성된다.In addition, the imitation reinforcement learning unit 200, the experience memory 210 for storing a predetermined number of transition information of the device 400 generated based on the control of the device 400 in a FIFO format, the stored Reinforcement of the reinforcement learning agent (120) for performing reinforcement learning on the reinforcement learning network of the reinforcement learning model for device control using transition information, and reinforcement learning agent (120) as a result of the reinforcement learning It comprises a learning parameter providing unit 230 provided to the learning model update unit 125.

상기 전이정보는, 상기 강화학습을 수행하기 위한 기반이 되는 데이터로써, 상기 PID 컨트롤러(110) 및 상기 강화학습 에이전트(120)에서 생성되어 상기 경험 메모리(210)에 저장된다.The transition information is data that is the basis for performing the reinforcement learning, and is generated by the PID controller 110 and the reinforcement learning agent 120 and stored in the experience memory 210.

한편 상기 경험 메모리(210)는 FIFO 구조로 구성되며, 미리 설정한 개수만큼 상기 전이정보가 생성된 순서대로 상기 전이정보를 저장한다.Meanwhile, the experience memory 210 is configured with a FIFO structure, and stores the transition information in the order in which the transition information is generated as many as a preset number.

또한 상기 전이정보는, 미리 설정한 에피소드 동안 상기 디바이스(400)를 제어한 결과에 따라 순서대로 저장되는 것으로, 상기 에피소드가 종료될 때마다 한 번에 저장되거나, 상기 전이정보가 생성될 때마다 저장될 수 있다.In addition, the transition information is stored in order according to a result of controlling the device 400 during a preset episode, and stored at one time each time the episode ends, or whenever the transition information is generated. Can be.

한편 상기 디바이스 제어용 강화학습모델이 생성된 초기에는, 해당 디바이스 제어용 강화학습모델에 대한 가중치가 랜덤 혹은 사용자의 선택에 따라 임의로 설정되고, 강화학습을 위한 전이정보의 개수가 부족하므로, 상기 전이정보는 상기 PID 컨트롤러(110)에 의해 생성된다. 이때, 상기 경험 메모리(210)에는 상기 PID 컨트롤러(110)에 의해 생성된 전이정보가 미리 설정한 개수만큼 저장되며, 상기 디바이스 제어용 강화학습모델의 강화학습 네트워크에 대한 강화학습은 상기 PID 컨트롤러(110)에 의해 생성된 전이정보를 토대로 수행된다. 이는, 상기 PID 컨트롤러(110)의 전통적 제어모델인 PID 제어모델을 모방하여 강화학습을 수행하는 것으로, 본 발명에서는 이를 모방 강화학습이라고 지칭함은 상술한 바와 같다.On the other hand, since the reinforcement learning model for controlling the device is initially generated, the weight for the reinforcement learning model for controlling the device is randomly or randomly set according to the user's selection, and the number of transition information for reinforcement learning is insufficient, so the transition information is It is generated by the PID controller 110. At this time, the experience memory 210 stores a predetermined number of transition information generated by the PID controller 110, and reinforcement learning for the reinforcement learning network of the reinforcement learning model for controlling the device is the PID controller 110 ). This is to perform reinforcement learning by imitating the PID control model, which is a traditional control model of the PID controller 110, and the present invention is referred to as imitation reinforcement learning.

이후, 상기 전이정보는, 상기 강화학습 에이전트(120)를 통해 생성되어 상기 경험 메모리(210)에 FIFO 형식으로 저장되며, 상기 강화학습부(220)는, 상기 경험 메모리(210)에 저장된 전이정보를 이용하여 강화학습을 수행함으로서, 상기 물리 환경에 위치하는 디바이스(400)를 정밀하게 제어할 수 있도록 한다.Thereafter, the transition information is generated through the reinforcement learning agent 120 and stored in the FIFO format in the experience memory 210, and the reinforcement learning unit 220 transmits the transition information stored in the experience memory 210. By performing reinforcement learning by using, it is possible to precisely control the device 400 located in the physical environment.

또한 상기 학습 파라미터 제공부(230)는, 상기 강화학습 네트워크에 대한 강화학습을 수행한 결과, 해당 강화학습 네트워크의 학습 파라미터를 상기 강화학습 에이전트(120)의 강화학습모델 업데이트부(125)로 제공하여, 상기 디바이스 제어용 강화학습모델에 상기 학습 파라미터를 적용할 수 있도록 함으로써, 상기 디바이스 제어용 강화학습모델을 업데이트할 수 있도록 한다. In addition, the learning parameter providing unit 230, as a result of performing reinforcement learning on the reinforcement learning network, provides learning parameters of the reinforcement learning network to the reinforcement learning model update unit 125 of the reinforcement learning agent 120 By doing so, by applying the learning parameter to the reinforcement learning model for controlling the device, it is possible to update the reinforcement learning model for controlling the device.

즉, 상기 디바이스 제어용 강화학습모델은, 강화학습을 수행한 결과인 학습 파라미터를 통해 점점 고도화되어 상기 디바이스(400)를 신속하고 정밀하게 제어할 수 있도록 한다.That is, the reinforcement learning model for device control is gradually advanced through learning parameters that are the result of performing reinforcement learning, so that the device 400 can be quickly and precisely controlled.

한편 상기 강화학습 네트워크가 CNN(convolutional neural nework)을 기반으로 구성된 경우, 상기 경험 메모리(210)에 저장된 전이정보는 이미지화되어 상기 강화학습 네트워크에 대한 강화학습을 수행할 수 있도록 한다.Meanwhile, when the reinforcement learning network is configured based on a convolutional neural nework (CNN), transition information stored in the experience memory 210 is imaged so that reinforcement learning can be performed on the reinforcement learning network.

다만, 본 발명에서의 강화학습 네트워크는 ANN(artificial neural network) 등과 같은 다양한 기계학습 네트워크를 기반으로 구현될 수 있으며, 이에 따라 상기 전이정보는 상기 다양한 기계학습 네트워크에 따라 다양한 포맷으로 변환될 수 있다.However, the reinforcement learning network in the present invention may be implemented based on various machine learning networks such as an artificial neural network (ANN), and accordingly, the transfer information may be converted into various formats according to the various machine learning networks. .

도 6은 본 발명의 일 실시예에 따른 강화학습을 수행하는 절차를 나타낸 흐름도이다.6 is a flowchart illustrating a procedure for performing reinforcement learning according to an embodiment of the present invention.

도 6에 도시한 바와 같이, 본 발명의 일 실시예에 따른 디바이스 원격 제어를 위한 모방 강화학습 시스템(10)을 통해 물리 환경에 위치하는 디바이스(300)를 해당 물리 환경에 빠르게 적응하여 상기 디바이스(300)를 정밀하게 제어하기 위해 강화학습을 수행하는 절차는 우선, 상기 디바이스 원격 제어를 위한 모방 강화학습 시스템(10)의 디바이스 원격 제어 장치(100)는, 상기 강화학습을 수행하기 위해 미리 설정한 에피소드에 따라 상기 디바이스(400)를 원격에서 제어하고자 하는 경우(S110), 상기 디바이스(400)를 초기화하고, 해당 디바이스(400)의 상태정보를 수신한다(S120).As illustrated in FIG. 6, the device 300 located in a physical environment is quickly adapted to a corresponding physical environment through the imitation reinforcement learning system 10 for remote control of the device according to an embodiment of the present invention. 300) The procedure for performing reinforcement learning in order to precisely control, first, the device remote control apparatus 100 of the imitation reinforcement learning system 10 for remote control of the device is set in advance to perform the reinforcement learning When the device 400 is to be remotely controlled according to an episode (S110), the device 400 is initialized, and status information of the device 400 is received (S120).

이때, 상기 디바이스 원격 제어 장치(100)는, 모방 강화학습부(200)의 경험 메모리(210)를 참조하여, 상기 경험 메모리(210)의 저장된 전이정보가 미리 설정한 개수 이상보다 미만인 경우에는, PID 컨트롤러(110)를 통해 상기 디바이스(400)를 제어하여 상기 전이정보를 생성하도록 하고, 이상인 경우에는 상기 디바이스(400)를 강화학습 에이전트(120)를 통해 상기 디바이스(400)를 제어하여, 상기 전이정보를 생성하도록 한다.In this case, when the device remote control apparatus 100 refers to the experience memory 210 of the imitation reinforcement learning unit 200, when the stored transition information of the experience memory 210 is less than or equal to a preset number, The device 400 is controlled by the PID controller 110 to generate the transition information, and in the case of abnormality, the device 400 is controlled by the reinforcement learning agent 120 to control the device 400. Create transition information.

또한 상기 PID 컨트롤러(110)를 통해 상기 디바이스(400)를 제어하여 상기 전이정보를 생성하는 과정은, 상기 PID 컨트롤러(110)에서, PID 제어모델을 이용하여 상기 디바이스 상태정보에 따른 제어정보를 생성한다(S131).In addition, in the process of generating the transition information by controlling the device 400 through the PID controller 110, the PID controller 110 generates control information according to the device state information using a PID control model. (S131).

상기 PID 제어모델은, 상기 디바이스(400)의 상태정보를 미리 설정한 참조값 또는 설정값과 비교하여 오차값을 계산한 후, 상기 계산한 오차값을 이용하여 상기 디바이스 제어에 필요한 값을 계산함으로써, 상기 디바이스(400)를 제어하도록 하는 것으로, 상기 디바이스(400)를 미리 설정한 에피소드에 따라 안정적으로 제어하기 위해 사전에 정의된 수학적 모델을 의미한다.The PID control model calculates an error value by comparing the state information of the device 400 with a preset reference value or a set value, and then calculates a value necessary for controlling the device using the calculated error value, By controlling the device 400, it means a predefined mathematical model to stably control the device 400 according to a preset episode.

다음으로, 상기 PID 컨트롤러(110)에서, 상기 생성한 제어정보를 상기 디바이스(400)로 전송하여, 상기 디바이스(400)를 제어하고(S141), 이후, 상기 PID 컨트롤러(110)는, 상기 디바이스(400)로부터 상기 제어정보에 따른 제어결과에 대한 디바이스 상태정보를 수신하여 이에 대한 전이정보를 생성하고, 상기 생성한 전이정보를 상기 경험 메모리(210)에 저장한다(S151).Next, the PID controller 110 transmits the generated control information to the device 400 to control the device 400 (S141), and then, the PID controller 110, the device Device status information for a control result according to the control information is received from 400 to generate transfer information for the result, and the generated transfer information is stored in the experience memory 210 (S151).

이후, 상기 에피소드가 종료된 경우(S160), 상기 모방 강화학습부(200)에서, 상기 경험 메모리(210)에 저장된 전이정보를 이용하여 상기 강화학습 네트워크에 대한 강화학습을 수행함으로써, 상기 디바이스 제어용 강화학습모델을 고도화시킬 수 있도록 한다. Thereafter, when the episode ends (S160), the imitation reinforcement learning unit 200 uses the transfer information stored in the experience memory 210 to perform reinforcement learning on the reinforcement learning network, thereby controlling the device It is necessary to advance the reinforcement learning model.

한편, 상기 PID 컨트롤러(110)를 통해 상기 전이정보를 생성하여 상기 경험 메모리(210)에 저장하는 것은, 상기 디바이스 제어용 강화학습모델에 대한 초기 강화학습 데이터(즉, 전이정보)를 신속하게 수집하고, 초기 안정된 데이터를 모방하여 강화학습을 수행할 수 있도록 함으로써, 상기 디바이스 제어용 강화학습모델이 상기 디바이스(400)의 물리 환경에 빠르게 적응할 수 있도록 하기 위함이다.Meanwhile, generating the transition information through the PID controller 110 and storing it in the experience memory 210 quickly collects initial reinforcement learning data (ie, transition information) for the reinforcement learning model for device control. In order to enable reinforcement learning by imitating the initial stable data, the reinforcement learning model for controlling the device can be quickly adapted to the physical environment of the device 400.

또한 상기 PID 컨트롤러(110)를 통해 상기 전이정보를 생성하여 상기 경험 메모리(210)에 저장될 때마다, 상기 PID 컨트롤러(110)를 통해 저장된 전이정보가 상기 미리 설정한 개수가 될 때마다, 또는 상기 에피소드가 종료될 때, 상기 모방 강화학습부(200)는 상기 PID 컨트롤러(110)를 통해 저장되는 전이정보를 토대로 상기 모방강화학습을 수행할 수 있도록 구현될 수 있다.Also, whenever the transition information is generated through the PID controller 110 and stored in the experience memory 210, whenever the transition information stored through the PID controller 110 becomes the preset number, or When the episode ends, the imitation reinforcement learning unit 200 may be implemented to perform the imitation strong learning based on the transfer information stored through the PID controller 110.

이때, 상기 모방 강화학습부(200)는, 상기 강화학습을 수행한 결과인 학습 파라미터를 상기 강화학습 에이전트(120)로 제공하여, 상기 디바이스 제어용 강화학습모델에 적용할 수 있도록 함으로서, 상기 디바이스 제어용 강화학습모델이 상기 디바이스(400)의 물리 환경에 빠르게 적응할 수 있도록 한다.At this time, the imitation reinforcement learning unit 200 provides the learning parameter that is the result of performing the reinforcement learning to the reinforcement learning agent 120 so that it can be applied to the reinforcement learning model for controlling the device, thereby controlling the device The reinforcement learning model can be quickly adapted to the physical environment of the device 400.

또한 경험 메모리(210)의 전이정보가 미리 설정한 개수만큼 저장되어, 상기 강화학습 에이전트(110)를 통해 상기 디바이스(400)를 제어하여 상기 강화학습을 위한 전이정보를 생성 및 저장하는 경우, 상기 강화학습 에이전트(110)에서, 상기 수신한 상태정보를 디바이스 제어용 강화학습모델에 입력하여, 상기 디바이스를 제어하기 위한 제어정보를 생성한다(S130).In addition, when the transition information of the experience memory 210 is stored in a preset number, when the device 400 is controlled through the reinforcement learning agent 110 to generate and store the transition information for the reinforcement learning, the In the reinforcement learning agent 110, the received state information is input to the reinforcement learning model for device control, and control information for controlling the device is generated (S130).

다음으로, 상기 강화학습 에이전트(120)에서, 상기 생성한 제어정보를 상기 디바이스(400)로 전송함으로써, 상기 디바이스(400)를 제어하고, 상기 디바이스로(400)로부터 해당 제어정보에 따른 제어된 해당 디바이스(400)의 디바이스 상태정보를 수신하여, 이에 대한 전이정보를 생성하고, 상기 생성한 전이정보를 상기 경험 메모리(210)에 저장한다(S150). Next, the reinforcement learning agent 120 controls the device 400 by transmitting the generated control information to the device 400, and is controlled according to the control information from the device 400 The device status information of the corresponding device 400 is received, the transfer information is generated, and the generated transfer information is stored in the experience memory 210 (S150).

이후, 상기 미리 설정한 에피소드에 따라 상기 디바이스(400)에 대한 제어가 종료된 경우(S160), 상기 모방 강화학습부(200)는, 상기 경험 메모리(210)에 저장된 전이정보를 이용하여 상기 강화학습을 수행하게 된다(S170).Thereafter, when control of the device 400 is terminated according to the preset episode (S160), the imitation reinforcement learning unit 200 uses the transition information stored in the experience memory 210 to enhance the Learning is performed (S170).

이때, 상기 모강 강화학습부(200)는, 상기 강화학습을 수행한 결과인 학습 파라미터를 상기 강화학습 에이전트(120)로 제공하여, 상기 디바이스 제어용 강화학습모델에 적용할 수 있도록 함으로써, 상기 디바이스 제어용 강화학습모델을 고도화시켜 상기 디바이스(400)를 정밀하게 제어할 수 있도록 한다.At this time, the reinforcing learning unit 200 provides the learning parameter that is the result of performing the reinforcement learning to the reinforcement learning agent 120 so that it can be applied to the reinforcement learning model for device control, thereby controlling the device By enhancing the reinforcement learning model, the device 400 can be precisely controlled.

한편 상기 강화학습 에이저트(120)를 통해 저장되는 전이정보를 이용한 강화학습은, 상기 에피소드가 종료될 때 마다 수행되는 것으로, 상기 강화학습을 반복적으로 수행하여, 상기 디바이스 제어용 강화학습모델을 점진적으로 고도화시킴으로서, 상기 디바이스(400)를 원격에서 정밀하게 제어할 수 있도록 한다.Meanwhile, reinforcement learning using transition information stored through the reinforcement learning agent 120 is performed whenever the episode ends, and the reinforcement learning is repeatedly performed to gradually enhance the reinforcement learning model for device control. By upgrading, it is possible to precisely control the device 400 remotely.

또한 상기 강화학습은, 상기 전이정보를 토대로 상기 디바이스(400)의 특정 상태정보에서, 상기 에피소드가 종료될 때까지 모든 상태에 대한 리워드 정보의 총합이 높은 순으로 적어도 하나 이상으로 출력될 수 있도록 수행되며, 상기 강화학습 에이전트(120)는, 상기 리워드 정보의 총합이 제일 높은 제어정보를 선택함으로서, 상기 제어정보를 생성하게 된다.In addition, the reinforcement learning is performed so that at least one or more of the sum of the reward information for all states may be output in a high order from the specific state information of the device 400 based on the transition information until the episode ends. In addition, the reinforcement learning agent 120 generates the control information by selecting the control information having the highest sum of the reward information.

또한 상기 디바이스 상태정보 및 상기 제어정보는, 임의의 특정 통신방법을 통해 송수신되며, 네트워크 컨트롤러(300)에 의해 실시간으로 송수신되도록 제어됨은 상술한 바와 같다.In addition, the device status information and the control information are transmitted and received through any specific communication method, and are controlled to be transmitted and received in real time by the network controller 300 as described above.

또한 상기 경험 메모리(210)는, FIFO 구조의 메모리로 형성되는 것으로, 미리 설정한 개수만큼 상기 전이정보를 저장하며, 상기 미리 설정한 개수를 초과하여 상기 전이정보가 상기 경험 메모리(210)에 저장되는 경우, 제일 빨리 저장된 상기 전이정보를 삭제하여 상기 미리 설정한 개수를 초과한 전이정보를 상기 경험 메모리(210)에 저장된다.In addition, the experience memory 210 is formed of a FIFO-structured memory, and stores the transition information as many as a preset number, and stores the transition information in the experience memory 210 in excess of the preset number. If possible, the transfer information exceeding the preset number is stored in the experience memory 210 by deleting the transfer information stored at the earliest.

이를 통해, 상기 강화학습을 수행할 때, 상기 전이정보가 반복되어 적용되는 것을 방지함으로서, 효율적인 강화학습을 수행할 수 있도록 한다. Through this, when the reinforcement learning is performed, it is possible to perform efficient reinforcement learning by preventing the transfer of the repetition information repeatedly.

이상에서 설명한 바와 같이, 본 발명의 일 실시예에 따른 디바이스 원격 제어를 위한 모방 강화학습 시스템 및 그 방법은, 특정 디바이스를 제어하기 위한 전통적인 기계 제어모델을 활용하여 상기 특정 디바이스에 대한 전이정보를 신속하게 생성하고, 상기 생성한 전이정보를 모방하여 상기 디바이스를 제어하기 위한 디바이스 제어용 강화학습모델에 대한 강화학습을 수행할 수 있도록 함으로서, 상기 디바이스의 물리 환경에 빠르게 적응하여 상기 디바이스를 정밀하게 제어할 수 있도록 하는 효과가 있다.As described above, the imitation reinforcement learning system and method for remote control of a device according to an embodiment of the present invention utilize a traditional machine control model for controlling a specific device to quickly transfer transition information for the specific device. It is possible to quickly adapt to the physical environment of the device and to precisely control the device by allowing the device to perform reinforcement learning on the reinforcement learning model for device control for controlling the device by imitating the generated transition information. It has the effect of making it possible.

상기에서는 본 발명에 따른 바람직한 실시예를 위주로 상술하였으나, 본 발명의 기술적 사상은 이에 한정되는 것은 아니며 본 발명의 각 구성요소는 동일한 목적 및 효과의 달성을 위하여 본 발명의 기술적 범위 내에서 변경 또는 수정될 수 있을 것이다.In the above, the preferred embodiment according to the present invention has been mainly described, but the technical spirit of the present invention is not limited thereto, and each component of the present invention is modified or modified within the technical scope of the present invention in order to achieve the same purpose and effect. It could be.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In addition, although the preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the present invention pertains without departing from the gist of the present invention claimed in the claims. In addition, various modifications may be implemented by a person having ordinary knowledge in the art, and these modifications should not be individually understood from the technical idea or prospect of the present invention.

10: 디바이스 원격 제어를 위한 모방 강화학습 시스템
100: 디바이스 원격 제어 장치 110: PID 컨트롤러
120: 강화학습 에이전트 121: 제어정보 처리부
122: 디바이스 상태정보 수신부 123: 리워드 정보 생성부
124: 전이정보 생성부 125: 강화학습모델 업데이트부
126: 메모리 200: 모방 강화학습부
210: 경험 메모리 220: 강화 학습부
230: 학습 파라미터 제공부 300: 네트워크 컨트롤러
400: 디바이스 500: 사용자 단말10: Imitation reinforcement learning system for device remote control
100: device remote control device 110: PID controller
120: reinforcement learning agent 121: control information processing unit
122: device status information receiving unit 123: reward information generating unit
124: transfer information generation unit 125: reinforcement learning model update unit
126: memory 200: imitation reinforcement learning department
210: experience memory 220: reinforcement learning department
230: learning parameter providing unit 300: network controller
400: device 500: user terminal

Claims

An experience memory storing transition information generated based on a control result for at least one device located in a physical environment according to a preset episode in a first in first out (FIFO) method; And
Includes; a mimic reinforcement learning unit for performing reinforcement learning on a reinforcement learning model for device control for controlling the device using the transition information stored in the experience memory.
The transition information includes current status information of the device, control information for controlling the device, next status information of the device controlled according to the control information, and reward information for the next status information. Imitation reinforcement learning system for device remote control.

The method according to claim 1,
The imitation reinforcement learning system,
When the transition information is stored in the experience memory in a number less than a preset number, the device is controlled according to the episode through a PID controller using a PID control model, and the transition information is generated and stored in the experience memory And
The imitation reinforcement learning unit, whenever the transition information generated and stored through the PID controller becomes the preset number, or whenever the transition information is stored in the experience memory through the PID controller, or the episode ends. At this time, the device characterized in that it further comprises performing the reinforcement learning by mimicking the experience of the PID controller controlling the device by performing the reinforcement learning based on the transfer information generated and stored through the PID controller. Imitation reinforcement learning system for remote control.

The method according to claim 2,
The imitation reinforcement learning system,
When the transition information is stored in the experience memory in a preset number, by inputting the state information of the device into the reinforcement learning model for controlling the device performing the reinforcement learning, extracting control information for controlling the device, Further comprising; reinforcement learning agent for controlling the device step by step according to the episode,
The reinforcement learning agent receives device status information from the device controlled based on the control information, generates transition information for the device, and stores it in the FIFO method in the experience memory for remote control of the device. Imitation reinforcement learning system.

The method according to claim 1,
The reinforcement learning,
It is performed to extract control information in which the sum of the reward information is the maximum until the final state of the device, which is set in advance according to device state information for a specific state of the device,
The experience memory,
A imitation reinforcement learning system for remote control of a device, characterized in that the transition information is stored in a FIFO manner to prevent the transition information from being duplicated and applied during the reinforcement learning.

The method according to claim 1,
The reward information,
It is characterized in that it is generated by applying a positive compensation value when the state information of the device is controlled within a preset threshold range and applying a negative compensation value when the result is controlled beyond a preset threshold range. Imitation reinforcement learning system for remote control of devices.

A storage step of storing transition information generated based on a control result for at least one device located in a physical environment according to a preset episode in an experience memory in a first in first out (FIFO) method; And
Includes; imitation reinforcement learning step of performing reinforcement learning on the reinforcement learning model for device control for controlling the device by using the transition information stored in the experience memory.
The transition information includes current status information of the device, control information for controlling the device, next status information of the device controlled according to the control information, and reward information for the next status information. Imitation reinforcement learning method for device remote control.

The method according to claim 6,
The imitation reinforcement learning method,
When the transition information is stored in the experience memory in less than a preset number, the device is controlled according to the episode through a PID controller using a PID control model, and the transition information is generated and stored in the experience memory And
In the imitation reinforcement learning step, whenever the transition information generated and stored through the PID controller becomes the preset number, or whenever the transition information is stored in the experience memory of the PID controller, or the episode ends. At this time, the device characterized in that it further comprises performing the reinforcement learning by mimicking the experience of the PID controller controlling the device by performing the reinforcement learning based on the transfer information generated and stored through the PID controller. Imitation reinforcement learning method for remote control.

The method according to claim 7,
The imitation reinforcement learning method,
When the transition information is stored in the experience memory in a preset number, through the reinforcement learning agent, control for controlling the device by inputting the state information of the device into the reinforcement learning model for controlling the device performing the reinforcement learning And extracting information to control the device stepwise according to the episode.
The reinforcement learning agent receives device status information from the device controlled based on the control information, generates transition information for the device, and stores it in the FIFO method in the experience memory for remote control of the device. Imitation reinforcement learning method.

The method according to claim 6,
The reinforcement learning,
It is performed to extract control information in which the sum of the reward information is the maximum until the final state of the device, which is set in advance according to device state information for a specific state of the device,
The experience memory,
By storing the transition information in a FIFO manner, when the reinforcement learning, the mitigation reinforcement learning method for remote control of a device, characterized in that the transition information is prevented from being applied repeatedly.

The method according to claim 6,
The reward information,
It is characterized in that it is generated by applying a positive compensation value when the state information of the device is controlled within a preset threshold range and applying a negative compensation value when the result is controlled beyond a preset threshold range. Reinforcement learning method for remote control of a device.