KR20230151736A

KR20230151736A - Method and system for selecting movement routs of mobile robot performing multiple tasks based on deep reinforcement learning

Info

Publication number: KR20230151736A
Application number: KR1020220051446A
Authority: KR
Inventors: 유수정; 이창식; 지상훈
Original assignee: 한국생산기술연구원
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2023-11-02

Abstract

다종 작업을 수행하는 모바일 로봇의 강화학습 기반 이동경로 선택 방법은, 모바일 로봇이 수행할 작업과 작업 환경을 포함하는 DQN 인공지능의 파라미터를 입력 받는 단계, 입력된 작업 환경에 대해서 복수의 구역별로 노드를 할당하여 그래프로 표시하고, 상기 작업에 대한 환경 상태와 관측 가능 상태를 정의하고, 상기 작업에 대한 목표와 상기 목표에 따른 보상 함수를 설계하여 모의환경을 설계하는 단계, 상기 모의환경에서 상기 DQN 인공지능이, 행동가치 함수 값이 최대가 되는 이동경로를 선택하도록 하여 강화학습시키는 단계, 및 학습된 DQN 인공지능을 이용하여 실제 환경 또는 시뮬레이션 환경에서 상기 모바일 로봇의 자율주행을 제어하는 단계를 포함하여 구성된다.The reinforcement learning-based movement path selection method for a mobile robot that performs a variety of tasks involves inputting parameters of DQN artificial intelligence, including the task to be performed by the mobile robot and the work environment, and selecting nodes for multiple zones for the input work environment. Designing a simulation environment by allocating and displaying it as a graph, defining environmental states and observable states for the task, and designing a goal for the task and a reward function according to the goal, the DQN in the simulated environment It includes a step of reinforcement learning by having artificial intelligence select a movement path that maximizes the action value function value, and a step of controlling autonomous driving of the mobile robot in a real environment or simulation environment using the learned DQN artificial intelligence. It is composed by:

Description

Reinforcement learning-based movement path selection method and system for mobile robots performing multiple tasks {METHOD AND SYSTEM FOR SELECTING MOVEMENT ROUTS OF MOBILE ROBOT PERFORMING MULTIPLE TASKS BASED ON DEEP REINFORCEMENT LEARNING}

이하의 설명은 주어진 작업 환경에서 모바일 로봇이 다종 작업 중 선택된 작업을 최적으로 수행하도록 하는 이동경로를 선택하는 방법 및 시스템에 관한 것이다.The following description relates to a method and system for selecting a movement path that allows a mobile robot to optimally perform a selected task among various tasks in a given work environment.

강화학습은 임의의 임무 환경이 주어져 있을 때 임무 달성도를 나타내는 보상 함수를 정의하고, 임의의 에이전트가 해당 환경 내에서 보상을 최대화하는 행동을 학습하도록 만드는 기계학습 방법론이다. 강화학습은 연산 장치의 성능과 심층 학습(Deep Learning) 기술의 발달과 함께, 심층 강화학습(Deep Reinforcement Learning)의 형태로 발전하여 기존에 모델링이 복잡하여 해결하기 어려웠던 여러 문제에서 효과적인 성능을 입증하였다. 그 일례로, 2016년 강화학습 기반 바둑 인공지능 알파고는 한국의 프로 바둑기사 이세돌과 바둑 대결에서 승리하여 심층 강화학습 기술의 실효성을 대중에게 각인시켰다. 이러한 심층 강화학습 기술의 적용 범위는 모바일 로봇의 자율 작업 수행을 위한 연구까지 확장되고 있다.Reinforcement learning is a machine learning methodology that defines a reward function that represents mission achievement when a random task environment is given and makes a random agent learn actions that maximize reward within that environment. With the development of computing device performance and deep learning technology, reinforcement learning has developed into the form of deep reinforcement learning and has proven effective performance in various problems that were previously difficult to solve due to complex modeling. . For example, in 2016, AlphaGo, a reinforcement learning-based Go artificial intelligence, won a Go match against Korean professional Go player Lee Sedol, thereby impressing upon the public the effectiveness of deep reinforcement learning technology. The scope of application of these deep reinforcement learning technologies is expanding to research into autonomous task performance of mobile robots.

모바일 로봇의 자율 작업 수행 능력 확보를 위한 연구는, 비용 절감, 인명 위험 감소 등의 장점이 있어 지속적으로 수행되어 왔다. 그 중에서도, 순찰, 경비배치, 화재감시, 물건전달, 청소를 포함하는 다종 작업을 수행할 수 있는 서비스 로봇의 연구는, 기존 무인 서비스 체계의 적용 범위 및 효율성 증대를 위해 활발히 수행되어 왔다.Research to secure autonomous task performance capabilities of mobile robots has been continuously conducted due to its advantages such as cost reduction and reduced risk to life. Among them, research on service robots that can perform a variety of tasks including patrolling, security deployment, fire surveillance, delivery of goods, and cleaning has been actively conducted to increase the scope and efficiency of existing unmanned service systems.

기존의 심층 강화학습 기술에 관한 연구들은, 주로 순찰, 경비배치, 화재감시, 물건전달, 청소 중 단일 작업만을 수행하기 위한 모바일 로봇에 적용하는 경우에 한정되었다. 그러나, 동일 환경에서 순찰, 경비배치, 화재감시, 물건전달, 청소를 포함하는 다종 작업에 적용하기 위한 강화학습 모의환경을 구축하고, 구축된 모의환경에서 학습된 에이전트는 없었다.Existing studies on deep reinforcement learning technology have been limited to applications to mobile robots that perform only single tasks such as patrolling, security deployment, fire surveillance, product delivery, and cleaning. However, a reinforcement learning simulation environment was constructed to apply to various tasks including patrolling, security deployment, fire surveillance, product delivery, and cleaning in the same environment, and there were no agents learned in the constructed simulated environment.

그리고, 모바일 로봇에 적용하기 위한 기존의 심층 강화학습 기술은, 모바일 로봇이 실시간으로 촬영하는 영상 정보 또는 작업 환경의 지도 영상을 입력 받아서, 모바일 로봇의 제어 신호 또는 이동 좌표를 추정하는 인공지능 모델에 대한 연구가 주를 이룬다. 그러나 영상 자료를 활용하는 방법은, 많은 연산량을 필요로 하기 때문에 모바일 로봇에 탑재 가능한 연산장치에서 구동하기에는 성능이 제한되는 문제점이 있다.In addition, the existing deep reinforcement learning technology for application to mobile robots receives video information captured in real time by the mobile robot or map images of the work environment, and is applied to an artificial intelligence model that estimates the control signal or movement coordinates of the mobile robot. Research is mainly conducted on However, since the method of using video data requires a large amount of calculation, there is a problem that performance is limited when operated on a computing device that can be mounted on a mobile robot.

그리고 연산량을 줄이기 위해서 영상보다 낮은 차원의 자료로 추상화한 자료에 대해서 강화학습을 적용하는 연구도 수행되고 있으나, 순찰, 배송, 청소 등의 단일 작업을 수행하는 사례는 있으나, 순찰, 경비배치, 화재감시, 물건전달, 청소를 포함하는 다종 작업에 대해서 추상화된 강화학습 모의환경을 구성하고 에이전트를 학습시킨 사례는 없었다.In order to reduce the amount of computation, research is being conducted on applying reinforcement learning to data abstracted to a lower level than video. However, there are cases where single tasks such as patrolling, delivery, and cleaning are performed, but there are also cases where single tasks such as patrolling, security deployment, and fire are performed. There has been no case in which an abstract reinforcement learning simulation environment was constructed and an agent was trained for various tasks including surveillance, product delivery, and cleaning.

전술한 배경기술은 발명자가 본원의 개시 내용을 도출하는 과정에서 보유하거나 습득한 것으로서, 반드시 본 출원 전에 일반 공중에 공개된 공지기술이라고 할 수는 없다.The above-mentioned background technology is possessed or acquired by the inventor in the process of deriving the disclosure of the present application, and cannot necessarily be said to be known technology disclosed to the general public before the present application.

실시예의 목적은, 주어진 작업 환경에서 모바일 로봇이 순찰, 경비배치, 화재감시, 물건전달, 청소와 같은 다종 작업 중에서 선택된 작업을 최적으로 수행하도록 하기 위한 이동경로를 선택하는 방법 및 시스템을 제공하는 것이다.The purpose of the embodiment is to provide a method and system for selecting a movement path to enable a mobile robot to optimally perform selected tasks among various tasks such as patrolling, security deployment, fire surveillance, product delivery, and cleaning in a given work environment. .

실시예들에서 해결하려는 과제들은 이상에서 언급한 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problems to be solved in the embodiments are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the description below.

실시예에 따른 다종 작업을 수행하는 모바일 로봇의 강화학습 기반 이동경로 선택 방법은, 모바일 로봇이 수행할 작업과 작업 환경을 포함하는 DQN 인공지능의 파라미터를 입력 받는 단계, 입력된 작업 환경에 대해서 복수의 구역별로 노드를 할당하여 그래프로 표시하고, 상기 작업에 대한 환경 상태와 관측 가능 상태를 정의하고, 상기 작업에 대한 목표와 상기 목표에 따른 보상 함수를 설계하여 모의환경을 설계하는 단계, 상기 모의환경에서 상기 DQN 인공지능이, 행동가치 함수 값이 최대가 되는 이동경로를 선택하도록 하여 강화학습시키는 단계, 및 학습된 DQN 인공지능을 이용하여 실제 환경 또는 시뮬레이션 환경에서 상기 모바일 로봇의 자율주행을 제어하는 단계를 포함하여 구성된다.The reinforcement learning-based movement path selection method for a mobile robot performing multiple tasks according to an embodiment includes the steps of receiving parameters of DQN artificial intelligence including the task to be performed by the mobile robot and the work environment, and multiple steps for the input work environment. Designing a simulation environment by allocating nodes for each zone and displaying it in a graph, defining environmental states and observable states for the task, and designing a goal for the task and a reward function according to the goal, the simulation Reinforcement learning in the environment by having the DQN artificial intelligence select a movement path that maximizes the action value function value, and controlling autonomous driving of the mobile robot in a real or simulated environment using the learned DQN artificial intelligence. It is composed of steps including:

또한, 실시예에 따른 다종 작업을 수행하는 모바일 로봇의 자율주행을 제어하는 이동경로 선택 시스템은, 모바일 로봇이 수행할 작업 및 작업 환경을 포함하는 DQN 인공지능의 파라미터를 입력 받는 입력부, 상기 DQN 인공지능의 강화학습을 위한 모의환경을 설계하고, 설계된 모의환경에서 상기 DQN 인공지능을 강화학습시키는 학습부, 및 학습된 DQN 인공지능을 이용하여 실제 환경 또는 시뮬레이션 환경에서 모바일 로봇의 자율주행을 제어하는 주행제어부를 포함하여 구성된다.In addition, the movement path selection system that controls the autonomous driving of a mobile robot that performs various tasks according to an embodiment includes an input unit that receives parameters of DQN artificial intelligence, including the task to be performed by the mobile robot and the work environment, and the DQN artificial intelligence. A learning unit that designs a simulated environment for reinforcement learning of intelligence, strengthens and learns the DQN artificial intelligence in the designed simulated environment, and controls the autonomous driving of a mobile robot in a real or simulated environment using the learned DQN artificial intelligence. It consists of a driving control unit.

또한, 실시예에 따른 다종 작업을 수행하는 모바일 로봇은 입력된 작업 및 작업 환경에 대한 모의환경을 설계하고, 상기 모의환경에서 행동가치 함수 값을 최대화는 이동경로를 선택하여 DQN 인공지능을 강화학습시키고, 학습된 DQN 인공지능을 기반으로 하여 운용된다.In addition, the mobile robot that performs various tasks according to the embodiment designs a simulated environment for the input task and work environment, selects a movement path that maximizes the action value function value in the simulated environment, and uses DQN artificial intelligence for reinforcement learning. and is operated based on learned DQN artificial intelligence.

실시예들에 따르면, 주어진 작업 환경에서, 순찰, 경비배치, 화재감시, 물건전달, 청소 중 하나 이상의 작업을 포함하는 다종 작업을 수행할 수 있는 모바일 로봇의 심층 강화학습을 위한 모의환경을 설계하고, 이를 통해 인공지능 에이전트를 학습시킴으로써, 모바일 로봇이 연산을 효율적으로 수행하여 최적의 작업을 수행하는 이동경로를 선택할 수 있도록 한다.According to embodiments, in a given work environment, a simulated environment is designed for deep reinforcement learning of a mobile robot that can perform a variety of tasks including one or more of patrolling, security deployment, fire surveillance, product delivery, and cleaning. , By learning an artificial intelligence agent through this, the mobile robot can perform calculations efficiently and select a movement path that performs the optimal task.

실시예에 따른 다종 작업을 수행하는 모바일 로봇의 강화학습 기반 이동경로 선택 방법 및 시스템의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the reinforcement learning-based movement path selection method and system for a mobile robot that performs various tasks according to the embodiment are not limited to those mentioned above, and other effects not mentioned will become clear to those skilled in the art from the description below. It will be understandable.

도 1은 일 실시예에 따른 다종 작업을 수행하는 모바일 로봇의 강화학습 기반 이동경로 선택 시스템을 설명하는 블록도이다.
도 2는 일 실시예에 따른 작업 환경을 그래프로 표현하는 방법을 설명하는 도면이다.
도 3은 일 실시예에 따른 DQN 인공지능의 구성을 설명하는 블록도이다.
도 4는 일 실시예에 따른 DQN 인공지능의 학습 방법을 설명하는 도면이다.
도 5는 도 1의 주행제어부의 구성을 설명하는 블록도이다.
도 6은 도 5의 주행제어부의 동작을 설명하는 도면이다.
도 7a 내지 도 7e는 일 실시예에 따른 DQN 인공지능의 학습 수에 따른 작업 성공률을 보여주는 그래프들이다.Figure 1 is a block diagram illustrating a reinforcement learning-based movement path selection system for a mobile robot that performs various tasks according to an embodiment.
Figure 2 is a diagram illustrating a method of representing a work environment as a graph according to an embodiment.
Figure 3 is a block diagram explaining the configuration of DQN artificial intelligence according to an embodiment.
Figure 4 is a diagram explaining a DQN artificial intelligence learning method according to an embodiment.
FIG. 5 is a block diagram illustrating the configuration of the travel control unit of FIG. 1.
FIG. 6 is a diagram explaining the operation of the travel control unit of FIG. 5.
Figures 7A to 7E are graphs showing the task success rate according to the number of training sessions of DQN artificial intelligence according to an embodiment.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings. However, various changes can be made to the embodiments, so the scope of the patent application is not limited or limited by these embodiments. It should be understood that all changes, equivalents, or substitutes for the embodiments are included in the scope of rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are for descriptive purposes only and should not be construed as limiting. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that this does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by a person of ordinary skill in the technical field to which the embodiments belong. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and unless explicitly defined in the present application, should not be interpreted in an ideal or excessively formal sense. No.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, when describing with reference to the accompanying drawings, identical components will be assigned the same reference numerals regardless of the reference numerals, and overlapping descriptions thereof will be omitted. In describing the embodiments, if it is determined that detailed descriptions of related known technologies may unnecessarily obscure the gist of the embodiments, the detailed descriptions are omitted.

또한, 실시예의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다.Additionally, in describing the components of the embodiment, terms such as first, second, A, B, (a), and (b) may be used. These terms are only used to distinguish the component from other components, and the nature, sequence, or order of the component is not limited by the term. When a component is described as being "connected," "coupled," or "connected" to another component, that component may be directly connected or connected to that other component, but there is no need for another component between each component. It should be understood that may be “connected,” “combined,” or “connected.”

어느 하나의 실시예에 포함된 구성요소와, 공통적인 기능을 포함하는 구성요소는, 다른 실시예에서 동일한 명칭을 사용하여 설명하기로 한다. 반대되는 기재가 없는 이상, 어느 하나의 실시예에 기재한 설명은 다른 실시예에도 적용될 수 있으며, 중복되는 범위에서 구체적인 설명은 생략하기로 한다.Components included in one embodiment and components including common functions will be described using the same names in other embodiments. Unless stated to the contrary, the description given in one embodiment may be applied to other embodiments, and detailed description will be omitted to the extent of overlap.

이하에서는, 도 1 내지 도 7e를 참조하여, 실시예에 따른 다종 작업을 수행하는 모바일 로봇의 강화학습 기반 이동경로 선택 방법 및 시스템에 대해서 설명한다. 참고적으로, 도 1은 일 실시예에 다종 작업을 수행하는 모바일 로봇의 강화학습 기반 이동경로 선택 시스템(이하, "이동경로 선택 시스템"이라 함)(100)을 설명하는 블록도이다. 도 2는 일 실시예에 따른 이동경로 선택 시스템(100)에서 모의환경 설계를 위해서 작업 환경을 그래프로 표현하는 방법을 설명하는 도면이다. 그리고 도 3은 일 실시예에 따른 DQN 인공지능(151)의 구성을 설명하는 블록도이고, 도 4는 DQN 인공지능(151)의 강화학습 방법을 설명하는 도면이다. 그리고 도 5는 일 실시예에 따른 주행제어부(160)의 구성을 설명하는 블록도이고, 도 6은 주행제어부(160)의 동작을 설명하는 도면이다.Hereinafter, with reference to FIGS. 1 to 7E, a reinforcement learning-based movement path selection method and system for a mobile robot that performs various tasks according to an embodiment will be described. For reference, FIG. 1 is a block diagram illustrating a reinforcement learning-based movement path selection system (hereinafter referred to as “movement path selection system”) 100 for a mobile robot that performs various tasks in one embodiment. FIG. 2 is a diagram illustrating a method of representing a work environment as a graph for designing a simulated environment in the movement path selection system 100 according to an embodiment. And Figure 3 is a block diagram explaining the configuration of the DQN artificial intelligence 151 according to an embodiment, and Figure 4 is a diagram explaining the reinforcement learning method of the DQN artificial intelligence 151. And FIG. 5 is a block diagram explaining the configuration of the driving control unit 160 according to an embodiment, and FIG. 6 is a diagram explaining the operation of the driving control unit 160.

도 1을 참조하면, 다종 작업을 수행하는 모바일 로봇의 자율주행을 제어하기 위한 이동경로 선택 시스템(100)은, 입력부(110), 프로세서(120), 메모리(130), 출력부(140), 학습부(150) 및 주행제어부(160)를 포함할 수 있다.Referring to FIG. 1, the movement path selection system 100 for controlling the autonomous driving of a mobile robot that performs various tasks includes an input unit 110, a processor 120, a memory 130, an output unit 140, It may include a learning unit 150 and a driving control unit 160.

이동경로 선택 시스템(100)은, 카메라, 라이다(Light Detection And Ranging: LiDAR) 등의 센서를 활용하는 위치 인식 기술과 주행제어 기술을 융합한 자율주행 기술을 적용하고, DQN(Deep Q Network) 인공지능(151)을 기반으로 하여, 순찰(경비), 물건전달, 청소, 경비배치 및 화재감사 등의 다종 작업을 수행하는 모바일 로봇에서, 연산을 효율적으로 수행하고, 최적의 이동경로를 선택할 수 있도록 한다.The movement path selection system 100 applies autonomous driving technology that combines location recognition technology and driving control technology using sensors such as cameras and LiDAR (Light Detection And Ranging: LiDAR), and DQN (Deep Q Network) Based on artificial intelligence (151), mobile robots that perform various tasks such as patrol (security), delivery of goods, cleaning, security deployment, and fire audit can efficiently perform calculations and select the optimal movement path. Let it happen.

우선, 강화학습에 대해서 설명한다.First, reinforcement learning will be explained.

강화학습은 주어진 문제에 대한 이론적 모델이 복잡한 경우, 문제를 환경과 에이전트 간의 상호작용이라는 형태로 단순화시켜서 푸는 기계학습 방법론 중 하나이다. 주로 동적으로 변화하는 상황에서 특정 임무를 최적의 효율로 수행해야 하는 인공지능을 학습시키는 데에 사용된다.Reinforcement learning is one of the machine learning methodologies that solves cases where the theoretical model for a given problem is complex by simplifying the problem into the form of interaction between the environment and the agent. It is mainly used to learn artificial intelligence that must perform specific tasks with optimal efficiency in dynamically changing situations.

강화학습은, 임의의 환경(environment)에서 에이전트 (agent)가 관측(observation)을 통해 현재 상태(state)를 인식하고, 인식한 현재 상태에 대한 행동정책(policy)에 따라 행동(action)을 취하면 보상(reward)을 받는 구조로 이루어진다. 여기서, 보상이란, 에이전트가 취한 행동이 문제의 최적해(optimal solution)에 가까운 정도를 나타내며, 에이전트가 취한 행동이 최적해에 가까울수록 높은 보상을 받는다.Reinforcement learning is where an agent in a random environment recognizes the current state through observation and takes action according to the action policy for the recognized current state. If you do this, you will receive a reward. Here, reward refers to the degree to which the action taken by the agent is close to the optimal solution of the problem, and the closer the action taken by the agent is to the optimal solution, the higher the reward is received.

강화학습은, 예를 들어, 문제를 풀기 위해 환경을 POMDP (Partially Observable Markov Decision Process)로 모델링할 수 있다. POMDP는 7-튜플(tuple)(S, A, P, R, Ω, O, γ)로 구성된다. 여기서, S는 상태 공간(state space), A는 행동 공간(action space), P는 전이 확률(transition probability), R은 보상 함수(reward function), Ω는 관측 공간(observation space), O는 관측 확률(observation probability), γ는 할인계수(discount factor)이다.Reinforcement learning, for example, can model the environment as a POMDP (Partially Observable Markov Decision Process) to solve a problem. POMDP consists of 7-tuples (S, A, P, R, Ω, O, γ). Here, S is the state space, A is the action space, P is the transition probability, R is the reward function, Ω is the observation space, and O is the observation. Observation probability, γ is the discount factor.

상태 공간 S는 현재 환경의 상태를 나타내는 정보들의 집합이다. 예를 들어, 상태 공간 S는, 모바일 로봇의 현재 위치, 작업 대상의 위치, 동적 장애물의 위치, 혼잡 영역의 위치 등을 포함할 수 있다.State space S is a set of information representing the current state of the environment. For example, the state space S may include the current location of the mobile robot, the location of the work object, the location of dynamic obstacles, the location of the congestion area, etc.

행동 공간 A는 에이전트가 현재 환경에서 취할 수 있는 행동들의 집합이다. 예를 들어, 행동 공간 A는 모바일 로봇이 현재 위치에서 이동가능한 지점의 좌표를 포함할 수 있다.Action space A is the set of actions that an agent can take in the current environment. For example, action space A may include the coordinates of a point where the mobile robot can move from its current location.

전이 확률 P는 에이전트가 현재 상태 s_t에서 선택한 행동 a_t에 따라 상태 s_t+1으로 전이할 확률을 나타낸다. 예를 들어, 전이 확률 P는 모바일 로봇의 현재 위치에서 환경 내 각 영역의 중심 좌표에 대한 이동가능성을 나타낼 수 있다.The transition probability P represents the probability that the agent will transition from the current state s _t to the state s _t+1 according to the selected action a _t . For example, the transition probability P may represent the possibility of movement from the current location of the mobile robot to the center coordinate of each area in the environment.

보상 함수 R은 에이전트가 현재 상태 s_t에서 행동 a_t를 선택하여 상태 s_t+1로 전이됨에 따라 주어지는 보상을 의미한다. 예를 들어, 보상 함수 R은 각 작업 중에 모바일 로봇이 이동 불가능한 좌표로 이동하려고 할 때마다 -10의 보상을 제공하고, 모바일 로봇이 작업을 완료하면 +10의 보상을 제공하도록 구성될 수 있다.The reward function R refers to the reward given as the agent selects action a _t in the current state s _t and transitions to state s _t+1 . For example, the reward function R may be configured to provide a reward of -10 whenever the mobile robot attempts to move to an impossible coordinate during each task, and to provide a reward of +10 when the mobile robot completes the task.

관측 확률 O는 에이전트가 현재 상태에서 정보를 관측할 확률을 나타낸다. 예를 들어, 관측 확률 O는 모바일 로봇의 현재 위치에서, 모바일 로봇 및 동적 장애물 각각의 현재 위치에 따른 동적 장애물의 관측가능성을 나타낼 수 있다.Observation probability O represents the probability that the agent will observe information in the current state. For example, the observation probability O may represent the observability of a dynamic obstacle according to the current positions of each of the mobile robot and the dynamic obstacle, at the current location of the mobile robot.

할인계수 γ∈[0, 1]는 현재 보상이 미래에 얻게 될 보상보다 얼마나 더 중요한지를 나타내는 상수이다. 예를 들어, γ=0인 경우에는 모바일 로봇이 지금 당장의 보상을 최대화하는 행동정책을 학습하고, γ=1인 경우에는 모바일 로봇이 미래 보상의 합을 최대화하도록 하는 행동정책을 학습하도록 구성될 수 있다.The discount coefficient γ∈[0, 1] is a constant that indicates how much more important the current reward is than the reward that will be obtained in the future. For example, if γ = 0, the mobile robot will learn an action policy that maximizes the current reward, and if γ = 1, the mobile robot will learn an action policy that maximizes the sum of future rewards. You can.

즉, 에이전트는 위와 같이 POMDP로 모델링 된 작업 환경에서 관측 o_t, 상태 s_t 인식, 행동 a_t를 반복하는 일련의 과정을 통해 보상 r_t를 최대화하기 위한 행동정책 π^*을 학습하고, 이를 통해 주어진 문제의 최적해를 찾을 수 있다.In other words, the agent learns the action policy π ^* to maximize the reward r _t through a series of processes that repeat observation o _t , state s _t recognition, and action a _t in the work environment modeled by POMDP as above. You can find the optimal solution to a given problem.

강화학습에서 상태가치 함수 V_π(s_t)는, 임의의 상태 s_t에서 에이전트가 행동정책 π를 따를 때 받을 보상의 기대값으로 정의할 수 있다. 그리고 행동가치 함수 Q_π(s_t, a_t)는 임의의 상태 s_t에서 에이전트가 행동정책 π를 따라 행동 a_t를 취했을 때 받을 보상의 기대값으로 정의할 수 있다.In reinforcement learning, the state value function V _π (s _t ) can be defined as the expected value of reward that the agent will receive when she follows the action policy π in any state s _t . And the action value function Q _π (s _t , a _t ) can be defined as the expected value of reward received when the agent takes action a _t according to the action policy π in a random state s _t .

강화학습은 각 상태에 대한 상태가치 함수 V_π(s_t) 또는 행동가치 함수 Q_π(s_t, a_t)의 보상을 최대화하는 방향으로 에이전트를 학습시킴으로써, 에이전트가 보상을 최대화하는 최적의 행동정책 π^*(s_t)를 찾도록 할 수 있다.Reinforcement learning trains the agent in the direction of maximizing the reward of the state value function V _π (s _t ) or action value function Q _π (s _t , a _t ) for each state, thereby determining the optimal behavior for the agent to maximize the reward. We can find the policy π ^* (s _t ).

상태가치 함수 V_π(s_t) 또는 행동가치 함수 Q_π(s_t, a_t)는 벨만 기대 방정식(Bellman expectation equation)에 따라 수학식 1로 표현될 수 있다.The state value function V _π (s _t ) or the action value function Q _π (s _t , a _t ) can be expressed as Equation 1 according to the Bellman expectation equation.

이때, 최적의 상태가치 함수 V^*(s_t)와 최적의 행동가치 함수 Q^*(s_t, a_t)는 수학식 2로 정의될 수 있다.At this time, the optimal state value function V ^* (s _t ) and the optimal action value function Q ^* (s _t , a _t ) can be defined by Equation 2.

최적의 상태가치 함수 V^*(s_t)와 최적의 행동가치 함수 Q^*(s_t, a_t)는, 수학식 1과 수학식 2로부터 다음과 같이 수학식 3으로 표현될 수 있다.The optimal state value function V ^* (s _t ) and the optimal action value function Q ^* (s _t , a _t ) can be expressed as Equation 3 from Equation 1 and Equation 2 as follows.

한편, Q-학습(Q-Learning) 방법에서는 벨만 기대 방정식을 이용하여 환경 내 모든 상태에 대한 행동가치 함수 Q_π(s_t, a_t)를 학습함으로써 보상을 최대화하기 위한 최적의 행동정책 π^*(s_t)를 찾을 수 있었다. 그러나, 환경 내 상태의 수가 무수히 많을 때는 모든 상태 S와 모든 행동 A에 대한 행동가치 함수 Q_π(s_t, a_t)를 정의하고 학습하기가 어려웠다.Meanwhile, in the Q-Learning method, the optimal action policy π ^* to maximize reward by learning the action value function Q _π (s _t , a _t ) for all states in the environment using the Bellman expectation equation. (s _t ) was found. However, when the number of states in the environment is infinite, it is difficult to define and learn the action value function Q _π (s _t , a _t ) for all states S and all actions A.

이에 반해, DQN(Deep Q Network) 방법에서는, 모든 상태 S와 모든 행동 A에 대해 행동가치 함수 Q_π(s_t, a_t)를 근사할 수 있는 심층 신경망(deep neural network)을 구성함으로써, 보다 효율적으로 연산을 수행할 수 있다.On the other hand, in the DQN (Deep Q Network) method, by constructing a deep neural network that can approximate the action value function Q _π (s _t , a _t ) for all states S and all actions A, Calculations can be performed efficiently.

예를 들어, DQN 인공지능(151)은, 도 3에 도시한 바와 같이, 현재 상태 s_t를 입력 받으면, 해당 상태에서 취할 수 있는 행동에 대한 행동가치 함수 값 Q(s_t, a_t)를 출력하고, 출력된 행동가치 함수 값 Q(s_t, a_t)가 최대가 되는 행동 a_t ^*=argmax_at Q(s_t, a_t)를 취함으로써 행동을 선택할 수 있다.For example, as shown in FIG. 3, when the DQN artificial intelligence 151 receives the current state s _t , it sets the action value function value Q(s _t , a _t ) for the action that can be taken in that state. You can select an action by outputting and taking the action a _t ^* =argmax _at Q(s _t , a _t ) that maximizes the output action value function value Q(s _t , a _t ).

그리고 DQN 인공지능(151)은, 수학식 3을 응용하여 고안된 손실 함수 L을 최소화하는 파라미터를 찾도록 학습이 수행된다. 손실 함수 L은 수학식 4와 같이 정의될 수 있다.And the DQN artificial intelligence 151 is trained to find a parameter that minimizes the loss function L designed by applying Equation 3. The loss function L can be defined as Equation 4.

다시 도 1로 돌아와서, 입력부(110)는 사용자로부터 모바일 로봇이 수행할 작업, 작업 수행을 위한 작업 환경 및, 알고리즘 설정 등의 파라미터를 입력 받는다.Returning to Figure 1, the input unit 110 receives parameters from the user, such as the task to be performed by the mobile robot, the work environment for performing the task, and algorithm settings.

프로세서(120)는 입력 받은 파라미터를 이용하여 메모리(130), 출력부(140), 학습부(150) 및 주행제어부(160)를 제어한다.The processor 120 controls the memory 130, output unit 140, learning unit 150, and driving control unit 160 using the input parameters.

메모리(130)는 인공지능 알고리즘의 파라미터, 작업과, 환경설정 등의 데이터를 저장한다.The memory 130 stores data such as parameters, tasks, and environmental settings of the artificial intelligence algorithm.

출력부(140)는 시스템 상태 및 시뮬레이션 현황 등을 화면에 출력할 수 있다.The output unit 140 can output system status and simulation status on the screen.

학습부(150)는 입력부(110)에서 입력된 사용자가 선택한 작업과 작업환경, 그리고 모바일 로봇이 선택한 결과에 따라 인공지능의 에이전트를 심층 강화학습시키기 위한 모의환경을 설계하고, 설계된 모의환경 상에서 에이전트인 DQN(Deep Q Network) 인공지능(151)이 선택한 이동경로에 따른 보상을 최대화하도록 에이전트를 학습시킨다.The learning unit 150 designs a simulation environment for deep reinforcement learning of an artificial intelligence agent according to the task and work environment selected by the user input from the input unit 110 and the results selected by the mobile robot, and creates an agent in the designed simulation environment. DQN (Deep Q Network) artificial intelligence (151) trains the agent to maximize rewards according to the selected movement path.

상세하게는, 학습부(150)에서 모의환경을 설계하는 방법은, 작업 환경을 구역별로 노드를 할당하여 그래프로 표현하고, 각 작업에 알맞은 환경 상태 및 관측 가능 상태를 정의하고, 각 작업에 알맞은 목표와 보상 함수를 설계하는 단계를 포함할 수 있다.In detail, the method of designing a simulated environment in the learning unit 150 is to represent the work environment in a graph by allocating nodes for each zone, define environmental states and observable states appropriate for each task, and It may include designing goals and reward functions.

우선, 작업 환경을 그래프로 표현하는 단계는, 도 2에 도시한 바와 같이, 작업 환경의 지도를 작성하고(S11), 작성된 지도에서 복수의 구역을 구분하여 각 구역별로 노드를 할당하고(S12), 각 구역에서 다른 구역으로의 이동가능성 및 동적 장애물의 관측가능성을 표시하는 그래프를 작성하는(S13) 단계를 포함한다. 예를 들어, 각 노드는 각 구역을 대표하는 중심 좌표로 설정되고, 각 노드에서 이동 가능한 노드 및 관측 가능한 노드들을 선으로 연결한 형태로 그래프를 작성할 수 있다. 여기서, 이동 가능 구역과 관측 가능 구역은 서로 다르게 나타나거나, 또는 일치할 수도 있다. 예를 들어, 2번 노드는 0, 1, 3, 4, 19, 23, 24, 25번 노드로 이동 가능하고, 0, 1, 3, 4, 23, 24, 25번 노드를 관측할 수 있다.다음으로, 입력부(110)에서 사용자에 의해 선택된 작업에 알맞은 환경 상태와, 선택된 작업에 따라 모바일 로봇이 관측 가능한 관측 가능 상태를 정의한다. 환경 상태는 작업 환경 내에서 모바일 로봇이 선택된 작업을 수행하기 위해 필요한 모든 정보를 포함할 수 있다. 그리고 관측 가능 상태는 정의된 환경 상태 중에서 모바일 로봇이 선택된 작업을 수행하는 중 접근 가능한 상태를 포함할 수 있다. 예를 들어, 환경 상태와 관측 가능 상태는, 표 1과 같이 정의할 수 있다.First, in the step of representing the work environment in a graph, as shown in FIG. 2, a map of the work environment is created (S11), a plurality of zones are divided in the created map, and a node is assigned to each zone (S12). , It includes the step of creating a graph (S13) showing the possibility of movement from each zone to another zone and the observability of dynamic obstacles. For example, each node is set to the center coordinates representing each area, and a graph can be created in the form of connecting movable nodes and observable nodes at each node with lines. Here, the movable area and the observable area may appear different from each other or may coincide. For example, node 2 can move to nodes 0, 1, 3, 4, 19, 23, 24, and 25, and nodes 0, 1, 3, 4, 23, 24, and 25 can be observed. .Next, the input unit 110 defines an environmental state suitable for the task selected by the user and an observable state that can be observed by the mobile robot according to the selected task. The environmental state may include all information necessary for the mobile robot to perform the selected task within the work environment. And the observable state may include a state that is accessible while the mobile robot is performing a selected task among defined environmental states. For example, environmental states and observable states can be defined as shown in Table 1.

대상 작업target action 환경 상태environmental status 관측 가능 상태observable state 순찰patrol · 각 위치에 대한 로봇의 이동 가능 구역· 각 위치에 대한 로봇의 시야
· 로봇 위치
· 로봇 시야
· 임의로 변경되는 침입자 위치· The robot's movement area for each location · The robot's field of view for each location
· Robot position
· Robot vision
· Intruder location changes randomly · 현재 위치에서 로봇의 이동 가능 구역
· 최근 3시간 프레임 동안 로봇 위치
· 최근 3시간 프레임 동안 로봇 시야 및 시야 내 포착된 침입자 위치· Area where the robot can move from the current location
· Robot position for the last 3 hour frame
· Robot field of view and location of intruders captured within the field of view during the last 3 hour frame 물건전달delivery of goods ·각 위치에 대한 로봇의 이동 가능 구역· 물건전달 목표지점
· 로봇 위치
· 임의로 변경되는 각 구역의 혼잡도· Area where the robot can move to each location · Target location for delivery of goods
· Robot position
· Congestion in each area changes arbitrarily ·현재 위치에서 로봇의 이동 가능 구역
· 물건전달 목표지점
· 최근 3시간 프레임 동안 로봇 위치
· 최근 3시간 프레임 동안 혼잡 구역 위치· Area where the robot can move from the current location
· Delivery target point
· Robot position for the last 3 hour frame
· Location of congested areas during the last 3 hour frame 청소cleaning ·각 위치에 대한 로봇의 이동 가능 구역· 청소 완료 구역
· 로봇 위치
· 임의로 변경되는 각 구역의 혼잡도·Robot movement area for each location·Cleaning completion area
· Robot position
· Congestion in each area changes arbitrarily ·현재 위치에서 로봇의 이동 가능 구역
· 청소 완료 구역
· 최근 3시간 프레임 동안 로봇 위치
· 최근 3시간 프레임 동안 혼잡 구역 위치· Area where the robot can move from the current location
· Cleaned area
· Robot position for the last 3 hour frame
· Location of congested areas during the last 3 hour frame 경비배치Guard deployment · 각 위치에 대한 로봇의 이동 가능 구역· 각 위치에 대한 로봇의 시야
· 침입자들의 목표지점
· 로봇 위치
· 목표지점을 향해 이동하는 침입자들의 위치· The robot's movement area for each location · The robot's field of view for each location
· Target location of intruders
· Robot position
· Location of intruders moving toward the target point · 현재 위치에서 로봇의 이동 가능 구역
· 침입자들의 목표지점
· 최근 3시간 프레임 동안 로봇 위치
· 최근 3시간 프레임 동안 로봇 시야 및 시야 내 포착된 침입자 위치· Area where the robot can move from the current location
· Target location of intruders
· Robot position for the last 3 hour frame
· Robot field of view and location of intruders captured within the field of view during the last 3 hour frame 화재감시fire surveillance ·각 위치에 대한 로봇의 이동 가능 구역· 각 위치에 대한 로봇의 시야
· 로봇 위치
· 로봇 시야
· 화재 발생 구역·Robot's movement area for each location ·Robot's field of view for each location
· Robot position
· Robot vision
· Fire area ·현재 위치에서 로봇의 이동 가능 구역
· 최근 3시간 프레임 동안 로봇 위치
· 최근 3시간 프레임 동안 로봇 시야 및 시야 내 포착된 화재 발생 구역 위치· Area where the robot can move from the current location
· Robot position for the last 3 hour frame
· Location of fire areas captured within the robot’s field of view and field of view during the last three hour frame.

다음으로, 입력부(110)에서 사용자에 의해 선택된 작업에 알맞은 목표와, 목표에 따른 보상 조건으로 설정되는 보상 함수를 설계한다. 목표를 정의함으로써 모바일 로봇이 작업 환경에서 취해야 할 최적의 행동정책을 정의할 수 있다. 그리고 보상 함수는 목표에 상응하도록 설계함으로써 모바일 로봇이 최적의 행동정책을 학습하도록 할 수 있다. 예를 들어, 목표와 보상 조건은, 표 2와 같이 정의할 수 있다.Next, the input unit 110 designs a reward function that is set as a goal appropriate for the task selected by the user and a reward condition according to the goal. By defining the goal, it is possible to define the optimal action policy that the mobile robot should take in the work environment. And by designing the reward function to correspond to the goal, the mobile robot can learn the optimal action policy. For example, goals and reward conditions can be defined as shown in Table 2.

대상 작업target action 목표target 보상 조건Compensation Conditions 순찰patrol 로봇이 작업 환경을 순찰하면서 작업 환경을 배회하는 침입자 포획Robots patrol the work environment to capture intruders wandering around the work environment. · 로봇 충돌: -10
· 평상시: -1
· 침입자 포착: -1~0 (거리가 가까울수록 증가)
· 침입자 포획: +10 (작업 종료)· Robot Crash: -10
· Normal: -1
· Intruder detection: -1~0 (increases as the distance becomes closer)
· Intruder Capture: +10 (Operation Ended) 물건전달delivery of goods 작업 환경 내에서 임의로 위치를 변경하는 혼잡영역을 우회하여 물건전달 목표지점 도달Reach the delivery target point by bypassing congested areas that change locations arbitrarily within the work environment · 로봇 충돌: -10
· 평상시: -1
· 목표지점 인근: -1~0 (거리가 가까울수록 증가)
· 목표지점 도달: +10 (작업 종료)· Robot Crash: -10
· Normal: -1
· Near target point: -1~0 (increases as distance becomes closer)
· Goal reached: +10 (task completed) 청소cleaning 작업 환경 내에서 임의로 위치를 변경하는 혼잡영역을 우회하여 작업 환경 내 모든 지역 방문Visit all areas within the work environment, bypassing congested areas that change location arbitrarily within the work environment · 로봇 충돌: -10
· 평상시: -1
· 청소된 영역 증가: -0.5
· 모든 영역 청소: +10 (작업 종료)· Robot Crash: -10
· Normal: -1
· Increase in cleaned area: -0.5
· Clean all areas: +10 (job finished) 경비배치Guard deployment 특정 위치에 대기하며 작업 환경 내 목표지점을 향해 이동하는 침입자 포착Detecting intruders waiting at a specific location and moving toward a target within the work environment · 로봇 충돌: -10
· 로봇 위치 변경: -5
· 평상시: -1
· 침입자 포착: +10
· 작업 종료 시
▷ 포착한 침입자 수 > 놓친 침입자 수: +10
▷ 포착한 침입자 수 < 놓친 침입자 수: -10· Robot Crash: -10
· Change robot position: -5
· Normal: -1
· Detect intruder: +10
· At the end of work
▷ Number of captured intruders > Number of missed intruders: +10
▷ Number of captured intruders < Number of missed intruders: -10 화재감시fire surveillance 로봇이 작업 환경을 순찰하면서 작업 환경 내에서 임의로 발생하는 화재 발생 구역에 도달The robot patrols the work environment and reaches areas where fires occur randomly within the work environment. · 로봇 충돌: -10
· 평상시: -1
· 화재 위험 포착: -1~0 (거리가 가까울수록 증가)
· 화재 발생 구역 도달: +10 (작업 종료)· Robot Crash: -10
· Normal: -1
· Fire risk detection: -1~0 (increases as the distance increases)
· Fire zone reached: +10 (Operation ends)

그리고 학습부(150)는 설계된 모의환경에서 DQN 인공지능(151)을 학습시킨다.And the learning unit 150 trains the DQN artificial intelligence 151 in a designed simulation environment.

도 3을 참조하면, 학습부(150)에서 설계된 모의환경은, N개의 노드로 표현된 그래프에 C개의 정보를 가진 상태 s_t를 입력 받고, N개의 노드에 대한 행동가치 함수 값 Q(s_t, a_t)를 출력하도록 구성된다. 그리고 DQN 인공지능(151)은 출력된 행동가치 함수 값 Q(s_t, a_t)가 최대가 되는 이동경로를 취함으로써 행동 a_t ^*=argmax_at Q(s_t, a_t)를 선택한다. 행동가치 함수 값은 입력된 상태에 대해서 이동경로를 선택하였을 때 받을 보상의 기대값이다.Referring to FIG. 3, the simulation environment designed by the learning unit 150 receives a state s _t with C pieces of information in a graph represented by N nodes, and the action value function value Q(s _t for the N nodes , a _t ) is configured to output. And the DQN artificial intelligence 151 selects the action a _t ^* =argmax _at Q(s _t , a _t ) by taking the movement path that maximizes the output action value function value Q(s _t , a _t ). The action value function value is the expected value of the reward received when a movement path is selected for the input state.

예를 들어, DQN 인공지능(151)은 입력된 상태에 대해서 1차원 합성곱(1-Dimensional Convolution)을 적용하는 CNN(Convolutional Neural Network)으로 구성될 수 있다.For example, the DQN artificial intelligence 151 may be composed of a CNN (Convolutional Neural Network) that applies 1-Dimensional Convolution to the input state.

도 4를 참조하면, DQN 인공지능(151)의 심층 강화학습 방법은, DQN 인공지능(151)의 행동 선택 단계와, 모의환경 반영 단계와, 손실 함수 계산 단계 및 손실 함수 값에 따라 DQN 인공지능(151)의 파라미터를 계산하는 단계로 구성될 수 있다.Referring to FIG. 4, the deep reinforcement learning method of DQN artificial intelligence 151 is an action selection step of DQN artificial intelligence 151, a simulated environment reflection step, a loss function calculation step, and DQN artificial intelligence according to the loss function value. It may consist of calculating the parameters of (151).

상세하게는, 에이전트(DQN 인공지능(151))의 행동 선택은, 도 3에서 설명한 바와 같이, 상태 s_t에 대한 행동가치 함수 값 Q(s_t, a_t)가 최대가 되는 행동(즉, 이동경로) a_t ^*=argmax_at Q(s_t, a_t)로 선택된다.In detail, the action selection of the agent (DQN artificial intelligence 151) is, as described in Figure 3, the action that maximizes the action value function value Q(s _t , a _t ) for the state s _t (i.e. Movement path) is selected as a _t ^* =argmax _at Q(s _t , a _t ).

여기서, 설계된 모의환경으로부터 환경 상태가 입력되고, DQN 인공지능(151)에서 행동을 선택하고, 선택된 행동에 따른 관측 가능 상태가 모의환경으로 전달된다.Here, the environmental state is input from the designed simulated environment, an action is selected by the DQN artificial intelligence 151, and the observable state according to the selected action is transmitted to the simulated environment.

다음으로, 모의환경의 환경 상태, DQN 인공지능(151)의 행동 선택 및 선택된 행동에 따른 보상인 행동가치 함수 값으로부터 손실 함수 값을 계산한다. 예를 들어, 손실 함수는 수학식 4를 이용하여 정의될 수 있다.Next, the loss function value is calculated from the environmental state of the simulated environment, the action selection of the DQN artificial intelligence 151, and the action value function value, which is a reward according to the selected action. For example, the loss function can be defined using Equation 4.

다음으로, DQN 인공지능(151)은 이동경로에 따라 주어지는 보상을 최대화하도록 하는 행동을 선택하며, 이로부터 손실 함수 값을 최소화하는 파라미터를 계산한다. 그리고 계산된 파라미터를 DQN 인공지능(151)(에이전트)에 업데이트 한다. 예를 들어, 파라미터는, 현재 환경(즉, 모의환경)에서 선택된 작업과 행동가치 함수 값을 최대화하도록 하는 행동 정책인 이동경로가 된다.Next, the DQN artificial intelligence 151 selects an action to maximize the reward given according to the movement path and calculates a parameter that minimizes the loss function value from this. Then, the calculated parameters are updated to the DQN artificial intelligence (151) (agent). For example, the parameters are the movement path, which is an action policy that maximizes the selected task and action value function value in the current environment (i.e., simulated environment).

상기와 같은 단계를 반복하여, 손실 함수 값을 최소화하는 파라미터를 업데이트함으로써 DQN 인공지능(151)을 심층 강화학습시킨다.By repeating the above steps and updating the parameters that minimize the loss function value, the DQN artificial intelligence 151 is subjected to deep reinforcement learning.

도 5와 도 6을 참조하면, 주행제어부(160)는 학습된 DQN 인공지능(151)을 통해 실제 환경 또는 3차원 시뮬레이션 환경에서 모바일 로봇에 선택된 이동경로를 전달하고 작업 수행을 위한 자율주행을 하도록 구동시킨다.Referring to Figures 5 and 6, the driving control unit 160 transmits the selected movement path to the mobile robot in a real environment or a 3D simulation environment through the learned DQN artificial intelligence 151 and enables autonomous driving to perform tasks. Run it.

도 5를 참조하면, 주행제어부(160)는, 이동경로 선택부(610), 자율주행부(620), 로봇구동부(630), 및, 시뮬레이션부(640)를 포함하여 구성될 수 있다.Referring to FIG. 5, the travel control unit 160 may be configured to include a movement path selection unit 610, an autonomous driving unit 620, a robot driving unit 630, and a simulation unit 640.

도 6을 참조하면, 이동경로 선택부(610)는 학습된 DQN 인공지능(151)을 이용하여 작업 환경(즉, 모의환경) 내에서 현재 상태에서의 작업을 최적으로 수행하기 위한 이동경로 선택 명령을 자율주행부(620)에 전달한다.Referring to FIG. 6, the movement path selection unit 610 uses the learned DQN artificial intelligence 151 to select a movement path to optimally perform the task in the current state within the work environment (i.e., simulated environment). is transmitted to the autonomous driving unit 620.

자율주행부(620)는 제어 모듈(621)과 환경 인식 모듈(622)을 포함할 수 있다.The autonomous driving unit 620 may include a control module 621 and an environment recognition module 622.

환경 인식 모듈(622)은 모바일 로봇의 작업 환경 내에서 현재 위치에서의 위치 및 자세를 추정하고, 선택된 구역에서의 장애물 지도를 생성하고, 영상 내의 표적을 탐지한다. 여기서, 환경 인식 모듈(622)은 후술하는 로봇구동부(631)와 시뮬레이션부(632)에서 입력되는 센서 측정값을 이용하여 모바일 로봇의 현재 (즉, 실제 환경 또는 시뮬레이션 환경에서의 현재 위치) 위치와 자세를 추정할 수 있다. 그리고 환경 인식 모듈(622)은 인식된 환경 인식 결과를 이동경로 제어 모듈(621)에 전달한다.The environmental recognition module 622 estimates the current position and posture within the mobile robot's working environment, generates an obstacle map in the selected area, and detects targets in the image. Here, the environment recognition module 622 uses sensor measurements input from the robot driving unit 631 and the simulation unit 632, which will be described later, to determine the mobile robot's current location (i.e., current location in the real environment or simulation environment) and The posture can be estimated. And the environment recognition module 622 transmits the recognized environment recognition result to the movement path control module 621.

제어 모듈(621)은 이동경로 선택부(610)에서 전달된 이동경로 선택 명령을 받고, 환경 인식 모듈(622)로부터 전달된 환경 인식 결과를 토대로, 모바일 로봇의 현재 위치에서 선택된 구역의 중심 좌표까지의 전역 경로를 설계하고, 설계된 전역 경로를 따라가기 위한 지역 궤적을 생성하고, 모바일 로봇의 자율주행을 수행하기 위한 제어 명령을 연산한다. 그리고 제어 모듈(621)은 생성된 제어 명령을 로봇구동부(630) 또는 시뮬레이션부(640)에 전달한다.The control module 621 receives a movement path selection command transmitted from the movement path selection unit 610, and based on the environment recognition result transmitted from the environment recognition module 622, from the current location of the mobile robot to the center coordinates of the selected area. Design a global path, generate a local trajectory to follow the designed global path, and calculate control commands to perform autonomous driving of the mobile robot. And the control module 621 transmits the generated control command to the robot driving unit 630 or the simulation unit 640.

로봇구동부(630)는, 실제 환경에서 모바일 로봇을 운용하는 것으로, 모바일 로봇의 모터제어 모듈(631)과 센싱 모듈(632)을 포함할 수 있다.The robot driving unit 630 operates a mobile robot in a real environment and may include a motor control module 631 and a sensing module 632 of the mobile robot.

모터제어 모듈(631)은 자율주행부(620)로부터 제어 명령 신호를 입력 받아서 실제 환경에서 모바일 로봇의 모터를 구동하여 운용할 수 있다.The motor control module 631 can receive control command signals from the autonomous driving unit 620 and drive and operate the motor of the mobile robot in a real environment.

센싱 모듈(632)은, 복수의 센서, 예를 들어, 모바일 로봇에 장착된 카메라, 라이다 등의 센서로부터 센서 측정값을 입력 받고, 이를 자율주행부(620)에 전달한다.The sensing module 632 receives sensor measurement values from a plurality of sensors, for example, sensors such as cameras and lidar mounted on the mobile robot, and transmits them to the autonomous driving unit 620.

시뮬레이션부(640)는 자율주행부(620)로부터 제어 명령 신호를 입력 받아서 3차원 시뮬레이션으로 생성 및 시각화한다. 더불어, 시뮬레이션부(640)는 3차원 시뮬레이션 환경에서 모바일 로봇을 운용하기 위한 것으로, 로봇구동부(630)와 유사하게, 3차원 시뮬레이션 환경 내의 모바일 로봇의 모터를 구동하여 운용하고, 3차원 시뮬레이션 환경 내에서 모바일 로봇에 장착된 센서들의 측정값을 자율주행부(620)에 전달할 수 있다.The simulation unit 640 receives control command signals from the autonomous driving unit 620 and generates and visualizes them as 3D simulations. In addition, the simulation unit 640 is for operating a mobile robot in a 3D simulation environment. Similar to the robot driving unit 630, it operates by driving the motor of the mobile robot in the 3D simulation environment and operates the mobile robot in the 3D simulation environment. Measured values from sensors mounted on the mobile robot can be transmitted to the autonomous driving unit 620.

한편, 도 7a 내지 도 7e는 실시예에 따른 DQN 인공지능을 학습시켰을 때, 학습 에피소드 수에 따른 각 작업의 작업 성공률을 보여주는 그래프들이다. 참고적으로, 도 7a는 순찰, 도 7b는 물건전달, 도 7c는 청소, 도 7d는 경비배치, 그리고 도 7e는 화재감시에 대한 강화학습 에이전트의 학습률을 각각 나타내는 그래프들로서, 각 에피소드가 끝날 때마다 작업 성공률을 평가하여 표시한 것이다.도면을 참조하면, DQN 인공지능은 학습 에피소드 수가 많을수록 작업 성공률이 증가하는 것을 알 수 있다. 더불어, 이러한 결과들로부터, DQN 인공지능은 작업을 최적으로 수행하기 위한 행동정책을 학습하고 있음을 알 수 있다.Meanwhile, Figures 7A to 7E are graphs showing the success rate of each task according to the number of learning episodes when training DQN artificial intelligence according to an embodiment. For reference, Figure 7a is a patrol, Figure 7b is a delivery of goods, Figure 7c is a cleaning, Figure 7d is a guard arrangement, and Figure 7e is a graph showing the learning rate of the reinforcement learning agent for fire surveillance, respectively, at the end of each episode. The task success rate is evaluated and displayed for each task. Referring to the drawing, it can be seen that the task success rate of DQN artificial intelligence increases as the number of learning episodes increases. In addition, from these results, it can be seen that DQN artificial intelligence is learning behavioral policies to optimally perform tasks.

또한, DQN 인공지능의 학습률 그래프는 작업의 복잡도, 학습 파라미터의 설정에 따라 다른 양상을 보인다. 예를 들어, 도 7a의 순찰에 대한 학습률 그래프에 따르면, 대략 학습 에피소드 수가 1,000개 정도일 때 작업 성공률이 1에 근사하고, 도 7b의 물건전달, 도 7c의 청소, 도 7e의 화재감시에 대한 학습률 그래프에 따르면, 학습 에피소드 수가 3,000개 이상이면 작업 성공률이 1에 근사하는 것을 알 수 있다. 그런데 도 7d의 경비배치에 대한 학습률 그래프에 따르면, 학습 에피소드 수가 12,500개 이상 진행되어도 작업 성공률이 0.9에 도달하는 것을 알 수 있다. 이러한 결과들로부터, 작업이 복잡하면 학습이 어렵고 작업 성공률이 떨어지는 것을 알 수 있다. 또한, 이로부터 예시한 작업들 중에서 경비배치 작업이 다른 작업들에 비해 복잡하고 학습이 어렵다는 것을 알 수 있다.Additionally, the learning rate graph of DQN artificial intelligence shows different aspects depending on the complexity of the task and the settings of the learning parameters. For example, according to the learning rate graph for patrol in Figure 7a, the task success rate is close to 1 when the number of learning episodes is approximately 1,000, and the learning rates for item delivery in Figure 7b, cleaning in Figure 7c, and fire surveillance in Figure 7e According to the graph, you can see that when the number of learning episodes is more than 3,000, the task success rate approaches 1. However, according to the learning rate graph for guard placement in Figure 7d, it can be seen that the task success rate reaches 0.9 even when the number of learning episodes is more than 12,500. From these results, it can be seen that when the task is complex, learning is difficult and the task success rate decreases. In addition, it can be seen from this that among the tasks illustrated, the guard placement task is more complex and difficult to learn than other tasks.

본 실시예에 따르면, 다종 작업을 수행하는 모바일 로봇의 강화학습 기반 이동경로 선택 시스템(100)은, 카메라, 라이다 등의 센서를 활용하는 위치 인식 기술과 주행제어 기술을 융합한 자율주행 기술을 적용하여, 순찰, 물건전달 등의 다종 작업을 수행하는 모바일 로봇에서, 선택된 작업을 수행하기 위한 최적의 이동경로를 선택할 수 있도록 한다. 또한, DQN 인공지능을 심층 강화학습시킴으로써 모바일 로봇의 최적 이동경로를 선택하여 효율적으로 작업을 수행할 수 있도록 한다.According to this embodiment, the reinforcement learning-based movement path selection system 100 of a mobile robot that performs various tasks uses autonomous driving technology that combines location recognition technology and driving control technology using sensors such as cameras and lidar. By applying it, it is possible to select the optimal movement path to perform the selected task in a mobile robot that performs various tasks such as patrolling and delivering goods. In addition, through deep reinforcement learning of DQN artificial intelligence, the optimal movement path of the mobile robot can be selected to perform tasks efficiently.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described with limited drawings as described above, those skilled in the art can apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the following claims.

100: 다종 작업을 수행하는 모바일 로봇의 강화학습 기반 이동경로 선택 시스템
110: 입력부
120: 프로세서
130: 메모리
140: 출력부
150: 학습부
151: DQN 인공지능(에이전트)
160: 주행제어부100: Reinforcement learning-based movement path selection system for mobile robots that perform multiple tasks
110: input unit
120: processor
130: memory
140: output unit
150: Learning Department
151: DQN artificial intelligence (agent)
160: Driving control unit

Claims

A step of receiving input parameters of DQN artificial intelligence, including tasks to be performed by the mobile robot and the work environment;
For the input work environment, nodes are assigned to each zone and displayed in a graph, environmental states and observable states for the work are defined, and a goal for the work and a reward function according to the goal are designed to simulate the environment. designing;
Reinforcement learning by having the DQN artificial intelligence select a movement path that maximizes the action value function value in the simulated environment; and
Controlling autonomous driving of the mobile robot in a real environment or simulation environment using learned DQN artificial intelligence;
Reinforcement learning-based movement path selection method for mobile robots that perform various tasks including.

According to paragraph 1,
The reinforcement learning step is,
Calculating a loss function value by reflecting the environmental state of the simulated environment, an action selected by the DQN artificial intelligence, and an observable state according to the selected action;
calculating a parameter by which the loss function value is minimized; and
Updating the DQN artificial intelligence with the calculated parameters;
Including,
A reinforcement learning-based movement path selection method for a mobile robot that performs various tasks including repeatedly performing the step of calculating the loss function value and the step of updating.

According to paragraph 2,
The DQN artificial intelligence is reinforcement learning for mobile robots that perform various tasks consisting of a CNN (Convolutional Neural Network) that applies one-dimensional convolution to a state with multiple information in a graph represented by multiple nodes in the simulated environment. How to select a base movement path.

According to paragraph 1,
The step of designing the simulated environment is,
Create a map of the input work environment,
Divide the work environment into a plurality of zones and assign nodes to each zone,
Displaying the work environment graphically by displaying the possibility of movement from each zone to another zone and the observability of dynamic obstacles,
Reinforcement learning-based movement path selection method for mobile robots performing multiple tasks.

According to paragraph 4,
The environmental state includes information for the mobile robot to perform the task within the work environment,
The observable state includes a state that is accessible while the mobile robot is performing the task in the environmental state,
Reinforcement learning-based movement path selection method for mobile robots performing multiple tasks.

According to paragraph 1,
The step of controlling the autonomous driving of the mobile robot is,
Using the learned DQN artificial intelligence, a movement path selection command is generated to select a movement route to perform a task within the simulated environment,
Generate control commands for autonomous driving of the mobile robot based on the movement path selection command and the recognition result of the work environment,
Operating the mobile robot in a real environment or simulation environment according to the control command,
Reinforcement learning-based movement path selection method for mobile robots performing multiple tasks.

According to clause 6,
The step of controlling the autonomous driving of the mobile robot is,
In a real environment or simulation environment in which the mobile robot is operated, sensor measurement values measured through a plurality of sensors provided in the mobile robot are input, and the current location and posture of the mobile robot are estimated.
Reinforcement learning-based movement path selection method for mobile robots performing multiple tasks.

In the movement path selection system that controls the autonomous driving of a mobile robot that performs various tasks,
An input unit that receives parameters of DQN artificial intelligence, including the tasks to be performed by the mobile robot and the work environment;
A learning unit that designs a simulation environment for reinforcement learning of the DQN artificial intelligence and performs reinforcement learning of the DQN artificial intelligence in the designed simulation environment; and
A driving control unit that controls autonomous driving of a mobile robot in a real or simulated environment using learned DQN artificial intelligence;
Reinforcement learning-based movement path selection system for mobile robots that perform various tasks including.

According to clause 8,
The learning department,
The DQN artificial intelligence selects an action by taking a movement path that maximizes the action value function value,
Calculate parameters that minimize the loss function value by reflecting the behavior and simulated environment selected by the DQN artificial intelligence,
Reinforcement learning of the DQN artificial intelligence by updating the DQN artificial intelligence with the calculated parameters,
Reinforcement learning-based movement path selection system for mobile robots that perform multiple tasks.

According to clause 8,
The simulated environment is,
For the input work environment, a graph displayed by assigning a plurality of nodes to each zone,
An environmental state including information for the mobile robot to perform the task within the work environment and an observable state including an accessible state while the mobile robot is performing the task in the environmental state;
Containing a reward function set as a goal appropriate for the task and a reward condition according to the goal,
Reinforcement learning-based movement path selection system for mobile robots that perform multiple tasks.

According to clause 10,
The graph is a reinforcement learning-based movement path selection system for a mobile robot that performs various tasks, indicating the possibility of movement from each area to another area and the observability of dynamic obstacles.

According to clause 8,
The driving control unit,
A movement path selection unit that generates a movement path selection command to select a movement path for performing a task within the simulated environment using learned DQN artificial intelligence;
an autonomous driving unit that generates a control command for autonomous driving of the mobile robot based on the movement path selection command and a recognition result of the work environment;
a robot driving unit that operates the mobile robot in a real environment according to the control command; and
a simulation unit that operates the mobile robot in a simulation environment according to the control command;
Reinforcement learning-based movement path selection system for mobile robots that perform various tasks including.

According to clause 12,
The autonomous driving department,
A control module that designs a global path from the current location of the mobile robot to a selected area, generates a local trajectory to follow the global path, and generates the control command, and
Comprising an environment recognition module that estimates the position and posture of the mobile robot at the current location within the work environment, creates an obstacle map, and detects a target,
Reinforcement learning-based movement path selection system for mobile robots that perform multiple tasks.

According to clause 13,
The environment recognition module is,
Receiving sensor measurement values measured through a plurality of sensors provided in the mobile robot from the robot driving unit and the simulation unit,
Reinforcement learning-based movement path selection system for mobile robots that perform multiple tasks.

Design a simulated environment for the input task and work environment, select a movement path that maximizes the action value function value in the simulated environment, reinforce DQN artificial intelligence, and operate various types of AI based on the learned DQN artificial intelligence. Mobile robots performing tasks.