KR20220160391A

KR20220160391A - Generating collision-free path by rnn-based multi-agent deep reinforcement learning

Info

Publication number: KR20220160391A
Application number: KR1020210068563A
Authority: KR
Inventors: 최호진; 심민호; 원준희; 구본홍; 윤성열; 최형균; 이성후; 심현우; 김보라; 이정욱; 남제현
Original assignee: 한국과학기술원; 주식회사 네비웍스
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2022-12-06

Abstract

A path generation method according to the present invention comprises the steps of: a simulation progress step of generating an agent state based on information about an environment obtained from a simulator and manipulating an agent motion through a predetermined force vector; and a generation step of predicting a recursive path feature based on the agent state and generating a force vector and a cumulative expected reward which is to be transmitted to an agent. The path generation method according to the present invention utilizes the advantages of deep learning to show uniform performance even in the presence of an arbitrary number of agents and obstacles. In addition, by using an RNN-based model, the agent can effectively consider a path of the agent and paths of neighboring agents, so that a robust simulation can be performed even in a longer time unit, thereby significantly reducing the complexity of calculation.

Description

Collision-free path generation method using RNN-based multi-agent deep reinforcement learning

본 발명은 RNN 기반 멀티에이전트 심층강화학습을 이용하여 충돌없는 경로를 예측하고, 자연스러운 군중 장면을 생성할 수 있는 충돌없는 경로 생성 방법에 관한 것이다.The present invention relates to a collision-free path generation method capable of predicting a collision-free path using RNN-based multi-agent deep reinforcement learning and generating a natural crowd scene.

군중 시뮬레이션은 다수의 캐릭터, 혹은 에이전트의 움직임을 조작하여 경로를 생성하는 기술로, 비디오 게임, 영화 등에서 대규모 군중 장면(mob-scene)을 만들거나, 도시 공학, 대피 시뮬레이션 등 다양한 분야에서 활용되는 주제이다. 사실적인 군중 장면을 생성하기 위해 각각의 캐릭터는 다른 캐릭터, 장애물과 충돌하지 않고 목표 위치에 도달할 수 있어야 하며, 이와 동시에 실제 사람과 같은 자연스러운 행동을 보여야 한다. 자연스러운 군중 경로를 생성하는 것은 에이전트가 서로의 속도, 목적지 등의 정보를 알지 못하는 의사 소통이 불가능한(non-communicating) 상황에서는 어려운 문제인데, 이를 해결하기 위한 다양한 접근법이 존재한다.Crowd simulation is a technology that creates paths by manipulating the movements of multiple characters or agents. It is used in various fields such as creating large-scale mob-scenes in video games and movies, urban engineering, and evacuation simulation. to be. To create a realistic crowd scene, each character must be able to reach its target location without colliding with other characters or obstacles, while exhibiting natural human-like behavior. Creating a natural crowd path is a difficult problem in a non-communicating situation where agents do not know each other's speed, destination, etc., and various approaches exist to solve this problem.

기존의 방법은 크게 활용하는 정보의 종류에 따라 반응 기반(reaction-based) 접근법과 궤적 기반(trajectory-based) 접근법으로 나뉘어진다.반응 기반 접근법의 경우, 각 에이전트가 짧은 시간 단위마다 얻은 현재 환경에 대한 정보를 바탕으로 즉각적인 경로를 생성한다. 이 경우 계산량은 비교적 적지만, 다른 에이전트의 미래 상태를 고려하지 못하기 때문에, 다소 부자연스러운 움직임을 생성하거나, 특정 시나리오에서만 동작한다는 한계가 존재한다.Existing methods are largely divided into a reaction-based approach and a trajectory-based approach according to the type of information used. In the case of the reaction-based approach, each agent obtains the current environment in a short time unit. Creates an immediate route based on information about In this case, the amount of calculation is relatively small, but since the future state of other agents cannot be considered, there is a limitation that it generates somewhat unnatural movements or operates only in specific scenarios.

궤적 기반 접근법은 모든 에이전트의 시간에 따른 움직임을 기록하여 궤적을 생성하고, 이를 바탕으로 다른 캐릭터의 목적지를 예측하여 최적의 경로를 생성하는 방법이다. 이 경우 반응 기반 접근법에 비해 자연스러운 경로를 생성하지만, 계산량이 많고, 에이전트가 붐비는 상황에서는 모든 경로를 안전하지 않다고 판단하는 프리징 로봇(freezing robot) 문제가 발생할 수 있다.The trajectory-based approach creates trajectories by recording the movements of all agents over time, and based on this, predicts the destination of other characters to create an optimal route. In this case, a natural route is generated compared to the reaction-based approach, but in a situation where the amount of calculation is large and agents are crowded, a freezing robot problem may occur in which all routes are judged to be unsafe.

최근에는 심층강화학습(deep reinforcement learning)을 활용한 시도가 있어왔다. 강화 학습은 에이전트가 존재하는 환경(environment)의 상태(state), 행동(action), 보상(reward)을 정의하고, 에이전트가 환경 내에서 얻게 되는 경험(experience)들을 바탕으로, 에이전트가 얻게 되는 시간에 따른 누적 기대 보상(expected cumulative reward)를 최대화하도록 학습하는 기계학습의 한 분야이다. 최근 딥러닝의 발전과 함께 인공신경망(artificial neural network)을 활용해 에이전트의 행동을 예측하는 심층강화학습이 주목받고 있으며, 로보틱스(robotics), 네트워크 라우팅(network routing), 게임 인공지능 등 다양한 분야에서 뛰어난 성능을 보이고 있다. Recently, attempts have been made using deep reinforcement learning. Reinforcement learning defines the state, action, and reward of the environment in which the agent exists, and the time the agent acquires based on the experiences the agent gains in the environment. It is a field of machine learning that learns to maximize the expected cumulative reward according to With the recent development of deep learning, deep reinforcement learning that predicts the behavior of an agent using an artificial neural network is drawing attention, and in various fields such as robotics, network routing, and game artificial intelligence. It shows excellent performance.

본 발명의 목적은 RNN기반 멀티에이전트 심층강화학습 모델을 활용하여 생성 경로 품질을 개선할 수 있는 경로 생성 방법을 제공함에 있다.An object of the present invention is to provide a path generation method capable of improving the generated path quality by utilizing an RNN-based multi-agent deep reinforcement learning model.

상기 목적을 달성하기 위한 본 발명에 따른 경로 생성 방법은, 시뮬레이터로부터 획득된 환경에 대한 정보를 바탕으로 에이전트 상태를 생성하고, 소정의 힘 벡터를 통해 에이전트의 움직임을 조작하는 시뮬레이션 진행 단계; 및 상기 에이전트 상태를 기반으로 하여 재귀적 경로 특질을 예측하고, 에이전트가 받게 되는 힘 벡터 및 누적 기대 보상을 생성하는 생성 단계;를 포함한다.A path generation method according to the present invention for achieving the above object includes a simulation progress step of generating an agent state based on environment information acquired from a simulator and manipulating the motion of the agent through a predetermined force vector; and a generating step of predicting a recursive path feature based on the agent state and generating a force vector and a cumulative expected reward that the agent will receive.

그리고, 상기 시뮬레이션 진행 단계는, 상기 환경에 대한 정보를 배우 모델 및 비평가 모델의 입력으로 사용하기 위한 형태로 변환하는 전처리 단계; 상기 생성 단계에서 생성된 힘 벡터를, 그에 대응되는 각 에이전트에 적용하는 진행 단계; 및 에이전트의 목적지 도착 여부, 다른 에이전트 및 장애물과의 충돌 여부를 판단하는 검사 단계;를 포함할 수 있다.Further, the simulation progress step may include a pre-processing step of converting the information about the environment into a form to be used as an input of an actor model and a critic model; a step of applying the force vector generated in the generating step to each agent corresponding thereto; and a check step of determining whether the agent has arrived at the destination and whether the agent has collided with other agents and obstacles.

또한, 상기 배우 모델 및 비평가 모델은, 생성된 에이전트 상태를 기반으로 재귀적 경로 특질을 생성하는 심층 재귀적 특질 생성 단계; 상기 심층 재귀적 특질을 기반으로 현재 시간 단위에서 에이전트의 힘을 생성하는 정책 생성 단계; 및 상기 심층 재귀적 특질을 기반으로 현재 시간 단위에서 에이전트의 기대 누적 보상을 생성하는 상태 평가 단계;를 포함할 수 있다.In addition, the actor model and the critic model may further include a deep recursive feature generation step of generating a recursive path feature based on the created agent state; a policy generating step of generating an agent force in a current time unit based on the deep recursive feature; and a state evaluation step of generating an agent's expected cumulative reward in the current time unit based on the deep recursive feature.

본 발명에 의하면, 다수의 에이전트와 장애물이 존재하는 환경에서 충돌이 없는 자연스러운 경로를 생성할 수 있다. 구체적으로, RNN 기반 심층강화학습 모델을 활용함으로써 효과적으로 주변 에이전트의 상태를 예측하고, 예측한 상태를 바탕으로 자연스러운 경로를 생성할 수 있게 된다.According to the present invention, a natural path without collision can be created in an environment where a number of agents and obstacles exist. Specifically, by using an RNN-based deep reinforcement learning model, it is possible to effectively predict the state of surrounding agents and create a natural path based on the predicted state.

도 1은 본 발명에 따른 경로 생성 방법의 모델을 나타내는 도면이다.
도 2는 본 발명에 따른 경로 생성 방법에 있어서 정책 모델 데이터의 흐름을 나타내는 도면이다.
도 3은 본 발명에 따라 구성된 시뮬레이터를 활용한 시뮬레이션 흐름 진행도이다.1 is a diagram showing a model of a route generation method according to the present invention.
2 is a diagram showing the flow of policy model data in the route creation method according to the present invention.
3 is a simulation flow progress diagram using a simulator configured according to the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Hereinafter, the embodiments disclosed in this specification will be described in detail with reference to the accompanying drawings, but the same or similar elements are given the same reference numerals regardless of reference numerals, and redundant description thereof will be omitted. In describing the embodiments disclosed in this specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in this specification, the detailed descriptions thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in this specification, the technical idea disclosed in this specification is not limited by the accompanying drawings, and all changes included in the spirit and technical scope of the present invention , it should be understood to include equivalents or substitutes.

본 발명의 작동 원리에 대한 설명은 에이전트, 장애물, 목적지를 배치함과 동시에 에이전트에 라이다 기능을 부여하고, 생성된 경로를 바탕으로 시뮬레이션을 진행할 수 있는 물리기반 시뮬레이터가 구축되었다는 가정하에 자연스러운 경로 생성에 대한 상세 작동 원리를 설명한다.The description of the operating principle of the present invention creates a natural path under the assumption that an agent, obstacle, and destination are placed, and a physics-based simulator that can perform a simulation based on the created path is built by giving the agent a lidar function at the same time. The detailed operating principle is explained.

본 발명은 시간에 따른 에이전트의 상태를 바탕으로 수행해야하는 행동을 예측하는 정책을 학습할 수 있는 RNN기반 멀티에이전트 심층강화학습 모델을 이용한다. The present invention uses an RNN-based multi-agent deep reinforcement learning model capable of learning a policy for predicting an action to be performed based on an agent's state over time.

도 1은 본 발명에 따른 경로 생성 방법의 모델을 나타내는 도면이다.1 is a diagram showing a model of a route generation method according to the present invention.

구체적으로, 도 1은 자연스러운 경로를 생성하는 에이전트 경로 생성 시스템을 도시한다. 시뮬레이터로부터 얻은 환경 정보를 가공하여 에이전트 상태 배치를 구축하고, 이를 이용하여 에이전트의 힘 벡터와 누적 기대 보상을 예측한다.Specifically, Figure 1 shows an agent path generation system that creates natural paths. The agent state arrangement is constructed by processing the environment information obtained from the simulator, and the agent's force vector and cumulative expected reward are predicted using this.

도 1에 도시된 바와 같이, 다양한 심층강화학습 방법들이 정책에 해당하는 배우(Actor) 모델, 비용 함수에 해당하는 비평가(Critic) 모델을 동시에 학습하는 방법을 따르며, 본 발명의 심층강화학습 모델의 이상적인 구현 사례는 배우 모델, 비평가 모델의 두 가지의 모델을 포함한다.As shown in FIG. 1, various deep reinforcement learning methods follow the method of simultaneously learning an actor model corresponding to a policy and a critic model corresponding to a cost function, and the deep reinforcement learning model of the present invention An ideal implementation case includes two models: an actor model and a critic model.

이하 에이전트 상태는 매 시간 단위마다 시뮬레이터로부터 얻게 되는 환경 정보를 바탕으로 생성되는 특질을 말한다. 이는 에이전트로부터 목적지까지의 상대적인 위치 벡터를 나타내는 2차원 정보, 에이전트 전방 180도를 20등분한 뒤 라이다를 쏘아 얻게 되는 각 20차원의 깊이 맵(Depth Map)과 속도 맵(Velocity Map)을 연결한 총 42차원의 실수 정보이다. Hereinafter, the agent state refers to a characteristic generated based on environment information obtained from a simulator at every unit of time. This is the two-dimensional information representing the relative position vector from the agent to the destination, dividing the 180 degrees ahead of the agent into 20 parts and then connecting each 20-dimensional depth map and velocity map obtained by shooting lidar. It is a total of 42-dimensional real number information.

또한, 에이전트 보상은 매 시간 단위마다 에이전트가 취한 행동에 대한 피드백을 말하며, 에이전트가 목적지에 도착한 경우 양의 보상을, 에이전트가 다른 에이전트, 혹은 장애물과 충돌이 발생한 경우 음의 보상이 발생한다. 마지막으로, 에이전트 행동은 매 시간 단위마다 정책이 생성하는 에이전트의 힘에 해당하며, 각 에이전트는 해당 힘을 시뮬레이터의 입력에 넣어주며, 그 결과로 다음 시간 단위의 환경 정보를 얻게 된다.In addition, agent reward refers to feedback on actions taken by the agent at every unit of time. A positive reward occurs when the agent arrives at the destination, and a negative reward occurs when the agent collides with another agent or an obstacle. Finally, the agent action corresponds to the force of the agent generated by the policy every time unit, and each agent puts that force into the input of the simulator, and as a result, the environment information of the next time unit is obtained.

상기 배우 모델은 에이전트의 상태로부터 현재 시간 단위에서 충돌이 발생하지 않도록 에이전트에 가해지는 힘 벡터를 생성하는 단계이다. 제안 배우 모델의 데이터 흐름은 다음과 같다. 각 에이전트마다 42차원의 상태 벡터를 가지고 있으며, 시뮬레이션을 진행함에 따라 해당 상태 벡터를 계속 축적하여 80시간 단위마다 상태 배치(batch)를 생성한다. 즉, 80x42 크기의 상태 배치를 얻을 수 있다. The actor model is a step of generating a force vector applied to the agent so that a collision does not occur in the current time unit from the state of the agent. The data flow of the proposed actor model is as follows. Each agent has a 42-dimensional state vector, and as the simulation proceeds, the corresponding state vector is continuously accumulated to create a state batch every 80 hours. That is, you can get a state layout of size 80x42.

이후, 상태 배치에서 상태를 이루고 있는 요소마다 다른 종류의 인공신경망을 사용하여 특질을 추출한다. 먼저 에이전트로부터 목적지까지의 상대적인 거리 벡터의 경우, 다중 레이어 퍼셉트론을 활용하여 128차원의 특질 벡터(Feature Vector)를 생성한다. 다음으로 에이전트 전방에 쏜 라이다를 바탕으로 얻은 각 20차원의 깊이 맵과 속도 맵의 경우, 각각 1차원 합성곱 신경망을 통해 256차원의 특질 벡터를 생성한다. 이렇게 생성된 특질 벡터를 모두 연결하면, 총 640차원의 특질 벡터를 얻을 수 있으며, 해당 특질 벡터를 LSTM 셀의 입력으로 사용하면, 80시간 단위의 궤적을 고려한 특질 벡터를 얻을 수 있다. 이를 다중 레이어 퍼셉트론의 입력으로 사용하면, 최종적으로 현재 시간 단위와 상태에서 에이전트가 받게 되는 힘에 해당하는 2차원 벡터를 결과로 얻을 수 있다.After that, different types of artificial neural networks are used to extract features for each element constituting a state in the state arrangement. First, in the case of the relative distance vector from the agent to the destination, a 128-dimensional feature vector is created using a multi-layer perceptron. Next, in the case of each 20-dimensional depth map and velocity map obtained based on lidar shot in front of the agent, a 256-dimensional feature vector is generated through a 1-dimensional convolutional neural network, respectively. If all feature vectors generated in this way are connected, a total of 640-dimensional feature vectors can be obtained, and if the corresponding feature vectors are used as an input to an LSTM cell, a feature vector considering an 80-hour trajectory can be obtained. If this is used as an input for the multi-layer perceptron, a 2D vector corresponding to the force the agent will receive in the current time unit and state can finally be obtained as a result.

도 2는 본 발명에 따른 경로 생성 방법에 있어서 정책 모델 데이터의 흐름을 나타내는 도면이다. 2 is a diagram showing the flow of policy model data in the route creation method according to the present invention.

상기 비평가 모델은 에이전트의 상태로부터 현재 시간 단위에서 에이전트가 얻을 수 있는 누적 기대 보상을 생성하는 단계이다. 제안 비평가 모델의 데이터 흐름은 다음과 같다. 배우 모델과 마찬가지로, 각 에이전트마다 가지고 있는 42차원의 상태 벡터를 80시간 단위만큼 축적하여 80x42 크기의 상태 배치를 얻는다. 이후, 상태 요소의 특징에 맞게 다른 종류의 인공신경망을 사용하여 특질을 추출한다. 중간 결과로써 배우 모델과 마찬가지로 640차원의 특질 벡터를 얻게 되며, 이를 LSTM셀의 입력으로 사용하면, 80 시간 단위의 궤적을 고려한 특질 벡터를 얻을 수 있다. 이를 다중 레이어 퍼셉트론의 입력으로 사용하면, 최종적으로 현재 시간 단위와 상태에서 에이전트가 얻을 수 있는 누적 기대 보상에 해당하는 1차원의 스칼라 값을 생성한다. 해당 값이 클수록 에이전트에게 현재 상태는 누적 기대 보상이 큰 바람직한 상태이며, 그렇지 않다면 현재 상태는 누적 기대 보상이 작은, 충돌이 발생할 확률이 높은 상태를 의미한다.The critic model is a step of generating a cumulative expected reward that the agent can obtain in the current time unit from the state of the agent. The data flow of the proposed critic model is as follows. As with the actor model, the 42-dimensional state vector of each agent is accumulated by 80 time units to obtain an 80x42 state arrangement. Then, features are extracted using different types of artificial neural networks according to the features of state elements. As an intermediate result, like the actor model, a 640-dimensional feature vector is obtained. If this is used as an input for an LSTM cell, a feature vector considering a trajectory of 80 time units can be obtained. When this is used as an input for the multi-layer perceptron, a one-dimensional scalar value corresponding to the cumulative expected reward that the agent can obtain in the current time unit and state is finally generated. As the corresponding value is larger, the current state is a desirable state with a large cumulative expected reward for the agent. Otherwise, the current state means a state with a small cumulative expected reward and high probability of collision.

도 3은 본 발명에 따라 구성된 시뮬레이터를 활용한 시뮬레이션 흐름 진행도이다.3 is a simulation flow progress diagram using a simulator configured according to the present invention.

이는, 시뮬레이터가 가지고 있는 환경에 대한 정보를 바탕으로 에이전트 상태를 생성하고, 주어진 힘 벡터를 통해 에이전트의 움직임을 조작하는 시뮬레이션 진행 단계와, 에이전트 상태를 기반으로 하여 재귀적 경로 특질을 예측하고, 에이전트가 받게 되는 힘 벡터 및 누적 기대 보상을 생성하는 생성 단계를 포함한다.This includes the simulation process of generating an agent state based on information about the environment of the simulator and manipulating the agent's movement through a given force vector, predicting recursive path characteristics based on the agent state, and and a generation step of generating a force vector and a cumulative expected reward received by .

이때, 시뮬레이션 진행 단계는, 환경의 정보를 모델(배우 모델/비평가 모델)의 입력으로 사용할 수 있는 형태로 변환하는 전처리 단계와, 생성된 힘 벡터를 대응되는 각 에이전트에 적용하는 진행 단계와, 에이전트의 목적지 도착 여부, 다른 에이전트 및 장애물과의 충돌 여부를 판단하는 검사 단계를 포함할 수 있다.At this time, the simulation progress step includes a pre-processing step of converting environment information into a form that can be used as an input of a model (actor model/critic model), a step of applying the generated force vector to each corresponding agent, and an agent It may include an inspection step of determining whether the destination has arrived and whether there is a collision with other agents and obstacles.

이때, 배우 모델 및 비평가 모델은, 생성된 에이전트 상태를 기반으로 재귀적 경로 특질을 생성하는 심층 재귀적 특질 생성 단계와, 심층 재귀적 특질을 기반으로 현재 시간 단위에서 에이전트의 힘을 생성하는 정책 생성 단계와, 심층 재귀적 특질을 기반으로 현재 시간 단위에서 에이전트의 기대 누적 보상을 생성하는 상태 평가 단계로 이루어진다.At this time, the actor model and the critic model generate a deep recursive feature generation step of generating a recursive path feature based on the created agent state, and a policy generating step of generating the agent's force in the current time unit based on the deep recursive feature. and a state evaluation step of generating the agent's expected cumulative reward in the current time unit based on the deep recursive feature.

기존의 군중 시뮬레이션 방법은 다양한 시나리오에서 균등한 성능을 보여주지 못했고, 특정 시나리오에서 동작하기 위해서는 다양한 하이퍼파라미터(Hyperparameter)의 튜닝이 필요했다.Existing crowd simulation methods did not show uniform performance in various scenarios, and various hyperparameter tuning was required to operate in specific scenarios.

이에 반해 본 발명에서 제안하는 RNN기반 멀티에이전트 심층강화학습을 활용해 자연스러운 경로를 생성하는 방법은 딥러닝의 장점을 활용하여 임의 개수의 에이전트, 장애물이 존재하는 상황에서도 균등한 성능을 보인다는 장점이 있다. 또한, RNN기반 모델을 사용함으로써 에이전트가 자신의 경로와 주변 에이전트들의 경로를 효과적으로 고려할 수 있게 되어 더 긴 시간 단위에서도 견고한 시뮬레이션을 진행할 수 있었고, 이를 통해 계산의 복잡도를 크게 낮출 수 있다. On the other hand, the method of generating a natural path using RNN-based multi-agent deep reinforcement learning proposed in the present invention has the advantage of showing uniform performance even in the presence of an arbitrary number of agents and obstacles by utilizing the advantages of deep learning. have. In addition, by using the RNN-based model, the agent can effectively consider its own path and the paths of neighboring agents, so that a robust simulation can be performed even in a longer time unit, and through this, the complexity of calculation can be greatly reduced.

본 발명에 따른 RNN 기반 멀티에이전트 심층강화학습을 활용한 충돌없는 경로 생성 방법을 수행하는 데이터 처리 장치는, 상기 데이터 처리 장치에서 판독 가능한 명령을 실행하도록 구현되는 프로세서를 포함하고, 상기 프로세서는 위에서 설명한 경로 생성 방법의 각 단계를 수행할 수 있다.A data processing device for performing a collision-free path generation method using RNN-based multi-agent deep reinforcement learning according to the present invention includes a processor implemented to execute instructions readable by the data processing device, and the processor described above. Each step of the route creation method can be performed.

한편, 본 발명은 컴퓨터 시스템 상에서 구현될 수 있고, 하나 이상의 컴퓨터로 구성된 시스템은 소프트웨어, 펌웨어, 하드웨어 또는 이들의 조합을 작동중 시스템이 작업들을 수행하도록 하는 시스템에 설치함으로써 위에서 설명한 방법의 각 단계를 수행하도록 구성될 수 있다. 하나 이상의 컴퓨터 프로그램은 데이터 처리 장치에 의해 실행될 때 장치로 하여금 동작들을 수행하게 하는 명령어들을 포함함으로써 특정 동작들 또는 작업들을 수행하도록 구성될 수 있다.On the other hand, the present invention can be implemented on a computer system, and a system composed of one or more computers performs each step of the method described above by installing software, firmware, hardware, or a combination thereof in a system that causes the system to perform tasks during operation. can be configured to perform. One or more computer programs may be configured to perform particular operations or tasks by including instructions that, when executed by a data processing device, cause the device to perform the operations.

한편, 위에서 설명한 경로 생성 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 기록 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Meanwhile, the path creation method described above may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer readable medium. A computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the medium may be those specially designed and configured for the present invention or those known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to act as one or more software modules to perform the operations of the present invention, and vice versa.

이상에서 실시예들에 설명된 특징, 구조, 효과 등은 본 발명의 하나 의 실시예에 포함되며, 반드시 하나의 실시예에만 한정되는 것은 아니다. 나아가, 각 실시예에서 예시된 특징, 구조, 효과 등은 실시예들이 속하는 분야의 통상의 지식을 가지는 자에 의해 다른 실시예들에 대해서도 조합 또는 변형되어 실시 가능하다. 따라서 이러한 조합과 변형에 관계된 내용들은 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다.Features, structures, effects, etc. described in the embodiments above are included in one embodiment of the present invention, and are not necessarily limited to only one embodiment. Furthermore, the features, structures, and effects illustrated in each embodiment can be combined or modified with respect to other embodiments by those skilled in the art in the field to which the embodiments belong. Therefore, contents related to these combinations and variations should be construed as being included in the scope of the present invention.

RNN: Recurrent Neural Network
LSTM: Long Short-Term Memory modelRNNs: Recurrent Neural Networks
LSTM: long short-term memory model

Claims

A simulation progress step of generating an agent state based on environment information obtained from a simulator and manipulating the agent's motion through a predetermined force vector; and
and generating a recursive path feature based on the agent state and generating a force vector and a cumulative expected reward that the agent will receive.

According to claim 1,
The simulation progress step,
a pre-processing step of converting the information about the environment into a form for use as an input of an actor model and a critic model;
a step of applying the force vector generated in the generating step to each agent corresponding thereto; and
A path creation method comprising: determining whether an agent has arrived at a destination and whether an agent has collided with another agent or an obstacle.

According to claim 2,
The actor model and the critic model,
a deep recursive feature creation step of generating a recursive path feature based on the created agent state;
a policy generating step of generating an agent force in a current time unit based on the deep recursive feature; and
and a state evaluation step of generating an expected cumulative reward of an agent in a current time unit based on the deep recursive feature.