KR102331803B1

KR102331803B1 - Vision and language navigation system

Info

Publication number: KR102331803B1
Application number: KR1020190140411A
Authority: KR
Inventors: 김인철; 황지수
Original assignee: 경기대학교 산학협력단
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2021-11-30
Also published as: KR20210054355A

Abstract

시각 및 언어 기반 심층 신경망 공간 탐색 시스템이 개시된다. 이 시스템은 초기 입력된 자연어 지시와 관련된 물체와 장소 중 적어도 하나인 랜드마크를 에이전트의 외부 관찰 영상에서 탐지하는 입력 처리부, 및 탐지된 랜드마크를 고려하여 에이전트의 자율 이동 행동을 계획하는 행동 계획부를 포함한다.A visual and language-based deep neural network spatial search system is disclosed. The system includes an input processing unit that detects a landmark, which is at least one of an object and a place related to an initially input natural language instruction, from an externally observed image of the agent, and an action planning unit that plans the autonomous movement behavior of the agent in consideration of the detected landmark. include

Description

Visual and language based spatial navigation system {Vision and language navigation system}

본 발명은 공간 탐색 기술에 관한 것으로, 특히 심층 신경망을 이용한 시각 및 언어 기반의 공간 탐색 기술에 관한 것이다.The present invention relates to a spatial search technology, and more particularly, to a visual and language-based spatial search technology using a deep neural network.

사람의 자연어 지시(natural language instruction)에 따라 작업을 수행할 수 있는 기계나 로봇을 만드는 일은 인공지능(AI)과 로봇 공학(robotics)의 오랜 꿈이었다. 이와 관련하여 최근 컴퓨터 비전 기술과 자연처 처리 기술의 급속한 발전으로, 이 기술들을 결합한 영상 기반의 질문과 응답(Visual Question Answering), 영상 기반의 대화(Visual Dialog), 에이전트의 행동 계획과 실행까지 요구하는 신체 기반의 질문과 응답(Embodied Question Answering), 상호 작용 기반의 질문과 응답(Interactive Question Answering) 등 다양한 복합 지능 문제들에 관한 연구들이 활발하다.Creating a machine or robot that can perform tasks according to human natural language instructions has been a long-standing dream of artificial intelligence (AI) and robotics. In this regard, with the recent rapid development of computer vision technology and natural processing technology, image-based question and answering (Visual Question Answering) that combines these technologies, video-based dialog (Visual Dialog), and agent action planning and execution are required. Studies on various complex intelligence problems such as body-based question and answering and interactive question answering are active.

특히, 시각-언어 이동(Vision-and-Language Navigation, VLN)은 이러한 꿈에 한층 더 가까이 다가갈 수 있게 하는 복합 지능 문제 중 하나이다. 시각-언어 이동(VLN)은 3차원 환경에서 실시간 입력 영상(image)과 자연어 지시를 이해함으로써, 에이전트 스스로 목적지까지 이동(navigate)해야 하는 문제이다. 시각, 언어 기반의 공간 탐색에 관한 종래기술로는 Seq2S2q(Sequence to Sequence) 모델에 주로 사용되는 LSTM(Long Short-Term Memory)을 기본 구조로 사용하고 있다. 또한, 입력 영상과 지시어정보 간의 조합을 위해 각 특징들에 대하여 소프트 어텐션(soft-attention)을 적용하는 연구들이 제안되어 왔다.In particular, Vision-and-Language Navigation (VLN) is one of the complex intelligence problems that can bring us closer to this dream. Visual-language movement (VLN) is a problem in which an agent must navigate to a destination by understanding real-time input images and natural language instructions in a three-dimensional environment. As a conventional technique for visual and language-based spatial search, LSTM (Long Short-Term Memory), which is mainly used in a Seq2S2q (Sequence to Sequence) model, is used as a basic structure. In addition, studies have been proposed to apply soft-attention to each feature for the combination between the input image and the directive information.

국내공개특허공보 제10-2018-0134683호 (2018년 12월 19일 공개)Korean Patent Publication No. 10-2018-0134683 (published on December 19, 2018)

본 발명은 자연어 지시에 따라 경로를 탐색함에 있어서 오류율을 최소화하기 위한 기술적 방안을 제공함을 목적으로 한다.An object of the present invention is to provide a technical method for minimizing an error rate in searching a path according to a natural language instruction.

일 양상에 따른 시각 및 언어 기반 심층 신경망 공간 탐색 시스템은 초기 입력된 자연어 지시와 관련된 물체와 장소 중 적어도 하나인 랜드마크를 에이전트의 외부 관찰 영상에서 탐지하는 입력 처리부, 및 탐지된 랜드마크를 고려하여 에이전트의 자율 이동 행동을 계획하는 행동 계획부를 포함할 수 있다.The visual and language-based deep neural network spatial search system according to an aspect considers an input processing unit that detects a landmark, which is at least one of an object and a place related to an initially input natural language instruction, from an external observation image of an agent, and the detected landmark It may include an action planning unit for planning the autonomous movement behavior of the agent.

입력 처리부는 초기 입력된 자연어 지시를 인코딩하여 지시 특징을 추출하는 지시 추출부와, 입력 영상인 파노라마 영상으로부터 영상 특징을 추출하는 영상 추출부와, 영상 특징에 기초하여 파노라마 영상에서 자연어 지시와 관련된 물체를 탐지하고 해당 물체 특징을 추출하는 물체 탐지부, 및 영상 특징에 기초하여 파노라마 영상에서 자연어 지시와 관련된 장소를 인식하고 해당 장소 특징을 추출하는 장소 인식부를 포함하는 특징 추출부를 포함할 수 있다.The input processing unit includes an instruction extractor that encodes an initially input natural language instruction to extract instructional features, an image extractor that extracts image features from a panoramic image that is an input image, and an object related to a natural language instruction from a panoramic image based on the image features and a feature extraction unit including an object detection unit detecting and extracting a corresponding object feature, and a place recognition unit recognizing a place related to a natural language instruction in a panoramic image based on the image feature and extracting the corresponding place feature.

물체 탐지부는 YOLO(You Only Look Once) v3 신경망에 기초하여 물체를 탐지할 수 있다.The object detector may detect an object based on a You Only Look Once (YOLO) v3 neural network.

장소 인식부는 Matterport 3D 시뮬레이터 및 Places365 데이터를 사용한 합성곱 신경망(Convolutional Newral Network, CNN) 분류 모델에 기초하여 장소를 인식할 수 있다.The place recognition unit may recognize a place based on a convolutional neural network (CNN) classification model using the Matterport 3D simulator and Places365 data.

물체 탐지부와 장소 인식부는 각각 물체 또는 장소에 대한 탐지 확률과 물체 또는 장소의 탐지 방향을 토대로 물체 특징 또는 장소 특징을 생성할 수 있다.The object detector and the place recognizer may generate an object feature or a place feature based on a detection probability of the object or place and a detection direction of the object or place, respectively.

입력 처리부는 특징 추출부에서 추출된 특징들로 멀티 모달 특징을 생성하는 특징 정의부를 더 포함할 수 있다.The input processing unit may further include a feature defining unit that generates a multi-modal feature using the features extracted by the feature extraction unit.

특징 정의부는 특징 추출부에 의해 추출된 특징마다 소프트 어텐션(soft-attention) 기법을 통해 주의 집중된 특징을 생성하되, 에이전트의 직전 상황 정보가 반영되어 결정된 가중치를 적용하여 주의 집중된 특징을 생성하는 주의 집중부, 및 주의 집중된 특징들로 멀티 모달 특징을 생성하는 멀티 모달 특징 생성부를 포함할 수 있다.The feature definition unit generates attention-focused features through the soft-attention technique for each feature extracted by the feature extraction unit, but applies the weight determined by reflecting the agent's immediate situation information to generate the focused features. and a multi-modal feature generating unit that generates a multi-modal feature using the sub, and focused features.

멀티 모달 특징 생성부는 주의 집중된 특징들 외에 직전 시간에 수행된 에이전트의 행동 특징을 추가로 포함하여 멀티 모달 특징을 생성할 수 있다.The multi-modal feature generating unit may generate the multi-modal feature by additionally including the behavioral feature of the agent performed at the immediately preceding time in addition to the focused features.

행동 계획부는 멀티 모달 특징에 기초하여 현 상황 정보인 맥락 특징을 추출하는 맥락 추출부, 및 입력 처리부를 통해 얻어진 일부 특징과 맥락 추출부에 의해 추출된 맥락 특징을 토대로 에이전트의 자율 이동 행동을 결정하는 행동 결정부를 포함할 수 있다.The action planning unit determines the autonomous movement behavior of the agent based on a context extractor that extracts context features, which are current context information, based on multi-modal features, and some features obtained through the input processing unit and context features extracted by the context extractor. It may include an action decision unit.

맥락 추출부는 순환신경망(recurrent neural network)에 속하는 LSTM(Long Short-Term Memory)을 이용하여 맥락 특징을 추출할 수 있다.The context extractor may extract the context feature by using a Long Short-Term Memory (LSTM) belonging to a recurrent neural network.

행동 결정부는 영상 특징과 주의 집중된 지시 특징 및 맥락 특징을 토대로 에이전트의 자율 이동 행동을 결정할 수 있다.The behavior determining unit may determine the autonomous movement behavior of the agent based on the image feature, the focused instruction feature, and the context feature.

한편, 일 양상에 따른 시각 및 언어 기반 심층 신경망 공간 탐색 방법은 초기 입력된 자연어 지시와 관련된 물체와 장소 중 적어도 하나인 랜드마크 정보를 에이전트의 외부 관찰 영상에서 탐지하는 입력 처리 단계, 및 탐지된 랜드마크 정보를 고려하여 에이전트의 자율 이동 행동을 계획하는 행동 계획 단계를 포함할 수 있다.On the other hand, the visual and language-based deep neural network spatial search method according to an aspect includes an input processing step of detecting landmark information, which is at least one of an object and a place related to an initially input natural language instruction, from an externally observed image of an agent, and the detected land It may include an action planning step of planning the agent's autonomous movement behavior in consideration of the mark information.

본 발명은 공간 탐색 과정에서 랜드마크가 되는 물체와 장소를 탐지하고 관련 특징 벡터를 추출하고 추출한 정보들에 대하여 선택적 주의 집중을 활용하고 시스템에 적용함으로써, 오류율이 낮은 지시어에 대한 탐색 경로를 제공할 수 있다.The present invention detects objects and places that become landmarks in the spatial search process, extracts relevant feature vectors, utilizes selective attention to the extracted information, and applies it to the system, thereby providing a search path for a directive with a low error rate. can

도 1은 일 실시예에 따른 시각 및 언어 기반 심층 신경망 공간 탐색 시스템 블록도이다.
도 2는 시각-언어 이동(VLN) 환경을 예시한 도면이다.
도 3은 일 실시예에 따른 랜드마크 기반 VLN(LVLN) 구조도이다.
도 4는 물체 특징 매트릭스 예시도이다.
도 5는 장소 인식 네트워크 설명을 위한 참조도이다.1 is a block diagram of a visual and language-based deep neural network spatial search system according to an embodiment.
2 is a diagram illustrating a Visual-Language Shift (VLN) environment.
3 is a structural diagram of a landmark-based VLN (LVLN) according to an embodiment.
4 is an exemplary diagram of an object feature matrix.
5 is a reference diagram for describing a place-aware network.

전술한, 그리고 추가적인 본 발명의 양상들은 첨부된 도면을 참조하여 설명되는 바람직한 실시예들을 통하여 더욱 명백해질 것이다. 이하에서는 본 발명을 이러한 실시예를 통해 당업자가 용이하게 이해하고 재현할 수 있도록 상세히 설명하기로 한다.The foregoing and further aspects of the present invention will become more apparent through preferred embodiments described with reference to the accompanying drawings. Hereinafter, the present invention will be described in detail so that those skilled in the art can easily understand and reproduce it through these examples.

도 1은 일 실시예에 따른 시각 및 언어 기반 심층 신경망 공간 탐색 시스템 블록도이다. 본 시스템은 로봇과 같은 에이전트에 탑재되어 에이전트로 하여금 시각 및 언어 기반으로 자율 이동 행동을 수행할 수 있도록 하는 시스템이다. 본 시스템은 입력 처리부(100)와 행동 계획부(400)를 포함하며, 진척 점검부(500)를 더 포함할 수 있다. 입력 처리부(100)는 초기 입력된 자연어 지시와 관련된 물체와 장소 중 적어도 하나인 랜드마크를 에이전트의 입력 영상에서 탐지한다. 여기서 입력 영상은 에이전트에 의해 촬영된 영상, 즉 관찰 영상을 말한다. 그리고 행동 계획부(400)는 입력 처리부(100)에서 탐지된 랜드마크를 고려하여 에이전트의 자율 이동 행동을 계획한다. 즉, 본 시스템은 자연어 지시에서 언급하는 주요 장소와 물체들을 입력 영상에서 탐지해내고 이 정보들을 이용하여 자율 이동 행동을 계획하도록 하는 것이다. 예를 들어, 자연어 지시가 “Walk forward to the yellow sofa thing. Walk around the yellow sofa thing and enter the door on the left. Stand at the top of the stairs”과 같을 경우에 yellow sofa, door, stairs와 같은 특정 단어들을 입력 영상에서 탐지해내고 이를 고려하여 자율 이동 행동을 계획하도록 한다.1 is a block diagram of a visual and language-based deep neural network spatial search system according to an embodiment. This system is mounted on an agent such as a robot so that the agent can perform autonomous movement behavior based on vision and language. The system includes an input processing unit 100 and an action planning unit 400 , and may further include a progress check unit 500 . The input processing unit 100 detects, from the input image of the agent, a landmark that is at least one of an object and a place related to the initially input natural language instruction. Here, the input image refers to an image captured by the agent, that is, an observation image. In addition, the action planning unit 400 plans the autonomous movement behavior of the agent in consideration of the landmark detected by the input processing unit 100 . That is, this system detects the main places and objects mentioned in the natural language instruction from the input image and uses this information to plan the autonomous movement behavior. For example, if the natural language instruction is “Walk forward to the yellow sofa thing. Walk around the yellow sofa thing and enter the door on the left. Stand at the top of the stairs”, detect specific words such as yellow sofa, door, and stairs from the input image and plan the autonomous movement behavior in consideration of this.

입력 처리부(100)는 특징 추출부(200)와 특징 정의부(300)를 포함할 수 있다. 특징 추출부(200)는 입력으로부터 에이전트의 자율 이동 행동을 결정하는데 필요한 특징들을 추출(생성)하기 위한 구성으로서, 정보가 입력되면 딥러닝 알고리즘으로 인지 정보 특징을 추출한다. 일 실시예에 있어서, 특징 추출부(200)는 지시 추출부(210)와 영상 추출부(220)와 물체 탐지부(230) 및 장소 인식부(240)를 포함할 수 있다. 지시 추출부(210)는 초기에 입력된 자연어 지시를 인코딩하여 지시 특징을 추출한다. 일 실시예에 있지시 추출부(210)는 순환신경망인 LSTM(Long Short-Term Memory)을 이용하여 자연어 형태로 된 지시어로부터 지시 특징을 추출한다. 영상 추출부(220)는 실시간(매 시간마다) 입력되는 영상으로부터 영상 특징을 추출한다. 여기서, 입력 영상은 에이전트가 관찰하는 주변 영상(외부 영상)으로서 360°파노라마 영상(panorama image)일 수 있다. 일 실시예에 있어서, 영상 특징 추출부(200)는 파노라마 형태의 RGB 입력 영상을 합성곱 신경망(Convolutional Newral Network, CNN) 알고리즘을 이용하여 시각 특징을 추출한다. 여기서, 합성곱 신경망 모델은 ResNet일 수 있다. 그리고 추출된 시각 특징은 입력 영상 내의 물체, 장소, 배경 등의 속성들을 일괄적으로 함축하여 표현하는 특징으로서, 로봇의 현재 상황을 파악할 수 있는 필수적인 특징으로 활용될 수 있다.The input processing unit 100 may include a feature extracting unit 200 and a feature defining unit 300 . The feature extraction unit 200 is a configuration for extracting (generating) features necessary for determining the autonomous movement behavior of the agent from the input, and when the information is input, the cognitive information feature is extracted with a deep learning algorithm. In an embodiment, the feature extractor 200 may include an instruction extractor 210 , an image extractor 220 , an object detector 230 , and a place recognizer 240 . The indication extraction unit 210 extracts indication features by encoding the initially input natural language indication. In an exemplary embodiment, the extracting unit 210 extracts a reference feature from a directive in a natural language form by using a long short-term memory (LSTM), which is a cyclic neural network. The image extractor 220 extracts image features from an input image in real time (every time). Here, the input image may be a 360° panoramic image as a peripheral image (external image) observed by the agent. In an embodiment, the image feature extraction unit 200 extracts visual features from a panoramic RGB input image using a convolutional neural network (CNN) algorithm. Here, the convolutional neural network model may be ResNet. And, the extracted visual feature is a feature that implies and expresses properties such as an object, a place, and a background in the input image, and can be used as an essential feature for understanding the current situation of the robot.

물체 탐지부(230)는 입력 영상에서 자연어 지시와 관련된 물체를 탐지하고 해당 물체 특징을 추출(생성)하는 것으로, 영상 추출부(220)로부터 추출된 영상 특징을 입력으로 받는다. 물체 탐지부(230)는 영상 특징 추출부(200)의 결과물을 받아서 영상 내의 특정 랜드마크 요소가 될 수 있는 물체들을 탐지하고 이를 특징 벡터로 설계한다. 그리고 장소 인식부(240)는 입력 영상에서 자연어 지시와 관련된 장소를 인식하고 해당 장소 특징을 추출(생성)하는 것으로, 물체 탐지부(230)와 동일하게 영상 추출부(220)로부터 추출된 영상 특징을 입력받아 영상 내에 탐지되는 특정 장소를 인식하고 이를 특징 벡터로 설계한다.The object detector 230 detects an object related to a natural language instruction from an input image and extracts (generates) the object feature, and receives the image feature extracted from the image extractor 220 as an input. The object detection unit 230 receives the result of the image feature extraction unit 200, detects objects that may be specific landmark elements in the image, and designs them as a feature vector. In addition, the place recognizing unit 240 recognizes a place related to a natural language instruction from an input image and extracts (generates) the corresponding place feature, and the image feature extracted from the image extractor 220 in the same way as the object detector 230 . is input, recognizes a specific place detected in the image, and designs it as a feature vector.

일 실시예에 있어서, 물체 탐지부(230)는 대표적인 실시간 물체 탐지 딥러닝 모델로 알려진 YOLO(You Only Look Once) v3 신경망에 기초하여 물체를 탐지한다. 일 실시예에 있어서, 장소 인식부(240)는 Matterport 3D 시뮬레이터 및 Places365 데이터를 사용한 합성곱 신경망(Convolutional Newral Network, CNN) 분류 모델에 기초하여 장소를 인식한다. 그리고 물체 탐지부(230)는 딥러닝 모델을 통해 얻어지는 물체에 대한 탐지 확률과 탐지 방향을 토대로 물체 특징을 추출(생성)할 수 있으며, 장소 인식부(240)는 딥러닝 모델을 통해 얻어지는 장소에 대한 탐지 확률과 탐지 방향을 토대로 장소 특징을 생성할 수 있다.In an embodiment, the object detector 230 detects an object based on a You Only Look Once (YOLO) v3 neural network known as a representative real-time object detection deep learning model. In an embodiment, the place recognition unit 240 recognizes a place based on a convolutional neural network (CNN) classification model using the Matterport 3D simulator and Places365 data. And the object detection unit 230 may extract (generate) object features based on the detection probability and detection direction of the object obtained through the deep learning model, and the place recognition unit 240 is located in the place obtained through the deep learning model. A place feature can be generated based on the detection probability and the detection direction of the

특징 정의부(300)는 특징 추출부(200)에서 추출된 지시, 영상, 물체, 장소 등의 특징들을 종합하여 복합적인 멀티 모달 특징을 생성한다. 일 실시예에 있어서, 특징 정의부(300)는 주의 집중부(310)와 멀티 모달 특징 생성부(320)를 포함한다. 주의 집중부(310)는 특징 추출부(200)에 의해 추출된 특징마다 소프트 어텐션(soft-attention) 기법을 통해 주의 집중된 특징을 생성한다. 즉, 주의 집중부(310)는 지시 주의 집중부(311)는 지시 특징을 입력받아 주의 집중된 특징을 생성하고, 시각 주의 집중부(312)는 시각 특징을 입력받아 주의 집중된 특징을 생성하고, 물체 주의 집중부(313)는 물체 특징을 입력받아 주의 집중된 특징을 생성하며, 장소 주의 집중부(314)는 장소 특징을 입력받아 주의 집중된 특징을 생성한다. 이때, 각각의 주의 집중부(311, 312, 313, 314)는 에이전트의 직전 상황 정보가 반영되어 결정된 가중치를 적용하여 주의 집중된 특징을 생성할 수 있다. 그리고 멀티 모달 특징 생성부(320)는 주의 집중된 지시 특징, 영상 특징, 물체 특징, 장소 특징을 가지고 에이전트의 행동을 계획하는데 필요한 멀티 모달 특징을 생성하는데, 추가로 직전 시간에 수행된 에이전트의 행동 특징도 함께 반영하여 멀티 모달 특징을 생성할 수 있다.The feature defining unit 300 generates a complex multi-modal feature by synthesizing the features, such as instructions, images, objects, and places, extracted from the feature extracting unit 200 . In an embodiment, the feature defining unit 300 includes an attention focusing unit 310 and a multi-modal feature generating unit 320 . The attention unit 310 generates a focused feature through a soft-attention technique for each feature extracted by the feature extraction unit 200 . That is, the attention focusing unit 310 receives the instructional feature as an input to generate the focused feature, the visual attention focusing part 312 receives the visual feature to generate the focused feature, and the object The attention unit 313 receives the object feature and generates the focused feature, and the place attention part 314 receives the place feature and generates the focused feature. In this case, each of the attention-focusing units 311 , 312 , 313 , and 314 may generate a focused feature by applying a weight determined by reflecting the agent's immediate situation information. And the multi-modal feature generating unit 320 generates the multi-modal features necessary for planning the agent's action with the focused instructional feature, the image feature, the object feature, and the location feature. can also be reflected to create a multi-modal feature.

행동 계획부(400)는 맥락 추출부(410)와 행동 결정부(420)를 포함할 수 있다. 맥락 추출부(410)는 멀티 모달 특징에 기초하여 에이전트의 현재 상황에 대한 특징(맥락 특징)을 추출한다. 일 실시예에 있어서, 맥락 추출부(410)는 순환신경망에 속하는 LSTM을 이용하여 맥락 특징을 추출한다. 그리고 행동 결정부(420)는 입력 처리부(100)를 통해 얻어진 일부 특징들과 맥락 추출부(410)에서 추출된 맥락 특징을 토대로 에이전트의 자율 이동 행동을 결정한다. 구체적으로, 행동 결정부(420)는 영상 추출부(220)에서 추출된 영상 특징과 지시 주의 집중부(311)에서 주의 집중된 지시 특징 그리고 맥락 추출부(410)에서 추출된 맥락 특징을 토대로 에이전트의 행동을 결정할 수 있다. 한편, 진척 점검부(500)는 에이전트가 3차원 실내 환경에서 자연어 지시와 입력 영상에 의존하여 자율 이동을 계속하는 동안에 실질적으로 목표 지점에 접근하는지 여부를 판단한다.The action planner 400 may include a context extractor 410 and an action determiner 420 . The context extraction unit 410 extracts a feature (context feature) of the agent's current situation based on the multi-modal feature. In one embodiment, the context extraction unit 410 extracts the context feature by using the LSTM belonging to the recurrent neural network. In addition, the behavior determining unit 420 determines the autonomous movement behavior of the agent based on some features obtained through the input processing unit 100 and the context characteristics extracted by the context extraction unit 410 . Specifically, the action determining unit 420 is the agent based on the image features extracted from the image extraction unit 220, the instructional features focused on the instructional attention unit 311, and the context features extracted from the context extraction unit 410. action can be decided. Meanwhile, the progress check unit 500 determines whether the agent actually approaches the target point while continuing autonomous movement depending on the natural language instruction and the input image in the three-dimensional indoor environment.

이하에서는 본 시스템의 시각 및 언어 기반 심층 신경망 공간 탐색 방법에 대해 보다 구체적으로 설명하기로 한다. 본 시스템은 에이전트에게 직접 목표 지점을 알려주는 대신 고수준의 이동 계획인 자연어 지시를 제공하고, 에이전트로 하여금 이 자연어 지시와 실시간 입력 영상에 따라 자율적으로 이동하도록 한다. 시간-언어 이동 작업은 도 2와 같이 실사 영상(photo-realistic image)를 제공하는 3차원 실내 시뮬레이션 환경인 Matterport3D에서 수행될 수 있다. 에이전트가 활동하는 실내 공간에는 도 2의 하단과 같이 그래프 형태의 위상 지도(topological map)가 그려져 있다. 즉, 이 지도는 실내 공간의 특정 지점들을 나타내는 노드(node)들과 직접 이동 접근이 가능한 두 인접 노드를 잇는 간선(edge)들로 구성된다. 예컨데, 도 2의 예에서 노란색 점과 선들은 에이전트가 이동할 수 있는 공간상의 경로들을 보여주는 위상 지도를 나타낸다. 그리고 그 위에 놓인 빨간색 별표는 시작 지점(starting point)을, 파란색 선은 목표 지점까지의 최적 경로(optimal path)를 각각 나타낸다.Hereinafter, the visual and language-based deep neural network spatial search method of the present system will be described in more detail. This system provides natural language instruction, which is a high-level movement plan, instead of notifying the agent directly of the target point, and allows the agent to move autonomously according to the natural language instruction and the real-time input image. The time-language movement operation may be performed in Matterport3D, a 3D indoor simulation environment that provides a photo-realistic image as shown in FIG. 2 . In the indoor space where the agent is active, a topological map in the form of a graph is drawn as shown in the lower part of FIG. 2 . That is, this map is composed of nodes indicating specific points in the indoor space and edges connecting two adjacent nodes that can be accessed directly by moving. For example, in the example of FIG. 2 , yellow dots and lines represent a topology map showing paths in space through which an agent can move. And the red asterisk placed on it indicates the starting point, and the blue line indicates the optimal path to the target point, respectively.

그러나 에이전트에게는 이와 같은 그래프 형태의 위상 지도가 직접 제공되지는 않고, 그림 2의 상단과 같이 현재 위치에서 에이전트의 주변 환경을 포착한 360°파노라마 영상(panorama image)이 주어진다. 이 파노라마 영상은 도 2의 상단과 같이 수평과 수직으로 균등히 분할된 총 36개의 부분 영상들로 나뉠 수 있다. 에이전트는 매순간 이러한 파노라마 입력 영상으로부터 실내 환경의 배치와 자신의 현재 위치를 추정하고, 파노라마 영상을 구성하는 36개의 부분 영상들 중 하나를 선택하여 해당 방향으로 향하는 행동을 수행한다. 이때, 에이전트가 선택한 행동의 결과는 해당 방향으로 놓인 위상 지도상 가장 근접한 노드로 에이전트의 위치 변경이 이루어지는 것이다. 따라서, 시각-언어 이동(VLN) 문제는 아래와 같이 정의할 수 있다.However, such a graph-type topology map is not directly provided to the agent, but a 360° panoramic image capturing the agent's surrounding environment at the current location as shown in the upper part of Figure 2 is given. This panoramic image may be divided into a total of 36 partial images equally divided horizontally and vertically as shown in the upper part of FIG. 2 . The agent estimates the layout of the indoor environment and its current location from these panoramic input images at every moment, selects one of 36 partial images constituting the panoramic image, and moves toward the corresponding direction. At this time, the result of the action selected by the agent is that the location of the agent is changed to the closest node on the topology map placed in the corresponding direction. Thus, the Visual-Language Shift (VLN) problem can be defined as follows.

① 지시(Instruction):

① Instruction:

단어들의 시퀀스인 자연어 지시

,

Natural language instruction, which is a sequence of words

,

② 상태(State):

② State:

시각-언어 이동 문제를 구성하는 각 상태(state) s_t는 에이전트의 실시간 위치 정보로 표현한다. 즉,

, 이때

는 에이전트가 놓여있는 지점의 3차원 위치(position)를,

는 에이전트가 향하고 있는 수평 방향(heading)을,

는 수직 방향인 고도(elevation)를 각각 나타낸다. 초기 상태는

과 같이 에이전트의 시작 위치로 주어진다. _{Each state s t} constituting the visual-language movement problem is expressed as real-time location information of the agent. in other words,

, At this time

is the three-dimensional position of the point where the agent is placed,

is the horizontal heading the agent is facing,

represents the vertical elevation, respectively. the initial state

It is given as the starting position of the agent as

③ 관찰(Observation):

③ Observation:

매 순간 에이전트에게 주어지는 입력은 그 상태 s_t의 현재 위치에서 취득한 360°파노라마 영상

이다. 파노라마 영상

는 360°를 수평으로 30°씩 나눈 12개의 수평 방향 영역들과 이들 각각을 다시 3개의 상하 고도로 나눈 수직 방향 영역들을 종합하여 총 36개의 부분 영상들로 구성된다.

The input given to the agent at every moment is a 360° panoramic image obtained from the current position _{of that state s t .}

am. panoramic video

is composed of a total of 36 partial images by combining 12 horizontal regions divided by 360° horizontally by 30° and vertical regions divided by 3 vertical heights.

④ 행동(Action):

④ Action:

매 순간 에이전트는 입력 파노라마 영상을 구성하는 총 36개의 부분 영상들 중 하나를 선택하여 그에 해당하는 방향 영역으로 이동한다. 따라서

, 이때 각

는 부분 영상

에 해당하는 방향 영역으로 향하는 이동 행동을 나타낸다.At every moment, the agent selects one of 36 partial images constituting the input panoramic image and moves to the corresponding direction area. thus

, at this time each

is a partial video

It represents the movement action toward the direction area corresponding to .

⑤ 상태 전이(State Transition):

⑤ State Transition:

상태

에서 에이전트가 실행한 행동

는 새로운 상태

로 상태 전이를 유발한다. 즉

state

action taken by the agent in

is the new state

causes a state transition. In other words

⑥ 에피소드(Episode):

⑥ Episode:

하나의 에피소드

는 초기 상태에서 시작하여 에이전트가 수행하는 일련의 행동 시퀀스

를 나타낸다. 에피소드

를 구성하는 각 행동

의 실행은 다음 상태

로의 변경과 새로운 관찰

의 입력을 발생시킨다.one episode

is a sequence of actions that the agent performs, starting from an initial state

indicates episode

each action that constitutes

Execution of the following states

Changes to and new observations

generates an input of

⑦ 작업 평가(Evaluation)⑦ Evaluation of work

에피소드

가 완료된 상태에서 도달 지점과 목표 지점과의 거리를 계산한다. 두 지점 간의 차이가 소정 거리 이내(예를 들어, 3미터 이내)일 경우 작업 성공으로 판단한다.episode

Calculates the distance between the arrival point and the target point in the state where is completed. If the difference between the two points is within a predetermined distance (for example, within 3 meters), it is determined that the operation is successful.

도 3은 일 실시예에 따른 심층 신경망 모델인 LVLN의 전체 구조도를 나타낸다. LVLN 모델은 초기에 자연어 지시

를 입력한 후, 매 시간(

) 입력 영상

과 직전 행동

으로부터 현재 실행할 행동

를 결정하는 과정을 반복해야 한다. LVLN 모델에서는 이 과정을 입력 시퀀스로부터 출력 시퀀스를 생성하는 문제로 간주하여, 중심 모듈로 순환신경망(recurrent neural network)의 하나인 LSTM(Long Short-Term Memory)을 채용한다. 그리고 LVLN 모델은 LSTM을 중심으로 크게 자연어 지시

와 입력 영상

과 직전 행동

등의 입력으로부터 행동 결정에 필요한 멀티 모달 특징 벡터를 얻어내는 인코더(Encoder) 부분과, 멀티 모달 특징 벡터를 토대로 현재 실행할 행동

를 결정하는 디코더(Decoder) 부분으로 나뉜다. 그리고 인코더 부분은 다시 각각의 입력으로부터 특징을 추출하는 특징 추출 모듈(feature extration module)들과 추출된 특징들에 주의 집중을 적용하는 주의 집중 모듈(attention module)들로 구성된다.3 shows an overall structural diagram of LVLN, which is a deep neural network model according to an embodiment. The LVLN model initially dictates natural language

After entering , every hour (

) input image

and previous action

current action from

The process of determining should be repeated. In the LVLN model, this process is regarded as a problem of generating an output sequence from an input sequence, and LSTM (Long Short-Term Memory), which is one of the recurrent neural networks, is employed as a central module. And the LVLN model largely dictates natural language around LSTM.

with input video

and previous action

The encoder part that obtains the multi-modal feature vector required for action decision from input such as, and the action to be executed based on the multi-modal feature vector

It is divided into a decoder part that determines And the encoder part again consists of feature extraction modules that extract features from each input and attention modules that apply attention to the extracted features.

지시 인코더(Instruction Encoder)에서는 자연어 지시(instruction)를 순환신경망(LSTM)을 통해 인코딩하고, 지시의 어느 부분까지 현재 수행하였는지 추적할 수 있도록 단어의 위치 정보(Positional Encoding, PE)를 추가하여 지시 특징

을 만들게 된다. 영상 특징 추출기(Visual Feature Extractor)에서는 대표적인 합성 곱 신경망(CNN)인 ResNet-152를 이용하여 입력 파노라마 영상

으로부터 시각 특징

을 추출해낸다. 그리고 파노라마 영상에 관한 시각 특징

은 수학식 1과 같이 각 부분 영상에서 추출한 시각 특징들을 서로 연결(concatenation)하여 만든다.In the instruction encoder, a natural language instruction is encoded through a cyclic neural network (LSTM), and positional encoding (PE) of a word is added to track the current execution of the instruction.

will make In the image feature extractor (Visual Feature Extractor), input panoramic image using ResNet-152, a representative convolutional neural network (CNN)

visual features from

to extract and the visual characteristics of the panoramic image.

is made by concatenating the visual features extracted from each partial image as in Equation 1.

또한, 이 영상 특징

은 물체 탐지 네트워크(Object Detection Network, ODN)와 장소 인식 네트워크(Place Recognition Network, PRN)의 입력으로도 제공되어, 입력 파노라마 영상에 포함된 특정 장소들과 물체들을 나타내는 물체 특징

과 장소 특징

을 추출한다. 마지막으로, 직전 시간(t-1)에서 수행했던 행동

도 인코딩하여 행동 특징

을 생성한다.Also, this video feature

is provided as an input to the Object Detection Network (ODN) and Place Recognition Network (PRN), and object features representing specific places and objects included in the input panoramic image.

and place features

to extract Finally, the action performed at the previous time (t-1)

Also encodes behavioral characteristics

create

LVLN 모델은 매 시간마다 올바른 행동

를 결정하기 위해 자연어 지시

중 어떤 부분에 집중해야 하는지와 입력 영상

의 어떤 영역에 집중해야 하는지를 명확히 해야 한다. 따라서, 도 3과 같이 주의 집중 단계에서는 그동안의 작업 맥락 정보를 나타내는 순환신경망 LSTM의 직전 은닉 상태(h_t-1)를 사용하여 각 특징마다 soft-attention 기법을 적용한 주의 집중된 특징들을 생성한다. 특히, LVLN 모델에서는 자연어 지시에서 언급하는 물체와 장소에 주목하기 위하여 물체 특징과 장소 특징에도 주의 집중 메커니즘을 적용한다. 지시 특징(instruction feature)의 주의 집중 계산은 수학식 2와 같다.The LVLN model does the right thing every hour

natural language instruction to determine

Which part to focus on and the input image

It should be clear which area of the Therefore, in the attention-focusing step as shown in FIG. 3, the focused features to which the soft-attention technique is applied to each feature are generated using the _{last hidden state (h t-1) of the cyclic neural network LSTM indicating the work context information so far.} In particular, in the LVLN model, the attention mechanism is applied to object features and place features in order to pay attention to objects and places mentioned in natural language instruction. Attention calculation of the instruction feature is as Equation (2).

이때,

는 자연어 지시

내의 현재까지 실행한 단어

위치 정보(Positional Encoding)를 나타내며,

은 자연어 지시 내의

번째 단어와 직전 상황(은닉) 특징

간의 상관관계를 계산한 값이다. 그리고

는 지시 특징

에 적용할 주의 집중 가중치를 나타낸다. 한편, 시각 특징(visual feature)의 주의 집중 계산 과정은 지시 특징의 경우와 유사한 수학식 3과 같다.At this time,

is a natural language instruction

Words that have been executed so far in

Indicates location information (Positional Encoding),

is within natural language instruction

The second word and the preceding situation (hidden) characteristics

It is a calculated value of the correlation between and

is the indicated feature

Indicates the attention weight to be applied to . On the other hand, a process of calculating the concentration of attention of a visual feature is the same as in Equation 3 similar to the case of the pointing feature.

여기서

함수는 하나의 다층 신경망(Multi-Layer Perceptron, MLP)으로 구현되며,

는 시각 특징

에 적용할 주의 집중 가중치를 나타낸다. 물체와 장소 특징의 주의 집중 과정도 앞선 시각 특징과 유사한 방식으로, 수학식 4와 같이 직전 은닉 상태

와 현재의 랜드마크 특징

을 이용하여 주의 집중(soft-attention)을 수행한다.here

The function is implemented as a single multi-layer perceptron (MLP),

is a visual feature

Indicates the attention weight to be applied to . The attention process of object and place features is similar to the previous visual feature, and as shown in Equation 4, the last hidden state

and current landmark features

to perform soft-attention.

이와 같이 주의 집중 과정을 거친 지시 특징

, 시각 특징

, 물체 특징

, 장소 특징

들은 직전에 수행한 행동을 나타내는 행동 특징

과 함께 통합되어 하나의 멀티 모달 특징 벡터

를 생성한다. 그리고 이것은 순환신경망 LSTM의 입력으로 주어져 수학식 5와 같이 새로운 은닉 상태

를 생성하게 된다.In this way, the instructional features that have gone through the attention process

, visual features

, object features

, place features

behavioral characteristics that indicate the behavior that was performed immediately before

integrated with one multi-modal feature vector

create And this is given as an input of the cyclic neural network LSTM, and as shown in Equation 5, a new hidden state

will create

순환신경망인 LSTM의 셀 상태(cell state)

와 직전의 은닉 상태

는 주의 집중된 시각 특징

과 지시 특징의 주의 집중 가중치(

)와 함께 진척 점검기(Progress Monitor)의 입력으로 이용될 수 있으며, 이 과정에서 나온 행동 스코어

는 모델을 최적화하는 손실값(Loss)을 구하는 과정에 활용될 수 있다.Cell state of LSTM, a cyclic neural network

and the previous hidden state

is a focused visual feature

and attention weights of directive features (

) can be used as input to the Progress Monitor with

can be used in the process of finding the loss value that optimizes the model.

마지막으로, 행동 디코더는 자연어 지시

내에서 현재 주목해야 할 부분을 나타내는 지시 특징

과 입력 영상에 대한 시각 특징

, 그리고 맥락 정보를 나타내는 은닉 상태

를 토대로 현재 수행할 행동

를 결정한다. 행동 디코더는 선형 계층(Linear Layer)과 소프트맥스(softmax) 계층으로 구성되며, 행동

를 결정하는 과정은 수학식 6과 같다. 수학식 6에서

는 현재 수행 가능한 각 행동에 대한 평가치를 나타낸다.Finally, the behavioral decoder is a natural language instruction

Indicative feature indicating the part of the current attention within

and visual features for the input image.

, and a hidden state representing contextual information

action to be taken based on

to decide The behavior decoder consists of a linear layer and a softmax layer,

The process of determining , is as shown in Equation (6). in Equation 6

represents the evaluation value for each currently performable action.

따라서, 행동 디코더는 현재 수행 가능한 행동들 중에서 가장 높은 평가치를 갖는 행동을 선택한다.Accordingly, the behavior decoder selects the behavior with the highest evaluation value among the currently performable behaviors.

한편, LVLN 모델의 물체 탐지 네트워크(ODN)에서는 빠른 실시간 처리를 위해 대표적인 단-단계 탐지기(single-phase detector)인 YOLO(You Only Look Once)가 사용될 수 있는데, 구체적으로 YOLO v3가 사용될 수 있다. 그리고 물체 특징의 경우, 물체가 탐지된 방향과 물체의 확률값을 토대로 특징 벡터를 설계하는데 아래의 수학식 7과 같이 나타낼 수 있으며, 세미콜론은 각 특징들을 이어붙인(concatenate) 것을 의미한다.Meanwhile, in the object detection network (ODN) of the LVLN model, YOLO (You Only Look Once), which is a representative single-phase detector, may be used for fast real-time processing, and specifically YOLO v3 may be used. And in the case of an object feature, a feature vector is designed based on the direction in which the object is detected and the probability value of the object, which can be expressed as in Equation 7 below, and a semicolon means concatenating each feature.

수학식 7과 같이, 하나의 파노라마 영상을 90°단위로 분할한 4개의 부분 영상 각각에서 탐지된 물체(object)의 종류

와 탐지 방향(heading)

, 그리고 탐지 확률(probability)을 기초로 수학식 7과 같이 개별 물체 특징

들로 구성되는 물체 특징 행렬

를 생성한다. 이때 방향

는 360°파노라마 영상을 30°씩 수평으로 나눈 총 12개의 방향 영역들 중 하나가 된다. 따라서, (M×N) 크기의 행렬(matrix)로 표현되는 물체 특징

에는 각 물체의 탐지 신뢰도를 나타내는 확률값(probability)들을 포함한다. 도 4를 예로 들면, 첫 번째 부분 영상에서 물체 색인번호가 5인 싱크대(sink)가 2번 방향 영역에서 탐지되었으므로, 물체 특징 행렬

내 (5, 2) 지점의 특징값은 해당 물체의 탐지 신뢰도를 나타내는 확률값 0.88이 된다. 여기서, 확률값은 YOLO v3를 사용하였을 때 얻어진 값이다.As shown in Equation 7, the type of object detected in each of the four partial images in which one panoramic image is divided into 90° units

and detection direction (heading)

, and individual object features as in Equation 7 based on the probability of detection

object feature matrix consisting of

create direction at this time

becomes one of a total of 12 directional areas divided horizontally by 30° in a 360° panoramic image. Therefore, object features expressed as a matrix of (M×N) size

includes probability values indicating the detection reliability of each object. Taking Fig. 4 as an example, since the sink with the object index number 5 in the first partial image was detected in the second direction, the object feature matrix

The feature value of the point within (5, 2) becomes a probability value of 0.88 indicating the detection reliability of the object. Here, the probability value is a value obtained when YOLO v3 is used.

장소 인식 네트워크(PRN)는 도 5와 같은 구조의 합성곱 신경망(CNN)으로 구성될 수 있다. 이 네트워크는 장소 데이터 집합으로 사전 학습시킨 장면 인식 네트워크를 Matterport3D 환경에서 수집한 새로운 장소별 영상 데이터들로 재학습시켜 사용한다. 이 네트워크에서도 물체 탐지 네트워크와 유사한 방식으로 입력 파노라마 영상을 수평으로 90°단위로 나눈 4개의 부분 영상들에서 다양한 장소들을 탐지해낸다. 그리고 탐지된 장소(place)의 색인번호

, 탐지 방향(heading)

, 탐지 확률 값(probability)들을 토대로 수학식 8과 같이 장소 특징 행렬

을 생성한다.The place recognition network (PRN) may be configured as a convolutional neural network (CNN) having a structure as shown in FIG. 5 . This network uses the scene recognition network pre-trained with the place data set by retraining it with new place-specific image data collected in the Matterport3D environment. In this network, in a similar manner to the object detection network, various places are detected from the four partial images of the input panoramic image horizontally divided into 90° units. and the index number of the detected place.

, detection heading

, based on the detection probability values (probability), the place feature matrix as in Equation 8

create

이렇게 생성된 물체 특징

과 장소 특징

도 자연어 지시에서 언급한 물체 혹은 장소와의 연관성을 높이기 위해 맥락 정보 기반의 주의 집중 단계를 거친다.Characteristics of the object thus created

and place features

Also, in order to increase the relevance with the object or place mentioned in the natural language instruction, it goes through the contextual information-based attention-focusing stage.

한편, 도 3에서 진척 점검기(Progress Monitor)는 직전 상태의 상황 정보를 나타내는 순환신경망(LSTM)의 직전 은닉 상태(

), 현재의 셀 상태(

) 정보, 주의 집중된 시각 특징(

), 지시어의 주의 집중 가중치(

) 등을 입력으로 이용한다. 이러한 입력들을 토대로, 내부 은닉 상태(

)와 평가 점수(

)를 수학식 9와 같이 각각 계산한다.On the other hand, in FIG. 3, the progress monitor shows the previous hidden state (

), the current cell state (

) information, focused visual features (

), the attention weight of the directive (

) are used as input. Based on these inputs, the internal hidden state (

) and the evaluation score (

) are respectively calculated as in Equation 9.

이 식에서

은 행렬의 요소 곱(element-wise produce) 연산을, σ는 시그모이드(Sigmoid) 함수를 각각 나타낸다. LVLN 모델에서는 진척 점검 과정에서 생성된

을 모델 학습 단계에서 이용하기 위해, 수학식 10과 같은 손실 함수

를 정의한다.in this expression

denotes an element-wise produce operation of a matrix, and σ denotes a sigmoid function, respectively. In the LVLN model, the generated

To use in the model training step, the loss function as in Equation 10

define

이 손실 함수

는 크게 행동 결정에 대한 크로스 엔트로피 손실(cross-entropy loss) 부분인

과, 진척 점검기의 평균 제곱오차 손실(mean squared error loss) 부분인

로 구성된다. 그리고 문제의 특성에 맞게 계수 λ를 조절함으로써, 이 두 가지 손실들의 결합 비율을 조정할 수 있다. 크로스 엔트로피 손실 계산에 사용된

는

시간의 정답 동작(ground truth action)을,

는 수행 대상 동작(

)에 대한 에이전트의 평가 확률 값을 각각 나타낸다. 한편, 평균 제곱오차 손실에 사용된

는 에이전트의 현재 위치와 목표 지점간의 거리를 정규화한 값을,

는 수학식 9에 따라 계산한 진척 평가 점수를 각각 나타낸다.this loss function

is largely the cross-entropy loss part of the action decision.

and the mean squared error loss part of the progress checker

is composed of And by adjusting the coefficient λ according to the characteristics of the problem, the coupling ratio of these two losses can be adjusted. used to calculate the cross entropy loss.

Is

ground truth action of time,

is the action to be performed (

) represents the agent's evaluation probability value for each. On the other hand, the mean square error loss

is the normalized distance between the agent's current location and the target point,

denotes a progress evaluation score calculated according to Equation 9, respectively.

정리하면, 시각-언어 이동(VLN) 문제를 위한 새로운 심층 신경망 모델인 LVLN(Landmark-based VLN)은 자연어 지시의 언어적 특징과 입력 영상 전체의 시각적 특징들 외에 자연어 지시에서 언급하는 주요 장소와 물체들을 입력 영상에서 탐지해내고 이 정보들을 추가적으로 이용한다. 특히, 맥락 정보 기반의 주의 집중(context-besed attention) 메커니즘을 통해, 자연어 지시 내 각 개체(entity)와 영상 내 각 관심영역(ROI), 그리고 영상에서 탐지된 개별 물체(object) 및 장소(place) 간의 연관성과 일치성을 높일 수 있다. 또한, LVLN 모델에서는 에이전트의 목표 도달 가능성을 향상시키기 위해 목표를 향한 실질적인 접근을 점검할 수 있는 진척 점검기(progress monitor) 모듈도 포함한다.In summary, LVLN (Landmark-based VLN), a new deep neural network model for the visual-language movement (VLN) problem, is the main place and object mentioned in the natural language instruction in addition to the linguistic features of the natural language instruction and the visual features of the entire input image. are detected in the input image, and this information is additionally used. In particular, through a context-based attention mechanism, each entity in the natural language instruction, each region of interest (ROI) in the image, and individual objects and places detected in the image ) can increase the correlation and concordance between In addition, the LVLN model includes a progress monitor module that can check the actual approach towards the goal to improve the agent's reachability of the goal.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at with respect to preferred embodiments thereof. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in modified forms without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments are to be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

100 : 입력 처리부 200 : 특징 추출부
210 : 지시 추출부 220 : 영상 추출부
230 : 물체 탐지부 240 : 장소 인식부
300 : 특징 정의부 310 : 주의 집중부
311 : 지시 주의 집중부 312 : 시각 주의 집중부
313 : 물체 주의 집중부 314 : 장소 주의 집중부
320 : 멀티 모달 특징 생성부 400 : 행동 계획부
410 : 맥락 추출부 420 : 행동 결정부
500 : 진척 점검부100: input processing unit 200: feature extraction unit
210: instruction extraction unit 220: image extraction unit
230: object detection unit 240: place recognition unit
300: feature definition unit 310: attention center
311: instructional attention center 312: visual attention center
313: object attention part 314: place attention part
320: multi-modal feature generation unit 400: action planning unit
410: context extraction unit 420: action determining unit
500: progress check unit

Claims

an input processing unit for detecting, from an externally observed image of an agent, a landmark that is at least one of an object and a place related to the initially input natural language instruction; and
Including, but;
The input processor is:
An instruction extractor that encodes an initially input natural language instruction to extract instructional features, an image extractor that extracts image features from a panoramic image that is an input image, and an object related to a natural language instruction in a panoramic image based on the image feature is detected and a feature extracting unit including an object detector for generating a corresponding object feature, and a place recognition unit for recognizing a place related to a natural language instruction in a panoramic image based on the image feature and generating the corresponding place feature; and
For each feature extracted by the feature extraction unit, a focused feature is generated through a soft-attention technique, but the attention-focused feature is generated by applying a weight determined by reflecting the agent's immediate situation information; and a feature definition unit including a multi-modal feature generator for generating a multi-modal feature with the focused features;
A visual and language-based spatial navigation system comprising a.

The method of claim 1,
The object detector is a visual and language-based spatial search system that detects an object based on a You Only Look Once (YOLO) neural network.

The method of claim 1,
The place recognition unit is a visual and language-based spatial search system that recognizes places based on a convolutional neural network (CNN) classification model using the Matterport 3D simulator and Places365 data.

The method of claim 1,
A visual and language-based spatial search system that generates an object feature or a place feature based on the detection probability of the object or place and the detection direction of the object or place, respectively, by the object detector and the place recognition unit.

The method of claim 1,
The multi-modal feature generating unit is a visual and language-based spatial search system that generates multi-modal features by additionally including the behavioral features of the agent performed in the immediately preceding time in addition to the focused features.

6. The method according to any one of claims 1 to 5, wherein the action planning unit comprises:
a context extraction unit for extracting a context feature that is current context information based on the multi-modal feature; and
a behavior determining unit that determines the autonomous movement behavior of the agent based on the image characteristics obtained through the input processing unit, the focused instructional characteristics, and the context characteristics extracted by the context extraction unit;
A visual and language-based spatial navigation system comprising a.

7. The method of claim 6,
The context extraction unit is a visual and language-based spatial search system that extracts contextual features using LSTM (Long Short-Term Memory) belonging to a recurrent neural network.

7. The method of claim 6,
The behavior decision unit is a visual and language-based spatial search system that determines the autonomous movement behavior of an agent based on image features, attention-focused instructional features, and contextual features.

an input processing step of detecting landmark information, which is at least one of an object and a place related to an initially input natural language instruction, from an external observation image of an agent; and
An action planning step of planning the agent's autonomous movement behavior in consideration of the detected landmark information; including,
The input processing steps are:
The steps of encoding the initially input natural language instruction to extract instructional features, extracting image features from the panoramic image that is the input image, and detecting an object related to the natural language instruction in the panoramic image based on the image feature and analyzing the object features a feature extraction step comprising the steps of: extracting, recognizing a place related to a natural language indication in a panoramic image based on the image feature, and extracting the corresponding place feature; and
For each extracted feature, a focused feature is generated through a soft-attention technique, but a weight determined by reflecting the agent's immediate situation information is applied to generate the focused feature, and the focused features and the immediately preceding time A feature definition step comprising the step of generating a multi-modal feature as the behavioral feature of the performed agent;
A visual and language-based spatial exploration method comprising

10. The method of claim 9, wherein the action planning step is
A context extraction step of extracting a context feature that is current context information using a Long Short-Term Memory (LSTM) belonging to a recurrent neural network for multi-modal features; and
a behavior decision step of determining the autonomous movement behavior of the agent based on the image feature, the focused instructional feature, and the contextual feature;
A visual and language-based spatial exploration method comprising

delete