KR102244086B1

KR102244086B1 - System for visual commonsense reasoning using knowledge graph

Info

Publication number: KR102244086B1
Application number: KR1020190160915A
Authority: KR
Inventors: 김인철; 이재윤
Original assignee: 경기대학교 산학협력단
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2021-04-23

Abstract

Disclosed is an image-based common sense reasoning system using a knowledge graph. The system comprises: a knowledge inference unit that generates knowledge features after searching for knowledge related to an image, a query for the image and a list of answers to the query; a visual context inference unit for generating visual context features by associating the image, the query and the list of answers; and an answer decision unit that determines an answer to a question in the answer list based on the knowledge features and the visual context features.

Description

System for visual commonsense reasoning using knowledge graph}

본 발명은 심층 신경망(deep neural network) 모델에 관한 것으로, 특히 영상 기반 상식 추론(visual commonsense reasoning, VCR) 모델에 관한 것이다.The present invention relates to a deep neural network model, and more particularly, to a visual commonsense reasoning (VCR) model.

잘 알려진 바와 같이, 딥러닝(deep learning)을 위시한 기계 학습 기술의 발전과 더불어 컴퓨터 비전(computer vision), 자연어 처리(natural language processing) 등과 같은 인공지능(AI)의 핵심 기술들이 혁신적으로 발전하고 있다. 그에 따라, 다시 고전적인 튜링 테스트(Turing Test)와 같이 인공지능이 얼마나 인간에 가까운 복합 지능을 발휘할 수 있는지 알아보려는 매우 도전적인 과제들이 활발히 생겨나고 있다. 그 중에서도 영상 기반 질문-응답(Visual Question Answering, VQA)은 시각적 튜링 테스트(Visual Truing Test)의 한 형태로서, 영상에 관한 자연어 질문에 인공지능이 얼마나 자연스러운 답변을 자동 생성하는지를 알아보기 위한 지능 작업이다.As is well known, along with advances in machine learning technology including deep learning, key technologies of artificial intelligence (AI) such as computer vision and natural language processing are developing innovatively. . As a result, again, very challenging tasks are emerging to find out how close artificial intelligence can exert complex intelligence similar to humans, such as the classic Turing Test. Among them, Visual Question Answering (VQA) is a form of Visual Truing Test, which is an intelligent task to find out how natural artificial intelligence automatically generates answers to natural language questions about images. .

현재 영상 기반 질문-응답(VQA)의 대표적인 한계점은 대부분의 질문들이 입력 영상이나 질문에 명백하게 포함되어 있는 내용들만을 다룰 뿐, 소위 상식 추론(commonsense reasoning)을 요구하는 질문들은 별로 없다는 점이다. 이러한 영상 기반 질문-응답(VQA)의 한계성에 대응하여, 최근 새롭게 영상 기반 상식 추론(Visual Commonsense Reasoning, VCR) 문제들이 제시되었다. 영상 기반 상식 추론(VCR) 문제는 하나의 영상(image)과 자연어 질문(question) 및 응답 리스트가 입력으로 주어지면, 지문에 가장 적절한 답변(answer)과 근거(rationale)를 선택하는 문제이다. 영상 기반 상식 추론(VCR) 문제는 숨겨져 있는 사물들 간의 관계 파악과 답변 근거 제시 등 별도의 상식 추론이 요구된다는 점에서 영상 기반 질문-응답(VQA)과는 상당한 차이가 있다.The main limitation of current video-based question-answer (VQA) is that most of the questions only deal with input images or contents that are clearly included in the question, and there are few questions that require so-called commonsense reasoning. In response to the limitations of this image-based question-and-answer (VQA), a new image-based Visual Commonsense Reasoning (VCR) problem has recently been proposed. The image-based common sense reasoning (VCR) problem is a problem of selecting the most appropriate answer and rationale for a fingerprint when an image, a natural language question, and a list of responses are given as inputs. The image-based common-sense reasoning (VCR) problem differs considerably from the image-based question-and-answer (VQA) in that it requires separate common-sense reasoning, such as grasping the relationship between hidden objects and presenting an answer basis.

국내등록특허공보 제10-2011788호 (2019년 10월 21일 공고)Korean Patent Publication No. 10-2011788 (announced on October 21, 2019)

본 발명은 영상 기반 상식 추론의 성능 향상을 위한 기술적 방안을 제공함을 목적으로 한다.An object of the present invention is to provide a technical method for improving the performance of image-based common sense reasoning.

일 양상에 따른 지식 그래프를 이용한 영상 기반 상식 추론 시스템은 영상과 영상에 대한 질의 및 질의에 대한 응답 리스트와 관련된 지식을 검색한 후에 지식 특징을 생성하는 지식 추론부, 영상과 질의 및 응답 리스트를 서로 연관지어 시각적 맥락 특징을 생성하는 시각적 맥락 추론부, 및 지식 특징과 시각적 맥락 특징에 근거하여 응답 리스트에서 질의에 대한 답변을 결정하는 답변 결정부를 포함할 수 있다.An image-based common-sense reasoning system using a knowledge graph according to an aspect is a knowledge inference unit that generates a knowledge feature after searching for knowledge related to a list of queries and responses to a query and a list of images and images, It may include a visual context inference unit that generates a visual context feature in association, and an answer determination unit that determines an answer to a query from the response list based on the knowledge feature and the visual context feature.

지식 추론부는 영상의 시각적 특징에 대한 단어들과 질의 및 응답 리스트로부터 추출된 단어들을 각각 검색어로 지정하여 트리플(triple) 집합으로 구성된 그래프 형태의 지식 베이스에서 관련 지식을 검색하는 지식 검색부, 및 관련 지식을 임베딩하여 지식 특징을 생성하는 지식 임베딩부를 포함할 수 있다.The knowledge inference unit is a knowledge search unit that searches for related knowledge in a graph-type knowledge base composed of a triple set by designating words for visual features of an image and words extracted from the query and response list as search words, respectively. It may include a knowledge embedding unit that embeds knowledge to generate a knowledge feature.

지식 검색부는 검색된 지식들과 질의를 BERT(Bidirectional Encoder Representations from Transformers) 임베딩한 후에 유사도 비교를 통해 일부 지식을 관련 지식으로 추출할 수 있다.The knowledge search unit may extract some knowledge as related knowledge through a similarity comparison after embedding the searched knowledge and the query in BERT (Bidirectional Encoder Representations from Transformers).

지식 임베딩부는 그래프 합성 곱 신경망(Graph Convolutional Neural Network) 기반의 지식 그래프 임베딩을 수행하여 지식 특징을 생성할 수 있다.The knowledge embedding unit may generate a knowledge feature by performing a knowledge graph embedding based on a graph convolutional neural network.

시각적 맥락 추론부는 질문과 응답 리스트를 BERT(Bidirectional Encoder Representations from Transformers) 임베딩한 후에 질의와 응답 리스트 각각에 포함된 포인팅 식별자를 영상에서 탐지된 시각적 특징 영역과 대응시키는 시각적 접지(visual grounding)를 수행하는 시각적 접지부, 영상에서 탐지된 시각적 특징 영역과 시각적 접지부에 의해 접지된 질의 및 접지된 응답 리스트 간의 맥락 정보를 추출하는 시각적 맥락화부, 및 맥락화된 질문과 맥락화된 물체 및 접지된 응답 리스트를 단일의 멀티 모달 맥락 특징으로 융합하는 맥락화 임베딩부를 포함할 수 있다.The visual context inference unit embeds the question and answer list into BERT (Bidirectional Encoder Representations from Transformers), and then performs visual grounding that matches the pointing identifiers included in each of the question and answer lists with the visual feature areas detected in the image. A visual contextualization unit that extracts contextual information between the visual grounding unit, the visual feature area detected in the image and the query and grounded response list grounded by the visual grounding unit, and a contextualized question, contextualized object, and grounded response list are single It may include a contextualized embedding unit that fuses with multi-modal context features of

시각적 접지부는 BLSTM(Bidrectional Long Short-Term Memory)을 이용하여 질의에 포함된 포인팅 식별자를 영상 내 시각적 특징 영역에 대응시킬 수 있다.The visual grounding unit may use a BLSTM (Bidrectional Long Short-Term Memory) to correspond the pointing identifier included in the query to the visual feature area in the image.

답변 결정부는 지식 특징과 멀티 모달 맥락 특징을 융합하고 두 개의 완전 연결 계층(Fully-Connected Layer)을 통과시킨 후에 소프트맥스(Softmax) 계층을 통해 가장 높은 점수를 얻은 응답을 질의에 대한 답변으로 결정할 수 있다.The answer decision unit can determine the answer with the highest score through the Softmax layer as the answer to the query after fusing the knowledge feature and the multi-modal context feature and passing through two fully-connected layers. have.

한편, 일 양상에 따른 지식 그래프를 이용한 영상 기반 상식 추론 방법은 영상과 영상에 대한 질의 및 질의에 대한 응답 리스트와 관련된 지식을 검색한 후에 지식 특징을 생성하는 지식 추론 단계, 영상과 질의 및 응답 리스트를 서로 연관지어 시각적 맥락 특징을 생성하는 시각적 맥락 추론 단계, 및 지식 특징과 시각적 맥락 특징에 근거하여 응답 리스트에서 질의에 대한 답변을 결정하는 답변 결정 단계를 포함할 수 있다.On the other hand, the image-based common sense inference method using a knowledge graph according to an aspect is a knowledge inference step of generating a knowledge feature after searching for knowledge related to a query and response list for an image and an image, and an image and query and response list. A visual context inference step of creating a visual context feature by associating with each other, and an answer determination step of determining an answer to a query from the response list based on the knowledge feature and the visual context feature.

본 발명은 영상과 질문 및 응답 리스트에 존재하는 상식 관련 키워드를 추출하고 외부 지식 데이터베이스에서 검색하여 상식 관련 정보를 획득한 후에 이를 답변 결정에 추가적으로 사용함으로써, 상식을 묻는 질문에 올바른 답변을 제시할 수 있게 하는 효과를 창출한다.The present invention extracts common sense related keywords existing in the video, question and answer list, retrieves common sense related information from an external knowledge database, and then additionally uses it for answer determination, thereby providing correct answers to common sense questions. It creates the effect of being able to do it.

도 1은 일 실시예에 따른 지식 그래프를 이용한 영상 기반 상식 추론 시스템 블록도이다.
도 2는 영상 기반 상식 추론(VCR)의 예를 나타낸다.
도 3은 일 실시예에 따른 지식 그래프를 이용한 영상 기반 상식 추론 모델 구조도이다.
도 4는 지식 검색 모듈을 예시한 도면이다.
도 5는 지식 임베딩 모듈을 예시한 도면이다.
도 6은 시각적 접지 모듈을 예시한 도면이다.
도 7은 시각적 맥락화 모듈을 예시한 도면이다.
도 8은 맥락 정보 임베딩 모듈을 예시한 도면이다.
도 9는 답변 결정 모듈을 예시한 도면이다.1 is a block diagram of an image-based common sense reasoning system using a knowledge graph according to an exemplary embodiment.
2 shows an example of image-based common sense reasoning (VCR).
3 is a structural diagram of an image-based common sense inference model using a knowledge graph according to an exemplary embodiment.
4 is a diagram illustrating a knowledge search module.
5 is a diagram illustrating a knowledge embedding module.
6 is a diagram illustrating a visual grounding module.
7 is a diagram illustrating a visual contextualization module.
8 is a diagram illustrating a context information embedding module.
9 is a diagram illustrating an answer determination module.

전술한, 그리고 추가적인 본 발명의 양상들은 첨부된 도면을 참조하여 설명되는 바람직한 실시예들을 통하여 더욱 명백해질 것이다. 이하에서는 본 발명을 이러한 실시예를 통해 당업자가 용이하게 이해하고 재현할 수 있도록 상세히 설명하기로 한다.The foregoing and further aspects of the invention will become more apparent through preferred embodiments described with reference to the accompanying drawings. Hereinafter, the present invention will be described in detail so that those skilled in the art can easily understand and reproduce through these examples.

도 1은 일 실시예에 따른 지식 그래프를 이용한 영상 기반 상식 추론 시스템 블록도이다. 도 1에 도시된 바와 같이, 영상 기반 상식 추론 시스템은 지식 추론부(100)와 시각적 맥락 추론부(200) 및 답변 결정부(300)를 포함한다. 지식 추론부(100)는 입력으로 주어지는 영상, 영상에 대한 자연어 질의(질문), 그리고 질의에 대한 자연어 응답 리스트와 관련된 지식(상식)을 외부의 지식 베이스(400)에서 검색한 후에 그 검색 결과에 따라 지식 특징을 생성한다. 이때, 지식 추론부(100)는 영상에서 탐지된 시각적 특징(시각적 개념)들에 대한 단어들과 질의와 응답 리스트에 각각 포함된 단어들을 검색어로 이용하여 지식 베이스(400)에서 지식을 검색한다. 참고로, 영상 탐지는 MASK R-CNN을 통해 이루어질 수 있다.1 is a block diagram of an image-based common sense reasoning system using a knowledge graph according to an exemplary embodiment. As shown in FIG. 1, the image-based common sense reasoning system includes a knowledge reasoning unit 100, a visual context reasoning unit 200, and an answer determining unit 300. The knowledge inference unit 100 searches for knowledge (common sense) related to an image given as an input, a natural language query (question) for an image, and a natural language response list for the query from the external knowledge base 400, and then searches the search result. It generates knowledge features accordingly. In this case, the knowledge inference unit 100 searches for knowledge in the knowledge base 400 by using words for visual features (visual concepts) detected in the image and words included in each of the query and response lists as search words. For reference, image detection may be performed through MASK R-CNN.

도 1에 도시된 바와 같이, 지식 추론부(100)는 지식 검색부(110)와 지식 임베딩부(120)를 포함할 수 있다. 지식 검색부(110)는 영상의 시각적 특징 단어들과 질의 및 응답 리스트로부터 추출된 단어들을 각각 검색어로 지정하여 트리플(triple) 집합으로 구성된 그래프 형태의 지식 베이스(400)에서 관련 지식을 검색한다. 질의 및 응답 리스트에는 각각 자연어 뿐만 아니라 영상에 등장하는 시각적 특징 영역(물체 영역)들을 가리키는 포인팅 식별자(pointing tag)들이 해당 단어들에 미리 부여되어 포함되며, 포인팅 식별자들에 의해 지정된 단어들이 검색어로 이용될 수 있다.1, the knowledge inference unit 100 may include a knowledge search unit 110 and a knowledge embedding unit 120. The knowledge search unit 110 designates visual feature words of an image and words extracted from the query and response list as search words, and searches for related knowledge in the knowledge base 400 in the form of a graph composed of a set of triples. In the query and response list, pointing tags indicating not only natural language but also visual feature areas (object areas) appearing in the video are pre-assigned to the corresponding words, and words specified by the pointing identifiers are used as search words. Can be.

지식 검색부(110)는 검색된 관련 지식을 지식 임베딩부(120)로 제공하는데, 검색된 모든 관련 지식을 지식 임베딩부(120)로 제공하는 것이 아니라 일부 지식만을 추출하여 지식 임베딩부(120)로 제공할 수 있다. 일 실시예에 있어서, 지식 검색부(110)는 검색된 지식들에 대해 BERT(Bidirectional Encoder Representations from Transformers) 임베딩을 통해 지식 특징을 추출하고, 질의에 대해서도 BERT 임베딩을 통해 질의 특징을 추출하며, 지식 특징과 질의 특징과의 유사도 비교를 통해 일부 지식만을 관련 지식으로 선택한다. 그리고 지식 임베딩부(120)는 지식 검색부(110)로부터 제공된 관련 지식을 임베딩하여 지식 특징을 생성한다. 일 실시예에 있어서, 지식 임베딩부는 그래프 합성 곱 신경망(Graph Convolutional Neural Network, GCN) 기반의 지식 그래프 임베딩을 수행하여 지식 특징을 생성한다.The knowledge search unit 110 provides the searched related knowledge to the knowledge embedding unit 120, and does not provide all the searched related knowledge to the knowledge embedding unit 120, but extracts only some knowledge and provides it to the knowledge embedding unit 120 can do. In one embodiment, the knowledge search unit 110 extracts knowledge features through BERT (Bidirectional Encoder Representations from Transformers) embedding for the searched knowledge, extracts query features through BERT embedding for queries, and knowledge features Only some knowledge is selected as related knowledge through a comparison of similarity with the characteristics of the quality and quality. In addition, the knowledge embedding unit 120 embeds the related knowledge provided from the knowledge search unit 110 to generate a knowledge feature. In an embodiment, the knowledge embedding unit generates a knowledge feature by performing a knowledge graph embedding based on a graph convolutional neural network (GCN).

시각적 맥락 추론부(200)는 영상과 질의 및 응답 리스트를 서로 연관지어 시각적 맥락 특징을 생성한다. 도 1에 도시된 바와 같이, 시각적 맥락 추론부(200)는 시각적 접지부(210)와 시각적 맥락화부(220) 및 맥락화 임베딩부(230)를 포함할 수 있다. 시각적 접지부(210)는 질의과 응답 리스트를 BERT 임베딩한 후에 질의와 응답 리스트 각각에 포함된 포인팅 식별자를 영상에서 탐지된 물체 영역과 대응(매칭)시키는 시각적 접지(visual grounding)를 수행한다. 시각적 맥락화부(220)는 영상에서 탐지된 시각적 특징(물체)과 시각적 접지부(210)에 의해 접지된 질의 및 접지된 응답 리스트 간의 맥락 정보를 추출한다. 일 실시예에 있어서, 시각적 맥락화부(220)는 쌍 선형 주의집중 네트워크(Bilinear Attention Networks, BAN)를 이용하여 맥락 정보를 추출한다. 맥락화 임베딩부(230)는 맥락화된 질문과 맥락화된 물체 및 접지된 응답 리스트를 융합하여 단일 특징(멀티 모달 맥락 특징)을 만든다. 일 실시예에 있어서, 맥락화 임베딩부(230)는 BLSTM(Bidirectional LSTM(Long-Short Term Memory))을 사용하여 멀티 모달 맥락 특징을 만든다.The visual context inference unit 200 creates a visual context feature by associating an image with a query and response list. As shown in FIG. 1, the visual context inference unit 200 may include a visual ground unit 210, a visual contextualization unit 220, and a contextualization embedding unit 230. The visual grounding unit 210 performs visual grounding for matching (matching) the pointing identifiers included in each of the query and response lists with the object region detected in the image after BERT embedding the query and response list. The visual contextualization unit 220 extracts context information between a visual feature (object) detected in an image and a query grounded by the visual grounding unit 210 and a grounded response list. In one embodiment, the visual contextualization unit 220 extracts context information using Bilinear Attention Networks (BAN). The contextualized embedding unit 230 creates a single feature (multi-modal contextual feature) by fusing the contextualized question, the contextualized object, and the grounded response list. In one embodiment, the contextualization embedding unit 230 creates a multi-modal context feature using a Bidirectional Long-Short Term Memory (LSTM) (BLSTM).

답변 결정부(300)는 지식 추론부(100)로부터 얻어진 지식 특징과 시각적 맥락 추론부(200)로부터 얻어진 시각적 맥락 특징에 근거하여 응답 리스트에서 질의에 대한 가장 적절한 답변을 결정한다. 일 실시예에 있어서, 답변 결정부(300)는 지식 특징과 멀티 모달 맥락 특징을 융합하고 두 개의 완전 연결 계층(Fully-Connected Layer)을 통과시킨 후에 소프트맥스(Softmax) 계층을 통해 가장 높은 점수를 얻은 응답을 질의에 대한 답변으로 결정한다.The answer determination unit 300 determines the most appropriate answer to the query from the response list based on the knowledge feature obtained from the knowledge inference unit 100 and the visual context feature obtained from the visual context inference unit 200. In one embodiment, the answer determination unit 300 fuses the knowledge feature and the multi-modal context feature, passes two fully-connected layers, and then scores the highest score through the Softmax layer. The response obtained is determined as the answer to the query.

이하에서는 새로운 심층 신경망 모델인 지식 그래프를 이용한 영상 기반 상식 추론 모델(Visual Commonsense Reasoning with Knowledge Graph, KG_VCR)에 대해 보다 구체적으로 설명한다. 영상 기반 상식 추론(VCR) 문제는 아래와 같이 서로 다른 3가지 양식으로 제시된다.Hereinafter, a new deep neural network model, a visual commonsense reasoning with knowledge graph (KG_VCR) using a knowledge graph, will be described in more detail. The image-based common sense reasoning (VCR) problem is presented in three different formats as follows.

① Q → A : 하나의 질문

에 대해 정답

를 선택하는 문제① Q → A: One question

About the correct answer

The problem of choosing

② QA → R : 하나의 질문

와 정답

에 대해 올바른 근거

을 선택하는 문제② QA → R: One question

And correct answer

The right grounds for

The problem of choosing

③ Q → AR : 하나의 질문

에 대해 정답

와 올바른 근거

을 함께 선택하는 문제③ Q → AR: One question

About the correct answer

And the right grounds

The problem of choosing together

그리고 영상 기반 상식 추론(VCR) 문제의 입력은 다음과 같은 형식으로 주어진다고 가정한다.And it is assumed that the input of the image-based common sense reasoning (VCR) problem is given in the following format.

- 영상(image)

-Image

- 물체 탐지(object detection) 결과물

-Object detection results

- 자연어와 포인팅이 혼합된 질의(query)

-Query mixed with natural language and pointing

- 자연어와 포인팅이 혼합된 응답(response)

들-Response with mixed natural language and pointing

field

도 2에 도시된 바와 같이, 영상 기반 상식 추론(VCR) 문제는 입력 영상

이외에도 이 영상에서 탐지 가능한 각 물체 영역

들이 물체 타입과 함께 제공된다. 또한, 질문

와 답변

그리고 근거

들도 자연어(natural language) 뿐만 아니라 영상에 등장하는 물체 영역

들을 가리키는 포인팅 식별자(pointing tag) - 예컨대 [person3] - 들도 포함하고 있다. 하나의 질문에 대해 제시되는 응답 리스트의 개수 N=4이며, 이 중에서 정답은 단 하나 존재하는 것으로 가정한다. 모델의 정확도(accuracy)는 문제 양식에 따라 달리 평가한다. 즉 Q → A 혹은 QA → R 문제에 관한 기준 정확도는 1/N인 반면, Q → AR 문제에 관한 기준 정확도는 1/N²이 된다.As shown in Figure 2, the image-based common sense reasoning (VCR) problem is

In addition, each object area that can be detected in this image

Are provided with the object type. Also, the question

And answer

And rationale

Not only in natural language, but also in the area of objects appearing in images.

It also includes a pointing tag pointing to them-eg [person3] -. It is assumed that the number of response lists presented to one question is N=4, and there is only one correct answer among them. The accuracy of the model is evaluated differently depending on the type of problem. That is, the reference accuracy for the Q → A or QA → R problem is 1/N, while the reference accuracy for the Q → AR problem is 1/N ² .

도 3은 일 실시예에 따른 지식 그래프를 이용한 영상 기반 상식 추론 모델(KG_VCR) 구조도이다. 영상 기반 상식 추론을 위한 심층 신경망 모델 KG_VCR은 지식 그래프 추론(Knowledge Graph Reasoning) 모듈과 시각적 맥락 추론(Visual Context Reasoning) 모듈 및 답변 결정(Answer Determination) 모듈을 포함한다. 이들은 각각 상술한 시각적 접지부(210)와 시각적 맥락화부(220) 및 맥락화 임베딩부(230)를 의미한다.3 is a structural diagram of an image-based common sense reasoning model (KG_VCR) using a knowledge graph according to an exemplary embodiment. The deep neural network model KG_VCR for image-based common sense inference includes a knowledge graph reasoning module, a visual context reasoning module, and an answer decision module. These refer to the visual grounding unit 210, the visual contextualization unit 220, and the contextualization embedding unit 230, respectively.

지식 그래프 추론 모듈(100)은 입력 영상

와 질문

그리고 응답 리스트

와 연관된 상식(commonsense knowledge)들을 외부 지식 베이스(400)인 ConceptNet으로부터 검색하며, 검색된 지식 그래프에 그래프 합성곱 신경망(Graph Convolutional Neural Network, GCN)을 적용함으로써 하나의 지식 벡터

로 임베딩해내는 역할을 수행한다. 시각적 맥락 추론 모듈(200)은 영상

와 자연어 질문

그리고 응답 리스트

에 포함된 사물(object)들을 서로 연관지어 그들 간의 관계와 맥락 정보를 추출함으로써, 멀티 모달 맥락 벡터

를 생성해내는 역할을 수행한다. 그리고 답변 결정 모듈(300)은 지식 그래프 추론 모듈의 결과물인 지식 벡터

와 시각적 맥락 추론 모듈(200)의 결과물인 멀티 모달 맥락 벡터

를 상호 보완적으로 결합함으로써, 제시된 응답 리스트에서 최적의 답변

를 결정하는 역할을 수행한다.The knowledge graph inference module 100 is an input image

And question

And reply list

One knowledge vector by searching for commonsense knowledge related to the concept from the external knowledge base 400, and applying a graph convolutional neural network (GCN) to the searched knowledge graph.

It plays the role of embedding with. Visual context reasoning module 200

And natural language questions

And reply list

Multi-modal context vector by correlating objects included in each other and extracting relational and context information between them

Plays the role of creating And the answer determination module 300 is a knowledge vector that is a result of the knowledge graph inference module

And a multi-modal context vector that is the result of the visual context inference module 200

By complementary combination of each other, the best answer from the suggested answer list

It plays a role in determining.

지식 그래프 추론에 대해 설명한다. 외부 지식 베이스(400)로부터 시각적 상식 추론(VCR) 문제에 도움이 될 상식을 추출하여 답변 결정에 효과적으로 활용하기 위해서는 관련 상식의 검색과 추출된 지식 그래프의 임베딩이 매우 중요하다. 본 KG_VCR 모델에서는 ConceptNet 지식 베이스(400)로부터 관련 상식을 검색하기 위해, 영상에서 인식해 낸 시각적 개념들(visual concepts)뿐만 아니라 자연어 질문(question)과 응답 리스트(response list)에서 추출한 키워드들(key words)을 함께 이용한다. 또한, <subject, relationship, object> 형태로 구조화된 트리플(triple) 집합으로 구성된 상식 그래프를 그래프 고유의 관계성(relationship)을 효과적으로 고려하여 하나의 벡터로 임베딩하기 위해, 대표적인 그래프 합성 곱 신경망 모듈인 GCN을 이용한다.Explain the knowledge graph inference. In order to extract common sense from the external knowledge base 400 to help with a visual common sense reasoning (VCR) problem and effectively use it for answer determination, it is very important to search for related common sense and embedding the extracted knowledge graph. In this KG_VCR model, in order to search for related common sense from the ConceptNet knowledge base 400, not only visual concepts recognized from images, but also keywords extracted from natural language questions and response lists (keys) words) together. In addition, in order to embed a common-sense graph composed of a set of triples structured in the form of <subject, relationship, object> into one vector by effectively considering the relationship inherent in the graph, a representative graph synthesis product neural network module, Use GCN.

도 3에 도시된 바와 같이, 지식 그래프 추론 모듈(100)은 서브 모듈들인 지식 검색 모듈(110)과 지식 임베딩 모듈(120)을 포함하며, 이들은 각각 상술한 지식 검색부(110)와 지식 임베딩부(120)를 의미한다. 지식 검색 모듈(110)의 구조는 도 4에 예시되어 있다. 지식 검색 모듈(110)은 Concept 지식 베이스(400)로부터 입력 영상과 연관된 상식을 추출하기 위해 미리 정해 놓은 범주에 따라 영상

에 포함된 사물(object), 장면(scene), 활동(activity)들을 각각 인식해내고, 이러한 시각적 개념 단어들을 지식 베이스(400)의 검색어로 사용한다. 또한, 자연어로 된 질문 및 응답 리스트와도 연관된 상식들을 지식 베이스(400)로부터 검색해내기 위해서 질문

와 응답 리스트

에 등장하는 키워드들도 지식 베이스(400)의 검색어로 사용한다. 그리고 지식 베이스(400)로부터 키워드당 약 100개의 트리플들로 구성된 지식들을 검색해낸다.As shown in FIG. 3, the knowledge graph inference module 100 includes a knowledge search module 110 and a knowledge embedding module 120, which are submodules, each of the knowledge search unit 110 and the knowledge embedding unit described above. Means 120. The structure of the knowledge search module 110 is illustrated in FIG. 4. The knowledge search module 110 is an image according to a preset category in order to extract common sense related to the input image from the concept knowledge base 400.

Objects, scenes, and activities included in are recognized respectively, and these visual concept words are used as search words of the knowledge base 400. In addition, in order to retrieve common senses related to the question and answer list in natural language from the knowledge base 400

And response list

Keywords appearing in are also used as search words of the knowledge base 400. In addition, knowledge consisting of about 100 triples per keyword is retrieved from the knowledge base 400.

이렇게 검색된 지식들은 BERT 임베딩을 거친 뒤, 질문 q와 코사인 유사도 계산을 거쳐 이들 중 상위 50개의 지식들만을 추출해낸다. 검색된 지식은 각각 <subject, relationship, object> 트리플 형태를 취하며, subject와 object들은 영상 속어 포함된 시각적 개념들(사물, 장면, 활동)이나 질문과 응답 리스트에 등장하는 단어들이 되며, relationship들은 이들 간의 관계를 나타내는 RelatedTo, SimilarTo, LocationAt, IsA 등 총 31개의 관계들 중 하나이다.After BERT embedding of the retrieved knowledge, only the top 50 knowledges are extracted from the question q and cosine similarity calculation. Each searched knowledge takes the form of a <subject, relationship, object> triple, and subject and object are visual concepts (object, scene, activity) included in video slang or words appearing in the question and answer list, and relationships are these It is one of a total of 31 relationships, such as RelatedTo, SimilarTo, LocationAt, IsA, which represent the relationship between the two.

지식 임베딩 모듈(120)은 지식 베이스(400)에서 검색된 관련 상식들을 토대로 답변 결정에 도움을 줄 수 있는 지식 벡터

를 생성하는 역할을 수행한다. 지식 임베딩을 위해 다층 퍼셉트론(Multi-Layer Perceptron, MLP) 등이 사용될 수 있으나, 바람직하게는 개념 및 단어들 간의 관계를 구조적으로 잘 표현하고 있는 지식 그래프의 특성을 감안하여 그래프 합성 곱 신경망인 GCN이 사용될 수 있다. 참고로, GCN은 개념 노드(concept node)들이 관계 간선(relation arc)들로 연결된 지식 그래프의 구조적 특성을 잘 반영하여 지식 그래프를 효과적으로 벡터로 임베딩할 수 있는 방법으로 잘 알려져 있다.The knowledge embedding module 120 is a knowledge vector that can help determine an answer based on the related common sense retrieved from the knowledge base 400

It plays the role of creating. For knowledge embedding, a multi-layer perceptron (MLP), etc. may be used, but in consideration of the characteristics of a knowledge graph that structurally expresses the relationship between concepts and words well, GCN, a graph synthesis product neural network, is used. Can be used. For reference, GCN is well known as a method for effectively embedding a knowledge graph into a vector by reflecting the structural characteristics of a knowledge graph in which concept nodes are connected by relationship arcs.

도 5와 같이, GCN의 입력으로 제공될 그래프의 개념 노드들은 지식 베이스(400)에서 검색된 상식들에 포함된 지식 개체(knowledge entity)들과 영상에서 인식해낸 시각적 개념(visual concept)들이 된다. 여기서 지식 개체란 <subject, relationship, object> 트리플 형태의 각 상식을 구성하는 subject나 object들을 의미한다. 그래프의 각 개념 노드에는 이들 지식 개체 혹은 시각적 개념 단어 외에도, 주어진 질문과의 연관성을 담아내기 위해 자연어 질문 자체도 포함시킨다. 구체적으로는 도 5와 같이 BERT 임베딩된 지식 개체

혹은 시각적 개념 단어

를 자연어 질문

과 각각 결합시킨 뒤, 최대 풀링(Max Pooling) 연산을 거친 후 각 개념 노드에 저장한다. 적어도 하나 이상의 관계(relationship)를 통해 트리플로 묶여있는 개념 노드들끼리는 그래프의 간선을 연결함으로써, GCN을 위한 초기 입력 그래프가 완성된다. 이후 GCN 계층(layer)들을 통과할 때마다, 수학식 1과 같이 간선을 통해 인접 노드의 정보가 유입되어 각 개념 노드의 정보가 새롭게 갱신된다.As shown in FIG. 5, concept nodes of a graph to be provided as an input of the GCN are knowledge entities included in common senses searched in the knowledge base 400 and visual concepts recognized from an image. Here, the knowledge entity refers to subjects or objects that form each common sense in the form of a <subject, relationship, object> triple. In addition to these knowledge entities or visual conceptual words, each conceptual node of the graph includes the natural language question itself to capture the association with the given question. Specifically, BERT-embedded knowledge entity as shown in FIG. 5

Or visual concept word

Natural language questions

After combining each with and then storing in each concept node after performing a maximum pooling operation. The initial input graph for GCN is completed by connecting the edges of the graph between conceptual nodes that are tripled through at least one relationship. Thereafter, each time passing through the GCN layers, information of neighboring nodes is introduced through the trunk line as shown in Equation 1, and the information of each concept node is newly updated.

여기서

는 그래프의 인접 행렬,

는

계층 각 노드들의 특징 벡터 값,

는

계층의 가중치 값,

는

계층의 바이어스 값을 각각 나타낸다. 이와 같이 GCN 계층들을 통한 그래프 노드들의 갱신이 이루어진 후, 각 노드의 벡터 값들에 다시 최대 풀링 연산이 적용되어 최종적인 지식 벡터

를 생성한다.here

Is the adjacency matrix of the graph,

Is

Feature vector values of each node in the hierarchy,

Is

The weight value of the layer,

Is

Each of the layer bias values is indicated. After updating the graph nodes through the GCN layers in this way, the maximum pooling operation is again applied to the vector values of each node, and the final knowledge vector

Create

시각적 맥락 추론에 대해 설명한다. 도 3에 도시된 바와 같이, 시각적 맥락 추론 모듈(200)은 서브 모듈들인 시각적 접지(Visual Grounding) 모듈과 시각적 맥락화(Visual Contextualization) 모듈 및 맥락 정보 임베딩(Context Embedding) 모듈을 포함하며, 이들은 각각 상술한 시각적 접지부(210)와 시각적 맥락화부(220) 및 맥락화 임베딩부(230)를 의미한다. 서로 다른 입력 데이터로부터 멀티 모달 맥락 벡터

를 생성해내기 위한 첫 단계는 질문

와 응답 리스트

에 등장하는 [person1], [person5]와 같은 각 포인팅(pointing)들을 입력 영상 안의 적절한 사물 영역들과 대응시키는 시각적 접지(Visual Grounding)이다. 도 6은 질문에 포함된 각 포인팅들을 영상 안의 인물 영역들로 매칭시킨 결과를 얻는 시각적 접지 모듈(210)의 예를 보여준다. 시각적 접지 모듈(210)은 도 6과 같이 순환 신경망(recurrent neural network, RNN)의 하나인 BLSTM(Bidirectional LSTM(Long-Short Term Memory))을 이용하여 질문 시퀀스에 포함된 각 포인팅을 입력 영상 내 특정 사물 영역에 대응시킨다.Explain visual context reasoning. As shown in FIG. 3, the visual context inference module 200 includes a visual grounding module, a visual contextualization module, and a context embedding module, which are sub-modules, respectively. It means a visual grounding unit 210, a visual contextualization unit 220, and a contextualization embedding unit 230. Multi-modal context vectors from different input data

The first step to generating a question

And response list

It is a visual grounding that matches each pointing such as [person1] and [person5] in the input image with appropriate object areas in the input image. 6 shows an example of a visual grounding module 210 that obtains a result of matching each of the pointings included in a question to the person regions in an image. As shown in FIG. 6, the visual grounding module 210 uses a bidirectional long-short term memory (BLSTM) (BLSTM) as shown in FIG. 6 to specify each pointing included in the question sequence in the input image. Corresponds to the object area.

시각적 맥락화 모듈(220)은 영상에 포함된 각 사물 영역(visual object)

, 접지된 질문(grounded question)

, 접지된 응답(grounded response)

들을 토대로 이들을 서로 연관지어 맥락 정보를 추출하는 역할을 수행한다. 즉, 도 7과 같이 하나의 접지된 응답(grounded response)

을 토대로 질문과 영상의 각 사물 영역에 주의 집중(attention) 메커니즘을 적용함으로써, 집중된 질문(attended question) 벡터

와 집중된 사물(attended object) 벡터

를 각각 생성한다. 수학식 2는 응답

을 토대로 질문

에 대한 주의 집중을 계산하는 식을 나타낸다.The visual contextualization module 220 includes each visual object included in the image.

, Grounded question

, Grounded response

Based on these, they correlate with each other to extract contextual information. That is, one grounded response as shown in FIG. 7

Based on the question and by applying an attention mechanism to each object area of the image, the focused question vector

And the focused object vector

Respectively. Equation 2 is the answer

Questions based on

Expresses an expression that calculates the attention to attention.

도 8에서와 같이, 맥락 정보 임베딩 모듈(230)은 앞서 생성된 서로 다른 맥락 정보들인

,

등을 역시 순환 신경망의 한 종류인 BLSTM을 통해 순차적으로 결합함으로써, 최종적인 멀티 모달 맥락 벡터

를 생성한다.As shown in FIG. 8, the context information embedding module 230 is

,

The final multi-modal context vector is combined sequentially through BLSTM, which is also a kind of recurrent neural network.

Create

마지막으로, 답변 결정 모듈(300)은 도 9와 같이 지식 그래프 추론의 결과인 지식 벡터

와 시각적 맥락 추론 결과인 멀티 모달 맥락 벡터

를 결합한 뒤, 두 개의 완전 연결 계층(Fully-Connected Layer)과 소프트맥스(softmax) 계층을 거쳐 최종적으로 응답 리스트 중에서 가장 적합한 답변

을 결정한다.Finally, the answer determination module 300 is a knowledge vector that is a result of knowledge graph inference, as shown in FIG. 9.

And multi-modal context vectors resulting from visual context inference

After combining them, the most suitable answer from the response list is finally passed through two fully-connected layers and a softmax layer.

To decide.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at around its preferred embodiments. Those of ordinary skill in the art to which the present invention pertains will be able to understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative point of view rather than a limiting point of view. The scope of the present invention is shown in the claims rather than the above description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

100 : 지식 추론부 110 : 지식 검색부
120 : 지식 임베딩부 200 : 시각적 맥락 추론부
210 : 시각적 접지부 220 : 시각적 맥락화부
230 : 맥락화 임베딩부 300 : 답변 결정부
400 : 지식 베이스100: knowledge reasoning unit 110: knowledge search unit
120: knowledge embedding unit 200: visual context reasoning unit
210: visual ground 220: visual contextualization unit
230: contextualization embedding unit 300: answer decision unit
400: knowledge base

Claims

A knowledge inference unit for generating a knowledge feature after retrieving knowledge related to an image and a query for the image and a list of responses to the query;
A visual context inference unit for generating a visual context feature by associating an image with a query and response list; And
Including; an answer decision unit that determines an answer to a query from the answer list based on the knowledge feature and the visual context feature;
Knowledge Reasoning Department:
A knowledge search unit for searching related knowledge in a graph-type knowledge base composed of a set of triples by designating words for visual features of the image and words extracted from the query and response list as search words; And
Includes; a knowledge embedding unit for generating a knowledge feature by embedding related knowledge,
Pointing identifiers indicating visual features in the video are previously assigned to some words in the query and response list,
The visual context reasoning unit:
A visual grounding unit for performing visual grounding for matching the pointing identifiers included in each of the question and answer lists with the visual feature regions detected in the image after BERT (Bidirectional Encoder Representations from Transformers) embedding the question and answer list;
A visual contextualization unit for extracting context information between a visual feature area detected in the image and a query grounded by the visual grounding unit and a grounded response list; And
A contextualization embedding unit that fuses a contextualized question, a contextualized object, and a grounded response list into a single multi-modal contextual feature;
An image-based common sense reasoning system using a knowledge graph comprising a.

The method of claim 1,
The knowledge search unit embeds the retrieved knowledge and queries into BERT (Bidirectional Encoder Representations from Transformers), and then extracts some knowledge as related knowledge through similarity comparison. An image-based common-sense reasoning system using a knowledge graph.

The method of claim 1,
The knowledge embedding unit is an image-based common sense reasoning system using a knowledge graph that generates knowledge features by performing knowledge graph embedding based on a graph convolutional neural network.

The method of claim 1,
The visual grounding unit is an image-based common sense reasoning system using a knowledge graph in which a pointing identifier included in a query is mapped to a visual feature area in an image using BLSTM (Bidrectional Long Short-Term Memory).

The method according to any one of claims 1 to 4,
The answer decision unit fuses the knowledge feature and the multi-modal context feature, passes two fully-connected layers, and then determines the response with the highest score through the Softmax layer as the answer to the query. Image-based common sense reasoning system using knowledge graph.

In the image-based common-sense reasoning method using a knowledge graph performed by an image-based common-sense reasoning system using a knowledge graph including a knowledge inference unit, a visual context inference unit, and an answer decision unit,
The knowledge inference unit comprises: a knowledge inference step of generating a knowledge feature after retrieving knowledge related to an image and a query for the image and a list of responses to the query;
The visual context inference unit includes a visual context inference step of correlating an image with a query and response list to generate a visual context feature; And
The answer determination unit includes an answer determination step of determining an answer to the query from the answer list based on the knowledge characteristic and the visual context characteristic;
The knowledge reasoning steps of the knowledge reasoning department are:
A knowledge retrieval step of searching for related knowledge in a graph-type knowledge base composed of a set of triples by designating words for visual features of the image and words extracted from the query and response list as search words, respectively; And
Includes; knowledge embedding step of generating a knowledge feature by embedding related knowledge,
Pointing identifiers indicating visual features in the video are previously assigned to some words in the query and response list,
The visual context reasoning steps of the visual context reasoning department are:
A visual grounding step of performing a visual grounding step of embedding the question and response list into a BERT (Bidirectional Encoder Representations from Transformers), and then performing a visual grounding in which the pointing identifiers included in each of the question and response lists correspond to the visual feature regions detected in the image;
A visual contextualization step of extracting context information between the visual feature area detected in the image and the query and response list grounded in the visual grounding step; And
A contextual embedding step of fusing the contextualized question and the contextualized object and grounded response list into a single multi-modal contextual feature;
Image-based common sense reasoning method using a knowledge graph comprising a.

The method of claim 6,
The knowledge embedding step is an image-based common-sense reasoning method using a knowledge graph that generates a knowledge feature by performing a knowledge graph embedding based on a graph convolutional neural network.

The method of claim 6 or 7,
In the answer decision step, the answer with the highest score through the Softmax layer is determined as the answer to the query after fusion of knowledge features and multi-modal context features and passing through two fully-connected layers. An image-based common-sense reasoning method using a knowledge graph.

delete