KR102593463B1

KR102593463B1 - Apparatus and method for generating language based on commonsense reasoning

Info

Publication number: KR102593463B1
Application number: KR1020210127435A
Authority: KR
Inventors: 임희석; 서재형
Original assignee: 고려대학교 산학협력단
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2023-10-24
Also published as: KR20230044834A

Abstract

일반 상식 추론 기반의 언어 생성 장치 및 방법에 관한 것으로, 일반 상식 추론 기반의 언어 생성 장치는, 일반 상식 추론 기반의 언어 생성 장치는, 적어도 하나의 개념, 적어도 하나의 이미지 및 적어도 하나의 문서를 입력 받는 입력부 및 상기 적어도 하나의 개념을 이용하여, 상기 적어도 하나의 이미지로부터 장면 지식을 추출하고, 상기 적어도 하나의 문서로부터 관계 지식을 추출하고, 상기 장면 지식 및 상기 관계 지식을 입력 값으로 하여 언어 모델의 학습 처리를 수행하는 프로세서를 포함할 수 있다.Pertaining to a language generation device and method based on general knowledge reasoning, the language generation device based on general knowledge reasoning inputs at least one concept, at least one image, and at least one document. Using a receiving input unit and the at least one concept, scene knowledge is extracted from the at least one image, relationship knowledge is extracted from the at least one document, and a language model is created using the scene knowledge and the relationship knowledge as input values. It may include a processor that performs learning processing.

Description

Apparatus and method for generating language based on common sense reasoning {APPARATUS AND METHOD FOR GENERATING LANGUAGE BASED ON COMMONSENSE REASONING}

일반 상식 추론 기반의 언어 생성 장치 및 방법에 관한 것이다.It relates to a language generation device and method based on general knowledge reasoning.

인공 지능(AI: Artificial Intelligence)은 인간의 학습이나 추론 능력 등을 컴퓨터와 같은 기계 장치가 처리할 수 있도록 한 것으로, 장치, 시스템 또는 프로그램 등의 형태로 구현될 수 있다. 최근에는 다양한 정보를 기반으로 모델을 훈련시켜 최적의 모델을 찾고 이를 이용하여 판단을 수행하는 기계 학습 기술의 발달에 의해 인공 지능 기술은 급격히 성장하고 있고, 광범위하고 다양한 산업 분야에 크게 영향을 미치고 있다. 이러한 인공 지능 기술은, 자연어 처리 분야에서도 적용되어, 텍스트의 분석 과정, 음성의 인식과 텍스트로의 변환 과정이나, 텍스트의 인식과 타 언어로의 번역 등과 같은 자연어 처리를 보다 정확하면서도 효과적으로 수행할 수 있게 하고 있다. 근자에는 인공 지능 기술을 이용하여 문장이나 개념 등이 주어지면 이에 대응하여 새롭게 정밀한 문장을 생성하는 기술이 연구, 개발 및 공개되고 있다. 이러한 문장 생성 기술은 기사나 이메일 등과 같은 간단한 텍스트뿐만 아니라, 문학 등과 같은 어문 저작물 등의 생성에도 이용될 수 있을 것으로 판단된다. 그러나, 이들 문장 생성 기술은 기술적 한계로 인해 입력된 개념 정보를 누락하여 문장을 생성하거나, 인간이 이해하기 어려운 어색한 문장을 생성하는 경우도 빈번한 문제점이 있었다. 이에 따라 문장을 인간이 작성한 것과 같이 자연스럽게 생성하는 언어 생성 기술에 대한 연구 개발의 필요성이 더욱 더 증가해 왔다.Artificial Intelligence (AI) allows mechanical devices such as computers to process human learning and reasoning abilities, and can be implemented in the form of devices, systems, or programs. Recently, artificial intelligence technology has been growing rapidly due to the development of machine learning technology, which trains models based on various information to find the optimal model and uses it to make decisions, and is having a significant impact on a wide range of industrial fields. . These artificial intelligence technologies can also be applied in the field of natural language processing to more accurately and effectively perform natural language processing such as text analysis process, voice recognition and conversion into text, or text recognition and translation into other languages. It is being done. Recently, technology that uses artificial intelligence technology to generate new, precise sentences in response to sentences or concepts given to them has been researched, developed, and disclosed. It is believed that this sentence generation technology can be used not only for simple text such as articles or emails, but also for the creation of literary works such as literature. However, due to technical limitations, these sentence generation technologies often have problems such as generating sentences by omitting input concept information or generating awkward sentences that are difficult for humans to understand. Accordingly, the need for research and development of language generation technology that naturally generates sentences as if they were written by humans has increased.

검색된 외부 지식을 반영하여 일반 상식 추론 능력을 강화하고, 강화된 일반 상식 추론 능력을 바탕으로 보다 자연스러운 문장을 생성할 수 있는 일반 상식 추론 기반의 언어 생성 장치 및 방법을 제공하는 것을 해결하고자 하는 과제로 한다.The goal is to provide a language generation device and method based on general knowledge reasoning that can strengthen general knowledge reasoning ability by reflecting retrieved external knowledge and generate more natural sentences based on the strengthened general knowledge reasoning ability. do.

상술한 과제를 해결하기 위하여 일반 상식 추론 기반의 언어 생성 장치 및 방법이 제공된다.In order to solve the above-mentioned problems, a language generation device and method based on common sense reasoning are provided.

일반 상식 추론 기반의 언어 생성 장치는, 적어도 하나의 개념, 적어도 하나의 이미지 및 적어도 하나의 문서를 입력 받는 입력부 및 상기 적어도 하나의 개념을 이용하여, 상기 적어도 하나의 이미지로부터 장면 지식을 추출하고, 상기 적어도 하나의 문서로부터 관계 지식을 추출하고, 상기 장면 지식 및 상기 관계 지식을 입력 값으로 하여 언어 모델의 학습 처리를 수행하는 프로세서를 포함할 수 있다.The general knowledge inference-based language generation device includes an input unit that receives at least one concept, at least one image, and at least one document, and extracts scene knowledge from the at least one image using the at least one concept, It may include a processor that extracts relationship knowledge from the at least one document and performs a language model learning process using the scene knowledge and the relationship knowledge as input values.

일반 상식 추론 기반의 언어 생성 방법은, 적어도 하나의 개념, 적어도 하나의 이미지 및 적어도 하나의 문서를 입력 받는 단계, 상기 적어도 하나의 개념을 이용하여 상기 적어도 하나의 이미지로부터 장면 지식을 추출하는 단계, 상기 적어도 하나의 개념을 이용하여 상기 적어도 하나의 문서로부터 관계 지식을 추출하는 단계 및 상기 장면 지식 및 상기 관계 지식을 입력 값으로 하여 언어 모델의 학습 처리를 수행하는 단계를 포함할 수도 있다.A general knowledge inference-based language generation method includes receiving at least one concept, at least one image, and at least one document as input, extracting scene knowledge from the at least one image using the at least one concept, It may also include extracting relationship knowledge from the at least one document using the at least one concept and performing learning processing of a language model using the scene knowledge and the relationship knowledge as input values.

상술한 일반 상식 추론 기반의 언어 생성 장치 및 방법에 의하면, 검색된 외부 지식을 반영하여 일반 상식 추론 능력을 강화하고, 강화된 일반 상식 추론 능력을 바탕으로 보다 자연스러운 문장을 생성할 수 있게 되는 효과를 얻을 수 있다.According to the above-described general common sense reasoning-based language generation device and method, the general common sense reasoning ability is strengthened by reflecting the retrieved external knowledge, and the effect of being able to generate more natural sentences based on the strengthened general common sense reasoning ability is obtained. You can.

상술한 일반 상식 추론 기반의 언어 생성 장치 및 방법에 의하면, 장면 지식(scene knowledge)을 이용하여 주어진 개념 정보 간의 조합이 구현할 수 있는 일반적인 상황을 가정할 수 있도록 하고, 관계 지식(relational knowledge)를 이용하여 장면 지식과 하나 이상의 개념 정보 간의 일반적인 관계를 추론할 수 있도록 함으로써, 언어 생성 과정에서 보다 타당한 문장을 구성할 수 있게 되는 장점도 얻을 수 있다.According to the above-described common sense inference-based language generation device and method, it is possible to assume a general situation that can be implemented by a combination of given conceptual information using scene knowledge, and using relational knowledge. By making it possible to infer a general relationship between scene knowledge and one or more conceptual information, the advantage of being able to construct more valid sentences during the language generation process can also be obtained.

상술한 일반 상식 추론 기반의 언어 생성 장치 및 방법에 의하면, 종래 자기 회귀 기반의 생성 모델 등과 같은 언어 모델의 낮은 성능과 부족한 일반 상식 추론 능력을 보강할 수 있음으로써, 주어진 개념 정보를 생략하고 문장을 생성하거나 정보가 부족한 문장을 삭제하여 상식적으로 이해할 수 없는 문장을 생성하는 것을 방지할 수 있게 된다.According to the above-described general knowledge inference-based language generation device and method, it is possible to reinforce the low performance and insufficient general knowledge inference ability of language models such as conventional autoregressive-based generative models, thereby omitting given conceptual information and generating sentences. By creating or deleting sentences with insufficient information, it is possible to prevent the creation of sentences that cannot be understood through common sense.

상술한 일반 상식 추론 기반의 언어 생성 장치 및 방법에 의하면, 인코더 여부 등에 무관하게 대부분의 언어 생성 모델에 적용 가능한 범용성을 갖는 언어 생성 모델, 장치 및 방법을 구현할 수 있게 되는 장점도 있다.According to the above-described common sense reasoning-based language generation device and method, there is also the advantage of being able to implement a language generation model, device, and method that have versatility applicable to most language generation models regardless of whether there is an encoder or not.

도 1은 일반 상식 추론 기반의 언어 생성 장치의 일 실시예에 대한 블록도이다.
도 2는 프로세서의 동작의 일 실시예를 설명하기 위한 블록도이다.
도 3은 언어모델 처리부의 일 실시예에 대한 블록도이다.
도 4는 일반 상식 추론 기반의 언어 생성 방법의 일 실시예에 대한 흐름도이다.1 is a block diagram of an embodiment of a language generation device based on common sense reasoning.
Figure 2 is a block diagram for explaining an embodiment of the operation of a processor.
Figure 3 is a block diagram of an embodiment of the language model processing unit.
Figure 4 is a flowchart of an embodiment of a language generation method based on common sense reasoning.

이하 명세서 전체에서 동일 참조 부호는 특별한 사정이 없는 한 동일 구성요소를 지칭한다. 이하에서 사용되는 '부'가 부가된 용어는, 소프트웨어 및/또는 하드웨어로 구현될 수 있으며, 실시예에 따라 하나의 '부'가 하나의 물리적 또는 논리적 부품으로 구현되거나, 복수의 '부'가 하나의 물리적 또는 논리적 부품으로 구현되거나, 하나의 '부'가 복수의 물리적 또는 논리적 부품들로 구현되는 것도 가능하다. 명세서 전체에서 어떤 부분이 다른 부분과 연결되어 있다고 할 때, 이는 어떤 부분과 다른 부분이 상호 간에 물리적으로 연결되었음을 의미할 수도 있고, 및/또는 전기적으로 연결되었음을 의미할 수도 있다. 또한, 어떤 부분이 다른 부분을 포함한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 부분 이외의 또 다른 부분을 제외하는 것이 아니며, 설계자의 선택에 따라서 또 다른 부분을 더 포함할 수 있음을 의미한다. 제1 내지 제N(N은 1 이상의 자연수) 등의 표현은, 적어도 하나의 부분(들)을 다른 부분(들)으로부터 구분하기 위한 것으로, 특별한 기재가 없는 이상 이들이 순차적임을 반드시 의미하지는 않는다. 또한 단수의 표현은 문맥상 명백하게 예외가 있지 않는 한, 복수의 표현을 포함할 수 있다.Throughout the specification below, the same reference signs refer to the same components unless there are special circumstances. Terms with the addition of 'unit' used below may be implemented as software and/or hardware, and depending on the embodiment, one 'unit' may be implemented as one physical or logical part, or a plurality of 'units' may be implemented. It is also possible to be implemented with one physical or logical part, or one 'part' to be implemented with a plurality of physical or logical parts. When a part is said to be connected to another part throughout the specification, this may mean that the part and the other part are physically connected to each other and/or electrically connected. In addition, when a part includes another part, this does not mean excluding another part other than the other part unless specifically stated to the contrary, and means that another part may be included depending on the designer's choice. do. Expressions such as the first to Nth (N is a natural number greater than or equal to 1) are intended to distinguish at least one part(s) from other part(s), and do not necessarily mean that they are sequential unless otherwise specified. Additionally, singular expressions may include plural expressions, unless the context clearly makes an exception.

이하 도 1 내지 도 3을 참조하여 일반 상식 추론 기반의 언어 생성 장치의 일 실시예에 대해서 설명하도록 한다.Hereinafter, an embodiment of a language generation device based on common sense reasoning will be described with reference to FIGS. 1 to 3.

도 1은 일반 상식 추론 기반의 언어 생성 장치의 일 실시예에 대한 블록도이다.1 is a block diagram of an embodiment of a language generation device based on common sense reasoning.

도 1에 도시된 일 실시예에 따르면. 일반 상식 추론 기반의 언어 생성 장치(100, 이하 언어 생성 장치라 함)는, 프로세서(110)를 포함할 수 있고, 필요에 따라, 입력부(101), 저장부(105) 및 출력부(109) 중 적어도 하나를 더 포함할 수 있다. 입력부(101), 저장부(105), 출력부(109) 및 프로세서(110) 중 적어도 둘은, 회로, 케이블 또는 무선통신네트워크 등을 통해 일방으로 또는 쌍방으로 데이터나 명령/지시 등을 전달하도록 마련된 것일 수 있다.According to one embodiment shown in Figure 1. The general knowledge inference-based language generation device 100 (hereinafter referred to as language generation device) may include a processor 110, and, if necessary, an input unit 101, a storage unit 105, and an output unit 109. It may include at least one more. At least two of the input unit 101, storage unit 105, output unit 109, and processor 110 are configured to transmit data or commands/instructions one-way or two-way through a circuit, cable, or wireless communication network. It may have been prepared.

입력부(101)는, 언어 생성 장치(100)의 동작에 필요한 데이터나 프로그램(앱, 소프트웨어 또는 애플리케이션 등으로 지칭 가능함)을 사용자나 다른 장치(미도시)로부터 입력 받을 수 있다. The input unit 101 may receive data or programs (which may be referred to as apps, software, or applications) required for the operation of the language generating device 100 from a user or another device (not shown).

예를 들어 입력부(101)는, 개념 집합(91) 그 자체를 입력 받거나, 또는 개념 집합(91)에 포함될 수 있는 적어도 하나의 개념(들)(도 2의 91-1)을 입력 받을 수 있다. 여기서, 적어도 하나의 개념(91-1)은, 하나 이상의 단어를 포함할 수 있으며, 예를 들어, 적어도 하나의 의미적 단어(content word)를 포함할 수 있다. 필요에 따라, 적어도 하나의 개념(91-1)은 적어도 하나의 기능적 단어(function word)를 포함할 수도 있다. 적어도 하나의 단어는, 소정 품사의 단어를 포함할 수 있으며, 예를 들어, 객관적인 형태로 일반 상식을 반영한 문장의 생성을 위해 동사(일례로 소지, 보유, 착용 등) 및 명사(일례로 장미, 정원, 드레스 등) 중 적어도 하나를 포함할 수도 있다. 여기서, 동사는, 동사 기본형, 동명사, 현재 또는 과거 시제 및/또는 과거 분사 등을 포함할 수 있고, 명사는 단수 보통 명사, 복수 보통 명사, 단수 고유 명사 및/또는 복수 고유 명사 등을 포함할 수 있다. 실시예에 따라, 적어도 하나의 단어는, 동사나 명사 외에도 부사나 형용사 등을 포함할 수도 있다. 개념 집합(91)은 이들 개념(91-1)을 하나 이상 포함하여 형성된 집합을 의미한다. 개념 집합(91)은 개개의 개념(91-1)을 요소로 하는 데이터베이스 형태로 마련될 수도 있다. 또한, 입력부(101)는 장면 지식 추출을 위해 이용되는 이미지(93, 정지영상, 동영상 또는 동영상 내의 하나 이상의 개별 프레임 등을 포함 가능함)나, 관계 지식 추출을 위해 이용되는 문서(95, 텍스트 등을 포함하며, 필요에 따라 표, 이미지 또는 기호 등을 포함할 수도 있음)을 더 입력 받을 수도 있다. 이미지(93) 및 문서(95) 중 적어도 하나, 사용자 등의 조작에 따라 직접 입력되거나, 별도로 마련된 메모리 장치나 다른 정보처리장치(일례로 서버용 하드웨어 장치나 스마트폰 등) 등을 통해 전달받아 입력될 수도 있다. 이미지(93) 및 문서(95) 중 적어도 하나는 유무선 통신 네트워크를 통해 접속 가능한 외부의 다른 웹사이트로부터 획득된 것일 수 있다. 예를 들어, 이미지(93)는 외부의 검색 사이트나 이미지 데이터베이스 사이트 등으로부터 수집되어 획득된 것일 수 있다. 또한, 이미지(93)는 별도로 마련된 이미지 데이터 셋(예를 들어, VATEX나 Visual Genome)에서 획득된 것일 수도 있다. 문서(95)은, 예를 들어, 외부의 검색 사이트나 정보 데이터베이스 사이트(예를 들어, 사전이나 위키 사이트 등)로부터 수집되어 획득된 것일 수 있으며, 웹 문서를 포함할 수도 있다. 이들 이미지(93) 및 문서(95) 중 적어도 하나의 수집은 사용자 등의 조작에 따라 또는 프로세서(110)의 제어 동작에 따라 수동으로 또는 자동으로 수행될 수 있다. For example, the input unit 101 may receive the concept set 91 itself, or at least one concept(s) (91-1 in FIG. 2) that may be included in the concept set 91. . Here, at least one concept 91-1 may include one or more words, for example, at least one semantic word (content word). If necessary, at least one concept 91-1 may include at least one functional word. At least one word may include a word from a predetermined part of speech, for example, a verb (e.g., possess, have, wear, etc.) and a noun (e.g., rose, etc.) to create a sentence that reflects general knowledge in an objective form. It may also include at least one of the following: a garden, a dress, etc.). Here, the verb may include a verb base form, a gerund, a present or past tense, and/or a past participle, and the noun may include a singular common noun, a plural common noun, a singular proper noun, and/or a plural proper noun. there is. Depending on the embodiment, at least one word may include an adverb or an adjective in addition to a verb or noun. The concept set 91 refers to a set formed by including one or more of these concepts (91-1). The concept set 91 may be prepared in the form of a database with individual concepts 91-1 as elements. In addition, the input unit 101 inputs an image (93, which may include a still image, a video, or one or more individual frames within a video) used to extract scene knowledge, or a document (95, text, etc.) used to extract relationship knowledge. Additional information (which may also include tables, images, or symbols) may be received as needed. At least one of the image 93 and the document 95 may be input directly according to the operation of the user, etc., or may be transmitted and input through a separately provided memory device or other information processing device (for example, a server hardware device or a smartphone, etc.). It may be possible. At least one of the image 93 and the document 95 may be obtained from another external website accessible through a wired or wireless communication network. For example, the image 93 may be obtained by collecting from an external search site or an image database site. Additionally, the image 93 may be obtained from a separately prepared image data set (eg, VATEX or Visual Genome). For example, the document 95 may be obtained by collecting from an external search site or an information database site (eg, a dictionary or wiki site, etc.), and may include a web document. Collection of at least one of these images 93 and documents 95 may be performed manually or automatically according to an operation of a user or the like or a control operation of the processor 110.

입력부(101)를 통해 입력되는 프로그램은, 프로세서(110)의 동작을 위한 알고리즘 등을 포함할 수 있다. 입력부(101)는, 실시예에 따라, 키보드, 마우스, 태블릿, 터치 스크린, 터치 패드, 트랙 볼, 트랙패드, 스캐너 장치, 영상 촬영 모듈, 초음파 스캐너, 동작 감지 센서, 진동 센서, 수광 센서, 감압 센서, 근접 센서 및/또는 마이크로 폰 등을 포함할 수도 있고, 다른 장치(일례로 휴대용 메모리 장치 등)로부터 데이터 등의 수신이 가능한 데이터 입출력 단자(범용 직렬 버스 단자나, 고선명 멀티미디어 인터페이스 단자나, 슬롯(M.2나 PCI express) 등)나, 외부의 다른 장치와 유무선 통신 네트워크를 통해 연결되는 통신 모듈(일례로 랜 카드, 근거리 통신 모듈 또는 이동통신 모듈 등) 등을 포함할 수도 있다.A program input through the input unit 101 may include an algorithm for operating the processor 110, etc. Depending on the embodiment, the input unit 101 may include a keyboard, mouse, tablet, touch screen, touch pad, track ball, track pad, scanner device, image capture module, ultrasound scanner, motion detection sensor, vibration sensor, light receiving sensor, and pressure sensitive sensor. A data input/output terminal (such as a universal serial bus terminal, a high-definition multimedia interface terminal, or a slot) may include sensors, proximity sensors, and/or microphones, and can receive data, etc. from other devices (e.g., portable memory devices, etc.). (M.2 or PCI express), etc.) or a communication module (for example, a LAN card, short-distance communication module, or mobile communication module) that is connected to another external device through a wired or wireless communication network.

저장부(105)는, 언어 생성 장치(100)의 동작에 필요한 데이터나 프로그램 등을 저장할 수 있다. 예를 들어, 저장부(105)는 프로세서(110)에 의해 이용될 개념 집합(91), 이미지(93) 및 문서(95) 중 적어도 하나를 일시적 또는 비일시적으로 저장할 수 있다. 저장부(105)에 저장된 데이터는 입력부(101)를 통해 입력된 것일 수 있고, 프로세서(110)의 처리 과정에서 생성된 것일 수도 있다. 저장부(105)에 저장된 데이터(개념 집합(91), 이미지(93) 및 문서(95) 중 적어도 하나)는, 프로세서(110)의 호출 등과 같이 필요에 따라 프로세서(110)로 전달될 수 있다. 실시예에 따라서, 저장부(105)는, 주기억장치 및 보조기억장치 중 적어도 하나를 포함할 수 있다. 주기억장치는 롬(ROM) 및/또는 램(RAM)과 같은 반도체 저장 매체 등을 포함할 수 있고, 보조기억장치는, 플래시 메모리 장치, SD(Secure Digital) 카드, 솔리드 스테이트 드라이브(SSD, Solid State Drive), 하드 디스크 드라이브(HDD, Hard Disc Drive), 자기 드럼, 컴팩트 디스크(CD), 디브이디(DVD) 또는 레이저 디스크 등과 같은 광 기록 매체(optical media), 자기 테이프, 광자기 디스크 및/또는 플로피 디스크 등을 포함할 수 있다.The storage unit 105 can store data or programs necessary for the operation of the language generating device 100. For example, the storage unit 105 may temporarily or non-temporarily store at least one of a concept set 91, an image 93, and a document 95 to be used by the processor 110. Data stored in the storage unit 105 may be input through the input unit 101 or may be generated during processing by the processor 110. Data (at least one of the concept set 91, image 93, and document 95) stored in the storage unit 105 may be transmitted to the processor 110 as needed, such as a call to the processor 110. . Depending on the embodiment, the storage unit 105 may include at least one of a main memory and an auxiliary memory. The main memory may include semiconductor storage media such as ROM and/or RAM, and the auxiliary memory may include a flash memory device, Secure Digital (SD) card, or solid state drive (SSD). Drive, hard disk drive (HDD), optical media such as magnetic drum, compact disk (CD), DVD or laser disk, magnetic tape, magneto-optical disk and/or floppy disk It may include a disk, etc.

출력부(109)는 프로세서(110)의 처리 결과나, 저장부(105)에 저장된 데이터 등을 외부로 출력할 수 있다. 예를 들어, 출력부(109)는 처리된 개념(91)에 대응하여 프로세서(110)가 생성한 적어도 하나의 문장을 시각적 또는 청각적으로 출력할 수 있다. 이 경우, 적어도 하나의 문장은, 디스플레이 화면에 텍스트 등의 형태로 출력될 수 있다. 또한, 출력부(109)는, 다른 예를 들어, 프로세서(110)에 의해 생성된 하나 이상의 문장을 외부의 다른 장치(스마트폰, 서버용 하드웨어 장치나 메모리 장치 등)로 케이블이나 무선 통신 네트워크 등을 통해 전달하여 제공할 수도 있다. 실시예에 따라, 출력부(109)는, 언어 생성 장치(100)와 일체형으로 마련된 것일 수도 있고, 또는 물리적으로 분리 가능하게 마련된 것일 수도 있으며, 예를 들어, 디스플레이, 프린터 장치, 스피커 장치, 영상 출력 단자, 데이터 입출력 단자 및/또는 통신 모듈 등을 포함할 수 있다. 그러나, 출력부(109)는, 이에 한정되는 것은 아니다.The output unit 109 may output the processing results of the processor 110 or data stored in the storage unit 105 to the outside. For example, the output unit 109 may visually or auditorily output at least one sentence generated by the processor 110 in response to the processed concept 91. In this case, at least one sentence may be output in the form of text, etc. on the display screen. In addition, for another example, the output unit 109 transmits one or more sentences generated by the processor 110 to another external device (smart phone, server hardware device, memory device, etc.) via a cable or wireless communication network. It can also be provided through delivery. Depending on the embodiment, the output unit 109 may be provided integrally with the language generating device 100, or may be provided to be physically separable, for example, a display, a printer device, a speaker device, an image device, etc. It may include an output terminal, a data input/output terminal, and/or a communication module. However, the output unit 109 is not limited to this.

프로세서(110)는 개념 집합(91)을 이용하여 언어 모델의 성능(예를 들어, 일반 상식 추론 능력 및/또는 이를 기반으로 한 문장 생성 능력)을 강화하거나, 언어 모델을 이용하여 하나 이상의 문장을 생성할 수 있다. 프로세서(110)는 이러한 동작 수행을 위해 저장부(105)에 저장된 프로그램을 실행시킬 수도 있다. 프로세서(110)는, 예를 들어, 중앙 처리 장치(CPU: Central Processing Unit), 마이크로 컨트롤러 유닛(MCU: Micro Controller Unit), 애플리케이션 프로세서(AP: Application Processor), 전자 제어 유닛(ECU: Electronic Controlling Unit), 기본보드 관리 컨트롤러(BMC: Baseboard Management Controller), 마이컴(Micom: Micro Processor) 및/또는 이외 각종 연산 및 제어 처리를 수행할 수 있는 적어도 하나의 전자 장치 등을 포함할 수 있다. 이들 처리 또는 제어 장치는, 예를 들어, 하나 또는 둘 이상의 반도체 칩, 회로 또는 관련 부품 등을 단독으로 이용하거나 조합하여 구현된 것일 수도 있다.The processor 110 uses the concept set 91 to enhance the performance of the language model (e.g., general knowledge reasoning ability and/or sentence generation ability based on it) or generates one or more sentences using the language model. can be created. The processor 110 may execute a program stored in the storage unit 105 to perform these operations. The processor 110 may include, for example, a central processing unit (CPU), a microcontroller unit (MCU), an application processor (AP), and an electronic control unit (ECU). ), a Baseboard Management Controller (BMC), a Micro Processor (Micom), and/or at least one electronic device capable of performing various calculations and control processing. These processing or control devices may be implemented, for example, by using one or more semiconductor chips, circuits, or related components alone or in combination.

도 1에 도시된 바에 의하면, 프로세서(110)는 일 실시예에 있어서 추출부(120), 인코딩부(130) 및 언어모델 처리부(140)를 포함할 수 있다. 추출부(120), 인코딩부(130) 및 언어모델 처리부(140) 중 적어도 둘은 물리적으로 구분되는 것일 수도 있고, 논리적으로 구분되는 것일 수도 있으며, 물리적으로 구분되는 경우, 추출부(120), 인코딩부(130) 및 언어모델 처리부(140) 중 적어도 둘은, 일례로 서로 상이한 물리적 처리 장치에 의해 구현될 수 있으며, 논리적으로 구분되는 경우, 추출부(120), 인코딩부(130) 및 언어모델 처리부(140) 모두는 하나의 물리적 처리 장치에 의해 구현될 수도 있다.As shown in FIG. 1, the processor 110 may include an extraction unit 120, an encoding unit 130, and a language model processing unit 140 in one embodiment. At least two of the extraction unit 120, the encoding unit 130, and the language model processing unit 140 may be physically separated or logically separated. If they are physically separated, the extraction unit 120, At least two of the encoding unit 130 and the language model processing unit 140 may, for example, be implemented by different physical processing devices, and when logically separated, the extraction unit 120, the encoding unit 130, and the language All of the model processing units 140 may be implemented by one physical processing device.

도 2는 프로세서의 동작의 일 실시예를 설명하기 위한 블록도이다.Figure 2 is a block diagram for explaining an embodiment of the operation of a processor.

도 1 및 도 2에 도시된 바와 같이, 추출부(120)는 장면 지식 추출부(121) 및 관계 지식 추출부(123)를 포함할 수 있다.As shown in Figures 1 and 2, the extraction unit 120 may include a scene knowledge extraction unit 121 and a relationship knowledge extraction unit 123.

장면 지식 추출부(121)는, 입력부(101)를 통해 입력되거나 저장부(105)에 저장된 개념 집합(91)의 적어도 하나의 개념(91-1, 일례로 꽃, 드레스, 정원, 보유 및 착용 등)을 획득하고, 아울러 입력부(101)를 통해 수신하거나(예를 들어, 웹 사이트 등의 검색을 통해 수신하거나) 또는 저장부(105)에 저장된 적어도 하나의 이미지(93)를 획득하고, 적어도 하나의 개념(91-1) 및 적어도 하나의 이미지(93)를 기반으로 장면 지식(121a)을 획득할 수 있다. 장면 지식(121a)은, 예를 들어, 하나 이상의 언어(일례로 단어나 단어의 조합 등) 등을 포함할 수 있으며, 보다 구체적으로는 적어도 하나의 이미지(93)로부터 추출된 하나 이상의 단어 등을 포함할 수 있다.The scene knowledge extraction unit 121 selects at least one concept 91-1 of the concept set 91 input through the input unit 101 or stored in the storage unit 105, for example, flower, dress, garden, holding and wearing. etc.), and obtain at least one image 93 received through the input unit 101 (for example, received through a search of a website, etc.) or stored in the storage unit 105, and at least Scene knowledge 121a can be acquired based on one concept 91-1 and at least one image 93. The scene knowledge 121a may include, for example, one or more languages (eg, words or combinations of words, etc.), and more specifically, one or more words extracted from at least one image 93. It can be included.

장면 지식 추출부(121)는 이미지(93)를 데이터베이스로 활용하여 특정 개념(91-1)에 대한 일반적인 상황을 묘사하는 내용을 검색하여 추출하여 장면 지식(121a)를 획득할 수 있다. 예를 들어, [장미], [정원] 및 [착복] 등의 개념(91-1)이 입력된 경우, 이에 대응하여 이미지(93)로부터 [여성이 검은 드레스를 착복하고 있다]나, [여성이 꽃을 들고 있다]는 등의 장면 지식(121a)을 획득할 수 있다. 또한, 장면 추출부(121)는 적어도 하나의 이미지에 대한 설명을 포함하는 데이터 셋(일례로 VATEX나 Visual Genome 등)으로부터 장면 지식(121a)을 추출하는 것도 가능하다. 일 실시예에 따르면, 장면 지식 추출부(121)는 장면 지식(121a)을 추출하기 위해 비트겐슈타인 그림 이론을 이용할 수도 있다. 비트겐슈타인 그림 이론이란 언어(명제)들은 세계(사태, fact)에 일대일로 대응되며, 언어는 세계를 그림처럼 묘사할 수 있다는 이론이다. 장면 추출부(121)는 이와 같은 비트겐슈타인 그림 이론에 따라 장면에 대한 생성적 상식 추론(generative commonsense reasoning)을 수행하여 장면 지식(121a)을 획득할 수 있다. 보다 구체적으로 예를 들어, 장면 추출부(121)는 개념(91-1)과 관련성이 높은 설명을 추출하기 위해 품사 태깅(tagging)과 어간 추출(stemming) 등을 수행하여 적어도 하나의 이미지(93)로부터 장면 지식 검색을 수행할 수 있다. 품사 태깅은 개념(91-1)을 해당 개념(91-1)의 품사에 따라 선별함으로써 수행될 수 있다. 예를 들어, 품사 태깅 과정에 있어, 모든 개념(91-1) 중에서 동사 또는 명사에 해당하는 개념(91-1)만이 선별될 수 있다. 품사 태깅은, 예를 들어, 자연어 툴킷(NLTK: Natural language Toolkit)의 품사 태거(POS tagger: part-of-speech tagger)를 이용하여 수행될 수 있다. 어간 추출은, 개념(91-1) 및 이미지(93)에 대한 설명 내의 모든 또는 일부의 토큰(들)으로부터 어간을 추출함으로써 수행될 수도 있다. 어간 추출에 의해 문법적 상이함이 무시될 수 있다. 일 실시예에 의하면, 장면 지식 추출부(121)는 문장이 추출되면, 모든 추출된 문장 중에서 선택된 개념(91-1)에 존재하는 동일한 어간을 가진 개념이 적어도 둘 이상 포함한 문장이 유효한 것으로 취급할 수 있다. 유효한 것으로 취급된 문장은 외부 세계를 묘사하는 장면 지식 기반 문장으로 재구축될 수 있다. 이와 같은 과정에 따라 장면 지식(121a)이 획득될 수 있다. 장면 지식 기반 문장은, 일례로 동사와 명사를 갖는 개념 집합을 포함할 수 있다. 이와 같은 장면 지식 기반 문장 학습은, 마치 그림을 설명하는 것처럼, 외부의 시계 지식으로부터 획득된 일반적인 상황에 대한 이해를 기반으로 문장 생성 능력을 향상시킬 수 있게 한다.The scene knowledge extraction unit 121 may use the image 93 as a database to search and extract content depicting a general situation for a specific concept 91-1 to obtain scene knowledge 121a. For example, when concepts (91-1) such as [rose], [garden], and [clothing] are input, the corresponding image 93 is displayed such as [woman wearing a black dress] or [woman You can acquire scene knowledge (121a) such as [I am holding this flower]. Additionally, the scene extractor 121 is also capable of extracting scene knowledge 121a from a data set (for example, VATEX or Visual Genome, etc.) that includes a description of at least one image. According to one embodiment, the scene knowledge extraction unit 121 may use Wittgenstein picture theory to extract the scene knowledge 121a. Wittgenstein's picture theory is a theory that language (propositions) correspond one-to-one to the world (facts), and that language can describe the world like a picture. The scene extraction unit 121 may acquire scene knowledge 121a by performing generative commonsense reasoning about the scene according to Wittgenstein's picture theory. More specifically, for example, the scene extraction unit 121 performs part-of-speech tagging and stemming to extract a description highly related to the concept 91-1 to extract at least one image 93 ), you can perform scene knowledge search. Part-of-speech tagging can be performed by selecting the concept (91-1) according to the part of speech of the corresponding concept (91-1). For example, in the part-of-speech tagging process, only the concept (91-1) corresponding to a verb or noun can be selected among all concepts (91-1). Part-of-speech tagging can be performed, for example, using a part-of-speech tagger (POS tagger) of the Natural language Toolkit (NLTK). Stemming may be performed by extracting stems from all or some of the token(s) within the description of concept 91-1 and image 93. Grammatical differences can be ignored by stemming. According to one embodiment, when a sentence is extracted, the scene knowledge extraction unit 121 treats sentences containing at least two concepts with the same stem present in the selected concept 91-1 among all extracted sentences as valid. You can. Sentences treated as valid can be reconstructed into scene knowledge-based sentences that describe the external world. Scene knowledge 121a can be acquired according to this process. A scene knowledge-based sentence may include, for example, a set of concepts with a verb and a noun. Such scene knowledge-based sentence learning allows one to improve sentence generation ability based on understanding of general situations obtained from external clock knowledge, just as when explaining a picture.

관계 지식 추출부(123)는, 적어도 하나의 개념(91-1)과 적어도 하나의 문서(95)를 기반으로 관계 지식(123a)을 획득할 수도 있다. 관계 지식(123a)은, 소정의 개념(91-1)에 대한 외부 시계 지식을 포함하면서 검색 또는 이용된 장면 지식(121a)만으로는 도출되기 어려운 비시각적 상식을 포함할 수 있다. 이와 같은 관계 지식(123a)은 비트겐슈타인의 사용 이론을 바탕으로 각각의 개념(91-1)이 갖는 일반적인 관계에 대한 정보를 추론할 수 있도록 한다. 구체적으로 설명하면, 관계 지식(123a)은, 주어진 각각의 개념(91-1)이 가정한 하나 이상의 일반적인 상황 속에서 외부 지식(일례로 문서(95))을 더 이용하여 각각의 개념(91-1)이 갖는 일반적인 관계에 대한 정보를 확장할 수 있게 한다. 예를 들어, 이미지(93) 내에서 [여성이 꽃을 들고 있다]라는 문장이 추출된 경우, [축하]가 관계 지식(123a)으로 더 이용되면, 언어모델 처리부(140)에 의해 [여성이 꽃을 들고 축하하고 있다] 문장이 생성될 수 있게 된다. 일 실시예에 따르면, 관계 지식(123a)의 획득을 위해 먼저 관계 지식 추출부(123)는 하나 이상의 문서(95)에 대한 관련성 점수를 연산하고, 이를 이용하여 적어도 하나의 문서(95)를 선별할 수 있다. 이 경우, 상대적으로 관련성 점수가 높은 문서(95), 즉 관련성이 높은 문서(95)가 선별될 수도 있다. 하나 이상의 문서(95)에 대한 관련성 점수는, 예를 들어, 하나 이상의 문서(95) 중에서 주어진 하나 이상의 개념(91-1)과 동일한 개념이 개념(91-1)이 도출된 빈도, 해나 이상의 개념(91-1)이 도출된 해당 문서(95)의 길이 및 전체 문서(95)의 평균 길이 등을 기반으로 연산될 수 있다. 일 실시예에 따르면, 점수는 하기의 수학식 1과 같이 주어질 수 있다.The relationship knowledge extraction unit 123 may acquire relationship knowledge 123a based on at least one concept 91-1 and at least one document 95. The relational knowledge 123a may include external visual knowledge of the predetermined concept 91-1 and non-visual common sense that is difficult to derive only from the searched or used scene knowledge 121a. This kind of relationship knowledge (123a) allows information about the general relationship of each concept (91-1) to be inferred based on Wittgenstein's usage theory. Specifically, the relational knowledge 123a further uses external knowledge (e.g., a document 95) in one or more general situations assumed by each given concept 91-1 to form a relationship between each concept 91-1. 1) It is possible to expand information about the general relationship. For example, when the sentence [A woman is holding a flower] is extracted from the image 93, if [Congratulations] is further used as the relationship knowledge 123a, the language model processing unit 140 determines that [The woman is holding a flower]. [He is celebrating with flowers] The sentence can be created. According to one embodiment, in order to acquire the relationship knowledge 123a, the relationship knowledge extraction unit 123 first calculates a relevance score for one or more documents 95 and selects at least one document 95 using this. can do. In this case, a document 95 with a relatively high relevance score, that is, a document 95 with high relevance, may be selected. The relevance score for one or more documents (95) is, for example, the frequency with which the concept (91-1) is derived from one or more concepts (91-1) that are the same as a given one or more concepts (91-1) among one or more documents (95). (91-1) can be calculated based on the length of the corresponding document 95 from which it is derived and the average length of all documents 95. According to one embodiment, the score may be given as Equation 1 below.

[수학식 1][Equation 1]

수학식 1에서 score(D,Q)는 주어진 문서(D) 및 개념(Q, 쿼리)에 따른 점수이고, IDF()는 역 문서 빈도(inverse document frequency)이고, f()는 용어 빈도이다. k 및 b는 하이퍼 파라미터로 일례로 각각 1.2 및 0.75로 주어질 수 있다. |D|는 해당 문서(D)의 길이이고, avgDL은 전체 문서의 길이의 평균을 의미한다.In Equation 1, score(D, Q) is the score according to the given document (D) and concept (Q, query), IDF() is the inverse document frequency, and f() is the term frequency. k and b are hyperparameters and can be given as 1.2 and 0.75, respectively, for example. |D| is the length of the document (D), and avgDL means the average length of all documents.

적어도 하나의 문서(95)가 선별되면, 선별된 문서(95)는 적어도 하나의 문장으로 분해되고, 각각의 문장은 상대적 지식 기반 문장으로 이용될 수 있다. 이 경우, 실시예에 따라서, 분해에 따라 획득된 문장 중에서 적어도 두 개의 개념(91-1)을 포함하는 문장이 상대적 지식 기반 문장으로 이용될 수도 있다.When at least one document 95 is selected, the selected document 95 is decomposed into at least one sentence, and each sentence can be used as a relative knowledge base sentence. In this case, depending on the embodiment, a sentence containing at least two concepts (91-1) among sentences obtained through decomposition may be used as a relative knowledge-based sentence.

필요에 따라, 무관한 정보의 제거를 위해서 다수의 문장에 대해 노이즈 제거 과정을 수행될 수도 있다. 예를 들어, 하나 이상의 선별된 문서(95) 내의 다수의 문장 중에서 관련성이 높은 문장을 선택하고, 이외의 문장은 배제함으로써 무관한 정보에 따른 노이즈를 제거할 수 있다. 일 실시예에 따르면, 노이즈 제거 과정은 GloVe 단어 임베딩 방법을 이용하여 수행될 수 있다. 구체적으로 예를 들어, 주어진 개념(91-1) 및 하나의 문장(일례로 관련성 점수가 높은 문서에서 추출된 문장)에서 추출된 개념 간의 유사도(일례로 코사인 유사도)가 먼저 연산되고, 상대적으로 높은 유사도를 갖는 문장(들)만이 추출 및 선택됨으로써, 관련성이 높은 문장이 선별될 수도 있다.If necessary, a noise removal process may be performed on multiple sentences to remove irrelevant information. For example, noise due to irrelevant information can be removed by selecting highly relevant sentences from among a plurality of sentences in one or more selected documents 95 and excluding other sentences. According to one embodiment, the noise removal process may be performed using the GloVe word embedding method. Specifically, for example, the similarity (e.g., cosine similarity) between a given concept (91-1) and concepts extracted from a sentence (e.g., a sentence extracted from a document with a high relevance score) is calculated first, and the relatively high By extracting and selecting only sentence(s) with similarity, sentences with high relevance may be selected.

상술한 바에 의해 획득된 장면 지식(121a) 및 관계 지식(123a)은 인코딩부(130)로 전달될 수 있다. 인코딩부(130)는 장면 지식(121a) 및 관계 지식(123a)을 인코딩하고, 인코딩 결과(131, 132)를 병합하여 언어모델 처리부(140)에 대한 입력 값(135)을 생성할 수 있다. 여기서, 입력 값(135)은 적어도 하나의 개념과 지식 기반 문장의 쌍으로 주어질 수 있다. 일 실시예에 따르면, 인코딩부(130)는 장면 지식(121a)을 변환하여 장면 지식(121a)에 대응하는 제1 시퀀스(131)을 생성하고, 및/또는 관계 지식(123a)을 변환하여 관계 지식(123a)에 대응하는 제2 시퀀스(132)를 생성할 수 있다. 보다 구체적으로 각각의 시퀀스(131, 132)는 초기 토큰(SOS)과, 종료 토큰(EOS)와, 초기 토근(SOS) 및 종료 토큰(EOS) 사이에 배치되는 적어도 하나의 개념들과 이들 개념이 추출 및 도출된 문장을 포함하여 구현될 수 있으며, 여기서 개념이 추출된 문장은 장면 지식(121a) 또는 관계 지식(123a) 중 어느 하나에 속할 수 있다. 인코딩부(130)는 각각의 시퀀스(131, 132)를 병합하여 언어 모델 처리부(140)의 입력 값(135)을 생성하고, 이를 언어 모델 처리부(140)로 전달할 수 있다.The scene knowledge 121a and relationship knowledge 123a obtained as described above may be transmitted to the encoding unit 130. The encoding unit 130 may encode the scene knowledge 121a and the relationship knowledge 123a and merge the encoding results 131 and 132 to generate an input value 135 for the language model processing unit 140. Here, the input value 135 may be given as a pair of at least one concept and a knowledge-based sentence. According to one embodiment, the encoding unit 130 converts the scene knowledge 121a to generate the first sequence 131 corresponding to the scene knowledge 121a, and/or converts the relationship knowledge 123a to create a relationship A second sequence 132 corresponding to the knowledge 123a may be generated. More specifically, each sequence 131, 132 includes an initial token (SOS), an end token (EOS), at least one concept disposed between the initial token (SOS) and an end token (EOS), and these concepts It may be implemented by including extracted and derived sentences, where the sentence from which the concept is extracted may belong to either scene knowledge 121a or relationship knowledge 123a. The encoding unit 130 may merge each sequence 131 and 132 to generate an input value 135 for the language model processing unit 140 and transmit it to the language model processing unit 140.

언어 모델 처리부(140)는 입력 값(135)을 기반으로 학습 처리를 수행할 수 있다. 예를 들어, 언어 모델 처리부(140)는 입력 값(135)을 이용하여 언어 모델을 훈련시킬 수도 있고 및/또는 입력 값(135)을 이용하여 개념(91-1)에 대응하는 결과(즉, 문장 등)을 획득할 수도 있다. 일 실시예에 의하면, 언어 모델 처리부(140)는 생성적 사전 훈련 변환기(GPT: Generative Pre-trained Transformer), 생성적 사전 훈련 변환기 2(GPT2), 바트(BART) 또는 T5 등과 같은 자동 회귀 언어 모델(autoregressive model)을 이용하여 구현될 수도 있다. 생성적 사전 훈련 변환기(GPT)는 매우 많은 수의 매개 변수를 같는 변환 디코더를 포함하고, . 생성적 사전 훈련 변환기 2(GPT2)는, 사전 활성화 잔여 네트워크와 동일하게 각각의 셀프 어텐션 블록 앞에 정규화 층을 갖는 변환기로, 방대한 텍스트 데이터를 이용하여 사전에 훈련된 디코더 전용, 변환기 기반의 자기 회귀 모델이다. 바트는 토큰 마스킹, 삭제 또는 채우기 등과 같은 변화 제거 방법을 통해 사전에 훈련된 것으로, 변환기 인코더 및 디코더 구조로 이루어져 있다. T5 역시 변환기 인코더 및 디코더 구조로 형성되어 있으며, 전형적인 위치적 인코딩 대신에 상대적 위치 인코딩 방법을 이용한다. 이들 외에도 언어 모델 처리부(140)는, 예를 들어, 심층 신경망(DNN: Deep Neural Network), 콘볼루션 신경망(CNN: Convolutional Neural Network), 순환 신경망(RNN: Recurrent Neural Network), 콘볼루션 순환 신경망(CRNN: Convolutional Recurrent Neural Network), 심층 신뢰 신경망(DBN: Deep Belief Network), 심층 Q-네트워크(Deep Q-Networks), 장단기 메모리(LSTM: Long short term memory), 다층 퍼셉트론(Multi-layer Perceptron), 서포트 벡터 머신(SVM: support vector machine), 생성적 적대 신경망(GAN: Generative Adversarial Network) 및/또는 조건적 생성적 적대 신경망(cGAN: Conditional GAN) 등의 적어도 하나의 학습 모델을 기반으로 훈련 및 추론을 수행할 수도 있다.The language model processing unit 140 may perform learning processing based on the input value 135. For example, the language model processing unit 140 may train a language model using the input value 135 and/or use the input value 135 to obtain a result corresponding to the concept 91-1 (i.e. sentences, etc.) can also be obtained. According to one embodiment, the language model processing unit 140 is an autoregressive language model such as a generative pre-trained transformer (GPT), generative pre-trained transformer 2 (GPT2), BART, or T5. It can also be implemented using an autoregressive model. A generative pre-trained transformer (GPT) contains a transform decoder with a very large number of parameters, such as . Generative Pre-Trained Transformer 2 (GPT2) is a transformer that has a normalization layer in front of each self-attention block, just like the pre-activation residual network. It is a decoder-only, transformer-based autoregressive model that is pre-trained using massive text data. am. BART is pre-trained using variation removal methods such as token masking, deletion or filling, and consists of a transformer encoder and decoder structure. T5 is also formed with a converter encoder and decoder structure, and uses a relative position encoding method instead of the typical positional encoding. In addition to these, the language model processing unit 140 includes, for example, a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), and a convolutional neural network ( CRNN: Convolutional Recurrent Neural Network, Deep Belief Network (DBN), Deep Q-Networks, Long short term memory (LSTM), Multi-layer Perceptron, Training and inference based on at least one learning model, such as a support vector machine (SVM), a generative adversarial network (GAN), and/or a conditional generative adversarial network (cGAN). You can also perform .

도 3은 언어모델 처리부의 일 실시예에 대한 블록도이다.Figure 3 is a block diagram of an embodiment of the language model processing unit.

예를 들어, 만약 언어 모델 처리부(140)가 생성적 사전 훈련 변환기 2(GPT2)를 이용하여 구현된 경우, 언어 모델 처리부(140)는 도 3에 도시된 바와 같이 마스크된 멀티 헤드 어텐션(141), 멀티 헤드 어텐션(143), 순방향 신경망(145, feed forward network) 및 출력처리부(147)를 포함할 수 있다. 마스크된 멀티 헤드 어텐션(141)은 입력된 문장 내의 단어들 중 일부의 단어(일부의 토큰)만을 기반으로 셀프 어텐션을 수행한다. 멀티 헤드 어텐션(143)은 마스크된 멀티 헤드 어텐션(141)의 처리 결과를 수신하고, 처리 결과에 대해 다수의 셀프 어텐션을 수행하고, 그 처리 결과를 순방향 네트워크(145)로 전달할 수 있다. 순방향 네트워크(145)는 멀티 헤드 어텐션(143)의 처리 결과를 기반으로 학습을 수행하고, 출력 처리부(147)는 소프트맥스 함수 등을 이용하여 순방향 네트워크(145)의 결과 값으로부터 최종적인 결과(149), 즉 장면 지식(121a) 및 관계 지식(123a) 기반으로 향상된 문장을 생성하여 획득한다.For example, if the language model processing unit 140 is implemented using generative pre-training transformer 2 (GPT2), the language model processing unit 140 uses the masked multi-head attention 141 as shown in FIG. 3. , may include a multi-head attention 143, a feed forward network (145), and an output processing unit 147. The masked multi-head attention 141 performs self-attention based on only some words (some tokens) among the words in the input sentence. The multi-head attention 143 may receive the processing result of the masked multi-head attention 141, perform multiple self-attention on the processing results, and transmit the processing results to the forward network 145. The forward network 145 performs learning based on the processing result of the multi-head attention 143, and the output processing unit 147 calculates the final result (149) from the result value of the forward network 145 using a softmax function, etc. ), that is, an improved sentence is generated and obtained based on scene knowledge (121a) and relationship knowledge (123a).

이에 따라 언어 생성 장치(100)는 이미지 및 관계 지식을 이용하여 강화된 일반 상식 추론 능력을 기반으로 더욱 더 자연스러운 문장을 생성할 수 있게 된다.Accordingly, the language generating device 100 can generate more natural sentences based on enhanced general knowledge reasoning ability using image and relationship knowledge.

상술한 언어 생성 장치(100)는, 상술한 동작을 수행할 수 있도록 특별히 고안된 장치를 이용하여 구현될 수도 있고, 또는 하나 또는 둘 이상의 정보처리장치를 단독으로 이용하거나 조합 이용함으로써 구현될 수도 있다. 여기서, 하나 또는 둘 이상의 정보처리장치는, 예를 들어, 데스크톱 컴퓨터, 랩톱 컴퓨터, 서버용 하드웨어 장치, 스캐너 장치, 프린터 장치, 삼차원 프린터 장치, 스마트 폰, 태블릿 피씨, 스마트 시계, 스마트 태그, 스마트 밴드, 두부 장착형 디스플레이(HMD: Head Mounted Display) 장치, 휴대용 게임기, 개인용 디지털 보조기(PDA: Personal Digital Assistant), 내비게이션 장치, 스마트 키, 디지털 텔레비전, 셋 톱 박스, 디지털 미디어 플레이어 장치, 미디어 스트리밍 장치, 디브이디 재생 장치, 컴팩트 디스크 재생 장치, 음향 재생 장치(인공 지능 스피커 등), 가전 기기(일례로 냉장고, 선풍기, 공조기, 오븐 또는 세탁기 등), 유인 또는 무인 이동체(일례로 승용차, 버스나 이륜차와 같은 차량, 이동성 로봇, 무선 모형 차량, 로봇 청소기 등), 유인 또는 무인 비행체(일례로 항공기나, 헬리콥터나, 드론, 모형 비행기, 모형 헬리콥터 등), 의료기기(엑스선 촬영 장치, 컴퓨터 단층 촬영 장치(CT: Computed Tomography) 또는 자기공명촬영(MRI: Magnetic Resonance Imaging) 장치 등), 가정용, 산업용 또는 군사용 로봇, 산업용 또는 군사용 기계 또는 교통 제어기 등을 포함할 수 있으나 이에 한정되는 것은 아니다. 설계자나 사용자 등은 상황이나 조건에 따라서 상술한 정보처리장치 이외에도 정보의 연산 처리 및 제어가 다양한 장치 중 적어도 하나를 상술한 언어 생성 장치(100)로 고려하여 채용할 수 있다.The above-described language generating device 100 may be implemented using a device specifically designed to perform the above-described operations, or may be implemented by using one or more information processing devices alone or in combination. Here, one or more information processing devices include, for example, a desktop computer, a laptop computer, a server hardware device, a scanner device, a printer device, a three-dimensional printer device, a smart phone, a tablet PC, a smart watch, a smart tag, a smart band, Head Mounted Display (HMD) devices, portable game consoles, personal digital assistants (PDAs), navigation devices, smart keys, digital televisions, set-top boxes, digital media player devices, media streaming devices, DVD playback devices, compact disc playback devices, sound reproduction devices (artificial intelligence speakers, etc.), home appliances (e.g. refrigerators, fans, air conditioners, ovens or washing machines, etc.), manned or unmanned vehicles (e.g. vehicles such as cars, buses or two-wheelers, etc.) Mobile robots, wireless model vehicles, robot vacuum cleaners, etc.), manned or unmanned flying vehicles (e.g. aircraft, helicopters, drones, model airplanes, model helicopters, etc.), medical devices (X-ray imaging devices, computed tomography devices (CT) Tomography or magnetic resonance imaging (MRI) devices, etc.), household, industrial or military robots, industrial or military machines, or traffic controllers, but are not limited thereto. Depending on the situation or conditions, designers, users, etc. may consider and employ at least one of various devices for computational processing and control of information in addition to the above-described information processing device as the above-described language generation device 100.

이하 도 4를 참조하여 일반 상식 추론 기반의 언어 생성 방법의 여러 실시예에 대해서 설명하도록 한다.Hereinafter, several embodiments of a language generation method based on common sense reasoning will be described with reference to FIG. 4.

도 4는 일반 상식 추론 기반의 언어 생성 방법의 일 실시예에 대한 흐름도이다.Figure 4 is a flowchart of an embodiment of a language generation method based on common sense reasoning.

도 4에 도시된 바를 참조하면, 일반 상식 추론 기반의 언어 생성 방법(이하 언어 생성 방법)의 수행을 위하여 먼저 사용자나 다른 장치(메모리 장치나 정보처리장치 등)로부터 하나 이상의 개념이 입력될 수 있다(200). 입력된 개념은 저장부에 개념 집합의 형태로 저장될 수 있다.Referring to Figure 4, in order to perform a language generation method (hereinafter referred to as a language generation method) based on common sense reasoning, one or more concepts may first be input from a user or another device (memory device, information processing device, etc.). (200). The input concept may be stored in the form of a concept set in the storage unit.

사용자의 조작이나 미리 정의된 설정에 따라서, 언어 생성 동작이 개시되면, 프로세서에 의해 입력된 개념이 획득되거나 또는 저장부에 저장된 개념 집합 내의 적어도 하나의 개념이 획득되고, 이와 동시에 또는 순차적으로 이미지 및 문서 중 적어도 하나가 획득될 수 있다(202). 이미지 및 문서 중 적어도 하나의 획득은, 사용자 등의 입력부의 조작, 다른 웹 사이트의 검색 및 다른 외부 저장 매체 등으로부터의 수신 중 적어도 하나를 통해 수행될 수 있으며, 이들 외에도 다양한 방법을 통해 수행될 수도 있다. 여기서, 이미지는 정지 영상 또는 동영상을 포함할 수 있으며, 동영상의 적어도 하나의 프레임을 포함할 수도 있다. 문서는 텍스트를 포함할 수 있으며, 필요에 따라 하나 이상의 이미지를 포함하는 것도 가능하다.When the language generation operation is initiated according to the user's operation or predefined settings, the concept input by the processor is acquired or at least one concept in the concept set stored in the storage is acquired, and simultaneously or sequentially, images and At least one of the documents may be obtained (202). Acquisition of at least one of images and documents may be performed through at least one of manipulating an input unit such as a user, searching another website, receiving from another external storage medium, etc., and may also be performed through various methods in addition to these. there is. Here, the image may include a still image or a moving image, and may also include at least one frame of the moving image. A document can contain text and, if necessary, one or more images.

적어도 하나의 이미지가 획득되면, 적어도 하나의 이미지로부터 장면 지식이 추출 및 획득될 수 있다(204). 또한, 적어도 하나의 문서가 획득되면, 적어도 하나의 문서로부터 관계 지식이 추출 및 획득될 수 있다. 장면 지식의 추출 및 관계 지식의 추출은 동시에 또는 이시에 개시될 수도 있다. 따라서, 실시예에 따라 장면 지식의 추출이 선행하여 수행되고, 관계 지식의 추출이 후행하여 수행되는 것도 가능하고, 장면 지식의 추출 및 관계 지식의 추출이 동시에 수행되는 것도 가능하다. 구체적으로 예를 들어, 장면 지식의 획득은 비트겐슈타인 그림 이론에 따라 장면에 대한 상생적 상식 추론을 진행함으로써 수행될 수도 있다. 이 경우, 주어진 개념을 해당 개념의 품사에 따라 선별하는 품사 태깅과, 개념 및 설명 내의 모든 토큰(들)로부터 어간을 추출하는 어간 추출을 기반으로 장면 지식의 획득 과정이 수행될 수도 있다. 또한, 관계 지식의 획득은, 예를 들어, 주어진 문서 내에서 개념에 대응하는 단어(들)을 찾고 이들 간의 거리나 빈도 등을 종합하여 수행될 수 있다. 구체적으로 GloVe 단어 임베딩 방법 등과 같이 주어진 개념과 문장에서 추출된 개념 간의 유사도를 연산하고, 이를 기반으로 관련성이 높은 문장을 선택하고, 각각의 문서에 대한 점수를 이용하여 관련성 점수가 높은 문장을 점수가 높은 순으로 하나 이상(일례로 5개 등) 추출하여 관계 지식을 획득할 수도 있다.Once at least one image is acquired, scene knowledge can be extracted and acquired from the at least one image (204). Additionally, when at least one document is acquired, relationship knowledge can be extracted and obtained from the at least one document. Extraction of scene knowledge and extraction of relationship knowledge may be initiated simultaneously or at different times. Therefore, depending on the embodiment, it is possible that the extraction of scene knowledge is performed first and the extraction of relationship knowledge is performed later, or it is also possible that the extraction of scene knowledge and the extraction of relationship knowledge are performed simultaneously. Specifically, for example, the acquisition of scene knowledge may be performed by conducting mutualistic common sense inference about the scene according to Wittgenstein's picture theory. In this case, the process of acquiring scene knowledge may be performed based on part-of-speech tagging, which selects a given concept according to the part of speech of the concept, and stemming, which extracts stems from all token(s) within the concept and description. Additionally, acquisition of relational knowledge can be performed, for example, by finding word(s) corresponding to a concept in a given document and summarizing the distance or frequency between them. Specifically, the similarity between a given concept and a concept extracted from a sentence is calculated using the GloVe word embedding method, and based on this, sentences with high relevance are selected, and sentences with a high relevance score are scored using the score for each document. Relational knowledge can also be obtained by extracting one or more items (e.g., five items) in descending order.

장면 지식 및 관계 지식이 획득되면, 인코딩 처리가 수행된다(208). 구체적으로 인코딩 처리는, 장면 지식에 대응하는 제1 시퀀스 및 관계 지식에 대응하는 제2 시퀀스를 생성하고, 생성된 제1 및 제2 시퀀스를 병합하여 언어 모델에 입력될 입력 값을 생성함으로써 수행될 수 있다.Once scene knowledge and relationship knowledge are acquired, encoding processing is performed (208). Specifically, the encoding process is performed by generating a first sequence corresponding to scene knowledge and a second sequence corresponding to relationship knowledge, and merging the generated first and second sequences to generate an input value to be input to the language model. You can.

인코딩 처리에 의해 획득된 입력 값은, 언어 모델에 입력되고, 언어 모델은 입력 값을 기반으로 훈련되거나 또는 입력 값에 대응하는 결과 값(즉, 입력된 개념에 대응하는 문장)을 생성할 수 있다(210). 여기서, 언어 모델은, 생성적 사전 훈련 변환기(GPT), 생성적 사전 훈련 변환기 2(GPT2), 바트 또는 T5 등을 포함할 수 있으며, 이들 외에도 설계자 등이 고려 가능한 다양한 학습 모델을 포함할 수도 있다.The input value obtained through encoding processing is input to the language model, and the language model can be trained based on the input value or generate a result value corresponding to the input value (i.e., a sentence corresponding to the input concept). (210). Here, the language model may include a generative pre-training transformer (GPT), generative pre-training transformer 2 (GPT2), BART, or T5, and may also include various learning models that designers, etc. can consider. .

언어 모델에 대한 학습 처리에 따라 언어 모델은 훈련 및 갱신되거나 및/또는 주어진 개념에 대응하는 처리 결과(즉, 문장)이 획득될 수 있다(212). 획득된 문장은 출력부를 통해 사용자에게 시각적 또는 청각적으로 제시될 수도 있고, 다른 외부의 정보처리장치나 메모리 장치 등으로 전달될 수도 있다.Depending on the learning process for the language model, the language model may be trained and updated and/or a processing result (i.e., a sentence) corresponding to a given concept may be obtained (212). The obtained sentence may be presented visually or auditorily to the user through an output unit, or may be transmitted to another external information processing device or memory device.

상술한 실시예에 따른 언어 생성 방법은, 컴퓨터 장치에 의해 구동될 수 있는 프로그램의 형태로 구현될 수 있다. 프로그램은, 명령어, 라이브러리, 데이터 파일 및/또는 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있으며, 기계어 코드나 고급 언어 코드를 이용하여 설계 및 제작된 것일 수 있다. 프로그램은 상술한 방법을 구현하기 위하여 특별히 설계된 것일 수도 있고, 컴퓨터 소프트웨어 분야에서 통상의 기술자에게 기 공지되어 사용 가능한 각종 함수나 정의를 이용하여 구현된 것일 수도 있다. 또한, 여기서, 컴퓨터 장치는, 프로그램의 기능을 실현 가능하게 하는 프로세서나 메모리 등을 포함하여 구현된 것일 수 있으며, 필요에 따라 통신 장치를 더 포함할 수도 있다. 상술한 언어 생성 방법을 구현하기 위한 프로그램은, 컴퓨터에 의해 판독 가능한 기록 매체에 기록될 수 있다. 컴퓨터에 의해 판독 가능한 기록 매체는, 예를 들어, 솔리드 스테이트 드라이브(SSD), 롬, 램 또는 플래시 메모리 등과 같은 반도체 저장 매체나, 하드 디스크나 플로피 디스크 등과 같은 자기 디스크 저장 매체나, 콤팩트 디스크나 디브이디 등과 같은 광 기록 매체나, 또는 플롭티컬 디스크 등과 같은 자기-광 기록 매체 등 컴퓨터 등의 호출에 따라 실행되는 하나 이상의 프로그램을 일시적 또는 비일시적으로 저장 가능한 적어도 한 종류의 물리적 저장 매체를 포함할 수 있다.The language generation method according to the above-described embodiment may be implemented in the form of a program that can be driven by a computer device. A program may include instructions, libraries, data files, and/or data structures, etc., singly or in combination, and may be designed and produced using machine code or high-level language code. The program may be specially designed to implement the above-described method, or may be implemented using various functions or definitions that are known and available to those skilled in the art in the computer software field. In addition, here, the computer device may be implemented by including a processor or memory that enables the function of the program, and may further include a communication device if necessary. A program for implementing the above-described language generation method may be recorded on a computer-readable recording medium. Recording media readable by a computer include, for example, semiconductor storage media such as a solid state drive (SSD), ROM, RAM, or flash memory, magnetic disk storage media such as a hard disk or floppy disk, or a compact disk or DVD. It may include at least one type of physical storage medium capable of temporarily or non-temporarily storing one or more programs executed in response to a call from a computer, such as an optical recording medium such as an optical recording medium, or a magneto-optical recording medium such as a floptical disk. .

이상 일반 상식 추론 기반의 언어 생성 장치 및 상식 추론 기반의 언어 생성 방법의 여러 실시예에 대해 설명하였으나, 언어 생성 장치 또는 언어 생성 방법은 오직 상술한 실시예에 한정되는 것은 아니다. 해당 기술 분야에서 통상의 지식을 가진 자가 상술한 실시예를 기초로 수정 및 변형하여 구현할 수 있는 다른 다양한 장치나 방법 역시 상술한 언어 생성 장치 또는 언어 생성 방법의 일 실시예가 될 수 있다. 예를 들어, 설명된 방법(들)이 설명된 바와 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성 요소(들)가 설명된 바와 다른 형태로 결합, 연결 또는 조합되거나 다른 구성 요소 또는 균등물 등에 의하여 대치 또는 치환되더라도, 상술한 언어 생성 장치 및/또는 언어 생성 방법의 일 실시예가 될 수 있다.Although several embodiments of the language generation device and the language generation method based on common sense reasoning have been described above, the language generation device or the language generation method is not limited to the above-described embodiments. Various other devices or methods that can be implemented by those skilled in the art by modifying and modifying the above-described embodiments based on the above-described embodiments may also be examples of the above-described language generation device or language generation method. For example, the described method(s) may be performed in an order other than as described, and/or component(s) of the described system, structure, device, circuit, etc. may be combined, connected, or otherwise in a form other than as described. Even if combined or replaced or replaced by other components or equivalents, it can be an embodiment of the above-described language generation device and/or language generation method.

91: 개념 집합 93: 이미지
95: 문서 100: 언어 생성 장치
101: 입력부 105: 저장부
109: 출력부 110: 프로세서
120: 추출부 121: 장면 지식 추출부
123: 관계 지식 추출부 130: 인코딩부
140: 언어모델처리부91: Concept set 93: Image
95: Document 100: Language generation device
101: input unit 105: storage unit
109: output unit 110: processor
120: extraction unit 121: scene knowledge extraction unit
123: Relational knowledge extraction unit 130: Encoding unit
140: Language model processing unit

Claims

an input unit that receives at least one concept, at least one image, and at least one document; and
Using the at least one concept, extracting scene knowledge from the at least one image, extracting relationship knowledge from the at least one document, and learning a language model using the scene knowledge and the relationship knowledge as input values. Includes a processor that performs,
The processor performs generative common sense reasoning according to Wittgenstein picture theory to obtain scene knowledge from the at least one image,
The processor extracts the scene knowledge by selecting concepts that are verbs and nouns among the at least one concept, or extracting stems from all or some tokens in a description of the at least one concept or the at least one image. ,
A language generation device based on common sense reasoning.

delete

According to paragraph 1,
The processor extracts information about a general relationship from the at least one document to expand information about the relationship between the concepts according to Wittgenstein's usage theory. A language generation device based on common sense reasoning.

an input unit that receives at least one concept, at least one image, and at least one document; and
Using the at least one concept, extracting scene knowledge from the at least one image, extracting relationship knowledge from the at least one document, and learning a language model using the scene knowledge and the relationship knowledge as input values. Includes a processor that performs,
the processor extracts information about general relationships from the at least one document to expand information about relationships of the concepts according to Wittgenstein's usage theory,
The processor obtains a relevance score for the at least one document, selects a document with a high relevance score from among the at least one document, and decomposes the sentence with the high relevance score into at least one sentence. language production device.

According to clause 5,
The processor calculates the similarity between the at least one concept and the concept extracted from the sentence of the document with the high relevance score, and selects sentences with relatively high similarity, thereby selecting sentences based on common sense reasoning. Generating device.

According to paragraph 1,
The language model is a general sense inference-based language generation device including a generative pre-trained transformer (GPT), generative pre-trained transformer 2 (GPT2), or BART.

In a general sense inference-based language generation method performed by a general sense inference-based language generation device implemented as a computing device including at least a processor,
Receiving at least one concept, at least one image, and at least one document as input;
extracting scene knowledge from the at least one image using the at least one concept;
extracting relational knowledge from the at least one document using the at least one concept; and
Comprising: performing learning processing of a language model using the scene knowledge and the relationship knowledge as input values,
The step of extracting scene knowledge from the at least one image using the at least one concept includes:
Comprising: obtaining scene knowledge from the at least one image by performing generative common sense reasoning according to Wittgenstein picture theory,
The step of performing generative common sense reasoning according to the Wittgenstein picture theory to obtain scene knowledge from the at least one image,
A step of extracting the scene knowledge by selecting concepts that are verbs and nouns among the at least one concept, or extracting stems from all or some tokens in a description of the at least one concept or the at least one image. doing,
A language generation method based on general knowledge reasoning.

delete

According to clause 8,
The step of extracting relational knowledge from the at least one document using the at least one concept includes:
A language generation method based on common sense reasoning, comprising: extracting information about a general relationship from the at least one document in order to expand information about the relationship between the concepts according to Wittgenstein's usage theory.

In a general sense inference-based language generation method performed by a general sense inference-based language generation device implemented as a computing device including at least a processor,
Receiving at least one concept, at least one image, and at least one document as input;
extracting scene knowledge from the at least one image using the at least one concept;
extracting relational knowledge from the at least one document using the at least one concept; and
Comprising: performing learning processing of a language model using the scene knowledge and the relationship knowledge as input values,
The step of extracting relational knowledge from the at least one document using the at least one concept includes:
A step of extracting information about a general relationship from the at least one document to expand information about the relationship of the concepts according to Wittgenstein's usage theory,
The step of extracting information about a general relationship from the at least one document includes:
Obtaining a relevance score for the at least one document;
Selecting a document with a high relevance score from among the at least one document; and
Decomposing the sentence with the high relevance score into at least one sentence. A language generation method based on common sense reasoning, including:

According to clause 12,
The step of extracting information about a general relationship from the at least one document includes:
calculating a degree of similarity between the at least one concept and a concept extracted from a sentence of a document with a high relevance score; and
A language generation method based on common sense reasoning, further comprising: selecting highly relevant sentences by selecting sentences with relatively high similarity.

According to clause 8,
The language model is a general sense inference-based language generation method including a generative pre-training transformer, generative pre-training transformer 2, or BART.