KR20240052394A

KR20240052394A - Device and method for generating korean commonsense reasoning dataset

Info

Publication number: KR20240052394A
Application number: KR1020220132373A
Authority: KR
Inventors: 임희석; 서재형
Original assignee: 고려대학교 산학협력단
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2024-04-23

Abstract

데이터셋 생성 장치가 개시된다. 상기 데이터셋 생성 장치는 텍스트 데이터를 수집하는 데이터 수집부, 상기 텍스트 데이터에 대한 전처리 동작을 수행하는 전처리부, 및 전처리된 텍스트 데이터로부터 데이터셋을 구축하는 데이터셋 추출부를 포함하고, 상기 데이터셋 추출부는 전처리된 결과로써 생성되고, 각각이 개념 정보 집합과 상기 개념 정보 집합에 대응하는 문장으로 구성된 개념 정보 집합 문장 쌍들로부터 미리 정해진 조건을 만족하는 개념 정보 집합 문장 쌍들을 추출하여 상기 데이터셋을 구축한다.A data set generating device is disclosed. The dataset creation device includes a data collection unit that collects text data, a preprocessor that performs a preprocessing operation on the text data, and a dataset extraction unit that builds a dataset from the preprocessed text data, and extracts the dataset. The dataset is created as a result of preprocessing, and the dataset is constructed by extracting concept information set sentence pairs that satisfy predetermined conditions from concept information set sentence pairs, each of which consists of a concept information set and a sentence corresponding to the concept information set.

Description

Korean common sense reasoning ability data generation device and method {DEVICE AND METHOD FOR GENERATING KOREAN COMMONSENSE REASONING DATASET}

본 발명은 자연어 처리(Natural Language Process, NLP) 연구 분야 중에서 자연어 생성에 관한 것으로, 딥러닝 기반 언어 모델의 상식 추론 능력을 평가하기 위한 데이터셋을 자동으로 구축하는 방법에 관한 것이다.The present invention relates to natural language generation in the field of Natural Language Process (NLP) research, and to a method of automatically building a dataset to evaluate the common sense reasoning ability of a deep learning-based language model.

상식 추론은 학습 단계에서 등장하지 않았던 개념이나 주어진 조건과 관련된 일반 상식을 추론하는 능력이다. CommonsenseQA는 ConceptNet의 개념을 기반으로 상식에 가까운 질문으로 데이터셋를 구성했고, 각각의 질문에 필요한 상식 추론의 기술을 카테고리로 분류하여 분석하였다. VQA는 이미지와 질문이 주어졌을 때, 모델의 질문에 대답하는 능력을 테스트하는 태스크를 제안하였다. 해당 태스크에는 주어진 정보만이 아닌 학습에 등장하지 않았던 개념도 요구하는 상식 추론이 필요함을 휴먼 검증을 통해서 검증했다.Common sense reasoning is the ability to infer general knowledge related to concepts or given conditions that did not appear during the learning stage. CommonsenseQA constructed a dataset with questions close to common sense based on the concept of ConceptNet, and classified and analyzed the common sense reasoning skills required for each question into categories. VQA proposed a task that tests the model's ability to answer questions when given an image and a question. It was verified through human verification that the task requires common sense reasoning that requires not only the given information but also concepts that did not appear in learning.

최근 자연어처리(Natural Language Process, NLP) 연구에서는 다양한 상식 추론 데이터셋이 공개되었다. Cosmos QA 데이터셋은 문맥에서 외부적으로 드러나지 않는 사실을 바탕으로 QA 데이터셋을 구축하였다. CoS-E 데이터셋은 CommonsenseQA를 바탕으로 상식에 대한 사람의 설명을 추가해 주어 데이터셋을 모델이 학습할 때 모델의 상식 추론을 강화하려고 노력했다. XCOPA 데이터셋은 상식과 관련된 데이터셋을 구축함과 동시에 언어적 차이와 문화적 차이로 인해 발생할 수 있는 상식의 차이점을 완화시키려고 노력했다. 한국어에는 생성 기반의 상식 추론을 위한 데이터셋이 없으며, 이를 바탕으로 하는 개선 연구도 없는 실정이다. 따라서 본 발명에서는 새로운 텍스트 생성 데이터셋을 제작할 수 있는 시스템을 구축하고, 한국어 자연어처리 연구의 이후 방향을 제시하고자 한다.In recent Natural Language Process (NLP) research, various common sense inference datasets have been released. The Cosmos QA dataset was constructed based on facts that are not externally revealed in the context. The CoS-E dataset tried to strengthen the model's common sense inference when learning the dataset by adding human explanations of common sense based on CommonsenseQA. The XCOPA dataset attempted to construct a dataset related to common sense and at the same time alleviate differences in common sense that may arise due to linguistic and cultural differences. There is no dataset for generation-based common sense inference in the Korean language, and there is no improvement research based on it. Therefore, the present invention aims to build a system that can create a new text generation dataset and suggest the future direction of Korean natural language processing research.

대한민국 등록특허 제10-2409667호 (2022.06.16. 공고)Republic of Korea Patent No. 10-2409667 (announced on June 16, 2022) 대한민국 공개특허 제2022-0077311호 (2022.06.09. 공개)Republic of Korea Patent Publication No. 2022-0077311 (published on 2022.06.09.) 대한민국 공개특허 제2022-0064966호 (2022.05.19. 공개)Republic of Korea Patent Publication No. 2022-0064966 (published on May 19, 2022)

한국어 자연어처리 연구는 여전히 두 가지의 문제점을 지니고 있다: (1) 자연어 생성 연구의 부족과 (2) 일반 상식 기반 문장 생성 능력의 부족이다. 첫 번째로, 한국어에서는 대부분 자연어 이해 태스크에 집중되어 있어서, 자연어 생성 결과에 대한 언어 모델의 성능을 정량적으로 평가할 데이터셋과 평가 기준이 존재하지 않는다. 따라서, 언어 모델의 생성 능력을 검증하거나 명확한 가이드라인이 부족하여, 자연어 이해에 비해 개선 연구를 진행하기 어렵다. 두 번째로, 한국어 언어 모델은 (성능을 검증할 데이터셋과 평가 기준이 부족할 뿐만 아니라) 간단한 상식 지식을 바탕으로 하는 문장 생성에도 상당한 어려움을 겪고 있다.Korean natural language processing research still has two problems: (1) lack of natural language generation research and (2) lack of sentence generation ability based on general knowledge. First, since most of the Korean language is focused on natural language understanding tasks, there are no datasets or evaluation standards to quantitatively evaluate the performance of language models for natural language generation results. Therefore, it is difficult to conduct improvement research compared to natural language understanding due to the lack of clear guidelines or verification of the language model's creation ability. Second, the Korean language model (not only lacks datasets and evaluation standards to verify performance) also has significant difficulties in generating sentences based on simple common sense knowledge.

본 발명의 일 실시예에 따른 데이터셋 생성 장치는 텍스트 데이터를 수집하는 데이터 수집부, 상기 텍스트 데이터에 대한 전처리 동작을 수행하는 전처리부, 및 전처리된 텍스트 데이터로부터 데이터셋을 구축하는 데이터셋 추출부를 포함하고, 상기 데이터셋 추출부는 전처리된 결과로써 생성되고, 각각이 개념 정보 집합과 상기 개념 정보 집합에 대응하는 문장으로 구성된 개념 정보 집합 문장 쌍들로부터 미리 정해진 조건을 만족하는 개념 정보 집합 문장 쌍들을 추출하여 상기 데이터셋을 구축한다.A dataset generation device according to an embodiment of the present invention includes a data collection unit that collects text data, a preprocessor that performs a preprocessing operation on the text data, and a dataset extractor that builds a dataset from the preprocessed text data. The dataset extractor extracts concept information set sentence pairs that satisfy predetermined conditions from concept information set sentence pairs that are generated as preprocessed results and each consists of a concept information set and a sentence corresponding to the concept information set. Build the dataset.

본 발명은 한국어 상식 추론 데이터셋을 통해서 한국어 생성 모델의 상식 추론 능력을 끌어내어 평가하는 방법을 제시한다. 데이터셋을 통해 생성 모델은 한국어 기반의 상식 추론 결과를 담아내고 있는 문장을 생성할 수 있다. 또한 자연어 이해 기반의 평가 방식을 통해 한국어 생성 모델을 비교 분석하는 것이 아닌, 생성 모델을 위한 상식 추론 능력과 문장 재구성 능력에 대해 평가하고 비교 분석이 가능하다. 더 나아가 한국어 생성 모델의 자연스러운 문장 생성과 상식 추론 능력을 개선하는 자연어 생성 연구에 대한 하나의 새로운 방향을 제시하고 발전에 기여할 수 있다.The present invention presents a method to derive and evaluate the common sense inference ability of a Korean language generation model through the Korean common sense inference dataset. Through the dataset, the generative model can generate sentences containing common sense inference results based on the Korean language. In addition, through an evaluation method based on natural language understanding, it is possible to evaluate and comparatively analyze the common sense reasoning ability and sentence reconstruction ability for the generative model, rather than comparatively analyzing the Korean generative model. Furthermore, it can suggest a new direction and contribute to the development of natural language generation research by improving the natural sentence generation and common sense reasoning abilities of the Korean language generation model.

본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 상세한 설명이 제공된다.
도 1은 본 발명에서 제안하는 데이터셋을 통한 생성 모델의 상식 추론 결과를 설명하기 위한 도면이다.
도 2는 제안하는 평가 데이터셋의 개념 정보 집합을 조합하여 문장을 생성하는 하나의 사례를 보여준다.
도 3은 본 발명의 일 실시예에 따른, 한국어 언어 모델을 위한 데이터셋 생성 장치의 기능 블럭도이다.In order to more fully understand the drawings cited in the detailed description of the present invention, a detailed description of each drawing is provided.
Figure 1 is a diagram for explaining the common sense inference results of the generation model through the dataset proposed in the present invention.
Figure 2 shows an example of generating a sentence by combining the concept information set of the proposed evaluation dataset.
Figure 3 is a functional block diagram of a dataset generation device for a Korean language model, according to an embodiment of the present invention.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시예들은 다양한 형태들로 실시될 수 있으며 본 명세서에 설명된 실시예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the concept of the present invention disclosed in this specification are merely illustrative for the purpose of explaining the embodiments according to the concept of the present invention. They may be implemented in various forms and are not limited to the embodiments described herein.

본 발명의 개념에 따른 실시예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시예들을 도면에 예시하고 본 명세서에서 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시예들을 특정한 개시 형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물, 또는 대체물을 포함한다.Since the embodiments according to the concept of the present invention can make various changes and have various forms, the embodiments will be illustrated in the drawings and described in detail in this specification. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes all changes, equivalents, or substitutes included in the spirit and technical scope of the present invention.

제1 또는 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만, 예컨대 본 발명의 개념에 따른 권리 범위로부터 벗어나지 않은 채, 제1 구성 요소는 제2 구성 요소로 명명될 수 있고 유사하게 제2 구성 요소는 제1 구성 요소로도 명명될 수 있다.Terms such as first or second may be used to describe various components, but the components should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another component, for example, without departing from the scope of rights according to the concept of the present invention, a first component may be named a second component and similarly a second component The component may also be named a first component.

어떤 구성 요소가 다른 구성 요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성 요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성 요소가 다른 구성 요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는 중간에 다른 구성 요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성 요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is said to be "connected" or "connected" to another component, it is understood that it may be directly connected to or connected to that other component, but that other components may also exist in between. It should be. On the other hand, when it is mentioned that a component is “directly connected” or “directly connected” to another component, it should be understood that there are no other components in between. Other expressions that describe the relationship between components, such as "between" and "immediately between" or "neighboring" and "directly adjacent to" should be interpreted similarly.

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로서, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 본 명세서에 기재된 특징, 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in this specification are only used to describe specific embodiments and are not intended to limit the invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in this specification, but are not intended to indicate the presence of one or more other features. It should be understood that it does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the technical field to which the present invention pertains. Terms as defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings they have in the context of the related technology, and unless clearly defined in this specification, should not be interpreted in an idealized or overly formal sense. No.

이하, 본 명세서에 첨부된 도면들을 참조하여 본 발명의 실시예들을 상세히 설명한다. 그러나, 특허출원의 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings attached to this specification. However, the scope of the patent application is not limited or limited by these examples. The same reference numerals in each drawing indicate the same members.

본 발명에서는 (한국어) 언어 생성 모델이 상식 추론 능력을 바탕으로 문장을 생성하는 데이터셋(KommonGen이라고 명명함)을 제안한다.In the present invention, we propose a dataset (named KommonGen) in which a (Korean) language generation model generates sentences based on common sense reasoning ability.

제안하는 데이터셋(KommonGen)을 통해 일상에서 흔히 볼 수 있는 사물과 행동으로 구성된 개념 정보를 조합하여 생성한 한국어 문장에 대해 모델의 성능을 평가할 수 있다. 올바른 문장을 생성하기 위해서는 모델이 각 개념 정보를 조합하여 하나의 문장을 생성하는 구성 능력과 각 개념 정보 간의 일반적인 관계에 대한 사전 지식을 지니고 있어야 한다. 또한 모델이 생성한 문장은 주어진 개념 정보를 모두 포함해야 하며, 한국어 문법 규칙에 부합하는 자연스러운 문장이여야 한다.Through the proposed dataset (KommonGen), the performance of the model can be evaluated for Korean sentences created by combining conceptual information consisting of objects and actions commonly seen in daily life. In order to generate a correct sentence, the model must have the ability to create a single sentence by combining each conceptual information and prior knowledge of the general relationship between each conceptual information. Additionally, the sentences generated by the model must include all given conceptual information and must be natural sentences that comply with Korean grammar rules.

예를 들어 도 1의 {“해변”, “밀려오는”, “파도”, “보인다”}와 같이 개념 정보가 무작위로 나열되어 주어진다면, 개념 정보를 바탕으로 사람은 “파도가 밀려오는 해변이 보인다.”와 같이 각 개념 정보의 관계를 고려하여 상식적으로 받아들일 수 있는 하나의 짧은 문장을 생성할 수 있다. 마찬가지로 한국어 기반의 생성 모델도 일반 상식을 반영하여 개념 정보를 조합한 문장을 생성해야 한다.For example, if conceptual information such as {“beach”, “surging”, “waves”, “seen”} in Figure 1 is given in a random order, based on the conceptual information, a person may say, “The beach where the waves are washing up is It is possible to create a short sentence that can be accepted by common sense by considering the relationship between each conceptual information, such as “I see it.” Likewise, a Korean-based generative model must reflect general knowledge and generate sentences that combine conceptual information.

또한 본 발명에서 제시하는 데이터셋(KommonGen)을 통해 한국어 생성 모델의 문장 구성 능력과 상식 추론 능력에 대해 정량적인 성능 평가가 가능하다. 더 나아가 평가 기준을 바탕으로, 임의의 언어 모델(예컨대 공개된 한국어 대용량 사전 훈련 언어 모델인 KoGPT21와 KoBART2, 그리고 다국어 언어 모델인 mBART-50)에 대한 한국어 생성 능력을 비교 분석할 수 있다. 본 발명에서 제안하는 데이터셋과 평가 방법은 한국어 기반의 자연어 생성 연구 발전에 기여할 것으로 기대된다.In addition, the dataset (KommonGen) presented in the present invention allows quantitative performance evaluation of the sentence composition ability and common sense reasoning ability of the Korean language generation model. Furthermore, based on the evaluation criteria, the Korean language generation ability of arbitrary language models (e.g., KoGPT21 and KoBART2, which are public large-scale Korean pre-trained language models, and mBART-50, which is a multilingual language model) can be compared and analyzed. The dataset and evaluation method proposed in this invention are expected to contribute to the development of Korean-based natural language generation research.

대용량의 한국어 말뭉치를 기반으로 사전 훈련을 진행한 대표적인 언어 모델로는 KoGPT2와 KoBART가 있다. KoGPT2는 SKT-AI에서 Transformer의 디코더로만 구성된 GPT2 모델을 40GB 이상의 한국어 텍스트로 학습하여 공개한 모델이다. 해당 모델은 12개의 레이어와 헤드, 그리고 125M의 모델 파라미터로 구성되어 있다. KoBART도 마찬가지로 SKT-AI에서 공개했으며, Transformer의 인코더-디코더 구조를 활용한 BART 모델을 40GB 이상의 한국어 텍스트로 학습했다. KoBART는 BART 모델의 사전 훈련에 사용한 Text Infilling 방식을 적용하여 한국어에 대한 훈련을 구성했다. 해당 모델은 6개의 레이어와 16개의 헤드, 그리고 124M의 모델 파라미터로 구성되어 있다. 본 발명에서는 한국어 기반의 생성 모델에 대한 성능을 비교 분석하기 위해 한국어를 포함한 50개의 다국어 말뭉치를 학습한 mBART-50 모델을 실험 대상에 추가했다. mBART-50은 25개의 다국어에 대해 사전 훈련한 mBART에 각 언어 쌍에 대한 양방향 미세조정 훈련을 추가하고 50개의 다국어로 범위를 확장한 모델이다. 해당 모델은 54GB 이상의 한국어 텍스트를 학습했으며, 24개의 레이어와 16개의 헤드, 그리고 610M의 모델 파라미터로 구성되어 있다.Representative language models that have undergone pre-training based on a large Korean corpus include KoGPT2 and KoBART. KoGPT2 is a model released by SKT-AI by learning the GPT2 model, which consists of only the Transformer decoder, with over 40GB of Korean text. The model consists of 12 layers, a head, and 125M model parameters. KoBART was also released by SKT-AI, and the BART model using Transformer's encoder-decoder structure was learned with over 40GB of Korean text. KoBART configured training for Korean by applying the Text Infilling method used in pre-training of the BART model. The model consists of 6 layers, 16 heads, and 124M model parameters. In the present invention, in order to compare and analyze the performance of Korean-based generation models, the mBART-50 model, which learned 50 multilingual corpora including Korean, was added to the test subject. mBART-50 is a model that adds bidirectional fine-tuning training for each language pair to mBART, which was pre-trained for 25 multiple languages, and expanded the range to 50 multiple languages. The model learned more than 54GB of Korean text and consists of 24 layers, 16 heads, and 610M model parameters.

다만, 본 발명의 권리범위가 상술한 (한국어) 생성 모델의 종류에 제한되는 것은 아니며, 실시예에 따라 기술되지 않은 다양한 언어 생성 모델들이 이용될 수 있다.However, the scope of the present invention is not limited to the type of (Korean) production model described above, and various language production models not described may be used depending on the embodiment.

본 발명에서는 일반 상식에 기반한 문장을 생성하도록, 지나치게 특정 분야에 국한되거나 전문 용어를 사용하지 않는 방향으로 데이터셋을 구축했다. 따라서 AI-HUB의 인식기술-시각지능 분야에서 MS-COCO(T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” European conference on computer vision, pp. 740-755, 2014.) 데이터셋의 캡션 정보를 한국어로 1차 번역을 진행하고 오류 교정을 실시한 데이터를 사용했다.In the present invention, the dataset was constructed in a way that was not overly limited to a specific field or used technical terms to generate sentences based on general knowledge. Therefore, in the cognitive technology-visual intelligence field of AI-HUB, MS-COCO (T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick , “Microsoft coco: Common objects in context,” European conference on computer vision, pp. 740-755, 2014.) The caption information in the dataset was first translated into Korean and error correction was performed.

표 1과 같이 총 576,704개의 한국어로 번역한 이미지 캡션 데이터에서 구 (phrase)가 아닌 문장 형태의 평서문을 추출하기 위해 종결어미 “-다.”로 끝맺는 경우인 163,629개의 문장을 선별했다. 선별한 문장은 한국어 형태소 분석 패키지 KoNLPy(E. L. Park and S. Cho, “Konlpy: Korean natural language processing in python,” Annual Conference on Human and Language Technology, pp. 133-136, 2014.)를 활용하여 Mecab(T. KUDO, “Mecab: Yet another part-of-speech and morphological analyzer,” http://mecab. sourceforge.net/, 2005.) 방식의 형태소 분절과 품사 태깅을 진행했다.As shown in Table 1, in order to extract descriptive sentences in the form of sentences rather than phrases from a total of 576,704 image caption data translated into Korean, 163,629 sentences ending with the final ending “-da.” were selected. The selected sentences were analyzed using Mecab( T. KUDO, “Mecab: Yet another part-of-speech and morphological analyzer,” http://mecab. sourceforge.net/, 2005.) morpheme segmentation and part-of-speech tagging were performed.

활용하는 품사에는 최대한 객관적인 내용으로 개념 정보를 구성하기 위해 체언과 용언(만)을 활용했다. 체언의 경우 일반 명사, 고유 명사, 의존 명사, 단위 명사, 수사, 그리고 대명사가 해당한다. 용언의 경우에는 동사, 형용사, 보조 용언, 그리고 긍정/부정사를 포함한다. 그러나 한국어의 경우 문장에 쓰이는 용언을 추출할 때 어미를 고려해야 함에 따라 선어말 어미, 종결 어미, 연결 어미, 그리고 전성 어미를 포함한 형태로 개념 정보를 구성했다. 품사 태깅의 결과로 각 문장에서 추출한 개념 정보 집합은 일반적인 한국어 평서문의 형태인 “주어”, “목적어”, “서술어”를 고려하여 최소 1개의 용언과 2개의 체언을 포함하도록 하였다.For the parts of speech used, nouns and verbs (only) were used to organize conceptual information with as objective content as possible. In the case of nouns, common nouns, proper nouns, dependent nouns, unit nouns, numerals, and pronouns are included. In the case of verbs, they include verbs, adjectives, auxiliary verbs, and affirmative/infinitive verbs. However, in the case of Korean, as endings must be considered when extracting verbs used in sentences, conceptual information was organized in a form including prefinal endings, final endings, conjunctive endings, and prepositional endings. As a result of part-of-speech tagging, the set of concept information extracted from each sentence was designed to include at least one verb and two nouns, considering the typical Korean descriptive sentence types of “subject,” “object,” and “predicate.”

본 발명은 최소한의 포함 조건을 만족한 개념 정보 집합과 문장 쌍에 대해서 총 3개에서 6개의 개념 정보를 지니는 경우에 데이터셋으로 활용했다. 훈련 데이터셋은 3개에서 6개 사이의 개념 정보를 포함하는 4,354/27,448/56,249/56개의 개념 집합-문장 쌍으로, 검증 데이터셋은 3개에서 6개 사이의 개념 정보를 포함하는 201/309/488/2개의 개념 집합-문장 쌍으로 구성했다. 평가 데이터셋은 훈련 데이터에서 학습하지 않은 개념 정보를 포함하며, 더 복잡한 구성 방식의 비율을 높혀서 생성 모델의 상식 추론 능력과 문장 구성 능력을 평가 했다. 이를 위해 평가 데이터셋의 각 개념 정보 집합을 훈련 데이터셋에 존재하지 않은 최소한 1개의 개념 정보를 포함하고, 최대 6개의 개념 정보 비율이 5.1% 증가한 개념 집합-문장 쌍으로 구성했다. 그러나 추가적으로 훈련 데이터셋에 등장하지 않고 사용 빈도수가 크게 낮은 개념 정보를 포함하는 문장은 비교적 자주 사용하지 않는 어색한 표현과 문법 구조를 지니고 있다. 따라서 한국어가 모국어이며 고등 교육 이상의 조건을 지닌 3명의 심사원을 통해 어색한 문장과 자연스러운 문장에 대한 이진 분류를 추가적으로 더 진행할 수도 있다. 그 결과, 평가를 위해 사전에 구성한 2,000개의 개념 정보-문장 쌍 중에서 과반 수의 동의로 568개의 어색한 문장을 제거한 1,432개의 개념 정보-문장 쌍을 최종 평가 데이터셋으로 사용했다.The present invention used the concept information set and sentence pair that satisfied the minimum inclusion conditions as a dataset when it had a total of 3 to 6 concept information. The training dataset is 4,354/27,448/56,249/56 concept set-sentence pairs containing between 3 and 6 concept information, and the validation dataset is 201/309 containing between 3 and 6 concept information. /488/It consists of two concept sets and sentence pairs. The evaluation dataset contains conceptual information that was not learned in the training data, and the common sense reasoning ability and sentence construction ability of the generating model were evaluated by increasing the proportion of more complex composition methods. For this purpose, each concept information set in the evaluation dataset included at least one concept information that did not exist in the training dataset, and consisted of up to six concept set-sentence pairs with a 5.1% increase in the concept information rate. However, in addition, sentences containing conceptual information that does not appear in the training dataset and are used very less frequently have awkward expressions and grammatical structures that are not used relatively often. Therefore, binary classification of awkward sentences and natural sentences can be additionally performed using three judges whose native language is Korean and who have higher education or higher. As a result, among the 2,000 concept information-sentence pairs previously constructed for evaluation, 1,432 concept information-sentence pairs from which 568 awkward sentences were removed with majority agreement were used as the final evaluation dataset.

[표 1][Table 1]

이하에서는 평가 지표에 대해 설명한다.Below, the evaluation indicators are explained.

생성 모델에 대한 평가는 두 가지 방식이며, 문장의 n-gram 중첩과 개념 정보 포함 여부에 따른 Coverage(커버리지) 점수를 사용한다.There are two ways to evaluate the generation model, using coverage scores based on n-gram overlap of sentences and whether or not concept information is included.

첫 번째로, 모델의 생성 문장에 대해서 구조적인 온전함을 평가하기 위해 n-gram 기반의 평가 지표인 BLEU(K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318, 2002), METEOR(S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65-72, 2005.), 그리고 ROUGE(C.-Y. LIN, “Rouge: A package for automatic evaluation of summaries,” Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, pp. 74-81, 2004.)를 사용한다. 해당 평가 방식은 n-gram 중복 여부를 통해 한국어 평서문의 일반적인 어순과 문법적인 특징을 일부 반영하고 사람이 작성한 정답 문장과의 일치 정도를 계산한다.First, to evaluate the structural integrity of the generated sentences of the model, an n-gram-based evaluation index, BLEU (K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” Proceedings of the 40th annual meeting on association for computational linguistics, pp 311-318, 2002), METEOR (S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation. with improved correlation with human judgments,” Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summary, pp. 65-72, 2005.), and ROUGE (C.-Y. LIN, “Rouge: A package for automatic evaluation of summaries,” Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, pp. 74-81, 2004. This evaluation method reflects some of the general word order and grammatical features of Korean descriptive sentences through n-gram overlap and calculates the degree of agreement with the correct sentence written by a person.

두 번째로, 생성 결과를 형태소 단위로 분절하여 각 문장이 주어진 개념 정보를 얼마나 포함하는지에 대해 Coverage 백분율로 표현한다. Coverage는 생성 모델이 반드시 포함해야 하는 개념 정보를 의도에 맞게 문장으로 재구성하는지에 대해 나타내는 지표이다.Second, the generation result is segmented into morpheme units and expressed as a coverage percentage to indicate how much each sentence contains a given concept information. Coverage is an indicator that indicates whether the conceptual information that the generation model must include is reorganized into sentences according to intent.

또한 각 성능 지표에서 계산한 값을 표준화하기 위해 ROUGE를 제외한 생성 결과에 대해서는 Mecab 기반의 형태소 분절 방식을 사용한다. ROUGE의 경우에는 한국어 분절 결과에 대한 평가가 온전하지 않음을 고려하여서, 각 모델이 토크나이저를 통해 구축한 사전에서 인덱스 값을 매핑하여 점수를 계산한다.Additionally, in order to standardize the values calculated from each performance indicator, the Mecab-based morpheme segmentation method is used for the generated results excluding ROUGE. In the case of ROUGE, considering that the evaluation of Korean segmentation results is not complete, each model calculates the score by mapping index values from the dictionary built through the tokenizer.

이하에서는 실험 및 실험결과를 설명한다.Below, the experiment and its results are described.

생성 모델에 대한 실험은 Transformer의 디코더만으로 구성된 KoGPT2 모델과 시퀀스 투 시퀀스 모델로 인코더와 디코더를 지닌 KoBART와 mBART-50 모델에 대해서 진행했다. 실험을 위한 딥러닝 프레임 워크로는 Pytorch 1.9(A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, Vol. 32, pp. 8026-8037, 2019.)와 Huggingface Transformers(https://github.com/huggingface/transformers)를 사용했다. KoGPT2에 대한 실험 하이퍼 파라미터는 batch 4 with gradient accumulation, initial learning rate 5 × 10^(-5), warmup steps 400, AdamW optimizer (β₁ = 0.9, β₂ = 0.999, ε = 1e - 8), block size 128, seed 42, 그리고 epochs 5로 설정했다. KoBART와 mBART-50의 경우에는 batch 16 with gradient accumulation, initial learning rate 5 × 10^(-5), warmup steps 400, AdamW optimizer (β₁ = 0.9, β₂ = 0.999, ε = 1e - 8), epochs 5, max source length 64, max target length 256, 그리고 src/tgt language ko_KR (mBART-50만 해당)으로 설정했다. 또한 실험을 위해서 본 발명은 RAM 128GB, 18-core Intel Xeon Gold 6230 CPU, 그리고 NVIDIA A6000 (48GB) GPU에 해당하는 컴퓨터 자원을 사용했다.Experiments on the generative model were conducted on the KoGPT2 model, which consists of only a transformer decoder, and the KoBART and mBART-50 models, which have an encoder and decoder as a sequence-to-sequence model. The deep learning framework for the experiment was Pytorch 1.9 (A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, Vol. 32, pp. 8026-8037, 2019. and Huggingface Transformers (https://github. com/huggingface/transformers) was used. Experimental hyperparameters for KoGPT2 are batch 4 with gradient accumulation, initial learning rate 5 × 10^(-5), warmup steps 400, AdamW optimizer (β ₁ = 0.9, β ₂ = 0.999, ε = 1e - 8), block I set size to 128, seed to 42, and epochs to 5. In the case of KoBART and mBART-50, batch 16 with gradient accumulation, initial learning rate 5 × 10^(-5), warmup steps 400, AdamW optimizer (β ₁ = 0.9, β ₂ = 0.999, ε = 1e - 8), We set epochs 5, max source length 64, max target length 256, and src/tgt language ko_KR (mBART-50 only). Additionally, for the experiment, the present invention used computer resources equivalent to 128GB of RAM, 18-core Intel Xeon Gold 6230 CPU, and NVIDIA A6000 (48GB) GPU.

주어진 개념 정보에 대해서 상식 추론을 기반으로 문장을 생성하기 위해 모델은 입력 개념 집합 x ∈ X와 해당 개념으로 구성된 문장 y ∈ Y 를 재구성하도록 훈련된다. 아래의 수학식 1은 생성 모델의 훈련 과정을 반영한다.To generate sentences based on common sense reasoning for given conceptual information, the model is trained to reconstruct a set of input concepts x ∈ X and sentences y ∈ Y composed of those concepts. Equation 1 below reflects the training process of the generative model.

[수학식 1][Equation 1]

의 의 y 문장에 대한 최대 생성 길이를 의미하며, 수학식은 입력 시퀀스에 대한 연속적인 조건부 확률로 계산한다(Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” The journal of machine learning research, Vol. 3, pp. 1137-1155, 2003.). 는 는 모델이 반환하는 은닉 벡터를 의미하며, e(·)는 입력 토큰의 임베딩 값을 나타낸다. of It means the maximum generated length for y sentences, and the equation is calculated as a continuous conditional probability for the input sequence (Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model, “The journal of machine learning research, Vol. 3, pp. 1137-1155, 2003. refers to the hidden vector returned by the model, and e(·) represents the embedding value of the input token.

[수학식 2][Equation 2]

이후 수식 2와 같이 KommonGen 데이터셋 D에 대하여 모델의 파라미터 θ가 자가 회귀 기반의 문장 생성에서 다음 토큰에 대한 예측 확률에 대한 로그 가능도를 최대화하는 방향으로 학습한다.Afterwards, as shown in Equation 2, the parameter θ of the model for KommonGen dataset D is learned to maximize the log likelihood of the prediction probability for the next token in autoregressive sentence generation.

학습된 모델은 beam size 10, max sequence length 30, min sequence length 10, 그리고 n-gram size 3을 디코딩 하이퍼파라미터로 사용하여 문장을 생성한다. 빔 크기 만큼 생성된 10개의 문장은 분절화한 개념 정보를 포함한 갯수에 따라 순위를 결정한다. 후보 문장의 순위는 개념 정보 갯수에 따라 내림차순으로 재정렬하여 가장 상위의 1개 문장을 최종 결과로 사용한다.The learned model generates sentences using beam size 10, max sequence length 30, min sequence length 10, and n-gram size 3 as decoding hyperparameters. The 10 sentences generated as large as the beam size are ranked according to the number of segments they contain. The ranking of candidate sentences is rearranged in descending order according to the number of concept information, and the top one sentence is used as the final result.

[표 2][Table 2]

표 2는 KommonGen 데이터셋을 통해 생성 모델의 성능을 제시한 평가 기준을 통해 비교 분석한 결과이다. 모든 평가 지표에서 디코더만으로 구성된 KoGPT2보다 시퀀스 투 시퀀스 모델인 KoBART와 mBART-50이 높은 성능을 보인다. 이러한 결과는 시퀀스 투 시퀀스 모델이 입력에 대해 전후 문맥 파악이 가능한 인코더를 지니고 있어서 개념 정보를 재구성하는 과정에서 상식 추론에 이점을 지니고 있기 때문이다. 반면 디코더만으로 구성된 KoGPT2는 대용량 코퍼스를 단방향으로 학습하는 사전 훈련 단계와 미세 조정 훈련 사이의 간격을 prompt 기반의 생성 방법이 충분히 좁히지 못하면서 낮은 성능을 보인다. 또한 Coverage는 KoGPT2와 KoBART가 각각 주어진 개념 정보의 23%, 10% 정도를 소실한 채 문장을 생성한다는 것을 보여준다. 이와 같은 성능 지표는 두 모델의 부족한 상식 추론 능력과 주어진 조건에 대해 문장을 구성하는데에 어려움을 지니고 있다는 것을 나타낸다.Table 2 shows the results of a comparative analysis of the performance of the generation model using the KommonGen dataset using the presented evaluation criteria. In all evaluation indicators, the sequence-to-sequence models KoBART and mBART-50 show higher performance than KoGPT2, which consists of only a decoder. This result is because the sequence-to-sequence model has an encoder that can understand the context of the input, so it has an advantage in common sense reasoning in the process of reconstructing concept information. On the other hand, KoGPT2, which consists of only a decoder, shows low performance as the prompt-based generation method does not sufficiently narrow the gap between the pre-training stage of unidirectionally learning a large corpus and fine-tuning training. Coverage also shows that KoGPT2 and KoBART generate sentences with about 23% and 10% of the given conceptual information lost, respectively. These performance indicators indicate that the two models lack common sense reasoning ability and have difficulty constructing sentences for given conditions.

특히 한국어를 포함한 50개의 다국어 말뭉치에 대해 사전훈련된 mBART-50는 n-gram 기반 및 개념 정보 매칭 정도에 따른 평가에서 가장 높은 성능을 보이고 있다. 이러한 결과는 한국어 단일 코퍼스로 학습한 모델보다 14GB 이상 더 많은 한국어 코퍼스를 사전 훈련에 사용했다는 점과 모델 파라미터가 약 5배정도 더 많은 것에 영향을 받은 것으로 보인다. 그리고 mBART-50가 기존 mBART의 25개의 언어 자원을 2배로 확장하는 과정에서 도입한 양방향 훈련 방식은 중간/저자원에 해당하는 한국어에 대한 모델의 이해를 크게 향상했기 때문이다. 더 나아가 다국어 데이터 증강을 위해 사용한 역 번역(back translation) 방식이 한국어로 번역한 MS-COCO 데이터셋의 문장으로 구성된 KommonGen 평가 데이터셋을 재구성하는데 긍정적인 영향을 준 것으로 보인다.In particular, mBART-50, which was pre-trained on 50 multilingual corpora including Korean, shows the highest performance in evaluations based on n-gram base and degree of conceptual information matching. These results appear to have been influenced by the fact that more than 14GB more Korean corpus was used for pre-training than the model learned with a single Korean corpus and that the model parameters were approximately 5 times larger. And this is because the two-way training method introduced by mBART-50 in the process of doubling the existing mBART's 25 language resources greatly improved the model's understanding of Korean, which is a medium/low resource. Furthermore, the back translation method used to augment multilingual data appears to have had a positive impact on reconstructing the KommonGen evaluation dataset, which consists of sentences from the MS-COCO dataset translated into Korean.

도 2는 KommonGen 평가 데이터셋의 개념 정보 집합을 조합하여 문장을 생성한 하나의 사례를 보여준다. 우선, KoGPT2는 주어진 개념 정보인 “카페”를 생략한 문장을 생성했으며, 다소 어색한 표현과 한국어 문법을 사용했다. 불안정한 문장 구조와 문법 오류는 KoGPT2가 n-gram 기반의 측정에서 낮은 점수를 보이는 이유이며, 주어진 개념 정보에 대한 부족한 상식 추론의 결과로 보여진다. KoBART는 모든 개념 정보를 포함한 문장을 생성했으나, 필요 이상으로 동일한 개념 정보를 반복한 결과를 보인다. 이러한 결과는 주어진 개념 정보를 조합하여 문장으로 구성하는데 있어서 추가적인 개선이 필요함을 나타낸다. mBART-50은 다른 모델에 비해 자연스럽고 안정된 구조의 문장을 생성했다. 그러나 사람이 작성한 문장에 비해 간결하고 쉬운 표현으로, 다양한 단어와 풍부한 표현을 기반으로 문장을 생성할 수 있도록 개선이 필요해 보인다Figure 2 shows an example of generating a sentence by combining a set of concept information from the KommonGen evaluation dataset. First, KoGPT2 generated sentences that omitted the given conceptual information “cafe,” and used somewhat awkward expressions and Korean grammar. Unstable sentence structure and grammatical errors are the reasons why KoGPT2 shows low scores in n-gram-based measurements, and appear to be the result of insufficient common sense inference for given conceptual information. KoBART generated sentences containing all conceptual information, but it ended up repeating the same conceptual information more than necessary. These results indicate that further improvement is needed in combining given conceptual information to form sentences. mBART-50 generated sentences with a natural and stable structure compared to other models. However, it is a concise and easy expression compared to sentences written by humans, and it seems that improvements are needed to be able to generate sentences based on various words and rich expressions.

도 3은 본 발명의 일 실시예에 따른, 한국어 언어 모델을 위한 데이터셋 생성 장치의 기능 블럭도이다.Figure 3 is a functional block diagram of a dataset generation device for a Korean language model, according to an embodiment of the present invention.

도 3을 참조하면, 언어 모델 평가 장치 등으로 명명될 수도 있는 데이터셋 생성 장치(100)는 데이터 수집부(110), 전처리부(120), 및 데이터셋 추출부(130)를 포함한다. 실시예에 따라, 데이터셋 생성 장치(100)는 학습부(140), 평가부(150), 및 저장부(160) 중 적어도 하나를 더 포함할 수도 있다. 데이터셋 생성 장치(100)는 적어도 프로세서(processor) 및/또는 메모리(memory)를 포함하는 컴퓨팅 장치로 구현될 수 있다. 컴퓨팅 장치는 PC(Personal Computer), 랩탑 컴퓨터, 노트 PC, 서버(server) 등을 포함할 수 있다.Referring to FIG. 3, the dataset creation device 100, which may be referred to as a language model evaluation device, etc., includes a data collection unit 110, a preprocessing unit 120, and a dataset extraction unit 130. Depending on the embodiment, the dataset generating device 100 may further include at least one of a learning unit 140, an evaluation unit 150, and a storage unit 160. The dataset generating device 100 may be implemented as a computing device including at least a processor and/or memory. Computing devices may include personal computers (PCs), laptop computers, note PCs, servers, etc.

데이터 수집부(110)는 소정의 유무선 통신망을 통하여 (텍스트) 데이터를 수집할 수 있다. 이를 위해, 데이터 수집부(110)는 크롤링을 통해 텍스트 데이터를 제공하는 적어도 하나의 서버로부터 (텍스트) 데이터를 수집할 수 있다.The data collection unit 110 can collect (text) data through a predetermined wired or wireless communication network. To this end, the data collection unit 110 may collect (text) data from at least one server that provides text data through crawling.

일 예로, 데이터 수집부(110)는 유무선 통신망을 통하여 소정의 데이터셋(예컨대, MS-COCO 데이터셋)의 캡션 정보를 수집할 수 있다. 다른 예로, 데이터 수집부(110)는 위키디피아 등과 같이 텍스트 데이터를 다수 보유하고 이를 제공하는 서버로부터 텍스트 데이터를 수집할 수도 있다. 데이터 수집부(110)에 의해 수집된 데이터는 저장부(160)에 저장될 수 있다.As an example, the data collection unit 110 may collect caption information of a certain data set (eg, MS-COCO data set) through a wired or wireless communication network. As another example, the data collection unit 110 may collect text data from a server that holds and provides text data, such as Wikipedia. Data collected by the data collection unit 110 may be stored in the storage unit 160.

전처리부(120)는 수집된 (텍스트) 데이터에 대한 전처리 동작을 수행할 수 있다. 구체적으로, 전처리부(120)는 수집된 데이터로부터 평서문만을 추출할 수 있다. 이를 위해, 전처리부(120)는 종결어미가 "-다."로 끝나는 문장만을 추출할 수 있다.The preprocessor 120 may perform a preprocessing operation on the collected (text) data. Specifically, the preprocessor 120 can extract only the comment text from the collected data. To this end, the preprocessor 120 can extract only sentences whose endings end with “-da.”

또한, 전처리부(120)는 형태소 분석 및/또는 품사 태깅 동작을 수행할 수 있다. 예시적으로 한국어 형태소 분석 패키지인 KoNLPy를 이용하여 Mecab 방식의 형태소 분절을 수행할 수 있다. 일 실시예에 따르면, 전처리부(120)는 체언과 용언만을 태깅할 수도 있다. 태깅된 체언과 용언은 개념 정보 집합을 구성한다. 또한, 개념 정보 집합은 해당 용언에 대한 선어말 어미, 종결 어미, 연결 어미, 및 전성 어미를 포함하도록 구성될 수도 있다. 따라서, 각각이 개념 정보 집합과 문장을 포함하는 개념 정보 집합 문장 쌍들이 구축될 수 있다.Additionally, the preprocessor 120 may perform morpheme analysis and/or part-of-speech tagging operations. For example, Mecab-style morpheme segmentation can be performed using KoNLPy, a Korean morpheme analysis package. According to one embodiment, the preprocessor 120 may tag only body language and verbs. Tagged phrases and verbs constitute a set of conceptual information. Additionally, the concept information set may be configured to include pre-final endings, final endings, conjunctive endings, and pre-verbal endings for the corresponding verb. Accordingly, concept information set sentence pairs, each including a concept information set and a sentence, can be constructed.

또한, 전처리부(120)는 상이한 언어로 된 텍스트 데이터에 대한 기계 번역을 수행하여 한국어로 변환한 후 상술한 전처리 동작을 수행할 수도 있다.Additionally, the preprocessor 120 may perform machine translation on text data in different languages, convert it into Korean, and then perform the above-described preprocessing operation.

전처리부(120)에 의해 구축된 개념 정보 집합 문장 쌍들은 저장부(160)에 저장될 수 있다.Concept information set sentence pairs constructed by the preprocessor 120 may be stored in the storage unit 160.

데이터셋 추출부(130)는 개념 정보 집합 문장 쌍들로부터 데이터셋을 구축할 수 있다. 데이터셋은 학습 데이터셋, 검증 데이터셋, 및 평가 데이터셋 중 적어도 하나를 의미할 수 있다.The dataset extractor 130 may construct a dataset from concept information set sentence pairs. The data set may mean at least one of a training data set, a validation data set, and an evaluation data set.

구체적으로, 데이터셋 추출부(130)는 개념 정보 집합 문장 쌍들로부터 미리 정해진 개수의 개념 정보를 갖는(즉, 미리 정해진 개수(예컨대, 3개 이상 6개 이하)의 용언 및/또는 체언을 포함하는) 개념 정보 집합 문장 쌍들을 추출함으로써 데이터셋을 구축할 수 있다.Specifically, the dataset extractor 130 has a predetermined number of concept information (i.e., includes a predetermined number (e.g., 3 to 6) of verbs and/or nouns from the concept information set sentence pairs. ) A dataset can be constructed by extracting concept information set sentence pairs.

실시예에 따르면, 학습 데이터셋, 검증 데이터셋, 및 평가 데이터셋에 포함된 개념 정보 집합이 중복되지 않도록 구축될 수 있다. 일 예로, 평가 데이터셋은 학습 데이터셋에서 학습하지 않은(즉, 학습 데이터셋에 포함되지 않은) 개념 정보(만)을 포함하도록 구축될 수 있다. 또한, 평가 데이터셋에 포함된 용언 및/또는 체언의 비율은 학습 데이터셋에 포함된 용언 및/또는 체언의 비율보다 높도록 구축될 수도 있다.According to an embodiment, the concept information sets included in the learning dataset, validation dataset, and evaluation dataset may be constructed so as not to overlap. As an example, the evaluation dataset may be constructed to include (only) concept information that was not learned in the training dataset (i.e., not included in the training dataset). Additionally, the ratio of verbs and/or nouns included in the evaluation dataset may be constructed to be higher than the ratio of verbs and/or nouns included in the learning dataset.

데이터셋 추출부(130)에 의해 구축된(또는 추출된) 데이터셋은 저장부(160)에 저장될 수 있다.The dataset constructed (or extracted) by the dataset extraction unit 130 may be stored in the storage unit 160.

학습부(140)는 생성된 학습 데이터셋을 이용하여 미리 저장되어 있는 사전학습된 적어도 하나의 (한국어) 언어 모델을 학습시킬 수 있다. 학습은 입력 개념 집합으로부터 이에 대응하는 문장 y를 재구성하도록 학습될 수 있다.The learning unit 140 may train at least one pre-trained (Korean) language model that is stored in advance using the generated training dataset. Learning can be done to reconstruct the corresponding sentence y from a set of input concepts.

학습부(140)에 의해 학습된 적어도 하나의 언어 모델은 저장부(160)에 저장될 수 있다.At least one language model learned by the learning unit 140 may be stored in the storage unit 160.

평가부(150)는 학습 데이터를 이용하여 학습된 적어도 하나의 언어 모델에 대한 평가 동작을 수행할 수 있다. 구체적으로, 평가부(150)는 평가 데이터셋의 개념 정보 집합을 언어 모델들 각각에 입력하고, 출력된 문장(재구성된 문장)과 개념 정보 집합 문장 쌍에 포함된 문장을 비교함으로써, 각 언어 모델에 대한 평가 지표를 도출할 수 있다. 평가 지표는 ROUGE-2, ROUGE-L, BLEU 3, BLEU 4, METEOR, 및 Coverage 중 적어도 하나를 포함할 수 있다. 언어 모델의 출력 문장은, 각 언어 모델이 생성한(재구성한) 복수의 문장들(예컨대, 빔 크기 만큼 생성된 10개의 문장들) 중 입력된 개념 정보 집합을 가장 많이 포함하고 있는 문장을 선택하고, 선택된 문장을 기준으로 언어 모델을 평가할 수 있다.The evaluation unit 150 may perform an evaluation operation on at least one language model learned using training data. Specifically, the evaluation unit 150 inputs the concept information set of the evaluation data set into each of the language models, and compares the output sentences (reconstructed sentences) with the sentences included in the concept information set sentence pair, so that each language model Evaluation indicators can be derived. The evaluation index may include at least one of ROUGE-2, ROUGE-L, BLEU 3, BLEU 4, METEOR, and Coverage. The output sentence of the language model selects the sentence that contains the largest set of input concept information among a plurality of sentences generated (reconstructed) by each language model (e.g., 10 sentences generated by the size of the beam), and , the language model can be evaluated based on the selected sentence.

평가부(150)에 의한 평가 결과는 저장부(160)에 저장될 수 있다.The evaluation results by the evaluation unit 150 may be stored in the storage unit 160.

저장부(160)에는 데이터셋 생성 장치(100)의 동작을 위한 운영체제(Operating System, OS), 프로그램, 소스 코드 등이 저장되어 있을 수 있다. 또한, 저장부(160)에는 사전학습된 적어도 하나의 (한국어) 언어 모델이 저장되어 있을 수도 있다. 또한, 저장부(160)에는 데이터 수집부(110)에 의해 수집된 데이터, 전처리부(120)에 의해 전처리된 데이터, 데이터셋 추출부(130)에 의해 추출된 데이터셋, 학습부(140)에 의해 학습된 결과, 즉 학습된 언어 모델, 평가부(150)에 의한 평가 결과 등이 저장될 수 있다.The storage unit 160 may store an operating system (OS), program, source code, etc. for operation of the data set generating device 100. Additionally, at least one pre-trained (Korean) language model may be stored in the storage unit 160. In addition, the storage unit 160 includes data collected by the data collection unit 110, data preprocessed by the preprocessor 120, a dataset extracted by the dataset extraction unit 130, and learning unit 140. Results learned by, that is, the learned language model, evaluation results by the evaluation unit 150, etc. may be stored.

이상에서 설명된 장치는 하드웨어 구성 요소, 소프트웨어 구성 요소, 및/또는 하드웨어 구성 요소 및 소프트웨어 구성 요소의 집합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성 요소는, 예를 들어, 프로세서, 콘트롤러, ALU(Arithmetic Logic Unit), 디지털 신호 프로세서(Digital Signal Processor), 마이크로컴퓨터, FPA(Field Programmable array), PLU(Programmable Logic Unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(Operation System, OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술 분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(Processing Element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(Parallel Processor)와 같은, 다른 처리 구성(Processing Configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a set of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an Arithmetic Logic Unit (ALU), a Digital Signal Processor, a microcomputer, a Field Programmable Array (FPA), It may be implemented using one or more general-purpose or special-purpose computers, such as a Programmable Logic Unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device may include multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include a plurality of processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are also possible.

소프트웨어는 컴퓨터 프로그램(Computer Program), 코드(Code), 명령(Instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(Collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성 요소(Component), 물리적 장치, 가상 장치(Virtual Equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(Signal Wave)에 영구적으로, 또는 일시적으로 구체화(Embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, and may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM, DVD와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-optical Media), 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes specially configured hardware devices to store and execute program instructions, such as magneto-optical media, ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성 요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성 요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 등록청구범위의 기술적 사상에 의해 정해져야 할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those skilled in the art will understand that various modifications and other equivalent embodiments are possible therefrom. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent. Therefore, the true scope of technical protection of the present invention should be determined by the technical spirit of the attached registration claims.

100 : 데이터셋 생성 장치
110 : 데이터 수집부
120 : 전처리부
130 : 데이터셋 추출부
140 : 학습부
150 : 평가부
160 : 저장부100: Dataset creation device
110: data collection unit
120: preprocessing unit
130: Dataset extraction unit
140: Learning Department
150: Evaluation department
160: storage unit

Claims

A data collection unit that collects text data;
a pre-processing unit that performs a pre-processing operation on the text data; and
It includes a dataset extraction unit that builds a dataset from preprocessed text data,
The dataset extractor extracts concept information set sentence pairs that satisfy predetermined conditions from concept information set sentence pairs that are generated as preprocessed results, each of which consists of a concept information set and a sentence corresponding to the concept information set, and sets the dataset. To build,
Dataset creation device.

According to paragraph 1,
The preprocessor,
Extracting descriptive sentences from the text data,
Performing morpheme segmentation and part-of-speech tagging for each of the above descriptive sentences,
Dataset creation device.

According to paragraph 2,
The preprocessor tags the part of speech for the declarative sentence and the verb for each of the descriptive sentences,
Dataset creation device.

According to paragraph 1,
The concept information set includes pre-final endings, final endings, conjunctive endings, and prepositional endings for verbs included in the concept information set,
Dataset creation device.

According to paragraph 1,
The data set extractor constructs concept information set sentence pairs containing a predetermined number of verbs and phrases from among the concept information set sentence pairs as the data set,
Dataset creation device.

According to paragraph 1,
The dataset generating device further includes a learning unit that trains at least one pre-trained language model using training data included in the dataset.
Dataset creation device.

According to clause 6,
The dataset generating device further includes an evaluation unit that evaluates the performance of at least one language model learned by the learning unit using an evaluation dataset included in the dataset,
Dataset creation device.