KR102564692B1

KR102564692B1 - System and method for processing natural language using correspondence learning

Info

Publication number: KR102564692B1
Application number: KR1020210000530A
Authority: KR
Inventors: 정상근; 서혜인
Original assignee: 충남대학교 산학협력단
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2023-08-08
Also published as: KR20220098628A

Abstract

본 기술은 관계성 학습을 이용한 자연어 처리 시스템 및 방법이 개시된다. 이러한 본 기술에 대한 구체적인 구현 예는 수집된 다수의 말뭉치에 대한 각 문장 벡터를 토대로 구조 벡터를 생성하여 생성된 다수의 구조 벡터를 입력된 말뭉치와 매칭시켜 학습 모델을 구축하고, 입력된 말뭉치에 대해 구축된 학습 모델을 기반 학습 수행하여 입력된 말뭉치와 유사한 문장 구조를 가지는 다수의 유사 구조 벡터를 추정하고 추정된 유사 구조 벡터 및 문장 벡터의 거리 함수를 토대로 정답 문장구조 태그열을 생성함에 따라 입력된 문장의 문법구조를 반영하여 입력된 문장을 인코딩할 수 있고, 이에 자연스러운 대화를 진행할 수 있다.Disclosed is a natural language processing system and method using relational learning. A specific implementation example of this technology is to construct a learning model by generating a structure vector based on each sentence vector of a plurality of collected corpora, matching the generated structure vector with an input corpus, and constructing a learning model for the input corpus. By performing based learning on the built learning model, a number of similar structure vectors having similar sentence structures to the input corpus are estimated, and an answer sentence structure tag string is generated based on the distance function of the estimated similar structure vectors and sentence vectors. An input sentence can be encoded by reflecting the grammatical structure of the sentence, and a natural conversation can proceed accordingly.

Description

Natural language processing system and method using relational learning {SYSTEM AND METHOD FOR PROCESSING NATURAL LANGUAGE USING CORRESPONDENCE LEARNING}

본 발명은 관계성 학습을 이용한 자연어 처리 시스템 및 방법에 관한 것으로서, 더욱 상세하게는 입력된 문장의 문장 벡터와 입력된 문장에 대한 유사 문법 구조를 가지는 구조 벡터를 추정하고 추정된 구조 벡터 및 문장 벡터를 벡터 공간 상에서 서로 가까워짐에 따라 정답 문법구조 태그열을 생성함에 따라, 유사 문법 구조를 가지는 문장 검색을 실시간으로 인코딩할 수 있도록 한 기술에 관한 것이다.The present invention relates to a natural language processing system and method using relational learning, and more particularly, to a sentence vector of an input sentence and a structure vector having a similar grammatical structure to the input sentence, which are estimated, and the estimated structure vector and sentence vector The present invention relates to a technology enabling real-time encoding of searching for sentences having similar grammatical structures by generating an answer grammar structure tag string as they become closer to each other in a vector space.

자연어 이해(Natural Language Understanding)란 인간이 발화하는 언어 현상을 기계적으로 분석해 컴퓨터가 이해할 수 있는 형태로 구조화하는 것을 말한다. 자연어 이해는 정보 추출, 문서 요약 등에 활용되는 핵심 기술 중 하나이다. Natural language understanding is the process of mechanically analyzing language phenomena spoken by humans and structuring them into a form that computers can understand. Natural language understanding is one of the key technologies used for information extraction and document summarization.

문장과 의미들 사이의 관계성인 상호 연관성을 이용한 벡터공간 학습 방법을 소개하였고, 이는 자연어이해의 여러 분야에 다양하게 활용된다.A vector space learning method using mutual correlation, which is the relationship between sentences and meanings, was introduced, which is widely used in various fields of natural language understanding.

그러나, 자연어 이해 시스템은 문장의 의미를 결정하는 중요한 요소인 단어 순서나 문장구조에 대해 문장과 문법적 구조 사이의 벡터 공간을 이용한 학습 방법은 전무한 상태이다. However, natural language understanding systems do not have a learning method using a vector space between a sentence and a grammatical structure for word order or sentence structure, which are important factors that determine the meaning of a sentence.

이에 본 출원인은 수집된 다수의 말뭉치에 대해 문장과 문장의 문법 구조 사이의 연관성에 대해 모델링하고, 문장과 각 문장의 문법 구조 사이의 상호 연관성을 벡터 공간상에서의 거리로 정의하여 입력된 말뭉치에 대한 학습을 수행하는 방안을 제안하고자 한다Accordingly, the present applicant models the correlation between a sentence and the grammatical structure of each sentence for a plurality of collected corpora, defines the mutual correlation between a sentence and the grammatical structure of each sentence as a distance in a vector space, I would like to suggest a way to conduct learning

Sangkeun Jung. Semantic vector learning for natural language understanding. Computer Speech & Language, 56:130 -145, 2019.Sangkeun Jung. Semantic vector learning for natural language understanding. Computer Speech & Language, 56:130 -145, 2019.

따라서, 본 발명은 수집된 다수의 말뭉치에 대해 수집된 다수의 말뭉치에 대해 문장과 문장의 문법 구조 사이의 연관성에 대해 학습 모델링하고, 문장과 각 문장의 문법 구조 사이의 상호 연관성을 벡터 공간 상에서의 거리로 정의하여 입력된 말뭉치에 대해 학습을 수행한 다음 입력된 말뭉치와 유사한 문장 구조를 가지는 다수의 구조 벡터를 추정하고 추정된 구조 벡터 및 문장 벡터의 거리 함수를 토대로 정답 문장구조 태그열을 생성함에 따라 입력된 말뭉치의 문법적 구조를 반영하여 입력된 문장에 대한 인코딩을 수행할 수 있는 관계성 학습을 이용한 자연어 처리 시스템 및 방법을 제공하고자 한다.Therefore, the present invention performs learning modeling on the correlation between sentences and grammatical structures of sentences for a plurality of collected corpora, and correlates correlation between sentences and grammatical structures of each sentence on a vector space. Defined as a distance, learning is performed on the input corpus, then a number of structure vectors having a sentence structure similar to the input corpus are estimated, and a correct sentence structure tag string is generated based on the estimated structure vector and the distance function of the sentence vector. Accordingly, it is intended to provide a natural language processing system and method using relational learning that can perform encoding on an input sentence by reflecting the grammatical structure of the input corpus.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있으며, 본 발명의 실시예에 의해 보다 분명하게 알게 될 것이다. 또한, 본 발명의 목적 및 장점들은 특허청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.The object of the present invention is not limited to the above-mentioned object, and other objects and advantages of the present invention not mentioned above can be understood by the following description and will be more clearly understood by the examples of the present invention. It will also be readily apparent that the objects and advantages of the present invention may be realized by means of the instrumentalities and combinations thereof set forth in the claims.

본 발명의 일 실시예에 따른 관계성 학습을 이용한 자연어 처리 시스템은, A natural language processing system using relational learning according to an embodiment of the present invention,

수집된 말뭉치를 문법구조 모델링 및 전처리를 통해 문장 형태로 표현하고, 문장 형태의 의존 구문을 선학습 BERT를 통해 토큰으로 분리한 다음 각 토큰의 임베딩값으로 구조 벡터를 생성하여 학습 모델링하는 문법구조 리더기;A grammatical structure reader that expresses the collected corpus in the form of sentences through grammatical structure modeling and preprocessing, separates dependent phrases in the form of sentences into tokens through pre-learning BERT, and generates structure vectors with the embedding values of each token for learning modeling. ;

입력된 말뭉치에 대해 선학습 BERT(Bidirectional Encoder Representations from Transformers) 언어모델의 토크나이저를 통해 토큰 분리한 다음 각 토큰의 임베딩값으로 문장 벡터를 생성하는 텍스트 리더기; A text reader that separates tokens from the input corpus through a tokenizer of a pre-learning BERT (Bidirectional Encoder Representations from Transformers) language model and then generates sentence vectors with the embedding values of each token;

텍스트 리더기로부터 생성된 문장 벡터에 대한 상기 학습 모델링을 통해 학습 수행하여 유사 구조 벡터를 생성하는 문법구조 라이터기; a grammar structure writer for generating a similar structure vector by performing learning through the learning modeling on the sentence vector generated by the text reader;

상기 문법구조 라이터기의 다수의 유사 구조 벡터 중 각 유사 구조 벡터와 텍스트 리더기의 문장 벡터에 대한 거리함수를 기반으로 정답 구조 벡터를 추정하는 학습부를 포함하는 것을 일 특징으로 한다.and a learning unit for estimating an answer structure vector based on a distance function between each similar structure vector among a plurality of similar structure vectors of the grammar structure writer and a sentence vector of the text reader.

바람직하게 상기 문법구조 리더기는Preferably, the grammatical structure reader

수집된 다수의 문장에 대해 다수의 의존 구문 분석(dependency parsing) 및 전처리를 통해 각 단어 간의 의존 관계를 문장 형태로 표현하고 의존 구문의 문장 형태를 BERT 토크나이저를 통해 분리한 다음 각 토큰에 대해 임베딩값을 연산하여 의존 구문 벡터로 출력하는 의존 구조 리더모듈;For a large number of collected sentences, dependencies between words are expressed in sentence form through dependency parsing and preprocessing, and the sentence form of the dependent phrase is separated through BERT tokenizer, and then embedded for each token. a dependent structure reader module that calculates a value and outputs it as a dependent syntax vector;

수집된 다수의 문장에 대해 다수의 구 구조 구문 분석(phrase structure parsing)을 수행한 결과 생성된 구 구조 트리를 문장 형태로 표현한 다음 구 구조 문장 형태를 BERT 토크나이저를 통해 분리하고, 각 토큰에 대해 임베딩값을 연산하여 구 구조 벡터로 출력하는 구 구조 리더모듈; 및 The phrase structure tree generated as a result of performing multiple phrase structure parsing on the collected sentences is expressed in sentence form, and then the phrase structure sentence form is separated through the BERT tokenizer, and for each token a spherical structure reader module that calculates an embedding value and outputs it as a spherical structure vector; and

다수의 문장에 대해 다수의 형태소 분석(part-of-speech tagging)을 수행하고, 각 형태소에 따른 품사 태그를 문장 형태로 표현한 다음 형태소 분석 문장 형태를 BERT 토크나이저를 통해 분리하고, 각 토큰에 대해 임베딩값을 연산하여 형태소 벡터로 출력하는 형태소 분석 리더모듈; 중 적어도 하나를 포함할 수 있다. Part-of-speech tagging is performed on multiple sentences, parts of speech tags according to each morpheme are expressed in sentence form, and then the morpheme analysis sentence form is separated through BERT tokenizer, and for each token a morpheme analysis reader module that calculates an embedding value and outputs it as a morpheme vector; may include at least one of them.

바람직하게 상기 문법구조 리더기의 전처리는 Preferably, the preprocessing of the grammar structure reader

입력된 텍스트 리더기의 문장 벡터에 대해 각 의존 구문 분석을 통해 의존 구문 트리를 생성하고, 생성된 각 트리에서 각 단어 간의 관계인 아크 라벨 arc label을 문장 형태의 의존 구문으로 변환하며, 변환된 의존 구문의 문장에 대해 선학습 BERT 언어모델을 통해 도출될 수 있다.For each sentence vector of the input text reader, a dependent syntax tree is created through analysis of each dependent syntax, and arc labels, which are the relationships between words in each generated tree, are converted into dependent syntax in the form of a sentence. Sentences can be derived through a pre-learning BERT language model.

바람직하게 상기 텍스트 리더기는 Preferably the text reader

입력된 말뭉치를 순차적으로 분리하는 BERT 토크라이저; 및BERT talkizer that sequentially separates the input corpus; and

각 토큰에 대해 임베딩값을 연산하고 연산된 임베딩값을 문장 벡터로 출력하는 문장벡터 출력부를 포함할 수 있다. A sentence vector output unit for calculating an embedding value for each token and outputting the calculated embedding value as a sentence vector may be included.

바람직하게 상기 학습부는, Preferably the learning unit,

입력된 상기 텍스트 리더기의 문장 벡터와 문장구조 라이트기의 유사 구조 벡터 간의 거리 비용을 도출하는 거리비용 도출모듈; 및a distance cost derivation module for deriving a distance cost between the input sentence vector of the text reader and a similar structure vector of the sentence structure writer; and

도출된 거리 비용이 가장 작은 문장 벡터 및 유사 구조 벡터에 대해 문법구조 태그열을 생성하여 생성된 문법구조 태그열을 정답 문법구조 태그열로 출력하는 정답 문법구조 도출모듈; 을 포함하도록 구비될 수 있다.a correct grammar structure derivation module for generating a grammatical structure tag string for the sentence vector and the similar structure vector having the smallest distance cost and outputting the generated grammatical structure tag string as a correct grammatical structure tag string; It may be provided to include.

바람직하게 상기 학습부는,Preferably the learning unit,

실측된 정답 문법구조 태그열과 생성된 정답 문법구조 태그열을 기반으로 학습 성능을 제어하는 학습 제어기를 더 포함할 수 있다.A learning controller for controlling learning performance based on the actually measured correct answer grammatical structure tag string and the generated correct answer grammatical structure tag string may be further included.

바람직하게 상기 학습 제어기는,Preferably the learning controller,

실측된 정답 문법구조 태그열과 상기 생성된 정답 문법구조 태그열에 대한 상관 엔트로피로 생성 비용을 연산하는 생성 비용 연산모듈; 및a generation cost calculation module for calculating a generation cost based on correlation entropy between the actually measured correct answer grammatical structure tag sequence and the generated correct answer grammatical structure tag sequence; and

연산된 생성 비용과 거리 비용의 합으로 손실 비용을 연산하는 손실 비용 연산모듈; 및a loss cost calculation module for calculating a loss cost as a sum of the computed generation cost and distance cost; and

상기 손실 비용으로 학습 성능을 제어하는 학습 성능 제어모듈을 더 포함할 수 있다.A learning performance control module for controlling learning performance based on the loss cost may be further included.

본 발명의 다른 실시 양태에 의거한 관계성 학습을 이용한 자연어 처리 방법은Natural language processing method using relational learning according to another embodiment of the present invention

수집된 말뭉치를 문법구조 모델링을 통해 문장 형태로 표현하고, 문장 형태의 의존 구문을 전처리 후 선학습 BERT를 통해 토큰으로 분리한 다음 각 토큰의 임베딩값으로 구조 벡터를 생성하여 학습 모델링하는 문법구조 리딩 단계; Grammar structure reading that expresses the collected corpus in the form of sentences through grammatical structure modeling, pre-processes dependent phrases in the form of sentences, separates them into tokens through pre-learning BERT, and generates structure vectors with the embedding values of each token for learning modeling. step;

입력된 말뭉치에 대해 선학습 BERT(Bidirectional Encoder Representations from Transformers) 언어모델의 토크나이저를 통해 토큰 분리한 다음 각 토큰의 임베딩값으로 문장 벡터를 생성하는 텍스트 리더 단계; A text reader step of separating tokens from the input corpus through a tokenizer of a pre-learning BERT (Bidirectional Encoder Representations from Transformers) language model and then generating a sentence vector with an embedding value of each token;

생성된 문장 벡터에 대한 상기 학습 모델링을 통해 학습 수행하여 유사 구조 벡터를 생성하는 문법구조 라이팅 단계; 및 a grammatical structure writing step of generating a similar structure vector by performing learning through the learning modeling on the generated sentence vector; and

다수의 유사 구조 벡터 중 각 유사 구조 벡터와 텍스트 리더기의 문장 벡터에 대한 거리함수를 기반으로 정답 구조 벡터를 추정하는 학습단계를 포함하는 것을 일 특징으로 한다.and a learning step of estimating an answer structure vector based on a distance function between each similar structure vector among a plurality of similar structure vectors and a sentence vector of a text reader.

바람직하게 상기 문법구조 리더기의 전처리는, Preferably, the preprocessing of the grammatical structure reader,

바람직하게 상기 학습 단계는, Preferably, the learning step,

입력된 상기 텍스트 리더기의 문장 벡터와 문장구조 라이트기의 유사 구조 벡터 간의 거리 비용을 도출하는 단계; 및 도출된 거리 비용이 가장 작은 문장 벡터 및 유사 구조 벡터에 대해 문법구조 태그열을 생성하여 생성된 문법구조 태그열을 정답 문법구조 태그열로 출력하는 단계를 포함할 수 있다.deriving a distance cost between the input sentence vector of the text reader and the similar structure vector of the sentence structure writer; and generating a grammatical structure tag sequence for the sentence vector and the similar structure vector having the smallest distance cost and outputting the generated grammatical structure tag sequence as a correct answer grammatical structure tag sequence.

바람직하게 상기 학습 단계는,Preferably, the learning step,

실측된 정답 문법구조 태그열과 상기 생성된 정답 문법구조 태그열에 대한 상관 엔트로피로 생성 비용을 연산하는 단계; 연산된 생성 비용과 거리 비용의 합으로 손실 비용을 연산하는 단계; 및 상기 손실 비용으로 학습 성능을 제어하는 단계를 더 포함할 수 있다.calculating a generation cost based on correlation entropy between the actually measured correct answer grammatical structure tag string and the generated correct answer grammatical structure tag string; calculating a loss cost as the sum of the calculated generation cost and the distance cost; and controlling learning performance with the loss cost.

일 실시 예에 따르면, 수집된 다수의 말뭉치에 대한 각 문장벡터와 구조 벡터 사이의 연관성 학습 방법을 통해 유사한 문장 구조를 갖는 문장들이 벡터 공간상의 비슷한 공간 안에 배치되도록 하는 벡터 형태의 문장 구조를 반영한 문장표현 방법에 대한 학습 모델을 구축하고, 입력된 말뭉치에 대해 구축된 학습 모델을 기반 학습 수행하여 입력된 말뭉치와 유사한 문장 구조를 가지는 다수의 유사 구조 벡터를 추정하고 추정된 유사 구조 벡터 및 문장 벡터의 거리 함수를 토대로 정답 문장구조 태그열을 생성함에 따라 입력된 문장의 문법구조를 반영하여 입력된 문장을 인코딩할 수 있고, 이에 자연스러운 대화를 진행할 수 있는 효과를 가진다. According to an embodiment, a sentence reflecting a sentence structure in a vector form in which sentences having a similar sentence structure are arranged in a similar space on a vector space through a method of learning association between each sentence vector and a structure vector for a plurality of collected corpora. A learning model for the expression method is built, and a number of similar structure vectors having a similar sentence structure to the input corpus are estimated by performing based learning on the built learning model for the input corpus, and the estimated similar structure vector and sentence vector are By generating the correct sentence structure tag string based on the distance function, it is possible to encode the input sentence by reflecting the grammatical structure of the input sentence, which has the effect of allowing natural conversation.

본 명세서에서 첨부되는 다음의 도면들은 본 발명의 바람직한 실시 예를 예시하는 것이며, 후술하는 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되어서는 아니된다.
도 1은 일 실시예의 관계성 학습을 이용한 자연어 처리 시스템의 구성도이다.
도 2는 일 실시예의 시스템의 문법구조 리더기의 세부 구성도이다.
도 3은 일 실시예의 시스템의 의존 구문 트리를 보인 예시도이다.
도 4는 일 실시예의 시스템의 구 구조 분석 결과를 보인 예시도이다.
도 5는 일 실시예의 시스템의 POS 분석 결과를 보인 예시도이다.
도 6은 일 실시예의 시스템의 텍스트 리더기의 세부 구성도이다.
도 7은 일 실시예의 시스템의 학습부의 세부 구성도이다.The following drawings attached to this specification illustrate preferred embodiments of the present invention, and together with the detailed description of the present invention serve to further understand the technical idea of the present invention, the present invention is the details described in such drawings should not be construed as limited to
1 is a configuration diagram of a natural language processing system using relational learning according to an embodiment.
2 is a detailed configuration diagram of a grammar structure reader of the system according to an embodiment.
3 is an exemplary diagram showing a dependency syntax tree of a system according to an embodiment.
4 is an exemplary view showing the result of analyzing the sphere structure of the system according to an embodiment.
5 is an exemplary diagram showing a POS analysis result of a system according to an embodiment.
6 is a detailed configuration diagram of a text reader of the system according to an embodiment.
7 is a detailed configuration diagram of a learning unit of a system according to an embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 실시될 수 있다. 따라서, 실시예들은 특정한 개시형태로 한정되는 것이 아니며, 본 명세서의 범위는 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only, and may be modified and implemented in various forms. Therefore, the embodiments are not limited to the specific disclosed form, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical spirit.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although terms such as first or second may be used to describe various components, such terms should only be construed for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.It should be understood that when an element is referred to as being “connected” to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, but one or more other features or numbers, It should be understood that the presence or addition of steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this specification, it should not be interpreted in an ideal or excessively formal meaning. don't

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예를 설명함으로써, 본 발명을 상세히 설명한다.Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings.

도 1은 일 실시예의 관계성 학습을 이용한 자연어 처리 시스템의 구성도이고, 도 2는 도 1의 문법구조 리더기의 세부 구성도이며, 도 3은 도 2의 의존 구문 트리를 보인 예시도이고, 도 4는 도 2의 구 구조 분석 결과를 보인 예시도이며, 도 5는 도 2의 POS 분석 결과를 보인 예시도이고, 도 6은 도 1의 텍스트 리더기의 세부 구성도이며, 도 7은 도 1의 학습부의 세부 구성도이다.1 is a configuration diagram of a natural language processing system using relational learning according to an embodiment, FIG. 2 is a detailed configuration diagram of the grammar structure reader of FIG. 1, and FIG. 3 is an exemplary diagram showing a dependency syntax tree of FIG. 4 is an exemplary view showing the result of analyzing the sphere structure of FIG. 2, FIG. 5 is an example view showing the POS analysis result of FIG. 2, FIG. 6 is a detailed configuration diagram of the text reader of FIG. 1, and FIG. This is a detailed configuration diagram of the learning unit.

도 1 내지 도 7을 참조하면, 일 실시 예에 따른 관계성 학습을 이용한 자연어 처리 시스템은 수집된 다수의 말뭉치에 대한 각 문장벡터와 구조 벡터 사이의 연관성 학습 방법을 통해 유사한 문장 구조를 갖는 문장들이 벡터 공간상의 비슷한 공간 안에 배치되도록 하는 벡터 형태의 문장 구조를 반영한 문장표현 방법에 대한 학습 모델을 구축하고, 입력된 말뭉치에 대한 문장 벡터에 대해 구축된 학습 모델을 기반 학습 수행하여 유사한 문장 구조를 가지는 다수의 유사 구조 벡터를 추정하고 추정된 다수의 유사 구조 벡터 각각과 문장 벡터의 거리 함수를 토대로 정답 문장구조 태그열을 생성하는 구성을 갖추며, 이에 시스템은 문법구조 리더기(100), 텍스트 리더기(200), 문법구조 라이트기(300), 및 학습부(400)을 포함할 수 있다.Referring to FIGS. 1 to 7 , the natural language processing system using relational learning according to an embodiment provides sentences having similar sentence structures through a method for learning the association between each sentence vector and structure vector for a plurality of collected corpora. A learning model for the sentence expression method reflecting the vector-type sentence structure to be placed in a similar space on the vector space is built, and based learning is performed on the built learning model for the sentence vector for the input corpus to have a similar sentence structure It has a configuration for estimating a plurality of similar structure vectors and generating a correct sentence structure tag string based on a distance function of each of the estimated plurality of similar structure vectors and sentence vectors. ), a grammar structure writer 300, and a learning unit 400.

여기서, 문법구조 리더기(100)는 수집된 다수의 문장 벡터 각각에 대한 구조 벡터를 생성하여 학습 모델을 구축하도록 구비되고, 이에 문법구조 리더기(100)는 도 2에 도시된 바와 같이, 의존 구조(dependency structure) 리더모듈(110), 구 구조(phrases structure) 리더모듈(120), 및 형태소 구조(POS: Part-Of-Speech) 리더모듈 (130) 중 적어도 하나를 포함할 수 있다. Here, the grammatical structure reader 100 is provided to construct a learning model by generating a structure vector for each of a plurality of collected sentence vectors, and thus, as shown in FIG. 2, the grammatical structure reader 100 has a dependent structure ( It may include at least one of a dependency structure reader module 110, a phrase structure reader module 120, and a morpheme structure (POS: Part-Of-Speech) reader module 130.

의존 구조 리더모듈 (110)는 수집된 다수의 문장에 대해 다수의 의존 구문 분석(dependency parsing) 및 전처리를 통해 각 단어 간의 의존 관계를 문장 형태로 표현하고 의존 구문의 문장 형태를 BERT 토크나이저를 통해 분리한 다음 각 토큰에 대해 임베딩값을 연산하여 의존 구문 벡터로 출력하고 출력된 의존 구문 벡터로 학습 모델링한다.The dependency structure reader module 110 expresses the dependency relationship between each word in a sentence form through dependency parsing and preprocessing of a plurality of collected sentences, and converts the sentence form of the dependent phrase into a BERT tokenizer. After separation, the embedding value is calculated for each token, output as a dependent syntax vector, and learning modeling is performed with the output dependent syntax vector.

여기서, 전처리는 수집된 다수의 문장에 대해 각 의존 구문 분석을 통해 의존 구문 트리가 생성되고, 생성된 각 트리에서 각 단어 간의 관계인 아크 라벨 arc label을 문장 형태의 의존 구문으로 변환하며, 변환된 의존 구문의 문장에 대해 선학습 BERT 언어모델을 통해 의존 구조 벡터가 출력된다.Here, in the preprocessing, a dependency syntax tree is generated through analysis of each dependent syntax for a plurality of collected sentences, arc labels, which are relationships between words in each generated tree, are converted into dependent syntax in the form of a sentence, and the converted dependent syntax is converted. Dependency structure vectors are output through the pre-learning BERT language model for the sentences of the syntax.

일 례로, 수집된 말뭉치 “on april first I need a flight from phoenix to san diego”에 대한 의존 구문은 도 3을 참조하면, “prep-prbj-advmod-nsubj-ROOT-det_dobj_aci-prep-pobj-prep-compoung-pobj” 이다. As an example, the dependency syntax for the collected corpus “on april first I need a flight from phoenix to san diego” is “prep-prbj-advmod-nsubj-ROOT-det_dobj_aci-prep-pobj-prep- compoung-pobj”.

여기서, 수집된 말뭉치에 한국어인 경우 AIOpen에서 공개한 에트리 의존 구문 분석 API를 사용하고 영어 문장의 경우 스파이시 오픈소스 라이브러리를 사용하며, 이러한 의존 구문 분석하는 일련의 과정은 본 명세서 상에서 구체적으로 기재하지 아니하였으나 당업자 수준에서 이해되어야 할 것이다. Here, in the case of the collected corpus, the ETRI-dependent syntax analysis API published by AIOpen is used in the case of Korean, and in the case of English sentences, the Spicy open source library is used. Although not, it should be understood at the level of those skilled in the art.

그리고, 구 구조 리더모듈 (120)는 수집된 다수의 문장에 대해 다수의 구 구조 구문 분석(phrase structure parsing)을 수행한 결과 생성된 구 구조 트리를 문장 형태로 표현한 다음 구 구조 문장 형태를 BERT 토크나이저를 통해 분리하고, 각 토큰에 대해 임베딩값을 연산하여 구 구조 벡터로 출력하여 학습 모델링을 수행한다.In addition, the phrase structure reader module 120 expresses the phrase structure tree generated as a result of performing a plurality of phrase structure parsing on the collected sentences in a sentence form, and then converts the phrase structure sentence form into a BERT talk It is separated through a niger, and an embedding value is calculated for each token and output as a sphere structure vector to perform learning modeling.

일 례로, 수집된 문장 “on april first I need a flight from phoenix to san diego” 에 대한 구 구조는 도 4에 도시된 바와 같이, “(S (PP (IN) (NP (NP (NNP)) (ADVP (RB)))) (NP (PRP)) (VP (VBP) (NP (NP (DT) (NN)) (VP (VBG) (PP (IN) (NP (NNP))) (PP (IN) (NP (NNP) (NNP)))))))” 이다. As an example, the phrase structure for the collected sentence “on april first I need a flight from phoenix to san diego” is shown in FIG. 4, “(S (PP (IN) (NP (NP (NNP)) ( ADVP (RB)))) (NP (PRP)) (VP (VBP) (NP (NP (DT) (NN)) (VP (VBG) (PP (IN) (NP (NNP))) (PP (IN ) (NP (NNP) (NNP))))))”.

한편, 형태소 구조(POS tagging) 리더모듈(130)은 다수의 문장에 대해 다수의 형태소 분석(part-of-speech tagging)을 수행하고, 각 형태소에 따른 품사 태그를 문장 형태로 표현한 다음 형태소 분석 문장 형태를 BERT 토크나이저를 통해 분리하고, 각 토큰에 대해 임베딩값을 연산하여 형태소 벡터로 출력하고 이를 토대로 학습 모델링을 수행한다.Meanwhile, the morpheme structure (POS tagging) reader module 130 performs part-of-speech tagging on a plurality of sentences, expresses the part-of-speech tag according to each morpheme in the form of a sentence, and then the morpheme analysis sentence The shape is separated through the BERT tokenizer, the embedding value is calculated for each token, output as a morpheme vector, and learning modeling is performed based on this.

일 례로 수집된 문장 “on april first I need a flight from phoenix to san diego”에 대한 구 구조는 도 5를 참조하면 “ADP-PROPN-ADV-PRON-VERB-DET-NOUN-VERB-ADP-PROPN-ADP-PROPN-PROPN” 이다.As an example, the phrase structure for the collected sentence “on april first I need a flight from phoenix to san diego” is “ADP-PROPN-ADV-PRON-VERB-DET-NOUN-VERB-ADP-PROPN- ADP-PROPN-PROPN”.

한편, 입력된 말뭉치는 텍스트 리더기(200)로 전달되고, 텍스트 리더기(200)는 선학습 BERT(Bidirectional Encoder Representations from Transformers) 언어모델 및 전처리를 토대로 문장 벡터를 생성하도록 구비되고 이에 텍스트 리더기(200)는 도 6을 참조하여 BERT 토크라이저(210) 및 문장벡터 출력부(220)을 포함한다.Meanwhile, the input corpus is transmitted to the text reader 200, and the text reader 200 is provided to generate a sentence vector based on a pre-learning BERT (Bidirectional Encoder Representations from Transformers) language model and preprocessing, and thus the text reader 200 6 includes a BERT talkizer 210 and a sentence vector output unit 220.

여기서 전처리는 즉, 문법구조 라이터기(300)는 입력된 말뭉치에 대해 각 의존 구문 분석을 통해 의존 구문 트리를 생성하고, 생성된 각 트리에서 각 단어 간의 관계인 아크 라벨 arc label을 문장 형태의 의존 구문으로 변환하며, 변환된 의존 구문의 문장에 대해 선학습 BERT 언어모델로 전달된다.Here, in the preprocessing, that is, the grammatical structure writer 300 generates a dependency syntax tree through analysis of each dependent syntax for the input corpus, and arc labels, which are relationships between words in each generated tree, into dependent syntax in the form of a sentence. converted, and the sentences of the converted dependent syntax are transmitted to the pre-learning BERT language model.

이에 선학습 BERT 언어모델의 BERT 토크라이저(210)는 입력된 문장을 순차적으로 분리한 다음 분리된 토큰은 문장벡터 출력부(220)로 전달된다. 이에 문장벡터 출력부(220)는 각 토큰에 대한 임베딩값을 연산하고 연산된 임베딩값을 문장벡터 로 출력한다.Accordingly, the BERT tokenizer 210 of the pre-trained BERT language model sequentially separates input sentences, and then the separated tokens are transmitted to the sentence vector output unit 220. Accordingly, the sentence vector output unit 220 calculates an embedding value for each token and converts the calculated embedding value into a sentence vector. output as

그리고, 출력된 문장벡터 는 문법구조 라이터기(300)로 전달된다. 문법구조 라이터기(300)는 텍스트 리더기(100)의 문장벡터 에 대해 구축된 학습 모델 기반으로 학습 수행하여 입력된 문장벡터 에 대한 다수의 유사 구조 벡터 를 출력한다.And, the output sentence vector is transmitted to the grammar structure writer 300. The grammar structure writer 300 is a sentence vector of the text reader 100. Sentence vector input by performing learning based on the learning model built for A number of similar structure vectors for outputs

일 례로, “오늘 미세먼지 정보를 알려줄래?”의 입력 문장에 대해, 의존 구분 분석을 통해 위존 구문 트리가 출력되고, 의존 구문 트리구조는 다음과 같다.For example, for an input sentence of “Can you tell me the fine dust information today?”, a Wizone syntax tree is output through dependency classification analysis, and the dependent syntax tree structure is as follows.

그리고, 생성된 의존 구문 트리에서 각 단어 간의 관계인 아크 레벨 arc level을 의존 구문의 문장 형태는 “NP_AJT-NP-NP_OBJ-VP” 이다.And, the arc level, which is the relationship between each word in the generated dependent phrase tree, is “NP_AJT-NP-NP_OBJ-VP” in the sentence form of the dependent phrase.

그리고 문장 형태의 의존 구문에 대해 문법구조 문장 형태로 가공한 다음 선학습 BERT 언어모델을 수행하여 최종 다수의 유사 구조벡터 를 도출하고 도출된 각 다수의 유사 구조벡터 는 학습부(400)로 전달된다.In addition, the dependent syntax in the form of a sentence is processed into a grammatical structure sentence form, and then a pre-learning BERT language model is performed to obtain a final number of similar structure vectors. is derived, and each of the derived similar structure vectors is transmitted to the learning unit 400.

힉습부(400)는 문법구조 라이터기의 다수의 유사 구조 벡터 중 각 구조 벡터와 텍스트 리더기(100)의 문장 벡터 에 대한 거리함수를 기반으로 정답 구조벡터를 추정하는 구성을 갖추며, 이에 학습부(400)는 도 7을 참조하면, 거리비용 도출모듈(410) 및 정답 문법구조 도출모듈(420)을 포함할 수 있다.The learning unit 400 generates each structure vector among a plurality of similar structure vectors of the grammar structure writer and the sentence vector of the text reader 100 It has a configuration for estimating the structure vector of the correct answer based on the distance function for , and thus, referring to FIG. 7, the learning unit 400 may include a distance cost derivation module 410 and a correct answer grammar structure derivation module 420. there is.

거리비용 도출모듈(410)은 입력된 텍스트 리더기(200)의 문장 벡터 와 문장구조 라이트기(300)의 각 유사 구조 벡터 에 대해, 기 정해진 코사인 거리 함수로 거리 비용 를 도출한다. The distance cost derivation module 410 is a sentence vector of the input text reader 200 and each similar structure vector of the sentence structure writer 300 For, the distance cost as a predetermined cosine distance function derive

즉, 벡터 공간 상에서 거리가 가장 근접한 문장 벡터 와 구조 벡터 에 대한 문법구조 태그열이 생성된다. 즉, 이고 코사인 함수로 도출된다. 이러한 코사인 함수를 이용하여 벡터 공간 상에서 거리를 도출하는 일련의 과정은 선행문헌에 개시된 과정과 유사하므로 이에 대한 구체적인 설명은 생략한다.That is, the sentence vector with the closest distance in the vector space. and structure vector Syntax structure tag string for is created. in other words, and is derived as a cosine function. Since a series of processes of deriving a distance on a vector space using such a cosine function is similar to a process disclosed in prior literature, a detailed description thereof will be omitted.

그리고, 정답 문법구조 도출모듈(420)은 도출된 거리 비용이 가장 작은 문장 벡터 및 구조 벡터에 대해 문법구조 태그열을 생성하여 생성된 문법구조 태그열로 정답 문법구조 태그열을 출력한다. 이에 정답 문법구조 태그열로 입력된 질의 문장에 대한 자연어 이해를 수행한 다음 이에 대한 응답 문장이 생성된다.Then, the correct answer grammatical structure derivation module 420 generates a grammatical structure tag sequence for the sentence vector and structure vector having the smallest distance cost and outputs the correct grammatical structure tag sequence as the generated grammatical structure tag sequence. Accordingly, natural language understanding is performed on the query sentence entered as the correct grammatical structure tag string, and then a response sentence is generated.

한편, 관계성 학습을 이용한 자연어 처리 시스템은 실측된 정답 문법구조 태그열과 생성된 문법구조 태그열을 기반으로 학습 성능을 제어하는 학습 제어기를 더 포함하되, 학습 제어기는 도 7에 도시된 바와 같이 생성 비용 연산모듈(430), 소비 비용 연산모듈(440), 및 학습 성능 제어모듈(450)를 포함한다.On the other hand, the natural language processing system using relational learning further includes a learning controller for controlling learning performance based on the actually measured answer grammatical structure tag string and the generated grammatical structure tag string, but the learning controller is generated as shown in FIG. It includes a cost calculation module 430, a consumption cost calculation module 440, and a learning performance control module 450.

즉, 생성 비용 연산모듈(430)은 실측된 정답 문법구조 태그열과 상기 생성된 문법구조 태그열에 대한 상관 엔트로피로 생성 비용을 연산하여 생성 비용을 도출하고, 도출된 생성 비용은 손실 비용 연산모듈(440)로 전달된다.That is, the generation cost calculation module 430 derives the generation cost by calculating the generation cost with the correlation entropy for the actually measured correct answer grammatical structure tag sequence and the generated grammatical structure tag sequence, and the derived generation cost is the loss cost calculation module 440 ) is transmitted to

그리고 손실 비용 연산모듈(440)은 연산된 생성 비용과 거리 비용의 합으로 손실 비용을 연산하고 연산된 손실 비용은 학습 성능 제어모듈(450)로 전달된다.The loss cost calculation module 440 calculates a loss cost as the sum of the calculated generation cost and the distance cost, and the calculated loss cost is transmitted to the learning performance control module 450 .

학습 성능 제어모듈(450)은 최소 손실 비용을 가지도록 학습 변수를 조절하여 학습 성능 최대화를 수행한다. The learning performance control module 450 maximizes learning performance by adjusting learning variables to have a minimum loss cost.

여기서, 학습 변수라 함은 일반적인 RNN(Recurrent Neural Network) 언어모델의 학습 파라미터들로서, 순환 신경망(RNN: Recurrent Neural Network) 모델을 기반으로 학습된 인코딩 장치는, 입력 계층(input layer)과 은닉 계층(hidden layer)을 포함하는 인코더(encoder) 및 은닉 계층과 출력 계층(output layer)을 포함하는 디코더(decoder)로 구성된다.Here, the learning variables are learning parameters of a general Recurrent Neural Network (RNN) language model, and an encoding device learned based on a Recurrent Neural Network (RNN) model includes an input layer and a hidden layer ( It consists of an encoder including a hidden layer and a decoder including a hidden layer and an output layer.

인코더와 디코더에 학습된 모델은 학습 과정에서 계속 데이터를 생성해 내고, 학습이 끝난 이후에는 인코더의 출력 형식의 수학적 함수 분포(distribution)를 디코더에 넣어 주면 일정한 출력물을 생성할 수 있다. 즉, RNN은 신경망 학습 과정에서 시계열적으로 이전 또는/및 이후 데이터를 활용하여 학습을 수행한다. RNN은 시간 스텝 t에서의 출력값이 이전 시간 스텝 및/또는 이후의 시간 스텝에서 들어오는 입력값에도 영향을 받을 수 있다는 아이디어에 기반한다. 예를 들어, 한글 문제에서 빈칸에 가장 일치하는 단어를 채우기 위해서는 빈칸보다 앞쪽 문장들을 기반으로 빈칸 이후의 단어들의 문맥을 파악할 수 있다. 이러한 순환 신경망 구조는 양방향(birectional) RNN의 경우 두 개의 RNN을 포함하고, 출력값은 두 개의 RNN의 은닉 계층에 의존하여 결정된다.Models trained in the encoder and decoder continue to generate data during the learning process, and after learning is completed, a certain output can be generated by feeding the mathematical function distribution in the output format of the encoder to the decoder. That is, the RNN performs learning by using previous and/or subsequent data in a time-series manner in the neural network learning process. RNNs are based on the idea that an output value at time step t can also be influenced by input values from previous and/or subsequent time steps. For example, in order to fill in the word that most closely matches the blank in the Hangul problem, the context of the words after the blank can be identified based on the sentences preceding the blank. This recurrent neural network structure includes two RNNs in the case of a bidirectional RNN, and an output value is determined depending on the hidden layer of the two RNNs.

이하 한국어 기반의 weather 및 nevi 말뭉치와 영어 기반의 Atis 및 snips 말뭉치 각각에 대한 학습 결과에 의거 도출된 정답 문장구조 태그열에 대해, 기 정해진 평가 기준(BLEU 및 ROUGE)으로 평가하는 시뮬레이션에 대해 설명한다.Hereinafter, a simulation that evaluates the answer sentence structure tag string derived based on the learning results for the Korean-based weather and nevi corpus and the English-based Atis and snips corpus, respectively, with predetermined evaluation criteria (BLEU and ROUGE) will be described.

여기서 평가 기준 BLEU는 레퍼런스에 대한 점수를 내는 방식으로 n_gram을 통해 순서쌍들이 얼마나 겹치는 지를 측정하는 방식이고, ROUGE는 단어 수에 따라 두 문장 사이 가장 겹치는 문장 길이를 이용하여 점수를 도출하는 방식이다.Here, the evaluation criterion BLEU is a way to score references and measures how much overlapped ordered pairs are through n_gram, and ROUGE is a way to derive a score using the length of the most overlapping sentences between two sentences according to the number of words.

일 례로 “오늘 미세먼지 정보 알려줄래”의 입력된 말뭉치에 대한 문자 벡터와 구문 벡터에 대해, 학습 결과로 도출된 유사 의존 구조 벡터는 하기 표 1에 도시된 바와 같다.As an example, for the character vector and phrase vector for the input corpus of "Will you tell me the fine dust information today?", the similar dependency structure vector derived as a learning result is shown in Table 1 below.

[표 1][Table 1]

표 1을 참조하면, “오늘 미세먼지 정보 알려줄래?”에 대한 말뭉치 및 “NP_AJT_NP_NP_OBJ_VP”의 문장 벡터에 대해 “오늘 미세먼지 수치 알려줘”, “내일 독도 날씨 알려줘”, 등 의존 구문 및 “NP_AJT_NP_NP_OBJ_VP”의 의존 구문 벡터가 학습부(400)로 전달되면, 학습 결과에 따라 “내일 모래 경북 최고온도 알려주라”의 유사 의존 구문과 “NP_AJT_NP_NP_OBJ_VP”의 의존 구문 벡터가 도출된다. Referring to Table 1, for the corpus of “Can you tell me the fine dust information today?” and for the sentence vector of “NP_AJT_NP_NP_OBJ_VP”, dependent phrases such as “Tell me the fine dust level today”, “Tell me the weather in Dokdo tomorrow”, and “NP_AJT_NP_NP_OBJ_VP” When the dependent phrase vector is transmitted to the learning unit 400, a similar dependent phrase of “Tell me the highest temperature in Gyeongbuk tomorrow” and a dependent phrase vector of “NP_AJT_NP_NP_OBJ_VP” are derived according to the learning result.

여기서, 문장 벡터는 입력된 한국어 기반 말뭉치에 대해, BERT-base, Multilingual Cased, 영어 기반 말뭉치에 대해 BERT _base, Cased 선학습 모델로 도출된다.Here, the sentence vectors are derived using BERT-base, Multilingual Cased, and BERT _base, Cased pre-learning models for the input Korean-based corpus and English-based corpus.

또한, 구조 벡터는 한국어 기반 말뭉치에 대해, BERT-base, Multilingual Cased, 영어 기반 말뭉치에 대해 BERT _base, Cased 선학습 모델로 도출된다.In addition, the structure vector is derived with BERT-base, Multilingual Cased for the Korean-based corpus, and BERT_base, Cased pre-learning models for the English-based corpus.

이러한 학습 결과를 기 정해진 평가 기준에 의거한 평가 결과는 하기 표 2에 도시된 바와 같다.The evaluation results based on the predetermined evaluation criteria for these learning results are shown in Table 2 below.

[표 2][Table 2]

표 2를 참조하면, weather, nevi, Atis 및 snips 말뭉치에 대해, 기준 보다 높은 성능을 보이고 있고, BLEU의 경우 1_gram에서 가장 성능이 향상됨을 확인할 수 있다.Referring to Table 2, for the weather, nevi, Atis, and snips corpus, performance is higher than the standard, and in the case of BLEU, it can be seen that the performance is improved the most in 1_gram.

본 발명의 다른 실시예에 의한 관계성 학습을 이용한 자연어 처리 방법은, A natural language processing method using relational learning according to another embodiment of the present invention,

수집된 말뭉치를 문법구조 모델링을 통해 문장 형태로 표현하고, 문장 형태의 의존 구문을 전처리 후 선학습 BERT를 통해 토큰으로 분리한 다음 각 토큰의 임베딩값으로 구조 벡터를 생성하여 학습 모델링하는 문법구조 리딩 단계; 입력된 말뭉치에 대해 선학습 BERT(Bidirectional Encoder Representations from Transformers) 언어모델의 토크나이저를 통해 토큰 분리한 다음 각 토큰의 임베딩값으로 문장 벡터를 생성하는 텍스트 리더 단계; 생성된 문장 벡터에 대한 상기 학습 모델링을 통해 학습 수행하여 유사 구조 벡터를 생성하는 문법구조 라이팅 단계; 및 다수의 유사 구조 벡터 중 각 유사 구조 벡터와 텍스트 리더기의 문장 벡터에 대한 거리함수를 기반으로 정답 구조 벡터를 추정하는 학습단계를 포함하고, 상기의 관계성 학습을 이용한 자연어 처리 하는 각 단계는, 전술한 문법구조 리더기(100), 텍스트 리더기(200), 문법구조 라이터기(300), 및 학습부(400)에서 수행되는 기능으로 자세한 원용은 생략한다.Grammar structure reading that expresses the collected corpus in the form of sentences through grammatical structure modeling, pre-processes dependent phrases in the form of sentences, separates them into tokens through pre-learning BERT, and generates structure vectors with the embedding values of each token for learning modeling. step; A text reader step of separating tokens from the input corpus through a tokenizer of a pre-learning BERT (Bidirectional Encoder Representations from Transformers) language model and then generating a sentence vector with an embedding value of each token; a grammatical structure writing step of generating a similar structure vector by performing learning through the learning modeling on the generated sentence vector; and a learning step of estimating a structure vector of an answer based on a distance function for each similar structure vector among a plurality of similar structure vectors and a sentence vector of a text reader, wherein each step of natural language processing using relational learning, This is a function performed by the above-described grammatical structure reader 100, text reader 200, grammatical structure writer 300, and learning unit 400, and detailed reference is omitted.

이에 일 실시예는, 수집된 수집된 다수의 말뭉치에 대한 각 문장벡터와 구조 벡터 사이의 연관성 학습 방법을 통해 유사한 문장 구조를 갖는 문장들이 벡터 공간상의 비슷한 공간 안에 배치되도록 하는 벡터 형태의 문장 구조를 반영한 문장표현 방법에 대한 학습 모델을 구축하고 입력된 말뭉치에 대해 구축된 학습 모델을 기반 학습 수행하여 입력된 말뭉치와 유사한 문장 구조를 가지는 다수의 유사 구조 벡터를 추정하고 추정된 유사 구조 벡터 및 문장 벡터의 거리 함수를 토대로 정답 문장구조 태그열을 생성함에 따라 입력된 문장의 문법구조를 반영하여 입력된 문장을 인코딩할 수 있고, 이에 자연스러운 대화를 진행할 수 있다.Accordingly, in one embodiment, a sentence structure in the form of a vector in which sentences having similar sentence structures are arranged in a similar space on a vector space through a method of learning the association between each sentence vector and structure vector for a plurality of collected corpus is collected. A learning model for the reflected sentence expression method is built, and a number of similar structure vectors having a sentence structure similar to the input corpus are estimated by performing based learning based on the built learning model for the input corpus, and the estimated similar structure vector and sentence vector are estimated. By generating the correct sentence structure tag string based on the distance function of , the input sentence can be encoded by reflecting the grammatical structure of the input sentence, and natural conversation can proceed accordingly.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). array), programmable logic units (PLUs), microprocessors, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로 (collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기 광매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. Computer readable media may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on a computer readable medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. Included are hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on the above. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

100 : 문법구조 리더기
200 : 텍스트 리더기
300 : 문법구조 라이트기
4300 : 학습부100: grammar structure reader
200: text reader
300: grammar structure writer
4300: learning unit

Claims

A grammatical structure reader that expresses the collected corpus in the form of sentences through grammatical structure modeling and preprocessing, separates dependent phrases in the form of sentences into tokens through pre-learning BERT, and generates structure vectors with the embedding values of each token for learning modeling. ;
A text reader that separates tokens from the input corpus through a tokenizer of a pre-learning BERT (Bidirectional Encoder Representations from Transformers) language model and then generates sentence vectors with the embedding values of each token;
a grammar structure writer for generating a similar structure vector by performing learning through the learning modeling on the sentence vector generated by the text reader;
A learning unit for estimating an answer structure vector based on a distance function between each similar structure vector among a plurality of similar structure vectors of the grammar structure writer and a sentence vector of the text reader,
The learning unit,
After calculating the generation cost with the correlation entropy for the actually measured correct answer grammatical structure tag string and the generated correct answer grammatical structure tag string, the calculated generation cost and the input distance cost between the sentence vector of the text reader and the similar structure vector of the sentence structure writer Calculate a loss cost as the sum of , and control the learning performance with the loss cost.
A natural language processing system using relational learning, characterized in that.

The method of claim 1, wherein the grammar structure reader
For a large number of collected sentences, dependencies between words are expressed in sentence form through dependency parsing and preprocessing, and the sentence form of the dependent phrase is separated through BERT tokenizer, and then embedded for each token. a dependent structure reader module that calculates a value and outputs it as a dependent syntax vector;
The phrase structure tree generated as a result of performing multiple phrase structure parsing on the collected sentences is expressed in sentence form, and then the phrase structure sentence form is separated through the BERT tokenizer, and for each token a spherical structure reader module that calculates an embedding value and outputs it as a spherical structure vector; and
Part-of-speech tagging is performed on multiple sentences, parts of speech tags according to each morpheme are expressed in sentence form, and then the morpheme analysis sentence form is separated through BERT tokenizer, and for each token a morpheme analysis reader module that calculates an embedding value and outputs it as a morpheme vector; Natural language processing system using relational learning, characterized in that it comprises at least one of.

The method of claim 2, wherein the preprocessing of the grammar structure reader
For each sentence vector of the input text reader, a dependent syntax tree is created through analysis of each dependent syntax, and arc labels, which are the relationships between words in each generated tree, are converted into dependent syntax in the form of a sentence. A natural language processing system using relational learning, characterized in that sentences are transmitted to a pre-learning BERT language model.

The method of claim 3, wherein the text reader
BERT talkizer that sequentially separates the input corpus; and
A natural language processing system using relational learning, characterized in that a sentence vector output unit for calculating an embedding value for each token and outputting the calculated embedding value as a sentence vector is derived.

The method of claim 4, wherein the learning unit,
a distance cost derivation module for deriving a distance cost between the input sentence vector of the text reader and a similar structure vector of the sentence structure writer; and
a correct grammar structure derivation module for generating a grammatical structure tag string for the sentence vector and the similar structure vector having the smallest distance cost and outputting the generated grammatical structure tag string as a correct grammatical structure tag string; Natural language processing system using relational learning, characterized in that it comprises a.

The method of claim 5, wherein the learning unit,
A natural language processing system using relational learning, further comprising a learning controller for controlling learning performance based on the actually measured correct answer grammatical structure tag string and the generated correct answer grammatical structure tag string.

The method of claim 6, wherein the learning controller,
a generation cost calculation module for calculating a generation cost based on correlation entropy between the actually measured correct answer grammatical structure tag sequence and the generated correct answer grammatical structure tag sequence; and
a loss cost calculation module for calculating a loss cost as a sum of the computed generation cost and distance cost; and
A natural language processing system using relational learning, comprising a learning performance control module for controlling learning performance at the loss cost.

In the natural language processing method using relational learning performed based on the natural language processing system using relational learning including the grammatical structure reader, text reader, and learning unit of claim 1,
Grammar structure reading that expresses the collected corpus in the form of sentences through grammatical structure modeling, pre-processes dependent phrases in the form of sentences, separates them into tokens through pre-learning BERT, and generates structure vectors with the embedding values of each token for learning modeling. step;
A text reader step of separating tokens from the input corpus through a tokenizer of a pre-learning BERT (Bidirectional Encoder Representations from Transformers) language model and then generating a sentence vector with an embedding value of each token;
a grammatical structure writing step of generating a similar structure vector by performing learning through the learning modeling on the generated sentence vector; and
A learning step of estimating a structure vector of an answer based on a distance function between each similar structure vector among a plurality of similar structure vectors and a sentence vector of a text reader;
In the learning phase,
After calculating the generation cost with the correlation entropy for the actually measured correct answer grammatical structure tag string and the generated correct answer grammatical structure tag string, the calculated generation cost and the input distance cost between the sentence vector of the text reader and the similar structure vector of the sentence structure writer Calculate the loss cost as the sum of , and control the learning performance with the loss cost.
Natural language processing method using relational learning, characterized in that provided.

The method of claim 8, wherein the pretreatment,
For each sentence vector of the input text reader, a dependent syntax tree is generated through analysis of each dependent syntax;
In each generated tree, the arc label arc label, which is the relationship between each word, is converted into a dependent phrase in the form of a sentence,
A natural language processing method using relational learning, characterized in that the sentence of the converted dependent syntax is transmitted to the pre-learning BERT language model.

The method of claim 8, wherein the learning step,
deriving a distance cost between the input sentence vector of the text reader and the similar structure vector of the sentence structure writer; and
Generating a grammatical structure tag sequence for the sentence vector and similar structure vector having the smallest distance cost derived and outputting the generated grammatical structure tag sequence as a correct answer grammatical structure tag sequence. Natural language processing methods.

The method of claim 10, wherein the learning step,
calculating a generation cost based on correlation entropy between the actually measured correct answer grammatical structure tag string and the generated correct answer grammatical structure tag string;
calculating a loss cost as the sum of the calculated generation cost and the distance cost; and
Natural language processing method using relational learning, characterized in that it further comprises the step of controlling the learning performance at the loss cost.

A computer-readable recording medium on which a program for executing the natural language processing method using relational learning according to any one of claims 8 to 11 is recorded.