KR20200066972A

KR20200066972A - Global Relation Extraction Model for Korean Documents

Info

Publication number: KR20200066972A
Application number: KR1020180153759A
Authority: KR
Inventors: 임희석; 조재춘; 김규경; 김경민
Original assignee: 고려대학교 산학협력단
Priority date: 2018-12-03
Filing date: 2018-12-03
Publication date: 2020-06-11

Abstract

According to the present invention, disclosed is a global relation extraction device in Korean documents. The present invention relates to natural language processing and, more particularly, to a model device which assists in document summarization. The present invention attempts to extract the global relations, but attempts global level relation extraction by analyzing fragmented relations extracted within the sentence as a whole.

Description

Global Relation Extraction Model for Korean Documents}

본 발명은 자연어처리에 관한 것이며, 보다 상세하게는 문서 요약에 도움을 주는 모델 장치에 관한 것이다.The present invention relates to natural language processing, and more particularly, to a model device that assists in document summarization.

관계 추출(Relation Extraction): 관계 추출은 구조화되지 않은 데이터로부터 구조화된 정보를 추출하는데 핵심적인 역할을 수행한다. 이는 단순히 텍스트 데이터뿐만 아니라 개체와 개체간의 상호관계로 의미를 전달하는 모든 종류의 매체에 사용되어 질 수 있다. 이러한 관계 추출은 크게 2가지 종류로 나눌 수 있는데 각각 전역 수준의 관계 추출(Global Level Relation Extraction)과 언급 수준의 관계 추출(Mention Level Relation Extraction)이 존재한다. 현재 대부분의 높은 성능을 자랑하는 관계 추출 모델은 후자에 속하며, 문장 내의 관계 추출 성능은 우수하다. 그러나 이는 많은 양의 정보를 요약하고 그 주제를 파악하는데에는 한계가 있다. 이 논문에서는 전역 수준의 관계 추출을 한 번에 대량의 문서를 전부 분석하여 관계를 추출하기보다는 각 문장을 분석한 뒤, 문장 내에서 추출된 단편적인 관계들을 전체적으로 분석함으로써 전역 수준의 관계추출을 시도한다. 이러한 시도방법을 취함으로써 전역 수준의 관계 추출을 하되, 언급 수준의 관계 추출을 병행함으로써 정보의 누락을 최대한 방지한다.Relationship Extraction: Relationship extraction plays a key role in extracting structured information from unstructured data. It can be used not only for text data, but also for any kind of medium that conveys meaning through the interaction between objects. There are two types of relationship extraction, which are global level relation extraction and mention level relation extraction. Currently, the relationship extraction model that boasts the most high performance belongs to the latter, and the relationship extraction performance in sentences is excellent. However, this has limitations in summarizing a large amount of information and grasping the subject. In this paper, rather than extracting relationships by analyzing a large number of documents at once, extracting global-level relationships at once, and then analyzing each sentence and then attempting to extract global-level relationships by analyzing fragmented relationships extracted entirely within sentences do. By taking this approach, global level relations are extracted, but reference level relations are extracted in parallel to prevent the missing information as much as possible.

메모리 증강 신경망(Memory Augmented Neural Network): 본 논문에서는 전역 수준 관계 추출과 언급 수준 관계 추출을 병행함으로써 많은 양의 개체 관계들을 처리하게 된다. 그러나 이러한 관계들 중 주요 개체가 포함된 개체들의 관계는 전역 수준 관계 추출에 커다란 영향을 끼친다. 그렇기 때문에 이들이 가지고 있는 관계들은 따로 샘플로 외부메모리에 저장하여 전역 수준 관계 추출 때 해당 관계 분류를 할 때 사용된다. 그리고 이렇게 추출된 주요 개체 간의 관계들은 다시 한 번 언급 수준 관계들을 재조정하는데 사용된다. 이는 메타 학습(Meta Learning)에서 사용되는 One-shot 학습 방법을 적용한 것이다. 모델을 참조한 에서는 첫 번째로 나온 분류 예측 결과를 외부 메모리에 샘플로 저장하고, 이를 차후 비슷한 유형의 샘플을 예측할 때 가져와서 사용하고, 이 결과에 따라 첫 번째 분류 예측 프로세스에 역전파 신호(Back Propagated Signal)를 보내어 세부 사항을 다시 한 번 재조정한다.Memory Augmented Neural Network: In this paper, a large amount of entity relationships are handled by combining global-level relationship extraction and reference-level relationship extraction. However, among these relationships, the relationship of the entities that include the main entity has a great influence on the extraction of global level relationships. For this reason, the relationships they have are stored in external memory as a sample and used when classifying the relationship when extracting the global level relationship. And the relations between these extracted main entities are once again used to rebalance the reference level relations. This is a one-shot learning method used in meta learning. In reference to the model, the first classification prediction result is stored as a sample in external memory, which is then used when predicting a similar type of sample, and used in the first classification prediction process according to the result. Signal) to readjust the details once again.

관계 추출에는 크게 2가지가 존재한다. 각각 전역 수준의 관계 추출(Global Level Relation Extraction)과 언급 수준의 관계 추출(Mention Level Relation Extraction)이다. 현재 대부분의 높은 성능을 자랑하는 관계 추출 모델은 후자에 속하며, 문장 내의 관계 추출 성능은 우수하나, 장문의 텍스트를 처리할 경우 그 성능과 정확도가 크게 떨어지는 경향이 보인다. 반면 전역 관계 추출에서는 한번에 대량의 텍스트를 처리하면서 정보 누락 현상이 자주 일어난다. 이는 실질적으로 관계추출기술이 요약 및 분석이 많이 요구되는 문서 내 관계 추출에서는 활용도가 떨어지는 결과를 초래한다. 본 발명은 전역관계추출을 시도하되 문장 내에서 추출된 단편적인 관계들을 전체적으로 분석함으로써 전역 수준의 관계추출을 시도한다. 이러한 시도방법을 취함으로써 전역 수준의 관계 추출을 하되, 언급 수준의 관계 추출을 병행함으로써 정보의 누락을 최대한 방지하는 형식으로 이를 해결하고자 한다.There are two main types of relationship extraction. These are Global Level Relation Extraction and Mention Level Relation Extraction, respectively. Currently, most of the high-performance relationship extraction models belong to the latter, and the relationship extraction performance in sentences is excellent, but when processing long texts, the performance and accuracy tend to be greatly reduced. On the other hand, in global relation extraction, information is frequently dropped while processing large amounts of text at a time. This actually results in poor utilization in relation extraction within documents that require a lot of summarization and analysis. The present invention attempts to extract global relations, but attempts to extract relations at the global level by analyzing the fragmented relations extracted in the sentence as a whole. By taking this approach, we try to solve this in a format that prevents the omission of information as much as possible by extracting relations at the global level, but extracting relations at the reference level.

상기한 종래기술의 문제점을 해결하기 위해 본 발명의 바람직한 일 실시예에 따르면, 한글 문서 내 전역적 관계 추출기가 제공된다. According to one preferred embodiment of the present invention to solve the problems of the prior art, a global relationship extractor in a Korean document is provided.

문서 단위의 관계 추출을 시도할 때 더욱 정확한 관계추출이 가능해지게 한다. 또한 일반적으로 자연어로 이루어진 장문의 텍스트에서 단 하나의 문장 내에서 성립되는 관계보다는 여러 문장에 걸쳐 표현되어지는 개체들 간의 관계가 더 해당 텍스트 내의 주제를 더 잘 반영하나, 외부 메모리를 이용한 본 발명의 모델은 그러한 문제로부터 좀 더 자유롭다. 그렇기 때문에 이러한 여러 문장에 걸쳐서 표현되어지는 관계를 추출하는 것이 해당 문서를 더욱 정확하게 요약하며, 단순히 단편적인 관계들을 사용하는 것보다 훨씬 정형화되고 사용하기 편한 지식베이스를 구축할 있도록 해준다. 이는 점점 데이터가 많아지고 그 분류가 사람의 손으로 거치기 어려워지는 현 태세의 상황을 보면 앞으로 정확한 전역 관계 추출의 가치는 점점 높아질 것으로 보인다. When attempting to extract document-level relationships, it is possible to extract relationships more accurately. In addition, in general, in a long text, in a long text, a relationship between entities expressed over several sentences is better reflected than a relationship established within a single sentence, but the subject of the present invention using external memory is better. The model is more free from such problems. For this reason, extracting the relations expressed across these various sentences allows the document to be summarized more accurately, and it is possible to build a much more formal and easy-to-use knowledge base than simply using fragmented relations. Looking at the current situation where data is getting more and more difficult to classify by the human hand, the value of accurate global relationship extraction is expected to increase in the future.

도 1은 LSTM 기반 단편적 관계 추출기를 도시한 도면이다.
도 2는 외부 메모리 신경망을 통한 메타학습을 나타낸 도면이다. 1 is a diagram illustrating an LSTM-based fragmentary relationship extractor.
2 is a diagram illustrating meta-learning through an external memory neural network.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.The present invention can be applied to various changes and can have various embodiments, and specific embodiments will be illustrated in the drawings and described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals are used for similar components.

이 발명에서 쓰이는 모델은 크게 2 부분으로 이루어져 있으며 훈련은 3단계에 걸쳐서 진행된다. 모델은 단편적 관계 추출 모델과 외부 메모리 신경망으로 이루어져 있다. 그리고 훈련은 각각 단편적 관계 추출 모델의 훈련, 전역 관계를 위한 메모리 증강 신경망 훈련, 마지막으로 메모리 증강 신경망 훈련의 결과를 반영한 관계 추출 모델의 재훈련으로 총 3단계가 존재한다.The model used in this invention is largely composed of two parts, and training is conducted in three stages. The model consists of a fragmented relationship extraction model and an external memory neural network. In addition, there are a total of three stages of training: training of a fragmented relationship extraction model, training of memory-enhanced neural networks for global relationships, and finally re-training of a relationship extraction model that reflects the results of memory-enhanced neural network training.

단편적인 관계 추출을 시도하기에 앞서 먼저 텍스트의 전처리가 요구된다. 텍스트의 전처리는 문장 분리 및 문장 내 개체명 인식(Named Entity Recognition), 그리고 마지막으로 문장 내의 관계를 연결(Entity Linking) 및 분류(Relation Classification)할 2 개체를 마킹하는 것이다. 해당 작업은 기존에 이미 존재하는 모델을 이용함으로써 해결하였으며 각 개체 간 마킹은 문장 내의 모든 개체들 간 관계를 추출 할 때까지 각각 마킹하여 관계를 파악한다.Before attempting to extract fragmentary relationships, preprocessing of the text is required. The pre-processing of text is to mark two entities that will separate sentences and recognize Entity Recognition in sentences, and finally link and classify relationships within sentences. This task was solved by using a model that already exists. Marking between each object is identified by marking each relationship until all the objects in the sentence are extracted.

단편적 관계 추출 모델:Fractional relationship extraction model:

전처리를 마친 텍스트는 아래 도면 1과 같은 LSTM 기반의 단편적 관계 추출 모델에 입력된다. 이 단계에서는 텍스트를 각 문장 단위로 분석하기 시작한다. 이때 문장

에 존재하는 각 토큰

들을

차원의 벡터에 임베딩하기 위하여 행렬

를 사용한다. 이때

는 vocabulary의 크기를 나타낸다. 입력된 문장에 마킹된 개체 1과 개체 2를 기준으로 타 문장요소들인 토큰의 위치가 개체1 또는 개체 2에 더 가까운 지, 아니면 그 어느 쪽 개체와도 거리가 먼지에 따라 각각의 위치를 마킹한다. 이는 구문분석(Part of Speech Tagging)을 통해 이루어지며 각 토큰을 문법적으로 더 가까운 개체 쪽으로 분류하도록 하는 것이다. 그리고 일정 임계값이 넘는 거리가 나오면 그 어느 쪽과도 관계가 없는 것으로 표기하는 것이다. 이렇게 만들어진 토큰 임베딩들은 LSTM신경망(Long Short Term Memory Network)을 통하여 n개의 벡터들을 고정된 크기의 출력 벡터로 변환한다. 이 출력 벡터는 문장 내 관계 인코딩을 시도하는 2 개체의 관계를 나타낸다. 학습된 모델은 해당 출력 벡터 를 소프트맥스 레이어에(Softmax Layer) 입력하여 해당 개체 쌍의 관계여부의 판별 및 관계 분류를 통해 언급 수준의 관계 추출을 실행한다. 여기서 관계 분류기를

, 소프트맥스 레이어를

, 첫 번째 개체와 두 번째 개체를

라고 정의하였을 때 관계 분류기는 다음과 같다.The pre-processed text is input to the LSTM-based fragmentary relationship extraction model as shown in Figure 1 below. At this stage, the text begins to be analyzed in units of each sentence. Sentence

Each token present on

Hear

Matrix to embed in dimensional vector

Use At this time

Indicates the size of the vocabulary. Based on the objects 1 and 2 marked on the input sentence, the position of the tokens, which are other sentence elements, is closer to the object 1 or 2, or the distance from either object marks each position according to the dust. . This is done through parsing (Part of Speech Tagging), and it is intended to classify each token into a grammatically closer entity. And, if a distance exceeds a certain threshold, it is marked as not related to either. These token embeddings convert n vectors into fixed-size output vectors through the LSTM Neural Network (Long Short Term Memory Network). This output vector represents the relationship of two entities trying to encode the relationship in the sentence. The trained model inputs the corresponding output vector into the Softmax Layer and performs relationship extraction at the reference level through discrimination and relationship classification of the corresponding object pair. Where the relationship classifier

, Softmax layer

, The first object and the second object

When defined as, the relationship classifier is as follows.

이때

는 무게 벡터이며

는 바이어스이다.At this time

Is the weight vector

Is bias.

외부 메모리:External memory:

단편적 관계 추출 모델에서 얻어낸 결과들은 그러나 어디까지나 언급 수준의 관계들로 여러 문장에 걸쳐져있는 관계를 가진 개체들을 분석하기 위해서는 추가적인 조치가 요구된다. 그렇기 때문에 텍스트 내의 관계들에 대한 종합적인 분석이 요구된다. 그렇기 때문에 1차적으로 외부 메모리에 텍스트에서 얻어낸 언급 수준의 관계들의 출력벡터를 차례대로 저장하며 문장을 건너뛰는 관계 후보들을 찾는다. 이후 각 개체 사이에 공유하고 있는 중간 단계 개체들이 존재한다면 이러한 관계들을 모두 통합하여 새로운 예측 벡터를 생성한다. 이후 생성된 예측 벡터는 훈련된 메모리 증강 신경망을 통하여 타 문장에 속하는 개체들 간의 관계들을 정의한다. 이 때

번째 순서에 입력된 언급 수준의 관계를

라고 정의하고,

가 속해있는 전역 관계를

이라고 한다. 이때

의 양 끝에 해당하는 전역 관계를 대표하는 개체를

로 표기하며 그에 속한 개체들은

로 표기한다. 그에 해당 이러한 전역관계를 나타내는 예측 벡터

은 다음과 같다.The results obtained from the fragmented relationship extraction model, however, require additional action to analyze individuals with relationships spanning multiple sentences with reference-level relationships. Therefore, a comprehensive analysis of the relationships in the text is required. Therefore, first, the output vector of the relations of the reference level obtained from the text is stored in the external memory in order, and the candidates for the relation skipping the sentence are searched. Thereafter, if there are intermediate-level entities that are shared between the individual entities, all of these relationships are combined to generate a new prediction vector. The generated predictive vector then defines the relationships between the entities belonging to other sentences through the trained memory augmented neural network. At this time

The level of mention entered in the first order

Is defined as

The global relationship to which it belongs

It is said. At this time

Objects representing the global relationship at both ends of the

And the objects belonging to it are

Is denoted as. Prediction vector representing this global relationship

Is as follows.

이후

을 분류하는 데에는 코사인 유사도를 사용하여 해당 전역관계가 어떠한 분류에 속하는지 파악한다.after

The cosine similarity is used to classify, so that the classification of the global relationship belongs to.

훈련 과정:Training course:

소개되는 2 가지 모델을 훈련시키기 위해서는 총 3단계의 훈련이 필요하다. 첫 번째 훈련은 단편적 관계 추출 모델에 대한 학습으로, 이는 두 개체가 마킹이 된 구문 부석이 이루어진 문장에서 정답 개체 간 관계를 통해 End-to-End 방식의 훈련을 통해 학습이 이루어졌다. 이는 외부 메모리에서 전역적 관계 추출을 학습할 때에도 마찬가지다. 다만 이 경우에는 텍스트를 통한 학습이 아닌 여러 관계 tuple을 제공하며 전역적 관계로 이어지는 여러 tuple들의 합성곱을 통한 예측 벡터와 주어진 정답을 통한 End-to-End 방식의 훈련으로 1차 훈련을 마친다.To train the two models introduced, a total of three levels of training are required. The first training is learning about the fragmentary relationship extraction model, which is achieved through end-to-end training through the relationship between the correct answer entities in sentences with syntax parsing where two entities are marked. This is also true when learning global relationship extraction from external memory. However, in this case, it provides several tuples of relations rather than learning through texts, and finishes the first training with a prediction vector through a composite product of multiple tuples leading to a global relationship and an end-to-end training through a given correct answer.

1차적인 훈련을 마친 뒤에는 메모리 증강 신경망을 통한 2차 훈련을 시작한다. 이는 전역적 관계를 언급 수준의 관계에 반영하는 것이다. 이때는 각 전역관계

에서 순서

에 해당되는 개체들과

의 주요개체 간 관계추출을 재 정의할 때 해당되는 문장을

라고 정의한다. 또한 해당 문장 내의 존재하는 전역관계

에 속하는 개체를

라고 정의하며

가 속한 언급 수준의 관계를 외부 메모리에 있는

의 타 관계들과 비교하게 된다. 이때

가 속해있는

중 그 예측 벡터

이 해당 문장 내의

가 포함된 언급 수준의 관계와 유사한지 확인한다. 이를

라고 하여 코사인 유사도를 이용하여 유사도를 계산한다.After completing the primary training, begin the secondary training through the memory-enhanced neural network. This reflects the global relationship to the reference level. At this time, each global relationship

In order

With the corresponding objects

When redefining the relationship extraction between the main entities of

Is defined as Also, the global relationship that exists in the sentence

Objects belonging to

Is defined as

The level of reference to which the relationship belongs is located in external memory.

Compare with other relationships. At this time

To which

Of those prediction vectors

Within this sentence

Make sure that it is similar to the level of reference included. This

Then, the similarity is calculated using the cosine similarity.

이 계산을 통하여 무게 벡터

를 생성할 수 있는데, 무게 벡터

을 이용하여 문장

내의 개체

가 속한 관계를 전역관계

을 반영하여 재정의해야 또는 변화를 주어야하는지, 그리고 준다면 어떻게 변화를 주는지 결정하게 된다.

을 구하는 방법은 다음과 같다.Weight vector through this calculation

Weight vector

Use sentence

Within

Global relationship

It reflects the need to redefine or make a change, and if so, how to make a change.

The method to obtain is as follows.

구한

가 임계값보다 클 경우 언급 수준의 관계 예측 벡터인

에 합산곱을 통하여 새로운

을 생성하도록 하며 단편적 관계 추출 모델을 재훈련시킨다. 1차 훈련을 반복한 뒤 다시 이를 통한 2차 훈련을 반복하여 성능이 최대한 높아질 때까지 이 과정을 반복한다. Saved

If is greater than the threshold, the relationship prediction vector at the mentioned level

New through summation to

And retrain the fragmented relationship extraction model. After repeating the 1st training and repeating the 2nd training through it again, repeat this process until the performance is as high as possible.

상기한 본 발명의 실시예는 예시의 목적을 위해 개시된 것이고, 본 발명에 대한 통상의 지식을 가지는 당업자라면 본 발명의 사상과 범위 안에서 다양한 수정, 변경, 부가가 가능할 것이며, 이러한 수정, 변경 및 부가는 하기의 특허청구범위에 속하는 것으로 보아야 할 것이다. The above-described embodiments of the present invention have been disclosed for purposes of illustration, and those skilled in the art having various knowledge of the present invention will be able to make various modifications, changes, and additions within the spirit and scope of the present invention. Should be regarded as belonging to the following claims.

Claims

Global relationship extractor in Korean documents.