KR20220091035A

KR20220091035A - Summary evaluation device, control method thereof and summary evaluation program

Info

Publication number: KR20220091035A
Application number: KR1020200182146A
Authority: KR
Inventors: 이동엽; 고병일; 이다니엘; 신명철; 김응균
Original assignee: 주식회사 카카오; 주식회사 카카오엔터프라이즈
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2022-06-30
Also published as: KR102495881B1

Abstract

본 발명은, 소스 문서에 대한 예측 요약문이 소스 문서의 내용을 제대로 요약했는지 여부를 판단하기 위하여 예측 요약문을 평가하기 위한 장치, 제어 방법 및 컴퓨터프로그램에 관한 것이다. 보다 구체적으로 본 발명은, 소스 문서와 예측 요약문을 비교하는 제 1 비교 결과를 획득하고, 정답 요약문과 상기 예측 요약문을 비교하는 제 2 비교 결과를 획득하며, 상기 제 1 및 제 2 비교 결과를 조합하여 상기 예측 요약문에 대한 평가결과를 획득하는 것을 특징으로 한다.The present invention relates to an apparatus, a control method and a computer program for evaluating a predictive summary for a source document to determine whether the predictive summary for a source document properly summarizes the contents of the source document. More specifically, the present invention provides a first comparison result for comparing the source document and the prediction summary, obtains a second comparison result for comparing the correct answer summary with the prediction summary, and combining the first and second comparison results Thus, it is characterized in that the evaluation result for the prediction summary is obtained.

Description

SUMMARY EVALUATION DEVICE, CONTROL METHOD THEREOF AND SUMMARY EVALUATION PROGRAM

본 발명은 기사, 문서 등의 요약을 평가하는 기술에 관한 것이다. 보다 구체적으로, 본 발명은 머신 러닝을 통하여 자동으로 수행된 요약의 평가 처리를 수행하는 기술에 관한 것이다.The present invention relates to techniques for evaluating summaries of articles, documents, and the like. More specifically, the present invention relates to a technique for performing evaluation processing of summary automatically performed through machine learning.

최근에는 정보 기술이 발달함에 따라 컴퓨터를 사용하여 기사나 문서를 자동으로 요약하는 자동 문서 요약 기술이 널리 보급되었다. 전자 문서 중심의 디지털 시대로 넘어오면서 디지털화 된 정보의 양이 폭발적으로 증가하고 있고, 정보 과부하 문제를 해결하기 위한 방안으로 이와 같은 자동 문서 요약 기술의 중요성이 강조되고 있다.In recent years, with the development of information technology, an automatic document summarization technique for automatically summarizing articles or documents using a computer has become widespread. As we move into the digital age centered on electronic documents, the amount of digitized information is explosively increasing, and the importance of such automatic document summarization technology is being emphasized as a way to solve the information overload problem.

자동 문서 요약 기술이 중요해짐에 따라, 다양한 자동 문서 요약 기술을 사용하여 작성된 요약의 올바른 평가 역시 중요해 지고 있다.As automatic document summarization techniques become important, the correct evaluation of summaries created using various automated document summarization techniques is also becoming important.

자동 문서 요약 기술은 추출요약(extractive summarization)과 추상요약(abstractive summarization)으로 나눌 수 있다. 추출요약은 문서 내에서 중요 핵심 문장들을 그대로 추출하여 요약문을 구성하는 방식으로서, 요약문은 문서내에 등장하는 단어들로 구성되는 특징이 있다. 추상요약은 문서가 담고 있는 내용을 잘 반영할 수 있는 요약문을 머신 러닝 모델이 직접 생성하는 방식으로서, 문서에 등장하지 않는 단어가 요약문으로 생성될 수 있다는 특징이 있다.Automatic document summarization technology can be divided into extractive summarization and abstract summarization. Extraction summary is a method of composing a summary by extracting important key sentences from within the document as it is, and the summary is composed of words appearing in the document. Abstract summary is a method in which the machine learning model directly generates a summary sentence that can reflect the contents of the document.

기존에 자동으로 요약문을 평가할 수 있는 방법으로 ROUGE(Recall Oriented Understudy of Gisting Evaluation)가 보편적으로 사용되고 있다. ROUGE 방식은 전문가가 정답 요약문을 만들어 놓고, 자동으로 생성된 요약문이 정답 요약문과 얼마나 유사한지를 비교하는 방법이다. ROUGE 방식에서는, 모델이 자동으로 생성한 요약문과 정답 요약문 간의 n-gram overlap을 기반으로 성능을 평가한다.ROUGE (Recall Oriented Understudy of Gisting Evaluation) is commonly used as a method for automatically evaluating summary texts. The ROUGE method is a method in which an expert creates an answer summary and compares how similar the automatically generated summary is to the correct answer summary. In the ROUGE method, the performance is evaluated based on the n-gram overlap between the summary sentence automatically generated by the model and the correct answer summary sentence.

상기와 같은 기존의 평가 방식은 n-gram overlap기반의 평가 방식이기 때문에, 의미 정보(semantic)를 반영하지 못하여 추상요약 방식에 대한 평가 신뢰도가 낮다는 문제점이 존재한다.Since the existing evaluation method as described above is an n-gram overlap-based evaluation method, there is a problem that the evaluation reliability for the abstract summary method is low because it cannot reflect semantic information.

이에 따라, 요약문에 대한 의미 정보를 반영하여 요약에 대한 평가를 수행할 수 있는 기술에 대한 연구가 요구되는 실정이다.Accordingly, there is a need for research on a technology that can evaluate the summary by reflecting the semantic information on the summary.

본 발명이 해결하고자 하는 과제는 높은 평가 신뢰도를 가지며 한국어에 최적화된 요약 평가 장치, 제어 방법 및 프로그램을 제공하는 것이다.SUMMARY OF THE INVENTION An object of the present invention is to provide a summary evaluation apparatus, a control method, and a program that have high evaluation reliability and are optimized for Korean.

본 발명이 해결하고자 하는 다른 과제는 소스 문서와 자동으로 생성된 요약문 간에 의미 정보를 고려한 비교를 통하여 요약문을 평가할 수 있는 장치 및 방법 및 프로그램을 제공하는 것이다.Another object to be solved by the present invention is to provide an apparatus, method and program capable of evaluating a summary text through a comparison in consideration of semantic information between a source document and an automatically generated summary text.

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. will be able

상기 또는 다른 과제를 해결하기 위해 본 발명의 일 측면에 따르면, 소스 문서와 예측 요약문을 비교하는 제 1 비교 결과를 획득하는 단계; 정답 요약문과 상기 예측 요약문을 비교하는 제 2 비교 결과를 획득하는 단계; 및 상기 제 1 및 제 2 비교 결과를 조합하여 상기 예측 요약문에 대한 평가결과를 획득하는 단계를 포함하는 것을 특징으로 하는, 요약 평가 장치의 제어 방법을 제공한다.According to an aspect of the present invention to solve the above or other problems, the method comprising: obtaining a first comparison result of comparing a source document and a prediction summary; obtaining a second comparison result of comparing the correct answer summary and the prediction summary; and obtaining an evaluation result for the prediction summary by combining the first and second comparison results.

상기 소스 문서에 기초하여 제 1 문장 임베딩 벡터를 획득하는 단계; 및 상기 예측 요약문에 기초하여 제 2 문장 임베딩 벡터를 획득하는 단계를 더 포함하고, 상기 제 1 비교 결과는 상기 제 1 및 제 2 문장 임베딩 벡터 간의 제 1 유사도일 수 있다.obtaining a first sentence embedding vector based on the source document; and obtaining a second sentence embedding vector based on the prediction summary, wherein the first comparison result may be a first degree of similarity between the first and second sentence embedding vectors.

상기 정답 요약문에 기초하여 제 3 문장 임베딩 벡터를 획득하는 단계를 더 포함하고, 상기 제 2 비교 결과는 상기 제 2 및 제 3 문장 임베딩 벡터 간의 제 2 유사도일 수 있다.The method may further include obtaining a third sentence embedding vector based on the correct answer summary, wherein the second comparison result may be a second degree of similarity between the second and third sentence embedding vectors.

상기 제 1 내지 제 3 문장 임베딩 벡터는, 문장의 의미를 내포한 임베딩 벡터일 수 있다.The first to third sentence embedding vectors may be embedding vectors including the meaning of the sentence.

상기 제 1 내지 제 3 문장 임베딩 벡터는, SBERT(Sentence Bidirectional Encoder Representations) 모델에 기초하여 획득되는 임베딩 벡터일 수 있다.The first to third sentence embedding vectors may be embedding vectors obtained based on a Sentence Bidirectional Encoder Representations (SBERT) model.

상기 제 1 및 제 2 유사도는, 코사인 유사도(cosine similarity)일 수 있다.The first and second similarities may be cosine similarity.

상기 또는 다른 과제를 해결하기 위해 본 발명의 다른 측면에 따르면, 명령어들을 저장하는 메모리; 및 상기 저장된 명령어들을 실행하도록 설정된 프로세서를 포함하고, 상기 프로세서는: 소스 문서와 예측 요약문을 비교하는 제 1 비교 결과를 획득하고, 정답 요약문과 상기 예측 요약문을 비교하는 제 2 비교 결과를 획득하며, 상기 제 1 및 제 2 비교 결과를 조합하여 상기 예측 요약문에 대한 평가결과를 획득하는 것을 특징으로 하는, 요약 평가 장치를 제공한다.According to another aspect of the present invention to solve the above or other problems, a memory for storing instructions; and a processor configured to execute the stored instructions, wherein the processor is configured to: obtain a first comparison result of comparing a source document and a prediction summary, and obtain a second comparison result of comparing the correct answer summary with the prediction summary; and combining the first and second comparison results to obtain an evaluation result for the prediction summary sentence.

상기 프로세서는 상기 소스 문서에 기초하여 제 1 문장 임베딩 벡터를 획득하고, 상기 예측 요약문에 기초하여 제 2 문장 임베딩 벡터를 획득하며, 상기 제 1 비교 결과는 상기 제 1 및 제 2 문장 임베딩 벡터 간의 제 1 유사도일 수 있다.The processor obtains a first sentence embedding vector based on the source document, obtains a second sentence embedding vector based on the prediction summary, and the first comparison result is a second sentence between the first and second sentence embedding vectors. 1 may be similarity.

상기 프로세서는, 상기 정답 요약문에 기초하여 제 3 문장 임베딩 벡터를 획득하고, 상기 제 2 비교 결과는 상기 제 2 및 제 3 문장 임베딩 벡터 간의 제 2 유사도일 수 있다.The processor may obtain a third sentence embedding vector based on the correct answer summary, and the second comparison result may be a second degree of similarity between the second and third sentence embedding vectors.

본 발명에 따른 요약 평가 장치의 효과에 대해 설명하면 다음과 같다.The effect of the summary evaluation apparatus according to the present invention will be described as follows.

본 발명의 실시 예들 중 적어도 하나에 의하면, 자동 문서 요약 기술에 의해서 생성된 요약문에 대하여 높은 신뢰도의 평가 수행이 가능하다는 장점이 있다.According to at least one of the embodiments of the present invention, there is an advantage in that it is possible to perform evaluation with high reliability on the summary generated by the automatic document summarization technique.

또한, 본 발명의 실시 예들 중 적어도 하나에 의하면, 소스 문서와 요약문 간에 의미론적인 비교를 통하여 요약문의 평가를 제공할 수 있다는 장점이 있다.In addition, according to at least one of the embodiments of the present invention, there is an advantage that the evaluation of the summary can be provided through semantic comparison between the source document and the abstract.

본 발명의 적용 가능성의 추가적인 범위는 이하의 상세한 설명으로부터 명백해질 것이다. 그러나 본 발명의 사상 및 범위 내에서 다양한 변경 및 수정은 당업자에게 명확하게 이해될 수 있으므로, 상세한 설명 및 본 발명의 바람직한 실시 예와 같은 특정 실시 예는 단지 예시로 주어진 것으로 이해되어야 한다.Further scope of applicability of the present invention will become apparent from the following detailed description. However, it should be understood that the detailed description and specific embodiments such as preferred embodiments of the present invention are given by way of illustration only, since various changes and modifications within the spirit and scope of the present invention may be clearly understood by those skilled in the art.

도 1은 본 발명의 일실시예에 따른 요약 평가 장치(100)의 블록도를 도시한다.
도 2는 본 발명의 일실시예에 따른 요약 평가 장치(100)의 제어 순서도를 도시한다.
도 3은 본 발명의 일실시예에 따른 제 1 및 제 2 비교부(121, 122)의 세부 블록도를 도시한다.
도 4는 일 실시예에 따른 요약 평가 장치(100)의 구성을 도시한 도면이다.
도 5는 본 발명의 실시예에 따른 평가 지표를 검증하기 위하여 다양한 요약 모델에 대한 실험 결과를 도시하는 도면이다.
도 6은 본 발명의 일실시예에 따른 평가 지표(RDASS)와 기존 평가 지표(ROUGE)의 비교 결과를 도시하는 도면이다.
도 7은 본 발명의 일실시예에 따른 언어 모델(P-BERT, FWA-BERT)의 효능을 다른 언어 모델(MUSE)의 적용과 비교한 데이터를 도시한다.
도 8은 본 발명의 실시예에 따른 평가 지표와 다른 평가 지표 간의 상관관계를 나타내는 표를 도시한다.
도 9 및 도 10은 본 발명의 실시예에 따른 효능을 입증하기 위한 실험의 정성적인 분석 결과이다.1 is a block diagram of a summary evaluation apparatus 100 according to an embodiment of the present invention.
2 is a control flowchart of the summary evaluation apparatus 100 according to an embodiment of the present invention.
3 is a detailed block diagram of the first and second comparison units 121 and 122 according to an embodiment of the present invention.
4 is a diagram illustrating a configuration of a summary evaluation apparatus 100 according to an exemplary embodiment.
5 is a diagram illustrating experimental results for various summary models in order to verify an evaluation index according to an embodiment of the present invention.
6 is a diagram illustrating a comparison result between an evaluation index (RDASS) and an existing evaluation index (ROUGE) according to an embodiment of the present invention.
7 shows data comparing the efficacy of language models (P-BERT, FWA-BERT) with the application of other language models (MUSE) according to an embodiment of the present invention.
8 is a table illustrating a correlation between an evaluation index and another evaluation index according to an embodiment of the present invention.
9 and 10 are qualitative analysis results of an experiment to prove the efficacy according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, the embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but the same or similar components are assigned the same reference numbers regardless of reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "part" for the components used in the following description are given or mixed in consideration of only the ease of writing the specification, and do not have distinct meanings or roles by themselves. In addition, in describing the embodiments disclosed in the present specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in the present specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in this specification, and the technical idea disclosed herein is not limited by the accompanying drawings, and all changes included in the spirit and scope of the present invention , should be understood to include equivalents or substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including an ordinal number, such as first, second, etc., may be used to describe various elements, but the elements are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being “connected” or “connected” to another component, it may be directly connected or connected to the other component, but it is understood that other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In the present application, terms such as “comprises” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

도 1은 본 발명의 일실시예에 따른 요약 평가 장치(100)의 블록도를 도시한다.1 is a block diagram of a summary evaluation apparatus 100 according to an embodiment of the present invention.

본 발명의 일실시예에 따른 요약 평가 장치(100)는, 제 1 비교부(121), 제 2 비교부(122) 및 평가 결과 획득부(123)를 포함하도록 구성될 수 있다.The summary evaluation apparatus 100 according to an embodiment of the present invention may be configured to include a first comparison unit 121 , a second comparison unit 122 , and an evaluation result acquisition unit 123 .

요약 평가 장치(100)는 예측 요약문(102)을 평가하기 위한 장치로서, 요약 평가 장치(100)에 입력된 소스 문서(101), 예측 요약문(102) 및 정답 요약문(103)에 기초하여 예측 요약문(102)이 소스 문서(101)의 내용을 얼마나 잘 요약하였는지 여부를 판단하여 평가 결과(102)로 출력해 준다. 이때 평가 결과(102)는 예측 요약문(102)이 소스 문서(101)의 내용을 얼마나 잘 요약했는지 여부를 수치화한 결과일 수 있으며, 예를 들어 0 ~ 1 사이의 숫자로 표현되어 1에 가까운 값일 수록 소스 문서(101)의 내용을 보다 더 잘 반영한 결과이고, 반대로 0에 가까운 값일 수록 잘 못 반영한 결과일 수 있다.The summary evaluation apparatus 100 is an apparatus for evaluating the prediction summary sentence 102 , and a prediction summary sentence based on the source document 101 , the prediction summary sentence 102 , and the correct answer summary sentence 103 input to the summary evaluation apparatus 100 . (102) judges how well the contents of the source document 101 are summarized and outputs the evaluation result 102. In this case, the evaluation result 102 may be a result of quantifying how well the prediction summary sentence 102 summarizes the contents of the source document 101, for example, it may be a value close to 1 expressed as a number between 0 and 1 It is a result of better reflecting the contents of the source document 101, and conversely, a value closer to 0 may be a result of reflecting incorrectly.

도 2는 본 발명의 일실시예에 따른 요약 평가 장치(100)의 제어 순서도를 도시한다. 도 3은 본 발명의 일실시예에 따른 제 1 및 제 2 비교부(121, 122)의 세부 블록도를 도시한다. 이하 도 2 및 도 3을 함께 참조하여 설명한다.2 is a control flowchart of the summary evaluation apparatus 100 according to an embodiment of the present invention. 3 is a detailed block diagram of the first and second comparison units 121 and 122 according to an embodiment of the present invention. Hereinafter, it will be described with reference to FIGS. 2 and 3 together.

제 1 문장 임베딩 벡터 산출부(301-1)는 소스 문서(101)가 입력되면, 제 1 문장 임베딩 벡터

를 획득(S201-1)한다. 이는, 문장 간의 의미 정보를 고려하기 위하여, 소스 문서(101)를 고정된 차원의 벡터로 변환하는 작업이다. 이때 고정된 차원의 벡터는 문장의 의미 정보를 반영할 수 있으며, 문장 임베딩 벡터라 부른다.When the source document 101 is input, the first sentence embedding vector calculating unit 301-1 is configured to calculate the first sentence embedding vector.

to obtain (S201-1). This is an operation of transforming the source document 101 into a fixed-dimensional vector in order to consider semantic information between sentences. In this case, the vector of a fixed dimension may reflect the semantic information of the sentence, and is called a sentence embedding vector.

상기와 같이 문장 임베딩 벡터를 획득하는 대표적인 방법으로, GLOVE나 BERT 등을 이용하여 문장을 구성하는 각 단어들의 단어 임베딩 값(벡터)을 구한 후 이 값들을 평균내는 방법이 존재한다.As a representative method of obtaining the sentence embedding vector as described above, there is a method of obtaining the word embedding value (vector) of each word constituting the sentence using GLOVE or BERT, and averaging these values.

본 발명의 일실시예에서는 문장 간의 의미 정보를 비교하기 위해 BPE(Byte Pair Encoding) 알고리즘으로 분절한 대규모 데이터셋을 사전학습한 언어 모델인 Sentence-BERT(이하 SBERT라 함)를 활용하도록 제안한다. BPE 알고리즘은 NMT, BERT 등 최근 자연어처리 알고리즘에서 전처리로 이용되는 서브워드(sub-word) 단위 분절 방법이다.In one embodiment of the present invention, in order to compare semantic information between sentences, it is proposed to utilize Sentence-BERT (hereinafter referred to as SBERT), which is a language model pre-learning a large-scale dataset segmented by a BPE (Byte Pair Encoding) algorithm. The BPE algorithm is a sub-word unit segmentation method used as a preprocessor in recent natural language processing algorithms such as NMT and BERT.

SBERT는 BERT 모델 구조를 변형하고 NLI(Natural language inference) 데이터셋에서 미세조정(fine-tuning) 함으로써 BERT보다 문장의 의미 의미를 더 잘 추출할 수 있도록 설계된 모델을 의미한다.SBERT refers to a model designed to better extract the semantic meaning of sentences than BERT by modifying the BERT model structure and performing fine-tuning in the NLI (Natural language inference) dataset.

SBERT를 활용한 문장 임베딩 벡터

는, 아래와 같은 수학식 1 및 2로 획득할 수 있다.Sentence Embedding Vector Using SBERT

can be obtained by

Equations

1 and 2 below.

w ₁ , ..., w _n 은 문장을 구성하는 단어들 각각을 나타내며, [CLS]는 가장 첫 문장의 앞에 추가되는 태그, [SEP]는 문장과 문장 사이를 구분하거나 문장의 끝에 부가되는 태그이다. w ₁ , ..., w _n represents each of the words constituting a sentence, [CLS] is a tag added to the front of the first sentence, and [SEP] is a tag that separates between sentences or is added to the end of a sentence to be.

e _cls , e ₁ , ..., e _n , e _sep 은 [CLS], w ₁ , ..., w _n 및 [SEP] 태그 각각에 대한 워드 임베딩 벡터이다. 예를 들어 e ₁ 은 단어 w ₁ 에 대한 워드 임베딩 벡터이다. e ₁ 은 단어 w ₁ 의 의미 정보를 반영할 뿐만 아니라, 문장에 포함되어 있는 다른 단어(w ₂ , ..., w _n )에 대한 의미 정보 역시 반영할 수 있다. e _cls , e ₁ , ..., e _n , e _sep are word embedding vectors for each of the [CLS], w ₁ , ..., w _n and [SEP] tags. For example, e ₁ is the word embedding vector for the word w ₁ . e ₁ may reflect semantic information of the word w ₁ as well as semantic information about other words ( w ₂ , ..., w _n ) included in the sentence.

이후 제 1 문장 임베딩 벡터 산출부(301-1)는, 상기 워드 임베딩 벡터 e ₁ , ..., e _n 에 평균 풀링(Mean pooling)을 적용하여 아래 수학식 2와 같이 소스 문서(101)의 제 1 문장 임베딩 벡터

를 획득할 수 있다.Thereafter, the first sentence embedding vector calculating unit 301-1 applies mean pooling to the word embedding vectors e ₁ , ..., e _n to obtain the source document 101 as shown in Equation 2 below. first sentence embedding vector

can be obtained.

이와 같은 평균 풀링 방식은 하나의 예시일 뿐, 다른 풀링 방법이 적용될 수 있음은 자명할 것이다.Such an average pooling method is only an example, and it will be apparent that other pooling methods may be applied.

여기서 j는 임베딩의 차원 색인값이고, n은 문장(소스 문서(101))을 구성하는 단어의 수이다.Here, j is a dimension index value of embedding, and n is the number of words constituting a sentence (source document 101).

마찬가지 과정을 통하여, 제 2 문장 임베딩 벡터 산출부(301-2)는 예측 요약문(102)이 입력되면, 제 2 문장 임베딩 벡터

를 획득(S201-2)할 수 있다. 그리고 제 3 문장 임베딩 벡터 산출부(301-3)는 정답 요약문(103)이 입력되면, 제 3 문장 임베딩 벡터

를 획득(S201-3)할 수 있다.Through the same process, when the prediction summary sentence 102 is input, the second sentence embedding vector calculating unit 301-2 calculates the second sentence embedding vector

can be obtained (S201-2). In addition, the third sentence embedding vector calculating unit 301-3 receives the third sentence embedding vector when the correct answer summary 103 is input.

can be obtained (S201-3).

이어서 제 1 유사도 산출부(302-1)는 제 1 문장 임베딩 벡터

및 제 2 문장 임베딩 벡터

에 기초하여 제 1 유사도를 산출(S202-1)할 수 있다. 제 1 유사도 및 이하에서 후술되는 제 2 유사도는, 비교 대상인 두 벡터가 벡터 공간 상에서 얼마나 가까운지 여부를 나타낸다.Then, the first similarity calculator 302-1 generates the first sentence embedding vector.

and a second sentence embedding vector

A first degree of similarity may be calculated ( S202 - 1 ) based on . The first degree of similarity and the second degree of similarity, which will be described below, indicate how close two vectors to be compared are in a vector space.

본 발명의 일실시예에 따른 제 1 및 제 2 유사도는 코사인 유사도(cosine similarity)로 산출된다. 코사인 유사도에 의한 제 1 유사도의 산출은 아래와 같은 수학식 3에 의해서 이루어질 수 있다.The first and second similarities according to an embodiment of the present invention are calculated as cosine similarity. The calculation of the first degree of similarity based on the cosine similarity may be performed by the following Equation (3).

여기서

는 제 1 유사도(303-1)를 의미할 것이다.here

will mean the first similarity 303 - 1 .

즉 제 1 유사도는 소스 문서(101)와 예측 요약문(102) 간에 의미론적으로 유사한지 여부를 나타낼 수 있을 것이다. 왜냐하면 소스 문서(101)의 의미 정보를 담고 있는 제 1 문장 임베딩 벡터

와 예측 요약문(102)의 의미 정보를 담고 있는 제 2 문장 임베딩 벡터

간의 유사도이기 때문이다.That is, the first degree of similarity may indicate whether the source document 101 and the prediction summary 102 are semantically similar. Because the first sentence embedding vector containing the semantic information of the source document 101

and the second sentence embedding vector containing semantic information of the prediction summary sentence 102

This is because the similarity between

마찬가지로, 이어서 제 2 유사도 산출부(302-2)는 제 2 문장 임베딩 벡터

와 제 3 문장 임베딩 벡터

에 기초하여 제 2 유사도를 산출(S202-2)할 수 있다.Similarly, the second similarity calculating unit 302-2 then performs the second sentence embedding vector

and 3rd sentence embedding vector

A second degree of similarity may be calculated on the basis of ( S202 - 2 ).

코사인 유사도에 의한 제 2 유사도의 산출은 아래와 같은 수학식 4에 의해서 이루어질 수 있다.The calculation of the second degree of similarity based on the cosine similarity may be performed by the following Equation (4).

여기서

는 제 2 유사도(303-2)를 의미할 것이다.here

will mean the second degree of similarity 303 - 2 .

이와 같이 산출된 제 1 및 제 2 유사도(303-1, 303-2)는 평가 결과 획득부(123)로 전달된다. 평가 결과 획득부(123)는 제 1 및 제 2 유사도(303-1, 303-2)를 조합하여 최종 평가 결과를 획득(S203)하고, 획득된 평가 결과를 출력(S204)한다.The calculated first and second similarities 303 - 1 and 303 - 2 are transmitted to the evaluation result obtaining unit 123 . The evaluation result obtaining unit 123 obtains a final evaluation result by combining the first and second similarities 303-1 and 303-2 (S203), and outputs the obtained evaluation result (S204).

평가 결과 획득부(123)에 의해서 수행되는 제 1 및 제 2 유사도(303-1, 303-2)를 조합은, 아래 수학식 5와 같이 이루어질 수 있다.The combination of the first and second similarities 303 - 1 and 303 - 2 performed by the evaluation result obtaining unit 123 may be performed as in Equation 5 below.

여기서 RDASS는 본 발명의 일실시예에 따라 수행된 평가 절차의 최종 출력 결과로서, 예측 요약문에 대한 평가 지표이다.Here, RDASS is the final output result of the evaluation procedure performed according to an embodiment of the present invention, and is an evaluation index for the prediction summary.

즉 본 발명의 실시예에 따른 평가 지표 RDASS에 의하면, 예측 요약문(102)을 평가하는데 있어서, 예측 요약문(102)과 정답 요약문(103) 각각의 의미 정보를 더 고려하여 비교할 수 있을 뿐만 아니라, 예측 요약문(102)과 소스 본문(101)과의 의미 정보까지 고려하여 평가하기 때문에 더욱 높은 평가 신뢰도를 기대할 수 있다는 장점이 예상될 수 있다.That is, according to the evaluation index RDASS according to the embodiment of the present invention, in evaluating the prediction summary sentence 102, the prediction summary sentence 102 and the correct answer summary sentence 103 can be compared in consideration of the semantic information of each, as well as the prediction Since the summary sentence 102 and the semantic information of the source body 101 are evaluated in consideration of the evaluation, the advantage that higher evaluation reliability can be expected can be expected.

더 나아가 본 발명에서는, 상기 SBERT 모델에 대한 미세 조정(Fine-tunning)을 수행하도록 제안한다.Furthermore, in the present invention, it is proposed to perform fine-tuning on the SBERT model.

SBERT는 학습 가능한 모델로서 소스 문서(101) 및 정답 요약문(103)에 대하여 보다 적합하도록 추가 교육을 받을 수 있다. 즉 본 발명에서와 같이 추상요약 모델에 최적화된 SBERT에 대한 미세 조정 방법을 제안한다.SBERT may be further trained to be more suitable for the source document 101 and the correct answer summary 103 as a learnable model. That is, as in the present invention, we propose a fine-tuning method for SBERT that is optimized for the abstract summary model.

추상화 요약을 위한 대부분의 신경 접근 방식은 인코더-디코더 아키텍처를 기반으로 한다. 공식적으로 D = [w₁, ..., w_k]라는 문서가 주어지면 목표는 숨겨진 표현 h_p = [h₁, ..., h_n]에서 요약 y_p = [w₁, ..., w_n]을 생성하는 것이다. 숨겨진 표현 h_p = [h₁, ..., h_n]은 디코더의 출력 벡터로, 예측 요약문(102)에 대한 임베딩 벡터이다.Most neural approaches for abstraction abstraction are based on encoder-decoder architectures. Formally, given the document D = [w ₁ , ..., w _k ], the goal is a summary from the hidden expression h _p = [h ₁ , ..., h _n ] y _p = [w ₁ , ... , w _n ]. The hidden expression h _p = [h ₁ , ..., h _n ] is the output vector of the decoder, the embedding vector for the prediction summary 102 .

본 발명의 일실시예에서는, 디코더의 숨겨진 표현을 활용하여 SBERT를 미세 조정한다.In one embodiment of the present invention, a hidden representation of the decoder is utilized to fine-tune the SBERT.

특히 본 발명의 일실시예에서는, SBERT를 미세조정하기 위하여 삼중 목적함수(triplet objective function)을 사용하도록 제안한다.In particular, in one embodiment of the present invention, it is proposed to use a triplet objective function to fine-tune the SBERT.

앵커 h_p, 양의 참조 표현 v_pr, 음의 표현 v_nr 및 유클리드 거리 d가 주어지면 예측 요약문(102)과 정답 요약문(103)에 대한 삼중 목적 함수 J(p, r)는 다음 수학식 6과 같이 정의될 수 있다.Given the anchor h _p , the positive reference expression v _pr , the negative expression v _nr , and the Euclidean distance d, the triple objective function J(p, r) for the prediction summary (102) and the correct answer summary (103) is can be defined as

여기서 ε은 h_p가 v_nr보다 v_pr에 더 가깝다는 것을 보장하는 마진이며 본 발명의 일실시예에서 ε을 1로 설정한다. 유사하게 소스 문서(101) 및 예측 요약문(102)간의 삼중 목적 함수 J(p, d)는 다음 수학식 7과 같이 정의될 수 있다.Here, ε is a margin that guarantees that h _p is closer to v _pr than v _nr , and ε is set to 1 in an embodiment of the present invention. Similarly, the triple objective function J(p, d) between the source document 101 and the prediction summary 102 may be defined as in Equation 7 below.

즉, SBERT에 대한 미세 소정은 아래 수학식 8에서와 같이 삼중 목적 함수 J(p, r)와 J(p, d)를 최소화 시키는 것에 있다.That is, the micro-precision for SBERT consists in minimizing the triple objective functions J(p, r) and J(p, d) as shown in Equation 8 below.

상기 수학식 8의 SBERT의 목적 함수 J는 추상요약에 대해 최적화된다. 일반적으로 예측 요약문(102)과 정답 요약문(103) 간의 negative log-likelihood(NLL) 목표는 추상적인 요약에 사용된다. 이와 같이 미세 조정된 SBERT를 이하에서는 FWA-SBERT라고 부른다.The objective function J of SBERT in Equation 8 is optimized for abstract summary. In general, a negative log-likelihood (NLL) goal between the prediction summary 102 and the correct answer summary 103 is used for the abstract summary. This fine-tuned SBERT is hereinafter referred to as FWA-SBERT.

도 4는 일 실시예에 따른 요약 평가 장치(100)의 구성을 도시한 도면이다.4 is a diagram illustrating a configuration of a summary evaluation apparatus 100 according to an exemplary embodiment.

도 4를 참조하면, 요약 평가 장치(100)는 메모리(192) 및 프로세서(191)를 포함한다. 메모리(192)는 프로세서(191)에 의해 실행 가능한 하나 이상의 명령어를 저장한다. 프로세서(191)는 메모리(192)에 저장된 하나 이상의 명령어를 실행한다. 프로세서(191)는 명령어를 실행하는 것에 의해 도 1 내지 도 3과 관련하여 위에서 설명된 하나 이상의 동작을 실행할 수 있다. 프로세서(191)는 명령어에 따라 예측 요약문(102)에 대한 평가를 수행하고, 평가 결과를 출력할 수 있다.Referring to FIG. 4 , the summary evaluation apparatus 100 includes a memory 192 and a processor 191 . Memory 192 stores one or more instructions executable by processor 191 . Processor 191 executes one or more instructions stored in memory 192 . The processor 191 may execute one or more operations described above with respect to FIGS. 1-3 by executing instructions. The processor 191 may evaluate the prediction summary 102 according to the instruction, and may output an evaluation result.

지금까지 본 발명의 실시예들에 대해서 설명하였다. 이하에서는, 상술한 실시예들의 효과에 대해서, 구체적인 실험 결과를 근거로 설명한다.So far, embodiments of the present invention have been described. Hereinafter, the effects of the above-described embodiments will be described based on specific experimental results.

* 실험의 준비* Preparation of experiments

대한민국의 포털 사이트 다음(Daum)의 뉴스 탭에서 정치, 경제, 국제, 문화, 정보 기술 등 10 개 주제로 구성된 데이터 세트를 사용하여 모델을 훈련하고 평가를 수행하였다. 이로부터 약 300만 개의 뉴스 기사를 추출하였다. 학습, 검증 및 테스트를 위한 기사 수는 각각 2.98M, 0.01M 및 0.01M이었다. 이 데이터 세트를 "Daum/News 데이터 세트"이라고 부른다.In the News tab of Daum, a Korean portal site, a model was trained and evaluated using a data set consisting of 10 topics including politics, economy, international, culture, and information technology. About 3 million news articles were extracted from this. The number of articles for learning, validation and testing were 2.98M, 0.01M, and 0.01M, respectively. We call this data set the "Daum/News data set".

Daum/News 데이터 세트를 통해 기사 내용을 충분히 이해하고 적절한 평가를 진행하였다. 이 데이터 세트에는 각각 다른 요약 스타일을 가진 143 개 신문의 기사가 포함되어 있으며 이를 사용하여 본 발명에서 제안된 방법의 효과를 확인하기 위한 실험이 수행되었다.Through the Daum/News data set, the contents of the article were fully understood and an appropriate evaluation was conducted. This data set contains articles from 143 newspapers, each with a different summary style, and an experiment was performed to confirm the effectiveness of the method proposed in the present invention using them.

SBERT를 활용하기 위해 본 발명에서는 위키, 세종 말뭉치, 웹 문서를 포함하여 2,300만 개의 문장과 160만 개의 문서로 구성된 한국어 데이터 세트에 대해 BERT(bert-base-uncased)를 먼저 사전 학습했다. 다음으로 NLI 및 의미론적 텍스트 유사성(STS) 벤치 마크(STSb) 각각의 분류 및 회귀 목표로 SBERT를 훈련했다.In order to utilize SBERT, in the present invention, bert-base-uncased (BERT) was first pre-learned on a Korean data set consisting of 23 million sentences and 1.6 million documents including wikis, Sejong corpus, and web documents. Next, the SBERT was trained with the classification and regression goals of the NLI and semantic textual similarity (STS) benchmarks (STSb), respectively.

NLI와 STSb 데이터 세트는 영어로 되어 있기 때문에 Kakao Machine Translator에서 번역한 한국어 NLI와 STS 데이터 세트를 더 활용하였다. STS 벤치 마크 테스트 데이터 세트에 대한 평가가 수행되어 80.52 Spearman의 순위 상관 결과가 표시되었다.Since the NLI and STSb data sets are in English, the Korean NLI and STS data sets translated by Kakao Machine Translator were further utilized. An evaluation was performed on the STS benchmark test data set to present an 80.52 Spearman rank correlation result.

그 후, 사전 훈련 된 SBERT 모델은 예측 요약문(102)과 함께 정답 요약문(103) 및 소스 문서(101)의 상황에 맞는 정보를 캡처하기 위해 추상요약 모델로 미세 조정되었다.After that, the pre-trained SBERT model was fine-tuned with the abstract summary model to capture contextual information of the correct answer summary (103) and the source document (101) along with the prediction summary (102).

본 발명에서의 효과를 입증하기 위해 여러 명의 평가자로 하여금 관련성, 일관성 및 유연성을 평가하도록 요청하였다. 관련성은 문서의 적절성 정도를 나타내고 일관성은 사실성의 정도를 나타내고 유연성은 생성 된 요약의 품질 정도를 나타낸다.In order to demonstrate the effectiveness in the present invention, several raters were asked to evaluate relevance, consistency, and flexibility. Relevance indicates the degree of relevance of the document, consistency indicates the degree of realism, and flexibility indicates the degree of quality of the generated summary.

소스 문서(101), 예측 요약문(102) 및 정답 요약문(103)이 주어지면 각 평가자는 평가 지표(즉, 관련성, 일관성, 유연성)에 대해 1 ~ 5 점 범위에서 점수를 매기는 방식으로 평가가 이루어졌다.Given a source document 101, a predictive summary 102, and a correct answer summary 103, each rater evaluates the assessment in such a way that it scores on a scale of 1 to 5 for an assessment indicator (i.e., relevance, consistency, flexibility). was done

컴퓨터 과학을 전공한 박사 학위 3명과 석사 학위 3명으로 총 6명이 평가를 수행하였다. Daum/News 데이터 세트에서 추출한 200개 표본 요약에 대해 평균을 낸 결과, 관련성 점수는 3.8점, 일관성은 3.6점, 유연성은 3.9점이었다.A total of six people conducted the evaluation, three with doctoral degrees and three with master's degrees majoring in computer science. Averaging over 200 sample summaries drawn from the Daum/News data set, the score was 3.8 for relevance, 3.6 for consistency, and 3.9 for flexibility.

이하에서는, ROUGE평가 방법과 비교하는 방식을 통하여 효과를 입증한다.In the following, the effect is demonstrated through the method of comparison with the ROUGE evaluation method.

도 5는 본 발명의 실시예에 따른 평가 지표를 검증하기 위하여 다양한 요약 모델에 대한 실험 결과를 도시하는 도면이다.5 is a diagram illustrating experimental results for various summary models in order to verify an evaluation index according to an embodiment of the present invention.

도 5에서 도시된 실험 결과에서는, 정답 요약문(103, Reference Summary)을 상한 기준(1.00)으로 설정하고 나머지에 대한 결과를 비교하는 베이스라인 방식(Baseline Method)에 기초한다.In the experimental result shown in FIG. 5 , it is based on a baseline method in which the correct answer summary sentence 103 (Reference Summary) is set as the upper limit criterion (1.00) and the results for the rest are compared.

도시된 실험 결과에서 Lead-1 모델은 소스 문서(101)에서 가장 첫 번째 문장을 선택하여 예측 요약문(102)으로 설정하는 모델이고, Lead-3 모델은 소스 문서(101)의 첫 번째에서 세 번째 문장을 예측 요약문(102)으로 설정하는 모델을 의미한다.In the experimental results shown, the Lead-1 model is a model that selects the first sentence from the source document 101 and sets it as the prediction summary 102, and the Lead-3 model selects the first to third sentences of the source document 101. It refers to a model that sets a sentence as a predictive summary sentence 102 .

리포터는 문서를 요약 할 때 암묵적인 단어를 사용하는 경향이 있으므로 s(p, d)(소스 문서(101)와 예측 요약문(102)간의 유사도)의 값은 상한 기준에 비해 상대적으로 낮다. 하지만 s(p, r) 점수가 1.00이기 때문에, 정답 요약문(103, Reference Summary)에서 가장 높은 RDASS 점수를 확인할 수 있다.Reporters tend to use implicit words when summarizing documents, so the value of s(p, d) (similarity between source document 101 and predictive summary 102) is relatively low compared to the upper limit criterion. However, since the score of s(p, r) is 1.00, the highest RDASS score can be found in the correct answer summary (103, Reference Summary).

Lead-1의 경우 s(p, r)이 s(p, d)보다 높은 성능을 나타내고 Lead-3의 경우 s(p, d)가 s(p, r)보다 높은 성능을 나타내는 것으로 확인할 수 있다. 이러한 성능의 이유는 Lead-3에 문서에서 더 많은 문장이 포함되어 있기 때문에, 정답 요약문(103)과의 유사성인 s(p, r)은 낮지만 소스 문서(101)과의 유사성인 s(p, d) 은 증가하기 때문이다.It can be confirmed that in the case of Lead-1, s(p, r) shows higher performance than s(p, d), and in the case of Lead-3, s(p, d) shows higher performance than s(p, r). . The reason for this performance is that, since Lead-3 contains more sentences in the document, s(p, r), which is similar to the correct answer summary sentence 103, is low, but s(p, r), which is similar to the source document 101, is low. , d) increases.

상한 기준의 ROUGE 성능의 경우 영어 데이터 세트에서 수행된 다른 연구에 비해 상대적으로 낮은 성능을 확인할 수 있다. 그 이유는 한국어의 경우 교착어의 언어 특성상 동일한 의미가 다르게 표현되기 때문이다.In the case of the ROUGE performance of the upper limit criterion, it can be seen that the performance is relatively low compared to other studies performed on the English data set. The reason is that in the case of Korean, the same meaning is expressed differently due to the linguistic characteristics of an agglutinative language.

그러나 상한 기준의 RDASS 점수는 정답 요약문(103)의 점수와 유사 함을 확인할 수 있다. 이를 통해 제안된 평가 방법이 참조 요약의 의미적 의미를 잘 반영하고 문서화 할 수 있음을 확인할 수 있다. However, it can be seen that the RDASS score of the upper limit is similar to the score of the correct answer summary (103). This confirms that the proposed evaluation method can reflect and document the semantic meaning of the reference summary well.

'BERTSUMABS'의 경우 상한 기준 보다 정답 요약문(103)과 더 높은 유사성을 보이지만 생성 모델을 기반으로 하기 때문에 문서에서 Lead 기준선으로 문장을 추출하지 않는다. 그 결과 상대적으로 낮은 s(p,d) 점수를 보여준다.In the case of 'BERTSUMABS', although it has a higher similarity with the correct answer summary sentence (103) than the upper limit criterion, the sentence is not extracted from the document as the lead baseline because it is based on a generative model. As a result, it shows a relatively low s(p,d) score.

도 6은 본 발명의 일실시예에 따른 평가 지표(RDASS)와 기존 평가 지표(ROUGE)의 비교 결과를 도시하는 도면이다. 각 평가 지표와 평가자에 의한 결과 간에 상관관계를 분석하는 방법으로 비교가 이루어졌다. 즉, 평가 지표에 따른 결과가 평가자에 의한 지표와 유사할 경우 더 우수한 평가 지표라고 인정될 수 있기 때문에, 평가자에 의한 평가 결과와의 상관관계를 분석한다.6 is a diagram illustrating a comparison result between an evaluation index (RDASS) and an existing evaluation index (ROUGE) according to an embodiment of the present invention. The comparison was made by analyzing the correlation between each evaluation index and the results by the evaluators. That is, if the result according to the evaluation index is similar to the index by the evaluator, it can be recognized as a better evaluation index, so the correlation with the evaluation result by the evaluator is analyzed.

도 6 (a)는 피어슨(Pearson) 상관관계에 따른 비교이며, 도 6 (b)는 켄달 순위(Kendall rank) 상관관계에 따른 비교 결과를 도시한다.Figure 6 (a) is a comparison according to the Pearson (Pearson) correlation, Figure 6 (b) shows a comparison result according to the Kendall rank (Kendall rank) correlation.

도 6 (a)를 참조하면, 좌측이 평가자에 의한 평가 결과와 ROUGE와의 Pearson 상관관계를 나타내며, 우측이 평가자에 의한 평가 결과와 FWA-SBERT(본 발명)와의 Pearson 상관관계를 도시한다. FWA-SBERT가 ROUGE 점수보다 훨씬 더 높은 것을 확인할 수 있다.Referring to FIG. 6 (a), the left side shows the Pearson correlation between the evaluation result by the evaluator and ROUGE, and the right side shows the Pearson correlation between the evaluation result by the rater and FWA-SBERT (the present invention). It can be seen that the FWA-SBERT is much higher than the ROUGE score.

도 6 (b)를 참조하면, 좌측이 평가자에 의한 평가 결과와 ROUGE와의 켄달 순위를 나타내며, 우측이 평가자에 의한 평가 결과와 FWA-SBERT(본 발명)와의 켄달 순위를 도시한다. 마찬가지로 켄달 순위에서도 FWA-SBERT가 ROUGE 점수보다 훨씬 더 높은 것을 확인할 수 있다.Referring to FIG. 6 (b), the left side shows the evaluation result by the evaluator and the Kendall rank with ROUGE, and the right side shows the evaluation result by the evaluator and the Kendall rank with FWA-SBERT (the present invention). Similarly, in the Kendall ranking, it can be seen that the FWA-SBERT is much higher than the ROUGE score.

도 6 (a), (b)에 도시된 그래프에 따르면, RDASS는 ROUGE보다 평가자의 판단과 더 높은 상관 관계를 보인 것을 확인할 수 있다.According to the graphs shown in FIGS. 6 (a) and (b), it can be confirmed that RDASS showed a higher correlation with the judgment of the evaluator than ROUGE.

따라서, 상기 실험 결과에 따르면 본 발명의 실시예에 따른 요약 평가 기술은 n-gram 중첩을 기반으로 하는 ROUGE 보다 의미 정보를 더욱 반영하여 평가할 수 있다는 것을 입증한다.Therefore, according to the experimental results, it is proved that the summary evaluation technique according to the embodiment of the present invention can be evaluated by reflecting semantic information more than ROUGE based on n-gram overlap.

이하에서는 다른 언어모델을 적용한 경우와 비교를 통하여 본 발명의 효과를 설명한다.Hereinafter, the effect of the present invention will be described through comparison with the case where other language models are applied.

도 7은 본 발명의 일실시예에 따른 언어 모델(P-BERT, FWA-BERT)의 효능을 다른 언어 모델(MUSE)의 적용과 비교한 데이터를 도시한다.7 shows data comparing the efficacy of language models (P-BERT, FWA-BERT) with the application of other language models (MUSE) according to an embodiment of the present invention.

실험은 MUSE(Multilingual Universal Sentence Encoder), P-BERT(Pre-trained SBERT) 및 FWA-SBERT 세 가지 언어모델의 비교를 통하여 설명된다.The experiment is explained through the comparison of three language models: MUSE (Multilingual Universal Sentence Encoder), P-BERT (Pre-trained SBERT), and FWA-SBERT.

MUSE는 다중 작업 학습(multi-task learning)을 사용하여 16 개 언어의 텍스트를 단일 의미 공간에 삽입하는 다국어 문장 인코더이다. 이 모델은 10 억 개 이상의 질문-답변 쌍에 대해 학습되었으며 의미론, 이중 텍스트 검색 및 검색 질문에 대한 경쟁력 있는 최첨단 결과를 보여준 것으로 알려졌다.MUSE is a multilingual sentence encoder that uses multi-task learning to insert texts from 16 languages into a single semantic space. The model has been trained on over a billion question-answer pairs and is known to show competitive state-of-the-art results for semantics, double-text searches, and search queries.

P-BERT는 본 발명의 일실시예에 따른 언어 모델로, 미세 조정 없이 사전 훈련된 SBERT만을 사용한 모델이다.P-BERT is a language model according to an embodiment of the present invention, and is a model using only pre-trained SBERT without fine tuning.

FWA-SBERT는 본 발명의 일실시예에 따라 미세 조정이 이루어진 SBERT 모델이다.FWA-SBERT is a SBERT model with fine adjustments made according to an embodiment of the present invention.

도시된 도면을 참조하면, P-SBERT는 MUSE보다 평가자의 평가 결과와 높은 상관 관계를 보여준다는 것을 확인할 수 있다. 더 나아가 전반적으로 FWA-SBERT를 사용했을 때 평가자의 판단과 가장 높은 상관 관계를 보인다는 것을 확인할 수 있다.Referring to the illustrated figure, it can be confirmed that P-SBERT shows a higher correlation with the evaluation result of the evaluator than MUSE. Furthermore, it can be confirmed that overall, when FWA-SBERT was used, it showed the highest correlation with the judge's judgment.

즉, 도 7에서는 본 발명의 실시예에 따른 평가 지표인 RDASS가 다른 평가 지표를 이용할 때 보다 더 평가자의 평가에 근접하다는 것을 확인할 수 있으며, 특히 미세 조정에 의해서 더 높은 효능을 얻을 수 있다는 것을 입증할 수 있다.That is, in FIG. 7, it can be confirmed that the RDASS, an evaluation index according to an embodiment of the present invention, is closer to the evaluation of the evaluator than when other evaluation indicators are used, and in particular, it is proved that higher efficacy can be obtained by fine adjustment can do.

도 8은 본 발명의 실시예에 따른 평가 지표와 다른 평가 지표 간의 상관관계를 나타내는 표를 도시한다.8 is a table illustrating a correlation between an evaluation index and another evaluation index according to an embodiment of the present invention.

도 8을 참조하면 ROUGE 메트릭 간에는 서로 높은 상관 관계가 존재한다는 것을 확인할 수 있다. 예를 들어, ROUGE-1과 ROUGE-2 간에는 0.84라는 높은 상관관계가, ROUGE-1과 ROUGE-L 간에는 0.99라는 높은 상관관계가 확인된다.Referring to FIG. 8 , it can be confirmed that a high correlation exists between the ROUGE metrics. For example, a high correlation of 0.84 between ROUGE-1 and ROUGE-2 and a high correlation of 0.99 between ROUGE-1 and ROUGE-L are confirmed.

하지만, ROUGE 메트릭과 본 발명에 따른 평가 지표(

,

및 RDASS) 사이에는 상대적으로 낮은 상관 관계가 있음을 확인할 수 있다. 즉, 본 발명의 실시예에 따른 평가 지표들이 ROUGE가 할 수 없었던 의미정보를 반영하였기 때문이라고 해석할 수 있을 것이다.However, the ROUGE metric and the evaluation index according to the present invention (

,

and RDASS), it can be seen that there is a relatively low correlation. That is, it can be interpreted that the evaluation indicators according to the embodiment of the present invention reflect semantic information that ROUGE could not do.

도 9 및 도 10은 본 발명의 실시예에 따른 효능을 입증하기 위한 실험의 정성적인 분석 결과이다.9 and 10 are qualitative analysis results of an experiment to prove the efficacy according to an embodiment of the present invention.

도 9를 참조하면, 소스 문서(101), 예측 요약문(102), 정답 요약문(103), RDASS 결과(701), ROUGE 결과(702) 및 평가자(703)에 의한 평가를 출력한다. 정답 요약문(103)인 "메시가 30번째 생일 함께한 이는 아내와 아들"과, 예측 요약문(102) "메시 30번째 생일, 가족과 함께 오붓하게 보내"은 실질적으로 거의 동일한 내용을 담고 있다는 것을 확인할 수 있다. 그렇기 때문에 평가자에 의한 평가 결과(703)는 높다는 것을 확인할 수 있다.Referring to FIG. 9 , the source document 101 , the prediction summary sentence 102 , the correct answer summary sentence 103 , the RDASS result 701 , the ROUGE result 702 , and the evaluation by the evaluator 703 are output. It can be seen that the summary of the correct answer (103), "Messi's 30th birthday is with his wife and son", and the prediction summary (102) "Messi's 30th birthday, spend closely with his family" contain substantially the same content. have. Therefore, it can be confirmed that the evaluation result 703 by the evaluator is high.

하지만, ROUGE 평가 결과(702)는 그리 높지 않다는 것을 확인할 수 있다. 의미 정보를 고려하지 않았기 때문에 '가족'이라는 표현과 '아내와 아들'이라는 표현을 서로 다르게 인식하였기 때문이다.However, it can be confirmed that the ROUGE evaluation result 702 is not very high. This is because the expressions 'family' and 'wife and son' were perceived differently because semantic information was not considered.

의미 정보를 고려한 본 발명의 실시예에 따른 RDASS 결과(701)는, ROUGE 평가 결과(702)와는 달리, 높은 점수(0.81)라는 것을 확인할 수 있다. 이는, ROUGE와는 달리 의미 정보를 충분히 고려하여 평가하였다는 것을 입증할 수 있다.It can be seen that the RDASS result 701 according to the embodiment of the present invention in consideration of semantic information is a high score (0.81), unlike the ROUGE evaluation result 702 . Unlike ROUGE, this can prove that semantic information was sufficiently considered and evaluated.

마찬가지로 도 10을 참조하면, 정답 요약문(103)인 "삼성전자, 중남미 최대 시장 브라질에 qled tv 론칭"과, 예측 요약문(102) "삼성전자, 브라질서 'gled tv' 신제품 출시"는 실질적으로 거의 동일한 내용을 담고 있다는 것을 확인할 수 있다. 이 결과, 평가자에 의한 평가 결과(703)는 높다는 것을 확인할 수 있다.Similarly, referring to FIG. 10, the answer summary (103) "Samsung Electronics launches qled tv in Brazil, the largest market in Latin America" and the prediction summary 102 "Samsung Electronics launches a new 'gled tv' in Brazil" are practically almost You can see that they contain the same content. As a result, it can be confirmed that the evaluation result 703 by the evaluator is high.

하지만, ROUGE 평가 결과(702)는 그리 높지 않다는 것을 확인할 수 있다. 의미 정보를 고려하지 않았기 때문에 '론칭'이라는 표현과 '신제품 출시'라는 표현을 서로 다르게 인식하였기 때문이다.However, it can be confirmed that the ROUGE evaluation result 702 is not very high. This is because the expressions 'launch' and 'launch of new products' were perceived differently because semantic information was not considered.

의미 정보를 고려한 본 발명의 실시예에 따른 RDASS 결과(701)는, ROUGE 평가 결과(702)와는 달리, 높은 점수(0.71)라는 것을 확인할 수 있다. 이는, ROUGE와는 달리 의미 정보를 충분히 고려하여 평가하였다는 것을 입증할 수 있다.It can be seen that the RDASS result 701 according to the embodiment of the present invention in consideration of semantic information has a high score (0.71), unlike the ROUGE evaluation result 702 . Unlike ROUGE, this can prove that semantic information was sufficiently considered and evaluated.

상술한 실험 결과에 따르면, 국문 요약에 대한 평가 시 널리 사용되는 ROUGE 평가 지표는 의미 정보를 반영하지 못한다는 문제점을 확인할 수 있다. 한국어는 다양한 표현을 가지는 언어로서, 정답 요약문과 동일한 의미를 갖는 다양한 표현의 예측 요약문이 존재할 수 있다. 그렇기 때문에 ROUGE 메트릭을 활용하는 것만으로 높은 정확도를 기대할 수 없다.According to the above experimental results, it can be confirmed that the ROUGE evaluation index, which is widely used when evaluating Korean summaries, does not reflect semantic information. Korean is a language with various expressions, and prediction summaries of various expressions having the same meaning as the correct answer summary sentences may exist. Therefore, high accuracy cannot be expected just by using the ROUGE metric.

그렇기 때문에 본 발명의 실시예에서는, 이러한 문제점을 해결하기 위하여 새로운 평가 지표인 RDASS를 제안하였다. RDASS는 예측 요약문 및 소스 문서 간에 깊은 의미 관계를 반영할 수 있다는 것을 확인하였다. 광범위한 평가를 통해 ROUGE 평가 결과보다 본 발명의 실시예에 따른 평가 지표(RDASS)에서 인간 판단과의 상관 관계가 더 높다는 것을 입증하였다.Therefore, in the embodiment of the present invention, a new evaluation index, RDASS, is proposed to solve this problem. It was confirmed that RDASS can reflect a deep semantic relationship between the prediction summary and the source document. Through extensive evaluation, it was demonstrated that the correlation with human judgment is higher in the evaluation index (RDASS) according to the embodiment of the present invention than the ROUGE evaluation result.

이상으로 본 발명에 따른 예측 요약문을 평가하는 장치, 제어 방법 및 컴퓨터프로그램의 실시예를 설시하였으나 이는 적어도 하나의 실시예로서 설명되는 것이며, 이에 의하여 본 발명의 기술적 사상과 그 구성 및 작용이 제한되지는 아니하는 것으로, 본 발명의 기술적 사상의 범위가 도면 또는 도면을 참조한 설명에 의해 한정／제한되지는 아니하는 것이다. 또한 본 발명에서 제시된 발명의 개념과 실시예가 본 발명의 동일 목적을 수행하기 위하여 다른 구조로 수정하거나 설계하기 위한 기초로써 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에 의해 사용되어질 수 있을 것인데, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에 의한 수정 또는 변경된 등가 구조는 청구범위에서 기술되는 본 발명의 기술적 범위에 구속되는 것으로서, 청구범위에서 기술한 발명의 사상이나 범위를 벗어나지 않는 한도 내에서 다양한 변화, 치환 및 변경이 가능한 것이다.Although the embodiments of the apparatus, control method, and computer program for evaluating the prediction summary according to the present invention have been described above, these are described as at least one embodiment, and thereby the technical spirit of the present invention and its configuration and operation are not limited. is not, and the scope of the technical idea of the present invention is not limited / limited by the drawings or the description with reference to the drawings. In addition, the concepts and embodiments of the present invention presented in the present invention can be used by those of ordinary skill in the art as a basis for modifying or designing other structures in order to perform the same purpose of the present invention. , an equivalent structure modified or changed by a person of ordinary skill in the art to which the present invention belongs is bound by the technical scope of the present invention described in the claims, and does not depart from the spirit or scope of the invention described in the claims Various changes, substitutions and changes are possible within the limits.

Claims

In the control method of the summary evaluation device,
obtaining a first comparison result of comparing the source document and the prediction summary;
obtaining a second comparison result of comparing the correct answer summary and the prediction summary; and
Combining the first and second comparison results to obtain an evaluation result for the prediction summary,
Control method of summary evaluation device.

The method of claim 1,
obtaining a first sentence embedding vector based on the source document; and
Further comprising the step of obtaining a second sentence embedding vector based on the prediction summary,
The first comparison result is a first degree of similarity between the first and second sentence embedding vectors,
Control method of summary evaluation device.

3. The method of claim 2,
Further comprising the step of obtaining a third sentence embedding vector based on the correct answer summary,
The second comparison result is a second degree of similarity between the second and third sentence embedding vectors,
Control method of summary evaluation device.

The method of claim 3, wherein the first to third sentence embedding vectors are
Characterized in that it is an embedding vector containing the meaning of the sentence,
Control method of summary evaluation device.

The method of claim 4, wherein the first to third sentence embedding vectors are:
Characterized in that it is an embedding vector obtained based on a SBERT (Sentence Bidirectional Encoder Representations) model,
Control method of summary evaluation device.

The method of claim 3, wherein the first and second similarities are:
Characterized in that the cosine similarity (cosine similarity),
Control method of summary evaluation device.

In the summary evaluation device,
a memory storing instructions; and
a processor configured to execute the stored instructions, the processor comprising:
obtain a first comparison result comparing the source document and the forecast summary;
obtaining a second comparison result of comparing the correct answer summary and the prediction summary;
Combining the first and second comparison results to obtain an evaluation result for the prediction summary,
Summary evaluation device.

8. The method of claim 7, wherein the processor
obtaining a first sentence embedding vector based on the source document,
obtaining a second sentence embedding vector based on the prediction summary;
The first comparison result is a first degree of similarity between the first and second sentence embedding vectors,
Summary evaluation device.

9. The method of claim 8, wherein the processor is
obtaining a third sentence embedding vector based on the correct answer summary,
The second comparison result is a second degree of similarity between the second and third sentence embedding vectors,
Summary evaluation device.

10. The method of claim 9, wherein the first to third sentence embedding vectors,
Characterized in that it is an embedding vector containing the meaning of the sentence,
Summary evaluation device.

11. The method of claim 10, wherein the first to third sentence embedding vectors,
Characterized in that it is an embedding vector obtained based on a SBERT (Sentence Bidirectional Encoder Representations) model,
Summary evaluation device.

10. The method of claim 9, wherein the first and the second degree of similarity,
Characterized in that the cosine similarity (cosine similarity),
Summary evaluation device.

A computer program stored in a medium for executing the method of any one of claims 1 to 6 in combination with hardware.