KR102535852B1

KR102535852B1 - Textrank based core sentence extraction method and device using bert sentence embedding vector

Info

Publication number: KR102535852B1
Application number: KR1020200067679A
Authority: KR
Inventors: 손영두; 양승호; 신석원
Original assignee: 동국대학교 산학협력단; 주식회사 인사이저
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2023-05-24
Also published as: KR20210151281A

Abstract

본 발명은 ERT의 문장 임베딩 벡터를 이용한 텍스트랭크 기반 핵심 문장 추출 방법 및 장치에 관한 것으로, 본 발명의 일실시예에 따른 BERT의 문장 임베딩 벡터를 이용한 텍스트랭크 기반 핵심 문장 추출 방법은 컴퓨팅 장치에서 실행되는 핵심 문장 추출에 관한 컴퓨터 구현 방법(Computer implemented method)으로서, 핵심 문장을 추출하고자 하는 자연어 데이터를 문장 단위로 분할하는 제1 단계; 상기 분할된 각 문장 앞에 특별 분류 토큰(CLS: special classification token)을 추가하는 제2 단계; 문장 벡터 변환 모델을 이용해 상기 특별 분류 토큰(CLS)이 추가된 각 문장을 문장 벡터로 변환시키는 제3 단계; 상기 문장 벡터들을 기반으로 문장 사이의 유사도를 계산하여 유사도 매트릭스(Matrix)를 구성하는 제4 단계; 상기 유사도 매트릭스를 텍스트랭크(TextRank)에 적용하여 각 문장 별 중요도를 산출하는 제5 단계; 및 상기 산출된 중요도에 따라 핵심 문장을 추출하는 제6 단계;를 포함한다.The present invention relates to a text rank-based key sentence extraction method and apparatus using an ERT sentence embedding vector. The text rank-based key sentence extraction method using a BERT sentence embedding vector according to an embodiment of the present invention is executed in a computing device. A computer implemented method for extracting a core sentence, comprising: a first step of dividing natural language data from which a core sentence is to be extracted into sentence units; a second step of adding a special classification token (CLS) in front of each of the divided sentences; a third step of converting each sentence to which the special classification token (CLS) is added into a sentence vector using a sentence vector conversion model; a fourth step of constructing a similarity matrix by calculating similarities between sentences based on the sentence vectors; a fifth step of calculating the importance of each sentence by applying the similarity matrix to TextRank; and a sixth step of extracting key sentences according to the calculated importance.

Description

Method and device for extracting key sentences based on text rank using BERT's sentence embedding vector

본 발명은 텍스트랭크 기반 핵심 문장 추출 방법 및 장치에 관한 것으로, 보다 상세하게는 BERT의 문장 임베딩 벡터를 이용한 텍스트랭크 기반 핵심 문장 추출 방법 및 장치에 관한 것이다.The present invention relates to a text rank-based key sentence extraction method and apparatus, and more particularly, to a text rank-based key sentence extraction method and apparatus using a BERT sentence embedding vector.

기존의 핵심문장 추출은 텍스트랭크(TextRank)를 이용한 핵심문장 추출이 주를 이루었다. 텍스트랭크(TextRank)는 토큰(문서, 문장, 단어 등)의 단위로 토큰 간의 유사도를 계산하여 유사도 매트릭스(Matrix)를 기반으로 토큰 별 중요도를 계산하는 방법론이다. 이는 토큰 간의 유사도 매트릭스(Matrix)에 의존적인 방법론으로 토큰 간의 유사도를 계산하는 방식에 의존적인 방법론이다.Existing key sentence extraction mainly consisted of key sentence extraction using TextRank. TextRank is a methodology that calculates the similarity between tokens in units of tokens (documents, sentences, words, etc.) and calculates the importance of each token based on a similarity matrix. This is a methodology dependent on the similarity matrix between tokens and a methodology dependent on the method of calculating the similarity between tokens.

문장 간의 유사도를 계산하는 방식은 출현 단어의 횟수 혹은 존재 여부를 토대로 문장을 벡터화(BoW; Bag of Words)하여 벡터 간의 거리를 기반으로 유사도를 부여하는 방식을 주로 사용하였다.The method of calculating the degree of similarity between sentences is mainly used to vectorize sentences based on the number of occurrences or existence of words (BoW; Bag of Words) and give similarity based on the distance between vectors.

단순한 빈도에 의존적인 자연어 토큰의 수치화(BoW)는 단어의 존재여부와 단어의 출현 횟수 정보만을 포함하기 때문에 토큰이 가지고 있는 의미나, 토큰의 위치가 가지고 있는 정보를 모두 반영해주기 힘들다. 이를 해결하기 위해 최초로 도입된 방법은 단어 출현을 기반으로 비지도 학습 기법을 이용한 모델이다. 하지만, 이 또한, 단어의 출현 순서, 혹은 단어 동시 출현 여부 정보만 반영되기 때문에 단어의 의미나 단어 자체가 가지고 있는 특성을 완벽히 파악했다고 보기 힘들다. 이에 따라 최근에는 자연어 토큰이 가지고 있는 의미를 단어 간의 관계와 여러 가지 자연어 처리 작업을 통해 학습된 기계학습 모델로 추출하는 방식이 많이 연구되고 있고 이를 언어 모델이라 부른다. Since the simple frequency-dependent natural language token digitization (BoW) includes only word existence and word appearance count information, it is difficult to reflect both the meaning of the token and the information of the location of the token. The first method introduced to solve this problem is a model using an unsupervised learning technique based on word appearance. However, it is difficult to say that the meaning of the words or the characteristics of the words themselves are completely grasped because this also reflects only information about the order in which words appear or whether words appear simultaneously. Accordingly, in recent years, a method of extracting the meaning of natural language tokens through a relationship between words and a machine learning model learned through various natural language processing tasks has been studied a lot, and this is called a language model.

언어 모델은 질문에 대한 정답 찾기, 이어질 문장 고르기, 사이 단어 고르기 등 여러 가지 자연어 처리 작업을 통해 학습된 모델이다. 언어 모델은 위에 명시한 자연어 처리 작업을 위해 토큰 단위(문서, 문장, 단어 등)로 적절한 벡터를 생성해내기 때문에, 단어의 존재 여부 혹은 출현 횟수 정도의 정보가 담긴 빈도기반 방법의 수치화를 넘어서서 단어가 가진 내재적인 의미나 단어가 문장에 미치는 영향 등 복잡한 언어 체계 자체가 생성되는 벡터에 영향을 미치는 구조이다. The language model is a model learned through various natural language processing tasks, such as finding the correct answer to a question, choosing a sentence to follow, and choosing a word in between. Because the language model generates appropriate vectors in token units (documents, sentences, words, etc.) for the natural language processing task described above, it goes beyond the quantification of frequency-based methods that contain information about whether words exist or how many times they appear. It is a structure that affects the vectors created by the complex linguistic system itself, such as the inherent meaning and the effect of words on sentences.

언어 모델의 대표적인 예시로는 GPT, ELMo 등이 있다. 하지만 GPT와 ELMo 등의 언어 모델은 순환 신경망 구조를 기본 구조로 하고 있기 때문에 언어 모델이 토큰 단위로 언어를 분석함에 있어서, 분석 순서가 존재한다. 이는 특정 토큰을 분석 할 때, 해당 토큰 이전의 토큰만 영향을 미치거나 이후의 토큰만 영향을 미치기 때문에, 앞과 뒷 문맥이 모두 영향을 미치는 언어 체계에 완벽하게 설계된 모델은 아니었다. Representative examples of language models include GPT and ELMo. However, since language models such as GPT and ELMo have a recurrent neural network structure as a basic structure, when the language model analyzes a language in token units, an analysis order exists. This was not a perfectly designed model for a language system in which both the preceding and following contexts influence, because when analyzing a particular token, only the tokens before or after that token are affected.

BERT(Bidirectional Encoder Representations from Transformers)는 구글(Google)에서 개발한 언어 모델로, 기존의 순환 신경망 구조에서 벗어나 멀티 헤드 셀프 어텐션(Multi head Self Attention)을 기본 구조로 하는 트랜스포머(Transformer) 모델을 바탕으로 구성된 모델이다. 이는 토큰을 분석할 때마다 앞/뒤 문맥을 다시보고 해당 토큰에 대한 분석을 수행하기 때문에 앞/뒤 문맥을 지속적으로 반영해줄 수 있다. 동시에 MLM(Masking 기법이 적용된 언어모델)을 사용하는데, 이는 주어진 텍스트 데이터의 일부를 가리고 가려진 단어가 무엇인지 학습시키는 방법으로 단어 간의 관계를 학습하게 된다. 이를 통해 BERT는 여러 자연어 처리 작업에서 기존 언어 모델의 성능을 넘어서는 최고 수준의 성능을 보여주었다. BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google. It is a constructed model. Whenever a token is analyzed, the previous/next context is revisited and analysis is performed on the token, so the previous/next context can be continuously reflected. At the same time, MLM (Language Model with Masking Technique) is used, which learns the relationship between words by covering a part of the given text data and learning what the hidden words are. Through this, BERT showed the highest level of performance that exceeded the performance of existing language models in several natural language processing tasks.

텍스트 데이터가 가진 통계적인 정보를 벗어나 각각의 단어 혹은 문장에 내제된 의미까지 분석할 수 있는 언어 모델이 존재함에도 불구하고 현재 문장에서 정보를 추출(핵심문장, 키워드 추출 등)하는 방법은 기존의 빈도 기반의 방법론의 한계를 벗어나지 못하고 있다. 이는 정보 추출에 문장이나 혹은 문맥이 가진 의미를 완전히 반영해주지 못한다는 한계점을 가지고 있다. Despite the existence of a language model that can analyze the meaning inherent in each word or sentence beyond the statistical information of text data, the method of extracting information from the current sentence (key sentence, keyword extraction, etc.) It does not go beyond the limitations of the methodology based. This has a limitation in that it cannot fully reflect the meaning of a sentence or context in information extraction.

이와 같은 핵심문장 추출 방법은 언어 모델의 개념이 생기기 전이었으며, 아직까지 언어 모델의 문장 수치화를 이용하여 핵심문장을 추출하는 방법은 존재하지 않는다. 단순 통계적 수치를 벗어나 언어 체계에 대한 이해를 바탕으로 수치화된 BERT의 문장 임베딩(Embedding)들을 기반으로 한 핵심문장 추출 기법에 대한 개발이 필요하다.Such a key sentence extraction method was before the concept of a language model was created, and there is no method of extracting a key sentence using the sentence digitization of a language model yet. It is necessary to develop a key sentence extraction technique based on BERT's digitized sentence embeddings based on understanding of the language system beyond simple statistical figures.

본 발명은 텍스트랭크(TextRank)를 이용하여 핵심문장 추출하는 방식에 BERT(Bidirectional Encoder Representations from Transformers) 모델로 추출된 문장의 유의미한 임베딩 벡터(Embedding vector)를 사용하는 방법을 제공하여, 현재 이용 가능한 최고 수준의 언어 모델을 핵심문장 추출에 이용할 수 있는 간단한 방법을 구축할 수 있도록 하고자 한다.The present invention provides a method of using a meaningful embedding vector of a sentence extracted with a BERT (Bidirectional Encoder Representations from Transformers) model in a key sentence extraction method using TextRank, thereby providing the best currently available We want to build a simple method that can use the language model of the level to extract key sentences.

본 발명의 일실시예에 따른 BERT의 문장 임베딩 벡터를 이용한 텍스트랭크 기반 핵심 문장 추출 방법은 컴퓨팅 장치에서 실행되는 핵심 문장 추출에 관한 컴퓨터 구현 방법(Computer implemented method)으로서, 핵심 문장을 추출하고자 하는 자연어 데이터를 문장 단위로 분할하는 제1 단계; 상기 분할된 각 문장 앞에 특별 분류 토큰(CLS: special classification token)을 추가하는 제2 단계; 문장 벡터 변환 모델을 이용해 상기 특별 분류 토큰(CLS)이 추가된 각 문장을 문장 벡터로 변환시키는 제3 단계; 상기 문장 벡터들을 기반으로 문장 사이의 유사도를 계산하여 유사도 매트릭스(Matrix)를 구성하는 제4 단계; 상기 유사도 매트릭스를 텍스트랭크(TextRank)에 적용하여 각 문장 별 중요도를 산출하는 제5 단계; 및 상기 산출된 중요도에 따라 핵심 문장을 추출하는 제6 단계;를 포함한다.A TextRank-based key sentence extraction method using a sentence embedding vector of BERT according to an embodiment of the present invention is a computer implemented method for extracting key sentences executed on a computing device, and is a natural language for extracting key sentences. A first step of dividing data into sentence units; a second step of adding a special classification token (CLS) in front of each of the divided sentences; a third step of converting each sentence to which the special classification token (CLS) is added into a sentence vector using a sentence vector conversion model; a fourth step of constructing a similarity matrix by calculating similarities between sentences based on the sentence vectors; a fifth step of calculating the importance of each sentence by applying the similarity matrix to TextRank; and a sixth step of extracting key sentences according to the calculated importance.

본 발명의 다른 일실시예에 따르면, 상기 제2 단계는 상기 특별 분류 토큰(CLS)을 문장 토큰의 임베딩 벡터(embedding vector)로 사용할 수 있다.According to another embodiment of the present invention, the second step may use the special classification token (CLS) as an embedding vector of sentence tokens.

본 발명의 다른 일실시예에 따르면, 상기 제3 단계는 BERT(Bidirectional Encoder Representations from Transformers) 모델을 이용하여 상기 특별 분류 토큰(CLS)이 추가된 각 문장을 문장 벡터로 변환시킬 수 있다.According to another embodiment of the present invention, in the third step, each sentence to which the special classification token (CLS) is added may be converted into a sentence vector using a Bidirectional Encoder Representations from Transformers (BERT) model.

본 발명의 다른 일실시예에 따르면, 상기 제4 단계는 상기 각 문장의 벡터 값의 내적 값을 이용해 문장 유사도를 계산하여 유사도 매트릭스를 구성할 수 있다.According to another embodiment of the present invention, in the fourth step, a similarity matrix may be constructed by calculating sentence similarity using an inner product of vector values of each sentence.

본 발명의 다른 일실시예에 따르면, 상기 산출된 중요도의 값이 가장 높은 문장을 핵심 문장으로 추출할 수 있다.According to another embodiment of the present invention, a sentence having the highest value of the calculated importance may be extracted as a core sentence.

본 발명의 일실시예에 따른 BERT의 문장 임베딩 벡터를 이용한 텍스트랭크 기반 핵심 문장 추출 장치는 메모리; 적어도 하나의 프로세서; 및 상기 메모리에 저장되며, 상기 적어도 하나의 프로세서에 의해 실행되도록 구현되는 핵심 문장 추출에 관한 프로그램을 포함하는 컴퓨팅 장치로서, 상기 핵심 문장 추출에 관한 프로그램은,An apparatus for extracting key sentences based on text rank using a sentence embedding vector of BERT according to an embodiment of the present invention includes a memory; at least one processor; and a program for extracting key sentences stored in the memory and implemented to be executed by the at least one processor, wherein the program for extracting key sentences comprises:

핵심 문장을 추출하고자 하는 자연어 데이터를 문장 단위로 분할하는 제1 단계; 상기 분할된 각 문장 앞에 특별 분류 토큰(CLS: special classification token)을 추가하는 제2 단계; 문장 벡터 변환 모델을 이용해 상기 특별 분류 토큰(CLS)이 추가된 각 문장을 문장 벡터로 변환시키는 제3 단계; 상기 문장 벡터들을 기반으로 문장 사이의 유사도를 계산하여 유사도 매트릭스(Matrix)를 구성하는 제4 단계; 상기 유사도 매트릭스를 텍스트랭크(TextRank)에 적용하여 각 문장 별 중요도를 산출하는 제5 단계; 및 상기 산출된 중요도에 따라 핵심 문장을 추출하는 제6 단계;를 수행하는 명령어들을 포함할 수 있다.A first step of dividing natural language data from which core sentences are to be extracted into sentence units; a second step of adding a special classification token (CLS) in front of each of the divided sentences; a third step of converting each sentence to which the special classification token (CLS) is added into a sentence vector using a sentence vector conversion model; a fourth step of constructing a similarity matrix by calculating similarities between sentences based on the sentence vectors; a fifth step of calculating the importance of each sentence by applying the similarity matrix to TextRank; and a sixth step of extracting key sentences according to the calculated importance.

본 발명은 텍스트랭크(TextRank)를 이용하여 핵심문장 추출하는 방식에 BERT(Bidirectional Encoder Representations from Transformers) 모델로 추출된 문장의 유의미한 임베딩 벡터(Embedding vector)를 사용하는 방법을 제공하여, 현재 이용 가능한 최고 수준의 언어 모델을 핵심문장 추출에 이용할 수 있는 간단한 방법을 구축할 수 있는 효과가 있다. 이에 의할 때, 본 발명은 기존의 텍스트 데이터를 이용한 요약문 추출, 핵심 키워드 추출 문제에 활용되어, 뉴스 데이터를 이용한 금융시장 분석, 리뷰 데이터를 이용한 고객 니즈, 상품 문제점 분석 등에 적용될 수 있으며, 국내외 산업에서 텍스트 데이터의 활용을 더욱 활성화시킬 수 있다.The present invention provides a method of using a meaningful embedding vector of a sentence extracted with a BERT (Bidirectional Encoder Representations from Transformers) model in a key sentence extraction method using TextRank, thereby providing the best currently available It has the effect of constructing a simple method that can use the language model of the level to extract key sentences. According to this, the present invention can be applied to the problem of extracting summaries using existing text data and extracting key keywords, and can be applied to financial market analysis using news data, customer needs and product problem analysis using review data, and domestic and foreign industries can further activate the utilization of text data.

도 1은 본 발명의 일실시예에 따른 BERT의 문장 임베딩 벡터를 이용한 텍스트랭크 기반 핵심 문장 추출 방법을 설명하기 위한 흐름도이다.
도 2는 본 발명의 일실시예에 따른 문장 벡터를 기반으로 계산된 유사도 매트릭스(Matrix)를 도시한 도면이다.
도 3은 본 발명의 일실시예에 따른 각 문장 별 중요도를 추출하는 방법을 설명하기 위한 도면이다.
도 4는 본 발명의 일실시예에 따른 핵심 문장을 추출하는 방법을 설명하기 위한 도면이다.
도 5는 본 발명의 일실시예에 따른 핵심 문장 추출 방법을 실행하는 컴퓨팅 장치에 관한 개략적인 블록 구성도이다.1 is a flowchart illustrating a text rank-based key sentence extraction method using a sentence embedding vector of BERT according to an embodiment of the present invention.
2 is a diagram illustrating a similarity matrix calculated based on sentence vectors according to an embodiment of the present invention.
3 is a diagram for explaining a method of extracting importance for each sentence according to an embodiment of the present invention.
4 is a diagram for explaining a method of extracting a key sentence according to an embodiment of the present invention.
5 is a schematic block diagram of a computing device executing a key sentence extraction method according to an embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present invention can apply various transformations and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, it should be understood that this is not intended to limit the present invention to specific embodiments, and includes all transformations, equivalents, and substitutes included in the spirit and scope of the present invention.

본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 본 명세서의 설명 과정에서 이용되는 숫자(예를 들어, 제1, 제2 등)는 하나의 구성요소를 다른 구성요소와 구분하기 위한 식별기호에 불과하다.In describing the present invention, if it is determined that a detailed description of related known technologies may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, numbers (eg, first, second, etc.) used in the description process of this specification are only identifiers for distinguishing one component from another component.

또한, 명세서 전체에서, 일 구성요소가 다른 구성요소와 "연결된다" 거나 "접속된다" 등으로 언급된 때에는, 상기 일 구성요소가 상기 다른 구성요소와 직접 연결되거나 또는 직접 접속될 수도 있지만, 특별히 반대되는 기재가 존재하지 않는 이상, 중간에 또 다른 구성요소를 매개하여 연결되거나 또는 접속될 수도 있다고 이해되어야 할 것이다. 또한, 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하나 이상의 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 조합으로 구현될 수 있음을 의미한다.In addition, throughout the specification, when an element is referred to as "connected" or "connected" to another element, the element may be directly connected or directly connected to the other element, but in particular Unless otherwise described, it should be understood that they may be connected or connected via another component in the middle. In addition, throughout the specification, when a certain component is said to "include", it means that it may further include other components without excluding other components unless otherwise stated. In addition, terms such as "unit" and "module" described in the specification mean a unit that processes at least one function or operation, which means that it can be implemented as one or more hardware or software or a combination of hardware and software. .

본 발명은 BERT(Bidirectional Encoder Representations from Transformers) 모델을 이용한 핵심문장 추출 방법에 관한 것으로, 핵심문장 추출에 필요한 적합한 문장 간의 유사도 선택 방법을 개시하고 있다.The present invention relates to a key sentence extraction method using a BERT (Bidirectional Encoder Representations from Transformers) model, and discloses a similarity selection method between appropriate sentences required for key sentence extraction.

이하, 첨부된 도면들을 참조하여 본 발명의 실시예를 상세히 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 BERT의 문장 임베딩 벡터를 이용한 텍스트랭크 기반 핵심 문장 추출 방법을 설명하기 위한 흐름도이고, 도 2는 본 발명의 일실시예에 따른 문장 벡터를 기반으로 계산된 유사도 매트릭스(Matrix)를 도시한 도면이다.1 is a flowchart illustrating a text rank-based core sentence extraction method using a sentence embedding vector of BERT according to an embodiment of the present invention, and FIG. 2 is a flowchart calculated based on a sentence vector according to an embodiment of the present invention. It is a diagram showing a similarity matrix.

또한, 도 3은 본 발명의 일실시예에 따른 각 문장 별 중요도를 추출하는 방법을 설명하기 위한 도면이고, 도 4는 본 발명의 일실시예에 따른 핵심 문장을 추출하는 방법을 설명하기 위한 도면이다. 또한, 도 5는 본 발명의 일실시예에 따른 핵심 문장 추출 방법을 실행하는 컴퓨팅 장치에 관한 개략적인 블록 구성도이다.3 is a diagram for explaining a method for extracting the importance of each sentence according to an embodiment of the present invention, and FIG. 4 is a diagram for explaining a method for extracting key sentences according to an embodiment of the present invention. am. 5 is a schematic block diagram of a computing device executing a key sentence extraction method according to an embodiment of the present invention.

도 1을 참조하면, 먼저 핵심 문장 추출 장치가 핵심 문장을 추출하고자 하는 자연어 데이터를 문장 단위로 분할한다(S110).Referring to FIG. 1 , first, the core sentence extraction device divides natural language data from which core sentences are to be extracted into sentence units (S110).

이후에 상기 분할된 각 문장 앞에 특별 분류 토큰(CLS: special classification token)을 추가한다(S120).Thereafter, a special classification token (CLS) is added in front of each of the divided sentences (S120).

본 발명의 일실시예에 따르면, 다양한 작업에 의해 학습된 사전 훈련된(Pre-trained) 언어 모델 중 가장 좋은 성능을 낸 BERT(Bidirectional Encoder Representations from Transformers) 모델의 특별 분류 토큰(CLS)을 이용한다According to one embodiment of the present invention, a special classification token (CLS) of a BERT (Bidirectional Encoder Representations from Transformers) model with the best performance among pre-trained language models learned by various tasks is used.

특별 분류 토큰(CLS)이란, 문장 단위의 내제된 정보를 이용하여 분류 작업에 사용할 때, 문장의 내제된 정보를 담고 있을 수 있도록 학습된 BERT 고유의 토큰이다.The special classification token (CLS) is a BERT-specific token learned to contain the inherent information of a sentence when used for classification using the inherent information of a sentence unit.

특별 분류 토큰(CLS)을 이용하는 방식은 다음과 같다. 핵심 문장을 추출하고 싶은 자연어 데이터를 문장 단위로 나눈 뒤, 각 문장에 특별 분류 토큰(CLS)을 문장 앞에 추가한다. 특별 분류 토큰(CLS)은 '[CLS]'라는 단어를 문장 시작 부분에 추가하면 되기 때문에 다른 추가 제약조건 없이, 어떤 형태의 문장에도 추가가 가능하다. 이후, 모델에 의해 수치화된 특별 분류 토큰(CLS)의 벡터는 각 문장 사이의 유사도를 계산할 때 이용된다. 즉, 해당 특별 분류 토큰(CLS)은 BERT에서 문장 단위로 분류 작업을 실행할 때, 문장의 표현 벡터로 사용되는 토큰이며 특별 분류 토큰(CLS)의 표현 벡터를 문장 토큰의 임베딩 벡터(Embedding Vector)로 사용한다.The method of using the Special Classification Token (CLS) is as follows. After dividing the natural language data from which you want to extract key sentences into sentence units, a special classification token (CLS) is added to each sentence in front of the sentence. The special classification token (CLS) can be added to any type of sentence without any additional constraints because the word '[CLS]' can be added to the beginning of the sentence. Then, the vector of special classification tokens (CLS) quantified by the model is used when calculating the degree of similarity between each sentence. That is, the special classification token (CLS) is a token used as a sentence expression vector when BERT executes a classification task in sentence units, and the expression vector of the special classification token (CLS) is used as an embedding vector of sentence tokens. use.

이후, 문장 벡터 변환 모델을 이용해 상기 특별 분류 토큰(CLS)이 추가된 각 문장을 문장 벡터로 변환한다(S130).Thereafter, each sentence to which the special classification token (CLS) is added is converted into a sentence vector using a sentence vector conversion model (S130).

즉, 특별 분류 토큰(CLS)이 추가된 각 문장을 사전 훈련된(Pre-trained) BERT을 이용해 벡터화 한다. That is, each sentence to which a special classification token (CLS) is added is vectorized using a pre-trained BERT.

그에 따라, 상기 문장 벡터들을 기반으로 문장 사이의 유사도를 계산하여 유사도 매트릭스(Matrix)를 구성한다(S140)Accordingly, a similarity matrix is constructed by calculating the degree of similarity between sentences based on the sentence vectors (S140).

본 발명의 일실시예에 따른 문장 간 유사도 계산 방법을 설명하면, 각 문장의 유사도는 문장 벡터 간의 내적 값을 이용한다. 해당 방식은 코사인 유사도와 동일한 유사도 계산 방식으로 기존 코사인 유사도 계산이 두 벡터의 내적 값을 두 벡터의 크기 곱으로 나누어주는 방식에서 분모에 해당하는 두 벡터의 크기가 BERT 내부 알고리즘을 통해 1에 맞춰져 있음을 이용하는 방식이다. 유사도를 내적 값으로 하는 방법은 BERT 내부 알고리즘에서 토큰 사이의 유사도를 계산하는 방법과 동일한 방법으로 다른 유사도 척도보다 BERT 알고리즘을 통해 추출된 토큰(Token)에 적합한 방식이다. Describing the similarity calculation method between sentences according to an embodiment of the present invention, the similarity of each sentence uses a dot product value between sentence vectors. This method is the same similarity calculation method as the cosine similarity. In the method in which the existing cosine similarity calculation divides the dot product of two vectors by the product of the magnitudes of the two vectors, the magnitude of the two vectors corresponding to the denominator is set to 1 through the BERT internal algorithm. way to use it. The method of taking the similarity as the inner product value is the same method as the method of calculating the similarity between tokens in the BERT internal algorithm, and is more suitable for tokens extracted through the BERT algorithm than other similarity measures.

도 2에 도시된 바와 같이, 계산된 문장 유사도를 기반으로 유사도 매트릭스(Matrix)를 구성한 후에는, 유사도 매트릭스(Matrix)를 텍스트랭크(TextRank)에 적용하여 각 문장 별로 중요도를 추출한다(S150).As shown in FIG. 2, after constructing a similarity matrix based on the calculated sentence similarity, the similarity matrix is applied to TextRank to extract the importance for each sentence (S150).

따라서, 추출된 중요도에 따라 순위를 매겨 주어진 텍스트 데이터의 핵심 문장을 추출한다(S160).Accordingly, the key sentences of the given text data are extracted by ranking them according to the extracted importance (S160).

도 3에는 실제 금융 뉴스에서의 텍스트랭크(TextRank)를 이용한 각 문장 별 중요도를 산출한 값이 도시되어 있으며, 도 4는 실제 금융 뉴스에서 핵심 문장을 추출한 예시가 도시되어 있다.3 shows values obtained by calculating the importance of each sentence using TextRank in actual financial news, and FIG. 4 shows an example of extracting key sentences from actual financial news.

도 4에 도시된 바와 같이 가장 점수가 높은 문장(S2)과 그 다음으로 접수가 높은 문장(S1)을 핵심 문장으로 추출할 수 있다. 이와 같이 각 문장 별 중요도 값 중에서 가장 높은 값을 갖는 문장을 핵심 문장으로 추출할 수 있다.As shown in FIG. 4 , the sentence S2 with the highest score and the sentence S1 with the next highest score may be extracted as core sentences. In this way, among the importance values for each sentence, a sentence having the highest value may be extracted as a core sentence.

즉, 본 발명에 따르면 기존에 원시 텍스트 데이터를 문장 단위로 나누어 특별 분류 토큰(CLS)을 추가하고, 특별 분류 토큰(CLS)이 추가된 문장을 BERT 모델을 기반으로 벡터화 한 뒤, 해당 문장 벡터들을 기반으로 문장 사이의 유사도를 계산하고 계산된 값으로 유사도 매트릭스(Matrix)를 구축한다. 그 후, 유사도 매트릭스(Matrix)를 텍스트랭크(TextRank)에 적용하여 각 문장 별로 중요도 값을 추출하고, 이와 같은 중요도를 기반으로 최종 핵심문장을 선정할 수 있다.That is, according to the present invention, the existing raw text data is divided into sentence units and a special classification token (CLS) is added, the sentence to which the special classification token (CLS) is added is vectorized based on the BERT model, and the sentence vectors are Based on this, the similarity between sentences is calculated and a similarity matrix is constructed with the calculated values. After that, a similarity matrix is applied to TextRank to extract an importance value for each sentence, and a final key sentence can be selected based on this importance.

상술한 바와 같이 BERT의 문장 임베딩 벡터를 이용한 텍스트랭크 기반 핵심 문장 추출 방법은, 도 5에 도시된 바와 같이, 적어도 하나의 프로세서(110)와 메모리(120)를 포함하는 컴퓨팅 장치(100)에 의해 실행될 수 있다. 이때, 메모리(120)에는 상술한 핵심 문장 추출을 위한 프로그램이 저장되며, 이러한 핵심 문장 추출을 위한 프로그램은 상기 적어도 하나의 프로세서(110)에 의해 실행되도록 구현될 수 있다.As described above, the text rank-based key sentence extraction method using the sentence embedding vector of BERT is, as shown in FIG. 5, by the computing device 100 including at least one processor 110 and memory 120 can be executed At this time, the above-described program for extracting the core sentence is stored in the memory 120 , and the program for extracting the core sentence may be implemented to be executed by the at least one processor 110 .

이때, 상기 핵심 문장 추출에 관한 프로그램은, 핵심 문장을 추출하고자 하는 자연어 데이터를 문장 단위로 분할하고, 상기 분할된 각 문장 앞에 특별 분류 토큰(CLS: special classification token)을 추가하며, 문장 벡터 변환 모델을 이용해 상기 특별 분류 토큰(CLS)이 추가된 각 문장을 문장 벡터로 변환시키고, 상기 문장 벡터들을 기반으로 문장 사이의 유사도를 계산하여 유사도 매트릭스(Matrix)를 구성하며, 상기 유사도 매트릭스를 텍스트랭크(TextRank)에 적용하여 각 문장 별 중요도를 산출하고, 상기 산출된 중요도에 따라 핵심 문장을 추출하도록 하는 명령어들을 포함할 수 있다.At this time, the program for extracting the core sentence divides the natural language data from which the core sentence is to be extracted into sentence units, adds a special classification token (CLS) in front of each of the divided sentences, and converts the sentence vector into a sentence model. Each sentence to which the special classification token (CLS) is added is converted into a sentence vector, and a similarity matrix is constructed by calculating the similarity between sentences based on the sentence vectors, and the similarity matrix is converted into a text rank ( TextRank) to calculate the importance of each sentence, and extract key sentences according to the calculated importance.

상술한 바에 따른 본 발명의 실시예에 의하면, 텍스트랭크(TextRank)를 이용하여 핵심문장 추출하는 방식에 BERT(Bidirectional Encoder Representations from Transformers) 모델로 추출된 문장의 유의미한 임베딩 벡터(Embedding vector)를 사용하는 방법을 제공하여, 현재 이용 가능한 최고 수준의 언어 모델을 핵심문장 추출에 이용할 수 있는 간단한 방법을 구축할 수 있는 효과가 있다. 또한 이에 의할 때, 기존의 텍스트 데이터를 이용한 요약문 추출, 핵심 키워드 추출 문제에 활용되어, 뉴스 데이터를 이용한 금융시장 분석, 리뷰 데이터를 이용한 고객 니즈, 상품 문제점 분석 등에 적용될 수 있으며, 국내외 산업에서 텍스트 데이터의 활용을 더욱 활성화시킬 수 있다.According to the embodiment of the present invention as described above, in a method of extracting key sentences using TextRank, a meaningful embedding vector of a sentence extracted by a BERT (Bidirectional Encoder Representations from Transformers) model is used. By providing a method, there is an effect of constructing a simple method that can use the currently available highest level language model for key sentence extraction. In addition, according to this, it can be applied to extracting summaries using existing text data and extracting key keywords, and can be applied to financial market analysis using news data, customer needs using review data, product problem analysis, etc. Data utilization can be further activated.

이상에서는 본 발명의 실시예를 참조하여 설명하였지만, 해당 기술 분야에서 통상의 지식을 가진 자라면 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 쉽게 이해할 수 있을 것이다.Although the above has been described with reference to the embodiments of the present invention, those skilled in the art will variously modify the present invention within the scope not departing from the spirit and scope of the present invention described in the claims below. And it will be readily understood that it can be changed.

Claims

As a computer implemented method for extracting key sentences executed on a computing device,
A first step of dividing natural language data from which core sentences are to be extracted into sentence units;
a second step of adding a special classification token (CLS) in front of each of the divided sentences, wherein the special classification token (CLS) is used as an embedding vector of sentence tokens;
a third step of converting each sentence to which the special classification token (CLS) is added into a sentence vector by using a BERT (Bidirectional Encoder Representations from Transformers) model;
a fourth step of constructing a similarity matrix by calculating similarities between sentences based on the sentence vectors;
a fifth step of calculating the importance of each sentence by applying the similarity matrix to TextRank; and
A sixth step of extracting key sentences according to the calculated importance,
In the fourth step, the text rank-based core sentence extraction method using the sentence embedding vector of BERT constructs a similarity matrix by calculating the sentence similarity using the dot product of the vector values of each sentence.

delete

The method of claim 1,
The sixth step,
A text rank-based key sentence extraction method using BERT's sentence embedding vector that extracts the sentence with the highest calculated importance value as a key sentence.

Memory; at least one processor; And a computing device including a program related to key sentence extraction stored in the memory and implemented to be executed by the at least one processor,
The program for extracting the key sentence,
The natural language data from which core sentences are to be extracted is divided into sentence units, a special classification token (CLS) is added in front of each of the divided sentences, and the special classification token (CLS) is used as an embedding vector of sentence tokens. vector), each sentence to which the special classification token (CLS) is added is converted into a sentence vector using a BERT (Bidirectional Encoder Representations from Transformers) model, and the similarity between sentences is calculated based on the sentence vectors. A similarity matrix is constructed by calculating the dot product of vector values of sentences, and the importance of each sentence is calculated by applying the similarity matrix to TextRank, and key sentences are selected according to the calculated importance. A computing device comprising instructions that cause extraction.