KR102376489B1

KR102376489B1 - Text document cluster and topic generation apparatus and method thereof

Info

Publication number: KR102376489B1
Application number: KR1020190151194A
Authority: KR
Inventors: 김문종; 류승현; 홍범석; 장정훈
Original assignee: 주식회사 와이즈넛
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2022-03-18
Also published as: KR20210062934A

Abstract

본 발명은 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치 및 그 방법에 관한 것으로, 외부로부터 수집된 다수의 텍스트 문서들 내의 비정형 텍스트에서 단어 간 관계 정보를 활용하여 각 텍스트 문서별로 가중치가 계산된 키워드를 추출하는 제1 모듈과, 제1 모듈로부터 추출된 각 텍스트 문서별 키워드의 가중치를 정규화하고 키워드의 랭킹을 계산하는 제2 모듈과, 제2 모듈로부터 계산된 각 텍스트 문서별 키워드의 랭킹을 기반으로 유사 텍스트 문서들을 군집하는 제3 모듈과, 제3 모듈로부터 군집된 각 유사 텍스트 문서에서 대표 주제를 생성하는 제4 모듈을 포함함으로써, 군집의 품질을 향상할 수 있고, 문서가 가지는 정보를 더 빠르게 얻을 수 있을 뿐만 아니라 이를 통해 더 가치 있는 의사 결정에 이바지할 수 있으며, 이를 응용하여 대용량 데이터의 군집을 통한 군집 별 통계, 시계열 분석 등에 활용할 수 있는 효과가 있다.The present invention relates to a word ranking-based text document cluster and topic generating apparatus and method, and a keyword weighted for each text document by using relationship information between words in atypical texts in a plurality of text documents collected from outside Based on a first module for extracting , a second module for normalizing the weight of the keyword for each text document extracted from the first module and calculating the ranking of the keyword, and the ranking of the keyword for each text document calculated from the second module By including a third module for clustering text-like documents with , and a fourth module for generating a representative topic from each clustered text-like document from the third module, the quality of clustering can be improved, and the information contained in the document can be further added. Not only can it be obtained quickly, but it can also contribute to more valuable decision-making, and by applying this, it has the effect of being utilized for cluster-specific statistics and time series analysis through clustering of large-scale data.

Description

Apparatus and method for generating text document clusters and topics based on word ranking

본 발명은 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치 및 그 방법에 관한 것으로, 더 상세하게는 비정형 텍스트 문서의 구문 관계를 분석하고 키워드를 추출하여 단어의 랭킹 기반으로 텍스트 문서를 군집하고, 군집 텍스트 문서의 대표 주제를 생성하는 장치 및 그 방법에 관한 것이다.The present invention relates to a word ranking-based text document clustering and topic generating apparatus and method, and more particularly, by analyzing the syntactic relationship of an atypical text document and extracting keywords to cluster text documents based on word ranking, and to cluster An apparatus and method for generating a representative subject of a text document are provided.

오늘날 디지털 데이터의 대부분은 텍스트, 이미지, 음성 등과 같은 비정형 데이터로 구성되어 있다. 구조화되어 있지 않은 데이터로부터 의미 있는 분석을 하고 가치를 도출하는 요구는 갈수록 늘어나고 있다.Most of today's digital data consists of unstructured data such as text, image, and voice. The demand for meaningful analysis and deriving value from unstructured data is increasing.

텍스트 자연어처리 분야에서는 비지도 학습을 통해 문서 집합에서의 확률 모델을 생성함으로써 언어에 대한 자질 추출을 자동으로 수행한다는 점에서 주목 받고 있다. 특히, 최근에는 단어 및 구의 표현을 넘어 문장 및 문서에서의 문맥 정보를 반영하기 위한 방법도 제안되었다.In the field of text natural language processing, it is attracting attention in that it automatically performs feature extraction for a language by generating a probabilistic model from a set of documents through unsupervised learning. In particular, recently, a method for reflecting contextual information in sentences and documents beyond expression of words and phrases has also been proposed.

더 나아가, 자연어처리를 통해 문서 검색, 요약, 연관 정보 분석과 군집 등의 다양한 분야에서 활용되고 있다. 마켓앤마켓 시장조사에 따르면 전 세계 텍스트 분석 시장은 연평균 약 17.2%를 지속해 2022년 87억 9,000만 달러 규모에 이를 지경이라고 발표하였다. 이러한 비정형 텍스트 데이터를 분석하기 위해 SAS, MS 등의 대기업이 텍스트 분석 시장이 뛰어들고 있다.Furthermore, natural language processing is being used in various fields such as document search, summary, and related information analysis and clustering. According to Market and Market market research, the global text analytics market is expected to reach $8.79 billion by 2022, at an average annual growth rate of about 17.2%. In order to analyze such unstructured text data, large companies such as SAS and Microsoft are jumping into the text analysis market.

한편, 기존의 군집 기법에서는 텍스트 단어를 처리하기 위해 수치 데이터로 변환하고 군집의 수를 지정해야 하는 이슈들이 있었으나, 텍스트의 키워드 가중치 정보를 활용함으로써 수치 변환 및 군집의 수를 지정하지 않고 문서를 군집하고 각 군집 된 문서의 대표 주제를 키워드 기반으로 생성하는 방법을 고안하였다.On the other hand, in the existing clustering technique, there were issues that it was necessary to convert the text word into numerical data and specify the number of clusters in order to process the text word. and devised a method to generate the representative subject of each clustered document based on keywords.

오늘날 비정형 데이터에 대한 수가 방대해짐에 따라 기업뿐만 아니라 사회 전반적으로 효율적인 업무 관리를 위한 다양한 형태의 분석 요구가 증가하고 있다.Today, as the number of unstructured data increases, the demand for various types of analysis for efficient business management not only in companies but also in society as a whole is increasing.

국내 등록특허 제10-1505546호(2015.03.26. 공고)Domestic Registered Patent No. 10-1505546 (2015.03.26. Announcement)

본 발명은 전술한 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은 텍스트 문서 내에서 키워드를 추출하여 비지도 학습으로 문서를 군집하고 군집된 문서를 대표하는 주제를 자동으로 생성하는 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치 및 그 방법을 제공하는 데 있다.The present invention has been devised to solve the above problems, and an object of the present invention is to extract keywords from text documents, cluster documents through unsupervised learning, and automatically generate topics representing the clustered documents based on word ranking. To provide an apparatus and method for generating a text document cluster and topic of

본 발명의 다른 목적은 대량의 비정형 데이터의 문서 군집에 있어 문서의 중요한 의미가 있는 키워드의 랭킹을 계산하여 문서를 군집하고 키워드 기반으로 대표 주제를 생성함으로써, 군집의 품질을 향상할 수 있도록 한 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치 및 그 방법을 제공하는 데 있다.Another object of the present invention is to calculate the ranking of keywords with important meanings of documents in document clusters of large amounts of unstructured data, cluster documents, and generate representative topics based on keywords, thereby improving the quality of clusters. An object of the present invention is to provide a ranking-based text document cluster and topic generating apparatus and method.

전술한 목적을 달성하기 위하여 본 발명의 제1 측면은, 외부로부터 수집된 다수의 텍스트 문서들 내의 비정형 텍스트에서 단어 간 관계 정보를 활용하여 각 텍스트 문서별로 가중치가 계산된 키워드를 추출하는 제1 모듈; 상기 제1 모듈로부터 추출된 각 텍스트 문서별 키워드의 가중치를 정규화하고 키워드의 랭킹을 계산하는 제2 모듈; 상기 제2 모듈로부터 계산된 각 텍스트 문서별 키워드의 랭킹을 기반으로 유사 텍스트 문서들을 군집하는 제3 모듈; 및 상기 제3 모듈로부터 군집된 각 유사 텍스트 문서에서 대표 주제를 생성하는 제4 모듈을 포함하는 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치를 제공하는 것이다.In order to achieve the above object, a first aspect of the present invention is a first module for extracting a keyword whose weight is calculated for each text document by utilizing relationship information between words from atypical texts in a plurality of text documents collected from the outside. ; a second module for normalizing weights of keywords for each text document extracted from the first module and calculating a ranking of the keywords; a third module for grouping similar text documents based on the ranking of keywords for each text document calculated from the second module; and a fourth module for generating a representative topic from each similar text document clustered from the third module.

여기서, 상기 제1 모듈은, CYK(Cocke-Younger-Kasami) 알고리즘을 활용하여 각 텍스트 문서 내의 비정형 텍스트에서 단어를 형태소 단위로 분해하고, 단어의 품사 정보에 따라 구분 지어 관계를 파싱한 후, 각 텍스트 문서의 문장 내 단어의 관계를 형성하고 많이 참조된 단어를 중심으로 가중치를 계산하여 키워드를 추출함이 바람직하다.Here, the first module uses the CYK (Cocke-Younger-Kasami) algorithm to decompose the word in the atypical text in each text document into morpheme units, parse the relationship according to the part-of-speech information of the word, and then It is preferable to extract keywords by forming a relationship between words in a sentence of a text document and calculating a weight based on a word that is referenced a lot.

바람직하게, 상기 제1 모듈에 의해 추출된 키워드의 가중치(Keyword(t, d, D))는 하기의 식 1에 의해 계산될 수 있다.Preferably, the weight (Keyword(t, d, D)) of the keyword extracted by the first module may be calculated by Equation 1 below.

(식 1)(Equation 1)

여기서, t는 형태소 단위의 단어이고, d는 전체 텍스트 문서에 속한 특정 텍스트 문서이며, D는 전체 텍스트 문서 집합이며, t_cnt는 CYK에서 각 단어의 참조된 횟수를 의미한다.Here, t is a word of a morpheme unit, d is a specific text document belonging to a full text document, D is a full text document set, and t _cnt means the number of times each word is referenced in CYK.

바람직하게, 상기 제2 모듈은, 각 텍스트 문서 간 키워드의 가중치 격차를 줄이기 위해서 가장 높은 가중치를 가지고 있는 키워드를 기 설정된 기준 점수로 치환하고, 나머지 키워드들을 상기 기 설정된 기준 점수에 맞추어 정규화하며, 상기 기 설정된 기준 점수를 최대 점수로 하여 각 텍스트 문서별 키워드의 랭킹을 부여할 수 있다.Preferably, the second module replaces a keyword having the highest weight with a preset reference score in order to reduce a weight difference between keywords between text documents, and normalizes the remaining keywords according to the preset reference score, A ranking of the keywords for each text document may be given by using the preset reference score as the maximum score.

바람직하게, 상기 제3 모듈은, 상기 텍스트 문서들 중 어느 하나의 텍스트 문서를 피봇(pivot)으로 설정하여 첫 번째 군집을 형성하고, 그 외의 나머지 텍스트 문서들은 차례대로 군집 간 문서 유사도를 비교하여 같은 군집에 속할지 다른 군집에 속할지 결정하여 모든 텍스트 문서에 대해 유사 텍스트 문서를 군집할 수 있다.Preferably, the third module forms a first cluster by setting any one text document among the text documents as a pivot, and the other text documents sequentially compare the document similarity between the groups to obtain the same You can cluster pseudo-text documents for all text documents by deciding whether they belong to a cluster or another cluster.

바람직하게, 상기 제3 모듈은, 하기의 식 2에 의해 상기 문서 유사도를 계산할 수 있다.Preferably, the third module may calculate the document similarity by Equation 2 below.

(식 2)(Equation 2)

여기서, 모든 텍스트 문서를 D, 모든 군집을 C라고 할 경우, 최대 군집의 개수는 C_D가 되며, 기존의 군집(C_i)은 군집을 이루기 위한 상위 N-best개의 키워드(k_i) 및 키워드의 가중치를 가지고 있고, 상기 제3 모듈을 통해 새로운 텍스트 문서(D_ki)와 비교하여 일치하면 기존의 군집(C_i)에 포함하며, 만약에 C_i... C_n과 비교하여 군집과 일치하지 않다면 새로운 군집(C_new)을 생성한다.Here, if all text documents are D and all clusters are C, the maximum number of clusters is C _D , and the existing cluster (C _i ) is the top N-best keywords (k _i ) and keywords for forming the cluster. has a weight of , and if it is matched with a new text document (D _ki ) through the third module, it is included in the existing cluster (C _i ), and if it is compared with C _i ... C _n and matches the cluster If not, it creates a new cluster (C _new ).

바람직하게, 상기 제4 모듈은, 상기 제3 모듈로부터 군집된 각 유사 텍스트 문서에서 의존 구문 기반의 후보 문장을 선정하고, 상기 선정된 후보 문장을 가지고 TextRank 알고리즘 기반의 문장 가중치를 계산 및 분석하여 가중치 기반의 대표 문장을 추출하며, 상기 추출된 가중치 기반의 대표 문장에서 불필요한 수식어구를 제거한 후 대표 주제를 생성할 수 있다.Preferably, the fourth module selects a dependent phrase-based candidate sentence from each grouped text similarity document from the third module, calculates and analyzes a text weight based on the TextRank algorithm with the selected candidate sentence, and calculates the weight Based representative sentences are extracted, and after removing unnecessary modifiers from the extracted weight-based representative sentences, a representative topic can be generated.

바람직하게, 상기 제4 모듈은, 의존 구문 분석을 활용하여 상기 제3 모듈로부터 군집된 각 유사 텍스트 문서에서 문장이 주어, 동사, 목적어 구조의 하나의 절 또는 구 구조로 이루고 있을 때 후보 문장으로 선정할 수 있다.Preferably, the fourth module is selected as a candidate sentence when a sentence in each pseudotext document clustered from the third module is composed of one clause or phrase structure of a subject, verb, and object structure by using dependent syntax analysis can do.

바람직하게, 상기 제4 모듈은, 상기 추출된 가중치 기반의 대표 문장에서 불필요한 수식어구를 제거하기 위해서 형태소 분석을 통해 명사를 추출한 후, 어순을 유지하면서 대표 주제를 생성할 수 있다.Preferably, the fourth module may extract a noun through morpheme analysis in order to remove unnecessary modifiers from the extracted weight-based representative sentence, and then generate a representative subject while maintaining the word order.

본 발명의 제2 측면은, 제1 내지 제4 모듈을 포함한 장치를 이용하여 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 방법으로서, (a) 상기 제1 모듈을 통해 외부로부터 수집된 다수의 텍스트 문서들 내의 비정형 텍스트에서 단어 간 관계 정보를 활용하여 각 텍스트 문서별로 가중치가 계산된 키워드를 추출하는 단계; (b) 상기 제2 모듈을 통해 상기 단계(a)에서 추출된 각 텍스트 문서별 키워드의 가중치를 정규화하고 키워드의 랭킹을 계산하는 단계; (c) 상기 제3 모듈을 통해 상기 단계(b)에서 계산된 각 텍스트 문서별 키워드의 랭킹을 기반으로 유사 텍스트 문서들을 군집하는 단계; 및 (d) 상기 제4 모듈을 통해 상기 단계(c)에서 군집된 각 유사 텍스트 문서에서 대표 주제를 생성하는 단계를 포함하는 것을 특징으로 하는 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 방법을 제공하는 것이다.A second aspect of the present invention is a method for generating a text document cluster and subject based on word ranking using an apparatus including first to fourth modules, (a) a plurality of text documents collected from outside through the first module extracting a keyword whose weight is calculated for each text document by utilizing relationship information between words from the unstructured text in the field; (b) normalizing the weights of the keywords for each text document extracted in step (a) through the second module and calculating the ranking of the keywords; (c) grouping similar text documents through the third module based on the ranking of the keywords for each text document calculated in step (b); and (d) generating a representative topic from each similar text document clustered in step (c) through the fourth module. will be.

여기서, 상기 단계(a)는, 상기 제1 모듈을 통해 CYK(Cocke-Younger-Kasami) 알고리즘을 활용하여 각 텍스트 문서 내의 비정형 텍스트에서 단어를 형태소 단위로 분해하고, 단어의 품사 정보에 따라 구분지어 관계를 파싱한 후, 각 텍스트 문서의 문장 내 단어의 관계를 형성하고 많이 참조된 단어를 중심으로 가중치를 계산하여 키워드를 추출함이 바람직하다.Here, the step (a) uses the CYK (Cocke-Younger-Kasami) algorithm through the first module to decompose the word in the atypical text in each text document into morpheme units, and classify the word according to the part-of-speech information After parsing the relationship, it is preferable to form a relationship between words in a sentence of each text document and to extract a keyword by calculating a weight based on a word that is referenced a lot.

(식 1)(Equation 1)

바람직하게, 상기 단계(b)는, 상기 제2 모듈을 통해 각 텍스트 문서 간 키워드의 가중치 격차를 줄이기 위해서 가장 높은 가중치를 가지고 있는 키워드를 기 설정된 기준 점수로 치환하고, 나머지 키워드들을 상기 기 설정된 기준 점수에 맞추어 정규화하며, 상기 기 설정된 기준 점수를 최대 점수로 하여 각 텍스트 문서별 키워드의 랭킹을 부여할 수 있다.Preferably, in step (b), the keyword having the highest weight is replaced with a preset reference score in order to reduce the weight difference between keywords between text documents through the second module, and the remaining keywords are replaced with the preset reference score. It is normalized according to the score, and a ranking of the keywords for each text document may be given by using the preset reference score as the maximum score.

바람직하게, 상기 단계(c)는, 상기 제3 모듈을 통해 상기 텍스트 문서들 중 어느 하나의 텍스트 문서를 피봇(pivot)으로 설정하여 첫 번째 군집을 형성하고, 그 외의 나머지 텍스트 문서들은 차례대로 군집 간 문서 유사도를 비교하여 같은 군집에 속할지 다른 군집에 속할지 결정하여 모든 텍스트 문서에 대해 유사 텍스트 문서를 군집할 수 있다.Preferably, in the step (c), a first group is formed by pivoting any one of the text documents through the third module, and the other text documents are grouped sequentially. Text-like documents can be clustered for all text documents by comparing the degree of similarity between documents to determine whether they belong to the same or different clusters.

바람직하게, 상기 단계(c)에서, 상기 문서 유사도는 하기의 식 2에 의해 계산할 수 있다.Preferably, in the step (c), the document similarity can be calculated by the following Equation 2.

(식 2)(Equation 2)

바람직하게, 상기 단계(d)는, (d-1) 상기 제4 모듈을 통해 상기 단계(c)에서 군집된 각 유사 텍스트 문서에서 의존 구문 기반의 후보 문장을 선정하는 단계; (d-2) 상기 제4 모듈을 통해 상기 단계(d-1)에서 선정된 후보 문장을 가지고 TextRank 알고리즘 기반의 문장 가중치를 계산 및 분석하여 가중치 기반의 대표 문장을 추출하는 단계; 및 (d-3) 상기 제4 모듈을 통해 상기 단계(d-2)에서 추출된 가중치 기반의 대표 문장에서 불필요한 수식어구를 제거한 후 대표 주제를 생성하는 단계를 포함하여 이루어질 수 있다.Preferably, the step (d) includes: (d-1) selecting a dependent phrase-based candidate sentence from each pseudotext document clustered in the step (c) through the fourth module; (d-2) extracting weight-based representative sentences by calculating and analyzing the text weights based on the TextRank algorithm with the candidate sentences selected in step (d-1) through the fourth module; and (d-3) removing unnecessary modifiers from the weight-based representative sentence extracted in step (d-2) through the fourth module and then generating the representative subject.

바람직하게, 상기 단계(d-1)는, 상기 제4 모듈을 통해 의존 구문 분석을 활용하여 상기 단계(c)에서 군집된 각 유사 텍스트 문서에서 문장이 주어, 동사, 목적어 구조의 하나의 절 또는 구 구조로 이루고 있을 때 후보 문장으로 선정할 수 있다.Preferably, in step (d-1), in each pseudotext document clustered in step (c), a sentence is a clause of a subject, verb, object structure, or It can be selected as a candidate sentence when it is structured in a phrase.

바람직하게, 상기 단계(d-3)는, 상기 제4 모듈을 통해 상기 단계(d-2)에서 추출된 가중치 기반의 대표 문장에서 불필요한 수식어구를 제거하기 위해서 형태소 분석을 통해 명사를 추출한 후, 어순을 유지하면서 대표 주제를 생성할 수 있다.Preferably, in the step (d-3), the noun is extracted through morphological analysis in order to remove unnecessary modifiers from the weight-based representative sentence extracted in the step (d-2) through the fourth module, A representative theme can be created while maintaining the word order.

본 발명의 제3 측면은, 상술한 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 방법을 실행시킬 수 있는 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체를 제공한다.A third aspect of the present invention provides a computer-readable recording medium in which a program capable of executing the above-described method for generating a text document group and subject based on word ranking is recorded.

본 발명에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 방법은 컴퓨터로 판독할 수 있는 기록 매체에 컴퓨터로 판독할 수 있는 코드로 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체에는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.The method for generating a text document cluster and subject based on word ranking according to the present invention may be implemented as computer-readable codes on a computer-readable recording medium. The computer-readable recording medium includes any type of recording device in which data readable by a computer system is stored.

예컨대, 컴퓨터가 읽을 수 있는 기록 매체로는 롬(ROM), 램(RAM), 시디-롬(CD-ROM), 자기 테이프, 하드디스크, 플로피 디스크, 이동식 저장장치, 비휘발성 메모리(Flash Memory), 광 데이터 저장장치 등이 있다.For example, computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, hard disk, floppy disk, removable storage device, and non-volatile memory (Flash Memory). , and optical data storage devices.

이상에서 설명한 바와 같은 본 발명의 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치 및 그 방법에 따르면, 대량의 비정형 데이터의 문서 군집에 있어 문서의 중요한 의미가 있는 키워드의 랭킹을 계산하여 문서를 군집하고 키워드 기반으로 대표 주제를 생성함으로써, 군집의 품질을 향상할 수 있는 이점이 있다.According to the word ranking-based text document cluster and subject generating apparatus and method of the present invention as described above, in the document cluster of a large amount of unstructured data, the ranking of keywords having an important meaning of the document is calculated to cluster the documents and By generating a representative topic based on a keyword, there is an advantage in that the quality of the cluster can be improved.

또한, 본 발명에 따르면, 각 기업에 축적된 비정형 텍스트 데이터를 비슷한 의미가 있는 문서로 군집하고 대표 주제를 요약함으로써 문서가 가지는 정보를 더 빠르게 얻을 수 있고, 이를 통해 더 가치 있는 의사 결정에 기여할 수 있으며, 이를 응용하여 대용량 데이터의 군집을 통한 군집 별 통계, 시계열 분석 등에 활용할 수 있는 이점이 있다.In addition, according to the present invention, by clustering the unstructured text data accumulated in each company into documents with similar meanings and summarizing representative topics, the information of the documents can be obtained more quickly, thereby contributing to more valuable decision-making. There is an advantage in that by applying this, it can be utilized for cluster-specific statistics and time series analysis through clustering of large-capacity data.

도 1은 본 발명의 일 실시예에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치를 설명하기 위한 전체적인 블록 구성도이다.
도 2는 본 발명의 일 실시예에 적용된 제1 모듈에서 활용한 CYK 알고리즘의 Pseudo code를 나타낸 도면이다.
도 3은 본 발명의 일 실시예에 적용된 제2 모듈에 의해 키워드의 가중치 정규화 및 키워드의 랭킹 계산을 설명하기 위한 알고리즘을 나타낸 도면이다.
도 4는 본 발명의 일 실시예에 적용된 제3 모듈에 의해 키워드 랭킹 기반의 텍스트 문서를 군집하는 방식을 개념적으로 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 적용된 제4 모듈에 의해 문장 후보의 생성부터 대표 주제의 생성까지의 과정을 개략적으로 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 방법을 설명하기 위한 전체적인 흐름도이다.
도 7은 본 발명의 일 실시예에 적용된 제4 모듈에 의해 군집된 각 유사 텍스트 문서에서 대표 주제를 생성하는 단계를 구체적으로 설명하기 위한 흐름도이다.1 is an overall block diagram illustrating an apparatus for generating a text document cluster and subject based on word ranking according to an embodiment of the present invention.
2 is a diagram illustrating a pseudo code of the CYK algorithm utilized in the first module applied to an embodiment of the present invention.
3 is a diagram illustrating an algorithm for explaining weight normalization of keywords and calculation of ranking of keywords by the second module applied to an embodiment of the present invention.
4 is a diagram conceptually illustrating a method of clustering text documents based on keyword ranking by a third module applied to an embodiment of the present invention.
5 is a diagram schematically illustrating a process from generation of a sentence candidate to generation of a representative topic by the fourth module applied to an embodiment of the present invention.
6 is an overall flowchart illustrating a method for generating a text document cluster and topic based on word ranking according to an embodiment of the present invention.
7 is a flowchart for specifically explaining the step of generating a representative subject in each pseudotext document clustered by the fourth module applied to an embodiment of the present invention.

전술한 목적, 특징 및 장점은 첨부된 도면을 참조하여 상세하게 후술되며, 이에 따라 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 쉽게 실시할 수 있을 것이다. 본 발명을 설명하면서 본 발명과 관련된 공지 기술에 대한 자세한 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 상세한 설명을 생략한다.The above-described objects, features and advantages will be described below in detail with reference to the accompanying drawings, and accordingly, those of ordinary skill in the art to which the present invention pertains will be able to easily implement the technical idea of the present invention. While describing the present invention, if it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description will be omitted.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.Terms including an ordinal number such as 1st, 2nd, etc. may be used to describe various elements, but the elements are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 발명에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the present invention have been selected as currently widely used general terms as possible while considering the functions in the present invention, but these may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding invention. Therefore, the term used in the present invention should be defined based on the meaning of the term and the overall content of the present invention, rather than the name of a simple term.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.In the entire specification, when a part "includes" a certain element, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated. In addition, terms such as "...unit" and "module" described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software, or a combination of hardware and software. .

이하, 첨부 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. 그러나, 다음에 예시하는 본 발명의 실시예는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 다음에 상술하는 실시예에 한정되는 것은 아니다. 본 발명의 실시예는 당업계에서 통상의 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공되는 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the embodiments of the present invention illustrated below may be modified in various other forms, and the scope of the present invention is not limited to the embodiments described below. The embodiments of the present invention are provided to more completely explain the present invention to those of ordinary skill in the art.

첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션(실행 엔진)들에 의해 수행될 수도 있으며, 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다.Each block in the accompanying block diagram and combinations of steps in the flowchart may be executed by computer program instructions (execution engines), which may be executed by a processor of a general-purpose computer, special-purpose computer, or other programmable data processing equipment. It may be mounted so that the instructions, which are executed by the processor of a computer or other programmable data processing equipment, create means for performing the functions described in each block of the block diagram or in each step of the flowchart. These computer program instructions may also be stored in a computer-usable or computer-readable memory that may direct a computer or other programmable data processing equipment to implement a function in a particular manner, and thus the computer-usable or computer-readable memory. It is also possible to produce an article of manufacture containing instruction means for performing the functions described in each block of the block diagram or each step of the flowchart, the instructions stored in the block diagram.

그리고, 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명되는 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.And, since the computer program instructions may be mounted on a computer or other programmable data processing equipment, a series of operating steps are performed on the computer or other programmable data processing equipment to create a computer-executed process to create a computer or other program It is also possible that instructions for performing the possible data processing equipment provide steps for carrying out the functions described in each block of the block diagram and in each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능들을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있으며, 몇 가지 대체 실시 예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하며, 또한 그 블록들 또는 단계들이 필요에 따라 해당하는 기능의 역순으로 수행되는 것도 가능하다.In addition, each block or step may represent a module, segment, or portion of code comprising one or more executable instructions for executing specified logical functions, and in some alternative embodiments the blocks or steps referred to in some alternative embodiments. It should be noted that it is also possible for functions to occur out of sequence. For example, it is possible that two blocks or steps shown one after another may be performed substantially simultaneously, and also the blocks or steps may be performed in the reverse order of the corresponding functions, if necessary.

도 1은 본 발명의 일 실시예에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치를 설명하기 위한 전체적인 블록 구성도이고, 도 2는 본 발명의 일 실시예에 적용된 제1 모듈에서 활용한 CYK 알고리즘의 Pseudo code를 나타낸 도면이며, 도 3은 본 발명의 일 실시예에 적용된 제2 모듈에 의해 키워드의 가중치 정규화 및 키워드의 랭킹 계산을 설명하기 위한 알고리즘을 나타낸 도면이며, 도 4는 본 발명의 일 실시예에 적용된 제3 모듈에 의해 키워드 랭킹 기반의 텍스트 문서를 군집하는 방식을 개념적으로 나타낸 도면이며, 도 5는 본 발명의 일 실시예에 적용된 제4 모듈에 의해 문장 후보의 생성부터 대표 주제의 생성까지의 과정을 개략적으로 설명하기 위한 도면이다.1 is an overall block diagram illustrating an apparatus for generating a text document cluster and subject based on word ranking according to an embodiment of the present invention, and FIG. 2 is a CYK utilized in the first module applied to an embodiment of the present invention. It is a diagram showing a pseudo code of an algorithm, and FIG. 3 is a diagram showing an algorithm for explaining weight normalization of keywords and calculation of ranking of keywords by the second module applied to an embodiment of the present invention, and FIG. 4 is a diagram of the present invention It is a diagram conceptually illustrating a method of clustering text documents based on keyword ranking by a third module applied to an embodiment, and FIG. 5 is a representative topic from generation of sentence candidates by the fourth module applied to an embodiment of the present invention. It is a diagram for schematically explaining a process up to the generation of .

도 1 내지 도 5를 참조하면, 본 발명의 일 실시예에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치는, 크게 제1 모듈(100), 제2 모듈(200), 제3 모듈(300), 및 제4 모듈(400) 등을 포함하여 이루어진다. 한편, 도 1 내지 도 5에 도시된 구성요소들이 필수적인 것은 아니어서, 본 발명의 일 실시예에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치는 그보다 많은 구성요소들을 갖거나 그보다 적은 구성요소들을 가질 수도 있다.1 to 5 , the apparatus for generating a text document cluster and subject based on word ranking according to an embodiment of the present invention includes a first module 100 , a second module 200 , and a third module 300 . ), and a fourth module 400 and the like. On the other hand, since the components shown in FIGS. 1 to 5 are not essential, the apparatus for generating a text document cluster and subject based on word ranking according to an embodiment of the present invention has more or fewer components than that. may have

이하, 본 발명의 일 실시예에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치의 구성요소들에 대해 구체적으로 살펴보면 다음과 같다.Hereinafter, components of the apparatus for generating a text document cluster and subject based on word ranking according to an embodiment of the present invention will be described in detail as follows.

제1 모듈(100)은 외부로부터 수집된 다수의 텍스트 문서들 내의 비정형 텍스트에서 단어 간 관계 정보를 활용하여 각 텍스트 문서별로 가중치가 계산된 키워드를 추출하는 기능을 수행한다.The first module 100 performs a function of extracting a keyword whose weight is calculated for each text document by utilizing relationship information between words from atypical texts in a plurality of text documents collected from the outside.

즉, 제1 모듈(100)은 예컨대, CYK(Cocke-Younger-Kasami) 알고리즘을 활용하여 각 텍스트 문서 내의 비정형 텍스트에서 단어를 형태소 단위로 분해하고, 단어의 품사 정보에 따라 구분 지어 관계를 파싱한 후, 각 텍스트 문서의 문장 내 단어의 관계를 형성하고 많이 참조된 단어를 중심으로 가중치를 계산하여 키워드를 추출하는 기능을 수행한다.That is, the first module 100 decomposes the word in the atypical text in each text document into morpheme units using, for example, the Cocke-Younger-Kasami (CYK) algorithm, and parses the relationship according to the part-of-speech information of the word. After that, a function of extracting keywords is performed by forming a relationship between words in a sentence of each text document and calculating a weight based on a frequently referenced word.

그리고, 제1 모듈(100)에 의해 추출된 키워드의 가중치(Keyword(t, d, D))는 하기의 식 1에 의해 계산될 수 있다.And, the weight (Keyword(t, d, D)) of the keyword extracted by the first module 100 may be calculated by Equation 1 below.

(식 1)(Equation 1)

제2 모듈(200)은 제1 모듈(100)로부터 추출된 각 텍스트 문서별 키워드의 가중치를 정규화하고 키워드의 랭킹을 계산하는 기능을 수행한다.The second module 200 performs a function of normalizing the weight of the keyword for each text document extracted from the first module 100 and calculating the ranking of the keyword.

즉, 제2 모듈(200)은 각 텍스트 문서 간 키워드의 가중치 격차를 줄이기 위해서 가장 높은 가중치를 가지고 있는 키워드를 기 설정된 기준 점수(예컨대, 100점)로 치환하고, 나머지 키워드들을 상기 기 설정된 기준 점수에 맞추어 정규화하며, 상기 기 설정된 기준 점수를 최대 점수로 하여 각 텍스트 문서별 키워드의 랭킹을 부여하는 기능을 수행한다.That is, the second module 200 replaces the keyword having the highest weight with a preset reference score (eg, 100 points) in order to reduce the weight difference of keywords between text documents, and replaces the remaining keywords with the preset reference score. Normalization is performed according to , and a function of giving a ranking of keywords for each text document is performed by using the preset reference score as the maximum score.

제3 모듈(300)은 제2 모듈(200)로부터 계산된 각 텍스트 문서별 키워드의 랭킹을 기반으로 유사 텍스트 문서들을 군집하는 기능을 수행한다.The third module 300 performs a function of grouping text similarity documents based on the ranking of keywords for each text document calculated from the second module 200 .

즉, 제3 모듈(300)은 상기 텍스트 문서들 중 어느 하나의 텍스트 문서를 피봇(pivot)으로 설정하여 첫 번째 군집을 형성하고, 그 외의 나머지 텍스트 문서들은 차례대로 군집 간 문서 유사도를 비교하여 같은 군집에 속할지 다른 군집에 속할지 결정하여 모든 텍스트 문서에 대해 유사 텍스트 문서를 군집하는 기능을 수행한다.That is, the third module 300 forms a first cluster by setting any one text document among the text documents as a pivot, and the other text documents sequentially compare the document similarity between the groups to obtain the same It determines whether to belong to a cluster or another cluster, and performs a function of clustering text-like documents for all text documents.

또한, 제3 모듈(300)은 하기의 식 2에 의해 상기 문서 유사도를 계산할 수 있다.Also, the third module 300 may calculate the document similarity by Equation 2 below.

(식 2)(Equation 2)

제4 모듈(400)은 제3 모듈(300)로부터 군집된 각 유사 텍스트 문서에서 대표 주제를 생성하는 기능을 수행한다.The fourth module 400 performs a function of generating a representative topic from each text-like text grouped from the third module 300 .

즉, 제4 모듈(400)은 제3 모듈(300)로부터 군집된 각 유사 텍스트 문서에서 의존 구문 기반의 후보 문장을 선정하고, 상기 선정된 후보 문장을 가지고 TextRank 알고리즘 기반의 문장 가중치를 계산 및 분석하여 가중치 기반의 대표 문장을 추출하며, 상기 추출된 가중치 기반의 대표 문장에서 불필요한 수식어구를 제거한 후 대표 주제를 생성하는 기능을 수행한다.That is, the fourth module 400 selects a dependent phrase-based candidate sentence from each grouped similar text document from the third module 300 , and calculates and analyzes a TextRank algorithm-based sentence weight using the selected candidate sentence. Thus, a weight-based representative sentence is extracted, and unnecessary modifiers are removed from the extracted weight-based representative sentence, and then a representative topic is generated.

또한, 제4 모듈(400)은 의존 구문 분석을 활용하여 제3 모듈(300)로부터 군집된 각 유사 텍스트 문서에서 문장이 주어, 동사, 목적어 구조의 하나의 절 또는 구 구조로 이루고 있을 때 후보 문장으로 선정하는 기능을 수행할 수 있다.In addition, the fourth module 400 is a candidate sentence when the sentence in each pseudotext document clustered from the third module 300 is composed of one clause or phrase structure of the subject, verb, and object structure by using dependent syntax analysis. The function to select can be performed.

또한, 제4 모듈(400)은 상기 추출된 가중치 기반의 대표 문장에서 불필요한 수식어구를 제거하기 위해서 형태소 분석을 통해 명사를 추출한 후, 어순을 유지하면서 대표 주제를 생성하는 기능을 수행할 수 있다.In addition, the fourth module 400 extracts a noun through morpheme analysis in order to remove unnecessary modifiers from the extracted weight-based representative sentence, and then generates a representative subject while maintaining the word order.

이하에는 본 발명의 일 실시예에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 방법에 대해 구체적으로 설명하기로 한다.Hereinafter, a method for generating a text document cluster and topic based on word ranking according to an embodiment of the present invention will be described in detail.

도 6은 본 발명의 일 실시예에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 방법을 설명하기 위한 전체적인 흐름도이고, 도 7은 본 발명의 일 실시예에 적용된 제4 모듈에 의해 군집된 각 유사 텍스트 문서에서 대표 주제를 생성하는 단계를 구체적으로 설명하기 위한 흐름도이다.6 is an overall flowchart for explaining a method for generating a text document cluster and topic based on word ranking according to an embodiment of the present invention, and FIG. 7 is each similarity clustered by the fourth module applied to an embodiment of the present invention. It is a flowchart for specifically explaining the steps of creating a representative topic in a text document.

도 1 내지 도 7을 참조하면, 본 발명의 일 실시예에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 방법은, 먼저, 제1 모듈(100)을 통해 외부로부터 수집된 다수의 텍스트 문서들 내의 비정형 텍스트에서 단어 간 관계 정보를 활용하여 각 텍스트 문서별로 가중치가 계산된 키워드를 추출한다(S100).1 to 7 , in the method for generating a text document cluster and topic based on word ranking according to an embodiment of the present invention, first, within a plurality of text documents collected from the outside through the first module 100 By utilizing relationship information between words in the unstructured text, a keyword whose weight is calculated for each text document is extracted ( S100 ).

일 예로, 상기 단계S100은 제1 모듈(100)을 통해 CYK(Cocke-Younger-Kasami) 알고리즘을 활용하여 각 텍스트 문서 내의 비정형 텍스트에서 단어를 형태소 단위로 분해하고, 단어의 품사 정보(예컨대, 체언, 용언, 부사구, 관형사구 및 그 외 감탄사, 부호 등)에 따라 구분 지어 관계를 파싱한 후, 각 텍스트 문서의 문장 내 단어의 관계를 형성하고 많이 참조된 단어를 중심으로 가중치를 계산하여 키워드를 추출할 수 있다.For example, in step S100, the first module 100 utilizes a Cocke-Younger-Kasami (CYK) algorithm to decompose a word in the atypical text in each text document into a morpheme unit, , verb, adverb phrase, adjective phrase, and other exclamation phrases, signs, etc.) and then parse the relationship, form the relationship between words in the sentences of each text document, and extract the keywords by calculating the weight based on the most referenced words can do.

또한, 제1 모듈(100)에 의해 추출된 키워드의 가중치(Keyword(t, d, D))는 하기의 식 1에 의해 계산될 수 있다.In addition, the weight (Keyword(t, d, D)) of the keyword extracted by the first module 100 may be calculated by Equation 1 below.

(식 1)(Equation 1)

여기서, t는 형태소 단위의 단어이고, d는 전체 텍스트 문서에 속한 특정 텍스트 문서이며, D는 전체 텍스트 문서 집합이며, t_cnt는 CYK에서 각 단어의 참조된 횟수를 의미한다. 단어의 빈도인 TF(Term Frequency) 및 전체 텍스트 문서 중에서 특정 단어를 포함하는 텍스트 문서의 빈도인 DF(Document Frequency)를 계산할 때 로그(log)를 취한 이유는 텍스트 문서 수가 많아질수록 DF의 역수의 값이 전체 가중치에 영향을 크게 주는 것을 상쇄하기 위함이다.Here, t is a word of a morpheme unit, d is a specific text document belonging to a full text document, D is a full text document set, and t _cnt means the number of times each word is referenced in CYK. The reason for taking the log when calculating TF (Term Frequency), which is the frequency of words, and DF (Document Frequency), which is the frequency of text documents including a specific word among all text documents, is that as the number of text documents increases, the reciprocal of DF is This is to offset the large influence of the value on the overall weight.

즉, 키워드 추출은 문서에서 중요한 의미가 있는 단어를 추출하는 기법으로, 전통적으로 TF-IDF(Term Frequency - Inverse Document Frequency)와 같은 방법이 이용됐다. 상기 종래의 TF-IDF 방법은 단어의 출현 빈도를 기반으로 키워드를 추출하는 방법으로 키워드의 빈도가 높으면 중요한 의미가 있다는 것을 가정하고 있다.That is, keyword extraction is a technique for extracting important words from a document, and a method such as TF-IDF (Term Frequency - Inverse Document Frequency) has been traditionally used. The conventional TF-IDF method is a method of extracting a keyword based on the frequency of occurrence of a word, and it is assumed that a high frequency of a keyword has an important meaning.

하지만, 같은 단어라도 단어의 품사 정보에 따라 문장에서 가질 수 있는 의미의 역할이 다르기 때문에(예컨대, 부사는 동사를 수식하기 때문에 실제로는 동사가 부사보다 더 중요한 의미를 가지고 있음), 단순한 빈도수로는 문서에서 중요한 의미가 있는 키워드를 계산하기에는 부족함이 있다.However, even for the same word, since the role of meaning it can have in a sentence is different depending on the part-of-speech information of the word (for example, since an adverb modifies a verb, a verb actually has a more important meaning than an adverb), It is insufficient to calculate keywords with significant meaning in the document.

이러한 이유 때문에 본 발명의 일 실시예에서는 CYK 알고리즘을 사용하여 단어를 형태소 단위로 분해하고, 체언, 용언, 부사구, 관형사구 및 그 외 감탄사, 부호 등으로 구분 지어 관계를 파싱한다(도 2 참조).For this reason, in one embodiment of the present invention, a word is decomposed into morpheme units using the CYK algorithm, and the relationship is parsed by dividing it into adjectives, verbs, adverbs, adjectives, and other exclamations and signs (see FIG. 2).

이러한 CYK 알고리즘을 활용하여 문장 내 단어의 관계를 형성하고 많이 참조된 단어를 중심으로 빈도수를 계산하면, 기존의 빈도수 기반의 키워드보다 좀 더 의미 있는 키워드 추출이 가능해진다. 키워드를 추출하기 위한 수식은 상기의 식 1과 같다.By using this CYK algorithm to form a relationship between words in a sentence and calculate the frequency based on the frequently referenced words, it is possible to extract more meaningful keywords than the existing frequency-based keywords. A formula for extracting a keyword is the same as Equation 1 above.

이후에, 제2 모듈(200)을 통해 상기 단계S100에서 추출된 각 텍스트 문서별 키워드의 가중치를 정규화하고 키워드의 랭킹을 계산한다(S200).Thereafter, the weight of the keyword for each text document extracted in step S100 is normalized through the second module 200 and the ranking of the keyword is calculated ( S200 ).

일 예로, 상기 단계S200은 제2 모듈(200)을 통해 각 텍스트 문서 간 키워드의 가중치 격차를 줄이기 위해서 가장 높은 가중치를 가지고 있는 키워드를 기 설정된 기준 점수(예컨대, 100점)로 치환하고, 나머지 키워드들을 상기 기 설정된 기준 점수에 맞추어 정규화하며, 상기 기 설정된 기준 점수를 최대 점수로 하여 각 텍스트 문서별 키워드의 랭킹을 부여할 수 있다.For example, in step S200, the keyword having the highest weight is replaced with a preset reference score (eg, 100 points) in order to reduce the weight difference between keywords between text documents through the second module 200, and the remaining keywords are normalized according to the preset reference score, and a ranking of the keywords for each text document may be given by using the preset reference score as the maximum score.

즉, 상기의 식 1에 의해 키워드의 가중치를 계산하고 나면 '0'이상의 양수 값이 나온다(바람직하게, TF, DF, CYK 계산 시 log(0)이 계산되는 것을 막기 위해 보정 값 1을 더함).That is, after calculating the weight of the keyword by Equation 1 above, a positive value of '0' or more is obtained (preferably, a correction value of 1 is added to prevent log(0) from being calculated when calculating TF, DF, and CYK) .

하지만, 각 텍스트 문서의 키워드가 다른 텍스트 문서의 키워드와 같은 중요도를 가졌는지 비교하기 위해서 수치를 그대로 비교할 수는 없다. 왜냐하면, 전반적으로 단어의 빈도수가 높은 텍스트 문서의 경우 가중치가 상향 평준화가 되어 있고, 상대적으로 단어 빈도수가 낮은 텍스트 문서의 경우 가중치가 하향 평준화가 되어 있기 때문이다.However, in order to compare whether keywords of each text document have the same importance as keywords of other text documents, numerical values cannot be directly compared. This is because overall, in the case of a text document having a high frequency of words, the weight is leveled upward, and in the case of a text document having a relatively low frequency of words, the weight is leveled downward.

이러한 이유로 문서 간 키워드 가중치의 격차를 줄이기 위해서 가장 높은 가중치를 가지고 있는 단어를 100점으로 치환하고 나머지 단어들을 100점 기준에 맞추어 정규화한다. 이렇게 정규화하게 되면 각 문서의 키워드가 최대 100점을 기준으로 랭킹을 부여할 수 있다(도 3 참조).For this reason, in order to reduce the difference in keyword weights between documents, the word having the highest weight is substituted with 100 points, and the remaining words are normalized according to the standard of 100 points. By normalizing in this way, the keywords of each document can be ranked based on a maximum of 100 points (refer to FIG. 3 ).

그런 다음, 제3 모듈(300)을 통해 상기 단계S200에서 계산된 각 텍스트 문서별 키워드의 랭킹을 기반으로 유사 텍스트 문서들을 군집한다(S300).Then, similar text documents are grouped based on the ranking of the keywords for each text document calculated in step S200 through the third module 300 (S300).

일 예로, 상기 단계S300은 제3 모듈(300)을 통해 상기 텍스트 문서들 중 어느 하나의 텍스트 문서를 피봇(pivot)으로 설정하여 첫 번째 군집을 형성하고, 그 외의 나머지 텍스트 문서들은 차례대로 군집 간 문서 유사도를 비교하여 같은 군집에 속할지 다른 군집에 속할지 결정하여 모든 텍스트 문서에 대해 유사 텍스트 문서를 군집할 수 있다. 이러한 방법으로 모든 텍스트 문서를 모든 군집과 비교하여 군집을 생성할 수 있다.For example, in step S300, a first group is formed by setting any one of the text documents as a pivot through the third module 300, and the other text documents are sequentially intergrouped. Text-like documents can be clustered for all text documents by comparing document similarities to determine whether they belong to the same or different clusters. In this way, clusters can be generated by comparing all text documents with all clusters.

이때, 상기 단계S300에서, 상기 문서 유사도는 하기의 식 2에 의해 계산할 수 있다.In this case, in step S300, the document similarity may be calculated by Equation 2 below.

(식 2)(Equation 2)

여기서, 모든 텍스트 문서를 D, 모든 군집을 C라고 할 경우, 최대 군집의 개수는 C_D가 되며, C_D를 이루기 위해 O(C*D)의 계산 복잡도를 가진다. 기존의 군집(C_i)은 군집을 이루기 위한 상위 N-best개의 키워드(k_i) 및 키워드의 가중치를 가지고 있고, 제3 모듈(300)을 통해 새로운 텍스트 문서(D_ki)와 비교하여 일치하면 기존의 군집(C_i)에 포함한다. 만약에, C_i... C_n과 비교하여 군집과 일치하지 않다면 새로운 군집(C_new)을 생성한다(도 4 참조).Here, if all text documents are D and all clusters are C, the maximum number of clusters is C _D , and has a computational complexity of O(C*D) to achieve C _D . The existing cluster (C _i ) has the top N-best keywords (k _i ) and weights of the keywords to form the cluster, and is compared with the new text document (D _ki ) through the third module 300 and matched. Included in the existing cluster (C _i ). If, compared to C _i ... C _n , does not match the cluster, a new cluster C _new is generated (see FIG. 4 ).

또한, 본 발명의 일 실시예에서는 k-means 알고리즘(algorithm)과 같이 군집할 문서의 개수를 미리 지정하지 않기 때문에 군집의 개수를 조정할 수 없다. 본 발명에서는 1-Best 키워드의 가중치 일치(1-best 키워드는 항상 100점을 가짐)를 기준으로 군집할 수 있다. 군집의 유사도를 조율하기 위해서는 n-best의 개수를 조정하여 분석하면 된다.In addition, in an embodiment of the present invention, since the number of documents to be clustered is not previously designated as in the k-means algorithm, the number of clusters cannot be adjusted. In the present invention, it is possible to cluster based on the weight matching of 1-best keywords (1-best keyword always has 100 points). In order to tune the similarity of clusters, the number of n-bests can be adjusted and analyzed.

다음으로, 제4 모듈(400)을 통해 상기 단계S300에서 군집된 각 유사 텍스트 문서에서 대표 주제를 생성한다(S400).Next, a representative subject is generated from each pseudotext document clustered in step S300 through the fourth module 400 (S400).

일 예로, 상기 단계S400은, 제4 모듈(400)을 통해 상기 단계S300에서 군집된 각 유사 텍스트 문서에서 의존 구문 기반의 후보 문장을 선정하는 단계(S410)와, 제4 모듈(400)을 통해 상기 단계S410에서 선정된 후보 문장을 가지고 TextRank 알고리즘 기반의 문장 가중치를 계산 및 분석하여 가중치 기반의 대표 문장을 추출하는 단계(S420)와, 제4 모듈(400)을 통해 상기 단계S420에서 추출된 가중치 기반의 대표 문장에서 불필요한 수식어구를 제거한 후 대표 주제를 생성하는 단계(S430)를 포함하여 이루어질 수 있다.As an example, the step S400 includes a step (S410) of selecting a dependent phrase-based candidate sentence from each similar text document clustered in the step S300 through the fourth module 400 (S410), and the fourth module 400 Using the candidate sentences selected in step S410, calculating and analyzing sentence weights based on the TextRank algorithm to extract weight-based representative sentences (S420), and the weights extracted in step S420 through the fourth module 400 After removing unnecessary modifiers from the base representative sentence, it may include a step (S430) of generating a representative subject.

여기서, 상기 단계S410은 제4 모듈(400)을 통해 의존 구문 분석을 활용하여 상기 단계S300에서 군집된 각 유사 텍스트 문서에서 문장이 주어, 동사, 목적어 구조의 하나의 절 또는 구 구조로 이루고 있을 때 후보 문장으로 선정함이 바람직하다.Here, in the step S410, the sentence in each pseudotext document clustered in the step S300 by utilizing the dependent syntax analysis through the fourth module 400 is composed of one clause or phrase structure of the subject, verb, and object structure. It is desirable to select it as a candidate sentence.

그리고, 상기 단계S430은 제4 모듈(400)을 통해 상기 단계S420에서 추출된 가중치 기반의 대표 문장에서 불필요한 수식어구를 제거하기 위해서 형태소 분석을 통해 명사를 추출한 후, 어순을 유지하면서 대표 주제를 생성할 수 있다.In the step S430, the noun is extracted through morphological analysis in order to remove unnecessary modifiers from the weight-based representative sentence extracted in the step S420 through the fourth module 400, and then the representative subject is generated while maintaining the word order. can do.

한편, 비정형 텍스트 데이터의 군집은 분류와는 다르게 비지도 학습으로 이루어지기 때문에 군집이 잘 되었는지 평가할 기준이 없을 뿐 아니라 군집된 텍스트 문서가 어떠한 의미가 있는지 확인하기 위해서는 각 군집의 텍스트 문서 내용을 다시 살펴봐야 하는 문제가 있다.On the other hand, since clustering of unstructured text data is performed by unsupervised learning, unlike classification, there is no standard to evaluate whether clustering is successful, and in order to check the meaning of clustered text documents, the contents of text documents in each cluster are reviewed again. There is a problem that needs to be looked at.

본 발명의 일 실시예에서는 이러한 불편함을 해소하고자 텍스트 문서를 군집하는 것에서 끝나지 않고 각 군집 텍스트 문서에서 주제가 되는 부분을 생성한다. 수많은 문장으로 이루어진 텍스트 문서에서 중요 문장을 추출하는 방법으로는 대표적으로 TextRank 알고리즘이 있다.In one embodiment of the present invention, in order to solve this inconvenience, the subject part is generated in each clustered text document rather than clustering the text documents. As a method of extracting important sentences from a text document consisting of numerous sentences, the TextRank algorithm is representative.

상기 TextRank 알고리즘은 Pagerank 기반의 알고리즘으로 구글(Google)의 검색 엔진의 랭킹 알고리즘을 기반으로 만들어졌다. 상기 Pagerank 알고리즘은 중요한 웹 페이지를 찾기 위해서 유입된 링크에 점수를 부여하여 계산한다.The TextRank algorithm is a Pagerank-based algorithm and was created based on the ranking algorithm of Google's search engine. The Pagerank algorithm calculates by assigning a score to the incoming link in order to find an important web page.

상기 TextRank 알고리즘에서는 하나의 문장을 하나의 페이지로 놓고 문장을 구성하는 단어가 다른 문장에 링크가 많이 되어 있으면 해당 문장이 중요한 문장으로 계산하는 방법이다. 이러한 TextRank 알고리즘의 기본 수식은 하기의 식 3과 같이 구성된다.In the TextRank algorithm, one sentence is placed on one page and if many words constituting the sentence are linked to other sentences, the corresponding sentence is calculated as an important sentence. The basic formula of this TextRank algorithm is composed of Equation 3 below.

(식 3)(Equation 3)

여기서, TR(V_i)는 문장 또는 단어(V_i)에 대한 TextRank 값이고, W_ji는 문장 또는 단어 i와 j 사이의 가중치이며, d(damplingfactor)는 PageRank에서 웹 서핑을 하는 사람이 해당 페이지를 만족하지 못하고 다른 페이지로 이동하는 확률로서, 기본값은 PageRank와 마찬가지로 0.85로 설정하여 사용한다.where TR(V _i ) is the TextRank value for the sentence or word (V _i ), W _ji is the weight between the sentence or word i and j, and d(damplingfactor) is the page It is the probability of moving to another page without satisfying .

그러나, PageRank를 TextRank로 옮겨왔을 때, TR(V_i)를 계산하는 과정에서 문장을 구성하는 단어의 수가 적더라도 단어가 다른 문장들과 링크(link)가 많이 되어 있으면 링크 값이 올라가게 되어 중요한 문장이 될 가능성이 크게 된다.However, when PageRank is transferred to TextRank, even if the number of words constituting a sentence is small in the process of calculating TR(V _i ), if the word is linked with other sentences a lot, the link value rises, which is important It is highly likely to be a sentence.

이러한 문장을 대표 주제로 선정하게 되면 대표 주제가 한두 단어로 이루어져 의미가 포괄적으로 되어 주제를 제대로 파악하기 어려워지는 문제점이 있다. 이러한 문제를 해결하고자 본 발명의 일 실시예에서는 의존 구문을 분석을 활용하여 문장이 주어, 동사, 목적어 구조의 하나의 절 또는 구 구조를 이루고 있을 때 후보 문장을 선정한다.When such a sentence is selected as a representative subject, the representative subject consists of one or two words, and the meaning becomes comprehensive, making it difficult to properly grasp the subject. In order to solve this problem, in one embodiment of the present invention, a candidate sentence is selected when a sentence forms one clause or phrase structure of a subject, verb, and object structure by using dependent syntax analysis.

그리고, TextRank 단계에서는 선정된 후보 문장만을 가지고 분석을 하여 대표 문장을 추출하고, 대표 문장에서 불필요한 수식어구를 제거하기 위해서 형태소 분석을 통해 명사를 추출하고 어순을 유지하면서 주제어구를 생성한다. 문장 후보 생성부터 대표 주제 생성까지의 절차는 도 5와 같이 이루어질 수 있다.And, in the TextRank stage, only the selected candidate sentences are analyzed to extract the representative sentences, and in order to remove unnecessary modifiers from the representative sentences, the nouns are extracted through morphological analysis, and the subject sentences are generated while maintaining the word order. A procedure from generating a sentence candidate to generating a representative topic may be performed as shown in FIG. 5 .

또한, 후보 문장을 선정하여 대표 주제를 생성하는 수식은 하기의 식 4와 같다.In addition, a formula for generating a representative topic by selecting a candidate sentence is as shown in Equation 4 below.

(식 4)(Equation 4)

여기서, TR(S_i)는 문장(S_i)에 대한 TextRank 값이고, W_ji는 문장(S_i)에 속한 단어 i와 j 사이의 가중치(시뮬레이션을 통해 최적의 값으로 결정할 수 있음)이며,

는 의존 구문 분석의 결과로 SVO(t_s, t_v, t_o)구조를 가지는 문장의 집합(S)이며, d(damplingfactor)는 PageRank에서 웹 서핑을 하는 사람이 해당 페이지를 만족하지 못하고 다른 페이지로 이동하는 확률로서, 기본값은 PageRank와 마찬가지로 0.85로 설정하여 사용한다.Here, TR(S _i ) is the TextRank value for the sentence (S _i ), W _ji is the weight between the words i and j belonging to the sentence (S _i ) (which can be determined as an optimal value through simulation),

is a set of sentences (S) with SVO(t _s , t _v , t _o ) structure as a result of dependency parsing, and d(damplingfactor) is a As the probability of moving to , the default value is set to 0.85 like PageRank and used.

한편, 본 발명의 일 실시예에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 방법은 또한 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽힐 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.Meanwhile, the method for generating a text document cluster and subject based on word ranking according to an embodiment of the present invention may also be implemented as computer-readable codes on a computer-readable recording medium. The computer-readable recording medium includes any type of recording device in which data readable by a computer system is stored.

또한, 컴퓨터로 읽을 수 있는 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다.In addition, the computer-readable recording medium may be distributed in computer systems connected through a computer communication network, and stored and executed as readable codes in a distributed manner.

전술한 본 발명에 따른 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치 및 그 방법에 대한 바람직한 실시예에 대하여 설명하였지만, 본 발명은 이에 한정되는 것이 아니고 특허청구범위와 발명의 상세한 설명 및 첨부한 도면의 범위 안에서 여러 가지로 변형하여 실시하는 것이 가능하고 이 또한 본 발명에 속한다.Although a preferred embodiment of the apparatus and method for generating a text document cluster and subject based on word ranking according to the present invention has been described above, the present invention is not limited thereto, and the claims, detailed description of the invention, and accompanying drawings It is possible to carry out various modifications within the scope of the invention, and this also belongs to the present invention.

100 : 제1 모듈,
200 : 제2 모듈,
300 : 제3 모듈,
400 : 제4 모듈100: a first module;
200: a second module;
300: a third module;
400: fourth module

Claims

a first module for extracting a keyword whose weight has been calculated for each text document by utilizing relationship information between words from the unstructured text in a plurality of text documents collected from the outside;
a second module for normalizing the weights of keywords for each text document extracted from the first module and calculating a ranking of the keywords;
a third module for grouping similar text documents based on the ranking of keywords for each text document calculated from the second module; and
a fourth module for generating a representative topic from each grouped text-like document from the third module;
The fourth module selects a candidate sentence as a candidate sentence when a sentence in each pseudotext document clustered from the third module is composed of one clause or phrase structure of a subject, verb, and object structure by using dependent syntax analysis, After calculating and analyzing the text weight based on the TextRank algorithm with the selected candidate sentence, a weighted representative sentence is extracted, and a noun is extracted through morphological analysis to remove unnecessary modifiers from the extracted weighted representative sentence. , A device for generating a text document cluster and topic based on word ranking, characterized in that the representative topic is generated while maintaining the word order.

According to claim 1,
The first module uses the CYK (Cocke-Younger-Kasami) algorithm to decompose the words in the atypical text in each text document into morpheme units, classify them according to the part-of-speech information of the words, parse the relationship, and then parse each text document A device for generating a text document cluster and topic based on word ranking, characterized in that the keyword is extracted by forming a relationship between words in the sentence and calculating the weight based on the frequently referenced word.

3. The method of claim 2,
The weight (Keyword(t, d, D)) of the keyword extracted by the first module is calculated by Equation 1 below.
(Equation 1)

Here, t is a word of a morpheme unit, d is a specific text document belonging to a full text document, D is a full text document set, and t _cnt means the number of times each word is referenced in CYK.

According to claim 1,
The second module replaces a keyword having the highest weight with a preset reference score in order to reduce a weight difference between keywords between text documents, normalizes the remaining keywords according to the preset reference score, and the preset criterion A text document cluster and subject generating device based on word ranking, characterized in that the ranking of the keywords for each text document is given by using the score as the maximum score.

According to claim 1,
The third module forms a first cluster by setting any one of the text documents as a pivot, and the other text documents belong to the same cluster by sequentially comparing the document similarity between the groups. A text document cluster and topic generating device based on word ranking, characterized in that the text documents are clustered for all text documents by determining whether they belong to another cluster.

6. The method of claim 5,
wherein the third module calculates the document similarity according to Equation 2 below.
(Equation 2)

Here, if all text documents are D and all clusters are C, the maximum number of clusters is C _D , and the existing cluster (C _i ) is the top N-best keywords (k _i ) and keywords for forming the cluster. has a weight of , and if it is matched with a new text document (D _ki ) through the third module, it is included in the existing cluster (C _i ), and if it is compared with C _i ... C _n and matches the cluster If not, it creates a new cluster (C _new ).

delete

A method for generating a text document cluster and topic based on word ranking using an apparatus including first to fourth modules, the method comprising:
(a) extracting a keyword whose weight is calculated for each text document by utilizing relationship information between words from atypical texts in a plurality of text documents collected from the outside through the first module;
(b) normalizing the weights of the keywords for each text document extracted in step (a) through the second module and calculating the ranking of the keywords;
(c) grouping similar text documents through the third module based on the ranking of the keywords for each text document calculated in step (b); and
(d) generating a representative topic from each text-like text clustered in step (c) through the fourth module;
The step (d) is,
(d-1) selecting a dependent phrase-based candidate sentence from each pseudotext document clustered in step (c) through the fourth module;
(d-2) extracting weight-based representative sentences by calculating and analyzing the text weights based on the TextRank algorithm with the candidate sentences selected in step (d-1) through the fourth module; and
(d-3) removing unnecessary modifiers from the weight-based representative sentence extracted in step (d-2) through the fourth module and then generating a representative topic;
In step (d-1), a sentence in each pseudotext document clustered in step (c) is converted into one clause or phrase structure of subject, verb, and object structure by utilizing dependent syntax analysis through the fourth module. When it is achieved, it is selected as a candidate sentence,
In the step (d-3), the noun is extracted through morphological analysis in order to remove unnecessary modifiers from the weight-based representative sentence extracted in the step (d-2) through the fourth module, and then the word order is maintained. Word ranking-based text document cluster and topic generation method, characterized in that while generating a representative topic.

11. The method of claim 10,
The step (a) uses the CYK (Cocke-Younger-Kasami) algorithm through the first module to decompose a word in the atypical text in each text document into a morpheme unit, and classify the relationship according to the part-of-speech information of the word. After parsing, a method for generating a text document cluster and topic based on word ranking, characterized in that the keywords are extracted by forming a relationship between words in a sentence of each text document and calculating a weight based on the frequently referenced words.

12. The method of claim 11,
The weight (Keyword(t, d, D)) of the keyword extracted by the first module is calculated by Equation 1 below.
(Equation 1)

11. The method of claim 10,
In the step (b), the keyword having the highest weight is replaced with a preset reference score in order to reduce the weight gap between keywords between text documents through the second module, and the remaining keywords are matched with the preset reference score. A method for generating a text document cluster and topic based on word ranking, characterized in that the keyword is normalized, and a ranking of the keywords for each text document is given by using the preset reference score as the maximum score.

11. The method of claim 10,
In the step (c), the first group is formed by pivoting any one of the text documents through the third module, and the other text documents are sequentially similar to each other in the group. A method for generating text document clusters and topics based on word ranking, comprising clustering pseudotext documents for all text documents by comparing them to determine whether they belong to the same or different clusters.

15. The method of claim 14,
In step (c), the document similarity is calculated by the following Equation 2, a text document cluster and subject generation method based on word ranking.
(Equation 2)

delete

A computer-readable recording medium in which a program capable of executing the method of any one of claims 10 to 15 by a computer is recorded.