KR101850993B1

KR101850993B1 - Method and apparatus for extracting keyword based on cluster

Info

Publication number: KR101850993B1
Application number: KR1020160166340A
Authority: KR
Inventors: 김한준; 유한묵
Original assignee: 서울시립대학교 산학협력단
Priority date: 2016-12-08
Filing date: 2016-12-08
Publication date: 2018-04-23

Abstract

A method for extracting a keyword based on a cluster according to an embodiment of the present invention includes a step of classifying each of a plurality of documents and matching it to any one of a plurality of clusters, a step of calculating mutual information indicative of the extent to which two words contained in the plurality of documents are biased in some of the plurality of clusters; a step of calculating the text rank of each word included in one cluster based on the mutual information between words included in one cluster and another word included in one cluster for each cluster. It is possible to distinguish keywords contained in the document of a topic that has a relatively small percentage of the entire set of documents.

Description

[0001] METHOD AND APPARATUS FOR EXTRACTING KEYWORD BASED ON CLUSTER [0002]

본 발명은 클러스터 기반 키워드 산출 방법 및 장치에 관한 것으로서, 보다 자세하게는 유사한 문서들이 동일한 클러스터에 속하도록 클러스터링(clustering)하여, 각 클러스터 별로 키워드를 추출하는 클러스터 기반 키워드 산출 방법 및 장치에 관한 것이다.The present invention relates to a cluster-based keyword calculation method and apparatus, and more particularly, to a cluster-based keyword calculation method and apparatus for clustering similar documents so as to belong to the same cluster, thereby extracting keywords for each cluster.

최근 인터넷 기술의 발달에 따라 웹에 등록되는 문서의 양이 급격하게 증가하고 있다. 이에 따라, 이러한 문서들로부터 핵심 정보를 추려내는 기술의 필요성이 증대되고 있으며, 키워드 추출(keyword extraction)이 그 중 하나의 기술이다. 키워드는 문서의 내용을 포괄할 수 있는 대표 단어를 의미하며, 이는 문서 요약 또는 단어망 구성에 활용될 수 있다. Recently, with the development of Internet technology, the amount of documents registered on the web is rapidly increasing. Accordingly, there is an increasing need for a technique for culling key information from such documents, and keyword extraction is one of them. A keyword is a representative word that can cover the contents of a document, and can be used for document summary or word network construction.

이러한 키워드를 추출하는 방식에는 크게 단어 통계량 방식과 단어 그래프 방식이 있다. 단어 통계량 방식은 문서의 통계적 정보를 활용하여 단어의 중요도를 평가하는 방식이고, 단어 그래프 방식은 단어 그래프상에서의 특정 단어와 주변 단어들과의 관계를 고려하여 단어의 중요도를 평가하는 방식이다. There are two methods of extracting these keywords: word statistic method and word graph method. The word statistic method is a method of evaluating the importance of words using statistical information of a document. The word graph method is a method of evaluating the importance of words in consideration of the relationship between a specific word and surrounding words on the word graph.

이때 기존의 단어 그래프 방식은 주어진 문서 집합 전체를 기준으로 단어의 중요도를 판단한다. 이때 전체 문서 집합에서 상대적으로 작은 비율을 가지는 주제의 문서에 포함된 단어는 다른 주제의 문서에 포함된 단어들과의 연관도가 상대적으로 작기 때문에, 특정 주제에서 중요한 키워드라 할지라도 전체 문서 집합에서 작은 비율을 가지므로 키워드로 추출되지 않는다는 문제점이 있다. At this time, the existing word graph method judges the importance of words based on the whole set of documents. In this case, since the words included in the document of the subject having a relatively small ratio in the entire document set are relatively small in association with the words included in the document of the other subject, It has a problem that it can not be extracted as a keyword because it has a small ratio.

그렇기 때문에 대중적이지 않은 주제를 담고 있는 문서들을 다른 주제의 문서와 함께 분석하는 경우 키워드 추출의 정확도가 높지 못하여 문제가 된다.Therefore, analyzing documents containing non-popular topics together with documents of other subjects is problematic because the accuracy of keyword extraction is not high.

일 실시예에서 해결하고자 하는 과제는 단어 그래프 방식을 이용하여 키워드를 추출함에 있어, 전체 문서 집합에서 상대적으로 작은 비율을 가지는 주제의 문서에 포함된 키워드를 구별하는 기술을 제공하는 것이다.One problem to be solved in one embodiment is to provide a technology for distinguishing keywords included in a document of a subject having a relatively small ratio in the entire document set in extracting keywords using the word graph method.

다만, 본 발명의 실시예가 이루고자 하는 기술적 과제는 이상에서 언급한 과제로 제한되지 않으며, 이하에서 설명할 내용으로부터 통상의 기술자에게 자명한 범위 내에서 다양한 기술적 과제가 도출될 수 있다.It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.

일 실시예에 따른 클러스터 기반 키워드 산출 방법은 복수의 문서 각각이 복수의 클러스터 중 어느 하나에 속하도록 분류하는 단계와, 상기 복수의 문서 내에 포함된 두 단어가 상기 복수의 클러스터 중 일부 클러스터에 편중되어 있는 정도를 나타내는 상호정보량(mutual information)을 계산하는 단계와, 각 클러스터 별로, 하나의 클러스터 내에 포함된 단어가 상기 하나의 클러스터 내에 포함된 다른 단어와의 상기 상호정보량을 기초로, 상기 하나의 클러스터 내에 포함된 각 단어의 중요도(text rank)를 계산하는 단계를 포함한다. A cluster-based keyword calculation method according to an embodiment includes classifying each of a plurality of documents so as to belong to one of a plurality of clusters, and determining whether two words included in the plurality of documents are biased to some clusters among the plurality of clusters The method comprising the steps of: calculating mutual information indicating a degree of mutual information indicating a degree of mutual information indicating a degree of mutual information indicating a degree of mutual information, And calculating the text rank of each word included in the text.

이때 상기 각 클러스터 별로, 상기 하나의 클러스터에 속하는 문서들을 소정의 단위로 구분하여, 상기 하나의 클러스터 내에 포함된 두 단어가 상기 소정의 단위 내에 함께 출현하는 정도를 나타내는 상호출현량(co-occurrence)을 계산하는 단계와, 상기 각 클러스터 별로, 최대 신장 트리(maximum spanning tree) 알고리즘을 적용하여, 상기 하나의 클러스터 내의 모든 단어를 순환간선 없이 연결하되, 상기 연결된 단어간의 상호출현도 합이 최대가 되도록 연결하는 단계를 더 포함하고, 상기 각 단어의 중요도를 계산하는 단계는 상기 각 클러스터 별로, 상기 하나의 클러스터 내에 포함된 단어와 연결된 단어와의 상기 상호정보량을 기초로, 상기 각 클러스터 내에 포함된 각 단어의 중요도를 계산하는 단계를 포함할 수 있다. At this time, for each cluster, documents belonging to the one cluster are divided into predetermined units, and co-occurrences indicating the degree of coexistence of two words included in the one cluster in the predetermined unit And a maximum spanning tree algorithm is applied to each of the clusters to connect all the words in the one cluster without circulant truncation, Wherein the step of calculating the importance of each word includes calculating a degree of importance of each word included in each cluster based on the amount of mutual information between words associated with words included in the one cluster, And calculating a degree of importance of the image.

또한 상기 상호정보량은

(

= p번째 클러스터,

= 단어

가 p번째 클러스터에서 동시에 출현할 확률,

= 모든 클러스터 중 단어

가 같은 클러스터에 있을 확률,

= 전체 문서의 수 중 p번째 클러스터의 문서 수의 비율) 일 수 있다.Further,

(

= p-th cluster,

= Word

The probability of occurrence at the same time in the p-th cluster,

= Word in all clusters

Are in the same cluster,

= The ratio of the number of documents in the pth cluster among the total number of documents).

더하여 상기 중요도는

(

= 클러스터 내에 포함된 단어,

=

와 연결된 단어의 집합,

= 상기

와 연결된 단어,

=

와 연결된 단어의 집합,

= 상기

와 연결된 단어,

= 상호정보량, d = 최대 신장 트리에서 특정 단어의 중요도를 그 단어와 연결된 다른 단어들로 평가할 확률,

,

) 의 수학식을 통해, 상기 각 클러스터 내에 포함된 모든 단어에 대한 초기

값을 소정의 값으로 설정 후, 상기 모든 단어들에 대하여 상기

계산하고, 상기 모든 단어들에 대한

값이 기 설정된 범위 이상 변하지 않을 때까지 반복 계산하여 구할 수 있다. In addition,

(

= Words contained within the cluster,

=

A set of words associated with,

=

The words associated with,

=

A set of words associated with,

=

The words associated with,

= Mutual information, d = probability of evaluating the importance of a particular word in the maximum height tree to other words associated with that word,

,

), &Lt; / RTI >< RTI ID = 0.0 >

Setting a value to a predetermined value,

For all the words,

It can be repeatedly calculated until the value does not change more than the preset range.

아울러 상기 복수의 문서에서 공통적으로 출현하는 빈도 수가 기 설정된 비율 이상인 단어를 제외하는 단계를 더 포함할 수 있다. The method may further include the step of excluding words having a frequency that is more than a preset ratio in the plurality of documents.

또한 상기 단어를 제외하는 단계는

(

= 단어, N= 복수의 문서의 수,

= 단어

가 출현한 문서의 수)를 계산하여, 상기 ICF 값이 기 설정된 값 이하인 단어를 제외하는 단계를 포함할 수 있다. Also, the step of excluding the words

(

= Word, N = number of documents,

= Word

The number of documents in which the ICF value has appeared), and excluding the word whose ICF value is less than a predetermined value.

일 실시예에 따른 클러스터 기반 키워드 산출 장치는 클러스터 기반으로 키워드를 산출하는 프로세서를 포함하고, 상기 프로세서는 복수의 문서 각각이 복수의 클러스터 중 어느 하나에 속하도록 분류하는 동작과, 상기 복수의 문서 내에 포함된 두 단어가 상기 복수의 클러스터 중 일부 클러스터에 편중되어 있는 정도를 나타내는 상호정보량(mutual information)을 계산하는 동작과, 각 클러스터 별로, 하나의 클러스터 내에 포함된 단어가 상기 하나의 클러스터 내에 포함된 다른 단어와의 상기 상호정보량을 기초로, 상기 하나의 클러스터 내에 포함된 각 단어의 중요도(text rank)를 계산하는 동작을 수행한다. The cluster-based keyword calculation apparatus according to an embodiment includes a processor for calculating a keyword based on a cluster, the processor classifying each of a plurality of documents as belonging to one of a plurality of clusters, Calculating mutual information indicative of a degree to which two words included in the plurality of clusters are concentrated in some of the plurality of clusters; calculating an amount of mutual information indicating that words included in one cluster are included in the one cluster And calculates the text rank of each word included in the one cluster based on the mutual information amount with the other word.

이때 상기 각 클러스터 별로, 상기 하나의 클러스터에 속하는 문서들을 소정의 단위로 구분하여, 상기 하나의 클러스터 내에 포함된 두 단어가 상기 소정의 단위 내에 함께 출현하는 정도를 나타내는 상호출현량(co-occurrence)을 계산하는 동작과, 상기 각 클러스터 별로, 최대 신장 트리(maximum spanning tree) 알고리즘을 적용하여, 상기 하나의 클러스터 내의 모든 단어를 순환간선 없이 연결하되, 상기 연결된 단어간의 상호출현도 합이 최대가 되도록 연결하는 동작을 더 수행하고, 상기 각 단어의 중요도를 계산하는 동작은 상기 각 클러스터 별로, 상기 하나의 클러스터 내에 포함된 단어와 연결된 단어와의 상기 상호정보량을 기초로, 상기 각 클러스터 내에 포함된 각 단어의 중요도를 계산하는 동작을 포함할 수 있다. At this time, for each cluster, documents belonging to the one cluster are divided into predetermined units, and co-occurrences indicating the degree of coexistence of two words included in the one cluster in the predetermined unit And a maximum spanning tree algorithm is applied to each of the clusters to connect all the words in the one cluster without circular truncation, And calculating an importance degree of each word based on the amount of mutual information of each cluster and a word linked to a word included in the one cluster, Lt; RTI ID = 0.0 > a < / RTI >

또한 상기 상호정보량은

(

= p번째 클러스터,

= 단어

가 p번째 클러스터에서 동시에 출현할 확률,

= 모든 클러스터 중 단어

가 같은 클러스터에 있을 확률,

= 전체 문서의 수 중 p번째 클러스터의 문서 수의 비율)일 수 있다. Further,

(

= p-th cluster,

= Word

The probability of occurrence at the same time in the p-th cluster,

= Word in all clusters

Are in the same cluster,

더하여 상기 중요도는

(

= 클러스터 내에 포함된 단어,

=

와 연결된 단어의 집합,

= 상기

와 연결된 단어,

=

와 연결된 단어의 집합,

= 상기

와 연결된 단어,

,

계산하고, 상기 모든 단어들에 대한

(

= Words contained within the cluster,

=

A set of words associated with,

=

The words associated with,

=

A set of words associated with,

=

The words associated with,

,

), &Lt; / RTI >< RTI ID = 0.0 >

Setting a value to a predetermined value,

For all the words,

더불어 상기 복수의 문서에서 공통적으로 출현하는 빈도 수가 기 설정된 비율 이상인 단어를 제외하는 동작을 더 수행할 수 있다. In addition, it is possible to further exclude words whose frequencies appearing in the plurality of documents are equal to or higher than a predetermined ratio.

이때 상기 단어를 제외하는 단계는

(

= 단어, N= 복수의 문서의 수,

= 단어

가 출현한 문서의 수)를 계산하여, 상기 ICF 값이 기 설정된 값 이하인 단어를 제외하는 동작을 포함할 수 있다. The step of excluding the word

(

= Word, N = number of documents,

= Word

일 실시예에 따르면, 단어의 중요도를 유사한 문서들이 속해있는 클러스터 내에서 평가할 수 있으며, 이에 따라 전체 문서 집합에서 상대적으로 작은 비율을 가지는 문서에 대한 키워드 추출의 정확도를 높일 수 있다.According to one embodiment, the importance of a word can be evaluated in a cluster to which similar documents belong, thereby increasing the accuracy of keyword extraction for a document having a relatively small percentage of the entire document set.

도 1은 일 실시예에 따른 클러스터 기반 키워드 산출 방법의 순서를 나타낸 흐름도이다.
도 2는 복수의 문서 각각이 복수의 클러스터 중 어느 하나에 속하도록 분류하는 단계를 설명하기 위한 예시도이다.
도 3은 최대 신장 트리(maximum spanning tree) 알고리즘을 적용하여, 하나의 클러스터 내의 모든 단어를 순환간선 없이 연결하는 단계를 설명하기 위한 예시도이다.
도 4는 상호정보량을 기초로 각 단어의 중요도를 계산하는 단계를 설명하기 위한 예시도이다.
도 5는 복수의 문서에서 공통적으로 출현하는 빈도 수가 기 설정된 비율 이상인 단어를 제외하는 단계를 설명하기 위한 예시도이다.
도 6은 각 클러스터 내 포함된 단어를 하나로 통합하여 중요도를 기준으로 내림차순으로 정렬하는 단계를 설명하기 위한 예시도이다. 1 is a flowchart illustrating a procedure of a cluster-based keyword calculation method according to an embodiment.
2 is an exemplary diagram illustrating a step of classifying each of a plurality of documents into one of a plurality of clusters.
3 is an exemplary diagram for explaining a step of connecting all words in one cluster without circular trunks by applying a maximum spanning tree algorithm.
4 is an exemplary diagram for explaining the step of calculating the importance of each word based on the mutual information amount.
FIG. 5 is a diagram for explaining a step of excluding words whose frequencies appearing in a plurality of documents are equal to or greater than a predetermined ratio.
FIG. 6 is a diagram for explaining a step of integrating words included in each cluster into a single word and arranging them in descending order based on importance.

본 발명의 목적과 기술적 구성 및 그에 따른 작용 효과에 관한 자세한 사항은 본 발명의 명세서에 첨부된 도면에 의거한 이하의 상세한 설명에 의해 보다 명확하게 이해될 것이다. 첨부된 도면을 참조하여 본 발명에 따른 실시예를 상세하게 설명한다.DETAILED DESCRIPTION OF THE EMBODIMENTS Reference will now be made in detail to the preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment according to the present invention will be described in detail with reference to the accompanying drawings.

본 명세서에서 개시되는 실시예들은 본 발명의 범위를 한정하는 것으로 해석되거나 이용되지 않아야 할 것이다. 이 분야의 통상의 기술자에게 본 명세서의 실시예를 포함한 설명은 다양한 응용을 갖는다는 것이 당연하다. 따라서, 본 발명의 상세한 설명에 기재된 임의의 실시예들은 본 발명을 보다 잘 설명하기 위한 예시적인 것이며 본 발명의 범위가 실시예들로 한정되는 것을 의도하지 않는다.The embodiments disclosed herein should not be construed or interpreted as limiting the scope of the present invention. It will be apparent to those of ordinary skill in the art that the description including the embodiments of the present specification has various applications. Accordingly, any embodiment described in the Detailed Description of the Invention is illustrative for a better understanding of the invention and is not intended to limit the scope of the invention to embodiments.

도면에 표시되고 아래에 설명되는 기능 블록들은 가능한 구현의 예들일 뿐이다. 다른 구현들에서는 상세한 설명의 사상 및 범위를 벗어나지 않는 범위에서 다른 기능 블록들이 사용될 수 있다. 또한, 본 발명의 하나 이상의 기능 블록이 개별 블록들로 표시되지만, 본 발명의 기능 블록들 중 하나 이상은 동일 기능을 실행하는 다양한 하드웨어 및 소프트웨어 구성들의 조합일 수 있다.The functional blocks shown in the drawings and described below are merely examples of possible implementations. In other implementations, other functional blocks may be used without departing from the spirit and scope of the following detailed description. Also, although one or more functional blocks of the present invention are represented as discrete blocks, one or more of the functional blocks of the present invention may be a combination of various hardware and software configurations that perform the same function.

또한, 어떤 구성요소들을 포함한다는 표현은 개방형의 표현으로서 해당 구성요소들이 존재하는 것을 단순히 지칭할 뿐이며, 추가적인 구성요소들을 배제하는 것으로 이해되어서는 안 된다.In addition, the expression "including any element" is merely an expression of an open-ended expression, and is not to be construed as excluding the additional elements.

나아가 어떤 구성요소가 다른 구성요소에 연결되어 있다거나 접속되어 있다고 언급될 때에는, 그 다른 구성요소에 직접적으로 연결 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 한다. Further, when a component is referred to as being connected or connected to another component, it may be directly connected or connected to the other component, but it should be understood that there may be other components in between.

또한 '제1, 제2' 등과 같은 표현은 복수의 구성들을 구분하기 위한 용도로만 사용된 표현으로써, 구성들 사이의 순서나 기타 특징들을 한정하지 않는다. Also, the expressions such as 'first, second', etc. are used only to distinguish a plurality of configurations, and do not limit the order or other features between configurations.

이하에서는 도면들을 참조하여 본 발명의 실시예들에 대해 설명하도록 한다. Hereinafter, embodiments of the present invention will be described with reference to the drawings.

도 1은 일 실시예에 따른 클러스터 기반 키워드 산출 방법의 순서를 나타낸 흐름도이다. 도 1에 따른 클러스터 기반 키워드 산출 방법은 복수의 문서가 존재할 때, 유사한 문서들이 동일한 클러스터에 속하도록 클러스터링(clustering)하여 각 클러스터 별로 키워드를 추출한다. 1 is a flowchart illustrating a procedure of a cluster-based keyword calculation method according to an embodiment. In the cluster-based keyword calculation method according to FIG. 1, when there are a plurality of documents, clustering is performed so that similar documents belong to the same cluster, and keywords are extracted for each cluster.

이를 위해, 우선 복수의 문서 각각이 복수의 클러스터 중 어느 하나에 속하도록 분류한다(S110). To do this, first, each of the plurality of documents is classified as belonging to one of a plurality of clusters (S110).

도 2는 복수의 문서 각각이 복수의 클러스터 중 어느 하나에 속하도록 분류하는 단계를 설명하기 위한 예시도이다. 도 2를 참조하면, 복수의 문서를 특정 기준에 따라 복수의 클러스터 중 어느 하나에 속하도록 분류할 수 있다.2 is an exemplary diagram illustrating a step of classifying each of a plurality of documents into one of a plurality of clusters. Referring to FIG. 2, a plurality of documents can be classified into any one of a plurality of clusters according to a specific criterion.

예를 들면, K 평균 알고리즘(k-means algorithm)을 적용하여 복수의 문서들을 복수의 클러스터로 분류할 수 있으며, 이때 K 평균 알고리즘의 입력값으로 문서 단어 행렬(document term matrix)을 사용할 수 있고, 엘보우 알고리즘(elbow algorithm)을 적용하여 적절한 클러스터의 수를 결정할 수 있다. For example, a plurality of documents can be classified into a plurality of clusters by applying a k-means algorithm. In this case, a document term matrix can be used as an input value of the K average algorithm, The number of suitable clusters can be determined by applying an elbow algorithm.

이후, 각 클러스터 별로 하나의 클러스터에 속하는 문서들을 소정의 단위로 구분하여, 하나의 클러스터 내에 포함된 두 단어가 소정의 단위 내에 함께 출현하는 정도를 나타내는 상호출현량(co-occurrence)을 계산한다(S120). Thereafter, the documents belonging to one cluster are divided into predetermined units for each cluster, and a co-occurrence indicating the degree of coexistence of two words included in one cluster in a predetermined unit is calculated (S120 ).

이때 소정의 단위는, 예를 들면 문서, 문단, 문장 중 하나가 될 수 있다. 가령, 소정의 단위가 문서인 경우에는, 하나의 클러스터에 속하는 모든 문서 중 하나의 문서에서 A, B 두 단어가 함께 출현하는 확률을 단어 A-B간의 상호출현량으로 계산할 수 있다. 또한 소정의 단위가 문단인 경우에는, 하나의 클러스터에 속하는 모든 문서의 문단을 기준으로 하나의 문단에서 A, B 두 단어가 함께 출현하는 확률을 단어 A-B간의 상호출현량으로 계산할 수 있다. 이를 통해, 각 클러스터 별로 하나의 클러스터에 속하는 모든 두 단어 간의 상호출현량을 구할 수 있다. At this time, the predetermined unit may be, for example, one of a document, a paragraph, and a sentence. For example, when the predetermined unit is a document, the probability of occurrence of two words A and B together in one document among all the documents belonging to one cluster can be calculated as the mutual occurrence amount of the words A-B. If the predetermined unit is a paragraph, the probability of occurrence of two words A and B together in one paragraph based on the paragraphs of all the documents belonging to one cluster can be calculated as the mutual occurrence amount of the words A-B. Through this, mutual emergence amount of all two words belonging to one cluster can be obtained for each cluster.

다음으로, 각 클러스터 별로 최대 신장 트리(maximum spanning tree) 알고리즘을 적용하여, 하나의 클러스터 내의 모든 단어를 순환간선 없이 연결하되, 연결된 단어간의 상호출현량 합이 최대가 되도록 연결한다(S130).Next, a maximum spanning tree algorithm is applied to each cluster to connect all the words in one cluster without circulant truncation, so that the sum of the amounts of mutual occurrence among connected words is maximized (S130).

도 3은 최대 신장 트리(maximum spanning tree) 알고리즘을 적용하여, 하나의 클러스터 내의 모든 단어를 순환간선 없이 연결하는 단계를 설명하기 위한 예시도이다. 3 is an exemplary diagram for explaining a step of connecting all words in one cluster without circular trunks by applying a maximum spanning tree algorithm.

도 3을 참조하면, 각 클러스터 내에 포함된 각 단어를 하나의 노드(node)로 간주하고, 각 단어를 간선으로 연결한다. 이때 단어와 단어를 연결하는 간선에는 S120 단계에서 계산된 각 단어간의 상호출현량이 부여된다. 이때 최대 신장 트리 알고리즘을 사용하여 모든 단어를 순환간선 없이 연결하되, 이때 각 간선에 부여된 상호출현량의 합이 최대가 되도록 연결할 수 있다. Referring to FIG. 3, each word included in each cluster is regarded as a node, and each word is connected to an edge. At this time, the mutual emergence amount of each word calculated in step S120 is given to the trunk connecting the word and the word. In this case, all the words are connected without circulant trunks using the maximum extension tree algorithm, so that the sum of mutual emergence amounts given to each trunk can be maximized.

이를 통하여, 이후 S150 단계에서 각 단어의 중요도(text rank)를 계산할 때, 하나의 단어에 대하여 클러스터 내의 모든 단어와의 관계를 고려하여 중요도를 계산하게 되면 계산량이 많아지기 때문에, 최대 신장 트리를 통해 각 단어와 상호출현량이 높은 단어와의 관계만을 고려하여, 빠르고 정확하게 각 단어의 중요도를 계산할 수 있다. Accordingly, when calculating the text rank of each word in step S150, if the importance is calculated by taking into account the relation with all the words in the cluster for one word, the calculation amount becomes large. Therefore, The importance of each word can be calculated quickly and accurately, taking into account only the relationship between each word and a word having a high mutual emergence.

한편, 전체 클러스터에서 중요하다고 평가되는 단어보다 특정 클러스터에서만 중요하다고 평가되는 단어가 실질적으로 특정 주제에 대한 키워드일 가능성이 높다. 왜냐하면 문서의 주제에 따라 문서마다 특정 단어가 다른 빈도로 나타날 것이며, 모든 문서에 공통적으로 많이 출현하는 단어는 통상적으로 많이 사용되는 단어일 가능성이 높기 때문에 특정 주제의 키워드로 선정하기에는 적합하지 않기 때문이다. On the other hand, it is highly probable that a word evaluated as important only in a specific cluster is a keyword for a specific subject, rather than a word evaluated as important in the entire cluster. This is because certain words in different documents will appear at different frequencies depending on the subject of the document, and many words commonly appearing in all documents are likely to be commonly used words, .

이에, 복수의 문서 내에 포함된 두 단어가 복수의 클러스터 중 일부 클러스터에 편중되어 있는 정도를 나타내는 상호정보량(mutual information)을 계산한다(S140). Accordingly, mutual information indicating the extent to which two words included in the plurality of documents are concentrated in some of the plurality of clusters is calculated (S140).

예를 들어, 아래 [수학식 1]을 통해 각 단어의 상호정보량을 계산할 수 있다. For example, the mutual information amount of each word can be calculated through the following equation (1).

이때

는 단어

간의 상호정보량을 나타내며

는 p번째 클러스터를 의미한다. 또한,

는 단어

가 p번째 클러스터에서 동시에 출현할 확률을 의미하고,

는 모든 클러스터 중 단어

가 같은 클러스터에 있을 확률을 의미하며,

는 전체 문서의 수 중 p번째 클러스터의 문서 수의 비율을 의미한다. At this time

Word

The amount of mutual information between

Denotes a p-th cluster. Also,

Word

Is the probability of the simultaneous occurrence in the p-th cluster,

Of all clusters

Are in the same cluster,

Is the ratio of the number of documents in the pth cluster among the total number of documents.

따라서, [수학식 1]은 단어

가 모든 클러스터 중 동일한 클러스터 내에서 출현할 확률에 대한 특정 클러스터 내에서 출현하는 정도를 나타내므로, 복수의 클러스터 중 일부 클러스터에 편중되어 있는 정도를 의미한다. 즉, MI가 높은 값을 가지면 두 단어가 특정 클러스터에 편중되어 있는 정도가 높다는 것을 의미하고, 중간 정도의 값을 가지면 여러 클러스터에 고르게 분포되어 있다는 것을 의미하며, 0에 가까우면 두 단어가 클러스터에 독립적으로 나타난다는 것을 의미한다.Thus, Equation (1)

Represents the degree of appearance in a specific cluster with respect to the probability of occurrence in the same cluster among all the clusters, it means a degree of being concentrated in some clusters among a plurality of clusters. In other words, a high value of MI means that two words are concentrated in a specific cluster, a medium value means that they are evenly distributed in several clusters, and when close to 0, It means that it appears independently.

이어서, 각 클러스터 별로 하나의 클러스터 내에 포함된 단어가 최대 신장 트리에서 연결된 다른 단어와의 상호정보량을 기초로, 각 클러스터 내에 포함된 각 단어의 중요도(text rank)를 계산한다(S150).Then, the text rank of each word included in each cluster is calculated (S150) based on the mutual information amount between the words included in one cluster and the other words linked in the maximum extension tree for each cluster.

예를 들어, 아래 [수학식 2]를 통해 각 단어의 중요도를 계산할 수 있다. For example, the importance of each word can be calculated through the following equation (2).

이때

는 클러스터 내에 포함된 단어를 의미하고,

는

와 연결된 단어의 집합으로

는

와 연결된 단어를 의미하며,

는

와 연결된 단어의 집합으로

는

와 연결된 단어를 의미한다. 또한,

는 단어

,

간의 상호정보량을 의미하고,

는 단어

,

간의 상호정보량을 의미하며, d 는 최대 신장 트리에서 특정 단어의 중요도를 그 단어와 연결된 다른 단어들로 평가할 확률로 설정에 따라 지정할 수 있다. At this time

Denotes a word included in the cluster,

The

As a set of words associated with

The

&Quot; and "

The

As a set of words associated with

The

&Quot;< / RTI > Also,

Word

,

The amount of mutual information between the two,

Word

,

And d is a probability that the importance of a particular word in the maximum extension tree is evaluated by other words connected to the word.

이때 각 클러스터 내에 포함된 모든 단어에 대한

값을 소정의 값으로 설정 후, 모든 단어들에 대하여

계산하되, 모든 단어들에 대한

값이 기 설정된 범위 이상 변하지 않을 때까지 반복 계산하여 각 단어의 중요도를 구할 수 있다. For each word in each cluster,

After setting the value to a predetermined value,

Calculate, for all words

The significance of each word can be determined by iterative calculation until the value does not change by more than the predetermined range.

도 4는 상호정보량을 기초로 각 단어의 중요도를 계산하는 단계를 설명하기 위한 예시도이다. 4 is an exemplary diagram for explaining the step of calculating the importance of each word based on the mutual information amount.

도 4를 참조하면, [수학식 2]를 적용할 때 모든 단어의 중요도를 1로 설정하고 d를 0.85로 설정한 경우, 특정 단어에 대하여 [수학식 2]에 대입하면 도 4와 같이 5.4의 값을 구할 수 있다. 이처럼 하나의 클러스터 내 모든 단어에 대하여 [수학식 2]를 통해

를 구하되, 모든 단어들에 대한

값이 기 설정된 범위 이상 변하지 않을 때까지 반복 계산하여 각 단어의 중요도를 구할 수 있다. 이때 모든 단어들에 대해 [수학식 2]를 반복 계산함에 있어, 각 반복 단계마다 이전 단계의 각 단어들의

값과 현재 단계의 각 단어들의 값에 대한 평균 제곱근 편차(root mean square error)를 계산하고, 평균 제곱근 편차 값이 0에 수렴하거나 기 설정된 값 미만에 도달하면 계산이 완료한 것으로 설정할 수 있다.Referring to FIG. 4, when the importance of all words is set to 1 and d is set to 0.85 when [Equation 2] is applied, if a specific word is substituted into Equation 2, Value can be obtained. As described above, for all the words in one cluster, through Equation (2)

For all words,

The significance of each word can be determined by iterative calculation until the value does not change by more than the predetermined range. At this time, in repeating the calculation of [Equation 2] for all the words,

Value and each word of the current stage The root mean square error for the value is calculated and the calculation can be set to be completed when the mean square deviation value converges to zero or below a predetermined value.

이를 통해, 단어의 중요도를 유사한 문서들이 속해있는 클러스터 내에서 평가할 수 있으며, 전체 문서 집합에서 상대적으로 작은 비율을 가지는 문서에 대한 키워드 추출의 정확도를 높일 수 있다.This makes it possible to evaluate the importance of a word within a cluster to which similar documents belong, and to improve the accuracy of keyword extraction for a document having a relatively small ratio in the entire document set.

한편, 특정 기준(예, 주제) 별로 클러스터를 분류하였음에도 전체 클러스터에서 빈번하게 출현하는 단어는 특정 클러스터에서 중요한 키워드가 아닐 가능성이 있다. 따라서 S150 단계에서 중요도가 높게 산출되더라도 전체 클러스터에서 빈번하게 출현하는 단어를 제거하기 위해, 복수의 문서에서 공통적으로 출현하는 빈도 수가 기 설정된 비율 이상인 단어를 제외할 수 있다(S160). On the other hand, although clusters are classified by specific criteria (eg, topic), words frequently appearing in the entire cluster may not be important keywords in a specific cluster. Therefore, in order to remove the words frequently appearing in the entire cluster even though the importance is calculated in step S150, words having a frequency more frequently than a predetermined ratio may be excluded in a plurality of documents (S160).

예를 들어, 아래 [수학식 3]을 통해 특정 단어가 복수의 문서에서 공통적으로 출현하는 비율을 구할 수 있다. For example, the ratio of occurrence of a specific word common to a plurality of documents can be obtained through the following equation (3).

이때

는 단어를 의미하고, N 은 복수의 문서의 수를 의미하며,

는 단어

가 출현한 문서의 수를 의미한다. 따라서 특정 단어가 모든 문서에 고르게 출현할수록 ICF(inverse cluster frequency)가 낮은 값을 갖는다. At this time

Denotes a word, N denotes the number of a plurality of documents,

Word

Quot; means the number of documents that have appeared. Therefore, ICF (inverse cluster frequency) has a low value as a certain word appears uniformly in all documents.

도 5는 복수의 문서에서 공통적으로 출현하는 빈도 수가 기 설정된 비율 이상인 단어를 제외하는 단계를 설명하기 위한 예시도이다. FIG. 5 is a diagram for explaining a step of excluding words whose frequencies appearing in a plurality of documents are equal to or greater than a predetermined ratio.

예를 들어, 도 5와 같이 중요도가 높게 계산된 단어이더라도 ICF가 1 미만인 단어를 제외하여, 해당 단어는 키워드가 아닌 것으로 판별할 수 있다. For example, even if a word having a high importance is calculated as shown in FIG. 5, a word having ICF less than 1 can be excluded, and the word can be discriminated as not a keyword.

이후, 각 클러스터 내 포함된 단어의 중요도를 기준으로 내림차순으로 정렬하여 각 클러스터에서 어떤 단어가 키워드인지 확인할 수 있으며(S170), 또한 도 6과 같이 각 클러스터 내 포함된 단어를 하나로 통합하여 중요도를 기준으로 내림차순으로 정렬하여 모든 문서에 대한 키워드를 통합적으로 확인할 수도 있다(S180). Thereafter, it is possible to ascertain which word is the keyword in each cluster by sorting in ascending order based on the importance of the words included in each cluster (S170). In addition, as shown in FIG. 6, (S180). In this case, the keywords of all the documents may be collectively checked (S180).

상술한 본 발명의 실시예들은 다양한 수단을 통해 구현될 수 있다. 예를 들어, 본 발명의 실시예들은 하드웨어, 펌웨어(firmware), 소프트웨어 또는 그것들의 결합 등에 의해 구현될 수 있다.The above-described embodiments of the present invention can be implemented by various means. For example, embodiments of the present invention may be implemented by hardware, firmware, software, or a combination thereof.

하드웨어에 의한 구현의 경우, 본 발명의 실시예들에 따른 방법은 하나 또는 그 이상의 ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), FPGAs(Field Programmable Gate Arrays), 프로세서, 컨트롤러, 마이크로 컨트롤러, 마이크로 프로세서 등에 의해 구현될 수 있다.In the case of hardware implementation, the method according to embodiments of the present invention may be implemented in one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs) , FPGAs (Field Programmable Gate Arrays), processors, controllers, microcontrollers, microprocessors, and the like.

펌웨어나 소프트웨어에 의한 구현의 경우, 본 발명의 실시예들에 따른 방법은 이상에서 설명된 기능 또는 동작들을 수행하는 모듈, 절차 또는 함수 등의 형태로 구현될 수 있다. 소프트웨어 코드는 메모리 유닛에 저장되어 프로세서에 의해 구동될 수 있다. 상기 메모리 유닛은 상기 프로세서 내부 또는 외부에 위치하여, 이미 공지된 다양한 수단에 의해 상기 프로세서와 데이터를 주고 받을 수 있다.In the case of an implementation by firmware or software, the method according to embodiments of the present invention may be implemented in the form of a module, a procedure or a function for performing the functions or operations described above. The software code can be stored in a memory unit and driven by the processor. The memory unit may be located inside or outside the processor, and may exchange data with the processor by various well-known means.

이와 같이, 본 발명이 속하는 기술분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다. Thus, those skilled in the art will appreciate that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. It is therefore to be understood that the embodiments described above are to be considered in all respects only as illustrative and not restrictive. The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

Claims

Classifying each of the plurality of documents as belonging to one of the plurality of clusters;
Computing mutual information indicative of the degree to which two words contained in the plurality of documents are biased to some clusters of the plurality of clusters;
Calculating a text rank of each word included in the one cluster based on the amount of mutual information between words included in one cluster and another word included in the one cluster, Included
Clustering-based keyword calculation method.

The method according to claim 1,
For each cluster, documents belonging to the one cluster are divided into predetermined units, and a co-occurrence indicating a degree of coexistence of two words included in the one cluster is calculated , &Lt; / RTI &
Applying a maximum spanning tree algorithm to each of the clusters to connect all the words in the one cluster without circular truncation so that the sum of the mutual appearances of the connected words is maximized and,
Wherein the step of calculating the importance of each word comprises:
Calculating a degree of importance of each word included in each cluster on the basis of the mutual information amount for each cluster and a word linked to a word included in the one cluster
Clustering-based keyword calculation method.

The method according to claim 1,
The inter-

sign
(

= p-th cluster,

= Word

The probability of occurrence at the same time in the p-th cluster,

= Word in all clusters

Are in the same cluster,

= The ratio of the number of documents in the pth cluster to the total number of documents)
Clustering-based keyword calculation method.

The method of claim 3,
[0031]

(

= Words contained within the cluster,

=

A set of words associated with,

=

The words associated with,

=

A set of words associated with,

=

The words associated with,

,

), &Lt; / RTI >< RTI ID = 0.0 >

Setting a value to a predetermined value,

For all the words,

It is repeatedly calculated until the value does not change more than the predetermined range.
Clustering-based keyword calculation method.

The method according to claim 1,
Further comprising the step of excluding words whose frequencies appearing in the plurality of documents are equal to or greater than a predetermined ratio
Clustering-based keyword calculation method.

6. The method of claim 5,
The step of excluding the word comprises:

(

= Word, N = number of documents,

= Word

The number of documents in which the ICF value has appeared), and excluding words whose ICF value is less than or equal to a predetermined value
Clustering-based keyword calculation method.

A processor for generating a keyword based on a cluster,
The processor comprising:
Classifying each of the plurality of documents as belonging to one of the plurality of clusters,
Computing mutual information indicative of the degree to which two words contained in the plurality of documents are biased in some clusters of the plurality of clusters;
Calculating a text rank of each word included in the one cluster based on the amount of mutual information between words included in one cluster and another word included in the one cluster, Perform
Cluster - based keyword calculation device.

8. The method of claim 7,
For each cluster, documents belonging to the one cluster are divided into predetermined units, and a co-occurrence indicating a degree of coexistence of two words included in the one cluster is calculated ,
The maximum spanning tree algorithm is applied to each of the clusters so as to connect all the words in the one cluster without circulant truncation so that the sum of the mutual appearances of the connected words is maximized and,
The operation of calculating the importance of each word may include:
Calculating an importance degree of each word included in each cluster based on the mutual information amount for each cluster and a word associated with a word included in the one cluster
Cluster - based keyword calculation device.

8. The method of claim 7,
The inter-

sign
(

= p-th cluster,

= Word

The probability of occurrence at the same time in the p-th cluster,

= Word in all clusters

Are in the same cluster,

= The ratio of the number of documents in the pth cluster to the total number of documents)
Cluster - based keyword calculation device.

10. The method of claim 9,
[0031]

(

= Words contained within the cluster,

=

A set of words associated with,

=

The words associated with,

=

A set of words associated with,

=

The words associated with,

,

), &Lt; / RTI >< RTI ID = 0.0 >

Setting a value to a predetermined value,

For all the words,

It is repeatedly calculated until the value does not change more than the predetermined range.
Cluster - based keyword calculation device.

8. The method of claim 7,
Further comprising the step of excluding words whose frequencies appearing in the plurality of documents are equal to or greater than a predetermined ratio
Cluster - based keyword calculation device.

12. The method of claim 11,
The act of excluding the word may comprise:

(

= Word, N = number of documents,

= Word

The number of documents in which the ICF value has appeared), and excludes words whose ICF value is less than or equal to a predetermined value
Cluster - based keyword calculation device.

A program stored in a computer-readable medium for causing a processor to perform each step according to the method of any one of claims 1 to 6.

A computer-readable medium having stored thereon instructions for causing a processor to perform each step according to the method of any one of claims 1 to 6.