KR100876319B1

KR100876319B1 - Apparatus for providing document clustering using re-weighted term

Info

Publication number: KR100876319B1
Application number: KR1020070081006A
Authority: KR
Inventors: 이주홍; 박선; 김덕환; 안찬민
Original assignee: 인하대학교 산학협력단
Priority date: 2007-08-13
Filing date: 2007-08-13
Publication date: 2008-12-31

Abstract

A clustering apparatus and a method thereof are presented to minimize semantic difference between a cluster required by a user and a cluster provided by a system. According to a document clustering apparatus using a terminology weight recalculation, inputted document information is dissembled into each sentence information and invalid words are removed and primitive is extracted. A pre-processor(200) clusters documents on the basis of non-negative number matrix factorization, by generating a terminology-sentence matrix. A document clustering part(300) recalculates weight of the terminology and generates document clusters.

Description

Apparatus for providing Document Clustering using Re-weighted Term}

본 발명은 용어 가중치 재계산을 이용한 문서 군집 장치에 관한 것으로서, 더욱 상세하게는 상위 수준의 사용자 요구 및 하위 수준의 시스템 군집 간의 의미적 차이를 최소화하기 위하여 사용자 피드백에 의해 가중치가 재계산된 용어를 이용하며, 문서 군집 과정에서 문서 집합의 내부 구조를 나타내는 의미 특징 행렬 및 의미 변수 행렬을 사용하여 문서 군집 방법의 성능을 향상시킬 수 있는 용어 가중치 재계산을 이용한 문서 군집 장치에 관한 것이다.The present invention relates to a document clustering apparatus using term weight recalculation, and more particularly, to re-calculate a term weighted by user feedback in order to minimize the semantic difference between higher level user demand and lower level system cluster. The present invention relates to a document clustering apparatus using term weight recalculation that can improve the performance of a document clustering method by using a semantic feature matrix and a semantic variable matrix representing an internal structure of a document set in a document clustering process.

문서 군집은 군집 알고리즘에 의하여 문서 집합으로부터 유사한 특성을 가진 문서들의 그룹을 발견하는 것이다. 문서 군집 방법은 자료의 조직화, 웹 검색 결과의 브라우징 또는 다중 문서 요약 등 정보 검색의 많은 응용 분야에 사용되는 중요한 문서 분석 방법이다.Document clustering is the discovery of groups of documents with similar characteristics from a set of documents by a clustering algorithm. Document clustering is an important document analysis method used in many applications of information retrieval, such as organizing data, browsing web search results, or multiple document summaries.

문서 군집에 있어서는 자료 집합의 분포, 내부 구조 또는 사용자가 원하는 군집 형태 등이 군집 결과에 상당한 영향을 미치게 되는데, 사용자에게 제한된 정보만을 제공하여 군집의 질을 향상시키는 연구들이 있어 왔다.In document clustering, the distribution of data sets, internal structure, or the type of clustering desired by users have a significant effect on clustering results. There have been studies to improve the quality of clustering by providing limited information to users.

이러한 연구의 하나로서, 문서 군집 분석에 군집의 구성원에 대한 사전 지식(Prior Knowledge of Cluster Membership)을 통합한 준지도 문서 군집 모델(Semi-Supervised Document Clustering Model)이 제안된 바 있다.As one of these studies, a semi-supervised document clustering model has been proposed that integrates prior knowledge of cluster membership in document clustering analysis.

상기 준지도 문서 군집 모델은 사용자가 그룹을 원하는 클러스터를 부호화하고, 부호화된 사용자의 사전 지식을 군집의 코스트 함수(Cost Function)에 벌점 용어(Penalty Term)로 부가한다.The quasi-map document cluster model encodes a cluster that a user wants to group, and adds the prior knowledge of the encoded user to a cost function of the cluster as a penalty term.

다른 연구로서, 준지도 K-평균(K-means) 방법이 있는데, 이 방법은 분류 표시가 된 자료(Labeled Data)를 이용하여 초기 시드(Seed) 클러스터를 생성하고, 분류 표시가 된 자료로부터 제약 사항을 생성하여 군집한다.Another study is the quasi-K-means method, which creates initial seed clusters using labeled data and constrains them from labeled data. Create and cluster items.

또한, 비지도 학습 방법인 웹 검색 결과의 군집을 지도 학습 방법으로 변환하는 방법이 제안된 바 있다. 이 방법은 주어진 질의 및 검색 결과의 순위 리스트로부터 여러 개의 속성을 계산하고, 이 속성 및 학습 자료를 회귀 모델 학습에 적용하여 검색 결과에 대한 사용자의 브라우징(Browsing)을 목적으로 한다.In addition, a method of converting a cluster of web search results into a supervised learning method has been proposed. This method calculates several attributes from a ranked list of given queries and search results, and applies these attributes and training data to regression model training for the purpose of user browsing.

또 다른 연구로서, 군집의 이름과 관련이 없는 일반적인 특질로부터 군집에 필요한 문서의 특질을 분류하고, 제안 모델에 다양한 유형의 사용자 피드백을 적용하여 매개 변수를 조정할 수 있는 수단을 제공하는 방법이 있다.In another study, there is a method of classifying the characteristics of documents required for clustering from general characteristics not related to cluster names, and providing a means for adjusting parameters by applying various types of user feedback to the proposed model.

그러나, 상술한 종래의 방법들은 사용자 피드백이 반영되지 않거나 부족하여 사용자가 원하는 군집 결과와 시스템에 의한 군집 결과의 의미적 차이를 유발하는 문제점이 있었다.However, the above-described conventional methods have a problem in that the user feedback is not reflected or insufficient, causing a meaningful difference between the clustering result desired by the user and the clustering result by the system.

그리고, 상술한 종래의 방법들은 문서 집합의 내부 구조를 문서 군집에 반영 하기가 곤란하여 문서 군집의 정확도를 담보할 수 없는 문제점이 있었다.In addition, the above-described conventional methods have a problem in that it is difficult to reflect the internal structure of the document set in the document cluster, thereby ensuring the accuracy of the document cluster.

본 발명이 해결하고자 하는 과제는, 사용자가 요구하는 군집과 시스템이 제공하는 군집 간의 의미적 차이를 최소화하는 군집 장치 및 방법을 제공하는 것이다.An object of the present invention is to provide a clustering device and a method for minimizing the semantic difference between a cluster required by a user and a cluster provided by a system.

본 발명이 해결하고자 하는 다른 과제는, 군집의 내부 구조 및 의미 특징의 분포를 용이하게 파악할 수 있는 군집 장치 및 방법을 제공하는 것이다.Another problem to be solved by the present invention is to provide a clustering device and a method which can easily grasp the distribution of the internal structure and semantic features of the cluster.

본 발명이 해결하고자 하는 또 다른 과제는, 용어의 가중치를 재계산함으로써 문서 군집의 정확도를 향상시키는 군집 장치 및 방법을 제공하는 것이다.Another object of the present invention is to provide a clustering apparatus and method for improving the accuracy of document clustering by recalculating the weights of terms.

본 발명이 해결하고자 하는 또 다른 과제는, 종래의 문서 군집 방법에 적용하여 군집 방법의 성능을 향상시킬 수 있는 용어 가중치 추정 방법을 제공하는 것이다.Another problem to be solved by the present invention is to provide a term weight estimation method that can be applied to the conventional document clustering method to improve the performance of the clustering method.

본 발명은, 용어 가중치 재계산을 이용한 문서 군집 장치에 관한 것으로서, 용어의 가중치 재계산을 이용한 문서 군집의 전단계로서, 입력받은 문서 정보를 각각의 문장 정보로 분해한 후, 불용어를 제거하고, 어근을 추출하며, 용어-문장행렬을 생성함으로써, 비음수 행렬 인수분해(NMF : Non-Negative Matrix Factorization)를 기반으로 문서를 군집할 수 있게 하는 기능을 수행하는 전처리부; 및 상기 용어-문장행렬을 비음수 행렬 인수분해하며, 용어의 가중치를 재계산하고, 문서 군집을 생성하는 문서군집부; 를 포함하되, 상기 비음수 행렬 인수분해는 다음의 수학식에 의한 목적 함수가 최소값을 가지도록 수행되는 것을 특징으로 한다.

(상기 수학식에서, J는 목적 함수이고, A는 용어-문장행렬로서

이며, W 는 비음수 의미특징 행렬로서 W=[w_ij]이고, H^T는 비음수 의미변수 행렬로서 H=[h_ij]이다.)The present invention relates to a document clustering apparatus using term weight recalculation, and as a previous step of document clustering using term weight recalculation, after decomposing input document information into respective sentence information, removing stop words, A pre-processing unit for extracting the terminology and generating the term-sentence matrix to enable clustering of documents based on non-negative matrix factorization (NMF); And a document cluster unit for factoring the term-sentence matrix with a non-negative matrix, recalculating the weights of terms, and generating a document cluster. Including but, the non-negative matrix factorization is characterized in that it is performed so that the objective function according to the following equation has a minimum value.

Where J is the objective function and A is the term-statement matrix

W is a non-negative semantic matrix, W = [w _ij ], and H ^T is a non-negative semantic matrix, H = [h _ij ].)

바람직하게는, 사용자로부터 용어의 가중치 재계산을 위한 문서 정보를 입력받고, 입력받은 문서를 분류하는 입력부; 및 상기 문서군집부가 생성한 문서 군집 결과를 상기 사용자가 확인할 수 있도록 출력하는 출력부; 를 더 포함하는 것을 특징으로 한다.Preferably, the input unit for receiving document information for re-calculation of the term weight from the user, and classifies the received document; And an output unit for outputting the document clustering result generated by the document clustering unit for the user to check. It characterized in that it further comprises.

또한 바람직하게는, 상기 전처리부는, 입력받은 문서 정보를 각각의 문장 정보로 분해하는 문장 분해수단; 입력받은 문서 정보에 있어서 문서 군집의 개수를 설정하는 문서군집 설정수단; 분해된 각각의 문장에서 불용어를 제거하는 불용어 제거수단; 불용어가 제거된 각각의 문장에서 어근을 추출하는 어근 추출수단; 불용어가 제거된 각각의 문장에서 용어의 사용빈도에 따른 용어-빈도벡터를 생성하는 용어-빈도벡터 생성수단; 및 입력받은 문서 정보에서 용어-문장행렬을 생성하는 용어-문장행렬 생성수단; 을 포함하는 것을 특징으로 한다.Also preferably, the preprocessing unit may include sentence decomposition means for decomposing the received document information into respective sentence information; Document cluster setting means for setting the number of document clusters in the received document information; Stopword removal means for removing stopwords from each of the decomposed sentences; Root extracting means for extracting roots from each sentence from which stop words are removed; Term-frequency vector generating means for generating a term-frequency vector according to the frequency of use of the term in each sentence from which stop words are removed; Term-sentence matrix generating means for generating a term-sentence matrix from the received document information; Characterized in that it comprises a.

또한 바람직하게는, 상기 문서군집부는, 상기 용어-문장행렬을 비음수 의미특징 행렬(NSFM, non-negative semantic feature matrix) 및 비음수 의미변수 행렬(NSVM, non-negative semantic variable matrix)로 인수분해하는 비음수 행렬 인수분해 수단; 상기 비음수 의미특징 행렬 및 비음수 의미변수 행렬을 산출하는 비음수 행렬 산출수단; 상기 비음수 의미특징 행렬 및 비음수 의미변수 행렬을 이용하여 용어의 가중치를 재계산하는 가중치 재계산 수단; 및 재계산된 용어 가중치를 이용하여 문서를 군집하는 문서군집 생성수단; 을 포함하는 것을 특징으로 한다.Also preferably, the document clustering factor may factor the term-statement matrix into a non-negative semantic feature matrix (NSFM) and a non-negative semantic variable matrix (NSVM). Non-negative matrix factorization means; Non-negative matrix calculation means for calculating the non-negative semantic feature matrix and the non-negative semantic variable matrix; Weight recalculation means for recalculating a weight of terms using the non-negative semantic feature matrix and the non-negative semantic variable matrix; Document group generation means for grouping documents using the recalculated term weightings; Characterized in that it comprises a.

또한 바람직하게는, 상기 불용어 제거는 리츠베르젠(rijsbergen) 불용어 목록을 통해 수행되는 것을 특징으로 한다.Also preferably, the stopword removal may be performed through a list of Rijsbergen stopwords.

삭제delete

또한 바람직하게는, 상기 W 및 H^T의 원소 값을 갱신하기 위하여 상기 목적 함수값이 미리 설정된 수렴 허용 오차보다 작아지거나 미리 설정된 반복 횟수를 초과할 때까지 다음의 수학식에 의한 계산을 반복하는 것을 특징으로 한다.Also preferably, in order to update the element values of W and H ^T , repeating the calculation by the following equation until the objective function value is smaller than a preset convergence tolerance or exceeds a preset number of repetitions. It features.

또한 바람직하게는, 상기 수렴 허용 오차는 0.0001 내지 0.01인 것을 특징으로 한다.Also preferably, the convergence tolerance is 0.0001 to 0.01.

또한 바람직하게는, 상기 반복 횟수는 10 내지 100인 것을 특징으로 한다.Also preferably, the number of repetitions is 10 to 100.

또한 바람직하게는, 상기 비음수 행렬 인수분해의 결과, 비음수 의미특징 행렬 및 비음수 의미변수 행렬을 생성하고, 다음의 수학식을 이용하여 생성한 비음수 의미특징 행렬 및 비음수 의미변수 행렬을 정규화하는 것을 특징으로 한다.Also, preferably, as a result of the non-negative matrix factorization, a non-negative semantic feature matrix and a non-negative semantic variable matrix are generated, and the non-negative semantic feature matrix and the non-negative semantic variable matrix generated by the following equation It is characterized by normalizing.

(상기 수학식에서, w_ij는 비음수 의미특징 행렬의 원소를 나타내며, h_ij는 비음수 의미변수 행렬의 원소를 나타낸다.)(In the above equation, w _ij represents an element of the non-negative semantic matrix, and h _ij represents an element of the non-negative semantic matrix.)

또한 바람직하게는,

이면 문서 d_i를 군집 x에 할당하는 것을 특징으로 한다.Also preferably,

In this case, the document d _i is assigned to the cluster x.

또한 바람직하게는, 상기 용어의 가중치 재계산은 다음의 수학식에 의하는 것을 특징으로 한다.Also preferably, the weight recalculation of the term is characterized by the following equation.

(상기 수학식에서, Δg_a는 a행의 전체 원소에 대한 평균 가중치의 변화량이고, Δg_a ⁱ는 i번째 문서의 a번째 용어 가중치의 변화량이며, n은 전체 문서의 개수이다. 또한, g_a는 a번째 용어의 가중치 값이며, A_ai는 a번째 용어와 i번째 문서의 용어의 빈도이고, W 및 H는 각각 비음수 행렬 인수분해로 생성된 비음수 의미특징 행렬 및 비음수 의미변수 행렬이며, I_i는 H의 i번째 문서 H_*i에서 ΔH_ki=0이 아닌 k들의 집합이고, H_ki ^old는 원래 값, H_ki ^new는 학습 문서에 의해 수정된 값이다.)(A change in the average weight for the entire elements in the equation, Δg _a is a line, Δg _a ⁱ is the change amount of the i-th document in a second term weight, n is the number of the document. Also, g _a is is a weighted value of the a-th term, A _ai is the frequency of the a-th term and the i-th document's term, W and H are the non-negative semantic feature matrix and the non-negative semantic matrix, respectively, generated by factoring the non-negative matrix, I _i is the set of k other than ΔH _ki = 0 in the i-th document H _{* i} of H, H _ki ^old is the original value, and H _ki ^new is the value modified by the learning document.)

또한 바람직하게는, 상기 g_a ^old의 초기 값은 1로 설정되는 것을 특징으로 한다.Also preferably, the initial value of g _a ^old may be set to 1.

또한 바람직하게는, H_ki ^new는 다음의 수학식에 의해 계산되는 것을 특징으로 한다.Also preferably, H _ki ^new is characterized by being calculated by the following equation.

(상기 수학식에서, c는 군집의 일련번호로서 c=1,2,...,e이고, e는 군집의 개수를 나타내며,

은 c번째 군집 내에서 의미특징의 중요도의 차이를 나타내고, ξ는 조절 상수이다.)(In the above equation, c is the serial number of the cluster, c = 1, 2, ..., e, e represents the number of clusters,

Is the difference in importance of the semantic features within the c th cluster, and ξ is the control constant.)

또한 바람직하게는, 조절 상수 ξ는 0.5인 것을 특징으로 한다.Also preferably, the adjustment constant ξ is 0.5.

(상기 수학식에서, A는 용어-문장행렬이고,

는 재계산된 가중치가 적용된 용어-문장행렬이며, G는 재계산된 용어의 가중치 행렬이다.)(Wherein A is a term-sentence matrix,

Is the term-sentence matrix with the recalculated weights applied, and G is the weight matrix of the recalculated terms.)

그리고 바람직하게는, 상기

행렬에 K-평균(K-means) 방법을 이용하여 상기 문서 군집의 생성을 수행하는 것을 특징으로 한다.And preferably,

The generation of the document cluster is performed by using a K-means method on a matrix.

삭제delete

본 발명에 따르면, 사용자에 의해서 미리 분류된 문서를 이용하여 사용자가 원하는 군집과 시스템에 의한 군집 사이의 의미적 차이를 최소화할 수 있는 효과가 있다.According to the present invention, there is an effect of minimizing a semantic difference between a cluster desired by a user and a cluster by a system by using a document previously classified by the user.

본 발명에 따르면, 의미 특징과 의미 변수를 사용하여 군집의 내부 구조 및 의미 특징의 분포를 쉽게 파악할 수 있는 효과가 있다.According to the present invention, it is possible to easily grasp the distribution of the internal structure and semantic features of a cluster by using semantic features and semantic variables.

본 발명에 따르면, 의미 특징과 의미 변수를 이용하여 용어의 가중치를 용이하게 재계산함으로써 문서 군집의 정확도를 향상시킬 수 있는 효과가 있다.According to the present invention, the accuracy of document clusters can be improved by easily recalculating the weights of terms using semantic features and semantic variables.

본 발명에 따르면, 가중치 추정은 비음수 행렬 인수분해에 기반한 문서 군집 방법뿐만 아니라 단수값 분해(SVD : Singular Value Decomposition)도 적용하여 계산할 수 있어서 적용 범위가 확장되는 효과가 있다.According to the present invention, the weight estimation can be calculated by applying Singular Value Decomposition (SVD) as well as a document clustering method based on non-negative matrix factorization, thereby extending the application range.

본 발명에 따르면, 종래의 문서 군집 방법에도 추정된 용어의 가중치를 적용할 수 있으므로 종래의 문서 군집 방법의 성능을 향상시키는 효과도 있다.According to the present invention, the weight of the estimated term can also be applied to the conventional document clustering method, thereby improving the performance of the conventional document clustering method.

본 발명의 실시를 위한 구체적인 내용을 설명하기에 앞서, 본 발명의 기술적 요지와 직접적 관련이 없는 구성에 대하여는 본 발명의 기술적 요지를 흩뜨리지 않 는 범위 내에서 생략하였음을 유의하여야 할 것이다.Before describing the details for carrying out the present invention, it should be noted that the configuration that is not directly related to the technical gist of the present invention has been omitted within the scope of not distracting the technical gist of the present invention.

또한, 본 명세서 및 청구범위에 사용된 용어나 단어는 발명자가 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다.In addition, the terms or words used in the present specification and claims are consistent with the technical spirit of the present invention on the basis of the principle that the inventor can appropriately define the concept of the term in order to explain the invention in the best way. It should be interpreted as meaning and concept.

본 발명은 지도 학습 방법을 이용하여 비음수 행렬 인수분해(NMF : Non-Negative Matrix Factorization)에 기반한 용어의 가중치를 추정하고, 추정된 용어의 가중치를 군집 대상 집합에 적용하여 문서를 군집하는 장치 및 방법을 제안한다.The present invention provides an apparatus for estimating a weight of terms based on non-negative matrix factorization (NMF) using a supervised learning method, and clustering documents by applying the estimated weights of the term to a clustered target set; Suggest a method.

상기 비음수 행렬 인수분해는 인간이 객체를 인식할 때 객체의 부분 정보의 조합으로 인식하는 것에 착안하여, 객체 정보를 기초 특징(Base Feature) 및 부호 특징(Encoding Feature)으로 분해하여 부분 정보(Part-Base)로 표현하는 알고리즘이다.The non-negative matrix factorization focuses on recognizing that a human recognizes an object as a combination of partial information of the object, and decomposes the object information into a base feature and an encoding feature, thereby removing the partial information. -Base).

이러한 부분 정보의 조합으로 전체 객체를 표현하는 방법은 대량의 정보를 효율적으로 표현할 수 있으며, 본 발명의 바람직한 실시예에 있어서는 주어진 양의 행렬로부터 양의 인수를 찾아낸다.The method of representing the entire object by the combination of such partial information can efficiently represent a large amount of information, and in a preferred embodiment of the present invention, a positive argument is found from a given positive matrix.

이하, 본 발명의 바람직한 실시예에 따른 용어 가중치 재계산을 이용한 문서 군집 장치에 관하여 도 1 을 참조하여 설명한다.Hereinafter, a document clustering apparatus using term weight recalculation according to a preferred embodiment of the present invention will be described with reference to FIG. 1.

도 1 은 본 발명의 바람직한 실시예에 따른 용어 가중치 재계산을 이용한 문 서 군집 장치에 관한 전체 구성도이다.1 is an overall configuration diagram of a document clustering apparatus using term weight recalculation according to a preferred embodiment of the present invention.

상기 도 1 에 도시된 바와 같이, 본 발명의 바람직한 실시예에 따른 용어 가중치 재계산을 이용한 문서 군집 장치는 입력부(100), 전처리부(200), 문서군집부(300) 및 출력부(400)를 포함한다.As shown in FIG. 1, the document clustering apparatus using the term weight recalculation according to the preferred embodiment of the present invention includes an input unit 100, a preprocessor 200, a document cluster unit 300, and an output unit 400. It includes.

상기 입력부(100)는 사용자 측으로부터 용어의 가중치 재계산을 위한 문서 정보를 입력받고, 입력받은 문서를 군집 레이블(Label)에 일치하도록 분류한다.The input unit 100 receives document information for recalculating a weight of a term from a user side, and classifies the received document to correspond to a cluster label.

또한, 상기 전처리부(200)는 용어의 가중치 재계산을 통한 문서 군집의 전단계로서, 입력받은 문서 정보를 각각의 문장 정보로 분해한 후, 불필요한 문자열을 제거하고, 어근을 추출하며, 비음수 행렬 인수분해를 기반으로 문서를 군집할 수 있게 하는 기능을 수행하는 바, 문장 분해수단(210), 문서군집 설정수단(220), 불용어(Stop Word) 제거수단(230), 어근 추출수단(240), 용어(Term)-빈도(Frequency)벡터 생성수단(250), 용어-문장행렬 생성수단(260)을 포함한다.In addition, the preprocessing unit 200 is a previous step of document clustering by recalculating the weights of terms, decomposing the input document information into respective sentence information, removing unnecessary strings, extracting roots, and non-negative matrices. Performing a function to cluster the documents based on the factorization, sentence decomposition means 210, document cluster setting means 220, stop word removal means 230, root extracting means 240 , Term-frequency vector generating means 250, term-phrase matrix generating means 260.

먼저, 상기 문장 분해수단(210)은 입력받은 문서 정보를 각각의 문장 정보로 분해하는 기능을 수행한다.First, the sentence decomposing means 210 decomposes the received document information into respective sentence information.

다음으로, 상기 문서군집 설정수단(220)은 입력받은 문서 정보에 있어서 문서 군집의 개수를 설정하는 기능을 수행하는 바, 본 실시예에서 문서 군집의 개수는 k로 설정하였으나, 본 발명이 이에 한정되지 아니한다.Next, the document group setting means 220 performs a function of setting the number of document clusters in the received document information. In this embodiment, the number of document clusters is set to k, but the present invention is limited thereto. Not.

다음으로, 상기 불용어 제거수단(230)은 상기 문장 분해수단(210)을 통해서 분해된 각각의 문장들에 대하여 불용어 목록(예컨대, 리츠베르젠-rijsbergen 등)을 통해 불용어를 제거하는 기능을 수행한다.Next, the stopword removing means 230 performs a function of removing stopwords through a stopword list (for example, Rizbergen-rijsbergen, etc.) for each sentence decomposed through the sentence decomposition means 210.

참고적으로, 불용어(Stop-words 또는 Noise Word)는 검색시에 무시해버리는 단어 또는 검색엔진이 데이터베이스를 구축할 때, 색인에서 제외해 버리는 단어를 뜻하며, 예를 들어 한글에서 '~를, ~을, ~에서, ~와' 등을 포함하는 조사, 접미사, 접속사, 어미 등이 있고, 그리고 영어에서 'a, an, is, are, if, for, but, this, may' 등을 포함하는 동사, 조동사, 인칭대명사, 지시대명사, 전치사 등이 있다.For reference, stopwords (Stop-words or Noise Word) refers to words that are ignored at the time of search or words that search engines exclude from the index when building the database. Verbs, including, in, surveys, suffixes, conjunctions, endings, etc., and in English, including a, an, is, are, if, for, but, this, may, There are modal verbs, personal pronouns, instructional pronouns, and prepositions.

다음으로, 상기 어근 추출수단(240)은 각각의 문장에 대하여 어근을 추출하는 기능을 수행한다.Next, the root extracting means 240 performs a function of extracting the root for each sentence.

참고적으로, 어근은 단어를 이루는 형태소 중에서 그 단어의 실질적인 뜻을 나타내는 형태소로서, 합성어의 경우에 합성된 낱낱의 실질 형태소이고, 파생어의 경우에 접사를 제외한 나머지 부분이 해당한다. 예를 들면, '덮개'에서 '덮', '사람답다'에서 '사람', '넉넉하다'에서 '넉넉' 등이 어근에 해당한다.For reference, the root is a morpheme representing the actual meaning of the word among the morphemes constituting the word, and is a real real morpheme synthesized in the case of a compound word, and the rest except the affix in the case of a derivative word. For example, the roots include the cover, the cover, the person, the person, and the generous.

다음으로, 상기 용어-빈도벡터 생성수단(250)은 각각의 문장에 대하여 불용어를 제외한 용어의 사용빈도에 따른 용어-빈도벡터를 생성하는 기능을 수행한다.Next, the term-frequency vector generating means 250 performs a function for generating a term-frequency vector for each sentence according to the frequency of use of the term except the stopword.

마지막으로, 상기 용어-문장행렬 생성수단(260)은 문서에서 총 m개의 용어와 n개의 문장으로 이루어진 m×n으로 구성된 행렬 A를 생성하는 기능을 수행한다.Finally, the term-sentence matrix generating means 260 performs a function of generating a matrix A consisting of m × n composed of m terms and n sentences in the document.

이때, 행렬 A는 A=[A_*1,A_*2,...,A_*n]로 나타낼 수 있으며, 각 행 벡터 A_*i는 i번째 문서의 용어-빈도벡터이다.In this case, the matrix A may be represented by A = [A _{* 1} , A _{* 2} , ..., A _{* n} ], and each row vector A _{* i} is a term-frequency vector of the i-th document.

또한, 상기 문서군집부(300)는 상기 용어-문장행렬 생성수단(260)을 통해서 생성된 행렬 A를 비음수 행렬 인수분해함으로써, 용어의 가중치를 재계산하고, 문 서 군집을 생성하는 기능을 수행하는 바, 비음수 행렬 인수분해 수단(310), 비음수 행렬 산출수단(320), 가중치 재계산 수단(330) 및 문서군집 생성수단(340)을 포함한다.In addition, the document grouping unit 300 performs a function of recalculating the weights of terms and generating document clusters by non-negative matrix factorization of the matrix A generated through the term-sentence matrix generating unit 260. The non-negative matrix factorization means 310, the non-negative matrix calculation means 320, the weight recalculation means 330, and the document cluster generation means 340 are included.

상기 비음수 행렬 인수분해 수단(310)은 행렬 A를 비음수 의미특징 행렬(NSFM, non-negative semantic feature matrix)과 비음수 의미변수 행렬(NSVM, non-negative semantic variable matrix)을 이용하여 인수분해하는 기능을 수행한다.The non-negative matrix factoring means 310 factorizes the matrix A using a non-negative semantic feature matrix (NSFM) and a non-negative semantic variable matrix (NSVM). It performs the function.

이하, 상기 비음수 행렬 인수분해(NMF : Non-Negative Matrix Factorization)를 이용한 용어-문장행렬의 분해에 관하여 설명한다.Hereinafter, the decomposition of the term-statement matrix using the non-negative matrix factorization (NMF) will be described.

문서 집합이 k개의 군집으로 구성된다고 가정할 때, 용어-문장행렬 A를 다음의 [수학식 2]가 나타내는 목적 함수가 최소값을 가지도록 다음의 [수학식 1]과 같이 m×k 비음수 의미특징 행렬 W 및 k×n 비음수 의미변수 행렬 H^T로 분해한다. 본 실시예에서 행렬 A의 i번째 행과 j번째 열의 원소는 A_ij로 표시하기로 한다.Assuming that the document set is composed of k clusters, the term-statement matrix A is m × k nonnegative meaning as shown in Equation 1 below so that the objective function represented by Equation 2 below has a minimum value. Decompose into feature matrix W and k × n nonnegative semantic matrix H ^T. In the present embodiment, elements of the i th row and the j th column of the matrix A will be denoted by A _ij .

상기 [수학식 1] 및 [수학식 2]에서, W=[w_ij]이고, H=[h_ij]이며, W=[w₁,w₂,...,w_k]이다.In Equations 1 and 2, W = [w _ij ], H = [h _ij ], and W = [w ₁ , w ₂ , ..., w _k ].

이때, W와 H^T의 원소 값을 갱신하기 위하여 상기 [수학식 2]에 의한 목적함수 J값이 미리 설정한 수렴 허용 오차보다 작아지거나 지정한 반복 횟수를 초과할 때까지 다음의 [수학식 3]을 반복한다.At this time, in order to update the element values of W and H ^T , the objective function J value according to Equation 2 is smaller than the preset convergence tolerance or exceeds the specified number of repetitions. Repeat.

미리 설정한 수렴 허용 오차는 0.0001 내지 0.01인 것으로 설정할 수 있으며, 바람직하게는 0.001이다. 또한, 지정한 반복 횟수는 10 내지 100인 것으로 설정할 수 있으며, 바람직하게는 50이다.The preset convergence tolerance can be set to 0.0001 to 0.01, preferably 0.001. The specified number of repetitions can be set to 10 to 100, preferably 50.

주어진 문서 집합으로부터 구성된 용어-문장행렬 A에 상기 [수학식 3]에 의한 비음수 행렬 인수분해를 수행하여 비음수 행렬 W와 H를 얻는다.A non-negative matrix factorization according to Equation 3 above is performed on the term-sentence matrix A constructed from a given document set to obtain non-negative matrices W and H.

그리고 다음의 [수학식 4]를 이용하여 행렬 W와 H를 정규화하며, 행렬 H를 이용하여 각 데이터 포인트(Data Point)의 군집 레이블을 결정한다. 예를 들어,

이면 문서 d_i를 군집 x에 할당한다.Next, Equation 4 is used to normalize the matrices W and H, and the cluster label of each data point is determined using the matrix H. E.g,

Then assign document d _i to cluster x.

다음은 상기 [수학식 3]을 이용하여 비음수 행렬 인수분해를 적용한 결과, A 행렬이 W 및 H^T 행렬로 분해된 예이다. 다음의 예에서, k=2이고, 수렴할 반복 횟수는 50이며, 수렴 허용 오차는 0.001로 설정하였고, W 및 H^T 행렬의 초기값은 각각 0.5이었다.Next, as a result of applying the non-negative matrix factorization using Equation 3, the A matrix is W and H ^T. Example of decomposition into matrices. In the following example, k = 2, the number of iterations to converge is 50, the convergence tolerance is set to 0.001, W and H ^T The initial values of the matrix were 0.5 respectively.

비음수 행렬 인수분해의 수행 결과, 행렬 A의 j번째 열벡터 A_*j는 행렬 W의 l번째 열벡터 W_*l와 행렬 H^T의 요소 H_kj ^T가 선형 조합을 이루며 다음의 [수학식 5]와 같다.As a result of performing the non-negative matrix factorization, the j th column vector A _{* j} of the matrix A is a linear combination of the l th column vector W _{* l} of the matrix W and the element H _kj ^T of the matrix H ^T. Is the same as

다음으로, 상기 비음수 행렬 산출수단(320)은 비음수 의미특징 행렬 W 및 비음수 의미변수 행렬 H를 산출하는 기능을 수행한다.Next, the non-negative matrix calculation means 320 performs a function of calculating the non-negative semantic feature matrix W and the non-negative semantic variable matrix H.

상기 비음수 행렬 인수분해 수단(310)에 의하여 행렬 A로부터 분해된 비음수 의미특징 행렬을 W라 하고, 비음수 의미변수 행렬을 H라 할 때, 행렬 W의 의미특징 벡터 W_*l는 문장의 내부 특징을 나타내며, 행렬 H의 원소인 의미변수 H_kj는 문장 내에서의 의미특징의 중요도를 나타낸다.When the non-negative semantic matrix decomposed from the matrix A by the non-negative matrix factoring means 310 is called W and the non-negative semantic matrix is H, the semantic feature vector W _{* l} of the matrix W The semantic variable H _kj , which is an element of the matrix H, indicates the importance of the semantic feature in the sentence.

다음으로, 상기 가중치 재계산 수단(330)은 군집 대상이 되는 문서 집합에 적용하기 위하여, 상기 비음수 행렬 인수분해의 결과를 이용하여 용어의 가중치를 재계산하는 기능을 수행한다.Next, the weight recalculation means 330 performs a function of recalculating the weights of terms using the result of the non-negative matrix factorization in order to apply to the document set to be clustered.

이하, 상기 가중치 재계산 수단(330)의 용어 가중치 재계산에 관하여 상세히 설명한다.Hereinafter, the term weight recalculation of the weight recalculation means 330 will be described in detail.

상기 용어 가중치 재계산은 다음의 [수학식 9] 및 [수학식 10]에 의하여 이루어질 수 있다. 상기 [수학식 9]는 다음의 [수학식 6] 내지 [수학식 8]에 의하여 유도된다.The term weight recalculation may be performed by Equation 9 and Equation 10 below. [Equation 9] is derived by the following [Equation 6] to [Equation 8].

상기 [수학식 6] 내지 [수학식 8]에서, g_a는 a번째 용어의 가중치 값, A_ai는 a번째 용어와 i번째 문서의 용어의 빈도, I_i는 H의 i번째 문서 H_*i에서 ΔH_ki=0이 아닌 k들의 집합, H_ki ^old는 원래 값, H_ki ^new는 학습 문서에 의해 수정된 값이다.In Equations 6 to 8, g _a is a weight value of the a-th term, A _ai is the frequency of terms in the a-th term and the i-th document, I _i is the i-th document H _{* i of} H Is a set of k other than ΔH _ki = 0, H _ki ^old is the original value, and H _ki ^new is the value modified by the learning document.

상기 [수학식 9]에서, Δg_a는 a행의 전체 원소에 대한 평균 가중치의 변화량이고, Δg_a ⁱ는 i번째 문서의 a번째 용어 가중치의 변화량, n은 전체 문서의 개수이 다.In Equation 9, Δg _a is the amount of change in the average weight of all elements in row a, Δg _a ⁱ is the amount of change in the a-th term weight of the i-th document, and n is the number of entire documents.

상기 [수학식 10]에서, g_a ^old의 초기 값은 1로 설정된다.In Equation 10, the initial value of g _a ^old is set to 1.

그리고, H_ki ^new는 다음의 [수학식 11] 내지 [수학식 15]와 같이 구할 수 있다.In addition, H _ki ^new can be obtained as shown in Equations 11 to 15 below.

상기 [수학식 11]에서,

는 정규화된 H_ji ^old이다.In [Equation 11],

Is the normalized H _ji ^old .

상기 [수학식 12]에서, c는 군집의 일련번호로서 c=1,2,...,e이고, e는 군집의 개수를 나타낸다. 예컨대, D_c를 c번째 군집이라 하면, c번째 군집에 포함된 문서의 개수는 │D_c│=f^c이며

로서 k=1,2,...,f^c이다.In Equation 12, c is a serial number of a cluster, and c = 1, 2, ..., e, and e represents the number of clusters. For example, if D _c is the c th cluster, the number of documents included in the c th cluster is | D _c | = f ^c .

K = 1, 2, ..., f ^c .

상기 [수학식 13]에서,

은 c번째 군집 내에서 의미특징의 중요도의 차이를 나타낸다.In [Equation 13],

Denotes the difference in importance of semantic features within the c th cluster.

상기 [수학식 14]에서, ξ는 조절 상수로서 0.5를 갖는 것이 바람직하다.In Equation 14, ξ preferably has 0.5 as an adjustment constant.

다음의 [수학식 16]은 상기 [수학식 9]에 의하여 재계산된 용어의 가중치 행렬이다.Equation 16 below is a weight matrix of terms recalculated by Equation 9 above.

상기 [수학식 16]에서, 가중치 행렬 G의 원소값은 g_a ^new의 a번째 용어와 일치하는 A_a _*의 용어가 존재하는 경우 g_a ^new의 원소값을 가지며, 그렇지 않으면 1의 값을 가진다.The in [Equation 16], the element values of the weighting matrix G is a case that the term of A _a _* that matches a second term of g _a ^new presence g _a ^new has an element value, otherwise it has a value of 1 .

마지막으로, 상기 문서군집 생성수단(340)은 계산된 용어 가중치를 문서 집합에 적용함으로써 문서를 군집하는 기능을 수행한다.Finally, the document cluster generation unit 340 performs a function of grouping documents by applying the calculated term weight to the document set.

행렬 A에 대한 추정 가중치는 상기 [수학식 16]을 이용하여 다음의 [수학식 17]과 같은 행렬

로 적용한다.The estimated weight for the matrix A is expressed by the following Equation 17 using Equation 16 above.

Apply to.

그 다음, 재계산된 가중치가 적용된 행렬

에 비음수 행렬 인수분해를 이용한 방법 또는 K-평균(K-means) 방법을 이용하여 군집한다.Then, the recalculated weighted matrix

Cluster by using non-negative matrix factorization or K-means method.

그리고, 상기 출력부(400)는 상기 문서군집부(300)를 통해서 생성된 문서 군집을 사용자가 확인할 수 있도록 출력하는 기능을 수행하는 바, 본 발명의 바람직한 실시예에서 모니터 또는 프린터 등을 포함하는 것으로 설정하였으나, 본 발명이 이에 한정되지 아니한다.In addition, the output unit 400 performs a function of outputting the document cluster generated by the document cluster unit 300 so that a user can check it, which includes a monitor or a printer in a preferred embodiment of the present invention. Although set to, the present invention is not limited thereto.

이하, 본 발명의 바람직한 실시예에 따른 용어 가중치 재계산을 이용한 문서 군집 방법에 관하여 도 2 를 참조하여 설명한다.Hereinafter, a document clustering method using the term weight recalculation according to a preferred embodiment of the present invention will be described with reference to FIG. 2.

도 2 는 본 발명의 바람직한 실시예에 따른 용어 가중치 재계산을 이용한 문서 군집 방법에 관한 전체 흐름도이다.2 is a flowchart illustrating a document clustering method using term weight recalculation according to a preferred embodiment of the present invention.

상기 도 2 에 도시된 바와 같이, 먼저 군집 대상이 되는 문서 집합으로부터 사용자 피드백에 의하여 문서를 추출하고 군집을 결정한다(S10).As shown in FIG. 2, a document is first extracted from a set of documents to be clustered by user feedback and a cluster is determined (S10).

다시 말하면, 사용자에 의하여 문서 집합으로부터 용어의 가중치 재계산을 위한 문서를 입력받고, 군집 레이블(Label)에 일치하도록 문서를 분류한다.In other words, a document is input by the user for recalculation of weights of terms from the document set, and the documents are classified to match the cluster label.

본 실시예에서 의미하는 사용자 피드백은 일반적인 질의를 확장하여 용어를 재작성하는 연관 피드백과는 달리, 정확한 군집 결과가 도출될 수 있도록 용어에 대한 가중치를 재계산하기 위하여 사용자가 원하는 결과의 군집에 포함되는 문서를 추출하는 피드백을 의미한다.Unlike the feedback related to rewriting terms by extending general queries, the meaning of user feedback in the present embodiment is included in a cluster of results desired by a user in order to recalculate weights for terms so that accurate clustering results can be obtained. It means the feedback to extract the document.

다음으로, 추출된 문서 집합을 전처리한다(S20).Next, the extracted document set is preprocessed (S20).

상기 S20 단계는 상기 S10 단계에서 추출된 문서에 대한 불용어 제거 및 어근 추출 등으로 수행된다.The step S20 is performed by removing stop words and extracting roots of the document extracted in the step S10.

다음으로, 전처리한 문서로부터 벡터 모델을 생성함으로써 용어-문장행렬 A를 생성한다(S30).Next, the term-phrase matrix A is generated by generating a vector model from the preprocessed document (S30).

상기 S20 단계의 전처리 후, 총 m개의 용어와 n개의 문장으로 이루어진 m×n 행렬 A는 A=[A_*1,A_*2,...,A_*n]로 나타낼 수 있으며, 각 행 벡터 A_*i는 i번째 문서의 용어-빈도벡터이다.After the preprocessing of step S20, the m × n matrix A consisting of a total of m terms and n sentences can be represented by A = [A _{* 1} , A _{* 2} , ..., A _{* n} ], and each row vector A _{* i} is the term-frequency vector of the i-th document.

다음으로, 비음수 행렬 인수분해를 이용하여 용어-문장행렬 A를 비음수 의미특징 행렬 및 비음수 의미변수 행렬로 분해한다(S40).Next, the term-statement matrix A is decomposed into a non-negative semantic feature matrix and a non-negative semantic variable matrix using non-negative matrix factorization (S40).

상기 S40 단계에 의하여, 추출된 문서 집합 내의 문장 벡터들은 의미특징 벡 터에 가중치인 의미변수를 곱한 값의 선형 합으로 표시된다. 의미특징 벡터는 문장의 내부 특징을 나타내며, 의미변수는 문장 내에서 의미특징의 중요도를 나타낸다.In operation S40, the sentence vectors in the extracted document set are represented as a linear sum of a value obtained by multiplying the semantic feature vector by a weighted semantic variable. The semantic feature vector represents the internal feature of the sentence, and the semantic variable represents the importance of the semantic feature in the sentence.

다음으로, 비음수 의미변수 행렬에 가중치 재계산 방법을 적용하여 용어의 가중치를 재계산한다(S50).Next, the weight of the term is recalculated by applying the weight recalculation method to the non-negative semantic variable matrix (S50).

본 발명은 비음수 행렬 인수분해에 기반한 용어의 가중치를 지도 학습 방법으로 추정하고, 추정된 용어의 가중치를 군집 대상 집합에 적용하여 문서를 군집하는 문서 군집 방법을 제안하는 바, 본 발명의 바람직한 실시예에 따른 가중치 재계산 방법은 상기 단수값 분해 또는 비음수 행렬 인수분해 등과 같이 원본 데이터를 분해하여 군집하는 모든 방법에 적용할 수 있다.The present invention proposes a document clustering method for estimating a weight of a term based on non-negative matrix factorization using a supervised learning method, and clustering documents by applying the estimated weight of the term to a clustered target set. The weight recalculation method according to an example may be applied to all methods of decomposing and clustering original data, such as singular value decomposition or non-negative matrix factorization.

마지막으로, 계산된 용어 가중치를 문서 집합 전체에 적용한 후 문서 군집 방법을 이용하여 문서를 군집한다(S60).Finally, after applying the calculated term weight to the entire document set, documents are clustered using the document clustering method (S60).

마지막으로, 상기 S60 단계의 군집 결과를 사용자에게 제공하여 사용자가 군집 결과에 만족하는 경우 절차를 종료한다(S70).Finally, if the user is satisfied with the clustering result by providing the clustering result of step S60 to the user (S70).

상기 S70 단계의 판단 결과, 사용자가 군집 결과에 만족하지 않는 경우 상기 S50 단계로 절차를 이행한다.As a result of the determination in step S70, if the user is not satisfied with the clustering result, the procedure proceeds to step S50.

이하, 본 발명의 바람직한 실시예에 따른 용어 가중치 재계산을 이용한 문서 군집 방법에 관한 실험 결과에 대하여 설명한다.Hereinafter, the experimental results of the document clustering method using the term weight recalculation according to the preferred embodiment of the present invention will be described.

상기 실험은 20,000개의 문서를 포함하는 20개의 뉴스 그룹 중 일부를 무작위로 추출한 자료를 사용하여 수행되었다.The experiment was performed using data randomly extracted from some of the 20 newsgroups containing 20,000 documents.

뉴스 그룹은 컴퓨터 그래픽, 운영체제 윈도우, 컴퓨터 하드웨어, 종교, 의학, 정치 등 20개의 다양한 주제로 구성되며, 각 주제에 포함된 기사의 수는 서로 같다.The newsgroup consists of 20 different topics, including computer graphics, operating system windows, computer hardware, religion, medicine, and politics, with the same number of articles in each topic.

실험에 사용된 평가 자료에 대하여 다음의 [표 1]에 나타내었다.The evaluation data used in the experiment is shown in the following [Table 1].

문서 집합의 속성Document Set Properties 20 뉴스 그룹20 newsgroups 총 문서 개수Total number of documents 20,00020,000 사용 문서 개수Documents Used 5,4005,400 클러스터 개수Number of clusters 2020 사용 클러스터 개수Number of clusters used 1010 최대 클러스터의 문서 개수Maximum number of documents in cluster 1,0001,000 최소 클러스터의 문서 개수Minimum number of documents in cluster 100100 중간 클러스터의 문서 개수Document Count in Intermediate Cluster 500500 평균 클러스터의 문서 개수Document Count in Average Cluster 540540

성능 평가의 척도로는 다음의 [수학식 18]에 의한 NMI(Normalize Mutual Information)를 사용하였다.As a measure of performance evaluation, NMI (Normalize Mutual Information) according to Equation 18 was used.

두 개의 문서 군집 C 및 C'이 주어질 때 이들 간의 상호정보 MI(C,C')는 다음의 [수학식 18]과 같이 정의된다.Given two document clusters C and C ', the mutual information MI (C, C') between them is defined as in Equation 18 below.

상기 [수학식 18]에서, p(c_i) 및 p(c'_j)는 각각 군집 c_i 및 c'_j에 문서 집합 의 문서가 포함될 확률이고, p(c_i,c'_j)는 문서 집합의 문서가 동시에 군집 c_i 및 c'_j에 포함될 확률이다. H(C) 및 H(C')는 각각 C 및 C'의 엔트로피이다.In Equation 18, p (c _i ) and p (c ' _j ) are probabilities of including documents of a document set in clusters c _i and c' _j , respectively, and p (c _i , c ' _j ) is a document. The probability that the documents in the set are included in clusters c _i and c ' _j at the same time. H (C) and H (C ') are entropy of C and C', respectively.

다음의 [표 2]와 같이 서로 다른 두 가지 군집 방법의 NMI를 군집의 개수를 2에서 10까지 증가시키면서 비교하였다.As shown in Table 2, NMIs of two different clustering methods were compared with increasing the number of clusters from 2 to 10.

kk K-means K-means 가중치 K-means Weighted K-means 비음수 행렬 인수분해Nonnegative matrix factorization 가중치 비음수 행렬 인수분해Weighted Nonnegative Matrix Factorization 22 0.65450.6545 0.66480.6648 0.68740.6874 0.72350.7235 33 0.68980.6898 0.69870.6987 0.69980.6998 0.74320.7432 44 0.62110.6211 0.65410.6541 0.70210.7021 0.75340.7534 55 0.64550.6455 0.66750.6675 0.72120.7212 0.75120.7512 66 0.66350.6635 0.67840.6784 0.69850.6985 0.72240.7224 77 0.64220.6422 0.65440.6544 0.72010.7201 0.74030.7403 88 0.66650.6665 0.68550.6855 0.70320.7032 0.76440.7644 99 0.66330.6633 0.69130.6913 0.69910.6991 0.74110.7411 1010 0.64200.6420 0.67020.6702 0.70450.7045 0.73210.7321 평균Average 0.65430.6543 0.67390.6739 0.70340.7034 0.74130.7413

상기 [표 2]에서, K-means 방법은 표준 K-means 방법을 의미하고, 비음수 행렬 인수분해(NMF)는 쉬(Xu)가 제안한 군집 방법이다. 또한 가중치 K-means 방법 및 가중치 비음수 행렬 인수분해는 K-means 및 비음수 행렬 인수분해 방법에 가중치를 재계산하여 적용한 방법이다.In Table 2, the K-means method means a standard K-means method, and nonnegative matrix factorization (NMF) is a clustering method proposed by Xu. In addition, the weighted K-means method and the weighted non-negative matrix factorization are methods that are recalculated and applied to the K-means and non-negative matrix factorization methods.

평가는 두 가지 군집 방법에 재계산된 가중치를 적용한 것과 적용하지 않고 군집한 결과에 [수학식 18]을 이용하여 수행되었다.The evaluation was performed using Equation 18 on the results of clustering with and without recalculation of the two clustering methods.

상기 [표 2]에 나타낸 바와 같이, 가중치를 재계산한 방법이 가중치를 계산하지 않은 방법에 비해서 더 좋은 성능을 나타냄을 알 수 있다. 가중치가 부여된 K-means 방법은 가중치가 부여되지 않은 K-means 방법에 비하여 평균 NMI가 2.9% 더 높으며, 가중치가 부여된 비음수 행렬 인수분해 군집 방법은 가중치가 부여되지 않은 비음수 행렬 인수분해 군집 방법에 비하여 평균 NMI가 5.2% 더 높다.As shown in Table 2, it can be seen that the method of recalculating the weight shows better performance than the method of not calculating the weight. The weighted K-means method has an average NMI of 2.9% higher than the unweighted K-means method, and the weighted non-negative matrix factorization clustering method uses the unweighted non-negative matrix factorization. The average NMI is 5.2% higher than the clustering method.

가중치를 재계산하지 않은 K-means 방법이 최하의 성능을 나타낸다.The K-means method, which does not recalculate weights, exhibits the lowest performance.

비음수 행렬 인수분해 군집 방법이 K-means 방법보다 성능이 좋은 이유는 K-means의 단순한 유사도를 이용한 군집보다 비음수 행렬 인수분해를 이용하여 자료의 내부 구조를 반영하여 군집하는 것이 더욱 정확도를 기할 수 있기 때문이다. 또한, 가중치가 부여된 비음수 행렬 인수분해 군집 방법은 자료의 내부 구조를 반영하면서, 사용자가 부여한 학습을 통하여 용어의 가중치를 재계산함으로써 사용자가 원하는 군집 결과로 유도하므로, 가장 우수한 성능을 나타낸다.The reason why nonnegative matrix factorization clustering method performs better than K-means method is that it is more accurate to group non-negative matrix factorization using nonnegative matrix factorization to reflect the internal structure of data than K-means clustering. Because it can. In addition, the weighted non-negative matrix factorization clustering method reflects the internal structure of the data, and shows the best performance since the weight of the term is recalculated through the learning given by the user to lead to the desired clustering result.

이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명에 대해 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등물들도 본 발명의 범위에 속하는 것으로 간주되어야 할 것이다.As described above and described with reference to a preferred embodiment for illustrating the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described as described above, it is a deviation from the scope of the technical idea It will be understood by those skilled in the art that many modifications and variations can be made to the invention without departing from the scope of the invention. Accordingly, all such suitable changes and modifications and equivalents should be considered to be within the scope of the present invention.

도 1 은 본 발명의 바람직한 실시예에 따른 용어 가중치 재계산을 이용한 문서 군집 장치의 전체 구성도.1 is an overall configuration diagram of a document clustering apparatus using term weight recalculation according to a preferred embodiment of the present invention.

도 2 는 본 발명의 바람직한 실시예에 따른 용어 가중치 재계산을 이용한 문서 군집 방법의 전체 흐름도.2 is an overall flowchart of a document clustering method using term weight recalculation according to a preferred embodiment of the present invention.

Claims

A document clustering device using term weight recalculation,

As a preliminary step of document clustering using weighted recalculation of terminology, factoring non-negative matrix by decomposing input document information into individual sentence information, removing stop words, extracting roots, and generating term-sentence matrix A preprocessor 200 for performing a function of grouping documents based on (NMF: Non-Negative Matrix Factorization); And

A document cluster unit 300 for factoring the term-phrase matrix into a non-negative matrix, recalculating the weights of terms, and generating a document cluster; Including but not limited to:

And the non-negative matrix factorization is performed such that the objective function according to the following equation has a minimum value.

Where J is the objective function and A is the term-statement matrix

W is a non-negative semantic feature matrix (NSFM) and W = [w _ij ], and H ^T is a non-negative semantic variable matrix (NSVM). h _ij ].)

The method of claim 1,

An input unit 100 receiving document information for recalculating a weight of a term from a user and classifying the received document; And

An output unit 400 for outputting the document grouping result generated by the document grouping unit 300 for the user to check; Document clustering apparatus using the term weight recalculation, characterized in that it further comprises.

The method of claim 1,

The preprocessing unit 200,

Sentence decomposition means 210 for decomposing the received document information into respective sentence information;

Document cluster setting means (220) for setting the number of document clusters in the received document information;

Stopword removing means for removing stopwords from each of the decomposed sentences;

A root extracting means 240 for extracting roots in each sentence from which stop words are removed;

Term-frequency vector generating means 250 for generating a term-frequency vector according to the frequency of use of the term in each sentence from which stop words are removed; And

Term-sentence matrix generating means 260 for generating a term-sentence matrix from the received document information; Document clustering apparatus using the term weight recalculation comprising a.

The method of claim 1,

The document grouping unit 300,

Non-negative matrix factorization means 310 for factoring the term-phrase matrix into a non-negative semantic feature matrix (NSFM) and a non-negative semantic variable matrix (NSVM) ;

Non-negative matrix calculation means (320) for calculating the non-negative semantic feature matrix and the non-negative semantic variable matrix;

Weight recalculation means (330) for recalculating a weight of terms using the non-negative semantic feature matrix and the non-negative semantic variable matrix; And

Document cluster generation means 340 for grouping documents using the recalculated term weights; Document clustering apparatus using the term weight recalculation comprising a.

The method of claim 1,

The terminology removal is document aggregation apparatus using the term weight recalculation, characterized in that performed through the list of Rijsbergen stopwords.

delete

The method of claim 1,

In order to update the element values of W and H ^T , the term weighting is repeated until the objective function value is smaller than a preset convergence tolerance or exceeds a preset number of repetitions. Document clustering device using recalculation.

The method of claim 7, wherein

And the convergence tolerance is 0.0001 to 0.01.

The method of claim 7, wherein

The document repetition apparatus using the term weight recalculation, characterized in that the number of iterations is 10 to 100.

The method of claim 1,

As a result of the non-negative matrix factorization, a non-negative semantic feature matrix and a non-negative semantic variable matrix are generated, and the non-negative semantic feature matrix and the non-negative semantic variable matrix generated using the following equation are normalized. Document grouping device using the term weight recalculation.

(In the above equation, w _ij represents an element of the non-negative semantic matrix, and h _ij represents an element of the non-negative semantic matrix.)

The method of claim 10,

The document clustering device using term weight recalculation, wherein the document is assigned to cluster x.

The method of claim 1,

Recalculation of the weight of the term is a document grouping apparatus using the term weight recalculation, characterized in that the following equation.

(A change in the average weight for the entire elements in the equation, Δg _a is a line, Δg _a ⁱ is the change amount of the i-th document in a second term weight, n is the number of the document. Also, g _a is is a weighted value of the a-th term, A _ai is the frequency of the a-th term and the i-th document's term, W and H are the non-negative semantic feature matrix and the non-negative semantic matrix, respectively, generated by factoring the non-negative matrix, I _i is the set of k other than ΔH _ki = 0 in the i-th document H _{* i} of H, H _ki ^old is the original value, and H _ki ^new is the value modified by the learning document.)

The method of claim 12,

And the initial value of g _a ^old is set to 1. The document grouping apparatus using the term weight recalculation.

The method of claim 12,

H _ki ^new is a document clustering device using the term weight recalculation, characterized in that calculated by the following equation.

(In the above equation, c is the serial number of the cluster, c = 1, 2, ..., e, e represents the number of clusters,

The method of claim 14,

Document grouping device using the term weight recalculation, characterized in that the control constant ξ is 0.5.

The method of claim 1,

(Wherein A is a term-sentence matrix,

The method of claim 16,

remind

A document clustering device using term weight recalculation, characterized in that the generation of the document clustering is performed using a K-means method on a matrix.

delete