JP2002183171A

JP2002183171A - Document data clustering system

Info

Publication number: JP2002183171A
Application number: JP2000377606A
Authority: JP
Inventors: Kai Itou; 快伊藤; Takao Fukushige; 貴雄福重; Takamasa Koyama; 隆正小山
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2000-12-12
Filing date: 2000-12-12
Publication date: 2002-06-28

Abstract

PROBLEM TO BE SOLVED: To provide a document clustering system which can classify document data into a number of clusters corresponding to a clustering object. SOLUTION: A group of feature vectors of documents that a feature vector generating means 103 generates is decomposed by singular-values and document similarity vectors 108 for computing the similarities among the documents are generated from the result 106 of the singular-value decomposition. A cluster generating means 110 computes the distance between an object document and the center of gravity of a cluster by using a document similarity vector, classifies the same object document for the 2nd time while increasing the number of dimensions of the document similarity vector used for the 1st classification, and compares the results of the both to determine a cluster having small variation as a stable cluster. A data selecting means 109 excludes documents of the stable cluster from the object to select object documents to be classified next by the cluster generating means, and repeats this trial. The classification is repeated in steps to determine the number of clusters corresponding to the object even when the number of clusters is not determined in advance.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書検索システム
や文書ファイリングシステムなどに利用される文書クラ
スタリングシステムに関し、特に、段階的にクラスタリ
ングを実施する手法を用いて文書を的確に分類すること
を可能にしたものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document clustering system used in a document search system, a document filing system, and the like, and more particularly, it is possible to accurately classify documents by using a method of performing clustering step by step. It was made.

【０００２】[0002]

【従来の技術】従来、文書の分類方法では、文書の特徴
を表すものとして、文書の持つキーワードを抽出し、キ
ーワードの内容に基づいて文書を自動分類する方法が知
られている。キーワードは、形態素解析などの文書解析
方法を用いて文書から自動的に抽出され、あるいは、予
め人手によって付与される。例えば、特開平７−１１４
５７２号公報には、文書解析方法により文書から単語を
自動抽出し、各単語にベクトルを対応させた単語の特徴
ベクトルを文書ごとに求め、この特徴ベクトルの類似度
から文書を分類する方法が開示されている。2. Description of the Related Art Conventionally, as a method of classifying documents, there is known a method of extracting a keyword of a document as a characteristic of the document and automatically classifying the document based on the content of the keyword. The keyword is automatically extracted from the document using a document analysis method such as morphological analysis, or is added manually in advance. For example, JP-A-7-114
No. 572 discloses a method of automatically extracting words from a document by a document analysis method, obtaining a feature vector of a word in which a vector is associated with each word for each document, and classifying the document based on the similarity of the feature vectors. Have been.

【０００３】[0003]

【発明が解決しようとする課題】しかし、類似度から文
書を分類する場合には、沢山の分類候補が発生するた
め、どの分類候補を選択、採用すべきかの判定基準が不
明確になり、分類が困難になると云う問題点がある。However, when documents are classified based on the degree of similarity, a large number of classification candidates are generated. Therefore, the criteria for selecting and adopting the classification candidates are unclear, and the classification is not clear. There is a problem that it becomes difficult.

【０００４】また、分類された結果が何を意味している
のかが分かりにくいという問題点がある。Another problem is that it is difficult to understand what the classified results mean.

【０００５】また、人間が日常行っているクラスタリン
グは、例えば、新聞記事を、まず政治関係記事、経済関
係記事、国際関係記事、スポーツ・芸能記事などに大別
し、さらに、経済関係記事を、経済政策記事、株式市場
関係記事、企業動向記事などに細分化する、と云うよう
に、分類の結果が階層的な構造を持つことが多い。これ
は、階層的なクラスタリングの方が、非階層的なクラス
タリングよりも人間にとって自然であり、理解しやすい
ことを意味している。そのため、機械的にクラスタリン
グを行う場合でも、なるべく階層的な構造を持つ分類結
果が得られるようにすることが望ましい。[0005] In the clustering performed by humans on a daily basis, for example, newspaper articles are first classified into political articles, economic articles, international articles, sports / entertainment articles, etc. The classification results often have a hierarchical structure, such as subdividing into economic policy articles, stock market articles, and corporate trend articles. This means that hierarchical clustering is more natural and easier for humans to understand than non-hierarchical clustering. Therefore, it is desirable to obtain a classification result having a hierarchical structure as much as possible even when performing clustering mechanically.

【０００６】本発明は、こうした課題に応えるものであ
り、文書データをクラスタリング対象に応じたクラスタ
数に的確に分類することができ、また、分類された結果
に対して、その内容を表す表示を対応付けることがで
き、さらに、文書データを階層構造に分類することがで
きる文書クラスタリングシステムを提供することを目的
としている。SUMMARY OF THE INVENTION The present invention addresses such a problem, and it is possible to accurately classify document data into a number of clusters corresponding to a clustering target, and to display a display showing the contents of the classified result. An object of the present invention is to provide a document clustering system that can associate documents and classify document data into a hierarchical structure.

【０００７】[0007]

【課題を解決するための手段】そこで、本発明では、機
械可読な文書データを格納した文書データベースと、機
械可読な単語を格納した辞書とを備え、前記文書データ
ベースに格納された文書をクラスタリングする文書デー
タ・クラスタリングシステムにおいて、文書中の前記辞
書に格納された単語の出現頻度をもとに文書データベー
スに格納された文書の特徴ベクトルを作成する特徴ベク
トル作成手段と、特徴ベクトル作成手段により作成され
た特徴ベクトルの組を特異値分解する特異値分解手段
と、特異値分解の結果から文書間の類似度を計算するた
めの文書類似ベクトルを作成する文書類似ベクトル作成
手段と、文書類似ベクトル作成手段により作成された文
書類似ベクトルの組により文書データベース中の全てま
たは一部の文書を対象にクラスタを作成するクラスタ作
成手段と、作成されたクラスタの情報を格納するクラス
タ情報テーブルと、クラスタ情報テーブルを参照してク
ラスタ作成手段でのクラスタリングの対象となる文書を
文書データベースから選択するクラスタリングデータ選
択手段とを設け、クラスタ作成手段は、クラスタリング
対象の文書に対して、文書類似ベクトルを用いて、文書
とクラスタ重心との距離を算出し、さらに同一のクラス
タリング対象の文書に対して、一回目のクラスタリング
に利用した文書類似ベクトルの次元数を適度な範囲で増
加させて二回目のクラスタリングを行い、二回のクラス
タリング結果を比較して、変化の少ないクラスタを安定
したクラスタとして判別し、クラスタリングデータ選択
手段は、安定したクラスタに割り当てられた文書をクラ
スタリング対象から取り除いて、クラスタ作成手段が次
に行うクラスタリングの対象を選定し、クラスタ作成手
段とクラスタリングデータ選択手段との間で、この試行
を繰り返すように構成している。Therefore, the present invention comprises a document database storing machine-readable document data and a dictionary storing machine-readable words, and clusters the documents stored in the document database. In the document data clustering system, a feature vector creating unit that creates a feature vector of a document stored in a document database based on the appearance frequency of a word stored in the dictionary in the document, and a feature vector creating unit that creates the feature vector. Singular value decomposition means for singular value decomposition of a set of feature vectors, document similarity vector generation means for generating a document similarity vector for calculating similarity between documents from the result of singular value decomposition, and document similarity vector generation means All or some documents in the document database are matched by the set of document similarity vectors created by Cluster creating means for creating a cluster in the cluster, a cluster information table storing information of the created cluster, and clustering data for selecting a document to be clustered by the cluster creating means from the document database with reference to the cluster information table Selecting means, a cluster creating means calculates a distance between the document and the cluster centroid using a document similarity vector for the document to be clustered, and further, for the same document to be clustered, The second clustering is performed by increasing the number of dimensions of the document similarity vector used in the clustering within an appropriate range, and comparing the results of the two clusterings, a cluster with little change is determined as a stable cluster, and the clustering data Selection means assigned to stable cluster Remove the document from the clustered object, cluster creation means selects the object clustering to do next, between the cluster forming means clustering data selection means, and configured to repeat the attempt.

【０００８】また、クラスタ作成手段は、クラスタリン
グ対象の文書に対して、文書類似ベクトルを用いて、文
書とクラスタ重心との距離を算出し、さらに同一のクラ
スタリング対象の文書に対して、一回目のクラスタリン
グに利用した文書類似ベクトルの次元数を適度な範囲で
増加させて二回目のクラスタリングを行い、二回のクラ
スタリング結果を比較して、変化の大きいクラスタを不
安定なクラスタとして判別し、クラスタリングデータ選
択手段は、不安定なクラスタに割り当てられた文書をク
ラスタリング対象から取り除いて、クラスタ作成手段が
次に行うクラスタリングの対象を選定し、クラスタ作成
手段とクラスタリングデータ選択手段との間で、この試
行を繰り返すように構成している。The cluster creating means calculates the distance between the document and the cluster centroid using the document similarity vector for the document to be clustered, and further calculates the first clustering document for the same document to be clustered. The second clustering is performed by increasing the number of dimensions of the document similarity vector used in the clustering within an appropriate range, and the results of the two clusterings are compared. The selection unit removes the document assigned to the unstable cluster from the clustering targets, selects a clustering target to be performed next by the cluster creation unit, and performs this trial between the cluster creation unit and the clustering data selection unit. It is configured to repeat.

【０００９】また、文書データベースと、辞書と、特異
値分解結果と、クラスタ情報テーブルとを参照して、ク
ラスタごとのラベルを抽出するラベル抽出手段を設け、
ラベル抽出手段は、クラスタの重心での前記単語の擬似
的な出現頻度を表現した特徴ベクトルを算出し、当該ク
ラスタに割り当てられた文書中から、前記特徴ベクトル
に含まれる出現頻度の大きな単語の周辺に出現する文字
列をラベルとして抽出するように構成している。Further, label extracting means for extracting a label for each cluster with reference to the document database, the dictionary, the singular value decomposition result, and the cluster information table is provided.
The label extracting means calculates a feature vector expressing a pseudo frequency of occurrence of the word at the center of gravity of the cluster, and calculates, from among documents assigned to the cluster, a vicinity of a word having a high frequency of appearance included in the feature vector. Is extracted as a label.

【００１０】また、クラスタ作成手段により作成された
クラスタの間の階層関係を設定するクラスタ階層関係決
定手段を設け、クラスタ階層関係決定手段は、任意のク
ラスタＣが安定したクラスタであると判定されたときの
次元数を当該クラスタの安定次元ｄ（Ｃ）として定義す
るとき、クラスタＣの安定次元ｄ（Ｃ）より安定次元が
低いクラスタＣ’に属する全ての文書とクラスタＣの重
心ｇ（Ｃ）とのｄ（Ｃ）次元における距離が一定の距離
Ｒ（Ｃ）以内にある場合に、クラスタＣをクラスタＣ’
の上位クラスタに階層化するように構成している。Further, there is provided a cluster hierarchical relationship determining means for setting a hierarchical relationship between clusters created by the cluster creating means, and the cluster hierarchical relationship determining means has determined that any cluster C is a stable cluster. When the number of dimensions at this time is defined as the stable dimension d (C) of the cluster, all documents belonging to the cluster C ′ having a lower stable dimension than the stable dimension d (C) of the cluster C and the center of gravity g (C) of the cluster C When the distance in the d (C) dimension with respect to is within a certain distance R (C), the cluster C
It is configured to be hierarchized into upper clusters.

【００１１】そのため、本発明の文書データ・クラスタ
リングシステムでは、文書類似ベクトルの次元数を徐々
に増加させて、段階的にクラスタリングを繰り返すこと
により、クラスタ数を事前に決定していなくても、クラ
スタリング対象に応じたクラスタ数を決定することがで
きる。Therefore, in the document data clustering system of the present invention, the number of dimensions of the document similarity vector is gradually increased, and the clustering is repeated in stages, so that the clustering can be performed even if the number of clusters is not determined in advance. The number of clusters according to the target can be determined.

【００１２】また、ラベル抽出手段が、クラスタの内容
を表す文字列を、そのクラスタに属する文書から抽出す
ることができる。Further, the label extracting means can extract a character string representing the content of the cluster from a document belonging to the cluster.

【００１３】また、クラスタ階層関係決定手段が、段階
的なクラスタリングで生成されたクラスタ間の階層化を
設定し、人間の直感に一致するクラスタ間の階層関係を
生成することができる。[0013] Further, the cluster hierarchy relation determining means can set a hierarchy between the clusters generated by the stepwise clustering, and generate a hierarchical relation between the clusters that matches human intuition.

【００１４】[0014]

【発明の実施の形態】（第１の実施形態）第１の実施形
態の文書データ・クラスタリングシステムは、図１に示
すように、文書データを格納する文書データベース101
と、単語を格納した機械可読な辞書102と、クラスタリ
ング対象の文書における単語出現頻度から特徴ベクトル
を作成する特徴ベクトル作成手段103と、特徴ベクトル
データを特異値分解する特異値分解手段105と、特徴ベ
クトルの次元を縮小した文書類似ベクトルを作成する文
書類似ベクトル作成手段107と、クラスタリング対象の
文書を選択するクラスタリング対象データ選択手段109
と、文書類似ベクトルを基にしてクラスタリング対象文
書のクラスタを作成するクラスタ作成手段110と、クラ
スタ作成手段110により作成されたクラスタの情報を格
納するクラスタ情報テーブル111と、クラスタリング結
果を表示する結果表示手段112とを備えている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS (First Embodiment) As shown in FIG. 1, a document data clustering system according to a first embodiment has a document database 101 for storing document data.
A machine-readable dictionary 102 that stores words, a feature vector creating unit 103 that creates a feature vector from word appearance frequencies in documents to be clustered, a singular value decomposition unit 105 that performs singular value decomposition of feature vector data, Document similarity vector creating means 107 for creating a document similarity vector with reduced vector dimension, and clustering target data selecting means 109 for selecting a clustering target document
A cluster creating unit 110 for creating a cluster of documents to be clustered based on the document similarity vector, a cluster information table 111 for storing information of clusters created by the cluster creating unit 110, and a result display for displaying a clustering result Means 112.

【００１５】なお、図１において、104は、特徴ベクト
ル作成手段103によって作成されて格納された特徴ベク
トル組、106は、特異値分解手段105によって作成されて
格納された特異値分解結果、また、108は、文書類似ベ
クトル作成手段107によって作成されて格納された文書
類似ベクトル組を表している。In FIG. 1, reference numeral 104 denotes a feature vector set created and stored by the feature vector creation means 103; 106, a singular value decomposition result created and stored by the singular value decomposition means 105; Reference numeral 108 denotes a set of document similarity vectors created and stored by the document similarity vector creation means 107.

【００１６】次に、以上の構成を有する文書データ・ク
ラスタリングシステムの動作を図２から図１０を用いて
説明する。Next, the operation of the document data clustering system having the above configuration will be described with reference to FIGS.

【００１７】まず、文書データから文書類似ベクトルを
作成するまでの処理手順について説明する。図３は、こ
の処理手順を表すフローチャートである。First, a processing procedure until a document similarity vector is created from document data will be described. FIG. 3 is a flowchart showing this processing procedure.

【００１８】ステップ302：特徴ベクトル作成手段103
は、辞書102を参照しながら、文書データベース101中の
文書について、その単語出現頻度の統計的情報により特
徴ベクトル104を作成する。Step 302: feature vector creating means 103
Creates a feature vector 104 of a document in the document database 101 based on the word appearance frequency statistical information with reference to the dictionary 102.

【００１９】特徴ベクトル作成手段103は、辞書102を参
照し、この辞書102に記載された単語の文書における出
現頻度を求め、各単語の出現頻度を当該文書に対する特
徴ベクトルの要素の値とする。特徴ベクトルの次元数は
辞書中の全単語数に一致し、特徴ベクトル数は文書数に
一致する。当該文書に出現しない単語の頻度は０とな
る。The feature vector creation means 103 refers to the dictionary 102, finds the appearance frequency of the words described in the dictionary 102 in the document, and uses the appearance frequency of each word as the value of the element of the feature vector for the document. The number of dimensions of the feature vector matches the number of all words in the dictionary, and the number of feature vectors matches the number of documents. The frequency of words that do not appear in the document is 0.

【００２０】図４は、文書内の単語の出現頻度を基にし
た特徴ベクトルの例である。文書識別子０００１の文書
１には、辞書102に収録された単語１が１３回、単語２
が０回、単語３が４回出現していることを表している。FIG. 4 is an example of a feature vector based on the frequency of appearance of a word in a document. In the document 1 with the document identifier 0001, the word 1 recorded in the dictionary 102 is 13 times and the word 2
Represents 0 times and word 3 appears 4 times.

【００２１】なお、特徴ベクトルの要素は、出現頻度だ
けでなく、図５に示すように、出現頻度を文書長で除算
して正規化した値や、図６に示すように、出現頻度を文
書内での単語の出現頻度の総和で除算して正規化した値
や、また、情報検索の分野で広く用いられている、文書
内出現頻度と全文書中での出現頻度とを考慮したｔｆ・
ｉｄｆ値などを用いることができる。特徴ベクトルの要
素は、出現頻度を基に算出される統計的情報であればど
のような値でも構わない。The elements of the feature vector include not only the frequency of appearance but also a value obtained by dividing the frequency of appearance by the document length as shown in FIG. 5 or the frequency of appearance as shown in FIG. The value normalized by dividing by the total sum of the appearance frequencies of the words in the document, and the tf · in consideration of the appearance frequency in the document and the appearance frequency in the whole document widely used in the field of information retrieval.
An idf value or the like can be used. The element of the feature vector may be any value as long as it is statistical information calculated based on the appearance frequency.

【００２２】ステップ303：特異値分解手段105は、ステ
ップ302で得られた特徴ベクトル組について、特異値分
解を行う。Step 303: The singular value decomposition means 105 performs singular value decomposition on the feature vector set obtained in step 302.

【００２３】特徴ベクトル組を行列Ｘで表現すると、行
数は特徴ベクトル数、列数は特徴ベクトルの次元数、す
なわち辞書103中の単語数となる。行列Ｘを階数ｒ、行
数ｍ、列数ｎとすれば、特異値分解により、３つの行列
Ｄ、Ｓ、Ｔに分解できる（Ｔ’はＴの転置行列を表
す）。Ｘ＝ＤＳＴ’ （式１）ここで、Ｓは、行列Ｘの特異値を対角要素とするｒ×ｒ
の対角行列であり、Ｄはｍ×ｒ、Ｔはｎ×ｒの列直交行
列（ＴＴ’＝ＤＤ’＝Ｉ、Ｉは単位行列）となる。こう
して、特徴ベクトルを特異値分解し、特異値分解結果10
6を得る。When the feature vector set is represented by a matrix X, the number of rows is the number of feature vectors, and the number of columns is the number of dimensions of the feature vector, that is, the number of words in the dictionary 103. If the matrix X has rank r, number of rows m, and number of columns n, it can be decomposed into three matrices D, S, and T by singular value decomposition (T 'represents a transposed matrix of T). X = DST ′ (Equation 1) where S is r × r having a singular value of the matrix X as a diagonal element.
Where D is an m × r, T is an n × r column orthogonal matrix (TT ′ = DD ′ = I, I is a unit matrix). Thus, the singular value decomposition of the feature vector is performed, and the singular value decomposition result 10
Get 6.

【００２４】特異値分解を行った場合は、行列Ｓから対
角要素の大きい順にｓ個の要素を取り出して低階数近似
を行うことができ、このとき、低階数近似で得た特徴ベ
クトルの次元数は元の特徴ベクトル組の次元数よりも少
なくなる。When the singular value decomposition is performed, s elements can be extracted from the matrix S in descending order of diagonal elements to perform low-rank approximation. At this time, the dimension of the feature vector obtained by the low-rank approximation can be obtained. The number is smaller than the number of dimensions of the original feature vector set.

【００２５】ステップ305：文書類似ベクトル作成手段1
07は、特異値分解結果106を用いて、文書間の類似度を
算出するための文書類似ベクトル組を作成する。Step 305: Document similar vector creating means 1
In step 07, using the singular value decomposition result 106, a document similarity vector set for calculating the similarity between documents is created.

【００２６】前記特徴ベクトル組104の行列表現Ｘを用
いると、文書間の類似度はＸＸ’という行列で表現でき
る。したがって、文書間の類似度行列は、（式１）で与
えられる特異値分解結果から、ＸＸ’＝ＤＳＴ’ＴＳＤ’＝ＤＳＳＤ’＝（ＤＳ）（ＤＳ）’（式２）と変形できる。行列ＤＳはｍ×ｒの行列であり、行数は
文書数ｍに一致し、列数は行列Ｘの階数ｒに一致する。
この行列ＤＳの行ベクトルを文書類似ベクトルとする。When the matrix expression X of the feature vector set 104 is used, the similarity between documents can be expressed by a matrix XX '. Therefore, the similarity matrix between documents can be transformed into XX ′ = DST′TSD ′ = DSSD ′ = (DS) (DS) ′ (Equation 2) from the singular value decomposition result given by (Equation 1). The matrix DS is an mxr matrix, the number of rows matches the number m of documents, and the number of columns matches the rank r of the matrix X.
The row vector of the matrix DS is defined as a document similar vector.

【００２７】なお、特異値分解で得られる行列Ｓの対角
要素は、主成分分析における主成分に相当するので、次
元数が多いほど、より多くの情報をもつ。Since the diagonal elements of the matrix S obtained by the singular value decomposition correspond to the principal components in the principal component analysis, the greater the number of dimensions, the more information is provided.

【００２８】次に、このシステムでのクラスタリング処
理の処理手順について説明する。図２は、この処理手順
を表すフローチャートである。Next, the processing procedure of the clustering processing in this system will be described. FIG. 2 is a flowchart showing this processing procedure.

【００２９】ステップ202：クラスタリング対象データ
選択手段109は、文書データベース101の全文書（全要
素）をクラスタリング対象として、クラスタ情報テーブ
ル111に加える。Step 202: The clustering target data selecting means 109 adds all documents (all elements) of the document database 101 to the cluster information table 111 as clustering targets.

【００３０】図７はクラスタ情報テーブル111の記述例
を示す。このテーブルのレコード数は文書データベース
101内の文書数であり、少なくとも文書を特定する文書
識別子と、その文書が割り当てられたクラスタの識別子
のフィールドと、その文書がクラスタリング対象である
ことを示すフラグとを有する。図７では、ステップ202
の結果、文書データベース101内の全文書がクラスタ識
別子０のクラスタに割り当てられていること、及び全て
の文書がクラスタリング対象であることを示している。
ここで、クラスタ識別子０は、クラスタリングが一度も
行われていないので、仮想的なクラスタを示している。
なお、クラスタリング対象フラグとして、１はクラスタ
リング対象、０はクラスタリング対象外であることを示
す。FIG. 7 shows a description example of the cluster information table 111. The number of records in this table is the document database
This is the number of documents in 101, and has at least a document identifier for specifying a document, a field of an identifier of a cluster to which the document is assigned, and a flag indicating that the document is to be clustered. In FIG. 7, step 202
As a result, it is shown that all the documents in the document database 101 are assigned to the cluster with the cluster identifier 0, and that all the documents are to be clustered.
Here, the cluster identifier 0 indicates a virtual cluster since clustering has never been performed.
As a clustering target flag, 1 indicates that the clustering is to be performed, and 0 indicates that the clustering is not to be performed.

【００３１】ステップ203：クラスタ作成手段110は、パ
ラメタとして、一回のクラスタリングで作成するクラス
タ数ｋと、クラスタ数の増分Δｋと、クラスタ割り当て
時に利用する文書類似ベクトルの次元数ｄ及び次元数の
増分Δｄとを設定する。Step 203: The cluster creating means 110 receives, as parameters, the number k of clusters created by one clustering, the increment Δk of the number of clusters, the number d of dimensions of the document similarity vector used at the time of cluster assignment, and the number of dimensions. Set the increment Δd.

【００３２】クラスタ数ｋは、１以上、文書データベー
スの文書数以下であればよい。クラスタ数の増分Δｋ
は、０または、その絶対値が１から文書データベースの
文書数以下であればよい。次元数ｄ及び次元数の増分Δ
ｄは、ステップ303で算出された行列Ｓの階数ｒ以下で
あればよい。The number k of clusters may be one or more and equal to or less than the number of documents in the document database. Cluster number increment Δk
May be 0 or its absolute value is from 1 to the number of documents in the document database. Dimension number d and dimension increment Δ
d may be equal to or smaller than the rank r of the matrix S calculated in step 303.

【００３３】ステップ204：クラスタ作成手段110は、ク
ラスタ情報テーブル111を参照して、クラスタリング対
象文書数を調べる。クラスタリングが一度も行われてい
ない状態では、全ての文書がクラスタリング対象となる
ので、クラスタリング対象文書数は文書データベース中
の文書数に一致し、ステップ205へ進む。Step 204: The cluster creating means 110 checks the number of documents to be clustered with reference to the cluster information table 111. If no clustering has been performed, all documents are to be clustered. Therefore, the number of documents to be clustered matches the number of documents in the document database, and the process proceeds to step 205.

【００３４】ステップ205：ステップ305で得られた文書
類似ベクトル組108を用いて、Ｋ平均法と呼ばれる非階
層クラスタリングを行う。Step 205: Non-hierarchical clustering called a K-means method is performed using the document similarity vector set 108 obtained in step 305.

【００３５】Ｋ平均法では、以下のようなアルゴリズム
でクラスタリングを行う。まず、初期値としてｋ個のク
ラスタ重心点を与える。クラスタリング対象の文書（＝
要素）ごとにｋ個のクラスタ重心との距離を計算し、最
も近いクラスタ重心のクラスタにその要素を割り当て
る。全ての要素について割り当てが終わったら、ｋ個の
クラスタごとにその要素から重心を求め、これを新たな
重心点とし、再び各要素のクラスタへの割り当てを行
う。各要素はクラスタの割り当てごとに、別のクラスタ
に割り当てられる可能性があるが、全ての要素につい
て、直前のクラスタと新しく割り当てられるクラスタと
の一致しない数が一定数以下になった場合に割り当ては
終了する。In the K-means method, clustering is performed by the following algorithm. First, k cluster centroid points are given as initial values. Documents to be clustered (=
The distance from each of the k cluster centroids is calculated for each element, and the element is assigned to the cluster having the closest cluster centroid. When the assignment has been completed for all the elements, the center of gravity is obtained from the elements for each of the k clusters, and this is set as a new center of gravity, and each element is assigned to the cluster again. Each element may be assigned to another cluster for each cluster assignment, but if all the elements have a certain number of mismatches between the previous cluster and the newly assigned cluster, the assignment will be reduced. finish.

【００３６】ステップ205では、クラスタ情報テーブル1
11中のクラスタリング対象となっている文書について、
Ｋ平均法によりクラスタリングを行う。ここで、クラス
タ重心と各要素の文書類似ベクトルとの間の距離計算、
及び、重心算出時に、ｒ次元の文書類似ベクトルの第１
次元から第ｄ次元までの要素を用いる。また、このとき
各クラスタ初期重心点と、クラスタに含まれる文書の文
書識別子のリストとを記録しておく。ステップ205の結
果、ｋ個のクラスタが生成される。In step 205, the cluster information table 1
For documents that are subject to clustering in 11,
Clustering is performed by the K-means method. Here, the distance calculation between the cluster centroid and the document similarity vector of each element,
And, when calculating the center of gravity, the first of the r-dimensional document similarity vectors
Elements from the dimension to the d-th dimension are used. At this time, the initial center of gravity of each cluster and a list of document identifiers of the documents included in the cluster are recorded. As a result of step 205, k clusters are generated.

【００３７】ステップ206：利用次元数ｄをｒに満たな
い範囲で適度に増加させる。次元数ｄの増分をΔｄとす
れば、ｄ＋Δｄがｄとなる。例えば、ｄが１００、増分
を５０とすれば、新しい利用次元数は１５０になる。ステップ207：ステップ205と同様にＫ平均法によりｋ個
のクラスタを作成する。ここでのクラスタリングでは、
利用する次元数がステップ205の場合よりも大きくなっ
ているので、より多くの情報を用いて、クラスタリング
が行われる。このとき、初期重心は、ステップ205で用
いたものと同じ点を利用する。同じ初期重心を利用する
ことで、ステップ205で得られたクラスタとステップ207
で得られたクラスタとの対応づけが可能になる。このと
きクラスタに含まれる文書の文書識別子のリストを記録
しておく。Step 206: The number of used dimensions d is appropriately increased within a range less than r. If the increment of the number of dimensions d is Δd, d + Δd becomes d. For example, if d is 100 and the increment is 50, the new number of used dimensions is 150. Step 207: Similar to step 205, k clusters are created by the K-means method. In this clustering,
Since the number of dimensions to be used is larger than in step 205, clustering is performed using more information. At this time, the same point as that used in step 205 is used for the initial center of gravity. By using the same initial center of gravity, the cluster obtained in step 205 and step 207 are obtained.
Can be associated with the cluster obtained in step (1). At this time, a list of document identifiers of the documents included in the cluster is recorded.

【００３８】ステップ208：ステップ205のクラスタリン
グ結果と、クラスタリングに利用する情報を増加させて
クラスタリングしたステップ207の結果とを比較するこ
とにより、それ以上のクラスタリングが不要であるクラ
スタを判別する。情報量を増やしてもクラスタに割り当
てられた文書が変動しないクラスタは、それ以上の情報
を与えてもクラスタリング結果は変わらないと判断でき
る。Step 208: By comparing the result of the clustering in the step 205 with the result of the step 207 in which the information used for the clustering is increased and clustered, a cluster which does not require further clustering is determined. For a cluster in which the document assigned to the cluster does not change even when the amount of information is increased, it can be determined that the clustering result does not change even if more information is given.

【００３９】判別の手順は、まず、ステップ205及びス
テップ207で得られたクラスタごとに、それぞれ割り当
てられている文書の有無を比較し、一致度を調べる。In the determination procedure, first, for each of the clusters obtained in steps 205 and 207, the presence or absence of the assigned document is compared, and the degree of coincidence is checked.

【００４０】図８は、あるクラスタについて、２回のク
ラスタリングの結果、割り当てられた文書を比較したも
のである。２行目はステップ205、すなわち、次元数ｄ
を用いて作成したクラスタに含まれる文書を、３行目は
ステップ207、すなわち次元数ｄ＋Δｄを用いて作成し
たクラスタに含まれる文書を、それぞれ示している。図
中で０は、その文書がクラスタに含まれていないこと
を、１はクラスタに含まれていることを示す。一致度
は、２回のクラスタリング結果に共通して含まれている
文書数と総文書数との比で与えるなどとすればよい。FIG. 8 shows a comparison of documents assigned as a result of two clusterings for a certain cluster. The second line is step 205, that is, the number of dimensions d
, And the third line indicates the document included in the cluster created using step 207, that is, the number of dimensions d + Δd. In the figure, 0 indicates that the document is not included in the cluster, and 1 indicates that the document is included in the cluster. The degree of coincidence may be given by the ratio of the number of documents commonly included in two clustering results to the total number of documents.

【００４１】図８では、１回目のクラスタリングで合計
１００個の文書が、２回目のクラスタリングで合計１１
０個の文書がそれぞれ割り当てられ、共通に含まれてい
る文書数が９５である場合を示している。このとき一致
度は９５／（１００＋１１０）＝０．４５となる。各ク
ラスタについて、適当なしきい値を与え、その一致度が
高いものは安定したクラスタとし、それ以外のクラスタ
は不安定なクラスタとする。In FIG. 8, a total of 100 documents are obtained by the first clustering, and a total of 11 documents are obtained by the second clustering.
This example shows a case where zero documents are allocated and the number of commonly included documents is 95. At this time, the degree of coincidence is 95 / (100 + 110) = 0.45. For each cluster, an appropriate threshold is given, and the one with a high degree of coincidence is regarded as a stable cluster, and the other clusters are regarded as unstable clusters.

【００４２】また、一致度で不安定なクラスタを判断す
るだけでなく、クラスタに割り当てられた文書数と全文
書数との比などを用いて、不安定なクラスタを判断して
もよい。In addition to determining an unstable cluster based on the degree of coincidence, an unstable cluster may be determined using the ratio of the number of documents assigned to the cluster to the total number of documents.

【００４３】ステップ208：次に、安定したクラスタに
含まれる文書については、それ以上のクラスタリングは
不要であるので、クラスタ情報テーブル111のクラスタ
リング対象フラグを不要に設定する。不安定なクラスタ
に含まれる文書については、クラスタリング対象フラグ
を変化させないでおき、再クラスタリングの対象とす
る。クラスタ情報テーブル111には、安定したクラスタ
に含まれる文書のクラスタ番号のフィールドに２回目に
割り当てられたクラスタ番号を記録しておく。Step 208: Next, the clustering target flag of the cluster information table 111 is set to be unnecessary because the document included in the stable cluster does not require further clustering. The documents included in the unstable cluster are subjected to re-clustering without changing the clustering target flag. In the cluster information table 111, the cluster number assigned the second time is recorded in the cluster number field of the document included in the stable cluster.

【００４４】図９は、更新されたクラスタ情報テーブル
111の例を示している。文書識別子０００１、０００３
は安定したクラスタ１００１に、また、文書識別子００
０４、ｍは安定したクラスタ１００２に割り当てられて
いる。文書識別子０００２，０００５は再クラスタリン
グ対象であるので、クラスタリング対象フラグが０にな
っている。FIG. 9 shows an updated cluster information table.
111 shows an example. Document identifier 0001, 0003
Is the stable cluster 1001 and the document identifier 00
04, m are assigned to stable cluster 1002. Since the document identifiers 0002 and 0005 are to be re-clustered, the clustering target flag is set to 0.

【００４５】ステップ209：ステップ208の後、作成クラ
スタ数ｋをステップ203で与えたΔｋだけ増加させる。
Δｋが０の場合はクラスタ数は変化しない。Step 209: After step 208, the number k of created clusters is increased by Δk given in step 203.
When Δk is 0, the number of clusters does not change.

【００４６】続いてステップ204へ移り、再びクラスタ
情報テーブル111を参照してクラスタリング対象文書数
を調べる。Then, the process proceeds to step 204, where the number of documents to be clustered is checked with reference to the cluster information table 111 again.

【００４７】ステップ208で、全てのクラスタが安定し
たクラスタであると判断された場合は、クラスタリング
対象の文書は存在しないのでステップ210へ移り、クラ
スタリング処理が終了する。クラスタリング対象文書が
存在した場合は再びステップ205から208までを繰り返
す。ただし、クラスタ割り当て時に利用する文書類似ベ
クトルの次元数はｄ＋Δｄである。If it is determined in step 208 that all clusters are stable clusters, since there is no document to be clustered, the process proceeds to step 210, and the clustering process ends. If there is a document to be clustered, steps 205 to 208 are repeated again. However, the number of dimensions of the document similarity vector used at the time of cluster assignment is d + Δd.

【００４８】図１０は、以上説明したステップ201から
ステップ210までに作成されるクラスタの変化を表す概
念図である。左側には、クラスタ数ｋとステップ207の
クラスタリング時に利用した文書類似ベクトルの次元数
を示す。ｋは７としてある。Δｋは０である。最初のク
ラスタリングでは、３つの安定したクラスタ（図中の四
角）と、４つの不安定なクラスタ（斜線入りの四角）と
が作成されたことを示している。次の段階では、４つの
不安定なクラスタに割り当てられた文書をクラスタリン
グ対象として、クラスタリングを行った結果を表してい
る。FIG. 10 is a conceptual diagram showing the change of the cluster created from step 201 to step 210 described above. On the left side, the number of clusters k and the number of dimensions of the document similarity vector used at the time of clustering in step 207 are shown. k is set to 7. Δk is 0. The first clustering shows that three stable clusters (squares in the figure) and four unstable clusters (hatched squares) have been created. The next stage shows the result of clustering with documents assigned to the four unstable clusters as clustering targets.

【００４９】このように、クラスタリングを繰り返すこ
とによって、４段階目は不安定なクラスタが無くなり、
クラスタ作成処理が終了したことを示している。As described above, by repeating the clustering, the unstable cluster is eliminated in the fourth stage.
Indicates that the cluster creation processing has been completed.

【００５０】以上の説明のように、この実施形態の文書
データ・クラスタリングシステムでは、文書データの特
異値分解を行い、文書間の距離計算に用いる情報を徐々
に増加させ、段階的にクラスタリングを繰り返し行うこ
とにより、クラスタ数を事前に決定していなくても、ク
ラスタリング対象に応じたクラスタ数を決定することが
でき、また、類似している文書だけを集約したクラスタ
が生成できる、という効果がある。As described above, in the document data clustering system of this embodiment, the singular value decomposition of the document data is performed, the information used for calculating the distance between documents is gradually increased, and the clustering is repeated stepwise. By doing so, it is possible to determine the number of clusters according to the clustering target even if the number of clusters is not determined in advance, and it is possible to generate a cluster in which only similar documents are aggregated. .

【００５１】（第２の実施形態）第２の実施形態では、
クラスタリングした結果に対して、その内容を表す文字
列をラベリングするシステムについて説明する。(Second Embodiment) In the second embodiment,
A system for labeling a character string representing the content of the result of clustering will be described.

【００５２】このシステムは、図１１に示すように、第
１の実施形態と同様に、文書データを格納する文書デー
タベース1101と、単語を格納した機械可読な辞書1102
と、特徴ベクトルを作成する特徴ベクトル作成手段1103
と、特徴ベクトル組を特異値分解する特異値分解手段11
05と、特異値分解結果を基に文書間の類似度を算出する
ための文書類似ベクトルを算出する文書類似ベクトル作
成手段1107と、結果表示手段1112とを備え、さらに、ク
ラスタを代表するラベルを抽出するラベル抽出手段1110
を備えている。As shown in FIG. 11, the system includes a document database 1101 for storing document data and a machine-readable dictionary 1102 for storing words, as in the first embodiment.
And feature vector creating means 1103 for creating a feature vector
Singular value decomposition means 11 for singular value decomposition of a feature vector set
05, a document similarity vector creating means 1107 for calculating a document similarity vector for calculating the similarity between documents based on the singular value decomposition result, and a result display means 1112, and further, a label representing a cluster is provided. Label extraction means 1110 to extract
It has.

【００５３】図１１において、1104は特徴ベクトル作成
手段1103によって作成された特徴ベクトル組を、1106は
特異値分解手段1105により得られた特異値分解結果を、
1108は文書類似ベクトル作成手段1107により作成された
文書類似ベクトル組を示している。また、1109は、何ら
かの方法で、文書データベース中の文書を複数のクラス
タのいずれかに割り当て、第１の実施形態と同様の情報
を格納してあるクラスタ情報テーブルを表している。In FIG. 11, reference numeral 1104 denotes a feature vector set created by the feature vector creation means 1103, 1106 denotes a singular value decomposition result obtained by the singular value decomposition means 1105,
Reference numeral 1108 denotes a document similarity vector set created by the document similarity vector creating means 1107. Reference numeral 1109 denotes a cluster information table in which a document in the document database is assigned to one of a plurality of clusters by some method, and information similar to that of the first embodiment is stored.

【００５４】ラベル抽出手段1110は、文書データベース
1101と辞書1102と特異値分解結果1106と文書類似ベクト
ル組1108とクラスタ情報テーブル1109とを参照しなが
ら、クラスタを代表するラベルを抽出する。1111はラベ
ル抽出手段1108により抽出されたラベルを示している。
結果表示手段1112は、クラスタ情報テーブル1109と抽出
ラベル1111とを表示する。The label extracting means 1110 includes a document database
With reference to 1101, dictionary 1102, singular value decomposition result 1106, document similar vector set 1108, and cluster information table 1109, a label representing a cluster is extracted. Reference numeral 1111 denotes a label extracted by the label extracting unit 1108.
The result display means 1112 displays the cluster information table 1109 and the extracted label 1111.

【００５５】以上の構成の文書データ・クラスタリング
システムの動作を図１２から図１４を用いて説明する。The operation of the document data clustering system having the above configuration will be described with reference to FIGS.

【００５６】図１２は、このシステムの処理手順を表す
フローチャートである。ステップ1202：特徴ベクトル作
成手段1103は、文書データベース1101と辞書1102とを用
いて特徴ベクトル組1104を作成する。この手順は第１の
実施形態（図３）のステップ302と同様である。FIG. 12 is a flowchart showing the processing procedure of this system. Step 1202: The feature vector creation unit 1103 creates a feature vector set 1104 using the document database 1101 and the dictionary 1102. This procedure is the same as step 302 of the first embodiment (FIG. 3).

【００５７】ステップ1203：次に特異値分解手段1105
は、特徴ベクトル組をもとに、特異値分解を行い特異値
分解結果1106を得る。この手順は第１の実施形態（図
３）のステップ303と同様である。Step 1203: Next, singular value decomposition means 1105
Performs singular value decomposition based on the feature vector set to obtain a singular value decomposition result 1106. This procedure is the same as step 303 of the first embodiment (FIG. 3).

【００５８】特異値分解の結果、第１の実施形態の（式
１）と同様にして、特徴ベクトル組の行列表現Ｘ（ｍ×
ｎ）は、３つの行列Ｄ（ｍ×ｒ）、Ｓ（ｒ×ｒ）、Ｔ
（ｎ×ｒ）に分解される。ここで、ｍは文書データベー
ス中の文書数、ｎは辞書中の単語数、ｒは行列Ｘの階数
である。As a result of the singular value decomposition, the matrix expression X (m ×
n) are three matrices D (m × r), S (r × r), T
It is decomposed into (nxr). Here, m is the number of documents in the document database, n is the number of words in the dictionary, and r is the rank of the matrix X.

【００５９】ステップ1204：文書類似ベクトル作成手段
1107は、特異値分解結果から文書類似ベクトル1108を作
成する。この手順は第１の実施形態（図３）のステップ
304と同様である。Step 1204: Document similar vector creation means
A step 1107 creates a document similarity vector 1108 from the singular value decomposition result. This procedure is a step of the first embodiment (FIG. 3).
Same as 304.

【００６０】ステップ1205：次に、ラベル抽出手段1110
は、クラスタからのラベル抽出処理に移る。ステップ12
05では、全てのクラスタからラベルを抽出し終えたかど
うかを判断する。全てのクラスタについてラベル抽出を
終えていた場合は、処理は終了する（ステップ1210）。
ラベル抽出手段1106は、ラベル抽出を終えていないクラ
スタについて、クラスタからのラベル抽出処理を行う
（ステップ1206からステップ1209）。Step 1205: Next, label extracting means 1110
Shifts to label extraction processing from the cluster. Step 12
In 05, it is determined whether or not labels have been extracted from all clusters. If label extraction has been completed for all clusters, the process ends (step 1210).
The label extracting unit 1106 performs a label extracting process from the cluster for a cluster for which label extraction has not been completed (steps 1206 to 1209).

【００６１】ステップ1206：まず、クラスタの重心ベク
トルｇを求める。クラスタの重心ベクトルｇの各要素
は、クラスタに割り当てられた文書に対応する文書類似
ベクトル1108の各要素ごとの平均値とする。クラスタ重
心ベクトルの次元数は文書類似ベクトルの次元数に一致
し、ｒとなる。Step 1206: First, a centroid vector g of the cluster is obtained. Each element of the centroid vector g of the cluster is an average value of each element of the document similarity vector 1108 corresponding to the document assigned to the cluster. The number of dimensions of the cluster centroid vector matches the number of dimensions of the document similarity vector, and is r.

【００６２】ステップ1207：次に、クラスタの重心ベク
トルｇ（１×ｒ）と、特異値分解結果1106とを利用し、
（式３）に示す計算方法で代表語ベクトルｈを取得す
る。ｈ＝ｇＳＴ’ （式３）（式３）の右辺は、（式１）における行列Ｄを重心ベク
トルｇに置き換えたものであるので、ｈはクラスタの重
心ベクトルｇに対応する特徴ベクトルを表すことにな
る。したがって、代表語ベクトルｈの各要素は辞書1102
中の単語に対応し、要素の値はクラスタ重心に対する出
現頻度に対応する。Step 1207: Next, using the centroid vector g (1 × r) of the cluster and the singular value decomposition result 1106,
The representative word vector h is obtained by the calculation method shown in (Equation 3). h = gST ′ (Equation 3) Since the right side of (Equation 3) is obtained by replacing the matrix D in (Equation 1) with the barycenter vector g, h represents a feature vector corresponding to the barycenter vector g of the cluster. become. Therefore, each element of the representative word vector h is stored in the dictionary 1102
The value of the element corresponds to the frequency of appearance with respect to the cluster centroid.

【００６３】図１３は、代表語ベクトルの例を示す。図
中で代表語ベクトルを点線で囲み、各要素に対応する単
語を左側に記述してある。FIG. 13 shows an example of a representative word vector. In the figure, a representative word vector is surrounded by a dotted line, and words corresponding to each element are described on the left side.

【００６４】次に、代表語ベクトルの各要素に対応する
単語を辞書1102を参照して取得し、これらを代表語組と
する。各代表語にはそのスコアとして、代表語ベクトル
の対応する要素の値を記録しておく。代表語数は辞書11
02中の単語数に一致するが、スコアの小さなものを代表
語から取り除いてもよい。Next, words corresponding to each element of the representative word vector are obtained with reference to the dictionary 1102, and these are set as a representative word set. For each representative word, the value of the corresponding element of the representative word vector is recorded as its score. The number of representative words is dictionary 11
Although the number of words matches the number of words in 02, words with a small score may be removed from the representative words.

【００６５】ステップ1208：続いて、代表語を用いて、
クラスタに割り当てられた文書に出現する代表語の周辺
にある文字列をラベル候補として抽出する。ここでは、
任意の２つの代表語に挟まれた文字列をラベル候補とす
る。Step 1208: Subsequently, using the representative word,
A character string around a representative word appearing in a document assigned to a cluster is extracted as a label candidate. here,
A character string sandwiched between any two representative words is set as a label candidate.

【００６６】なお、代表語の周辺の文字列をラベルとし
て抽出する方法は、例えば代表語の品詞を考慮し、固有
名詞である代表語で始まりサ変動詞である代表語で終わ
る文字列をラベル候補とする、などの方法も考えられ
る。また、ラベル抽出元の文書はクラスタに割り当てら
れた全ての文書でもよいし、幾つかの選択した文書でも
よい。文書の選択方法としては、クラスタの重心ベクト
ルと文書類似ベクトルとの距離を算出し、その距離が近
いものから数文書を選択するなどという方法が考えられ
る。In the method of extracting a character string around a representative word as a label, for example, considering a part of speech of a representative word, a character string that starts with a representative word that is a proper noun and ends with a representative word that is a variable is a label candidate. And the like. The document from which the label is to be extracted may be all documents assigned to the cluster or some selected documents. As a document selection method, a method of calculating the distance between the centroid vector of the cluster and the document similarity vector, and selecting several documents from those having the shortest distance can be considered.

【００６７】ステップ1209：各ラベル候補について、ス
コアを計算し、スコアの大きなラベル候補をクラスタを
代表するラベルとして抽出する。スコア計算方法は幾つ
か考えられるが、一例として（式４）に示すような計算
方法がある。（ｐ×ｗ１＋（１−ｐ）ｗ２）／Ｌ（式４）ここで、ｐはラベル候補の先頭に位置する代表語のスコ
ア重みを表し、その値は０以上１以下である。ｗ１、ｗ
２はラベル候補の先頭に位置する代表語のスコア、末尾
に位置する代表語のスコアをそれぞれ表す。Ｌはラベル
候補の長さである。Step 1209: A score is calculated for each label candidate, and a label candidate having a large score is extracted as a label representing a cluster. Although there are several possible score calculation methods, there is a calculation method as shown in (Formula 4) as an example. (P × w1 + (1−p) w2) / L (Equation 4) Here, p represents the score weight of the representative word located at the head of the label candidate, and its value is 0 or more and 1 or less. w1, w
2 represents the score of the representative word located at the head of the label candidate and the score of the representative word located at the end. L is the length of the label candidate.

【００６８】図１４は、文書中から、ラベル候補を抽出
し、そのスコアを算出した例を示している。上段の例で
は、「生活」及び「水準」で挟まれた文字列として「生
活水準」が抽出され、そのスコアが、（式４）において
Ｐ＝０．５として算出されている。下段の例では、「専
門」及び「連携」で挟まれた文字列として「専門性を生
かした連携」が抽出され、そのスコアが算出されてい
る。FIG. 14 shows an example of extracting a label candidate from a document and calculating its score. In the upper example, “living standard” is extracted as a character string sandwiched between “living” and “level”, and its score is calculated as P = 0.5 in (Equation 4). In the lower example, “cooperation utilizing specialty” is extracted as a character string sandwiched between “specialty” and “cooperation”, and its score is calculated.

【００６９】以上の説明のように、この実施形態の文書
データ・クラスタリングシステムでは、クラスタリング
された結果に対して、その内容を代表させた文字列をラ
ベルとして抽出することができ、クラスタを区別し、そ
の内容を推測させるための情報を提供できるという効果
がある。As described above, in the document data clustering system of this embodiment, a character string representing the contents of the clustered result can be extracted as a label, and the cluster can be distinguished. This has the effect of providing information for estimating the contents.

【００７０】（第３の実施形態）第３の実施形態では、
不安定なクラスタを対象から除いて、再クラスタリング
を繰り返し、類似している文書だけを集約したクラスタ
を生成するシステムについて説明する。(Third Embodiment) In the third embodiment,
A system will be described in which an unstable cluster is excluded from a target and re-clustering is repeated to generate a cluster in which only similar documents are aggregated.

【００７１】第１の実施形態では、再クラスタリングを
実施する場合に、安定しているクラスタを対象から除い
て再クラスタリングを実施しているが、第３の実施形態
のシステムでは、情報量を増やしてクラスタリングした
ときの不一致度が高い不安定なクラスタを対象から除い
て再クラスタリングを行う。In the first embodiment, when performing re-clustering, clusters that are stable are excluded from the target and re-clustering is performed. However, in the system of the third embodiment, the amount of information is increased. Re-clustering is performed excluding unstable clusters having a high degree of mismatch when clustering is performed.

【００７２】このシステムは、図１５に示すように、ク
ラスタリング対象データ選択手段109として、文書デー
タベース101またはクラスタリング結果から、不安定な
クラスタを除き、残ったクラスタをクラスタリング対象
として、その文書データの識別子を選択する主要親クラ
スタリング対象データ選択手段1091を備えている。その
他の構成は、第１の実施形態（図１）と変わりが無い。In this system, as shown in FIG. 15, the clustering target data selecting means 109 removes the unstable clusters from the document database 101 or the clustering result, and sets the remaining clusters as the clustering target, and sets the identifier of the document data. The main parent clustering target data selection means 1091 for selecting the Other configurations are the same as those of the first embodiment (FIG. 1).

【００７３】このシステムの動作を図１６のフローチャ
ートに示している。図１６において、ステップ202から
ステップ207までの手順は、第１の実施形態（図２）の
手順と同様である。The operation of this system is shown in the flowchart of FIG. In FIG. 16, the procedure from step 202 to step 207 is the same as the procedure of the first embodiment (FIG. 2).

【００７４】ステップ208：ステップ205のクラスタリン
グ結果と、クラスタリングに利用する情報を増加させて
クラスタリングしたステップ207の結果とを比較するこ
とにより、クラスタの対象から外すクラスタを判別す
る。情報量を増やしてもクラスタに割り当てられた文書
が大きく変動するクラスタは、情報をさらに与えること
によりクラスタリング結果は更に大きく変動すると判断
する。Step 208: By comparing the clustering result of step 205 with the result of step 207 obtained by increasing the information used for clustering and performing clustering, a cluster to be excluded from the cluster is determined. For a cluster in which the documents assigned to the cluster fluctuate greatly even if the amount of information is increased, it is determined that the clustering result will fluctuate further by providing more information.

【００７５】判別の手順は、まず、ステップ205及びス
テップ207で得られたクラスタごとに、それぞれ割り当
てられている文書の有無を比較し、一致度を調べる。図
１７はあるクラスタについて、２回のクラスタリングの
結果、割り当てられた文書を比較したものである。２行
目はステップ205、すなわち、次元数ｄを用いて作成し
たクラスタに含まれる文書を、３行目はステップ207、
すなわち次元数ｄ＋Δｄを用いて作成したクラスタに含
まれる文書を、それぞれ示している。図中で０はその文
書がクラスタに含まれていないことを、１はクラスタに
含まれていることを示す。一致度は２回のクラスタリン
グ結果に共通して含まれている文書数と総文書数との比
で与えるなどすればよい。図１７では、１回目のクラス
タリングで合計１００個の文書が、２回目のクラスタリ
ングで合計１１０個の文書がそれぞれ割り当てられ、相
互に共通しない文書数が４０である場合を示している。
このとき、不一致度は４０／（１００＋１１０）＝０．
１８となる。In the determination procedure, first, for each cluster obtained in steps 205 and 207, the presence / absence of a document assigned to each cluster is compared, and the degree of coincidence is checked. FIG. 17 shows a comparison of documents assigned as a result of two clusterings for a certain cluster. The second line is Step 205, that is, the document included in the cluster created using the number of dimensions d is the third line.
That is, the documents included in the cluster created using the number of dimensions d + Δd are shown. In the figure, 0 indicates that the document is not included in the cluster, and 1 indicates that the document is included in the cluster. The degree of coincidence may be given by the ratio of the number of documents commonly included in two clustering results to the total number of documents. FIG. 17 shows a case where a total of 100 documents are allocated in the first clustering and a total of 110 documents are allocated in the second clustering, and the number of documents that are not common to each other is 40.
At this time, the degree of mismatch is 40 / (100 + 110) = 0.
It becomes 18.

【００７６】各クラスタについて、適当なしきい値を与
え、その不一致が高いものは不安定なクラスタとして除
外し、それ以外を安定したクラスタとする。なお、不一
致度で不安定なクラスタを判断するだけでなく、クラス
タに割り当てられた文書数と全文書数との比などを用い
て不安定なクラスタを判断してもよい。For each cluster, an appropriate threshold value is given, those with high mismatch are excluded as unstable clusters, and the others are regarded as stable clusters. In addition to determining an unstable cluster due to the degree of inconsistency, an unstable cluster may be determined using the ratio of the number of documents assigned to the cluster to the total number of documents.

【００７７】次に、不安定なクラスタに含まれる文書に
ついては、クラスタ情報テーブル111のクラスタリング
対象フラグを不要に設定する。安定なクラスタに含まれ
る文書については、クラスタリング対象フラグを変化さ
せないでおき、再クラスタリング対象とする。クラスタ
情報テーブル111の、安定したクラスタに含まれる文書
のクラスタ番号のフィールドには２回目に割り当てられ
たクラスタ番号を記録しておく。Next, for documents included in an unstable cluster, the clustering target flag of the cluster information table 111 is set to be unnecessary. Documents included in a stable cluster are set as re-clustering targets without changing the clustering target flag. In the cluster information field of the document included in the stable cluster in the cluster information table 111, the cluster number assigned for the second time is recorded.

【００７８】図１８は、更新されたクラスタ情報テーブ
ル111の例を示している。文書識別子０００１、０００
３は安定したクラスタ１００１に、文書識別子０００
４、ｍは安定したクラスタ１００２に割り当てられ、再
クラスタリング対象であるので、クラスタリング対象フ
ラグが１になっている。文書識別子０００２、０００５
は再クラスタリング対象から除外されるので、クラスタ
リング対象フラグが０になっている。FIG. 18 shows an example of the updated cluster information table 111. Document identifier 0001, 000
3 is the document identifier 000 in the stable cluster 1001
Since 4 and m are allocated to the stable cluster 1002 and are to be re-clustered, the clustering target flag is 1. Document identifier 0002, 0005
Are excluded from the re-clustering target, and the clustering target flag is set to 0.

【００７９】ステップ208以降については、図１と同様
の処理が行われる。こうすることで、安定したクラスタ
を更にクラスタリングすることができる。From step 208 onward, the same processing as in FIG. 1 is performed. By doing so, a stable cluster can be further clustered.

【００８０】図１９は、クラスタの変化を表す概念図で
ある。左側には、クラスタ数ｋとステップ207のクラス
タリング時に利用した文書類似ベクトルの次元数を示
す。ｋは７としてある。Δｋは０である。最初のクラス
タリングでは、３つの安定したクラスタ（図中の四角）
と、４つの不安定なクラスタ（斜線入りの四角）とが作
成されたことを示している。次の段階では、３つの安定
なクラスタに割り当てられた文書をクラスタリング対象
として、クラスタリングを行った結果を表している。FIG. 19 is a conceptual diagram showing a change in a cluster. On the left side, the number of clusters k and the number of dimensions of the document similarity vector used at the time of clustering in step 207 are shown. k is set to 7. Δk is 0. In the first clustering, three stable clusters (squares in the figure)
And four unstable clusters (hatched squares) have been created. The next stage shows the result of clustering with documents assigned to three stable clusters as clustering targets.

【００８１】このように、クラスタリングを繰り返すこ
とによって、類似している文書だけを集約したクラスタ
を生成することができる。As described above, by repeating clustering, a cluster in which only similar documents are aggregated can be generated.

【００８２】以上の説明のように、この実施形態の文書
データ・クラスタリングシステムでは、文書データの特
異値分解を行い、文書間の距離計算に用いる情報を徐々
に増加させ、段階的にクラスタリングを繰り返し行うこ
とにより、クラスタ数を事前に決定していなくても、ク
ラスタリング対象に応じたクラスタ数が決定でき、ま
た、類似している文書だけを集約したクラスタを生成す
ることができる、という効果がある。As described above, in the document data clustering system of this embodiment, the singular value decomposition of the document data is performed, the information used for calculating the distance between documents is gradually increased, and the clustering is repeated stepwise. By doing so, even if the number of clusters is not determined in advance, the number of clusters can be determined according to the clustering target, and a cluster in which only similar documents are aggregated can be generated. .

【００８３】（第４の実施形態）第４の実施形態では、
文書データを階層的に分類する文書データ・クラスタリ
ングシステムについて説明する。(Fourth Embodiment) In the fourth embodiment,
A document data clustering system that classifies document data hierarchically will be described.

【００８４】このシステムでの階層化の基本的な考え方
は以下の通りである。第１の実施形態では、クラスタを
作成する過程において、使用する次元数を増やしなが
ら、安定したクラスタを随時除いて行き、残った要素
を、より大きな次元を持つ空間でクラスタリングする、
という段階的なクラスタリングを実現している。The basic concept of hierarchization in this system is as follows. In the first embodiment, in the process of creating a cluster, while increasing the number of dimensions to be used, a stable cluster is removed as needed, and the remaining elements are clustered in a space having a larger dimension.
It realizes the gradual clustering.

【００８５】このような段階的なクラスタリングでは、
より密集したクラスタが早期に安定し、より広がりをも
ったクラスタが後になって生成され易い、という傾向が
認められる。そのため、より密集したクラスタ、すなわ
ち、より特定のテーマに関連した文書からなるクラスタ
が、早期に生成され、より疎なクラスタ、すなわち、よ
り一般的なテーマの下で纏められる文書からなるクラス
タが、後から生成され易い。In such a stepwise clustering,
There is a tendency for denser clusters to stabilize earlier and for more broad clusters to be generated later. Therefore, a denser cluster, that is, a cluster of documents related to a more specific theme, is generated earlier, and a sparse cluster, that is, a cluster of documents that are grouped under a more general theme, is It is easy to generate later.

【００８６】この場合、より一般的なテーマに関連した
文書クラスタを、より特定のテーマに関する文書クラス
タの上位クラスタとすることは、人間の行うクラスタ間
の階層化の直感にも一致する。In this case, setting a document cluster related to a more general theme as a higher-level cluster of a document cluster related to a more specific theme is also consistent with the intuition of hierarchies between clusters performed by humans.

【００８７】したがって、クラスタリングの過程におい
て、段階的に求められたクラスタ間に自然な階層関係を
導入するには、より後になって安定した（安定次元の高
い）クラスタを、より先に安定して（安定次元の低い）
クラスタよりも階層において上位に位置させることが望
ましい。Therefore, in order to introduce a natural hierarchical relationship between the clusters obtained step by step in the clustering process, a later stable cluster (having a higher stable dimension) must be stably first. (Low stable dimension)
It is desirable to be positioned higher in the hierarchy than in the cluster.

【００８８】ただし、上位クラスタは、下位クラスタの
テーマをカバーするというのが自然な直感であるから、
下位クラスタは、上位クラスタの重心から一定の範囲内
に収まっていることが望まれる。However, since it is natural intuition that the upper cluster covers the theme of the lower cluster,
It is desirable that the lower cluster be within a certain range from the center of gravity of the upper cluster.

【００８９】これらの要件を満たすべく、各クラスタの
重心と当該クラスタが安定したと判定されたときの次元
とを参照し、先に安定したクラスタＣ’が、後で次元ｄ
（Ｃ）で安定したクラスタＣの重心から一定距離以内に
位置する場合、すなわち、次元ｄ（Ｃ）において、クラ
スタＣの重心からクラスタＣ’までの距離が、クラスタ
Ｃの大きさを示す一定距離の範囲内に収まる場合には、
ＣをＣ’の上位クラスタと位置付けることにより、段階
的に生成されたクラスタ間の階層関係を設定している。In order to satisfy these requirements, referring to the center of gravity of each cluster and the dimension when the cluster is determined to be stable, the first stable cluster C ′ is later determined by the dimension d
In the case of being located within a certain distance from the center of gravity of the stable cluster C in (C), that is, in the dimension d (C), the distance from the center of gravity of the cluster C to the cluster C ′ is a certain distance indicating the size of the cluster C. If it falls within the range,
By positioning C as the upper cluster of C ′, the hierarchical relationship between the clusters generated stepwise is set.

【００９０】ここで、各クラスタの重心や、クラスタの
大きさを表す距離は、文書特徴ベクトル間の計算により
機械的に求めることができる計算可能な量である。Here, the center of gravity of each cluster and the distance representing the size of the cluster are calculable quantities that can be obtained mechanically by calculation between document feature vectors.

【００９１】以上の考え方に従って、この実施形態の文
書データ・クラスタリングシステムは、クラスタリング
を実行する。According to the above concept, the document data clustering system of this embodiment executes clustering.

【００９２】このシステムは、図２０に示すように、第
１の実施形態（図１）と同様、文書データベース1901、
辞書1902、特徴ベクトル作成手段1903、特徴ベクトル組
1904、特異値分解手段1905、特異値分解結果1906、文書
類似ベクトル作成手段1907、文書類似ベクトル組1908、
クラスタリング対象データ選択手段1909、クラスタ作成
手段1910、クラスタ情報テーブル1911、及び、結果表示
手段1915を備えるとともに、さらに、クラスタ情報テー
ブル1911に格納されたクラスタの情報と文書類似ベクト
ル組1908とからクラスタ間の階層関係を計算するクラス
タ階層関係決定手段1912と、各クラスタのサイズ情報を
格納するクラスタサイズ情報テーブル1913と、クラスタ
間の階層関係を登録するクラスタ階層関係テーブル1914
とを備えている。As shown in FIG. 20, this system has a document database 1901 and a document database 1901 as in the first embodiment (FIG. 1).
Dictionary 1902, feature vector creation means 1903, feature vector set
1904, singular value decomposition means 1905, singular value decomposition result 1906, document similar vector creation means 1907, document similar vector set 1908,
It includes a clustering target data selection unit 1909, a cluster creation unit 1910, a cluster information table 1911, and a result display unit 1915, and further, based on the cluster information stored in the cluster information table 1911 and the document similarity vector set 1908, Cluster hierarchical relationship determining means 1912 for calculating the hierarchical relationship between the clusters, a cluster size information table 1913 for storing size information of each cluster, and a cluster hierarchical relationship table 1914 for registering a hierarchical relationship between clusters.
And

【００９３】ここで、クラスタ作成手段1910は、第１の
実施形態における作用に加えて、安定したクラスタを検
出したときに、その時点での使用次元を当該クラスタの
安定次元としてクラスタサイズ情報テーブル1912に格納
し、また、同時に当該クラスタの重心を同クラスタサイ
ズ情報テーブル1913に格納する。また、結果表示手段19
15は、クラスタリング結果及びクラスタ間階層関係を表
示する。このクラスタ作成手段1910を除く1901から1911
までの内部構成と作用は、第１の実施形態と同一であ
る。Here, in addition to the operation in the first embodiment, when a stable cluster is detected, the cluster creating means 1910 sets the used dimension at that time as the stable dimension of the cluster and sets the cluster size information table 1912 as the stable dimension. And the center of gravity of the cluster is stored in the same cluster size information table 1913 at the same time. Also, the result display means 19
Reference numeral 15 denotes a clustering result and a hierarchical relationship between clusters. 1901 to 1911 excluding this cluster creation means 1910
The internal configuration and operation up to this point are the same as in the first embodiment.

【００９４】図２１は、前記クラスタサイズ情報テーブ
ル1913の例である。このテーブルには、各クラスタごと
に、クラスタ番号、安定次元、重心ベクトル、及び、基
準半径が格納されている。クラスタＣの安定次元とは、
クラスタリングの過程において、クラスタＣが安定して
いるとしてクラスタリング対象から外されるときの次元
数である。以後、この安定次元をｄ（Ｃ）と表す。FIG. 21 is an example of the cluster size information table 1913. This table stores a cluster number, a stable dimension, a center-of-gravity vector, and a reference radius for each cluster. The stable dimension of cluster C is
This is the number of dimensions when the cluster C is removed from the clustering target as being stable in the clustering process. Hereinafter, this stable dimension is expressed as d (C).

【００９５】クラスタＣの基準半径とは、クラスタＣに
属す文書のうち、クラスタＣのクラスタ重心（ｇ（Ｃ）
と表記）から、ｄ（Ｃ）次元での距離において最も遠い
文書とｇ（Ｃ）とのｄ（Ｃ）次元での距離をＲ１（Ｃ）
とし、また、安定次元がｄ（Ｃ）以上のＣ以外のクラス
タに属す文書のうち、ｄ（Ｃ）次元での距離において、
最もｇ（Ｃ）に近くなる文書とｇ（Ｃ）とのｄ（Ｃ）次
元での距離をＲ２（Ｃ）とするとき、Ｒ１（Ｃ）及びＲ
２（Ｃ）の中の小さい方とする。以後、クラスタＣの基
準半径をＲ（Ｃ）と表す。The reference radius of the cluster C is the cluster centroid (g (C)) of the cluster C among the documents belonging to the cluster C.
), The distance in the d (C) dimension between the document farthest in the distance in the d (C) dimension and g (C) is R1 (C).
Further, among documents belonging to a cluster other than C whose stable dimension is d (C) or more, in a distance in the d (C) dimension,
When the distance between the document closest to g (C) and g (C) in the d (C) dimension is R2 (C), R1 (C) and R1 (C)
2 (C), whichever is smaller. Hereinafter, the reference radius of the cluster C is represented as R (C).

【００９６】ただし、ここで、ｄ（Ｃ）次元における文
書あるいはクラスタ重心間の距離は、文書類似ベクトル
間のｄ（Ｃ）次元における距離であるとし、文書類似ベ
クトルｘ１＝（ｘ11，ｘ12，‥，ｘ1d(C)，‥，ｘ1r）
と、ｘ２＝（ｘ21，ｘ22，‥，ｘ2d(C)，‥，ｘ2r）と
のｄ（Ｃ）次元における距離ｄｉｓｔ（ｘ１，ｘ２）
は、ｄｉｓｔ（ｘ１，ｘ２）＝{(ｘ11−ｘ21)²＋(ｘ12−ｘ22)²＋‥ ＋(ｘ1d(C)−ｘ2d(C))²}^0.5 （式５）により定義する。Here, the distance between the document or cluster centroids in the d (C) dimension is the distance in the d (C) dimension between the document similar vectors, and the document similar vector x1 = (x11, x12,. , X1d (C), ‥, x1r)
Dist (x1, x2) in the d (C) dimension between x2 = (x21, x22, ‥, x2d (C), ‥, x2r)
Is defined by dist (x1, x2) = {(x11−x21) ² + (x12−x22) ² + ‥ + (x1d (C) −x2d (C)) ² } ^0.5 (Equation 5)

【００９７】図２２は、クラスタ階層関係テーブル1914
の例を示した図である。生成された全クラスタ数をＣma
xとすると、同テーブルは、Ｃmax×Ｃmax次元の行列で
あり、第（ｉ，ｊ）要素には、クラスタ番号ｉのクラス
タが、クラスタ番号ｊのクラスタの上位クラスタである
場合には、１が、それ以外の場合には、０が格納され
る。図２２は、クラスタ２００１が、クラスタ１００１
の上位クラスタであると登録されている状態を示してい
る。初期状態では、すべての要素は０である。FIG. 22 shows a cluster hierarchy relation table 1914.
FIG. 3 is a diagram showing an example of the above. The total number of generated clusters is Cma
Assuming that x is the same, the table is a Cmax × Cmax-dimensional matrix. In the (i, j) -th element, when the cluster of cluster number i is an upper cluster of the cluster of cluster number j, 1 is set. Otherwise, 0 is stored. FIG. 22 shows that the cluster 2001 is the cluster 1001
This indicates a state in which the cluster is registered as a higher-level cluster. Initially, all elements are zero.

【００９８】以上の構成の文書データ・クラスタリング
システムの動作を図２３から図２５を用いて説明する。
図２３は、クラスタリング処理及びクラスタ間の階層関
係計算手順を表すフローチャートである。The operation of the document data clustering system having the above configuration will be described with reference to FIGS.
FIG. 23 is a flowchart illustrating a clustering process and a hierarchical relationship calculation procedure between clusters.

【００９９】ステップ2202：特徴ベクトル作成手段1903
は、文書データベース1901に含まれる文書について特徴
ベクトル1904を作成する。ステップ2203：特異値分解手段1905は、特徴ベクトル組
をもとに、特異値分解を行い特異値分解結果1906を得
る。ステップ2204：文書類似ベクトル作成手段1907は、特異
値分解結果から文書類似ベクトル組1908を作成する。ここまでの手順は、第１の実施形態（図３）のステップ
302〜ステップ305と同じである。Step 2202: means for creating feature vector 1903
Creates a feature vector 1904 for a document included in the document database 1901. Step 2203: The singular value decomposition means 1905 performs singular value decomposition based on the feature vector set to obtain a singular value decomposition result 1906. Step 2204: The document similar vector creation means 1907 creates a document similar vector set 1908 from the singular value decomposition result. The procedure up to this point is the same as that of the first embodiment (FIG. 3).
Same as 302 to step 305.

【０１００】ステップ2205：クラスタリング対象データ
選択手段1909、クラスタ作成手段1910は、文書類似ベク
トル組1908及び文書データベース1901を参照しながら、
文書データベース中のすべての文書に対してクラスタリ
ングを行う。この手順は、第１の実施形態（図２）のス
テップ201からステップ210と同様である。ただし、クラ
スタ作成手段1910は、安定したクラスタを認定するたび
に（第１の実施形態のステップ208の時点に相当）、そ
の時点での使用次元数及び当該クラスタの重心ベクトル
をクラスタサイズ情報テーブル1913に登録する。Step 2205: The clustering target data selecting means 1909 and the cluster creating means 1910 refer to the document similarity vector set 1908 and the document database 1901,
Perform clustering on all documents in the document database. This procedure is the same as steps 201 to 210 of the first embodiment (FIG. 2). However, every time a stable cluster is recognized (corresponding to the time point of step 208 in the first embodiment), the cluster creating means 1910 stores the number of used dimensions and the centroid vector of the cluster at that time point in the cluster size information table 1913. Register with.

【０１０１】ステップ2206：次に、クラスタ階層関係決
定手段1912は、クラスタ情報テーブル1911とクラスタサ
イズ情報テーブル1913との内容を参照し、上述の定義に
従って、各クラスタの基準半径を計算し、クラスタサイ
ズ情報テーブル1912に登録する。ステップ2207：クラスタ階層関係決定手段1913は、クラ
スタ情報テーブル1911とクラスタサイズ情報テーブル19
13との内容を参照し、各クラスタ間の階層関係を計算
し、結果をクラスタ階層関係テーブル1914に登録する。Step 2206: Next, the cluster hierarchy relation determining means 1912 refers to the contents of the cluster information table 1911 and the cluster size information table 1913, calculates the reference radius of each cluster according to the above definition, and Register in the information table 1912. Step 2207: The cluster hierarchy relation determining means 1913 includes the cluster information table 1911 and the cluster size information table 19
13, the hierarchical relationship between the clusters is calculated, and the result is registered in the cluster hierarchical relationship table 1914.

【０１０２】図２４は、このクラスタ間の階層関係を計
算する手順を表すフローチャートである。ステップ2302：まず、クラスタ階層関係テーブル1914の
要素をすべて０（何も階層関係がない状態）に初期化す
る。ステップ2303：次に、クラスタサイズ情報テーブル1913
を参照し、次元ｄの値を、登録されている安定次元の最
小値とする。ステップ2304：次に、クラスタサイズ情報テーブル1913
を参照し、安定次元がｄのクラスタの集合を比較対象集
合として求める。以後、比較対象集合のことをＳ０と表
すことにする。FIG. 24 is a flowchart showing a procedure for calculating the hierarchical relationship between the clusters. Step 2302: First, all the elements of the cluster hierarchy relation table 1914 are initialized to 0 (there is no hierarchical relation). Step 2303: Next, the cluster size information table 1913
, And set the value of the dimension d as the minimum value of the registered stable dimensions. Step 2304: Next, the cluster size information table 1913
, A set of clusters whose stable dimension is d is obtained as a set to be compared. Hereinafter, the set to be compared is represented as S0.

【０１０３】ステップ2305：次に、次元を、クラスタリ
ングのときに用いたΔｄの値だけ増やし、ステップ2306：クラスタサイズ情報テーブル1913を参照
し、増やした結果の次元が、安定次元の最大値を超えて
いるか調べる。超えていれば、クラスタ間の階層関係の
計算を終了する（ステップ2314）。超えていなければ、ステップ2307：クラスタサイズ情報テーブル1913を参照
し、現在の次元を安定次元とするクラスタの集合を、処
理対象集合として求める。以後、処理対象集合のことを
Ｓ１と表すことにする。Step 2305: Next, increase the dimension by the value of Δd used at the time of clustering. Step 2306: Referring to the cluster size information table 1913, the dimension resulting from the increase exceeds the maximum value of the stable dimension. Find out if. If so, the calculation of the hierarchical relationship between the clusters ends (step 2314). If not, step 2307: The cluster size information table 1913 is referred to, and a set of clusters having the current dimension as a stable dimension is obtained as a processing target set. Hereinafter, the set to be processed is represented as S1.

【０１０４】ステップ2308：次に、処理対象集合Ｓ１に
属する各クラスタを全て未処理とした後、ステップ2309：Ｓ１の中に未処理のクラスタが無くなる
まで、ステップ2310：Ｓ１からクラスタを一つ取り出して処理
対象とし、ステップ2311：Ｓ０の各クラスタと、その処理対象のク
ラスタとの包含関係を調べる。この手順については後に
図２５を用いて詳述する。Step 2308: Next, after all the clusters belonging to the processing target set S1 are unprocessed, Step 2309: Until there are no unprocessed clusters in S1, Step 2310: One cluster is extracted from S1 Step 2311: The inclusion relationship between each cluster in S0 and the cluster to be processed is examined. This procedure will be described later in detail with reference to FIG.

【０１０５】ステップ2312：処理が済んだ処理対象のク
ラスタを処理済みとし、ステップ2309からの手順を繰り
返す。Step 2312: The processed cluster to be processed is regarded as processed, and the procedure from step 2309 is repeated.

【０１０６】Ｓ１のすべてのクラスタについて、Ｓ０の
クラスタとの階層関係の計算が終わったら、ステップ2313：処理対象集合Ｓ１の全クラスタを比較対
象集合Ｓ０に加えて、ステップ2305に戻る。When the calculation of the hierarchical relationship with the clusters of S0 is completed for all the clusters of S1, step 2313: All the clusters of the processing target set S1 are added to the comparison target set S0, and the process returns to step 2305.

【０１０７】以上により、各クラスタについて、当該ク
ラスタの安定次元より低い安定次元を持つクラスタ、す
なわち、当該クラスタよりも早い段階で安定したと判定
されたクラスタとの間の階層関係が計算される。As described above, for each cluster, the hierarchical relationship between the cluster having a stable dimension lower than the stable dimension of the cluster, that is, the cluster determined to be stable earlier than the cluster is calculated.

【０１０８】図２５は、下位クラスタの求め方の手順を
表すフローチャートである。この図２５を用いて、ステ
ップ2311における処理対象のクラスタと比較対象集合Ｓ
０の各クラスタとの間の階層関係の計算について説明す
る。ここで、処理対象のクラスタをＣとする。FIG. 25 is a flowchart showing a procedure for obtaining a lower cluster. Using this FIG. 25, the cluster to be processed in step 2311 and the comparison target set S
The calculation of the hierarchical relation between each cluster of 0 will be described. Here, the cluster to be processed is C.

【０１０９】ステップ2402：比較対象集合のクラスタを
すべて未比較とする。ステップ2403：次に、クラスタサイズ情報テーブル1912
を参照し、処理対象のクラスタＣのクラスタ重心ｇ
（Ｃ）の重心ベクトルとクラスタＣの基準半径Ｒ（Ｃ）
とを得る。Step 2402: All clusters in the set to be compared are not compared. Step 2403: Next, the cluster size information table 1912
And the cluster centroid g of the cluster C to be processed
The center of gravity vector of (C) and the reference radius R (C) of the cluster C
And get

【０１１０】ステップ2404：次に、比較対象集合の中
に、未比較のクラスタがあるか調べる。なければ、階層
関係の計算を終了する(ステップ2409)。未比較のクラス
タがあれば、ステップ2405：そのうちの一つを比較対象とする。比較
対象としたクラスタをＣ’とする。Step 2404: Next, it is checked whether there is an uncompared cluster in the set to be compared. If not, the calculation of the hierarchical relationship ends (step 2409). If there are uncompared clusters, Step 2405: One of them is set as a comparison target. The cluster to be compared is C ′.

【０１１１】ステップ2406：次に、比較対象のクラスタ
Ｃ’に属するすべての文書が、処理対象のクラスタＣの
重心ｇ（Ｃ）を中心とする基準半径Ｒ（Ｃ）の範囲に入
っているか調べる。ただし、距離に関しては、（式５）
の定義に従って計算する。Step 2406: Next, it is checked whether all the documents belonging to the cluster C 'to be compared fall within the range of the reference radius R (C) centered on the center of gravity g (C) of the cluster C to be processed. . However, regarding the distance, (Equation 5)
Calculate according to the definition.

【０１１２】ステップ2407：その結果、すべてが入って
いれば、処理対象のクラスタＣを、比較対象のクラスタ
Ｃ’の上位クラスタとし、ステップ2408：比較対象のクラスタＣ’を比較済みにす
る。また、ステップ2406の検査で、比較対象のクラスタ
Ｃ’に属する文書で、処理対象のクラスタＣの重心ｇ
（Ｃ）から、基準半径Ｒ（Ｃ）内の範囲に入っていない
ものがあった場合は、ステップ2406から直接ステップ24
08に移る。ステップ2408の後はステップ2404に戻り、次
の比較対象があるか調べる。Step 2407: As a result, if all are included, the cluster C to be processed is set as the upper cluster of the cluster C ′ to be compared, and Step 2408: The cluster C ′ to be compared is made to be already compared. In the inspection of step 2406, the document belonging to the cluster C ′ to be compared, and the center of gravity g of the cluster C to be processed
In the case where there is an object that does not fall within the range of the reference radius R (C) from (C), the process proceeds directly from step 2406 to step 24.
Move to 08. After step 2408, the process returns to step 2404 to check whether there is a next comparison target.

【０１１３】以上が、処理対象のクラスタと比較対象集
合Ｓ０の各クラスタとの間の階層関係を調べる方法であ
る。The above is the method of examining the hierarchical relationship between the cluster to be processed and each cluster of the set S0 to be compared.

【０１１４】以上の説明のように、第４の実施の形態の
文書データ・クラスタリングシステムでは、文書データ
の特異値分解を行い、文書間の距離計算に用いる情報を
徐々に増加させ、段階的にクラスタリングを行った結果
のクラスタ間に、計算可能な量を用いて、自然な形の階
層関係を導入できる、という効果がある。As described above, in the document data clustering system according to the fourth embodiment, the singular value decomposition of the document data is performed, and the information used for calculating the distance between the documents is gradually increased. There is an effect that a natural hierarchical relation can be introduced between the clusters resulting from the clustering by using a computable amount.

【０１１５】なお、上記実施形態では、クラスタＣがク
ラスタＣ’の上位クラスタであると認定された後も、ク
ラスタＣ’の要素をクラスタＣに直接含めることは行わ
なかったが、下位クラスタの要素を上位クラスタに含ま
せるようにしてもよい。In the above embodiment, even after the cluster C is determined to be the upper cluster of the cluster C ′, the elements of the cluster C ′ are not directly included in the cluster C. May be included in the upper cluster.

【０１１６】また、下位クラスタの要素を考慮して、上
位クラスタの重心を計算し直すようにしてもよい。Further, the center of gravity of the upper cluster may be calculated again in consideration of the elements of the lower cluster.

【０１１７】また、上記実施形態では、クラスタ間の階
層関係を調べる際に基準半径という距離を定義して用い
たが、上記定義の基準半径の代わりに、例えば、処理対
象のクラスタの重心から同クラスタに属する各文書まで
の距離の最大値や、最小値、あるは、平均値、９５％の
要素が入るような半径など、ｄ（Ｃ）次元において定義
される各種の距離を用いるようにしてもよい。In the above embodiment, a distance called a reference radius is defined and used when examining the hierarchical relationship between clusters. However, instead of the reference radius defined above, for example, the distance from the center of gravity of the cluster to be processed may be used. By using various distances defined in the d (C) dimension, such as the maximum value, the minimum value, or the average value of the distances to the documents belonging to the cluster, the radius in which 95% of the elements are included, and the like. Is also good.

【０１１８】[0118]

【発明の効果】以上の説明から明らかなように、本発明
の文書データ・クラスタリングシステムでは、文書デー
タの特異値分解を行い、文書間の距離計算に用いる情報
を徐々に増加させ、段階的にクラスタリングを繰り返す
ことにより、クラスタ数を事前に決定していなくても、
クラスタリング対象に応じたクラスタ数が決定でき、ま
た、類似している文書だけを集約したクラスタを生成す
ることができる、という効果がある。As is clear from the above description, in the document data clustering system of the present invention, the singular value decomposition of the document data is performed, and the information used for calculating the distance between the documents is gradually increased. By repeating clustering, even if the number of clusters is not determined in advance,
The number of clusters according to the clustering target can be determined, and a cluster in which only similar documents are aggregated can be generated.

【０１１９】また、ラベル抽出手段を設けたシステムで
は、クラスタリングされた結果に対して、その内容を表
す文字列をラベリングすることができ、クラスタリング
結果を分かり易く示すことができる。Further, in the system provided with the label extracting means, a character string representing the contents of the clustered result can be labeled, and the clustering result can be shown in an easily understandable manner.

【０１２０】また、段階的なクラスタリングで生成した
クラスタの間を階層化することにより、人間の直感に一
致するクラスタ間の階層関係を生成することができると
いう効果がある。[0120] Further, by forming hierarchies between clusters generated by stepwise clustering, it is possible to generate a hierarchical relationship between clusters that matches human intuition.

[Brief description of the drawings]

【図１】第１の実施形態における文書データ・クラスタ
リングシステムの全体構成を表すブロック図、FIG. 1 is a block diagram showing the overall configuration of a document data clustering system according to a first embodiment;

【図２】第１の実施形態における文書データ・クラスタ
リングシステムのクラスタリング処理手順を表すフロー
チャート、FIG. 2 is a flowchart showing a clustering processing procedure of the document data clustering system according to the first embodiment;

【図３】第１の実施形態における文書データ・クラスタ
リングシステムの文書データベースから文書類似ベクト
ルを作成するまでの処理手順を表すフローチャート、FIG. 3 is a flowchart showing a processing procedure until a document similarity vector is created from a document database of the document data clustering system according to the first embodiment;

【図４】第１の実施形態における文書データ・クラスタ
リングシステムの文書内の単語の出現頻度を基にした特
徴ベクトルの例、FIG. 4 is an example of a feature vector based on the appearance frequency of a word in a document in the document data clustering system according to the first embodiment;

【図５】第１の実施形態における文書データ・クラスタ
リングシステムの出現頻度を文書長で除算し正規化した
値を基にした特徴ベクトルの例、FIG. 5 is an example of a feature vector based on a value obtained by dividing the frequency of appearance of the document data clustering system according to the first embodiment by a document length and normalizing the frequency;

【図６】第１の実施形態における文書データ・クラスタ
リングシステムの出現頻度を文書内での単語の出現頻度
の総和で除算し正規化した値を基にした特徴ベクトルの
例、FIG. 6 is an example of a feature vector based on a value obtained by dividing the appearance frequency of the document data clustering system according to the first embodiment by the sum of the appearance frequencies of words in a document and normalizing the divided frequency.

【図７】第１の実施形態における文書データ・クラスタ
リングシステムのクラスタ情報テーブルの記述例、FIG. 7 is a description example of a cluster information table of the document data clustering system according to the first embodiment;

【図８】第１の実施形態における文書データ・クラスタ
リングシステムのあるクラスタについて、２回のクラス
タリングの結果割り当てられた文書を比較した例、FIG. 8 shows an example of comparing documents assigned as a result of two clustering operations for a certain cluster in the document data clustering system according to the first embodiment;

【図９】第１の実施形態における文書データ・クラスタ
リングシステムの更新されたクラスタ情報テーブルの
例、FIG. 9 is an example of an updated cluster information table of the document data clustering system according to the first embodiment;

【図１０】第１の実施形態における文書データ・クラス
タリングシステムのクラスタの変化を表す概念図、FIG. 10 is a conceptual diagram illustrating a change in a cluster of the document data clustering system according to the first embodiment;

【図１１】第２の実施形態における文書データ・クラス
タリングシステムの全体構成を表すブロック図、FIG. 11 is a block diagram illustrating an overall configuration of a document data clustering system according to a second embodiment.

【図１２】第２の実施形態における文書データ・クラス
タリングシステムの処理手順を表すフローチャート、FIG. 12 is a flowchart illustrating a processing procedure of the document data clustering system according to the second embodiment;

【図１３】第２の実施形態における文書データ・クラス
タリングシステムの代表語ベクトルの例、FIG. 13 shows an example of a representative word vector of the document data clustering system according to the second embodiment;

【図１４】第２の実施形態における文書データ・クラス
タリングシステムの文書中から、ラベル候補を抽出し、
そのスコアを算出した例、FIG. 14 extracts a label candidate from a document of the document data clustering system according to the second embodiment,
An example of calculating the score,

【図１５】第３の実施形態における文書データ・クラス
タリングシステムの全体構成を表すブロック図、FIG. 15 is a block diagram illustrating an overall configuration of a document data clustering system according to a third embodiment;

【図１６】第３の実施形態における文書データ・クラス
タリングシステムのクラスタリング処理手順を表すフロ
ーチャート、FIG. 16 is a flowchart illustrating a clustering processing procedure of the document data clustering system according to the third embodiment;

【図１７】第３の実施形態における文書データ・クラス
タリングシステムのクラスタについて、２回のクラスタ
リングの結果割り当てられた文書を比較した例、FIG. 17 shows an example in which documents assigned as a result of two clusterings are compared for a cluster of the document data clustering system according to the third embodiment;

【図１８】第３の実施形態における文書データ・クラス
タリングシステムの更新されたクラスタ情報テーブルの
例、FIG. 18 is an example of an updated cluster information table of the document data clustering system according to the third embodiment;

【図１９】第３の実施形態における文書データ・クラス
タリングシステムのクラスタの変化を表す概念図、FIG. 19 is a conceptual diagram showing a change in a cluster of the document data clustering system according to the third embodiment;

【図２０】第４の実施形態における文書データ・クラス
タリングシステムの構成を示すブロック図、FIG. 20 is a block diagram illustrating a configuration of a document data clustering system according to a fourth embodiment.

【図２１】第４の実施形態におけるクラスタサイズ情報
テーブルの例を示す図、FIG. 21 is a diagram illustrating an example of a cluster size information table according to the fourth embodiment;

【図２２】第４の実施形態におけるクラスタ階層関係テ
ーブルの例を示す図、FIG. 22 is a diagram illustrating an example of a cluster hierarchy relation table according to the fourth embodiment;

【図２３】第４の実施形態における文書データ・クラス
タリングシステムの動作手順を示すフローチャート、FIG. 23 is a flowchart showing an operation procedure of the document data clustering system according to the fourth embodiment;

【図２４】第４の実施形態におけるクラスタ間の階層関
係計算手順を示すフローチャート、FIG. 24 is a flowchart illustrating a procedure for calculating a hierarchical relationship between clusters according to the fourth embodiment;

【図２５】第４の実施形態における下位クラスタ計算手
順を示すフローチャートである。FIG. 25 is a flowchart illustrating a lower cluster calculation procedure according to the fourth embodiment.

[Explanation of symbols]

101、1101、1901 文書データベース 102、1102、1902 辞書 103、1103、1903 特徴ベクトル作成手段 104、1104、1904 特徴ベクトル組 105、1105、1905 特異値分解手段 106、1106、1906 特異値分解結果 107、1107、1907 文書類似ベクトル作成手段 108、1108、1908 文書類似ベクトル組 109、1909 クラスタリング対象データ選択手段 110、1910 クラスタ作成手段 111、1109、1911 クラスタ情報テーブル 112、1112、1915 結果表示手段 1091 主要親クラスタリング対象データ選択手段 1110 ラベル抽出手段 1111 抽出ラベル 1912 クラスタ階層関係決定手段 1913 クラスタサイズ情報テーブル 1914 クラスタ階層関係テーブル 101, 1101, 1901 Document database 102, 1102, 1902 Dictionary 103, 1103, 1903 Feature vector creation means 104, 1104, 1904 Feature vector set 105, 1105, 1905 Singular value decomposition means 106, 1106, 1906 Singular value decomposition result 107, 1107, 1907 Document similar vector creation means 108, 1108, 1908 Document similar vector set 109, 1909 Clustering target data selection means 110, 1910 Cluster creation means 111, 1109, 1911 Cluster information table 112, 1112, 1915 Result display means 1091 Main parent Clustering target data selection means 1110 Label extraction means 1111 Extracted labels 1912 Cluster hierarchy relation determination means 1913 Cluster size information table 1914 Cluster hierarchy relation table

───────────────────────────────────────────────────── フロントページの続き (72)発明者小山隆正大阪府門真市大字門真1006番地松下電器産業株式会社内Ｆターム(参考） 5B075 ND03 NK06 NR12 PQ02 PR04 PR06 QM08 UU06 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Takamasa Koyama 1006 Kazuma Kadoma, Kadoma-shi, Osaka Matsushita Electric Industrial Co., Ltd. F-term (reference) 5B075 ND03 NK06 NR12 PQ02 PR04 PR06 QM08 UU06

Claims

[Claims]

1. A document data clustering system, comprising: a document database storing machine-readable document data; and a dictionary storing machine-readable words, wherein the document data clustering system clusters the documents stored in the document database. A feature vector creating means for creating a feature vector of the document stored in the document database based on the frequency of appearance of the word stored in the dictionary, and a set of feature vectors created by the feature vector creating means as singular values Singular value decomposition means for decomposing, document similarity vector creation means for creating a document similarity vector for calculating the similarity between documents from the result of the singular value decomposition, and document similarity created by the document similarity vector creation means By using a set of vectors, a cluster is formed for all or some documents in the document database. And cluster creation means for forming a cluster information table for storing information of cluster created,
Clustering data selecting means for selecting a document to be clustered by the cluster creating means from the document database with reference to the cluster information table, the cluster creating means, for a document to be clustered, Using the document similarity vector, the distance between the document and the cluster centroid is calculated, and for the same clustering target document, the number of dimensions of the document similarity vector used for the first clustering is increased within an appropriate range. Performing the second clustering, comparing the results of the second clustering, and discriminating a cluster with little change as a stable cluster, and the clustering data selecting means removes a document assigned to the stable cluster from a clustering target. Remove the cluster creation means A clustering target to be selected, and repeating this trial between the cluster creating means and the clustering data selecting means.
Clustering system.

2. A document data clustering system, comprising: a document database storing machine-readable document data; and a dictionary storing machine-readable words, wherein the document data clustering system clusters the documents stored in the document database. A feature vector creating means for creating a feature vector of a document stored in the document database based on the frequency of appearance of words stored in the dictionary, and a set of feature vectors created by the feature vector creating means as singular values Singular value decomposition means for decomposing, document similarity vector creation means for creating a document similarity vector for calculating the similarity between documents from the result of the singular value decomposition, and document similarity created by the document similarity vector creation means A cluster is formed for all or some documents in the document database by a set of vectors. And cluster creation means for forming a cluster information table for storing information of cluster created,
Clustering data selecting means for selecting a document to be subjected to clustering by the cluster creating means from the document database with reference to the cluster information table, wherein the cluster creating means, for a document to be clustered, Using the document similarity vector, the distance between the document and the cluster centroid is calculated, and for the same clustering target document, the number of dimensions of the document similarity vector used for the first clustering is increased within an appropriate range. Performing a second clustering, comparing the results of the two clusterings, and determining a cluster having a large change as an unstable cluster, and the clustering data selecting unit performs clustering on the documents assigned to the unstable cluster. Removed from the target, the cluster creation means A clustering target to be selected, and repeating this trial between the cluster creating means and the clustering data selecting means.
Clustering system.

3. The document database, the dictionary,
A label extracting unit that extracts a label for each cluster by referring to the singular value decomposition result and the cluster information table, wherein the label extracting unit includes a pseudo appearance frequency of the word at a center of gravity of the cluster. 2. A feature vector expressing the expression is calculated, and a character string appearing around a word having a high appearance frequency included in the feature vector is extracted as a label from a document assigned to the cluster. Or the document data clustering system according to 2.

4. A cluster hierarchical relationship determining means for setting a hierarchical relationship between clusters created by the cluster creating means, wherein the cluster hierarchical relationship determining means determines that any cluster C is a stable cluster. When the number of dimensions at the time of determination is defined as the stable dimension d (C) of the cluster, all documents belonging to the cluster C ′ whose stable dimension is lower than the stable dimension d (C) of the cluster C and the center of gravity of the cluster C When the distance in the d (C) dimension from g (C) is within a certain distance R (C), the cluster C
2. The document data clustering system according to claim 1, wherein the document data clustering system is hierarchized into clusters higher than the cluster C ′.

5. The cluster hierarchy relation determining means includes: each document belonging to a cluster C; and a center of gravity g (C) of the cluster C.
Is the largest distance among the distances in the d (C) dimension with R1
(C), and the minimum distance in the d (C) dimension between each document belonging to a cluster other than C whose stable dimension is d (C) or more and the center of gravity g (C) of the cluster C is R2.
5. The document data clustering system according to claim 4, wherein when (C) is used, a smaller one of R1 (C) and R2 (C) is set as R (C).