JP2012088972A

JP2012088972A - Data classification device, data classification method and data classification program

Info

Publication number: JP2012088972A
Application number: JP2010235879A
Authority: JP
Inventors: Seiichi Konya; 精一紺谷; Akimichi Tanaka; 明通田中; Masashi Uchiyama; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-10-20
Filing date: 2010-10-20
Publication date: 2012-05-10

Abstract

PROBLEM TO BE SOLVED: To properly classify data without re-calculating clustering and affecting an existing clustering result.SOLUTION: A data classification device 1 for transforming new data Z into an eigenspace through a transformation matrix M obtained by learning training data X includes: a similarity calculation unit 5 for calculating a similarity matrix K between the training data X; an eigenspace learning unit 6 for calculating the transformation matrix M for transforming the similarity matrix K into its eigenspace on the basis of an eigenvector Q corresponding to the similarity matrix K and the training data X; and an eigenspace transformation unit 7 for transforming the new data Z into the eigenspace through a similarity matrix H between the new data Z and the training data X calculated by the similarity calculation unit 5 at the time of classification processing on the new data Z and the transformation matrix M calculated by the eigenspace learning unit 6.

Description

本発明は、教師なしデータ分類技術に関し、特にテキストや画像などのデータをクラスタリングした後に未知のデータを既に生成したクラスタに分類するデータ分類する技術に関する。 The present invention relates to an unsupervised data classification technique, and more particularly to a data classification technique for classifying unknown data into clusters that have already been generated after clustering data such as text and images.

従来、教師なしデータ分類手法として、例えば非特許文献１に示すようなスペクトラルクラスタリングという手法が知られている。 Conventionally, as an unsupervised data classification method, for example, a method called spectral clustering as shown in Non-Patent Document 1 is known.

スペクトラルクラスタリングは、表１に示したＡｌｇｏｒｉｔｈｍ１のように、ｎ組のデータｘ_i∈Ｒ^m及びパラメーターデータ間の類似度τ，クラスタ数ｋ，ｋ‐ｍｅａｎｓクラスタリングの繰り返し回数Ｔが与えられた場合、ｎ組のデータをｋ個のクラスタに分類するデータ分類手法である。尚、ｋ‐ｍｅａｎｓクラスタリングは非特許文献２に示されている。 Spectral clustering is performed when n sets of data x _i ∈R ^m , similarity τ between parameter data, number of clusters k, and number of iterations T of k-means clustering are given as in Algorithm 1 shown in Table 1. This is a data classification method for classifying n sets of data into k clusters. Note that k-means clustering is shown in Non-Patent Document 2.

スペクトラルクラスタリングは入力データをノード、入力データ間の親和度をエッジの重みとするグラフとみなして、これをｋ個に分割する。切断するエッジの重みの総和を小さく、かつ、分割後のノード数が均等になるような分割が選ばれる。 Spectral clustering regards the input data as a node and the affinity between the input data as a weight of the edge, and divides it into k pieces. A division is selected such that the sum of the weights of the edges to be cut is small and the number of nodes after division is equal.

８個のデータ間に重み１のエッジが付けられたグラフの例を図１３に示した。このグラフを２つに分割すると図１４に示したような結果となる。切断されたエッジは２と小さく、分割後のノード数は共に４と均等になっている。スペクトラルクラスタリングは、入力データの親和行列から固有ベクトルを計算して、入力データを図１３に示したように固有空間にマッピングし、この空間で表２に示したＡｌｇｏｒｉｔｈｍ２のｋ‐ｍｅａｎｓ法を行うことで実現する。 FIG. 13 shows an example of a graph in which an edge having a weight of 1 is added between 8 pieces of data. When this graph is divided into two, the result shown in FIG. 14 is obtained. The cut edges are as small as 2, and the number of nodes after division is both equal to 4. Spectral clustering calculates the eigenvector from the affinity matrix of the input data, maps the input data to the eigenspace as shown in FIG. 13, and performs the Algorithm 2 k-means method shown in Table 2 in this space. Realize.

より具体的な例として、ｋ＝２、τ＝２００、Ｔ＝２００でスペクトラルクラスタリングした例を図５に示す。 As a more specific example, FIG. 5 shows an example of spectral clustering with k = 2, τ = 200, and T = 200.

Ａｌｇｏｒｉｔｈｍ１の手順２〜５で入力データｘ_iが固有空間のデータｙ_iに変換される。固有空間のデータの分布は図６に示されたようなデータＹの分布となる。次に、この固有空間のデータをクラスタ数ｋ＝２，繰り返し数Ｔ＝２００で表２に示されたＡｌｇｏｒｉｔｈｍ２の手順１〜６によってｋ−ｍｅａｎｓクラスタリングを実行する。この結果、図７に示されたｃｅｎｔｒｏｉｄで示すクラスタ中心、及び、クラスタインデクスｇで分類されたデータｇ_i＝０とｇ_i＝１が得られる。 In steps 2 to 5 of Algorithm 1, the input data x _i is converted into eigenspace data y _i . The distribution of the eigenspace data is the distribution of data Y as shown in FIG. Next, k-means clustering is performed on the data of this eigenspace by the procedures 1 to 6 of Algorithm 2 shown in Table 2 with the number of clusters k = 2 and the number of repetitions T = 200. As a result, the cluster center indicated by centroid shown in FIG. 7 and the data g _i = 0 and g _i = 1 classified by the cluster index g are obtained.

クラスタインデクスで元のデータを分類した結果は図８に示したようなクラスタリング結果となる。螺旋状のデータが適切に２つに分離されている。 The result of classifying the original data by the cluster index is a clustering result as shown in FIG. The spiral data is properly separated into two.

Andrew Y.Ng,Michael I.Jordan, and Yair Weiss.“On spectral clustering: analysis and an algorithm.”Neural Information Processing Symposium 14, 2001.Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. “On spectral clustering: analysis and an algorithm.” Neural Information Processing Symposium 14, 2001. Ｃ.Ｍ.ビショップ，「パターン認識と機械学習下」，シュプリンガー・ジャパン株式会社，２００８年７月１日，ｐｐ.１４０−１４２CM Bishop, “Pattern recognition and machine learning,” Springer Japan, July 1, 2008, pp. 140-142.

しかしながら、上述のスペクトラルクラスタリングでは、訓練データ−クラスタリング時に与えられたｘ_i以外のデータを分類できないという問題があった。 However, the spectral clustering described above, the training data - there is a problem that can not be classified data other than x _i given at clustering.

スペクトラルクラスタリングで求められたクラスタ中心ｃ_j（ｊ＝１，…，ｋ）は、元のデータとは異なる空間にあり、新規のデータを分類するにはこの空間への変換手段が必要となる。 The cluster center c _j (j = 1,..., K) obtained by spectral clustering is in a different space from the original data, and means for converting to this space is required to classify new data.

例えば、図１４に示されたクラスタリング結果に新規データを割当てようとしても、スペクトラルクラスタリングには図１５に示した固有空間に新規データをマッピングする方法が用意されていないため、分類することができない。そのため新規データを追加してスペクトラルクラスタリングをやり直す必要がある。 For example, even when trying to assign new data to the clustering result shown in FIG. 14, spectral clustering does not provide a method for mapping the new data to the eigenspace shown in FIG. Therefore, it is necessary to redo spectral clustering by adding new data.

また、新規データの影響でクラスタリング結果が異なる場合もある。例えば、図１６に示されたサンプルグラフのように新規データ９が追加されると、図１７に示したように元の分割では切断エッジ数が４となる。切断エッジ数が３の新たな分割が選ばれ、元のデータのクラスタリング結果も変わる。 In addition, clustering results may differ due to the influence of new data. For example, when new data 9 is added as in the sample graph shown in FIG. 16, the number of cut edges is 4 in the original division as shown in FIG. A new division with 3 cutting edges is selected, and the clustering result of the original data also changes.

本発明は、上記問題を解決するものであり、新規のデータを訓練データで作成したクラスタリング結果に分類する際に、クラスタリングの再計算が不要で且つ既存のクラスタリング結果に影響を及ぼさない適切なデータの分類を目的とする。 The present invention solves the above problems, and when classifying new data into clustering results created with training data, appropriate data that does not require recalculation of clustering and does not affect existing clustering results For the purpose of classification.

そこで、本発明は、図２に示したように、訓練データＸの処理の際、訓練データＸ間の類似度行列Ｋを計算し、この類似度行列と訓練データＸの固有値Ｑに対応する固有ベクトルに基づき変換行列Ｍを算出する。次いで、新規データＺの処理の際、新規データＺと訓練データＸとの類似度行列Ｈを計算し、この類似度行列Ｈと変換行列Ｍとから新規データＺを固有空間に変換したデータを算出する。これにより新規データＺの固有空間のデータＶへの変換が可能となる。 Therefore, according to the present invention, as shown in FIG. 2, when processing the training data X, a similarity matrix K between the training data X is calculated, and the eigenvector corresponding to the similarity matrix and the eigenvalue Q of the training data X is calculated. Based on the above, the transformation matrix M is calculated. Next, when the new data Z is processed, a similarity matrix H between the new data Z and the training data X is calculated, and data obtained by converting the new data Z into an eigenspace is calculated from the similarity matrix H and the conversion matrix M. To do. As a result, the conversion of the new data Z into the data V in the eigenspace becomes possible.

すなわち、本発明のデータ分類装置の態様としては、新規データを訓練データから学習した変換行列によって固有空間に変換するデータ分類装置であって、訓練データ間の類似度行列を計算する類似度計算手段と、前記類似度行列と当該訓練データに対応した固有ベクトルとに基づき当該類似度行列をその固有空間に変換するための変換行列を算出する固有空間学習手段と、新規データの分類処理時に前記類似度計算手段によって算出された前記訓練データと当該新規データとの類似度行列と前記固有空間学習手段によって算出された変換行列とから当該新規データを固有空間に変換する固有空間変換手段とを備える。 That is, as an aspect of the data classification device of the present invention, a data classification device for converting new data into an eigenspace by a conversion matrix learned from training data, a similarity calculation means for calculating a similarity matrix between training data Eigenspace learning means for calculating a transformation matrix for transforming the similarity matrix into the eigenspace based on the similarity matrix and the eigenvector corresponding to the training data, and the similarity during the classification process of new data Eigenspace conversion means for converting the new data into eigenspace from the similarity matrix between the training data calculated by the calculation means and the new data and the conversion matrix calculated by the eigenspace learning means.

また、本発明のデータ分類方法の態様としては、新規データを訓練データから学習した変換行列によって固有空間に変換するデータ分類方法であって、類似度計算手段が訓練データ間の類似度行列を計算するステップと、固有空間学習手段が前記類似度行列と当該訓練データに対応した固有ベクトルとに基づき当該類似度行列をその固有空間に変換するための変換行列を算出するステップと、前記類似度計算手段が新規データと前記訓練データとの類似度行列を計算するステップと、固有空間変換手段が前記類似度行列と前記変換行列とから新規データを固有空間に変換するステップとを有する。 Further, as an aspect of the data classification method of the present invention, there is a data classification method in which new data is converted into an eigenspace by a conversion matrix learned from training data, and the similarity calculation means calculates a similarity matrix between the training data. And a step of calculating a conversion matrix for converting the similarity matrix into the eigenspace based on the similarity matrix and the eigenvector corresponding to the training data, and the similarity calculation unit Calculating a similarity matrix between the new data and the training data, and eigenspace conversion means converting the new data into the eigenspace from the similarity matrix and the conversion matrix.

尚、本発明は上記データ分類装置を構成する各手段としてコンピュータを機能させるデータ分類プログラムの態様とすることもできる。 It should be noted that the present invention may be in the form of a data classification program that causes a computer to function as each means constituting the data classification apparatus.

以上の発明によれば新規のデータを訓練データで作成したクラスタリング結果に分類する際にクラスタリングの再計算が不要となると共に既存のクラスタリング結果に影響を及ぼすことなく適切なデータ分類が行える。 According to the above invention, when new data is classified into clustering results created with training data, recalculation of clustering becomes unnecessary and appropriate data classification can be performed without affecting existing clustering results.

発明の実施形態に係るデータ分類装置のブロック図。The block diagram of the data classification device which concerns on embodiment of invention. 発明に係るデータ分類処理の概要を説明した概要図。The schematic diagram explaining the outline | summary of the data classification process based on invention. 発明の実施形態に係る訓練データの処理手順を示したチャート図。The chart figure showing the processing procedure of training data concerning the embodiment of the invention. 発明の実施形態に係る新規データの分類手順を示したチャート図。The chart figure showing the classification procedure of new data concerning the embodiment of the invention. 入力データの一例。An example of input data. 固有空間のデータｙ_iの分布の一例を示した図。The figure which showed an example of distribution of the data _yi of eigenspace. 固有空間でのクラスタリングの一例を示した図。The figure which showed an example of the clustering in eigenspace. クラスタインデクスの一例を示した図。The figure which showed an example of the cluster index. 新規データの一例を示した図。The figure which showed an example of new data. 固有空間に変換された新規データの一例を示した図。The figure which showed an example of the new data converted into the eigenspace. 固有空間で分類された新規データの一例を示した図。The figure which showed an example of the new data classified by eigenspace. 分類結果の一例を示した図。The figure which showed an example of the classification result. サンプルグラフの一例を示した図。The figure which showed an example of the sample graph. サンプルグラフの分割結果を示した図。The figure which showed the division | segmentation result of the sample graph. 固有空間の一例を示した図。The figure which showed an example of eigenspace. 新規データが追加されたサンプルグラフの一例を示した図。The figure which showed an example of the sample graph to which new data was added. 新規データが追加されたサンプルグラフの分割結果の一例を示した図。The figure which showed an example of the division | segmentation result of the sample graph to which new data was added.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments.

（概要）
図１に示された本発明の実施形態に係るデータ分類装置１は、訓練データＸの処理の際、訓練データＸ間の類似度行列Ｋを計算し、この類似度行列と訓練データＸの固有値に対応する固有ベクトルＱに基づき変換行列Ｍを算出する。次いで、新規データＺの処理の際、新規データＺと訓練データＸとの類似度行列Ｈを計算し、この類似度行列Ｈと変換行列Ｍとから新規データＺを固有空間に変換したデータを算出する。これにより新規データＺの固有空間のデータＶへの変換が可能となる。 (Overview)
The data classification device 1 according to the embodiment of the present invention shown in FIG. 1 calculates a similarity matrix K between the training data X when processing the training data X, and the eigenvalues of the similarity matrix and the training data X are calculated. A transformation matrix M is calculated based on the eigenvector Q corresponding to. Next, when the new data Z is processed, a similarity matrix H between the new data Z and the training data X is calculated, and data obtained by converting the new data Z into an eigenspace is calculated from the similarity matrix H and the conversion matrix M. To do. As a result, the conversion of the new data Z into the data V in the eigenspace becomes possible.

（装置の構成）
データ分類装置１は、図１に示されたように、データ入力部２、スペクトラルクラスタリング部３、蓄積部４、類似度計算部５、固有空間学習部６、固有空間変換部７、データ割当て部８、結果出力部９を備える。 (Device configuration)
As shown in FIG. 1, the data classification device 1 includes a data input unit 2, a spectral clustering unit 3, a storage unit 4, a similarity calculation unit 5, an eigenspace learning unit 6, an eigenspace conversion unit 7, a data allocation unit. 8. A result output unit 9 is provided.

データ分類装置１の各機能部２〜９は例えばコンピュータのハードウェアリソースによって実現される。すなわち、データ分類装置１はＣＰＵ、メモリ、記憶装置（例えば、ハードディスクドライブ装置）、Ｉ／Ｏデバイス（例えば、ネットワークデバイス、ＵＳＢ等）等のコンピュータに係るハードウェアリソースを備える。そして、これらのハードウェアリソースがソフトウェアリソース（ＯＳ、アプリケーション等）と協働することにより機能部２〜９が実装される。 The function units 2 to 9 of the data classification device 1 are realized by, for example, hardware resources of a computer. That is, the data classification device 1 includes hardware resources related to a computer such as a CPU, a memory, a storage device (for example, a hard disk drive device), and an I / O device (for example, a network device, USB, etc.). These hardware resources cooperate with software resources (OS, applications, etc.) to implement the function units 2-9.

データ入力部２はネットワークまたはファイル等からＩ／Ｏデバイスを介して訓練データｘ_i∈Ｒ^m（ｉ＝１，…，ｎ）、クラスタ数ｋ、訓練データｘ_i間の類似度τ及びｋ‐ｍｅａｎｓクラスタリングの繰り返し数Ｔの入力を受ける。また、前記Ｉ／Ｏデバイスを介して新規データｚ_j∈Ｒ^m（ｊ＝１，…，ｌ）の入力を受ける。 The data input unit 2 receives the training data x _i εR ^m (i = 1,..., N), the number of clusters k, the similarity τ between the training data x _i and k− via the I / O device from a network or a file. An input of the number of repetitions T of means clustering is received. In addition, input of new data z _j εR ^m (j = 1,..., L) is received via the I / O device.

スペクトラルクラスタリング部３は訓練データｘ_iのスペクトラルクラスタリングを行い、訓練データｘ_iの固有ベクトルｑ_i、クラスタ中心ｃ_i、及びクラスタインデクスｇを算出する。前記スペクトラルクラスタリングには例えば前述の表１のＡｌｇｏｒｉｔｈｍ１を用いた非特許文献１のスペクトラルクラスタリング手法が適用される。 Spectral clustering unit 3 performs spectral clustering of training data x _i, the eigenvectors q _i of the training data x _i, the cluster center c _i, and calculates a cluster index g. For the spectral clustering, for example, the spectral clustering method of Non-Patent Document 1 using Algorithm 1 in Table 1 described above is applied.

蓄積部４は前記入力された訓練データｘ_i、上記パラメータｋ，τ，Ｔ、固有ベクトルＱ、クラスタ中心ｃ_i及び変換行列Ｍ（訓練データｘ_iから得られたもの）を蓄積する。 The storage unit 4 stores the input training data x _i , the parameters k, τ, T, the eigenvector Q, the cluster center c _i and the transformation matrix M (obtained from the training data x _i ).

類似度計算部５は、訓練データｘ_iの処理の際に、蓄積部４から訓練データｘ_i、パラメータτを引き出して下記の式（１）による訓練データｘ_i間の類似度行列Ｋ(＝ｋ_ij)の計算を行う。 Similarity calculation unit 5, training during the processing of the data x _i, the training data x _i from the storage unit 4, the parameter pull the τ training by the following formula (1) data x _i between the similarity matrix K (= k _ij ) is calculated.

また、類似度計算部５は、新規データｚ_jの処理の際に、蓄積部４から訓練データｘ_i、パラメータτを引き出して下記の式（２）による訓練データｘ_iと新規データｚ_jの類似度行列Ｈ(＝ｈ_ij)の計算を行う。 In addition, the similarity calculation unit 5 _extracts the training data x _i and the parameter τ from the storage unit 4 when processing the new data z _j , and calculates the training data x _i and the new data z _j according to the following equation (2). The similarity matrix H (= h _ij ) is calculated.

固有空間学習部６は、訓練データｘ_iの処理の際に、前記算出された訓練データｘ_iの類似度行列Ｋと蓄積部４から引き出した固有ベクトルＱとから下記の式（３）のｋ組の連立一次方程式を作る。そして、この方程式をガウスの消去法などで解き、変換行列Ｍ＝（ｍ₁，…，ｍ_k）を計算する。尚、式（３）のｑ_iは固有ベクトルＱの成分である。 When processing the training data x _i , the eigenspace learning unit 6 uses the similarity matrix K of the calculated training data x _i and the eigenvector Q derived from the storage unit 4 to obtain k sets of the following equation (3). Make a simultaneous linear equation of Then, this equation is solved by Gaussian elimination or the like to calculate a transformation matrix M = (m ₁ ,..., M _k ). Note that q _i in equation (3) is a component of the eigenvector Q.

固有空間変換部７は、新規データｚ_iの処理の際に、下記の式（４）のように訓練データｘ_iと新規データｚ_jとの類似度行列Ｈと蓄積部４に蓄積された変換行列Ｍとの積Ｕを計算する。 The eigenspace conversion unit 7 converts the similarity matrix H between the training data x _i and the new data z _j and the conversion stored in the storage unit 4 as shown in the following equation (4) when processing the new data z _i. The product U with the matrix M is calculated.

そして、固有空間変換部７は、積Ｕの全ての行を単位ベクトルに正規化し、さらに転置した後、下記の式（５）による演算によって固有空間のデータＶ＝（ｖ₁，…，ｖ_l），ｖ_i＝（ｖ_il，…，ｖ_ik）を算出する。 Then, the eigenspace conversion unit 7 normalizes all the rows of the product U to unit vectors, further transposes them, and then performs eigenspace data V = (v ₁ ,..., V _l by calculation according to the following equation (5). ), V _i = (v _il ,..., V _ik ).

データ割当て部８は固有空間変換部７によって算出されたデータｖ₁，…，ｖ_lを下記の式（６）（７）のように当該データと最も距離の近いクラスタ中心ｃ_jに割当てる。 The data allocating unit 8 allocates the data v ₁ ,..., V _l calculated by the eigenspace converting unit 7 to the cluster center c _j that is closest to the data as in the following equations (6) and (7).

結果出力部９はＩ／Ｏデバイスを介してネットワークまたはファイルなどに訓練データｘ_iのクラスタインデクスｇ及び新規データｚ_iの割当てｂを出力する。 The result output unit 9 outputs the cluster index g of the training data x _i and the allocation b of the new data z _i to a network or a file via the I / O device.

（処理手順の説明）
図３を参照しながらデータ分類装置１による訓練データの具体的な処理手順Ｓ１０１〜Ｓ１０４について説明する。 (Description of processing procedure)
Specific processing procedures S101 to S104 of training data by the data classification device 1 will be described with reference to FIG.

Ｓ１０１：データ入力部２はＩ／Ｏデバイスを介して図５に例示した訓練データ、及び、パラメータ（ｋ＝２，τ＝２００，Ｔ＝２００）の入力を受けると、これらのデータを蓄積部４に蓄積される。 S101: When the data input unit 2 receives the training data illustrated in FIG. 5 and the parameters (k = 2, τ = 200, T = 200) via the I / O device, the data input unit 2 stores these data. 4 is accumulated.

Ｓ１０２：スペクトラルクラスタリング部３は蓄積部４から訓練データＸ及びパラメータ（ｋ＝２，τ＝２００，Ｔ＝２００）を引き出し、上述のスペクトラルスタリングによって図６に示されたＹの分布からなる訓練データをその固有空間である固有ベクトルＱに変換し、図７に示したｃｅｎｔｒｏｉｄのクラスタ中心ｃ_iを計算する。さらに、図８に示したクラスタインデクスｇを算出する。以上の算出されたｋ個の固有ベクトルＱ＝（ｑ₁，…，ｑ_k）、クラスタ中心ｃ_i、クラスタインデクスｇは蓄積部４に蓄積される。尚、前記訓練データのクラスタインデクスｇは結果出力部９によって出力される。 S102: The spectral clustering unit 3 extracts the training data X and parameters (k = 2, τ = 200, T = 200) from the storage unit 4, and performs the training consisting of the Y distribution shown in FIG. It converts the data into eigenvectors Q is the eigenspace, calculates the cluster center c _i of centroid shown in FIG. Further, the cluster index g shown in FIG. 8 is calculated. The k calculated eigenvectors Q = (q ₁ ,..., Q _k ), the cluster center c _i , and the cluster index g are stored in the storage unit 4. The training data cluster index g is output by the result output unit 9.

Ｓ１０３：類似度計算部５は蓄積部４に蓄積された訓練データＸから前記（１）式による演算によって類似度行列Ｋを算出する。この類似度行列Ｋは蓄積部４に蓄積する。 S103: The similarity calculation unit 5 calculates the similarity matrix K from the training data X stored in the storage unit 4 by the calculation according to the equation (1). This similarity matrix K is stored in the storage unit 4.

Ｓ１０４：固有空間学習部６は蓄積部４から訓練データＸの類似度行列Ｋを引き出し、この類似度行列Ｋが適用された前記式（３）から固有ベクトルＱへの変換行列Ｍを算出する。この変換行列Ｍは蓄積部４に蓄積される。 S104: The eigenspace learning unit 6 extracts the similarity matrix K of the training data X from the storage unit 4, and calculates the transformation matrix M from the equation (3) to which the similarity matrix K is applied to the eigenvector Q. This transformation matrix M is stored in the storage unit 4.

次に図４を参照しながら新規データの具体的な分類手順Ｓ２０１〜Ｓ２０４について説明する。 Next, specific classification procedures S201 to S204 for new data will be described with reference to FIG.

Ｓ２０１：データ入力部２がＩ／Ｏデバイスを介して図９に例示された新規データＺの入力を受けると類似度計算部５に供する。 S201: When the data input unit 2 receives the input of the new data Z illustrated in FIG. 9 via the I / O device, the data input unit 2 supplies it to the similarity calculation unit 5.

Ｓ２０２：類似度計算部５はデータ入力部２から供された新規データＺと蓄積部４から引き出した訓練データＸとから前記（２）式の演算によって当該新規データＺと当該訓練データＸとの類似度行列Ｈを計算し、蓄積部４に蓄積する。 S202: The similarity calculation unit 5 calculates the new data Z and the training data X from the new data Z provided from the data input unit 2 and the training data X extracted from the storage unit 4 by the calculation of the equation (2). The similarity matrix H is calculated and stored in the storage unit 4.

Ｓ２０３：固有空間変換部７は、蓄積部４から類似度行列Ｈと変換行列Ｍとを引き出し、前記（４）（５）式の演算を行う。この演算によって前記新規データＺは固有空間に変換される。その結果、図１０に示されたデータＶの分布が得られる。 S203: The eigenspace conversion unit 7 extracts the similarity matrix H and the conversion matrix M from the storage unit 4, and performs the calculations of the equations (4) and (5). By this calculation, the new data Z is converted into an eigenspace. As a result, the distribution of the data V shown in FIG. 10 is obtained.

Ｓ２０４：データ割当て部８はＳ２０３で固有空間に変換された新規データＺを前記（６）（７）式の演算によりクラスタ中心に割当てる。その結果、図１１に示したようにデータＶがクラスタ０とクラスタ１に分けられる。 S204: The data allocation unit 8 allocates the new data Z converted to the eigenspace in S203 to the cluster center by the calculations of the above formulas (6) and (7). As a result, the data V is divided into cluster 0 and cluster 1 as shown in FIG.

結果出力部９はＳ２０４で得られた新規データＺの割当てｂを出力する。図１２に示されたように新規データＺがスペクトラルクラスタリングの結果に従って適切に割当てられている。 The result output unit 9 outputs the allocation b of the new data Z obtained in S204. As shown in FIG. 12, the new data Z is appropriately assigned according to the result of the spectral clustering.

（本実施形態の効果）
以上のようにデータ分類装置１は訓練データおよびそれに対応する固有ベクトルから固有空間への変換方法を学習して新規データにその変換を適用する。したがって、スペクトラルクラスタリングで得られたクラスタに新規のデータを適切に割当てることができる。 (Effect of this embodiment)
As described above, the data classification device 1 learns the training data and the conversion method from the corresponding eigenvector to the eigenspace, and applies the conversion to the new data. Therefore, new data can be appropriately assigned to the cluster obtained by spectral clustering.

また、従来技術（スペクトラルクラスタリング）は親和行列（類似度行列）の固有ベクトルを計算するため、データ数に応じた大量の計算を必要とする。これに対して本発明の実施形態に係るデータ分類装置１は、入力データをサンプリングした少数のデータでスペクトラルクラスタリングを行い、残りのデータをクラスタリング結果に分類することで、データ分類に要する計算量を削減できる。 Further, since the conventional technique (spectral clustering) calculates the eigenvectors of the affinity matrix (similarity matrix), a large amount of calculation according to the number of data is required. On the other hand, the data classification device 1 according to the embodiment of the present invention performs spectral clustering with a small number of data obtained by sampling input data and classifies the remaining data into clustering results, thereby reducing the amount of calculation required for data classification. Can be reduced.

特に、本実施形態では、訓練データの処理の過程で、スペクトラルクラスタリング部３が前記訓練データとその類似度、クラスタ数及びクラスタリングの繰返し数に基づく当該訓練データのスペクトラルクラスタリングによって前記訓練データに対応した固有ベクトルを算出する。これにより、固有空間学習部６での変換行列の算出を効率的に行える。また、前記スペクトラルスタリングによって前記訓練データのクラスタ中心が算出されるのでデータ割当て部８での新規データの固有空間での割当てを効率的に行える。 In particular, in the present embodiment, in the course of training data processing, the spectral clustering unit 3 responds to the training data by spectral clustering of the training data based on the training data and its similarity, the number of clusters, and the number of clustering repetitions. Calculate eigenvectors. Thereby, the calculation of the transformation matrix in the eigenspace learning unit 6 can be performed efficiently. Moreover, since the cluster center of the training data is calculated by the spectral starring, the data allocation unit 8 can efficiently allocate new data in the eigenspace.

さらに、データ割当て部８によって前記固有空間のデータが当該データと最も距離の近い前記クラスタ中心に割当てられるので新規データが当該固有空間において明確に分類される。 Further, since the data in the eigenspace is assigned to the cluster center closest to the data by the data allocation unit 8, new data is clearly classified in the eigenspace.

本実施形態のデータ分類技術は例えばテキストデータや画像データの分類に有効である。 The data classification technique of this embodiment is effective for classifying text data and image data, for example.

（本発明のプログラムとしての態様）
本発明は上記の実施形態のデータ分類装置１に係る各機能部２〜９の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータによって実行して本発明を実現することができる。また、本実施形態の処理手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができる。さらに、コンピュータで前記機能を実現するためのプログラムをそのコンピュータが読み取り可能な記録媒体、例えば、ＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）−ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＨＤＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 (Aspect as the program of the present invention)
In the present invention, a part or all of the functions of the functional units 2 to 9 according to the data classification device 1 of the above embodiment are configured by a computer program, and the program is executed by the computer to realize the present invention. it can. Further, the processing procedure of the present embodiment can be configured by a computer program, and the program can be executed by the computer. Furthermore, a computer-readable recording medium such as an FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), It can be recorded on a memory card, CD (Compact Disk) -ROM, DVD (Digital Versatile Disk) -ROM, CD-R, CD-RW, HDD, removable disk, etc., and can be stored or distributed. is there. It is also possible to provide the above program through a network such as the Internet or electronic mail.

１…データ分類装置
３…スペクトラルクラスタリング部（スペクトラルクラスタリング手段）
５…類似度計算部（類似度計算手段）
６…固有空間学習部（固有空間学習手段）
７…固有空間変換部（固有空間変換手段）
８…データ割当て部（データ割当て手段） DESCRIPTION OF SYMBOLS 1 ... Data classification device 3 ... Spectral clustering part (spectral clustering means)
5 ... Similarity calculation section (similarity calculation means)
6 ... Eigenspace learning unit (Eigenspace learning means)
7 ... Eigenspace conversion unit (Eigenspace conversion means)
8: Data allocation unit (data allocation means)

Claims

A data classification device that converts new data into eigenspace using a transformation matrix learned from training data,
A similarity calculation means for calculating a similarity matrix between training data;
Eigenspace learning means for calculating a transformation matrix for transforming the similarity matrix into the eigenspace based on the similarity matrix and the eigenvector corresponding to the training data;
The new data is converted into an eigenspace from the similarity matrix between the training data calculated by the similarity calculation means and the new data and the conversion matrix calculated by the eigenspace learning means during the new data classification process. A data classification apparatus comprising: an eigenspace conversion means.

Spectral clustering means for calculating eigenvectors and cluster centers corresponding to the training data by performing spectral clustering of the training data based on the training data and its similarity, the number of clusters, and the number of repetitions of clustering. The data classification device according to claim 1.

3. The data classification device according to claim 2, further comprising data allocation means for allocating eigenspace data calculated by the eigenspace conversion means to the cluster center closest to the data.

A data classification method for transforming new data into eigenspace using a transformation matrix learned from training data,
A step of calculating a similarity matrix between training data by a similarity calculation means;
Eigenspace learning means calculating a transformation matrix for transforming the similarity matrix into the eigenspace based on the similarity matrix and the eigenvector corresponding to the training data;
The similarity calculating means calculating a similarity matrix between the new data and the training data;
A data classification method, comprising: a step of converting eigenspace into eigenspace from the similarity matrix and the conversion matrix.

Spectral clustering means further includes a step of calculating eigenvectors and cluster centers corresponding to the training data by performing spectral clustering of the training data based on the training data and its similarity, the number of clusters, and the number of clustering repetitions. 5. The data classification method according to claim 4, wherein

6. The data classification method according to claim 5, further comprising the step of assigning the data of the eigenspace calculated by the eigenspace conversion means to the cluster center closest to the data.

A data classification program for causing a computer to function as each means constituting the data classification device according to any one of claims 1 to 3.