JP5410741B2

JP5410741B2 - Data processing system and data processing program

Info

Publication number: JP5410741B2
Application number: JP2008308840A
Authority: JP
Inventors: 明中村; 悟速水; けい子山本; 保臣紀ノ定
Original assignee: Panasonic Healthcare Co Ltd
Current assignee: PHC Corp
Priority date: 2008-12-03
Filing date: 2008-12-03
Publication date: 2014-02-05
Anticipated expiration: 2028-12-03
Also published as: JP2010134632A

Description

本発明は、データ集合のクラスタリングに関し、詳細には大規模データ集合をクラスタリングするデータ処理システム及びデータ処理プログラムに関する。 The present invention relates to clustering of data sets, and more particularly to a data processing system and data processing program for clustering large-scale data sets.

与えられたデータ集合を、クラスタの種類や数などの外的基準なしに分類する技術はクラスタリングまたはクラスタ分析と呼ばれ、代表的アルゴリズムとしてk−平均法（c-平均法とも呼ばれる）や併合型階層的クラスタリング（非特許文献１）がよく知られている。 The technique for classifying a given data set without external criteria such as the type and number of clusters is called clustering or cluster analysis. As a representative algorithm, k-means (also called c-means) or merged type is used. Hierarchical clustering (Non-Patent Document 1) is well known.

一方、非特許文献１の正規混合分布モデル（Gaussian Mixture Model）はデータ集合から確率密度関数を推定する一手法であるが、確率密度関数から各データのクラスタ所属確率を計算することができるため、クラスタリング手法の一つとみなせる。確率密度推定に基づくクラスタリング手法としては、他に平均移動法（Mean Shift：非特許文献２）や、これを改良した拡張平均移動法（Extended Mean Shift:非特許文献３）も知られている。 On the other hand, the normal mixture distribution model (Gaussian Mixture Model) of Non-Patent Document 1 is a method for estimating the probability density function from the data set. However, since the cluster membership probability of each data can be calculated from the probability density function, It can be regarded as one of the clustering methods. As a clustering technique based on probability density estimation, an average moving method (Mean Shift: Non-Patent Document 2) and an extended average moving method (Extended Mean Shift: Non-Patent Document 3) improved from the mean moving method are also known.

これら従来のクラスタリング手法では、実行にあたってまずデータ集合全体をメモリ上に読み込む必要がある。個々のデータが所定次元数（＝Ｄ）のベクトルとして表現されている場合、Ｎ個のデータからなるデータ集合を保持するメモリ容量はＤ×Ｎに比例するため、対象とするデータ集合が大規模になると全データをメモリ上に保持することが不可能となる。例えば５００次元・１００万個のデータ集合を３２ビット倍精度で保持するのに必要なメモリ容量は４ＧＢであり、一般的なパーソナルコンピュータでは扱うことができない。対象がここまで大規模でなくても、上述した既存クラスタリング手法の多くは計算時間がＮ²に比例して増加するため、大規模データに対しては実用的な時間で処理が終えられない状況に陥る。 In these conventional clustering methods, it is necessary to read the entire data set into the memory before execution. When each piece of data is expressed as a vector having a predetermined number of dimensions (= D), the memory capacity for holding a data set composed of N pieces of data is proportional to D × N. Then, it becomes impossible to hold all data on the memory. For example, the memory capacity required to hold a 500-dimensional / one-million data set with 32-bit double precision is 4 GB, which cannot be handled by a general personal computer. Even if the target is not so large, many of the existing clustering methods mentioned above increase the computation time in proportion to N ² , so the processing cannot be completed in a practical time for large-scale data Fall into.

このような大規模データを扱うための技術として、データスカッシングと呼ばれる技術が開発されている。データスカッシングでは、メモリに収まらないような大規模なデータ集合を１個ずつ順次読み出しながら、１パスで元のデータ集合より大幅に小規模なデータ集合に変換する。そして、変換後のデータに対して既存の各種アルゴリズムを適用する。すなわち、データスカッシングは大規模データに対する一種の前処理技術である。 As a technique for handling such large-scale data, a technique called data squashing has been developed. In data squashing, large data sets that do not fit in the memory are sequentially read one by one and converted into a data set significantly smaller than the original data set in one pass. Then, various existing algorithms are applied to the converted data. That is, data squashing is a kind of preprocessing technique for large-scale data.

データスカッシングの代表的な手法の一つとしてＢＩＲＣＨ（Balanced Iterrative Reducing and Clusterring using Hierarchies:非特許文献４）が知られている。ＢＩＲＣＨでは、元のデータ集合であるＸ＝｛x1,x2,...,xN｝を部分クラスタ（すなわちクラスタの一部）の集合Ｖ＝｛v1,v2,...,vM｝に変換する。Ｎ>>Ｍであるため、Ｖに対して既存クラスタリング手法を適用することが可能である。特許文献１では、データスカッシングと既存クラスタリング手法を組み合わせた方式が記載されている。この方式は、ＢＩＲＣＨによる大分類処理（特許文献１ではオンライン・マイクロクラスタリングと称している）とk−平均法による詳細分類処理（特許文献１ではオフライン・マクロクラスタリングと称している）から構成されている。なお、非特許文献５は、本発明の出願時の技術水準を示す。
特開２００５−１００３６３号公報宮本定明、「クラスター分析入門」、森北出版、1999 ケー・フクナガ(K.Fukunaga)，「スタティスティカルパターンレコグニション（セカンドエディション）(Statistical Pattern Recognition (second edition)」、Academic Press、1990 若原徹他、「拡張平均移動法による階層的クラスタリング」、電子情報通信学会技術研究報告 PRMU98-38、1998 ティ・ツァング(T. Zang)他、「バーチ：アニューデータクラスタリングアルゴリズムアンドイッツアプリケーションズ(BIRCH:A New Data Clustering Algorithm and its Applications)」、Jounal of DAta Mining and Knowledge Discovery,vol.1,pp.103-114,1996 エム・ピー・ワンド(M.P. Wand )他、「カーネルスムーズシング(Kernel Smoothing)」、Chapman&Hall,1995 BIRCH (Balanced Iterrative Reducing and Clustering using Hierarchies: Non-Patent Document 4) is known as one of typical techniques for data squashing. In BIRCH, the original data set X = {x1, x2, ..., xN} is converted into a set V = {v1, v2, ..., vM} of partial clusters (that is, a part of the cluster). . Since N >> M, the existing clustering method can be applied to V. Patent Document 1 describes a method in which data squashing and an existing clustering method are combined. This method consists of a large classification process by BIRCH (referred to as online microclustering in Patent Document 1) and a detailed classification process by k-means (referred to as offline macroclustering in Patent Document 1). Yes. Non-Patent Document 5 shows the technical level at the time of filing of the present invention.
JP 2005-100363 A Miyamoto Sadaaki, “Introduction to Cluster Analysis”, Morikita Publishing, 1999 K. Fukunaga, “Statistical Pattern Recognition (second edition)”, Academic Press, 1990 Toru Wakahara et al., “Hierarchical clustering by the extended average moving method”, IEICE technical report PRMU98-38, 1998 T. Zang et al., “BIRCH: A New Data Clustering Algorithm and its Applications”, Journal of DAta Mining and Knowledge Discovery, vol.1, pp.103 -114,1996 MP Wand et al., “Kernel Smoothing”, Chapman & Hall, 1995

前述のように、既存のクラスタリング手法は単独では大規模データへの適用が困難であるという問題があった。前処理（大分類処理）としてデータスカッシングを用いることによって既存のクラスタリング手法を適用することが可能となるが、特許文献１に記載されているような従来の方式では、クラスタの形状なサイズが不均一な場合に、高精度な分類結果を得ることができないという課題があった。 As described above, there is a problem that existing clustering methods are difficult to apply to large-scale data alone. Although it is possible to apply an existing clustering method by using data squashing as preprocessing (major classification processing), in the conventional method as described in Patent Document 1, the size of the cluster shape is small. In the case of non-uniformity, there is a problem that a highly accurate classification result cannot be obtained.

これは大分類処理によって得られた個々の部分クラスタの局所的な密度や分布の形状に関する情報を詳細分類処理で利用していないことに起因する。例えば、ＢＩＲＣＨによって生成される個々の部分クラスタはＣＦベクトル（Cluster Feature ベクトル）と呼ばれ、各部分クラスタに属するデータ数、属するデータの和ベクトルおよび２乗和からなり、各部分クラスタの局所的な分布に関する情報を含んでいる。しかしながら、k−平均法により詳細分類処理を行う場合、k−平均法が対象とする個々のデータはＤ次元空間内の１点として表されるＤ次元ベクトルである必要があるため、各部分クラスタの重心位置（すなわち、ＣＦベクトル中の和ベクトルを当該部分クラスタに属するデータ数で除したベクトル）を対象として詳細分類処理を行っていた。 This is because information regarding the local density and distribution shape of each partial cluster obtained by the large classification process is not used in the detailed classification process. For example, each partial cluster generated by BIRCH is called a CF vector (Cluster Feature vector), which is composed of the number of data belonging to each partial cluster, the sum vector of the data belonging to it, and the sum of squares. Contains information about the distribution. However, when the detailed classification process is performed by the k-average method, each piece of data targeted by the k-average method needs to be a D-dimensional vector represented as one point in the D-dimensional space. The detailed classification process is performed on the center of gravity position (that is, a vector obtained by dividing the sum vector in the CF vector by the number of data belonging to the partial cluster).

これは、各部分クラスタの位置のみを用いて詳細分類処理を行っていることに相当し、各部分クラスタの分布形状や属するデータ数に関する情報が全く考慮されていない。
詳細分類処理として他の既存クラスタリング手法を用いる場合においても、基本的に各クラスタリング手法はＤ次元ベクトル（すなわち、Ｄ次元空間内の点）を処理対象とするため、単純に既存手法をそのまま詳細分類処理に用いた場合、同様に部分クラスタの局所的な密度や分布の形状は考慮されない。 This is equivalent to performing the detailed classification process using only the position of each partial cluster, and information on the distribution shape of each partial cluster and the number of data belonging to it is not taken into consideration at all.
Even when other existing clustering methods are used as the detailed classification process, each clustering method is basically a D-dimensional vector (that is, a point in the D-dimensional space), so the existing method is simply classified as it is. Similarly, when used for processing, the local density and distribution shape of partial clusters are not considered.

本発明の目的は、上述の課題を解消するために、大分類処理と詳細分類処理の２段階処理により大規模データ集合をクラスタリングする場合、大分類処理の結果得られる各部分クラスタの局所的な密度に関する属性を詳細分類処理において適切に扱い、これによって高精度なクラスタリング結果を得ることができるデータ処理システム及びデータ処理プログラムを提供することにある。 An object of the present invention is to solve the above-mentioned problem, when clustering a large-scale data set by a two-stage process of a large classification process and a detailed classification process, the local cluster of each partial cluster obtained as a result of the large classification process An object of the present invention is to provide a data processing system and a data processing program that can appropriately handle attributes related to density in the detailed classification process and thereby obtain a highly accurate clustering result.

上記問題点を解決するために、本発明は、所定次元数のベクトルデータの集合として与えられる入力データ集合をクラスタリングするデータ処理システムであって、入力データ集合を部分クラスタの集合に変換する大分類手段と、前記大分類手段が変換した前記部分クラスタの集合をクラスタリングする詳細分類手段とを備え、前記詳細分類手段は前記大分類手段が変換した前記部分クラスタの局所的な密度に関する属性を考慮して詳細分類を行うことを特徴とするデータ処理システムである。 In order to solve the above problems, the present invention is a data processing system for clustering an input data set given as a set of vector data of a predetermined number of dimensions, and a large classification for converting an input data set into a set of partial clusters And detailed classification means for clustering the set of partial clusters converted by the large classification means, wherein the detailed classification means considers an attribute relating to the local density of the partial clusters converted by the large classification means. to perform a detailed classification Te is a data processing system according to claim.

このようにデータ処理システムが構成されていると、計算時間やメモリ容量などのための制約のため単一の手法では処理不可能な大規模データ集合をクラスタリングする場合において、高精度なクラスタリング結果を得ることが可能なデータ処理システムを提供できる。特に、クラスタの形状やサイズが不均一な大規模データをクラスタリングの対象とする場合、従来よりも分類精度を向上することができる。 When a data processing system is configured in this way, high-precision clustering results can be obtained when clustering large data sets that cannot be processed by a single method due to restrictions on calculation time and memory capacity. An obtainable data processing system can be provided. In particular, when large-scale data with a nonuniform cluster shape or size is targeted for clustering, the classification accuracy can be improved as compared with the prior art.

また、前記データ処理システムでは前記詳細分類手段は、各部分クラスタの重心位置座標を該各部分クラスタの代表点とし、この代表点を起点として確率密度関数の極大点を探索する極大点探索手段と、前記極大点探索手段による極大点探索が収束したかどうかを極大点探索を行った前回値と今回値の各部分クラスタに対する局所平均ベクトル間の距離情報に基づいて判定する収束判定手段と、同一または近傍の極大点に収束した部分クラスタ群を同じクラスタに属するものとして分類する部分クラスタ分類手段とを備え、前記極大点探索手段は、前記各部分クラスタに属するサンプル数を当該各部分クラスタの重みとして極大点探索を行うものである。 Also, the detailed classification means in said data processing system, the center of gravity position coordinates of each portion cluster as a representative point of the respective partial cluster, the local maximum point searching means for searching a maximum point of the probability density function of the representative point as the starting point And a convergence determination unit that determines whether or not the local maximum point search by the local maximum point search unit has converged based on distance information between local average vectors for each partial cluster of the previous value and current value in which the local maximum point search has been performed, Partial cluster classification means for classifying partial cluster groups converged to the same or nearby local maximum points as belonging to the same cluster, and the local maximum point searching means determines the number of samples belonging to each partial cluster for each partial cluster. A maximum point search is performed as a weight .

このように詳細分類手段が構成されていると、クラスタの分布形状が不規則な形状をしている場合であっても、対応でき、分類精度を向上させることができる。
また、前記極大点探索手段において、局所近傍範囲を定めるパラメータを前記各部分クラスタごとに可変とし、前記各部分クラスタのうち、局所的な密度が大きい部分クラスタに対しては前記パラメータを小さく、局所的な密度が小さい部分クラスタに対しては前記パラメータを大きくするように制御するようにしてもよい。 If the detailed classification means is configured as described above, even if the cluster distribution shape is irregular, it is possible to cope with it and improve the classification accuracy.
Further, in the local maximum point search means, a parameter for defining a local neighborhood range is variable for each partial cluster, and among the partial clusters, the partial parameter having a high local density is set to a small parameter, Control may be performed so that the parameter is increased for partial clusters having a low general density.

また、前記極大点探索手段は、入力データの各次元に対し個別に局所的な密度を算出し、この密度に応じて各次元に対し個別に前記パラメータを制御するようにしてもよい。
また、本発明のデータ処理プログラムは、所定次元数のベクトルデータの集合として与えられる入力データ集合をクラスタリングするデータ処理システムのプログラムであって、コンピュータを、入力データ集合を部分クラスタの集合に変換する大分類手段として機能させ、さらに、前記大分類手段が変換した前記部分クラスタの集合をクラスタリングする詳細分類手段として機能させ、かつ、前記詳細分類手段として機能させる際に、前記大分類手段が変換した前記部分クラスタの局所的な密度に関する属性を考慮して詳細分類を行わせることを要旨とする。本発明のプログラムによれば、計算時間やメモリ容量などのための制約のため単一の手法では処理不可能な大規模データ集合をクラスタリングする場合において、高精度なクラスタリング結果を得ることが可能なデータ処理システムのデータ処理プログラムを提供することができる。特に、クラスタの形状やサイズが不均一な大規模データをクラスタリングの対象とする場合、従来よりも分類精度を向上することができるデータ処理プログラムを提供できる。 Further, the local maximum point search means may calculate a local density individually for each dimension of the input data, and control the parameter individually for each dimension according to the density.
The data processing program of the present invention is a data processing system program for clustering an input data set given as a set of vector data of a predetermined dimension number, and converts the input data set into a set of partial clusters. Functioning as a large classification means, and further functioning as a detailed classification means for clustering the set of partial clusters converted by the large classification means, and when the large classification means functions as the detailed classification means, the large classification means converted The gist is that detailed classification is performed in consideration of an attribute related to the local density of the partial cluster. According to the program of the present invention, it is possible to obtain a highly accurate clustering result when clustering a large-scale data set that cannot be processed by a single method due to restrictions on calculation time, memory capacity, and the like. A data processing program for a data processing system can be provided. In particular, when large-scale data with a nonuniform cluster shape or size is targeted for clustering, it is possible to provide a data processing program capable of improving the classification accuracy as compared with the prior art.

本発明によれば、計算時間やメモリ容量などのための制約のため単一の手法では処理不可能な大規模データ集合をクラスタリングする場合において、高精度なクラスタリング結果を得ることが可能なデータ処理システムを提供できる。特に、クラスタの形状やサイズが不均一な大規模データをクラスタリングの対象とする場合、従来よりも分類精度を向上することができる。 According to the present invention, in the case of clustering a large-scale data set that cannot be processed by a single method due to restrictions on calculation time, memory capacity, etc., data processing that can obtain a highly accurate clustering result Can provide a system. In particular, when large-scale data with a nonuniform cluster shape or size is targeted for clustering, the classification accuracy can be improved as compared with the prior art.

（第１実施形態）
以下、本発明を具体化した第１実施形態を図１〜７を参照して説明する。図１はデータ処理システム１０を構成するブロック図を示し、データ処理システム１０の内部構成は機能ブロックを示している。 (First embodiment)
Hereinafter, a first embodiment of the present invention will be described with reference to FIGS. FIG. 1 shows a block diagram of the data processing system 10, and the internal configuration of the data processing system 10 shows functional blocks.

図１に示すように、データ処理システム１０は、キーボード等の入力装置１１と、各種プログラムにより動作するデータ処理装置１２を備えている。データ処理装置１２には、各種データを記憶する記憶手段としての記憶装置１３と、出力装置１４とが接続されている。出力装置１４は、例えばディスプレイやプリンタが含まれている。なお、記憶装置１３は、コンピュータの内部装置としてもよく、或いは外部記憶装置として構成されていてもよい。 As shown in FIG. 1, the data processing system 10 includes an input device 11 such as a keyboard and a data processing device 12 that operates according to various programs. A storage device 13 as storage means for storing various data and an output device 14 are connected to the data processing device 12. The output device 14 includes a display and a printer, for example. Note that the storage device 13 may be an internal device of a computer or may be configured as an external storage device.

データ処理装置１２は、ＲＯＭ１２ａ及びＲＡＭ１２ｂを備えたコンピュータ１６からなり、ＲＯＭ１２ａに格納されたデータ処理プログラムにより後述する各種処理を行う。
データ処理装置１２は、最近隣部分クラスタ探索部２２、追加可否判定部２４、及び部分クラスタ集合更新部２６を機能として備える大分類処理部２０と、極大点探索部３２、収束判定部３４及び部分クラスタ分類部３６を機能として備える詳細分類処理部３０を構成する。又、データ処理装置１２は、大分類手段、詳細分類手段に相当する。 The data processing device 12 includes a computer 16 having a ROM 12a and a RAM 12b, and performs various processes described later by a data processing program stored in the ROM 12a.
The data processing apparatus 12 includes a major classification processing unit 20 including a nearest neighbor partial cluster search unit 22, an addability determination unit 24, and a partial cluster set update unit 26 as functions, a maximum point search unit 32, a convergence determination unit 34, and a partial A detailed classification processing unit 30 including the cluster classification unit 36 as a function is configured. The data processing device 12 corresponds to a large classification unit and a detailed classification unit.

（作用）
さて、上記のように構成されたデータ処理システム１０の作用を説明する。
図２は、データ処理装置１２が行う大分類処理のフローチャートである。 (Function)
Now, the operation of the data processing system 10 configured as described above will be described.
FIG. 2 is a flowchart of the large classification process performed by the data processing device 12.

なお、データ処理装置１２には、入力装置１１等の入力手段により、処理対象のデータ集合が入力されていて、予め記憶装置１３に格納されており、前記データ処理プログラムにて後述する処理が行われるものとする。前記データ集合は、所定次元数のベクトルデータの集合として与えられる入力データ集合に相当する。 The data processing device 12 is input with a data set to be processed by an input means such as the input device 11 and is stored in the storage device 13 in advance. Shall be. The data set corresponds to an input data set given as a set of vector data having a predetermined number of dimensions.

図２は、データ処理装置１２による処理の概要のフローチャートであり、Ｓ１０では、大分類処理部２０はN個のデータからなる処理対象のデータ集合X={x₁, x₂, …, x_N}を部分クラスタの集合V={v₁, v₂, …, v_M}に変換することにより、M個の部分クラスタに分類する。続いて、Ｓ２０では、詳細分類処理部３０は部分クラスタの集合VをC個のクラスタに分類する。ここで一般にN＞＞Mであり、M＞＞Cである。ただし通常、真のクラスタ数Cは未知である。また部分クラスタ数Mは大分類処理部の動作を制御することにより適切に決める必要があるが、この点については後述する。 Figure 2 is a flowchart of outline of processing by the data processing unit 12, in S10, the large classification processing unit 20 is a data set to be processed consisting of N data _{_{X = {x 1, x 2}} , ..., x N } Is converted into a set of partial clusters V = {v ₁ , v ₂ ,..., V _M } to be classified into M partial clusters. Subsequently, in S20, the detailed classification processing unit 30 classifies the set V of partial clusters into C clusters. Here, generally, N >> M and M >> C. However, the true cluster number C is usually unknown. The number of partial clusters M needs to be appropriately determined by controlling the operation of the large classification processing unit, which will be described later.

図３は、大分類処理部２０の処理の一例のフローチャートである。
大分類処理部２０では、データ集合Xから処理対象のデータx_i（1≦i≦N）を１個読み込むたびに図３に示す手順を繰り返すことにより、大分類処理を行う。 FIG. 3 is a flowchart of an example of processing of the large classification processing unit 20.
The major classification processing unit 20 performs major classification processing by repeating the procedure shown in FIG. 3 every time one piece of data x _i (1 ≦ i ≦ N) to be processed is read from the data set X.

同図に示すように、Ｓ１１で、大分類処理部２０は、処理対象のデータx_iを読み込むと、Ｓ１２において、記憶装置１３が記憶している現在の部分クラスタ集合Vの中でx_iに最も近い部分クラスタv_kを探索する。 As shown in the figure, in S11, when the large classification processing unit 20 reads the data x _i to be processed, in S12, the large classification processing unit 20 sets x _i in the current partial cluster set V stored in the storage device 13. Search for the nearest partial cluster v _k .

そして、Ｓ１３において、大分類処理部２０は、v_kにx_iを追加した場合の直径T(v_k∪x_i)を計算し、Ｓ１４において、この値をあらかじめ定めたしきい値T₀と比較する。大分類処理部２０は、T(v_k∪x_i)がT₀より小さいと判定（「ＹＥＳ」と判定）すると、Ｓ１５において、x_iを部分クラスタv_kに追加する。大分類処理部２０は、T(v_k∪x_i)がT₀以上であると判定（「ＮＯ」と判定）すると、Ｓ１６において、x_iのみからなる新規の部分クラスタを作成する。以上の処理を大分類処理部２０は、X内の全データに対して行うことにより、最終的に得られた部分クラスタ集合が大分類処理結果として得られる。 Then, in S13, the rough classification processing unit 20 calculates the v _k to the diameter of adding the _{_{_{x i T (v k ∪x i}}} ), in S14, a threshold value T ₀ that defines the value previously Compare. When determining that T (v _k ∪x _i ) is smaller than T ₀ (determined as “YES”), the large classification processing unit 20 adds x _i to the partial cluster v _k in S15. When determining that T (v _k ∪x _i ) is equal to or greater than T ₀ (determined as “NO”), the large classification processing unit 20 creates a new partial cluster including only x _i in S16. The large classification processing unit 20 performs the above processing on all the data in X, so that the finally obtained partial cluster set is obtained as the large classification processing result.

ここで、Ｓ１２は、最近隣部分クラスタ探索部２２の処理であり、最近隣部分クラスタ探索部２２は最近隣部分クラスタ探索手段に相当する。Ｓ１４は、追加可否判定部２４の処理であり、追加可否判定部２４は追加可否判定手段に相当する。Ｓ１５及びＳ１６は、部分クラスタ集合更新部２６の処理であり、部分クラスタ集合更新部２６は部分クラスタ集合更新手段に相当する。記憶装置１３は、部分クラスタの集合を記憶する記憶手段に相当する。 Here, S12 is a process of the nearest neighbor partial cluster search unit 22, and the nearest neighbor partial cluster search unit 22 corresponds to the nearest neighbor partial cluster search means. S14 is a process of the addability determination unit 24, and the addability determination unit 24 corresponds to an addability determination unit. S15 and S16 are processes of the partial cluster set update unit 26, and the partial cluster set update unit 26 corresponds to a partial cluster set update unit. The storage device 13 corresponds to storage means for storing a set of partial clusters.

なお、部分クラスタvは下記のようにn, LS, SSの３要素によって記述される。
v = (n, LS, SS)
n : vを構成するデータ数
LS : vを構成するn個のデータの和ベクトル(linear sum)
SS : vを構成するn個のデータの２乗和(square sum)
部分クラスタvの直径T(v)はvを構成する全データ間のユークリッド距離の平均値であり、(n, LS, SS)から次式（１）により得られる。 The partial cluster v is described by three elements of n, LS, and SS as follows.
v = (n, LS, SS)
n: Number of data composing v
LS: sum vector (linear sum) of n data composing v
SS: square sum of n data composing v
The diameter T (v) of the partial cluster v is an average value of the Euclidean distance between all data constituting v, and is obtained from (n, LS, SS) by the following equation (1).

ここで、式（１）の導出について説明する。 Here, the derivation of Expression (1) will be described.

n個の個体{x_i} (1≦i≦n)からなる部分クラスタvのCFベクトルの要素LS (linear sum)およびSS (square sum)は以下のように表わされる（d: 次元数）。 The elements LS (linear sum) and SS (square sum) of the CF vector of the partial cluster v composed of n individuals {x _i } (1 ≦ i ≦ n) are expressed as follows (d: number of dimensions).

また、部分クラスタvの直径T(v)はvに属するすべての個体間のユークリッド距離の平均値として定義される。
したがって、 Further, the diameter T (v) of the partial cluster v is defined as an average value of the Euclidean distance between all individuals belonging to v.
Therefore,

より、T(v)はn, LS, SSのみから計算される。 Therefore, T (v) is calculated only from n, LS, and SS.

話を元に戻して、前記部分クラスタvとデータxとの距離d(v, x)はvを構成する全データとxとのユークリッド距離の平均値であり、次式（４）により得られる。読み込んだデータに最も近い部分クラスタはこの距離尺度に基づいて決定する。 Returning to the original, the distance d (v, x) between the partial cluster v and the data x is an average value of the Euclidean distance between all the data constituting v and x, and is obtained by the following equation (4). . The partial cluster closest to the read data is determined based on this distance measure.

なお、上式（４）は、下記のようにして導出される。すなわち、ＣＦベクトルからの部分クラスタ間距離dの計算方法は、下記の通りである。部分クラスタv_a, v_b間の距離d(v_a, v_b)は次式（５）で定義される。(n_a, n_bはv_a, v_bの個体数) The above equation (4) is derived as follows. That is, the calculation method of the distance d between partial clusters from the CF vector is as follows. The distance d (v _a , v _b ) between the partial clusters v _a and v _b is defined by the following equation (5). (n _a and n _b are the number of individuals in v _a and v _b )

T(v)の定義より From the definition of T (v)

であり、和クラスタ(v_a∪v_b)のCFベクトルはv_a, v_bのCFベクトルから計算できるため、結局d(v_a, v_b)はv_a, v_bのCFベクトルのみから求められる。 Since the CF vector of the sum cluster (v _a ∪v _b ) can be calculated from the CF vector of v _a , v _b , d (v _a , v _b ) can be obtained only from the CF vector of v _a , v _{b after all} It is done.

ある単一データxと部分クラスタv間の距離d(v, x)は、xを1個のデータのみからなる部分クラスタと見なすことにより上式（６）から導かれる次式（７）から計算できる。 The distance d (v, x) between a single data x and a partial cluster v is calculated from the following equation (7) derived from the above equation (6) by regarding x as a partial cluster consisting of only one piece of data. it can.

ここで、大分類処理部２０による処理結果の例を以下に説明する。図４（ａ）〜（ｃ）は処理対象となる３種類のデータ集合をプロットした図である。同図中、（ａ）〜（ｃ）の３種類とも、次元数D=2, データ数N=300000, クラスタ数C=3であり、各データが属するクラスタを色の濃淡で表現している。図中、N₁, N₂, N₃はそれぞれ、各クラスタc₁,c₂,c₃に属するデータ数である。 Here, an example of a processing result by the large classification processing unit 20 will be described below. 4A to 4C are plots of three types of data sets to be processed. In the figure, all three types (a) to (c) have dimension number D = 2, number of data N = 300000, number of clusters C = 3, and the clusters to which each data belongs are expressed in shades of color. . In the figure, N ₁ , N ₂ , and N ₃ are the numbers of data belonging to the clusters c ₁ , c ₂ , and c ₃ , respectively.

図５（ａ）〜（ｃ）はそれぞれ、図4（ａ）〜（ｃ）に示したデータ集合に対する大分類処理結果の例である。ひとつひとつの円が部分クラスタを表し、円の大きさと色の濃さがそれぞれ、部分クラスタの直径T(v)と部分クラスタに属するデータ数を示している。部分クラスタの数Mはしきい値T₀によって変化するが、ここでは（ａ）〜（ｃ）においてMがほぼ300となるようT₀の値を制御した。 FIGS. 5A to 5C are examples of large classification processing results for the data sets shown in FIGS. 4A to 4C, respectively. Each circle represents a partial cluster, and the size and color density of the circle indicate the diameter T (v) of the partial cluster and the number of data belonging to the partial cluster, respectively. Although the number M of partial clusters varies depending on the threshold value T ₀ , the value of T ₀ is controlled so that M is approximately 300 in (a) to (c).

次に、図１に示す詳細分類処理部３０について以下に説明する。
本実施形態における詳細分類処理部３０は、非特許文献２，３に示されている平均移動法（Mean Shift）を基本としている。公知の平均移動法では、各データ点に対して定義される局所平均ベクトルを逐次移動することにより、密度分布の極大点を並列探索し、同じ極大点に収束したデータが同じクラスタに属すると判定する。平均移動法における極大点探索の概念図を、図１０に示す。同図に示すように、同じ極大点に収束した個体群がクラスタを形成する。 Next, the detailed classification processing unit 30 shown in FIG. 1 will be described below.
The detailed classification processing unit 30 in the present embodiment is based on the average shift method (Mean Shift) shown in Non-Patent Documents 2 and 3. In the known average moving method, the local average vector defined for each data point is sequentially moved to search in parallel for the local maximum points of the density distribution and determine that the data converged to the same local maximum point belongs to the same cluster. To do. FIG. 10 shows a conceptual diagram of maximum point search in the average moving method. As shown in the figure, individuals that converge to the same maximum point form a cluster.

図１０の場合、各データ点（例えば、ｘ_ｉ，ｘ_ｊ，ｘ_ｋ）を局所平均ベクトルの初期値とし、これを起点として探索している状態が示されている。
前記公知の平均移動法を部分クラスタの集合V={v₁, v₂, …, v_M}に対してそのまま適用する場合、各部分クラスタの重心位置座標g(v_k)=LS_k／n_k（なお、LS_k, n_kはそれぞれ部分クラスタv_kの和ベクトル,データ数）を各部分クラスタの代表点として、これを処理対象データと見なし、次式（８）の手順により各部分クラスタに対する局所平均ベクトルm_k(l)を逐次移動する。 In the case of FIG. 10, each data point (for example, x _i , x _j , x _k ) is set as the initial value of the local average vector, and the state of searching using this as the starting point is shown.
When the known average moving method is directly applied to the set of partial clusters V = {v ₁ , v ₂ ,..., V _M }, the center-of-gravity position coordinates g (v _k ) = LS _k / n of each partial cluster. _k (where LS _k and n _k are the sum vectors of the partial clusters v _{k and} the number of data), respectively, is regarded as the processing target data, and each partial cluster is processed according to the following equation (8). The local average vector m _k (l) for is sequentially moved.

ここで、ｌ（=1,2,….）は移動操作の反復回数、ω(x, x’; h) = exp(−|| x−x’||²／h²)は局所近傍を定義するカーネル関数であり、hは別途定めるバンド幅パラメータである。非特許文献３に示されているように、式（８）の操作による移動方向は確率密度関数p(x)の最急勾配方向∇p(x)に等しく、その大きさは極大点から遠い（（=p(x)）が小さい）ほど大きい。ここでの確率密度関数p(x)は、ω(x, x’; h)をカーネル関数とするカーネル密度推定法による確率密度関数であり、次式（９）で表わされる。上記の詳細分類処理部３０の処理は、極大点探索部３２で行う。 Here, l (= 1, 2,...) Is the number of iterations of the moving operation, and ω (x, x '; h) = exp (− || x−x ′ || ² / h ² ) is the local neighborhood. This is a kernel function to be defined, and h is a separately defined bandwidth parameter. As shown in Non-Patent Document 3, the moving direction by the operation of Equation (8) is equal to the steepest gradient direction ∇p (x) of the probability density function p (x), and the magnitude is far from the maximum point. The smaller ((= p (x)) is smaller). The probability density function p (x) here is a probability density function by a kernel density estimation method using ω (x, x ′; h) as a kernel function, and is expressed by the following equation (9). The processing of the detailed classification processing unit 30 is performed by the local maximum point searching unit 32.

次に、詳細分類処理部３０の収束判定部３４では、全部分クラスタに対する移動操作が完了するたび次式（１０）により収束判定を行う。式（１０）は、全部分クラスタにおいて、局所平均ベクトルm_k(l)、m_k(l−１)間のユークリッド２乗距離の総和が求められて、該総和がしきい値Ｔｈ未満の場合に収束したと判定することを意味する。本実施形態では、前記局所平均ベクトルmk(l)、mk(l-1)間のユークリッド２乗距離が、局所平均ベクトルの移動操作の反復における前回値と今回値の各部分クラスタに対する局所平均ベクトル間の距離情報に相当する。なお、前記距離情報としてユークリッド２乗距離の代わりに他の距離尺度、例えばユークリッド距離や市街区距離を用いてもよい。 Next, the convergence determination unit 34 of the detailed classification processing unit 30 performs convergence determination according to the following equation (10) every time the moving operation for all partial clusters is completed. Expression (10) is obtained when the sum of Euclidean square distances between local average vectors m _k (l) and m _k (l−1) is obtained in all partial clusters, and the sum is less than the threshold Th. Means that it has converged to. In the present embodiment, the Euclidean square distance between the local average vectors mk (l) and mk (l-1) is the local average vector for each partial cluster of the previous value and the current value in the iteration operation of the local average vector. It corresponds to the distance information. In addition, you may use another distance scale, for example, a Euclidean distance and a city block distance, instead of the Euclidean square distance as the distance information.

詳細分類処理部３０の収束判定部３４は、式（１０）により収束と判定した場合（すなわち、上記式（１０）が成立する場合）、この処理を終了する。そして、部分クラスタ分類部３６は、同一の極大点（又は、極めて近い座標点、すなわち、近傍の極大点）に収束したデータ群を同じクラスタに属するもの、すなわち、１つのクラスタとする。 The convergence determination unit 34 of the detailed classification processing unit 30 ends this process when it is determined that the convergence is achieved by the equation (10) (that is, when the above equation (10) is established). Then, the partial cluster classification unit 36 sets the data group converged to the same local maximum point (or extremely close coordinate points, that is, local maximum points) to belong to the same cluster, that is, one cluster.

なお、後で示すように公知の平均移動法をそのまま適用した場合、望ましい分類結果が得られない場合がある。
そこで本実施形態では、大分類処理によって得られる各部分クラスタの局所的な分布に関する情報を適切に扱うことにより高精度な詳細分類結果を得る。具体的には、各部分クラスタに対する局所平均ベクトルm_k(l)を次式（１１）の手順により逐次移動する。 As will be described later, when a known average moving method is applied as it is, a desired classification result may not be obtained.
Therefore, in the present embodiment, a highly accurate detailed classification result is obtained by appropriately handling information on the local distribution of each partial cluster obtained by the large classification process. Specifically, the local average vector m _k (l) for each partial cluster is sequentially moved by the procedure of the following equation (11).

上記式（１１）の第１式は、各部分クラスタの重心位置座標ｇ（ｖ_ｋ）を各部分クラスタの代表点として、これを局所平均ベクトルの初期値、すなわち、起点とすることを表わしている。 The first expression of the above expression (11) represents that the center-of-gravity position coordinates g (v _k ) of each partial cluster is used as the representative point of each partial cluster, and this is used as the initial value of the local average vector, that is, the starting point. Yes.

なお、n_jは部分クラスタv_jに属するサンプル数、h_jは局所密度に応じて部分クラスタごとに定めるバンド幅パラメータであり、部分クラスタv_jに対応するh_j ²がT(v_ｊ)²に比例し、かつサンプル数n_jで重み付けしたh_j ²の平均がh²に一致するように次式（１２）で定義する。 Note that n _j is the number of samples belonging to the partial cluster v _j , h _j is a bandwidth parameter determined for each partial cluster according to the local density, and h _j ² corresponding to the partial cluster v _j is T (v _j ) ² proportional to, and the average of h _j ² weighted by the number of samples n _j is defined by the following equation (12) to match the h ^2.

なお、式（１２）の導出は下記の通りである。 In addition, derivation | leading-out of Formula (12) is as follows.

部分クラスタv_jの半径をT(v_ｊ)², バンド幅パラメータをh_jと表す。
h_j ²がT(v_ｊ)²に比例し、かつn_jで重み付けしたh_j ²の平均がh²に一致するよう以下のようにh_jを定める。すなわち、 The radius of the partial cluster v _j represents T (v _j) ^2, a bandwidth parameter and h _j.
h _j ² is determined as follows so that h _j ² is proportional to T (v _j ) ² and the average of h _j ² weighted by n _j matches h ² . That is,

を満たすような定数αを求める。ここでn_jはv_jに属するサンプル数、Nは全サンプル数であり、 Find a constant α that satisfies. Where n _j is the number of samples belonging to v _j , N is the total number of samples,

である。上記式（１３）、式（１４）から It is. From the above formula (13) and formula (14)

となり、 And

が得られる。従って、式（１７）（＝式（１２）） Is obtained. Therefore, Expression (17) (= Expression (12))

となる。 It becomes.

話を元に戻して、式(１２)のようにバンド幅パラメータを可変とすることは、ある局所平均ベクトルに着目した場合、式（１１）において半径が大きい（すなわち、局所的な密度が小さい）部分クラスタとのカーネル関数の前記パラメータの値をより小さく、また半径が小さい（すなわち、局所的な密度が大きい）部分クラスタとのカーネル関数の前記パラメータの値をより大きくする作用をもたらす。この結果、極大点探索において局所平均ベクトルは密度がより大きい部分クラスタに向かって移動していく。なお、上記のように部分クラスタ毎にバンド幅パラメータの可変する処理は、極大点探索部３２が行う。 Returning the story to making the bandwidth parameter variable as shown in equation (12), when focusing on a certain local average vector, the radius is large in equation (11) (that is, the local density is small). ) The value of the parameter of the kernel function with the partial cluster is reduced, and the value of the parameter of the kernel function with the partial cluster having a small radius (that is, local density is large) is increased. As a result, in the local maximum search, the local average vector moves toward a partial cluster having a higher density. Note that the local maximum point search unit 32 performs the process of changing the bandwidth parameter for each partial cluster as described above.

又、式（１１）では、カーネル関数の値に各部分クラスタに属するサンプル数n_jを乗じて加算している。これはサンプル数n_jを各部分クラスタの重みとして極大点探索を行っていることに相当する。 In equation (11), the value of the kernel function is multiplied by the number of samples n _j belonging to each partial cluster and added. This corresponds to performing a local maximum search using the number of samples n _j as the weight of each partial cluster.

バンド幅パラメータhの基準値はデータ全体の分布を考慮して適切に決める必要があり、基準値の値によって詳細分類の結果得られるクラスタ数が変化する。非特許文献３では全サンプルの最近隣接点間距離の平均値を基準値の初期値として漸増する方法が示されている。また、平均移動法におけるバンド幅選択はカーネル密度推定法と本質的に同じ問題であり、非特許文献５の p.71〜72に示されているプラグイン・ルールによりhを定めてもよい。 The reference value of the bandwidth parameter h needs to be appropriately determined in consideration of the distribution of the entire data, and the number of clusters obtained as a result of the detailed classification changes depending on the value of the reference value. Non-Patent Document 3 shows a method of gradually increasing the average value of the distances between nearest neighbor points of all samples as an initial value of a reference value. Bandwidth selection in the average shift method is essentially the same problem as the kernel density estimation method, and h may be determined by a plug-in rule shown in pages 71 to 72 of Non-Patent Document 5.

上記のようにして、大分類処理によって得られた図５（ａ）〜（ｃ）の部分クラスタ集合に対して、本実施形態の詳細分類処理部により詳細分類を行った結果を図６（ａ）〜（ｃ）に示す。ここでは、hの値を変化させてクラスタ数が真のクラスタ数３に一致したときの結果を示しており、同じクラスタに分類された部分クラスタを同じ明度で図示している。 The result of detailed classification performed by the detailed classification processing unit of the present embodiment on the partial cluster set of FIGS. 5A to 5C obtained by the large classification process as described above is shown in FIG. ) To (c). Here, the result when the value of h is changed and the number of clusters matches the true number of clusters 3 is shown, and the partial clusters classified into the same cluster are illustrated with the same brightness.

又、図７（ａ）〜（ｃ）は、図５（ａ）〜（ｃ）の大分類処理結果に対して公知の平均移動法により詳細分類を行った結果を示している。図６では（ａ）〜（ｃ）すべてで望ましい分類結果が得られているのに対して、図７では（ｂ），（ｃ）のように各クラスタの形状・サンプル数が不均一な場合に正しい分類結果が得られていない。これは、本実施形態では各部分クラスタの局所的な密度を考慮して詳細分類を行うことにより、従来技術よりも高精度な分類結果が得られることを示している。 FIGS. 7A to 7C show the results of performing detailed classification by the known average moving method on the large classification processing results of FIGS. 5A to 5C. In FIG. 6, the desired classification results are obtained in all of (a) to (c), whereas in FIG. 7, the shape and the number of samples of each cluster are not uniform as in (b) and (c). The correct classification result is not obtained. This indicates that, in the present embodiment, by performing detailed classification in consideration of the local density of each partial cluster, a classification result with higher accuracy than in the conventional technique can be obtained.

以上のように構成されたデータ処理システム１０は、下記の特徴がある。
（１）本実施形態のデータ処理システムのデータ処理装置１２は、大分類手段としてデータ集合（入力データ集合）を部分クラスタの集合に変換し、詳細分類手段として、前記大分類手段が変換した前記部分クラスタの集合をクラスタリングするにあたり、部分クラスタの局所的な密度に関する属性を考慮して詳細分類を行うようにした。 The data processing system 10 configured as described above has the following characteristics.
(1) The data processing device 12 of the data processing system according to the present embodiment converts a data set (input data set) as a large classifying unit into a set of partial clusters, and the detailed classifying unit converts the data set as a detailed classifying unit. When clustering a set of partial clusters, detailed classification is performed considering attributes related to the local density of the partial clusters.

この結果、計算時間やメモリ容量などのための制約のため単一の手法では処理不可能な大規模データ集合をクラスタリングする場合において、高精度なクラスタリング結果を得ることが可能なデータ処理システムを提供できる。特に、クラスタの形状やサイズが不均一な大規模データをクラスタリングの対象とする場合、従来よりも分類精度を向上することができる。 As a result, we provide a data processing system that can obtain highly accurate clustering results when clustering large-scale data sets that cannot be processed by a single method due to constraints on computation time and memory capacity. it can. In particular, when large-scale data with a nonuniform cluster shape or size is targeted for clustering, the classification accuracy can be improved as compared with the prior art.

（２）本実施形態のデータ処理装置１２は、大分類手段として、入力データ集合を部分クラスタの集合に変換された後の該部分クラスタ集合を記憶する記憶装置１３（記憶手段）と、前記記憶装置１３が記憶している部分クラスタの集合の中から、処理対象のデータに対し最も近い部分クラスタである最近隣部分クラスタを探索する最近隣部分クラスタ探索部２２（最近隣部分クラスタ探索手段）と、前記処理対象のデータを該最近隣部分クラスタに追加するべきか否かを判定する追加可否判定部２４（追加可否判定手段）を備える。又、データ処理装置１２は、判定結果に基づいて前記最近隣部分クラスタへの前記処理対象のデータの追加処理及び前記処理対象のデータについて新規の部分クラスタの生成処理のいずれかの処理を行う部分クラスタ集合更新部２６（部分クラスタ集合更新手段）を備える。そして、追加可否判定部２４は、データ集合（入力データ集合）からデータを１個読み出すたびに最近隣部分クラスタ探索と追加可否判定を行い、部分クラスタ集合更新部２６は、前記最近隣部分クラスタへの前記処理対象のデータの追加処理及び前記処理対象のデータについて新規の部分クラスタの生成処理のいずれかの処理を行うようにした。この結果、このように大分類手段が構成されていると、入力データ集合全体をメモリ上に保持する必要がないため、メモリ上で保持できないような大規模データを扱うことが可能となる。 (2) The data processing device 12 of the present embodiment is a storage device 13 (storage unit) that stores the partial cluster set after the input data set is converted into a set of partial clusters as the large classification unit, and the storage A nearest neighbor cluster search unit 22 (nearest neighbor cluster search means) that searches for a nearest neighbor cluster that is the nearest cluster to the processing target data from the set of partial clusters stored in the device 13; , And an addition availability determination unit 24 (addition availability determination means) for determining whether or not the processing target data should be added to the nearest neighbor cluster. Further, the data processing device 12 performs a process of either adding the process target data to the nearest partial cluster or generating a new partial cluster for the process target data based on the determination result. A cluster set update unit 26 (partial cluster set update means) is provided. Then, each time one piece of data is read from the data set (input data set), the addability determination unit 24 performs the nearest neighbor partial cluster search and the addability determination, and the partial cluster set update unit 26 moves to the nearest neighbor cluster. Any one of the process for adding the data to be processed and the process for generating a new partial cluster is performed on the data to be processed. As a result, when the large classification means is configured in this way, it is not necessary to hold the entire input data set on the memory, and thus it is possible to handle large-scale data that cannot be held on the memory.

（３）又、本実施形態のデータ処理装置１２は、詳細分類手段として、各部分クラスタの重心位置座標を該各部分クラスタの代表点とし、この代表点を起点として確率密度関数の極大点を探索する極大点探索部３２（極大点探索手段）と、極大点探索部３２による極大点探索が収束したかどうかを極大点探索を行った前回値と今回値の各部分クラスタに対する局所平均ベクトル間の距離情報に基づいて判定する収束判定部３４（収束判定手段）と、同一または近傍の極大点に収束した部分クラスタ群を同じクラスタに属するものとして分類する部分クラスタ分類部３６（部分クラスタ分類手段）とを備える。又、極大点探索部３２は、各部分クラスタに属するサンプル数を当該各部分クラスタの重みとして極大点探索を行う。この結果、クラスタの分布形状が不規則な形状をしている場合であっても対応でき、分類精度を向上させることができる。 (3) Further, the data processing apparatus 12 of the present embodiment uses, as the detailed classification means, the center-of-gravity position coordinates of each partial cluster as a representative point of each partial cluster, and uses the representative point as a starting point to determine the maximum point of the probability density function. The local maximum vector for each partial cluster of the previous value and the current value in which the local maximum search is performed as to whether the local maximum search by the local maximum point searching unit 32 (local maximum point searching means) to be searched and whether the local maximum point search by the local maximum point searching unit 32 has converged. And a partial cluster classification unit 36 (partial cluster classification unit) that classifies a partial cluster group that has converged to the same or nearby maximum point as belonging to the same cluster. ). The maximum point search unit 32 performs a maximum point search using the number of samples belonging to each partial cluster as the weight of each partial cluster. As a result, even when the cluster distribution shape is irregular, it can be dealt with, and the classification accuracy can be improved.

（４）又、本実施形態のデータ処理装置１２は、極大点探索部３２（極大点探索手段）において、局所近傍範囲を定めるパラメータ（バンド幅パラメータ）を各部分クラスタごとに可変とし、前記各部分クラスタのうち、局所的な密度が大きい部分クラスタに対しては前記パラメータを小さく、局所的な密度が小さい部分クラスタに対しては前記パラメータを大きくするように制御するようにしている。この結果、上記（１）、（２）、（３）の効果を容易に実現できる。 (4) In the data processing apparatus 12 of the present embodiment, in the local maximum point search unit 32 (local maximum point searching means), a parameter (bandwidth parameter) that defines a local neighborhood range is variable for each partial cluster. Among the partial clusters, control is performed such that the parameter is reduced for a partial cluster having a large local density, and the parameter is increased for a partial cluster having a low local density. As a result, the effects (1), (2), and (3) can be easily realized.

（第２実施形態）
次に、第２実施形態を図８を参照して説明する。なお、第１実施形態と同一構成については、同一符号を付して、異なる構成について説明する。 (Second Embodiment)
Next, a second embodiment will be described with reference to FIG. In addition, about the same structure as 1st Embodiment, the same code | symbol is attached | subjected and a different structure is demonstrated.

第２実施形態では、第１実施形態における大分類処理部２０の処理を別の方式に置き替えたものである。
具体的には、第２実施形態の方式では各部分クラスタを(n, LS, SS)の３要素で記述するかわりに、各部分クラスタの重心位置座標とサンプル数の２要素で記述する。この方式は、データxと部分クラスタとの距離を求める際に部分クラスタの直径Tに相当する量を用いず重心位置座標のみを用いる簡略版である。直径Tは大分類終了時に一括して算出する。 In the second embodiment, the processing of the large classification processing unit 20 in the first embodiment is replaced with another method.
Specifically, in the method of the second embodiment, instead of describing each partial cluster with three elements (n, LS, SS), the partial cluster is described with two elements of the barycentric position coordinate and the number of samples. This method is a simplified version that uses only the center-of-gravity position coordinates without using the amount corresponding to the diameter T of the partial cluster when obtaining the distance between the data x and the partial cluster. Diameter T is calculated at the end of major classification.

図８は大分類処理部２０が処理するフローチャートである。同図に示すようにこの大分類処理部２０では、データ集合Xからデータx_i（1≦i≦N）を１個読み込むたびに図８に示す手順を繰り返す。Ｓ１１でデータx_iが読み込まれると、と１２において、まず現在の部分クラスタ集合Vの中でx_iに最も近い部分クラスタv_kを探索する。そして、Ｓ１３Ａにおいて、v_kとx_iとの距離d(v_k, x_i)を計算する。続くＳ１４Ａにおいて、v_kとx_iとの距離d(v_k, x_i)をあらかじめ定めたしきい値ｄ₀と比較する。 FIG. 8 is a flowchart processed by the large classification processing unit 20. As shown in the figure, the major classification processing unit 20 repeats the procedure shown in FIG. 8 every time one piece of data x _i (1 ≦ i ≦ N) is read from the data set X. When the data x _i is read in S 11, first, in 12, the partial cluster v _k closest to x _i in the current partial cluster set V is searched. Then, in S13A, v _k and the distance d (v _k, x _i) of x _i is calculated. In subsequent S14A, v _k and x _i as the distance d (v _k, x _i) of the comparison with the threshold value d ₀ which predetermined a.

Ｓ１４Ａにおいて、d(v_k, x_i)がd₀より小さい場合には、Ｓ１５Ａに移行して、x_iを部分クラスタv_kに追加し、v_kの重心g(v_k)を更新する。又、Ｓ１４Ａにおいて、d(v_k, x_i)がd₀以上の場合には、Ｓ１６において、x_iのみからなる新しい部分クラスタを作成する。以上の処理をX内の全データに対して行うことにより、最終的に得られた部分クラスタ集合が大分類処理結果として得られる。 In _{S14A, d (v k, x} i) is the case d ₀ less, the process proceeds to S15A, add the x _i to the partial cluster v _k, updating the v _k of the center of gravity g (v _k). If d (v _k , x _i ) is greater than or equal to d ₀ in S14A, a new partial cluster consisting only of x _i is created in S16. By performing the above processing on all the data in X, the finally obtained partial cluster set is obtained as a result of the large classification processing.

この方式では部分クラスタvは下記のようにn, g(v)の２要素によって記述される。
v = (n, g(v))
n : vを構成するデータ数
g(v) : vを構成するn個のデータの重心位置座標
部分クラスタv重心g(v)はvを構成する全データの平均ベクトルであり、次式（１８）により得られる。 In this method, the partial cluster v is described by two elements n and g (v) as follows.
v = (n, g (v))
n: Number of data composing v
g (v): Center-of-gravity position coordinates of n data constituting v Partial cluster v-centroid g (v) is an average vector of all data constituting v and is obtained by the following equation (18).

又、部分クラスタvとデータxとの距離d(v, x)はg(v)とxとのユークリッド距離であり、次式（１９）により得られる。読み込んだデータに最も近い部分クラスタはこの距離尺度に基づいて決定する。 The distance d (v, x) between the partial cluster v and the data x is the Euclidean distance between g (v) and x, and is obtained by the following equation (19). The partial cluster closest to the read data is determined based on this distance measure.

全データに対して部分クラスタへの割り当てが終了したら、各部分クラスタに対して直径Tを算出する。部分クラスタvの直径T(v)はvに属する全データ間のユークリッド距離の平均値であり、各データの座標値から直接、次式（２０）により算出する。 When the assignment to the partial clusters is completed for all data, the diameter T is calculated for each partial cluster. The diameter T (v) of the partial cluster v is an average value of the Euclidean distance between all the data belonging to v, and is calculated directly from the coordinate value of each data by the following equation (20).

この第２実施形態において、Ｓ１２は、最近隣部分クラスタ探索部２２の処理であり、最近隣部分クラスタ探索部２２は最近隣部分クラスタ探索手段に相当する。Ｓ１４Ａは、追加可否判定部２４の処理であり、追加可否判定部２４は追加可否判定手段に相当する。Ｓ１５Ａ及びＳ１６は、部分クラスタ集合更新部２６の処理であり、部分クラスタ集合更新部２６は部分クラスタ集合更新手段に相当する。記憶装置１３は、部分クラスタ集合を記憶する記憶手段に相当する。
（第３実施形態）
次に、第３実施形態を図９を参照して説明する。本実施形態では、各部分クラスタのサンプル数のみを考慮する。すなわち、第１実施形態では式（１１）の更新式により各部分クラスタのサンプル数と半径の両方を考慮して極大点探索を行っているが、サンプル数のみを考慮するよう簡略化することも可能である。この場合、局所平均ベクトルの移動操作における更新式は次式（２１）で表わされる。 In the second embodiment, S12 is processing of the nearest neighbor cluster search unit 22, and the nearest neighbor cluster search unit 22 corresponds to nearest neighbor cluster search means. S14A is the process of the addability determination unit 24, and the addability determination unit 24 corresponds to an addability determination unit. S15A and S16 are processes of the partial cluster set update unit 26, and the partial cluster set update unit 26 corresponds to a partial cluster set update unit. The storage device 13 corresponds to storage means for storing the partial cluster set.
(Third embodiment)
Next, a third embodiment will be described with reference to FIG. In the present embodiment, only the number of samples of each partial cluster is considered. That is, in the first embodiment, the maximum point search is performed by considering both the number of samples and the radius of each partial cluster by the update formula of Formula (11), but it may be simplified to consider only the number of samples. Is possible. In this case, the update formula in the moving operation of the local average vector is expressed by the following formula (21).

この方式を用いて詳細分類処理を行った結果を図９に示す。図６の結果にはやや劣るものの、従来方式に比べ大幅に改善している。
（第４実施形態）
次に、請求項４に対応する第４実施形態を説明する。本実施形態は、詳細分類処理部３０の極大点探索部３２がバンド幅パラメータhを入力データの各次元ごとに独立に制御することが特徴である。 The result of performing the detailed classification process using this method is shown in FIG. Although slightly inferior to the result of FIG. 6, it is significantly improved compared to the conventional method.
(Fourth embodiment)
Next, a fourth embodiment corresponding to claim 4 will be described. The present embodiment is characterized in that the local maximum point search unit 32 of the detailed classification processing unit 30 controls the bandwidth parameter h independently for each dimension of the input data.

以下詳説すると、本実施形態では、CFベクトルのSS(Square Sum)を、各次元ごとの2乗和を要素とするベクトルとすることにより、バンド幅パラメータhを非等方性に拡張している（非対角要素がゼロのbandwidth matrix H）。 As will be described in detail below, in the present embodiment, the bandwidth parameter h is expanded anisotropically by making the SS (Square Sum) of the CF vector a vector whose element is the sum of squares for each dimension. (Bandwidth matrix H with zero off-diagonal elements).

ここで、ｎ個の個体{x_i} (1≦i≦n)からなる部分クラスタvのCFベクトルの要素LS (linear sum)およびSS (square sum)は以下のように表わされる（d: 次元数）。 Here, the elements LS (linear sum) and SS (square sum) of the CF vector of the partial cluster v consisting of n individuals {x _i } (1 ≦ i ≦ n) are expressed as follows (d: dimension number).

上記のうちSS を以下のようにd次元のベクトルSSに変更する。 In the above, SS is changed to d-dimensional vector SS as follows.

SS（スカラー量）はベクトルSSの各要素の和として求められるので、従来のCFベクトルの拡張となっている。すなわち、部分クラスタが超球状と仮定した場合の直径T, ２つの部分クラスタ間の距離、２つの部分クラスタの和は従来同様に計算できる。 Since SS (scalar amount) is obtained as the sum of the elements of vector SS, it is an extension of the conventional CF vector. That is, the diameter T when the partial cluster is assumed to be hyperspherical, the distance between the two partial clusters, and the sum of the two partial clusters can be calculated as in the conventional case.

ここでSSをベクトルに拡張したCFベクトルvで表わされる部分クラスタが超楕円体であると考え、各次元k(1≦k≦d)ごとに部分クラスタvの直径T_k( v)を考える。
直径T_k( v)はSSがスカラーの場合と同様に以下のように算出できる。 Here, a partial cluster represented by a CF vector v obtained by extending SS into a vector is considered to be a super ellipsoid, and a diameter T _k (v) of the partial cluster v is considered for each dimension k (1 ≦ k ≦ d).
The diameter T _k (v) can be calculated as follows, as in the case where SS is a scalar.

（バンド幅マトリックスＨ（bandwidth matrix H）の算出）
ここで、CFベクトルv_jに対応するバンド幅パラメータを非対角要素がゼロのbandwidth matrix H_jとし、H_jの(k,k)要素をh_jk ²と書く。 (Calculation of bandwidth matrix H)
Here, the bandwidth parameter corresponding to the CF vector v _j is a bandwidth matrix H _j with zero off-diagonal elements, and the (k, k) element of H _j is written as h _jk ² .

そして、h_jkを第k次元方向の局所的な密度に応じて以下のように定める（1≦k≦d）。すなわち、h_jk ²がT_k(v_j)²に比例し、かつv_jに属するサンプル数を考慮したh_jk ²の平均が第k次元方向の固定バンド幅の2乗h_k ²に一致するよう次式（２６）で定義する。 Then, h _jk is determined as follows according to the local density in the kth dimension (1 ≦ k ≦ d). That is, h _jk ² is proportional to T _k (v _j ) ² and the average of h _jk ² considering the number of samples belonging to v _j matches the square of fixed bandwidth h _k ² in the _k- th dimension. It is defined by the following equation (26).

（バンド幅マトリックスＨを用いた平均移動法（mean shift）クラスタリング）
そして、CFベクトルの重心g(v_i)に対する局所平均ベクトルm_iを以下の次式（２７）の手順により逐次移動し、極大点を探索する。 (Mean shift clustering using bandwidth matrix H)
Then, the local average vector m _i with respect to the center of gravity CF (v _i ) of the CF vector is sequentially moved by the procedure of the following equation (27) to search for a local maximum point.

ここでω_H(x, x’; H)は局所近傍を定義する非等方性のカーネル関数であり、 Where ω _H (x, x '; H) is an anisotropic kernel function that defines the local neighborhood,

と表される。 It is expressed.

なお、本発明の実施形態は以下のように変更してもよい。
○ 第１実施形態では、大分類処理部２０の処理によって得られる部分クラスタは(n, LS, SS)の３要素で記述されているが、必ずしもこの構成である必要はなく、部分クラスタの重心位置座標と分散に相当する量が得られるよう構成されていればよい。極論すれば全データの座標点と部分クラスタとの対応さえ与えられていれば重心と分散を計算できる。ただし、大規模データ集合に対して効率良く大分類を行うためには上述のような構成であることが望ましい。これは、部分クラスタvとデータxとの距離d(v, x)の算出、部分クラスタの直径T(v)の算出、部分クラスタvにデータxを追加した場合のパラメータ更新が効率良く行えるためである。 In addition, you may change embodiment of this invention as follows.
In the first embodiment, the partial cluster obtained by the processing of the large classification processing unit 20 is described by three elements (n, LS, SS), but this configuration is not necessarily required, and the centroid of the partial cluster What is necessary is just to be comprised so that the quantity equivalent to a position coordinate and dispersion | distribution may be obtained. In the extreme case, the center of gravity and variance can be calculated as long as the correspondence between the coordinate points of all data and the partial clusters is given. However, in order to efficiently perform large classification on a large-scale data set, the above-described configuration is desirable. This is because the distance d (v, x) between the partial cluster v and the data x can be calculated, the diameter T (v) of the partial cluster can be calculated, and the parameters can be updated efficiently when the data x is added to the partial cluster v. It is.

○ 第１実施形態の構成中、詳細分類処理部３０において部分クラスタごとにバンド幅パラメータh_jを定める方法は上記実施例の方法（=式（１２）式）に限定されるものではない。例えば、部分クラスタの半径が大きい（すなわち局所密度が小さい）ときにはh_jが大きく、半径が小さい（すなわち局所密度が大きい）ときにはh_jが小さくなるような任意の方式を用いることもできる。さらに、詳細分類処理において、局所近傍を定義するカーネル関数はガウシアン関数に限らず任意のカーネル関数を用いてもよい。 In the configuration of the first embodiment, the method for determining the bandwidth parameter h _j for each partial cluster in the detailed classification processing unit 30 is not limited to the method of the above embodiment (= expression (12)). For example, the radius of the partial cluster is large (i.e. less local density) sometimes h _j is large, a small radius (i.e., greater local density) Sometimes it is also possible to use any method, such as h _j becomes smaller. Further, in the detailed classification process, the kernel function that defines the local neighborhood is not limited to the Gaussian function, and an arbitrary kernel function may be used.

○ 第１実施形態では、１つのコンピュータからなるデータ処理装置１２により、大分類手段、詳細分類手段を構成したが、２台のコンピュータでそれぞれ大分類手段、詳細分類手段を構成してもよい。又、最近隣部分クラスタ探索部２２、追加可否判定部２４、部分クラスタ集合更新部２６、詳細分類処理部３０の極大点探索部３２、収束判定部３４、部分クラスタ分類部３６をそれぞれコンピュータで構成してもよい。 In the first embodiment, the data processing device 12 composed of one computer constitutes the large classification means and the detailed classification means, but the two computers may constitute the large classification means and the detailed classification means, respectively. Further, the nearest neighbor cluster search unit 22, the addability determination unit 24, the partial cluster set update unit 26, the local maximum point search unit 32 of the detailed classification processing unit 30, the convergence determination unit 34, and the partial cluster classification unit 36 are each configured by a computer. May be.

本明細書において、請求項以外に把握できる技術的思想を以下に列挙する。 In the present specification, technical ideas that can be grasped other than the claims are listed below .

（１）請求項４において、
コンピュータを、前記極大点探索手段として機能させる際に、
局所近傍範囲を定めるパラメータを前記各部分クラスタごとに可変とし、前記各部分クラスタのうち、局所的な密度が大きい部分クラスタに対しては前記パラメータを小さく、局所的な密度が小さい部分クラスタに対しては前記パラメータを大きくするように制御させることを特徴とするデータ処理システムのプログラム。 ( 1 ) In claim 4 ,
When making a computer function as the local maximum search means,
The parameter that defines the local neighborhood range is variable for each partial cluster, and among the partial clusters, the parameter is small for a partial cluster with a high local density, and for the partial cluster with a low local density. A program for a data processing system, characterized in that control is performed to increase the parameter.

（２）前記（１）において、
コンピュータを、前記極大点探索手段として機能させる際に、
入力データの各次元に対し個別に局所的な密度を算出し、この密度に応じて各次元に対し個別に前記パラメータを制御させることを特徴とするデータ処理システムのプログラム。 ( 2 ) In the above ( 1 ),
When making a computer function as the local maximum search means,
A program for a data processing system, wherein a local density is calculated individually for each dimension of input data, and the parameters are individually controlled for each dimension according to the density.

データ処理システム１０を構成するブロック図。1 is a block diagram constituting a data processing system 10. FIG. データ処理装置１２による処理の概要のフローチャート。5 is a flowchart of an outline of processing by the data processing device 12; 大分類処理部２０の処理の一例のフローチャート。6 is a flowchart of an example of processing of a large classification processing unit 20. （ａ）〜（ｃ）は処理対象となる３種類のデータ集合をプロットした図。(A)-(c) is the figure which plotted three types of data sets used as a process target. （ａ）〜（ｃ）は、図4（ａ）〜（ｃ）に示したデータ集合に対する大分類処理結果の例の説明図。(A)-(c) is explanatory drawing of the example of the large classification | category process result with respect to the data set shown to Fig.4 (a)-(c). （ａ）〜（ｃ）は、大分類処理によって得られた図５（ａ）〜（ｃ）の部分クラスタ集合に対して、第１実施形態の詳細分類処理部により詳細分類を行った結果の説明図。(A) to (c) are the results of the detailed classification performed by the detailed classification processing unit of the first embodiment on the partial cluster set of FIGS. 5A to 5C obtained by the large classification processing. Illustration. （ａ）〜（ｃ）は、図５（ａ）〜（ｃ）の大分類処理結果に対して公知の平均移動法により詳細分類を行った結果の説明図。(A)-(c) is explanatory drawing of the result of having performed the detailed classification | category by the well-known average moving method with respect to the large classification | category processing result of Fig.5 (a)-(c). 第２実施形態の分類処理部２０が処理するフローチャートFlowchart processed by the classification processing unit 20 of the second embodiment （ａ）〜（ｃ）は第３実施形態において、詳細分類処理を行った結果の説明図。(A)-(c) is explanatory drawing of the result of having performed the detailed classification | category process in 3rd Embodiment. 平均移動法における極大点探索の概念図。The conceptual diagram of the local maximum search in an average moving method.

Explanation of symbols

１０…データ処理システム、
１１…入力装置、
１２…データ処理装置（大分類手段、詳細分類手段）、
１２ａ…ＲＯＭ、
１２ｂ…ＲＡＭ、
１３…記憶装置（記憶手段）、
１４…出力装置、
２０…大分類処理部、
２２…探索部最近隣部分クラスタ探索部、
２４…追加可否判定部、
２６…部分クラスタ集合更新部、
３０…詳細分類処理部、
３２…極大点探索部、
３４…収束判定部、
３６…部分クラスタ分類部。 10: Data processing system,
11 ... Input device,
12 ... Data processing device (major classification means, detailed classification means),
12a ... ROM,
12b ... RAM,
13: Storage device (storage means),
14 ... output device,
20 ... major classification processing part,
22 ... search part nearest neighbor partial cluster search part,
24 ... Additional availability determination unit,
26: Partial cluster set update unit,
30 ... Detailed classification processing unit,
32 ... Maximum point search part,
34 ... Convergence determining unit,
36... Partial cluster classification unit.

Claims

A data processing system for clustering an input data set given as a set of vector data of a predetermined number of dimensions,
A large classification means for converting an input data set into a set of partial clusters;
Detailed classification means for clustering the set of partial clusters converted by the large classification means,
The detailed classification means,
The center-of-gravity position coordinates of each partial cluster is used as the representative point of each partial cluster, and the local maximum point search means for searching for the local maximum point of the probability density function starting from this representative point, and the local maximum point search by the local maximum point searching means converge Convergence determination means that determines whether or not the local maximum vector for each partial cluster of the previous value and current value for which the local maximum point search has been performed is the same as the partial cluster group that has converged to the same or nearby local maximum point A partial cluster classification means for classifying as belonging to a cluster,
The data processing system, wherein the local maximum point searching means performs local maximum point search using the number of samples belonging to each partial cluster as the weight of each partial cluster .

In the local maximum point search means, a parameter for determining a local neighborhood range is variable for each partial cluster, and among the partial clusters, the partial parameter having a high local density is set to a small parameter, The data processing system according to claim 1, wherein the parameter is controlled to be increased for a partial cluster having a low density.

The maximum point searching means calculates a local density individually for each dimension of the input data, according to claim 2, wherein the controller controls the parameters individually for each dimension according to the density Data processing system.

A data processing system program for clustering an input data set given as a set of vector data of a predetermined number of dimensions,
Computer
Function as a large classification means to convert the input data set to a set of partial clusters,
Further, it functions as detailed classification means for clustering the set of partial clusters converted by the large classification means,
And when functioning as the detailed classification means,
The centroid position coordinates of each partial cluster as a representative point of each partial cluster, the local maximum point search means for searching for the local maximum point of the probability density function starting from this representative point,
Convergence determination means for determining whether or not the local maximum search has converged based on distance information between local average vectors for each partial cluster of the previous value and current value in which the local maximum search has been performed,
Function as a partial cluster classification means for classifying a partial cluster group converged to the same or nearby maximum point as belonging to the same cluster,
A data processing program that, when functioning as the maximum point search means, performs a maximum point search using the number of samples belonging to each partial cluster as a weight of each partial cluster .