JP2008250848A

JP2008250848A - Clustering method, data processor, and program

Info

Publication number: JP2008250848A
Application number: JP2007093969A
Authority: JP
Inventors: Hiroki Imamura; 弘樹今村; Makoto Fujimura; 誠藤村; Hideo Kuroda; 英夫黒田
Original assignee: Nagasaki University NUC
Current assignee: Nagasaki University NUC
Priority date: 2007-03-30
Filing date: 2007-03-30
Publication date: 2008-10-16

Abstract

<P>PROBLEM TO BE SOLVED: To perform accurate clustering of data based on a self organizing map. <P>SOLUTION: Self organizing map processing to fit input data to a code vector using the self organizing map, determination processing to determine whether the code vector obtained by the above process is converged into a predetermined state, and separation/bind processing to compare the distance between the respective code vectors with a set threshold, if determined not converged, separate the cases of longer distances than the threshold value, and bind the cases of shorter than the threshold value are performed. Processes to perform self organizing map processing to the code vector clustered by separation/bind in the separation/bind processing and to acquire the code vector are repeated, and the clustering state determined to be converged into the predetermined state in the determination processing is output. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、各種データを自己組織化マップ（Self-organizing maps,ＳＯＭ）を利用して分類する処理を行うクラスタリング方法、及びそのクラスタリングの処理を行うデータ処理装置、並びにその処理をコンピュータ装置などのデータ処理装置に実装して行うプログラムに関する。 The present invention relates to a clustering method for performing a process of classifying various data using a self-organizing map (SOM), a data processing apparatus for performing the clustering process, and a process such as a computer apparatus. The present invention relates to a program that is implemented in a data processing apparatus.

従来、各種データをクラスタリングする手法として、ｋ−ｍｅａｎｓ法、ｆｕｚｚｙ−ｃ平均法などが知られている。これらの手法は、クラスタリングするデータの分布が正規分布であることが前提となっている。これに対して、クラスタリングするデータの分布が任意の形状であっても、高精度にクラスタリングする手法として、任意形状クラスタリング法が提案されている。しなしながら、従来提案されている任意形状クラスタリングでは、クラスタリングの精度があまり良くなかった。即ち異なるクラスタが接近している場合、任意形状クラスタリング手法を用いても、本来異なるクラスタが同じクラスタに分類されしまうことがある。
「クラスタ間距離の昇順によるラベリングに基づくノイズにロバストな任意形状クラスタリング」映像情報学会誌、Vol.60,N0.4,618頁−620頁,2006年発行 Conventionally, k-means method, fuzzy-c average method, etc. are known as methods for clustering various data. These methods are based on the assumption that the distribution of data to be clustered is a normal distribution. On the other hand, an arbitrary shape clustering method has been proposed as a method for performing clustering with high accuracy even if the distribution of data to be clustered has an arbitrary shape. However, in the conventionally proposed arbitrary shape clustering, the accuracy of the clustering is not so good. That is, when different clusters are close to each other, even if an arbitrary shape clustering method is used, different clusters may be classified into the same cluster.
“Arbitrary Shape Clustering Robust to Noise Based on Labeling by Ascending Order of Intercluster Distance”, Journal of the Institute of Image Information, Vol.60, N0.4, pp.618-620, 2006

本願の発明者は、非特許文献１において、上記問題を解決するクラスタリング手法を提案した。
この提案したクラスタリング手法は、コードベクトルをランダムに生成させた後、その生成されたコードベクトルを、自己組織化マップ(ＳＯＭ)を用いてデータにフィッティングさせ、このコードベクトルの連続性に基づき、クラスタリングする手法である。この手法では、まず、フィッティングさせたコードベクトルをつなぐ角度が大きく変化する毎にコードベクトルを分け、ラベリングを行う。次に、それぞれのデータが最小距離となるコートベクトルを探索し、そのコードベクトルのラベルをデータに割り当てることにより、各データをクラスタリングしている。 The inventor of the present application has proposed a clustering method for solving the above problem in Non-Patent Document 1.
In this proposed clustering method, a code vector is randomly generated, and then the generated code vector is fitted to data using a self-organizing map (SOM). Based on the continuity of the code vector, clustering is performed. It is a technique to do. In this method, first, code vectors are divided and labeled each time the angle connecting the fitted code vectors changes greatly. Next, each data is clustered by searching for a coat vector in which each data has a minimum distance and assigning a label of the code vector to the data.

図８及び図９は、この従来の自己組織化マップを用いたクラスタリング手法による処理例を示したものである。図８の例は、クラスタリングが成功した場合を想定したものである。順に説明すると、図８の左端に示すように複数のデータが２次元上に分布しているとする。ここでも分布としては、［ｕ］字状の分布と、［ｎ］字状の分布が上下に近接して、［ｘ］字状にも見える状態であるとする。 FIG. 8 and FIG. 9 show an example of processing by a clustering method using this conventional self-organizing map. The example of FIG. 8 assumes a case where clustering is successful. To explain in order, it is assumed that a plurality of data are distributed two-dimensionally as shown at the left end of FIG. Here, as a distribution, it is assumed that the [u] -shaped distribution and the [n] -shaped distribution are close to each other in the vertical direction and can be seen as an [x] -shaped.

このとき、自己組織化マップを用いてコードベクトルを各データをフィッティングさせ、図８の中央に示すコードベクトルのデータを得る。次に、フィッティングさせたコードベクトルをつなぐ角度が大きく変化する毎に、コードベクトルを分け、ラベリングを行い、ラベルをデータに割り当てて、割り当てが適正であれば、図８の右端に示すような、上側の［ｕ］字状の分布のクラスタ（黒丸で示す）と、下側の［ｎ］字状の分布のクラスタ（三角で示す）とが、別のものにクラスタリングされる。 At this time, each data is fitted to the code vector using the self-organizing map, and the code vector data shown in the center of FIG. 8 is obtained. Next, every time the angle connecting the fitted code vector changes greatly, the code vector is divided and labeled, and the label is assigned to the data. If the assignment is appropriate, as shown at the right end of FIG. The upper [u] -shaped distribution cluster (indicated by a black circle) and the lower [n] -shaped distribution cluster (indicated by a triangle) are clustered into different ones.

これに対して、図９に示す例では、図９の左端に示す同じデータに対してコードベクトルを自己組織化マップを用いて各データにフィッティングさせ、図９の中央に示すコードベクトルのデータを得て、図９の右端に示すような、黒丸と三角と四角の３つのクラスタに分割された結果となった例である。この例では、三角で示すクラスタが、上下に跨った状態であり、本来は異なるクラスタであるデータが、同じクラスタとしてクラスタリングされている。また、それとは別に、黒丸で示すクラスタと四角で示すクラスタとに分類されて、本来は同じクラスタであるデータが、異なるクラスタに分類されている。 On the other hand, in the example shown in FIG. 9, the code vector is fitted to each data using the self-organizing map for the same data shown at the left end of FIG. 9, and the code vector data shown in the center of FIG. This is an example obtained as a result of being divided into three clusters of black circles, triangles and squares as shown at the right end of FIG. In this example, the clusters indicated by triangles are in a state straddling up and down, and data that are originally different clusters are clustered as the same cluster. Separately, the data is classified into a cluster indicated by a black circle and a cluster indicated by a square, and originally the same cluster is classified into different clusters.

図８、図９の例では、クラスタリングするデータが何のデータであるのか特に示していないが、例えば図８、図９の左端に丸印で示す各データが、顔画像のデータであることを想定すると、その顔画像による個人認証が可能になる。例えば、図８、図９の左端に丸印で示す各データの内の、上側の［ｕ］字状に分布したデータが、Ａさんを撮影した様々な顔画像であり、下側の［ｎ］字状に分布したデータが、Ｂさんを撮影した様々な顔画像であるとする。このとき、画像データからＡさんの識別とＢさんの識別を行う際に、図８に示すようにクラスタリングできれば良いが、図９に示すようにクラスタリングされてしまうと、ＡさんとＢさんを正しく識別できなく、個人認証エラーが発生してしまう。一人の人物の顔画像が複数存在するということは、例えば一人の人物の顔の表情の変化による変化要因、照明などの撮影条件による変化要因など、様々の要因が考えられ、画像認識時には、正確なクラスタリングが要求される。
なお、ここでの説明はクラスタリングの原理を示したものであり、実際の顔画像の場合には、図８、図９に示すような２次元データとは異なる、より高い次元数のデータである可能性が高い。 8 and 9 do not particularly indicate what data is to be clustered. For example, each data indicated by a circle at the left end of FIGS. 8 and 9 is data of a face image. Assuming that personal authentication is possible using the face image. For example, among the data indicated by the circles at the left end of FIGS. 8 and 9, the data distributed in the upper [u] shape is various face images taken of Mr. A, and the lower [n] ] Suppose that the data distributed in a letter shape are various face images taken of Mr. B. At this time, when Mr. A and Mr. B are identified from the image data, it is only necessary to perform clustering as shown in FIG. 8, but when clustering is performed as shown in FIG. It cannot be identified and a personal authentication error occurs. The fact that there are multiple face images of a person can be attributed to various factors such as changes due to changes in facial expressions of one person, changes due to shooting conditions such as lighting, etc. Clustering is required.
The description here shows the principle of clustering, and in the case of an actual face image, it is data of a higher number of dimensions different from the two-dimensional data as shown in FIGS. Probability is high.

個人認証を行うための画像認証としては、顔画像以外にも、指紋、虹彩、血管などを画像化して認証するものが各種提案され、実用化されつつあるが、いずれの場合でも、入力された多数の画像を正確にクラスタリングすることは、正確な認証を行う上で非常に重要である。画像以外でも、例えば音声から個人認証を行うような場合でも、入力音声の周波数特性などをクラスタリングすれば、個人認証が可能であるが、そのような場合もクラスタリングを行う場合も、クラスタリングの原理は同じである。さらに、筆跡などを利用して認証を行う場合も同様である。
従来の自己組織化マップを用いたクラスタリング手法では、上述したように必ずしも適切なクラスタリングが行われているとは言えず、クラスタリング結果の信頼性が高くないという問題があった。例えば顔画像を例にすると、顔の表情に変化があった場合にも、同一人物か否か正確に判断するためには、非常に高度なクラスタリングが要求されるが、従来のクラスタリング手法では十分ではなかった。 As image authentication for personal authentication, in addition to facial images, various types of authentication that are made by imaging fingerprints, irises, blood vessels, etc. have been proposed and are being put into practical use. Accurate clustering of a large number of images is very important for accurate authentication. In addition to images, for example, even when performing personal authentication from speech, it is possible to perform personal authentication by clustering the frequency characteristics of the input speech. In such cases, the principle of clustering is also the case where clustering is performed. The same. Further, the same applies when authentication is performed using handwriting.
In the conventional clustering method using the self-organizing map, as described above, appropriate clustering is not necessarily performed, and there is a problem that the reliability of the clustering result is not high. For example, taking a face image as an example, even if there is a change in facial expression, in order to accurately determine whether or not they are the same person, very high-level clustering is required. It wasn't.

本発明はかかる点に鑑みてなされたものであり、自己組織化マップに基づいたデータのクラスタリングが、より精度よく行えるようにすることを目的とする。 The present invention has been made in view of such a point, and an object thereof is to perform data clustering based on a self-organizing map with higher accuracy.

本発明は、入力した複数のデータを、自己組織化マップのアルゴリズムに基づいてクラスタリングする場合において、複数のデータが配置された座標軸上にコードベクトルを設定するコードベクトル設定処理と、そのコードベクトル設定処理で設定したコードベクトルを、自己組織化マップを用いて、データにフィッティングさせる自己組織化マップ処理と、自己組織化マップ処理で得られたコードベクトルが、所定状態に収束したかを判断する判断処理と、判断処理で収束していないと判断した場合に、自己組織化マップ処理で得られた各コードベクトルのコードベクトル間の距離を、設定した閾値と比較して、閾値より距離より距離が長い場合に分離し、閾値より距離が短い場合に結合する分離・結合処理とを行い、分離・結合処理で分離と結合によりクラスタリングされたコードベクトルに対して自己組織化マップ処理を行ってコードベクトルを得る処理を繰り返し、判断処理で所定状態に収束したと判断したクラスタリング状態に基づいて、入力した複数のデータをクラスタリングするものである。 The present invention provides a code vector setting process for setting a code vector on a coordinate axis on which a plurality of data is arranged, and a code vector setting thereof, when clustering a plurality of input data based on a self-organizing map algorithm A self-organizing map process that fits the code vector set in the process to data using a self-organizing map, and a judgment that determines whether the code vector obtained by the self-organizing map process has converged to a predetermined state When it is determined that the process and the determination process have not converged, the distance between the code vectors of each code vector obtained by the self-organizing map process is compared with the set threshold value, and the distance is larger than the distance by the threshold value. Separation is performed when separation is longer, and separation is performed when the distance is shorter than the threshold. Repeat the process of obtaining the code vector by performing the self-organizing map process on the code vector clustered by the combination and the plurality of input data based on the clustering state determined to have converged to the predetermined state by the determination process Clustering.

本発明によると、コードベクトル間の距離に基づいた分離・結合を行うことで、データを適正にクラスタリングすることが可能となる。例えば、クラスタリングするデータとして画像データに適用することで、画像データで示される人物や物体を正確に識別できるようになる。例えば、画像データで示されている顔画像などから個人認証を行う際に、その画像データが誰の顔を撮影した画像であるのか、正確に分類することが可能となる。或いは、画像データで示された文字を認識する際に、その文字認識を正確に行うことが可能となる。 According to the present invention, data can be appropriately clustered by performing separation / combination based on the distance between code vectors. For example, by applying to the image data as data to be clustered, it becomes possible to accurately identify a person or object indicated by the image data. For example, when performing personal authentication from a face image indicated by image data, it is possible to accurately classify who the face is of which image data is taken. Alternatively, when the character indicated by the image data is recognized, the character recognition can be accurately performed.

以下、本発明の一実施の形態を、図１〜図７を参照して説明する。本発明においては、各種データを自己組織化マップを利用して分類するものに適用したものである。
図１は、本実施の形態による処理例を示したフローチャートであり、図２はその処理を行う装置構成の例を示した図である。 Hereinafter, an embodiment of the present invention will be described with reference to FIGS. In the present invention, various data are applied to data that is classified using a self-organizing map.
FIG. 1 is a flowchart showing an example of processing according to the present embodiment, and FIG. 2 is a diagram showing an example of an apparatus configuration for performing the processing.

装置としては、例えばコンピュータ装置とその周辺機器で構成してあり、コンピュータ装置に接続されたカメラ、ビデオキャプチャなど構成される画像取り込み部１１で、画像データを取り込むことで、データを入力させるようにしてあり、クラスタリングされた結果に基づいた識別結果をディスプレイ２１に表示するようにしてある。コンピュータ装置には、画像識別装置として機能させるためのソフトウェア（プログラム）がインストールしてある。そのソフトウェアは、図１のフローチャートに示す動作処理などを実行させるものである。操作については、例えばコンピュータ装置に接続されたキーボード１７を使用する。 The apparatus is composed of, for example, a computer apparatus and its peripheral devices, and the image capturing section 11 configured by a camera, video capture, etc. connected to the computer apparatus is used to input data by capturing the image data. The identification result based on the clustered result is displayed on the display 21. Software (program) for causing the computer device to function as an image identification device is installed. The software executes the operation process shown in the flowchart of FIG. For the operation, for example, a keyboard 17 connected to a computer device is used.

画像取り込み部１１で取り込まれた画像データは、データ処理部１２に送られ、制御部１５の制御で、クラスタリングを行うためのデータとされる。 The image data captured by the image capturing unit 11 is sent to the data processing unit 12 and is used as data for clustering under the control of the control unit 15.

クラスタリングを行うためのデータは、データメモリ１３に記憶されて、制御部１５の制御でクラスタリングが行われる。このクラスタリングされた各クラスタを、識別結果として出力させる際には、そのままクラスタリングされたデータとして出力させてもよいが、例えば各クラスタが、データベース１４に蓄積された参照用画像と同じクラスタであるか否かなどを判断して、各種認証処理などが行うようにしてもよい。認識処理結果は、例えば表示制御部１８に送られて、ディスプレイ２１に認識結果を表示させる。 Data for clustering is stored in the data memory 13 and clustering is performed under the control of the control unit 15. When each clustered cluster is output as an identification result, it may be output as clustered data as it is. For example, whether each cluster is the same cluster as the reference image stored in the database 14. It may be determined whether or not, and various authentication processes may be performed. The recognition processing result is sent to, for example, the display control unit 18 to display the recognition result on the display 21.

次に、このように構成されるデータ処理装置で実行される本例のクラスタリング処理の詳細について説明する。
まず、図１のフローチャートを参照して、本例のクラスタリング処理の概要について説明すると、クラスタリングさせる複数のデータが入力すると、その複数のデータをｎ次元（ｎは２以上の整数）の座標軸に配置し、その座標軸上のデータに対して、コードベクトルをランダムに生成させる。その生成させるコードベクトルの数は、予め設定した数とする。或いは、入力した複数のデータの数に応じて、コードベクトルの数を設定するようにしてもよい。また本例の場合には、後述するようにクラスタの分離や結合を行うために、ラベルと称する値を設定して、ラベル値を更新するようにしてある。 Next, details of the clustering process of this example executed by the data processing apparatus configured as described above will be described.
First, the outline of the clustering process of this example will be described with reference to the flowchart of FIG. 1. When a plurality of data to be clustered are input, the plurality of data are arranged on n-dimensional (n is an integer of 2 or more) coordinate axes. A code vector is randomly generated for the data on the coordinate axes. The number of code vectors to be generated is a preset number. Alternatively, the number of code vectors may be set according to the number of input data. In the case of this example, in order to separate and combine clusters as will be described later, a value called a label is set and the label value is updated.

コードベクトルが生成されると、その生成されたコードベクトルを、入力した複数のデータにフィッティングさせる自己組織化マップ処理(ＳＯＭ処理)が行われる（ステップＳ１）。自己組織化マップ処理では、一定の条件を満たすコードベクトルを勝利ベクトルとし、その勝利ベクトルとその周辺のコードベクトルとの関係で、コードベクトルを更新する処理が行われる。自己組織化マップ処理については、既に従来から行われている処理手法であり、コードベクトルをデータに近づけさせる処理である。 When the code vector is generated, a self-organizing map process (SOM process) for fitting the generated code vector to a plurality of input data is performed (step S1). In the self-organizing map process, a code vector that satisfies a certain condition is set as a victory vector, and a process of updating the code vector is performed based on the relationship between the victory vector and its surrounding code vectors. The self-organizing map processing is a processing method that has already been performed in the past, and is processing for bringing a code vector closer to data.

そして、その自己組織化マップ処理で、コードベクトルが十分に収束したか否か判断し（ステップＳ２）、収束している場合には、コードベクトルの処理を終了し、コードベクトルのラベルに基づいて各データをクラスタリングして（ステップＳ６）、クラスタリング処理を終了する。ここでの収束とは、コードベクトルが、入力したデータに近似したベクトルになっている状態のことである。
コードベクトルが収束していない場合には、現在のコードベクトル間の値に基づいて閾値Ｔｈを設定する（ステップＳ３）。閾値Ｔｈの設定については後述するが、例えば、現在のコードベクトルの長さの最大値と平均値を使用して、適正な値を設定する。 Then, in the self-organizing map process, it is determined whether or not the code vector has sufficiently converged (step S2). If the code vector has converged, the code vector process is terminated, and based on the code vector label Each data is clustered (step S6), and the clustering process is terminated. Convergence here means a state in which the code vector is a vector that approximates the input data.
If the code vectors have not converged, a threshold Th is set based on the value between the current code vectors (step S3). Although the setting of the threshold Th will be described later, for example, an appropriate value is set using the maximum value and the average value of the length of the current code vector.

閾値Ｔｈを設定すると、現在のそれぞれのコードベクトル間の距離を、その設定した閾値Ｔｈと比較する。その比較で、閾値Ｔｈより距離が大きい場合に（後述する式では距離が等しい場合も含む）、そのとき判断した２つのコードベクトルを、別のクラスタに分離する（ステップＳ４）。別のクラスタに分離することで、別のラベル値を設定する。また、閾値Ｔｈより距離が小さい場合に、そのとき判断した２つのコードベクトルを、同じクラスタに結合する（ステップＳ５）。同じクラスに結合することで、同じラベル値を持つようにしてある。 When the threshold Th is set, the current distance between the respective code vectors is compared with the set threshold Th. In the comparison, when the distance is larger than the threshold Th (including the case where the distance is equal in the formula described later), the two code vectors determined at that time are separated into different clusters (step S4). Set different label values by separating into different clusters. If the distance is smaller than the threshold Th, the two code vectors determined at that time are combined into the same cluster (step S5). By binding to the same class, it has the same label value.

そして、ステップＳ４での分離とステップＳ５での結合が行われた各クラスタのコードベクトルに対して、ステップＳ１に戻って、自己組織化マップ処理を行って、コードベクトルを更新させて、さらに入力データに近似させる。以後、ステップＳ２でコードベクトルが十分に収束したと判断されるまで、ステップＳ１での自己組織化マップ処理から、ステップＳ４，Ｓ５での分離・結合処理が繰り返される。 Then, with respect to the code vector of each cluster subjected to the separation in step S4 and the combination in step S5, the process returns to step S1, self-organizing map processing is performed, the code vector is updated, and further input is performed. Fit the data. Thereafter, until it is determined in step S2 that the code vector has sufficiently converged, the separation / combination processing in steps S4 and S5 is repeated from the self-organizing map processing in step S1.

ステップＳ６で、コードベクトルのラベルに基づいて各データをクラスタリングする処理としては、座標軸上のデータに最も近いコードベクトルを判断する。そして、そのコードベクトルのラベル値に対応したクラスタを設定する。そのため、例えばラベル値が２つであれば、データが２つにクラスタリングされ、ラベル値が３つであれば３つにクラスタリングされる。 In step S6, as a process of clustering each data based on the code vector label, the code vector closest to the data on the coordinate axes is determined. Then, a cluster corresponding to the label value of the code vector is set. Therefore, for example, if there are two label values, the data is clustered into two, and if there are three label values, the data is clustered into three.

次に、実際にクラスタリングした例を、図３〜図６に示す。図３，図５は本例の処理でクラスタリングした例であり、図４，図６は、図３，図５と同じデータに対して、従来処理でクラスタリングした例である。 Next, examples of actual clustering are shown in FIGS. FIGS. 3 and 5 are examples of clustering by the processing of this example, and FIGS. 4 and 6 are examples of clustering by the conventional processing on the same data as FIGS.

まず図３の例について説明する。図３（ａ）は、２次元の座標軸上に人工的に生成させた入力データの分布である。この例では、図３（ａ）から判るように、データが２つの群に分かれている。このデータが配置された座標軸上に、ランダムにコードベクトルを生成させたのが図３（ｂ）である。この図３（ｂ）の初期状態では、コードベクトルは１つに繋がった状態である。この図３（ｂ）の様態から、自己組織化マップ処理を行って、コードベクトルを更新させて、入力データ点に近づける処理が行われる。 First, the example of FIG. 3 will be described. FIG. 3A shows a distribution of input data artificially generated on a two-dimensional coordinate axis. In this example, as can be seen from FIG. 3A, the data is divided into two groups. FIG. 3B shows a code vector randomly generated on the coordinate axis where this data is arranged. In the initial state of FIG. 3B, the code vectors are connected to one. From the state shown in FIG. 3B, a self-organizing map process is performed to update the code vector so as to approach the input data point.

この自己組織化マップ処理を繰り返し行うことで、図３（ｃ）及び（ｄ）に示すように順次データに近づいて行くが、本例の場合には、各コードベクトルの距離に応じて、コードベクトルの分離及び結合の処理が行われる。図３（ｄ）の例では、２つに分離された例であるが、まだ分離状態が不完全である。 By repeating this self-organizing map process, the data is sequentially approached as shown in FIGS. 3C and 3D. In this example, the code is changed according to the distance of each code vector. Vector separation and combination processing is performed. In the example of FIG. 3D, the example is separated into two, but the separation state is still incomplete.

これに対して、コードベクトルの更新が十分に行われた計算の終了時の状態では、図３（ｅ）に示すように、ほぼ座標軸上のデータ点を結ぶような形状にコードベクトルが設定されている。そして、そのコードベクトルが、データの配置状態に対応して２つに分割された状態となっている。
この図３（ｅ）に示す状態にラベル分けされたコードベクトルに基づいて、各データをクラスタリングすることで、図３（ｆ）に示すように、各データが正確に２つにクラスタリングされた状態となる。図３（ｆ）では、○印で示すデータと、×印で示すデータとにクラスタリングされた状態を示している。 On the other hand, in the state at the end of the calculation when the code vector is sufficiently updated, the code vector is set in a shape that almost connects the data points on the coordinate axes as shown in FIG. ing. The code vector is divided into two corresponding to the data arrangement state.
By clustering each data based on the code vector labeled in the state shown in FIG. 3 (e), each data is accurately clustered into two as shown in FIG. 3 (f). It becomes. FIG. 3F shows a state where data is clustered into data indicated by ◯ and data indicated by X.

これに対して、図３と同じデータに対して従来処理を行った図４の例について説明すると、図４（ａ）は、２次元の座標軸上に人工的に生成させた入力データの分布であり、図３（ａ）と同じである。このデータが配置された座標軸上に、ランダムにコードベクトルを生成させたのが図４（ｂ）である。この図４（ｂ）の様態から、自己組織化マップ処理を行って、コードベクトルを更新させて、データ点に近づける処理が行われ、図４（ｃ），（ｄ），（ｅ）に示すように順に、データ点に近づけさせる。この従来処理では、自己組織化マップ処理を行っている間は、コードベクトルが１つに繋がった状態であり、図４（ｅ）に示した繰り返し演算終了時でも、コードベクトルは１つに繋がっている。
図４（ｅ）に示したコードベクトルが得られると、各コードベクトルをなす角度に応じて、コードベクトルを分割する処理が行われることで、最終的に得られるデータのクラスタリング状態としては、図４（ｆ）に示すように、実際のデータ分布とは全く異なる４つにクラスタリングされた状態になってしまう。図４（ｆ）の状態は、○印と×印と△印と□印に分割された状態を示す。 On the other hand, the example of FIG. 4 in which the conventional processing is performed on the same data as FIG. 3 will be described. FIG. 4A shows the distribution of input data artificially generated on a two-dimensional coordinate axis. Yes, as in FIG. FIG. 4B shows a code vector randomly generated on the coordinate axis where this data is arranged. From the state shown in FIG. 4B, the self-organizing map process is performed to update the code vector so as to approach the data point, as shown in FIGS. 4C, 4D, and 4E. In order, the data points are brought closer to each other. In this conventional process, the code vector is connected to one while the self-organizing map process is performed, and the code vector is connected to one even at the end of the repetitive calculation shown in FIG. ing.
When the code vectors shown in FIG. 4 (e) are obtained, the process of dividing the code vectors is performed according to the angle forming each code vector. As shown in FIG. 4 (f), the data is clustered into four completely different from the actual data distribution. The state of FIG. 4 (f) shows a state divided into a circle mark, a cross mark, a triangle mark, and a square mark.

図５は、本例の処理を行った別の例である。図５（ａ）は、２次元の座標軸上に人工的に生成させた入力データの分布である。この例では、図５（ａ）から判るように、データが３つの群に分かれている。このデータが配置された座標軸上に、ランダムにコードベクトルを生成させたのが図５（ｂ）である。この図５（ｂ）の初期状態では、コードベクトルは１つに繋がった状態である。この図５（ｂ）の様態から、自己組織化マップ処理を行って、コードベクトルを更新させて、入力データ点に近づける処理が行われる。 FIG. 5 is another example in which the processing of this example is performed. FIG. 5A shows a distribution of input data artificially generated on a two-dimensional coordinate axis. In this example, as can be seen from FIG. 5A, the data is divided into three groups. FIG. 5B shows a code vector randomly generated on the coordinate axis where this data is arranged. In the initial state of FIG. 5B, the code vectors are connected to one. From the state shown in FIG. 5B, the self-organizing map process is performed to update the code vector so as to approach the input data point.

この自己組織化マップ処理を繰り返し行うことで、図５（ｃ）及び（ｄ）に示すように順次データに近づいて行くが、本例の場合には、各コードベクトルの距離に応じて、コードベクトルの分離及び結合の処理が行われる。このため、コードベクトルの更新が十分に行われた計算の終了時の状態では、図５（ｅ）に示すように、ほぼ座標軸上のデータ点を結ぶような形状にコードベクトルが設定されていると同時に、そのコードベクトルが、データの配置状態に対応して３つに分割された状態となっている。
この図５（ｅ）に示す状態にラベル分けされたコードベクトルに基づいて、各データをクラスタリングすることで、図５（ｆ）に示すように、各データが正確に３つにクラスタリングされた状態となる。図３（ｆ）では、○印で示すデータと、×印で示すデータと、△印で示すデータとにクラスタリングされた状態を示している。 By repeating this self-organizing map process, the data is sequentially approached as shown in FIGS. 5C and 5D. In this example, according to the distance between the code vectors, Vector separation and combination processing is performed. For this reason, in the state at the end of the calculation when the code vector is sufficiently updated, the code vector is set in a shape that substantially connects the data points on the coordinate axes as shown in FIG. At the same time, the code vector is divided into three corresponding to the data arrangement state.
By clustering each data based on the code vector labeled in the state shown in FIG. 5 (e), each data is accurately clustered into three as shown in FIG. 5 (f). It becomes. FIG. 3F shows a state in which data indicated by a circle, data indicated by a cross, and data indicated by a triangle are clustered.

これに対して、図５と同じデータに対して従来処理を行った図６の例について説明すると、図６（ａ）は、２次元の座標軸上に人工的に生成させた入力データの分布であり、図５（ａ）と同じである。このデータが配置された座標軸上に、ランダムにコードベクトルを生成させたのが図６（ｂ）である。この図６（ｂ）の様態から、自己組織化マップ処理を行って、コードベクトルを更新させて、データ点に近づける処理が行われ、図６（ｃ），（ｄ），（ｅ）に示すように順に、データ点に近づけさせる。この従来処理では、自己組織化マップ処理を行っている間は、コードベクトルが１つに繋がった状態であり、図６（ｅ）に示した繰り返し演算終了時でも、コードベクトルは１つに繋がっている。
図６（ｅ）に示したコードベクトルが得られると、各コードベクトルをなす角度に応じて、コードベクトルを分割する処理が行われるが、この例では、分割できる角度の違いがないため、最終的に得られるデータのクラスタリング状態としては、図６（ｆ）に示すように、実際のデータ分布とは全く異なる状態である、全て同じクラスタに分類された状態になっている。 In contrast, the example of FIG. 6 in which the conventional processing is performed on the same data as FIG. 5 will be described. FIG. 6A shows the distribution of input data artificially generated on a two-dimensional coordinate axis. Yes, as in FIG. FIG. 6B shows a code vector randomly generated on the coordinate axis on which this data is arranged. From the state shown in FIG. 6B, the self-organizing map process is performed to update the code vector so as to approach the data point, as shown in FIGS. 6C, 6D, and 6E. In order, the data points are brought closer to each other. In this conventional process, the code vector is connected to one while the self-organizing map process is performed, and the code vector is connected to one even at the end of the repetitive calculation shown in FIG. ing.
When the code vector shown in FIG. 6 (e) is obtained, the code vector is divided according to the angle forming each code vector. In this example, there is no difference in the angle that can be divided. As shown in FIG. 6 (f), the clustering state of the data obtained in a typical manner is a state that is completely different from the actual data distribution and is all classified into the same cluster.

次に、本例のアルゴリズムの詳細を、数式を用いて説明する。ここでは、１次の自己組織化マップ処理を用いることとする。
［１．］まず、ｋ次元のコードベクトルｍ^ｐ _ｌを定義域内においてランダムに生成する。ここでのｋ次元とは、例えばデータが［ｎ×ｎ］の画素数であるとき、［ｎ×ｎ］次元であることを示す次元数である。コードベクトルｍ^ｐ _ｌは、クラスｌ（１≦ｌ≦Ｃ_ｌ）におけるｐ（１≦ｐ≦ｎｕｍ_ｌ）番目におけるコードベクトルを表す。
［２．］コードベクトルｍ^ｐ _ｌの以下の更新処理を、ｔ＝１，２，・・，Ｔ；ｊ＝１，２，・・，ｎについて繰り返す。
［３．］式（１）を満たすコードベクトルｍ^ｐ _ｌを得る。 Next, details of the algorithm of this example will be described using mathematical expressions. Here, primary self-organizing map processing is used.
[1. First, a k-dimensional code vector m ^p _l is randomly generated in the domain. The k dimension here is a dimension number indicating that it is an [n × n] dimension when the data is the number of pixels of [n × n], for example. The code vector m ^p _l represents the code vector at the p (1 ≦ p ≦ num _l ) th in the class l (1 ≦ l ≦ C _l ).
[2. The following update process of the code vector m ^p _l is repeated for t = 1, 2,..., T;
[3. The code vector m ^p _l satisfying the expression (1) is obtained.

この式（１）をデータｘｊに対する勝利コードベクトルとする。
［４．］勝利コードベクトルとその周辺のコードベクトルを、式（２）により更新する。 This equation (1) is assumed to be a victory code vector for the data xj.
[4. ] The winning code vector and its surrounding code vectors are updated by equation (2).

但し、α（ｔ）と、Ｎｃ（ｔ）は、式（３）及び式（４）より求める。 However, (alpha) (t) and Nc (t) are calculated | required from Formula (3) and Formula (4).

なお、［ｑ］は、ｑを超えない最大の整数を表す。
ここまでが、図１のフローチャートでのステップＳ１での自己組織化マップ処理に相当する。 [Q] represents the maximum integer not exceeding q.
The steps so far correspond to the self-organizing map process in step S1 in the flowchart of FIG.

［５．］ｌ_１＝１，２，・・・、Ｃ_ｌ；ｐ＝１，２，・・・，ｎｕｍ_ｌ１について、次の処理を繰り返す。ｎｕｍ_ｌ１は、クラスｌ_１に含まれるコードベクトルの数である。また、ｄ_totai＝０，ｄ_cnt＝０とする。ｄ_totaiは、クラス内のコードベクトルの隣との差の総和である。ｄ_cntは、クラス内のコードベクトル間の個数（即ちコードベクトル数−１）である。 [5. ] 1 ₁ = 1, 2,..., C ₁ ; p = 1, 2 _,. num _l1 is the number of code vectors included in Class _{l 1.} Further, d _totai = 0 and d _cnt = 0. d _totai is the sum of the differences from the next to the code vector in the class. d _cnt is the number of code vectors in the class (that is, the number of code vectors −1).

［５．１］もし、ｍ^l1 _p+1が存在するならば、（５）式を算出し、ｄ_totai＝ｄ_totai＋ｄ_diff，ｄ_cnt＝ｄ_cnt＋１とする。 [5.1] If m ^l1 _{p + 1} exists, the equation (5) is calculated and d _totai = d _totai + d _diff and d _cnt = d _cnt +1.

［５．２］ｄ_cnt≠０ならば、閾値Ｔｈ＝（ｄ_totai／ｄ_cnt）＋αｄ^{（ｍａｘ）} _ｄｉｆｆとする。但し、ｄ^{（ｍａｘ）} _ｄｉｆｆは、ｄ_ｄｉｆｆの最大値である。また、［ｄ_totai／ｄ_cnt］として、コードベクトルの総和を数で割っていることで、各コードベクトルの平均を求めていることになる。このように、閾値Ｔｈは、各コードベクトルの平均に、最大値に係数αを乗算した値を加算して設定している。
［６．］ｌ_１＝１，２，・・・，Ｃ_ｌ；ｐ＝１，２，・・・，ｎｕｍ_ｌ１について、［５．１］を繰り返す。
ここまでがステップＳ３の閾値Ｔｈを設定する処理である。 [5.2] If d _cnt ≠ 0, the threshold value Th = (d _totai / d _cnt ) + αd ^(max) _diff is set. However, d ^(max) _diff is the maximum value of d _diff . Further, as _[d totai / d _cnt], that are divided by the number of total code vector, thus seeking the average of each code vector. As described above, the threshold value Th is set by adding the value obtained by multiplying the maximum value by the coefficient α to the average of the code vectors.
[6. ] 1 ₁ = 1, 2,..., C ₁ ; p = 1, 2,..., Num ₁₁ , [5.1] is repeated.
This is the process of setting the threshold value Th in step S3.

［６．１］閾値Ｔｈとの比較で、もし、（６）式の条件を満たす場合に、ｅ^ｌ１ _ｐ＝２，ｅ^ｌ１ _ｐ＋１＝１とし、ｌ_１に含まれるｐ＋１からｎｕｍｌ_１までのコードベクトルのラベルを１つインクリメントする。 [6.1] In comparison with the threshold value Th, if the condition of the equation (6) is satisfied, e ^l1 _p = 2 and e ^l1 _p + 1 = 1, and p + 1 to luml ₁ included in l ₁ are set. Increment the code vector label by one.

［７．］ｌ_１＝１，２，・・・，Ｃ_ｌ；ｌ_２＝ｌ_１＋２，・・・，Ｃ_ｌ；ｐ＝１，２，・・・，ｍｕｎ_ｌ１；ｑ＝１，２，・・・，ｎｕｍ_ｌ２について、以下の処理を繰り返す。但し、ｍｕｎ_ｌ１及びｎｕｍ_ｌ２は、それぞれクラスｌ_１及びｌ_２に含まれるコードベクトルの数である。 [7. _{_{_{] L 1 = 1,2, ···,}}} C l; l 2 = l 1 +2, ···, C l; p = 1,2, ···, mun l1; q = 1,2, ·· .. Repeat the following processing for num _l2 . However, _{mun l1} and _{num l2} is the number of code vectors respectively included in the class _{l 1} and _{l 2.}

［７．１］もし、ｅ^ｌ１ _ｐ＝１又は２、かつｅ^ｌ２ _ｐ＝１又は２ならば、（７）式を算出する。 [7.1] If e ^l1 _p = 1 or 2 and e ^l2 _p = 1 or 2, the equation (7) is calculated.

その（７）式で、コードベクトル間の距離ｄと閾値Ｔｈとの関係が、ｄ＜Ｔｈならば、
もし、ｅ^ｌ１ _ｐ＝１、かつ、ｅ^ｌ２ _ｐ＝１ならば、 In the equation (7), if the relationship between the distance d between code vectors and the threshold Th is d <Th,
If e ^l1 _p = 1 and e ^l2 _p = 1,

とし、 age,

として、ｌ_２のラベル値を持つコードベクトルの順番を入れ換える。さらに、 As a result, the order of the code vectors having the label value of l ₂ is changed. further,

とすることにより、クラスｌ_１、ｌ_２に含まれるコードベクトルの結合を行い、［７］へ移行する。
・もし、ｅ^ｌ１ _ｐ＝２、かつ、ｅ^ｌ２ _ｐ＝２ならば、 Thus, the code vectors included in the classes l ₁ and l ₂ are combined, and the process proceeds to [7].
If e ^l1 _p = 2 and e ^l2 _p = 2,

とし、 age,

とする。さらに、 And further,

とし、 age,

とすることにより、クラスｌ_１、ｌ_２に含まれるコードベクトルの結合を行い、［７］へ移行する。
・もし、ｅ^ｌ１ _ｐ＝１、かつ、ｅ^ｌ２ _ｐ＝２ならば、 Thus, the code vectors included in the classes l ₁ and l ₂ are combined, and the process proceeds to [7].
If e ^l1 _p = 1 and e ^l2 _p = 2

とし、 age,

とする。さらに、 And further,

とし、 age,

とすることにより、クラスｌ_１、ｌ_２に含まれるコードベクトルの結合を行い、［７］へ移行する。
・もし、ｅ^ｌ１ _ｐ＝２、かつ、ｅ^ｌ２ _ｐ＝１ならば、 Thus, the code vectors included in the classes l ₁ and l ₂ are combined, and the process proceeds to [7].
If e ^l1 _p = 2 and e ^l2 _p = 1,

とし、 age,

とする。さらに、 And further,

とし、 age,

とすることにより、クラスｌ_１、ｌ_２に含まれるコードベクトルの結合を行い、［７］へ移行する。 Thus, the code vectors included in the classes l ₁ and l ₂ are combined, and the process proceeds to [7].

ここまで説明したアルゴリズムにおいて、［４］はコードベクトルを更新する処理である。［５］では、閾値Ｔｈを計算する処理を行っており、［６］では、閾値に応じてコードベクトルを分離している。［７］では、閾値に応じてコードベクトルを結合している。 In the algorithm described so far, [4] is a process of updating the code vector. In [5], processing for calculating the threshold Th is performed, and in [6], code vectors are separated according to the threshold. In [7], code vectors are combined according to a threshold value.

このようにして処理を行うことで、図３或いは図５に示した分離及び結合が行われながらの自己組織化マップ処理が行われ、良好にデータのクラスタリングが行える。 By performing the processing in this way, the self-organizing map processing is performed while the separation and combination shown in FIG. 3 or FIG. 5 is performed, and data can be clustered satisfactorily.

図３や図５の例では、２次元座標上にデータが配置される例について説明したが、画像データなどのより複雑なデータを、次元数の高い座標軸上に配置して、その配置されたデータをクラスタリングする際にも、同様の手法が適用可能である。
例えば、図７に示すようなドット配列で「Ａ」という文字が示された画像の場合、その画像の水平方向及び垂直方向の画素配列に基づいて多数の組のベクトルを生成して、そのベクトルを入力データとして扱って、上述した分離及び結合を行いながらの自己組織化マップ処理を行うことで、画像データの正確なクラスタリングが可能となる。 In the examples of FIGS. 3 and 5, the example in which data is arranged on two-dimensional coordinates has been described. However, more complicated data such as image data is arranged on coordinate axes having a high number of dimensions, and the arrangement is performed. A similar method can be applied when clustering data.
For example, in the case of an image in which the letter “A” is shown in the dot arrangement as shown in FIG. 7, a large number of sets of vectors are generated based on the horizontal and vertical pixel arrangement of the image, and the vector Is processed as input data, and the above-described self-organizing map processing is performed while performing separation and combination, thereby enabling accurate clustering of image data.

画像データとしては、例えば顔を撮影した画像から、個人を識別する処理に適用することが可能である。この場合には、各個人を撮影した画像をデータベースとして保持しておき、そのデータベースの顔画像と、撮影した画像データとを比較して、撮影した画像データが、データベース中のどの顔画像と同じクラスタに属するかクラスタリングさせることで、正確なクラスタリングが可能となる。例えば、撮影された顔画像の表情が、データベースの顔画像と異なっていたり、或いは、撮影時の明るさや影の発生条件などが、データベースの顔画像と異なっている場合でも、顔からの正確な個人認証が可能となる。データベース化する顔画像として、一人の個人ごとに、表情や撮影条件の異なる複数の画像がある場合には、より個人認証精度が向上する。 The image data can be applied to a process for identifying an individual from, for example, an image obtained by photographing a face. In this case, an image obtained by photographing each individual is stored as a database, and the face image in the database is compared with the photographed image data, and the photographed image data is the same as any face image in the database. Accurate clustering is possible by clustering whether or not it belongs to a cluster. For example, even if the facial expression of the photographed face image is different from the face image in the database, or the brightness and shadowing conditions at the time of photographing are different from the face image in the database, Personal authentication is possible. When there are a plurality of images having different facial expressions and photographing conditions for each individual person as the face images to be databased, the personal authentication accuracy is further improved.

また、画像認識の別の適用例として、例えば、手書き文字を撮影（入力）した画像を、データベース中に用意された文字パターンと比較する場合にも適用可能である。手書き文字認識に適用することで、手書き文字の識別率を向上させることができる。 Further, as another application example of image recognition, for example, the present invention can be applied to a case where an image obtained by photographing (inputting) a handwritten character is compared with a character pattern prepared in a database. By applying it to handwritten character recognition, the recognition rate of handwritten characters can be improved.

さらにまた、顔画像や手書き文字以外のその他のデータ識別のためのクラスタリングに適用してもよい。
例えば、個人を認証する画像として、顔以外に、指紋，虹彩，血管（静脈）などの体の特定の部分を撮影した画像を本発明の手法でクラスタリングして、認証するようにしてもよい。
また、話し声などの音声の特性（周波数特性など）を解析した結果を本発明の手法でクラスタリングして、個人認証するようにしてもよい。 Furthermore, the present invention may be applied to clustering for identifying data other than face images and handwritten characters.
For example, as an image for authenticating an individual, in addition to a face, an image obtained by photographing a specific part of a body such as a fingerprint, an iris, or a blood vessel (vein) may be clustered by the method of the present invention and authenticated.
Further, the results of analyzing voice characteristics such as speech (frequency characteristics and the like) may be clustered by the method of the present invention to perform personal authentication.

また、クラスタリングを行うデータ処理装置として、上述した実施の形態では図２に示したように、汎用の情報処理装置であるパーソナルコンピュータ装置を使用して構成させるようにしたが、クラスタリング処理を行う専用の情報処理装置として構成させるようにしてもよい。或いは、本発明の処理をプログラム化して、そのプログラム（ソフトウェア）を、各種情報処理装置にインストールさせることで、本発明の処理を行う装置として構成させることも可能である。本発明の処理をプログラムは、光ディスクや半導体メモリなどの記憶媒体に記憶させて配布する他に、インターネットなどの伝送手段を介してダウンロードさせるようにしてもよい。 In the above-described embodiment, the data processing device for performing clustering is configured using a personal computer device, which is a general-purpose information processing device, as shown in FIG. You may make it comprise as this information processing apparatus. Alternatively, the processing according to the present invention can be configured as a device that performs the processing according to the present invention by programming the processing of the present invention and installing the program (software) in various information processing apparatuses. The program of the present invention may be downloaded via a transmission means such as the Internet, in addition to being distributed by being stored in a storage medium such as an optical disk or a semiconductor memory.

本発明の一実施の形態による処理例を示すフローチャートである。It is a flowchart which shows the process example by one embodiment of this invention. 本発明の一実施の形態による処理を実行する装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the apparatus which performs the process by one embodiment of this invention. 本発明の一実施の形態による処理例（例１）を示す説明図である。It is explanatory drawing which shows the process example (Example 1) by one embodiment of this invention. 図３の例と同じデータに対して、従来処理を行った例を示す説明図である。It is explanatory drawing which shows the example which performed the conventional process with respect to the same data as the example of FIG. 本発明の一実施の形態による処理例（例２）を示す説明図である。It is explanatory drawing which shows the process example (Example 2) by one embodiment of this invention. 図５の例と同じデータに対して、従来処理を行った例を示す説明図である。It is explanatory drawing which shows the example which performed the conventional process with respect to the same data as the example of FIG. 本発明の一実施の形態の処理を画像データに対して適用した例を示す説明図である。It is explanatory drawing which shows the example which applied the process of one embodiment of this invention with respect to image data. 従来の自己組織化マップに基づいたクラスタリングの例を示す原理図である。It is a principle figure which shows the example of the clustering based on the conventional self-organization map. 従来の自己組織化マップに基づいたクラスタリングの例を示す原理図である。It is a principle figure which shows the example of the clustering based on the conventional self-organization map.

Explanation of symbols

１１…画像取り込み部、１２…データ処理部、１３…データメモリ、１４…データベース、１５…制御部、１６…プログラムメモリ、１７…キーボード、１８…表示制御部、１９…ディスプレイ DESCRIPTION OF SYMBOLS 11 ... Image acquisition part, 12 ... Data processing part, 13 ... Data memory, 14 ... Database, 15 ... Control part, 16 ... Program memory, 17 ... Keyboard, 18 ... Display control part, 19 ... Display

Claims

In a clustering method for clustering a plurality of input data based on a self-organizing map,
A code vector setting process for setting a code vector on a coordinate axis in which the plurality of data are arranged;
A self-organizing map process for fitting the code vector set in the code vector setting process to the plurality of data using a self-organizing map algorithm;
A determination process for determining whether the code vector obtained by the self-organizing map process has converged to a predetermined state;
When it is determined in the determination process that it has not converged, the distance between each code vector obtained by the self-organizing map process is compared with a set threshold value, and separated when the distance is longer than the threshold value, Separation / combination processing that combines when the distance is shorter than the threshold,
The process of performing the self-organizing map process on the code vectors clustered by the separation and combination in the separation / combination process to fit the plurality of data is repeated, and the determination process determines that the code vector has converged to a predetermined state. A clustering method comprising clustering the plurality of data based on a clustering state.

In the clustering method according to claim 1,
The threshold value is a value calculated by calculation using a maximum value and an average value of distances between code vectors.

In the clustering method according to claim 1,
A clustering method, wherein the plurality of input data is image data for personal identification, and an individual is specified from an image by clustering based on the self-organizing map.

In the clustering method according to claim 1,
A clustering method, wherein the plurality of input data is character image data, and character recognition is performed from an image by clustering based on the self-organizing map.

In a data processing apparatus that clusters a plurality of input data based on a self-organizing map,
Code vector setting means for setting a code vector on a coordinate axis in which the plurality of data are arranged;
Self-organizing map processing means for fitting the code vector set by the code vector setting means to the plurality of data using a self-organizing map algorithm;
Determining means for determining whether the code vector obtained by the processing in the self-organizing map processing means has converged to a predetermined state;
When it is determined in the determination process by the determination means that it has not converged, the distance between each code vector obtained by the self-organizing map process is compared with a set threshold value, and the distance is longer than the threshold value And separation / combination processing means for combining when the distance is shorter than the threshold,
It repeats the process of obtaining the code vector by performing the self-organizing map processing in the self-organizing map processing means on the code vectors clustered by the separation and combining in the separation / combination processing means, and the determination in the determination means A data processing apparatus that outputs a result of clustering the plurality of data based on a clustering state determined to have converged to a predetermined state by processing.

In a program for implementing a process for clustering a plurality of input data based on a self-organizing map on a data processing apparatus,
A code vector setting process for setting a code vector on a coordinate axis in which the plurality of data are arranged;
A self-organizing map process for fitting the code vector set in the code vector setting process to the plurality of data using a self-organizing map algorithm;
A determination process for determining whether the code vector obtained by the self-organizing map process has converged to a predetermined state;
When it is determined that the convergence is not achieved in the determination process, the distance between the code vectors of each code vector obtained by the self-organizing map process is compared with a set threshold value, and the distance is longer than the threshold value. Separating and performing separation / combination processing to combine when the distance is shorter than the threshold,
Repeat the process of obtaining the code vector by performing the self-organizing map process on the code vectors clustered by the separation and combination in the separation / combination process, and the clustering state determined to have converged to a predetermined state in the determination process A program for executing a process of clustering the plurality of data based on the data.