JP2009048575A

JP2009048575A - Clustering device, clustering method, program, and recording medium

Info

Publication number: JP2009048575A
Application number: JP2007216509A
Authority: JP
Inventors: Masatsugu Minamishima; 正嗣南嶋
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2007-08-22
Filing date: 2007-08-22
Publication date: 2009-03-05

Abstract

<P>PROBLEM TO BE SOLVED: To provide a clustering device capable of reducing a lack of balance in the number of vectors included in each cluster, a clustering method, a program and a recording medium. <P>SOLUTION: A dimension degenerating means 11 converts multidimensional vector data into two-dimensional vector data, and an imaging means 12 converts the two-dimensional vector data into binary image data. An area extracting means 13 extracts an initial area composed of adjacent pixels among pixels corresponding to a two-dimensional vector out of pixels composing an image which the binary image data shows. A cluster range determining means 14 extracts two-dimensional normal distribution from the frequency distribution of pixels for every initial areas, and determines a cluster range based on the average and standard deviation of every extracted two-dimensional normal distribution. A cluster determining means 15 classifies the pixels into clusters according to an evaluation value H and whether the pixels are included in the cluster range. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、複数の要素からなる集合を部分集合であるクラスタに分割するクラスタリング装置、クラスタリング方法、プログラムおよび記録媒体に関し、より詳細には、多次元ベクトルの集合をベクトル間の類似度に基づいてクラスタに分割するクラスタリング装置、クラスタリング方法、プログラムおよび記録媒体に関する。 The present invention relates to a clustering device, a clustering method, a program, and a recording medium that divide a set of a plurality of elements into clusters that are subsets. More specifically, the present invention relates to a set of multidimensional vectors based on similarity between vectors. The present invention relates to a clustering apparatus, a clustering method, a program, and a recording medium that divide into clusters.

複数の要素からなる集合の中から類似する要素を見つけ出して、集合をいくつかの部分集合に分割することをクラスタリングという。クラスタリングは、ウェブ（Ｗｅｂ）あるいは企業内などで利用される大量の情報の分類に適用され、クラスタリングによって大量の情報を類似する情報に分割することができる。クラスタリングは、必要とする情報を速く見つけることを可能とし、さらに類似する情報の特長を情報の概要として示すことを可能とすることができるので、情報の検索に利用することができる。 Finding similar elements from a set of a plurality of elements and dividing the set into several subsets is called clustering. Clustering is applied to classification of a large amount of information used on the web (Web) or in a company, and a large amount of information can be divided into similar information by clustering. Clustering makes it possible to quickly find necessary information and to show similar information features as an outline of the information, so that it can be used for information retrieval.

ベクトルの集合のクラスタリングには、階層的クラスタリングと非階層的クラスタリングとがある。 Clustering of a set of vectors includes hierarchical clustering and non-hierarchical clustering.

階層的クラスタリングは、たとえば集合にＮ個のベクトルが含まれる場合、まず個々のベクトルからなるＮ個のクラスタを形成する。次に、２つのベクトルの距離に基づいて、各ベクトルをそれぞれ含む２つのクラスタ間の距離を計算し、クラスタ間の距離が最も小さい２つのクラスタ同士を１つに併合する。そして、この併合を、すべてのクラスタが１つのクラスタに併合されるまで順次繰り返して、クラスタの階層構造を形成する。たとえばクラスタの数をｋ個と決めて、ｋ個のクラスタからなる階層で区切ることによって、ベクトルをｋ個のクラスタに分類することができる。 In the hierarchical clustering, for example, when N sets are included in a set, N clusters composed of individual vectors are first formed. Next, based on the distance between the two vectors, a distance between two clusters each including each vector is calculated, and two clusters having the smallest distance between the clusters are merged into one. This merging is sequentially repeated until all the clusters are merged into one cluster to form a cluster hierarchical structure. For example, a vector can be classified into k clusters by determining the number of clusters as k and dividing the hierarchy by a hierarchy of k clusters.

非階層的クラスタリングは、階層構造を用いないでベクトルを分類するクラスタリングである。代表的な非階層的クラスタリングとして、ｋ平均法（以下「k-means法」という）がある。k-means法は、ベクトル間の距離に応じてｋ個のクラスタに分類する。集合に含まれるベクトルの数をＮ個、クラスタの数をｋ個とするときのk-means法によるクラスタリングの手順を示す。 Non-hierarchical clustering is clustering that classifies vectors without using a hierarchical structure. As a typical non-hierarchical clustering, there is a k-means method (hereinafter referred to as “k-means method”). The k-means method classifies into k clusters according to the distance between vectors. The procedure of clustering by the k-means method when the number of vectors included in the set is N and the number of clusters is k is shown.

手順１では、Ｎ個のベクトルの中の任意のｋ個のベクトルを、ｋ個のクラスタのそれぞれの中心を示すベクトルの初期値とする。手順２では、Ｎ個のベクトルを、クラスタの中心のベクトルが最も近いクラスタに分類する。手順３では、各クラスタに含まれるベクトルの平均を新たなクラスタの中心とする。手順４では、クラスタの中心が変化しなくなるまで手順２および手順３を繰り返して、クラスタの中心が変化しなくなったところで終了する。 In the procedure 1, any k vectors among the N vectors are set as initial values of vectors indicating the centers of the k clusters. In the procedure 2, the N vectors are classified into clusters having the closest cluster center vector. In procedure 3, the average of the vectors included in each cluster is set as the center of the new cluster. In step 4, step 2 and step 3 are repeated until the center of the cluster no longer changes, and the process ends when the center of the cluster no longer changes.

上述した階層的クラスタリングおよびk-means法による非階層的クラスタリングはいずれも、クラスタ数ｋを人手によって指定する必要がある。 In the above-described hierarchical clustering and non-hierarchical clustering by the k-means method, it is necessary to manually specify the number k of clusters.

クラスタ数ｋを人手によって指定しなくともよい従来の技術の例として、文書の自動分類方法がある。この文書の自動分類方法は、処理時間が許容時間内になるようにクラスタ数を決定し、文書の内容を表現する意味要素の強さに応じて、文書をベクトルで表現して分類する（たとえば特許文献１参照）。 As an example of a conventional technique that does not require manually specifying the number of clusters k, there is an automatic document classification method. In this document automatic classification method, the number of clusters is determined so that the processing time is within an allowable time, and the document is expressed as a vector and classified according to the strength of the semantic element expressing the content of the document (for example, Patent Document 1).

特開平１０−１７１８２３号公報Japanese Patent Laid-Open No. 10-171823

しかしながら、上述した従来の技術の例では、クラスタ数を自動的に決定することができても、各クラスタに含まれるベクトル数に偏りが生じる可能性があるという問題がある。たとえば、階層的クラスタリングは、クラスタ数によっては、全ベクトル数の２％のベクトルを含むクラスタと全ベクトル数の４０％のベクトルを含むクラスタとに分類されるというように、各クラスタに含まれるベクトル数に偏りが生じる場合がある。 However, in the above-described prior art example, there is a problem that even if the number of clusters can be automatically determined, the number of vectors included in each cluster may be biased. For example, hierarchical clustering is classified into a cluster including 2% of all vectors and a cluster including 40% of all vectors, depending on the number of clusters. Numbers may be biased.

この階層的クラスタリングを情報の検索に利用して、検索結果を表示するとき、１つのクラスタに分類されるベクトルの数が多い場合は、類似する文書の数が多くなるので、所望の文書を見つけるのに多くの時間がかかる。あるいは表示される文書の数が少ない場合は、その中に所望の文書が含まれないことがあり得る。 When this hierarchical clustering is used for information retrieval and the retrieval result is displayed, if there are many vectors classified into one cluster, the number of similar documents increases, and thus a desired document is found. It takes a lot of time. Or when there are few documents displayed, a desired document may not be contained in it.

この偏りを是正するためには、階層構造を図示したデンドログラムつまり樹形図を用いて、人手によって、ベクトル数の多いクラスタをより低い階層で分割し、あるいは、ベクトル数の少ないクラスタをより高い階層で併合する必要がある。 To correct this bias, use a dendrogram or tree diagram to illustrate the hierarchical structure, and manually divide clusters with a large number of vectors into lower layers, or clusters with a small number of vectors to a higher level. Need to merge in hierarchy.

k-means法による非階層的クラスタリングは、手順１でのクラスタの中心の初期値の決め方によって、クラスタリングの結果に差が生じる。たとえば、クラスタの初期値の中心が特定の領域に集中し、かつ１つのクラスタの初期値の中心がその領域から離れている場合、階層的クラスタリングと同様に、各クラスタに含まれるベクトル数に偏りが生じる。この偏りを是正するためには、クラスタリングの結果を人手によって修正する必要がある。 In the non-hierarchical clustering by the k-means method, the result of clustering differs depending on how the initial value of the center of the cluster is determined in step 1. For example, if the center of the initial value of a cluster is concentrated in a specific area, and the center of the initial value of one cluster is far from that area, the number of vectors contained in each cluster is biased, as in hierarchical clustering. Occurs. In order to correct this bias, it is necessary to manually correct the clustering result.

本発明の目的は、各クラスタに含まれるベクトル数の偏りをより少なくすることができるクラスタリング装置、クラスタリング方法、プログラムおよび記録媒体を提供することである。 An object of the present invention is to provide a clustering device, a clustering method, a program, and a recording medium that can reduce the deviation of the number of vectors included in each cluster.

本発明は、多次元ベクトルを表す多次元ベクトルデータを入力する入力手段と、
入力手段によって入力された多次元ベクトルデータを、予め定める次元変換方式によって２次元ベクトルを表す２次元ベクトルデータに変換する次元変換手段と、
次元変換手段によって変換された２次元ベクトルデータを、対応する２次元ベクトルの有無を示す値で表され、かつ対応する２次元ベクトルの数を表す度数が付与される画素からなる画像を表す画像データに変換する画像化手段と、
画像化手段によって変換された画像データが示す画素のうち、対応する２次元ベクトルが有ることを示す値で表される画素の中から、隣接する画素によって構成される領域を抽出する領域抽出手段と、
領域抽出手段によって抽出された各領域に含まれる画素に付与された度数の分布から、その分布を構成する正規分布を抽出し、抽出した正規分布ごとに、各正規分布の平均を中心とし、かつ標準偏差に基づいて決められる範囲をクラスタの範囲として決定するクラスタ範囲決定手段と、
クラスタに分類される画素の数を均一化するための予め定める分類条件に従って、クラスタ範囲決定手段によって範囲が決定されたクラスタに、各画素を分類する画素分類手段とを含むこと特徴とするクラスタリング装置である。 The present invention provides an input means for inputting multidimensional vector data representing a multidimensional vector;
Dimension conversion means for converting the multidimensional vector data input by the input means into two-dimensional vector data representing a two-dimensional vector by a predetermined dimension conversion method;
Image data representing an image composed of pixels to which the two-dimensional vector data converted by the dimension conversion means is represented by a value indicating the presence or absence of the corresponding two-dimensional vector and to which a frequency indicating the number of the corresponding two-dimensional vector is given. Imaging means for converting to
A region extracting unit that extracts a region constituted by adjacent pixels from pixels represented by values indicating that there is a corresponding two-dimensional vector among the pixels represented by the image data converted by the imaging unit; ,
From the frequency distribution given to the pixels included in each region extracted by the region extraction means, a normal distribution constituting the distribution is extracted, and for each extracted normal distribution, the average of each normal distribution is centered, and Cluster range determining means for determining a range determined based on the standard deviation as a cluster range;
A clustering apparatus comprising: a cluster whose range is determined by a cluster range determination unit according to a predetermined classification condition for equalizing the number of pixels classified into clusters; and a pixel classification unit that classifies each pixel. It is.

また本発明は、前記予め定める分類条件は、
画素に付与される度数が１つの正規分布のみを構成する度数の場合は、その度数が付与される画素を、その度数が構成する正規分布に対応するクラスタに分類し、
画素に付与される度数の一部が複数の正規分布を構成する度数の場合は、その度数が付与される画素を、その度数の一部が構成する複数の正規分布のうち、正規分布の標準偏差を平均で除算した評価値が最も大きい正規分布に対応するクラスタに分類し、
画素に付与される度数がいずれの正規分布をも構成しない度数の場合は、その度数が付与される画素を、その度数が付与される画素に最も近い範囲のクラスタのうち、前記評価値が最も大きいクラスタに分類することを特徴とする。 In the present invention, the predetermined classification condition is:
If the frequency given to a pixel is a frequency that constitutes only one normal distribution, classify the pixel to which the frequency is given into a cluster corresponding to the normal distribution that the frequency constitutes,
When a part of the frequency assigned to a pixel is a frequency that forms a plurality of normal distributions, the pixel to which the frequency is assigned is selected from the normal distributions that are part of the frequency. Classify the cluster into the cluster corresponding to the normal distribution with the largest evaluation value divided by the mean,
When the frequency given to a pixel is a frequency that does not constitute any normal distribution, the evaluation value is the highest among the clusters in the range closest to the pixel to which the frequency is assigned. It is characterized by classifying into large clusters.

また本発明は、多次元ベクトルを表す多次元ベクトルデータを入力する入力ステップと、
入力ステップで入力された多次元ベクトルデータを、予め定める次元変換条件に従って２次元ベクトルを表す２次元ベクトルデータに変換する次元変換ステップと、
次元変換ステップで変換された２次元ベクトルデータを、対応する２次元ベクトルの有無を示す値で表され、かつ対応する２次元ベクトルの数を表す度数が付与される画素からなる画像を表す画像データに変換する画像化ステップと、
画像化ステップで変換された画像データが示す画素のうち、対応する２次元ベクトルが有ることを示す値で表される画素の中から、隣接する画素によって構成される領域を抽出する領域抽出ステップと、
領域抽出ステップで抽出された各領域に含まれる画素に付与された度数の分布から、その分布を構成する正規分布を抽出し、抽出した正規分布ごとに、各正規分布の平均を中心とし、かつ標準偏差に基づいて決められる範囲をクラスタの範囲として決定するクラスタ範囲決定ステップと、
クラスタに分類される画素の数を均一化するための予め定める分類条件に従って、クラスタ範囲決定ステップで範囲が決定されたクラスタに、各画素を分類する画素分類ステップとを含むこと特徴とするクラスタリング方法である。 The present invention also includes an input step of inputting multidimensional vector data representing a multidimensional vector;
A dimension conversion step of converting the multidimensional vector data input in the input step into two-dimensional vector data representing a two-dimensional vector according to a predetermined dimension conversion condition;
Image data representing an image made up of pixels to which the two-dimensional vector data converted in the dimension conversion step is represented by a value indicating the presence or absence of the corresponding two-dimensional vector and to which a frequency representing the number of the corresponding two-dimensional vector is given An imaging step to convert to
An area extraction step for extracting an area constituted by adjacent pixels from pixels represented by values indicating that there is a corresponding two-dimensional vector among the pixels indicated by the image data converted in the imaging step; ,
From the frequency distribution given to the pixels included in each region extracted in the region extraction step, a normal distribution that constitutes the distribution is extracted, and for each extracted normal distribution, the average of each normal distribution is centered, and A cluster range determining step for determining a range determined based on the standard deviation as a cluster range;
A clustering method comprising: a pixel classification step for classifying each pixel into a cluster whose range is determined in the cluster range determination step according to a predetermined classification condition for equalizing the number of pixels classified into clusters. It is.

また本発明は、多次元ベクトルを表す多次元ベクトルデータを入力する入力ステップと、
入力ステップで入力された多次元ベクトルデータを、予め定める次元変換条件に従って２次元ベクトルを表す２次元ベクトルデータに変換する次元変換ステップと、
次元変換ステップで変換された２次元ベクトルデータを、対応する２次元ベクトルの有無を示す値で表され、かつ対応する２次元ベクトルの数を表す度数が付与される画素からなる画像を表す画像データに変換する画像化ステップと、
画像化ステップで変換された画像データが示す画素のうち、対応する２次元ベクトルが有ることを示す値で表される画素の中から、隣接する画素によって構成される領域を抽出する領域抽出ステップと、
領域抽出ステップで抽出された各領域に含まれる画素に付与された度数の分布から、その分布を構成する正規分布を抽出し、抽出した正規分布ごとに、各正規分布の平均を中心とし、かつ標準偏差に基づいて決められる範囲をクラスタの範囲として決定するクラスタ範囲決定ステップと、
クラスタに分類される画素の数を均一化するための予め定める分類条件に従って、クラスタ範囲決定ステップで範囲が決定されたクラスタに、各画素を分類する画素分類ステップとを、コンピュータに実行させるためのプログラムである。
また本発明は、前記プログラムを記録したコンピュータ読取可能な記録媒体である。 The present invention also includes an input step of inputting multidimensional vector data representing a multidimensional vector;
A dimension conversion step of converting the multidimensional vector data input in the input step into two-dimensional vector data representing a two-dimensional vector according to a predetermined dimension conversion condition;
Image data representing an image made up of pixels to which the two-dimensional vector data converted in the dimension conversion step is represented by a value indicating the presence or absence of the corresponding two-dimensional vector and to which a frequency representing the number of the corresponding two-dimensional vector is given. An imaging step to convert to
An area extraction step for extracting an area constituted by adjacent pixels from pixels represented by values indicating that there is a corresponding two-dimensional vector among the pixels indicated by the image data converted in the imaging step; ,
From the frequency distribution given to the pixels included in each region extracted in the region extraction step, a normal distribution that constitutes the distribution is extracted, and for each extracted normal distribution, the average of each normal distribution is centered, and A cluster range determining step for determining a range determined based on the standard deviation as a cluster range;
In order to cause a computer to execute a pixel classification step for classifying each pixel in a cluster whose range is determined in a cluster range determination step according to a predetermined classification condition for equalizing the number of pixels classified into clusters. It is a program.
The present invention is also a computer-readable recording medium on which the program is recorded.

本発明によれば、入力手段によって、多次元ベクトルを表す多次元ベクトルデータが入力され、次元変換手段によって、入力手段によって入力された多次元ベクトルデータが、予め定める次元変換条件に従って２次元ベクトルを表す２次元ベクトルデータに変換される。 According to the present invention, multi-dimensional vector data representing a multi-dimensional vector is input by the input means, and the multi-dimensional vector data input by the input means is converted into a two-dimensional vector according to a predetermined dimensional conversion condition by the dimension conversion means. It is converted into two-dimensional vector data to be represented.

そして、画像化手段によって、次元変換手段によって変換された２次元ベクトルデータが、対応する２次元ベクトルの有無を示す値で表され、かつ対応する２次元ベクトルの数を表す度数が付与される画素からなる画像を表す画像データに変換され、領域抽出手段によって、画像化手段によって変換された画像データが示す画素のうち、対応する２次元ベクトルが有ることを示す値で表される画素の中から、隣接する画素によって構成される領域が抽出される。 Then, a pixel to which the two-dimensional vector data converted by the dimension converting unit is represented by a value indicating the presence or absence of the corresponding two-dimensional vector and a frequency indicating the number of the corresponding two-dimensional vector is given by the imaging unit. Among the pixels represented by the image data converted to image data representing the image and converted by the image extraction unit by the region extraction unit, the pixel is represented by a value indicating that there is a corresponding two-dimensional vector. A region constituted by adjacent pixels is extracted.

さらに、クラスタ範囲決定手段によって、領域抽出手段によって抽出された各領域に含まれる画素に付与された度数の分布から、その分布を構成する正規分布が抽出され、抽出された正規分布ごとに、各正規分布の平均を中心とし、かつ標準偏差に基づいて決められる範囲がクラスタの範囲として決定され、画素分類手段によって、クラスタに分類される画素の数を均一化するための予め定める分類条件に従って、クラスタ範囲決定手段によって範囲が決定されたクラスタに、各画素が分類される。 Further, a normal distribution constituting the distribution is extracted from the frequency distribution given to the pixels included in each region extracted by the region extracting unit by the cluster range determining unit, and for each extracted normal distribution, each normal distribution is extracted. A range centered on the average of the normal distribution and determined based on the standard deviation is determined as the cluster range, and according to a predetermined classification condition for equalizing the number of pixels classified into clusters by the pixel classification unit, Each pixel is classified into a cluster whose range is determined by the cluster range determining means.

すなわち、画素をクラスタに分類することによって、画素に対応する多次元ベクトルをクラスタに分類することができるので、クラスタに分類される画素の数を均一化するための予め定める分類条件に従って、画素をクラスタに分類することによって、各クラスタに含まれるベクトル数の偏りをより少なくすることができる。文書データを多次元ベクトルデータとして表して、検索に適用すれば、文書データをより速く検索することができる。 That is, by classifying pixels into clusters, multi-dimensional vectors corresponding to the pixels can be classified into clusters. Therefore, according to a predetermined classification condition for equalizing the number of pixels classified into clusters, the pixels are classified. By classifying into clusters, the number of vectors included in each cluster can be reduced. If the document data is expressed as multidimensional vector data and applied to the search, the document data can be searched faster.

また本発明によれば、入力ステップでは、多次元ベクトルを表す多次元ベクトルデータを入力する。次元変換ステップでは、入力ステップで入力された多次元ベクトルデータを、予め定める次元変換条件に従って２次元ベクトルを表す２次元ベクトルデータに変換する。画像化ステップでは、次元変換ステップで変換された２次元ベクトルデータを、対応する２次元ベクトルの有無を示す値で表され、かつ対応する２次元ベクトルの数を表す度数が付与される画素からなる画像を表す画像データに変換する。 According to the invention, in the input step, multidimensional vector data representing a multidimensional vector is input. In the dimension conversion step, the multidimensional vector data input in the input step is converted into two-dimensional vector data representing a two-dimensional vector according to a predetermined dimension conversion condition. In the imaging step, the two-dimensional vector data converted in the dimension conversion step is represented by a pixel which is represented by a value indicating the presence or absence of the corresponding two-dimensional vector and is given a frequency indicating the number of the corresponding two-dimensional vector. Convert to image data representing an image.

領域抽出ステップでは、画像化ステップで変換された画像データが示す画素のうち、対応する２次元ベクトルが有ることを示す値で表される画素の中から、隣接する画素によって構成される領域を抽出する。クラスタ範囲決定ステップでは、領域抽出ステップで抽出された各領域に含まれる画素に付与された度数の分布から、その分布を構成する正規分布を抽出し、抽出した正規分布ごとに、各正規分布の平均を中心とし、かつ標準偏差に基づいて決められる範囲をクラスタの範囲として決定する。そして、画素分類ステップでは、クラスタに分類される画素の数を均一化するための予め定める分類条件に従って、クラスタ範囲決定ステップで範囲が決定されたクラスタに、各画素を分類する。 In the region extraction step, a region constituted by adjacent pixels is extracted from pixels represented by values indicating that there is a corresponding two-dimensional vector among the pixels indicated by the image data converted in the imaging step. To do. In the cluster range determination step, a normal distribution constituting the distribution is extracted from the frequency distribution given to the pixels included in each region extracted in the region extraction step, and each normal distribution is extracted for each extracted normal distribution. A range centered on the average and determined based on the standard deviation is determined as the cluster range. In the pixel classification step, each pixel is classified into the cluster whose range is determined in the cluster range determination step in accordance with a predetermined classification condition for equalizing the number of pixels classified into the cluster.

すなわち、本発明に係るクラスタリング方法を適用すれば、画素をクラスタに分類することによって、画素に対応する多次元ベクトルをクラスタに分類することができるので、クラスタに分類される画素の数を均一化するための予め定める分類条件に従って、画素をクラスタに分類することによって、各クラスタに含まれるベクトル数の偏りをより少なくすることができる。文書データを多次元ベクトルデータとして表して、検索に適用すれば、文書データをより速く検索することができる。 In other words, by applying the clustering method according to the present invention, it is possible to classify pixels into clusters, thereby classifying multi-dimensional vectors corresponding to pixels into clusters, so that the number of pixels classified into clusters is made uniform. By classifying pixels into clusters in accordance with predetermined classification conditions for performing the above, it is possible to further reduce the bias of the number of vectors included in each cluster. If the document data is expressed as multidimensional vector data and applied to the search, the document data can be searched faster.

また本発明によれば、多次元ベクトルを表す多次元ベクトルデータを入力する入力ステップと、
入力ステップで入力された多次元ベクトルデータを、予め定める次元変換条件に従って２次元ベクトルを表す２次元ベクトルデータに変換する次元変換ステップと、
次元変換ステップで変換された２次元ベクトルデータを、対応する２次元ベクトルの有無を示す値で表され、かつ対応する２次元ベクトルの数を表す度数が付与される画素からなる画像を表す画像データに変換する画像化ステップと、
画像化ステップで変換された画像データが示す画素のうち、対応する２次元ベクトルが有ることを示す値で表される画素の中から、隣接する画素によって構成される領域を抽出する領域抽出ステップと、
領域抽出ステップで抽出された各領域に含まれる画素に付与された度数の分布から、その分布を構成する正規分布を抽出し、抽出した正規分布ごとに、各正規分布の平均を中心とし、かつ標準偏差に基づいて決められる範囲をクラスタの範囲として決定するクラスタ範囲決定ステップと、
クラスタに分類される画素の数を均一化するための予め定める分類条件に従って、クラスタ範囲決定ステップで範囲が決定されたクラスタに、各画素を分類する画素分類ステップを、コンピュータに実行せるためのプログラムとして提供することができる。 According to the present invention, an input step for inputting multidimensional vector data representing a multidimensional vector;
A dimension conversion step of converting the multidimensional vector data input in the input step into two-dimensional vector data representing a two-dimensional vector according to a predetermined dimension conversion condition;
Image data representing an image made up of pixels to which the two-dimensional vector data converted in the dimension conversion step is represented by a value indicating the presence or absence of the corresponding two-dimensional vector and to which a frequency representing the number of the corresponding two-dimensional vector is given. An imaging step to convert to
An area extraction step for extracting an area constituted by adjacent pixels from pixels represented by values indicating that there is a corresponding two-dimensional vector among the pixels indicated by the image data converted in the imaging step; ,
From the frequency distribution given to the pixels included in each region extracted in the region extraction step, a normal distribution that constitutes the distribution is extracted, and for each extracted normal distribution, the average of each normal distribution is centered, and A cluster range determining step for determining a range determined based on the standard deviation as a cluster range;
A program for causing a computer to execute a pixel classification step for classifying each pixel into a cluster whose range is determined in a cluster range determination step according to a predetermined classification condition for equalizing the number of pixels classified into clusters. Can be offered as.

また本発明によれば、前記プログラムを記録したコンピュータ読取可能な記録媒体として提供することができる。 The present invention can also be provided as a computer-readable recording medium on which the program is recorded.

図１は、本発明の実施の一形態であるクラスタリング自動化装置１の機能の構成を示すブロック図である。本発明に係るクラスタリング方法は、クラスタリング自動化装置１によって処理される。 FIG. 1 is a block diagram showing a functional configuration of a clustering automation apparatus 1 according to an embodiment of the present invention. The clustering method according to the present invention is processed by the clustering automation apparatus 1.

クラスタリング装置であるクラスタリング自動化装置１は、多次元ベクトルデータ入力手段（以下、入力手段という）１０、次元縮退手段１１、画像化手段１２、領域抽出手段１３、クラスタ範囲決定手段１４およびクラスタ決定手段１５を含んで構成される。 A clustering automation apparatus 1 which is a clustering apparatus includes a multidimensional vector data input means (hereinafter referred to as input means) 10, a dimension reduction means 11, an imaging means 12, a region extraction means 13, a cluster range determination means 14 and a cluster determination means 15. It is comprised including.

クラスタリング自動化装置１は、たとえばコンピュータによって構成される。クラスタリング自動化装置１を構成するコンピュータは、キーボードおよびマウスなどの入力装置と、ディスプレイなどの表示装置あるいはプリンタなどの印刷装置を含む出力装置と、通信回線たとえばＬＡＮ（Local Area Network）を介して情報を送受信する通信装置と、半導体メモリあるいはハードディスク装置によって構成され、プログラムおよびデータを記憶する記憶装置と、記憶装置に記憶されるプログラムを実行して、入力装置、出力装置、および通信装置を制御する中央処理装置（Central Processing Unit：以下「ＣＰＵ」という）とを含む。プログラムは、クラスタリング自動化装置１を制御するためのプログラムであり、ＯＳ（Operating System）およびアプリケーションプログラムを含んでもよい。コンピュータは、一般的に知られているコンピュータでよく、詳細な説明は省略する。 The clustering automation apparatus 1 is configured by a computer, for example. The computer constituting the clustering automation apparatus 1 receives information via an input device such as a keyboard and a mouse, an output device including a display device such as a display or a printing device such as a printer, and a communication line such as a LAN (Local Area Network). A communication device that transmits and receives, a semiconductor memory or a hard disk device, a storage device that stores programs and data, and a central point that controls the input device, the output device, and the communication device by executing a program stored in the storage device And a processing device (Central Processing Unit: hereinafter referred to as “CPU”). The program is a program for controlling the clustering automation apparatus 1 and may include an OS (Operating System) and an application program. The computer may be a generally known computer and will not be described in detail.

入力手段１０は、多次元ベクトルデータ記憶装置２に記憶される多次元ベクトルデータを読み出して、クラスタリング自動化装置１に入力する。多次元ベクトルデータは、多次元ベクトルを表すデータである。多次元ベクトルは、たとえば文書データをその内容などによって多次元ベクトルとして表したものである。 The input means 10 reads out the multidimensional vector data stored in the multidimensional vector data storage device 2 and inputs it to the clustering automation device 1. Multidimensional vector data is data representing a multidimensional vector. A multidimensional vector represents, for example, document data as a multidimensional vector depending on its contents.

多次元ベクトルデータ記憶装置２は、通信回線たとえばＬＡＮに接続される記憶装置であり、コンピュータに含まれる通信装置によって、多次元ベクトルデータ記憶装置２に記憶される情報を読み出すことができる。図１に示した構成では、多次元ベクトルデータ記憶装置２を、クラスタリング自動化装置１とは別の独立した装置として構成したが、多次元ベクトルデータ記憶装置２をクラスタリング自動化装置１に含めてもよい。 The multidimensional vector data storage device 2 is a storage device connected to a communication line such as a LAN, and information stored in the multidimensional vector data storage device 2 can be read out by a communication device included in a computer. In the configuration shown in FIG. 1, the multidimensional vector data storage device 2 is configured as an independent device different from the clustering automation device 1, but the multidimensional vector data storage device 2 may be included in the clustering automation device 1. .

次元変換手段である次元縮退手段１１は、入力手段１０よって入力された多次元ベクトルデータを、予め定める次元変換方式によって、２次元ベクトルを表す２次元ベクトルデータに変換する。予め定める次元変換方式は、たとえば多次元尺度構成法（Multi-
Dimensional Scaling：以下「ＭＤＳ」と略す）を用いて変換する方式である。 The dimension reduction means 11 which is a dimension conversion means converts the multidimensional vector data input by the input means 10 into two-dimensional vector data representing a two-dimensional vector by a predetermined dimension conversion method. The predetermined dimension conversion method is, for example, a multi-dimensional scale construction method (Multi-
Dimensional Scaling (hereinafter abbreviated as “MDS”).

ＭＤＳは、計量多次元尺度構成法（Metric MDS：以下「計量ＭＤＳ」という）および非計量多次元尺度構成法（Non-metric MDS：以下「非計量ＭＤＳ」という）に分類される。計量ＭＤＳは、対象間の距離データに基づいて対象を空間的に布置する方法であり、非計量ＭＤＳは、対象間の距離データあるいは距離データに対応する非類似度データに基づいて、対象を空間的に布置する方法である（たとえば齋藤堯幸、宿久洋著、「関連データの解析法」初版、共立出版株式会社、２００６年９月１０日、ｐ３７−ｐ１２３参照）。 MDS is classified into a metric multidimensional scaling method (Metric MDS: hereinafter referred to as “metric MDS”) and a non-metric multidimensional scaling method (hereinafter referred to as “non-metric MDS”). The metric MDS is a method of spatially arranging objects based on the distance data between the objects, and the non-metric MDS is a method for spatially locating the objects based on the distance data between the objects or dissimilarity data corresponding to the distance data. (See, for example, Yasuyuki Saitoh, Hiroshi Sukuhisa, “Analysis Method of Related Data”, first edition, Kyoritsu Publishing Co., Ltd., September 10, 2006, p37-p123).

計量ＭＤＳおよび非計量ＭＤＳのうちのいずれの方法も、対象を布置する空間の次元数を対象の次元数よりも少なくすることによって、ベクトル間の距離関係つまり類似度関係を可能な限り保持したまま次元を縮退することができる。距離に関しては、たとえばユークリッド距離を用いてもよい。次元を縮退する方法としては、ＭＤＳ以外にも、多次元ベクトルデータが示す多次元ベクトル間の距離関係を可能な限り保持したまま、より低次元のベクトルに次元を縮退する他の方法、たとえば主成分分析法を用いてもよい。 In both methods of the metric MDS and the non-metric MDS, the distance relationship between the vectors, that is, the similarity relationship is maintained as much as possible by reducing the number of dimensions of the space in which the object is placed than the number of dimensions of the object. The dimension can be reduced. For the distance, for example, the Euclidean distance may be used. As a method for reducing the dimension, besides MDS, other methods for reducing the dimension to a lower-dimensional vector while maintaining the distance relationship between the multi-dimensional vectors indicated by the multi-dimensional vector data as much as possible, Component analysis may be used.

図２は、次元縮退手段１１によって変換された２次元ベクトルの位置を示すＸＹ座標系２１の一例を示す図である。２次元ベクトルデータが示す２次元ベクトルの位置をＸＹ座標における座標（ｘ，ｙ）で表す。「×」印の位置が、始点を座標（０，０）としたときの各２次元ベクトルの先端の位置を示す。 FIG. 2 is a diagram showing an example of the XY coordinate system 21 indicating the position of the two-dimensional vector converted by the dimension reduction means 11. The position of the two-dimensional vector indicated by the two-dimensional vector data is represented by coordinates (x, y) in the XY coordinates. The position of “x” indicates the position of the tip of each two-dimensional vector when the start point is the coordinate (0, 0).

画像化手段１２は、次元縮退手段１１によって変換された２次元ベクトルデータを、２値画像データに変換する。２値画像データは、「０」および「１」のうちのいずれかの値をとる画素から構成される画像を表すデータである。画像化手段１２は、次に示す手順に従って、２次元ベクトルデータを２値画像データに変換する。 The imaging means 12 converts the two-dimensional vector data converted by the dimension reduction means 11 into binary image data. The binary image data is data representing an image composed of pixels having a value of either “0” or “1”. The imaging unit 12 converts the two-dimensional vector data into binary image data according to the following procedure.

手順１では、２次元ベクトルデータが示す２次元ベクトルのＸＹ座標における位置を座標（ｘ，ｙ）とするとき、ｘの最大値Ｘｍａｘ、ｙの最大値Ｙｍａｘ、ｘの最小値Ｘｍｉｎ、およびｙの最小値Ｙｍｉｎを求め、さらにｘの範囲Ｌｘ＝Ｘｍａｘ−Ｘｍｉｎ、およびｙの範囲Ｌｙ＝Ｙｍａｘ−Ｙｍｉｎを求める。 In the procedure 1, when the position in the XY coordinates of the two-dimensional vector indicated by the two-dimensional vector data is the coordinates (x, y), the maximum value Xmax of x, the maximum value Ymax of y, the minimum value Xmin of x, and the value of y The minimum value Ymin is obtained, and the x range Lx = Xmax−Xmin and the y range Ly = Ymax−Ymin are obtained.

手順２では、２値画像データが示す画像の解像度をｍ×ｎ、ｍおよびｎを自然数、２次元ベクトルデータが示す２次元ベクトルの数をＮとするとき、条件１および条件２を満たす最小のｍおよびｎを求める。条件１は、ｍ：ｎ＝（Ｌｘ＋１）：（Ｌｙ＋１）であり、条件２は、ｍ×ｎ≧Ｎである。条件１および条件２を満たすことによって、２次元ベクトルデータが示す２次元ベクトル間の距離関係つまり類似度関係を可能な限り保持したまま、２次元ベクトルデータを２値画像データに変換することができる。 In the procedure 2, when the resolution of the image indicated by the binary image data is m × n, m and n are natural numbers, and the number of two-dimensional vectors indicated by the two-dimensional vector data is N, the minimum satisfying the conditions 1 and 2 Find m and n. Condition 1 is m: n = (Lx + 1) :( Ly + 1), and Condition 2 is m × n ≧ N. By satisfying Condition 1 and Condition 2, the two-dimensional vector data can be converted into binary image data while maintaining the distance relationship between the two-dimensional vectors indicated by the two-dimensional vector data, that is, the similarity relationship as much as possible. .

手順３では、画素のＸＹ座標における位置を座標（Ｘ，Ｙ）とするとき、変換式Ｘ＝ｘ×ｍ／（Ｌｘ＋１）およびＹ＝ｙ×ｎ／（Ｌｙ＋１）によって、２次元ベクトルデータが示す２次元ベクトルの座標（ｘ，ｙ）を、画素の座標（Ｘ，Ｙ）に変換する。変換後、２次元ベクトルが変換された座標にある画素の値を「１」、２次元ベクトルが変換された座標にない画素の値を「０」とする。 In the procedure 3, when the position of the pixel in the XY coordinates is the coordinate (X, Y), the two-dimensional vector data is represented by the conversion formulas X = x × m / (Lx + 1) and Y = y × n / (Ly + 1). The coordinates (x, y) of the two-dimensional vector are converted into the coordinates (X, Y) of the pixels. After conversion, the value of the pixel at the coordinate where the two-dimensional vector is converted is “1”, and the value of the pixel not at the coordinate where the two-dimensional vector is converted is “0”.

手順１〜３によって、２次元ベクトルデータが示す２次元ベクトル間の距離関係つまり類似度関係を可能な限り保持したまま、２次元ベクトルデータを２値画像データに変換することができる。 By the procedures 1 to 3, the two-dimensional vector data can be converted into the binary image data while maintaining the distance relationship between the two-dimensional vectors indicated by the two-dimensional vector data, that is, the similarity relationship as much as possible.

図３は、画像化手段１２によって変換された画像２２の一例を示す図である。桝目の１つ１つの矩形が１つの画素２２１を表し、白色の矩形は画素値が「０」の画素であり、斜線を付した矩形は画素値が「１」の画素である。 FIG. 3 is a diagram illustrating an example of the image 22 converted by the imaging unit 12. Each rectangle of each square represents one pixel 221, a white rectangle is a pixel having a pixel value of “0”, and a hatched rectangle is a pixel having a pixel value of “1”.

図４は、画像化手段１２によって変換された画素のデータ構造３１の一例を示す図である。各画素は、ｉおよびｊを自然数とするとき、構造体型配列Ｇ［ｉ］［ｊ］のデータ構造で表され、各Ｇ［ｉ］［ｊ］は、画素値、クラスタ情報、ステータス情報および度数を含む。度数は、２次元ベクトルデータが２値画像データに変換に変換されたとき、各画素について、画素の座標（Ｘ，Ｙ）が同じになる２次元ベクトルの数であり、「分布度数」ともいう。たとえば、画素の位置が座標（１，１）である画素に変換される２次元ベクトルの数が４個であると、その画素の度数は「４」となる。クラスタ情報およびステータス情報については、後述する。 FIG. 4 is a diagram illustrating an example of the data structure 31 of the pixel converted by the imaging unit 12. Each pixel is represented by a data structure of a structure type array G [i] [j] where i and j are natural numbers. Each G [i] [j] is a pixel value, cluster information, status information, and frequency. including. The frequency is the number of two-dimensional vectors having the same pixel coordinates (X, Y) for each pixel when the two-dimensional vector data is converted into binary image data, and is also referred to as “distribution frequency”. . For example, if the number of two-dimensional vectors converted to a pixel whose pixel position is the coordinate (1, 1) is four, the frequency of the pixel is “4”. The cluster information and status information will be described later.

図４に示したデータ構造３１では、画素Ｇ［１］［１］の画素値は「１」であり、後述するステータス情報は「未処理」であり、度数は「４」である。同様に、画素Ｇ［２］［１］の画素値は「１」であり、ステータス情報は「未処理」であり、度数は「５」である。画素Ｇ［３］［１］の画素値は「１」であり、ステータス情報は「未処理」であり、度数は「７」である。画素Ｇ［ｍ］［ｎ］の画素値は「０」であり、ステータス情報は「未処理」であり、度数は「０」である。 In the data structure 31 shown in FIG. 4, the pixel value of the pixel G [1] [1] is “1”, status information to be described later is “unprocessed”, and the frequency is “4”. Similarly, the pixel value of the pixel G [2] [1] is “1”, the status information is “unprocessed”, and the frequency is “5”. The pixel value of the pixel G [3] [1] is “1”, the status information is “unprocessed”, and the frequency is “7”. The pixel value of the pixel G [m] [n] is “0”, the status information is “unprocessed”, and the frequency is “0”.

図５は、領域抽出手段１３によって抽出された初期領域が示された画像２３の一例を示す図である。領域抽出手段１３は、画像化手段１２によって変換された２値画像データが示す画像を構成する画素のうち、画素値が「１」である画素の中から、隣接する画素同士を１つの領域として抽出し、抽出した領域を初期領域とする。隣接は、画素がＸ軸方向つまり図５に示した桝目の幅方向、またはＹ軸方向つまり図５に示した桝目の高さ方向に隣接することをいう。図５に示した画像２３には、４つの初期領域２３ａ〜２３ｄが示されている。初期領域２３ａは４つの画素によって構成され、初期領域２３ｂは１つの画素によって構成され、初期領域２３ｃは３つの画素によって構成され、初期領域２３ｄは２つの画素によって構成されている。 FIG. 5 is a diagram showing an example of an image 23 in which the initial area extracted by the area extracting unit 13 is shown. The area extraction unit 13 sets adjacent pixels as one area from among the pixels constituting the image indicated by the binary image data converted by the imaging unit 12 among the pixels having the pixel value “1”. Extraction is performed, and the extracted area is set as an initial area. Adjacent means that the pixels are adjacent in the X-axis direction, that is, the width direction of the mesh shown in FIG. 5, or in the Y-axis direction, that is, the height direction of the mesh shown in FIG. In the image 23 shown in FIG. 5, four initial regions 23 a to 23 d are shown. The initial region 23a is composed of four pixels, the initial region 23b is composed of one pixel, the initial region 23c is composed of three pixels, and the initial region 23d is composed of two pixels.

図６は、領域抽出手段１３による領域抽出処理後の画素のデータ構造３２の一例を示す図である。クラスタ情報は、初期領域を識別するための番号である。ステータス情報は、初期領域を抽出する際に用いる画素の状態を示す情報であり、「未処理」はまだいずれの初期領域としても抽出されていないことを示し、「処理済」はすでにいずれかの初期領域として抽出されたことを示す。 FIG. 6 is a diagram illustrating an example of the pixel data structure 32 after the region extraction processing by the region extraction unit 13. The cluster information is a number for identifying the initial area. The status information is information indicating the state of the pixel used when extracting the initial region. “Unprocessed” indicates that it has not been extracted as any initial region, and “Processed” has already been selected. Indicates that it has been extracted as the initial region.

図６に示したデータ構造３２では、画素Ｇ［１］［１］のクラスタ情報は「１」であり、ステータス情報は「処理済」であり、度数は「４」である。同様に、画素Ｇ［２］［１］のクラスタ情報は「１」であり、ステータス情報は「処理済」であり、度数は「５」である。画素Ｇ［３］［１］のクラスタ情報は「１」であり、ステータス情報は「処理済」であり、度数は「７」である。画素Ｇ［ｍ］［ｎ］のクラスタ情報は「ｋ」であり、ステータス情報は「処理済」であり、度数は「２」である。 In the data structure 32 shown in FIG. 6, the cluster information of the pixel G [1] [1] is “1”, the status information is “processed”, and the frequency is “4”. Similarly, the cluster information of the pixel G [2] [1] is “1”, the status information is “processed”, and the frequency is “5”. The cluster information of the pixel G [3] [1] is “1”, the status information is “processed”, and the frequency is “7”. The cluster information of the pixel G [m] [n] is “k”, the status information is “processed”, and the frequency is “2”.

クラスタ範囲決定手段１４は、領域抽出手段１３によって抽出された各初期領域に含まれる画素の度数の分布に基づいて、クラスタの範囲を決定する。すなわち、次元縮退手段１１および画像化手段１０２は、隣接する画素を初期領域とすることによって、類似度の高いデータ同士を１つの領域とし、クラスタ範囲決定手段１４は、度数の分布に基づいてクラスタの範囲を決定することによって、さらに類似度の高いデータ同士を１つのクラスタに集約する。 The cluster range determining unit 14 determines the cluster range based on the frequency distribution of the pixels included in each initial region extracted by the region extracting unit 13. That is, the dimension reduction means 11 and the imaging means 102 set adjacent pixels as an initial area, thereby making data with high similarity into one area, and the cluster range determination means 14 performs clustering based on the frequency distribution. By determining the range, data having higher similarity are aggregated into one cluster.

領域抽出手段１３によって抽出された各初期領域に含まれる画素の度数の分布は、複数の正規分布である２次元正規分布が混合された２次元混合正規分布であるとみなすことができる。すなわち、各初期領域を構成する画素の度数の分布が、複数の山が形成される２次元混合正規分布であり、複数の山のうちの各山がそれぞれ１つの２次元正規分布と仮定して、２次元混合正規分布を構成する２次元正規分布を抽出することができる。 The frequency distribution of the pixels included in each initial region extracted by the region extracting unit 13 can be regarded as a two-dimensional mixed normal distribution in which a plurality of two-dimensional normal distributions that are normal distributions are mixed. That is, it is assumed that the frequency distribution of the pixels constituting each initial region is a two-dimensional mixed normal distribution in which a plurality of peaks are formed, and each of the plurality of peaks is a two-dimensional normal distribution. A two-dimensional normal distribution constituting a two-dimensional mixed normal distribution can be extracted.

２次元混合正規分布から２次元正規分布を抽出する方法として、たとえばＥＭ（
Expectation Maximization）アルゴリズムを用いた方法がある。ＥＭアルゴリズムは、尤度の期待値を求めるＥステップ（Expectation Step）と、尤度の期待値を最大化するＭステップ（Maximization Step）とを交互に繰り返すことによって、確かではない情報を含む観測データから最尤推定を行うための反復アルゴリズムである（たとえば赤穂昭太郎著、「ＥＭアルゴリズムの幾何学」、情報処理Ｖｏｌ．３７、Ｎｏ．１、１９９６年参照）。 As a method for extracting a two-dimensional normal distribution from a two-dimensional mixed normal distribution, for example, EM (
There is a method using the Expectation Maximization algorithm. The EM algorithm repeats an E step (Expectation Step) for obtaining the expected value of likelihood and an M step (Maximization Step) for maximizing the expected value of likelihood, thereby including observation data containing uncertain information. (See, for example, Shotaro Ako, “Geometry of the EM Algorithm”, Information Processing Vol. 37, No. 1, 1996).

クラスタ範囲決定手段１４は、次に示す手順で２次元混合正規分布から２次元正規分布を抽出する。各初期領域の分布が２次元混合正規分布であり、各初期領域の各山が１つの２次元正規分布であると仮定して、ＥＭアルゴリズムを用いた手順の例を示す。 The cluster range determining means 14 extracts a two-dimensional normal distribution from the two-dimensional mixed normal distribution in the following procedure. An example of the procedure using the EM algorithm is shown assuming that the distribution of each initial region is a two-dimensional mixed normal distribution and each mountain of each initial region is one two-dimensional normal distribution.

手順１では、各初期領域ｃの度数の分布について、ＥＭアルゴリズムを用いて２次元正規分布を抽出する。ここに、ｃは１〜Ｕの自然数であり、Ｕは初期領域の数である。手順２では、各初期領域ｃについて抽出された各２次元正規分布について、Ｘ軸方向の平均ｘｃｔ、Ｙ軸方向の平均ｙｃｔ、Ｘ軸方向の標準偏差σｘｃｔ、およびＹ軸方向の標準偏差σｙｃｔを算出する。ここに、ｔは１〜Ｒの自然数であり、Ｒは初期領域で抽出された２次元正規分布の数である。 In the procedure 1, a two-dimensional normal distribution is extracted using the EM algorithm for the frequency distribution of each initial region c. Here, c is a natural number of 1 to U, and U is the number of initial regions. In step 2, for each two-dimensional normal distribution extracted for each initial region c, the average xct in the X-axis direction, the average yct in the Y-axis direction, the standard deviation σxct in the X-axis direction, and the standard deviation σyct in the Y-axis direction are calculated. calculate. Here, t is a natural number of 1 to R, and R is the number of two-dimensional normal distributions extracted in the initial region.

手順３では、座標（ｘｃｔ，ｙｃｔ）を中心とし、Ｘ軸方向の半径をｐ×σｘｃｔとし、Ｙ軸方向の半径をｑ×σｙｃｔとする楕円内の範囲をクラスタの範囲として決定する。ここに、ｐおよびｑは、「３」以下の正の実数であり、ｐおよびｑを変化させることによって所望のクラスタの範囲とすることができる。たとえば、Ｘ軸方向の正規分布を考えるとき、ｐ＝１とすると、クラスタの範囲に６８．３％の画素を含めることができる。さらに、ｐ＝２とすると、９５．４％の画素を含めることができ、ｐ＝３とすると、９９．７％の画素を含めることができる。 In the procedure 3, the range in the ellipse having the coordinates (xct, yct) as the center, the radius in the X-axis direction as p × σxct, and the radius in the Y-axis direction as q × σyct is determined as the cluster range. Here, p and q are positive real numbers of “3” or less, and a desired cluster range can be obtained by changing p and q. For example, when considering a normal distribution in the X-axis direction, if p = 1, 68.3% of pixels can be included in the cluster range. Furthermore, if p = 2, 95.4% of pixels can be included, and if p = 3, 99.7% of pixels can be included.

図７は、クラスタ範囲決定手段１４によって抽出された２次元正規分布２４の一例を示す図である。２次元正規分布２４は、Ｙ＝Ｘ＋ａの直線２４１上での２次元正規分布であり、高さは各画素の度数２４２を示す。 FIG. 7 is a diagram illustrating an example of the two-dimensional normal distribution 24 extracted by the cluster range determining unit 14. The two-dimensional normal distribution 24 is a two-dimensional normal distribution on the straight line 241 of Y = X + a, and the height indicates the frequency 242 of each pixel.

図８は、クラスタ範囲決定手段１４によって抽出された２次元正規分布２５の一例を示す図である。２次元正規分布２５は、Ｘ軸方向の２次元正規分布である。範囲２５１は、クラスタのＸ軸方向の範囲であり、範囲２５１の長さは、Ｘ軸方向の半径がｐ×σｘｃｔの楕円のＸ軸方向の直径に相当する。 FIG. 8 is a diagram illustrating an example of the two-dimensional normal distribution 25 extracted by the cluster range determining unit 14. The two-dimensional normal distribution 25 is a two-dimensional normal distribution in the X-axis direction. The range 251 is a range in the X-axis direction of the cluster, and the length of the range 251 corresponds to the diameter in the X-axis direction of an ellipse whose radius in the X-axis direction is p × σxct.

図９は、クラスタ範囲決定手段１４によるクラスタ範囲決定処理後の画素のデータ構造３３の一例を示す図である。クラスタ範囲決定手段１４によってクラスタの範囲が決定された後は、クラスタ情報は、クラスタを識別するための番号であり、各画素が入っているクラスタの範囲のクラスタの番号を示す。 FIG. 9 is a diagram illustrating an example of the pixel data structure 33 after the cluster range determination processing by the cluster range determination unit 14. After the cluster range is determined by the cluster range determining unit 14, the cluster information is a number for identifying the cluster, and indicates the number of the cluster in the cluster range in which each pixel is included.

図９に示したデータ構造３３では、画素Ｇ［１］［１］のクラスタ情報は「１」であり、画素Ｇ［２］［１］のクラスタ情報は「１」および「２」であり、画素Ｇ［３］［１］のクラスタ情報は「２」であり、画素Ｇ［ｍ］［ｎ］のクラスタ情報は「ｋ’」である。画素Ｇ［２］［１］は、クラスタ情報が「１」のクラスタの範囲と、クラスタ情報が「２」のクラスタの範囲に入っている。ステータス情報および度数は、図６に示したデータ構造３２と同じであり、重複を避けるために説明は省略する。 In the data structure 33 shown in FIG. 9, the cluster information of the pixel G [1] [1] is “1”, the cluster information of the pixel G [2] [1] is “1” and “2”, The cluster information of the pixel G [3] [1] is “2”, and the cluster information of the pixel G [m] [n] is “k ′”. The pixel G [2] [1] is in the cluster range where the cluster information is “1” and the cluster range where the cluster information is “2”. The status information and the frequency are the same as those in the data structure 32 shown in FIG. 6, and a description thereof is omitted to avoid duplication.

クラスタ決定手段１５は、画素が、クラスタ範囲決定手段１４によって決定されたクラスタの範囲内に入っているか否かによって、次に示す手順で画素を各クラスタに分類する。手順１では、各画素のクラスタ情報をチェックし、１つのクラスタの番号のみがある場合は、すなわち画素の度数が１つの正規分布のみを構成する度数の場合は、その画素をそのクラスタ情報が示す番号のクラスタに分類する。 The cluster determining unit 15 classifies the pixels into clusters according to the following procedure depending on whether or not the pixel is within the cluster range determined by the cluster range determining unit 14. In step 1, the cluster information of each pixel is checked. If there is only one cluster number, that is, if the frequency of a pixel is a frequency that constitutes only one normal distribution, the cluster information indicates that pixel. Classify into numbered clusters.

手順２では、画素のクラスタ情報に、クラスタの番号がない場合は、すなわち、画素の度数がいずれの正規分布をも構成しない度数の場合は、その画素に最も距離が近いクラスタの範囲のクラスタに分類する。最も距離が近いクラスタが複数ある場合は、最も距離が近い複数のクラスタのうち、評価値Ｈの値が最大であるクラスタに分類する。評価値Ｈは、各クラスタに対応する２次正規分布のＸ軸方向の標準偏差とＹ軸方向の標準偏差との平均を、各２次正規分布のＸ軸方向の平均とＹ軸方向の平均との平均で除算した値であり、式Ｈ＝（（σｘｃｔ＋σｙｃｔ）／２）／（（ｘｃｔ＋ｙｃｔ）／２）＝（σｘｃｔ＋σｙｃｔ）／（ｘｃｔ＋ｙｃｔ）によって算出する。ここに、ｘｃｔはＸ軸方向の平均、ｙｃｔはＹ軸方向の平均、σｘｃｔはＸ軸方向の標準偏差、σｙｃｔはＹ軸方向の標準偏差である。 In step 2, if the cluster information of the pixel does not have a cluster number, that is, if the frequency of the pixel is a frequency that does not constitute any normal distribution, the cluster within the cluster range closest to the pixel is selected. Classify. When there are a plurality of clusters having the closest distance, the clusters having the largest evaluation value H are classified from the plurality of clusters having the closest distance. The evaluation value H is the average of the standard deviation in the X-axis direction and the standard deviation in the Y-axis direction of the secondary normal distribution corresponding to each cluster, and the average in the X-axis direction and the average in the Y-axis direction of each secondary normal distribution. And is calculated by the equation H = ((σxct + σyct) / 2) / ((xct + yct) / 2) = (σxct + σyct) / (xct + yct). Here, xct is an average in the X-axis direction, yct is an average in the Y-axis direction, σxct is a standard deviation in the X-axis direction, and σyct is a standard deviation in the Y-axis direction.

手順３では、画素のクラスタ情報に、複数のクラスタの番号がある場合は、すなわち、画素の度数の一部が複数の正規分布を構成する度数の場合は、その複数のクラスタのうち、評価値Ｈが最大であるクラスタに画素を分類する。 In the procedure 3, when there are a plurality of cluster numbers in the pixel cluster information, that is, when a part of the frequency of the pixel is a frequency constituting a plurality of normal distributions, the evaluation value of the plurality of clusters is determined. Classify pixels into clusters where H is maximum.

図１０は、クラスタ決定手段１５によって画素がクラスタに分類された画像２６の一例を示す図である。画像２６には、３つのクラスタ２６ａ〜２６ｃが示されている。 FIG. 10 is a diagram illustrating an example of an image 26 in which pixels are classified into clusters by the cluster determination unit 15. In the image 26, three clusters 26a to 26c are shown.

クラスタ決定手段１５は、評価値Ｈを用いて画素を分類するので、いずれのクラスタの範囲にも入らない画素および複数のクラスタの範囲に入る画素を、度数がより小さく分散の大きい正規分布のクラスタに分類することができる。すなわち、画素をクラスタに分類することによって、画素に対応する多次元ベクトルをクラスタに分類することができるので、山の高さがより低くかつ広がりがより広い正規分布のクラスタに分類されるベクトル数を増加することができ、各クラスタのベクトル数の偏りをより低減することができる。 Since the cluster determination unit 15 classifies the pixels using the evaluation value H, a normal distribution cluster having a smaller frequency and a larger variance is used for pixels that do not fall within any cluster range and pixels that fall within a plurality of cluster ranges. Can be classified. That is, by classifying pixels into clusters, multidimensional vectors corresponding to pixels can be classified into clusters, so the number of vectors classified into clusters of normal distribution with a lower peak height and wider spread The number of vectors in each cluster can be further reduced.

このように、入力手段１０によって、多次元ベクトルを表す多次元ベクトルデータが入力され、次元縮退手段１１によって、入力手段１０によって入力された多次元ベクトルデータが、予め定める次元変換条件に従って２次元ベクトルを表す２次元ベクトルデータに変換される。 In this way, multidimensional vector data representing a multidimensional vector is input by the input means 10, and the multidimensional vector data input by the input means 10 by the dimension reduction means 11 is converted into a two-dimensional vector according to a predetermined dimension conversion condition. Is converted into two-dimensional vector data.

そして、画像化手段１２によって、次元縮退手段１１によって変換された２次元ベクトルデータが、対応する２次元ベクトルの有無を示す値で表され、かつ対応する２次元ベクトルの数を表す度数が付与される画素からなる画像を表す画像データに変換され、領域抽出手段１３によって、画像化手段１２によって変換された画像データが示す画素のうち、対応する２次元ベクトルが有ることを示す値で表される画素の中から、隣接する画素によって構成される初期領域が抽出される。 Then, the imaging unit 12 represents the two-dimensional vector data converted by the dimension reduction unit 11 with a value indicating the presence or absence of the corresponding two-dimensional vector, and a frequency indicating the number of the corresponding two-dimensional vectors is given. The pixel data is converted into image data representing an image composed of pixels, and is represented by a value indicating that there is a corresponding two-dimensional vector among the pixels indicated by the image data converted by the image extraction unit 12 by the region extraction unit 13. An initial region composed of adjacent pixels is extracted from the pixels.

さらに、クラスタ範囲決定手段１４によって、領域抽出手段１３によって抽出された各初期領域に含まれる画素に付与された度数の分布から、その分布を構成する正規分布が抽出され、抽出された正規分布ごとに、各正規分布の平均を中心とし、かつ標準偏差に基づいて決められる範囲がクラスタの範囲として決定され、クラスタ決定手段１５によって、クラスタに分類される画素の数を均一化するための予め定める分類条件に従って、クラスタ範囲決定手段１４によって範囲が決定されたクラスタに、各画素が分類される。 Further, a normal distribution constituting the distribution is extracted from the distribution of frequencies given to the pixels included in each initial region extracted by the region extracting unit 13 by the cluster range determining unit 14, and each extracted normal distribution is extracted. In addition, a range centered on the average of each normal distribution and determined based on the standard deviation is determined as a cluster range, and the cluster determining unit 15 determines in advance the number of pixels classified into clusters. According to the classification condition, each pixel is classified into a cluster whose range has been determined by the cluster range determination means 14.

さらに、前記予め定める分類条件は、画素に付与される度数が１つの正規分布のみを構成する度数の場合は、その度数が付与される画素を、その度数が構成する正規分布に対応するクラスタに分類するし、画素に付与される度数の一部が複数の正規分布を構成する度数の場合は、その度数が付与される画素を、その度数の一部が構成する複数の正規分布のうち、正規分布の標準偏差を平均で除算した評価値が最も大きい正規分布に対応するクラスタに分類し、画素に付与される度数がいずれの正規分布をも構成しない度数の場合は、その度数が付与される画素を、その度数が付与される画素に最も近い範囲のクラスタのうち、前記評価値が最も大きいクラスタに分類することである。 Furthermore, when the predetermined classification condition is that the frequency given to a pixel is a frequency that constitutes only one normal distribution, the pixel to which the frequency is given is assigned to a cluster corresponding to the normal distribution that the frequency constitutes. If the frequency that is classified and a part of the frequency given to the pixel constitutes a plurality of normal distributions, the pixel to which the frequency is given is out of the plurality of normal distributions that constitute a part of the frequency, If the standard distribution of the normal distribution is divided into clusters corresponding to the normal distribution with the largest evaluation value divided by the average, and the frequency assigned to the pixel is a frequency that does not constitute any normal distribution, that frequency is assigned. Is classified into the cluster having the largest evaluation value among the clusters in the range closest to the pixel to which the frequency is given.

したがって、正規分布の標準偏差を平均で除算した評価値を用いることによって、正規分布を構成しない度数の画素および複数の正規分布を構成する度数の画素を、度数がより少なく分散の大きいクラスタに分類することができ、各クラスタに含まれる画素数つまりベクトル数の偏りをより少なくすることができる。 Therefore, by using the evaluation value obtained by dividing the standard deviation of the normal distribution by the average, the frequency pixels that do not constitute the normal distribution and the frequency pixels that constitute the plurality of normal distributions are classified into clusters with less frequency and greater variance. Thus, the deviation of the number of pixels, that is, the number of vectors included in each cluster can be further reduced.

図１１は、画像化手段１２が実行する画像化処理の処理手順を示すフローチャートである。次元縮退手段１１によって多次元ベクトルデータが２次元ベクトルデータに変換された後、ステップＡ１に移る。 FIG. 11 is a flowchart showing the processing procedure of the imaging process executed by the imaging unit 12. After the multidimensional vector data is converted into two-dimensional vector data by the dimension reduction means 11, the process proceeds to step A1.

ステップＡ１では、２次元ベクトルデータが示す２次元ベクトルのＸＹ座標における位置を座標（ｘ，ｙ）とするとき、ｘの最大値Ｘｍａｘ、ｙの最大値Ｙｍａｘ、ｘの最小値Ｘｍｉｎ、およびｙの最小値Ｙｍｉｎから、ｘの範囲Ｌｘ＝Ｘｍａｘ−Ｘｍｉｎ、およびｙの範囲Ｌｙ＝Ｙｍａｘ−Ｙｍｉｎを求める。ステップＡ２では、２値画像データが示す画像の解像度をｍ×ｎ、文書データ数つまり２次元ベクトルの数をＮとするとき、条件１および条件２を満たす最小のｍおよびｎを求める。条件１は、ｍ：ｎ＝（Ｌｘ＋１）：（Ｌｙ＋１）であり、条件２は、ｍ×ｎ≧Ｎである。ここに、ｍおよびｎは自然数である。 In step A1, when the position in the XY coordinates of the two-dimensional vector indicated by the two-dimensional vector data is the coordinate (x, y), the maximum value Xmax of x, the maximum value Ymax of y, the minimum value Xmin of x, and the value of y From the minimum value Ymin, an x range Lx = Xmax−Xmin and a y range Ly = Ymax−Ymin are obtained. In step A2, the minimum m and n satisfying the conditions 1 and 2 are obtained when the resolution of the image indicated by the binary image data is m × n and the number of document data, that is, the number of two-dimensional vectors is N. Condition 1 is m: n = (Lx + 1) :( Ly + 1), and Condition 2 is m × n ≧ N. Here, m and n are natural numbers.

ステップＡ３では、２次元ベクトルデータが示す２次元ベクトルの位置を示す座標（ｘ，ｙ）を、式Ｘ＝ｘ×ｍ／（Ｌｘ＋１）、および式Ｙ＝ｙ×ｎ／（Ｌｙ＋１）によって、２値画像データが示す画像上の画素の座標（Ｘ，Ｙ）に変換する。変換後、２次元ベクトルデータが示す２次元ベクトルが対応している画素の値を「１」、２次元ベクトルデータが示す２次元ベクトルが対応していない画素の値を「０」として、画像化処理を終了する。ステップＡ１〜Ａ３は、画像化ステップである。 In step A3, coordinates (x, y) indicating the position of the two-dimensional vector indicated by the two-dimensional vector data are expressed as 2 by the formula X = x × m / (Lx + 1) and the formula Y = y × n / (Ly + 1). Conversion is made to the coordinates (X, Y) of the pixel on the image indicated by the value image data. After conversion, the pixel value corresponding to the two-dimensional vector indicated by the two-dimensional vector data is “1”, and the pixel value not corresponding to the two-dimensional vector indicated by the two-dimensional vector data is “0”. The process ends. Steps A1 to A3 are imaging steps.

図１２は、領域抽出手段１３が実行する領域抽出処理の処理手順を示すフローチャートである。画像化手段１２が、図１１に示した画像化処理を終了すると、ステップＢ１に移る。 FIG. 12 is a flowchart showing a processing procedure of region extraction processing executed by the region extraction means 13. When the imaging unit 12 finishes the imaging process shown in FIG. 11, the process proceeds to step B1.

ステップＢ１では、各画素Ｇ［ｉ］［ｊ］について、画像化手段１２による画像化処理の結果に基づいて画素値を「０」または「１」とし、ステータス情報を初期値として「未処理」とし、座標（ｉ，ｊ）つまり構造体型配列Ｇ［ｉ］［ｊ］の初期値として、ｉ＝１，ｊ＝１とする。ステップＢ２では、ｉ＝ｉ＋１、ｊ＝１として順次画素をチェックする。ただし、ｉ＝１およびｊ＝１のときは、この処理をパスする。 In step B1, for each pixel G [i] [j], the pixel value is set to “0” or “1” based on the result of the imaging process by the imaging unit 12, and “unprocessed” is set with the status information as an initial value. As an initial value of coordinates (i, j), that is, the structure type array G [i] [j], i = 1 and j = 1. In step B2, pixels are sequentially checked with i = i + 1 and j = 1. However, this process is passed when i = 1 and j = 1.

ステップＢ３では、ｉ＝ｍ＋１であるか否かを判定する。ｉ＝ｍ＋１であると、ステップＢ１１に進み、ｉ＝ｍ＋１でないと、ステップＢ４に進む。ステップＢ４では、ステータス情報が「未処理」であるか否かを判定する。ステータス情報が未処理であると、ステップＢ５に進み、ステータス情報が未処理でないと、ステップＢ２に戻る。ステップＢ５では、ステータス情報を「処理済」とする。ステップＢ６では、画素値が「１」であるか否かを判定する。画素値が「１」であると、ステップＢ７に進み、画素値が「１」でないと、ステップＢ２に戻る。 In step B3, it is determined whether i = m + 1. If i = m + 1, the process proceeds to step B11. If i = m + 1 is not satisfied, the process proceeds to step B4. In step B4, it is determined whether or not the status information is “unprocessed”. If the status information is unprocessed, the process proceeds to step B5. If the status information is not processed, the process returns to step B2. In step B5, the status information is set to “processed”. In Step B6, it is determined whether or not the pixel value is “1”. If the pixel value is “1”, the process proceeds to step B7. If the pixel value is not “1”, the process returns to step B2.

ステップＢ７では、座標（ｉ，ｊ）の画素に隣接する画素について順次画素値が「１」か否かを調べる。ステップＢ８では、隣接する画素で画素値が「１」、かつステータス情報が「未処理」のものがあるか否かを判定する。隣接する画素で画素値が「１」、かつステータス情報が「未処理」のものがあると、ステップＢ９に進み、隣接する画素で画素値が「１」、かつステータス情報が「未処理」のものがないと、ステップＢ１３に進む。 In step B7, it is examined whether or not the pixel value of the pixel adjacent to the pixel at coordinates (i, j) is “1” sequentially. In step B8, it is determined whether there is an adjacent pixel having a pixel value “1” and status information “unprocessed”. If there is an adjacent pixel whose pixel value is “1” and status information is “unprocessed”, the process proceeds to step B9, where the adjacent pixel has a pixel value of “1” and status information is “unprocessed”. If there is nothing, the process proceeds to step B13.

ステップＢ９では、その隣接する画素へ移動する。すなわち座標（ｉ，ｊ）をその隣接する画素の座標にする。ステップＢ１０では、移動した画素のステータス情報を「処理済」として、ステップＢ７に戻る。ステップＢ１１では、ｉ＝１、ｊ＝ｊ＋１とする。ステップＢ１２では、ｊ＝ｎ＋１であるか否かを判定する。ｊ＝ｎ＋１であると、領域抽出処理を終了し、ｊ＝ｎ＋１でないと、ステップＢ２に戻る。 In step B9, the pixel moves to the adjacent pixel. That is, the coordinates (i, j) are set to the coordinates of the adjacent pixels. In step B10, the status information of the moved pixel is set to “processed”, and the process returns to step B7. In step B11, i = 1 and j = j + 1. In step B12, it is determined whether j = n + 1. If j = n + 1, the region extraction process is terminated. If j = n + 1 is not satisfied, the process returns to step B2.

ステップＢ１３では、隣接する４方向をチェックしたか否かを判定する。隣接する４方向をチェックした場合は、ステップＢ１４に進み、隣接する４方向をチェックしない場合は、ステップＢ７に戻る。ステップＢ１４では、最初の移動元であるか否かを判定する。最初の移動元であると、ステップＢ１６に進み、最初の移動元でないと、ステップＢ１５に進む。最初の移動元か否かは、たとえばステップＢ９で、隣接する画素に移動するとき、移動前の画素の座標（ｉ，ｊ）を順次記憶しておき、ステップＢ１４で、最初の移動前の座標と同じであるか否かを判断する。 In step B13, it is determined whether or not four adjacent directions have been checked. If the adjacent four directions are checked, the process proceeds to step B14. If the adjacent four directions are not checked, the process returns to step B7. In step B14, it is determined whether or not it is the first movement source. If it is the first movement source, the process proceeds to step B16, and if it is not the first movement source, the process proceeds to step B15. For example, when moving to an adjacent pixel in step B9, the coordinates (i, j) of the pixel before the movement are sequentially stored in step B9, and the coordinates before the first movement are determined in step B14. It is determined whether or not the same.

ステップＢ１５では、１つ前の移動元の画素に戻る。すなわち、画素の座標を１つ前に移動した移動前の画素の座標（ｉ，ｊ）に戻す。ステップＢ１６では、画素値が「１」である隣接する画素を１つの初期領域として決定して、ステップＢ２に戻る。ステップＢ１〜ステップＢ１６は、領域抽出ステップである。 In step B15, the process returns to the previous pixel of the movement source. That is, the pixel coordinate is returned to the coordinate (i, j) of the pixel before the movement that was moved one before. In step B16, an adjacent pixel whose pixel value is “1” is determined as one initial region, and the process returns to step B2. Steps B1 to B16 are region extraction steps.

図１３は、クラスタ範囲決定手段１４が実行するクラスタ範囲決定処理の処理手順を示すフローチャートである。領域抽出手段１３が、図１２に示した領域抽出処理を終了すると、ステップＣ１に移る。 FIG. 13 is a flowchart showing a processing procedure of cluster range determination processing executed by the cluster range determination means 14. When the area extracting unit 13 finishes the area extracting process shown in FIG. 12, the process proceeds to step C1.

ステップＣ１では、各初期領域ｃについて、ＥＭアルゴリズムを用いて、各初期領域の２次元混合正規分布からその分布を構成する２次元正規分布を求める。ここに、ｃは１〜Ｕの自然数であり、Ｕは初期領域の数である。ステップＣ２では、各初期領域ｃについて求められた各２次元正規分布の平均ｘｃｔおよびｙｃｔと、標準偏差σｘｃｔおよびσｙｃｔとを求める。ここに、ｔは、１〜Ｒの自然数であり、Ｒは各２次元混合正規分布から求められた２次元正規分布の数である。 In step C1, for each initial region c, a two-dimensional normal distribution constituting the distribution is obtained from the two-dimensional mixed normal distribution of each initial region using the EM algorithm. Here, c is a natural number of 1 to U, and U is the number of initial regions. In step C2, the average xct and yct of each two-dimensional normal distribution obtained for each initial region c and the standard deviations σxct and σyct are obtained. Here, t is a natural number of 1 to R, and R is the number of two-dimensional normal distributions obtained from each two-dimensional mixed normal distribution.

ステップＣ３では、点（ｘｃｔ，ｙｃｔ）を中心とするＸ軸方向の半径ｐ×σｘｃｔ、およびＹ軸方向の半径ｐ×σｙｃｔの楕円範囲をクラスタの範囲として、クラスタ範囲決定処理を終了する。ステップＣ１〜ステップＣ３は、クラスタ範囲決定ステップである。 In step C3, the cluster range determination process is terminated with the ellipse range having the radius p × σxct in the X-axis direction centered on the point (xct, yct) and the radius p × σyct in the Y-axis direction as the cluster range. Steps C1 to C3 are cluster range determination steps.

図１４は、クラスタ決定手段１５が実行するクラスタ決定処理の処理手順を示すフローチャートである。クラスタ範囲決定手段１４が、図１３に示したクラスタ範囲決定処理を終了すると、ステップＤ１に移る。 FIG. 14 is a flowchart showing a processing procedure of cluster determination processing executed by the cluster determination means 15. When the cluster range determining means 14 finishes the cluster range determining process shown in FIG. 13, the process proceeds to step D1.

ステップＤ１では、対応するクラスタが１つであるか否かを判定する。クラスタ情報に１つのクラスタの番号のみがあると、対応するクラスタが１つであると判定して、ステップＤ６に進む。クラスタ情報にあるクラスタの番号が１つのみでないと、対応するクラスタが１つでないと判定して、ステップＤ２に進む。ステップＤ２では、対応するクラスタがないか否かを判定する。クラスタ情報にクラスタの番号がないと、対応するクラスタがないと判定して、ステップＤ３に進む。クラスタ情報にクラスタの番号があると、対応するクラスタがあると判定して、ステップＤ７に進む。 In step D1, it is determined whether there is one corresponding cluster. If there is only one cluster number in the cluster information, it is determined that there is only one corresponding cluster, and the process proceeds to step D6. If there is not only one cluster number in the cluster information, it is determined that there is not one corresponding cluster, and the process proceeds to step D2. In step D2, it is determined whether or not there is a corresponding cluster. If there is no cluster number in the cluster information, it is determined that there is no corresponding cluster, and the process proceeds to step D3. If there is a cluster number in the cluster information, it is determined that there is a corresponding cluster, and the process proceeds to step D7.

ステップＤ３では、最短距離のクラスタが１つであるか否かを判定する。最短距離のクラスタが１つであると、ステップＤ６に進み、最短距離のクラスタが１つでないと、ステップＤ４に進む。ステップＤ４では、その画素を、複数の最短距離のクラスタのうち、評価値Ｈが最大のクラスタに分類する。評価値Ｈは、式Ｈ＝（σｘｃｔ＋σｙｃｔ）／（ｘｃｔ＋ｙｃｔ）によって算出する。ここに、ｘｃｔはＸ軸方向の平均、ｙｃｔはＹ軸方向の平均、σｘｃｔはＸ軸方向の標準偏差、σｙｃｔはＹ軸方向の標準偏差である。 In step D3, it is determined whether or not there is one shortest distance cluster. If there is one shortest distance cluster, the process proceeds to step D6, and if there is not one shortest distance cluster, the process proceeds to step D4. In step D4, the pixel is classified into a cluster having the largest evaluation value H among a plurality of shortest distance clusters. The evaluation value H is calculated by the equation H = (σxct + σyct) / (xct + yct). Here, xct is an average in the X-axis direction, yct is an average in the Y-axis direction, σxct is a standard deviation in the X-axis direction, and σyct is a standard deviation in the Y-axis direction.

ステップＤ５では、すべての画素を分類したか否かを判定する。すべての画素を分類すると、クラスタ決定処理を終了し、すべての画素を分類していないと、ステップＤ１に戻り、次の画素について処理する。ステップＤ６では、その画素をそのクラスタに分類して、ステップＤ５に進む。ステップＤ７では、その画素を、対応する複数のクラスタのうち、評価値Ｈが最大のクラスタに分類して、ステップＤ５に進む。ステップＤ１〜ステップＤ７は、画素分類ステップである。 In step D5, it is determined whether or not all the pixels have been classified. When all the pixels are classified, the cluster determination process is terminated. When all the pixels are not classified, the process returns to step D1 to process the next pixel. In step D6, the pixel is classified into the cluster, and the process proceeds to step D5. In step D7, the pixel is classified into a cluster having the largest evaluation value H among the corresponding clusters, and the process proceeds to step D5. Steps D1 to D7 are pixel classification steps.

画素をクラスタに分類することによって、画素に対応する多次元ベクトルをクラスタに分類することができる。評価値Ｈを用いて画素を分類するので、いずれのクラスタの範囲にも入らない画素および複数のクラスタの範囲に入る画素を、度数がより小さく分散の大きい正規分布のクラスタに分類することができる。したがって、山の高さがより低くかつ広がりがより広い正規分布のクラスタに分類されるベクトル数を増加することができ、各クラスタのベクトル数の偏りをより低減することができる。
このように、入力手段１０によって処理されるステップでは、多次元ベクトルを表す多次元ベクトルデータを入力する。次元縮退手段１１によって処理されるステップでは、入力手段１０によって処理されるステップで入力された多次元ベクトルデータを、予め定める次元変換条件に従って２次元ベクトルを表す２次元ベクトルデータに変換する。 By classifying pixels into clusters, multidimensional vectors corresponding to the pixels can be classified into clusters. Since the pixels are classified using the evaluation value H, the pixels that do not fall within any cluster range and the pixels that fall within a plurality of cluster ranges can be classified into normal distribution clusters with smaller frequency and greater variance. . Therefore, it is possible to increase the number of vectors classified into clusters of normal distribution having a lower peak height and wider spread, and it is possible to further reduce the bias in the number of vectors of each cluster.
Thus, in the step processed by the input means 10, multidimensional vector data representing a multidimensional vector is input. In the step processed by the dimension reduction means 11, the multidimensional vector data input in the step processed by the input means 10 is converted into two-dimensional vector data representing a two-dimensional vector according to a predetermined dimension conversion condition.

図１１に示したフローチャートのステップＡ１〜ステップＡ３では、次元縮退手段１１によって処理されるステップで変換された２次元ベクトルデータを、対応する２次元ベクトルの有無を示す値で表され、かつ対応する２次元ベクトルの数を表す度数が付与される画素からなる画像を表す画像データに変換する。図１２に示したフローチャートのステップＢ１〜ステップＢ１６では、図１１に示したフローチャートのステップＡ１〜ステップＡ３で変換された画像データが示す画素のうち、対応する２次元ベクトルが有ることを示す値で表される画素の中から、隣接する画素によって構成される領域を抽出する。 In step A1 to step A3 of the flowchart shown in FIG. 11, the two-dimensional vector data converted in the step processed by the dimension reduction means 11 is represented by a value indicating the presence or absence of the corresponding two-dimensional vector and corresponds. Conversion into image data representing an image composed of pixels to which a frequency representing the number of two-dimensional vectors is given. Steps B1 to B16 in the flowchart shown in FIG. 12 are values indicating that there is a corresponding two-dimensional vector among the pixels indicated by the image data converted in steps A1 to A3 in the flowchart shown in FIG. A region constituted by adjacent pixels is extracted from the represented pixels.

図１３に示したフローチャートのステップＣ１〜ステップＣ３では、図１２に示したフローチャートのステップＢ１〜ステップＢ１６で抽出された各領域に含まれる画素に付与された度数の分布から、その分布を構成する正規分布を抽出し、抽出した正規分布ごとに、各正規分布の平均を中心とし、かつ標準偏差に基づいて決められる範囲をクラスタの範囲として決定する。そして、図１４に示したフローチャートのステップＤ１〜ステップＤ７では、クラスタに分類される画素の数を均一化するための予め定める分類条件に従って、図１３に示したフローチャートのステップＣ１〜ステップＣ３で範囲が決定されたクラスタに、各画素を分類する。 In Step C1 to Step C3 of the flowchart shown in FIG. 13, the distribution is configured from the distribution of the frequencies given to the pixels included in each region extracted in Step B1 to Step B16 of the flowchart shown in FIG. A normal distribution is extracted, and for each extracted normal distribution, a range determined based on the average of each normal distribution and based on the standard deviation is determined as a cluster range. Then, in steps D1 to D7 of the flowchart shown in FIG. 14, the range is set in steps C1 to C3 of the flowchart shown in FIG. 13 according to a predetermined classification condition for equalizing the number of pixels classified into clusters. Each pixel is classified into the cluster in which is determined.

クラスタリング自動化装置１を制御するプログラムは、コンピュータに、本発明に係るクラスタリング方法の各ステップを実行させるためのプログラムでもある。したがって、本発明は、コンピュータにクラスタリング方法の各ステップを実行させるためのプログラムとして提供することができる。 The program for controlling the clustering automation apparatus 1 is also a program for causing a computer to execute each step of the clustering method according to the present invention. Therefore, the present invention can be provided as a program for causing a computer to execute each step of the clustering method.

上述した実施の形態では、プログラムは、コンピュータの記憶装置たとえば半導体メモリあるいはハードディスク装置などの記憶装置に記憶されているが、これらの記憶装置に限定されるものではなく、コンピュータで読取り可能な記録媒体に記録されていてもよい。記録媒体は、たとえば図示しない外部記憶装置としてプログラム読取装置を設け、そこに記録媒体を挿入することによって読取り可能な記録媒体であってもよいし、あるいは他の装置の記憶装置であってもよい。 In the above-described embodiment, the program is stored in a storage device of a computer such as a semiconductor memory or a hard disk device. However, the program is not limited to these storage devices, and a computer-readable recording medium. May be recorded. The recording medium may be a recording medium that can be read by providing a program reading device as an external storage device (not shown) and inserting the recording medium therein, or may be a storage device of another device. .

いずれの記録媒体であっても、記憶されているプログラムがコンピュータからアクセスされて実行される構成であればよい。あるいはいずれの記録媒体であっても、プログラムが読み出され、読み出されたプログラムが、記憶装置のプログラム記憶エリアに記憶されて、そのプログラムが実行される構成であってもよい。さらに通信ネットワークを介して他の装置からダウンロードされてプログラム記憶エリアに記憶させてもよい。ダウンロード用のプログラムは、予めコンピュータの記憶装置に記憶しておくか、あるいは別な記録媒体からプログラム記憶エリアにインストールしておく。 Any recording medium may be used as long as the stored program is accessed from a computer and executed. Alternatively, any recording medium may be configured such that the program is read, the read program is stored in the program storage area of the storage device, and the program is executed. Further, it may be downloaded from another device via a communication network and stored in the program storage area. The download program is stored in advance in a storage device of a computer, or installed in a program storage area from another recording medium.

本体と分離可能に構成される記録媒体は、たとえば磁気テープ／カセットテープなどのテープ系の記録媒体、フレキシブルディスク／ハードディスクなどの磁気ディスクもしくはＣＤ−ＲＯＭ（Compact Disk Read Only Memory）／ＭＯ（Magneto Optical disk）／ＭＤ（Mini Disc）／ＤＶＤ（Digital Versatile Disk）などの光ディスクのディスク系の記録媒体、ＩＣ（Integrated Circuit）カード（メモリカードを含む）／光カードなどのカード系の記録媒体、またはマスクＲＯＭ／ＥＰＲＯＭ（Erasable Programmable Read Only Memory）／ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only
Memory）／フラッシュＲＯＭなどの半導体メモリを含む固定的にプログラムを担持する記録媒体であってもよい。したがって、本発明は、コンピュータにクラスタリング方法の各ステップを実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体として提供することができる。 The recording medium configured to be separable from the main body is, for example, a tape-based recording medium such as a magnetic tape / cassette tape, a magnetic disk such as a flexible disk / hard disk, or a CD-ROM (Compact Disk Read Only Memory) / MO (Magneto Optical). disk) / MD (Mini Disc) / DVD (Digital Versatile Disk) and other optical disk recording media, IC (Integrated Circuit) cards (including memory cards) / optical cards and other card recording media, or masks ROM / EPROM (Erasable Programmable Read Only Memory) / EEPROM (Electrically Erasable Programmable Read Only)
Memory) / a recording medium that carries a fixed program including a semiconductor memory such as a flash ROM. Therefore, the present invention can be provided as a computer-readable recording medium recording a program for causing a computer to execute each step of the clustering method.

本発明の実施の一形態であるクラスタリング自動化装置１の機能の構成を示すブロック図である。It is a block diagram which shows the structure of the function of the clustering automation apparatus 1 which is one Embodiment of this invention. 次元縮退手段１１によって変換された２次元ベクトルの位置を示すＸＹ座標系２１の一例を示す図である。It is a figure which shows an example of the XY coordinate system 21 which shows the position of the two-dimensional vector converted by the dimension reduction means 11. FIG. 画像化手段１２によって変換された画像２２の一例を示す図である。It is a figure which shows an example of the image 22 converted by the imaging means. 画像化手段１２によって変換された画素のデータ構造３１の一例を示す図である。It is a figure which shows an example of the data structure 31 of the pixel converted by the imaging means. 領域抽出手段１３によって抽出された初期領域が示された画像２３の一例を示す図である。It is a figure which shows an example of the image 23 in which the initial area | region extracted by the area | region extraction means 13 was shown. 領域抽出手段１３による領域抽出処理後の画素のデータ構造３２の一例を示す図である。It is a figure which shows an example of the data structure 32 of the pixel after the area | region extraction process by the area | region extraction means 13. FIG. クラスタ範囲決定手段１４によって抽出された２次元正規分布２４の一例を示す図である。It is a figure which shows an example of the two-dimensional normal distribution 24 extracted by the cluster range determination means. クラスタ範囲決定手段１４によって抽出された２次元正規分布２５の一例を示す図である。It is a figure which shows an example of the two-dimensional normal distribution 25 extracted by the cluster range determination means. クラスタ範囲決定手段１４によるクラスタ範囲決定処理後の画素のデータ構造３３の一例を示す図である。It is a figure which shows an example of the data structure 33 of the pixel after the cluster range determination process by the cluster range determination means. クラスタ決定手段１５によって画素がクラスタに分類された画像２６の一例を示す図である。It is a figure which shows an example of the image 26 by which the cluster determination means 15 classified the pixel into the cluster. 画像化手段１２が実行する画像化処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the imaging process which the imaging means 12 performs. 領域抽出手段１３が実行する領域抽出処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the area | region extraction process which the area | region extraction means 13 performs. クラスタ範囲決定手段１４が実行するクラスタ範囲決定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the cluster range determination process which the cluster range determination means 14 performs. クラスタ決定手段１５が実行するクラスタ決定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the cluster determination process which the cluster determination means 15 performs.

Explanation of symbols

１クラスタリング自動化装置
１０多次元ベクトルデータ入力手段
１１次元縮退手段
１２画像化手段
１３領域抽出手段
１４クラスタ範囲決定手段
１５クラスタ決定手段
１６多次元ベクトルデータ記憶手段 DESCRIPTION OF SYMBOLS 1 Clustering automation apparatus 10 Multidimensional vector data input means 11 Dimension reduction means 12 Imaging means 13 Area extraction means 14 Cluster range determination means 15 Cluster determination means 16 Multidimensional vector data storage means

Claims

Input means for inputting multidimensional vector data representing a multidimensional vector;
Dimension conversion means for converting the multidimensional vector data input by the input means into two-dimensional vector data representing a two-dimensional vector by a predetermined dimension conversion method;
Image data representing an image composed of pixels to which the two-dimensional vector data converted by the dimension conversion means is represented by a value indicating the presence or absence of the corresponding two-dimensional vector and to which a frequency indicating the number of the corresponding two-dimensional vector is given. Imaging means for converting to
A region extracting unit that extracts a region constituted by adjacent pixels from pixels represented by values indicating that there is a corresponding two-dimensional vector among the pixels represented by the image data converted by the imaging unit; ,
From the frequency distribution given to the pixels included in each region extracted by the region extraction means, a normal distribution constituting the distribution is extracted, and for each extracted normal distribution, the average of each normal distribution is centered, and Cluster range determining means for determining a range determined based on the standard deviation as a cluster range;
A clustering apparatus comprising: a cluster whose range is determined by a cluster range determination unit according to a predetermined classification condition for equalizing the number of pixels classified into clusters; and a pixel classification unit that classifies each pixel. .

The predetermined classification conditions are:
If the frequency given to a pixel is a frequency that constitutes only one normal distribution, classify the pixel to which the frequency is given into a cluster corresponding to the normal distribution that the frequency constitutes,
When a part of the frequency assigned to a pixel is a frequency that forms a plurality of normal distributions, the pixel to which the frequency is assigned is selected from the normal distributions that are part of the frequency. Classify the cluster into the cluster corresponding to the normal distribution with the largest evaluation value divided by the mean,
When the frequency given to a pixel is a frequency that does not constitute any normal distribution, the evaluation value is the highest among the clusters in the range closest to the pixel to which the frequency is assigned. The clustering apparatus according to claim 1, wherein the clustering apparatus is classified into large clusters.

An input step for inputting multidimensional vector data representing the multidimensional vector;
A dimension conversion step of converting the multidimensional vector data input in the input step into two-dimensional vector data representing a two-dimensional vector according to a predetermined dimension conversion condition;
Image data representing an image made up of pixels to which the two-dimensional vector data converted in the dimension conversion step is represented by a value indicating the presence or absence of the corresponding two-dimensional vector and to which a frequency representing the number of the corresponding two-dimensional vector is given. An imaging step to convert to
An area extraction step for extracting an area constituted by adjacent pixels from pixels represented by values indicating that there is a corresponding two-dimensional vector among the pixels indicated by the image data converted in the imaging step; ,
From the frequency distribution given to the pixels included in each region extracted in the region extraction step, a normal distribution that constitutes the distribution is extracted, and for each extracted normal distribution, the average of each normal distribution is centered, and A cluster range determining step for determining a range determined based on the standard deviation as a cluster range;
A clustering method comprising: a pixel classification step for classifying each pixel into a cluster whose range is determined in the cluster range determination step according to a predetermined classification condition for equalizing the number of pixels classified into clusters. .

An input step for inputting multidimensional vector data representing the multidimensional vector;
A dimension conversion step of converting the multidimensional vector data input in the input step into two-dimensional vector data representing a two-dimensional vector according to a predetermined dimension conversion condition;
Image data representing an image made up of pixels to which the two-dimensional vector data converted in the dimension conversion step is represented by a value indicating the presence or absence of the corresponding two-dimensional vector and to which a frequency representing the number of the corresponding two-dimensional vector is given. An imaging step to convert to
An area extraction step for extracting an area constituted by adjacent pixels from pixels represented by values indicating that there is a corresponding two-dimensional vector among the pixels indicated by the image data converted in the imaging step; ,
From the frequency distribution given to the pixels included in each region extracted in the region extraction step, a normal distribution that constitutes the distribution is extracted, and for each extracted normal distribution, the average of each normal distribution is centered, and A cluster range determining step for determining a range determined based on the standard deviation as a cluster range;
In order to cause a computer to execute a pixel classification step for classifying each pixel in a cluster whose range is determined in a cluster range determination step according to a predetermined classification condition for equalizing the number of pixels classified into clusters. program.

The computer-readable recording medium which recorded the program of Claim 4.